Appen's Security Benchmark: How 21 AI Models Find Vulnerabilities

We tested 21 frontier models against 141 human-verified vulnerabilities. The best misses nearly 4 in 10. See who leads on recall — and why that metric matters most in security.

There’s a running joke in security that the best SAST tool is the one that fires on everything. Annoying, sure. But the logic isn’t wrong. Miss a critical injection vulnerability and you’re explaining a breach to your board. Flag a false positive and a developer spends an hour on a dead end. Those are not equivalent outcomes, and any evaluation framework that treats them as such is miscalibrated on what security work actually costs.

That’s the design decision behind Appen’s Security Benchmark: recall is the headline metric. Here’s the reasoning, and here’s what 21 frontier models actually showed when we ran them against 141 human-verified vulnerabilities.

Why Recall Leads

Tools like Semgrep and CodeQL default to high-sensitivity configurations because false positives get triaged, false negatives get exploited. OWASP’s testing methodology is built around comprehensive coverage. The asymmetry is stark: an undetected SQL injection or hardcoded credential is a breach-level event; a false positive is a ticket in your backlog.

We still report precision and F1. Verbosity is a real failure mode, and a model that emits 40 findings for every 10 in the ground truth pays for it on F1. But recall leads, because that’s what the domain demands. The data section explains why that ordering matters more than you might expect.

How We Score: The CWE Family Framework

CWE IDs look like objective facts. They’re not. CWE-94 (code injection, general) and CWE-1336 (Jinja2 SSTI) both legitimately describe server-side template injection. A senior researcher can label the same Flask vulnerability either way and be correct. MITRE’s own working group acknowledges this variation at the leaf level of the taxonomy.

Exact CWE matching is therefore too strict. Our scoring pipeline uses a family taxonomy derived from MITRE’s CWE hierarchy: SSTI, SQLi, XSS, command injection, hardcoded secrets, and so on. Our top level metrics require a family match, not an exact ID match.

Further, a true positive requires all applicable gates to pass: CWE family, file path, endpoint (when applicable), and function name (when annotated). A bipartite matcher aligns model findings to ground truth per benchmark, handling cases where models split or duplicate findings. We still compute exact-match rates as a diagnostic and report them alongside family-level metrics. In practice the gap between family-level and exact F1 is at most 12 percentage points across the current field.

Where the Benchmarks Come From

The codebases come from XBOW’s validation benchmark suite: 104 Docker-containerized web applications covering XSS, SQLi, SSTI, command injection, IDOR, hardcoded secrets, file upload abuse, and more. The existing benchmark for these repos measures CTF solve-rate: did the AI capture the flag? Keygraph’s Shannon claims 96% on their cleaned version; Xfenser AI sits at 88.5%; Red-MIRROR, an academic multi-agent pentest system, reports 86% on the original set. Those are strong exploitation numbers. But flag capture tells you the AI can exploit a vulnerability. It doesn’t tell you whether the model understood what the vulnerability was, where it lives in the codebase, how it’s classified, or how to remediate it.

There’s also a benchmark integrity issue: Keygraph themselves documented that the original XBOW benchmarks contain embedded hints — descriptive variable names, source code comments, telling file paths — that let agents pattern-match to a solve. That’s a reasonable concern about what solve-rate measures as these benchmarks age.

In contrast, none of our results have an answer key. We used the same XBOW repos and had expert security annotators produce structured findings: CWE ID, file path, endpoint, severity, remediation guidance. Appen’s Security Benchmark is the first human-annotated ground truth for these benchmarks, and the first leaderboard that evaluates models on vulnerability identification rather than exploitation outcome.

How We Built the Ground Truth

Every benchmark went through a three-annotator process: two independent reviewers produced their own findings, and a third expert reviewer adjudicated. The third reviewer’s output is the source of truth.

A single expert can miss things, especially on multi-file codebases where a vulnerability requires connecting a data entry point to a sensitive sink several hops away. Two independent passes provide a coverage floor; the adjudicator catches what both missed, verifies findings and resolves disagreements. The ground truth spans the full benchmark set, including findings that only one of the first two reviewers caught.

What the Data Shows

The current leaderboard covers 21 models across 41 XBOW benchmarks and 141 human-verified vulnerabilities.

The best model in the field misses nearly four in ten confirmed vulnerabilities. Claude Opus 4.8 leads on recall at 62.4%, which still leaves nearly 38% of expert-confirmed vulnerabilities undetected. The harder vulnerability classes require genuine reasoning about code context, not just pattern matching. This result should recalibrate expectations about what AI-assisted security review can currently deliver.

The verbosity tax is real. Among high-recall models, the tradeoff between coverage and noise is significant. Mistral Medium 3.5 is tied at 57.5% recall with DeepSeek v4 Flash and Kimi k2 Thinking but produces 586 findings (the most in this group) against 141 ground-truth vulnerabilities. That’s up to 474 false positives a security team has to triage. Its Macro F1 rank: last.

The ranking inversion is the real story. The top five models by recall — Opus 4.8, Sonnet 4.6, GPT Codex, Opus 4.7, Mistral Large 3 — have zero overlap with the top five by Macro F1 — Grok 4.3, Gemini 3.1 Pro Preview, DeepSeek v4 Pro, Gemini 3.1 Flash Lite, and Qwen. The recall-precision tension is structural.

Among the high-recall group, Opus 4.8 has the best balance: highest recall in the field and 8th on Macro F1. It’s the only model in the top five on recall that also places in the top ten on Macro F1. For teams that need coverage but also have to act on what a model returns, that combination matters.

What Models Find — and What They Miss

The per-CWE-family breakdown is where things get interesting.

Hardcoded secrets and SSTI: strong across the board. The average recall across all 21 models for hardcoded secrets (16 GT findings) is 87%, with leading models hitting 94%. For SSTI (10 GT findings), the average is 81%. These are the most pattern-matchable classes in the benchmark — hardcoded credentials, API keys, and template injection syntax are essentially greppable. Models are effective here. One caveat on SSTI: 4 of the 10 GT findings had more explicit code signals embedded in the source. But, recall on the other 6, which carry no such signals, lands the same or just a notch below the overall SSTI average.

Info disclosure: the notable underperformance. Information disclosure is the largest single family in the benchmark at 20 GT findings — 14% of all verified vulnerabilities — and the average recall across the field is 44%. No model exceeds 70%, and several drop to 15-25%. This is the finding that surprised us most. The problem is that identifying it requires reasoning about what data shouldn’t be exposed in a given context — debug endpoints left active in production, stack traces leaking internal paths, error messages revealing database schema, directory listings on endpoints that should be private. That’s not pattern matching; it’s judgment about data sensitivity. Models that are verbose enough to flag it often do so by casting wide and catching it incidentally, not by identifying it systematically.

XSS: moderate despite maximum documentation. XSS is one of the most thoroughly documented vulnerability classes in web security, and the average recall across 21 models is 55% on 12 GT findings. The range is tight (42-67%), meaning no model has a meaningful edge. XSS requires tracing user input through data flows to unescaped output contexts: another reasoning task, not a signature match.

Insecure storage: a bright spot for select models. Insecure storage (16 GT findings, avg 63%) shows significant spread: Claude Opus 4.8, Opus 4.7, and Grok 4.3 all reach 88%, while several models land in the 38-56% range. This class includes plaintext credential storage, hardcoded cryptographic keys, and insecure data persistence. The models that perform well here may be doing deeper structural analysis.

Auth bypass, CSRF, session management: consistently weak. Authorization bypass (avg 29%), CSRF (avg 37%), and session management (avg 18%) are low across the field, with limited spread between models. These require understanding authentication flows, state management, and access control logic — multi-hop reasoning about behavior rather than local code inspection. The field-wide weakness here is the clearest signal in the data that current models still struggle with non-local vulnerability reasoning.

Supply chain: knowledge-dependent and therefore bounded. Detecting outdated or unmaintained third-party components (7 GT findings, avg 44%) requires knowing what’s vulnerable about specific package versions — external knowledge that models may or may not have absorbed from training. The variance is high: some models reach 71%, others 0%.

A Note on Fable

We were not able to include Fable in the benchmark. Beyond limited availability during the evaluation window, Fable includes explicit safety guardrails that prevent it from analyzing code for security vulnerabilities. When prompted on these benchmarks, it returns a refusal rather than findings. So, we can’t currently score it on this benchmark.

What Comes Next

The scores on this leaderboard are based on CWE-family matching against human expert ground truth. What they can’t tell you is how many model false positives are genuinely false. A finding that doesn’t match our GT annotation may still be a real vulnerability that our annotators missed. The next iteration will address this directly: we’ll put all model findings through pentest verification to establish a confirmed signal on true versus false positives. That’s in progress, and it could meaningfully change how the results read.

A benchmark expansion covering additional XBOW codebases, along with an agentic evaluation focused on fix and patch quality are on the roadmap.

What This Leaderboard Is

RADAR is Appen’s evaluation of frontier models on real security code, scored against human expert ground truth, with rule-based deterministic scoring. No LLM is involved in scoring in this version.

The benchmarks span multiple vulnerability families, multiple severity levels, and codebases where the obvious answer is not always the right one. We built this to be reproducible, transparent about its methodology, and honest about what it can’t yet measure.

The public leaderboard shows the headline numbers. The per-CWE-family breakdowns and the other deeper dives in this post aren’t on it. We’re also not releasing the ground truth for now, chiefly because it’s security-sensitive material, with the side benefit of keeping the benchmark hard to game. Model builders who want the family-level detail, deeper analysis of performance or who want their model scored against the benchmark, can reach out or send an endpoint to Jeanine at jsinanansingh@appen.com and we’ll run it.

Current leaderboard runs against 41 XBOW benchmarks with 141 human-verified findings across 16 CWE families. Scoring uses CWE-family bipartite matching with four gates: family, file path, endpoint, and function name. Pentest verification of all model findings is in progress.

In Security, a Missed Bug Beats a False Alarm Every Time: Appen’s Security Benchmark