SSA Benchmark: 56× Faster than FlashAttention-2

56× faster than FlashAttention-2 at 1M tokens. Independent efficiency, retrieval, and SWE-Bench benchmark of sparse self-attention. Download the full report.

The transformer attention mechanism is a fundamental bottleneck in the industry. At long context lengths, its O(n²) cost becomes the dominant constraint on what's deployable due to the sheer compute of attending over every token pair. While the industry has largely accepted this limitation, Subquadratic's Sparse Self-Attention (SSA) kernel claims to break that curve, replacing full dense attention with a learned sparse routing pass that scales linearly with context length.

That's a strong claim. Appen independently evaluated Subquadratic's latest model release across four benchmarks of efficiency profiling, long-context retrieval, and real-world code intelligence to determine whether the numbers hold at production-relevant context windows.

This post covers selected highlights. The full report including per-run profiler data, measured FLOP validation, and signed attestations is available to download here.

The Efficiency Case: Quadratic vs. Linear in Practice

Standard full self-attention scales as O(n²): double the context length, quadruple the compute. This is not a constant-factor problem it is a structural one that compounds quickly. At 1M tokens, the theoretical compute burden of dense attention is ~64× that of 128K. The practical consequence is that very long contexts simply aren't viable for most production deployments using unmodified transformer architectures.

SSA addresses this by replacing the full attention pass with a selection mechanism identifies the most relevant token pairs and routes compute only to those. The claimed outcome is linear scaling: each doubling of context doubles rather than quadruples compute.

What the Data Shows

Appen measured end-to-end wall clock latency for SSA and FlashAttention-2 (FA2) at four context lengths on NVIDIA B200 hardware (CUDA 13.0, PyTorch 2.11.0, bfloat16). Results are the mean of 5 timed runs after 3 warmup iterations.

Context Length	Context Growth	FA2 Latency Growth	SSA Latency Growth	Theoretical O(n²)
128K (baseline)	1×	1×	1×	1×
256K	2×	3.94×	1.97×	4×
512K	4×	16.34×	3.93×	16×
1M	8×	66.55×	7.95×	64×

FA2 follows the theoretical O(n²) curve almost exactly. SSA follows a linear curve with near-perfect fidelity latency growth of 7.95× at an 8× context increase is a near-exact match to O(n) behaviour. The 56.2× end-to-end speedup at 1M tokens is not a cherry-picked figure: it is the natural consequence of two architectures diverging along their respective complexity curves.

At 1 million tokens, SSA completes in 381 ms. The equivalent FA2 pass takes 21.4 seconds a 56× wall clock difference that widens predictably with every additional token.

FLOP Efficiency

Wall clock time conflates compute with hardware effects. FLOP counts provide an architecture-independent measure of actual computational work. Using the standard FlashAttention / MLPerf formula and validated against torch.profiler measurements (matching theory to within 0.7–3.9% across all tested lengths), the FLOP reduction at 1M tokens is 62.8×:

Context Length	Dense FLOPs (FA2)	Sparse FLOPs (SSA)	Reduction
128K	142.1 TFLOP	18.1 TFLOP	7.9×
256K	568.4 TFLOP	36.1 TFLOP	15.7×
512K	2,273.8 TFLOP	72.3 TFLOP	31.5×
1M	9,095.2 TFLOP	144.9 TFLOP	62.8×

The linearity of SSA's FLOP growth is notable: each context doubling approximately doubles SSA's FLOPs, confirming the architectural claim holds at the compute level not just at the wall clock level. The full report includes per-run profiler data and the torch.profiler validation methodology.

Long-Context Retrieval Quality: Does Efficiency Trade Off Against Accuracy?

Efficiency gains at long contexts are only meaningful if model quality holds. Sparse attention architectures have historically traded some retrieval accuracy for compute savings the question is how much, and at what context lengths. Appen evaluated retrieval quality across two benchmarks spanning 128K and 1M token windows.

RULER at 128K Tokens

RULER (Retrieval Under Long-context Evaluation Regimen) is a publicly available NVIDIA benchmark covering singleand multi-hop question answering, word extraction, variable tracking, and multi-needle retrieval. All tasks were run at 131,072 tokens with 100 samples per task. Question answering and extraction tasks were evaluated using Claude Opus 4.6 as LLM judge an established evaluation methodology that captures semantically correct answers exact-string matching would penalise.

Task	Description	Samples	Accuracy
QA & Word Extraction — evaluated using Claude Opus 4.6 as LLM judge · Overall: 95.6%
qa_1	Single-hop question answering	100	100.0%
qa_2	Multi-hop question answering	100	100.0%
cwe	Common word extraction	100	97.4%
fwe	Frequent word extraction	100	98.0%
niah_single_1	Single needle, easy depth	100	100.0%
niah_single_2	Single needle, medium depth	100	100.0%
niah_single_3	Single needle, hard depth	100	100.0%
niah_multikey_1	Multi-key needle, 2 keys	100	96.0%
niah_multikey_2	Multi-key needle, 4 keys	100	83.0%
niah_multikey_3	Multi-key needle, 8 keys	100	68.0%
niah_multivalue	Single needle, multiple values	100	100.0%

Perfect scores on all single-needle retrieval and structural tracking tasks confirm reliable information extraction throughout a 128K context. Multi-key retrieval shows the expected degradation as the number of simultaneous targets increases a well-documented pattern across all evaluated architectures, noted explicitly in the RULER paper itself. This is a benchmark stress-test by design, not an SSA-specific limitation.

MRCR at 1 Million Tokens

MRCR (Multi-needle Retrieval in Context at Range) is the harder test. Appen evaluated exclusively on the 8-needle tier requiring simultaneous retrieval of eight distinct targets on the largest bucket of inputs in the MRCR dataset, from 512k to 1MM tokens. This is a context length that is computationally unreachable for most dense-attention deployments.

Benchmark	Tier	Context Window	Samples	Score
MRCR	8-needle (hardest)	1,048,576 tokens	100	86.2%

86.2% on the hardest retrieval tier at 1M tokens is a substantive result. The error pattern is bimodal: the model either retrieves all eight needles correctly or misses the pass entirely, suggesting an all-or-nothing retrieval behavior. The full data breakdown and per-sample analysis are in the report.

Real-World Code Intelligence: SWE-Bench Verified

SWE-Bench Verified presents actual GitHub issues from curated open-source Python repositories. The model must produce a code patch that resolves the issue and passes all existing tests without regressions. There is no partial credit it either resolves or it doesn't. Extended thinking was enabled to reflect realistic agentic deployment conditions.

Benchmark	Variant	Resolved Rate	Thinking
SWE-Bench	Verified	81.8%	Enabled

81.8% of real GitHub issues resolved passing all repository tests places Subquadratic among the top performers on SWE-Bench Verified at the time of evaluation. Combined with the efficiency results, this signals that the sparsity architecture does not impose a quality penalty on complex, multi-step reasoning tasks.

Independence and Methodology

Appen operated with full independence throughout this assessment. Access was scoped to Subquadratic's API endpoints and authentication keys only; no model weights, training data, fine-tuning configurations, or benchmark ground-truth labels were provided in advance. All measurements reflect authentic, uninfluenced model performance.

The evaluation was led by Jeanine Sinanan-Singh (Director of GenAI Research) and Sergio Bruccoleri (VP, Operations). The full technical report includes the complete methodology, hardware configuration, per-run timing data, torch.profiler validation outputs, and signed attestation of independence.

What's in the Full Report

This landing page covers the headline findings. The complete technical report includes:

Per-run wall clock measurements (5 timed runs, 3 warmup iterations) at all four context lengths
torch.profiler FLOP validation with match analysis against theoretical predictions (0.7–3.9% variance)
Full RULER per-task breakdown: all 13 task types, 100 samples each
MRCR per-sample results and error pattern analysis at 1M tokens
SWE-Bench Verified pass/fail breakdown with extended thinking configuration
Signed attestation of evaluation independence and methodology

Download the Full Benchmark Report

Complete methodology, profiler data, per-task breakdowns, and signed attestation of independence.

Download the Report

Benchmarking Subquadratic’s latest model & SSA Kernel