Putting SubQ 1.1 Small Preview to the Test on Long-Context Retrieval & LiveCodeBench
A third-party benchmark assessment of Subquadratic's preview models, conducted by Appen across long-context retrieval, code generation, business-workflow automation, and graduate-level reasoning benchmarks.
Frontier AI models are built on more than algorithms - they depend on rigorous, expert-validated evaluation to prove what they can actually do. Appen was engaged by Subquadratic to independently assess their latest preview models across publicly recognised benchmarks.
Appen has evaluated the latest version of Subquadratic’s 1.1 Small Preview models – join the waitlist for early access at SubQ.AI, which latest model card can be found here:
https://subq.ai/docs/subq-1-1-small-model-card.pdf
Long-Context Retrieval: Needle-in-a-Haystack
The Needle-in-a-Haystack (NIAH) evaluation tests whether a model can locate a specific fact embedded at varying depths within a very long context. Appen used the niah_single_1 task from the RULER suite, with 50 samples per context tier at temperature 0 and zero execution errors.
Subquadratic’s small models returned the target value verbatim on every sample at the 1M and 2M token tiers - 100% retrieval accuracy and 100% exact-match. At the 6M and 12M token tiers, the nano variant of the model held at 98% exact-match, sustaining near-perfect long-context retrieval at scales few models are tested at.
Code Generation: LiveCodeBench
LiveCodeBench evaluates code generation on competitive programming problems drawn continuously from live contest platforms, with release-date filtering to limit data contamination. This makes it a more reliable signal for real coding performance than benchmarks like SWEBench that are more susceptible to gaming - models can’t overfit to a static problem set when the problems keep changing.
Appen evaluated 1,055 problems with four completions each (4,220 total), reporting pass@1 and pass@4 by difficulty. The evaluated model - subq-2m-preview-small - achieved 89.7% pass@4.
About Appen
Appen powers the human data behind frontier AI. With verified domain specialists across 50+ fields, 235+ languages, and operations in 170 countries, Appen delivers the expert-validated data and evaluation that trains and tests the models shaping today’s AI landscape. Appen’s services span six purpose-built data products: Frontier Alignment, Agentic AI, Speech & Audio, Multimodal AI, Physical AI, and Model Integrity.
The full technical report for SubQ 1.1 Small Preview is available to download.