Appen Contributes Private Benchmark Data to the Open ASR Leaderboard

We partnered with Hugging Face to add a private evaluation track to the Open ASR Leaderboard, high-quality ASR datasets kept off-limits from model developers to give the community a cleaner signal on real-world performance.

“When a measure becomes a target, it ceases to be a good measure.” Goodhart’s Law

As ASR models have improved, so have strategies for gaming leaderboards: training on public test sets, curating data that mirrors known evaluation distributions, optimizing for macroaverages rather than generalization. The Open ASR Leaderboard has been visited over 710K times since September 2023, that visibility is exactly what makes its public test sets a target.

Appen partnered with Hugging Face to contribute a private benchmark track: datasets held back from any public release, evaluated only by Hugging Face, and designed to measure genuine capability across accent diversity and speech styles. The full technical write-up is on Hugging Face: Adding Benchmaxxer Repellant to the Open ASR Leaderboard

The Datasets

Appen contributed seven English ASR evaluation sets covering scripted read speech and spontaneous conversational speech across four regional accents. Gender balance is near 50/50 across all splits a deliberate choice, since gender and accent gaps in benchmarks have historically masked real performance disparities.

Dataset	Accent	Hours	M / F %	Style	Transcription
Appen Scripted AU	Australian	1.42	49 / 51	Read	Punctuated, cased
Appen Scripted CA	Canadian	1.53	52 / 48	Read	Punctuated, cased
Appen Scripted IN	Indian	1.02	49 / 51	Read	Punctuated, cased
Appen Scripted US	American	1.45	49 / 51	Read	Punctuated, cased
Appen Conversational IN	Indian	1.37	51 / 49	Spontaneous	Punctuated, disfluencies
Appen Conversational US003	American	1.64	49 / 51	Spontaneous	Punctuated, cased, disfluencies
Appen Conversational US004	American	1.65	49 / 51	Spontaneous	Punctuated, disfluencies

All model outputs and reference transcripts are normalized using a Whisper-based normalizer that strips punctuation and casing and maps to American spelling the same normalization applied across the public sets.

How the Private Track Works

The private datasets live in a dedicated “🔒 Private data” tab on the leaderboard. Scores are aggregated with no individual split scores exposed, preventing optimization against specific providers or accents. Five aggregate metrics are reported:

Column	Definition
Average WER	Macroaverage of per-provider averages providers weighted equally
Avg Scripted	Macroaverage across all scripted (read) splits
Avg Conversational	Macroaverage across all spontaneous conversational splits
Avg US	Macroaverage across all American English splits
Avg non-US	Macroaverage across all non-American splits

*Private data tab metric definitions and top model rankings (Qwen/Qwen3-ASR-1.7B: 8.46 | assemblyai/universal-3-pro: 8.81 | zoom/scribe_v1: 9.12)*

By default, private sets are excluded from the main leaderboard macroaverage. On the main leaderboard, two new toggle columns appear: Private data (scripted) and Private data (conversational).

*Main leaderboard with Private data columns toggled OFF column selector showing the two unchecked private data options*

When toggled on, they enter the macroaverage and the Rank Δ column shows how model ordering shifts relative to the public-only baseline.

*Main leaderboard with both private data columns toggled ON zoom/scribe_v1 rises ▲3 to #1 (Average WER 6.24), full Rank Δ column visible*

What the Rankings Show

On the public-only leaderboard, ibm-granite/granite-speech-4.1-2b leads with a mean WER of 5.33.

Public leaderboard (default view) top 5: ibm-granite/granite-speech-4.1-2b (5.33), CohereLabs/cohere-transcribe-03-2026 (5.42), ibm-granite/granite-4.0-1b-speech (5.52), nvidia/canary-qwen-2.5b (5.63), ibm-granite/granite-speech-3.3-8b (5.74)

When Appen’s private datasets are toggled in covering Australian, Canadian, Indian, and American accents across scripted and conversational conditions zoom/scribe_v1 moves from #4 to #1 (Average WER 6.24, Rank Δ ▲3). The model that appeared strongest on public benchmarks drops one position. These shifts are not anomalies: they reflect what the private track is built to detect models whose public scores overstate their generalization.

The scripted/conversational gap is particularly diagnostic. Models optimized for clean read speech tend to perform well on public datasets (which skew toward controlled conditions) but show substantially higher WER on spontaneous conversational audio. The Avg Scripted vs. Avg Conversational split makes that gap explicit and model specific.

Why Private Datasets Protect the Benchmark

Public test sets can be used in training. Once a dataset is open, model developers can train it directly or curate similarly distributed data to inflate their scores. A leaderboard built entirely on public sets increasingly measures optimization ability.

Having multiple data providers distributes the structural advantage: a model trained on Appen data gains no edge on evaluation sets from other providers, and vice versa. And to prevent any provider’s data from distorting overall rankings, the private sets are excluded from the default macroaverage opt-in only.

How the Datasets Were Built

Contributors were recruited against all three attributes of the benchmark scope simultaneously: accent, environment, and device. Each passed a spoken-language assessment validating proficiency and accent, plus a recording-environment check confirming access to the specific device and acoustic setting required.

Scripted and conversational speech followed separate protocols. Scripted scripts target specific phoneme, named entity, number, and domain vocabulary distributions. Conversational prompts are designed to elicit disfluencies, turn-taking, and informal registers, not cleaned-up speech.

Every recording pass two QA gates before transcription: automated checks (sample rate, codec, signal-to-noise ratio) and a human verification step. Files that fail either are flagged for re-recording.

Transcription combines automated quality scoring with human post-editing. Segments below the quality threshold go to post-editors for correction. All transcripts are then reviewed by senior auditors against a style guide, with speaker attribution and turn boundaries validated for multi-speaker recordings.

The speech AI community has made huge strides in model performance, but the benchmarks used to measure that progress haven't kept pace. Leaderboards only tell the full story when the underlying data reflects how speech technology is actually used.

Sergio Bruccoleri

VP of Delivery, Appen

Reliable AI evaluation starts with high-quality data, and we’re excited to partner with Appen to launch this new track in the Open ASR Leaderboard.

Eric Bezzam

Hugging Face

Full technical details: huggingface.co/blog/open-asr-leaderboard-private-data