Hugging Face's Eric Bezzam and Appen's Sergio Bruccoleri on benchmaxxing, accent gaps, and the private evaluation track changing how speech models are ranked.

The promise of voice AI is intuitive: speak naturally, get results. But behind every voice interface - customer support bot, medical assistant, smart device - lies a complex evaluation challenge that most benchmark leaderboards fail to fully capture.

In Benchmarked, the debut episode of The Data Layer by Appen, host Karla Heredia brings together two people who spent the last year working directly on this problem: Eric Bezzam, who leads Audio ML and the Open ASR Leaderboard at Hugging Face, and Sergio Bruccoleri, VP of GenAI Operations at Appen.

The conversation begins with a clear-eyed diagnosis. Speech AI has surged back to the forefront of AI development - not because the science suddenly solved old problems, but because LLMs created a new demand for natural interaction. Typing into ChatGPT is effective. Speaking to it is the next frontier. But that shift brings a reckoning: the models powering these experiences are often evaluated on public datasets that have been quietly gamed.

That practice - “benchmaxxing” - is the central problem this episode addresses. When models are fine-tuned with implicit knowledge of public test sets, the resulting scores become misleading signals. Eric explains how this concern prompted Hugging Face to reach out to Appen: to build a curated private dataset that model developers couldn’t train against.

What followed was more than a data handoff. Over four months, Appen and Hugging Face worked to ensure the private dataset had the right diversity of accents, conditions, and conversational styles - scripted versus spontaneous, near-field versus far-field, with accent coverage that exposed real model weaknesses. Indian and Canadian English showed the largest gaps. Conversational speech proved substantially harder than scripted.

Sergio adds the practitioner’s view: the difference between a model that tops a leaderboard and one that actually works in production. Environment matters enormously - clean academic recordings bear little resemblance to a TV soundbar in a living room or a voice assistant on a factory floor. Vocabulary, turn-taking, and latency are production considerations that rarely appear in benchmark scores.

The episode closes with a look ahead. Both guests see the next five years defined by full-duplex voice systems, voice AI in robotics, and a long-overdue focus on accessibility - ensuring speech systems work not just for standard American English speakers, but for people across languages, accents, and speech abilities.

What emerges is a picture of a field that’s maturing: moving past single-number rankings toward multi-signal evaluation - and asking harder questions about what “good enough” really means in production.

Speakers

Eric Bezzam
Audio ML
Hugging Face

Eric Bezzam is part of the Audio ML team at Hugging Face, where he contributes to the Transformers library - one of the most widely used open-source machine learning frameworks in the world, with over 160,000 GitHub stars - and leads evaluation efforts including the Open ASR Leaderboard. His work spans speech recognition, speech generation, codecs, and music AI. Before Hugging Face, Eric completed a PhD and worked at Sonos, where he was part of the team deploying ASR systems across millions of devices in varying acoustic environments. He is also co-founder of LauzHack, a leading European hackathon, and has previously worked as a VC analyst. Based in France, he brings a global perspective to the challenges of accent diversity and real-world speech robustness.

Sergio Bruccoleri
VP of GenAI Operations
Appen

Sergio Bruccoleri leads GenAI operations at Appen, overseeing the delivery and research teams responsible for building high-quality AI training and evaluation data at scale. With deep experience in speech AI, Sergio brings a practitioner’s perspective to the gap between benchmark performance and production deployment - with particular focus on acoustic environment, accent diversity, and the real-world conditions that standard datasets rarely capture. He represented Appen at ICASSP 2025 in Barcelona, where the private ASR track collaboration with Hugging Face was publicly unveiled.

Karla Heredia
Director of GenAI Delivery
Appen

Karla Heredia brings years of experience improving ASR data quality for major technology companies, including extensive work on Amazon Alexa. As Director of GenAI Delivery at Appen and host of The Data Layer, she connects deep technical expertise with the human realities of voice AI - from accent representation to multilingual accessibility. A multilingual speaker fluent in both Spanish and English, Karla brings firsthand perspective to the challenges of building speech systems that work for everyone, not just the majority use case.