Human feedback. Real-world relevance. Responsible AI.
On June 18, 2025, Appen and AI Circle brought together over 70 AI professionals in Mountain View, California to explore what it takes to build safe, scalable, and multilingual AI systems. Expert panelists from Google, Meta, Amazon, NVIDIA, Microsoft, and Appen led the discussion, including Appen’s latest research into the persistent challenges presented by multilingual AI translation.
From Agents to AGI
Moderated by NVIDIA’s Pratik Mehta, the panel explored critical topics such as the evolving role of humans in the loop (HITL), challenges in red teaming and evaluation, and the shifting landscape of fine-tuning for foundation models.
The lineup of speakers included:
- Si Chen – VP of Strategy and Marketing, Appen
- Himanshu Gupta – Applied Scientist, Amazon
- Mark Hoffmann – Staff ML Engineer, Meta
- Arsalan Mosenia – Tech Lead for AI Agents, Google
- Mayur Shastri – Software Engineer, Microsoft
Key Takeaways
1. Human-in-the-Loop Is Evolving, Quickly
At Meta, HITL isn’t just annotation. It’s a multi-layered process involving subject-matter expert validation, iterative relabeling, and quality loops across use cases like legal data and low-resource languages.
Appen’s Si Chen emphasized that HITL is core to building AI that respects linguistic and cultural nuance, especially across 500+ languages and 200+ markets. But she also noted that the challenge lies in sourcing and training the right experts, and making sure their knowledge is captured in ways that models can learn from.
“A lot of the focus today around how we leverage experts is: how do we find the right rubrics, the right benchmarks, the right guidelines to extract the knowledge that exists in the minds of these experts, and shift it toward consistent representation?”
– Si Chen, Appen
This shift toward domain-specific human expertise, not just annotation at scale, reflects the industry’s recognition that model quality depends just as much on what is annotated as who is doing the data annotation.
2. Fine-Tuning Isn’t Going Away
Forget the idea that fine-tuning is obsolete. Microsoft’s Mayur Shastri underscored the need for domain-specific fine-tuning, especially in fields like healthcare and legal reasoning. Retrieval-augmented generation (RAG) may be emerging, but it complements rather than replaces fine-tuning for many enterprise needs.
3. Agents Are the Next Frontier, But Not Without Risk
Google’s Arsalan Mosenia highlighted the shift from agent frameworks to autonomous agents capable of completing enterprise tasks from end to end. But adoption won’t be easy. Concerns around security, reliability, and error compounding in agentic systems remain major challenges, especially in sensitive applications where caution around AI safety is vital.
4. Synthetic Data Is Powerful, But Needs Human Oversight
Synthetic data is crucial for scaling, but the panel agreed it must be paired with human oversight to ensure high-quality outputs. Si noted that synthetic variants often lack the cultural, domain, and linguistic grounding required for safe and aligned model behavior.
5. Benchmarks Are Breaking, And That’s Not a Bad Thing
Several panelists shared concerns that benchmarks are increasingly outpaced by model performance. As Mayur noted, the real challenge is evaluating models in context-specific, evolving, and high-stakes scenarios, and that’s where human expertise remains essential.
Open Questions for the Industry
The audience Q&A pushed the conversation further:
- How do we preserve human judgment as AI systems grow in autonomy?
- Who gets to decide how models should respond, especially in subjective domains?
- What’s the path from HITL to AI-in-the-loop?
There were no easy answers. But the consensus was clear: building real-world AI isn’t just about scale. It’s about nuance, responsibility, and collaboration between humans and machines.
Looking Ahead
As AI becomes more embedded in everyday products, from shopping assistants to global ad platforms, the need for human-guided, multilingual, and culturally aware systems has never been greater.
Events like AI Circle x Appen serve as a crucial reminder that you can’t outsource alignment to your training data pipeline. You need people - experts, linguists, domain specialists - in the loop, every step of the way.
Contact us to learn more about how Appen partners with leading teams to build safe and scalable AI.