How Microsoft is Advancing Equitable Knowledge Access with AI Translation
“Appen’s breadth of language expertise and ability to source speakers, even for under-resourced languages, has allowed us to offer a wide range of languages and dialects, with Microsoft Translator.”
– Marco Casalaina, VP of Products for Microsoft Azure AI
Introduction: Making Translation Accessible to All
In the early days of online translation, the process was clunky and often inaccurate, resulting in significant misunderstandings due to literal word-for-word translations that missed the nuances of language. Microsoft Translator, powered by Azure Cognitive Services, changed that by delivering faster, more accurate translations, enabling seamless multi-language communication. Beyond the world’s most frequently spoken languages, Microsoft Translator is continually adding new languages to its platform, even those that are less commonly spoken. This effort not only promotes language preservation but also fosters equitable access to knowledge for speakers of all languages, making it easier for everyone to engage in cross-cultural communication.
To expand its language capabilities, Microsoft faced the challenge of sourcing and annotating large datasets, particularly for less frequently spoken languages. To tackle this, Microsoft turned to Appen, a leader in AI data solutions, to help meet its data requirements and scale its translation efforts.
About Microsoft Translator
Microsoft Translator is a real-time translation tool powered by AI that provides text, voice, and image translations across multiple languages. The technology is part of the Azure Cognitive Services suite and supports individual users, businesses, and developers by offering translation services that facilitate communication across different languages. The platform was initially focused on supporting widely spoken languages but has grown to incorporate lesser-known languages in order to preserve linguistic diversity and ensure global access to information. By expanding its language offerings, Microsoft Translator contributes to the broader goal of breaking down language barriers and enabling cross-cultural communication on a global scale.
Goals: Expanding AI for Language Equity
Microsoft Translator’s main goal in collaborating with Appen was to significantly increase the number of languages available on the platform, particularly those spoken by smaller communities. By expanding its language capabilities, Microsoft sought to:
- Ensure equitable access to knowledge for speakers of all languages, including those that are rare or endangered.
- Preserve linguistic diversity by digitizing less commonly spoken languages and preventing them from disappearing.
- Improve the accuracy of translations through high-quality, annotated datasets.
- Address potential bias in AI translation models by developing tools that account for gender-ambiguous sentences and other language-specific nuances.
Meeting these goals would allow Microsoft Translator to make a broader impact, ensuring that AI-driven translation services are accessible to people worldwide, regardless of their native language.
Challenge: Sourcing Data for Rare Languages
Microsoft Translator uses AI to translate between languages, but building accurate machine translation models requires vast, well-annotated datasets. For commonly spoken languages, data is readily available. However, for rare and less-documented languages, sourcing sufficient data poses significant challenges.
The challenge was twofold:
- Data Collection: Microsoft needed large datasets from native speakers, but for some languages, it was difficult to find fluent speakers or existing data to use.
- Data Annotation: Accurately transcribing and translating data into the target languages required not just linguistic expertise but also an understanding of the cultural context and structure of each language, across diverse alphabets, phonetic systems, and grammar.
Additionally, Microsoft needed a solution to address potential translation bias, such as ensuring accurate translations for gender-ambiguous source sentences. These complex requirements made it essential to find a partner capable of providing tailored solutions for diverse languages.
Solution: Appen’s Expertise in Multilingual Data Annotation
With 25+ years of experience in data sourcing, preparation, and evaluation, Appen was well-equipped to meet Microsoft Translator's needs.
Data Sourcing from Native Speakers
Appen collaborated with local communities to source language data directly from native speakers. By working with fluent speakers of rare languages, Appen collected high-quality language samples that accurately represented the linguistic and cultural nuances of each language.
Customized Data Annotation
Appen's team of experts annotated the collected data by transcribing and translating each sample with precision. This process included multiple layers of quality assurance to ensure the highest level of accuracy in every translation. Additionally, Appen developed a solution for generating multiple translations for gender-ambiguous sentences, allowing Microsoft to address potential biases in their AI translation models.
Phonetic Similarity and Transliteration
For languages with different alphabets or phonetic systems, Appen applied phonetic similarity and transliteration techniques to ensure that the datasets were correctly formatted and ready for use in machine learning models.
Result: Expanding Microsoft Translator’s Language Portfolio
Thanks to the collaboration with Appen, Microsoft Translator was able to scale its language capabilities significantly. The platform now supports 110 languages, including rare and endangered languages. Appen played a critical role in sourcing and annotating data for 108 of those languages.
Some of the newly added and less commonly spoken languages include:
- Assamee
- Basque
- Dari & Pashto
- Kazakh
- Kurdish
- Maori
- Marathi, Gujarati, Punjabi, Malayalam, Kannada
- Odia
- European and Brazilian Portuguese
By expanding its language offerings, Microsoft Translator has made significant strides in preserving endangered languages and promoting global access to knowledge. The work between Microsoft and Appen demonstrates how AI, combined with high-quality data, can drive greater inclusivity and equity in language access.
The Importance of Collaboration in AI Development
The success of the Microsoft Translator project highlights the importance of collaboration between AI technology developers and data providers. By partnering with Appen, Microsoft was able to overcome the complex challenges of sourcing and annotating data for rare languages, ensuring that its AI models were trained on diverse, representative datasets.
This collaboration also set a new standard for ethical AI development by addressing translation bias and ensuring that the AI-powered tool is accessible to people worldwide, regardless of their native language.
Why Appen?
Appen’s unique ability to deliver customized, high-quality data solutions for AI projects was key to the success of Microsoft Translator. Their comprehensive approach included:
- Expert Data Collection: Sourcing language data from fluent speakers worldwide.
- Precise Data Annotation: Delivering accurately transcribed and translated data with rigorous quality assurance.
- Scalability: Meeting the needs of large-scale AI projects by delivering vast datasets on time.
Through Appen’s support, Microsoft Translator has become a global leader in AI-powered language translation, helping to make knowledge accessible to all.
Conclusion
The partnership between Microsoft Translator and Appen underscores the critical role that high-quality data plays in developing AI technologies. With Appen’s support, Microsoft was able to expand its language portfolio to 110 languages, ensuring that speakers of even the rarest languages can access digital knowledge and engage in global conversations. This collaboration not only strengthened Microsoft’s AI capabilities but also advanced the broader goal of making AI-driven technology more inclusive and equitable for users around the world.