Multilingual LLM Translation: Evaluating Cultural Nuance in Generative AI

Large language models (LLMs) are rapidly expanding their capabilities, but deploying in new global markets and multilingual AI workflows necessitates a reassessment of current model benchmarking strategies. Today’s leading multilingual LLMs top the AI translation leaderboards when it comes to grammar and literal meaning. However, for LLM translations to be reliable, without the need for human revision, the generated output must accurately localise tone, humour, and figurative language to maintain cultural relevance.
This original research from Appen explores how leading multilingual LLMs perform when translating culturally nuanced language in a real-world business scenario. Focusing on marketing copy (rich with puns, idioms, and figurative language), this pilot study analyses LLM translation across 20+ languages, from high-resource languages like Spanish and French to regional languages like Gujarati and Igbo.
LLM Translation vs Localisation
Localisation goes beyond translation, adapting content to resonate with specific cultural, regional, or linguistic audiences. A translation may be grammatically and literally accurate but fail from a localisation perspective if the tone, message, and intended outcome of the original communication are not preserved.
In many scenarios, such as the marketing emails in this study, poor localisation can have disastrous results ranging from comedic miscommunication to offensive content and AI safety risks. On the other hand, effective localisation builds trust and resonates with local audiences.
Why this research matters
Traditional localisation is labour-intensive, requiring insights from translators experienced with the linguistic and cultural contexts of both the input and output target dialects. This is a costly and time-consuming task, which drives demand for multilingual LLMs to reliably perform not only direct translation but also localisation.
Our findings show that, despite impressive grammatical translation, LLMs routinely mistranslate idioms and puns across all languages. Even high-resource languages such as French and Spanish suffered mistranslations and required human intervention. This pilot introduces new approaches to evaluating multilingual LLMs. Going beyond conventional benchmarks, the focus on localisation (translation of tone, figurative language, etc.) tests models’ real-world capabilities better than literal translation and incorporates the expertise of human evaluators to highlight the gap between “accurate translation” and effective localisation.
Download the research paper
With multilingual LLMs increasingly used in translation and localisation workflows, this pilot exposes critical gaps in cultural alignment. Learn how LLMs handle nuance, where they fall short, and how combining AI with human expertise can unlock effective global communication.
In this paper, you’ll learn about:
- Opportunities for growth in state-of-the-art multilingual LLMs, despite their high performance on standard benchmarks
- Which types of language (e.g., idioms, puns, cultural references) cause the most consistent translation failures
- How linguistic features and LLM training data influence translation quality across languages
- Where human oversight remains essential to deliver high-quality, culturally relevant translations