Polish emerges as top language in multilingual AI benchmark testing
Researchers test 26 languages across seven synthetic tasks at context lengths from 8K to 128K tokens, finding performance gaps widen at longer contexts for low-resource languages.
Researchers from the University of Maryland, Microsoft, and UMass Amherst announced on March 3, 2025, the release of ONERULER, a benchmark designed to evaluate how well large language models handle long contexts across 26 different languages. The research paper, published at the Conference on Language Modeling (COLM) 2025, reveals unexpected findings about language model performance across different writing systems and resource levels.
The benchmark adapts the English-only RULER framework by including seven synthetic tasks that test both retrieval and aggregation capabilities. These tasks feature five variants of the "needle-in-a-haystack" challenge, where models must locate specific information within lengthy documents, along with two aggregation tasks that require synthesizing information across entire contexts. A notable modification introduces the possibility of nonexistent needles, where models receive credit for correctly identifying when no answer exists.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Creating ONERULER involved a two-step process. Researchers first wrote instructions for each task in English, then collaborated with native speakers to translate these instructions into 25 additional languages. The team hired 17 Upwork annotators for 18 languages and recruited six volunteers for the remaining seven languages, paying translators $25 USD per language. All annotators were native speakers with strong English proficiency, and they received context about task objectives to ensure high-quality translations.
The 26 languages span diverse language families, writing systems, and resource availability levels. They include Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hindi, Hungarian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Serbian, Sesotho, Spanish, Swahili, Swedish, Tamil, Ukrainian, and Vietnamese. These languages represent different scripts such as Latin, Cyrillic, logographic, and various typological features including variations in word order and morphological complexity.
Researchers defined low-resource languages using Wikipedia article counts, setting a minimum threshold of 250,000 articles for high-resource classification. By this definition, Hindi, Sesotho, Swahili, and Tamil qualify as low-resource languages despite Hindi having approximately 600 million speakers.
The team evaluated four open-weight models of different sizes—Qwen 2.5 (7B and 72B), Llama 3.1 (8B), and Llama 3.3 (70B)—alongside two closed-source models: OpenAI's o3-mini-high and Google's Gemini 1.5 Flash. Testing spanned four context lengths: 8,000, 32,000, 64,000, and 128,000 tokens, with 50 examples per configuration, totaling 5,200 prompts per task per model.
Gemini 1.5 Flash emerged as the strongest performer overall, followed by Qwen 2.5 72B. The o3-mini-high model, despite its reasoning capabilities, struggled significantly with longer contexts. At 128,000 tokens in English, o3-mini-high achieved only 67% accuracy compared to 92% on Polish and 89% on Ukrainian.
The benchmark revealed a widening performance gap between high- and low-resource languages as context length increased. At 8,000 tokens, the difference in aggregate accuracy between the top five and bottom five languages by Wikipedia size stood at 11%. This gap expanded dramatically to 34% at 128,000 tokens. The researchers speculate this widening disparity stems from a lack of low-resource data used during long-context extension training.
Polish unexpectedly topped the performance rankings at context lengths of 64,000 and 128,000 tokens, achieving an average accuracy of 88% across all models. English ranked only sixth among the 26 languages with an average accuracy of 83.9%. Chinese performed surprisingly poorly, ranking as the fourth-worst language with an average accuracy of 62.1%. The top positions were dominated by Slavic, Romance, and Germanic languages, all of which have large Wikipedia presence and use Latin scripts.
Models demonstrated superior performance on the multi-query variant of the needle-in-a-haystack task compared to the single-query version for languages other than English. Researchers found that models returned "none" answers more frequently in single-query tasks than in multi-query tasks, leading to greater performance degradation.
The introduction of the possibility of a nonexistent needle significantly impacted performance. Adding the instruction "If no such number exists, please answer 'none'" decreased single-query needle-in-a-haystack accuracy by 32% at a context length of 128,000 tokens in English. Many models, particularly o3-mini-high, exhibited a common failure mode of responding "none" when the needle actually existed in the context. At long contexts, o3-mini-high incorrectly predicted the absence of answers even in high-resource languages.
The Common Word Extraction (CWE) aggregation task proved substantially more challenging than retrieval tasks. This task requires models to identify the ten most frequent words from a long list. In the easy setting, where the most frequent words appear 30 times each and distractors appear three times each, average English accuracy across all models reached only 31.5%. Models achieved over 80% performance at 8,000 tokens, but performance dropped drastically as context length increased. In the hard setting, where the most frequent words appear 20 times and distractors 10 times, accuracy approached 0% across all models.
Cross-lingual experiments, where instructions and context appeared in different languages, demonstrated that instruction language significantly impacts performance. When given English contexts, switching instruction language to Korean decreased average accuracy across all models from 91% to 71% at 64,000 tokens. Conversely, when the context was in Korean, switching instructions to English or Polish improved performance. At 128,000 tokens, average accuracy increased from 61% to 77% when instructions switched from Korean to English.
Reasoning models exhibited unusual behavior on these synthetic tasks. O3-mini-high produced significantly more reasoning tokens for incorrect answers than for correct answers, suggesting inefficient reasoning behavior for simple retrieval tasks. On aggregation tasks, o3-mini-high failed to generate answers within its 10,000 output token limit for almost every sample across all languages and context sizes. This overthinking persisted even with smaller contexts, with reasoning outputs sometimes exceeding the length of the given context itself.
Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.
The research highlights disparities in long-context capabilities across languages. All models demonstrated strong aggregate accuracy at 8,000 tokens, but still struggled with low-resource languages like Swahili and Sesotho. This issue was most pronounced in open models, with Llama models exhibiting the most severe performance drops.The findings hold significant implications for the marketing community. Performance Max and AI-powered advertising campaignsincreasingly rely on language models that must function effectively across diverse linguistic contexts. Marketing professionals targeting international audiences need to understand that AI performance varies dramatically by language, with resource-poor languages facing particular challenges.
Gemini's strong performance across languages demonstrates that long-context capabilities represent a critical competitive advantage in AI development. With Gemini's expanding capabilities in multimodal understanding, including enhanced video and document processing, the model's multilingual prowess positions Google favorably for global marketing applications.
The benchmark's tokenization challenges underscore complexities in evaluating multilingual models. One Tamil document measured 42,124 tokens using Gemini's tokenizer but 103,990 tokens using Qwen's tokenizer. This discrepancy complicates fair comparisons across models and highlights how tokenizer efficiency affects practical deployment costs and performance.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Top-performing languages on long-context tasks
The benchmark revealed unexpected language rankings at 64,000 and 128,000 token context lengths. The top ten languages by average accuracy across all models were:
- Polish - 88% average accuracy
- Russian - 86% average accuracy
- French - 85% average accuracy
- Italian - 84% average accuracy
- Spanish - 84% average accuracy
- English - 83.9% average accuracy
- Ukrainian - 83% average accuracy
- Swedish - 82% average accuracy
- Portuguese - 81% average accuracy
- German - 80% average accuracy
Why Polish emerged as the top performer
Polish achieved the highest average accuracy despite not dominating pretraining datasets like English or Chinese. The language belongs to the Slavic family and uses Latin script with diacritical marks, combining morphological richness with familiar orthography. Polish features complex grammatical structures including seven cases, three genders, and extensive inflection patterns. These characteristics may have created robust training signals that transferred well to long-context tasks. The language also benefits from substantial Wikipedia presence with over 1.6 million articles, placing it ninth globally. Poland's active tech sector and digital economy likely contributed to quality training data. The morphological complexity might actually help models by providing clear grammatical markers that assist in tracking information across long contexts, while the Latin script ensures compatibility with tokenization schemes optimized for Western European languages.
Why Russian ranked second in performance
Russian demonstrated strong performance with 86% average accuracy, leveraging its position as one of the world's major languages with approximately 255 million speakers. The Cyrillic script, while different from Latin, received substantial attention in model development given Russian's geopolitical and economic significance. Russian benefits from over 2 million Wikipedia articles, ranking sixth globally in content availability. The language's rich morphological system, featuring six grammatical cases and complex verb aspects, provides dense information encoding that may improve model comprehension. Russia's significant presence in technology development, scientific publishing, and digital content creation ensured quality training data across diverse domains. The language's systematic grammatical structure, despite its complexity, creates predictable patterns that long-context models can learn effectively. Models trained on substantial Russian corpora likely developed strong capabilities for tracking grammatical relationships across extended texts.
Why French secured third place
French achieved 85% average accuracy, benefiting from its status as a global language spoken across multiple continents by approximately 312 million people. With over 2.6 million Wikipedia articles, French ranks third in content availability, ensuring comprehensive representation in training datasets. The language's morphological system, while complex, follows relatively regular patterns compared to some other Romance languages. French's extensive use in international diplomacy, business, education, and digital media created diverse, high-quality training corpora. The language's clear syntactic structure and explicit grammatical markers help models maintain context over long passages. France's leadership in AI research and technology adoption likely influenced training data quality. French orthography, despite its spelling complexities, uses standard Latin characters that tokenize efficiently. The language's prominence in both European and global contexts ensured attention during model development and evaluation processes.
Why Italian ranked fourth
Italian secured 84% average accuracy through its combination of regular grammatical patterns and substantial digital presence. With approximately 67 million speakers and over 1.9 million Wikipedia articles, Italian maintains strong representation in training datasets despite a smaller speaker population than some lower-ranked languages. The language features relatively transparent orthography compared to French or English, where spelling closely matches pronunciation. Italian's morphological system, while inflected, follows highly regular patterns that create clear training signals. Italy's significant contributions to digital humanities, cultural content, and technology sectors provided quality training data. The language's melodic structure and systematic grammar may have created particularly clear patterns for long-context tracking. Italian benefits from being studied extensively as a second language, generating educational content that reinforces grammatical structures. The Romance language family's consistent features across Italian, French, and Spanish suggest these languages share advantageous characteristics for long-context processing.
Why Spanish tied for fourth position
Spanish matched Italian at 84% average accuracy, supported by its massive speaker base of approximately 560 million people, making it the world's second-most spoken native language. With over 2 million Wikipedia articles, Spanish ranks seventh in content availability. The language's global spread across Europe, Latin America, and growing United States populations created diverse training data representing multiple dialects and registers. Spanish features relatively straightforward orthography and systematic grammatical structures, including clear verb conjugation patterns and consistent gender marking. The language's extensive use in business, entertainment, social media, and education generated substantial high-quality digital content. Latin America's growing technology sector and Spain's developed digital economy contributed to training data quality. Spanish's phonetic writing system and regular morphological patterns likely provided clear signals for model learning. The language's rising importance in global commerce and AI development attention likely ensured quality representation in multilingual training efforts.
Why English ranked only sixth
English achieved 83.9% average accuracy despite dominating AI training datasets, creating what researchers termed a "surprising" result. While English appears most frequently in pretraining corpora, several factors may have limited its long-context performance. English orthography features significant irregularities where spelling doesn't match pronunciation, potentially creating ambiguous training signals. The language's relatively impoverished morphological system, lacking gender markers, case endings, and extensive verb conjugations found in other Indo-European languages, may provide fewer grammatical cues for tracking information across long contexts. English's status as a global lingua franca means training data includes significant amounts of non-native speaker content, potentially introducing inconsistencies. The language's massive vocabulary and numerous synonyms might create semantic ambiguities that complicate long-context comprehension. Many models may have been optimized specifically for English short-context tasks rather than long-context understanding. The researchers' inclusion of the nonexistent needle option affected all languages but may have disproportionately impacted English, where models received the most NIAH training without this feature during development.
Why Ukrainian achieved strong seventh place
Ukrainian secured 83% average accuracy, demonstrating impressive performance for a language with approximately 40 million speakers. The language benefits from over 1.3 million Wikipedia articles and substantial digital content creation, particularly following increased international attention since 2014. Ukrainian uses Cyrillic script like Russian but maintains distinct grammatical and lexical features. The language's morphological richness, including seven grammatical cases and extensive verb conjugation patterns, likely provided clear signals for long-context tracking. Ukraine's significant technology sector, including numerous software development companies and AI researchers, contributed to quality training data. The language received increased attention in recent AI development efforts as developers sought to support Ukrainian speakers. Ukrainian's systematic grammatical structure and rich inflectional system create explicit markers that help models maintain contextual understanding. The language's growing digital presence and active online communities generated diverse, contemporary training content across multiple domains.
Why Swedish ranked eighth
Swedish achieved 82% average accuracy, performing well despite having only approximately 10 million native speakers. The language benefits from over 2.6 million Wikipedia articles, ranking fourth globally in content availability—a remarkable achievement given its small speaker population. This exceptional Wikipedia presence reflects Sweden's high internet penetration, digital literacy, and strong knowledge-sharing culture. Swedish features relatively straightforward grammar compared to other Germanic languages, with simpler case systems and regular verb conjugations. The language's morphological transparency and consistent orthography likely provided clear training signals. Sweden's leadership in technology adoption, digital government services, and AI research ensured high-quality training data. The country's strong educational system and English proficiency meant Swedish content often maintained high linguistic quality. Swedish's North Germanic features, sharing characteristics with Norwegian and Danish, may have benefited from transfer learning across related languages. The language's efficient information encoding and systematic structure suited long-context processing tasks.
Why Portuguese secured ninth place
Portuguese achieved 81% average accuracy, supported by approximately 264 million speakers across Portugal, Brazil, Africa, and Asia. With over 1.1 million Wikipedia articles, Portuguese maintains substantial digital presence. Brazilian Portuguese dominates the speaker population and digital content creation, while European Portuguese maintains strong representation in formal contexts. The language's Romance structure provides clear grammatical markers through gender agreement, verb conjugation, and systematic morphology. Brazil's large population, growing technology sector, and active social media presence generated substantial training data. Portuguese orthography, while containing some irregularities, follows relatively consistent patterns compared to English or French. The language's use in international business, entertainment, and diplomacy ensured diverse training corpora. Portuguese shares many structural features with Spanish and other Romance languages, potentially benefiting from transfer learning. Brazil's AI research community and Portugal's technology sector contributed to training data quality. The language's clear syntactic patterns and explicit grammatical relationships likely supported long-context comprehension.
Why German completed the top ten
German achieved 80% average accuracy despite its reputation for grammatical complexity. With approximately 134 million speakers and nearly 3 million Wikipedia articles, German ranks fourth in content availability. The language's elaborate grammatical system, featuring four cases, three genders, and complex compound noun formation, might seem disadvantageous, but these features provide explicit structural information. German's systematic case marking and gender agreement create clear relationships between words across long contexts. Germany's technological leadership, strong AI research community, and extensive digital infrastructure ensured high-quality training data. The language's compound word formation, while creating long individual words, provides precise semantic information in compact form. German orthography follows relatively consistent rules despite its complexity. The language's importance in scientific publishing, engineering documentation, and business communications generated technical training corpora. German's clear syntactic rules and explicit grammatical markers may have helped models track information across extended contexts despite the language's surface complexity.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Timeline
- December 2024: OpenAI introduces enhanced memory features for ChatGPT
- January 2025: Qwen 2.5 paper shows perfect NIAH performance in single-language testing
- February 13, 2025: Google announces memory capabilities for Gemini Advanced
- February 23, 2024: Google integrates Gemini into Performance Max campaigns
- March 3, 2025: Researchers from University of Maryland, Microsoft, and UMass Amherst announce ONERULER benchmark release
- July 2025: Gemini expands to Wear OS smartwatches with voice recognition
- July 10, 2025: WordStream study finds 20% error rate across AI responses for PPC strategy, with Gemini showing best performance at 6% error rate
- August 13, 2025: Gemini introduces personalization and temporary chat features
- August 26, 2025: Google launches state-of-the-art image generation in Gemini with character consistency features
- September 10, 2025: Google announces text guidelines for AI-powered advertising campaigns
Five Ws Summary
Who: Researchers Yekyung Kim, Jenna Russell, Marzena Karpinska, and Mohit Iyyer from the University of Maryland, Microsoft, and UMass Amherst created ONERULER. They hired 17 Upwork annotators and recruited six volunteers as native speakers to translate instructions. The study evaluated models from OpenAI, Google, Meta, and Chinese developers Qwen and DeepSeek.
What: ONERULER is a multilingual benchmark testing long-context language models across 26 languages using seven synthetic tasks. These include five needle-in-a-haystack variants and two aggregation tasks, evaluated at context lengths of 8,000, 32,000, 64,000, and 128,000 tokens. The benchmark introduces a novel feature allowing for nonexistent needles, where models must identify when no answer exists.
When: The research paper was submitted on March 3, 2025, with subsequent revisions in August and September 2025. It was published at the Conference on Language Modeling (COLM) 2025.
Where: The benchmark evaluates models globally across 26 languages representing diverse language families and writing systems, including Latin, Cyrillic, Hanzi, Kanji/Kana, Hangul, Perso-Arabic, Tamil, and Devanagari scripts. Languages span five continents and range from high-resource languages with millions of Wikipedia articles to low-resource languages with fewer than 2,000 articles.
Why: The research addresses a critical gap in understanding how language models perform with long contexts across multiple languages, moving beyond English-only or limited multilingual evaluations. The findings matter for developing more equitable AI systems, improving multilingual training pipelines, and informing international marketing strategies that increasingly rely on AI-powered tools across diverse linguistic markets.