Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Home → Research
Polish emerged as the top language for complex AI reasoning, leaving English in sixth place.
When a team of researchers from the University of Maryland, Microsoft, and UMass Amherst set out to explore how artificial intelligence handles long strings of language, they didn’t expect to be surprised. The dominance of English and Chinese in AI is well-known. These languages flood the internet, saturate training datasets, and shape the performance of every major model from ChatGPT to Google’s Gemini.
But the researchers weren’t testing raw fluency or translation. They were probing something deeper. Could an AI, when fed tens of thousands of words in one sitting, retrieve a crucial detail buried in a sea of text? Could it understand and aggregate information across a sprawling document?
And that’s when the surprise came. Under these long-context conditions, it wasn’t English or Chinese that performed best. It was Polish.
In the new AI benchmark study published at the 2025 Conference on Language Modeling, the team found that Polish was the most effective language for performing complex AI tasks, with an average accuracy of 88%. English ranked sixth, while Chinese, long assumed to be a linguistic stronghold for machine learning, placed near the bottom.
“English and Chinese dominate the pretraining data…and so we might expect them to be the top-performing languages,” wrote the researchers. “However, at context lengths of 64K and 128K, we unexpectedly observe that Polish is the top performer.”
The researchers built a new evaluation tool called ONERULER, expanding on earlier English-only tests to include 26 languages, from Polish and Russian to Swahili and Tamil. Researchers tested six major AI systems under identical conditions, including models from OpenAI, Google’s Gemini, Qwen, and Meta’s Llama.
The researchers tasked these systems with analyzing vast amounts of text, up to 128,000 tokens long, to find information or synthesize meaning. The most demanding tasks resembled finding a “needle in a haystack”: a hidden clue or number buried in a book-length passage.
The result upended expectations. Polish, a Slavic language known for its complex grammar, emerged as the best performer. Russian, French, Italian, and Spanish followed close behind.
By contrast, languages from the Bantu family, such as Swahili and Sesotho, performed poorly despite being spoken by more than 350 million people worldwide. Chinese, with its logographic script, ranked fourth from the bottom.
Stay ahead with ZME Science and subscribe.
Please check your inbox and confirm your subscription.
The authors suspect that Polish benefited from its Latin alphabet, rich morphology, and possibly the syntactic regularity that helps large language models keep track of relationships between words over long distances. Another factor may be that Polish data, though smaller in volume than English, is cleaner and more consistent.
“Humans have trouble with it, but not AI,” quipped the Polish Patent Office in a social media post after the findings were released.
Still, the researchers stress that the result is not about the innate “superiority” of Polish, but about how models process information. “We hope the release of ONERULER will facilitate future research into improving multilingual and cross-lingual long-context training pipelines,” the team wrote.
The findings prove that data abundance doesn’t always equal better data understanding. English may dominate global training datasets, but language structure and tokenization—how models split words into machine-readable units—dictate performance.
Slavic and Romance languages, which use inflected words and Latin or Cyrillic scripts, consistently outperformed others. Models struggled with languages that use non-Latin scripts or agglutinative forms, like Korean or Tamil, where words are built from long chains of morphemes.
This growing gap between “high-resource” and “low-resource” languages widens as AI systems process longer and longer contexts. As the authors note, “performance disparities between high- and low-resource languages increase as context length increases.”
The discovery comes as Poland invests heavily in homegrown AI development. Earlier this year, the government launched PLLuM, the Polish Large Language Model, now used by local administrations to automate official communication and summarize documents.
In that context, the new benchmark serves as a reminder that even smaller languages can lead in the AI era. As large models evolve toward processing larger datasets, their success may depend increasingly more on the diversity of languages they understand.
And somewhere in that multilingual maze, Polish—the language of Mickiewicz and Copernicus—may hold an unexpected key to making AI perform better.
Aerospace engineer with a passion for biology, paleontology, and physics.
© 2007-2025 ZME Science – Not exactly rocket science. All Rights Reserved.
© 2007-2025 ZME Science – Not exactly rocket science. All Rights Reserved.