Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
A large-scale study shows that the training process turning raw language models into helpful chatbots also weakens their ability to mimic human behavior. The effect gets worse with each new generation.
Language models are increasingly used as stand-ins for human test subjects to predict reactions to policy measures, simulate clinical training for psychiatrists, or model how students learn.
A new study from an international research consortium, including scientists from Helmholtz Munich, arrives at an inconvenient finding: the very training steps that turn language models into useful assistants make them worse at modeling human behavior.
The study builds on Psych-201, a new dataset of transcripts from behavioral experiments. It covers about 208,000 participants and roughly 26 million individual responses from hundreds of experiments, several times larger than any previous collection of its kind.
Each data point captures a participant’s full run through an experiment, along with detailed metadata like age, nationality, questionnaire responses, and other traits. The dataset was assembled through an open research collaboration involving researchers from more than 35 institutions.
The researchers compared models from the Qwen3, Llama3, and OLMo 3 families, testing both base models and their various post-trained variants. Base models are trained only to predict the next word in text.
From there, extra training produces the versions tuned for instruction-following, step-by-step reasoning, or image processing. The metric: how well each model predicts the actual answers human participants gave.
The result holds across all families and sizes. Base models predict human behavior better than their post-trained descendants. The effect shows up for every common training objective, hitting hardest with reasoning models, followed by instruction tuning and vision extensions. In nearly every head-to-head comparison, the base model outperforms its specialized variant.
One obvious counter-explanation: maybe assistant models just answer more deterministically and fail to capture the natural spread of human behavior. The researchers tested this with an accuracy analysis on a subset of tasks with discrete answer options. Post-trained models still performed worse, making higher determinism unlikely as the sole explanation.
While base models steadily improve from Qwen2 through Qwen2.5 to Qwen3, getting better at predicting human behavior with each generation, the gap to their derived assistant models keeps growing. Ongoing advances in post-training are making the divergence from human behavior worse.
The biggest distortion shows up in language tasks and reasoning. The researchers offer a plausible explanation: base models are, at their core, models of human language and therefore well-calibrated for language processing tasks. Post-training techniques like reinforcement learning from human feedback push them away from that original objective toward more user-friendly or normatively correct answers.
The same thing happens with reasoning. Human decisions are shaped by heuristics and systematic biases that base models apparently pick up. Reasoning training optimizes for logically correct answers instead, overwriting exactly the human quirks that matter for behavioral simulation.
A second finding concerns a widely used technique: giving language models participant-specific information to put them into a particular role. In the study, this took the form of an interview format where demographic details about each person were prepended before the experiment. Where available, the prompts included age, gender, nationality, education, clinical diagnoses, and questionnaire scores.
The effect was practically zero. That held even when the analysis was limited to developmental psychology experiments, where age-related differences should be informative. Earlier work had shown that persona prompts can produce human-like response distributions at the population level. But the new study questions whether they actually predict individual behavior or just look plausible on the surface.
The authors see their findings as a variation of a known problem: extra training toward specific goals can degrade abilities acquired during pretraining. To test whether this is a hard limit, they looked at Centaur – a model specifically fine-tuned on a portion of the behavioral data.
Centaur showed much higher agreement with human behavior even on new tasks that weren’t part of its training. So extra training can help, but only when it targets behavioral modeling rather than logical correctness.
For research practice, the takeaway is clear: the convenient, readily available assistant models aren’t automatically the best choice for behavioral simulations. The researchers recommend either raw base models or variants trained specifically for behavioral simulation. Code and data are available on Hugging Face and GitHub.
That chatbot models have their pitfalls as digital test subjects isn’t new. A recent study of nine open-source language models found that optimizing for more human-sounding output comes at the cost of factual precision, and a classifier unmasked AI responses with 70 to 80 percent accuracy. The persona trick also worked worse than expected.
Another study found that models can barely pose as weak or strong learners on command, with their hit rates shifting by less than a percentage point. And when it comes to reasoning, a deep gap persists anyway: an analysis of more than 170,000 reasoning traces showed that reasoning models think differently than humans, falling into a kind of sequential autopilot.
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive “AI Radar” frontier report six times a year, full archive access, and access to our comment section.
Stay in the loop on AI. Clear, useful, no fluff.
Follow The Decoder for AI news, background stories and expert analyses.
Stay in the loop on AI. Clear, useful, no fluff.