Anthropic study finds that role prompts can push AI chatbots out of their trained helper identity - the-decoder.com

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Chatbots like ChatGPT, Claude, and Gemini are trained to play a specific role after their basic training: the helpful, honest, and harmless AI assistant. But how reliably do they stay in character?
A new study by researchers at Anthropic, the MATS research program, and the University of Oxford suggests this conditioning is more fragile than expected. The team discovered what they call an “Assistant Axis” in language models, a way to measure how easily chatbots slip out of their trained helper role.
They tested 275 different roles across three models: Google’s Gemma 2, Alibaba’s Qwen 3, and Meta’s Llama 3.3. The roles ranged from analyst and teacher to mystical figures like ghosts and demons. Whether these findings apply to commercial products like ChatGPT or Gemini remains unclear, since none of the tested models are frontier models.
When analyzing the models’ internals, the researchers found a main axis that measures how close a model stays to its trained assistant identity. On one end sit roles like advisor, evaluator, and tutor. On the other end are fantasy characters like ghosts, hermits, and bards.
According to the researchers, a model’s position on this “assistant axis” can be measured and manipulated. Push it toward the assistant end, and it behaves more helpfully while refusing problematic requests more often. Push it the other way, and it becomes more willing to adopt alternative identities. In extreme cases, the team observed models developing a mystical, theatrical speaking style.
The researchers simulated multi-turn conversations on various topics and tracked how the model’s position on the axis changed. For topics like coding help, technical explanations, and practical instructions, the models stayed stable in their helper role.
But therapy-like conversations with emotionally vulnerable users or philosophical discussions about AI consciousness caused systematic drift. This is where things get dangerous: models can start reinforcing delusions, for example. The team documented several such cases.
To prevent this behavior, the researchers developed a method called “activation capping” that limits activations along the assistant axis to a normal range. According to the study, the approach cut harmful responses by nearly 60 percent without hurting benchmark performance.
The team recommends that model developers keep researching stabilization mechanisms like this. The position on the identity axis could serve as an early warning signal when a model strays too far from its intended role, they say. The researchers see this as a first step toward better control over model behavior in long, demanding conversations.
For everyday prompting, a simple rule of thumb is to ask for a concrete output rather than an open-ended identity. In the paper’s experiments, bounded task requests tended to keep models closer to their default assistant behavior, while emotionally charged disclosures and prompts pushing the model into self-reflection tended to drive “persona drift.”
Requests for bounded tasks, technical explanations, refinement, and how-to explainers maintained the model’s Assistant persona; prompts pushing for meta-reflection on the model’s processes, demanding phenomenological accounts, requiring specific creative writing that involve inhabiting a voice, or disclosing emotional vulnerability caused it to drift.
If you do use role prompts, it may help to define the job-to-do (what you want produced) rather than leaning into a fully open-ended character.
Anyone using chatbots for role-playing, creative writing, or emotional support should keep in mind that some topics are more likely to push models away from their default assistant persona—especially emotionally intense exchanges and conversations that pressure the model to describe its own inner experience or “consciousness.”
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
AI won't wait — and neither should you. One newsletter, zero noise.

Follow The Decoder for AI news, background stories and expert analyses.
AI won't wait — and neither should you. One newsletter, zero noise.

source

ZoomYourWeb3

Anthropic study finds that role prompts can push AI chatbots out of their trained helper identity – the-decoder.com

Contact Us

Quick Links