Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI - media.mit.edu

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Work for a Member organization and need a Member Portal account? Register here with your official email address.
Creative Commons
Attribution-NonCommercial 4.0 International

Pat, Anthony, Sheer
March 27, 2026
Karny, S., Baez, A., & Pataranutaporn, P. (2026). Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. Proceedings of the 31st International Conference on Intelligent User Interfaces, 868–884. doi:10.1145/3742413.3789120
Millions of users now design personalized LLM-based chatbots through system prompts that shape their daily interactions, yet have limited ability to anticipate how these design choices will manifest as behaviors in deployment. This opacity is consequential: seemingly innocuous prompts can trigger excessive sycophancy, toxicity, or other undesirable traits, harming utility and raising safety concerns. To address this, we introduce an interface that enables neural transparency by exposing language model internals during the chatbot’s personality design. Our approach extracts behavioral trait vectors (empathy, toxicity, sycophancy, etc.) by calculating the differences in neural activations between contrastive system prompts that elicit opposing behaviors. We quantify a chatbot’s personality by projecting the system prompt’s final token activations onto these trait vectors to create persona scores, which are then normalized for cross-trait comparability and visualized using an interactive sunburst diagram. To evaluate this approach, we conducted an online user study (N = 80) to compare our neural transparency interface against a baseline chatbot interface without any form of transparency. Our analyses suggest that users systematically miscalibrated AI behavior: participants misjudged trait activations for 11 of 15 analyzable traits, motivating the need for transparency tools in everyday human-AI interaction. While our interface did not alter design iteration patterns, it significantly increased user trust and was enthusiastically received. Qualitative analysis revealed nuanced user experiences with the visualization, suggesting interface and interaction improvements for future work. This work offers a path for how mechanistic interpretability can be operationalized for non-technical users, establishing a method for safer, more aligned human-AI interactions.
MIT researchers analyzed Reddit posts and found AI companionship is reshaping human bonds.
New studies show AI chatbots and altered media can distort memory
How AI Became a Third Party in Our Relationships, According to the MIT Media Lab’s Cyborg Psychology Group.
By Rhiannon WilliamsIt’s a tale as old as time. Looking for help with her art project, she strikes up a conversation with her assistan…
Massachusetts Institute of Technology
School of Architecture + Planning
Accessibility
Donate to the Lab

source

ZoomYourWeb3

Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI – media.mit.edu

Contact Us

Quick Links