AI chatbots provide poor answers to medical questions half the time, study finds - CIDRAP

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Tippapatt / iStock
A study published in BMJ Open suggests that half of answers provided by five publicly available artificial intelligence (AI)–driven chatbots in response to medically related questions are inaccurate and incomplete.
Led by a researcher from the University of California at Los Angeles, the study involved an audit of the chatbots Gemini (Google), DeepSeek (High-Flyer), Meta AI (Meta), ChatGPT (OpenAI), and Grok (xAI).
In February 2025, the team asked 10 questions of each chatbot in five categories: cancer, vaccines, stem cells, nutrition, and athletic performance. Researchers also prompted the chatbots to produce scientific references. They asked 10 open- and closed-ended questions designed to resemble common information-seeking medical and health questions and information tropes found online and in academic discussion.
The probes were also developed to point models toward misinformation or advice counter to medical standards, a method increasingly used to “stress test” AI chatbots and detect behavioral vulnerabilities. The chatbots had to provide pre-defined responses, often with only one correct answer, that agreed with scientific consensus, while open-ended questions usually required them to generate several responses in list form.
Two experts from each category rated the chatbot responses as non-, somewhat, or highly problematic, or potentially harmful. Citations were scored for accuracy and completeness, and each response was given a Flesch Reading Ease score.
The chatbots “have been rapidly adopted across research, education, business, marketing and medicine,” the authors wrote. “Most interactions, however, come from non-experts using chatbots like search engines, including for everyday health and medical queries.”
About half (49.6%) of responses were problematic, with 30% considered somewhat problematic and 19.6% deemed highly problematic. Response quality didn’t differ significantly by chatbot, but Grok generated significantly more highly problematic responses than would be expected under a random distribution. Gemini, on the other hand, produced the fewest highly problematic responses and the most non-problematic ones.
Chatbot performance was strongest in regard to questions about vaccines (mean z-score, –2.57) and cancer (–2.12) and weakest on stem cells (+1.25), athletic performance (+3.74), and nutrition (+4.35).
Chatbot responses were consistently given with confidence and certainty, with few caveats or disclaimers; of 250 total questions, only two (0.8%), on anabolic steroids and non-traditional cancer therapies, were met with refusals to answer, both from Meta AI. Reference quality was poor, with a median completeness score of 40%.
Open-ended prompts generated 40 highly problematic responses—significantly more than expected—and 51 non-problematic responses—significantly fewer than expected. The opposite was true of closed-ended prompts.
Chatbot hallucinations and made-up citations precluded all chatbots from providing a 100% accurate reference list. Response readability was scored as “difficult,” or complex enough that the reader would need at least some college to understand.
By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences.
“By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences,” the authors noted. “They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments.”
In addition, chatbots also base their responses in part on Q&A forums and social media while limiting scientific content to publicly available studies, which make up only 30% to 50% of published research. “While this enhances conversational fluency, it may come at the cost of scientific accuracy,” the researchers wrote.
Study limitations are the inclusion of only five chatbots, limiting the findings’ generalizability in a rapidly evolving field. Also, real-world chatbot queries aren’t all adversarial, an approach that may have overestimated the prevalence of problematic content.
“The audited chatbots performed poorly when answering questions in misinformation-prone health and medical fields,” the researchers concluded. “Continued deployment without public education and oversight risks amplifying misinformation.”
A new study links post-COVID ocular symptoms such as blurred vision and dry eyes to worse overall health and greater socioeconomic challenges.
The bats’ range is expanding northward, while CWD is creeping southward, creating a worrisome overlap.
Twenty years after the approval of the first HPV vaccines, studies continue to find new benefits.
The findings support the case for sex-neutral vaccination programs.
While it has immune-escape characteristics, the SARS-CoV-2 BA.3.2 strain isn’t associated with increased COVID severity or transmissibility, early evidence suggests.
The report was to be published on March 19, and showed people who had received an updated seasonal COVID vaccine reduced their likelihood of ED and urgent care visits by 50%.
The study also reveals that women with long COVID experience more heart problems than men.
Utah now has the most active outbreak in the country, with officials recording 24 new cases in the last five days.
A 2027 budget proposal from the administration seeks to slash the NIH budget by $5 billion.
Two papers on a multinational observational study indicate direct oral challenge is a safe and effective strategy for identifying patients who truly have a penicillin allergy.
Help make CIDRAP’s vital work possible
CIDRAP – Center for Infectious Disease Research & Policy
Research and Innovation Office, University of Minnesota, Minneapolis, MN
Email us
© 2026 Regents of the University of Minnesota. All rights Reserved.
The University of Minnesota is an equal opportunity educator and employer
Research and Innovation Office | Contact U of M | Privacy Policy
Newsletter subscribe

source

ZoomYourWeb3

AI chatbots provide poor answers to medical questions half the time, study finds – CIDRAP

Contact Us

Quick Links