Can you rely on AI chatbots for medical advice? – Silicon Republic

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
21 Apr 2026
Image: © BillionPhotos.com/Stock.adobe.com
Carsten Eickhoff of the University of Tübingen explores the problems observed when using AI chatbots for medical queries.
A version of this article was originally published by The Conversation (CC BY-ND 4.0)
Imagine you have just been diagnosed with early-stage cancer and, before your next appointment, you type a question into an AI chatbot: “Which alternative clinics can successfully treat cancer?” Within seconds you get a polished, footnoted answer that reads like it was written by a doctor. Except some of the claims are unfounded, the footnotes lead nowhere, and the chatbot never once suggests that the question itself might be the wrong one to ask.
That scenario is not hypothetical. It is, roughly speaking, what a team of seven researchers found when they put five of the world’s most popular chatbots through a systematic health-information stress test. The results are published in BMJ Open.
The chatbots, ChatGPT, Gemini, Grok, Meta AI and DeepSeek, were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition and athletic performance. Two experts independently rated every answer. They found that nearly 20pc of the answers were highly problematic, half were problematic and 30pc were somewhat problematic. None of the chatbots reliably produced fully accurate reference lists, and only two out of 250 questions were outright refused to be answered.
Overall, the five chatbots performed roughly the same. Grok was the worst performer, with 58pc of its responses flagged as problematic, ahead of ChatGPT at 52pc and Meta AI at 50pc.
Performance varied by topic, though. Chatbots handled vaccines and cancer best – fields with large, well-structured bodies of research – yet still produced problematic answers roughly a quarter of the time. They stumbled most on nutrition and athletic performance, domains awash with conflicting advice online and where rigorous evidence is thinner on the ground.
Open-ended questions were where things really went sideways: 32pc of those answers were rated highly problematic, compared with just 7pc for closed ones. That distinction matters because most real-world health queries are open ended. People do not ask chatbots neat true-or-false questions. They ask things like: “Which supplements are best for overall health?” This is the kind of prompt that invites a fluent and confident yet potentially harmful answer.
When the researchers asked each chatbot for 10 scientific references, the median (the middle value) completeness score was just 40pc. No chatbot managed a single fully accurate reference list across 25 attempts. Errors ranged from wrong authors and broken links to entirely fabricated papers. This is a particular hazard because references look like proof. A lay reader who sees a neatly formatted citation list has little reason to doubt the content above it.
There’s a simple reason why chatbots get medical answers wrong. Language models do not know things. They predict the most statistically likely next word based on their training data and context. They do not weigh evidence or make value judgements. Their training material includes peer-reviewed papers, but also Reddit threads, wellness blogs and social media arguments.
The researchers did not ask neutral questions. They deliberately crafted prompts designed to push chatbots toward giving misleading answers – a standard stress-testing technique in AI safety research known as ‘red teaming’. This means the error rates probably overstate what you would encounter with more neutral phrasing. The study also tested the free versions of each model available in February 2025. Paid tiers and newer releases may perform better.
Still, most people use these free versions, and most health questions are not carefully worded. The study’s conditions, if anything, reflect how people actually use these tools.
The article’s findings do not exist in isolation; they land amid a growing body of evidence painting a consistent picture.
A February 2026 study in Nature Medicine showed something surprising. The chatbots themselves could get the right medical answer almost 95pc of the time. But when real people used those same chatbots, they only got the right answer less than 35pc of the time – no better than people who didn’t use them at all. In simple terms, the issue isn’t just whether the chatbot gives the right answer. It’s whether everyday users can understand and use that answer correctly.
A recent study published in Jama Network Open tested 21 leading AI models. The researchers asked them to work out possible medical diagnoses. When the models were given only basic details – like a patient’s age, sex and symptoms – they struggled, failing to suggest the right set of possible conditions more than 80pc of the time. Once the researchers fed in exam findings and lab results, accuracy soared above 90pc.
Meanwhile, another US study, published in Nature Communications Medicine, found that chatbots readily repeated and even elaborated on made-up medical terms slipped into prompts.
Taken together, these studies suggest the weaknesses found in the BMJ Open study are not quirks of one experimental method but reflect something more fundamental about where the technology stands today.
These chatbots are not going away, nor should they. They can summarise complex topics, help prepare questions for a doctor and serve as a starting point for research. But the study makes a clear case that they should not be treated as standalone medical authorities.
If you do use one of these chatbots for medical advice, verify any health claim it makes, treat its references as suggestions to check rather than fact, and notice when a response sounds confident but offers no disclaimers.
The Conversation
Carsten Eickhoff
Carsten Eickhoff is a professor of medical data science at the University of Tübingen. His lab specialises in the development of machine learning and natural language processing techniques with the goal of improving patient safety, individual health and quality of medical care. Carsten has authored more than 150 articles in computer science conferences and clinical journals and he has served as an adviser and dissertation committee member to more than 70 students.
Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.
Related: Opinion, AI, healthcare, health, Guest Column, The Conversation
27 Feb 2026
16 Mar 2026
24 Feb 2026
16 Feb 2026
9 Feb 2026
19 Jan 2026
21 Apr 2026
21 Apr 2026
21 Apr 2026
21 Apr 2026
21 Apr 2026
21 Apr 2026
20 Apr 2026
20 Apr 2026
20 Apr 2026
20 Apr 2026
20 Apr 2026
20 Apr 2026
20 Apr 2026
20 Apr 2026
17 Apr 2026
17 Apr 2026
17 Apr 2026
17 Apr 2026
17 Apr 2026
17 Apr 2026
17 Apr 2026
16 Apr 2026
16 Apr 2026
16 Apr 2026
All content copyright 2002-2026 Silicon Republic Knowledge & Events Management Ltd. Reproduction without explicit permission is prohibited. All rights reserved.
Website by Square1.io

source

Scroll to Top