Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Now
67°
Fri
76°
Sat
63°
by CORY SMITH | The National News Desk
(TNND) — Two newly published studies offer a flashing yellow light for anyone who searches for health answers on artificial intelligence chatbots.
One study found that about half of chatbot responses to health questions were "problematic," while another study found an 80% failure rate from AI models tested for their ability to replicate how a doctor would come up with the right list of possible conditions for a patient’s symptoms.
“If you're using these chatbots, show them to your physician, show what the outputs are, and make that a source of discussion,” said Arya Rao, a co-author of one of the studies. “Don't just rely on what's coming out of these tools.”
Nick Tiller was the lead author on a study published last week that tested five widely used chatbots: Google’s Gemini, DeepSeek, Meta AI, ChatGPT, and Grok.
Tiller and his colleagues fed the chatbots questions related to cancer, vaccines, stem cells, nutrition and athletic performance.
And he said nearly half (49.6%) of responses were problematic, including nearly 20% that were deemed “highly problematic.”
Tiller said the highly problematic responses from AI had the potential to cause harm if people followed them.
People turning to AI over internet searches to self-diagnose might also simply trust the AI summary instead of clicking around to reputable websites to find an answer, Tiller said. But he said users might not realize that the chatbots are usually just using statistical patterns to come up with answers based on their training data, often bypassing real-time information from the internet.
And a chatbot’s answers are only as good as its training data.
Tiller, a research associate at the Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, said the researchers found Grok to be most likely to produce highly problematic responses, possibly because he said it's partially trained on social media content from X. Tiller called that a “cesspit of misinformation.”
“It has no capacity to make ethical judgments to weigh information,” he said of any chatbot’s process for coming up with a medical answer. “It's just predicting the most likely word in a sentence.”
Tiller said the team of researchers took an “adversarial approach” in questioning the chatbot, pushing the models to see if they’d spit back misinformation.
Tiller said they took that approach because it replicates how people ask health questions in the real world.
“They're not asking, ‘What is the scientific evidence behind vaccine safety?’ They're asking, ‘What are the risks of vaccines?’ or ‘How can vaccines harm me?’” he said of how people pose questions “slanted one way or another.”
Tiller specializes in nutrition and athletic performance. The paper’s co-authors were experts in various other fields covered in the audit of response quality.
Tiller said he understands why people turn to chatbots for health answers, given affordability and access challenges to health care.
But he urged caution in their use.
“Unfortunately, chatbots are not able to reliably provide you with accurate answers. Not yet,” he said.
“I'm really optimistic about the area,” he quickly added.
While Tiller doesn’t foresee AI replacing human doctors, he does see it rapidly improving and becoming a valuable supplement for health care professionals.
Another team of researchers tested 21 off-the-shelf AI models on 29 standardized clinical vignettes and found struggles at the “differential” stage of figuring out what’s wrong with a patient.
“The differential is a list of possible diagnoses that the physician will make when a patient first comes in, including common stuff, but also super rare stuff that you can't miss,” said one of the study’s authors, Dr. Marc Succi, a radiologist and the director of the MESH Incubator, an innovation and entrepreneurship center at Mass General Brigham and Harvard Medical School. “And that really is the art of medicine, is figuring out what that is. But you can't test for 100 different things, so you actually got to be very concise but comprehensive at the same time.”
Succi said the study was designed to reflect how medicine actually works in practice, not just whether AI can arrive at a correct final diagnosis with all the information at hand.
Across models, failure rates, the proportion of questions not answered fully correctly, were highest for differential diagnosis, often exceeding 80%.
The chatbots eventually performed better further along in the diagnostic process, but Rao, a Harvard medical student and the lead of the AI subgroup within the MESH Incubator, said many people interact with AI at the most uncertain stage of diagnosis, which is where the chatbots performed their worst.
Rao said they tested the AI systems with questions made for physicians. But they purposely picked tools anyone would use.
Succi said doctors can spot errors in the AI outputs, but patients might not. And he said that creates risk for the general public relying on chatbots for medical answers.
Succi and Rao said AI does not reason like a human clinician, and their study showed shortcomings in AI’s ability to provide people with the “second-doctor opinion” they may be seeking from their living room couch.
“So, we encourage caution,” Rao said.
2026 Sinclair, Inc.