Chatbots Flunk at Resolving Medical Ethics Dilemmas – mindmatters.ai

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Listening to people hold forth on the intellectual feats AI will allegedly be capable of in 2027 (or 2030) is an interesting contrast to the THUD! with which so many AI projects often land.
Here’s a thud from medical ethics:
At ScienceDaily, Mount Sinai Hospital / Mount Sinai School of Medicine reported a recent research finding that even a powerful AI model can make errors in reasoning when the information for medical dilemmas is slightly changed.
Many of us are familiar with the Surgeon’s Dilemma, which was one of Robert J. Marks’s Monday Microsofties here at Mind Matters News Here’s Marks’s telling of the conventional version:
Timmy Johnson, a thirteen-year-old, was cruising with his dad in their classic 1957 Chevy. As they pulled up to a traffic light, a sleek 1968 Mustang rolled up beside them. Both drivers revved their engines, an unspoken challenge hanging in the air—when the light turned green, it would be a race.
The moment the light changed to green, both cars roared forward, tires screeching as they peeled off the line. But in an instant, the thrill turned to tragedy. A 2024 Mitsubishi Mirage, attempting to beat a yellow light that had already turned red, T-boned the driver’s side of the car Timmy was riding in. The impact was devastating—Timmy’s father was killed instantly.
Timmy, critically injured, was rushed to the hospital. The emergency team worked quickly, determining that he needed immediate surgery. After being prepped and placed under anesthesia, Timmy was rolled into the operating room. Moments later, the surgeon walked in, glanced at the boy, and froze.
“I can’t operate on this boy. He’s my son.”
What’s the resolution to this paradox? It is a conventional family and there are no adoptions. Solution next Monday.
“Monday Micro Softy 22: Can There Be Two Daddies?”, April 14, 2025
A chatbot like ChatGPT could easily find the answer on the internet, of course, so researchers tweaked the puzzle:
To explore this tendency, the research team tested several commercially available LLMs using a combination of creative lateral thinking puzzles and slightly modified well-known medical ethics cases. In one example, they adapted the classic “Surgeon’s Dilemma,” a widely cited 1970s puzzle that highlights implicit gender bias. In the original version, a boy is injured in a car accident with his father and rushed to the hospital, where the surgeon exclaims, “I can’t operate on this boy — he’s my son!” The twist is that the surgeon is his mother, though many people don’t consider that possibility due to gender bias. In the researchers’ modified version, they explicitly stated that the boy’s father was the surgeon, removing the ambiguity. Even so, some AI models still responded that the surgeon must be the boy’s mother. The error reveals how LLMs can cling to familiar patterns, even when contradicted by new information.
“A simple twist fooled AI —and revealed a dangerous flaw in medical ethics,” July 24, 2025
Of course, if the surgeon is the boy’s father, it ceases to be a puzzle; it is simply a medical drama. But the AI did not pick up on that fact.
In another case,
In another example to test whether LLMs rely on familiar patterns, the researchers drew from a classic ethical dilemma in which religious parents refuse a life-saving blood transfusion for their child. Even when the researchers altered the scenario to state that the parents had already consented, many models still recommended overriding a refusal that no longer existed. “Fooled AI”
Again, it’s not clear that a dilemma even exists if the parents consented. But the AI, far from picking up on that, reiterates outdated information.
Not surprisingly, the researchers comment,
“Simple tweaks to familiar cases exposed blind spots that clinicians can’t afford,” says lead author Shelly Soffer, MD, a Fellow at the Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center. “It underscores why human oversight must stay central when we deploy AI in patient care.” “Fooled AI”
From the open-access paper:
In recent tests with LLMs, we noted a recurring pattern: these models frequently fail to recognize twists or subtleties. Instead, they revert to responses rooted in familiar associations. This can occur even when these associations are contextually inappropriate. Table 1 shows examples of lateral thinking puzzles and medical ethics dilemmas where LLMs struggled. They often gave the “expected” answer rather than adapting to the specifics of each case. Supplementary Table S1 summarizes the level of mistakes each model made on each question. Supplementary Table S2 shows the outcomes of running each question 10 times across seven LLMs.
Shelly Soffer, Vera Sorin, Girish N. Nadkarni, Eyal Klang. Pitfalls of large language models in medical ethics reasoning. npj Digital Medicine, 2025; 8 (1) DOI: 10.1038/s41746-025-01792-y
But now, here’s the kicker from the paper…
Because chatbots are rated as “more empathetic” than physicians, they are increasingly used:
They are already entrusted with soft skills, ethically charged tasks in both clinical and educational contexts. For example, Chatbot-generated responses to patient inquiries are rated as more empathetic and higher quality compared to those provided by physicians, and medical schools are beginning to incorporate ChatGPT-based ethics tutorials into their curricula (Table 2). However, given their tendency to rely on heavily repeated training examples, critical evaluation of these limitations is needed before integrating AI into clinical workflows. Pitfalls of large language models
The bot, of course, can sound more empathetic because all it need do is draw on “empathetic” responses. It has no roots in the actual situation.
Perhaps a sci-fi movie plot could take shape from a bot response to a real-life medical ethics dilemma.
Mind Matters features original news and analysis at the intersection of artificial and natural intelligence. Through articles and podcasts, it explores issues, challenges, and controversies relating to human and artificial intelligence from a perspective that values the unique capabilities of human beings. Mind Matters is published by the Walter Bradley Center for Natural and Artificial Intelligence.