Intelligence is foundation
Subscribe
  • Luma
  • About
  • Sources
  • Ecosystem
  • Nura
  • Marbl Codes
00:00
Contact
[email protected]
Connect
  • YouTube
  • LinkedIn
  • GitHub
Legal
Privacy Cookies Terms
  1. Home›
  2. Featured›
  3. Voices & Thought Leaders›
  4. Gary Marcus: Four Studies Show Chatbots Fail at Medical Diagnosis
Voices & Thought Leaders Tuesday, 21 April 2026

Gary Marcus: Four Studies Show Chatbots Fail at Medical Diagnosis

Share: LinkedIn
Gary Marcus: Four Studies Show Chatbots Fail at Medical Diagnosis

Gary Marcus has spent the past week reading peer-reviewed medical studies, and the results are grim. Four separate papers, published in the last few months, tested large language models on diagnostic reasoning and triage. The error rates hover near 50%. Worse, the models express high confidence even when they're catastrophically wrong.

This isn't Marcus being reflexively anti-AI. These are controlled studies from research teams testing whether LLMs like GPT-4 and Claude can handle real medical scenarios. The answer, consistently, is no - not yet, and not safely.

What the Studies Found

One study tested diagnostic reasoning across common presentations. The models got roughly half the cases wrong. Another tested triage decisions - the critical "does this person need emergency care or can they wait" judgement that shapes outcomes. Again, near-50% error rates, with dangerous misclassifications in both directions. Telling someone with sepsis to wait. Sending someone with indigestion to emergency.

The confidence problem is worse than the errors. A model that says "I'm not sure, consult a doctor" is annoying but safe. A model that says "this is definitely X" when it's actually Y is dangerous. The studies found LLMs consistently overstate certainty, which means users trust them more than they should.

Marcus's core argument: we are deploying these systems faster than we are understanding their failure modes. Chatbots are already being marketed for health advice. People are already using them. And the research says they fail at the exact tasks people assume they're good at - pattern recognition, diagnostic thinking, risk assessment.

Why This Matters Now

The timing is crucial. We're past the "will AI be used in medicine?" debate. It's already happening. Patients ask ChatGPT about symptoms. Doctors use Claude to draft notes. Startups pitch diagnostic tools built on LLMs. The question isn't whether, it's how fast - and whether we're moving faster than the evidence supports.

Marcus isn't calling for a ban. He's calling for public education and regulatory oversight. People need to know these tools make mistakes, often, and with confidence. Doctors need to know the error rates aren't edge cases - they're structural limits of how LLMs work. And regulators need to treat medical AI like medical devices, not software updates.

The alternative, he argues, is amplified misinformation at scale. A chatbot that gives bad advice to one person is a problem. A chatbot embedded in a health app used by millions is a public health crisis. The studies suggest we're closer to the latter than the former, and the oversight isn't keeping pace.

The Harder Question

Here's what Marcus doesn't say but the studies imply: LLMs might be fundamentally the wrong tool for this job. Diagnostic reasoning isn't autocomplete. It's probabilistic inference over incomplete, noisy data, guided by domain knowledge and experience. LLMs are very good at predicting the next word. That overlaps with medical reasoning sometimes, but not reliably, and not in ways we can predict.

The 50% error rate isn't a training problem or a prompt engineering problem. It's a signal that the architecture isn't suited to the task. You can make it better with more data, better fine-tuning, retrieval-augmented generation - but you can't make it safe without changing what it is. And we're deploying it anyway, because it sounds confident and it's easy to integrate.

The frustration in Marcus's piece is palpable. He's been saying this for months, and the deployment curve keeps outpacing the research. Every week, another health app adds an AI chat feature. Every week, another study shows why that's premature. The gap between what we know and what we're building is widening, not closing.

What Should Happen Next

Marcus wants three things: labelling (so people know they're talking to a bot, not a doctor), education (so people understand the limits), and regulation (so companies can't deploy medical AI without proving it's safe). None of this is radical. It's the baseline for any medical technology.

The challenge is that LLMs don't feel like medical devices. They feel like software. And software moves fast, iterates in public, and fixes bugs in production. That works for search engines. It doesn't work for diagnostic tools. The studies make that clear. The question is whether the industry will listen before the first high-profile failure.

Right now, the evidence says: don't trust your chatbot for medical advice. The risk isn't that it might be wrong - it's that it will be wrong, roughly half the time, and you won't know which half.

More Featured Insights

Builders & Makers
Trust Infrastructure for AI Agents Comes to TypeScript
Robotics & Automation
Florida Hospital Tests Robot Porters - But Nobody's Riding Yet

Video Sources

AI Engineer
Running LLMs on Your iPhone: 40 Tokens/sec Gemma 4 with MLX
Ania Kubów
In Learning, Information Isn't the Issue. It's Knowing What's Useful
Theo (t3.gg)
Did Anthropic Just Kill Figma? Claude Design Launch
AI Engineer
Full Workshop: Build Your Own Deep Research Agents
Ania Kubów
Automate Your Life in 4 Hours-Zapier Agent and Automation Course
AI Engineer
AIE Miami Keynote & Technical Talks-OpenCode, Google DeepMind, OpenAI
World of AI
AI Week Explodes: Kimi K2.6, GPT 5.5, Deepseek V4 Rumours, Gemini 3.5
AI Revolution
AGIBOT Releases Humanoid Robot Fleet and Embodied AI Models
OpenAI
OpenAI Introduces GPT-Rosalind for Life Sciences Research Workflows

Today's Sources

DEV.to AI
Agent Trust Stack Now Available in TypeScript
DEV.to AI
MAC Cosmetics Generated 53,000 Leads in 2 Days Using AI Workflows
Towards Data Science
RAG Systems Get Confidently Wrong as Memory Grows
DEV.to AI
I Automated My Email-Here's What Happened
Towards Data Science
Replaced GPT-4 with Local SLM-CI/CD Pipeline Stopped Failing
ML Mastery
AI Agent Memory Explained in 3 Levels of Difficulty
The Robot Report
Rovex and BayCare Partner to Explore In-Hospital Transport Robots
The Robot Report
Report: 500k+ Robots Installed Yearly, But Integrator Ecosystem Remains Fragmented
The Robot Report
RBR50 Gala Honours Leading Roboticists at 2026 Robotics Summit
ROS Discourse
Open Source ROS1 to ROS2 Version Converter SDK Released
ROS Discourse
ROS 2 Lyrical Luth Testing Party Begins April 30th
Gary Marcus
Gary Marcus: Don't Trust Your Chatbot for Medical Advice
Latent Space
Kimi K2.6 Refreshes Open-Source Model Leadership
Ben Thompson Stratechery
Ben Thompson: Tim Cook's Impeccable Timing

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Richard Bland
About Sources Privacy Cookies Terms Thou Art That
MEM Digital Ltd t/a Marbl Codes
Co. 13753194 (England & Wales)
VAT: 400325657
3-4 Brittens Court, Clifton Reynes, Olney, MK46 5LG
© 2026 MEM Digital Ltd