Gary Marcus: Four Studies Show Chatbots Fail at Medical Diagnosis

Gary Marcus has spent the past week reading peer-reviewed medical studies, and the results are grim. Four separate papers, published in the last few months, tested large language models on diagnostic reasoning and triage. The error rates hover near 50%. Worse, the models express high confidence even when they're catastrophically wrong.

This isn't Marcus being reflexively anti-AI. These are controlled studies from research teams testing whether LLMs like GPT-4 and Claude can handle real medical scenarios. The answer, consistently, is no - not yet, and not safely.

What the Studies Found

One study tested diagnostic reasoning across common presentations. The models got roughly half the cases wrong. Another tested triage decisions - the critical "does this person need emergency care or can they wait" judgement that shapes outcomes. Again, near-50% error rates, with dangerous misclassifications in both directions. Telling someone with sepsis to wait. Sending someone with indigestion to emergency.

The confidence problem is worse than the errors. A model that says "I'm not sure, consult a doctor" is annoying but safe. A model that says "this is definitely X" when it's actually Y is dangerous. The studies found LLMs consistently overstate certainty, which means users trust them more than they should.

Marcus's core argument: we are deploying these systems faster than we are understanding their failure modes. Chatbots are already being marketed for health advice. People are already using them. And the research says they fail at the exact tasks people assume they're good at - pattern recognition, diagnostic thinking, risk assessment.

Why This Matters Now

The timing is crucial. We're past the "will AI be used in medicine?" debate. It's already happening. Patients ask ChatGPT about symptoms. Doctors use Claude to draft notes. Startups pitch diagnostic tools built on LLMs. The question isn't whether, it's how fast - and whether we're moving faster than the evidence supports.

Marcus isn't calling for a ban. He's calling for public education and regulatory oversight. People need to know these tools make mistakes, often, and with confidence. Doctors need to know the error rates aren't edge cases - they're structural limits of how LLMs work. And regulators need to treat medical AI like medical devices, not software updates.

The alternative, he argues, is amplified misinformation at scale. A chatbot that gives bad advice to one person is a problem. A chatbot embedded in a health app used by millions is a public health crisis. The studies suggest we're closer to the latter than the former, and the oversight isn't keeping pace.

The Harder Question

Here's what Marcus doesn't say but the studies imply: LLMs might be fundamentally the wrong tool for this job. Diagnostic reasoning isn't autocomplete. It's probabilistic inference over incomplete, noisy data, guided by domain knowledge and experience. LLMs are very good at predicting the next word. That overlaps with medical reasoning sometimes, but not reliably, and not in ways we can predict.

The 50% error rate isn't a training problem or a prompt engineering problem. It's a signal that the architecture isn't suited to the task. You can make it better with more data, better fine-tuning, retrieval-augmented generation - but you can't make it safe without changing what it is. And we're deploying it anyway, because it sounds confident and it's easy to integrate.

The frustration in Marcus's piece is palpable. He's been saying this for months, and the deployment curve keeps outpacing the research. Every week, another health app adds an AI chat feature. Every week, another study shows why that's premature. The gap between what we know and what we're building is widening, not closing.

What Should Happen Next

Marcus wants three things: labelling (so people know they're talking to a bot, not a doctor), education (so people understand the limits), and regulation (so companies can't deploy medical AI without proving it's safe). None of this is radical. It's the baseline for any medical technology.

The challenge is that LLMs don't feel like medical devices. They feel like software. And software moves fast, iterates in public, and fixes bugs in production. That works for search engines. It doesn't work for diagnostic tools. The studies make that clear. The question is whether the industry will listen before the first high-profile failure.

Right now, the evidence says: don't trust your chatbot for medical advice. The risk isn't that it might be wrong - it's that it will be wrong, roughly half the time, and you won't know which half.