Stanford study: AI models score top marks on medical imaging without seeing the images

A Stanford research team gave frontier AI models a medical imaging test. The models scored at expert level. Then the researchers removed the images entirely - just text descriptions remained - and ran the test again.

The scores barely changed.

Gary Marcus flagged this study because it exposes something most people building with vision models haven't noticed yet: these systems are very good at generating plausible medical language, and not particularly good at actually seeing.

The fabrication problem

The models weren't guessing randomly. They were synthesising answers that sounded clinically correct based on the text context alone - diagnosis labels, patient history, test parameters. When they did look at the images, they often used them to confirm what the text had already suggested, not to discover new information.

This is the fabrication problem Marcus keeps circling back to. It's not that the models produce random nonsense. It's that they produce confident, coherent, plausible-sounding output based on pattern matching, not understanding. And in domains like medical imaging, where a missed diagnosis has consequences, plausible isn't good enough.

The benchmark scores suggest competence. The mechanism behind the scores suggests something else entirely - statistical correlation dressed up as visual reasoning.

What stays safe from disruption

Marcus argues this has implications for which jobs are actually at risk. Tasks that require real spatial reasoning - architecture, film editing, engineering design - are harder to automate than the headlines suggest. If a model can't reliably distinguish meaningful visual information from textual priming, it's not going to replace the architect reviewing structural plans or the editor cutting a scene for emotional pacing.

The jobs most vulnerable aren't the ones requiring visual expertise. They're the ones requiring plausible-sounding text generation - customer service scripts, marketing copy, basic report writing. These are tasks where fabrication is harder to detect because there's no ground truth to check against. The output just needs to sound right.

The benchmark problem

This study is part of a bigger pattern Marcus has been tracking. Benchmark performance keeps hitting new highs while real-world capability stays stubbornly inconsistent. Models ace tests designed to measure understanding but fail basic reasoning tasks that weren't in the training data.

The medical imaging case is stark because the stakes are clear. But the same dynamic shows up everywhere vision models are deployed. They're very good at tasks that can be solved with statistical shortcuts - recognising common objects, labelling familiar scenes, matching text to images in predictable ways. They're much worse at tasks requiring actual spatial reasoning - understanding 3D structure, tracking object relationships over time, reasoning about physical causality.

The difference matters. If you're building systems that need reliable visual understanding - medical diagnosis, autonomous navigation, quality control in manufacturing - you can't trust benchmark scores alone. You need to test whether the model is actually seeing or just generating statistically likely responses based on textual cues.

What this means for builders

If you're deploying vision models in production, Marcus's point is worth sitting with. Don't assume competence from benchmark performance. Test the edge cases. Remove the textual scaffolding and see if the visual reasoning holds up. And in high-stakes domains, keep a human in the loop - not because AI can't help, but because plausible fabrication is harder to catch than obvious failure.

The Stanford study isn't an argument against using vision models. It's an argument for understanding what they're actually doing under the hood - and building systems that account for the gap between statistical correlation and genuine understanding.

Read Gary Marcus's full analysis