Machine learning models are remarkably good at making predictions. They're less good at explaining WHY they made those predictions. And in fields like medical imaging, where a wrong diagnosis can have serious consequences, that opacity is a real problem.
Researchers at MIT have developed a method that forces AI models to use concepts they've already learned when explaining their decisions. The result? Explanations that are both more accurate and more useful to the humans making final decisions.
The Explainability Problem
Here's the issue: current AI models learn patterns in data, but they don't naturally explain those patterns in ways humans understand. When a model identifies a tumour in a scan, it might be picking up on features we recognise - like shape or texture - but it might also be using correlations we can't see or wouldn't trust.
Existing explanation methods try to reverse-engineer what the model was thinking. They look at which parts of an image the model focused on and try to describe that in human terms. But these explanations are often unreliable. The model might say it's looking at tumour texture when it's actually relying on something else entirely.
The MIT team took a different approach. Instead of trying to extract explanations after the fact, they force the model to use specific, understandable concepts WHILE it's making predictions.
Extracting Concepts the Model Already Knows
The breakthrough is in how they identify concepts. The researchers developed a technique to extract concepts that are already embedded in the model's internal representations. Think of it like this: the model has learned patterns during training, but those patterns are scattered across millions of parameters. The new method finds those patterns and labels them with terms doctors actually use.
Once these concepts are extracted, the model is required to base its explanations on them. It can't point to vague regions of an image or rely on hidden correlations. It has to say: "I'm seeing this shape, this texture, this pattern" - using the vocabulary it's been taught.
The results are striking. In tests on medical imaging datasets, the method improved both the accuracy of predictions AND the faithfulness of explanations. The model's reasoning aligned with what it was actually doing, not just what researchers hoped it was doing.
Why This Matters Beyond Medicine
Medical imaging is the obvious use case, but the implications are broader. Any field where humans need to trust AI decisions - loan approvals, hiring systems, autonomous vehicles - could benefit from this approach.
The key insight is this: interpretability shouldn't be an afterthought. If you build it into how the model makes decisions, you get explanations that are both more accurate and more trustworthy. You also catch cases where the model is relying on spurious correlations - the kinds of patterns that work in training data but fail in the real world.
For developers building AI systems, this research offers a practical path forward. Instead of choosing between accuracy and interpretability, you can design models that deliver both. That's not just better engineering - it's essential for deploying AI in situations where lives are on the line.
The full research is available from MIT AI News, where the team has shared both their methodology and datasets for further experimentation.