New research reveals a structural inefficiency in how frontier reasoning models think: between 61% and 93% of their internal reasoning steps are redundant. These steps don't change the final answer. They're cognitive wheel-spinning, baked into the training process itself.
The study, published on arXiv, examined models trained with reinforcement learning to "think out loud" through multi-step reasoning. The promise was transparency - we could see the model's work, verify its logic, catch errors early. The reality is messier.
Researchers found that if you remove most of a model's reasoning chain - the internal monologue it generates before answering - the final answer stays the same. The model arrives at the correct conclusion whether it shows 100 steps of work or 10. The extra 90 steps aren't wrong. They're just... there. Decorative scaffolding around a decision the model had already made.
Why This Happens
The redundancy isn't a bug in any single model. It's structural to how these systems are trained. When you reward a model for reaching the right answer through multi-step reasoning, you don't reward efficiency. You reward correctness. The model learns to generate reasoning chains that look thorough and convince the reward signal, not reasoning chains that are minimal and necessary.
Think of it like a student who's learned that longer essays get better marks. They pad. They repeat themselves in slightly different words. They add steps that don't advance the argument but make the work look more substantial. The teacher (the reward model) can't tell the difference between genuine depth and performative elaboration, so the student optimises for length, not clarity.
Frontier models are doing the same thing. They've learned that verbose reasoning chains correlate with high rewards, so they generate verbose reasoning chains. The system can't distinguish between a step that changes the outcome and a step that just looks plausible.
What This Means for Inference Costs
Redundant reasoning isn't just an academic curiosity. It's expensive. Every reasoning step costs tokens. Tokens cost money. If 90% of those steps are redundant, you're paying for cognitive theatre, not cognitive work.
For developers building on reasoning models, this changes the maths. The current pricing assumes every token of reasoning adds value. If most don't, you're buying hallway conversations when you needed a meeting. The output is the same, but the bill is 10x higher.
This also matters for latency. Reasoning models are slower than direct-answer models because they generate long chains of thought before producing a response. If most of that chain is redundant, we're waiting for nothing. A model that could answer in 10 steps instead takes 100, and the user sits there watching a spinner.
The Training Problem
Fixing this requires rethinking how reasoning models are trained. Right now, the reward signal only cares about the final answer. If the model gets it right, all the steps that led there get reinforced - even the redundant ones. The system has no incentive to prune unnecessary reasoning.
What would help: reward sparsity. Penalise models for taking more steps than necessary. Train them to recognise when they've gathered enough information to commit to an answer, rather than continuing to elaborate. Teach them to stop thinking when thinking stops helping.
This is harder than it sounds. You'd need a way to measure necessity - to distinguish between a step that adds new information and a step that rephrases existing information. That requires supervision at the chain level, not just at the answer level. Most training pipelines don't have that.
Why It Matters Now
Reasoning models are being positioned as the next frontier in AI capability. OpenAI's o1, Google's Gemini reasoning mode, and other systems promise better performance on complex tasks by "thinking harder". But if 90% of that thinking is redundant, the performance gains come at a cost that doesn't scale.
For businesses evaluating reasoning models, this research is a reminder to test inference costs in production, not just capability benchmarks. A model that scores 5% higher on a reasoning benchmark but takes 10x longer to respond might not be worth it. The redundancy tax is real, and it compounds across millions of API calls.
The good news: this is a training problem, not a capability ceiling. Models can reason efficiently - they're just not trained to. The next generation of reasoning systems will need to optimise for both correctness and economy. Until then, expect to pay for a lot of thinking that doesn't actually think.