Claude Opus 4.7: Every Benchmark Up, Every Cost Down

Anthropic shipped Claude Opus 4.7 last week. The headline number - improved across every single benchmark compared to 4.6 - undersells what actually changed. This isn't a minor update. The model got measurably smarter while getting cheaper to run.

The full analysis from Latent Space walks through the technical details, but the pattern is clear: better reasoning scores, stronger vision capabilities, improved instruction following, and faster response times. Meanwhile, token costs dropped and context handling improved.

That combination - better and cheaper - is rare. Usually you trade one for the other. Opus 4.7 moved both needles in the right direction.

The Tokenizer Question

Buried in the technical details is something developers noticed immediately: tokenizer changes. The way Claude breaks text into tokens affects everything - how much context fits in a prompt, how many tokens a response costs, how the model handles different languages.

Early testing suggests Opus 4.7 tokenizes more efficiently than 4.6, which means more actual content fits in the same token budget. For developers working near context limits, that's not a minor improvement. It changes what's possible in a single prompt.

The efficiency gain compounds when you're making thousands of API calls. Fewer tokens per request means lower costs, but it also means faster processing and less likelihood of hitting rate limits. The economics shift.

Vision Gets Sharper

Claude's vision capabilities jumped noticeably. Users testing document analysis, chart interpretation, and visual reasoning tasks reported clearer outputs and fewer hallucinations. The model seems more confident about what it's actually seeing versus what it's inferring.

This matters for practical applications. A model that can reliably extract data from invoices, read handwritten notes, or interpret technical diagrams opens up automation opportunities that weren't reliable six months ago. Vision was Claude's weak point. It's catching up fast.

The Mythos Debate

Inside the AI community, there's speculation that Opus 4.7 might be a distilled version of Mythos - a rumoured larger model Anthropic is training. The theory goes: train a massive model, then compress its knowledge into a smaller, faster version for production use.

If true, that would explain the across-the-board improvements. Distillation done well can preserve most of a large model's capabilities while cutting inference costs dramatically. It's how you get both better performance and lower prices.

Anthropic hasn't confirmed this. But the benchmark jumps are consistent with distillation from a stronger teacher model. The question is whether Mythos itself will ever ship publicly, or if it exists purely as a training tool for smaller, practical models.

What Builders Are Saying

Hands-on feedback from developers using Opus 4.7 in production is consistently positive. Faster response times. Better instruction following. More reliable outputs on edge cases. The kinds of improvements that don't show up in benchmarks but matter enormously in real applications.

Several developers noted that tasks requiring multi-step reasoning - like debugging code or analysing complex documents - felt noticeably more reliable. The model seems to hold context better and make fewer logical leaps that don't quite connect.

Token economics also shifted. Some developers reported 15-20% cost reductions on typical workloads due to more efficient tokenization and faster completion times. That's enough to change budget calculations for high-volume applications.

The Competitive Picture

Opus 4.7 puts Claude firmly back in the conversation as a GPT-4 alternative for production use. For months, Claude was the fallback option - good enough, but not the first choice. This release narrows that gap considerably.

The vision improvements are particularly significant. OpenAI's GPT-4 Vision set the standard. Claude was catching up. Now it's close enough that developers are choosing based on other factors - API reliability, rate limits, specific use case performance - rather than assuming GPT-4 Vision is the only option.

For businesses evaluating which model to build on, Opus 4.7 changes the calculation. It's not just about which model scores highest on benchmarks. It's about which model delivers reliable results at a price point that works for your use case. Claude just became more competitive on both fronts.

The pace here is what stands out. Opus 4.6 launched less than a year ago. We're already seeing meaningful improvements across the board. That cadence suggests these models aren't plateauing - they're still climbing steeply.