A developer spent 95 days tracking the performance of Claude Code and Codex on SWE-Bench-Pro, a benchmark that tests AI models on real-world software engineering tasks. The data reveals something nobody was talking about: Claude Opus's pass rate spiked 11 points from version 4.6 to 4.7, held steady for a month, then dropped 13 points this week.
The tracking method is clever - candlestick charts, the kind traders use for stock prices, but applied to model performance. Each candle shows the range of pass rates across test runs, making it easy to spot when a model's behaviour shifts. The full dataset and charts are on Dev.to, and they tell a story about what happens when AI models get updated in production.
The Pattern Nobody Expected
Claude Opus 4.6 was scoring around 40% on SWE-Bench-Pro tasks. When version 4.7 launched, pass rates jumped to 51% - a significant improvement. For a month, the model held that line. Then, this week, it dropped to 38%. That's not measurement noise - that's a real shift, crossing the statistical significance threshold the developer set.
Codex, by contrast, was flat across three releases. No spikes, no drops. Boring, but predictable. For developers building on top of these models, predictability matters more than peak performance. A model that scores 45% every time is easier to build around than one that swings between 38% and 51%.
The implications are immediate for anyone using Claude Code in production. If your workflow relies on specific pass rates, you're now dealing with a 13-point drop from last month's baseline. That's the difference between a tool that handles most of your boilerplate and one that needs constant supervision.
What SWE-Bench-Pro Actually Measures
SWE-Bench-Pro isn't a toy benchmark. It pulls real issues from GitHub repositories and asks models to generate patches that solve them. The tests run against actual codebases, with real dependencies and edge cases. Pass rates reflect whether the model's output compiles, runs, and solves the problem without breaking existing functionality.
This is harder than generating isolated code snippets. The model needs to understand context, maintain coding style, handle imports correctly, and avoid introducing bugs. A 40% pass rate means the model solves four out of ten real-world issues without human intervention. That's useful, but it's not replacing a developer - it's augmenting one.
The drop from 51% to 38% this week suggests something changed under the hood. Anthropic hasn't announced a new release, which raises questions about what triggered the shift. Model updates sometimes happen silently - performance optimisations, safety guardrails, or infrastructure changes that affect output without a version bump.
Why This Matters for Builders
Developers building on top of Claude Code face a specific problem: you can't freeze a model version like you can with a software dependency. When the API updates, your application gets the new behaviour whether you want it or not. If that behaviour includes a 13-point drop in pass rates, your product just got worse overnight.
The alternative is to version-pin where possible, but that's not always an option. API providers sunset old models, forcing upgrades. And even when version pinning is available, you're choosing between stability and access to improvements. A model that held at 51% for a month looked like a safe bet - until it didn't.
This is where open-source models offer a different trade-off. You can download the weights, run them locally, and guarantee the behaviour never changes. The performance might be lower, but the predictability is higher. For production systems, that trade-off starts to look attractive.
The Bigger Picture
The fact that one developer tracked this for 95 days and found something significant says a lot about the current state of AI deployment. Most companies using these models don't have visibility into performance changes over time. They notice when something breaks, but gradual degradation or sudden spikes go undetected.
This is a monitoring problem. If you're building on AI models, you need the same kind of observability you'd apply to any critical dependency. Track pass rates, log failures, measure variance. The candlestick approach this developer used is simple but effective - it shows trends, spikes, and drops in a format that makes changes obvious.
The data also challenges the narrative that AI models only get better over time. Claude Opus 4.7 improved, then regressed. That's normal software behaviour - regressions happen, especially when teams are shipping fast - but it's a reminder that AI models are infrastructure, not magic. They need the same kind of operational discipline as databases, APIs, and authentication systems.
For anyone relying on Claude Code right now, the message is clear: measure what you depend on. If your workflow assumes a certain pass rate, track it. If you're building a product on top of these models, build in tolerance for variance. And if you need guaranteed performance, consider whether you need to own the model entirely.