The Most Expensive Tokens Are the Ones You Spend Recovering from Bad Decisions

Claude Opus 4.7 launched this week with 35% fewer tokens than its predecessor while topping every coding benchmark. At the same time, I spent six hours trying to work around a Cloudflare Pages Functions cache that refused to update, lost half a Saturday between A&E and debugging, and shipped eleven commits on Luma between hospital trips.

The week proved something I've been suspecting for months: we are now in an era where AI output quality depends less on the model and more on what you feed it, how you structure the conversation, and when you stop talking.

The 35% token reduction nobody's talking about

Opus 4.7 is smaller than Opus 4. Thirty-five percent fewer tokens. And it's beating its predecessor on every major coding benchmark. That is not a minor efficiency gain. That is a shift in how these systems work.

Cost per task is dropping while capability rises. The implication is straightforward: if you are still throwing maximum context at every problem, you are overpaying for reasoning you do not need. The model has become better at working with less. Your prompts have not caught up.

I have been running Luma's article processing pipeline on Claude for months now. We cache the full article context once at the start of each session. First call pays full price. Every call after that costs roughly 10% of the original. The model sees the same system prompt, the same style guide, the same structural requirements every time. We are not re-transmitting that information. We are referencing it.

Prompt caching is not a nice-to-have anymore. It is table stakes. If your pipeline makes repeated API calls with shared context and you are not caching it, you are burning money on redundant tokens.

Agent-legible code is a token strategy

Here is something I did not expect to care about a year ago: I now treat "can an AI understand this without me explaining it" as a code quality metric.

Readable code means fewer tokens to reason about it. A 200-line function with four nested conditionals and variable names like tmp or data2 forces the model to burn context reconstructing what you meant. Ten small functions with explicit names and clear types? The model reads it once and moves on.

This is not about making code "AI-friendly" at the expense of human readability. It is the opposite. The things that make code legible to an AI - clear naming, small functions, explicit types, minimal nesting - are the same things that make it legible to the next developer. Who might be you, six months later, trying to remember why you wrote it that way.

Agent-legible code is just good code. The token efficiency is a side effect.

Match the model to the task, not the budget

I tier models now. Opus for architecture decisions. Sonnet for implementation. Haiku for classification and routing. This is not cost minimisation. This is cognition matching.

If the task requires reasoning about trade-offs, considering edge cases, or evaluating multiple approaches, I use Opus. If the task is implementing a well-defined function or translating a clear spec into code, Sonnet does it faster and cheaper. If the task is "does this text contain a URL" or "which category does this belong to", Haiku is more than sufficient.

Defaulting to the most capable model for everything is like hiring a senior architect to paint the walls. You will get painted walls. You will also get an invoice that makes you wince.

The right model is the smallest one that will do the job. Everything else is waste.

Three independent judges

I have started running every significant design decision past three models from three different providers. Claude, ChatGPT, Gemini. Same question. Three independent answers.

Where they agree, I ship. Where they split, I investigate before committing. This is not about building our Moirai system or some elaborate consensus engine. It is about catching blind spots. Each model has biases. Each training set has gaps. Asking the same question three times surfaces the places where the answer is less certain than it appears.

You do not need infrastructure for this - although we're building it for you. You need three browser tabs and five minutes. Paste the question. Read the answers. Look for divergence. When all three models give you the same recommendation, it is probably solid. When one disagrees, read the minority report. It is usually pointing at something real.

When the platform layer is broken, stop optimising inside the problem

This week I spent six hours trying to fix a Cloudflare Pages Functions cache that would not update. I deleted the function. I redeployed the project. I changed the route. I added cache-busting headers. Nothing worked. The old version kept serving.

Four hours in, I should have stopped. The premise was wrong. I was trying to fix a platform-layer failure from inside the application layer. That does not work. What works is stepping out and re-architecting what depends on it.

I decoupled the function entirely. Moved the logic elsewhere. Removed the dependency. Problem solved. Not because I fixed the cache. Because I stopped needing it to work.

Sometimes the right optimisation is to stop optimising. If you are four hours into debugging something that should have taken thirty minutes, the problem is not your debugging technique. It is your starting assumption.

Five things you can do this week

1. Turn on prompt caching. If your pipeline makes repeated API calls with shared context - a system prompt, a knowledge base, a style guide - cache it. Anthropic and OpenAI both support it. The cost drop is immediate and material. First call pays full price. Every call after costs a fraction.

2. Match the model to the task, not the budget. Default to the smallest model that will do the job. Use Opus only where reasoning matters. Most classification and routing work is Haiku-tier. Paying for capability you do not need is not being thorough. It is being wasteful.

3. Write code for two readers - you and the agent. Clear naming, small functions, explicit types. An AI reviewing a 200-line function with four nested conditionals burns a lot more tokens than reviewing ten small functions. Agent-legible code is not a compromise. It is just better code.

4. Run important decisions past multiple models. You do not need a Moirai system to do this. Paste the same design question into Claude, ChatGPT, and Gemini. Where they agree, ship. Where they split, investigate before committing. Consensus is a signal. Divergence is information.

5. When a platform layer is broken, stop trying to fix it inside the problem. If you are four hours into debugging something that should have taken thirty minutes, the premise is usually wrong. Step out. Re-architect what depends on it. Decoupling is a first-class solution, not a workaround.

The most expensive tokens

The most expensive tokens in any pipeline are the ones you spend recovering from a bad decision shipped too quickly. Everything else is pennies.

Caching cuts costs by 90%. Model tiering cuts costs by another 70%. Agent-legible code reduces token burn on every interaction. These are all real savings. But none of them matter if you ship something that breaks and spend the next week firefighting.

The week taught me that optimisation has limits. At some point, the right move is not to optimise further. It is to step back and question whether you are solving the right problem.

Luma's daily digest tracks these shifts as they happen. If you are building with AI tools and want to stay ahead of where this is going, subscribe to Luma. Every evening, straight to your inbox.

More from Inside Marbl

Back to Home All Inside Marbl