The Token Budget Problem Nobody Wants to Solve

AI teams face a strange new problem: how do you encourage people to use AI more without creating runaway costs? Give teams unlimited tokens and watch the bill explode. Cap usage too hard and nobody builds anything useful.

Latent Space's latest episode dives into what they're calling "tasteful tokenmaxxing" - the art of getting maximum value from AI systems without burning money on wasteful generation. The conversation cuts to the centre of how companies are actually deploying agents in production.

Depth Versus Breadth

The core debate playing out across AI labs right now: should agents refine a single answer through multiple passes (depth), or generate many different answers in parallel and pick the best one (breadth)?

Depth looks like this: generate a response, critique it, refine it, critique again, refine again. Each pass costs tokens, but you're building on previous work. The model learns from its mistakes within a single conversation.

Breadth looks like this: generate five different responses in parallel, evaluate all of them, pick the winner. More upfront cost, but you're exploring different solution paths simultaneously. Sometimes the best answer comes from a direction the first attempt would never have found.

Neither approach is clearly better. It depends on the problem, the model, and - critically - what you're optimising for. If you're writing code, breadth often wins because the search space is large and you can test all candidates. If you're doing complex reasoning, depth tends to work better because each pass adds context the next one needs.

The Incentive Problem

Here's where it gets messy: if you incentivize engineers to use AI by making tokens free, they won't think about efficiency. Why would they? The bill isn't theirs.

But if you make the budget visible and tie it to team goals, suddenly people start caring about whether that fifth refinement pass actually improved the output. They start asking whether running thirty parallel candidates is worth it, or if five would do the job.

The best system, according to the Latent Space discussion, isn't unlimited tokens or hard caps. It's transparency plus soft limits. Show teams what they're spending in real time. Set budgets that encourage exploration but require justification for excess. Make it easy to see which workflows are efficient and which are burning tokens for marginal gains.

The Agentic Context Explosion

Agentic systems make this worse. A single user query might trigger ten agent calls, each with its own context window. If every agent gets the full conversation history plus tool documentation plus retrieved knowledge, you're sending megabytes of text for what might be a simple function call.

Smart teams are solving this with context pruning - giving each agent exactly what it needs, not everything available. The routing agent doesn't need the full knowledge base. The code-writing agent doesn't need the conversation history from three turns ago. Surgical context injection cuts token usage by half in some systems, with no quality loss.

The other lever is result caching. If five users ask similar questions in an hour, why regenerate the same context five times? Cache the knowledge retrieval, cache the tool documentation, cache the reasoning steps. The full discussion breaks down caching strategies that work at scale.

The Deeper Question

Underneath the tactical conversation about tokens and budgets is a more interesting question: are we solving problems that justify this cost?

A coding agent that burns ten thousand tokens to generate a hundred-line function might be worth it if that function would have taken an engineer two hours. But a customer service agent that uses five thousand tokens to answer "What are your opening hours?" probably isn't.

The companies getting this right are measuring value, not just cost. They're tracking time saved, errors prevented, tasks automated - and comparing that to the token spend. When the ROI is clear, the budget conversation gets easier. When it's not, you're just burning money on automation theatre.

Tasteful tokenmaxxing isn't really about tokens. It's about knowing what you're building and why, and making sure the technical decisions align with that. Depth versus breadth, context size, caching strategy - these are all downstream of the question: what are we actually trying to achieve here?

Most teams skip that question and jump straight to implementation. The token bill is what forces them to come back and answer it properly.