A Claude agent hitting API limits in the first hour isn't a usage spike. It's a design flaw. Most token burn comes from verbose output - agents writing full sentences when a two-word status would do. The fix isn't better prompting. It's treating output as a line item with a budget.
The analysis from DEV.to shows that switching from prose to structured JSON, setting explicit context budgets per agent, and removing preambles cuts token consumption by 60-75%. Same functionality. Quarter of the cost.
Why Agents Write Too Much
Language models are trained to write naturally. That means full sentences, explanatory context, polite preambles. When you prompt an agent to "analyse this data", it doesn't just return results - it explains what it's doing, why it matters, what it found interesting. That's great for a chatbot. It's wasteful for an agent.
The problem compounds when agents chain together. Agent A produces verbose output. Agent B reads all of it as input, adds its own verbose response. Agent C does the same. Each step in the chain multiplies token usage. A five-step workflow that could run on 2,000 tokens ends up consuming 15,000 because nobody told the agents to be concise.
And here's the thing: the extra verbosity doesn't improve quality. An agent returning {"status": "complete", "items_processed": 47} is just as useful as one returning "I have successfully completed the processing task. In total, 47 items were processed without errors. The operation concluded normally." The second version burns 25 tokens for information the first version conveys in 8.
The JSON Fix
Structured output solves this immediately. Instead of asking agents to "explain" or "describe", ask for JSON with specific fields. Define the schema upfront. Lock down what information you actually need and strip everything else.
Before: "Analyse the customer data and summarise the findings."
After: "Return JSON: {customer_count: int, high_value_count: int, issues: [string]}"
The model still does the same analysis. It just skips the essay. Token usage drops by half or more, depending on how verbose your original prompts allowed the agent to be.
This isn't about losing information. It's about being precise. If you need context, add a field for it. If you don't, don't let the model write it anyway. Every token you don't need costs money and latency. Structure forces you to decide what actually matters.
Context Budgets Per Agent
Most workflows don't need every agent to see the full conversation history. Agent A needs its instructions and the current task. It doesn't need to know what Agent C did three steps ago. But by default, you're passing the entire context window forward, and it grows with every interaction.
Set explicit context budgets. Each agent gets exactly the information it needs - no more. If Agent B only needs the output from Agent A, pass only that. If Agent C needs a summary of prior steps, pass a summary, not the raw transcripts.
This requires thinking through your data flow. What does each agent actually use? What's just bloat? Most multi-agent systems pass far more context than necessary because it's easier than being selective. But the cost adds up fast. Ruthless pruning of context is one of the highest-use optimisations you can make.
Kill the Preambles
Language models love preambles. "Certainly, I'll help you with that." "Here's what I found." "Let me break this down for you." Every response starts with pleasantries because that's what the training data looks like.
For agents, this is pure waste. Tell the model explicitly: no preambles, no sign-offs, no politeness. Just the answer. Add it to your system prompt. Enforce it in your output validation. A well-designed agent returns data, not conversation.
The same goes for explanations of reasoning. Unless you specifically need the chain of thought, don't ask for it. The model will happily explain its logic in detail if you let it. That explanation costs tokens. If the output is correct, the reasoning is a luxury you probably can't afford at scale.
What Good Agents Look Like
A token-efficient agent is ruthlessly minimal. It receives structured input, performs one clear task, and returns structured output. No fluff. No context it doesn't need. No explaining itself unless explicitly asked.
This feels unnatural at first. We're used to conversational interfaces. But agents aren't chatbots. They're functions. Treat them like functions - defined inputs, defined outputs, no side conversation. The moment you start letting agents "talk", you're burning tokens on theatre.
The 60-75% reduction in token usage isn't theoretical. It's what happens when you stop asking language models to write prose and start asking them to return data. Same intelligence. Same capability. Quarter of the cost. Build accordingly.