Intelligence is foundation
Podcast Subscribe
Builders & Makers Tuesday, 17 March 2026

Token Prices Dropped 10x, But Your AI Bills Went Up. Here's Why.

Share: LinkedIn
Token Prices Dropped 10x, But Your AI Bills Went Up. Here's Why.

Token prices have fallen dramatically. OpenAI, Anthropic, Google - they've all slashed pricing over the past year. By some measures, tokens are now 10x cheaper than they were 18 months ago.

So why are AI bills going up?

This is one of those situations where the obvious answer turns out to be completely wrong. Lower prices should mean lower costs. But in AI, task complexity is increasing faster than prices are falling. And that gap is swallowing every efficiency gain.

The LLMflation Problem

There's a term for this: LLMflation. As models get more capable, developers use them for more complex tasks. Those tasks consume more tokens. Even at lower per-token prices, the total cost climbs.

Here's what the numbers look like in practice. A simple Q&A chatbot might use 50-500 tokens per interaction. That's cheap, even at old pricing. But a multi-agent system - where several AI models collaborate, plan, and iterate on a task - can burn through 1 million tokens or more for a single workflow.

Think of it like mobile data pricing. When data got cheaper, people didn't use the same amount and pay less. They started streaming video. Usage exploded. Total bills stayed the same or went up, even though the cost per gigabyte dropped.

AI is following the exact same pattern. Cheaper tokens don't reduce usage. They enable new use cases that consume tokens at a completely different scale.

Token Consumption Scales: From Simple to Staggering

To understand why costs are rising, you need to see how token consumption scales with task complexity:

Simple Q&A: 50-500 tokens per interaction. This is what most people think of when they imagine using an AI. Quick question, quick answer. Cheap.

Document analysis: 5,000-50,000 tokens. Now you're feeding context - contracts, reports, technical documentation. The model needs to read, understand, and synthesise. Token count jumps.

Multi-step workflows: 100,000-500,000 tokens. An agent that researches a topic, drafts content, revises based on feedback, and formats output is making multiple passes. Each step consumes tokens. Each iteration adds cost.

Multi-agent collaboration: 1,000,000+ tokens. Multiple AI models working together, each with their own context, planning and reasoning in parallel. This is where costs spiral. And this is where the industry is heading.

The pattern is clear. As AI capabilities improve, developers build more ambitious systems. Those systems consume orders of magnitude more tokens. Even at 10x lower prices, the total bill climbs.

Margin Compression at Scale

For API providers - OpenAI, Anthropic, Google - this creates a brutal dynamic. They're cutting prices to stay competitive. But their infrastructure costs aren't falling nearly as fast.

Compute, energy, and network costs drop slowly. Token prices drop fast. The gap between what they charge and what it costs to deliver is narrowing. That's margin compression. And at scale, it becomes unsustainable.

Some providers are betting they can offset lower margins with higher volume. Others are banking on efficiency improvements in model inference. But there's a limit to how far you can compress margins before the economics stop working.

This is why you're seeing API providers start to differentiate on features, not just price. Faster response times. Better quality outputs. Specialised models for specific tasks. They're trying to shift the conversation away from cost-per-token and toward value-per-task.

AI FinOps: The Discipline Nobody Wanted

All of this has given birth to something new: AI FinOps. Financial operations for AI spending. It's like cloud FinOps, but messier because token consumption is harder to predict and optimise.

If you're running AI workloads at any meaningful scale, you need to start thinking like a FinOps engineer:

Track consumption by task type. Not all prompts are created equal. Know which workflows are burning through tokens and why.

Optimise prompt design. Verbose prompts cost more. Every unnecessary word is wasted tokens. Engineers are learning to write tight, efficient prompts the same way they used to optimise database queries.

Cache aggressively. If you're asking the same question repeatedly, don't reprocess it every time. Cache results. Reuse outputs. Reduce redundant API calls.

Choose the right model for the task. Don't use a frontier model for tasks a smaller, cheaper model can handle. Match capability to need. Overpaying for intelligence you don't need is wasteful.

Set budget alerts. Token consumption can spiral fast. Know when you're approaching limits before the bill arrives.

What This Means for Builders

If you're building with AI, the lesson is simple: falling token prices are not a free pass to ignore costs. In fact, they're the opposite. They're an invitation to build more complex systems that will cost more, even at lower unit prices.

The developers who will succeed in this environment are the ones who treat token consumption as a first-class concern - not an afterthought. Design for efficiency. Measure ruthlessly. Optimise continuously.

Because the cost of AI isn't going down. It's just changing shape. And if you're not paying attention, it'll catch you off guard.

More Featured Insights

Robotics & Automation
Humanoid Robots Just Learned to Navigate Real Spaces on Their Own
Voices & Thought Leaders
Jensen Huang on Why Energy, Not Chips, Is the Real AI Bottleneck

Video Sources

NVIDIA Robotics
Accelerating the Future of Automotive Engineering

Today's Sources

DEV.to AI
Why Falling AI Token Prices Don't Mean Lower Costs
DEV.to AI
AI Agents Can Now See Your Browser-And It Changes Everything About Automation
DEV.to AI
We Use Our AI Testing Tool to Test Our AI Testing Tool
PyImageSearch
Build DeepSeek-V3: Multi-Head Latent Attention Architecture
Towards Data Science
How to Build a Production-Ready Claude Code Skill
Hacker News Best
Every Layer of Review Makes You 10x Slower
The Robot Report
RealSense Unveils Autonomous Humanoid Navigation at GTC 2026
Robohub
Graphene-Based Sensor Improves Robot Touch
The Robot Report
Aetina Shows 3D Vision and Enterprise Generative AI at GTC 2026
The Robot Report
Noland Arbaugh, World's First Neuralink User, to Keynote Robotics Summit
Ben Thompson Stratechery
An Interview with Nvidia CEO Jensen Huang About Accelerated Computing
Latent Space
NVIDIA GTC: Jensen Goes Hard on OpenClaw, Vera CPU, and Announces $1T Sales Backlog
Jack Clark Import AI
ImportAI 449: LLMs Training Other LLMs; 72B Distributed Training; Vision Harder Than Text
Gary Marcus
F Cancer
Azeem Azhar
Data to Start Your Week

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Free Daily Briefing

Start Every Morning Smarter

Luma curates the most important AI, quantum, and tech developments into a 5-minute morning briefing. Free, daily, no spam.

  • 8:00 AM Morning digest ready to listen
  • 1:00 PM Afternoon edition catches what you missed
  • 8:00 PM Daily roundup lands in your inbox

We respect your inbox. Unsubscribe anytime. Privacy Policy

© 2026 MEM Digital Ltd t/a Marbl Codes
About Sources Podcast Audio Privacy Cookies Terms Thou Art That
RSS Feed