Intelligence is foundation
Podcast Subscribe
Voices & Thought Leaders Monday, 6 April 2026

Token Demand Rises 2.5x Faster Than Supply Can Scale

Share: LinkedIn
Token Demand Rises 2.5x Faster Than Supply Can Scale

The price of API tokens has been falling for months. OpenAI cut prices. Anthropic cut prices. Google cut prices. On the surface, it looks like a race to the bottom.

But Azeem Azhar's latest analysis shows something stranger happening underneath: demand is rising 2.5 times faster than providers can scale supply. Cheap tokens aren't driving a price war - they're masking a supply crunch.

The Numbers Behind the Squeeze

OpenAI processed 15 billion tokens per minute in April 2025. That's up from 6 billion in October 2024. In six months, throughput more than doubled. The infrastructure required to handle that kind of growth doesn't appear overnight.

Anthropic responded by tightening session limits - fewer tokens per conversation, shorter context windows for some users. That's not a product decision. That's a capacity decision. When you can't meet demand, you ration supply.

Google, meanwhile, has reportedly pushed older hardware to full utilisation. The kind of chips that were supposed to be phased out are now running flat-out because newer capacity can't come online fast enough. Azhar's data suggests this isn't an anomaly - it's the new baseline.

Why Cheap Tokens Hide the Problem

Lower prices signal abundance. If something costs less, we assume there's plenty of it. But in this case, providers are cutting prices to stay competitive while quietly struggling to keep up with usage.

The economics are brutal. Training models is expensive. Running inference at scale is expensive. Acquiring and deploying hardware takes time. Lowering prices makes sense if you're trying to lock in customers before competitors do - but it doesn't solve the infrastructure bottleneck.

For developers building on these APIs, the cheap token pricing looks like a gift. But the hidden cost is reliability. When demand spikes, something has to give - either response times slow down, or access gets throttled, or features get pulled back. None of those show up in the pricing page.

What This Means for Builders

If you're running a business that depends on LLM APIs, this squeeze has practical implications. First, multi-vendor fallback strategies stop being optional. Relying on a single provider means you're exposed when their capacity tightens. Having a backup - even if it's slightly worse quality - keeps your product running when the primary provider starts throttling.

Second, usage monitoring becomes critical. If token prices keep falling but rate limits keep tightening, you need visibility into your actual consumption patterns. The bill might be smaller, but the reliability might be shakier. That's a trade-off worth tracking.

Third, local models start looking more attractive. If API reliability becomes a question mark, the case for running models on your own hardware strengthens - even if the upfront cost is higher. No rate limits. No throttling. No waiting for someone else's infrastructure to catch up.

The Bigger Picture

This isn't a temporary blip. Demand for inference is structural - every new AI feature, every chatbot, every automation layer adds to the load. Meanwhile, chip fabrication timelines haven't changed. New fabs take years to build. New chip designs take years to productionise.

The gap between demand growth and supply scaling isn't closing. It's widening. That creates pressure on pricing, on reliability, and on the business models of every company building on these APIs.

Azhar's analysis doesn't predict a collapse - but it does highlight a fragility. The infrastructure underpinning the AI boom is stretched thin. Lower prices make it easier to build. But they don't make the system more resilient. For anyone betting their business on API access, that's the number worth watching.

More Featured Insights

Builders & Makers
A Developer Automated 190 App Screenshots to Stay in Sync with Code
Robotics & Automation
Open-RMF's Six-Stage Plan to Break Vendor Lock-In for Robot Fleets

Video Sources

AI Revolution
CoWork: AI Tool Designed for Real Work (Not Chat)
Matthew Berman
Salesforce CEO on Microsoft Blocking OpenAI, AI Scapegoating, and Regulation
World of AI
Claude Code + Karpathy's Self-Evolving System for 10x Code Generation

Today's Sources

DEV.to AI
Automated Screenshot Workflow: Stop Re-Screenshotting Your App
n8n Blog
RAG System Architecture: Components, Implementation, and Best Practices
Hacker News Best
Show HN: A 9M Parameter LLM Built to Teach How Language Models Work
Towards Data Science
Proxy-Pointer RAG: Vectorless Accuracy at Vector RAG Scale
DEV.to AI
Building AI-Powered Frontends: From Clicks to Intent
DEV.to AI
One Prompt Replaced 3 Hours of Daily Coding (Russian language article)
ROS Discourse
Next Generation Open-RMF Roadmap
ROS Discourse
QERRA-v2: Hybrid Quantum-Ethical Safety Layer for Humanoid Robots
The Robot Report
NORD Launches MAXXDRIVE Gear Units for Mining Automation
Azeem Azhar
Data to Start Your Week: The AI Squeeze
Ben Thompson Stratechery
OpenAI Buys TBPN, Tech and the Token Tsunami
Gary Marcus
The Back Story Behind Medvi's '$1.8 Billion' Valuation

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Free Daily Briefing

Start Every Morning Smarter

Luma curates the most important AI, quantum, and tech developments into a 5-minute morning briefing. Free, daily, no spam.

  • 8:00 AM Morning digest ready to listen
  • 1:00 PM Afternoon edition catches what you missed
  • 8:00 PM Daily roundup lands in your inbox

We respect your inbox. Unsubscribe anytime. Privacy Policy

© 2026 MEM Digital Ltd t/a Marbl Codes
About Sources Podcast Audio Privacy Cookies Terms Thou Art That
RSS Feed