The price of API tokens has been falling for months. OpenAI cut prices. Anthropic cut prices. Google cut prices. On the surface, it looks like a race to the bottom.
But Azeem Azhar's latest analysis shows something stranger happening underneath: demand is rising 2.5 times faster than providers can scale supply. Cheap tokens aren't driving a price war - they're masking a supply crunch.
The Numbers Behind the Squeeze
OpenAI processed 15 billion tokens per minute in April 2025. That's up from 6 billion in October 2024. In six months, throughput more than doubled. The infrastructure required to handle that kind of growth doesn't appear overnight.
Anthropic responded by tightening session limits - fewer tokens per conversation, shorter context windows for some users. That's not a product decision. That's a capacity decision. When you can't meet demand, you ration supply.
Google, meanwhile, has reportedly pushed older hardware to full utilisation. The kind of chips that were supposed to be phased out are now running flat-out because newer capacity can't come online fast enough. Azhar's data suggests this isn't an anomaly - it's the new baseline.
Why Cheap Tokens Hide the Problem
Lower prices signal abundance. If something costs less, we assume there's plenty of it. But in this case, providers are cutting prices to stay competitive while quietly struggling to keep up with usage.
The economics are brutal. Training models is expensive. Running inference at scale is expensive. Acquiring and deploying hardware takes time. Lowering prices makes sense if you're trying to lock in customers before competitors do - but it doesn't solve the infrastructure bottleneck.
For developers building on these APIs, the cheap token pricing looks like a gift. But the hidden cost is reliability. When demand spikes, something has to give - either response times slow down, or access gets throttled, or features get pulled back. None of those show up in the pricing page.
What This Means for Builders
If you're running a business that depends on LLM APIs, this squeeze has practical implications. First, multi-vendor fallback strategies stop being optional. Relying on a single provider means you're exposed when their capacity tightens. Having a backup - even if it's slightly worse quality - keeps your product running when the primary provider starts throttling.
Second, usage monitoring becomes critical. If token prices keep falling but rate limits keep tightening, you need visibility into your actual consumption patterns. The bill might be smaller, but the reliability might be shakier. That's a trade-off worth tracking.
Third, local models start looking more attractive. If API reliability becomes a question mark, the case for running models on your own hardware strengthens - even if the upfront cost is higher. No rate limits. No throttling. No waiting for someone else's infrastructure to catch up.
The Bigger Picture
This isn't a temporary blip. Demand for inference is structural - every new AI feature, every chatbot, every automation layer adds to the load. Meanwhile, chip fabrication timelines haven't changed. New fabs take years to build. New chip designs take years to productionise.
The gap between demand growth and supply scaling isn't closing. It's widening. That creates pressure on pricing, on reliability, and on the business models of every company building on these APIs.
Azhar's analysis doesn't predict a collapse - but it does highlight a fragility. The infrastructure underpinning the AI boom is stretched thin. Lower prices make it easier to build. But they don't make the system more resilient. For anyone betting their business on API access, that's the number worth watching.