Tokens as electricity: Why inference costs matter more than training costs now

Azeem Azhar noticed something that changes the economics of AI entirely. We've spent three years obsessing over training costs - the billions spent on compute clusters, the energy consumption of model development, the race for parameter counts. Meanwhile, inference quietly became the bigger cost. Not for labs training foundation models, but for everyone else actually using them.

Inference is what happens when you ask a model to do something. Generate text, analyse an image, make a decision. Each request burns tokens - computational work that costs money. Right now, those costs are high enough that they shape what gets built. A startup automating customer support has to calculate token spend per conversation and decide if the economics work. An enterprise deploying AI code review has to budget for millions of inference calls per month. Tokens aren't just a metric. They're a constraint.

Azhar calls this the shift to an inference-first economy. In this model, tokens become a productive input like electricity or bandwidth - something you budget for, optimise around, and use to calculate unit economics. The companies that figure out how to deliver capability at lower token costs win. The ones that don't, price themselves out of the market.

What changes when inference costs drop

Model providers are already competing on price. OpenAI cut API costs by 75% in one year. Anthropic and Google are racing to undercut them. Open-source models running locally eliminate inference costs entirely for some use cases. This isn't a distant future scenario - it's happening now, and the effects are immediate.

When inference gets cheap, three things become possible. First, always-on AI shifts from luxury to default. Instead of triggering AI only when needed, you can run it continuously - monitoring data streams, watching for anomalies, providing real-time suggestions. Second, high-frequency use cases become viable. Customer support bots that previously rationed AI calls because of cost can now use GPT for every interaction. Third, local deployment makes sense for privacy-sensitive work. No API costs means no reason to send data to the cloud.

The knock-on effects are bigger than the direct cost savings. Cheaper inference means developers can experiment freely instead of rationing API calls during development. It means startups can build products with AI at the core instead of bolting it on as a premium feature. It means enterprises can deploy AI in low-margin workflows where the ROI calculation never worked before.

The agent economy nobody's ready for

Azhar sees AI agents as the logical endpoint of cheap inference. Not chatbots that answer questions, but systems that replace entire workflows. An agent that monitors your inbox, understands context across months of conversation, drafts replies, schedules follow-ups, and escalates edge cases doesn't just assist you - it replaces the administrative layer of knowledge work entirely.

That shift is already happening in pockets. Developers are using AI to generate boilerplate code, write tests, and review pull requests. Marketing teams are using it to draft email campaigns, generate variations, and analyse performance. Customer success teams are using it to summarise support tickets and suggest responses. Each of these is a workflow that previously required human judgment at every step. Now the AI handles the routine 80%, and humans focus on the exceptions.

The pattern is consistent: identify a workflow with clear inputs and outputs, document the decision logic, feed it to an AI agent, and let it run. What used to take a team of three takes one person plus an agent. What used to take a sprint happens overnight. The productivity gains aren't incremental - they're structural.

But Azhar raises a critical warning about verification. As AI analysis becomes easier to produce, the discipline around verification is collapsing. He points to examples of people sharing AI-generated insights - market analysis, data correlations, strategic recommendations - without checking if the output is accurate. It's exploratory work presented as verified fact, and the consequences compound when others build on top of unverified claims.

The responsibility gap

There's a growing gap between what AI can produce and what humans can verify. A model can generate a 10-page market analysis in 30 seconds. Reading it carefully, checking sources, and validating claims takes an hour. The incentive structure rewards speed over accuracy - share the AI output now, let someone else find the errors later.

This isn't a technical problem with models. It's a human problem with how we use them. AI makes exploration cheap and verification expensive. That imbalance creates risk, especially in domains where wrong information has consequences. A flawed investment thesis, a biased hiring algorithm, a medical recommendation based on hallucinated data - these aren't hypothetical failures. They're happening now, and the systems that catch them are overwhelmed.

Azhar's argument is that the responsibility sits with the person sharing the output, not the model producing it. If you publish AI analysis without verification, you own the error. If you deploy an agent without testing edge cases, you own the failure. The tooling makes it easy to abdicate responsibility - don't.

The inference-first economy is here. Tokens are productive inputs. Agents are replacing workflows. The economics have shifted in ways that make entirely new products viable. But the discipline around verification hasn't caught up, and that gap is where the risks live. Cheap inference is a capability unlock. What we build with it, and how carefully we verify it, is still on us.