Intelligence is foundation
Podcast Subscribe
Voices & Thought Leaders Sunday, 22 March 2026

Tokens as electricity: Why inference costs matter more than training costs now

Share: LinkedIn
Tokens as electricity: Why inference costs matter more than training costs now

Azeem Azhar noticed something that changes the economics of AI entirely. We've spent three years obsessing over training costs - the billions spent on compute clusters, the energy consumption of model development, the race for parameter counts. Meanwhile, inference quietly became the bigger cost. Not for labs training foundation models, but for everyone else actually using them.

Inference is what happens when you ask a model to do something. Generate text, analyse an image, make a decision. Each request burns tokens - computational work that costs money. Right now, those costs are high enough that they shape what gets built. A startup automating customer support has to calculate token spend per conversation and decide if the economics work. An enterprise deploying AI code review has to budget for millions of inference calls per month. Tokens aren't just a metric. They're a constraint.

Azhar calls this the shift to an inference-first economy. In this model, tokens become a productive input like electricity or bandwidth - something you budget for, optimise around, and use to calculate unit economics. The companies that figure out how to deliver capability at lower token costs win. The ones that don't, price themselves out of the market.

What changes when inference costs drop

Model providers are already competing on price. OpenAI cut API costs by 75% in one year. Anthropic and Google are racing to undercut them. Open-source models running locally eliminate inference costs entirely for some use cases. This isn't a distant future scenario - it's happening now, and the effects are immediate.

When inference gets cheap, three things become possible. First, always-on AI shifts from luxury to default. Instead of triggering AI only when needed, you can run it continuously - monitoring data streams, watching for anomalies, providing real-time suggestions. Second, high-frequency use cases become viable. Customer support bots that previously rationed AI calls because of cost can now use GPT for every interaction. Third, local deployment makes sense for privacy-sensitive work. No API costs means no reason to send data to the cloud.

The knock-on effects are bigger than the direct cost savings. Cheaper inference means developers can experiment freely instead of rationing API calls during development. It means startups can build products with AI at the core instead of bolting it on as a premium feature. It means enterprises can deploy AI in low-margin workflows where the ROI calculation never worked before.

The agent economy nobody's ready for

Azhar sees AI agents as the logical endpoint of cheap inference. Not chatbots that answer questions, but systems that replace entire workflows. An agent that monitors your inbox, understands context across months of conversation, drafts replies, schedules follow-ups, and escalates edge cases doesn't just assist you - it replaces the administrative layer of knowledge work entirely.

That shift is already happening in pockets. Developers are using AI to generate boilerplate code, write tests, and review pull requests. Marketing teams are using it to draft email campaigns, generate variations, and analyse performance. Customer success teams are using it to summarise support tickets and suggest responses. Each of these is a workflow that previously required human judgment at every step. Now the AI handles the routine 80%, and humans focus on the exceptions.

The pattern is consistent: identify a workflow with clear inputs and outputs, document the decision logic, feed it to an AI agent, and let it run. What used to take a team of three takes one person plus an agent. What used to take a sprint happens overnight. The productivity gains aren't incremental - they're structural.

But Azhar raises a critical warning about verification. As AI analysis becomes easier to produce, the discipline around verification is collapsing. He points to examples of people sharing AI-generated insights - market analysis, data correlations, strategic recommendations - without checking if the output is accurate. It's exploratory work presented as verified fact, and the consequences compound when others build on top of unverified claims.

The responsibility gap

There's a growing gap between what AI can produce and what humans can verify. A model can generate a 10-page market analysis in 30 seconds. Reading it carefully, checking sources, and validating claims takes an hour. The incentive structure rewards speed over accuracy - share the AI output now, let someone else find the errors later.

This isn't a technical problem with models. It's a human problem with how we use them. AI makes exploration cheap and verification expensive. That imbalance creates risk, especially in domains where wrong information has consequences. A flawed investment thesis, a biased hiring algorithm, a medical recommendation based on hallucinated data - these aren't hypothetical failures. They're happening now, and the systems that catch them are overwhelmed.

Azhar's argument is that the responsibility sits with the person sharing the output, not the model producing it. If you publish AI analysis without verification, you own the error. If you deploy an agent without testing edge cases, you own the failure. The tooling makes it easy to abdicate responsibility - don't.

The inference-first economy is here. Tokens are productive inputs. Agents are replacing workflows. The economics have shifted in ways that make entirely new products viable. But the discipline around verification hasn't caught up, and that gap is where the risks live. Cheap inference is a capability unlock. What we build with it, and how carefully we verify it, is still on us.

More Featured Insights

Builders & Makers
Building a health coach that actually learns: Next.js, Supabase, and the data nobody trusts
Robotics & Automation
Two continents, two rulebooks: The regulatory split holding robotics back

Today's Sources

DEV.to AI
How I built an AI health coach with Next.js, Supabase & GPT-5.2 - from wearable APIs to recovery predictions
DEV.to AI
Turning GitHub Copilot CLI into an AI Agent via ACP
Hacker News Best
Professional video editing, right in the browser with WebGPU and WASM
Hacker News Best
Floci - A free, open-source local AWS emulator
Hacker News Best
The three pillars of JavaScript bloat
Towards Data Science
Escaping the SQL Jungle
DEV.to AI
Automate or Stagnate: AI-Powered Customs for Southeast Asia Sellers
DEV.to AI
Amazon Q in Practice: How AI Is Transforming My AWS Workflow Between the Console and VS Code
Towards Data Science
Building a Navier-Stokes Solver in Python from Scratch: Simulating Airflow
The Robot Report
The great robot race: How companies can balance speed to market and compliance in the U.S.
Robohub
Robot Talk Episode 149 - Robot safety and security, with Krystal Mattich
The Robot Report
Allient to present new generation of mobile robot drive systems at LogiMAT
The Robot Report
How offline programming reduces machining automation deployment times
DEV.to AI
From Pixels to Physicality: Engineering Olaf with Reinforcement Learning, Control Systems, and Illusion Design
Azeem Azhar
🔮 Exponential View #566: A solar shield; AI agents; human judgment; China's robots++
Sebastian Raschka
A Visual Guide to Attention Variants in Modern LLMs

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Free Daily Briefing

Start Every Morning Smarter

Luma curates the most important AI, quantum, and tech developments into a 5-minute morning briefing. Free, daily, no spam.

  • 8:00 AM Morning digest ready to listen
  • 1:00 PM Afternoon edition catches what you missed
  • 8:00 PM Daily roundup lands in your inbox

We respect your inbox. Unsubscribe anytime. Privacy Policy

© 2026 MEM Digital Ltd t/a Marbl Codes
About Sources Podcast Audio Privacy Cookies Terms Thou Art That
RSS Feed