Intelligence is foundation
Subscribe
  • Luma
  • About
  • Sources
  • Ecosystem
  • Nura
  • Marbl Codes
00:00
Contact
[email protected]
Connect
  • YouTube
  • LinkedIn
  • GitHub
Legal
Privacy Cookies Terms
  1. Home›
  2. Featured›
  3. Voices & Thought Leaders›
  4. Ben Thompson on Agentic Inference: Speed Stops Mattering
Voices & Thought Leaders Monday, 11 May 2026

Ben Thompson on Agentic Inference: Speed Stops Mattering

Share: LinkedIn
Ben Thompson on Agentic Inference: Speed Stops Mattering

Ben Thompson published a piece this week arguing that agentic inference - where AI systems work autonomously without human oversight - will require fundamentally different compute architecture than the answer-generation systems we've built so far.

His core claim: speed becomes irrelevant. Memory hierarchy and cost become everything.

Two Types of Inference

Thompson separates AI inference into two categories. Answer inference is what we have now - a user asks a question, the model generates a response, the interaction ends. The constraint is latency. Nobody wants to wait 30 seconds for ChatGPT to respond. Speed is the product.

Agentic inference is different. The AI system receives a task and works on it for hours, days, or weeks without human involvement. It might be analysing legal documents, monitoring system logs, coordinating supply chain decisions, or managing customer support queues. The constraint isn't speed - it's cost and reliability.

When a human isn't waiting for the result, sub-second latency stops mattering. What matters is whether the system can keep working cheaply enough that the task is economically viable.

The Compute Shift

Current AI infrastructure is optimised for throughput. Expensive GPUs sit in data centres, processing thousands of requests per second. High utilisation is critical because the hardware costs millions. You want those chips busy every millisecond.

But agentic systems don't care about milliseconds. They care about memory and persistence. An agent working on a multi-day task needs to maintain context - the full conversation history, intermediate results, external data it's gathered. That context can't fit in GPU memory. It needs to live somewhere cheaper.

Thompson argues this points toward a different memory hierarchy. Keep the active working set in fast memory. Push the bulk of context into slower, cheaper storage. Optimise for cost per token stored, not tokens per second processed.

This isn't a minor tweak. It's a different architecture. The chip designs, the data centre layouts, the software stack - all of it gets rebuilt around persistence and cost rather than speed.

Why This Matters for Builders

If Thompson's right - and the logic is compelling - then the current wave of AI infrastructure is solving yesterday's problem. We've built everything around answering questions quickly. But the bigger opportunity is systems that work autonomously on complex tasks.

For developers, this means the tools you need don't exist yet. You can't just spin up an API and expect it to handle long-running agents. You need state management, cost controls, reliability guarantees, and graceful degradation when context grows too large. None of that comes standard.

The companies building agent frameworks - LangChain, LlamaIndex, emerging startups - are all grappling with this. How do you keep an agent running for three days without losing context? How do you pause execution, swap out the model, and resume without breaking the task? How do you keep costs under control when context balloons to millions of tokens?

These aren't solved problems. They're active research questions that happen to have immediate commercial relevance.

The Cost Problem

Speed costs money. A fast GPU processes requests in milliseconds but burns kilowatts doing it. For answer inference, that's fine - you're serving thousands of users per second. For agentic inference, it's wasteful. The agent doesn't need sub-second responses. It needs to keep running cheaply.

This opens the door for different hardware. CPUs are slower than GPUs but use less power. Edge devices are slower still but cost almost nothing to run continuously. If latency doesn't matter, you can use cheaper compute and pocket the difference.

The economic incentive is significant. Answer inference is a race to serve more users faster. Agentic inference is a race to run longer tasks cheaper. Different races produce different winners.

What Changes

If agentic inference becomes the dominant workload - and that's a big if - several things shift.

First, model size matters less than memory efficiency. A smaller model that fits entirely in cheap RAM might outperform a larger model that requires expensive GPU memory for context.

Second, persistent storage becomes critical infrastructure. Agents need to checkpoint their state frequently. Fast, reliable, cheap storage - not cutting-edge GPUs - becomes the bottleneck.

Third, orchestration tooling becomes more important than model quality. An agent that can recover from failures, handle interruptions, and manage complex workflows beats a more accurate model that crashes when context grows large.

Fourth, cost accounting gets complicated. With answer inference, you charge per request. With agentic inference, you charge for... what? Time spent? Tokens processed? Tasks completed? The billing model isn't obvious.

The Uncertainty

Thompson's argument is logical but untested at scale. We don't have many real-world examples of long-running agentic systems. The use cases exist - legal analysis, code review, monitoring dashboards - but the infrastructure is still being figured out.

It's also possible that latency matters more than he suggests. Even if a human isn't waiting, other systems might be. An agent coordinating supply chain decisions might need to respond to external events quickly. An agent managing infrastructure might need to react to outages in seconds, not minutes.

But the core insight holds: if AI systems are going to work autonomously on complex tasks, they need different infrastructure than the systems we've built for chatbots. Speed stops being the constraint. Cost and persistence become the game.

Read Thompson's full analysis at Stratechery.

More Featured Insights

Builders & Makers
Two Ways to Make Agents Durable: Replay vs Snapshot
Robotics & Automation
ROSCon Global 2026: Three Days, Ten Workshops, One Problem

Video Sources

AI Engineer
Two Roads to Durable Agents: Replay vs. Snapshot
AI Engineer
Hierarchical Memory: Context Management in Agents
AI Engineer
You Can't Just One Shot It
Theo (t3.gg)
We All Fell for It
Ania Kubów
Build Something Real First, Then Sell It
World of AI
Hermes Agentic OS is The Future
AI Revolution
The New Agentic AI Workflow Feels Too Powerful

Today's Sources

ML Mastery
Implementing Prompt Compression to Reduce Agentic Loop Costs
ROS Discourse
ROSCon Global 2026 Registration Now Open
ROS Discourse
Built an Autonomous Mobile Robot for Warehouse Automation
The Robot Report
Engineering Robot Tracks for Harsh Real-World Environments
Ben Thompson Stratechery
The Inference Shift
Azeem Azhar
The Broken Bargain of Moore's Law
Gary Marcus
Misplaced Panic Over AI Progress

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Richard Bland
About Sources Privacy Cookies Terms Thou Art That
MEM Digital Ltd t/a Marbl Codes
Co. 13753194 (England & Wales)
VAT: 400325657
3-4 Brittens Court, Clifton Reynes, Olney, MK46 5LG
© 2026 MEM Digital Ltd