Why Inference Compute Just Became the Bottleneck

Training large models used to be the expensive part. You'd spend months and millions burning through GPUs to create a new model, then inference - actually using it - was cheap. That ratio is flipping.

Latent Space just published an analysis showing inference compute is now the constraint. Not training. Not data quality. The compute required to serve billions of queries at scale. This changes procurement strategies, chip design priorities, and where the money flows.

The Shift in GPU Workload Economics

Here's the pattern: training a frontier model is a one-time cost. Expensive, yes - hundreds of millions in compute. But you do it once. Inference happens every time someone uses the model. Multiply that by millions of users, thousands of queries per second, and inference compute eclipses training compute within months.

OpenAI's GPT-4 cost an estimated $100 million to train. The inference cost to serve it to ChatGPT's 200 million weekly users? Multiples of that, every quarter. That's why inference is now the strategic constraint, not model development.

This matters for chip manufacturers. NVIDIA's H100 GPUs were designed for training workloads - massive parallel compute, high memory bandwidth. Inference needs different optimisations: latency over throughput, smaller batch sizes, faster token generation. The next generation of chips will prioritise inference, because that's where the volume is.

Disaggregation and the CPU Refresh Cycle

The other shift is workload disaggregation. Training happens in centralised clusters - massive GPU farms running for months. Inference happens everywhere: edge devices, regional data centres, customer premises. That's a different infrastructure problem.

Intel and AMD are pushing CPU refresh cycles specifically for inference. Their pitch: you don't need cutting-edge GPUs for every inference task. Smaller models running on modern CPUs with hardware acceleration can handle a lot of queries more cost-effectively than spinning up GPU instances.

The maths works for latency-insensitive tasks. If you're generating marketing copy or summarising documents, a 200ms response time on a CPU is fine. If you're doing real-time voice transcription, you need GPU speed. The market is bifurcating: high-value, latency-sensitive inference stays on GPUs. Everything else migrates to cheaper CPU-based inference.

What This Means for Model Deployment

Smaller, distilled models are suddenly more valuable. If inference compute is the constraint, a 7B parameter model that runs fast is better than a 70B model that's marginally more accurate but 10x slower. OpenAI's recent pricing cuts reflect this - they're betting on volume over margin, which only works if inference costs drop.

Edge inference is the next frontier. Running models locally - on phones, laptops, IoT devices - eliminates the round-trip to a data centre. Apple's on-device models, Google's Gemini Nano, Meta's Llama running on consumer hardware - these aren't just privacy plays. They're inference cost plays. Every query handled locally is one less server instance to spin up.

This is why companies like Groq and Cerebras are positioning themselves as inference specialists. Their chips are optimised for low-latency token generation, not training throughput. If inference is the bottleneck, the companies solving inference speed win the next wave of contracts.

The Strategic Implications

Model labs are shifting budget allocation. Less on training runs, more on inference infrastructure. That means more engineers working on serving optimisations, quantisation, and caching strategies. The cutting-edge research isn't just "make the model better" - it's "make the model faster to run at scale".

Pricing competition is accelerating. If inference costs are the barrier to adoption, whoever drives those costs down fastest captures market share. OpenAI, Anthropic, Google, and open-source providers are all racing to offer cheaper inference. The margin compression is real, but the volume opportunity is bigger.

This also changes the venture landscape. Startups building inference-optimised infrastructure - caching layers, edge runtimes, quantisation tools - are suddenly more interesting than yet another model wrapper. The infrastructure layer is where the value accrues when inference is the constraint.

The inflection point isn't coming. It's here. Training costs plateau as models hit diminishing returns. Inference costs scale linearly with adoption. If AI is going to be in every application, every workflow, every device, inference compute is the real cost to solve. The companies that solve it own the next decade.