Today's Overview
The entire conversation about AI compute is about to split in two. Ben Thompson calls it "the inference shift" - the realisation that when humans aren't waiting for an answer, speed stops being the constraint. A robot solving a task overnight doesn't care if it takes an hour or a day. An agent managing your inbox doesn't check the clock. This changes what hardware you build and how much you're willing to pay for it.
Why Memory Beats Speed
Cerebras went public last week on the strength of a chip that does one thing exceptionally well: it makes tokens incredibly fast. The WSE-3 has 6,000 times the memory bandwidth of an H100. But that speed advantage only matters if your entire model fits on-chip. The moment an agent needs to hold context - trace logs, conversation history, previous decisions, tool results - the architecture falls apart. That's not a flaw in Cerebras. It's a flaw in the assumption that inference = latency.
The real constraint for agentic systems is context. Agents need memory hierarchies, not raw compute. This is what the Arize team discovered building Alyx: truncating context breaks reasoning. Summarisation gives the model too much control and loses detail. What works is head/tail preservation paired with a retrievable memory store - the opposite of what current GPU infrastructure is optimised for. If you're running agents overnight, traditional DRAM makes more sense than high-bandwidth memory. Cheaper. Slower. But you don't notice the slowness when the task takes twelve hours anyway.
This creates an opening for different chip architectures entirely. China, without cutting-edge silicon, has everything needed for agentic inference: "fast enough" chips, "fast enough" CPUs, DRAM, storage. The US focus on bleeding-edge compute for training and answer inference leaves room for someone else to own the infrastructure layer that actually matters at scale - the one that's mostly waiting on memory.
The Economic Bargain Breaks
Separately, TSMC isn't buying ASML's newest lithography machines through 2029. The reason: cost. For fifty years, the semiconductor industry had a bargain - each more expensive tool delivered cheaper chips. The cost per transistor stopped falling in 2011. The narrower metric that tracks lithography performance kept improving until recently. Now it's reversed. You can still make faster chips. You just can't make cheaper ones. That economics shift matters more than the physics.
These two threads - agents that don't need speed, and chip economics that no longer reward miniaturisation - are about to reshape what gets built. The next phase of AI infrastructure won't look like a scaled version of what Nvidia built. It will look like something optimised for memory, context, and cost over latency. The companies that recognise this shift early will own the 2027-2030 infrastructure build.