Why AI context windows stopped growing - and what happens when memory hits a wall

Anthropic finally brought 1-million-token context windows to general availability this month. That sounds impressive until you realise both Gemini and OpenAI beat them to it months ago. The race to bigger context windows has slowed down. The question is why.

Latent Space calls it a "context drought" - and the reason isn't about algorithms or training techniques. It's about physical hardware hitting a wall. Specifically, memory. You can't just keep expanding context windows indefinitely when the GPUs running these models have finite RAM.

The Hardware Bottleneck Nobody Talks About

Here's the thing about context windows. When an AI model processes a prompt with a million tokens, it needs to hold all that information in memory simultaneously. That's not storage - that's active, high-speed memory sitting on the GPU itself. And that memory is expensive, power-hungry, and physically limited.

Think of it like RAM in your laptop. You can have all the processing power in the world, but if you run out of memory, everything grinds to a halt. AI models face the same constraint, just at a much larger scale. The chips can only hold so much data at once.

This isn't a software problem you can code your way around. It's a fundamental constraint of physics and economics. Bigger context windows require more memory. More memory means bigger, more expensive chips. At some point, the cost stops making sense.

Context Rationing - The New Reality

What happens when you can't just keep expanding context windows? You start rationing. Latent Space explores this idea of "context rationing" as an emerging economic reality - the notion that context becomes a limited resource you manage strategically, not an infinite buffer you assume is always available.

In practical terms, this means developers need to get smarter about what they include in prompts. Do you really need the entire conversation history, or just the last few exchanges? Do you need the full document, or can you summarise and pass key excerpts?

We've seen this pattern before in computing. When memory was scarce, programmers wrote tighter, more efficient code. When bandwidth was limited, compression techniques improved. Constraints drive innovation - but they also force trade-offs.

What This Means For Builders

If you're building AI applications today, don't assume context windows will keep growing at the same rate. Plan for a world where context is finite and costs money. That changes how you architect systems.

Retrieval-augmented generation (RAG) becomes more important, not less. Instead of stuffing everything into the context window, you retrieve relevant information on demand. It's more complex to build, but it scales better when memory is the bottleneck.

Similarly, summarisation and compression techniques matter more. If you can distill a 100,000-token document into a 5,000-token summary without losing critical information, you've just made your system 20 times more efficient.

The other implication is cost. Right now, API providers charge based on tokens processed. If context windows stop growing but costs don't drop proportionally, then every token you include in a prompt has an economic cost. Waste adds up fast.

The Bigger Picture

This context drought reflects a broader reality in AI development. We're moving from the "scale at all costs" phase to the "optimise what we have" phase. The low-hanging fruit - just making models bigger and feeding them more context - is getting harder to reach.

That's not a bad thing. It forces the industry to get smarter about efficiency, architecture, and real-world constraints. The next wave of innovation won't be "we added another zero to the context window." It'll be "we figured out how to do more with less."

For business owners and developers, the takeaway is clear. Build for the world where context is limited, not the fantasy where it's infinite. The hardware has spoken.