LLMs Hit a Wall. Here's What's Next.

Today's Overview

There's a quiet reckoning happening in AI right now. Frontier LLMs like GPT-5 and Claude are scoring below 1% on ARC-AGI-3, a new benchmark launched yesterday that measures something traditional tests never have: whether AI can explore, discover rules, and solve problems with zero instructions in interactive game environments.

This isn't a minor difficulty bump. The benchmark shifted from static grid puzzles to turn-based 3D games where agents must figure out both what to do and how to do it through pure interaction. Simple reinforcement learning approaches-a CNN combined with sparse rewards-reached 12.58%. Language models flatlined. The gap reveals something uncomfortable: current AI systems excel at pattern-matching static data, but struggle with sustained sequential reasoning and learning from environmental feedback. For businesses building on LLMs, this matters. It suggests the path forward isn't just bigger models, but fundamentally different architectures.

Making AI Workflows Actually Work

While frontier models hit their limits on reasoning tasks, a parallel shift is happening in how teams actually deploy AI agents. A new framework called aqm lets you define multi-agent workflows in pure YAML-no Python glue code, no API keys required if you've already configured your AI CLI. Define agents, handoffs, quality gates, and retry logic in a single file. This isn't significant, but it's practical: teams stop rewriting the same orchestration boilerplate and instead treat agent pipelines as configuration. That's a small shift that scales across every startup using Claude or Gemini internally.

Real-World Progress Where It Counts

Not everything is constraint-hitting. MIT and Symbotic shipped a hybrid system combining deep reinforcement learning with classical planning algorithms that achieves 25% better throughput in e-commerce warehouses by predicting congestion and dynamically prioritizing robot paths. The system adapts to new warehouse layouts without retraining. For warehouse operators, even a 2-3% throughput gain translates to millions in annual cost savings. This is what happens when you stop chasing AGI benchmarks and focus on the specific problem: moving robots efficiently through constrained space. Reinforcement learning works. Classical optimization works. Together, they work better.

The broader pattern is clear: AI isn't hitting a ceiling everywhere, just in specific directions. Interactive reasoning and exploration remain hard. Multi-turn collaboration with humans is improving but fragile. But focused applications-warehouse logistics, document extraction, structured data mapping-are producing real value. For builders deciding what to build on, the question isn't whether AI works. It's which problems have AI solutions that actually work, and which are still waiting for the next architectural shift.