You've upgraded to the latest model. Your agent still crashes halfway through a task. You add more compute. It makes the same mistakes. You blame the model. You're looking in the wrong place.
Here's what nobody tells you about production AI agents: the model is almost never your problem. The scaffolding around it - the architecture you've built to support it - that's where agents live or die.
A recent analysis from ClawGenesis documents case after case of identical models producing wildly different results. Same GPT-4. Same task. One agent completes it smoothly. The other falls apart. The difference? Tool descriptions. Error handling. State management. The boring stuff.
The Real Bottlenecks Nobody Talks About
Tool descriptions sound trivial until you realise your agent is choosing between twelve functions based on a sentence you wrote in ten seconds. That sentence is everything. An agent with a vague tool description will try the wrong tool, fail, retry with another wrong tool, and burn through your token budget before giving up. A well-described tool gets picked first time.
The number of tools matters more than you'd think. Give an agent thirty tools and watch it drown in choice paralysis. It's not stupidity - it's probability. More options mean more edge cases, more ambiguity, more chances to pick wrong. The best production agents often work with fewer than ten tools, each doing one thing extremely well.
Then there's error handling. Models don't fail gracefully by default. They hallucinate, they retry the same broken approach, they get stuck in loops. Your architecture needs to catch this. A simple retry mechanism with backoff can turn a 40% success rate into 85%. State tracking - knowing what the agent has already tried - stops it repeating mistakes. These aren't advanced techniques. They're basic engineering that most agent implementations skip.
What Good Scaffolding Actually Looks Like
The ClawGenesis analysis includes a case study that's worth paying attention to. Two teams building the same agent. Same model, same task - processing customer support tickets. Team A's agent had a 42% completion rate. Team B's hit 89%. The model was identical. The difference was entirely architectural.
Team B limited tool count to eight core functions. Each tool had a three-part description: what it does, when to use it, what NOT to use it for. They implemented retry logic with exponential backoff. They tracked state across calls so the agent could resume after interruptions. They added a "confidence threshold" - if the agent wasn't sure which tool to use, it asked for human confirmation rather than guessing.
Team A had eighteen tools with single-sentence descriptions. No retry logic. No state tracking. When the agent failed, it just... stopped. The model was doing its job. The architecture was failing it.
Why This Matters Now
We're entering a phase where model capabilities are outpacing our ability to deploy them well. GPT-4, Claude, Gemini - they're all capable of complex agent work. But most production agents still fail more often than they succeed, and teams keep blaming the model.
The real problem is simpler and harder: we're not building the infrastructure these models need to work reliably. Tool design, error handling, state management - this is plumbing work. It's not exciting. It doesn't make for good demos. But it's the difference between an agent that works in a demo and one that works in production.
For developers building agents right now, the message is clear. Stop upgrading models hoping for better results. Start auditing your tool descriptions. Count your tools - if you have more than twelve, you probably have too many. Implement proper error handling. Track state. Test failure modes.
The model you have is almost certainly good enough. The architecture around it probably isn't. That's your bottleneck.