The HITL Problem: Why Most AI Agents Aren't Ready for Production

Today's Overview

When an AI agent needs to pause and ask for human approval, something breaks. Not metaphorically. The agent process dies, the await doesn't survive a restart, the "approve or reject" prompt ends up on stdin in the wrong channel, or the human clicks through without thinking and the agent learns to ignore the feedback.

An audit of twelve popular AI-agent frameworks exposed this: only three have production-ready human-in-the-loop primitives, and even those leave half the problem to the developer. LangGraph wins on durability - paused state lives in PostgreSQL, any worker can resume. Pydantic AI has the cleanest typed API for deferred tool approval. Mastra leads TypeScript shops. But zero frameworks ship all six axes a production HITL system needs: durable storage, typed I/O, channel abstraction, idempotency, a verifier hook to quality-check the human's response before resuming, and an admin UI to see what's in flight. The rest - CrewAI, LangChain, AutoGen - reduce "pause for a human" to `input()`, which works in a terminal and nowhere else.

What Builders Are Actually Doing This Week

Beyond frameworks, three concrete patterns showed up in the feed: Genkit added middleware, giving developers an interception layer around model calls and tool execution - useful for adding reliability and safety checks without rewriting the core loop. Forward deployed engineers are in demand again - Google shortened hiring from 6 weeks to 2 days; OpenAI and Anthropic both spun out separate FDE companies funded by private equity rather than hire them directly. That's telling: FDEs are turning into solutions architects, not platform engineers. And in the "local LLMs powering actual systems" category, someone built a terminal-based insult sword-fighting game using Gemma 4 on CPU - not because it was practical, but because it proved that smaller models trained on domain-specific data (a Monkey Island duel dataset) can handle structured multi-turn tasks better than baseline chat.

The Pattern: Production Readiness Means You Have to Build It

If you're shipping an agent that needs human approval, don't expect your framework to carry the load. LangGraph gets you durable pause/resume. Then you're building the rest: the Slack integration, the dashboard, the idempotency guards, the verifier model. Same story with Pydantic AI and Mastra - they set up the foundation, you lay the bricks. The industry hasn't solved HITL yet. It's solved the pause. Everything after that pause is still on you.