Most AI Frameworks Treat Human Approval Like a Console Prompt

A developer just audited twelve AI agent frameworks to see how they handle human approval. Only three of them got it right.

The audit covered LangGraph, Pydantic AI, Mastra, and nine others - frameworks people actually use to build AI agents in production. The researcher scored each one across six criteria: durability (what happens if the system crashes mid-approval?), idempotency (can you safely retry?), typed input/output, channel abstraction, and whether the framework forces you to block the entire agent while waiting for a human.

Most of them failed spectacularly. Eight frameworks scored below 10 out of 30. The worst offenders reduce human approval to a literal input() call - the Python equivalent of stopping your entire application to wait for someone to type something into a terminal. If the process dies, the approval request vanishes. If the user refreshes the page, nothing happens. If two requests come in at once, the second one overwrites the first.

This isn't theoretical. If you're building an AI agent that needs approval before spending money, modifying data, or taking any action with consequences, you need durability. The agent should be able to pause, store the approval request somewhere persistent, and resume when the human responds - even if that takes three days. Most of these frameworks can't do that without you rebuilding the entire approval system yourself.

The Three That Actually Work

Three frameworks scored above 15: LangGraph (18/30), Mastra (16/30), and Pydantic AI (15/30). What they have in common: they treat human approval as a first-class async operation, not a blocking input call.

LangGraph uses persistent checkpoints. If your agent needs approval, it saves its state, pauses, and waits for a signal. The process can die. The server can restart. When the human clicks "approve", the agent picks up exactly where it left off. That's what production-ready looks like.

Mastra separates the approval request from the execution flow. The agent doesn't block - it hands the request off to a channel and keeps running other tasks. When approval comes back, it resumes. This is how you build systems that handle hundreds of approval requests without grinding to a halt.

Pydantic AI uses typed input and output schemas. The agent knows exactly what it's asking for, and the human knows exactly what they're approving. No ambiguous strings, no parsing errors, no "did they mean yes or Yes or y?"

Why This Matters Beyond Agents

The broader point here isn't just about AI frameworks. It's about how we're building tools for systems that need human oversight. If the default approach is a blocking input call, we're designing for demos, not production.

Real systems fail. Networks drop. Browsers crash. Users walk away from their screens. If your approval mechanism can't survive any of that, you're not building something reliable - you're building something that works until it doesn't, and then loses the approval request entirely.

The audit also exposes a gap in how these frameworks think about concurrency. Most of them assume one agent, one approval, one human, all in a single synchronous flow. But production systems run multiple agents. They handle multiple users. They need queues, retries, and state persistence. The frameworks that score well are the ones that thought about this from the start.

For developers building on these frameworks, the lesson is clear: test your approval flow under failure conditions before you ship. Kill the process mid-approval. Restart the server. Send two approval requests at once. If your system can't handle that, you're one crash away from a lost request and a very confused user.

Read the full audit on Dev.to for the detailed scoring breakdown and code examples from each framework.