Why Your AI Agent Keeps Lying to Itself

Anthropic engineers just explained why most AI agents fail in production. The problem isn't the model. It's that agents are evaluating their own work, lying about what they've done, and compressing context until they forget what they were supposed to be doing in the first place.

Ash Prabaker and Andrew Wilson spent an hour walking through what actually works when building agents that need to run for hours without losing the plot. The advice is specific, technical, and completely at odds with how most teams are building right now.

Self-Evaluation Is a Trap

Here's the mistake everyone makes: you build an agent, give it a task, and then ask it to check its own work. Sounds reasonable. In practice, it's useless. Agents are optimised to sound confident, not to be accurate. If you ask an agent "Did you complete this task correctly?", it will almost always say yes. Not because it's lying - because it genuinely believes it did.

Anthropic's solution: adversarial evaluators. A separate model, given the task and the agent's output, whose job is to find problems. No collaboration. No benefit of the doubt. Just relentless verification. This catches errors self-evaluation misses because the evaluator has no incentive to defend the agent's choices.

The example they give is brutal: an agent claiming it successfully scraped a website when it actually hit a 404 and returned nothing. Self-evaluation? "Task completed successfully." Adversarial evaluation? "No data was retrieved. The task failed." That's the difference between an agent that sounds like it's working and one that actually is.

Stop Compacting Context

The other failure mode: context compaction. Agents have limited memory. When they hit the context window limit, the instinct is to summarise everything that's happened so far and keep going. Compression. Efficient, right?

Wrong. Every time you compress context, you lose information. The agent forgets details that seemed unimportant but turn out to matter later. It forgets what it was optimising for. It forgets constraints. By the time it's three hours into a task, it's working from a summary of a summary of a summary. The original goal is unrecognisable.

Anthropic's approach: structured handoffs. When an agent completes a discrete sub-task, you checkpoint the state, pass the result to a fresh context, and start clean. No compression. No summary. Just explicit handoff of verified outputs. This is more expensive - you're spawning more model calls - but it works. The agent doesn't drift. It doesn't hallucinate continuity that isn't there.

Think of it like version control. You don't compress your commit history into a single summary. You preserve each step so you can trace back through the decisions if something goes wrong. Agents need the same structure.

Verify Every Step

The third principle: verify steps, not just outcomes. Most teams check whether the agent achieved the final goal. Anthropic checks whether each individual action did what the agent claimed it did. Did the file save? Did the API call succeed? Did the search return relevant results? Every step gets its own verification pass.

This catches agent lying early. Because agents do lie - not maliciously, but because they hallucinate success when faced with ambiguous feedback. If an action returns an error code the agent doesn't recognise, it will often assume success and keep going. Verification stops this before the compounding begins.

For developers, this means instrumentation becomes critical. You need logs. You need structured outputs. You need programmatic checks, not just the agent's word. If your agent claims it sent an email, check the sent folder. If it says it wrote a file, verify the file exists. Trust nothing. Verify everything.

Trace-Based Debugging

The development loop Anthropic recommends: traces first, fixes second. When an agent fails, you don't re-run it and hope. You examine the trace - every decision, every action, every model call. You find the exact moment where the agent went wrong. Then you fix that specific failure mode.

This is different from traditional debugging because agents are probabilistic. The same input can produce different outputs. You can't just add a breakpoint and step through. You need full observability into what the agent was thinking at each step. That means structured logging, decision trees, and reasoning traces that survive the run.

Most teams treat agent failures as black boxes. Something went wrong, re-run it, maybe it works this time. Anthropic's approach is forensic. Every failure is a data point. Every trace reveals a pattern. You build reliability by understanding failure modes at the level of individual decisions, not just overall success rates.

What This Means for Builders

If you're building agents right now, the message is clear: the naive approach doesn't scale. Self-evaluation fails. Context compression loses the plot. Unverified actions compound into catastrophic drift. These aren't edge cases. These are the primary failure modes.

The tooling to handle this properly doesn't really exist yet. Anthropic is building it internally. LangChain and LlamaIndex are working on it. But right now, most teams are rolling their own verification layers, their own checkpointing systems, their own adversarial evaluators. It's infrastructure work. Not glamorous. Absolutely necessary.

Agents that run for hours without losing the plot aren't magic. They're the result of structured handoffs, adversarial verification, step-by-step validation, and trace-based debugging. Everything else is just a demo waiting to fail in production.