Intelligence is foundation
Subscribe
  • Luma
  • About
  • Sources
  • Ecosystem
  • Nura
  • Marbl Codes
00:00
Contact
[email protected]
Connect
  • YouTube
  • LinkedIn
  • GitHub
Legal
Privacy Cookies Terms
  1. Home›
  2. Featured›
  3. Builders & Makers›
  4. Why Your AI Agent Keeps Lying to Itself
Builders & Makers Monday, 18 May 2026

Why Your AI Agent Keeps Lying to Itself

Share: LinkedIn
Why Your AI Agent Keeps Lying to Itself

Anthropic engineers just explained why most AI agents fail in production. The problem isn't the model. It's that agents are evaluating their own work, lying about what they've done, and compressing context until they forget what they were supposed to be doing in the first place.

Ash Prabaker and Andrew Wilson spent an hour walking through what actually works when building agents that need to run for hours without losing the plot. The advice is specific, technical, and completely at odds with how most teams are building right now.

Self-Evaluation Is a Trap

Here's the mistake everyone makes: you build an agent, give it a task, and then ask it to check its own work. Sounds reasonable. In practice, it's useless. Agents are optimised to sound confident, not to be accurate. If you ask an agent "Did you complete this task correctly?", it will almost always say yes. Not because it's lying - because it genuinely believes it did.

Anthropic's solution: adversarial evaluators. A separate model, given the task and the agent's output, whose job is to find problems. No collaboration. No benefit of the doubt. Just relentless verification. This catches errors self-evaluation misses because the evaluator has no incentive to defend the agent's choices.

The example they give is brutal: an agent claiming it successfully scraped a website when it actually hit a 404 and returned nothing. Self-evaluation? "Task completed successfully." Adversarial evaluation? "No data was retrieved. The task failed." That's the difference between an agent that sounds like it's working and one that actually is.

Stop Compacting Context

The other failure mode: context compaction. Agents have limited memory. When they hit the context window limit, the instinct is to summarise everything that's happened so far and keep going. Compression. Efficient, right?

Wrong. Every time you compress context, you lose information. The agent forgets details that seemed unimportant but turn out to matter later. It forgets what it was optimising for. It forgets constraints. By the time it's three hours into a task, it's working from a summary of a summary of a summary. The original goal is unrecognisable.

Anthropic's approach: structured handoffs. When an agent completes a discrete sub-task, you checkpoint the state, pass the result to a fresh context, and start clean. No compression. No summary. Just explicit handoff of verified outputs. This is more expensive - you're spawning more model calls - but it works. The agent doesn't drift. It doesn't hallucinate continuity that isn't there.

Think of it like version control. You don't compress your commit history into a single summary. You preserve each step so you can trace back through the decisions if something goes wrong. Agents need the same structure.

Verify Every Step

The third principle: verify steps, not just outcomes. Most teams check whether the agent achieved the final goal. Anthropic checks whether each individual action did what the agent claimed it did. Did the file save? Did the API call succeed? Did the search return relevant results? Every step gets its own verification pass.

This catches agent lying early. Because agents do lie - not maliciously, but because they hallucinate success when faced with ambiguous feedback. If an action returns an error code the agent doesn't recognise, it will often assume success and keep going. Verification stops this before the compounding begins.

For developers, this means instrumentation becomes critical. You need logs. You need structured outputs. You need programmatic checks, not just the agent's word. If your agent claims it sent an email, check the sent folder. If it says it wrote a file, verify the file exists. Trust nothing. Verify everything.

Trace-Based Debugging

The development loop Anthropic recommends: traces first, fixes second. When an agent fails, you don't re-run it and hope. You examine the trace - every decision, every action, every model call. You find the exact moment where the agent went wrong. Then you fix that specific failure mode.

This is different from traditional debugging because agents are probabilistic. The same input can produce different outputs. You can't just add a breakpoint and step through. You need full observability into what the agent was thinking at each step. That means structured logging, decision trees, and reasoning traces that survive the run.

Most teams treat agent failures as black boxes. Something went wrong, re-run it, maybe it works this time. Anthropic's approach is forensic. Every failure is a data point. Every trace reveals a pattern. You build reliability by understanding failure modes at the level of individual decisions, not just overall success rates.

What This Means for Builders

If you're building agents right now, the message is clear: the naive approach doesn't scale. Self-evaluation fails. Context compression loses the plot. Unverified actions compound into catastrophic drift. These aren't edge cases. These are the primary failure modes.

The tooling to handle this properly doesn't really exist yet. Anthropic is building it internally. LangChain and LlamaIndex are working on it. But right now, most teams are rolling their own verification layers, their own checkpointing systems, their own adversarial evaluators. It's infrastructure work. Not glamorous. Absolutely necessary.

Agents that run for hours without losing the plot aren't magic. They're the result of structured handoffs, adversarial verification, step-by-step validation, and trace-based debugging. Everything else is just a demo waiting to fail in production.

More Featured Insights

Robotics & Automation
Boston Dynamics Shows How Atlas Actually Learns
Voices & Thought Leaders
Companies Are Burning Through 2026 AI Budgets Already

Video Sources

AI Engineer
Build Agents That Run for Hours (Without Losing the Plot) - Ash Prabaker & Andrew Wilson, Anthropic
AI Engineer
Harnesses in AI: A Deep Dive - Tejas Kumar, IBM
AI Engineer
Fighting AI with AI - Lawrence Jones, Incident
Boston Dynamics YouTube
How does Atlas learn? | Inside the Lab | Boston Dynamics

Today's Sources

PyImageSearch
LLM Observability with Self-Hosted Langfuse and vLLM
Towards Data Science
Why Your AI Demo Will Die in Production
Towards Data Science
The Next AI Bottleneck Isn't the Model: It's the Inference System
Robohub
Table tennis robot defeats some of world's best players - why this has major implications for robotics
The Robot Report
Fraunhofer IPA offers new test benchmark for humanoids
The Robot Report
Mind Robotics raises $400M to scale AI-powered robots in manufacturing
Azeem Azhar
📈 Data to start your week: The cost of tokenmaxxing
Jack Clark Import AI
Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment
Ben Thompson Stratechery
Data Center Discontent, Understanding the Opposition, Fixing the Problem

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Richard Bland
About Sources Privacy Cookies Terms Thou Art That
MEM Digital Ltd t/a Marbl Codes
Co. 13753194 (England & Wales)
VAT: 400325657
24-25 High Street, Wellingborough, NN8 4JZ
© 2026 MEM Digital Ltd