When AI Stops Performing: Building Agents That Stay Reliable

Today's Overview

An AI system that works today can silently degrade tomorrow. Your support agent was resolving 70% of tickets last month. This month, it's 65%. No errors in the logs. No alerts fired. Just a quiet drift downward that nobody noticed until customers started complaining.

The Measurement Problem Nobody Talks About

This is the practical problem facing anyone shipping AI into production right now. Once an agent leaves the lab and starts handling real work-whether that's customer support, code generation, or prospecting-you need a way to measure whether it's actually getting better or worse. n8n's new evaluation framework shows what this actually looks like in practice: you run representative inputs through your workflow, compare outputs against quality criteria, and track metrics over time. Not before deployment. After. The goal is to catch drift before your users do.

The framework splits into five concrete approaches. Exact matching works for structured outputs-classification, data extraction, anything with a clear right answer. LLM-as-a-Judge handles the subjective stuff-whether a customer support response actually sounds helpful, whether generated code does what you intended. Tool-use evaluation (specific to agents) checks whether an agent called the right APIs in the right order. Safety evaluation catches PII leaks and policy violations. The strongest systems combine all of them.

What's changed this week is visibility. Ethyl Pratt's breakdown of which agent work actually has budget shows which categories have moved from demo territory into operating reality. Customer support resolution agents are first because HubSpot moved to outcome-based pricing-you pay per resolved conversation, not per prompt. Prospecting agents are next because the KPI is obvious: booked meetings. Voice receptionists and coding agents follow because they solve measurable problems with measurable ROI. The pattern is clear: agents with budget are agents with owners who track specific metrics and can explain the business case in numbers.

The Physical World Still Moves Differently

Meanwhile, in robotics, Colin Angle (the founder who made the Roomba) is attempting something much harder: a companion robot that actually stays in your home because it's emotionally useful, not just mechanically interesting. The Familiar machine is quadruped, fuzzy, touchsensitive, and designed to have no screen-because Angle's thesis is that screens don't solve loneliness. The robot learns your routines, nudges you to take walks, greets you after work. It communicates through motion and behavior, not conversation. This is the bet that emotional believability doesn't require perfect intelligence. It's also a very different category of problem from agent orchestration-it requires personality consistency over months of interaction, not accuracy on a single task.

The through-line connecting agent evaluation to companion robots is the same: durability matters more than brilliance. An agent that resolves 85% of support tickets consistently is more valuable than one that's brilliant on Monday and useless on Friday. A robot that's predictably responsive to context is more useful than one that's occasionally impressive and often irrelevant. The infrastructure to measure and maintain that consistency is now becoming the actual product boundary.