How to Stop AI from Quietly Breaking in Production

Your AI agent worked perfectly in testing. Three weeks into production, it's giving garbage responses and nobody noticed until a customer complained. n8n's production playbook is the manual for catching that drift before it costs you users.

The core problem with AI in production is that it fails silently. Traditional software throws errors. AI just gets worse. A slight model update changes response patterns. A new user query type exposes a weakness in your prompt. The system keeps running - it's just not working anymore.

The playbook covers five evaluation approaches: exact matching, code validation, tool-use scoring, LLM-as-a-Judge, and safety checks. Each catches a different failure mode. The practical bit is they include templates for building golden datasets and scheduling recurring evals so you're not manually checking output quality every week.

Golden Datasets and What They Actually Catch

A golden dataset is a set of test inputs with known-good outputs. For an AI agent that generates SQL queries, that's sample questions paired with the correct SQL. For a customer support bot, it's common queries paired with acceptable response templates.

Building one is tedious. You need 50-100 examples covering edge cases, ambiguous inputs, and typical user behaviour. But once it's built, you can run it nightly and catch drift before customers do. Exact matching works for structured output like code or JSON. LLM-as-a-Judge works for natural language where multiple responses are acceptable.

The trick is scheduling this to run automatically. If you're manually checking output quality, you'll do it for a week and then stop because you're shipping features. Set up a recurring eval that runs overnight and triggers an alert if scores drop below a threshold. That's the difference between catching a problem in development versus in customer complaints.

Tool-Use Scoring and Safety Checks

If your agent uses tools - function calls, API requests, database queries - you need tool-use scoring. This catches cases where the agent chose the wrong tool or called it with malformed parameters. It's especially critical for agents that interact with external systems, because a bad tool call can break downstream services.

Safety checks are the last line. These catch toxic output, prompt injection attempts, or responses that leak system prompts. n8n's approach is to run these on every production request, not just in testing. The latency hit is 50-100ms. The alternative is shipping a bot that insults customers or exposes your infrastructure details.

Alert Thresholds That Don't Spam You

The hardest part of monitoring AI systems is setting thresholds that catch real problems without triggering false alarms. If you alert on every eval that scores below 90%, you'll ignore alerts within a week. If you only alert at 50%, you've already lost half your quality.

n8n's recommendation: track score trends over time and alert on sudden drops, not absolute thresholds. If your eval scores 85% for two weeks and then drops to 70% overnight, that's a signal. If it's been hovering at 70% since launch, that's just your baseline and you need better prompts.

For critical systems, they suggest tiered alerts: a 10% drop triggers a Slack message, a 20% drop pages someone, and a 30% drop automatically rolls back to the previous model version. That requires infrastructure most teams don't have yet, but it's where this is heading.

What This Means for Builders

If you're shipping AI features, you need eval infrastructure before you need a marketing site. This isn't nice-to-have monitoring. It's the only way to know your system is still working.

Start with a small golden dataset - 20 examples covering the main use cases. Run it manually once a week. When that becomes annoying, automate it. When automation becomes critical, add alerting. The companies that survive the AI production phase are the ones who built this early, not the ones with the cleverest prompts.

The playbook is worth reading in full if you're running AI in production. It's not theory - it's the actual steps and code templates for catching problems before they compound. Which is the difference between shipping AI features and shipping AI features that still work in six months.