Agents fail in ways that are hard to measure. They call the wrong API. They hallucinate function arguments. They get stuck in loops. And when you change the prompt to fix one failure, you break something else - except you don't know it until users report weird behaviour three days later.
Laurie Voss from Arize walked through how to build evaluation pipelines that solve this. The AI Engineer workshop covers tracing, failure categorisation, code evals, LLM judges, and experiments that prove - actually prove - when a prompt change improves performance.
This is practical, hands-on work. Not theory about what evals should be. Instructions for building them from scratch.
The Vibes Problem
Most teams test agents by running them a few times, checking if the output "feels right", and shipping. This works until it doesn't. You make a prompt tweak. The agent seems better. Three weeks later, you discover it stopped handling edge cases correctly. But you've shipped five more changes since then, so you have no idea which one broke it.
The vibes problem is that subjective assessment doesn't scale. You need objective, repeatable measurements. That means automated evals that run on every code change, before anything ships.
Tracing: What Did the Agent Actually Do?
Before you can evaluate an agent, you need to see what it's doing. Tracing captures every step - which tools it called, what arguments it passed, what responses it received, how it decided what to do next.
Without tracing, debugging agents is guesswork. With tracing, you can replay failures, identify where the logic broke, and test fixes against the exact scenario that failed. Voss demonstrates setting up trace collection using OpenTelemetry standards, so your eval infrastructure isn't locked to one vendor.
Tracing also reveals patterns. If your agent consistently fails when users ask for data in a certain format, that's not a one-off bug - it's a systematic problem with how you're handling output formatting. Traces make those patterns visible.
Failure Categorisation: Not All Errors Are Equal
Agents fail in different ways, and each type of failure needs different measurement:
Tool calling errors - The agent calls the wrong API, or passes malformed arguments. These are straightforward to catch with schema validation.
Logic errors - The agent calls the right tools in the wrong order, or misinterprets context. These require task-specific evals that check the reasoning chain.
Incompleteness - The agent stops before finishing the task. Detection requires defining what "done" looks like for each task type.
Hallucination - The agent invents data or capabilities. Requires fact-checking against ground truth.
Voss walks through building separate eval pipelines for each category. Trying to catch everything with one test leads to vague metrics that don't tell you what's actually broken.
Code Evals: When Assertions Are Enough
Not everything needs an LLM judge. For deterministic tasks, traditional unit tests work fine. If your agent is supposed to query a database and return results in JSON, write an assertion that checks the output structure. If it's meant to call three APIs in sequence, verify the call order.
Code evals are fast, cheap, and unambiguous. A passing test means the agent did exactly what it should. A failing test tells you precisely what broke. Use them wherever you can before reaching for LLM-based evaluation.
LLM Judges: When You Need Semantic Understanding
Some tasks require judgment. Did the agent's explanation make sense? Is the tone appropriate for the context? Did it capture the user's intent correctly? Code assertions can't measure these.
LLM judges work by giving a separate model (often GPT-4) a rubric and asking it to score the agent's output. The trick is writing good rubrics - vague criteria produce inconsistent scores. Specific criteria with examples produce reliable measurements.
Voss demonstrates prompt templates for common evaluation tasks: factual accuracy, completeness, tone, and formatting. The key insight: LLM judges need to be evaluated too. You test them against human-scored examples to verify they're judging correctly before you trust their scores in production.
Experiments That Prove Prompt Changes Work
The point of all this infrastructure is to run experiments. You change the prompt, run your eval suite, and see if performance improves across your test set. Not on three hand-picked examples - on hundreds of real scenarios that cover edge cases.
This is what lets you ship with confidence. You're not guessing whether the new prompt is better. You're measuring it. And when something regresses, you know immediately, before users find out.
The workshop includes example eval datasets for common agent tasks: data retrieval, multi-step reasoning, API orchestration. These aren't perfect, but they're a starting point that beats testing by hand.
For teams building agents, this is the difference between "we think it works" and "we know it works, and here's the data". The latter is what you need before putting agents in front of users who depend on them.