Why Agent Quality Is 80% Harness, 20% Model

Most developer attention on AI agents goes to the model. Which LLM, what parameters, how many tokens. The teams building production agents - Claude Code, OpenHands, SWE-agent - say that's backwards.

This 16-part field guide from production builders makes the case: agent quality is 80% harness, 20% model. The harness is everything that wraps the model - how you structure the loop, manage tools, compress context, handle memory, enforce security. That's where the real engineering is.

Loop Design Matters More Than Model Choice

The core agent loop is deceptively simple: observe, think, act, repeat. The complexity is in how you implement each step. Do you let the model see raw tool output or do you filter it? Do you retry failed actions or move on? Do you compress the conversation history or keep everything?

Production agents make specific choices here. OpenHands runs a tight loop with aggressive context compression. SWE-agent keeps more history but filters tool output heavily. Claude Code balances both. There's no universal answer, but there is a pattern: the harness defines the behaviour more than the model does.

One example: tool execution. You can let the model call tools directly, or you can validate the calls first. Direct execution is faster but fragile - the model will call tools incorrectly and break things. Validated execution is slower but reliable. Production systems all use validation. The speed hit is worth the reliability gain.

Context Is a Resource You Manage

Models have token limits. Agents generate conversation history faster than those limits allow. If you don't compress context, the agent runs out of memory mid-task. If you compress too aggressively, it loses critical information.

The guide covers three compression strategies production teams use: sliding window (keep recent messages, drop old ones), summarisation (compress old messages into summaries), and hierarchical (keep critical messages, summarise the rest). Each has trade-offs. Sliding window is simple but loses long-term context. Summarisation keeps context but adds latency. Hierarchical is complex but effective.

The insight: context management isn't a side concern. It's a core architectural decision that determines what tasks your agent can complete. Get it wrong and the agent will hit token limits on anything non-trivial. Get it right and you can run multi-hour tasks without degradation.

Memory Architecture Defines Capability

Short-term memory - the conversation history - is one thing. Long-term memory is another. Production agents need both. Short-term memory handles the immediate task. Long-term memory handles learned patterns, user preferences, and cross-session context.

The teams building these systems use different architectures. Some store memory in vector databases and retrieve relevant chunks each loop. Others use structured databases with explicit schema. Others use a hybrid: structured data for facts, vectors for patterns.

What they all do: separate ephemeral state from persistent state. The agent needs to know what happened five minutes ago (ephemeral) and what it learned five days ago (persistent). Mixing them creates confusion. Separating them creates clarity.

Security and Multi-Tenancy Are Not Optional

A production agent runs user code, accesses user data, and makes API calls on behalf of users. If you don't isolate users from each other, one user's agent can leak data to another. If you don't sandbox code execution, a malicious prompt can compromise the system.

The guide covers isolation strategies: separate execution environments per user, sandboxed tool execution, permission systems that limit what tools agents can access. These aren't features you add later. They're architectural requirements from day one.

The pattern across production systems: security isn't bolted on, it's baked in. The harness enforces isolation. The tools run in sandboxes. The memory layer respects tenant boundaries. The model itself has no special privileges - it's just another component running in a controlled environment.

Performance Is About the Harness, Not the Model

A fast model in a slow harness is a slow agent. A slow model in a fast harness is often faster. The harness determines latency more than the model does.

Production teams optimise the harness: parallel tool execution where possible, lazy loading of context, caching of repeated operations, streaming output to reduce perceived latency. These optimisations don't change what the model does. They change how quickly the user sees results.

The bottleneck in most agent systems isn't model inference. It's tool execution, context retrieval, or waiting for external APIs. The harness controls all of those. Optimise the harness and the agent feels faster even if the model is the same.

What This Means for Builders

If you're building an agent, spend your time on the harness. The model is important, but it's also the part that improves without your effort. GPT-5 will be better than GPT-4. The harness won't improve unless you improve it.

The guide provides the playbook: how to structure the loop, manage context, design memory, enforce security, optimise performance. It's not theory. It's synthesis from teams running agents in production at scale. The lesson is clear: agent quality is engineering, not model selection. Build the harness right and the rest follows.