Two Ways to Make Agents Durable: Replay vs Snapshot

Eric Allam from Trigger.dev gave a talk this week breaking down how to build agents that don't lose their place when something goes wrong. His key insight: durability isn't one problem. It's two separate problems that need different solutions.

Context persistence and execution durability sound similar. They're not. Understanding the difference changes how you architect agent systems.

The Two Durability Problems

Context persistence is about memory. An LLM-based agent builds up conversational history as it works - the user's request, tool calls it's made, responses it's received, intermediate reasoning steps. If the process crashes, that history can't be lost. The agent needs to pick up exactly where it left off, with full memory of what happened before.

The solution Allam describes is an append-only log. Every interaction gets written to durable storage immediately. User message: logged. Tool call: logged. Tool response: logged. Model generation: logged. If the process dies, you replay the log to reconstruct the conversation state.

This is conceptually simple but has a catch. Replaying a long conversation is slow. If your agent has made 200 tool calls over three hours, reconstructing state means re-processing all 200 interactions. That takes time and costs tokens.

Execution Durability: The Harder Problem

Execution durability is about runtime state. Your agent isn't just having a conversation - it's running code. Variables are in memory, network connections are open, file handles are active, threads are executing. If the process crashes, all of that disappears.

Traditional approaches to durability use checkpoints. Periodically, you serialise the entire runtime state to disk. If the process dies, you load the most recent checkpoint and resume. This works but it's slow. Serialising a complex runtime state takes seconds. Deserialising it takes longer. And checkpoints grow large quickly.

Allam's team at Trigger.dev took a different approach: Firecracker microVMs with memory snapshots.

The Firecracker Solution

Firecracker is the virtualisation technology AWS uses for Lambda. It's designed to start virtual machines in milliseconds with minimal overhead. Each VM is isolated and lightweight.

Instead of checkpointing application state, Trigger.dev snapshots the entire VM memory. The agent runs in a Firecracker microVM. When you need durability, you pause the VM and save its memory to disk. The entire runtime state - variables, stack frames, network sockets, file handles - gets captured in one atomic snapshot.

The numbers Allam shared are compelling. A snapshot compresses to 14MB. Saving takes under a second. Restoring takes 100 milliseconds. Compare that to traditional checkpoint-restore systems where serialisation can take 5-10 seconds and deserialisation even longer.

Why This Works

The key insight is that memory snapshots capture everything. You don't need to explicitly serialise each piece of state. You don't need to worry about what's in memory versus what's on disk. You pause the entire VM, save the memory pages, and you're done.

This has a few advantages. First, it's language-agnostic. The snapshot happens at the VM level, not the application level. Your agent can be written in Python, Node, Rust, whatever - the snapshot mechanism doesn't care.

Second, it's atomic. Either the whole snapshot succeeds or it fails. You don't have partial checkpoints where some state was saved and some wasn't.

Third, it's fast enough to do frequently. With sub-second snapshot times, you can checkpoint after every tool call or every significant state change. That means when you restore, you're never more than a few seconds behind where you were.

The Trade-offs

This approach isn't free. Running every agent in its own microVM has overhead. Memory usage is higher - each VM needs its own kernel, even if it's minimal. Orchestration gets more complex - you're managing VMs, not just processes.

There's also the cold start problem. Restoring from a snapshot is fast, but it's not zero. If your agent needs to respond to external events in real-time, 100ms might matter. For long-running background tasks, it's negligible. For interactive agents, it might be noticeable.

And snapshots only work if your agent's external dependencies are resumable. If your agent was in the middle of a database transaction when it crashed, the snapshot won't help - the transaction is gone. You need idempotent operations and retry logic regardless.

Combining the Approaches

Allam's point is that you need both mechanisms. The append-only log handles context persistence - it's your conversation history, your audit trail, your replay mechanism. The VM snapshot handles execution durability - it's your runtime state, your in-progress computations, your quick-resume capability.

Together, they give you robust durability. If a process crashes mid-execution, you restore the VM snapshot to get back to the last known state. If the snapshot is corrupted or missing, you fall back to replaying the append-only log. You have two independent recovery paths.

What This Means for Agent Builders

If you're building agent systems, durability isn't optional. Agents run for extended periods. Networks fail. Processes crash. Servers restart. If your agent loses state when that happens, it's not production-ready.

The two-mechanism approach - logs for context, snapshots for execution - gives you a clear architecture. Log every interaction immediately. Snapshot the runtime state frequently. When something fails, restore the snapshot and replay any log entries that came after it.

Firecracker isn't the only way to do VM-level snapshots, but it's one of the fastest. CRIU (Checkpoint/Restore In Userspace) is an alternative for Linux containers. WebAssembly runtimes with pause/resume capabilities are emerging. The principle is the same: capture the entire runtime, not just application state.

For teams building agent platforms, this is table stakes. Users expect agents to survive failures. The question isn't whether to build durability - it's how to build it efficiently. Allam's approach gives you a concrete starting point: separate context from execution, and handle each with the right tool.

Watch the full talk at AI Engineer.