Notion shipped AI agents to millions of users. But before they got there, they threw away their architecture five times and started over.
Simon Last and Sarah Sachs - Notion's technical leads on AI - walked through the journey on Latent Space. What emerges is a candid picture of what it actually takes to move from AI experimentation to production at scale. The short version: nothing worked the first time, or the second, or the third.
The Five Rebuilds
Notion's AI story started in 2022 with experiments that, by their own admission, failed. The team tried to build agents using the tools available at the time. The models weren't good enough. The architectures were fragile. User feedback was brutal.
So they rebuilt. Then rebuilt again. Each iteration scrapped core assumptions. By the time they shipped what became Token Town - their internal name for the agent ecosystem - they'd gone through five complete architectural overhauls. That's not iteration. That's fundamental rethinking, each time.
What changed between failed experiments and production-ready agents wasn't just model quality, though that helped. It was understanding what agents actually needed to do versus what the hype promised they could do. Early attempts tried to build general-purpose reasoning. What shipped was specific, constrained, deeply integrated with Notion's existing tools.
The Tools Question - MCP vs CLIs
One of the more revealing threads in the conversation: how agents access tools. Notion built over 100 internal tools that their agents can call. Not theoretical capabilities - actual, production tools doing real work.
The debate around Model Context Protocol versus traditional command-line interfaces isn't academic when you're maintaining that many tools. MCP promises standardised agent-to-tool communication. CLIs are proven, understood, debuggable. Notion's engineering team had strong opinions about trade-offs between elegance and reliability.
Their approach leans practical. Tools need to be discoverable, documented, and testable by humans before agents touch them. If an engineer can't debug what a tool does when it breaks, agents won't magically figure it out either. The architecture reflects that - tools are built for human understanding first, then exposed to agents.
Evals, Pricing, and Production Reality
Here's where the conversation got into details that matter for anyone building production AI. Evaluations - testing whether agents actually work - turned out to be harder than building the agents themselves. Notion runs continuous eval pipelines, but the metrics that correlate with user satisfaction aren't always obvious.
An agent might technically complete a task successfully while frustrating the user through clumsy interactions. Or it might fail the task but surface information that helps the user solve the problem themselves. Traditional pass/fail metrics miss this nuance entirely.
Pricing came up as an unsolved problem. When agents call multiple tools, make reasoning steps, and iterate on solutions, costs spiral unpredictably. Notion's approach involves aggressive caching and constraining agent behaviour, but fundamentally, nobody's figured out how to price agent work in a way that feels fair to users and sustainable for providers.
The Software Factory Vision
Last and Sachs talked about software factories - the idea that AI agents will eventually assemble applications from components the way factories assemble products from parts. It's a compelling vision. Notion's already partway there with agents that can scaffold documents, populate databases, and wire up automations.
But the reality check matters. The agents that work in production are narrow, constrained, and deeply integrated with existing systems. They're not general-purpose builders. They're specialised tools that know Notion's architecture intimately and operate within careful guardrails.
The software factory vision requires agents that can understand requirements, select appropriate components, integrate them reliably, and handle edge cases. We're not there yet. What Notion shipped is more like a semi-automated assembly line where humans still make key decisions and agents handle repetitive execution.
What Developers Should Take Away
If you're building agents, Notion's journey offers useful patterns. First, expect to rebuild. Multiple times. The architecture that works in a demo rarely survives contact with real users. Second, tools matter more than models. An agent with access to well-designed, reliable tools outperforms a smarter agent with poor tooling. Third, evals are product design, not just engineering. What you measure shapes what you build.
The gap between AI demos and AI products is wider than most people expect. Notion spent years crossing it, with substantial engineering resources and tight model partnerships. Smaller teams should think carefully about where they can differentiate. Competing on agent intelligence is expensive and probably futile. Competing on tool quality, integration depth, and workflow understanding - that's winnable.
The software factory future might arrive eventually. But the path there runs through companies like Notion - doing the unglamorous work of making agents reliable, debuggable, and genuinely useful for specific tasks. Five rebuilds later, they've got something that works. That's the standard.