Cloudflare finished 18 months of resilience work this week with the rollout of Snapstone - an internal tool that treats configuration changes like software deployments, complete with progressive rollout, automatic rollback, and the kind of paranoid testing discipline that only comes from breaking the internet twice.
The project, internally called Code Orange, started after a 2022 outage where a single malformed regex in a firewall rule took down 19 of Cloudflare's data centres. The fix was never about the regex. It was about building a system where one mistake can't cascade across the network.
What Snapstone Actually Does
Configuration changes are the silent killer of uptime. Software gets tested, reviewed, deployed in stages. Config files get edited, committed, and pushed live across the entire fleet because they're "just config." Then someone adds a typo, a bad value, or a rule that interacts badly with another rule, and suddenly you're debugging a global outage.
Snapstone applies software deployment discipline to config changes. Every configuration update gets rolled out progressively - 1% of the fleet, then 10%, then 50%, then 100%. If errors spike at any stage, the system automatically rolls back. If a change breaks health checks, it never reaches production.
The technical innovation isn't the progressive rollout itself - that's standard practice for code. It's extending it to configuration, which requires rethinking how config files are structured, versioned, and applied across thousands of servers in dozens of data centres.
The AI Codex Nobody's Talking About
Buried in Cloudflare's announcement is something more interesting than Snapstone: the company rebuilt its internal incident response documentation as an AI-enforced Codex.
Best practices documents have a fatal flaw - nobody reads them during an outage. When systems are down and customers are angry, engineers skip documentation and start making changes based on instinct. That's when you get a second outage caused by the fix for the first outage.
Cloudflare's solution was to embed best practices into the tools themselves. The AI Codex doesn't just document emergency procedures - it enforces them. Try to bypass progressive rollout during an incident and the system asks you to confirm you understand the risk. Try to apply a config change without testing and it blocks you until you've validated against synthetic traffic.
It's not a chatbot. It's a constraints system dressed up as helpful guidance. And it works because it doesn't rely on engineers remembering to check documentation when everything's on fire.
Emergency Access, Rebuilt
The other piece of Code Orange was rebuilding emergency access procedures. When part of Cloudflare's network is unreachable, engineers need a way to get in and fix things without relying on the very infrastructure that's broken.
The old system was a collection of SSH keys, VPN configs, and tribal knowledge about which backup routes worked when the primary paths were down. The new system is codified, tested monthly, and includes automatic failover to secondary access methods if the primary route fails health checks.
This is boring infrastructure work. It's also the difference between a 10-minute outage and a 3-hour outage when something breaks. The companies that invest in this kind of resilience engineering are the ones that stay up when everyone else is posting incident reports.
Why This Matters for Smaller Teams
Cloudflare operates at a scale most companies will never reach. But the principles behind Code Orange apply at every level: progressive rollout, automatic rollback, config-as-code, and embedding best practices into tooling rather than documentation.
A five-person startup doesn't need Snapstone. But they do need a way to catch mistakes before they hit production. That could be as simple as a staging environment that mirrors production config, a deploy script that applies changes to one server before the rest, or a checklist that forces you to validate changes before pushing them live.
The specific tools matter less than the discipline. Cloudflare's advantage isn't that they built Snapstone - it's that they're willing to slow down deployments in exchange for reliability. Most teams aren't. They optimise for speed until an outage forces them to optimise for resilience.
The Real Cost of Outages
Cloudflare's 2022 outages were measured in minutes, but the reputational cost lasted months. When you're infrastructure for the internet, reliability isn't a feature - it's the entire value proposition. A CDN that goes down isn't a minor inconvenience; it's a reason to evaluate competitors.
That's why Code Orange took 18 months and involved rethinking core systems rather than just patching the immediate problem. Cloudflare didn't just want to prevent the same outage from happening again. They wanted to prevent the category of outages caused by config mistakes, rushed deployments, and emergency procedures that hadn't been tested under real conditions.
The result is a network that's genuinely harder to break by accident. And in infrastructure, that's as close as you get to a competitive moat.