OpenAI Built a New Network Protocol to Keep GPU Clusters in Sync

Training massive AI models requires thousands of GPUs working in perfect coordination. If one GPU's data arrives late, the entire cluster stalls. Network congestion - the thing that makes your video buffer - becomes a training bottleneck costing millions.

OpenAI just published details on Multipath Reliable Connection (MRC), a new network protocol they developed with AMD, Broadcom, Intel, Microsoft, and Nvidia. The goal: keep GPU clusters synchronized during training without the stalls that plague traditional networks. They're making it an open standard.

The Problem: Network Congestion Kills GPU Utilisation

When you train a large model, every GPU needs to share its gradients with every other GPU, constantly. Traditional networks use TCP - the protocol that underpins most internet traffic. TCP is reliable, but it's designed for a world where occasional delays are acceptable. Drop a packet, and TCP slows everything down to recover.

For GPU training, this is catastrophic. If one GPU's update is delayed, thousands of others sit idle waiting. The compute you're paying for - data centre power, cooling, hardware depreciation - burns money while nothing happens. At scale, network inefficiency isn't an annoyance. It's a direct hit to training economics.

Mark Handley and Greg Steinbrecher, the engineers behind MRC, explain it clearly in OpenAI's podcast episode. Existing protocols weren't built for this use case. They needed something that could handle the specific traffic patterns of distributed training: lots of small messages, all equally important, all time-sensitive.

How MRC Works

MRC takes a different approach. Instead of relying on a single path between GPUs and retrying when packets are lost, it uses multiple network paths simultaneously. If one path gets congested, traffic shifts to another. The protocol is designed for reliability without the retransmission delays that stall training.

Critically, MRC was built with input from the entire supply chain: AMD, Broadcom, Intel, Microsoft, and Nvidia. This isn't a proprietary OpenAI technology. It's being released as an open standard, meaning any company running large GPU clusters can adopt it.

The technical details matter, but the bigger point is this: OpenAI identified a bottleneck that couldn't be solved with existing tools, built a solution with the hardware vendors who control the ecosystem, and released it publicly. That approach - recognising where infrastructure fails and fixing it collaboratively - is how you build tools that scale.

Why This Matters Beyond OpenAI

GPU clusters aren't just for frontier model training anymore. Research labs, enterprises, and startups are all running distributed workloads. If network inefficiency is costing OpenAI millions, it's costing everyone else proportionally.

By making MRC an open standard, OpenAI is pushing the entire industry forward. Smaller labs benefit from the same efficiency gains. Cloud providers can offer better GPU utilisation. Hardware manufacturers can optimise for a protocol that's actually designed for AI workloads.

This is infrastructure work - the unglamorous layer beneath the models everyone talks about. But infrastructure is where real use exists. A 10% improvement in network efficiency is a 10% reduction in training cost, forever, for everyone who adopts it.

The podcast episode is worth listening to if you care about how AI systems actually get built. Handley and Steinbrecher walk through the problem, the design decisions, and the trade-offs in detail. It's rare to get this level of technical transparency from a frontier lab.

MRC won't make headlines the way a new model does. But it's the kind of work that determines who can afford to train at scale, and how fast the next generation of models arrives. Unsexy, critical, and now available to everyone.