Robots Atlas>ROBOTS ATLAS

OpenAI opens MRC protocol for AI supercomputer networks

OpenAI opens MRC protocol for AI supercomputer networks

On May 5, 2026, OpenAI published the MRC (Multipath Reliable Connection) specification — a new GPU cluster networking protocol developed over two years with AMD, Broadcom, Intel, Microsoft, and NVIDIA. Already deployed on the Stargate supercomputer in Texas, MRC has been released to the public domain via the Open Compute Project to help the broader industry build more resilient networks for frontier model training.

Key takeaways

  • MRC was developed over 2 years in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA
  • The protocol is already running on OpenAI's NVIDIA GB200 supercomputers — including Stargate (Abilene, TX, Oracle/OCI) and Microsoft Fairwater
  • Multi-plane network topology connects over 100,000 GPUs using only 2 switch tiers instead of the conventional 3–4
  • Packet spraying distributes a single transfer across hundreds of paths simultaneously — eliminating core network bottlenecks
  • MRC 1.0 specification available free of charge through the Open Compute Project (OCP)

Network as the bottleneck for AI training

Training large language models is fundamentally a coordination problem. Thousands of GPUs must exchange data in tightly synchronized lockstep — a single late transfer can stall the entire operation. As clusters grow to hundreds of thousands of processors, the network ceases to be background infrastructure and becomes a critical limiting factor.

OpenAI spent several years building successive generations of its own supercomputers. Experience from three generations of clusters, gathered before Stargate came online, led the team to one conclusion: conventional network protocols do not scale adequately. With millions of transfers per training step, even brief link failures or network congestion meant job restarts or multi-second stalls.

Three pillars of MRC

MRC (Multipath Reliable Connection) addresses three specific weaknesses of conventional compute networks.

Multi-plane networks

Instead of treating each network interface as a single 800 Gb/s link, MRC splits it into eight 100 Gb/s links connected to eight different switches, forming eight separate parallel planes. The key advantage: a switch handling 64 ports at 800 Gb/s can handle 512 ports at 100 Gb/s in this model. This allows more than 131,000 GPUs to be interconnected with just two tiers of switches — whereas a conventional 800 Gb/s network would require three or four.

Packet spraying

Traditional protocols route each transfer along a single path, causing collisions and congestion. MRC distributes packets from a single transfer across hundreds of paths through all planes simultaneously. Packets may arrive out of order — each carries its destination memory address, so the receiver can write them to memory as they arrive. If a path starts congesting, MRC dynamically reroutes packets to alternatives. If a packet is lost, the protocol assumes a failure and immediately stops using that path, rather than waiting for dynamic routing to react.

Source routing with SRv6

Instead of conventional dynamic routing (BGP), MRC uses IPv6 Segment Routing (SRv6): the sender encodes the full path for each packet directly in the destination address. Switches do not recompute routes — they apply static tables configured once at initialization. This eliminates an entire class of dynamic routing failures, which in practice have been a source of hard-to-diagnose outages.

Stargate: the protocol in production

MRC is not a research proposal — it is deployed in production. It runs on all of OpenAI's largest NVIDIA GB200 clusters: the Stargate supercomputer in Abilene, Texas (managed by Oracle Cloud Infrastructure) and Microsoft's Fairwater supercomputers. The protocol has been used to train multiple OpenAI frontier models, including those powering ChatGPT and Codex.

Production data confirms the system's resilience. During one training run, engineers had to reboot four Tier-1 switches. Before MRC, this would have required careful coordination with training teams to avoid job interruption. With MRC, the reboot went unnoticed by the cluster. Similarly, repeated link flaps between Tier-0 and Tier-1 switches had no measurable impact on synchronous pretraining jobs.

When one port of an 8-port GPU network interface failed, MRC detected the loss, recalculated paths bypassing the damaged plane, and informed peers not to route inbound traffic through it. The resulting training slowdown was noticeably smaller than the proportional capacity loss (one eighth of throughput).

Opening the standard

The MRC 1.0 specification has been released free of charge through the Open Compute Project — an initiative founded by Facebook in 2011 that has become the main platform for sharing open data center infrastructure standards. A technical whitepaper, "Resilient AI Supercomputer Networking using MRC and SRv6," accompanies the specification with full implementation details.

MRC extends the existing RoCE (RDMA over Converged Ethernet) standard developed by the InfiniBand Trade Association (IBTA), drawing on techniques from the Ultra Ethernet Consortium (UEC). Its publication fits OpenAI's broader infrastructure strategy: standardizing key layers of compute infrastructure to enable AI scaling beyond the resources of any single company.

In practice: AMD published its own technical commentary on MRC, Broadcom described its implementation in networking silicon, and NVIDIA and Intel confirmed deployments in their respective infrastructures.

Why it matters

For years, the frontier of AI training was defined by raw compute: more GPUs, faster models. It is increasingly clear that the next bottleneck is the network. With 900 million weekly ChatGPT users, OpenAI faces pressure that cannot be addressed by simply adding more chips.

MRC signals that OpenAI treats network infrastructure as a core competitive asset, not a commodity sourced from external vendors. Opening the specification is also a strategic move: if AMD, Broadcom, NVIDIA, and Microsoft build around MRC, OpenAI effectively becomes the defining voice in next-generation AI networking design.

For the rest of the sector — from hyperscalers to startups building clusters — MRC sets a new reference point. Two-tier networks capable of connecting 130,000 GPUs were previously out of reach for standard solutions. Publishing an open standard broadens that access — though crucially, MRC requires the latest 800 Gb/s network interfaces, meaning older clusters cannot take full advantage of the protocol.

What's next?

  • MRC 1.0 is already available through OCP — industry partners can implement the protocol in their own products and clusters; first commercial deployments outside the OpenAI/Microsoft/Oracle ecosystem are likely within 12–18 months
  • AMD, Broadcom, and NVIDIA have announced MRC support in their next-generation networking hardware — details are being released through their respective technical blogs
  • Stargate clusters are targeted to exceed 1 million GPUs per OpenAI's announced infrastructure expansion plan — MRC will be the networking protocol for that infrastructure

Sources

Share this article