← Blog
explainx / blog

OpenAI MRC explained: Multipath Reliable Connection for GPU supercomputer networking (2026)

OpenAI MRC: multipath GPU networking (RoCE, packet spraying, SRv6) for frontier training; OCP spec. Diagrams from OpenAI’s post; LLM tokens vs fabric packets; ExplainX skills & MCP.

10 min readYash Thakker
OpenAIMRCNetworkingAI trainingRoCESRv6GPU clusters

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

OpenAI MRC explained: Multipath Reliable Connection for GPU supercomputer networking (2026)

On May 5, 2026, OpenAI published Supercomputer networking to accelerate large scale AI training: MRC (Multipath Reliable Connection)—GPU networking co-developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA—released through the Open Compute Project (OCP) and documented in the paper Resilient AI Supercomputer Networking using MRC and SRv6.

This article gives a deeper, keyword-grounded walkthrough of that announcement: why synchronous frontier pretraining turns the network into a tail-latency problem, how multi-plane 800Gb/s-class fabrics change cluster shape, what packet spraying, path retirement, and packet trimming buy you, and why IPv6 Segment Routing (SRv6) source routing is part of the story. We then connect the stack to LLM tokens (inference billing and context windows) and to ExplainX surfaces for builders.

Visuals and attribution

The diagrams and facility photograph in this post are reproduced from OpenAI’s May 5, 2026 engineering article (same URLs as their Contentful-hosted assets). Animations referenced in the original post (packet collision and spraying) are credited there to Mark Handley; we include static diagrams OpenAI published alongside that narrative. If OpenAI updates the page, prefer their live post as the canonical visual source.


TL;DR

TopicTakeaway
ProblemSynchronous pretraining is tail-latency sensitive; collectives wait on the straggler; link and switch faults become routine at 100k+ GPU scale
MRCMultipath over RoCE-class Ethernet, adaptive spraying, packet trimming, SRv6 source routing, static switch forwarding tables
TopologyMulti-plane fabrics—OpenAI cites on the order of ~131k GPUs with two switch tiers under their stated assumptions vs three–four tiers for some single-plane designs
ArtifactsOCP MRC 1.0 PDF · Paper PDF
TokensNot your chat tokenizer—tokens live at the model API layer; MRC lives in the GPU fabric layer

Why AI training turns supercomputer networking into a bottleneck

Large-scale transformer pretraining is not embarrassingly parallel at the micro-step level: a single optimizer step can require many millions of RDMA-style transfers (gradient all-reduce, expert-parallel exchanges, checkpoint shards, and more—patterns depend on the strategy). Collective operations are worst-case-latency sensitive: one slow participant delays everyone.

OpenAI frames two networking imperatives at Stargate-class scale:

  1. Minimize avoidable congestion. Some incasts (many senders targeting one receiver) are physics; the design goal is to avoid additional hotspots from path selection and under-utilized capacity.
  2. Minimize failure blast radius. At enough scale, link flaps and switch faults are continuous background noise rather than rare incidents. A fabric that stalls for seconds while BGP-class reconvergence runs can waste enormous GPU-hours; checkpoint restarts are worse.

They describe very large synchronous jobs as a “failure amplifier”: adding GPUs increases statistical exposure to stragglers and faults.

Keywords: synchronous distributed training, GPU collective communication, RDMA RoCE AI cluster, tail latency straggler, AI supercomputer network design.


Multipath Reliable Connection (MRC): what it is

In OpenAI’s telling, MRC is a transport-level approach for 800Gb/s-class NICs that:

  • Sprays a single logical transfer across many paths (“hundreds” in their post) spanning all multi-plane Ethernet planes.
  • Delivers out-of-order packets correctly because MRC packets carry enough information for the receiver to place data into the right GPU memory locations.
  • Adapts when a path is congested or lossy: shift load, retire suspect paths quickly, probe to distinguish failure vs transient loss.
  • Uses packet trimming so congestion at the destination triggers explicit retransmit signals instead of silently dropping payloads—reducing false “this path is dead” conclusions.

MRC builds on RDMA over Converged Ethernet (RoCE)—the InfiniBand Trade Association (IBTA) family of standards for hardware-accelerated remote memory access among GPUs and CPUs—and incorporates ideas from the Ultra Ethernet Consortium (UEC), extended with SRv6 for large-scale AI fabrics.

Keywords: Multipath Reliable Connection protocol, RoCE training cluster, UEC Ethernet AI, GPU memory RDMA, AI fabric congestion control.


Multi-plane GPU fabrics (the topology MRC assumes)

Traditional thinking treats a fast NIC as one fat link. OpenAI’s multi-plane pattern splits it: for example, one interface feeding eight parallel planes at ~100Gb/s each instead of modeling only one 800Gb/s logical channel.

Why operators care: Path diversity and failure isolation improve when traffic can fan across planes. Switch radix math changes too—OpenAI gives a worked example: a switch that terminates 64 ports at 800Gb/s might instead attach 512 ports at 100Gb/s, enabling a fully connected cluster on the order of ~131,000 GPUs with two switch tiers, where a conventional single-plane 800Gb/s design might need three or four tiers (per their post—validate in the OCP PDF and paper before treating the numbers as universal).

The catch: classic single-path RoCE keeps packets in order on one path. That under-uses multi-plane capacity and still allows flows to collide on hot links—bad for synchronous training where outliers dominate.

Diagram: multi-plane network connecting over 100k GPUs with two switch tiers, from OpenAI’s MRC article

Source: OpenAI — Supercomputer networking to accelerate large scale AI training, May 5, 2026 (diagram as published).

Diagram: single-path RoCE-style flows colliding and congesting links; collective latency tied to worst flows

Source: OpenAI (2026), ibid.; original post notes congestion animation by Mark Handley.

Keywords: multi-plane Ethernet AI cluster, 800GbE GPU networking, two-tier AI switch fabric, 131000 GPU network topology.


Packet spraying, path retirement, and packet trimming

Spraying means striping packets from the same logical transfer across many paths at once, including across planes. That spreads load so individual links are less likely to become persistent hotspots—important when throughput variance across flows translates directly into step-time variance.

When loss happens, OpenAI describes a conservative stance: assume the path might be bad, stop using it, retransmit, then probe to see whether it was a blip or a real fault. That is how they claim microsecond-scale reactions vs seconds of routing instability in some legacy fabrics.

Packet trimming addresses loss from receiver-side congestion: instead of dropping silently, a switch can trim payload, forward headers, and let the receiver request retransmits. That reduces spurious path retirement when the problem was not a broken plane but downstream pressure.

Diagram: MRC spreading packets across many paths to reduce hotspots

Source: OpenAI (2026), ibid.; original post notes spraying animation by Mark Handley.

Keywords: adaptive packet spraying, multipath load balancing GPU cluster, network packet trimming ECN, microsecond failover AI training.


SRv6 source routing vs dynamic interior routing (BGP-style)

A second architectural move in OpenAI’s article is disabling dynamic interior routing (they name BGP) in favor of IPv6 Segment Routing version 6 (SRv6): senders encode the sequence of switch identifiers in the packet so forwarding is deterministic. Switches pop segments and use static tables provisioned at bring-up—not continuously mutating control-plane state.

Why pair this with MRC? If endpoints can spray, detect bad paths, and stop using them without needing the fabric to globally reconverge, much of the motivation for live dynamic path recomputation inside the training mesh diminishes. The trade-off is operational: you need rigorous engineering to design, provision, and reason about static source-routed topologies—this is hyperscale infrastructure, not a default for general enterprise LANs.

Diagram: SRv6 encoding the path in the packet; deterministic forwarding for AI fabrics

Source: OpenAI (2026), ibid.

Keywords: SRv6 AI networking, IPv6 segment routing GPU cluster, static routing table data center, BGP vs source routing HPC.


Deployment claims: GB200, OCI Abilene, Microsoft Fairwater

OpenAI states MRC is deployed on their largest NVIDIA GB200 supercomputers, including Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft Fairwater, with NVIDIA and Broadcom hardware involvement—and that multiple frontier models trained using it.

Stargate supercomputer site (OCI Abilene, Texas), still from OpenAI’s article

Source: OpenAI (2026), ibid. (facility photograph as published).

Treat these as vendor disclosures until independent observers publish compatible measurements; the durable artifacts for researchers are the OCP spec and peer-facing paper.

Keywords: NVIDIA GB200 training cluster, Oracle Abilene AI supercomputer, Microsoft Fairwater OpenAI, frontier model pretraining infrastructure.


What OpenAI reports in production (and how to read it)

Their post describes frequent tier-0↔tier-1 link flaps with no measurable impact on synchronous pretraining, rebooting four tier-1 switches during a ChatGPT/Codex training run without coordinating trainers, and quick recovery after losing an entire T1 switch—with a chart in the original article.

Diagram: production training data—MRC reaction to complete loss of a T1 switch

Source: OpenAI (2026), ibid.

Due diligence: correlate with figures and methodology in the PDF; blog prose is not a substitute for reproducible evaluation.


Contrast table: classic single-path RoCE vs MRC-style fabrics

DimensionTypical single-path RoCE-style mental modelOpenAI’s MRC narrative
Path useOne primary path preserves orderSpray across many paths; reorder at destination
Multi-planeOften under-utilized or poorly balancedDesigned to load-balance across planes
Loss interpretationMay conflate congestion and failureTrimming + probes to reduce false path retirement
Control planeDynamic routing commonSRv6 + static tables; endpoints steer around faults
Time scaleSeconds of instability possibleClaims microsecond-scale path decisions

Who needs to care (and who does not)

Strongly relevant: hyperscale training teams, AI HPC network architects, switch and NIC vendors, cloud providers hosting frontier clusters, and standardization bodies (OCP, UEC, IBTA).

Indirectly relevant: application developers shipping API products—you rarely touch SRv6, but you operate in the world these systems make possible: cheaper inference curves, new model generations, and fiercer competition on latency and price per token.

For day-to-day builder work, pair posts like this with token literacy and agent tooling (below).


LLM tokens: what they are—and why they are not “packets on the wire”

Packets on a training fabric move tensors, gradients, optimizer state, and checkpoints between machines at line rate. Tokens are discrete units from a tokenizer—the pieces a language model scores during inference and training.

In plain terms:

  • The model consumes token IDs in a fixed vocabulary, not raw Unicode strings.
  • Prompts with code, JSON, or URLs usually cost more tokens than plain prose.
  • Providers bill input and output tokens separately; caching and prefix pricing may apply.

Training token throughput (tokens per second across the cluster) and API tokens (billable units) share vocabulary ideas but not economics: pretraining burns collectives and power; your product burns dollars per million tokens and rate limits.

Start here: What are LLM tokens—including examples and billing?. Then:

Bridge: MRC-class networks help labs ship checkpoints; token-aware prompting helps you ship those models into production without surprises.


ExplainX resources (builder side of the stack)

For reliability patterns at the agent layer (evals, tools, routing), see Agent harness engineering. For MCP fundamentals: What is MCP?.


Related on ExplainX

Primary sources

Partner posts linked from OpenAI: AMD, Broadcom, Microsoft, NVIDIA


Specifications, deployments, and vendor narratives change. This article reflects May 2026 context tied to OpenAI’s announcement; validate in the OCP spec and paper before procurement, benchmarking, or architecture sign-off. Visual assets are credited to OpenAI’s public post; ExplainX hosts local copies for performance and editorial stability.

Related posts