Robots Atlas>ROBOTS ATLAS

Synchronous Distributed Training

All workers process the same SGD step in lockstep and aggregate gradients via all-reduce, making distributed training mathematically equivalent to single-node SGD on a large minibatch.

Category
Abstraction level
Operation level
LLM training (GPT, LLaMA, Claude, Gemini) on thousands of GPUsComputer vision (ResNet, ViT) at ImageNet scale and beyondReinforcement learning with massive parallel rolloutsProduction-scale recommendation modelsMultimodal foundation models

Training step: (1) Each of N workers receives a B/N shard of the minibatch and runs forward + backward on a local copy of the model. (2) All workers issue an all-reduce on the gradient buffer (typically ring all-reduce in NCCL/MPI or tree-reduce on a hierarchy). (3) When the collective completes, every worker holds an identical sum/mean of gradients. (4) The local optimizer (Adam, SGD) updates the weights β€” because all replicas started identical and received identical gradients, the weights stay in sync without further communication. FSDP/ZeRO add extra all-gather and reduce-scatter operations to shard optimizer state, gradients, and parameters.

Training large models on a single GPU is infeasible due to time and memory limits. Asynchronous parameter-server schemes scale poorly because of stale gradients. Synchronous training solves both while preserving convergence equivalent to single-node SGD on a large minibatch.

Parallelism

Fully parallel

Paradigm

Dense

All paths active

All workers perform exactly the same operations at every step β€” no conditional execution at the system level.

Number of workers (N)

Critical
  • 8 (single node)
  • 256 (Goyal et al. 2017)
  • 10000+ (frontier LLMs)

Number of GPUs/TPUs participating in training. Scales the global minibatch and demands linear LR scaling.

Global batch size

Critical

Total minibatch B = per_worker Γ— N. Too large causes a generalization gap.

All-reduce algorithm

Standard

Ring vs tree vs hierarchical (NVLink+IB) all-reduce. Affects scaling and topology tolerance.

Gradient accumulation steps

Standard

Micro-batches accumulated before all-reduce β€” emulates a larger effective batch under memory constraints.

Common pitfalls

Stragglers
HIGH

The slowest worker (e.g., GPU thermal throttling or a failing NIC) delays the whole step β€” risk grows with cluster size.

Backup workers, elastic training, redundant replicas, and per-host monitoring.

Large-batch generalization gap
MEDIUM

Very large global batches (>32k) can degrade generalization despite maintained training convergence.

Linear LR scaling + warmup, LARS/LAMB optimizers, and label smoothing.

Communication bottleneck
HIGH

All-reduce time scales with the number of nodes; for small models, communication can dominate compute.

Gradient compression, compute/communication overlap (PyTorch DDP bucketing), and a high-bandwidth interconnect (RoCE, InfiniBand, NVLink).

Failure recovery
CRITICAL

A single GPU/node failure in classic sync training forces a restart from checkpoint β€” in 10k+ GPU clusters MTBF is on the order of hours.

Frequent checkpointing, async checkpointing (e.g. PyTorch DCP), and elastic training (TorchElastic).

GENESIS Β· Source paper

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
2017arXiv (Facebook AI Research tech report)Priya Goyal, Piotr DollΓ‘r, Ross Girshick et al.
2012

DistBelief (Google) β€” early parameter server

Dean et al. introduce the asynchronous parameter server as the first distributed deep-learning architecture.

2017

Goyal et al. β€” synchronous training on 256 GPUs

breakthrough

Linear scaling rule + warmup enable training ImageNet in 1h on 256 GPUs. Establishes synchronous training as the standard.

2017

Horovod (Uber)

Sergeev & Del Balso release Horovod β€” a synchronous training framework based on ring all-reduce with NCCL/MPI.

2019

ZeRO (DeepSpeed) β€” optimizer state partitioning

breakthrough

Rajbhandari et al. introduce ZeRO, reducing memory by sharding optimizer state, gradients, and parameters (stage 1/2/3).

2022

PyTorch FSDP β€” stable release

Fully Sharded Data Parallel (a ZeRO-3 port into PyTorch core) becomes the dominant LLM training mechanism.

2024

Training on 100k+ GPUs

xAI Colossus, Meta Llama 3, and OpenAI GPT-4 are trained on 25-100k+ GPU clusters using synchronous SGD with RoCE/IB scale-out.

GPU Tensor CoresPRIMARY

Synchronous training is the primary workload of NVIDIA H100/H200/B200 clusters with NVLink + InfiniBand/RoCE.

TPUPRIMARY

Google TPU pods are built around synchronous training with a dedicated ICI all-reduce fabric.

Commonly used with

RoCE

RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β€” remote memory access bypassing the host CPU networking stack β€” onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.

GO TO CONCEPT
MRC

Multipath Reliable Connection (MRC) is a network protocol designed for training frontier AI models on supercomputer clusters with more than 100,000 GPUs. It extends the RDMA over Converged Ethernet (RoCE) standard from the InfiniBand Trade Association and builds on techniques from the Ultra Ethernet Consortium (UEC), adding SRv6 source routing on top. MRC has been deployed across all of OpenAI's largest NVIDIA GB200 supercomputers, including the Stargate site operated with Oracle Cloud Infrastructure in Abilene, Texas, and in Microsoft Fairwater supercomputers. The specification was published on May 5, 2026 as an Open Compute Project (OCP) contribution and is publicly available. MRC addresses three problems of large-scale synchronous training: it enables two-tier multi-plane networks connecting 131,000 GPUs instead of conventional three- or four-tier designs, virtually eliminates core network congestion via adaptive packet spraying, and routes around failures on a microsecond timescale using static source routing instead of dynamic BGP.

GO TO CONCEPT
IB

InfiniBand (IB) is a networking standard maintained by the InfiniBand Trade Association (IBTA, founded 1999), in which hosts connect to the fabric via Host Channel Adapters (HCAs) and peripherals via Target Channel Adapters (TCAs). Its switched-fabric topology, credit-based link-level flow control, and native RDMA deliver microsecond latencies (1.3 Β΅s at QDR, <0.6 Β΅s at HDR) and full line-rate without packet loss. Successive bandwidth generations are: SDR (8 Gbit/s 4Γ—, 2001), DDR (16, 2005), QDR (32, 2007), FDR (54.54, 2011), EDR (100, 2014), HDR (200, 2018), NDR (400, 2022), and XDR (800, 2024). InfiniBand supports five message types β€” RDMA read/write, channel send/receive, transactional operations, multicast, and atomics. The Linux kernel has supported IB since 2.6.11 (2005) via OpenFabrics Enterprise Distribution (OFED) and the so-called verbs API. After 2014, IB briefly led the TOP500 interconnect ranking, but Ethernet/RoCE later reclaimed market share. In 2019 NVIDIA acquired Mellanox β€” the last independent vendor β€” and today IB is the primary scale-out fabric of NVIDIA's AI platforms (Quantum-2, Quantum-X800), used for LLM training in conjunction with NVLink/NVSwitch.

GO TO CONCEPT