Synchronous Distributed Training
All workers process the same SGD step in lockstep and aggregate gradients via all-reduce, making distributed training mathematically equivalent to single-node SGD on a large minibatch.
Training step: (1) Each of N workers receives a B/N shard of the minibatch and runs forward + backward on a local copy of the model. (2) All workers issue an all-reduce on the gradient buffer (typically ring all-reduce in NCCL/MPI or tree-reduce on a hierarchy). (3) When the collective completes, every worker holds an identical sum/mean of gradients. (4) The local optimizer (Adam, SGD) updates the weights β because all replicas started identical and received identical gradients, the weights stay in sync without further communication. FSDP/ZeRO add extra all-gather and reduce-scatter operations to shard optimizer state, gradients, and parameters.
Training large models on a single GPU is infeasible due to time and memory limits. Asynchronous parameter-server schemes scale poorly because of stale gradients. Synchronous training solves both while preserving convergence equivalent to single-node SGD on a large minibatch.
Fully parallel
Dense
All paths active
All workers perform exactly the same operations at every step β no conditional execution at the system level.
Number of workers (N)
- 8 (single node)
- 256 (Goyal et al. 2017)
- 10000+ (frontier LLMs)
Number of GPUs/TPUs participating in training. Scales the global minibatch and demands linear LR scaling.
Global batch size
Total minibatch B = per_worker Γ N. Too large causes a generalization gap.
All-reduce algorithm
Ring vs tree vs hierarchical (NVLink+IB) all-reduce. Affects scaling and topology tolerance.
Gradient accumulation steps
Micro-batches accumulated before all-reduce β emulates a larger effective batch under memory constraints.
Common pitfalls
StragglersHIGH
The slowest worker (e.g., GPU thermal throttling or a failing NIC) delays the whole step β risk grows with cluster size.
Backup workers, elastic training, redundant replicas, and per-host monitoring.
Large-batch generalization gapMEDIUM
Very large global batches (>32k) can degrade generalization despite maintained training convergence.
Linear LR scaling + warmup, LARS/LAMB optimizers, and label smoothing.
Communication bottleneckHIGH
All-reduce time scales with the number of nodes; for small models, communication can dominate compute.
Gradient compression, compute/communication overlap (PyTorch DDP bucketing), and a high-bandwidth interconnect (RoCE, InfiniBand, NVLink).
Failure recoveryCRITICAL
A single GPU/node failure in classic sync training forces a restart from checkpoint β in 10k+ GPU clusters MTBF is on the order of hours.
Frequent checkpointing, async checkpointing (e.g. PyTorch DCP), and elastic training (TorchElastic).
GENESIS Β· Source paper
Accurate, Large Minibatch SGD: Training ImageNet in 1 HourDistBelief (Google) β early parameter server
Dean et al. introduce the asynchronous parameter server as the first distributed deep-learning architecture.
Goyal et al. β synchronous training on 256 GPUs
breakthroughLinear scaling rule + warmup enable training ImageNet in 1h on 256 GPUs. Establishes synchronous training as the standard.
Horovod (Uber)
Sergeev & Del Balso release Horovod β a synchronous training framework based on ring all-reduce with NCCL/MPI.
ZeRO (DeepSpeed) β optimizer state partitioning
breakthroughRajbhandari et al. introduce ZeRO, reducing memory by sharding optimizer state, gradients, and parameters (stage 1/2/3).
PyTorch FSDP β stable release
Fully Sharded Data Parallel (a ZeRO-3 port into PyTorch core) becomes the dominant LLM training mechanism.
Training on 100k+ GPUs
xAI Colossus, Meta Llama 3, and OpenAI GPT-4 are trained on 25-100k+ GPU clusters using synchronous SGD with RoCE/IB scale-out.
Synchronous training is the primary workload of NVIDIA H100/H200/B200 clusters with NVLink + InfiniBand/RoCE.
Google TPU pods are built around synchronous training with a dedicated ICI all-reduce fabric.
Commonly used with
RoCE
RDMA over Converged Ethernet (RoCE) is a family of network protocols standardized by the InfiniBand Trade Association (IBTA) that bring RDMA semantics β remote memory access bypassing the host CPU networking stack β onto Ethernet. Three variants exist: RoCE v1 operates as an Ethernet link-layer protocol (Ethertype 0x8915) confined to a single broadcast domain; the experimental RoCE v1.5 runs over IP; RoCE v2 encapsulates packets inside UDP/IP (port 4791) and is routable across IPv4/IPv6 networks. To approach InfiniBand-class performance, RoCE typically requires a lossless Ethernet fabric configured with Priority Flow Control (PFC) and Data Center Bridging (DCB); RoCE v2 additionally defines an ECN-based congestion-control mechanism using CNP frames. RoCE is today the dominant interconnect for GPU clusters in large-scale AI training, with end-to-end latencies as low as 1.3 Β΅s on modern host-channel adapters.
GO TO CONCEPTMRC
Multipath Reliable Connection (MRC) is a network protocol designed for training frontier AI models on supercomputer clusters with more than 100,000 GPUs. It extends the RDMA over Converged Ethernet (RoCE) standard from the InfiniBand Trade Association and builds on techniques from the Ultra Ethernet Consortium (UEC), adding SRv6 source routing on top. MRC has been deployed across all of OpenAI's largest NVIDIA GB200 supercomputers, including the Stargate site operated with Oracle Cloud Infrastructure in Abilene, Texas, and in Microsoft Fairwater supercomputers. The specification was published on May 5, 2026 as an Open Compute Project (OCP) contribution and is publicly available. MRC addresses three problems of large-scale synchronous training: it enables two-tier multi-plane networks connecting 131,000 GPUs instead of conventional three- or four-tier designs, virtually eliminates core network congestion via adaptive packet spraying, and routes around failures on a microsecond timescale using static source routing instead of dynamic BGP.
GO TO CONCEPTIB
InfiniBand (IB) is a networking standard maintained by the InfiniBand Trade Association (IBTA, founded 1999), in which hosts connect to the fabric via Host Channel Adapters (HCAs) and peripherals via Target Channel Adapters (TCAs). Its switched-fabric topology, credit-based link-level flow control, and native RDMA deliver microsecond latencies (1.3 Β΅s at QDR, <0.6 Β΅s at HDR) and full line-rate without packet loss. Successive bandwidth generations are: SDR (8 Gbit/s 4Γ, 2001), DDR (16, 2005), QDR (32, 2007), FDR (54.54, 2011), EDR (100, 2014), HDR (200, 2018), NDR (400, 2022), and XDR (800, 2024). InfiniBand supports five message types β RDMA read/write, channel send/receive, transactional operations, multicast, and atomics. The Linux kernel has supported IB since 2.6.11 (2005) via OpenFabrics Enterprise Distribution (OFED) and the so-called verbs API. After 2014, IB briefly led the TOP500 interconnect ranking, but Ethernet/RoCE later reclaimed market share. In 2019 NVIDIA acquired Mellanox β the last independent vendor β and today IB is the primary scale-out fabric of NVIDIA's AI platforms (Quantum-2, Quantum-X800), used for LLM training in conjunction with NVLink/NVSwitch.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour | arXiv (Facebook AI Research) | scientific article |
| Horovod: fast and easy distributed deep learning in TensorFlow | arXiv | scientific article |
| ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | arXiv (Microsoft) | scientific article |
| PyTorch DDP β official notes | Meta / PyTorch | documentation |
| NCCL Documentation | NVIDIA | documentation |