Training step: (1) Each of N workers receives a B/N shard of the minibatch and runs forward + backward on a local copy of the model. (2) All workers issue an all-reduce on the gradient buffer (typically ring all-reduce in NCCL/MPI or tree-reduce on a hierarchy). (3) When the collective completes, every worker holds an identical sum/mean of gradients. (4) The local optimizer (Adam, SGD) updates the weights โ because all replicas started identical and received identical gradients, the weights stay in sync without further communication. FSDP/ZeRO add extra all-gather and reduce-scatter operations to shard optimizer state, gradients, and parameters.
Training large models on a single GPU is infeasible due to time and memory limits. Asynchronous parameter-server schemes scale poorly because of stale gradients. Synchronous training solves both while preserving convergence equivalent to single-node SGD on a large minibatch.
The slowest worker (e.g., GPU thermal throttling or a failing NIC) delays the whole step โ risk grows with cluster size.
Very large global batches (>32k) can degrade generalization despite maintained training convergence.
All-reduce time scales with the number of nodes; for small models, communication can dominate compute.
A single GPU/node failure in classic sync training forces a restart from checkpoint โ in 10k+ GPU clusters MTBF is on the order of hours.
Dean et al. introduce the asynchronous parameter server as the first distributed deep-learning architecture.
Linear scaling rule + warmup enable training ImageNet in 1h on 256 GPUs. Establishes synchronous training as the standard.
Sergeev & Del Balso release Horovod โ a synchronous training framework based on ring all-reduce with NCCL/MPI.
Rajbhandari et al. introduce ZeRO, reducing memory by sharding optimizer state, gradients, and parameters (stage 1/2/3).
Fully Sharded Data Parallel (a ZeRO-3 port into PyTorch core) becomes the dominant LLM training mechanism.
xAI Colossus, Meta Llama 3, and OpenAI GPT-4 are trained on 25-100k+ GPU clusters using synchronous SGD with RoCE/IB scale-out.
Number of GPUs/TPUs participating in training. Scales the global minibatch and demands linear LR scaling.
Total minibatch B = per_worker ร N. Too large causes a generalization gap.
Ring vs tree vs hierarchical (NVLink+IB) all-reduce. Affects scaling and topology tolerance.
Micro-batches accumulated before all-reduce โ emulates a larger effective batch under memory constraints.
All workers perform exactly the same operations at every step โ no conditional execution at the system level.
Synchronous training is the primary workload of NVIDIA H100/H200/B200 clusters with NVLink + InfiniBand/RoCE.
Google TPU pods are built around synchronous training with a dedicated ICI all-reduce fabric.