The model contains multiple sub-networks (experts). A routing mechanism (gate) selects a subset of experts (e.g., top-2) for each token. Only selected experts are activated during computation, others are skipped.
Scaling dense neural networks is computationally expensive as every token activates all parameters. MoE enables scaling parameter count without proportionally increasing compute cost.
A set of N parallel sub-networks, each independently parameterized. In the Transformer context, experts are typically feed-forward networks (FFN) with identical architecture but separate weight matrices. Each expert learns to specialize on a different subset of the input distribution as a result of competitive routing.
Official
A trainable network (typically a linear projection followed by softmax) that computes a score for each expert given the current input token. In sparse MoE, only the top-k experts by score are activated; in soft MoE, all experts are weighted and summed. The router parameters are optimized jointly with the expert parameters via gradient descent.
Official
An auxiliary loss term added to the training objective that measures the imbalance of token routing across experts and penalizes skewed distributions. Without this mechanism, the router tends to collapse onto a small number of dominant experts through a self-reinforcing feedback loop. The specific formulation varies across implementations (importance loss + load loss in Shazeer et al. 2017; simplified scalar auxiliary loss in Switch Transformer).
Official
Without explicit load balancing, the router converges to routing most or all tokens to a small subset of experts through a self-reinforcing feedback loop: favored experts receive more training signal, become better, and are selected more often. This leaves most experts undertrained and wastes model capacity.
The auxiliary loss coefficient (alpha) must be carefully tuned. Too large a value causes the auxiliary loss to dominate the training objective, degrading model quality. Too small a value fails to prevent expert collapse. The optimal value depends on model scale, batch size, and number of experts.
When more tokens are routed to an expert than its capacity allows, the excess tokens are dropped (or passed through a residual connection without expert processing). Token dropping degrades model quality, especially for tokens in high-demand input regions.
Distributed MoE with expert parallelism requires all-to-all communication to dispatch tokens to their assigned expert devices and collect results. At large scale, this communication overhead can become the dominant bottleneck, especially on clusters with limited inter-node bandwidth.
The top-k selection operation is not differentiable, which can introduce high-variance gradients through the router and cause training instability, especially at large learning rates or with aggressive capacity constraints.
Jacobs et al. introduce the Mixture of Experts architecture: a system of parallel expert networks with a gating network that produces soft weights over experts via softmax, trained jointly via a supervised learning procedure. Demonstrates task decomposition on a vowel discrimination task.
Jordan and Jacobs extend the MoE framework to a hierarchical tree structure where each node is itself a gating network, enabling recursive decomposition of the input space. Training is formalized with an EM algorithm.
Shazeer et al. (Google Brain) introduce the Sparsely-Gated Mixture-of-Experts layer: sparse top-k gating with noisy gating for load balancing, applied convolutionally between LSTM layers. Achieves over 1000x improvement in model capacity with minor computational overhead. Demonstrates models with up to 137 billion parameters. This paper establishes the modern sparse MoE paradigm for large-scale deep learning.
Lepikhin et al. (Google) apply sparse MoE to Transformer encoder-decoder models at 600B parameter scale using automatic sharding (XLA SPMD). Introduces per-expert capacity limits and random routing for the second expert in top-2 setups to improve load balancing.
Fedus, Zoph, and Shazeer (Google) demonstrate that top-1 routing (each token routed to exactly one expert) achieves competitive quality with simpler implementation and lower communication overhead than top-2. Scale to 1.6 trillion parameters. Introduce a simplified auxiliary load balancing loss.
Architectures such as DeepSeek-MoE and subsequent work demonstrate that auxiliary-loss-free load balancing (via expert-wise bias on routing scores) achieves better model quality than traditional auxiliary loss approaches, avoiding the interference gradients introduced by load balancing losses on model training.
Time complexity: O(k · C_expert) per token per MoE layer. Space complexity: O(N · P_expert) total parameters.
In distributed MoE with expert parallelism, tokens must be dispatched from their originating device to the device holding the selected expert, and results must be collected back. This all-to-all communication scales with batch size and number of devices and becomes the dominant bottleneck at large scale.
Total number of expert sub-networks. Controls the parameter count of the MoE layer. Increasing N scales model capacity without increasing per-token FLOPs. Common values range from 8 to thousands.
Number of experts activated per token per MoE layer. k=1 (Switch Transformer) minimizes compute; k=2 is the most common value in practice. Higher k improves routing stability but increases per-token FLOPs.
Multiplier on the average number of tokens per expert per batch. Determines the maximum token buffer per expert. Values above 1.0 reduce token dropping at the cost of higher memory. Values below 1.0 increase dropping.
Scaling coefficient (alpha) for the load balancing auxiliary loss added to the training objective. Too high causes instability and degrades model quality; too low leads to expert collapse. Requires careful tuning per model scale.
In Transformer-based MoE models, not every FFN layer is replaced by a MoE layer. The interleaving pattern (e.g., every other layer, every 4th layer) controls the tradeoff between expert capacity and communication overhead.
In the original soft MoE formulation (Jacobs et al. 1991), all experts are weighted and summed (all_paths_active). The sparse top-k variant (Shazeer et al. 2017) is dominant in modern LLM applications.
A learned linear router computes a score for each of the N experts given the input token representation. Only the top-k experts (k typically 1 or 2) are activated; the others contribute no computation. Router scores are used to weight and sum the outputs of the selected experts. Noisy gating (additive Gaussian noise before top-k selection) is a common variant for improved load balancing.
Within a single device, expert computations are fully parallel across tokens assigned to that expert. Across devices, expert parallelism is used: each device holds a subset of experts and processes only the tokens routed to it.
Sparse MoE training and inference at scale requires GPU clusters with high-bandwidth interconnects (NVLink, InfiniBand) for efficient all-to-all communication in expert parallelism. Expert FFN computations are dense matrix multiplications that benefit from Tensor Core acceleration.
Google's GShard and Switch Transformer were developed and trained on TPU pods using XLA SPMD for automatic sharding. TPU's ICI interconnect provides high-bandwidth all-to-all communication well-suited to expert parallelism.