Architecture

MoE

Key innovation

Mixture of Experts introduces conditional computation, where only a subset of specialized sub-networks (experts) is activated per input example via a gating network, enabling model capacity to scale without a proportional increase in compute cost.

How it works

The model contains multiple sub-networks (experts). A routing mechanism (gate) selects a subset of experts (e.g., top-2) for each token. Only selected experts are activated during computation, others are skipped.

Problem solved

Scaling dense neural networks is computationally expensive as every token activates all parameters. MoE enables scaling parameter count without proportionally increasing compute cost.

Components

Expert NetworksCollection of N parallel sub-networks (experts), each specializing in a distinct subset of the input space; in the Transformer context, experts are typically FFN networks.

A set of N parallel sub-networks, each independently parameterized. In the Transformer context, experts are typically feed-forward networks (FFN) with identical architecture but separate weight matrices. Each expert learns to specialize on a different subset of the input distribution as a result of competitive routing.

FFN expertsStandard two-layer feed-forward networks used as experts, replacing the dense FFN sub-layer in a Transformer block. The most common variant in LLM MoE architectures.

Shared expertOne or more experts that receive all tokens unconditionally, alongside the sparsely-routed experts. Used in architectures such as DeepSeek-MoE to separate general from specialized computation.

Official

Gating / Router NetworkGating network trained jointly with the experts; produces weights or top-k expert selections for each input token / example.

A trainable network (typically a linear projection followed by softmax) that computes a score for each expert given the current input token. In sparse MoE, only the top-k experts by score are activated; in soft MoE, all experts are weighted and summed. The router parameters are optimized jointly with the expert parameters via gradient descent.

Noisy top-k gatingAdds tunable Gaussian noise to router logits before top-k selection to improve load balancing. Introduced by Shazeer et al. (2017).

Softmax routerStandard softmax over expert logits, used in Switch Transformer (top-1) and many subsequent architectures.

Official

Load Balancing MechanismAuxiliary loss term added to the primary training loss, penalizing uneven token distribution across experts.

An auxiliary loss term added to the training objective that measures the imbalance of token routing across experts and penalizes skewed distributions. Without this mechanism, the router tends to collapse onto a small number of dominant experts through a self-reinforcing feedback loop. The specific formulation varies across implementations (importance loss + load loss in Shazeer et al. 2017; simplified scalar auxiliary loss in Switch Transformer).

Official

Implementation

Reference implementations

lucidrains/mixture-of-experts (GitHub)

Python · Phil Wang (lucidrains)

Implementation pitfalls

Expert collapse and load imbalanceCritical

Without explicit load balancing, the router converges to routing most or all tokens to a small subset of experts through a self-reinforcing feedback loop: favored experts receive more training signal, become better, and are selected more often. This leaves most experts undertrained and wastes model capacity.

Fix:Add an auxiliary load balancing loss to the training objective. Alternatively, use auxiliary-loss-free approaches such as expert-wise routing bias with dynamic updates (DeepSeek approach). Monitor per-expert token counts during training. Consider noisy top-k gating.

Difficult auxiliary loss coefficient tuningHigh

The auxiliary loss coefficient (alpha) must be carefully tuned. Too large a value causes the auxiliary loss to dominate the training objective, degrading model quality. Too small a value fails to prevent expert collapse. The optimal value depends on model scale, batch size, and number of experts.

Fix:Start with values in the range suggested for the chosen architecture (e.g., alpha=1e-2 in Switch Transformer). Monitor both load balance metrics and downstream task loss. Consider sweep over alpha early in training on a smaller scale.

Capacity factor overflow and token droppingMedium

When more tokens are routed to an expert than its capacity allows, the excess tokens are dropped (or passed through a residual connection without expert processing). Token dropping degrades model quality, especially for tokens in high-demand input regions.

Fix:Set the capacity factor above 1.0 (e.g., 1.25) to provide buffer. Monitor overflow rates during training. Consider expert choice routing (experts select top-k tokens rather than tokens selecting experts) to guarantee perfect load balancing.

All-to-all communication overhead in expert parallelismHigh

Distributed MoE with expert parallelism requires all-to-all communication to dispatch tokens to their assigned expert devices and collect results. At large scale, this communication overhead can become the dominant bottleneck, especially on clusters with limited inter-node bandwidth.

Fix:Minimize the number of MoE layers (use MoE only in a fraction of Transformer layers). Use top-1 instead of top-2 routing to halve dispatch volume. Overlap communication with computation where the framework supports it. Profile all-to-all latency early.

Training instability caused by top-k routing discontinuityMedium

The top-k selection operation is not differentiable, which can introduce high-variance gradients through the router and cause training instability, especially at large learning rates or with aggressive capacity constraints.

Fix:Use gradient clipping. Reduce the learning rate relative to the dense baseline. Apply noisy gating during training to smooth routing decisions. Consider soft MoE or differentiable routing variants if instability persists.

Evolution

Original paper · 1991 · Neural Computation, vol. 3, no. 1, pp. 79–87 · Robert A. Jacobs

Adaptive Mixtures of Local Experts

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, Geoffrey E. Hinton

1991

Concept of MoE defined — Jacobs, Jordan, Nowlan, Hinton

Inflection point

Jacobs et al. introduce the Mixture of Experts architecture: a system of parallel expert networks with a gating network that produces soft weights over experts via softmax, trained jointly via a supervised learning procedure. Demonstrates task decomposition on a vowel discrimination task.

Adaptive Mixtures of Local Experts (paper)

1994

Hierarchical Mixtures of Experts (HME) — Jordan & Jacobs

Jordan and Jacobs extend the MoE framework to a hierarchical tree structure where each node is itself a gating network, enabling recursive decomposition of the input space. Training is formalized with an EM algorithm.

Hierarchical Mixtures of Experts and the EM Algorithm (paper)

2017

Sparsely-Gated MoE for deep networks — Shazeer et al., ICLR 2017

Inflection point

Shazeer et al. (Google Brain) introduce the Sparsely-Gated Mixture-of-Experts layer: sparse top-k gating with noisy gating for load balancing, applied convolutionally between LSTM layers. Achieves over 1000x improvement in model capacity with minor computational overhead. Demonstrates models with up to 137 billion parameters. This paper establishes the modern sparse MoE paradigm for large-scale deep learning.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (paper)

2020

GShard — scaling MoE to 600B parameters with automatic sharding

Lepikhin et al. (Google) apply sparse MoE to Transformer encoder-decoder models at 600B parameter scale using automatic sharding (XLA SPMD). Introduces per-expert capacity limits and random routing for the second expert in top-2 setups to improve load balancing.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (paper)

2021

Switch Transformer — simplification to top-1 routing and scaling to one trillion parameters

Inflection point

Fedus, Zoph, and Shazeer (Google) demonstrate that top-1 routing (each token routed to exactly one expert) achieves competitive quality with simpler implementation and lower communication overhead than top-2. Scale to 1.6 trillion parameters. Introduce a simplified auxiliary load balancing loss.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (paper)

2024

Auxiliary-loss-free load balancing — DeepSeek and subsequent MoE architectures

Architectures such as DeepSeek-MoE and subsequent work demonstrate that auxiliary-loss-free load balancing (via expert-wise bias on routing scores) achieves better model quality than traditional auxiliary loss approaches, avoiding the interference gradients introduced by load balancing losses on model training.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts (paper)

Sources

Switch Transformers

MoE

How it works

Problem solved

Components

Implementation

Evolution

Sources

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements