The starting point is the standard training-cost approximation C ≈ 6·N·D, where N is the parameter count and D is the number of training tokens. Hoffmann et al. (2022) applied three complementary approaches: (1) training fixed-size models for varying numbers of tokens, (2) IsoFLOP curves — for each budget C the values of N and D are varied while keeping C constant, and the (N*, D*) minimizing validation loss is identified, and (3) a parametric loss model L(N, D) fitted to all 400+ runs. All three methods agreed that the optimal N and D should grow in roughly equal proportion with C, corresponding to scaling exponents a ≈ 0.5 and b ≈ 0.5 in the relations N* ∝ C^a and D* ∝ C^b. In practice this yields the rule of thumb of approximately 20 tokens per parameter to keep training compute-optimal. The rule assumes a sufficiently large, deduplicated corpus, a decoder-only transformer architecture, and a standard learning-rate schedule tuned to the chosen number of steps.
Earlier scaling laws (Kaplan et al., 2020) suggested that under a constrained compute budget one should mainly increase the number of model parameters, which led to the training of very large but significantly undertrained models. Compute-Optimal Training addresses the problem of non-optimal allocation of the FLOPs budget between model size and training data volume.
The 20:1 rule minimizes pretraining loss for a training budget but does not account for inference cost. Models heavily deployed in production are often worth training for longer (more tokens) to reduce inference cost.
The rule assumes a sufficiently large, deduplicated corpus. Repeating the same data over many epochs breaks the assumption and distorts the relation between D and the effective count of "fresh" tokens.
The scaling exponents were fit on models up to around 16B parameters and budgets up to roughly 5e23 FLOPs; extrapolating to budgets that are orders of magnitude larger is not always reliable.
Kaplan et al. publish "Scaling Laws for Neural Language Models", suggesting that under a constrained budget one should mainly grow the model.
Hoffmann et al. introduce the compute-optimal rule and empirically confirm it with the Chinchilla 70B / 1.4T-token model, outperforming Gopher 280B at the same FLOPs budget.
Touvron et al. train LLaMA on more than one trillion tokens for 7B–13B models, intentionally going beyond the compute-optimal point to achieve cheaper inference at comparable quality.
Total compute budget C in FLOPs allocated for pretraining, jointly determining N and D.
The D/N ratio. According to the Chinchilla results the optimum is around 20 tokens per parameter.
Number of model parameters, chosen jointly with D for budget C.
Number of unique tokens processed during pretraining.
Compute-Optimal Training is a rule for allocating a FLOPs budget and does not depend on any specific hardware; it applies equally to GPU and TPU clusters.