Hoffmann et al. ran over 400 training jobs across model sizes from 70M to 16B parameters and dataset sizes from 5B to 500B tokens. They fit three independent methods to the data: (1) IsoFLOP curves — for fixed C, vary N and D, find L's minimum; (2) parametric loss fit — fit L(N, D) = E + A/N^α + B/D^β; (3) IsoLoss curves. All three methods converged on the same conclusion: optimal N* and D* both scale as C^0.5, which translates to ~20 tokens per parameter. They then trained Chinchilla 70B on 1.4T tokens to verify the prediction — the model outperformed the 280B Gopher on MMLU (67.5% vs 60.0%) and most benchmarks.
Earlier Kaplan et al. (2020) scaling laws suggested favoring much larger models at the expense of training tokens for any given compute budget. This led to models like GPT-3 and Gopher, which were undertrained and suboptimal in resource allocation.
Hoffmann et al.'s first method: for a fixed compute budget C, vary N and D, measure L, and find the (N*, D*) that minimizes loss. Repeated across multiple C values gives the optimal frontier.
Second method: fit the loss surface as a sum of an irreducible loss E plus two power-law terms with parameters α, β, A, B. Hoffmann et al. report α ≈ 0.34, β ≈ 0.28.
The main practical takeaway: for compute-optimal training, the number of training tokens should be approximately 20× the number of model parameters.
Standard approximation for dense-attention transformer training cost. Used to derive N* and D* for a given C: N* ≈ √(C/(6·20)), D* ≈ 20·N*.
Official
Empirical verification of the prediction: 70B parameters trained on 1.4T tokens (ratio = 20:1) at the same compute budget as the 280B Gopher. Outperformed Gopher on MMLU at 67.5% vs 60.0%.
Chinchilla optimizes training cost. For models served at scale (ChatGPT, Claude, Llama API), inference cost dominates — and a smaller model trained longer (over-training) is rational despite suboptimal compute.
The 20-ratio was measured on English language with curated web corpora. For code, vision, multimodal, or reasoning-heavy domains the α, β exponents differ — the effective optimal ratio can be different.
Hoffmann et al. noted that Kaplan-era fits were distorted because LR cooldown did not align with training duration. The same mistake still appears in replications.
Epoch AI's replications (2024) suggest the original Chinchilla fits may underestimate optimal D. Furthermore, dataset quality matters substantially — clean code, synthetic reasoning data, and image captions have different scaling dynamics.
OpenAI publishes "Scaling Laws for Neural Language Models". Suggests scaling N much faster than D — retrospectively shown to be incorrect.
DeepMind shows GPT-3 and Gopher are undertrained. Introduces the compute-optimal ratio of ~20 tokens/parameter and verifies it with 70B Chinchilla.
Meta trains Llama-1 and Llama-2 well above Chinchilla-optimal (50+ tokens/parameter), intentionally trading compute-optimality for lower inference cost.
Llama-3 is trained on 15T tokens — for the 8B model that is a ratio of ~1875, far beyond compute-optimal. The inference-cost-aware training era.
Epoch AI's independent replications suggest Hoffmann's original fits may underestimate optimal D — the effective ratio may exceed 20.
Chinchilla ratio = 20. Llama-2 and Llama-3 train far above (50+, 100+) — this is intentional over-training to reduce inference cost.
N* ∝ C^0.5 per Hoffmann et al.'s fit. For C = 6e23 FLOPs the optimum is ~70B parameters (Chinchilla).
D* ∝ C^0.5; in practice D* ≈ 20·N*. For Chinchilla, 1.4T tokens.
Hoffmann et al. noted that Kaplan's original fits were distorted by an uncalibrated LR cooldown. In Chinchilla the cooldown spans the full training horizon — critical for replicability.
Compute-optimal scaling is a mathematical relationship between N, D, C, and L. It does not depend on a specific hardware architecture so long as FLOPs can be measured.
Chinchilla was trained on DeepMind's TPU clusters. The entire IsoFLOP experiment (>400 runs) was conducted on TPUs.
Later replications (Llama on A100/H100, Mistral) confirm Chinchilla scaling works equally well on GPUs. In practice, all modern GPU-trained LLMs use Chinchilla as a baseline.