Compute-Optimal Training
Shows that for a fixed compute budget the number of model parameters and the number of training tokens should be scaled roughly in equal proportion, rather than scaling the model far faster than the data.
The starting point is the standard training-cost approximation C β 6Β·NΒ·D, where N is the parameter count and D is the number of training tokens. Hoffmann et al. (2022) applied three complementary approaches: (1) training fixed-size models for varying numbers of tokens, (2) IsoFLOP curves β for each budget C the values of N and D are varied while keeping C constant, and the (N*, D*) minimizing validation loss is identified, and (3) a parametric loss model L(N, D) fitted to all 400+ runs. All three methods agreed that the optimal N and D should grow in roughly equal proportion with C, corresponding to scaling exponents a β 0.5 and b β 0.5 in the relations N* β C^a and D* β C^b. In practice this yields the rule of thumb of approximately 20 tokens per parameter to keep training compute-optimal. The rule assumes a sufficiently large, deduplicated corpus, a decoder-only transformer architecture, and a standard learning-rate schedule tuned to the chosen number of steps.
Earlier scaling laws (Kaplan et al., 2020) suggested that under a constrained compute budget one should mainly increase the number of model parameters, which led to the training of very large but significantly undertrained models. Compute-Optimal Training addresses the problem of non-optimal allocation of the FLOPs budget between model size and training data volume.
Compute budget (FLOPs)
Total compute budget C in FLOPs allocated for pretraining, jointly determining N and D.
Tokens per parameter
- 20Value recommended by Hoffmann et al. (2022).
The D/N ratio. According to the Chinchilla results the optimum is around 20 tokens per parameter.
Parameter count (N)
Number of model parameters, chosen jointly with D for budget C.
Training tokens (D)
Number of unique tokens processed during pretraining.
Common pitfalls
Confusing compute-optimal with inference-optimalHIGH
The 20:1 rule minimizes pretraining loss for a training budget but does not account for inference cost. Models heavily deployed in production are often worth training for longer (more tokens) to reduce inference cost.
Optimize the combined training + inference cost over the full model lifecycle rather than the pretraining cost alone.
Insufficient corpusMEDIUM
The rule assumes a sufficiently large, deduplicated corpus. Repeating the same data over many epochs breaks the assumption and distorts the relation between D and the effective count of "fresh" tokens.
Check the size of the unique corpus before planning pretraining and treat D as the number of unique tokens, not tokens with repetition.
Naive extrapolation beyond the experimental rangeMEDIUM
The scaling exponents were fit on models up to around 16B parameters and budgets up to roughly 5e23 FLOPs; extrapolating to budgets that are orders of magnitude larger is not always reliable.
For very large budgets run your own IsoFLOP curves instead of blindly relying on the Chinchilla exponents.
GENESIS Β· Source paper
Training Compute-Optimal Large Language ModelsKaplan scaling laws
Kaplan et al. publish "Scaling Laws for Neural Language Models", suggesting that under a constrained budget one should mainly grow the model.
Chinchilla and compute-optimal scaling
breakthroughHoffmann et al. introduce the compute-optimal rule and empirically confirm it with the Chinchilla 70B / 1.4T-token model, outperforming Gopher 280B at the same FLOPs budget.
LLaMA β breaking the 20:1 rule
Touvron et al. train LLaMA on more than one trillion tokens for 7Bβ13B models, intentionally going beyond the compute-optimal point to achieve cheaper inference at comparable quality.
Compute-Optimal Training is a rule for allocating a FLOPs budget and does not depend on any specific hardware; it applies equally to GPU and TPU clusters.
BUILT ON
Scaling Laws (Kaplan / Chinchilla)
Scaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.
GO TO CONCEPTPretraining
Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data β next tokens in text, masked words, future video frames, future robot states β without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" β dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPTCommonly used with
Chinchilla (Compute-Optimal Scaling)
Chinchilla is the result of "Training Compute-Optimal Large Language Models" by DeepMind (Hoffmann et al., 2022). The paper empirically showed that models like GPT-3 (175B parameters, 300B tokens) were **massively undertrained** β for the same compute budget, a smaller model trained on far more data would yield better results. Chinchilla (70B parameters trained on 1.4T tokens) outperformed Gopher (280B parameters) and GPT-3 on most benchmarks at a fraction of the inference cost. The key takeaway: for compute-optimal training, N and D should scale roughly equally (~20 tokens per parameter), not disproportionately faster N (as Kaplan et al. suggested). Later work like Llama, Mistral, and Gemma went even further, training smaller models on 100+ tokens per parameter β because for production-served models, inference cost matters more than pure compute-optimal training.
GO TO CONCEPTLLM
A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Training Compute-Optimal Large Language Models (arXiv:2203.15556) | arXiv / DeepMind | scientific article |
| Scaling Laws for Neural Language Models (Kaplan et al., 2020) | arXiv / OpenAI | scientific article |
| LLaMA: Open and Efficient Foundation Language Models | arXiv / Meta AI | scientific article |