Compute-Optimal Training

Shows that for a fixed compute budget the number of model parameters and the number of training tokens should be scaled roughly in equal proportion, rather than scaling the model far faster than the data.

Category

Abstraction level

Operation level

Foundation-model pretraining planningSizing model and dataset against a fixed GPU budgetComparing training efficiency across model familiesDecisions on growing the corpus vs growing the modelEstimating marginal returns from scaling

The starting point is the standard training-cost approximation C ≈ 6·N·D, where N is the parameter count and D is the number of training tokens. Hoffmann et al. (2022) applied three complementary approaches: (1) training fixed-size models for varying numbers of tokens, (2) IsoFLOP curves — for each budget C the values of N and D are varied while keeping C constant, and the (N*, D*) minimizing validation loss is identified, and (3) a parametric loss model L(N, D) fitted to all 400+ runs. All three methods agreed that the optimal N and D should grow in roughly equal proportion with C, corresponding to scaling exponents a ≈ 0.5 and b ≈ 0.5 in the relations N* ∝ C^a and D* ∝ C^b. In practice this yields the rule of thumb of approximately 20 tokens per parameter to keep training compute-optimal. The rule assumes a sufficiently large, deduplicated corpus, a decoder-only transformer architecture, and a standard learning-rate schedule tuned to the chosen number of steps.

Earlier scaling laws (Kaplan et al., 2020) suggested that under a constrained compute budget one should mainly increase the number of model parameters, which led to the training of very large but significantly undertrained models. Compute-Optimal Training addresses the problem of non-optimal allocation of the FLOPs budget between model size and training data volume.

Compute budget (FLOPs)

Critical

Total compute budget C in FLOPs allocated for pretraining, jointly determining N and D.

Tokens per parameter

Critical

20Value recommended by Hoffmann et al. (2022).

The D/N ratio. According to the Chinchilla results the optimum is around 20 tokens per parameter.

Parameter count (N)

Standard

Number of model parameters, chosen jointly with D for budget C.

Training tokens (D)

Standard

Number of unique tokens processed during pretraining.

Common pitfalls

Confusing compute-optimal with inference-optimal

HIGH

The 20:1 rule minimizes pretraining loss for a training budget but does not account for inference cost. Models heavily deployed in production are often worth training for longer (more tokens) to reduce inference cost.

Optimize the combined training + inference cost over the full model lifecycle rather than the pretraining cost alone.

Insufficient corpus

MEDIUM

The rule assumes a sufficiently large, deduplicated corpus. Repeating the same data over many epochs breaks the assumption and distorts the relation between D and the effective count of "fresh" tokens.

Check the size of the unique corpus before planning pretraining and treat D as the number of unique tokens, not tokens with repetition.

Naive extrapolation beyond the experimental range

MEDIUM

The scaling exponents were fit on models up to around 16B parameters and budgets up to roughly 5e23 FLOPs; extrapolating to budgets that are orders of magnitude larger is not always reliable.

For very large budgets run your own IsoFLOP curves instead of blindly relying on the Chinchilla exponents.

GENESIS · Source paper

Training Compute-Optimal Large Language Models

2022arXiv 2022; NeurIPS 2022Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al.

2020

Kaplan scaling laws

Kaplan et al. publish "Scaling Laws for Neural Language Models", suggesting that under a constrained budget one should mainly grow the model.

Scaling Laws for Neural Language Models

2022

Chinchilla and compute-optimal scaling

breakthrough

Hoffmann et al. introduce the compute-optimal rule and empirically confirm it with the Chinchilla 70B / 1.4T-token model, outperforming Gopher 280B at the same FLOPs budget.

Training Compute-Optimal Large Language Models

2023

LLaMA — breaking the 20:1 rule

Touvron et al. train LLaMA on more than one trillion tokens for 7B–13B models, intentionally going beyond the compute-optimal point to achieve cheaper inference at comparable quality.

LLaMA: Open and Efficient Foundation Language Models

Hardware agnosticPRIMARY

Compute-Optimal Training is a rule for allocating a FLOPs budget and does not depend on any specific hardware; it applies equally to GPU and TPU clusters.

BUILT ON

Scaling Laws (Kaplan / Chinchilla)

Scaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.

GO TO CONCEPT

Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

Commonly used with

Chinchilla (Compute-Optimal Scaling)

Chinchilla is the result of "Training Compute-Optimal Large Language Models" by DeepMind (Hoffmann et al., 2022). The paper empirically showed that models like GPT-3 (175B parameters, 300B tokens) were **massively undertrained** — for the same compute budget, a smaller model trained on far more data would yield better results. Chinchilla (70B parameters trained on 1.4T tokens) outperformed Gopher (280B parameters) and GPT-3 on most benchmarks at a fraction of the inference cost. The key takeaway: for compute-optimal training, N and D should scale roughly equally (~20 tokens per parameter), not disproportionately faster N (as Kaplan et al. suggested). Later work like Llama, Mistral, and Gemma went even further, training smaller models on 100+ tokens per parameter — because for production-served models, inference cost matters more than pure compute-optimal training.

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Title	Publisher	Type
Training Compute-Optimal Large Language Models (arXiv:2203.15556)	arXiv / DeepMind	scientific article
Scaling Laws for Neural Language Models (Kaplan et al., 2020)	arXiv / OpenAI	scientific article
LLaMA: Open and Efficient Foundation Language Models	arXiv / Meta AI	scientific article