Robots AtlasRobots Atlas

Chinchilla

Empirically demonstrated that for a fixed compute budget, model parameter count (N) and training token count (D) should scale at roughly equal rates — approximately 20 tokens per parameter — overturning the earlier Kaplan et al. recommendation of scaling N much faster than D.

Category
Abstraction level
Operation level
Planning LLM training under a fixed compute budgetChoosing (N, D) pairs for foundation model runsValidating whether a training plan is undertrainedReference point for over-training decisions

Hoffmann et al. ran over 400 training jobs across model sizes from 70M to 16B parameters and dataset sizes from 5B to 500B tokens. They fit three independent methods to the data: (1) IsoFLOP curves — for fixed C, vary N and D, find L's minimum; (2) parametric loss fit — fit L(N, D) = E + A/N^α + B/D^β; (3) IsoLoss curves. All three methods converged on the same conclusion: optimal N* and D* both scale as C^0.5, which translates to ~20 tokens per parameter. They then trained Chinchilla 70B on 1.4T tokens to verify the prediction — the model outperformed the 280B Gopher on MMLU (67.5% vs 60.0%) and most benchmarks.

Earlier Kaplan et al. (2020) scaling laws suggested favoring much larger models at the expense of training tokens for any given compute budget. This led to models like GPT-3 and Gopher, which were undertrained and suboptimal in resource allocation.

01

IsoFLOP analysis

Empirical fitting method #1

Hoffmann et al.'s first method: for a fixed compute budget C, vary N and D, measure L, and find the (N*, D*) that minimizes loss. Repeated across multiple C values gives the optimal frontier.

02

L(N, D) = E + A/N^α + B/D^β

Empirical fitting method #2

Second method: fit the loss surface as a sum of an irreducible loss E plus two power-law terms with parameters α, β, A, B. Hoffmann et al. report α ≈ 0.34, β ≈ 0.28.

03

D/N ≈ 20

Engineering rule of thumb

The main practical takeaway: for compute-optimal training, the number of training tokens should be approximately 20× the number of model parameters.

04

C ≈ 6 · N · D

C ↔ (N, D) conversion

Modular

Standard approximation for dense-attention transformer training cost. Used to derive N* and D* for a given C: N* ≈ √(C/(6·20)), D* ≈ 20·N*.

05

Chinchilla 70B / 1.4T tokens

Validation experiment

Empirical verification of the prediction: 70B parameters trained on 1.4T tokens (ratio = 20:1) at the same compute budget as the 280B Gopher. Outperformed Gopher on MMLU at 67.5% vs 60.0%.

Tokens-per-parameter ratio (D/N)

Critical
  • ~1.7
  • 20
  • ~28
  • ~150–200

Chinchilla ratio = 20. Llama-2 and Llama-3 train far above (50+, 100+) — this is intentional over-training to reduce inference cost.

Compute-optimal N* for given C

Critical

N* ∝ C^0.5 per Hoffmann et al.'s fit. For C = 6e23 FLOPs the optimum is ~70B parameters (Chinchilla).

Compute-optimal D* for given C

Critical

D* ∝ C^0.5; in practice D* ≈ 20·N*. For Chinchilla, 1.4T tokens.

Learning rate cooldown

Standard

Hoffmann et al. noted that Kaplan's original fits were distorted by an uncalibrated LR cooldown. In Chinchilla the cooldown spans the full training horizon — critical for replicability.

Common pitfalls

Confusing compute-optimal with deployment-optimal
HIGH

Chinchilla optimizes training cost. For models served at scale (ChatGPT, Claude, Llama API), inference cost dominates — and a smaller model trained longer (over-training) is rational despite suboptimal compute.

Define the objective as training_cost + λ · inference_cost · usage_volume. For high-usage products, λ shifts the optimum toward smaller models trained on more data.

Naively applying the 20:1 ratio to new modalities
HIGH

The 20-ratio was measured on English language with curated web corpora. For code, vision, multimodal, or reasoning-heavy domains the α, β exponents differ — the effective optimal ratio can be different.

For a new domain, fit your own IsoFLOP curves on small models (up to ~1B params) before committing to a full-scale run.

Uncalibrated learning rate cooldown
HIGH

Hoffmann et al. noted that Kaplan-era fits were distorted because LR cooldown did not align with training duration. The same mistake still appears in replications.

LR cooldown should terminate exactly at the planned training horizon (D tokens). Do not use a fixed-horizon cooldown across different D values.

Treating the 20-ratio as an immutable law of nature
MEDIUM

Epoch AI's replications (2024) suggest the original Chinchilla fits may underestimate optimal D. Furthermore, dataset quality matters substantially — clean code, synthetic reasoning data, and image captions have different scaling dynamics.

Treat the 20-ratio as a rough baseline, not a law. Measure IsoFLOP curves for your own dataset / modality.

GENESIS · Source paper

Training Compute-Optimal Large Language Models
2022NeurIPS 2022Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al.
2020

Kaplan et al. — first scaling laws (context)

OpenAI publishes "Scaling Laws for Neural Language Models". Suggests scaling N much faster than D — retrospectively shown to be incorrect.

2022

Hoffmann et al. publish Chinchilla

breakthrough

DeepMind shows GPT-3 and Gopher are undertrained. Introduces the compute-optimal ratio of ~20 tokens/parameter and verifies it with 70B Chinchilla.

2023

Llama 1 and 2 — intentional over-training

breakthrough

Meta trains Llama-1 and Llama-2 well above Chinchilla-optimal (50+ tokens/parameter), intentionally trading compute-optimality for lower inference cost.

2024

Llama-3 — extreme over-training (~150 tokens/param)

Llama-3 is trained on 15T tokens — for the 8B model that is a ratio of ~1875, far beyond compute-optimal. The inference-cost-aware training era.

2024

Epoch AI — refit and critique of original Chinchilla

Epoch AI's independent replications suggest Hoffmann's original fits may underestimate optimal D — the effective ratio may exceed 20.

Hardware agnosticPRIMARY

Compute-optimal scaling is a mathematical relationship between N, D, C, and L. It does not depend on a specific hardware architecture so long as FLOPs can be measured.

TPUGOOD

Chinchilla was trained on DeepMind's TPU clusters. The entire IsoFLOP experiment (>400 runs) was conducted on TPUs.

GPU Tensor CoresGOOD

Later replications (Llama on A100/H100, Mistral) confirm Chinchilla scaling works equally well on GPUs. In practice, all modern GPU-trained LLMs use Chinchilla as a baseline.

BUILT ON

Scaling Laws (Kaplan / Chinchilla)

Scaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.

GO TO CONCEPT
Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT
LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT
Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

EXTENDS

Scaling Laws (Kaplan / Chinchilla)

Scaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.

GO TO CONCEPT

Commonly used with

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT
Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT
Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT
Emergent Abilities

Emergent abilities of large language models is an observation, formalized by Wei et al. (2022), that certain LLM capabilities — such as multi-step reasoning, zero-shot instruction following, modular arithmetic, or answering questions in low-resource languages — do not appear gradually with scale but emerge discontinuously, only after crossing a threshold of parameter count, training data, or compute FLOPs. Below the threshold, performance is random or near-zero; above it, performance jumps abruptly to substantially better-than-random. The phenomenon has been documented across more than 130 tasks from the BIG-Bench benchmark and other suites (MMLU, TruthfulQA). Canonical examples include: Chain-of-Thought reasoning (~100B-parameter threshold for PaLM/GPT-3), InstructGPT-style instruction following, modular arithmetic, International Phonetic Alphabet transliteration, and multi-step question answering. In 2023, Schaeffer, Miranda, and Koyejo (NeurIPS 2023, "Are Emergent Abilities of Large Language Models a Mirage?") challenged emergence as a real fundamental phenomenon. They showed that non-linear or discontinuous evaluation metrics (e.g. exact-match accuracy) artificially create the appearance of a jump — replacing them with continuous metrics (token edit distance, log-likelihood) reveals a smooth, predictable scaling curve. This critique is now central to the debate: some abilities are emergent in a metric-dependent sense, while others (e.g. inductive reasoning) appear to show genuine phase discontinuities. The concept has critical practical significance: if emergence is real, certain abilities cannot be predicted or trained at smaller scale — forcing organizations to train large models "blindly." If emergence is a metric artifact, then scaling laws (Hoffmann et al., Chinchilla) are sufficient to predict the behavior of larger models.

GO TO CONCEPT