IsoFLOP analysis
Empirical fitting method #1
Hoffmann et al.'s first method: for a fixed compute budget C, vary N and D, measure L, and find the (N*, D*) that minimizes loss. Repeated across multiple C values gives the optimal frontier.
Empirically demonstrated that for a fixed compute budget, model parameter count (N) and training token count (D) should scale at roughly equal rates — approximately 20 tokens per parameter — overturning the earlier Kaplan et al. recommendation of scaling N much faster than D.
Hoffmann et al. ran over 400 training jobs across model sizes from 70M to 16B parameters and dataset sizes from 5B to 500B tokens. They fit three independent methods to the data: (1) IsoFLOP curves — for fixed C, vary N and D, find L's minimum; (2) parametric loss fit — fit L(N, D) = E + A/N^α + B/D^β; (3) IsoLoss curves. All three methods converged on the same conclusion: optimal N* and D* both scale as C^0.5, which translates to ~20 tokens per parameter. They then trained Chinchilla 70B on 1.4T tokens to verify the prediction — the model outperformed the 280B Gopher on MMLU (67.5% vs 60.0%) and most benchmarks.
Earlier Kaplan et al. (2020) scaling laws suggested favoring much larger models at the expense of training tokens for any given compute budget. This led to models like GPT-3 and Gopher, which were undertrained and suboptimal in resource allocation.
Empirical fitting method #1
Hoffmann et al.'s first method: for a fixed compute budget C, vary N and D, measure L, and find the (N*, D*) that minimizes loss. Repeated across multiple C values gives the optimal frontier.
Empirical fitting method #2
Second method: fit the loss surface as a sum of an irreducible loss E plus two power-law terms with parameters α, β, A, B. Hoffmann et al. report α ≈ 0.34, β ≈ 0.28.
Engineering rule of thumb
The main practical takeaway: for compute-optimal training, the number of training tokens should be approximately 20× the number of model parameters.
C ↔ (N, D) conversion
Standard approximation for dense-attention transformer training cost. Used to derive N* and D* for a given C: N* ≈ √(C/(6·20)), D* ≈ 20·N*.
Validation experiment
Empirical verification of the prediction: 70B parameters trained on 1.4T tokens (ratio = 20:1) at the same compute budget as the 280B Gopher. Outperformed Gopher on MMLU at 67.5% vs 60.0%.
Tokens-per-parameter ratio (D/N)
Chinchilla ratio = 20. Llama-2 and Llama-3 train far above (50+, 100+) — this is intentional over-training to reduce inference cost.
Compute-optimal N* for given C
N* ∝ C^0.5 per Hoffmann et al.'s fit. For C = 6e23 FLOPs the optimum is ~70B parameters (Chinchilla).
Compute-optimal D* for given C
D* ∝ C^0.5; in practice D* ≈ 20·N*. For Chinchilla, 1.4T tokens.
Learning rate cooldown
Hoffmann et al. noted that Kaplan's original fits were distorted by an uncalibrated LR cooldown. In Chinchilla the cooldown spans the full training horizon — critical for replicability.
Chinchilla optimizes training cost. For models served at scale (ChatGPT, Claude, Llama API), inference cost dominates — and a smaller model trained longer (over-training) is rational despite suboptimal compute.
Define the objective as training_cost + λ · inference_cost · usage_volume. For high-usage products, λ shifts the optimum toward smaller models trained on more data.
The 20-ratio was measured on English language with curated web corpora. For code, vision, multimodal, or reasoning-heavy domains the α, β exponents differ — the effective optimal ratio can be different.
For a new domain, fit your own IsoFLOP curves on small models (up to ~1B params) before committing to a full-scale run.
Hoffmann et al. noted that Kaplan-era fits were distorted because LR cooldown did not align with training duration. The same mistake still appears in replications.
LR cooldown should terminate exactly at the planned training horizon (D tokens). Do not use a fixed-horizon cooldown across different D values.
Epoch AI's replications (2024) suggest the original Chinchilla fits may underestimate optimal D. Furthermore, dataset quality matters substantially — clean code, synthetic reasoning data, and image captions have different scaling dynamics.
Treat the 20-ratio as a rough baseline, not a law. Measure IsoFLOP curves for your own dataset / modality.
GENESIS · Source paper
Training Compute-Optimal Large Language ModelsKaplan et al. — first scaling laws (context)
OpenAI publishes "Scaling Laws for Neural Language Models". Suggests scaling N much faster than D — retrospectively shown to be incorrect.
Hoffmann et al. publish Chinchilla
breakthroughDeepMind shows GPT-3 and Gopher are undertrained. Introduces the compute-optimal ratio of ~20 tokens/parameter and verifies it with 70B Chinchilla.
Llama 1 and 2 — intentional over-training
breakthroughMeta trains Llama-1 and Llama-2 well above Chinchilla-optimal (50+ tokens/parameter), intentionally trading compute-optimality for lower inference cost.
Llama-3 — extreme over-training (~150 tokens/param)
Llama-3 is trained on 15T tokens — for the 8B model that is a ratio of ~1875, far beyond compute-optimal. The inference-cost-aware training era.
Epoch AI — refit and critique of original Chinchilla
Epoch AI's independent replications suggest Hoffmann's original fits may underestimate optimal D — the effective ratio may exceed 20.
Compute-optimal scaling is a mathematical relationship between N, D, C, and L. It does not depend on a specific hardware architecture so long as FLOPs can be measured.
Chinchilla was trained on DeepMind's TPU clusters. The entire IsoFLOP experiment (>400 runs) was conducted on TPUs.
Later replications (Llama on A100/H100, Mistral) confirm Chinchilla scaling works equally well on GPUs. In practice, all modern GPU-trained LLMs use Chinchilla as a baseline.
Scaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.
GO TO CONCEPTTransformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTPretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPTScaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTTransformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTPretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPTEmergent abilities of large language models is an observation, formalized by Wei et al. (2022), that certain LLM capabilities — such as multi-step reasoning, zero-shot instruction following, modular arithmetic, or answering questions in low-resource languages — do not appear gradually with scale but emerge discontinuously, only after crossing a threshold of parameter count, training data, or compute FLOPs. Below the threshold, performance is random or near-zero; above it, performance jumps abruptly to substantially better-than-random. The phenomenon has been documented across more than 130 tasks from the BIG-Bench benchmark and other suites (MMLU, TruthfulQA). Canonical examples include: Chain-of-Thought reasoning (~100B-parameter threshold for PaLM/GPT-3), InstructGPT-style instruction following, modular arithmetic, International Phonetic Alphabet transliteration, and multi-step question answering. In 2023, Schaeffer, Miranda, and Koyejo (NeurIPS 2023, "Are Emergent Abilities of Large Language Models a Mirage?") challenged emergence as a real fundamental phenomenon. They showed that non-linear or discontinuous evaluation metrics (e.g. exact-match accuracy) artificially create the appearance of a jump — replacing them with continuous metrics (token edit distance, log-likelihood) reveals a smooth, predictable scaling curve. This critique is now central to the debate: some abilities are emergent in a metric-dependent sense, while others (e.g. inductive reasoning) appear to show genuine phase discontinuities. The concept has critical practical significance: if emergence is real, certain abilities cannot be predicted or trained at smaller scale — forcing organizations to train large models "blindly." If emergence is a metric artifact, then scaling laws (Hoffmann et al., Chinchilla) are sufficient to predict the behavior of larger models.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Training Compute-Optimal Large Language Models | DeepMind / NeurIPS 2022 | scientific article |
| Scaling Laws for Neural Language Models (Kaplan et al.) | OpenAI | scientific article |
| Chinchilla's scaling law fits are not as accurate as they seem | Epoch AI | blog |
| Llama 2: Open Foundation and Fine-Tuned Chat Models | Meta AI | scientific article |