Robots AtlasRobots Atlas

Scaling Laws

Formalized empirical power-law relationships linking model performance to parameter count, data size, and compute budget, enabling performance prediction and optimal resource allocation.

Category
Abstraction level
Operation level
Language model training planningCompute budget allocationModel performance predictionArchitecture decisions

For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.

Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.

GENESIS · Source paper

Scaling Laws for Neural Language Models
2020arXiv 2020Jared Kaplan, Sam McCandlish, Tom Henighan et al.
2020

Scaling Laws for Neural Language Models (OpenAI)

breakthrough

Kaplan et al. establish power-law relationships between compute, data, parameters, and loss.

2022

Chinchilla scaling laws by Hoffmann et al.

breakthrough

Hoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.

2023

Scaling laws for specific domains and modalities

Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.

Commonly used with

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT