Scaling Laws
Formalized empirical power-law relationships linking model performance to parameter count, data size, and compute budget, enabling performance prediction and optimal resource allocation.
For language models, loss L scales as L(N) ~ N^(-alpha_N), L(D) ~ D^(-alpha_D), L(C) ~ C^(-alpha_C), where the exponents alpha are characteristic of the model and task. Researchers fit these power laws to experimental results at various N, D, C and extrapolate to larger scales.
Lack of predictable compute allocation principles: it was unclear whether training a large model briefly versus a small model for longer is optimal, or how many parameters are needed for a given compute budget.
GENESIS · Source paper
Scaling Laws for Neural Language ModelsScaling Laws for Neural Language Models (OpenAI)
breakthroughKaplan et al. establish power-law relationships between compute, data, parameters, and loss.
Chinchilla scaling laws by Hoffmann et al.
breakthroughHoffmann et al. show GPT-3-era models were undertrained; optimal compute split is ~equal between N and D.
Scaling laws for specific domains and modalities
Researchers extend scaling laws to code, vision, multimodal, and reasoning tasks.
Commonly used with
Transformer
Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Scaling Laws for Neural Language Models | — | scientific article |
| Training Compute-Optimal Large Language Models (Chinchilla) | — | scientific article |