Scaling Laws

Formalized empirical power-law relationships linking model performance to parameter count, data size, and compute budget, enabling performance prediction and optimal resource allocation.

Commonly used with

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

Title	Publisher	Type
Scaling Laws for Neural Language Models	—	scientific article
Training Compute-Optimal Large Language Models (Chinchilla)	—	scientific article

Scaling Laws for Neural Language Models

scientific article

Training Compute-Optimal Large Language Models (Chinchilla)

scientific article

Back to technology catalog

Scaling Laws

Use cases

How it works

Problem solved

History and evolution

Semantic relations

Commonly used with

Sources