Induction Heads
Identified a specific type of attention head as the mechanistic substrate of in-context learning in Transformer models, linking mechanistic interpretability to emergent behavior.
An induction head consists of two attention heads acting sequentially. Head 1 (Q-K-matching) copies the token preceding the current token into its representation. Head 2 (induction) uses this copied information to attend to previous occurrences of the current token and predict the next token based on what followed them before. The mechanism emerges at a precise training point correlated with ICL acquisition.
Lack of understanding of the mechanisms behind Transformer models' ability for in-context learning - a key emergent property discovered in GPT-3.
GENESIS · Source paper
In-context Learning and Induction HeadsA Mathematical Framework for Transformer Circuits (Elhage et al.)
breakthroughElhage et al. lay groundwork for mechanistic interpretability, identifying attention head composition.
Discovery of Induction Heads (Olsson et al.)
breakthroughOlsson et al. identify induction heads as the mechanistic source of in-context learning.
Extension to larger models (Nanda et al., Anthropic)
Follow-up work extends mechanistic interpretability findings to larger language models.
BUILT ON
Transformer
Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTCommonly used with
ICL
In-Context Learning (ICL) is the ability of large language models to perform a new task from a handful of examples (called demonstrations or shots) given directly in the prompt, without modifying model weights. The concept was formalized by Brown et al. (2020) in the GPT-3 paper "Language Models are Few-Shot Learners" as an emergent capability of models at ≥175B-parameter scale. In ICL, the prompt contains k (input, output) pairs demonstrating the task, followed by a new query input. Conditioned on these examples, the model produces output following the demonstration pattern. The number of examples k defines variants: zero-shot (k=0, natural-language task description only), one-shot (k=1), and few-shot (k=2–32, typically 4–8). Brown et al. showed that GPT-3 175B achieves competitive performance against fine-tuned models on many NLP tasks — using few-shot prompting alone. The underlying mechanism of ICL remains an active research topic. Main hypotheses: (1) ICL implements implicit gradient descent in attention activation space (Akyürek et al. 2022, von Oswald et al. 2023); (2) models perform pattern matching over distributions of patterns seen during pretraining (Xie et al. 2022 — Bayesian inference framework); (3) ICL relies on induction heads — attention structures forming during pretraining (Olsson et al. 2022, Anthropic). Empirically, demonstration quality, ordering, and even labels significantly affect performance (Min et al. 2022). ICL is the foundation of a broader family of prompt-engineering techniques: Chain-of-Thought (Wei et al. 2022) extends ICL with reasoning chains in demonstrations, instruction tuning (FLAN, T0) strengthens zero-shot ICL, and Retrieval-Augmented Generation dynamically selects demonstrations from a knowledge base. ICL became the dominant paradigm for using LLMs from 2022–2024, before being supplemented by instruction-tuned models requiring fewer or no examples.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| In-context Learning and Induction Heads | — | scientific article |
| A Mathematical Framework for Transformer Circuits | — | scientific article |