Induction Heads

Identified a specific type of attention head as the mechanistic substrate of in-context learning in Transformer models, linking mechanistic interpretability to emergent behavior.

BUILT ON

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

Commonly used with

ICL

In-Context Learning (ICL) is the ability of large language models to perform a new task from a handful of examples (called demonstrations or shots) given directly in the prompt, without modifying model weights. The concept was formalized by Brown et al. (2020) in the GPT-3 paper "Language Models are Few-Shot Learners" as an emergent capability of models at ≥175B-parameter scale. In ICL, the prompt contains k (input, output) pairs demonstrating the task, followed by a new query input. Conditioned on these examples, the model produces output following the demonstration pattern. The number of examples k defines variants: zero-shot (k=0, natural-language task description only), one-shot (k=1), and few-shot (k=2–32, typically 4–8). Brown et al. showed that GPT-3 175B achieves competitive performance against fine-tuned models on many NLP tasks — using few-shot prompting alone. The underlying mechanism of ICL remains an active research topic. Main hypotheses: (1) ICL implements implicit gradient descent in attention activation space (Akyürek et al. 2022, von Oswald et al. 2023); (2) models perform pattern matching over distributions of patterns seen during pretraining (Xie et al. 2022 — Bayesian inference framework); (3) ICL relies on induction heads — attention structures forming during pretraining (Olsson et al. 2022, Anthropic). Empirically, demonstration quality, ordering, and even labels significantly affect performance (Min et al. 2022). ICL is the foundation of a broader family of prompt-engineering techniques: Chain-of-Thought (Wei et al. 2022) extends ICL with reasoning chains in demonstrations, instruction tuning (FLAN, T0) strengthens zero-shot ICL, and Retrieval-Augmented Generation dynamically selects demonstrations from a knowledge base. ICL became the dominant paradigm for using LLMs from 2022–2024, before being supplemented by instruction-tuned models requiring fewer or no examples.

GO TO CONCEPT

Title	Publisher	Type
In-context Learning and Induction Heads	—	scientific article
A Mathematical Framework for Transformer Circuits	—	scientific article

In-context Learning and Induction Heads

scientific article

A Mathematical Framework for Transformer Circuits

scientific article

Back to technology catalog

Induction Heads

Use cases

How it works

Problem solved

History and evolution

Semantic relations

BUILT ON

Commonly used with

Sources