An induction head consists of two attention heads acting sequentially. Head 1 (Q-K-matching) copies the token preceding the current token into its representation. Head 2 (induction) uses this copied information to attend to previous occurrences of the current token and predict the next token based on what followed them before. The mechanism emerges at a precise training point correlated with ICL acquisition.
Lack of understanding of the mechanisms behind Transformer models' ability for in-context learning - a key emergent property discovered in GPT-3.
Induction heads are identified through attention activations — without tools like TransformerLens or circuit activation patching, their location in a specific model is uncertain.
Large models have hundreds of attention heads serving different functions. Misclassifying heads as induction heads leads to incorrect conclusions about model capabilities.
Elhage et al. lay groundwork for mechanistic interpretability, identifying attention head composition.
Olsson et al. identify induction heads as the mechanistic source of in-context learning.
Follow-up work extends mechanistic interpretability findings to larger language models.