Neural Networks: From Fundamentals to Modern AI · The Attention Mechanism and the Transformer

Multi-head attention and positional encoding

The Attention Mechanism and the Transformer

Introduction

Multi-head attention extends a single scaled dot-product operation to h parallel "heads". Input X is projected h times into three representations (Q, K, V), each of dimension d_k = d_v = d_model / h, each head runs its own scaled dot-product, and the outputs are concatenated and passed through a final projection W_O ∈ R^{h·d_v × d_model}. Motivation (Vaswani et al. 2017): a single head would have to average all relation types into one attention distribution. h heads let the network learn different roles in parallel — one head tracking coreference, another relative positions, yet another syntactic agreement. Empirically many heads are redundant (Voita et al. 2019, Michel et al. 2019: a large fraction of heads can be pruned without BLEU loss), but h>1 still gives a clear gain over h=1. The second topic of this lesson: pure self-attention is permutation-equivariant, so without a positional signal the model treats "cat ate fish" and "fish ate cat" identically. Positional encoding solves this by adding position info to the embedding. Vaswani et al. chose sinusoidal encoding (deterministic, extrapolatable) — for pos and dimension 2i: PE(pos, 2i)=sin(pos/10000^{2i/d}), PE(pos, 2i+1)=cos(...). BERT and GPT-2 use learned positional embeddings (parameter table). Newer models (T5, LLaMA, Mistral) use relative position bias or Rotary Position Embeddings (RoPE, Su et al. 2021), which encode position directly into Q and K via rotation in complex space — yielding better extrapolation to longer contexts. ALiBi (Press 2022) goes further: no embeddings at all, just a linear score bias depending on distance |i-j|.