Neural Networks: From Fundamentals to Modern AI · The Attention Mechanism and the Transformer

Self-attention — Query, Key, Value and scaled dot-product attention

The Attention Mechanism and the Transformer

Introduction

Scaled dot-product attention (Vaswani et al. 2017) is the operation: Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V. The three matrices Q, K, V are produced by linear projections of input X with learned weights W_Q, W_K, W_V — each token has three roles: it asks (query), is asked (key), contributes content (value). In self-attention Q, K, V come from the same sequence X (different projections); in cross-attention Q comes from one and K, V from another. Three key technical questions: (1) Why dot-product? — fast on GPU as a single matrix Q·K^T, unlike additive Bahdanau attention which needs a separate MLP. (2) Why scale by √d_k? — without it the dot product has variance growing with d_k, softmax saturates (≈ argmax), and gradients vanish. Scaling keeps variance ≈ 1 and softmax stays "soft". (3) Why separate Q and K rather than a symmetric matrix? — it lets us express asymmetric relations ("from this look at that" ≠ "from that look at this"). The output is a weighted sum of values: every query looks at all keys, gets a probability distribution α_i = softmax_i(score), takes Σ α_i v_i. It is a differentiable sequence lookup. Complexity: O(n^2 · d_k) for the attention matrix, O(n^2) memory — a fundamental scaling limit, the motivation behind FlashAttention (Dao 2022) and efficient-attention variants.