Step 1: Input X ∈ R^(n×d_model) is linearly projected h times by learnable matrices W^Q_i, W^K_i, W^V_i ∈ R^(d_model × d_k) (typically d_k = d_v = d_model / h), producing h triplets (Q_i, K_i, V_i). Step 2: Each head independently computes head_i = SoftMax(Q_i K_i^T / √d_k) V_i — standard Scaled Dot-Product Attention in lower dimensionality. Step 3: All head outputs are concatenated along the feature dimension: Concat(head_1, …, head_h) ∈ R^(n × h·d_v = d_model). Step 4: The concatenated result passes through a final projection matrix W^O ∈ R^(d_model × d_model), producing MHA(Q, K, V) = Concat(head_1, …, head_h) W^O. In the original Transformer h=8, d_model=512, d_k=d_v=64.
A single Scaled Dot-Product Attention head averages all dependencies into one weighted vector — the model must trade off limited representational capacity between different relation types (syntactic, semantic, coreference, long-range). MHA fixes this by splitting d_model into h parallel subspaces where each head can independently specialize in a different attention pattern.
Three sets of learnable weight matrices W^Q_i, W^K_i, W^V_i ∈ R^(d_model × d_k) for each of the h heads. Project the shared input into h independent subspaces.
h independent Scaled Dot-Product Attention instances running in parallel. Each head can learn a different attention pattern (syntax, semantics, coreference, neighbor positions).
Official
Outputs from all h heads (each ∈ R^(n×d_v)) are concatenated along the feature dimension into a R^(n × h·d_v) tensor.
Learnable matrix W^O ∈ R^(d_model × d_model) that mixes information across heads and matches dimensionality to the rest of the network.
Without dividing the Q K^T dot products by √d_k, softmax saturates at large d_k, gradients vanish and the model fails to learn.
A missing or incorrect upper-triangular mask in the decoder lets the model "see the future" during training — perplexity looks fine but generation is broken.
Confusing dimension order (batch, heads, seq, d_k) vs (batch, seq, heads, d_k) when reshaping/transposing leaks information between positions and heads.
Recomputing K, V for all prior tokens on each new autoregressive step gives O(n³) instead of O(n²) and degrades LLM throughput 10–100×.
Vaswani et al. define MHA with h=8 heads and d_k=64 as the key element of the Transformer architecture, replacing recurrence with full parallelism.
Google (BERT) and OpenAI (GPT-1) prove MHA scales to hundreds of millions of parameters and dominates NLP benchmarks.
Noam Shazeer ("Fast Transformer Decoding") proposes MQA — a single shared K, V across all Q heads, reducing KV cache and accelerating inference.
Tri Dao et al. introduce FlashAttention — exact MHA with SRAM tiling, eliminating the n×n materialization in HBM. 2–4× faster, lower memory footprint.
Ainslie et al. propose GQA — a middle ground between MHA and MQA where groups of Q heads share K, V. Adopted by LLaMA 2, Mistral, LLaMA 3 as the LLM standard from 2023 onward.
Time complexity: O(n² · d_model). Space complexity: O(n² + n · d_model).
The n×n attention weight matrix must be fully materialized (or streamed as in FlashAttention) — this dominates both compute and HBM memory cost at long context lengths.
Every head activates for every token — unlike sparse attention or MoE, all compute paths are always used.
All h heads and all sequence positions are computed in parallel. This is the key difference from RNNs and the reason Transformers scale on GPUs/TPUs.
Number of parallel attention heads. Must divide d_model. More heads = more diverse attention patterns but smaller per-head d_k dimensionality.
Q, K, V dimensionality per head. Conventionally d_k = d_model / h, but this is not a hard requirement — can be decoupled.
Masking of disallowed positions: causal (decoder, GPT), full (encoder, BERT), padding mask, custom (sliding window).
GEMM operations (Q K^T and Attention·V) map perfectly to Tensor Cores; all h heads computed in parallel blocks.
TPU systolic arrays are optimized for dense matrix multiplications — directly leveraged by Transformers on TPU (BERT, T5, PaLM).
Feasible on CPU but O(n²) without hardware support gives low throughput — mainly used for small-model inference.