Architecture

MHA

2017ActivePublished: 28 May 2026Updated: 28 May 2026Published

Key innovation

Running multiple independent Scaled Dot-Product Attention heads in parallel on linear projections of Q, K, V into lower-dimensional subspaces — letting the model jointly attend to different representation subspaces (syntactic, semantic, long-range) at no extra cost over a single full-dimensional head.

How it works

Step 1: Input X ∈ R^(n×d_model) is linearly projected h times by learnable matrices W^Q_i, W^K_i, W^V_i ∈ R^(d_model × d_k) (typically d_k = d_v = d_model / h), producing h triplets (Q_i, K_i, V_i). Step 2: Each head independently computes head_i = SoftMax(Q_i K_i^T / √d_k) V_i — standard Scaled Dot-Product Attention in lower dimensionality. Step 3: All head outputs are concatenated along the feature dimension: Concat(head_1, …, head_h) ∈ R^(n × h·d_v = d_model). Step 4: The concatenated result passes through a final projection matrix W^O ∈ R^(d_model × d_model), producing MHA(Q, K, V) = Concat(head_1, …, head_h) W^O. In the original Transformer h=8, d_model=512, d_k=d_v=64.

Problem solved

A single Scaled Dot-Product Attention head averages all dependencies into one weighted vector — the model must trade off limited representational capacity between different relation types (syntactic, semantic, coreference, long-range). MHA fixes this by splitting d_model into h parallel subspaces where each head can independently specialize in a different attention pattern.

Components

Linear projections Q, K, V (per head)Generating multiple input representations

Three sets of learnable weight matrices W^Q_i, W^K_i, W^V_i ∈ R^(d_model × d_k) for each of the h heads. Project the shared input into h independent subspaces.

Attention headsParallel attention computation in subspaces

h independent Scaled Dot-Product Attention instances running in parallel. Each head can learn a different attention pattern (syntax, semantics, coreference, neighbor positions).

Multi-Query Attention (MQA)All heads share a single K, V pair — reduces KV cache during inference.

Grouped-Query Attention (GQA)Groups of Q heads share K, V — trade-off between MHA and MQA (LLaMA 2/3, Mistral).

Official

Head concatenationCombining parallel head outputs

Outputs from all h heads (each ∈ R^(n×d_v)) are concatenated along the feature dimension into a R^(n × h·d_v) tensor.

Final output projection W^OFusing head outputs back to d_model

Learnable matrix W^O ∈ R^(d_model × d_model) that mixes information across heads and matches dimensionality to the rest of the network.

Implementation

Reference implementations

torch.nn.MultiheadAttention

Python · PyTorch

Official

Hugging Face Transformers — modeling_bert.BertSelfAttention

Python · Hugging Face

FlashAttention

CUDA / Python · Tri Dao et al.

Official

flax.linen.MultiHeadDotProductAttention

Python (JAX) · Google / Flax

Official

Implementation pitfalls

Forgetting to scale by √d_kCritical

Without dividing the Q K^T dot products by √d_k, softmax saturates at large d_k, gradients vanish and the model fails to learn.

Fix:Always use scaled dot-product attention — it's an integral part of MHA, not optional.

Incorrect causal mask in decoderCritical

A missing or incorrect upper-triangular mask in the decoder lets the model "see the future" during training — perplexity looks fine but generation is broken.

Fix:Verify the mask on a small example — compare output for the full sequence vs. prefix-by-prefix decoding. They should match at every position i.

Wrong reshape when splitting/concatenating headsHigh

Confusing dimension order (batch, heads, seq, d_k) vs (batch, seq, heads, d_k) when reshaping/transposing leaks information between positions and heads.

Fix:Use named tensors or einsum with clear index notation; test on a sequence with a known attention pattern.

Improper KV cache in inferenceHigh

Recomputing K, V for all prior tokens on each new autoregressive step gives O(n³) instead of O(n²) and degrades LLM throughput 10–100×.

Fix:Implement KV cache: keep K, V from previous steps and append only the new token. Consider MQA/GQA for large models.

Evolution

Original paper · 2017 · NeurIPS 2017 · Ashish Vaswani

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

2017

Introduction of Multi-Head Attention in the Transformer

Inflection point

Vaswani et al. define MHA with h=8 heads and d_k=64 as the key element of the Transformer architecture, replacing recurrence with full parallelism.

2018

BERT and GPT-1 — MHA at scale

Google (BERT) and OpenAI (GPT-1) prove MHA scales to hundreds of millions of parameters and dominates NLP benchmarks.

2019

Multi-Query Attention (MQA)

Noam Shazeer ("Fast Transformer Decoding") proposes MQA — a single shared K, V across all Q heads, reducing KV cache and accelerating inference.

Fast Transformer Decoding: One Write-Head is All You Need (paper)

2022

FlashAttention — IO-aware MHA

Inflection point

Tri Dao et al. introduce FlashAttention — exact MHA with SRAM tiling, eliminating the n×n materialization in HBM. 2–4× faster, lower memory footprint.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (paper)

2023

Grouped-Query Attention (GQA) in LLaMA 2

Inflection point

Ainslie et al. propose GQA — a middle ground between MHA and MQA where groups of Q heads share K, V. Adopted by LLaMA 2, Mistral, LLaMA 3 as the LLM standard from 2023 onward.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (paper)