In an MQA attention layer, for input x: all H Q heads are computed independently (Q_i = x · W_Q^i), while K and V are computed once as single projections (K = x · W_K, V = x · W_V) without head dimension. In the attention operation each Q_i head attends to the same shared K and V. Implementation-wise it amounts to broadcasting K and V across the head dimension. The cache stores only one K and one V per token instead of H — H-fold reduction (typically 8-128×).
Standard Multi-Head Attention generates a KV cache size proportional to the number of heads H, making autoregressive inference of long contexts memory-bound and expensive — cost grows linearly with H.
Single W_K matrix projecting x to one Key head, shared across all Q heads.
Single W_V matrix projecting x to one Value head, shared across all Q heads.
H separate W_Q^i matrices producing H separate queries — preserves the multi-dimensional attention space on the query side.
MQA can degrade model quality by 1-3% on benchmarks and complicate convergence, especially when training from scratch.
MQA benefit materializes only at long context or large batch — for short prompts and batch=1 the cache reduction is negligible.
Shazeer identifies KV cache as inference bottleneck and proposes K/V sharing across Q heads.
Google uses MQA in PaLM 540B for faster inference while maintaining quality — first major production deployment.
TII uses MQA in Falcon-40B/180B — first widely available open-source model with MQA.
GQA interpolates between MHA and MQA, offering better quality/memory trade-off — supersedes MQA in newer models.
All Q heads are always active; the difference vs MHA lies in parameter structure, not activation pattern.