Architecture

MQA

2019ActivePublished: 29 May 2026Updated: 29 May 2026Published

Key innovation

A Multi-Head Attention variant in which all query (Q) heads share a single Key and Value head pair, reducing KV cache size by a factor of H at the cost of acceptable quality loss.

Category

Architecture

Abstraction level

Pattern

Operation level

Architecture blockInference

Use cases

LLM inference with long contexts (Falcon, PaLM)On-device / edge models where HBM is constrainedModel serving with large batch sizes (more requests per GPU)Baseline for GQA — which interpolates between MHA and MQA

How it works

In an MQA attention layer, for input x: all H Q heads are computed independently (Q_i = x · W_Q^i), while K and V are computed once as single projections (K = x · W_K, V = x · W_V) without head dimension. In the attention operation each Q_i head attends to the same shared K and V. Implementation-wise it amounts to broadcasting K and V across the head dimension. The cache stores only one K and one V per token instead of H — H-fold reduction (typically 8-128×).

Problem solved

Standard Multi-Head Attention generates a KV cache size proportional to the number of heads H, making autoregressive inference of long contexts memory-bound and expensive — cost grows linearly with H.

Components

Shared K projection

Single W_K matrix projecting x to one Key head, shared across all Q heads.

Shared V projection

Single W_V matrix projecting x to one Value head, shared across all Q heads.

Independent Q heads

H separate W_Q^i matrices producing H separate queries — preserves the multi-dimensional attention space on the query side.

Implementation

Reference implementations

Hugging Face Transformers — MQA implementations

Python · Hugging Face

Falcon — open-source MQA model

Python · Technology Innovation Institute (TII)

Implementation pitfalls

Quality drop and training instabilityHigh

MQA can degrade model quality by 1-3% on benchmarks and complicate convergence, especially when training from scratch.

Fix:Use GQA with 4-8 groups instead of full MQA. Alternatively: uptraining from MHA checkpoint (Ainslie et al. method).

No benefit at small batch sizeLow

MQA benefit materializes only at long context or large batch — for short prompts and batch=1 the cache reduction is negligible.

Fix:Profile actual workload characteristics before choosing MQA vs MHA.

Evolution

Original paper · 2019 · arXiv preprint · Noam Shazeer

Fast Transformer Decoding: One Write-Head is All You Need

2019

MQA introduced (Shazeer)

Inflection point

Shazeer identifies KV cache as inference bottleneck and proposes K/V sharing across Q heads.

Fast Transformer Decoding: One Write-Head is All You Need (paper)

2022

Adoption in PaLM (Google)

Google uses MQA in PaLM 540B for faster inference while maintaining quality — first major production deployment.

2023

Adoption in Falcon

TII uses MQA in Falcon-40B/180B — first widely available open-source model with MQA.

2023

GQA as generalization (Ainslie et al.)

Inflection point

GQA interpolates between MHA and MQA, offering better quality/memory trade-off — supersedes MQA in newer models.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (paper)

Sources

Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019)

GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al., 2023)

PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)