Architecture

GQA

2023Updated: 4 May 2026

Key innovation

Resolves the trade-off between Multi-Head Attention (quality) and Multi-Query Attention (inference speed) by grouping Q heads to share K and V within each group, reducing KV-cache memory without significant quality loss.

How it works

Instead of one KV pair per head (MHA) or one for the whole model (MQA), GQA groups query heads — each group shares one KV pair. E.g., 8 heads in 2 groups of 4 → only 2 KV pairs instead of 8.

Problem solved

Multi-Head Attention requires a separate key-value pair per head, which is memory-intensive. GQA reduces KV cache memory usage by grouping query heads.

Implementation

Implementation pitfalls

Choosing group count G requires experimentationMedium

G=1 is MQA (maximum savings, quality loss), G=H is MHA (no savings). Optimal G depends on task and model size — no universal rule exists.

Incompatibility with MHA checkpoints during fine-tuningMedium

Models pre-trained with MHA cannot be directly fine-tuned as GQA without converting KV head weights (e.g. by averaging or pruning). Requires a dedicated conversion step.

Sources

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023)