Instead of one KV pair per head (MHA) or one for the whole model (MQA), GQA groups query heads โ each group shares one KV pair. E.g., 8 heads in 2 groups of 4 โ only 2 KV pairs instead of 8.
Multi-Head Attention requires a separate key-value pair per head, which is memory-intensive. GQA reduces KV cache memory usage by grouping query heads.
G=1 is MQA (maximum savings, quality loss), G=H is MHA (no savings). Optimal G depends on task and model size โ no universal rule exists.
Models pre-trained with MHA cannot be directly fine-tuned as GQA without converting KV head weights (e.g. by averaging or pruning). Requires a dedicated conversion step.
GQA reduces KV cache size โ particularly valuable on GPUs with limited VRAM for long contexts (128k+ tokens). Natively supported by FlashAttention-2 and vLLM.