Transformer from Scratch · Optimizations and Modern Variants
MQA, GQA and Lower Inference Cost
Optimizations and Modern Variants
Introduction
You will learn Multi-Query Attention and Grouped-Query Attention: techniques that reduce K/V cache cost during generation without fully giving up multiple query heads.