During the prefill phase (prompt processing), the model computes K and V tensors for all input tokens and stores them in a cache buffer per layer and per attention head. In the decode phase, for each new token only its own Q, K, V projections are computed; new K and V are appended to the cache, and attention is computed as Q_new · K_cacheᵀ over the entire cached context. Cache size grows linearly with context length and equals 2 · L · H · d_head · b · n bytes (2 for K+V, L layers, H heads, d_head dimension, b batch, n length) — typically in FP16/BF16 precision. The cache is allocated in accelerator HBM and read at every decoding step, making memory bandwidth the dominant bottleneck during generation.
Without KV cache, autoregressive decoding in a Transformer requires recomputing Key and Value projections for all previous tokens at every generation step, leading to quadratic complexity in sequence length and making long-text generation computationally infeasible.
Tensor of shape [batch, num_heads, seq_len, head_dim] storing Key projections for all previous tokens, per Transformer layer.
Tensor of the same shape as K buffer, storing Value projections. Together with K it constitutes the full attention layer context state.
Mechanism for appending newly computed K and V for the current token to the end of the buffer — typically implemented via preallocated tensor and write pointer.
Cache size grows linearly with context length and can easily exceed available HBM, especially at large batch sizes. Example: Llama-2-70B at 32k context and batch=8 requires ~160 GB of cache alone.
Traditional allocation of cache as contiguous max_context blocks leads to massive memory waste (60-80%) in continuous batching when sequences have varying lengths.
Any modification of the context prefix (system prompt, retrieved documents) invalidates the cache for all tokens from the change point — eliminates prompt caching benefits.
During decode, the model is memory-bound, not compute-bound — most time is spent reading the cache from HBM. Accelerators with high FLOPS but low memory bandwidth (e.g., some consumer GPUs) are underutilized during decode.
The original Transformer architecture in 'Attention Is All You Need' introduces self-attention. Autoregressive decoder implementations (GPT) quickly adopt K/V caching as an obvious optimization — without formal publication.
Noam Shazeer in 'Fast Transformer Decoding' identifies KV cache size as the main inference bottleneck and proposes MQA: a single K and V head shared across all Q heads. Reduces cache by a factor of H.
Pope, Douglas, Chowdhery et al. from Google publish the first detailed analysis of KV cache as the primary inference cost factor at LLM scale. The work formalizes the memory-bound nature of the decode phase.
GQA as MHA↔MQA compromise: groups of Q heads share one K/V pair. Standard in Llama-2-70B, Mistral, and most post-2023 models — reduces cache 4-8× without MQA's quality loss.
Kwon et al. introduce PagedAttention — KV cache paging modeled on OS virtual memory. Eliminates cache fragmentation, enables continuous batching, and 2-4× higher throughput in LLM serving.
Anthropic (August 2024) introduces prompt caching in Claude API — KV cache designed for sharing across requests with the same prefix. OpenAI and Google follow. Up to 90% cost and latency reduction for repeated contexts.
During decode with active KV cache, the dominant cost is reading the cache from HBM at every step — a memory-bound, not compute-bound operation. Cache size also limits maximum batch size and context length.
Data type for stored K and V. Standard is FP16/BF16; quantization to INT8/INT4 (KV cache quantization) reduces cache size 2-4× at the cost of minor quality loss.
Upper bound on cache size per inference session — directly determines maximum prompt + generation length.
Strategy for reclaiming memory when cache approaches the limit — e.g., sliding window (StreamingLLM, H2O), attention sink, no eviction.
KV cache does not change the model's activation pattern — all heads and layers are active. It only modifies the K/V computation reuse strategy across decoding steps.