Inference

KV Cache

2017ActivePublished: 29 May 2026Updated: 29 May 2026Published

Key innovation

Caching of Key and Value tensors from previous steps of autoregressive decoding, reducing per-token inference cost from O(n²·d) to O(n·d) by eliminating recomputation.

How it works

During the prefill phase (prompt processing), the model computes K and V tensors for all input tokens and stores them in a cache buffer per layer and per attention head. In the decode phase, for each new token only its own Q, K, V projections are computed; new K and V are appended to the cache, and attention is computed as Q_new · K_cacheᵀ over the entire cached context. Cache size grows linearly with context length and equals 2 · L · H · d_head · b · n bytes (2 for K+V, L layers, H heads, d_head dimension, b batch, n length) — typically in FP16/BF16 precision. The cache is allocated in accelerator HBM and read at every decoding step, making memory bandwidth the dominant bottleneck during generation.

Problem solved

Without KV cache, autoregressive decoding in a Transformer requires recomputing Key and Value projections for all previous tokens at every generation step, leading to quadratic complexity in sequence length and making long-text generation computationally infeasible.

Components

K (Key) cache buffer

Tensor of shape [batch, num_heads, seq_len, head_dim] storing Key projections for all previous tokens, per Transformer layer.

V (Value) cache buffer

Tensor of the same shape as K buffer, storing Value projections. Together with K it constitutes the full attention layer context state.

Append operation

Mechanism for appending newly computed K and V for the current token to the end of the buffer — typically implemented via preallocated tensor and write pointer.

Implementation

Reference implementations

Hugging Face Transformers — DynamicCache / StaticCache

Python · Hugging Face

Official

vLLM — PagedAttention

Python / CUDA · vLLM Project (UC Berkeley)

Official

NVIDIA TensorRT-LLM

C++ / CUDA / Python · NVIDIA

Official

FlashInfer — kernels for LLM serving

CUDA / Python · FlashInfer team

Official

Implementation pitfalls

Memory explosion at long contextHigh

Cache size grows linearly with context length and can easily exceed available HBM, especially at large batch sizes. Example: Llama-2-70B at 32k context and batch=8 requires ~160 GB of cache alone.

Fix:Use MQA/GQA for smaller cache per token. Enable KV cache quantization (INT8/INT4). Use PagedAttention/vLLM for efficient memory management. Consider sliding window attention.

Memory fragmentation in batch servingHigh

Traditional allocation of cache as contiguous max_context blocks leads to massive memory waste (60-80%) in continuous batching when sequences have varying lengths.

Fix:Use PagedAttention (vLLM): cache split into fixed blocks (e.g., 16 tokens) allocated on-demand, like virtual memory pages.

Cache invalidation on system prompt changeMedium

Any modification of the context prefix (system prompt, retrieved documents) invalidates the cache for all tokens from the change point — eliminates prompt caching benefits.

Fix:Design prompts with an immutable prefix (system → cached examples → user query). Mid-conversation system messages require re-computing the cache tail.

Decode latency dominated by bandwidthMedium

During decode, the model is memory-bound, not compute-bound — most time is spent reading the cache from HBM. Accelerators with high FLOPS but low memory bandwidth (e.g., some consumer GPUs) are underutilized during decode.

Fix:Choose accelerators with high HBM bandwidth (H100, MI300X). Use speculative decoding to increase compute utilization. Batch requests (continuous batching).

Evolution

Original paper · 2022 · MLSys 2023 · Reiner Pope

Efficiently Scaling Transformer Inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean

2017

Transformer introduced (Vaswani et al.)

Inflection point

The original Transformer architecture in 'Attention Is All You Need' introduces self-attention. Autoregressive decoder implementations (GPT) quickly adopt K/V caching as an obvious optimization — without formal publication.

Attention Is All You Need (paper)

2019

Multi-Query Attention (Shazeer)

Inflection point

Noam Shazeer in 'Fast Transformer Decoding' identifies KV cache size as the main inference bottleneck and proposes MQA: a single K and V head shared across all Q heads. Reduces cache by a factor of H.

Fast Transformer Decoding: One Write-Head is All You Need (paper)

2022

Formalization in 'Efficiently Scaling Transformer Inference' (Pope et al.)

Inflection point

Pope, Douglas, Chowdhery et al. from Google publish the first detailed analysis of KV cache as the primary inference cost factor at LLM scale. The work formalizes the memory-bound nature of the decode phase.

Efficiently Scaling Transformer Inference (paper)

2023

Grouped Query Attention (Ainslie et al.)

GQA as MHA↔MQA compromise: groups of Q heads share one K/V pair. Standard in Llama-2-70B, Mistral, and most post-2023 models — reduces cache 4-8× without MQA's quality loss.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (paper)

2023

PagedAttention and vLLM (Kwon et al.)

Inflection point

Kwon et al. introduce PagedAttention — KV cache paging modeled on OS virtual memory. Eliminates cache fragmentation, enables continuous batching, and 2-4× higher throughput in LLM serving.

Efficient Memory Management for Large Language Model Serving with PagedAttention (paper)

2024

Prompt caching in commercial APIs (Anthropic, OpenAI, Google)

Anthropic (August 2024) introduces prompt caching in Claude API — KV cache designed for sharing across requests with the same prefix. OpenAI and Google follow. Up to 90% cost and latency reduction for repeated contexts.