KV cache is divided into physical blocks in HBM (typically 16 tokens per block per layer). Each sequence in a batch has a logical block list (block table) mapping logical block index to physical HBM address. Blocks are allocated on-demand as the sequence grows — no max_context preallocation. A custom CUDA attention kernel handles non-contiguous cache, reading blocks via the block table. Prefix sharing is implemented via reference counting on blocks: when two sequences share a prefix, they share the physical blocks of that prefix. Modification (writing to a shared block) triggers copy-on-write. Continuous batching dynamically adds/removes sequences from the batch without restart, as per-block allocation eliminates the need to reserve a contiguous region.
Traditional LLM serving allocates KV cache as a contiguous max_context block per request, leading to 60-80% HBM memory waste through fragmentation — drastically limiting throughput and maximum batch size.
Per-sequence mapping of logical block indices to physical HBM addresses — analog of an OS page table.
Component allocating/freeing physical KV cache blocks in HBM, maintaining a free block pool.
Specialized attention kernel handling non-contiguous KV cache via block-table indirection.
Mechanism for sharing prefix blocks across sequences with lazy duplication on modification.
Standard attention kernels (FlashAttention, cuDNN) assume contiguous KV cache. PagedAttention requires a dedicated kernel with block-table indirection.
Small block size = less waste but more block-table overhead and less regular memory access. Large block size = more waste for short sequences.
Block manager, reference counting, copy-on-write add significant systems complexity — debugging memory issues is harder than in simpler servers.
UC Berkeley team open-sources vLLM before the paper publication — immediate community adoption.
Formal PagedAttention publication at the flagship systems conference. 2-4× throughput vs competitors, near-zero memory waste.
vLLM becomes the most popular open-source inference server. PagedAttention adopted by TensorRT-LLM, SGLang, LMDeploy.
Extension of PagedAttention with automatic detection and sharing of prefixes across requests without explicit marking.
PagedAttention eliminates fragmentation without significant compute overhead. Block-table indirection adds minimal overhead (~1-2%) offset by higher throughput from batching.
Doesn't change attention or model mathematics — purely a systems-level optimization of cache memory management.