Inference

PagedAttention

2023ActivePublished: 29 May 2026Updated: 29 May 2026Published

Key innovation

KV cache paging modeled on operating-system virtual memory — eliminates memory fragmentation in LLM serving, enables continuous batching and cache sharing across requests, yielding 2-4× higher throughput.

How it works

KV cache is divided into physical blocks in HBM (typically 16 tokens per block per layer). Each sequence in a batch has a logical block list (block table) mapping logical block index to physical HBM address. Blocks are allocated on-demand as the sequence grows — no max_context preallocation. A custom CUDA attention kernel handles non-contiguous cache, reading blocks via the block table. Prefix sharing is implemented via reference counting on blocks: when two sequences share a prefix, they share the physical blocks of that prefix. Modification (writing to a shared block) triggers copy-on-write. Continuous batching dynamically adds/removes sequences from the batch without restart, as per-block allocation eliminates the need to reserve a contiguous region.

Problem solved

Traditional LLM serving allocates KV cache as a contiguous max_context block per request, leading to 60-80% HBM memory waste through fragmentation — drastically limiting throughput and maximum batch size.

Components

Block table

Per-sequence mapping of logical block indices to physical HBM addresses — analog of an OS page table.

Block manager

Component allocating/freeing physical KV cache blocks in HBM, maintaining a free block pool.

Custom paged attention CUDA kernel

Specialized attention kernel handling non-contiguous KV cache via block-table indirection.

Reference counting + copy-on-write

Mechanism for sharing prefix blocks across sequences with lazy duplication on modification.

Implementation

Reference implementations

vLLM

Python / CUDA · vLLM Project (UC Berkeley / community)

Official

NVIDIA TensorRT-LLM (PagedKVCache)

C++ / CUDA / Python · NVIDIA

Official

SGLang

Python / CUDA · SGLang team (UC Berkeley / Stanford)

LMDeploy

Python / CUDA · InternLM

Implementation pitfalls

Custom CUDA kernel requiredMedium

Standard attention kernels (FlashAttention, cuDNN) assume contiguous KV cache. PagedAttention requires a dedicated kernel with block-table indirection.

Fix:Use existing implementations (vLLM, TensorRT-LLM, SGLang) rather than writing your own kernel.

Block size as trade-offLow

Small block size = less waste but more block-table overhead and less regular memory access. Large block size = more waste for short sequences.

Fix:Default 16 tokens works well for most workloads; benchmark for specific length distributions.

Systems complexityLow

Block manager, reference counting, copy-on-write add significant systems complexity — debugging memory issues is harder than in simpler servers.

Fix:Rely on mature implementations (vLLM has the broadest user base and bugfixes).

Evolution

Original paper · 2023 · SOSP 2023 · Woosuk Kwon

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

2023

vLLM open-source release (June 2023)

UC Berkeley team open-sources vLLM before the paper publication — immediate community adoption.

2023

SOSP 2023 publication (Kwon et al.)

Inflection point

Formal PagedAttention publication at the flagship systems conference. 2-4× throughput vs competitors, near-zero memory waste.

Efficient Memory Management for Large Language Model Serving with PagedAttention (paper)

2024

De facto LLM serving standard

vLLM becomes the most popular open-source inference server. PagedAttention adopted by TensorRT-LLM, SGLang, LMDeploy.

2024

Automatic Prefix Caching in vLLM

Extension of PagedAttention with automatic detection and sharing of prefixes across requests without explicit marking.

Sources

Efficient Memory Management for LLM Serving with PagedAttention (Kwon et al., 2023)

vLLM project — GitHub

vLLM documentation

vLLM blog — Easy, Fast, and Cheap LLM Serving with PagedAttention

PagedAttention

How it works

Problem solved

Components

Implementation

Evolution

Sources

Computational complexity

Compute bottleneck

Execution paradigm

Parallelism

Hardware requirements