Other

Prompt Caching

2023ActivePublished: 17 May 2026Updated: 17 May 2026Published

How it works

1. Prefix hashing: the inference engine hashes the prefix token sequence (e.g. SHA-256 or faster xxHash). The hash is the key into the on-device cache table.

2. Lookup: before prefill, the engine checks whether the KV-cache for that prefix (or its longest matching prefix) is already in memory. A hit = skip attention computation for those tokens.

3. Partial hit (prefix match): if the new prompt shares its first N tokens with the cached one, the engine reuses KV for those N and prefills only the tail. Hence the static-first, dynamic-last layout requirement.

4. Eviction policy: the cache has limited capacity (LRU, LFU, or explicit TTL). Anthropic uses 5-min/1-hour TTLs with explicit `cache_control`. OpenAI automatically retains the cache for minutes after last use. vLLM/SGLang use LRU at the memory-block level (PagedAttention).

5. Billing: prefix tokens are paid for once on cache write (Anthropic: 1.25× normal price), subsequent reads are much cheaper (Anthropic 0.1×, OpenAI 0.5×, Google ~0.25×). Generated tokens (output) always at full price.

6. Cache-friendly layout: to maximize hits, prompts are structured as [system + tools + documents + RAG] (static, cached) || [user question + history] (dynamic, prefilled each time).

Problem solved

Prefill is the dominant inference cost for applications with long, repetitive context: chatbots with large system prompts, agents with tool definitions, RAG with massive chunks, AI IDEs with an entire monorepo in context. Without prompt caching every request pays the same prefill bill, even though 95% of the context never changes. Prompt caching solves this by "remembering" the work spent processing the shared prefix. Result: 50–90% cost reduction and 5–10× faster TTFT for cache hits. With careful prompt layout (static-then-dynamic), production cache-hit rates routinely exceed 80%.

Components

KV-cache

Prefix hash / lookup index

Eviction policy

Block / page allocator

Cache-friendly prompt layout

Implementation

Reference implementations

vLLM — automatic prefix caching (PagedAttention)

SGLang — RadixAttention

Anthropic Prompt Caching docs

OpenAI Prompt Caching docs

Google Vertex AI Context Caching

llama.cpp prompt caching (`--prompt-cache`)

Implementation pitfalls

Invalidation by a single mid-prefix tokenHigh

The cache only works on a strict prefix match. Changing a date, timestamp, or user ID in the middle of the system prompt invalidates the entire cache from that position onward. Mitigation: always push dynamic variables to the end of the prompt.

Cache write premiumMedium

The first cache write costs more than a normal prefill (Anthropic: 1.25×). Caching only pays off when the same prefix is read ≥2 times within the TTL.

VRAM as the bottleneckHigh

A 100k-token KV-cache for a 70B model = tens of GB of HBM per cached prefix. Self-hosted vLLM/SGLang quickly exhaust memory under many long simultaneous caches.

TTL and unpredictable cache missMedium

The cache expires after TTL (Anthropic 5min/1h, OpenAI "minutes"). Apps with bursty traffic (sporadic queries) lose the cache and pay to rebuild it. Mitigation: keep-alive ping or explicit `cache_control` 1h.

Cross-tenant cache poisoning (theoretical)Low

Cross-tenant cache sharing (if a cloud provider naively shares prefix cache across customers) can leak prefix existence via a timing side-channel. All major providers isolate the cache per organization / API key.