1. Prefix hashing: the inference engine hashes the prefix token sequence (e.g. SHA-256 or faster xxHash). The hash is the key into the on-device cache table.
2. Lookup: before prefill, the engine checks whether the KV-cache for that prefix (or its longest matching prefix) is already in memory. A hit = skip attention computation for those tokens.
3. Partial hit (prefix match): if the new prompt shares its first N tokens with the cached one, the engine reuses KV for those N and prefills only the tail. Hence the static-first, dynamic-last layout requirement.
4. Eviction policy: the cache has limited capacity (LRU, LFU, or explicit TTL). Anthropic uses 5-min/1-hour TTLs with explicit `cache_control`. OpenAI automatically retains the cache for minutes after last use. vLLM/SGLang use LRU at the memory-block level (PagedAttention).
5. Billing: prefix tokens are paid for once on cache write (Anthropic: 1.25× normal price), subsequent reads are much cheaper (Anthropic 0.1×, OpenAI 0.5×, Google ~0.25×). Generated tokens (output) always at full price.
6. Cache-friendly layout: to maximize hits, prompts are structured as [system + tools + documents + RAG] (static, cached) || [user question + history] (dynamic, prefilled each time).
Prefill is the dominant inference cost for applications with long, repetitive context: chatbots with large system prompts, agents with tool definitions, RAG with massive chunks, AI IDEs with an entire monorepo in context. Without prompt caching every request pays the same prefill bill, even though 95% of the context never changes. Prompt caching solves this by "remembering" the work spent processing the shared prefix. Result: 50–90% cost reduction and 5–10× faster TTFT for cache hits. With careful prompt layout (static-then-dynamic), production cache-hit rates routinely exceed 80%.
The cache only works on a strict prefix match. Changing a date, timestamp, or user ID in the middle of the system prompt invalidates the entire cache from that position onward. Mitigation: always push dynamic variables to the end of the prompt.
The first cache write costs more than a normal prefill (Anthropic: 1.25×). Caching only pays off when the same prefix is read ≥2 times within the TTL.
A 100k-token KV-cache for a 70B model = tens of GB of HBM per cached prefix. Self-hosted vLLM/SGLang quickly exhaust memory under many long simultaneous caches.
The cache expires after TTL (Anthropic 5min/1h, OpenAI "minutes"). Apps with bursty traffic (sporadic queries) lose the cache and pay to rebuild it. Mitigation: keep-alive ping or explicit `cache_control` 1h.
Cross-tenant cache sharing (if a cloud provider naively shares prefix cache across customers) can leak prefix existence via a timing side-channel. All major providers isolate the cache per organization / API key.