Components

Prompt Caching — What It Is and How It Works

Sir Robot29 May 2026 · 9 min read

Sir Robot

29 May 2026 · 9 min readAI-assisted · editorial review

Prompt Caching is an optimisation mechanism in large language model APIs that allows repeated reuse of processed prompt fragments without recomputing them from scratch. For queries containing long, repetitive contexts — such as extensive system instructions, documents, or few-shot examples — it can make API calls up to ten times cheaper and several times faster.

What is Prompt Caching?

Prompt Caching is a technique that allows a language model to temporarily store intermediate computation results for the fixed, repeatable part of a query. Instead of processing the same context on every API call, the system can retrieve pre-computed representations and only process the new, variable part of the input.

It is purely an infrastructure-level tool. It does not change model behavior, does not affect the quality of generated text, and does not add memory to the model. It operates exclusively as a computational layer cache, increasing cost efficiency and response speed in scenarios where the same data is sent to the model repeatedly.

The term appears under different names across providers: Anthropic calls it _prompt caching_ (using a `cache_control` mechanism), OpenAI applies _prompt caching_ automatically without any developer configuration, and Google Gemini offers _context caching_ as a separate API resource with a dedicated name and configurable TTL.

Who is behind it?

Anthropic introduced Prompt Caching for Claude models in August 2024. OpenAI launched its own automatic caching implementation for GPT-4o and o1 models in October 2024. Google Gemini has offered Context Caching as a separate API feature for Gemini 1.5 Pro and Flash since mid-2024.

Each provider developed an independent implementation suited to their system architecture — though the concept of the underlying mechanism, the KV cache (Key-Value cache), originates from Transformer efficiency research and was well understood in the research community long before commercial deployment.

How does it work?

Technically, Prompt Caching is built on KV Cache (Key-Value Cache) — a data structure created during the processing of token sequences by a Transformer. For each model layer and each input token, key (K) and value (V) vectors are computed. In standard processing, these vectors are recomputed from scratch on every API call.

The caching mechanism allows these vectors to be preserved for the fixed, initial portion of a prompt. If the next query begins with an identical prefix, the system skips its computation and immediately begins processing only the new portion of the input.

In practice:

First call: the model processes the entire prompt (fixed prefix + variable user query), charged at full input token price.
The system creates and stores the KV Cache for the fixed prefix.
Subsequent calls with the same prefix: the model reads from cache, processes only the variable part — cached tokens cost a fraction of the standard rate.

The critical requirement is token order — cache is built for a prefix, meaning tokens at the beginning of the prompt. Any change to even a single token in the fixed portion invalidates the cache.

What are its key components?

Prompt Caching implementations differ between providers, but each contains several common elements:

Fixed prefix (cacheable prefix) — the part of the prompt that remains identical across all calls. Typically a system instruction, a long document, a few-shot example set, or a knowledge base.

Variable part (variable suffix) — the user query or dynamic data. This portion is always charged at the standard input rate.

TTL (Time to Live) — how long the cache is retained:

Anthropic Claude: 5 minutes (refreshable by subsequent calls); extended cache beta models may hold it longer
OpenAI: from a few minutes to several hours (server-dependent, no fixed value in documentation)
Google Gemini Context Caching: 1 minute to 1 hour (developer-configurable)

Minimum token threshold — caching is not cost-effective for short prompts:

Anthropic: minimum 1,024 tokens for Claude 3.5 Sonnet, minimum 2,048 for older models
OpenAI: minimum 1,024 tokens
Google Gemini: minimum 32,768 tokens for Gemini 1.5 Pro (significantly higher threshold)

Cache token pricing — the cost of reading from cache is lower than standard processing:

Anthropic: cached tokens cost approximately 10% of standard input price (90% discount)
OpenAI: cached tokens cost 50% of standard input price (50% discount)
Google Gemini: cached tokens cost approximately 25% of standard input price (75% discount), with an additional per-hour storage charge

Cache write vs. cache read

It is easy to assume that because caching lowers the bill, the cache itself is free. It is not — the mechanism involves two distinct costs:

Cache write — a one-time charge for creating the cache when the prefix is processed for the first time. At Anthropic, a cache write is even more expensive than regular input (about 25% more, i.e. ~1.25× the input price).
Cache read — the much cheaper cost of reusing the stored cache on subsequent calls (about 10% of the input price at Anthropic).

On first use, the provider has to create the cache, which incurs an extra charge; only subsequent calls benefit from the cheaper read. That is why caching pays off mainly when the same context is reused many times — the write cost must be amortized across enough reads. This is a crucial consideration from a system architect’s perspective.

What can it be used for?

Prompt Caching is most cost-effective in scenarios where the same large context appears in many consecutive API calls:

Extensive system instructions — applications with system prompts of several thousand tokens (detailed behavioral rules, persona definitions, response formatting). Instead of paying for those tokens on every user query, they are buffered once.
Document analysis in multi-turn conversations — scenario: a user uploads a contract PDF and asks dozens of questions about its content. The document is cached once; each question pays only for its own tokens.
Few-shot learning with examples — when a prompt contains dozens of training examples (e.g., input-output pairs for a specific classification task), those examples can be cached and reused across all subsequent calls.
Retrieval-Augmented Generation (RAG) — RAG systems that inject long knowledge base passages into every prompt can benefit from caching when the same documents are retrieved repeatedly.
Chatbots with long conversation context — in multi-turn conversations, the entire conversation history could potentially be cached, though precise control over the fixed prefix is needed.
Agentic workflows — agent systems in which an agent repeatedly references the same contextual resources (task specifications, tool definitions, rules).

How does it differ from other approaches?

Prompt Caching should be distinguished from several related but different concepts:

vs. semantic caching — Prompt Caching works at the exact token level — it requires an identical prefix byte-for-byte. Semantic caching (e.g., used in tools like GPTCache) works differently: if two queries have similar semantic meaning (but not necessarily identical tokens), it returns a stored response. Semantic caching caches _output_; Prompt Caching caches _internal model computations_.
vs. fine-tuning — Fine-tuning permanently embeds knowledge into model weights, eliminating the need to provide it in the prompt. Prompt Caching does not change the model — knowledge must still be in the prompt, but the cost of providing it is reduced.
vs. conversation history management — Managing conversation history is an application-level decision — what to keep in Context Window and for how long. Prompt Caching is an infrastructure optimization — how to cost-effectively process that context.
vs. RAG — RAG is an information selection strategy (retrieve → inject → generate). Prompt Caching is an execution optimization (if you inject the same documents repeatedly, pay for them once). Both mechanisms can coexist.

Key limitations and challenges

Sensitivity to token order — Even a single change at the beginning of a prompt invalidates the entire cache. Dynamic data (e.g., timestamps, session IDs, system variables) cannot appear in the fixed prefix portion of the prompt.
Cold start — The first API call always pays the full input token price. In scenarios with few calls, savings may be minimal or nonexistent.
Additional storage cost (Google) — Google Gemini charges an hourly storage fee for maintaining the cache. With infrequent calls, the storage cost may exceed the savings from cheaper tokens.
No cache hit guarantee — OpenAI does not guarantee how long a cache will remain active. With irregular traffic or infrastructure changes on their end, the cache may not be available.
Limited control at OpenAI — OpenAI's automatic caching works without developer configuration, which is convenient but prevents explicit control over what is cached and for how long.
Minimum token threshold as a barrier — The requirement of at least 1,024 (or 32,768 for Gemini) tokens means caching is unavailable for short prompts, excluding some use cases.

Why does it matter?

Prompt Caching addresses a real tension that became apparent as long-context models became widespread. Models like Claude 3.5 Sonnet (200,000 tokens) or Gemini 1.5 Pro (1 million tokens) allow loading entire codebases, legal documents, or correspondence histories into a prompt — but each subsequent call with the same context incurs an identical computational and financial cost.

For small prototypes, this is an academic consideration. For production applications handling thousands of sessions per day, it represents a fundamental difference in economics. A chatbot application analyzing medical documents, where each prompt includes a 50-page clinical context, can reduce input token costs by 80–90% at high traffic through caching.

The deeper consequence is architectural: Prompt Caching changes the cost calculus for long-context solutions versus RAG. Historically, RAG was the preferred strategy partly because injecting large documents on every call was prohibitively expensive. Caching makes "long context" a genuinely competitive alternative for at least some use cases where RAG retrieval precision is uncertain or where a complete context is truly needed.

It is also a signal about where the market is heading: model providers are actively reducing cost barriers for context-rich applications, which will gradually shift the equilibrium between prompt minimalism and full contextual expression.

Sources

Anthropic — Prompt Caching documentation — link
OpenAI — Prompt Caching guide — link
Google Gemini — Context Caching documentation — link
Anthropic — Claude API Messages reference — link

Share this insight

01Course

Prompt Caching — What It Is and How It Works

What is Prompt Caching?

Who is behind it?

How does it work?

What are its key components?

Cache write vs. cache read

What can it be used for?

How does it differ from other approaches?

Key limitations and challenges

Why does it matter?

Sources

Prompt Engineering in Practice

KV Cache

Context Window

RAG

LLM

Transformer

Tokenization

Efficient Memory Management for Large Language Model Serving with PagedAttention

Attention Is All You Need

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Prompt Caching — What It Is and How It Works

What is Prompt Caching?

Who is behind it?

How does it work?

What are its key components?

Cache write vs. cache read

What can it be used for?

How does it differ from other approaches?

Key limitations and challenges

Why does it matter?

Sources

Go deeper

Prompt Engineering in Practice

KV Cache

Context Window

RAG

LLM

Transformer

Tokenization

Efficient Memory Management for Large Language Model Serving with PagedAttention

Attention Is All You Need

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks