Prompt Caching
A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Definition
What is Prompt Caching?
Impact
Why It Matters for AI Costs
What is Prompt Caching?
When you send a request to an LLM API, the model must process every input token through its neural network layers to build an internal representation called the KV-cache (key-value cache). This KV-cache is the model's "understanding" of your input, and computing it accounts for a significant portion of the inference cost.
Prompt caching stores this KV-cache on the provider's servers between requests. When a subsequent request shares the same prefix tokens, the provider loads the cached KV-cache instead of recomputing it. This saves compute and is passed on to you as a price discount.
The key insight: prompt caching works on prefixes, not arbitrary substrings. The cached portion must start from the beginning of the prompt. This means your prompt architecture should place static content (system prompt, tool definitions, few-shot examples) at the beginning, and dynamic content (user query, conversation history) at the end.
Both Anthropic and OpenAI implement prompt caching at the infrastructure level. You do not need to manage a cache yourself — the provider handles storage, invalidation, and matching. Your job is to structure prompts so that the cacheable prefix is maximized.
Anthropic vs. OpenAI Caching Compared
The two major providers implement prompt caching differently. Understanding these differences is critical for optimizing across providers.
| Feature | Anthropic | OpenAI |
|---|---|---|
| Activation | Explicit: add cache_control breakpoints | Automatic: no code changes needed |
| Discount | 90% off cached input tokens | 50% off cached input tokens |
| Cache Write Cost | 25% surcharge on first write | No surcharge |
| TTL (Time-to-Live) | 5 minutes (refreshed on hit) | Managed by OpenAI (no explicit TTL) |
| Minimum Tokens | 1,024 tokens (Claude Sonnet), 2,048 (Haiku) | No minimum |
| Max Breakpoints | 4 per request | N/A (automatic) |
| Applies To | System prompt, tools, first user message | All input messages |
| Output Token Discount | None | None |
Anthropic's 90% discount is more aggressive than OpenAI's 50%, but it requires explicit opt-in and has a 25% write surcharge on the first cache write. This means Anthropic caching breaks even after just 2 requests with the same prefix: the first request pays 125% (write surcharge), the second pays 10% (cache hit), so the average after 2 requests is 67.5% — already cheaper than the 100% you would pay without caching. By the 5th request, the average drops to 32%, and by the 20th request, it approaches the floor of 10%.
OpenAI's automatic caching requires zero code changes and has no write surcharge, making it easier to adopt. But the 50% discount ceiling means the long-run savings are lower than Anthropic's 90% ceiling. For high-volume applications, Anthropic's caching is significantly more valuable despite the setup complexity.
Designing Cache-Friendly Prompts
The most important prompt architecture decision for caching is putting static content first. The cache matches on a contiguous prefix from the start of the prompt. Any variation in the prefix invalidates the cache. Here is the optimal structure:
┌─────────────────────────────────────┐
│ 1. System prompt (static) │ ← CACHED
│ 2. Tool definitions (static) │ ← CACHED
│ 3. Few-shot examples (static) │ ← CACHED
│ 4. Retrieved context (semi-static) │ ← Sometimes cached
│ 5. Conversation history (dynamic) │ ← NOT cached
│ 6. Current user query (dynamic) │ ← NOT cached
└─────────────────────────────────────┘For Anthropic, add cache_control breakpoints at the boundaries:
{
"system": [
{
"type": "text",
"text": "You are a helpful assistant for CostHawk...",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{
"role": "user",
"content": "What is my current spend?"
}
]
}Common mistakes that break caching:
- Including timestamps in the system prompt: A system prompt with "Today's date is March 16, 2026" changes daily, invalidating the cache every 24 hours. Move dynamic dates to the user message instead.
- Randomizing few-shot example order: If you shuffle few-shot examples between requests, the prefix changes every time. Use a fixed order.
- Per-user system prompt customization: Adding the user's name or preferences to the system prompt creates a unique prefix per user. Move personalization to a user message.
- Dynamic tool definitions: If tool definitions change between requests (e.g., filtering available tools), the prefix varies. Keep tool definitions stable or move dynamic tools to the end.
Break-Even Analysis
Prompt caching has a cost: Anthropic charges a 25% surcharge on the first cache write. This means caching is not free — you need a minimum number of requests with the same prefix for the investment to pay off.
Anthropic break-even calculation:
Without caching: N requests × 1.0 × input_rate = N × rate
With caching: 1 × 1.25 × rate + (N-1) × 0.10 × rate
Break-even when:
N × rate = 1.25 × rate + (N-1) × 0.10 × rate
N = 1.25 + 0.10N - 0.10
0.90N = 1.15
N = 1.28
→ Caching pays off after just 2 requests.After 2 requests, the average cost per request is (1.25 + 0.10) / 2 = 0.675, or 67.5% of the uncached cost — a 32.5% savings. After 10 requests: (1.25 + 9 × 0.10) / 10 = 0.215, a 78.5% savings. After 100 requests: (1.25 + 99 × 0.10) / 100 = 0.1115, approaching the 90% maximum discount.
OpenAI break-even: Since OpenAI has no write surcharge, caching is profitable from the very first cache hit. The first request costs 1.0× and the second costs 0.5×, for an average of 0.75× — an immediate 25% savings.
The practical question is not "should I use caching?" (yes, always) but "how much will it save based on my traffic patterns?" The answer depends on your cache hit rate, which depends on how well your prompt architecture supports prefix matching and how frequently your prefixes repeat within the TTL window.
Prompt Caching vs. Semantic Caching
Prompt caching and semantic caching are fundamentally different optimizations that operate at different layers of the stack:
| Dimension | Prompt Caching | Semantic Caching |
|---|---|---|
| Level | Provider infrastructure | Application layer |
| What's cached | KV-cache (model internal state) | Complete responses |
| Match criteria | Exact prefix match | Semantic similarity (cosine >0.95) |
| Savings | 50-90% on input tokens only | 100% (skips LLM call entirely) |
| Output tokens | Still generated and billed | Served from cache (free) |
| Freshness | Always current (model generates fresh output) | May serve stale answers |
| Implementation | Prompt architecture changes | Embedding + vector database |
| Applicable to | All requests (if prefix matches) | Only queries with semantic matches |
The two approaches are complementary, not competing. Prompt caching reduces the input cost of every request that shares a prefix. Semantic caching eliminates the LLM cost entirely for repeated or near-duplicate queries. A well-optimized system uses both: prompt caching saves 50-90% on input tokens for all requests, and semantic caching eliminates the full cost for 20-40% of queries that are semantically similar to previously answered ones.
The combined savings can be dramatic. If semantic caching handles 30% of queries (saving 100% of LLM cost on those), and prompt caching saves 70% on input tokens for the remaining 70%, the blended cost reduction across all queries is approximately 50-60%.
Monitoring Cache Hit Rates
Cache hit rate is the percentage of input tokens that are served from cache versus reprocessed. Both Anthropic and OpenAI return cache statistics in API response headers or usage metadata.
Anthropic returns three fields in the usage object:
{
"usage": {
"input_tokens": 500,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 2500,
"output_tokens": 300
}
}In this example, 2,500 of the 3,000 total input tokens were served from cache — an 83% cache hit rate. The 500 non-cached tokens are the dynamic portion of the prompt (user query, conversation history).
OpenAI returns similar data in the usage.prompt_tokens_details field:
{
"usage": {
"prompt_tokens": 3000,
"prompt_tokens_details": {
"cached_tokens": 2500
},
"completion_tokens": 300
}
}CostHawk ingests these cache statistics automatically and provides:
- Cache hit rate dashboard: Overall and per-endpoint cache hit rates trended over time.
- Cache savings calculator: Exact dollar savings from caching compared to the same traffic without caching.
- Cache miss analysis: Identification of requests with low cache hit rates, which may indicate prompt architecture issues like dynamic content in the prefix.
- TTL optimization: For Anthropic's 5-minute TTL, CostHawk helps you identify endpoints where request frequency is too low to maintain cache warmth, suggesting architectural changes like batching or request pacing.
FAQ
Frequently Asked Questions
Do I need to change my code to use prompt caching?+
cache_control breakpoints to your system prompt and messages to indicate where the cache boundary should be. This gives you more control over what gets cached but requires prompt architecture changes. In both cases, the most important change is structural: organize your prompts so that static content (system prompt, tool definitions, examples) comes first and dynamic content (user query, conversation history) comes last. This maximizes the cacheable prefix.How long does the prompt cache last?+
What is the minimum number of tokens for caching?+
cache_control breakpoints. This means short system prompts may need to be padded or combined with tool definitions to reach the threshold. OpenAI has no published minimum — caching applies automatically regardless of prefix length. In practice, the savings from caching are proportional to the number of cached tokens, so caching a 200-token prefix saves much less in absolute terms than caching a 3,000-token prefix. Most production applications have system prompts and tool definitions well above the 1,024-token threshold.Does Anthropic's 25% write surcharge make caching expensive?+
Can I cache conversation history?+
cache_control breakpoints at the end of the current conversation history, and the next turn will cache everything up to that point. However, the 5-minute TTL means this only helps if turns are spaced less than 5 minutes apart. For most chatbot applications, system prompt caching provides the bulk of savings (90% discount on 1,000-3,000 tokens), while conversation history caching provides a smaller incremental benefit.What is the difference between prompt caching and KV-cache?+
How does prompt caching interact with batch processing?+
How much can prompt caching save me per month?+
Can prompt caching cause stale or incorrect responses?+
What cache hit rate should I target?+
Related Terms
Semantic Caching
An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read morePrompt Compression
Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
