GlossaryOptimizationUpdated 2026-03-16

Prompt Caching

A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.

Definition

What is Prompt Caching?

Prompt caching is a server-side optimization offered by AI providers that stores the processed representation of prompt prefixes across requests. When consecutive requests share the same prefix (system prompt, tool definitions, or static context), the provider reuses the cached computation instead of reprocessing those tokens. Anthropic offers a 90% discount on cached input tokens with a 5-minute TTL and a minimum of 1,024 tokens. OpenAI offers automatic 50% discounts with no minimum and no explicit TTL management required. Prompt caching applies only to input tokens — output tokens are always generated fresh and priced at the full rate.

Impact

Why It Matters for AI Costs

System prompts and tool definitions are sent with every API request but rarely change between calls. For an application with a 3,000-token system prompt processing 10,000 requests per day, prompt caching on Anthropic saves $0.081 per day per million cached tokens — or $2.43/day at that volume. Over a month, that is $72.90 saved from a single optimization that requires minimal code changes. CostHawk tracks cache hit rates and calculates the exact dollar savings from caching across all your API calls.

What is Prompt Caching?

When you send a request to an LLM API, the model must process every input token through its neural network layers to build an internal representation called the KV-cache (key-value cache). This KV-cache is the model's "understanding" of your input, and computing it accounts for a significant portion of the inference cost.

Prompt caching stores this KV-cache on the provider's servers between requests. When a subsequent request shares the same prefix tokens, the provider loads the cached KV-cache instead of recomputing it. This saves compute and is passed on to you as a price discount.

The key insight: prompt caching works on prefixes, not arbitrary substrings. The cached portion must start from the beginning of the prompt. This means your prompt architecture should place static content (system prompt, tool definitions, few-shot examples) at the beginning, and dynamic content (user query, conversation history) at the end.

Both Anthropic and OpenAI implement prompt caching at the infrastructure level. You do not need to manage a cache yourself — the provider handles storage, invalidation, and matching. Your job is to structure prompts so that the cacheable prefix is maximized.

Anthropic vs. OpenAI Caching Compared

The two major providers implement prompt caching differently. Understanding these differences is critical for optimizing across providers.

FeatureAnthropicOpenAI
ActivationExplicit: add cache_control breakpointsAutomatic: no code changes needed
Discount90% off cached input tokens50% off cached input tokens
Cache Write Cost25% surcharge on first writeNo surcharge
TTL (Time-to-Live)5 minutes (refreshed on hit)Managed by OpenAI (no explicit TTL)
Minimum Tokens1,024 tokens (Claude Sonnet), 2,048 (Haiku)No minimum
Max Breakpoints4 per requestN/A (automatic)
Applies ToSystem prompt, tools, first user messageAll input messages
Output Token DiscountNoneNone

Anthropic's 90% discount is more aggressive than OpenAI's 50%, but it requires explicit opt-in and has a 25% write surcharge on the first cache write. This means Anthropic caching breaks even after just 2 requests with the same prefix: the first request pays 125% (write surcharge), the second pays 10% (cache hit), so the average after 2 requests is 67.5% — already cheaper than the 100% you would pay without caching. By the 5th request, the average drops to 32%, and by the 20th request, it approaches the floor of 10%.

OpenAI's automatic caching requires zero code changes and has no write surcharge, making it easier to adopt. But the 50% discount ceiling means the long-run savings are lower than Anthropic's 90% ceiling. For high-volume applications, Anthropic's caching is significantly more valuable despite the setup complexity.

Designing Cache-Friendly Prompts

The most important prompt architecture decision for caching is putting static content first. The cache matches on a contiguous prefix from the start of the prompt. Any variation in the prefix invalidates the cache. Here is the optimal structure:

┌─────────────────────────────────────┐
│ 1. System prompt (static)           │ ← CACHED
│ 2. Tool definitions (static)        │ ← CACHED
│ 3. Few-shot examples (static)       │ ← CACHED
│ 4. Retrieved context (semi-static)  │ ← Sometimes cached
│ 5. Conversation history (dynamic)   │ ← NOT cached
│ 6. Current user query (dynamic)     │ ← NOT cached
└─────────────────────────────────────┘

For Anthropic, add cache_control breakpoints at the boundaries:

{
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant for CostHawk...",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "What is my current spend?"
    }
  ]
}

Common mistakes that break caching:

  • Including timestamps in the system prompt: A system prompt with "Today's date is March 16, 2026" changes daily, invalidating the cache every 24 hours. Move dynamic dates to the user message instead.
  • Randomizing few-shot example order: If you shuffle few-shot examples between requests, the prefix changes every time. Use a fixed order.
  • Per-user system prompt customization: Adding the user's name or preferences to the system prompt creates a unique prefix per user. Move personalization to a user message.
  • Dynamic tool definitions: If tool definitions change between requests (e.g., filtering available tools), the prefix varies. Keep tool definitions stable or move dynamic tools to the end.

Break-Even Analysis

Prompt caching has a cost: Anthropic charges a 25% surcharge on the first cache write. This means caching is not free — you need a minimum number of requests with the same prefix for the investment to pay off.

Anthropic break-even calculation:

Without caching: N requests × 1.0 × input_rate = N × rate
With caching:    1 × 1.25 × rate + (N-1) × 0.10 × rate

Break-even when:
  N × rate = 1.25 × rate + (N-1) × 0.10 × rate
  N = 1.25 + 0.10N - 0.10
  0.90N = 1.15
  N = 1.28

→ Caching pays off after just 2 requests.

After 2 requests, the average cost per request is (1.25 + 0.10) / 2 = 0.675, or 67.5% of the uncached cost — a 32.5% savings. After 10 requests: (1.25 + 9 × 0.10) / 10 = 0.215, a 78.5% savings. After 100 requests: (1.25 + 99 × 0.10) / 100 = 0.1115, approaching the 90% maximum discount.

OpenAI break-even: Since OpenAI has no write surcharge, caching is profitable from the very first cache hit. The first request costs 1.0× and the second costs 0.5×, for an average of 0.75× — an immediate 25% savings.

The practical question is not "should I use caching?" (yes, always) but "how much will it save based on my traffic patterns?" The answer depends on your cache hit rate, which depends on how well your prompt architecture supports prefix matching and how frequently your prefixes repeat within the TTL window.

Prompt Caching vs. Semantic Caching

Prompt caching and semantic caching are fundamentally different optimizations that operate at different layers of the stack:

DimensionPrompt CachingSemantic Caching
LevelProvider infrastructureApplication layer
What's cachedKV-cache (model internal state)Complete responses
Match criteriaExact prefix matchSemantic similarity (cosine >0.95)
Savings50-90% on input tokens only100% (skips LLM call entirely)
Output tokensStill generated and billedServed from cache (free)
FreshnessAlways current (model generates fresh output)May serve stale answers
ImplementationPrompt architecture changesEmbedding + vector database
Applicable toAll requests (if prefix matches)Only queries with semantic matches

The two approaches are complementary, not competing. Prompt caching reduces the input cost of every request that shares a prefix. Semantic caching eliminates the LLM cost entirely for repeated or near-duplicate queries. A well-optimized system uses both: prompt caching saves 50-90% on input tokens for all requests, and semantic caching eliminates the full cost for 20-40% of queries that are semantically similar to previously answered ones.

The combined savings can be dramatic. If semantic caching handles 30% of queries (saving 100% of LLM cost on those), and prompt caching saves 70% on input tokens for the remaining 70%, the blended cost reduction across all queries is approximately 50-60%.

Monitoring Cache Hit Rates

Cache hit rate is the percentage of input tokens that are served from cache versus reprocessed. Both Anthropic and OpenAI return cache statistics in API response headers or usage metadata.

Anthropic returns three fields in the usage object:

{
  "usage": {
    "input_tokens": 500,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 2500,
    "output_tokens": 300
  }
}

In this example, 2,500 of the 3,000 total input tokens were served from cache — an 83% cache hit rate. The 500 non-cached tokens are the dynamic portion of the prompt (user query, conversation history).

OpenAI returns similar data in the usage.prompt_tokens_details field:

{
  "usage": {
    "prompt_tokens": 3000,
    "prompt_tokens_details": {
      "cached_tokens": 2500
    },
    "completion_tokens": 300
  }
}

CostHawk ingests these cache statistics automatically and provides:

  • Cache hit rate dashboard: Overall and per-endpoint cache hit rates trended over time.
  • Cache savings calculator: Exact dollar savings from caching compared to the same traffic without caching.
  • Cache miss analysis: Identification of requests with low cache hit rates, which may indicate prompt architecture issues like dynamic content in the prefix.
  • TTL optimization: For Anthropic's 5-minute TTL, CostHawk helps you identify endpoints where request frequency is too low to maintain cache warmth, suggesting architectural changes like batching or request pacing.

FAQ

Frequently Asked Questions

Do I need to change my code to use prompt caching?+
It depends on the provider. OpenAI's prompt caching is fully automatic — no code changes required. Any request that shares a prefix with a recent request will automatically receive the 50% input token discount. Anthropic's caching requires explicit opt-in: you add cache_control breakpoints to your system prompt and messages to indicate where the cache boundary should be. This gives you more control over what gets cached but requires prompt architecture changes. In both cases, the most important change is structural: organize your prompts so that static content (system prompt, tool definitions, examples) comes first and dynamic content (user query, conversation history) comes last. This maximizes the cacheable prefix.
How long does the prompt cache last?+
Anthropic's cache has an explicit 5-minute TTL (time-to-live) that refreshes on every cache hit. This means the cache stays warm as long as you make at least one request with the same prefix every 5 minutes. For applications with steady traffic (more than 12 requests per hour with the same system prompt), the cache effectively never expires during operating hours. For low-traffic endpoints, the cache may expire between requests, reducing the benefit. OpenAI does not publish an explicit TTL — caching is managed automatically by their infrastructure. In practice, OpenAI cache hits are reliable for requests within seconds to minutes of each other, but there is no guaranteed retention window. CostHawk tracks cache hit rates over time so you can see exactly how your TTL patterns affect savings.
What is the minimum number of tokens for caching?+
Anthropic requires a minimum of 1,024 tokens for Claude Sonnet models and 2,048 tokens for Haiku models to activate caching. Content below these thresholds cannot be cached even with cache_control breakpoints. This means short system prompts may need to be padded or combined with tool definitions to reach the threshold. OpenAI has no published minimum — caching applies automatically regardless of prefix length. In practice, the savings from caching are proportional to the number of cached tokens, so caching a 200-token prefix saves much less in absolute terms than caching a 3,000-token prefix. Most production applications have system prompts and tool definitions well above the 1,024-token threshold.
Does Anthropic's 25% write surcharge make caching expensive?+
The 25% write surcharge applies only to the first request that creates the cache entry. All subsequent cache reads get the 90% discount. This means caching breaks even after just 2 requests: request 1 costs 1.25× and request 2 costs 0.10×, for an average of 0.675× — already 32.5% cheaper than not caching. By the 10th request, the average cost is 0.215× (78.5% savings). By the 100th request, it approaches 0.10× (90% savings). The write surcharge is negligible for any endpoint that processes more than a handful of requests with the same prefix within the 5-minute TTL window. The only scenario where the surcharge hurts is if you have many unique prefixes with very low reuse — but in that case, caching would not be beneficial regardless of the surcharge.
Can I cache conversation history?+
Technically yes, but it is rarely beneficial because conversation history changes with every turn. Prompt caching works on contiguous prefixes — the cached portion must start from the beginning and be identical across requests. In a conversation, turns 1-5 might be identical between your request for turn 6 and turn 7, so the history of turns 1-5 can be cached for turn 7's request. Anthropic's approach handles this well: you can set cache_control breakpoints at the end of the current conversation history, and the next turn will cache everything up to that point. However, the 5-minute TTL means this only helps if turns are spaced less than 5 minutes apart. For most chatbot applications, system prompt caching provides the bulk of savings (90% discount on 1,000-3,000 tokens), while conversation history caching provides a smaller incremental benefit.
What is the difference between prompt caching and KV-cache?+
The KV-cache (key-value cache) is the internal data structure that transformer models build during inference. It stores the processed representation of all input tokens so the model can reference them efficiently during output generation. The KV-cache is created fresh for every request by default — this computation is a significant portion of inference cost. Prompt caching is the provider-side optimization that persists the KV-cache between requests. Instead of recomputing the KV-cache from scratch for each request, the provider stores it and reloads it when a matching prefix is detected. So KV-cache is the technical mechanism, and prompt caching is the product feature that exploits it for cost savings. You do not need to understand KV-cache internals to use prompt caching — just structure your prompts with static content first.
How does prompt caching interact with batch processing?+
Prompt caching and batch processing are independent discounts that can potentially be combined. Batch processing gives a flat 50% discount on both input and output tokens. Prompt caching discounts input tokens by 50% (OpenAI) or 90% (Anthropic). On Anthropic, a batch request with cached input could theoretically achieve an input rate of $0.15/MTok for Sonnet (50% batch discount on $3.00 = $1.50, then 90% cache discount = $0.15) — a 95% reduction from the $3.00 list price. However, the interaction between batch pricing and cache pricing varies by provider and may not always stack perfectly. Check current provider documentation for the exact combined rates. CostHawk calculates your effective rate based on actual billed amounts regardless of discount stacking rules.
How much can prompt caching save me per month?+
The savings depend on three factors: (1) the size of your cacheable prefix, (2) the number of requests that share that prefix, and (3) the provider's cache discount. Example calculation: System prompt of 2,000 tokens, 50,000 requests per day, using Claude Sonnet at $3.00/MTok. Without caching: 2,000 × 50,000 × $3.00/1M = $300/day in system prompt costs. With Anthropic caching at 90% discount and 95% hit rate: 5% of requests at $3.75/MTok (write surcharge) + 95% at $0.30/MTok = $1.88 + $28.50 = $30.38/day. Daily savings: $269.62. Monthly savings: $8,089. That is from caching just the system prompt. Adding tool definitions (another 1,000-2,000 tokens) would increase savings proportionally. CostHawk's cache savings calculator models this for your actual traffic patterns.
Can prompt caching cause stale or incorrect responses?+
No. Prompt caching only caches the model's internal representation of the input — it does not cache the output. Every request still generates a fresh response based on the full input (both cached and uncached portions). The model has no idea whether its input was served from cache or reprocessed; the internal representation is mathematically identical either way. This is a critical difference from semantic caching, which does return previously generated outputs and can serve stale answers. Prompt caching is completely safe from a correctness perspective — it is a pure infrastructure optimization with no impact on output quality or freshness. The only thing it changes is the cost of processing input tokens.
What cache hit rate should I target?+
A well-architected application should achieve 60-90% cache hit rates on input tokens. The exact target depends on the ratio of static to dynamic content in your prompts. If your system prompt and tool definitions are 3,000 tokens and your average user query plus conversation history is 1,000 tokens, your theoretical maximum cache hit rate is 75% (3,000 out of 4,000 total input tokens). In practice, you might achieve 70-75% due to occasional cache misses from TTL expiration or prefix variations. If your cache hit rate is below 50%, investigate: (1) Is dynamic content accidentally placed before static content in the prompt? (2) Are there variations in your system prompt (timestamps, user names) that break prefix matching? (3) Is request frequency too low to keep the cache warm within Anthropic's 5-minute TTL? CostHawk's cache miss analysis identifies the root cause of low hit rates.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.