Semantic Caching
An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.
Definition
What is Semantic Caching?
Impact
Why It Matters for AI Costs
How Semantic Caching Works
Semantic caching operates in four steps for every incoming query:
- Embed the query: The incoming user query is converted into a vector using an embedding model (e.g., OpenAI's text-embedding-3-small at $0.02/MTok). This produces a 1,536-dimensional vector representing the semantic meaning of the query.
- Search the cache: The query vector is compared against all previously cached query vectors using cosine similarity. This search runs against a vector database (Redis with vector search, Pinecone, Qdrant, or pgvector) and typically completes in 1-5 milliseconds.
- Check the threshold: If the highest-similarity cached entry exceeds the configured threshold (typically 0.95 cosine similarity), it is considered a match. A score of 0.95+ means the queries are semantically near-identical — different wording but the same intent.
- Return or generate: On a cache hit, the previously generated response is returned immediately. On a cache miss, the query is forwarded to the LLM, and both the query embedding and the generated response are stored in the cache for future matches.
The net effect is that paraphrased versions of the same question — "What is my current spend?", "How much have I spent?", "Show me my spending" — all resolve to the same cached answer after the first one is generated. This eliminates redundant LLM calls without requiring exact text matching.
Implementing Semantic Caching
There are two main approaches to implementing semantic caching: using a purpose-built library like GPTCache, or building a custom solution with an embedding model and a vector database.
Option 1: GPTCache (Python)
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import OpenAI as OpenAIEmbedding
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
# Initialize embedding model
embedding = OpenAIEmbedding()
# Configure cache storage (SQLite + FAISS for local dev)
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=embedding.dimension)
data_manager = get_data_manager(cache_base, vector_base)
# Initialize cache with 0.95 similarity threshold
cache.init(
embedding_func=embedding.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()
# Use the cached client — identical API to openai
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is my current spend?"}],
)Option 2: Custom Redis + pgvector solution (TypeScript)
import { OpenAI } from "openai";
import { createClient } from "redis";
const openai = new OpenAI();
const redis = createClient({ url: process.env.REDIS_URL });
const SIMILARITY_THRESHOLD = 0.95;
async function semanticCachedQuery(query: string): Promise<string> {
// Step 1: Embed the query
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
});
const vector = embedding.data[0].embedding;
// Step 2: Search Redis for similar cached queries
const results = await redis.ft.search("idx:cache", "*=>[KNN 1 @vector $vec AS score]", {
PARAMS: { vec: Buffer.from(new Float32Array(vector).buffer) },
SORTBY: "score",
LIMIT: { from: 0, size: 1 },
});
// Step 3: Check threshold
if (results.documents.length > 0) {
const topScore = 1 - parseFloat(results.documents[0].value.score as string);
if (topScore >= SIMILARITY_THRESHOLD) {
return results.documents[0].value.response as string; // Cache hit
}
}
// Step 4: Cache miss — call LLM and store result
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: query }],
});
const response = completion.choices[0].message.content!;
// Store in cache for future matches
await redis.hSet(`cache:${Date.now()}`, {
query,
response,
vector: Buffer.from(new Float32Array(vector).buffer),
});
return response;
}Both approaches achieve the same goal: intercepting LLM calls and serving cached responses for semantically similar queries. The GPTCache approach is faster to set up; the custom approach gives you full control over storage, TTL, and eviction policies.
Threshold Tuning
The similarity threshold is the most important configuration parameter in semantic caching. It determines the tradeoff between cache hit rate (higher hits = more savings) and accuracy (lower threshold = more false matches = wrong answers served).
| Cosine Similarity Threshold | Typical Cache Hit Rate | False Match Risk | Best For |
|---|---|---|---|
| 0.99+ | 5-10% | Near zero | High-stakes applications (medical, legal, financial) |
| 0.97-0.99 | 10-20% | Very low | General-purpose Q&A, documentation search |
| 0.95-0.97 | 20-35% | Low | Customer support, FAQ bots, product Q&A |
| 0.92-0.95 | 30-45% | Moderate | Casual chatbots, brainstorming, non-critical content |
| 0.90-0.92 | 40-55% | High | Not recommended for production |
| Below 0.90 | 50%+ | Very high | Not recommended — too many false matches |
The recommended starting point is 0.95. At this threshold, queries must be very semantically similar to match — different wording of the same question, not just vaguely related topics. "What is my current spend?" matches "How much have I spent?" (cosine ~0.97) but does not match "How can I reduce my spend?" (cosine ~0.88).
Tune the threshold based on your application's tolerance for incorrect cached answers. Monitor false match rates by sampling cache hits and verifying the cached answer is appropriate for the new query. CostHawk tracks cache hit rates at each threshold level so you can A/B test different thresholds and measure the cost savings vs. quality tradeoff empirically.
Cost-Benefit Analysis
Semantic caching adds costs (embedding calls, vector database infrastructure) while saving costs (eliminated LLM calls). Here is how to calculate the ROI:
Costs of semantic caching:
- Embedding per query: Every incoming query must be embedded, even cache hits. Using text-embedding-3-small at $0.02/MTok with an average query of 50 tokens: $0.000001 per query. Negligible.
- Vector database: Redis with vector search on a modest instance costs $30-$100/month. Pinecone's free tier handles 100K vectors. Managed pgvector on Supabase starts at $0/month (free tier) to $25/month.
- Storage: Each cached entry stores an embedding (6KB for 1,536 dimensions) plus the response text (~2KB average). 100K cached entries = ~800MB. Minimal storage cost.
Savings from semantic caching:
- A cache hit eliminates both input and output token costs for the LLM call. If your average LLM call costs $0.01 (median for GPT-4o RAG queries), and you achieve a 30% cache hit rate on 100,000 queries/month, you save: 30,000 × $0.01 = $300/month.
- Total caching infrastructure cost: $30-$100/month.
- Net monthly savings: $200-$270/month.
The ROI improves dramatically with scale and with more expensive models. For an application processing 1 million queries per month on Claude Sonnet (average CPQ $0.02) with a 25% cache hit rate: savings = 250,000 × $0.02 = $5,000/month. Infrastructure cost remains $50-$100/month. That is a 50-100x return on the caching investment.
The break-even point for semantic caching is typically 1,000-5,000 queries per month, depending on your average CPQ and infrastructure costs. Below that volume, the infrastructure overhead may not justify the savings.
When NOT to Use Semantic Caching
Semantic caching is not appropriate for all workloads. Avoid it in these scenarios:
- Personalized responses: If the response depends on the specific user's data ("What is MY spend this month?"), a cached answer from a different user is incorrect. You can work around this by including user-specific context in the cache key, but this dramatically reduces the cache hit rate since each user has a unique cache space.
- Time-sensitive data: Queries about current data ("What is the price of Bitcoin right now?") should not serve cached answers because the answer changes frequently. You can mitigate this with short TTLs (5-15 minutes), but the hit rate drops accordingly.
- Creative and generative tasks: Users asking for creative writing, brainstorming, or varied content expect different responses each time. Serving a cached response defeats the purpose.
- Multi-turn conversations: Each turn in a conversation depends on the full history. Caching individual turns risks serving contextually inappropriate responses. Semantic caching works best for stateless, single-turn queries.
- Low-volume applications: If you process fewer than 1,000 queries per month, the infrastructure overhead of semantic caching likely exceeds the savings. Use prompt caching instead, which has zero infrastructure cost.
- High-stakes decisions: Medical diagnosis, legal advice, and financial recommendations should always be generated fresh. The risk of a false cache match serving incorrect advice outweighs the cost savings.
A good rule of thumb: semantic caching works best for stateless, factual, high-volume queries where the same question asked by different users should receive the same answer.
Semantic vs. Prompt Caching Compared
Semantic caching and prompt caching are complementary optimizations that work at different levels. Here is a comprehensive comparison:
| Dimension | Semantic Caching | Prompt Caching |
|---|---|---|
| Layer | Application (you build and maintain it) | Provider infrastructure (managed for you) |
| What is cached | Complete LLM responses | KV-cache of input token computation |
| Match type | Semantic similarity (fuzzy) | Exact prefix match |
| Input token savings | 100% on cache hit | 50-90% on matched prefix |
| Output token savings | 100% on cache hit | None — output is always regenerated |
| Cache hit rate | 20-40% (depends on query diversity) | 60-90% (depends on prompt structure) |
| Latency on hit | 1-10ms (vector search + response retrieval) | Slightly faster than uncached (skips KV compute) |
| Infrastructure cost | $30-$100/month for vector database | $0 (provider-managed) |
| Correctness risk | Possible false matches at low thresholds | Zero — output is always fresh |
| Implementation effort | Medium (embedding pipeline + vector DB + TTL logic) | Low (prompt restructuring + optional breakpoints) |
| Best for | Repetitive stateless queries (FAQ, support, docs) | All workloads with stable system prompts |
The optimal strategy is to use both:
- Prompt caching reduces input costs on every request by 50-90%. Zero effort on OpenAI (automatic), minimal effort on Anthropic (add breakpoints).
- Semantic caching eliminates the full LLM cost for 20-40% of queries. Requires infrastructure investment but delivers the highest absolute savings for repetitive workloads.
Combined, these two strategies can reduce total LLM costs by 50-70% for applications with stable system prompts and repetitive query patterns. CostHawk tracks both cache layers independently so you can see the marginal contribution of each optimization.
FAQ
Frequently Asked Questions
What embedding model should I use for semantic caching?+
What cosine similarity threshold should I start with?+
How do I handle cache invalidation with semantic caching?+
Can semantic caching work with streaming responses?+
What vector database should I use for semantic caching?+
How much latency does semantic caching add to requests?+
What cache hit rate can I realistically expect?+
How does semantic caching handle multi-language queries?+
Can I use semantic caching with RAG applications?+
What is the ROI of semantic caching compared to other optimizations?+
Related Terms
Prompt Caching
A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Read moreEmbedding
A dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.
Read moreRetrieval-Augmented Generation (RAG)
An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
