GlossaryOptimizationUpdated 2026-03-16

Semantic Caching

An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.

Definition

What is Semantic Caching?

Semantic caching is an application-layer optimization that intercepts LLM API calls and checks whether a semantically similar query has already been answered. Instead of matching on exact text (like traditional caching), semantic caching converts queries into embedding vectors and uses cosine similarity to find near-matches above a configurable threshold (typically 0.95). When a match is found, the cached response is returned instantly without making an LLM call, eliminating both input and output token costs entirely. Semantic caching can reduce total API calls by 20-40% for applications with repetitive query patterns like FAQ bots, customer support, and documentation Q&A.

Impact

Why It Matters for AI Costs

Unlike prompt caching (which only discounts input tokens), semantic caching eliminates the entire LLM call — both input and output costs drop to zero for cache hits. The only cost is a cheap embedding call ($0.02/MTok) to vectorize the incoming query for similarity matching. For applications where 20-40% of queries are paraphrases of previously asked questions, semantic caching delivers the largest absolute cost reduction of any single optimization technique. CostHawk tracks semantic cache hit rates alongside LLM costs so you can measure the exact ROI of your caching layer.

How Semantic Caching Works

Semantic caching operates in four steps for every incoming query:

  1. Embed the query: The incoming user query is converted into a vector using an embedding model (e.g., OpenAI's text-embedding-3-small at $0.02/MTok). This produces a 1,536-dimensional vector representing the semantic meaning of the query.
  2. Search the cache: The query vector is compared against all previously cached query vectors using cosine similarity. This search runs against a vector database (Redis with vector search, Pinecone, Qdrant, or pgvector) and typically completes in 1-5 milliseconds.
  3. Check the threshold: If the highest-similarity cached entry exceeds the configured threshold (typically 0.95 cosine similarity), it is considered a match. A score of 0.95+ means the queries are semantically near-identical — different wording but the same intent.
  4. Return or generate: On a cache hit, the previously generated response is returned immediately. On a cache miss, the query is forwarded to the LLM, and both the query embedding and the generated response are stored in the cache for future matches.

The net effect is that paraphrased versions of the same question — "What is my current spend?", "How much have I spent?", "Show me my spending" — all resolve to the same cached answer after the first one is generated. This eliminates redundant LLM calls without requiring exact text matching.

Implementing Semantic Caching

There are two main approaches to implementing semantic caching: using a purpose-built library like GPTCache, or building a custom solution with an embedding model and a vector database.

Option 1: GPTCache (Python)

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import OpenAI as OpenAIEmbedding
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# Initialize embedding model
embedding = OpenAIEmbedding()

# Configure cache storage (SQLite + FAISS for local dev)
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=embedding.dimension)
data_manager = get_data_manager(cache_base, vector_base)

# Initialize cache with 0.95 similarity threshold
cache.init(
    embedding_func=embedding.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

# Use the cached client — identical API to openai
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is my current spend?"}],
)

Option 2: Custom Redis + pgvector solution (TypeScript)

import { OpenAI } from "openai";
import { createClient } from "redis";

const openai = new OpenAI();
const redis = createClient({ url: process.env.REDIS_URL });

const SIMILARITY_THRESHOLD = 0.95;

async function semanticCachedQuery(query: string): Promise<string> {
  // Step 1: Embed the query
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const vector = embedding.data[0].embedding;

  // Step 2: Search Redis for similar cached queries
  const results = await redis.ft.search("idx:cache", "*=>[KNN 1 @vector $vec AS score]", {
    PARAMS: { vec: Buffer.from(new Float32Array(vector).buffer) },
    SORTBY: "score",
    LIMIT: { from: 0, size: 1 },
  });

  // Step 3: Check threshold
  if (results.documents.length > 0) {
    const topScore = 1 - parseFloat(results.documents[0].value.score as string);
    if (topScore >= SIMILARITY_THRESHOLD) {
      return results.documents[0].value.response as string; // Cache hit
    }
  }

  // Step 4: Cache miss — call LLM and store result
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: query }],
  });
  const response = completion.choices[0].message.content!;

  // Store in cache for future matches
  await redis.hSet(`cache:${Date.now()}`, {
    query,
    response,
    vector: Buffer.from(new Float32Array(vector).buffer),
  });

  return response;
}

Both approaches achieve the same goal: intercepting LLM calls and serving cached responses for semantically similar queries. The GPTCache approach is faster to set up; the custom approach gives you full control over storage, TTL, and eviction policies.

Threshold Tuning

The similarity threshold is the most important configuration parameter in semantic caching. It determines the tradeoff between cache hit rate (higher hits = more savings) and accuracy (lower threshold = more false matches = wrong answers served).

Cosine Similarity ThresholdTypical Cache Hit RateFalse Match RiskBest For
0.99+5-10%Near zeroHigh-stakes applications (medical, legal, financial)
0.97-0.9910-20%Very lowGeneral-purpose Q&A, documentation search
0.95-0.9720-35%LowCustomer support, FAQ bots, product Q&A
0.92-0.9530-45%ModerateCasual chatbots, brainstorming, non-critical content
0.90-0.9240-55%HighNot recommended for production
Below 0.9050%+Very highNot recommended — too many false matches

The recommended starting point is 0.95. At this threshold, queries must be very semantically similar to match — different wording of the same question, not just vaguely related topics. "What is my current spend?" matches "How much have I spent?" (cosine ~0.97) but does not match "How can I reduce my spend?" (cosine ~0.88).

Tune the threshold based on your application's tolerance for incorrect cached answers. Monitor false match rates by sampling cache hits and verifying the cached answer is appropriate for the new query. CostHawk tracks cache hit rates at each threshold level so you can A/B test different thresholds and measure the cost savings vs. quality tradeoff empirically.

Cost-Benefit Analysis

Semantic caching adds costs (embedding calls, vector database infrastructure) while saving costs (eliminated LLM calls). Here is how to calculate the ROI:

Costs of semantic caching:

  • Embedding per query: Every incoming query must be embedded, even cache hits. Using text-embedding-3-small at $0.02/MTok with an average query of 50 tokens: $0.000001 per query. Negligible.
  • Vector database: Redis with vector search on a modest instance costs $30-$100/month. Pinecone's free tier handles 100K vectors. Managed pgvector on Supabase starts at $0/month (free tier) to $25/month.
  • Storage: Each cached entry stores an embedding (6KB for 1,536 dimensions) plus the response text (~2KB average). 100K cached entries = ~800MB. Minimal storage cost.

Savings from semantic caching:

  • A cache hit eliminates both input and output token costs for the LLM call. If your average LLM call costs $0.01 (median for GPT-4o RAG queries), and you achieve a 30% cache hit rate on 100,000 queries/month, you save: 30,000 × $0.01 = $300/month.
  • Total caching infrastructure cost: $30-$100/month.
  • Net monthly savings: $200-$270/month.

The ROI improves dramatically with scale and with more expensive models. For an application processing 1 million queries per month on Claude Sonnet (average CPQ $0.02) with a 25% cache hit rate: savings = 250,000 × $0.02 = $5,000/month. Infrastructure cost remains $50-$100/month. That is a 50-100x return on the caching investment.

The break-even point for semantic caching is typically 1,000-5,000 queries per month, depending on your average CPQ and infrastructure costs. Below that volume, the infrastructure overhead may not justify the savings.

When NOT to Use Semantic Caching

Semantic caching is not appropriate for all workloads. Avoid it in these scenarios:

  • Personalized responses: If the response depends on the specific user's data ("What is MY spend this month?"), a cached answer from a different user is incorrect. You can work around this by including user-specific context in the cache key, but this dramatically reduces the cache hit rate since each user has a unique cache space.
  • Time-sensitive data: Queries about current data ("What is the price of Bitcoin right now?") should not serve cached answers because the answer changes frequently. You can mitigate this with short TTLs (5-15 minutes), but the hit rate drops accordingly.
  • Creative and generative tasks: Users asking for creative writing, brainstorming, or varied content expect different responses each time. Serving a cached response defeats the purpose.
  • Multi-turn conversations: Each turn in a conversation depends on the full history. Caching individual turns risks serving contextually inappropriate responses. Semantic caching works best for stateless, single-turn queries.
  • Low-volume applications: If you process fewer than 1,000 queries per month, the infrastructure overhead of semantic caching likely exceeds the savings. Use prompt caching instead, which has zero infrastructure cost.
  • High-stakes decisions: Medical diagnosis, legal advice, and financial recommendations should always be generated fresh. The risk of a false cache match serving incorrect advice outweighs the cost savings.

A good rule of thumb: semantic caching works best for stateless, factual, high-volume queries where the same question asked by different users should receive the same answer.

Semantic vs. Prompt Caching Compared

Semantic caching and prompt caching are complementary optimizations that work at different levels. Here is a comprehensive comparison:

DimensionSemantic CachingPrompt Caching
LayerApplication (you build and maintain it)Provider infrastructure (managed for you)
What is cachedComplete LLM responsesKV-cache of input token computation
Match typeSemantic similarity (fuzzy)Exact prefix match
Input token savings100% on cache hit50-90% on matched prefix
Output token savings100% on cache hitNone — output is always regenerated
Cache hit rate20-40% (depends on query diversity)60-90% (depends on prompt structure)
Latency on hit1-10ms (vector search + response retrieval)Slightly faster than uncached (skips KV compute)
Infrastructure cost$30-$100/month for vector database$0 (provider-managed)
Correctness riskPossible false matches at low thresholdsZero — output is always fresh
Implementation effortMedium (embedding pipeline + vector DB + TTL logic)Low (prompt restructuring + optional breakpoints)
Best forRepetitive stateless queries (FAQ, support, docs)All workloads with stable system prompts

The optimal strategy is to use both:

  1. Prompt caching reduces input costs on every request by 50-90%. Zero effort on OpenAI (automatic), minimal effort on Anthropic (add breakpoints).
  2. Semantic caching eliminates the full LLM cost for 20-40% of queries. Requires infrastructure investment but delivers the highest absolute savings for repetitive workloads.

Combined, these two strategies can reduce total LLM costs by 50-70% for applications with stable system prompts and repetitive query patterns. CostHawk tracks both cache layers independently so you can see the marginal contribution of each optimization.

FAQ

Frequently Asked Questions

What embedding model should I use for semantic caching?+
Use the cheapest embedding model that produces high-quality similarity scores. OpenAI's text-embedding-3-small ($0.02/MTok, 1,536 dimensions) is the most popular choice — it is extremely cheap, fast, and produces excellent similarity scores for query matching. The larger text-embedding-3-large ($0.13/MTok, 3,072 dimensions) offers marginally better similarity quality but at 6.5x the cost with negligible improvement for caching use cases. Avoid using your LLM model for embeddings — completion models are orders of magnitude more expensive than embedding models. For on-premise or open-source alternatives, models like sentence-transformers/all-MiniLM-L6-v2 produce good similarity scores at zero per-token cost (only compute cost). The embedding model choice rarely matters for caching quality — the similarity threshold tuning has far more impact.
What cosine similarity threshold should I start with?+
Start with 0.95. This threshold catches genuine paraphrases ("What is my current spend?" matching "How much have I spent?") while rejecting related-but-different queries ("What is my spend?" not matching "How do I reduce my spend?"). At 0.95, you can expect 20-35% cache hit rates for FAQ and customer support applications. If you need higher hit rates and can tolerate occasional mismatches, lower to 0.93. If you need maximum accuracy, raise to 0.97-0.99. Never go below 0.90 in production — the false match rate becomes unacceptable. Monitor false matches by sampling 100 random cache hits per week and verifying the cached answer is appropriate. If more than 2-3% are incorrect, raise the threshold by 0.01-0.02. CostHawk can log cache hit details for this quality assurance process.
How do I handle cache invalidation with semantic caching?+
Cache invalidation is the hardest part of semantic caching. Use a multi-layered approach: (1) Time-based TTL: Set a maximum cache lifetime appropriate to your data freshness requirements. For FAQ-style content, 24-72 hours is reasonable. For data-dependent answers, 5-30 minutes. (2) Version-based invalidation: When your underlying data changes (e.g., a product update, pricing change, or documentation revision), invalidate all cached entries that reference the changed data. Tag cache entries with source document IDs to enable targeted invalidation. (3) Explicit purge: Provide an admin endpoint or CostHawk integration to manually purge specific cached entries when you know they are stale. (4) Staleness scoring: Track the age of each cached entry and deprioritize older entries in similarity matching, preferring fresher answers when multiple entries match.
Can semantic caching work with streaming responses?+
Yes, but it requires buffering. When a cache hit occurs, you can either (1) return the complete cached response as a single non-streamed response, or (2) simulate streaming by chunking the cached response and sending it in small pieces with artificial delays to mimic the LLM streaming experience. Option 1 is simpler and provides the fastest user experience — the response appears instantly. Option 2 maintains a consistent UX between cached and non-cached responses, which matters if your frontend is designed for streaming. For cache misses, stream the LLM response normally while simultaneously buffering it for cache storage. Once the full response is received, store it in the cache along with the query embedding. Most implementations use option 1 (instant return) since users prefer faster responses, and the UX difference between streaming and instant is a positive one.
What vector database should I use for semantic caching?+
For most applications, Redis with the vector search module (RediSearch) is the best choice. It provides sub-millisecond vector search, runs in memory for maximum speed, and is available as a managed service from Redis Cloud ($0-$30/month for caching workloads). If you already use PostgreSQL/Supabase, pgvector is an excellent zero-cost option that adds vector search to your existing database. For larger-scale deployments (millions of cached entries), dedicated vector databases like Pinecone, Qdrant, or Weaviate offer better scaling characteristics. For local development and testing, FAISS (an in-memory vector library from Meta) requires no infrastructure. Choose based on your existing stack: Redis if you already use Redis, pgvector if you use Postgres, Pinecone if you need managed scaling, FAISS if you are prototyping.
How much latency does semantic caching add to requests?+
Semantic caching adds two latency components: (1) embedding the query (15-50ms via OpenAI API, or 1-5ms with a local embedding model), and (2) vector similarity search (1-5ms for Redis/FAISS, 5-20ms for cloud-hosted vector DBs). Total added latency: 15-70ms on cache misses, 15-55ms on cache hits. Compare this to the LLM call latency that cache hits eliminate: 500-3,000ms for GPT-4o, 1,000-5,000ms for Claude Sonnet. A cache hit saves 500-5,000ms of LLM latency while adding only 15-55ms of cache lookup latency — a 10-100x net improvement in response time. On cache misses, the 15-70ms overhead is negligible compared to the LLM call that follows. To minimize latency, use a local embedding model (eliminates the API round-trip) and colocate your vector database with your application server.
What cache hit rate can I realistically expect?+
Cache hit rates vary dramatically by application type. FAQ bots and documentation Q&A see 30-50% hit rates because users repeatedly ask similar questions about the same topics. Customer support chatbots achieve 20-35% because many support tickets involve the same issues. General-purpose chatbots see 10-20% because conversations are more diverse. Code generation achieves 5-15% because queries are highly specific. Creative writing achieves near 0% because every request is unique. These rates assume a 0.95 similarity threshold and a warm cache with at least 1,000 entries. Hit rates improve over time as the cache grows and covers more query variations. CostHawk tracks your cache hit rate trend and projects the steady-state hit rate based on your query diversity curve.
How does semantic caching handle multi-language queries?+
Modern embedding models like text-embedding-3-small are multilingual — they produce similar vectors for semantically equivalent queries regardless of language. "What is my current spend?" in English and its French equivalent will have high cosine similarity (typically 0.90-0.95). However, cross-language matching may fall slightly below a 0.95 threshold, causing cache misses for the same question in different languages. You have two options: (1) lower the threshold to 0.92-0.93 to catch cross-language matches (increases false match risk), or (2) pre-translate all queries to a canonical language before embedding (adds translation latency and cost). Option 2 is more reliable but adds 100-200ms of latency. For most applications, serving separate caches per language is the simplest approach — it avoids cross-language accuracy issues and lets you tune TTLs and thresholds per market.
Can I use semantic caching with RAG applications?+
Yes, but with caveats. In a RAG application, the response depends on both the user query and the retrieved context. Two identical queries might receive different answers if the underlying knowledge base has changed. For semantic caching in RAG, you should include the retrieved context hash in the cache key — this ensures that cached answers are only served when both the query is semantically similar AND the retrieved context is identical. This reduces the cache hit rate compared to pure query-based caching, but maintains answer accuracy. An alternative approach is to cache at the retrieval layer rather than the generation layer: cache the query-to-chunks mapping so you skip the embedding + vector search step for repeated queries, but always regenerate the answer from the (potentially updated) chunks. This preserves freshness while saving retrieval costs.
What is the ROI of semantic caching compared to other optimizations?+
Semantic caching delivers the highest per-hit savings of any optimization (100% cost elimination for cache hits), but its total impact depends on your cache hit rate. At a 30% hit rate, semantic caching reduces total LLM costs by 30%. For comparison: prompt caching reduces input costs by 50-90% (effective total cost reduction of 20-45% depending on input/output split), model routing reduces costs by 50-90% on routed queries (effective total reduction of 30-60%), and prompt compression reduces input costs by 20-40% (effective total reduction of 10-20%). Semantic caching wins when your application has high query repetition. For unique queries (code generation, creative writing), it provides minimal benefit. The best strategy is to layer optimizations: prompt caching (always), model routing (for mixed workloads), semantic caching (for repetitive queries). CostHawk's optimization recommendations are ranked by projected ROI based on your actual workload patterns.

Related Terms

Prompt Caching

A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.

Read more

Embedding

A dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.

Read more

Retrieval-Augmented Generation (RAG)

An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.

Read more

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Read more

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Read more

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.