GlossaryOptimizationUpdated 2026-03-16

Retrieval-Augmented Generation (RAG)

An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.

Definition

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a system architecture that augments a large language model's capabilities by retrieving relevant information from an external knowledge base at query time. A RAG pipeline has three stages: (1) the user's query is converted into a vector embedding, (2) a vector database performs a similarity search to find the most relevant document chunks, and (3) those chunks are injected into the LLM prompt alongside the user's question. The LLM then generates a response grounded in the retrieved context. RAG was introduced by Lewis et al. (2020) at Meta AI and has become the dominant pattern for building knowledge-intensive AI applications. From a cost perspective, RAG replaces the need to fine-tune models or stuff entire knowledge bases into prompts, instead retrieving only the 3–10 most relevant chunks (typically 500–3,000 total tokens) per query — dramatically reducing input token costs compared to brute-force context stuffing.

Impact

Why It Matters for AI Costs

RAG is one of the most powerful cost optimization patterns in production AI systems. Without RAG, teams face a brutal tradeoff: either fine-tune the model (expensive, inflexible) or stuff relevant documents into the prompt (token-expensive at scale). RAG provides a middle path that is both cheaper and more flexible.

The cost math: Consider a customer support bot that needs access to 500 pages of documentation (approximately 250,000 tokens). Without RAG, you might try to fit as much context as possible into each request — perhaps 50,000 tokens of "most relevant" documentation. With RAG, a vector search retrieves the 3–5 most relevant chunks, totaling 1,500 tokens.

  • Without RAG: 50,000 input tokens × $2.50/MTok = $0.125 per query
  • With RAG: 1,500 input tokens + 500-token query = 2,000 tokens × $2.50/MTok = $0.005 per query

That is a 25x cost reduction per query. At 10,000 queries per day, RAG saves $1,200/day or $36,000/month in input token costs alone.

The cost of the RAG infrastructure itself is modest: embedding the query costs ~$0.000002 (negligible), and vector database hosting runs $20–$200/month depending on the provider and data volume. The LLM token savings dwarf the infrastructure costs by orders of magnitude.

CostHawk tracks the full cost of RAG pipelines — embedding costs, generation costs, and the per-query token breakdown — so you can quantify exactly how much your RAG implementation saves compared to alternative approaches.

The RAG Architecture

A production RAG system consists of several interconnected components, each with its own cost profile:

1. Document Ingestion Pipeline:

  • Documents are loaded from sources (databases, file storage, APIs, web crawlers)
  • A chunking algorithm splits documents into smaller pieces (typically 200–1,000 tokens each)
  • Each chunk is processed through an embedding model to produce a vector representation
  • Vectors are stored in a vector database with metadata (source, timestamp, section, tags)

2. Query Pipeline (runs per request):

  • The user's query is embedded using the same embedding model (cost: ~$0.000002 per query)
  • The vector database performs approximate nearest-neighbor (ANN) search to find the top-K most similar chunks
  • Retrieved chunks are formatted and injected into the LLM prompt as context
  • The LLM generates a response grounded in the retrieved context

3. Feedback Loop (optional but recommended):

  • User feedback on answer quality is logged
  • Low-quality answers trigger re-ranking or chunk refinement
  • Analytics identify which documents are most frequently retrieved and which queries produce poor results

The cost breakdown for a typical RAG query:

ComponentCost per Query% of Total
Query embedding$0.000002<0.1%
Vector DB search$0.00001–$0.0001<1%
LLM generation (with retrieved context)$0.003–$0.0299%+

The LLM call dominates cost, which means optimizing the number and size of retrieved chunks is the primary lever for RAG cost optimization.

Three Cost Layers of RAG

Every RAG system has three distinct cost layers, and understanding each is essential for budgeting and optimization:

Layer 1: Embedding Costs

Embedding models convert text into vectors. This happens at two points: during document ingestion (one-time per document) and during query time (once per query). Current embedding pricing:

ModelPrice per 1M TokensDimensions
OpenAI text-embedding-3-small$0.021,536
OpenAI text-embedding-3-large$0.133,072
Voyage AI voyage-3$0.061,024
Cohere embed-v3$0.101,024

Embedding a 100,000-page knowledge base (roughly 50 million tokens) costs $1.00–$6.50 as a one-time expense. Query embedding is negligible — even at 100,000 queries/day with 50-token average queries, the daily embedding cost is $0.10.

Layer 2: Vector Database Costs

Vector databases store and search embeddings. Options range from free (open-source Chroma running locally) to managed services:

  • Pinecone: $70/month for the starter tier (1M vectors), scaling with vector count and queries
  • Weaviate Cloud: $25/month for basic tier
  • Qdrant Cloud: Free tier available, paid plans from $25/month
  • Supabase pgvector: Included with your Postgres plan, no additional cost

For most applications, vector database costs are $25–$200/month — a small fraction of LLM costs.

Layer 3: Generation Costs

This is the LLM call with the retrieved context injected. It is by far the largest cost component. The key variable is how many tokens of context you retrieve: retrieving 5 chunks of 300 tokens each adds 1,500 input tokens per query. On GPT-4o, that is $0.00375 per query. The generation cost is directly proportional to the amount of retrieved context, which is why chunk size and top-K settings are critical cost levers.

RAG vs Fine-Tuning: Cost Comparison

RAG and fine-tuning are the two primary approaches for giving an LLM domain-specific knowledge. They have fundamentally different cost profiles:

DimensionRAGFine-Tuning
Upfront costLow: embedding docs ($1–$50), vector DB setupHigh: training run ($10–$1,000+), dataset curation
Per-query costHigher input tokens (retrieved context adds 500–3,000 tokens)Lower input tokens (no context needed), but higher per-token rate (2–6x base)
Knowledge updatesInstant: add/update documents, re-embedSlow: retrain model, redeploy (hours to days)
Knowledge scopeUnlimited: any document can be addedLimited: model absorbs patterns, not raw facts
Accuracy on factual Q&AHigher: answers grounded in retrieved textLower: model may hallucinate or blend training data
LatencyHigher: embedding + vector search adds 50–200msLower: single model call, no retrieval step
MaintenanceModerate: keep documents indexed, monitor retrieval qualityHigh: retrain on schedule, manage model versions

Cost comparison at 50,000 queries/day on GPT-4o:

RAG approach: 2,000 input tokens (query + retrieved context) + 400 output tokens per query

  • Input: 100M tokens/day × $2.50/MTok = $250/day
  • Output: 20M tokens/day × $10.00/MTok = $200/day
  • Vector DB: ~$3/day (amortized monthly cost)
  • Total: $453/day ($13,590/month)

Fine-tuned approach: 500 input tokens (no context needed) + 400 output tokens, but at 3x base rate

  • Input: 25M tokens/day × $7.50/MTok = $187.50/day
  • Output: 20M tokens/day × $30.00/MTok = $600/day
  • Total: $787.50/day ($23,625/month)

In this scenario, RAG is 42% cheaper than fine-tuning, despite the additional input tokens. The higher per-token rate for fine-tuned inference outweighs the input token savings. This is the typical pattern for knowledge-intensive applications.

Fine-tuning wins when: (1) you need to change the model's behavior or tone rather than its knowledge, (2) latency requirements are extremely tight, or (3) your prompts are very short and the fine-tuned per-token rate premium is small relative to the input token savings.

Chunking Strategies and Cost Impact

How you split documents into chunks directly affects both retrieval quality and cost. The chunk size determines how many tokens of context are retrieved per query, which is the primary driver of RAG's LLM generation cost.

Small chunks (100–300 tokens):

  • Pros: More precise retrieval (each chunk is narrowly focused), lower per-query token cost (retrieving 5 small chunks = 500–1,500 tokens)
  • Cons: May miss surrounding context, requires more chunks (higher top-K) for comprehensive answers, more vectors to store
  • Best for: FAQ databases, factual lookups, structured content

Medium chunks (300–800 tokens):

  • Pros: Good balance of precision and context, reasonable per-query cost
  • Cons: May include some irrelevant content alongside the relevant passage
  • Best for: General documentation, knowledge bases, support articles

Large chunks (800–2,000 tokens):

  • Pros: Rich context per chunk, fewer chunks needed (lower top-K), better for nuanced questions
  • Cons: Higher per-query token cost (retrieving 3 large chunks = 2,400–6,000 tokens), more irrelevant content per chunk
  • Best for: Legal documents, research papers, long-form technical content

Cost comparison at 10,000 queries/day on GPT-4o:

StrategyTokens Retrieved/QueryDaily Input Token CostMonthly Cost
Small (5 × 200 tokens)1,000$25$750
Medium (5 × 500 tokens)2,500$62.50$1,875
Large (3 × 1,500 tokens)4,500$112.50$3,375

The difference between small and large chunks is a 4.5x cost difference in retrieved context tokens. This is not trivial at scale.

Advanced strategies:

  • Hierarchical chunking: Store both small chunks (for retrieval precision) and their parent sections (for context). Retrieve the small chunk, then expand to the parent section only when needed.
  • Semantic chunking: Split on topic boundaries rather than fixed token counts. This produces more coherent chunks that require fewer retrievals per query.
  • Adaptive top-K: Retrieve fewer chunks for simple queries and more for complex ones, reducing average token consumption.

Optimizing RAG Costs

Beyond chunk size and top-K tuning, there are several advanced strategies for reducing RAG costs:

1. Semantic caching: Cache the LLM response for semantically similar queries. If a user asks "What is your return policy?" and another asks "How do I return an item?", these are semantically equivalent. By caching the first response and serving it for the second query, you eliminate the LLM call entirely — a 100% cost reduction for cached queries. Tools like GPTCache and Redis with vector search enable semantic caching. Cache hit rates of 15–40% are common for customer-facing applications, translating to proportional cost savings.

2. Re-ranking before generation: Instead of sending all top-K retrieved chunks to the LLM, use a lightweight re-ranker model (like Cohere Rerank at $1.00 per 1,000 searches) to score and filter chunks. Send only the top 2–3 most relevant chunks to the LLM. This reduces input tokens by 40–60% while often improving answer quality by removing marginally relevant context that dilutes the signal.

3. Query routing: Not every query needs RAG. Simple greetings, meta-questions ("What can you help me with?"), and questions within the model's training knowledge can be answered directly without retrieval. A lightweight classifier that routes 20–30% of queries away from the RAG pipeline eliminates the retrieval and context cost for those queries.

4. Hybrid search: Combine vector similarity search with keyword search (BM25). Hybrid search often retrieves more relevant chunks than vector-only search, which means you can use a smaller top-K (retrieving 3 chunks instead of 5) and still get good results. Fewer chunks = fewer input tokens = lower cost.

5. Embedding model selection: Cheaper embedding models are often sufficient for retrieval. OpenAI's text-embedding-3-small at $0.02/MTok performs within 2–3% of text-embedding-3-large at $0.13/MTok on most benchmarks. For a pipeline that embeds 10 million tokens per month, this saves $1,100/year.

6. Prompt caching for RAG: If your system prompt and instructions are the same across RAG queries (only the retrieved chunks change), Anthropic's prompt caching gives a 90% discount on the cached portion. This can reduce the system prompt overhead from thousands of tokens at full price to a fraction of the cost.

CostHawk's per-request analytics help you measure the impact of each optimization by tracking input tokens, output tokens, and cost broken down by whether RAG was used, how many chunks were retrieved, and which model generated the response.

RAG and CostHawk

CostHawk provides purpose-built monitoring for RAG pipelines that goes beyond basic token counting:

Embedding cost tracking: CostHawk tracks embedding API calls separately from generation calls, showing you the total cost of your embedding pipeline including both document ingestion (batch embedding) and query-time embedding. While embedding costs are typically small, they are not zero — especially for teams that re-index frequently or have large document corpora.

Retrieval token analysis: For each generation request, CostHawk can break down input tokens into categories: system prompt, user query, and retrieved context. This reveals the true cost of retrieval — how many tokens your RAG pipeline is adding to each request, and whether that context is worth the cost. If retrieved context accounts for 80% of your input tokens, optimizing chunk size and top-K has a higher ROI than optimizing your system prompt.

Cache hit rate monitoring: If you implement semantic caching, CostHawk tracks cache hits versus cache misses, showing you the cost savings from caching and the opportunity cost of cache misses. This data helps you tune cache TTL, similarity thresholds, and cache size for optimal cost savings.

Cost-per-query by pipeline stage: CostHawk aggregates costs across the entire RAG pipeline — embedding, retrieval infrastructure, and generation — giving you a true cost-per-query metric. This is essential for pricing your product, forecasting costs, and comparing the ROI of RAG versus alternative approaches like fine-tuning or context stuffing.

A/B testing support: When testing different chunking strategies, embedding models, or top-K settings, CostHawk's tagging system lets you compare cost and quality metrics side-by-side. Tag each variant, run them in parallel, and see which configuration delivers the best cost-quality tradeoff.

Anomaly detection for RAG: CostHawk detects anomalies specific to RAG pipelines, such as sudden increases in retrieved context size (suggesting a retrieval bug), spikes in embedding API calls (suggesting a re-indexing job running more frequently than expected), or drops in cache hit rate (suggesting a cache configuration issue). These RAG-specific alerts help you catch cost inefficiencies before they accumulate.

FAQ

Frequently Asked Questions

How much does it cost to build a RAG pipeline?+
The cost of building a RAG pipeline varies widely based on your document corpus size, query volume, and technology choices. A minimal setup — embedding 10,000 documents (~5M tokens) with OpenAI text-embedding-3-small, storing them in Supabase pgvector (included with your Postgres plan), and querying with GPT-4o mini — costs approximately $0.10 for initial embedding, $0/month for vector storage (bundled), and $0.001–$0.005 per query for generation. At 1,000 queries/day, expect $30–$150/month in total RAG costs. A production-grade setup with Pinecone ($70+/month), a re-ranker ($50+/month), and GPT-4o for generation at 10,000 queries/day runs $500–$2,000/month. The generation LLM cost is always the dominant expense — typically 90%+ of total RAG costs. Focus your optimization efforts there.
Does RAG reduce hallucinations?+
RAG significantly reduces hallucinations for questions that can be answered from the retrieved documents. By grounding the model's response in specific text passages, RAG gives the LLM factual source material to draw from rather than relying on potentially outdated or incorrect parametric knowledge. Studies show RAG reduces hallucination rates by 40–70% compared to raw LLM generation for factual Q&A tasks. However, RAG does not eliminate hallucinations entirely. The model can still misinterpret retrieved context, combine information from multiple chunks incorrectly, or hallucinate when the retrieved context is insufficient to answer the question. Implementing citation tracking (having the model reference which chunk its answer came from) improves both accuracy and user trust, while adding minimal token overhead (typically 50–100 extra output tokens).
How many documents can a RAG system handle?+
There is no theoretical limit on the number of documents a RAG system can handle — it depends on your vector database capacity and budget. Pinecone's enterprise tier supports billions of vectors. Open-source solutions like pgvector can handle millions of vectors on a single Postgres instance. The cost scales primarily with storage: 1 million 1,536-dimension vectors require approximately 6 GB of storage. Embedding cost scales linearly with document volume — embedding 1 million documents (~500M tokens) with text-embedding-3-small costs approximately $10. The real constraint is retrieval quality: as your corpus grows, the ratio of relevant to irrelevant content decreases, making precise retrieval harder. At very large scales (10M+ documents), invest in metadata filtering, hierarchical indexing, and hybrid search to maintain retrieval precision without increasing top-K (and therefore cost).
What is the best embedding model for RAG?+
The best embedding model depends on your accuracy requirements and budget. For most applications, OpenAI's text-embedding-3-small ($0.02/MTok, 1,536 dimensions) offers the best cost-performance ratio — it scores within 2–3% of the larger model on standard benchmarks while costing 6.5x less. For applications requiring maximum retrieval accuracy (legal, medical, financial), text-embedding-3-large ($0.13/MTok, 3,072 dimensions) or Cohere embed-v3 ($0.10/MTok) provide measurably better results. For teams that want zero marginal embedding cost, open-source models like BGE-large and E5-mistral can be self-hosted on a single GPU. The embedding model affects retrieval quality, which in turn affects how many chunks you need to retrieve (and therefore your LLM generation cost). A better embedding model that lets you retrieve 3 chunks instead of 5 may save more on LLM costs than it costs in embedding fees.
How does RAG compare to using a larger context window?+
Using a larger context window (stuffing more documents into the prompt) is the brute-force alternative to RAG. It is simpler to implement — no vector database, no embedding pipeline, no retrieval logic — but dramatically more expensive. Sending 50,000 tokens of documentation with every query on GPT-4o costs $0.125 per query in input tokens alone. RAG retrieval of 2,000 tokens of relevant context costs $0.005 per query — a 25x saving. At 10,000 queries/day, that is $1,200/day vs $50/day. Additionally, research shows that models perform worse with very long contexts due to the 'lost in the middle' effect — information in the middle of a long prompt is less likely to be used than information at the beginning or end. RAG avoids this by providing only the most relevant context. The only scenario where context stuffing beats RAG is when your entire knowledge base fits in a small context window (under 5,000 tokens) and query volume is low.
What vector database should I use for RAG?+
The choice depends on your scale, budget, and existing infrastructure. For teams already using Supabase or Postgres, pgvector is the best starting point — zero additional cost, no new infrastructure, and adequate performance for up to 1–5 million vectors. For teams needing higher performance at scale, Pinecone ($70+/month) offers managed infrastructure with excellent query latency and built-in filtering. Weaviate and Qdrant offer strong open-source options with optional managed cloud services. For cost-conscious teams with high vector counts, self-hosted Milvus or Qdrant on commodity hardware provides the lowest cost-per-query at scale. The vector database is typically 1–5% of your total RAG cost, so optimize your LLM generation costs first before worrying about vector database pricing.
How do I measure RAG quality and cost together?+
Measuring RAG effectiveness requires tracking both quality metrics and cost metrics for every query. On the quality side, track: (1) retrieval relevance (are the retrieved chunks actually relevant to the query?), (2) answer accuracy (is the generated response correct?), (3) answer completeness (does it fully address the question?), and (4) hallucination rate (does it include information not supported by the retrieved context?). On the cost side, track: (1) input tokens per query (including retrieved context), (2) output tokens per query, (3) embedding costs, and (4) total cost per query. CostHawk's per-request analytics provide the cost side automatically. For quality, implement a lightweight evaluation pipeline using a cheaper model (like GPT-4o mini) to score relevance and accuracy, or collect human feedback on a sample of responses. The goal is to find the configuration (chunk size, top-K, embedding model, generation model) that maximizes quality per dollar spent.
Can I use RAG with open-source models?+
Absolutely. RAG is architecture-agnostic — it works with any LLM, whether API-hosted or self-hosted. In fact, RAG is especially valuable with open-source models because it compensates for their smaller training datasets and lower parameter counts by providing relevant context at query time. A Llama 3 70B model with RAG can match or exceed GPT-4o without RAG on domain-specific factual Q&A tasks. The cost advantage of combining RAG with open-source models is significant: zero per-token generation fees (just GPU compute), low embedding costs (use an open-source embedding model too), and a free vector database (pgvector or self-hosted Qdrant). The tradeoff is operational complexity — you are managing the LLM serving infrastructure, the embedding model, and the vector database yourself. For teams spending $5,000+/month on API-hosted RAG, self-hosting the entire stack can reduce costs by 60–80% while maintaining quality.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.