GlossaryInfrastructureUpdated 2026-03-16

Embedding

A dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.

Definition

What is Embedding?

An embedding is a fixed-length array of floating-point numbers (a vector) that encodes the semantic meaning of a piece of text. Embedding models — distinct from generative LLMs — are neural networks trained to map text inputs to points in a high-dimensional vector space such that semantically similar texts are located near each other. The sentence "How do I reset my password?" and "I forgot my login credentials" would produce vectors with high cosine similarity (close to 1.0), while "How do I reset my password?" and "What is the weather forecast?" would produce vectors with low similarity (close to 0.0). Embeddings power a wide range of production AI systems: semantic search, retrieval-augmented generation (RAG), document classification, anomaly detection, recommendation engines, and deduplication. Unlike generative models that produce text output and bill for output tokens, embedding models consume only input tokens and return a vector — there are no output tokens. This makes their pricing model simpler but the volume dynamics different: RAG pipelines may embed millions of documents upfront and then embed every user query at inference time, creating both batch and real-time cost components that require separate optimization strategies.

Impact

Why It Matters for AI Costs

Embeddings are the backbone of nearly every production RAG system, and RAG is the dominant architecture for grounding LLM responses in private data. If your application retrieves relevant context before generating a response — and most enterprise AI applications do — you are running an embedding pipeline whether you realize it or not.

The cost structure of embeddings differs fundamentally from generative models:

ModelProviderPrice per 1M TokensDimensions
text-embedding-3-smallOpenAI$0.021,536
text-embedding-3-largeOpenAI$0.133,072
embed-v3 (English)Cohere$0.101,024
voyage-3Voyage AI$0.061,024
voyage-3-liteVoyage AI$0.02512
Gecko (textembedding-gecko)Google$0.025768

These per-token prices look negligible compared to generation models — text-embedding-3-small at $0.02/1M tokens is 125x cheaper than GPT-4o input at $2.50/1M. But embedding workloads involve massive token volumes. A RAG pipeline indexing 1 million documents averaging 2,000 tokens each requires 2 billion tokens to embed — costing $40 with text-embedding-3-small or $260 with text-embedding-3-large. Re-indexing after model upgrades or schema changes doubles those costs. At query time, embedding 100,000 user queries per day at an average of 50 tokens each adds 5 million tokens/day — modest, but it compounds.

The hidden cost multiplier is re-embedding. When you switch embedding models (to get better retrieval quality), you must re-embed your entire corpus because vectors from different models are incompatible. Teams that do not plan for this discover a surprise four- or five-figure bill during model migrations. CostHawk tracks embedding API usage separately from generation usage, giving you visibility into both the batch (indexing) and real-time (query) components of your embedding costs.

What Are Embeddings?

At its core, an embedding is a translation of human-readable text into a mathematical representation that machines can compare, search, and cluster. When you send text to an embedding API, the model processes the input through multiple transformer layers and outputs a single vector — an array of floating-point numbers — that represents the meaning of the entire input.

Consider these three sentences:

  1. "The quarterly revenue exceeded projections by 12%"
  2. "Q3 earnings beat analyst estimates, up 12 percent"
  3. "The cat sat on the windowsill"

An embedding model would place sentences 1 and 2 close together in vector space (high cosine similarity, ~0.92) because they convey the same meaning, despite using different words. Sentence 3 would be far away from both (low similarity, ~0.15) because it is semantically unrelated.

This property — that semantic similarity maps to geometric proximity — is what makes embeddings so powerful for search and retrieval. Traditional keyword search fails when users use different terminology than what appears in documents. Embedding-based search succeeds because it operates on meaning, not exact word matches.

How embedding models differ from generative models:

  • Input only. Embedding models accept text input and return a vector. There is no text generation, no output tokens, and no autoregressive decoding.
  • Fixed-size output. Regardless of input length, the output vector has a fixed number of dimensions (e.g., 1,536 for text-embedding-3-small). A 10-word query and a 500-word document both produce 1,536-dimensional vectors.
  • Batch-friendly. Embedding APIs accept batches of inputs in a single request (OpenAI supports up to 2,048 inputs per batch), enabling efficient bulk processing at reduced latency.
  • No temperature, no randomness. Embeddings are deterministic — the same input always produces the same vector (within floating-point precision).

The vector dimensionality is a key cost tradeoff. Higher-dimensional embeddings (3,072-d from text-embedding-3-large) capture more nuance and generally produce better retrieval quality, but they cost more to store in a vector database and more to compute similarity over. Lower-dimensional embeddings (512-d from voyage-3-lite) are cheaper to store and search but may sacrifice retrieval precision for some use cases.

Embedding Model Pricing

Embedding model pricing is straightforward — you pay per input token with no output token charge — but the total cost depends heavily on your workload pattern. There are two distinct cost components: indexing cost (embedding your document corpus) and query cost (embedding user queries at runtime).

ModelProviderPrice / 1M TokensDimensionsMax InputBest For
text-embedding-3-smallOpenAI$0.021,5368,191 tokensCost-sensitive workloads, general-purpose search
text-embedding-3-largeOpenAI$0.133,0728,191 tokensHigh-precision retrieval, complex semantic tasks
embed-v3 (English)Cohere$0.101,024512 tokensEnglish-focused search, classification
embed-v3 (Multilingual)Cohere$0.101,024512 tokensMultilingual retrieval across 100+ languages
voyage-3Voyage AI$0.061,02432,000 tokensLong-document embedding, code search
voyage-3-liteVoyage AI$0.0251232,000 tokensBudget workloads, rapid prototyping
voyage-code-3Voyage AI$0.061,02432,000 tokensCode retrieval, technical documentation search
textembedding-geckoGoogle$0.0257683,072 tokensGCP-native workloads, Vertex AI integration

Indexing cost example: Embedding a corpus of 500,000 support articles averaging 1,500 tokens each:

Total tokens: 500,000 × 1,500 = 750,000,000 (750M tokens)

text-embedding-3-small: 750 × $0.02 = $15.00
text-embedding-3-large: 750 × $0.13 = $97.50
voyage-3:              750 × $0.06 = $45.00
embed-v3:              750 × $0.10 = $75.00

Query cost example: 200,000 user queries per day averaging 40 tokens each:

Daily tokens: 200,000 × 40 = 8,000,000 (8M tokens)

text-embedding-3-small: 8 × $0.02 = $0.16/day ($4.80/month)
text-embedding-3-large: 8 × $0.13 = $1.04/day ($31.20/month)
voyage-3:              8 × $0.06 = $0.48/day ($14.40/month)

Query costs are typically small. The real expense comes from re-indexing — which happens when you upgrade embedding models, change chunking strategies, or add new documents to a growing corpus. Budget for 2–4 full re-indexes per year as embedding models improve and your corpus evolves.

Embedding Dimensions and Cost Tradeoffs

The dimensionality of an embedding vector — the number of floating-point numbers in the array — has cascading effects on storage cost, search latency, and retrieval quality. Choosing the right dimensionality is a cost-quality tradeoff that depends on your use case and scale.

Storage cost by dimensionality:

DimensionsBytes per Vector (float32)Storage for 10M VectorsExample Model
5122,048 bytes~19 GBvoyage-3-lite
7683,072 bytes~29 GBtextembedding-gecko
1,0244,096 bytes~38 GBvoyage-3, embed-v3
1,5366,144 bytes~57 GBtext-embedding-3-small
3,07212,288 bytes~115 GBtext-embedding-3-large

At 10 million vectors, the storage difference between 512-d and 3,072-d embeddings is 96 GB — which translates to significant vector database costs. Pinecone, Weaviate, Qdrant, and pgvector all charge based on storage and/or compute, and higher dimensionality increases both.

Search latency impact: Cosine similarity computation is proportional to vector dimensionality. Searching 3,072-d vectors is roughly 6x more compute-intensive than searching 512-d vectors. For real-time applications with strict latency requirements (sub-100ms), lower dimensionality enables faster retrieval without scaling up database compute.

Quality impact: Higher dimensionality generally captures more semantic nuance. On standard retrieval benchmarks (MTEB, BEIR), text-embedding-3-large (3,072-d) outperforms text-embedding-3-small (1,536-d) by 2–5% on average. Whether that quality difference matters depends on your use case. For a customer support chatbot, 2% better retrieval may not be perceptible. For a legal document search system where missing a relevant precedent has serious consequences, it may be critical.

Dimensionality reduction: OpenAI's text-embedding-3 models support a dimensions parameter that truncates the output vector to a specified size. You can request 256-d vectors from text-embedding-3-large, getting most of the model's quality at a fraction of the storage cost. This technique — called Matryoshka Representation Learning (MRL) — works because the most important information is packed into the first dimensions. In benchmarks, 512-d truncated vectors from text-embedding-3-large often outperform full 1,536-d vectors from text-embedding-3-small, giving you better quality at lower storage cost. This is one of the highest-leverage embedding cost optimizations available.

Embeddings in RAG Pipelines

Retrieval-Augmented Generation (RAG) is the most common production use case for embeddings, and it introduces a multi-stage cost structure that teams must understand to optimize effectively. A RAG pipeline has two phases, each with distinct embedding cost dynamics.

Phase 1: Indexing (offline, batch)

During indexing, your document corpus is chunked into segments (typically 200–1,000 tokens each), and each chunk is embedded and stored in a vector database. The cost drivers are:

  • Corpus size: More documents = more tokens to embed. A 100,000-document knowledge base at 500 tokens per chunk = 50M tokens.
  • Chunking strategy: Smaller chunks produce more embeddings (higher cost) but may improve retrieval precision. Larger chunks produce fewer embeddings (lower cost) but may include irrelevant content.
  • Overlap: Overlapping chunks (e.g., 100-token overlap between consecutive chunks) improve retrieval at the boundary between chunks but increase total token count by 10–30%.
  • Re-indexing frequency: Every time you change your embedding model, chunking strategy, or add significant new content, you must re-embed. Budget for this.

Phase 2: Query (online, real-time)

When a user sends a query, the RAG pipeline embeds the query, searches the vector database for the most similar chunks, and passes those chunks as context to a generative model. The embedding cost at query time is typically negligible (a single 40-token query costs fractions of a cent), but the downstream generative cost is not. Each retrieved chunk becomes part of the generative model's input, consuming input tokens at the generation model's rate.

The hidden cost multiplier: RAG pipelines create a cost coupling between embedding and generation. If your retrieval step returns 5 chunks of 500 tokens each, that is 2,500 additional input tokens on every generative request. At GPT-4o's $2.50/1M input rate, that is $0.00625 per query — 300x more expensive than the embedding itself. The embedding cost is a rounding error; the cost of using the embeddings (as retrieved context in generation) is where the real expense lies.

This means optimizing your RAG embedding pipeline is not just about reducing embedding API costs — it is about reducing the volume and size of retrieved chunks to minimize downstream generation costs. Better embeddings that retrieve fewer, more relevant chunks can actually increase embedding costs while decreasing total pipeline costs by reducing the context stuffed into the generation step.

Reducing Embedding Costs

Embedding costs can be optimized across five dimensions: batching, caching, dimensionality reduction, model selection, and chunking strategy. Here are the practical techniques and their expected savings.

1. Batch API requests. Embedding APIs accept multiple inputs per request. OpenAI supports up to 2,048 inputs per batch call. Batching reduces HTTP overhead, improves throughput, and — for providers that offer batch pricing — can reduce per-token costs. OpenAI's Batch API provides a 50% discount ($0.01/1M instead of $0.02/1M for text-embedding-3-small) for asynchronous batch jobs that complete within 24 hours. If your indexing is not time-sensitive, always use batch mode.

2. Cache embedding results. Embedding the same text twice is pure waste. Implement an embedding cache keyed on a hash of the input text + model name. For query embeddings, a Redis or in-memory cache with a TTL of 1–24 hours catches repeated queries (which are common — many users ask similar questions). For document embeddings, store vectors alongside document hashes and only re-embed when the content changes. A well-implemented cache typically reduces embedding API calls by 20–40% for query workloads and eliminates redundant re-embedding for unchanged documents.

3. Use dimensionality reduction. As discussed in the previous section, OpenAI's dimensions parameter lets you request shorter vectors from text-embedding-3 models. But you can also apply post-hoc dimensionality reduction to any embedding model using PCA (Principal Component Analysis) or random projection. Reducing 1,536-d vectors to 512-d typically retains 95%+ of retrieval quality while cutting storage costs by 67% and speeding up vector search proportionally.

4. Choose the right model tier. Not every embedding task needs the best model. For internal search over well-structured documents with clear keywords, text-embedding-3-small ($0.02/1M) performs within 3% of text-embedding-3-large ($0.13/1M) — a 6.5x cost difference. Reserve the larger model for use cases where retrieval precision is critical (legal, medical, compliance). Run A/B tests on your actual data to quantify the quality difference before paying the premium.

5. Optimize chunking. Smaller chunks produce more embedding API calls but may improve retrieval. Larger chunks produce fewer calls but may dilute relevance. The sweet spot depends on your content: 300–500 tokens per chunk works well for most FAQ and documentation corpora. For long-form content (research papers, legal filings), 500–1,000 token chunks with 100-token overlap balances cost and quality. Avoid the common mistake of chunking too small (under 100 tokens) — this produces many low-context embeddings that increase both embedding and vector database costs without improving retrieval.

Monitoring Embedding Usage

Embedding costs are easy to overlook because individual requests are cheap, but they accumulate silently — especially during indexing operations, model migrations, and traffic spikes. Effective monitoring requires tracking embedding usage as a distinct cost category, separate from generation costs.

Key metrics to monitor:

  • Daily embedding token volume: Track total tokens sent to embedding APIs per day, broken down by model. Sudden spikes indicate unexpected re-indexing, runaway batch jobs, or application bugs that embed content redundantly.
  • Embedding cost per query: For RAG pipelines, track the embedding cost component of each user query. This is typically tiny (sub-cent), but it establishes a baseline for detecting anomalies.
  • Indexing cost per run: Log the total cost of each indexing operation. Compare against previous runs to detect scope creep (growing corpus size), chunking changes that increase token volume, or accidental full re-indexes when incremental updates were intended.
  • Cache hit rate: If you implement embedding caching, monitor the hit rate. A healthy query cache should achieve 20–40% hit rate for diverse user queries and 60–80% for applications with repetitive query patterns. A dropping hit rate may indicate cache misconfiguration or a shift in query patterns.
  • Cost per document indexed: Normalize indexing costs by document count to detect changes in average document length, chunking strategy, or overlap settings.

Alerting thresholds:

  • Alert if daily embedding token volume exceeds 2x the 7-day moving average — this catches runaway indexing jobs.
  • Alert if embedding spend exceeds 10% of total AI API spend — for most workloads, embedding should be a small fraction of total cost, and exceeding this threshold suggests inefficiency.
  • Alert on new embedding models appearing in usage logs — unauthorized model changes can indicate configuration drift or unauthorized experimentation.

CostHawk tracks embedding API calls as a dedicated category in your usage dashboard, with per-model and per-key breakdowns. You can set budget alerts specifically for embedding spend, separate from generation spend, ensuring that indexing operations do not silently consume your budget. The anomaly detection engine flags unusual embedding volume patterns, catching issues like accidental full corpus re-indexes before they complete and bill.

FAQ

Frequently Asked Questions

What is the difference between an embedding model and a generative LLM?+
An embedding model and a generative LLM are both transformer-based neural networks, but they serve fundamentally different purposes and have different cost structures. An embedding model takes text input and produces a fixed-length vector (array of numbers) that represents the semantic meaning of that text. It does not generate any text output — there are no output tokens and no autoregressive decoding. A generative LLM takes text input and produces text output, generating tokens one at a time in a sequential process. The cost implications are significant: embedding models charge only for input tokens (since there is no output), while generative models charge for both input and output tokens, with output tokens costing 4–5x more. Embedding models are also dramatically cheaper per input token — typically $0.02–$0.13 per million tokens versus $0.15–$15.00 per million for generative models. However, embedding workloads often involve much higher token volumes (embedding entire document corpora), so total spend can still be meaningful.
How do I choose between text-embedding-3-small and text-embedding-3-large?+
The decision comes down to retrieval quality requirements versus cost. text-embedding-3-large ($0.13/1M tokens, 3,072 dimensions) outperforms text-embedding-3-small ($0.02/1M tokens, 1,536 dimensions) by 2–5% on standard retrieval benchmarks. For most applications — customer support search, FAQ matching, general document retrieval — that 2–5% quality gap is imperceptible to users, and the 6.5x price difference makes text-embedding-3-small the clear choice. Choose text-embedding-3-large when retrieval precision has outsized business impact: legal document search where missing a relevant precedent carries risk, medical literature retrieval where completeness matters, or compliance systems where false negatives are costly. A practical approach is to start with text-embedding-3-small, measure your retrieval quality on a test set of real queries with human-labeled relevance judgments, and upgrade to text-embedding-3-large only if the quality gap is both measurable and material to your use case.
How much does it cost to embed a million documents?+
The cost depends on document length and your chosen model. For a typical knowledge base with documents averaging 1,500 tokens after chunking (where each document might be split into 3 chunks of 500 tokens each), 1 million documents produce 3 million chunks totaling 1.5 billion tokens. At text-embedding-3-small pricing ($0.02/1M tokens), that is $30.00. At text-embedding-3-large pricing ($0.13/1M tokens), it is $195.00. At Voyage AI's voyage-3 pricing ($0.06/1M tokens), it is $90.00. These are one-time indexing costs, but you will pay them again every time you re-index — when you switch embedding models, change chunking strategies, or rebuild your vector database. Budget for 2–4 full re-indexes per year. Also factor in vector database storage costs: 3 million 1,536-d vectors require approximately 17 GB of storage, which costs $5–50/month depending on your vector database provider and tier. The embedding API cost is often smaller than the ongoing vector storage and search compute costs.
Can I use the same embeddings across different models or providers?+
No. Embeddings from different models are incompatible and cannot be mixed or compared. Each embedding model maps text to its own unique vector space — the dimensions do not correspond to the same features across models. A vector from text-embedding-3-small and a vector from voyage-3 for the same text will have different dimensionality, different magnitudes, and different geometric relationships. Computing cosine similarity between vectors from different models produces meaningless results. This incompatibility has a critical cost implication: switching embedding models requires re-embedding your entire corpus. If you have 500 million tokens of indexed documents and migrate from text-embedding-3-small to text-embedding-3-large, you will pay $65 just for the re-embedding, plus any downtime or dual-index costs during the transition. Plan model migrations carefully, run quality evaluations before committing, and budget for the full re-embedding cost. CostHawk tracks which embedding model produced each batch of embeddings, helping you forecast migration costs before you pull the trigger.
What is Matryoshka Representation Learning and how does it save money?+
Matryoshka Representation Learning (MRL) is a training technique where the most important semantic information is packed into the first N dimensions of an embedding vector, with progressively less important information in later dimensions. OpenAI's text-embedding-3 models are trained with MRL, which means you can request truncated vectors via the dimensions parameter — for example, requesting 256-d vectors from text-embedding-3-large instead of the full 3,072 dimensions. The API cost per token is the same regardless of requested dimensions (you still pay $0.13/1M tokens for text-embedding-3-large), but the downstream savings are substantial. Truncated vectors reduce vector database storage by up to 92% (256-d vs 3,072-d), speed up similarity search proportionally, and lower vector database compute costs. In benchmarks, 512-d truncated vectors from text-embedding-3-large often outperform full 1,536-d vectors from text-embedding-3-small, giving you the large model's quality at one-third the storage cost. This is one of the easiest wins in embedding cost optimization — a single parameter change that reduces infrastructure costs significantly.
How do embeddings work in a RAG pipeline and what are the cost implications?+
In a RAG pipeline, embeddings serve two roles: indexing and retrieval. During indexing, your document corpus is chunked and each chunk is embedded into a vector stored in a database. During retrieval, the user's query is embedded, the vector database finds the most similar document chunks, and those chunks are passed as context to a generative model. The cost breakdown is revealing: embedding the query is nearly free (a 40-token query at $0.02/1M tokens costs $0.0000008). But the retrieved chunks become input tokens for the generative model, and that is where the real cost lives. If retrieval returns 5 chunks of 500 tokens each, that is 2,500 tokens of context at the generative model's input rate — $0.00625 per query on GPT-4o ($2.50/1M). The embedding cost is 0.01% of the generation cost. This means optimizing your embedding pipeline for better precision (returning fewer, more relevant chunks) actually saves money on the generation side, even if you use a more expensive embedding model to achieve it. The total RAG pipeline cost is dominated by generation, not embedding.
Should I use open-source embedding models to avoid API costs entirely?+
Self-hosting open-source embedding models (like BGE, E5, or GTE from Hugging Face) eliminates per-token API costs but introduces infrastructure costs that may or may not be cheaper depending on your scale. Running a high-quality embedding model requires GPU compute: a single NVIDIA A10G instance on AWS costs approximately $1.00–$1.50/hour, and can process roughly 500–1,000 embedding requests per second depending on batch size and sequence length. For a team processing 10 million tokens per day, the API cost with text-embedding-3-small is $0.20/day ($6/month). Self-hosting would cost $720–$1,080/month for a single GPU instance running 24/7 — over 100x more expensive at this scale. Self-hosting breaks even at very high volumes, typically above 1 billion tokens per day, where API costs would exceed $600/month. Below that threshold, API-based embedding is almost always cheaper. If you are considering self-hosting for latency reasons (eliminating network round-trips), consider that most embedding APIs respond in 50–200ms, which is fast enough for all but the most latency-sensitive applications.
How often should I re-embed my document corpus and what does it cost?+
Re-embedding frequency depends on three triggers: model upgrades, content changes, and chunking strategy revisions. For model upgrades, evaluate new embedding models quarterly. If a new model offers meaningfully better retrieval quality (test on your own data, not just benchmarks), plan a re-index. For content changes, implement incremental indexing — only embed new or modified documents, not the entire corpus. Most vector databases support upsert operations that make incremental updates straightforward. For chunking changes, a full re-index is unavoidable since different chunk sizes produce different vectors. A practical schedule for most teams is one full re-index per year (when upgrading embedding models) plus continuous incremental indexing for new content. Budget accordingly: if your corpus is 500 million tokens, a full re-index costs $10 with text-embedding-3-small or $65 with text-embedding-3-large. Factor in vector database downtime or dual-index costs during the migration. CostHawk's historical usage data helps you forecast re-indexing costs based on your actual corpus growth rate and embedding token consumption patterns.

Related Terms

Retrieval-Augmented Generation (RAG)

An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.

Read more

Semantic Caching

An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.

Read more

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Read more

Cost Per Token

The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.

Read more

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Read more

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.