Embedding
A dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.
Definition
What is Embedding?
"How do I reset my password?" and "I forgot my login credentials" would produce vectors with high cosine similarity (close to 1.0), while "How do I reset my password?" and "What is the weather forecast?" would produce vectors with low similarity (close to 0.0). Embeddings power a wide range of production AI systems: semantic search, retrieval-augmented generation (RAG), document classification, anomaly detection, recommendation engines, and deduplication. Unlike generative models that produce text output and bill for output tokens, embedding models consume only input tokens and return a vector — there are no output tokens. This makes their pricing model simpler but the volume dynamics different: RAG pipelines may embed millions of documents upfront and then embed every user query at inference time, creating both batch and real-time cost components that require separate optimization strategies.Impact
Why It Matters for AI Costs
Embeddings are the backbone of nearly every production RAG system, and RAG is the dominant architecture for grounding LLM responses in private data. If your application retrieves relevant context before generating a response — and most enterprise AI applications do — you are running an embedding pipeline whether you realize it or not.
The cost structure of embeddings differs fundamentally from generative models:
| Model | Provider | Price per 1M Tokens | Dimensions |
|---|---|---|---|
| text-embedding-3-small | OpenAI | $0.02 | 1,536 |
| text-embedding-3-large | OpenAI | $0.13 | 3,072 |
| embed-v3 (English) | Cohere | $0.10 | 1,024 |
| voyage-3 | Voyage AI | $0.06 | 1,024 |
| voyage-3-lite | Voyage AI | $0.02 | 512 |
| Gecko (textembedding-gecko) | $0.025 | 768 |
These per-token prices look negligible compared to generation models — text-embedding-3-small at $0.02/1M tokens is 125x cheaper than GPT-4o input at $2.50/1M. But embedding workloads involve massive token volumes. A RAG pipeline indexing 1 million documents averaging 2,000 tokens each requires 2 billion tokens to embed — costing $40 with text-embedding-3-small or $260 with text-embedding-3-large. Re-indexing after model upgrades or schema changes doubles those costs. At query time, embedding 100,000 user queries per day at an average of 50 tokens each adds 5 million tokens/day — modest, but it compounds.
The hidden cost multiplier is re-embedding. When you switch embedding models (to get better retrieval quality), you must re-embed your entire corpus because vectors from different models are incompatible. Teams that do not plan for this discover a surprise four- or five-figure bill during model migrations. CostHawk tracks embedding API usage separately from generation usage, giving you visibility into both the batch (indexing) and real-time (query) components of your embedding costs.
What Are Embeddings?
At its core, an embedding is a translation of human-readable text into a mathematical representation that machines can compare, search, and cluster. When you send text to an embedding API, the model processes the input through multiple transformer layers and outputs a single vector — an array of floating-point numbers — that represents the meaning of the entire input.
Consider these three sentences:
"The quarterly revenue exceeded projections by 12%""Q3 earnings beat analyst estimates, up 12 percent""The cat sat on the windowsill"
An embedding model would place sentences 1 and 2 close together in vector space (high cosine similarity, ~0.92) because they convey the same meaning, despite using different words. Sentence 3 would be far away from both (low similarity, ~0.15) because it is semantically unrelated.
This property — that semantic similarity maps to geometric proximity — is what makes embeddings so powerful for search and retrieval. Traditional keyword search fails when users use different terminology than what appears in documents. Embedding-based search succeeds because it operates on meaning, not exact word matches.
How embedding models differ from generative models:
- Input only. Embedding models accept text input and return a vector. There is no text generation, no output tokens, and no autoregressive decoding.
- Fixed-size output. Regardless of input length, the output vector has a fixed number of dimensions (e.g., 1,536 for
text-embedding-3-small). A 10-word query and a 500-word document both produce 1,536-dimensional vectors. - Batch-friendly. Embedding APIs accept batches of inputs in a single request (OpenAI supports up to 2,048 inputs per batch), enabling efficient bulk processing at reduced latency.
- No temperature, no randomness. Embeddings are deterministic — the same input always produces the same vector (within floating-point precision).
The vector dimensionality is a key cost tradeoff. Higher-dimensional embeddings (3,072-d from text-embedding-3-large) capture more nuance and generally produce better retrieval quality, but they cost more to store in a vector database and more to compute similarity over. Lower-dimensional embeddings (512-d from voyage-3-lite) are cheaper to store and search but may sacrifice retrieval precision for some use cases.
Embedding Model Pricing
Embedding model pricing is straightforward — you pay per input token with no output token charge — but the total cost depends heavily on your workload pattern. There are two distinct cost components: indexing cost (embedding your document corpus) and query cost (embedding user queries at runtime).
| Model | Provider | Price / 1M Tokens | Dimensions | Max Input | Best For |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | $0.02 | 1,536 | 8,191 tokens | Cost-sensitive workloads, general-purpose search |
| text-embedding-3-large | OpenAI | $0.13 | 3,072 | 8,191 tokens | High-precision retrieval, complex semantic tasks |
| embed-v3 (English) | Cohere | $0.10 | 1,024 | 512 tokens | English-focused search, classification |
| embed-v3 (Multilingual) | Cohere | $0.10 | 1,024 | 512 tokens | Multilingual retrieval across 100+ languages |
| voyage-3 | Voyage AI | $0.06 | 1,024 | 32,000 tokens | Long-document embedding, code search |
| voyage-3-lite | Voyage AI | $0.02 | 512 | 32,000 tokens | Budget workloads, rapid prototyping |
| voyage-code-3 | Voyage AI | $0.06 | 1,024 | 32,000 tokens | Code retrieval, technical documentation search |
| textembedding-gecko | $0.025 | 768 | 3,072 tokens | GCP-native workloads, Vertex AI integration |
Indexing cost example: Embedding a corpus of 500,000 support articles averaging 1,500 tokens each:
Total tokens: 500,000 × 1,500 = 750,000,000 (750M tokens)
text-embedding-3-small: 750 × $0.02 = $15.00
text-embedding-3-large: 750 × $0.13 = $97.50
voyage-3: 750 × $0.06 = $45.00
embed-v3: 750 × $0.10 = $75.00Query cost example: 200,000 user queries per day averaging 40 tokens each:
Daily tokens: 200,000 × 40 = 8,000,000 (8M tokens)
text-embedding-3-small: 8 × $0.02 = $0.16/day ($4.80/month)
text-embedding-3-large: 8 × $0.13 = $1.04/day ($31.20/month)
voyage-3: 8 × $0.06 = $0.48/day ($14.40/month)Query costs are typically small. The real expense comes from re-indexing — which happens when you upgrade embedding models, change chunking strategies, or add new documents to a growing corpus. Budget for 2–4 full re-indexes per year as embedding models improve and your corpus evolves.
Embedding Dimensions and Cost Tradeoffs
The dimensionality of an embedding vector — the number of floating-point numbers in the array — has cascading effects on storage cost, search latency, and retrieval quality. Choosing the right dimensionality is a cost-quality tradeoff that depends on your use case and scale.
Storage cost by dimensionality:
| Dimensions | Bytes per Vector (float32) | Storage for 10M Vectors | Example Model |
|---|---|---|---|
| 512 | 2,048 bytes | ~19 GB | voyage-3-lite |
| 768 | 3,072 bytes | ~29 GB | textembedding-gecko |
| 1,024 | 4,096 bytes | ~38 GB | voyage-3, embed-v3 |
| 1,536 | 6,144 bytes | ~57 GB | text-embedding-3-small |
| 3,072 | 12,288 bytes | ~115 GB | text-embedding-3-large |
At 10 million vectors, the storage difference between 512-d and 3,072-d embeddings is 96 GB — which translates to significant vector database costs. Pinecone, Weaviate, Qdrant, and pgvector all charge based on storage and/or compute, and higher dimensionality increases both.
Search latency impact: Cosine similarity computation is proportional to vector dimensionality. Searching 3,072-d vectors is roughly 6x more compute-intensive than searching 512-d vectors. For real-time applications with strict latency requirements (sub-100ms), lower dimensionality enables faster retrieval without scaling up database compute.
Quality impact: Higher dimensionality generally captures more semantic nuance. On standard retrieval benchmarks (MTEB, BEIR), text-embedding-3-large (3,072-d) outperforms text-embedding-3-small (1,536-d) by 2–5% on average. Whether that quality difference matters depends on your use case. For a customer support chatbot, 2% better retrieval may not be perceptible. For a legal document search system where missing a relevant precedent has serious consequences, it may be critical.
Dimensionality reduction: OpenAI's text-embedding-3 models support a dimensions parameter that truncates the output vector to a specified size. You can request 256-d vectors from text-embedding-3-large, getting most of the model's quality at a fraction of the storage cost. This technique — called Matryoshka Representation Learning (MRL) — works because the most important information is packed into the first dimensions. In benchmarks, 512-d truncated vectors from text-embedding-3-large often outperform full 1,536-d vectors from text-embedding-3-small, giving you better quality at lower storage cost. This is one of the highest-leverage embedding cost optimizations available.
Embeddings in RAG Pipelines
Retrieval-Augmented Generation (RAG) is the most common production use case for embeddings, and it introduces a multi-stage cost structure that teams must understand to optimize effectively. A RAG pipeline has two phases, each with distinct embedding cost dynamics.
Phase 1: Indexing (offline, batch)
During indexing, your document corpus is chunked into segments (typically 200–1,000 tokens each), and each chunk is embedded and stored in a vector database. The cost drivers are:
- Corpus size: More documents = more tokens to embed. A 100,000-document knowledge base at 500 tokens per chunk = 50M tokens.
- Chunking strategy: Smaller chunks produce more embeddings (higher cost) but may improve retrieval precision. Larger chunks produce fewer embeddings (lower cost) but may include irrelevant content.
- Overlap: Overlapping chunks (e.g., 100-token overlap between consecutive chunks) improve retrieval at the boundary between chunks but increase total token count by 10–30%.
- Re-indexing frequency: Every time you change your embedding model, chunking strategy, or add significant new content, you must re-embed. Budget for this.
Phase 2: Query (online, real-time)
When a user sends a query, the RAG pipeline embeds the query, searches the vector database for the most similar chunks, and passes those chunks as context to a generative model. The embedding cost at query time is typically negligible (a single 40-token query costs fractions of a cent), but the downstream generative cost is not. Each retrieved chunk becomes part of the generative model's input, consuming input tokens at the generation model's rate.
The hidden cost multiplier: RAG pipelines create a cost coupling between embedding and generation. If your retrieval step returns 5 chunks of 500 tokens each, that is 2,500 additional input tokens on every generative request. At GPT-4o's $2.50/1M input rate, that is $0.00625 per query — 300x more expensive than the embedding itself. The embedding cost is a rounding error; the cost of using the embeddings (as retrieved context in generation) is where the real expense lies.
This means optimizing your RAG embedding pipeline is not just about reducing embedding API costs — it is about reducing the volume and size of retrieved chunks to minimize downstream generation costs. Better embeddings that retrieve fewer, more relevant chunks can actually increase embedding costs while decreasing total pipeline costs by reducing the context stuffed into the generation step.
Reducing Embedding Costs
Embedding costs can be optimized across five dimensions: batching, caching, dimensionality reduction, model selection, and chunking strategy. Here are the practical techniques and their expected savings.
1. Batch API requests. Embedding APIs accept multiple inputs per request. OpenAI supports up to 2,048 inputs per batch call. Batching reduces HTTP overhead, improves throughput, and — for providers that offer batch pricing — can reduce per-token costs. OpenAI's Batch API provides a 50% discount ($0.01/1M instead of $0.02/1M for text-embedding-3-small) for asynchronous batch jobs that complete within 24 hours. If your indexing is not time-sensitive, always use batch mode.
2. Cache embedding results. Embedding the same text twice is pure waste. Implement an embedding cache keyed on a hash of the input text + model name. For query embeddings, a Redis or in-memory cache with a TTL of 1–24 hours catches repeated queries (which are common — many users ask similar questions). For document embeddings, store vectors alongside document hashes and only re-embed when the content changes. A well-implemented cache typically reduces embedding API calls by 20–40% for query workloads and eliminates redundant re-embedding for unchanged documents.
3. Use dimensionality reduction. As discussed in the previous section, OpenAI's dimensions parameter lets you request shorter vectors from text-embedding-3 models. But you can also apply post-hoc dimensionality reduction to any embedding model using PCA (Principal Component Analysis) or random projection. Reducing 1,536-d vectors to 512-d typically retains 95%+ of retrieval quality while cutting storage costs by 67% and speeding up vector search proportionally.
4. Choose the right model tier. Not every embedding task needs the best model. For internal search over well-structured documents with clear keywords, text-embedding-3-small ($0.02/1M) performs within 3% of text-embedding-3-large ($0.13/1M) — a 6.5x cost difference. Reserve the larger model for use cases where retrieval precision is critical (legal, medical, compliance). Run A/B tests on your actual data to quantify the quality difference before paying the premium.
5. Optimize chunking. Smaller chunks produce more embedding API calls but may improve retrieval. Larger chunks produce fewer calls but may dilute relevance. The sweet spot depends on your content: 300–500 tokens per chunk works well for most FAQ and documentation corpora. For long-form content (research papers, legal filings), 500–1,000 token chunks with 100-token overlap balances cost and quality. Avoid the common mistake of chunking too small (under 100 tokens) — this produces many low-context embeddings that increase both embedding and vector database costs without improving retrieval.
Monitoring Embedding Usage
Embedding costs are easy to overlook because individual requests are cheap, but they accumulate silently — especially during indexing operations, model migrations, and traffic spikes. Effective monitoring requires tracking embedding usage as a distinct cost category, separate from generation costs.
Key metrics to monitor:
- Daily embedding token volume: Track total tokens sent to embedding APIs per day, broken down by model. Sudden spikes indicate unexpected re-indexing, runaway batch jobs, or application bugs that embed content redundantly.
- Embedding cost per query: For RAG pipelines, track the embedding cost component of each user query. This is typically tiny (sub-cent), but it establishes a baseline for detecting anomalies.
- Indexing cost per run: Log the total cost of each indexing operation. Compare against previous runs to detect scope creep (growing corpus size), chunking changes that increase token volume, or accidental full re-indexes when incremental updates were intended.
- Cache hit rate: If you implement embedding caching, monitor the hit rate. A healthy query cache should achieve 20–40% hit rate for diverse user queries and 60–80% for applications with repetitive query patterns. A dropping hit rate may indicate cache misconfiguration or a shift in query patterns.
- Cost per document indexed: Normalize indexing costs by document count to detect changes in average document length, chunking strategy, or overlap settings.
Alerting thresholds:
- Alert if daily embedding token volume exceeds 2x the 7-day moving average — this catches runaway indexing jobs.
- Alert if embedding spend exceeds 10% of total AI API spend — for most workloads, embedding should be a small fraction of total cost, and exceeding this threshold suggests inefficiency.
- Alert on new embedding models appearing in usage logs — unauthorized model changes can indicate configuration drift or unauthorized experimentation.
CostHawk tracks embedding API calls as a dedicated category in your usage dashboard, with per-model and per-key breakdowns. You can set budget alerts specifically for embedding spend, separate from generation spend, ensuring that indexing operations do not silently consume your budget. The anomaly detection engine flags unusual embedding volume patterns, catching issues like accidental full corpus re-indexes before they complete and bill.
FAQ
Frequently Asked Questions
What is the difference between an embedding model and a generative LLM?+
How do I choose between text-embedding-3-small and text-embedding-3-large?+
text-embedding-3-large ($0.13/1M tokens, 3,072 dimensions) outperforms text-embedding-3-small ($0.02/1M tokens, 1,536 dimensions) by 2–5% on standard retrieval benchmarks. For most applications — customer support search, FAQ matching, general document retrieval — that 2–5% quality gap is imperceptible to users, and the 6.5x price difference makes text-embedding-3-small the clear choice. Choose text-embedding-3-large when retrieval precision has outsized business impact: legal document search where missing a relevant precedent carries risk, medical literature retrieval where completeness matters, or compliance systems where false negatives are costly. A practical approach is to start with text-embedding-3-small, measure your retrieval quality on a test set of real queries with human-labeled relevance judgments, and upgrade to text-embedding-3-large only if the quality gap is both measurable and material to your use case.How much does it cost to embed a million documents?+
text-embedding-3-small pricing ($0.02/1M tokens), that is $30.00. At text-embedding-3-large pricing ($0.13/1M tokens), it is $195.00. At Voyage AI's voyage-3 pricing ($0.06/1M tokens), it is $90.00. These are one-time indexing costs, but you will pay them again every time you re-index — when you switch embedding models, change chunking strategies, or rebuild your vector database. Budget for 2–4 full re-indexes per year. Also factor in vector database storage costs: 3 million 1,536-d vectors require approximately 17 GB of storage, which costs $5–50/month depending on your vector database provider and tier. The embedding API cost is often smaller than the ongoing vector storage and search compute costs.Can I use the same embeddings across different models or providers?+
text-embedding-3-small and a vector from voyage-3 for the same text will have different dimensionality, different magnitudes, and different geometric relationships. Computing cosine similarity between vectors from different models produces meaningless results. This incompatibility has a critical cost implication: switching embedding models requires re-embedding your entire corpus. If you have 500 million tokens of indexed documents and migrate from text-embedding-3-small to text-embedding-3-large, you will pay $65 just for the re-embedding, plus any downtime or dual-index costs during the transition. Plan model migrations carefully, run quality evaluations before committing, and budget for the full re-embedding cost. CostHawk tracks which embedding model produced each batch of embeddings, helping you forecast migration costs before you pull the trigger.What is Matryoshka Representation Learning and how does it save money?+
text-embedding-3 models are trained with MRL, which means you can request truncated vectors via the dimensions parameter — for example, requesting 256-d vectors from text-embedding-3-large instead of the full 3,072 dimensions. The API cost per token is the same regardless of requested dimensions (you still pay $0.13/1M tokens for text-embedding-3-large), but the downstream savings are substantial. Truncated vectors reduce vector database storage by up to 92% (256-d vs 3,072-d), speed up similarity search proportionally, and lower vector database compute costs. In benchmarks, 512-d truncated vectors from text-embedding-3-large often outperform full 1,536-d vectors from text-embedding-3-small, giving you the large model's quality at one-third the storage cost. This is one of the easiest wins in embedding cost optimization — a single parameter change that reduces infrastructure costs significantly.How do embeddings work in a RAG pipeline and what are the cost implications?+
Should I use open-source embedding models to avoid API costs entirely?+
text-embedding-3-small is $0.20/day ($6/month). Self-hosting would cost $720–$1,080/month for a single GPU instance running 24/7 — over 100x more expensive at this scale. Self-hosting breaks even at very high volumes, typically above 1 billion tokens per day, where API costs would exceed $600/month. Below that threshold, API-based embedding is almost always cheaper. If you are considering self-hosting for latency reasons (eliminating network round-trips), consider that most embedding APIs respond in 50–200ms, which is fast enough for all but the most latency-sensitive applications.How often should I re-embed my document corpus and what does it cost?+
text-embedding-3-small or $65 with text-embedding-3-large. Factor in vector database downtime or dual-index costs during the migration. CostHawk's historical usage data helps you forecast re-indexing costs based on your actual corpus growth rate and embedding token consumption patterns.Related Terms
Retrieval-Augmented Generation (RAG)
An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.
Read moreSemantic Caching
An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.
Read moreToken
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreLarge Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
