Prompt Compression
Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Definition
What is Prompt Compression?
Prompt compression is the practice of reducing the number of tokens in a prompt (system message, user input, conversation history, few-shot examples, and retrieved context) while maintaining enough semantic content for the model to produce high-quality output. Since input tokens are billed on every request and constitute 50–80% of total token costs for most applications, even modest compression yields significant savings at scale.
Prompt compression operates on a spectrum from manual techniques (removing filler words, restructuring instructions) to algorithmic approaches (LLMLingua, selective context pruning, dynamic few-shot selection). Manual compression is free to implement and typically achieves 20–40% token reduction. Algorithmic compression tools like LLMLingua report 2–5x compression ratios (50–80% reduction) with minimal quality degradation on standard benchmarks. The optimal approach depends on your prompt structure, quality requirements, and engineering investment budget.
Impact
Why It Matters for AI Costs
Input tokens are the hidden cost driver in most AI applications. A system prompt sent with every request is paid for every single time — a 2,000-token system prompt across 100,000 daily requests costs 200 million input tokens per day. At Claude 3.5 Sonnet's $3.00/MTok input rate, that is $600/day or $18,000/month just for the system prompt. Compressing that system prompt from 2,000 to 800 tokens saves $10,800/month with zero change to your application's capabilities. For RAG applications that stuff retrieved documents into the prompt, compression is even more impactful — reducing a 10,000-token context window to 4,000 tokens saves $1.80 per 100 requests ($54,000/month at 100K requests/day). Despite these numbers, most teams never optimize their prompts for token efficiency because they write prompts for human readability rather than token economy. CostHawk's prompt analytics identify your highest-cost prompts and estimate the savings potential from compression.
What is Prompt Compression?
Prompt compression encompasses any technique that reduces the token count of text sent to a language model while preserving the information the model needs to generate a correct response. It is the input-side counterpart to max_tokens (which controls output cost) — together, they form the two primary levers for per-request cost optimization.
Compression can be applied to every component of a prompt:
| Prompt Component | Typical Token Share | Compression Potential | Risk Level |
|---|---|---|---|
| System prompt | 10–30% | High (30–60% reduction) | Medium — overtrimming degrades instruction following |
| Few-shot examples | 15–40% | Very high (50–80% reduction via dynamic selection) | Medium — fewer examples may reduce consistency |
| Conversation history | 20–50% | High (40–70% via summarization) | Low — models handle summarized context well |
| Retrieved context (RAG) | 30–60% | Very high (50–80% via selective retrieval) | Medium — aggressive pruning may drop relevant information |
| User message | 5–15% | Low (10–20%) | High — modifying user input risks changing meaning |
The key principle is that models are more robust to compressed input than humans expect. A prompt written in terse, abbreviated language often produces output quality comparable to a verbose, naturally-written prompt — because the model's understanding of language is fundamentally different from human reading comprehension. Research from Microsoft (LLMLingua, 2023) demonstrated that up to 80% of tokens in a typical prompt can be removed while retaining 90%+ of the model's original answer quality on benchmark tasks.
Manual Compression Techniques
Manual prompt compression is the fastest, cheapest, and most reliable optimization you can make. It requires no tools, no libraries, and no infrastructure changes — just careful editing of your prompt text. Here are the key techniques with before-and-after examples:
1. Filler Word Removal
Remove words that add no information: "please", "kindly", "I would like you to", "it is important that", "make sure to", "you should always". Models do not need politeness tokens.
BEFORE (47 tokens):
"I would like you to please carefully analyze the following customer
review and determine whether the overall sentiment expressed by the
customer is positive, negative, or neutral. It is important that you
only respond with one of these three options."
AFTER (19 tokens):
"Classify this review's sentiment as positive, negative, or neutral.
Respond with one word only."Result: 60% token reduction with identical output quality.
2. Structural Condensation
Replace prose instructions with structured formats (bullets, tables, schemas) that convey the same information in fewer tokens:
BEFORE (82 tokens):
"When you encounter a customer complaint about shipping, you should
categorize it into one of the following categories. If the complaint
is about a late delivery, categorize it as 'delay'. If the complaint
is about a damaged package, categorize it as 'damage'. If the complaint
is about a lost package, categorize it as 'lost'. If it doesn't fit
any of these categories, categorize it as 'other'."
AFTER (28 tokens):
"Categorize shipping complaint:
- Late delivery → delay
- Damaged package → damage
- Lost package → lost
- Otherwise → other"Result: 66% token reduction.
3. Example Trimming
Few-shot examples are often over-detailed. Reduce examples to the minimum necessary to demonstrate the pattern:
BEFORE (3 full examples, 180 tokens):
Example 1: "The product arrived quickly and works great! I love it." → positive
Example 2: "Terrible quality. Broke after one day. Want a refund." → negative
Example 3: "It's okay. Does what it says but nothing special." → neutral
AFTER (3 minimal examples, 45 tokens):
"Great product, love it" → positive
"Terrible, broke, want refund" → negative
"It's okay, nothing special" → neutralResult: 75% token reduction in the examples section.
4. Abbreviation and Shorthand
Models understand abbreviations, acronyms, and shorthand perfectly. Replace verbose phrases with shorter alternatives:
- "respond with" → "output:"
- "for example" → "e.g."
- "the following" → omit entirely
- "make sure that" → omit entirely
- "in the format of" → "format:"
- "the user's message" → "input"
5. Deduplication
Audit your system prompt for repeated instructions. A common pattern is stating the same constraint in multiple ways ("Only output JSON" + "Do not include explanations" + "Respond with valid JSON only"). Consolidate to a single clear statement. In prompts audited by CostHawk's optimization advisor, 15–25% of tokens are duplicative instructions that can be safely removed.
Algorithmic Compression
When manual optimization is not enough or not scalable (e.g., compressing dynamically retrieved RAG context), algorithmic compression tools can automatically reduce token counts:
LLMLingua / LLMLingua-2
Developed by Microsoft Research, LLMLingua uses a small language model (GPT-2 or LLaMA-7B) to identify and remove tokens that contribute least to the prompt's semantic content. It evaluates each token's "information entropy" — tokens with low entropy (predictable, redundant) are removed while high-entropy tokens (information-dense, unique) are preserved.
- Compression ratio: 2–5x (50–80% token reduction)
- Quality retention: 90–97% on standard benchmarks (GSM8K, BBH, ShareGPT)
- Latency overhead: 50–200ms per prompt (small model inference)
- Best for: Long prompts with retrieved context, few-shot examples, and verbose instructions
// LLMLingua integration example (Python)
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
device_map="cpu"
)
original_prompt = """[Your 3000-token prompt here...]"""
compressed = compressor.compress_prompt(
original_prompt,
rate=0.5, # Target 50% compression
force_tokens=['\n', '?', '.', '!', ','], # Preserve structure
drop_consecutive=True
)
print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']:.1f}x")
# Output: Original: 3000 tokens → Compressed: 1200 tokens → Ratio: 2.5xSelective Context Pruning
For RAG applications, selective context pruning removes entire sentences or paragraphs from retrieved documents that are unlikely to contain the answer. This is coarser than LLMLingua's token-level compression but faster and more predictable:
- Retrieve N documents (e.g., 10 chunks of 500 tokens each = 5,000 tokens)
- Re-rank by relevance to the query
- Include only the top K chunks (e.g., top 3 = 1,500 tokens)
- Result: 70% reduction in retrieved context tokens
This is effectively a relevance filter rather than a compression algorithm, but the cost impact is identical — fewer input tokens means lower cost.
Dynamic Few-Shot Selection
Instead of including all few-shot examples in every prompt, select only the examples most relevant to the current input. Use embedding similarity to match the current query against your example bank and include only the top 2–3 matches:
- Full example bank: 20 examples × 100 tokens = 2,000 tokens
- Dynamic selection: 3 examples × 100 tokens = 300 tokens
- Result: 85% reduction in few-shot token cost
- Quality impact: Often positive — relevant examples produce better outputs than a random assortment
Compression Ratio vs Quality Tradeoff
Every compression technique trades token savings against potential quality degradation. The relationship is not linear — moderate compression often has negligible quality impact, while aggressive compression hits a cliff where quality drops sharply. Here are empirical benchmarks from published research and CostHawk customer data:
| Compression Level | Token Reduction | Quality Retention (avg) | Technique | Recommended For |
|---|---|---|---|---|
| Light | 20–30% | 98–100% | Manual filler removal, deduplication | All prompts — no reason not to do this |
| Moderate | 30–50% | 95–98% | Manual restructuring + example trimming | System prompts, few-shot templates |
| Aggressive | 50–70% | 90–95% | LLMLingua at 2–3x compression | RAG context, long documents |
| Extreme | 70–80% | 80–90% | LLMLingua at 4–5x compression | Only when cost is critical and quality tolerance is high |
| Maximum | 80%+ | Below 80% | Not recommended | Never — quality degradation exceeds cost savings value |
Quality retention is measured as the percentage of outputs that match the uncompressed version's quality rating (typically judged by human evaluation or automated metrics like ROUGE, BLEU, or LLM-as-judge scoring). The numbers above are averages — actual results vary significantly by task type:
- Classification tasks are highly robust to compression. The key signal (sentiment words, category indicators) survives even aggressive compression. Quality remains above 95% at 5x compression.
- Summarization tasks are moderately robust. Compressing the source document can remove details that should appear in the summary. Quality drops to 90% at 3x compression.
- Code generation tasks are sensitive to compression. Variable names, function signatures, and logical structure must be preserved. Quality drops to 85% at 2x compression.
- Reasoning tasks (math, logic) are the most sensitive. Removing seemingly redundant context can change the problem. Quality drops to 80% at 2x compression.
The recommended approach is to establish a quality baseline for each prompt/endpoint, apply light-to-moderate compression, measure quality against the baseline, and only increase compression if quality remains above your threshold. CostHawk's A/B testing integration helps you measure quality impact alongside cost savings for each compression level.
Implementing Prompt Compression
Here is a practical implementation of a compression pipeline that combines multiple techniques for maximum savings with configurable quality protection:
// prompt-compression.ts
interface CompressionResult {
original: string;
compressed: string;
originalTokens: number;
compressedTokens: number;
compressionRatio: number;
techniques: string[];
}
// Technique 1: Static pattern-based compression
const FILLER_PATTERNS: [RegExp, string][] = [
[/I would like you to /gi, ''],
[/please (make sure to |ensure that |be sure to )/gi, ''],
[/it is important (that |to )/gi, ''],
[/you should always /gi, ''],
[/make sure (that |to )/gi, ''],
[/in order to /gi, 'to '],
[/the following /gi, ''],
[/\. Additionally, /gi, '. '],
[/\. Furthermore, /gi, '. '],
[/\. Moreover, /gi, '. '],
];
function removeFillers(text: string): string {
let result = text;
for (const [pattern, replacement] of FILLER_PATTERNS) {
result = result.replace(pattern, replacement);
}
// Collapse multiple spaces
return result.replace(/ +/g, ' ').trim();
}
// Technique 2: Few-shot example selection by similarity
async function selectRelevantExamples(
query: string,
examples: { input: string; output: string; embedding: number[] }[],
topK: number = 3
): Promise<{ input: string; output: string }[]> {
const queryEmbedding = await getEmbedding(query);
const scored = examples.map(ex => ({
...ex,
similarity: cosineSimilarity(queryEmbedding, ex.embedding),
}));
scored.sort((a, b) => b.similarity - a.similarity);
return scored.slice(0, topK).map(({ input, output }) => ({ input, output }));
}
// Technique 3: Conversation history summarization
async function compressHistory(
messages: { role: string; content: string }[],
maxTokens: number = 500
): Promise<string> {
if (estimateTokens(JSON.stringify(messages)) <= maxTokens) {
return JSON.stringify(messages); // Already short enough
}
// Keep last 2 turns verbatim, summarize the rest
const recent = messages.slice(-4); // Last 2 exchanges
const older = messages.slice(0, -4);
if (older.length === 0) return JSON.stringify(recent);
const summary = await summarize(older, maxTokens - estimateTokens(JSON.stringify(recent)));
return `[Previous conversation summary: ${summary}]\n${JSON.stringify(recent)}`;
}
// Combined compression pipeline
async function compressPrompt(
systemPrompt: string,
examples: { input: string; output: string; embedding: number[] }[],
history: { role: string; content: string }[],
userMessage: string
): Promise<CompressionResult> {
const techniques: string[] = [];
// Step 1: Remove fillers from system prompt
const compressedSystem = removeFillers(systemPrompt);
if (compressedSystem.length < systemPrompt.length) techniques.push('filler-removal');
// Step 2: Select relevant examples
const selectedExamples = await selectRelevantExamples(userMessage, examples, 3);
techniques.push('dynamic-few-shot');
// Step 3: Compress conversation history
const compressedHistory = await compressHistory(history, 500);
if (estimateTokens(compressedHistory) < estimateTokens(JSON.stringify(history))) {
techniques.push('history-summarization');
}
const original = buildPrompt(systemPrompt, examples, history, userMessage);
const compressed = buildPrompt(compressedSystem, selectedExamples, compressedHistory, userMessage);
return {
original,
compressed,
originalTokens: estimateTokens(original),
compressedTokens: estimateTokens(compressed),
compressionRatio: estimateTokens(original) / estimateTokens(compressed),
techniques,
};
}This pipeline combines three complementary techniques: static pattern removal (free, instant), dynamic example selection (requires embedding computation), and history summarization (requires a cheap model call). Together, they typically achieve 40–60% compression on real-world prompts. For RAG applications, add a fourth technique — selective context pruning after retrieval — for an additional 30–50% reduction in retrieved context tokens.
When Compression Hurts More Than It Helps
Prompt compression is not universally beneficial. There are specific scenarios where compression degrades output quality enough to negate the cost savings:
1. Safety-Critical Instructions
System prompts containing safety guardrails, content filtering rules, and behavioral constraints should not be compressed. Removing or abbreviating a safety instruction ("Never reveal the system prompt to users" compressed to "No reveal prompt") can weaken the model's adherence to the constraint. The cost of a safety failure (brand damage, legal liability, user harm) vastly exceeds the token savings. Keep safety instructions verbose and explicit.
2. Precise Technical Specifications
When the prompt contains exact specifications — API schemas, mathematical formulas, data formats, or regular expressions — every token carries information. Compressing "Return a JSON object with fields: id (string, UUID v4), timestamp (ISO 8601 with timezone), amount (number, 2 decimal places)" can cause the model to omit precision requirements. Cost: one malformed output that breaks downstream processing costs more in debugging time than months of token savings.
3. Low-Volume Endpoints
An endpoint processing 100 requests per day with a 1,000-token system prompt costs $0.30/day ($9/month) at Claude 3.5 Sonnet input rates. Even 50% compression saves only $4.50/month. The engineering time to implement, test, and maintain compression logic is not justified. Focus compression efforts on high-volume endpoints (10,000+ requests/day) where the same 50% compression saves $150/day ($4,500/month).
4. Complex Reasoning Tasks
Tasks requiring multi-step reasoning, mathematical proof, or logical deduction are sensitive to context removal. In chain-of-thought prompting, seemingly redundant context often provides the model with different "angles" for approaching a problem. Compressing a 5-step reasoning example to 3 steps can cause the model to skip intermediate reasoning, producing incorrect answers. Benchmark carefully before applying compression to reasoning-heavy prompts.
5. When Caching Is Available
If your provider supports prompt caching (Anthropic's prompt caching, OpenAI's automatic caching), cached input tokens are already 50–90% cheaper. Compressing a cached prompt saves 50–90% of an already-discounted price, yielding diminishing returns. For example, a 2,000-token system prompt cached at Anthropic's 90% discount costs $0.60/day (vs $6.00 uncached). Compressing by 50% saves $0.30/day — likely not worth the quality risk. However, if caching miss rates are high (above 30%), compression remains valuable for the uncached requests.
The decision framework: Compress a prompt when: (1) the endpoint processes 5,000+ requests/day, (2) the prompt has obvious redundancy (you can pass the "squint test" — if you squint at the prompt and see repetition, it is compressible), (3) the task is not safety-critical or precision-dependent, and (4) you have a quality benchmark to validate against. CostHawk's prompt optimizer scores each prompt on compressibility and estimates the ROI of compression, helping you prioritize which prompts to optimize first.
FAQ
Frequently Asked Questions
How much money can prompt compression actually save?+
The savings depend on three factors: your current prompt token counts, your request volume, and the compression ratio achieved. For a concrete example: a customer support chatbot with a 2,500-token system prompt, 1,500 tokens of few-shot examples, and 2,000 tokens average conversation history processes 80,000 requests per day using Claude 3.5 Sonnet ($3.00/MTok input). Daily input cost: 6,000 tokens × 80,000 requests × $3.00/MTok = $1,440/day ($43,200/month). Applying moderate compression — trimming the system prompt by 40% (1,500 tokens), selecting 2 of 8 few-shot examples dynamically (375 tokens instead of 1,500), and summarizing history beyond 3 turns (1,200 tokens instead of 2,000) — reduces the per-request input to 3,075 tokens. New daily cost: $738/day ($22,140/month). Savings: $21,060/month (49% reduction). CostHawk customers implementing prompt compression report average savings of 30–55% on input token costs.
Does prompt compression affect output quality?+
It can, but the impact is smaller than most developers expect. Research from Microsoft (LLMLingua) and Stanford (HELM) shows that light-to-moderate compression (20–50% token reduction) retains 95–99% of output quality on standard benchmarks. The key insight is that natural language is highly redundant — most prompts contain filler phrases, repeated instructions, and overly verbose examples that the model does not need. Removing these redundancies does not remove information. Quality degradation becomes noticeable at aggressive compression levels (60%+ reduction), particularly for reasoning tasks and code generation. The safest approach is to implement compression, then run your evaluation suite on both compressed and uncompressed prompts. If quality drops below your threshold, reduce the compression ratio. For most applications, there is a "free zone" of 20–40% compression where quality is indistinguishable from the uncompressed baseline. CostHawk's A/B comparison tools let you run production traffic through both versions and measure quality metrics side-by-side.
What is LLMLingua and how does it work?+
LLMLingua is an open-source prompt compression framework developed by Microsoft Research. It uses a small, fast language model (originally GPT-2, now options include LLaMA-7B and BERT-based models) to evaluate the "perplexity" of each token in the prompt — essentially measuring how predictable each token is given its context. Tokens with low perplexity (highly predictable, redundant) are removed, while tokens with high perplexity (information-dense, surprising) are preserved. The result is a compressed prompt that retains the most informative tokens. LLMLingua-2 (released 2024) improved on the original by using a trained classification model rather than perplexity scoring, achieving 3–6x compression with under 2% quality loss on standard benchmarks. The latency overhead is 50–200ms per prompt, which is acceptable for most applications since the LLM inference itself takes 500–5,000ms. The cost of running the compression model is negligible — BERT-based inference costs roughly $0.001 per prompt versus the $0.01–$0.10 saved in token costs.
Should I compress prompts or use prompt caching?+
Use both — they are complementary, not competing techniques. Prompt caching reduces the cost of static prompt components (system prompts, tool definitions) that repeat across requests. Compression reduces the token count of all prompt components, including dynamic ones that cannot be cached. The optimal strategy is: (1) Cache your system prompt and tool definitions using provider caching (Anthropic prompt caching at 90% discount, OpenAI automatic caching at 50% discount). (2) Compress the remaining dynamic components — conversation history, retrieved context, and few-shot examples — using the techniques in this article. With this combined approach, your static prompt costs drop 50–90% from caching, and your dynamic prompt costs drop 30–60% from compression. Total input cost reduction: 60–80%. For example, a 6,000-token prompt split 2,000 static + 4,000 dynamic: caching saves $0.0054/request on static tokens, compression saves $0.0072/request on dynamic tokens, total savings $0.0126/request. At 100K requests/day, that is $1,260/day ($37,800/month).
How do I measure the quality impact of prompt compression?+
Establish a quality benchmark before compressing, then measure against it after. The standard approach: (1) Create an evaluation dataset — 100–500 representative inputs covering your key use cases and edge cases. (2) Generate baseline outputs using your uncompressed prompt. (3) Generate compressed outputs using your compressed prompt on the same inputs. (4) Score both using automated metrics (ROUGE-L for summarization, exact match for classification, functional correctness for code) and/or LLM-as-judge evaluation (have GPT-4o or Claude rate both outputs). (5) Compare scores — if compressed quality is within 2–3% of baseline, the compression is safe. If it drops more than 5%, reduce compression aggressiveness. For production monitoring, run both compressed and uncompressed prompts on 5% of live traffic (shadow evaluation) and alert if compressed quality metrics diverge from baseline by more than your threshold. CostHawk integrates with evaluation frameworks (Braintrust, Langfuse, custom) to automate this quality monitoring alongside cost tracking.
Can I compress conversation history without losing context?+
Yes — conversation history is one of the most compressible prompt components because older turns contain progressively less relevant information. The standard approach is the "sliding window with summarization" pattern: keep the last N turns verbatim (N=2–4 provides good recency context) and summarize everything older into a condensed paragraph. A 10-turn conversation with 500 tokens per turn totals 5,000 tokens. Keeping the last 2 turns verbatim (1,000 tokens) and summarizing the first 8 turns into a 300-token summary reduces total history to 1,300 tokens — a 74% reduction. The summarization call itself costs tokens, but it is a one-time cost per conversation (summarize once when history exceeds the threshold, then append new turns until the next summarization). Use a cheap, fast model for summarization (GPT-4o-mini at $0.15/MTok or Claude 3.5 Haiku at $0.80/MTok) to keep the overhead minimal. Quality impact is low for most use cases because users rarely reference details from 8+ turns ago.
What is the difference between prompt compression and prompt engineering?+
Prompt engineering optimizes what you say to the model to get better outputs. Prompt compression optimizes how efficiently you say it to reduce cost. They overlap but have different goals: a prompt engineer might add a detailed 500-token chain-of-thought instruction that improves accuracy by 15%. A prompt compressor would then try to express that same instruction in 200 tokens without losing the accuracy gain. In practice, the best approach is prompt engineering first (get the output quality right), then prompt compression (reduce the cost of that quality). Never compress before engineering — you might compress away instructions that a prompt engineer would have added. Think of it as: prompt engineering maximizes the quality-per-dollar ratio by improving quality; prompt compression maximizes it by reducing dollars. CostHawk's prompt analytics show both dimensions: per-prompt quality scores (from your evaluation framework) and per-prompt cost, making it easy to identify prompts that are high-cost but low-quality (need engineering) versus high-cost and high-quality (candidates for compression).
How does prompt compression work with RAG applications?+
RAG (Retrieval-Augmented Generation) applications benefit enormously from compression because retrieved context is often the largest prompt component — 5,000 to 50,000 tokens of documents stuffed into the prompt. Three compression strategies apply specifically to RAG: (1) Retrieval-stage filtering: Retrieve more candidates than needed (top 20), then re-rank and keep only the top 3–5. This is the highest-impact optimization because it eliminates entire documents. (2) Chunk-level pruning: Within selected documents, remove sentences or paragraphs with low relevance to the query. Relevance can be scored by embedding similarity at the sentence level. (3) Token-level compression: Apply LLMLingua to the remaining context to remove redundant tokens within relevant passages. Stacking all three typically reduces RAG context from 20,000 tokens to 3,000–5,000 tokens (75–85% reduction) with minimal answer quality impact. At scale, this is transformative: a RAG application processing 50,000 queries/day at $3/MTok input saves $2,250–$2,550/day ($67,500–$76,500/month) from context compression alone.
Are there any tools that automatically compress prompts?+
Yes, several tools and libraries exist for automated prompt compression: (1) LLMLingua / LLMLingua-2 (Microsoft) — the most established open-source tool. Available as a Python library with pre-trained compression models. Achieves 2–5x compression with high quality retention. (2) Selective Context — a simpler approach that uses self-information (from GPT-2) to score and remove low-information sentences. Less aggressive than LLMLingua but faster and simpler to integrate. (3) RECOMP (Carnegie Mellon) — specifically designed for RAG context compression. Trains an abstractive compressor that produces short, query-focused summaries of retrieved documents. (4) LongLLMLingua — an extension of LLMLingua optimized for long-context scenarios (10K+ token prompts). Adds question-aware compression that preserves tokens most relevant to the user's query. (5) CostHawk Prompt Optimizer — analyzes your production prompts, identifies compressible sections, suggests specific optimizations, and estimates the dollar savings from each suggestion. Unlike the academic tools, CostHawk's optimizer works at the business level — prioritizing prompts by potential savings and tracking compression impact on both cost and quality over time.
How do I compress prompts in a multi-model architecture?+
In multi-model architectures (routing between GPT-4o-mini for simple tasks and Claude 3.5 Sonnet for complex ones), compression strategy should vary by model. Cheaper models (GPT-4o-mini at $0.15/MTok) have a lower cost-per-token, so the ROI of compression is 16x smaller than for expensive models (Claude 3.5 Sonnet at $3.00/MTok). Focus compression efforts on prompts sent to your most expensive models. Additionally, cheaper models may be more sensitive to compression because they have less capacity to "fill in the gaps" from compressed input. A compressed prompt that works well with Claude 3.5 Sonnet (strong model) might produce degraded output with GPT-4o-mini (lighter model). Test compression quality independently per model. The optimal architecture for cost: use an uncompressed or lightly compressed prompt with the cheap model (where the cost of full prompts is negligible) and an aggressively compressed prompt with the expensive model (where the cost savings are significant). CostHawk's per-model prompt analytics help you calibrate compression levels independently for each model in your routing layer.
Related Terms
Prompt Caching
A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Read moreToken
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read moreMax Tokens
The API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.
Read moreInput vs. Output Tokens
The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
