GlossaryOptimizationUpdated 2026-03-16By Chase Dillingham

Fine-Tuning

The process of further training a pre-trained large language model on a custom dataset to specialize it for a specific task, domain, or output style. Fine-tuning incurs upfront training costs (billed per training token) but can reduce ongoing inference costs by enabling a smaller, cheaper model to match the performance of a larger, more expensive one — making it both a quality tool and a cost optimization strategy.

Definition

What is Fine-Tuning?

Fine-tuning is a machine learning technique where a pre-trained foundation model is further trained on a curated dataset of task-specific examples. During fine-tuning, the model's weights are updated to learn patterns specific to your use case — your terminology, your output format, your decision criteria — while retaining the general language understanding acquired during pre-training. The result is a custom model variant that performs better on your specific task than the base model, often dramatically so. From a cost perspective, fine-tuning creates an upfront investment (you pay per training token to create the custom model) that can yield ongoing savings (the fine-tuned smaller model replaces a larger, more expensive model at inference time). OpenAI charges $3.00 per 1M training tokens for GPT-4o-mini fine-tuning and $25.00 per 1M training tokens for GPT-4o fine-tuning. The fine-tuned model then serves inference at a modest premium over the base model's rate. The economics are compelling when a fine-tuned GPT-4o-mini ($0.30/$1.20 per MTok for inference) can replace GPT-4o ($2.50/$10.00 per MTok) for a specific task — an 8x cost reduction on input and output tokens for every request, after a one-time training investment.

Impact

Why It Matters for AI Costs

Fine-tuning sits at the intersection of quality improvement and cost reduction, making it one of the most strategically valuable techniques in the AI cost optimization toolkit. The core economic proposition is straightforward: invest a fixed amount in training to unlock permanently lower per-query costs.

The math is compelling. Consider a customer support classification system processing 500,000 queries per day. Each query averages 200 input tokens and 50 output tokens.

Approach	Model	Input Rate	Output Rate	Daily Cost	Monthly Cost
Base model (large)	GPT-4o	$2.50/1M	$10.00/1M	$500	$15,000
Base model (small)	GPT-4o-mini	$0.15/1M	$0.60/1M	$30	$900
Fine-tuned (small)	GPT-4o-mini (ft)	$0.30/1M	$1.20/1M	$60	$1,800

If the base GPT-4o-mini cannot match GPT-4o's classification accuracy for your specific task, you are stuck paying $15,000/month. But if a fine-tuned GPT-4o-mini achieves comparable accuracy, you drop to $1,800/month — an 88% cost reduction, saving $13,200/month or $158,400/year.

The fine-tuning cost is a one-time investment. Training GPT-4o-mini on 10,000 examples averaging 300 tokens each (3M training tokens) costs $9.00. With 3 training epochs, that is $27.00 total. You recover that investment in less than 4 hours of production traffic at the reduced inference rate.

The risk is that fine-tuning does not always work. If your task requires broad world knowledge, complex reasoning, or handling diverse edge cases, a fine-tuned small model may not match the large model's performance. This is why rigorous evaluation before and after fine-tuning is essential — you need to confirm the quality meets your bar before routing production traffic to the cheaper model. CostHawk's per-model cost tracking lets you run A/B tests between base and fine-tuned models, measuring both quality metrics and cost savings to make data-driven decisions about when fine-tuning pays off.

What is Fine-Tuning?

Fine-tuning is the process of taking a pre-trained foundation model and continuing its training on a smaller, task-specific dataset. The model starts with the broad language understanding it learned during pre-training (grammar, facts, reasoning patterns) and then specializes on your examples to learn your specific requirements: your output format, your terminology, your classification categories, your tone, your edge cases.

How fine-tuning works technically:

Prepare training data. You create a dataset of input-output pairs that demonstrate the desired behavior. For a classification task, these might be customer messages paired with correct categories. For a generation task, these might be prompts paired with ideal responses. OpenAI requires a minimum of 10 examples but recommends 50–100 for noticeable improvement and 500+ for production-quality results.
Upload and validate. The training data is uploaded to the provider, validated for format correctness, and queued for training.
Training runs. The model processes your training data for a specified number of epochs (passes through the full dataset). Each epoch costs tokens_in_dataset × price_per_training_token. Default is typically 3–4 epochs, and OpenAI auto-selects based on dataset size.
Model deployment. The fine-tuned model is deployed and accessible via the same API with a custom model ID (e.g., ft:gpt-4o-mini-2024-07-18:my-org:custom-suffix:abc123). Inference calls are identical to the base model — same endpoints, same parameters — but use the custom model ID.

What fine-tuning changes:

Output format consistency. Fine-tuned models learn to reliably produce output in your exact format — specific JSON schemas, classification labels, structured templates — without elaborate prompt instructions.
Domain vocabulary. The model learns your specific terminology, product names, abbreviations, and jargon.
Task-specific accuracy. Classification accuracy, extraction precision, and generation quality improve on your specific task distribution.
Reduced prompt length. Because the behavior is learned during training, you need fewer instructions and examples in the prompt, reducing input tokens per request.

What fine-tuning does not change: the model's fundamental reasoning capabilities, its knowledge cutoff date, or its context window size. Fine-tuning teaches behavior patterns, not new facts.

Fine-Tuning Costs by Provider

Fine-tuning costs include three components: training cost (one-time), inference cost (ongoing, per-request), and hosting cost (some providers charge for model storage). Here is the current pricing landscape as of March 2026:

Provider	Model	Training Cost (per 1M tokens)	Inference Input (per 1M tokens)	Inference Output (per 1M tokens)	Hosting
OpenAI	GPT-4o-mini	$3.00	$0.30	$1.20	Free
OpenAI	GPT-4o	$25.00	$3.75	$15.00	Free
OpenAI	GPT-3.5 Turbo	$8.00	$3.00	$6.00	Free
Google	Gemini 1.5 Flash (tuning)	Free (limited)	Standard rate	Standard rate	Free (limited)
Mistral	Mistral Small	$2.00	$0.20	$0.60	Free
Mistral	Mistral Large	$8.00	$4.00	$12.00	Free
Anthropic	Claude models	Not publicly available	N/A	N/A	N/A

Training cost calculation example:

Dataset: 5,000 examples × 500 tokens average = 2,500,000 tokens (2.5M)
Epochs: 3 (default)
Total training tokens: 2.5M × 3 = 7.5M tokens

GPT-4o-mini: 7.5 × $3.00 = $22.50
GPT-4o:      7.5 × $25.00 = $187.50
Mistral Small: 7.5 × $2.00 = $15.00

Inference cost premium: Fine-tuned models cost more per inference token than their base counterparts. GPT-4o-mini base costs $0.15/$0.60 per MTok, while fine-tuned GPT-4o-mini costs $0.30/$1.20 — exactly 2x. This premium exists because the provider must load your custom model weights, which consumes dedicated GPU memory. Even with this premium, fine-tuned GPT-4o-mini ($0.30/$1.20) is dramatically cheaper than base GPT-4o ($2.50/$10.00) — an 8x savings on input and output tokens.

Hidden costs to budget for:

Iteration: Fine-tuning rarely succeeds on the first attempt. Budget for 3–5 training runs while tuning hyperparameters and refining your dataset.
Data preparation: Curating high-quality training examples is labor-intensive. The human time spent creating, reviewing, and cleaning training data often exceeds the API training cost by 10–100x.
Evaluation: Running the fine-tuned model against a held-out test set to measure quality costs inference tokens at the fine-tuned model's rate.

Fine-Tuning vs RAG vs Prompt Engineering

Fine-tuning is one of three primary techniques for improving LLM output quality, and each has distinct cost profiles, timelines, and best-fit use cases. Choosing the right technique — or combination — requires understanding these tradeoffs.

Dimension	Prompt Engineering	RAG	Fine-Tuning
Upfront cost	Near zero (developer time only)	Low–medium ($50–$500 for embedding + vector DB setup)	Medium ($20–$500 per training run)
Ongoing cost per query	High (long prompts with instructions and examples)	Medium (embedding query + retrieved context tokens)	Low (short prompts, no examples needed)
Time to deploy	Hours	Days to weeks	Days to weeks
Handles new knowledge	Only via prompt content	Yes (add documents to index)	No (requires retraining)
Output format control	Moderate (depends on prompt clarity)	Moderate (depends on prompt + retrieved context)	Strong (learned from examples)
Best for	Rapid prototyping, simple tasks, one-off queries	Knowledge-grounded answers, dynamic content, factual accuracy	Consistent formatting, classification, domain-specific behavior
Quality ceiling	Limited by prompt length and model capability	High for factual tasks with good retrieval	Highest for narrow, well-defined tasks

When to choose prompt engineering: Start here. Always. Prompt engineering has zero upfront cost and can be iterated in minutes. For many tasks — especially those with clear instructions and few edge cases — a well-crafted prompt on a capable model is sufficient. Only escalate to RAG or fine-tuning when prompt engineering hits its limits.

When to choose RAG: Use RAG when the model needs access to specific, frequently changing knowledge that cannot fit in a prompt. Customer support over a product knowledge base, legal research across case law, technical documentation search — these are RAG sweet spots. RAG adds per-query cost (embedding + retrieved context tokens) but avoids the upfront training investment and handles dynamic content naturally.

When to choose fine-tuning: Use fine-tuning when you need consistent, specific behavior that is hard to elicit through prompting alone. Classification into custom categories, extraction of domain-specific entities, adherence to a precise output schema, or matching a specific writing style — these are fine-tuning sweet spots. The upfront training cost is amortized across all future queries, and the per-query savings from shorter prompts and smaller models compound over time.

Combining techniques: The most cost-effective production systems often combine all three. Fine-tune a small model for your core task, use RAG to inject relevant context, and use prompt engineering to handle edge cases and formatting. This layered approach minimizes per-query costs while maximizing quality across diverse inputs.

When Fine-Tuning Saves Money

Fine-tuning saves money in two distinct ways: by enabling model downgrading (replacing an expensive model with a cheaper fine-tuned one) and by enabling prompt compression (reducing per-request token counts because behavior is learned, not instructed). The breakeven analysis determines whether the upfront training investment pays off for your specific workload.

Breakeven formula:

breakeven_queries = training_cost / (cost_per_query_before - cost_per_query_after)

// Example: Replacing GPT-4o with fine-tuned GPT-4o-mini
// Before: 200 input + 100 output tokens on GPT-4o
//   Cost: (200/1M × $2.50) + (100/1M × $10.00) = $0.001500
// After: 100 input + 80 output tokens on ft:GPT-4o-mini (shorter prompt needed)
//   Cost: (100/1M × $0.30) + (80/1M × $1.20) = $0.000126
// Savings per query: $0.001374
// Training cost: $22.50 (5K examples × 500 tokens × 3 epochs × $3/1M)
// Breakeven: 22.50 / 0.001374 = 16,375 queries

At 500,000 queries per day, you break even in 47 minutes. The annual savings are $250,000+.

Scenarios where fine-tuning saves the most:

High-volume, narrow tasks. Classification, entity extraction, sentiment analysis, and routing — tasks where the input-output pattern is consistent and repeatable. These tasks often do not need a frontier model's full reasoning capability, and a fine-tuned small model handles them perfectly.
Long system prompts. If your current prompt includes 2,000+ tokens of instructions and few-shot examples, fine-tuning can eliminate most of that overhead. The behavior is baked into the model weights, so a 50-token system prompt replaces a 2,000-token one — saving 1,950 input tokens per request.
Structured output generation. If you currently rely on extensive prompt instructions to get the model to output a specific JSON schema, fine-tuning on examples of the desired output format produces near-perfect schema adherence with minimal prompting.

Scenarios where fine-tuning does NOT save money:

Low volume. If you process fewer than 1,000 queries per day, the training cost may not be recouped for months, and the base model may be fine.
Diverse, open-ended tasks. Chatbots handling arbitrary user questions benefit less from fine-tuning because the task distribution is too broad. RAG or a larger base model is usually more effective.
Rapidly changing requirements. If your output format, classification categories, or domain knowledge changes monthly, the cost of repeated fine-tuning runs erodes savings.

CostHawk's model comparison analytics let you run side-by-side cost analysis of your current model versus a fine-tuned alternative, projecting annual savings based on your actual query volume and token distribution.

Fine-Tuning Best Practices

Successful fine-tuning requires disciplined data preparation, systematic evaluation, and iterative refinement. Here are the practices that separate cost-effective fine-tuning from wasted training spend.

1. Start with high-quality data, not more data. Quality trumps quantity for fine-tuning. 500 carefully curated, diverse, and correctly labeled examples outperform 5,000 noisy, repetitive, or inconsistent ones. Each training example should represent a real-world input your model will encounter in production, paired with the exact output you expect. Common data quality issues that waste training budget: duplicate examples, inconsistent formatting across examples, mislabeled classifications, and examples that contradict each other.

2. Use a held-out evaluation set. Reserve 15–20% of your dataset for evaluation. Never train on your eval set. After each fine-tuning run, measure performance on the eval set to determine whether the model meets your quality bar. Without rigorous evaluation, you cannot distinguish a successful fine-tune from an overfit or underfit model, and you risk deploying a model that costs less but produces worse results — a net negative.

3. Fine-tune the smallest model first. Start with GPT-4o-mini ($3.00/1M training tokens) or Mistral Small ($2.00/1M), not GPT-4o ($25.00/1M). For many tasks, a fine-tuned small model achieves the same quality as a fine-tuned large model at a fraction of the training and inference cost. Only move to a larger base model if the small model's eval metrics are unacceptable after data quality improvements.

4. Optimize hyperparameters incrementally. OpenAI auto-selects epochs and learning rate, but manual tuning can improve results. Start with the default (auto) settings for your first run. If quality is insufficient, try: increasing epochs from 3 to 4–5 (if the model is underfitting), or reducing the learning rate multiplier (if the model is overfitting to training examples and losing generalization). Each hyperparameter experiment costs one training run, so budget accordingly — typically $15–$50 per experiment for GPT-4o-mini.

5. Monitor for overfitting. A fine-tuned model that memorizes training examples instead of learning general patterns will fail on novel inputs. Signs of overfitting: high accuracy on training data but low accuracy on eval data, and declining performance as you add more diverse test cases. Mitigation: use more diverse training examples, reduce epochs, or increase the dataset size.

6. Version your models and data. Track which dataset version produced which fine-tuned model, and log the training cost, eval metrics, and deployment date for each version. This audit trail is essential for understanding which investments paid off and which did not. CostHawk's model-level cost tracking provides the inference cost side of this ledger, showing exactly how much each fine-tuned model version costs in production.

Common Fine-Tuning Cost Mistakes

Fine-tuning can be a powerful cost optimization tool, but common mistakes can turn it into a money pit. Here are the six most expensive errors and how to avoid them.

1. Fine-tuning before trying prompt engineering. The single most common and expensive fine-tuning mistake is jumping to fine-tuning when prompt engineering would suffice. A well-crafted system prompt with 3–5 few-shot examples often achieves 90%+ of fine-tuning quality at zero training cost. Always exhaust prompt engineering options first. Try structured prompts, chain-of-thought instructions, and output format specifications before investing in fine-tuning. If prompt engineering gets you within 5% of your target quality, the cost of fine-tuning may not be justified unless your volume is very high.

2. Training on low-quality data. Garbage in, garbage out applies with extra force to fine-tuning. If your training examples contain inconsistencies (different output formats for similar inputs), errors (wrong classifications), or biases (overrepresentation of certain categories), the model learns these flaws. You pay for the training run, deploy a flawed model, discover quality issues in production, fix the data, and pay for another training run. At $25/1M training tokens for GPT-4o, each wasted iteration costs real money. Invest in data quality auditing before training.

3. Fine-tuning the wrong base model. Fine-tuning GPT-4o ($25/1M training, $3.75/$15.00 inference) when GPT-4o-mini ($3.00/1M training, $0.30/$1.20 inference) would achieve the same quality is an 8x training cost mistake and a 12x inference cost mistake. Start small. The whole point of fine-tuning is to get small model quality close to large model quality — so start with the small model.

4. Over-training (too many epochs). More epochs do not always mean better results. After 3–4 epochs on a well-curated dataset, additional passes typically cause overfitting without improving generalization. Each unnecessary epoch multiplies your training cost — 8 epochs costs 2.67x more than 3 epochs for zero benefit or worse performance. Use the validation loss curve to detect when additional training is counterproductive.

5. Not budgeting for iteration. Fine-tuning rarely succeeds on the first try. Plan for 3–5 training runs while you refine your dataset, adjust hyperparameters, and evaluate results. If GPT-4o-mini fine-tuning costs $22 per run, budget $110 for the iteration cycle, not $22. Teams that budget for a single run often abandon fine-tuning after one disappointing result, missing the gains that come from systematic iteration.

6. Forgetting about inference cost premiums. Fine-tuned models cost 1.5–2x more per inference token than their base counterparts. If your fine-tuning does not enable model downgrading (replacing a bigger model with a fine-tuned smaller one) or significant prompt compression, the net effect could be higher costs — you paid for training AND pay more per query. Always calculate the end-to-end cost including the inference premium before committing to fine-tuning.

FAQ

Frequently Asked Questions

How many training examples do I need for effective fine-tuning?+

The minimum for OpenAI is 10 examples, but effective fine-tuning requires significantly more. For simple classification tasks (routing inputs to 3–5 categories), 50–100 well-curated examples often produce excellent results. For more complex tasks like structured extraction or multi-step generation, 500–1,000 examples typically reach production quality. For domain-specific generation with precise formatting requirements, 1,000–5,000 examples may be needed. Beyond 5,000 examples, improvements typically plateau unless you are addressing specific failure modes. The key insight is that quality matters far more than quantity — 200 perfectly labeled, diverse examples outperform 2,000 noisy ones. Each example should represent a real production scenario, cover edge cases proportionally, and demonstrate exactly the output format and quality you expect. Budget your training data costs accordingly: at $3.00/1M training tokens for GPT-4o-mini and an average of 500 tokens per example, 1,000 examples over 3 epochs costs $4.50 — trivial compared to the human labor cost of curating those examples.

How long does a fine-tuning job take to complete?+

Fine-tuning completion time depends on dataset size, model size, number of epochs, and provider queue depth. For OpenAI, typical timelines are: GPT-4o-mini with 1,000 examples (500 tokens each, 3 epochs) completes in 15–30 minutes. GPT-4o with the same dataset takes 30–60 minutes. Larger datasets (10,000+ examples) can take 2–6 hours. These are processing times once training begins — queue wait times can add minutes to hours depending on demand. Mistral fine-tuning typically completes in 10–45 minutes for comparable dataset sizes. Google's Gemini tuning through Vertex AI can take 1–4 hours depending on the model and configuration. The cost implication of training time is indirect but real: if you need to iterate 5 times on your fine-tuned model, each iteration cycle includes training time plus evaluation time. A 2-hour training job with a 1-hour evaluation cycle means each iteration takes 3 hours — 15 hours for 5 iterations. Plan your fine-tuning timeline accordingly, especially if you have a deployment deadline.

Can I fine-tune on my proprietary data without it being used to train other models?+

Yes, with caveats that vary by provider. OpenAI's fine-tuning data is not used to train other customers' models or improve base models — your data is isolated to your fine-tuning job and the resulting custom model. OpenAI retains fine-tuning data for 30 days after training for debugging purposes, then deletes it (you can request immediate deletion). Anthropic does not currently offer public fine-tuning, so this question does not apply. Google's Vertex AI fine-tuning keeps data within your Google Cloud project and does not use it for base model training. Mistral's fine-tuning through La Plateforme similarly isolates customer data. For maximum data isolation, consider self-hosted fine-tuning using open-source models (Llama 3, Mistral) on your own infrastructure — your data never leaves your environment. The tradeoff is higher infrastructure costs ($1–$5/hour for GPU instances) versus the simplicity and cost-effectiveness of API-based fine-tuning. For regulated industries (healthcare, finance, legal), verify each provider's data handling policies against your compliance requirements before uploading training data.

What happens to my fine-tuned model if the base model is deprecated?+

When a base model is deprecated, fine-tuned models based on it typically continue working for a grace period before being sunset. OpenAI provides at least 6 months notice before deprecating fine-tuned models, and often provides an automatic migration path to the successor model. However, automatic migration does not guarantee identical performance — the successor model has different weights, and your fine-tuned behavior may shift. You should budget for re-fine-tuning on the new base model when the successor is released. This means your training dataset must be preserved (version-controlled, ideally in a structured format) so you can retrain at any time. The cost of re-training is typically modest ($10–$200 depending on dataset size and model), but the real cost is the human time to evaluate whether the re-trained model meets your quality bar and address any regressions. CostHawk tracks which fine-tuned model versions you are using in production, so you will see deprecation-related model transitions reflected in your cost data and can budget for re-training accordingly.

Is fine-tuning or RAG better for reducing costs?+

It depends on your workload. Fine-tuning reduces costs by enabling model downgrading (cheaper model for the same quality) and prompt compression (shorter prompts because behavior is learned). RAG reduces costs by avoiding the need to stuff all relevant knowledge into the model via fine-tuning or prompting — it retrieves only what is needed per query. For classification and formatting tasks with stable requirements, fine-tuning is typically more cost-effective. A fine-tuned GPT-4o-mini at $0.30/$1.20 per MTok replacing GPT-4o at $2.50/$10.00 per MTok saves 8x per query with no per-query retrieval overhead. For knowledge-intensive tasks where answers must be grounded in specific, frequently updated documents, RAG is more cost-effective because fine-tuning cannot keep up with changing knowledge (you would need to retrain constantly). The embedding and retrieval cost ($0.001–$0.01 per query) is far less than the cost of repeated fine-tuning runs. For many production systems, the answer is both: fine-tune a small model for your core task behavior, and use RAG to inject current knowledge. This combination minimizes total cost per query.

How do I measure the ROI of fine-tuning?+

Measuring fine-tuning ROI requires tracking four metrics: training cost, inference cost savings, quality metrics, and prompt length reduction. Training cost is straightforward — the total training tokens multiplied by the per-token training rate, summed across all iteration runs. Inference cost savings = (cost_per_query_before - cost_per_query_after) multiplied by query volume. Capture this from your usage analytics by comparing the periods before and after deploying the fine-tuned model. Quality metrics should be task-specific: classification accuracy, extraction F1 score, user satisfaction ratings, or whatever KPI your task optimizes for. Quality must remain at or above your threshold for the cost savings to be valid. Prompt length reduction measures how many tokens you removed from prompts because the fine-tuned model no longer needs explicit instructions or few-shot examples. CostHawk's per-model analytics provide the inference cost data automatically. Compare your fine-tuned model's cost-per-query against the base model it replaced over a 30-day period to quantify the actual savings. A positive ROI means the cumulative inference savings exceed the total training cost — for high-volume workloads, this typically occurs within the first day of deployment.

Can I fine-tune open-source models instead of using provider APIs?+

Yes, and it can be significantly cheaper at scale. Open-source models like Llama 3, Mistral, and Phi can be fine-tuned on your own hardware or cloud GPU instances using frameworks like Hugging Face Transformers, Axolotl, or LitGPT. The cost structure shifts from per-token training fees to GPU compute time. Fine-tuning Llama 3 8B on an NVIDIA A100 (80GB) takes approximately 1–4 hours for a dataset of 5,000 examples, costing $2–$12 in GPU rental. Fine-tuning Llama 3 70B requires 4–8 A100s and takes 4–12 hours, costing $30–$200. Compared to OpenAI's API-based fine-tuning ($22–$190 for equivalent datasets), self-hosted fine-tuning is cost-competitive for smaller models and cheaper for larger ones. The hidden costs are engineering time (setting up training infrastructure, debugging CUDA issues, managing model deployments), inference hosting (you need GPU infrastructure to serve the fine-tuned model), and maintenance (you handle model updates, scaling, and reliability). For teams with existing GPU infrastructure and ML engineering capability, self-hosted fine-tuning is economical. For teams without this infrastructure, API-based fine-tuning is simpler and often cheaper when you factor in total cost of ownership.

How does fine-tuning interact with prompt caching?+

Fine-tuning and prompt caching are complementary cost optimizations that work well together. Fine-tuning reduces the prompt length needed (because behavior is learned rather than instructed), and prompt caching reduces the cost of whatever prompt remains. Consider the math: before fine-tuning, your system prompt is 2,000 tokens with detailed instructions and few-shot examples. After fine-tuning, your system prompt is 200 tokens (just a brief context statement). Without caching, you have already saved 1,800 input tokens per request. With Anthropic-style prompt caching (90% discount on cached tokens), those 200 remaining system prompt tokens cost only 20 effective tokens per request. The combined effect: 2,000 tokens reduced to 20 effective tokens — a 99% reduction in system prompt cost. For OpenAI's 50% cache discount, the 200 tokens become 100 effective tokens — still a 95% reduction from the original 2,000. Fine-tuning and caching are multiplicative optimizations. Apply fine-tuning first to minimize the prompt, then enable caching on the minimized prompt. CostHawk tracks both cached and uncached token costs, so you can measure the combined savings from both optimizations in your production usage data.

Related Terms

Retrieval-Augmented Generation (RAG)

An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.

Prompt Engineering

The practice of designing, structuring, and iterating on the text inputs (prompts) sent to large language models to elicit desired outputs. Prompt engineering directly affects AI API costs through two mechanisms: the token count of the prompt itself (input cost) and the length and quality of the model's response (output cost). A well-engineered prompt can reduce total per-request cost by 40–70% compared to a naive prompt while maintaining or improving output quality.

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary