Inference
The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.
Definition
What is Inference?
Impact
Why It Matters for AI Costs
Inference is where your money goes. Every dollar you spend on AI APIs is an inference cost. While headlines focus on the enormous cost of training frontier models (GPT-4 reportedly cost $100M+ to train, and GPT-5 likely exceeded $500M), API consumers do not pay training costs directly. Instead, providers amortize training costs across millions of customers and charge per-token inference fees that represent the marginal compute cost of running the model on your input.
The economics of inference at scale are striking:
| Workload | Queries/Day | Avg Input Tokens | Avg Output Tokens | Model | Daily Inference Cost | Monthly Cost |
|---|---|---|---|---|---|---|
| Customer support chatbot | 50,000 | 1,500 | 400 | GPT-4o | $387.50 | $11,625 |
| Code review assistant | 10,000 | 3,000 | 800 | Claude 3.5 Sonnet | $210.00 | $6,300 |
| Content classification | 500,000 | 200 | 20 | GPT-4o-mini | $21.00 | $630 |
| Document summarization | 5,000 | 8,000 | 1,000 | Claude 3.5 Sonnet | $195.00 | $5,850 |
These costs are purely inference — the compute to process your input and generate your output, running on GPU clusters operated by the provider. The total across these four workloads is $24,405/month, all inference.
The critical insight for cost management is that inference cost is a function of four variables: model choice (which determines per-token price), input token count (how much context you send), output token count (how much the model generates), and request volume (how many times per day you call the API). Optimizing any of these four variables directly reduces your inference bill. CostHawk tracks all four dimensions per request, enabling you to identify exactly which variable is driving costs and target your optimization efforts accordingly.
What is Inference?
Inference is the production phase of a machine learning model's lifecycle — the phase where the trained model does useful work by processing new inputs and producing outputs. In the lifecycle of an LLM, the sequence is: pre-training (learning language patterns from trillions of tokens), fine-tuning (specializing on specific tasks), and inference (serving predictions to end users). Pre-training and fine-tuning happen infrequently; inference happens continuously, millions of times per day across a provider's customer base.
The mechanics of LLM inference:
When you send a request to an LLM API, two distinct computational phases execute on the provider's GPU hardware:
Phase 1: Prefill (input processing). All input tokens (system prompt + user message + conversation history + tool definitions) are processed in parallel through the model's transformer layers. This phase builds the key-value (KV) cache — a compressed representation of the input that the model uses during generation. Prefill is computationally efficient because GPUs excel at parallel matrix multiplication. The time is proportional to input length but benefits from GPU parallelism.
Phase 2: Decode (output generation). Output tokens are generated one at a time. For each token, the model runs a forward pass that reads the KV cache (from prefill) and all previously generated tokens, computes attention across the full sequence, and samples the next token. This is inherently sequential — each token depends on the previous one — making it the computational bottleneck. The KV cache grows with each generated token, consuming additional GPU memory.
The prefill/decode distinction explains the input/output pricing split: output tokens cost 4–5x more than input tokens because decode is more compute-intensive per token than prefill. It also explains why time-to-first-token (TTFT) and tokens-per-second (TPS) are different metrics — TTFT measures prefill speed, while TPS measures decode speed.
Understanding these phases helps you reason about cost: a request with 10,000 input tokens and 100 output tokens is dominated by prefill cost, while a request with 100 input tokens and 5,000 output tokens is dominated by decode cost. The optimal cost reduction strategy differs accordingly.
Inference vs Training Costs
The AI cost landscape has two fundamentally different cost categories: training (creating the model) and inference (using the model). For API consumers, this distinction is crucial because you pay only for inference — but understanding training costs explains why per-token inference prices are what they are.
| Dimension | Training | Inference |
|---|---|---|
| Who pays | Model provider (OpenAI, Anthropic, Google) | API consumer (you) |
| Frequency | Once (per model version) | Millions of times per day |
| Cost scale | $10M–$500M+ per frontier model | $0.001–$0.10 per request |
| Hardware | Thousands of GPUs for weeks/months | Individual GPU-seconds per request |
| Optimization lever | Not applicable (provider's problem) | Model choice, token reduction, caching, batching |
| Budget visibility | N/A | Per-request, per-key, per-model via CostHawk |
Why training costs matter to you (indirectly):
Providers set per-token inference prices to recoup training investment plus earn margin. A model that cost $100M to train and serves 100 trillion inference tokens over its lifetime needs to earn at least $0.001 per token in gross margin to break even on training alone. This is why frontier models (expensive to train) have higher per-token prices than smaller models (cheaper to train). Understanding this relationship helps you predict pricing trends:
- As models get cheaper to train (through better algorithms, hardware efficiency), inference prices drop. GPT-4o's input price ($2.50/1M) is 40% lower than GPT-4's launch price ($30.00/1M for 8K context).
- As models serve more customers, training cost is amortized further, enabling price reductions. This is why model prices tend to decrease over time.
- Competition drives prices down. Anthropic, Google, Mistral, and open-source alternatives create pricing pressure that benefits API consumers.
For self-hosted inference (running open-source models on your own hardware), you bear both training costs (if you fine-tune) and inference costs (GPU compute). This changes the cost calculus significantly — see the "Self-Hosted vs API Inference Costs" section below.
Factors That Affect Inference Cost
Four primary factors determine the cost of each inference request, and understanding them gives you concrete levers for optimization.
1. Model size and architecture. Larger models require more GPU compute per forward pass. GPT-4o (estimated at hundreds of billions of parameters across a mixture-of-experts architecture) costs $2.50/$10.00 per MTok. GPT-4o-mini (a much smaller distilled model) costs $0.15/$0.60 — a 16x reduction. The relationship between model size and per-token cost is roughly linear: 2x more parameters means approximately 2x more compute per token, which translates to approximately 2x higher pricing. Mixture-of-experts (MoE) architectures partially break this relationship by activating only a subset of parameters per token, which is why models like GPT-4o and Mixtral can be large in total parameters but moderate in per-token cost.
2. Input token count. More input tokens mean more work during the prefill phase. The cost is linear: 10,000 input tokens costs exactly 10x more than 1,000 input tokens at the same per-token rate. Input tokens include everything you send: system prompts, user messages, conversation history, tool definitions, and any context (RAG results, documents, code). System prompts are a particularly important target because they are repeated in every request — a 2,000-token system prompt across 100,000 daily requests costs 200M input tokens/day.
3. Output token count. More output tokens mean more work during the decode phase. Because decode is more compute-intensive per token, output tokens cost 4–5x more than input tokens. Output length is partly determined by the task (classification produces short outputs, essay writing produces long outputs) and partly by your control parameters (max_tokens) and prompt instructions ("be concise" vs "explain in detail"). Controlling output length is often the highest-ROI cost optimization because of the output token premium.
4. Request volume. The total number of API calls multiplies the per-request cost into your total spend. A request costing $0.01 is negligible in isolation but costs $300/day at 30,000 requests/day. Volume is driven by product usage (more users = more requests), feature design (real-time suggestions generate more requests than on-demand features), and architectural decisions (streaming vs batch processing). Reducing unnecessary requests through caching, deduplication, and batching directly reduces total inference cost.
Secondary factors:
- Batch vs real-time: OpenAI's Batch API offers 50% off for async jobs. If latency is not critical, batch processing halves your inference cost.
- Prompt caching: Anthropic (90% discount) and OpenAI (50% discount) on cached input token prefixes.
- Time of day: Some providers offer lower rates during off-peak hours (not yet common but emerging).
Inference Optimization Techniques
Advanced inference optimization techniques reduce the compute cost per token by making the model run more efficiently, without changing the model's output quality. These are primarily relevant for self-hosted models but increasingly available through API providers as well.
1. Quantization. Quantization reduces the numerical precision of model weights from 32-bit or 16-bit floating point to 8-bit integers (INT8) or even 4-bit (INT4). This reduces memory footprint by 2–8x and speeds up computation by 1.5–3x because lower-precision operations are faster on modern GPUs. The quality tradeoff is typically minimal: INT8 quantization produces outputs within 1–2% of full-precision on most benchmarks. For self-hosted models, quantization is the single highest-impact optimization — it can double your throughput (halving per-inference cost) with negligible quality loss. For API users, quantization is transparent — providers use it internally to serve models more efficiently, and the cost savings are reflected in pricing.
2. KV cache optimization. The key-value cache stores intermediate attention computations and grows linearly with sequence length. For long-context requests (32K+ tokens), the KV cache can consume more GPU memory than the model weights themselves. Techniques like paged attention (used in vLLM) manage KV cache memory more efficiently, reducing waste from fragmentation. Multi-query attention (MQA) and grouped-query attention (GQA) reduce the KV cache size by sharing key-value heads across attention heads, cutting memory requirements by 4–8x. These optimizations are invisible to API users but directly affect the per-token costs providers can offer.
3. Speculative decoding. A small, fast "draft" model generates candidate tokens, and the large target model verifies them in a single forward pass. Because verification is parallelizable (like prefill), this can speed up decode by 2–3x when the draft model has high acceptance rates. The result is faster inference at the same quality, reducing per-request GPU-seconds and thus cost. Google uses speculative decoding in some Gemini serving configurations.
4. Continuous batching. Instead of processing one request at a time, the inference server batches multiple requests together, sharing the overhead of model weight loading across all requests in the batch. Continuous batching (also called iteration-level batching) dynamically adds new requests to the batch as previous ones finish generating, maximizing GPU utilization. This is the standard for production inference servers (vLLM, TensorRT-LLM, TGI) and is why API inference is much cheaper than naive single-request serving.
5. Model distillation. A smaller "student" model is trained to mimic the outputs of a larger "teacher" model. The student model is cheaper to run at inference time while preserving much of the teacher's quality. GPT-4o-mini is essentially a distilled version of the GPT-4o family — it achieves 80–90% of GPT-4o's quality at 1/16th the inference cost. If you cannot find a small enough model for your task via model selection, you can distill one via fine-tuning on the larger model's outputs.
Self-Hosted vs API Inference Costs
For teams with sufficient engineering resources and volume, self-hosting inference using open-source models can be cheaper than API pricing. But the breakeven calculation is more nuanced than most teams realize.
| Factor | API Inference | Self-Hosted Inference |
|---|---|---|
| Pricing model | Pay per token | Pay per GPU-hour (fixed + variable) |
| Cost at 1M tokens/day | $0.15–$15.00/day | $24–$72/day (1 GPU running 24/7) |
| Cost at 100M tokens/day | $15–$1,500/day | $48–$144/day (2–6 GPUs) |
| Cost at 1B tokens/day | $150–$15,000/day | $240–$720/day (10–30 GPUs) |
| Engineering overhead | None (API call) | High (infra, deployment, monitoring, scaling) |
| Scaling | Instant (provider handles capacity) | Manual (provision GPUs, manage load balancing) |
| Model quality | Frontier models (GPT-4o, Claude 3.5) | Open-source (Llama 3, Mistral, Qwen) |
| Latency control | Limited (provider's infrastructure) | Full (choose GPU type, location, batch size) |
Breakeven analysis:
A single NVIDIA A100 (80GB) on AWS costs approximately $3.00/hour ($72/day). Running Llama 3 70B (quantized to INT8) on this GPU, you can process approximately 40–80 tokens per second per concurrent request, with 4–8 concurrent requests via continuous batching. That is roughly 160–640 tokens per second sustained, or 13.8M–55.3M tokens per day.
// Self-hosted cost per million tokens:
$72/day ÷ 13.8M tokens/day = $5.22/1M tokens (conservative)
$72/day ÷ 55.3M tokens/day = $1.30/1M tokens (optimistic)
// API comparison:
// GPT-4o input: $2.50/1M, output: $10.00/1M (blended ~$5.00/1M)
// Claude 3.5 Sonnet input: $3.00/1M, output: $15.00/1M (blended ~$7.00/1M)
// GPT-4o-mini input: $0.15/1M, output: $0.60/1M (blended ~$0.30/1M)Self-hosting breaks even against frontier API models (GPT-4o, Claude 3.5 Sonnet) at moderate volumes. But it does not break even against budget API models (GPT-4o-mini at $0.30/1M blended) unless your volume exceeds 200M+ tokens per day AND the open-source model meets your quality requirements.
Hidden self-hosting costs:
- Engineering time: Setting up and maintaining inference infrastructure (vLLM, TensorRT-LLM, Kubernetes, monitoring) requires 0.5–2 FTE of ML platform engineering.
- GPU availability: A100 and H100 instances can be scarce on cloud providers. Reserved instances lock in availability but require upfront commitment.
- Redundancy: Production workloads need at least 2 GPUs for availability. Add a load balancer, health checks, and failover.
- Model updates: When Llama 4 releases, you manage the upgrade: downloading weights, testing quality, updating serving infrastructure.
Inference Cost Monitoring
Inference cost monitoring is the practice of tracking, analyzing, and alerting on the cost of every API call your organization makes to LLM providers. Because inference is the sole cost category for API consumers, inference monitoring is AI cost monitoring.
Essential metrics to track:
- Cost per request: The total cost of each individual API call, calculated as
(input_tokens × input_rate) + (output_tokens × output_rate). This is the atomic unit of cost monitoring. Tracking it per request allows you to identify expensive outliers, measure the impact of optimizations, and attribute costs to specific features, users, or teams. - Cost per query (end-to-end): For multi-step systems (RAG pipelines, agent loops, chain-of-thought), the total cost includes multiple inference requests. A single user query that triggers an embedding call, 3 retrieval-augmented generation calls, and a summarization call costs the sum of all five requests. Track end-to-end cost per user action, not just per API call.
- Tokens per request (input and output separately): Token counts are the mechanical driver of cost. Tracking them separately reveals whether costs are driven by large inputs (long prompts, extensive context) or large outputs (verbose responses, runaway generation).
- Model distribution: What percentage of your inference budget goes to each model? If 80% of requests go to GPT-4o but 40% of those could be handled by GPT-4o-mini, you have a clear optimization opportunity worth quantifying.
- Latency percentiles (P50, P95, P99): Inference latency correlates with cost (longer requests consume more GPU-seconds) and affects user experience. High P95 latency often indicates requests with excessive input context or output length.
Alerting best practices:
- Anomaly detection: Alert when hourly or daily inference spend deviates more than 2 standard deviations from the 7-day baseline. This catches runaway loops, misconfigured batch jobs, and traffic spikes before they become expensive.
- Budget thresholds: Set daily and monthly spending limits per API key, per project, and per model. Alert at 50%, 80%, and 100% of budget.
- Per-request cost spikes: Alert when any single request exceeds a defined cost threshold (e.g., $0.50). This catches accidentally large context windows, missing
max_tokenslimits, or agent loops that run too many iterations. - Model drift: Alert when traffic shifts between models unexpectedly — for example, if a configuration change accidentally routes traffic from GPT-4o-mini to GPT-4o, the per-request cost increases 16x.
CostHawk provides all of these metrics out of the box: per-request cost tracking, per-model and per-key breakdowns, time-series spend dashboards, anomaly detection, and budget alerts. By routing API calls through CostHawk wrapped keys or syncing usage data via the MCP server, you get full inference cost visibility without changing your application code.
FAQ
Frequently Asked Questions
Why are output tokens more expensive than input tokens during inference?+
What is the difference between inference latency and inference cost?+
How does batch inference reduce costs?+
What is serverless inference and how does its cost model differ?+
How do I calculate the inference cost of an agent or multi-step workflow?+
What is provisioned throughput and when does it make sense financially?+
How does inference cost differ between streaming and non-streaming API calls?+
What trends are driving inference costs down over time?+
Related Terms
Large Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Read moreToken
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreGPU Instance
Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.
Read moreServerless Inference
Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
