GlossaryInfrastructureUpdated 2026-03-16By Chase Dillingham

Inference

The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.

Definition

What is Inference?

Inference is the act of passing new input data through a trained neural network to produce an output — a prediction, a classification, a generated text sequence, or an embedding vector. In the context of large language models, inference encompasses two computational phases: the prefill phase, where all input tokens are processed in parallel to build the model's internal representation (key-value cache), and the decode phase, where output tokens are generated one at a time in an autoregressive loop, each conditioned on all preceding tokens. Every API call you make to OpenAI, Anthropic, Google, or any other LLM provider is an inference request. The provider runs your input through the model on GPU hardware, generates the output, and bills you for the tokens consumed. For API consumers, inference cost is the AI cost — you do not pay for training the base model (that cost is amortized across all customers and embedded in per-token pricing). Understanding how inference works, what drives its cost, and how to optimize it is essential for controlling AI API spend.

Impact

Why It Matters for AI Costs

Inference is where your money goes. Every dollar you spend on AI APIs is an inference cost. While headlines focus on the enormous cost of training frontier models (GPT-4 reportedly cost $100M+ to train, and GPT-5 likely exceeded $500M), API consumers do not pay training costs directly. Instead, providers amortize training costs across millions of customers and charge per-token inference fees that represent the marginal compute cost of running the model on your input.

The economics of inference at scale are striking:

Workload	Queries/Day	Avg Input Tokens	Avg Output Tokens	Model	Daily Inference Cost	Monthly Cost
Customer support chatbot	50,000	1,500	400	GPT-4o	$387.50	$11,625
Code review assistant	10,000	3,000	800	Claude 3.5 Sonnet	$210.00	$6,300
Content classification	500,000	200	20	GPT-4o-mini	$21.00	$630
Document summarization	5,000	8,000	1,000	Claude 3.5 Sonnet	$195.00	$5,850

These costs are purely inference — the compute to process your input and generate your output, running on GPU clusters operated by the provider. The total across these four workloads is $24,405/month, all inference.

The critical insight for cost management is that inference cost is a function of four variables: model choice (which determines per-token price), input token count (how much context you send), output token count (how much the model generates), and request volume (how many times per day you call the API). Optimizing any of these four variables directly reduces your inference bill. CostHawk tracks all four dimensions per request, enabling you to identify exactly which variable is driving costs and target your optimization efforts accordingly.

What is Inference?

Inference is the production phase of a machine learning model's lifecycle — the phase where the trained model does useful work by processing new inputs and producing outputs. In the lifecycle of an LLM, the sequence is: pre-training (learning language patterns from trillions of tokens), fine-tuning (specializing on specific tasks), and inference (serving predictions to end users). Pre-training and fine-tuning happen infrequently; inference happens continuously, millions of times per day across a provider's customer base.

The mechanics of LLM inference:

When you send a request to an LLM API, two distinct computational phases execute on the provider's GPU hardware:

Phase 1: Prefill (input processing). All input tokens (system prompt + user message + conversation history + tool definitions) are processed in parallel through the model's transformer layers. This phase builds the key-value (KV) cache — a compressed representation of the input that the model uses during generation. Prefill is computationally efficient because GPUs excel at parallel matrix multiplication. The time is proportional to input length but benefits from GPU parallelism.

Phase 2: Decode (output generation). Output tokens are generated one at a time. For each token, the model runs a forward pass that reads the KV cache (from prefill) and all previously generated tokens, computes attention across the full sequence, and samples the next token. This is inherently sequential — each token depends on the previous one — making it the computational bottleneck. The KV cache grows with each generated token, consuming additional GPU memory.

The prefill/decode distinction explains the input/output pricing split: output tokens cost 4–5x more than input tokens because decode is more compute-intensive per token than prefill. It also explains why time-to-first-token (TTFT) and tokens-per-second (TPS) are different metrics — TTFT measures prefill speed, while TPS measures decode speed.

Understanding these phases helps you reason about cost: a request with 10,000 input tokens and 100 output tokens is dominated by prefill cost, while a request with 100 input tokens and 5,000 output tokens is dominated by decode cost. The optimal cost reduction strategy differs accordingly.

Inference vs Training Costs

The AI cost landscape has two fundamentally different cost categories: training (creating the model) and inference (using the model). For API consumers, this distinction is crucial because you pay only for inference — but understanding training costs explains why per-token inference prices are what they are.

Dimension	Training	Inference
Who pays	Model provider (OpenAI, Anthropic, Google)	API consumer (you)
Frequency	Once (per model version)	Millions of times per day
Cost scale	$10M–$500M+ per frontier model	$0.001–$0.10 per request
Hardware	Thousands of GPUs for weeks/months	Individual GPU-seconds per request
Optimization lever	Not applicable (provider's problem)	Model choice, token reduction, caching, batching
Budget visibility	N/A	Per-request, per-key, per-model via CostHawk

Why training costs matter to you (indirectly):

Providers set per-token inference prices to recoup training investment plus earn margin. A model that cost $100M to train and serves 100 trillion inference tokens over its lifetime needs to earn at least $0.001 per token in gross margin to break even on training alone. This is why frontier models (expensive to train) have higher per-token prices than smaller models (cheaper to train). Understanding this relationship helps you predict pricing trends:

As models get cheaper to train (through better algorithms, hardware efficiency), inference prices drop. GPT-4o's input price ($2.50/1M) is 40% lower than GPT-4's launch price ($30.00/1M for 8K context).
As models serve more customers, training cost is amortized further, enabling price reductions. This is why model prices tend to decrease over time.
Competition drives prices down. Anthropic, Google, Mistral, and open-source alternatives create pricing pressure that benefits API consumers.

For self-hosted inference (running open-source models on your own hardware), you bear both training costs (if you fine-tune) and inference costs (GPU compute). This changes the cost calculus significantly — see the "Self-Hosted vs API Inference Costs" section below.

Factors That Affect Inference Cost

Four primary factors determine the cost of each inference request, and understanding them gives you concrete levers for optimization.

1. Model size and architecture. Larger models require more GPU compute per forward pass. GPT-4o (estimated at hundreds of billions of parameters across a mixture-of-experts architecture) costs $2.50/$10.00 per MTok. GPT-4o-mini (a much smaller distilled model) costs $0.15/$0.60 — a 16x reduction. The relationship between model size and per-token cost is roughly linear: 2x more parameters means approximately 2x more compute per token, which translates to approximately 2x higher pricing. Mixture-of-experts (MoE) architectures partially break this relationship by activating only a subset of parameters per token, which is why models like GPT-4o and Mixtral can be large in total parameters but moderate in per-token cost.

2. Input token count. More input tokens mean more work during the prefill phase. The cost is linear: 10,000 input tokens costs exactly 10x more than 1,000 input tokens at the same per-token rate. Input tokens include everything you send: system prompts, user messages, conversation history, tool definitions, and any context (RAG results, documents, code). System prompts are a particularly important target because they are repeated in every request — a 2,000-token system prompt across 100,000 daily requests costs 200M input tokens/day.

3. Output token count. More output tokens mean more work during the decode phase. Because decode is more compute-intensive per token, output tokens cost 4–5x more than input tokens. Output length is partly determined by the task (classification produces short outputs, essay writing produces long outputs) and partly by your control parameters (max_tokens) and prompt instructions ("be concise" vs "explain in detail"). Controlling output length is often the highest-ROI cost optimization because of the output token premium.

4. Request volume. The total number of API calls multiplies the per-request cost into your total spend. A request costing $0.01 is negligible in isolation but costs $300/day at 30,000 requests/day. Volume is driven by product usage (more users = more requests), feature design (real-time suggestions generate more requests than on-demand features), and architectural decisions (streaming vs batch processing). Reducing unnecessary requests through caching, deduplication, and batching directly reduces total inference cost.

Secondary factors:

Batch vs real-time: OpenAI's Batch API offers 50% off for async jobs. If latency is not critical, batch processing halves your inference cost.
Prompt caching: Anthropic (90% discount) and OpenAI (50% discount) on cached input token prefixes.
Time of day: Some providers offer lower rates during off-peak hours (not yet common but emerging).

Inference Optimization Techniques

Advanced inference optimization techniques reduce the compute cost per token by making the model run more efficiently, without changing the model's output quality. These are primarily relevant for self-hosted models but increasingly available through API providers as well.

1. Quantization. Quantization reduces the numerical precision of model weights from 32-bit or 16-bit floating point to 8-bit integers (INT8) or even 4-bit (INT4). This reduces memory footprint by 2–8x and speeds up computation by 1.5–3x because lower-precision operations are faster on modern GPUs. The quality tradeoff is typically minimal: INT8 quantization produces outputs within 1–2% of full-precision on most benchmarks. For self-hosted models, quantization is the single highest-impact optimization — it can double your throughput (halving per-inference cost) with negligible quality loss. For API users, quantization is transparent — providers use it internally to serve models more efficiently, and the cost savings are reflected in pricing.

2. KV cache optimization. The key-value cache stores intermediate attention computations and grows linearly with sequence length. For long-context requests (32K+ tokens), the KV cache can consume more GPU memory than the model weights themselves. Techniques like paged attention (used in vLLM) manage KV cache memory more efficiently, reducing waste from fragmentation. Multi-query attention (MQA) and grouped-query attention (GQA) reduce the KV cache size by sharing key-value heads across attention heads, cutting memory requirements by 4–8x. These optimizations are invisible to API users but directly affect the per-token costs providers can offer.

3. Speculative decoding. A small, fast "draft" model generates candidate tokens, and the large target model verifies them in a single forward pass. Because verification is parallelizable (like prefill), this can speed up decode by 2–3x when the draft model has high acceptance rates. The result is faster inference at the same quality, reducing per-request GPU-seconds and thus cost. Google uses speculative decoding in some Gemini serving configurations.

4. Continuous batching. Instead of processing one request at a time, the inference server batches multiple requests together, sharing the overhead of model weight loading across all requests in the batch. Continuous batching (also called iteration-level batching) dynamically adds new requests to the batch as previous ones finish generating, maximizing GPU utilization. This is the standard for production inference servers (vLLM, TensorRT-LLM, TGI) and is why API inference is much cheaper than naive single-request serving.

5. Model distillation. A smaller "student" model is trained to mimic the outputs of a larger "teacher" model. The student model is cheaper to run at inference time while preserving much of the teacher's quality. GPT-4o-mini is essentially a distilled version of the GPT-4o family — it achieves 80–90% of GPT-4o's quality at 1/16th the inference cost. If you cannot find a small enough model for your task via model selection, you can distill one via fine-tuning on the larger model's outputs.

Self-Hosted vs API Inference Costs

For teams with sufficient engineering resources and volume, self-hosting inference using open-source models can be cheaper than API pricing. But the breakeven calculation is more nuanced than most teams realize.

Factor	API Inference	Self-Hosted Inference
Pricing model	Pay per token	Pay per GPU-hour (fixed + variable)
Cost at 1M tokens/day	$0.15–$15.00/day	$24–$72/day (1 GPU running 24/7)
Cost at 100M tokens/day	$15–$1,500/day	$48–$144/day (2–6 GPUs)
Cost at 1B tokens/day	$150–$15,000/day	$240–$720/day (10–30 GPUs)
Engineering overhead	None (API call)	High (infra, deployment, monitoring, scaling)
Scaling	Instant (provider handles capacity)	Manual (provision GPUs, manage load balancing)
Model quality	Frontier models (GPT-4o, Claude 3.5)	Open-source (Llama 3, Mistral, Qwen)
Latency control	Limited (provider's infrastructure)	Full (choose GPU type, location, batch size)

Breakeven analysis:

A single NVIDIA A100 (80GB) on AWS costs approximately $3.00/hour ($72/day). Running Llama 3 70B (quantized to INT8) on this GPU, you can process approximately 40–80 tokens per second per concurrent request, with 4–8 concurrent requests via continuous batching. That is roughly 160–640 tokens per second sustained, or 13.8M–55.3M tokens per day.

// Self-hosted cost per million tokens:
$72/day ÷ 13.8M tokens/day = $5.22/1M tokens (conservative)
$72/day ÷ 55.3M tokens/day = $1.30/1M tokens (optimistic)

// API comparison:
// GPT-4o input: $2.50/1M, output: $10.00/1M (blended ~$5.00/1M)
// Claude 3.5 Sonnet input: $3.00/1M, output: $15.00/1M (blended ~$7.00/1M)
// GPT-4o-mini input: $0.15/1M, output: $0.60/1M (blended ~$0.30/1M)

Self-hosting breaks even against frontier API models (GPT-4o, Claude 3.5 Sonnet) at moderate volumes. But it does not break even against budget API models (GPT-4o-mini at $0.30/1M blended) unless your volume exceeds 200M+ tokens per day AND the open-source model meets your quality requirements.

Hidden self-hosting costs:

Engineering time: Setting up and maintaining inference infrastructure (vLLM, TensorRT-LLM, Kubernetes, monitoring) requires 0.5–2 FTE of ML platform engineering.
GPU availability: A100 and H100 instances can be scarce on cloud providers. Reserved instances lock in availability but require upfront commitment.
Redundancy: Production workloads need at least 2 GPUs for availability. Add a load balancer, health checks, and failover.
Model updates: When Llama 4 releases, you manage the upgrade: downloading weights, testing quality, updating serving infrastructure.

Inference Cost Monitoring

Inference cost monitoring is the practice of tracking, analyzing, and alerting on the cost of every API call your organization makes to LLM providers. Because inference is the sole cost category for API consumers, inference monitoring is AI cost monitoring.

Essential metrics to track:

Cost per request: The total cost of each individual API call, calculated as (input_tokens × input_rate) + (output_tokens × output_rate). This is the atomic unit of cost monitoring. Tracking it per request allows you to identify expensive outliers, measure the impact of optimizations, and attribute costs to specific features, users, or teams.
Cost per query (end-to-end): For multi-step systems (RAG pipelines, agent loops, chain-of-thought), the total cost includes multiple inference requests. A single user query that triggers an embedding call, 3 retrieval-augmented generation calls, and a summarization call costs the sum of all five requests. Track end-to-end cost per user action, not just per API call.
Tokens per request (input and output separately): Token counts are the mechanical driver of cost. Tracking them separately reveals whether costs are driven by large inputs (long prompts, extensive context) or large outputs (verbose responses, runaway generation).
Model distribution: What percentage of your inference budget goes to each model? If 80% of requests go to GPT-4o but 40% of those could be handled by GPT-4o-mini, you have a clear optimization opportunity worth quantifying.
Latency percentiles (P50, P95, P99): Inference latency correlates with cost (longer requests consume more GPU-seconds) and affects user experience. High P95 latency often indicates requests with excessive input context or output length.

Alerting best practices:

Anomaly detection: Alert when hourly or daily inference spend deviates more than 2 standard deviations from the 7-day baseline. This catches runaway loops, misconfigured batch jobs, and traffic spikes before they become expensive.
Budget thresholds: Set daily and monthly spending limits per API key, per project, and per model. Alert at 50%, 80%, and 100% of budget.
Per-request cost spikes: Alert when any single request exceeds a defined cost threshold (e.g., $0.50). This catches accidentally large context windows, missing max_tokens limits, or agent loops that run too many iterations.
Model drift: Alert when traffic shifts between models unexpectedly — for example, if a configuration change accidentally routes traffic from GPT-4o-mini to GPT-4o, the per-request cost increases 16x.

CostHawk provides all of these metrics out of the box: per-request cost tracking, per-model and per-key breakdowns, time-series spend dashboards, anomaly detection, and budget alerts. By routing API calls through CostHawk wrapped keys or syncing usage data via the MCP server, you get full inference cost visibility without changing your application code.

FAQ

Frequently Asked Questions

Why are output tokens more expensive than input tokens during inference?+

The price difference reflects a fundamental computational asymmetry. During the prefill phase (processing input tokens), the model runs a single forward pass over all input tokens in parallel. Modern GPUs are designed for exactly this kind of parallel matrix multiplication, making prefill highly efficient. During the decode phase (generating output tokens), the model must generate tokens sequentially — each new token depends on all previous tokens, requiring a separate forward pass per token. This sequential process cannot be parallelized across tokens and requires maintaining a key-value (KV) cache that grows with each generated token, consuming additional GPU memory. The decode phase also suffers from lower GPU utilization because each forward pass processes only one token instead of thousands. The net result is that generating one output token requires 3–5x more GPU-seconds than processing one input token, and providers pass this cost differential through in their pricing. This is why output tokens typically cost 4–5x more than input tokens across all major providers.

What is the difference between inference latency and inference cost?+

Inference latency (the time for a request to complete) and inference cost (the dollar amount billed) are related but distinct. Both increase with token count, but they are not perfectly correlated. Latency has two components: time-to-first-token (TTFT), which is dominated by prefill computation and network overhead (typically 200–800ms for most models), and generation speed (tokens per second during decode, typically 30–100 tokens/second). A request with 10,000 input tokens and 100 output tokens has high TTFT but completes quickly. A request with 100 input tokens and 5,000 output tokens has low TTFT but takes 50–100 seconds to complete. Cost, on the other hand, is purely a function of token counts and per-token prices. A request that takes 2 seconds and costs $0.01 is identical in cost to a request that takes 10 seconds and costs $0.01 (same tokens, same model) — latency does not affect cost for API users. However, latency affects user experience and throughput, and optimizing for latency (smaller models, shorter outputs) often also optimizes for cost.

How does batch inference reduce costs?+

Batch inference processes multiple requests asynchronously rather than serving them in real-time. OpenAI's Batch API offers a 50% discount on all models for batch jobs that complete within a 24-hour window. This means GPT-4o input drops from $2.50 to $1.25 per million tokens, and output from $10.00 to $5.00. The provider can offer this discount because batch processing allows better GPU utilization — they can schedule your jobs during off-peak hours, combine them with other batch jobs for optimal batching, and do not need to guarantee low-latency responses. For workloads that are not time-sensitive — nightly content moderation, bulk classification, document processing, training data generation, offline analytics — batch inference is an easy 50% cost reduction with zero quality impact. The tradeoff is latency: batch results may take minutes to hours instead of seconds. CostHawk tracks batch and real-time inference costs separately, helping you identify which workloads could be migrated to batch mode for immediate savings.

What is serverless inference and how does its cost model differ?+

Serverless inference is a deployment model where you pay only for the GPU-seconds consumed during actual inference, with no charges when the model is idle. Providers like AWS SageMaker Serverless, Hugging Face Inference Endpoints (serverless), and Replicate offer this model. The cost structure is: you pay per-second of GPU compute during the inference request, with a cold start penalty (1–30 seconds) if the model needs to be loaded into GPU memory. Serverless inference is cost-effective for bursty, low-volume workloads because you avoid paying for idle GPU time. A model that processes 100 requests per hour for 2 seconds each costs 200 GPU-seconds per hour — versus 3,600 GPU-seconds per hour for a dedicated instance. The tradeoff is cold start latency: the first request after an idle period may take 5–30 seconds while the model loads. For high-volume workloads (thousands of requests per minute), dedicated GPU instances are cheaper per-inference because the per-second rate is lower and there are no cold starts. The breakeven is typically around 30–40% GPU utilization — below that, serverless is cheaper; above that, dedicated is cheaper.

How do I calculate the inference cost of an agent or multi-step workflow?+

Agent and multi-step workflows multiply inference costs because each step is a separate API call. A ReAct-style agent that takes 5 tool-calling steps to answer a query makes 5 separate inference requests, each carrying the full conversation context plus growing tool results. The cost calculation requires summing all requests: if each step averages 3,000 input tokens and 200 output tokens on GPT-4o, the total is 15,000 input tokens ($0.0375) and 1,000 output tokens ($0.01) = $0.0475 per user query. Compare this to a single direct prompt that might cost $0.0075. Agent workflows are 5–10x more expensive per query than direct inference. To control agent inference costs: set a maximum step limit (e.g., 5 iterations), implement early termination when the agent has enough information, use model routing to run simpler tool-selection steps on cheaper models, and summarize intermediate results instead of passing full tool outputs in context. CostHawk's per-request grouping features let you tag all inference calls belonging to a single agent workflow, so you can see the true end-to-end cost per user query across all steps.

What is provisioned throughput and when does it make sense financially?+

Provisioned throughput (offered by Anthropic, AWS Bedrock, and Azure OpenAI) lets you reserve a fixed amount of inference capacity at a committed rate, typically with a 1–6 month term. Instead of paying per token, you pay a fixed hourly or monthly rate for guaranteed tokens-per-minute (TPM) capacity. For example, Anthropic's provisioned throughput for Claude might offer 100,000 TPM for a fixed monthly fee. The economics favor provisioned throughput when you have consistent, predictable high-volume workloads. If your usage is steady at 80,000+ TPM throughout the day, provisioned throughput can be 30–50% cheaper than on-demand per-token pricing. It also guarantees capacity — no rate limits, no queuing during peak demand. The risk is overprovisioning: if you reserve 100,000 TPM but only use 40,000 on average, you are paying for idle capacity. The breakeven is typically around 60–70% utilization of provisioned capacity. Below that, on-demand is cheaper. CostHawk's usage analytics show your TPM patterns over time, making it straightforward to assess whether your utilization profile justifies provisioned throughput commitments.

How does inference cost differ between streaming and non-streaming API calls?+

Streaming and non-streaming API calls cost exactly the same in terms of token billing — the total input and output tokens are identical regardless of delivery mode. The difference is in how the response is delivered: non-streaming returns the complete response after all tokens are generated, while streaming sends tokens incrementally via server-sent events (SSE) as they are generated. The cost per token is identical. However, streaming has indirect cost implications. On the positive side, streaming enables you to implement early termination — if the first 100 tokens of a response indicate low quality or an error, you can cancel the request before the model generates the remaining 900 tokens, saving those output token costs. On the negative side, streaming keeps the connection open longer, which may increase infrastructure costs (load balancer connections, server memory) on your end. For most workloads, the cost difference is negligible. Choose streaming for user-facing applications (better perceived latency) and non-streaming for backend processing (simpler error handling). Either way, the token cost — which dominates your AI bill — is the same.

What trends are driving inference costs down over time?+

Multiple technology and market trends are converging to reduce inference costs, and the trajectory is steep — inference prices have dropped roughly 10x in the past two years and are expected to continue declining. Hardware improvements: NVIDIA's H100 GPUs offer 3x the inference throughput of A100s, and the B200 generation promises another 2–3x improvement. Custom AI chips from Google (TPU v5), Amazon (Trainium/Inferentia), and startups like Groq and Cerebras are creating pricing pressure. Algorithmic efficiency: Mixture-of-experts architectures activate only a fraction of model parameters per token, reducing compute per inference. Speculative decoding, quantization (INT4, FP8), and KV cache optimization collectively improve inference throughput by 2–4x. Competition: The entry of Mistral, Cohere, Deepseek, and others has created aggressive pricing competition. Open-source models (Llama 3, Qwen) provide a free alternative that caps how much providers can charge for API inference. Scale economies: As LLM usage grows, providers spread fixed infrastructure costs across more tokens, enabling per-token price reductions. The practical implication: do not lock into long-term provisioned throughput commitments at current prices. Maintain flexibility to benefit from future price drops.

Related Terms

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

GPU Instance

Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.

Serverless Inference

Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary