Token Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Definition
What is Token Pricing?
Impact
Why It Matters for AI Costs
Token pricing is the mechanism that converts your AI usage into dollars. Understanding its structure is the foundation of AI cost management.
The pricing landscape is complex because every model has different rates, and the spread between the cheapest and most expensive options is enormous:
- Cheapest option: Gemini 2.0 Flash at $0.10/$0.40 per MTok (input/output)
- Most expensive option: Claude 3 Opus at $15.00/$75.00 per MTok
- Price ratio: 150x for input, 187x for output
This means the same 1,000-token query can cost anywhere from $0.00015 to $0.0225 depending on which model you choose — a difference that compounds to tens of thousands of dollars at scale.
Token pricing also introduces asymmetry: output tokens cost 3–5x more than input tokens across all providers. This asymmetry means that reducing output length has a higher ROI than reducing input length for most workloads. A team that cuts average output from 500 tokens to 300 tokens saves more money than one that cuts input from 2,000 to 1,800 tokens.
Additionally, discount mechanisms like prompt caching (50–90% off repeated prefixes), batch processing (50% off for async workloads), and volume commitments can significantly alter your effective per-token rate. CostHawk tracks your actual blended rate — the average price you pay per token after all discounts — so you can see your true cost basis and identify where further optimization is possible.
How Token Pricing Works
Token pricing follows a simple formula, but the details matter for accurate cost calculation:
Request Cost = (Input Tokens × Input Rate) + (Output Tokens × Output Rate)For a request using GPT-4o with 2,000 input tokens and 500 output tokens:
Cost = (2,000 / 1,000,000 × $2.50) + (500 / 1,000,000 × $10.00)
Cost = $0.005 + $0.005
Cost = $0.01When prompt caching is active, the formula becomes:
Request Cost = (Cached Input Tokens × Cached Rate) + (Uncached Input Tokens × Input Rate) + (Output Tokens × Output Rate)For the same request on Claude 3.5 Sonnet with 1,500 cached tokens and 500 uncached tokens:
Cost = (1,500 / 1,000,000 × $0.30) + (500 / 1,000,000 × $3.00) + (500 / 1,000,000 × $15.00)
Cost = $0.00045 + $0.0015 + $0.0075
Cost = $0.00945Key billing details across providers:
- OpenAI rounds token counts per request before applying pricing. The minimum billable unit is 1 token.
- Anthropic bills exact token counts. Cached tokens require a minimum of 1,024 tokens in the prefix for caching to activate. There is a 25% write premium on the first request that populates the cache.
- Google applies tiered pricing for Gemini 1.5 Pro — requests exceeding 128K tokens pay double the per-token rate.
- All providers include system prompts, tool/function definitions, and message role tags ("user:", "assistant:") in the input token count.
CostHawk normalizes all of these provider-specific billing quirks into a unified cost-per-request metric, so you always know exactly what each request costs regardless of which provider and model you used.
Provider Pricing Comparison
A comprehensive comparison of token pricing across all major providers as of March 2026:
| Provider | Model | Input (per 1M) | Output (per 1M) | Cached Input (per 1M) | Batch Input (per 1M) |
|---|---|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | $1.25 | $1.25 |
| OpenAI | GPT-4o mini | $0.15 | $0.60 | $0.075 | $0.075 |
| OpenAI | o1 | $15.00 | $60.00 | $7.50 | $7.50 |
| OpenAI | o3-mini | $1.10 | $4.40 | $0.55 | $0.55 |
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | $0.30 | $1.50 |
| Anthropic | Claude 3.5 Haiku | $0.80 | $4.00 | $0.08 | $0.40 |
| Anthropic | Claude 3 Opus | $15.00 | $75.00 | $1.50 | $7.50 |
| Gemini 2.0 Flash | $0.10 | $0.40 | $0.025 | $0.05 | |
| Gemini 1.5 Pro | $1.25 | $5.00 | $0.3125 | $0.625 | |
| Mistral | Mistral Large | $2.00 | $6.00 | — | — |
| Mistral | Mistral Small | $0.10 | $0.30 | — | — |
Reading this table for cost optimization:
- The "Cached Input" column shows the effective rate when prompt caching is active. Anthropic's 90% discount on cached input is the most aggressive — if your prompts have consistent prefixes, Anthropic offers the best effective input rate for repeat requests.
- The "Batch Input" column shows rates for asynchronous batch processing. Both OpenAI and Anthropic offer 50% off input tokens for batch requests that tolerate 24-hour latency. Google Gemini offers batch at 50% off as well.
- Mistral does not yet offer prompt caching or batch discounts, but their base rates are very competitive, especially Mistral Small at $0.10/$0.30.
CostHawk maintains an up-to-date pricing database across all providers and automatically applies the correct rates when calculating your costs. When providers change pricing (which happens frequently), CostHawk updates within 24 hours so your cost reports remain accurate.
The Output-to-Input Cost Ratio
One of the most important patterns in token pricing is the consistent premium on output tokens. Across all providers, output tokens cost 3–5x more than input tokens:
| Model | Input Rate | Output Rate | Ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4.0x |
| GPT-4o mini | $0.15 | $0.60 | 4.0x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 5.0x |
| Claude 3.5 Haiku | $0.80 | $4.00 | 5.0x |
| Claude 3 Opus | $15.00 | $75.00 | 5.0x |
| Gemini 2.0 Flash | $0.10 | $0.40 | 4.0x |
| o1 | $15.00 | $60.00 | 4.0x |
This ratio has profound implications for cost optimization priorities:
Output-heavy workloads are disproportionately expensive. An application that generates 1,000 output tokens per request pays 4–5x more per output token than per input token. For a 50/50 split of input and output tokens, output accounts for 80% of the total cost.
Optimization priority matrix:
| Workload Type | Typical I/O Ratio | Output % of Cost | Optimize First |
|---|---|---|---|
| Code generation | 1:2 (more output) | 89% | Output length |
| Chatbot / Q&A | 3:1 (more input) | 57% | Output length, then input |
| Summarization | 10:1 (much more input) | 29% | Input length (context size) |
| Classification | 20:1 (mostly input) | 17% | Input length (prompt efficiency) |
The key insight: for most workloads except pure classification, output token optimization delivers more savings per token reduced than input token optimization. Set max_tokens aggressively, instruct the model to be concise, use structured output formats (JSON) instead of natural language when the consumer is code, and avoid requesting explanations when you only need the answer.
Discount Mechanisms
All major providers offer discount mechanisms that can significantly reduce your effective per-token rate. Understanding and leveraging these discounts is one of the highest-ROI cost optimization strategies:
1. Prompt Caching (50–90% input discount)
Prompt caching gives a discount on input tokens when the beginning of your prompt matches a previous request. The mechanics differ by provider:
- Anthropic: 90% discount on cached input tokens. Requires 1,024+ token prefix match. 25% write premium on first request. Cache TTL: 5 minutes (extended with each hit). Best discount in the industry for applications with consistent system prompts.
- OpenAI: 50% discount on cached input tokens. Automatic — no configuration needed. Requires 1,024+ token prefix match. No write premium. Simpler but less aggressive discount.
- Google: Context caching available with explicit API. Pricing varies by model and cache duration. Requires explicit cache creation and management.
Effective savings example: An application with a 4,000-token system prompt, 1,000-token variable context, and 500-token output, making 50,000 requests/day on Claude 3.5 Sonnet:
- Without caching: (5,000 × $3.00/MTok + 500 × $15.00/MTok) × 50,000 = $750 + $375 = $1,125/day
- With caching (90% of system prompt cached): (4,000 × $0.30/MTok + 1,000 × $3.00/MTok + 500 × $15.00/MTok) × 50,000 = $60 + $150 + $375 = $585/day
- Savings: $540/day ($16,200/month)
2. Batch API (50% input discount)
Both OpenAI and Anthropic offer batch processing APIs that accept async workloads with up to 24-hour completion windows. In exchange, you get 50% off input token pricing. Batch is ideal for: data processing pipelines, nightly analysis jobs, content generation workflows, and any workload that does not require real-time response.
3. Volume Commitments
For teams spending $10,000+/month, most providers offer negotiated volume discounts. OpenAI's Committed Use Discount program offers 10–30% off for annual commitments. Anthropic offers custom pricing for enterprise accounts. These discounts stack with prompt caching and batch discounts.
4. Fine-tuned Model Trade-offs
Fine-tuned models charge a premium (2–6x base rate) but can reduce input tokens by eliminating few-shot examples. At high volume, the input savings can outweigh the rate premium. This is a discount in disguise — you pay more per token but use fewer tokens per request.
Calculating Your API Costs
Accurate cost calculation requires accounting for all token types, discount mechanisms, and provider-specific billing quirks. Here is a comprehensive framework:
Step 1: Identify your token volumes
For each endpoint or feature, measure or estimate:
- Average input tokens per request (including system prompt)
- Average output tokens per request
- Percentage of input tokens that are cacheable (static prefix)
- Daily request volume
Step 2: Apply the correct rates
// TypeScript cost calculator
interface CostParams {
inputTokens: number
outputTokens: number
cachedInputTokens: number
inputRate: number // per 1M tokens
outputRate: number // per 1M tokens
cachedRate: number // per 1M tokens
requestsPerDay: number
}
function calculateDailyCost(params: CostParams): number {
const uncachedInput = params.inputTokens - params.cachedInputTokens
const perRequestCost =
(params.cachedInputTokens / 1_000_000) * params.cachedRate +
(uncachedInput / 1_000_000) * params.inputRate +
(params.outputTokens / 1_000_000) * params.outputRate
return perRequestCost * params.requestsPerDay
}
// Example: GPT-4o with prompt caching
const daily = calculateDailyCost({
inputTokens: 3000,
outputTokens: 500,
cachedInputTokens: 2000,
inputRate: 2.50,
outputRate: 10.00,
cachedRate: 1.25,
requestsPerDay: 50000,
})
console.log(`Daily cost: $${daily.toFixed(2)}`) // $387.50Step 3: Account for hidden costs
- Retry overhead: If 5% of requests fail and retry, add 5% to your token volume.
- Tool/function definitions: If you use function calling, tool definitions are tokenized and count as input tokens. A set of 10 function definitions can add 500–2,000 tokens per request.
- Reasoning tokens: Models like o1 and o3-mini consume internal thinking tokens that are billed but not visible in the output. These can be 5–20x the visible output token count.
- Conversation history: For multi-turn applications, factor in the growing context size across turns.
Step 4: Build a monthly forecast
Multiply daily cost by 30 and add a 15–25% buffer for usage growth and variance. Update the forecast monthly as actual usage data comes in. CostHawk automates this entire calculation, providing real-time cost tracking, historical trends, and automated forecasting based on your actual usage patterns.
Pricing Trends
AI token pricing has been declining rapidly and shows no signs of stopping. Understanding the trends helps you plan budgets, evaluate build-vs-buy decisions, and time optimization investments.
Historical price trajectory (OpenAI GPT-4 class models):
| Date | Model | Input (per 1M) | Output (per 1M) | % Drop from Previous |
|---|---|---|---|---|
| Mar 2023 | GPT-4 | $30.00 | $60.00 | — |
| Nov 2023 | GPT-4 Turbo | $10.00 | $30.00 | -67% / -50% |
| May 2024 | GPT-4o | $5.00 | $15.00 | -50% / -50% |
| Oct 2024 | GPT-4o (price cut) | $2.50 | $10.00 | -50% / -33% |
In approximately 18 months, GPT-4 class pricing dropped 92% for input and 83% for output. Similar drops occurred across other providers.
Drivers of price decline:
- Hardware improvements: Newer GPU architectures (NVIDIA H100, B100) deliver 2–3x more inference throughput per dollar than predecessors.
- Inference optimization: Techniques like speculative decoding, continuous batching, KV-cache optimization, and quantization allow providers to serve more requests per GPU.
- Competition: With OpenAI, Anthropic, Google, Meta (via open-source), Mistral, and others competing aggressively, pricing pressure is intense.
- Scale economics: As usage grows, providers amortize fixed costs (training, infrastructure) over more tokens, enabling lower per-token prices.
What this means for your budget:
- Costs you commit to today will likely be cheaper in 6 months. Avoid long-term commitments at current rates unless the discount is substantial (30%+).
- Optimize now, benefit twice. Cost optimizations you implement today (prompt caching, model routing, output capping) save money at current prices AND compound with future price cuts.
- Budget for 30–50% annual price decline when forecasting multi-year AI infrastructure costs. This avoids over-provisioning budgets and under-investing in AI capabilities.
- Do not wait for price drops to optimize. The money you save today by optimizing is real money in the bank, regardless of future pricing.
CostHawk tracks pricing changes across all providers and automatically updates cost calculations when rates change. Historical pricing data in CostHawk lets you see how your effective per-token cost has evolved over time and project future trends based on the historical decline curve.
FAQ
Frequently Asked Questions
Why are output tokens more expensive than input tokens?+
How does prompt caching work and how much does it save?+
What is batch API pricing and when should I use it?+
How do I compare pricing across providers for the same task?+
cost = (input_tokens × input_rate + output_tokens × output_rate) / 1,000,000. (4) Compute a cost-quality score: quality_score / cost_per_request. The model with the highest quality-per-dollar ratio is your best option. Important nuances: different tokenizers produce different token counts for the same text (so the same prompt may cost more tokens on one provider), and output length varies by model (some models are more verbose, generating more output tokens for the same prompt). CostHawk normalizes cross-provider comparisons by tracking actual token counts and costs for real production requests.Are there free tiers or credits for AI APIs?+
How do reasoning model pricing (o1, o3-mini) differ from standard models?+
How often do AI API prices change?+
What is the most cost-effective model for high-volume workloads?+
Related Terms
Cost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreInput vs. Output Tokens
The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.
Read morePrompt Caching
A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Read moreBatch API
Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.
Read morePay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
