GlossaryBilling & PricingUpdated 2026-03-16

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Definition

What is Token Pricing?

Token pricing is the billing model used by virtually all LLM API providers, where customers pay a per-token fee for every token processed in a request. Pricing is split into at least two rates: input tokens (the prompt, system instructions, and context you send) and output tokens (the model's generated response). Many providers now offer a third rate for cached input tokens, which receive a 50–90% discount when a prompt prefix matches a previously processed request. Rates are quoted per million tokens (MTok) and vary widely by model — from $0.10/MTok for economy models to $75.00/MTok for frontier model output. Understanding token pricing is essential for budgeting, model selection, and cost optimization. Unlike traditional SaaS pricing with flat monthly fees, token pricing means your bill scales directly with usage volume and the specific models you choose, making cost management an ongoing engineering discipline rather than a one-time procurement decision.

Impact

Why It Matters for AI Costs

Token pricing is the mechanism that converts your AI usage into dollars. Understanding its structure is the foundation of AI cost management.

The pricing landscape is complex because every model has different rates, and the spread between the cheapest and most expensive options is enormous:

  • Cheapest option: Gemini 2.0 Flash at $0.10/$0.40 per MTok (input/output)
  • Most expensive option: Claude 3 Opus at $15.00/$75.00 per MTok
  • Price ratio: 150x for input, 187x for output

This means the same 1,000-token query can cost anywhere from $0.00015 to $0.0225 depending on which model you choose — a difference that compounds to tens of thousands of dollars at scale.

Token pricing also introduces asymmetry: output tokens cost 3–5x more than input tokens across all providers. This asymmetry means that reducing output length has a higher ROI than reducing input length for most workloads. A team that cuts average output from 500 tokens to 300 tokens saves more money than one that cuts input from 2,000 to 1,800 tokens.

Additionally, discount mechanisms like prompt caching (50–90% off repeated prefixes), batch processing (50% off for async workloads), and volume commitments can significantly alter your effective per-token rate. CostHawk tracks your actual blended rate — the average price you pay per token after all discounts — so you can see your true cost basis and identify where further optimization is possible.

How Token Pricing Works

Token pricing follows a simple formula, but the details matter for accurate cost calculation:

Request Cost = (Input Tokens × Input Rate) + (Output Tokens × Output Rate)

For a request using GPT-4o with 2,000 input tokens and 500 output tokens:

Cost = (2,000 / 1,000,000 × $2.50) + (500 / 1,000,000 × $10.00)
Cost = $0.005 + $0.005
Cost = $0.01

When prompt caching is active, the formula becomes:

Request Cost = (Cached Input Tokens × Cached Rate) + (Uncached Input Tokens × Input Rate) + (Output Tokens × Output Rate)

For the same request on Claude 3.5 Sonnet with 1,500 cached tokens and 500 uncached tokens:

Cost = (1,500 / 1,000,000 × $0.30) + (500 / 1,000,000 × $3.00) + (500 / 1,000,000 × $15.00)
Cost = $0.00045 + $0.0015 + $0.0075
Cost = $0.00945

Key billing details across providers:

  • OpenAI rounds token counts per request before applying pricing. The minimum billable unit is 1 token.
  • Anthropic bills exact token counts. Cached tokens require a minimum of 1,024 tokens in the prefix for caching to activate. There is a 25% write premium on the first request that populates the cache.
  • Google applies tiered pricing for Gemini 1.5 Pro — requests exceeding 128K tokens pay double the per-token rate.
  • All providers include system prompts, tool/function definitions, and message role tags ("user:", "assistant:") in the input token count.

CostHawk normalizes all of these provider-specific billing quirks into a unified cost-per-request metric, so you always know exactly what each request costs regardless of which provider and model you used.

Provider Pricing Comparison

A comprehensive comparison of token pricing across all major providers as of March 2026:

ProviderModelInput (per 1M)Output (per 1M)Cached Input (per 1M)Batch Input (per 1M)
OpenAIGPT-4o$2.50$10.00$1.25$1.25
OpenAIGPT-4o mini$0.15$0.60$0.075$0.075
OpenAIo1$15.00$60.00$7.50$7.50
OpenAIo3-mini$1.10$4.40$0.55$0.55
AnthropicClaude 3.5 Sonnet$3.00$15.00$0.30$1.50
AnthropicClaude 3.5 Haiku$0.80$4.00$0.08$0.40
AnthropicClaude 3 Opus$15.00$75.00$1.50$7.50
GoogleGemini 2.0 Flash$0.10$0.40$0.025$0.05
GoogleGemini 1.5 Pro$1.25$5.00$0.3125$0.625
MistralMistral Large$2.00$6.00
MistralMistral Small$0.10$0.30

Reading this table for cost optimization:

  • The "Cached Input" column shows the effective rate when prompt caching is active. Anthropic's 90% discount on cached input is the most aggressive — if your prompts have consistent prefixes, Anthropic offers the best effective input rate for repeat requests.
  • The "Batch Input" column shows rates for asynchronous batch processing. Both OpenAI and Anthropic offer 50% off input tokens for batch requests that tolerate 24-hour latency. Google Gemini offers batch at 50% off as well.
  • Mistral does not yet offer prompt caching or batch discounts, but their base rates are very competitive, especially Mistral Small at $0.10/$0.30.

CostHawk maintains an up-to-date pricing database across all providers and automatically applies the correct rates when calculating your costs. When providers change pricing (which happens frequently), CostHawk updates within 24 hours so your cost reports remain accurate.

The Output-to-Input Cost Ratio

One of the most important patterns in token pricing is the consistent premium on output tokens. Across all providers, output tokens cost 3–5x more than input tokens:

ModelInput RateOutput RateRatio
GPT-4o$2.50$10.004.0x
GPT-4o mini$0.15$0.604.0x
Claude 3.5 Sonnet$3.00$15.005.0x
Claude 3.5 Haiku$0.80$4.005.0x
Claude 3 Opus$15.00$75.005.0x
Gemini 2.0 Flash$0.10$0.404.0x
o1$15.00$60.004.0x

This ratio has profound implications for cost optimization priorities:

Output-heavy workloads are disproportionately expensive. An application that generates 1,000 output tokens per request pays 4–5x more per output token than per input token. For a 50/50 split of input and output tokens, output accounts for 80% of the total cost.

Optimization priority matrix:

Workload TypeTypical I/O RatioOutput % of CostOptimize First
Code generation1:2 (more output)89%Output length
Chatbot / Q&A3:1 (more input)57%Output length, then input
Summarization10:1 (much more input)29%Input length (context size)
Classification20:1 (mostly input)17%Input length (prompt efficiency)

The key insight: for most workloads except pure classification, output token optimization delivers more savings per token reduced than input token optimization. Set max_tokens aggressively, instruct the model to be concise, use structured output formats (JSON) instead of natural language when the consumer is code, and avoid requesting explanations when you only need the answer.

Discount Mechanisms

All major providers offer discount mechanisms that can significantly reduce your effective per-token rate. Understanding and leveraging these discounts is one of the highest-ROI cost optimization strategies:

1. Prompt Caching (50–90% input discount)

Prompt caching gives a discount on input tokens when the beginning of your prompt matches a previous request. The mechanics differ by provider:

  • Anthropic: 90% discount on cached input tokens. Requires 1,024+ token prefix match. 25% write premium on first request. Cache TTL: 5 minutes (extended with each hit). Best discount in the industry for applications with consistent system prompts.
  • OpenAI: 50% discount on cached input tokens. Automatic — no configuration needed. Requires 1,024+ token prefix match. No write premium. Simpler but less aggressive discount.
  • Google: Context caching available with explicit API. Pricing varies by model and cache duration. Requires explicit cache creation and management.

Effective savings example: An application with a 4,000-token system prompt, 1,000-token variable context, and 500-token output, making 50,000 requests/day on Claude 3.5 Sonnet:

  • Without caching: (5,000 × $3.00/MTok + 500 × $15.00/MTok) × 50,000 = $750 + $375 = $1,125/day
  • With caching (90% of system prompt cached): (4,000 × $0.30/MTok + 1,000 × $3.00/MTok + 500 × $15.00/MTok) × 50,000 = $60 + $150 + $375 = $585/day
  • Savings: $540/day ($16,200/month)

2. Batch API (50% input discount)

Both OpenAI and Anthropic offer batch processing APIs that accept async workloads with up to 24-hour completion windows. In exchange, you get 50% off input token pricing. Batch is ideal for: data processing pipelines, nightly analysis jobs, content generation workflows, and any workload that does not require real-time response.

3. Volume Commitments

For teams spending $10,000+/month, most providers offer negotiated volume discounts. OpenAI's Committed Use Discount program offers 10–30% off for annual commitments. Anthropic offers custom pricing for enterprise accounts. These discounts stack with prompt caching and batch discounts.

4. Fine-tuned Model Trade-offs

Fine-tuned models charge a premium (2–6x base rate) but can reduce input tokens by eliminating few-shot examples. At high volume, the input savings can outweigh the rate premium. This is a discount in disguise — you pay more per token but use fewer tokens per request.

Calculating Your API Costs

Accurate cost calculation requires accounting for all token types, discount mechanisms, and provider-specific billing quirks. Here is a comprehensive framework:

Step 1: Identify your token volumes

For each endpoint or feature, measure or estimate:

  • Average input tokens per request (including system prompt)
  • Average output tokens per request
  • Percentage of input tokens that are cacheable (static prefix)
  • Daily request volume

Step 2: Apply the correct rates

// TypeScript cost calculator
interface CostParams {
  inputTokens: number
  outputTokens: number
  cachedInputTokens: number
  inputRate: number      // per 1M tokens
  outputRate: number     // per 1M tokens
  cachedRate: number     // per 1M tokens
  requestsPerDay: number
}

function calculateDailyCost(params: CostParams): number {
  const uncachedInput = params.inputTokens - params.cachedInputTokens
  const perRequestCost =
    (params.cachedInputTokens / 1_000_000) * params.cachedRate +
    (uncachedInput / 1_000_000) * params.inputRate +
    (params.outputTokens / 1_000_000) * params.outputRate
  return perRequestCost * params.requestsPerDay
}

// Example: GPT-4o with prompt caching
const daily = calculateDailyCost({
  inputTokens: 3000,
  outputTokens: 500,
  cachedInputTokens: 2000,
  inputRate: 2.50,
  outputRate: 10.00,
  cachedRate: 1.25,
  requestsPerDay: 50000,
})
console.log(`Daily cost: $${daily.toFixed(2)}`)  // $387.50

Step 3: Account for hidden costs

  • Retry overhead: If 5% of requests fail and retry, add 5% to your token volume.
  • Tool/function definitions: If you use function calling, tool definitions are tokenized and count as input tokens. A set of 10 function definitions can add 500–2,000 tokens per request.
  • Reasoning tokens: Models like o1 and o3-mini consume internal thinking tokens that are billed but not visible in the output. These can be 5–20x the visible output token count.
  • Conversation history: For multi-turn applications, factor in the growing context size across turns.

Step 4: Build a monthly forecast

Multiply daily cost by 30 and add a 15–25% buffer for usage growth and variance. Update the forecast monthly as actual usage data comes in. CostHawk automates this entire calculation, providing real-time cost tracking, historical trends, and automated forecasting based on your actual usage patterns.

Pricing Trends

AI token pricing has been declining rapidly and shows no signs of stopping. Understanding the trends helps you plan budgets, evaluate build-vs-buy decisions, and time optimization investments.

Historical price trajectory (OpenAI GPT-4 class models):

DateModelInput (per 1M)Output (per 1M)% Drop from Previous
Mar 2023GPT-4$30.00$60.00
Nov 2023GPT-4 Turbo$10.00$30.00-67% / -50%
May 2024GPT-4o$5.00$15.00-50% / -50%
Oct 2024GPT-4o (price cut)$2.50$10.00-50% / -33%

In approximately 18 months, GPT-4 class pricing dropped 92% for input and 83% for output. Similar drops occurred across other providers.

Drivers of price decline:

  • Hardware improvements: Newer GPU architectures (NVIDIA H100, B100) deliver 2–3x more inference throughput per dollar than predecessors.
  • Inference optimization: Techniques like speculative decoding, continuous batching, KV-cache optimization, and quantization allow providers to serve more requests per GPU.
  • Competition: With OpenAI, Anthropic, Google, Meta (via open-source), Mistral, and others competing aggressively, pricing pressure is intense.
  • Scale economics: As usage grows, providers amortize fixed costs (training, infrastructure) over more tokens, enabling lower per-token prices.

What this means for your budget:

  • Costs you commit to today will likely be cheaper in 6 months. Avoid long-term commitments at current rates unless the discount is substantial (30%+).
  • Optimize now, benefit twice. Cost optimizations you implement today (prompt caching, model routing, output capping) save money at current prices AND compound with future price cuts.
  • Budget for 30–50% annual price decline when forecasting multi-year AI infrastructure costs. This avoids over-provisioning budgets and under-investing in AI capabilities.
  • Do not wait for price drops to optimize. The money you save today by optimizing is real money in the bank, regardless of future pricing.

CostHawk tracks pricing changes across all providers and automatically updates cost calculations when rates change. Historical pricing data in CostHawk lets you see how your effective per-token cost has evolved over time and project future trends based on the historical decline curve.

FAQ

Frequently Asked Questions

Why are output tokens more expensive than input tokens?+
Output tokens cost 3–5x more than input tokens because of the fundamental asymmetry in how transformers process input versus generate output. During the prefill phase, the model processes all input tokens in parallel — a GPU-friendly operation that achieves high throughput. During the decode phase, the model generates output tokens one at a time, each requiring a full forward pass through the model. This sequential generation cannot be parallelized, resulting in lower GPU utilization and higher per-token compute cost. Additionally, the key-value (KV) cache that stores intermediate representations grows with each generated token, consuming GPU memory that could otherwise be used for processing other requests. The combination of sequential computation, growing memory overhead, and lower GPU utilization makes output generation inherently more expensive, and providers pass this cost difference through to their pricing.
How does prompt caching work and how much does it save?+
Prompt caching stores the computed internal representations (KV-cache) of your prompt prefix so they do not need to be recomputed on subsequent requests with the same prefix. When you send a request with a 3,000-token system prompt and a 500-token user query, the provider caches the KV-cache for those 3,000 system tokens. On the next request with the same system prompt, the provider skips the expensive prefill computation for those tokens and charges a discounted rate. Anthropic offers a 90% discount on cached tokens (e.g., $0.30/MTok instead of $3.00/MTok for Claude 3.5 Sonnet). OpenAI offers 50% off ($1.25/MTok instead of $2.50/MTok for GPT-4o). For an application making 50,000 requests/day with a 3,000-token system prompt on Claude 3.5 Sonnet, caching saves: (3,000 × $2.70/MTok × 50,000) / 1,000,000 = $405/day or $12,150/month. The savings scale linearly with your system prompt size and request volume.
What is batch API pricing and when should I use it?+
Batch API pricing offers a 50% discount on input tokens (and sometimes output tokens) for requests submitted through an asynchronous batch endpoint. Instead of receiving immediate responses, you submit a batch of requests and receive results within a 24-hour window (typically much faster). OpenAI's batch API charges $1.25/$5.00 per MTok (vs $2.50/$10.00 real-time) for GPT-4o. Anthropic's message batches charge $1.50/$7.50 per MTok (vs $3.00/$15.00) for Claude 3.5 Sonnet. Use batch processing for: nightly data analysis jobs, content generation pipelines, document classification at scale, and any workload where latency is not critical. Do not use it for: user-facing chatbots, real-time recommendations, or any feature where users are waiting for a response. If 30% of your workload is batch-eligible, moving it to the batch API saves 15% on that portion of your bill with zero quality trade-off.
How do I compare pricing across providers for the same task?+
Comparing pricing across providers requires normalizing for both cost and quality. Here is a practical framework: (1) Define a representative set of test prompts for your actual workload. (2) Run each prompt against candidate models from each provider, recording token counts, latency, and output quality scores. (3) Calculate the cost per request using each provider's pricing: cost = (input_tokens × input_rate + output_tokens × output_rate) / 1,000,000. (4) Compute a cost-quality score: quality_score / cost_per_request. The model with the highest quality-per-dollar ratio is your best option. Important nuances: different tokenizers produce different token counts for the same text (so the same prompt may cost more tokens on one provider), and output length varies by model (some models are more verbose, generating more output tokens for the same prompt). CostHawk normalizes cross-provider comparisons by tracking actual token counts and costs for real production requests.
Are there free tiers or credits for AI APIs?+
Yes, most providers offer free tiers or promotional credits for new users. OpenAI provides $5–$18 in free credits for new accounts (varies by region and time), which covers approximately 2–72 million tokens depending on the model. Anthropic offers a free tier through the API with rate limits (typically 40 requests/minute, 1M tokens/day for Claude 3.5 Sonnet). Google Gemini offers a generous free tier: Gemini 2.0 Flash allows 1,500 requests/day free of charge. Mistral offers a free tier through their La Plateforme with rate limits. Additionally, cloud providers (AWS Bedrock, Azure OpenAI, Google Cloud Vertex AI) often bundle AI API credits with cloud commitments. For startups, many providers have special programs: Anthropic has a startup program, OpenAI has a research access program, and Google's Cloud for Startups program includes Gemini credits. These free tiers are sufficient for development and prototyping but typically insufficient for production workloads.
How do reasoning model pricing (o1, o3-mini) differ from standard models?+
Reasoning models have a fundamentally different pricing dynamic because of internal 'thinking tokens.' When a reasoning model processes a request, it generates extensive internal reasoning chains before producing the visible output. These thinking tokens are billed as output tokens even though the user never sees them. For example, OpenAI's o1 charges $15.00/MTok for input and $60.00/MTok for output (including thinking). A request with 1,000 input tokens and 200 visible output tokens might consume 3,000 thinking tokens internally, making the total output cost: (200 + 3,000) × $60.00/MTok = $0.192. Compare this to GPT-4o for the same request: 1,000 × $2.50/MTok + 200 × $10.00/MTok = $0.0045 — the reasoning model is 42x more expensive. OpenAI's o3-mini is a more economical reasoning option at $1.10/$4.40 per MTok with a 'low' reasoning effort setting that limits thinking tokens. Use reasoning models selectively for tasks that genuinely require step-by-step logic, not for general-purpose generation.
How often do AI API prices change?+
AI API prices change frequently — typically every 3–6 months per provider, with the overall trend being downward. In 2024 alone, OpenAI adjusted GPT-4o pricing twice, Anthropic released three new model tiers with different pricing, and Google cut Gemini pricing multiple times. Price changes are almost always reductions for existing models, though new models may launch at higher prices before declining. Providers rarely increase prices for existing models — doing so would be a competitive disadvantage. When a provider releases a new model version (e.g., GPT-4o replacing GPT-4 Turbo), the new model is typically both better and cheaper than the old one. The practical implication is that cost forecasts older than 6 months are unreliable. Build your budgets with an assumed 30–50% annual price decline and re-evaluate quarterly. CostHawk tracks pricing changes across all providers and adjusts cost calculations automatically, so your dashboards always reflect current rates.
What is the most cost-effective model for high-volume workloads?+
For high-volume workloads (100K+ requests/day), the most cost-effective model depends on your quality requirements. For tasks where economy models suffice (classification, extraction, simple formatting): Gemini 2.0 Flash at $0.10/$0.40 per MTok is the cheapest option, followed by Mistral Small at $0.10/$0.30 and GPT-4o mini at $0.15/$0.60. With prompt caching, Gemini 2.0 Flash drops to an effective $0.025/$0.40 for cached input — virtually free input tokens. For tasks requiring mid-tier capability: GPT-4o with prompt caching ($1.25/$10.00 effective rate) competes well with Claude 3.5 Sonnet with caching ($0.30/$15.00). The choice depends on whether your workload is input-heavy (GPT-4o wins on output rate) or whether you have large cacheable prefixes (Claude wins on cached input rate). At 1M+ requests/day, contact providers for custom pricing — volume discounts of 10–30% are standard, and they stack with caching and batch discounts.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.