Pay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Definition
What is Pay-Per-Token?
Pay-per-token is a usage-based billing model in which AI API providers charge customers based on the exact number of tokens processed in each request. There is no monthly subscription fee, no seat license, and no upfront commitment — you pay only for what you use. Input tokens (your prompt) and output tokens (the model's response) are metered separately, with output tokens typically costing 2–4x more than input tokens. This is the default pricing model for OpenAI, Anthropic, Google, Mistral, and virtually every other LLM API provider as of 2026.
For example, calling Claude 3.5 Sonnet via the Anthropic API costs $3.00 per million input tokens and $15.00 per million output tokens. If your application sends a 1,500-token prompt and receives a 500-token response, you pay (1,500 / 1,000,000 × $3.00) + (500 / 1,000,000 × $15.00) = $0.0045 + $0.0075 = $0.012 per request. Scale that to 100,000 requests per day and you are spending $1,200 daily — $36,000 per month — with no contractual cap unless you set one yourself.
Impact
Why It Matters for AI Costs
Pay-per-token is deceptively simple. The per-request cost looks tiny, but costs compound unpredictably as traffic grows, prompts get longer, and agentic workflows chain multiple calls together. Teams that do not monitor token-level spending routinely discover bills 3–10x higher than forecasted. Unlike traditional SaaS where costs are fixed and known in advance, pay-per-token pricing shifts all volume risk to the customer. CostHawk exists specifically to make this pricing model manageable — tracking every token, every request, and every dollar across all your providers in real time.
What is Pay-Per-Token Pricing?
Pay-per-token pricing is the standard billing mechanism for large language model APIs. Every API call you make is metered by counting the tokens in your input (the prompt, system message, conversation history, and any tool definitions) and the tokens in the output (the model's generated response). The provider multiplies each count by the respective per-token rate for the model you called, sums the two, and adds the result to your running bill.
This model has three defining characteristics:
- Zero fixed cost: If you make zero API calls in a month, you pay nothing. There is no base fee, platform fee, or minimum spend (though some enterprise contracts add these).
- Linear scaling: Your cost scales linearly with usage. Twice the tokens means twice the cost. There are no volume discounts in the standard pay-per-token model (volume discounts require separate committed-use agreements).
- Separate input/output rates: Output tokens are more expensive because autoregressive generation is more compute-intensive than encoding input. The ratio varies by provider and model — GPT-4o charges $2.50/$10.00 (4x), Claude 3.5 Sonnet charges $3.00/$15.00 (5x), and Gemini 1.5 Pro charges $1.25/$5.00 (4x).
This pricing model dominates because it aligns provider revenue with compute consumption. Providers pay for GPU-seconds, and token counts are a reasonable proxy for GPU time used. For customers, the appeal is low barriers to entry — you can experiment with a few dollars and scale up gradually.
How Usage-Based Billing Works
Understanding the mechanics of pay-per-token billing is essential for accurate forecasting. Here is the formula:
Request Cost = (Input Tokens × Input Price per Token) + (Output Tokens × Output Price per Token)Providers quote prices per million tokens (MTok) for readability, but the actual billing unit is the individual token. Let's walk through a concrete calculation for a customer support chatbot using GPT-4o:
| Component | Tokens | Rate (per 1M) | Cost |
|---|---|---|---|
| System prompt | 800 | $2.50 input | $0.0020 |
| Conversation history (5 turns) | 2,400 | $2.50 input | $0.0060 |
| User message | 150 | $2.50 input | $0.000375 |
| Model response | 350 | $10.00 output | $0.0035 |
| Total per request | 3,700 | $0.01188 |
At 50,000 conversations per day with an average of 6 turns each, that is 300,000 API calls per day. Daily cost: 300,000 × $0.01188 = $3,564. Monthly cost: $106,920. Notice that the conversation history grows with each turn — turn 6 sends all previous turns as input, making later turns significantly more expensive than earlier ones. A 10-turn conversation costs roughly 3x more in total than a 3-turn conversation because of this compounding input cost.
Key nuances that affect your bill:
- Cached tokens: Anthropic offers a 90% discount on prompt caching hits. OpenAI offers 50% off cached input tokens. If your system prompt is cached, the 800-token system prompt above drops from $0.002 to $0.0002 (Anthropic) or $0.001 (OpenAI).
- Minimum billing: Some providers round up to minimum token counts. Always check your actual billed tokens against your expected tokens.
- Tool/function definitions: Tool schemas sent to the model count as input tokens. A complex tool schema can add 500–2,000 tokens to every request.
Pay-Per-Token vs Flat-Rate Subscriptions
The AI industry now offers two distinct pricing paradigms. Pay-per-token (API billing) charges for exact usage, while flat-rate subscriptions (Claude Max at $100–$200/mo, ChatGPT Pro at $200/mo, ChatGPT Plus at $20/mo) charge a fixed monthly fee for a usage allowance. Each model has clear winners and losers depending on your situation.
| Factor | Pay-Per-Token (API) | Flat-Rate Subscription |
|---|---|---|
| Cost predictability | Low — varies with usage | High — fixed monthly bill |
| Cost efficiency at low volume | Excellent — pay only what you use | Poor — paying for unused capacity |
| Cost efficiency at high volume | Can be expensive — no volume cap | Excellent — unlimited within plan limits |
| Programmatic access | Full API access, any integration | Limited — chat UI or limited API |
| Model selection | Any model, switch per-request | Provider's bundled models only |
| Burst capacity | Scales instantly (within rate limits) | Throttled during peak periods |
| Multi-user support | One key serves entire application | Per-seat pricing adds up fast |
| Agentic workloads | Required — agents make API calls | Not supported for automated pipelines |
| Budget control | Requires monitoring (CostHawk) | Built-in — the subscription IS the cap |
The breakeven point varies by model. For Claude 3.5 Sonnet, a developer making roughly 2,500 complex requests per day (averaging 4,000 input + 1,000 output tokens each) would spend approximately $200/month on the API — making Claude Max's $200/month plan break even. Below that volume, pay-per-token is cheaper. Above it, the flat rate wins on cost but loses on flexibility. For production applications serving end users, pay-per-token is the only viable option because flat-rate subscriptions are designed for individual human use and their terms of service prohibit programmatic access at scale.
Cost Predictability Challenges
The biggest drawback of pay-per-token pricing is cost unpredictability. Unlike a $200/month subscription where the bill is known before the month starts, pay-per-token costs are determined by factors that change daily — sometimes hourly. Here are the primary sources of variance:
- Traffic spikes: A viral feature, marketing campaign, or seasonal event can 5–10x your daily API call volume overnight. If your baseline is $500/day and traffic triples, you are suddenly spending $1,500/day with no automatic circuit breaker.
- Prompt drift: Development teams frequently add context, examples, and instructions to prompts without realizing the cost impact. A system prompt that starts at 500 tokens in v1 often grows to 3,000 tokens by v10. Across 100,000 daily requests at $3/MTok input, that 2,500-token increase costs an extra $750/day ($22,500/month).
- Agent loops: Agentic workflows (tool-use, chain-of-thought, multi-step reasoning) make multiple API calls to complete a single user request. A ReAct agent might make 5–15 API calls per task. If each call averages 3,000 tokens in and 800 tokens out using Claude 3.5 Sonnet, a single agent run costs $0.045–$0.135. At 10,000 agent runs per day, that is $450–$1,350 daily — and the variance between 5-call and 15-call runs makes forecasting extremely difficult.
- Retry storms: When a downstream service is degraded, retry logic can amplify costs. A 30-minute outage with aggressive retries (no exponential backoff) can generate 10x normal traffic and 10x normal cost during that window.
- Model upgrades: Switching from GPT-4o-mini ($0.15/$0.60 per MTok) to GPT-4o ($2.50/$10.00 per MTok) increases costs by 16x on the same traffic. Teams sometimes upgrade models in development and forget to benchmark the cost impact before deploying to production.
CostHawk addresses these challenges through real-time cost tracking, anomaly detection alerts, and per-key budget controls that can automatically disable keys when spending exceeds thresholds.
When to Switch from Pay-Per-Token
Pay-per-token is not always the most cost-effective option. Here is a framework for evaluating when to switch to alternatives:
Switch to Provisioned Throughput when: Your monthly API spend exceeds $10,000 and your traffic is predictable (less than 2x variance between peak and trough). Anthropic, OpenAI, and AWS Bedrock offer reserved capacity at 30–50% discounts over pay-per-token rates. The breakeven for Anthropic's reserved capacity is roughly $8,000/month — below that, pay-per-token is cheaper because you would be paying for unused capacity.
Switch to Batch API when: Your workload is latency-tolerant (results needed in hours, not seconds). OpenAI's Batch API offers a 50% discount on GPT-4o ($1.25/$5.00 per MTok instead of $2.50/$10.00). Anthropic's Message Batches API offers similar discounts. If even 30% of your traffic can tolerate batch processing (e.g., nightly content generation, bulk classification, data extraction), you save 15% on your total bill by routing those requests to batch endpoints.
Switch to self-hosted models when: Your monthly spend exceeds $50,000 and your use case is well-served by open-source models (Llama 3, Mixtral, Qwen). Running Llama 3 70B on 4x A100 GPUs costs approximately $8,000–$12,000/month in cloud compute, which serves roughly the same throughput as $40,000–$60,000/month in API costs for a comparable model. The breakeven is typically 3–6 months including setup costs.
Stay on pay-per-token when: Your spend is under $5,000/month, your traffic is spiky and unpredictable, you need access to frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) that cannot be self-hosted, or you are in an early experimentation phase where usage patterns have not stabilized.
Managing Pay-Per-Token Costs
Effective cost management under pay-per-token pricing requires a layered approach combining monitoring, optimization, and guardrails:
1. Real-time visibility: You cannot optimize what you cannot see. Instrument every API call with cost tracking that captures model, token counts, latency, and request metadata (user ID, feature, environment). CostHawk provides this out of the box with a single SDK integration or proxy configuration.
2. Budget controls: Set hard and soft limits at every level — organization, project, API key, and individual user. A hard limit stops requests when the budget is exhausted. A soft limit sends alerts but allows traffic to continue. CostHawk supports both, with configurable alert thresholds at 50%, 75%, 90%, and 100% of budget.
3. Model routing: Not every request needs your most expensive model. Route simple tasks (classification, extraction, reformatting) to cheaper models like GPT-4o-mini ($0.15/$0.60 per MTok) or Claude 3.5 Haiku ($0.80/$4.00 per MTok) and reserve premium models for complex reasoning tasks. Teams that implement intelligent model routing typically reduce costs by 40–60% with minimal quality impact.
4. Prompt optimization: Audit your prompts quarterly. Remove redundant instructions, compress few-shot examples, and use structured output (JSON mode) to reduce output verbosity. A prompt optimization pass typically yields 20–40% token reduction.
5. Caching: Cache responses for identical or near-identical requests. Semantic caching (matching on intent rather than exact string) can achieve 15–30% cache hit rates for customer support and FAQ use cases, eliminating those API calls entirely. Additionally, use provider-level prompt caching (Anthropic's prompt caching, OpenAI's automatic caching) to reduce input token costs by 50–90% on repeated system prompts and context.
6. Output length control: Set max_tokens appropriate to each use case. A classification task needs 10–50 output tokens, not the default 4,096. Reducing max_tokens does not guarantee shorter responses, but it prevents runaway generation that wastes tokens on content you will truncate anyway.
FAQ
Frequently Asked Questions
How does pay-per-token pricing work for AI APIs?+
Pay-per-token pricing charges you based on the exact number of tokens processed in each API call. Every request has two components: input tokens (your prompt, system message, and conversation history) and output tokens (the model's generated response). Each component is billed at a different rate — output tokens typically cost 2–5x more than input tokens. Your bill is the sum of all input and output token costs across all requests in a billing period. For example, if you use Claude 3.5 Sonnet and send 10 million input tokens and receive 2 million output tokens in a month, your bill is (10M × $3.00/MTok) + (2M × $15.00/MTok) = $30 + $30 = $60. There is no base fee, setup cost, or minimum commitment — if you make zero API calls, you pay zero dollars.
Is pay-per-token cheaper than a ChatGPT Pro or Claude Max subscription?+
It depends entirely on your usage volume. ChatGPT Pro costs $200/month for unlimited GPT-4o access (with fair-use throttling), and Claude Max costs $100–$200/month depending on the tier. For a single developer making moderate use of the chat interface, these subscriptions are usually better value. However, pay-per-token is cheaper if you use less than approximately $200 worth of tokens per month. The crossover point for Claude 3.5 Sonnet is roughly 50–80 million total tokens per month. Below that, pay-per-token is cheaper. Above it, the flat subscription wins. Critically, subscriptions are per-seat (each developer needs their own), while a single API key on pay-per-token can serve your entire application and all its users. For production applications, pay-per-token is the only option — subscriptions are for human chat use only.
What happens if my pay-per-token costs spike unexpectedly?+
Without safeguards, you will simply receive a larger bill at the end of the billing period. Most providers do not automatically cap your spending unless you configure limits. OpenAI allows you to set a monthly budget cap in the dashboard. Anthropic offers workspace-level spend limits. However, these provider-level controls are coarse — they shut off all API access when triggered, which causes application outages. A better approach is using CostHawk to set granular budget alerts and per-key spending limits. CostHawk detects anomalies in real time and can alert you via Slack, email, or webhook within minutes of a cost spike starting. You can also configure automatic key disabling at specific spend thresholds, giving you per-project and per-feature cost isolation so a runaway agent loop in one feature does not exhaust your entire budget.
Do all AI providers use pay-per-token pricing?+
Nearly all major LLM API providers use pay-per-token as their default pricing model, including OpenAI, Anthropic, Google (Gemini API), Mistral, Cohere, and AI21 Labs. However, there are variations. Some providers also offer alternative pricing tiers: OpenAI has a Batch API at 50% discount for non-real-time workloads, Anthropic offers reserved capacity for high-volume customers, and AWS Bedrock provides both on-demand (pay-per-token) and provisioned throughput pricing. Image generation models (DALL-E, Stable Diffusion API) typically charge per image rather than per token. Embedding models also use pay-per-token pricing but at much lower rates — OpenAI's text-embedding-3-small costs $0.02 per million tokens, roughly 125x cheaper than GPT-4o input tokens.
How can I forecast my monthly costs under pay-per-token pricing?+
Start by measuring your current usage patterns over at least two weeks. Track four metrics: average requests per day, average input tokens per request, average output tokens per request, and the model(s) you are using. Multiply: (daily requests × avg input tokens × input price) + (daily requests × avg output tokens × output price) × 30 days. Add a 20–40% buffer for variance. For example, if you average 5,000 requests/day with 2,000 input tokens and 500 output tokens using GPT-4o: (5,000 × 2,000 × $2.50/MTok) + (5,000 × 500 × $10.00/MTok) = $25 + $25 = $50/day, or $1,500/month. With a 30% buffer, budget $1,950/month. CostHawk's forecasting dashboard automates this calculation using your actual historical usage data and projects costs forward with confidence intervals.
Are input and output tokens always priced differently?+
Yes, with very few exceptions. Output tokens are more expensive because generating text is computationally harder than processing it. During input processing, the model can evaluate all tokens in parallel (a single forward pass). During output generation, the model must generate tokens one at a time (autoregressive decoding), with each new token requiring a separate forward pass through the model. This sequential process uses more GPU time per token. The price ratio varies by provider and model: GPT-4o charges 4x more for output ($2.50 vs $10.00 per MTok), Claude 3.5 Sonnet charges 5x more ($3.00 vs $15.00 per MTok), and Gemini 1.5 Pro charges 4x more ($1.25 vs $5.00 per MTok). This asymmetry means that controlling output length (via max_tokens and prompt design) is one of the most effective cost optimization strategies.
Can I set spending limits with pay-per-token pricing?+
Yes, but the granularity depends on your tooling. At the provider level, OpenAI lets you set a monthly hard cap in the billing dashboard — once reached, all API calls fail. Anthropic offers workspace-level spend limits with similar behavior. These are blunt instruments that kill all traffic when triggered. For finer control, CostHawk provides per-key, per-project, and per-team budget limits with configurable actions (alert-only, throttle, or hard-stop). You can set a $500/month limit on your staging environment's API key while allowing production to spend up to $10,000/month, with Slack alerts at 50%, 75%, and 90% thresholds. This layered approach prevents budget overruns without risking production outages. We recommend always setting a provider-level hard cap as a safety net, even if it is 2–3x your expected spend.
How does pay-per-token pricing work with AI agents and tool use?+
AI agents amplify pay-per-token costs significantly because a single user action can trigger multiple API calls in a chain. When an agent uses tools (function calling), the process works like this: (1) the user sends a request, (2) the model responds with a tool call, (3) your application executes the tool and sends the result back, (4) the model processes the result and either calls another tool or returns a final response. Each round-trip is a separate billed API call, and critically, each subsequent call includes the full conversation history from previous steps as input tokens. A 5-step agent run might generate 15,000–25,000 total input tokens and 3,000–5,000 total output tokens, costing $0.05–$0.12 per run with Claude 3.5 Sonnet. Without monitoring, a buggy agent that loops indefinitely can generate hundreds of dollars in charges in minutes. CostHawk's per-session cost tracking and anomaly detection are essential safeguards for agentic workloads.
What is the cheapest pay-per-token model available in 2026?+
As of March 2026, the cheapest general-purpose LLM APIs are GPT-4o-mini at $0.15/$0.60 per million tokens (input/output), Gemini 1.5 Flash at $0.075/$0.30 per million tokens, and Claude 3.5 Haiku at $0.80/$4.00 per million tokens. For embedding-only workloads, OpenAI's text-embedding-3-small costs just $0.02 per million tokens. The cheapest option depends on your quality requirements — GPT-4o-mini and Gemini Flash are excellent for classification, extraction, and simple generation tasks but fall short on complex reasoning compared to GPT-4o or Claude 3.5 Sonnet. The best cost optimization strategy is model routing: use cheap models for simple tasks and expensive models for complex ones. CostHawk's model comparison dashboard shows cost-per-quality metrics to help you find the optimal model for each use case.
How do batch APIs reduce pay-per-token costs?+
Batch APIs let you submit large sets of requests (hundreds to millions) as a single job that the provider processes asynchronously, typically within 24 hours. In exchange for giving up real-time responses, you get a significant discount: OpenAI's Batch API charges 50% less than real-time pricing ($1.25/$5.00 per MTok for GPT-4o instead of $2.50/$10.00). Anthropic's Message Batches API offers similar savings. Batch processing is ideal for workloads like nightly content generation, bulk document classification, dataset labeling, evaluation runs, and pre-computing responses for anticipated queries. If 40% of your 1 million daily requests can be batched, and your average request costs $0.01 at real-time rates, you save $0.005 × 400,000 = $2,000/day ($60,000/month). CostHawk tracks batch and real-time spending separately so you can measure exactly how much you are saving from batch routing.
Related Terms
Token Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreToken Budget
Spending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.
Read moreBatch API
Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.
Read moreCost Anomaly Detection
Automated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
