Input vs. Output Tokens
The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.
Definition
What is Input vs. Output Tokens?
Impact
Why It Matters for AI Costs
Understanding the Asymmetry
When you send a prompt to an LLM, the model processes your input tokens in parallel using a single forward pass through the neural network. This is computationally efficient — the model reads all input tokens at once and builds an internal representation. Output generation is fundamentally different. The model produces tokens one at a time in a sequential, autoregressive process: each new token depends on every token that came before it. This means generating 1,000 output tokens requires roughly 1,000 sequential forward passes, each one slightly larger than the last.
This computational asymmetry is why providers charge more for output. The GPU time required to generate a token is significantly higher than the GPU time required to process an input token. On most transformer architectures, the ratio of compute per output token to compute per input token ranges from 3x to 6x, depending on model size, batch efficiency, and hardware utilization. Providers pass this cost differential directly to customers through split pricing.
The asymmetry also has implications for capacity planning. Output-heavy workloads consume more GPU-seconds per request, which means providers can serve fewer concurrent requests. This reduced throughput is another factor in the higher output price.
Provider Pricing Comparison
The following table shows current input and output token pricing for major models as of March 2026. All prices are per 1 million tokens.
| Provider | Model | Input (per 1M) | Output (per 1M) | Ratio |
|---|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | 4.0x |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 4.0x |
| OpenAI | o1 | $15.00 | $60.00 | 4.0x |
| OpenAI | o3-mini | $1.10 | $4.40 | 4.0x |
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 5.0x |
| Anthropic | Claude 3.5 Haiku | $0.80 | $4.00 | 5.0x |
| Anthropic | Claude 3 Opus | $15.00 | $75.00 | 5.0x |
| Gemini 1.5 Pro | $1.25 | $5.00 | 4.0x | |
| Gemini 1.5 Flash | $0.075 | $0.30 | 4.0x | |
| Gemini 2.0 Flash | $0.10 | $0.40 | 4.0x |
Notice that every provider maintains a consistent multiplier across their model lineup: OpenAI and Google use 4x, Anthropic uses 5x. This consistency means the optimization math is the same regardless of which model tier you use — reducing output tokens is always 4-5x more valuable than reducing input tokens by the same amount.
Why Output Tokens Cost More
The cost difference comes down to three factors rooted in how transformer models work:
- Autoregressive generation: Each output token requires a separate forward pass through the model. A 500-token response requires 500 sequential inference steps, while a 500-token input is processed in a single parallel pass. The sequential nature of output generation prevents the kind of batching optimizations that make input processing efficient.
- KV-cache memory: During output generation, the model maintains a key-value cache that grows with each token produced. This cache consumes GPU memory proportional to the sequence length, reducing the number of concurrent requests a GPU can handle. More memory per request means fewer requests per GPU, which increases the cost per token.
- Speculative decoding overhead: Many providers use speculative decoding to accelerate output generation, where a smaller draft model proposes tokens that the main model verifies. While this improves throughput, it adds computational overhead per output token that does not exist for input processing.
These three factors combine to make output generation 3-6x more expensive in raw compute cost per token. Providers set their pricing multipliers at 4-5x to reflect this reality while maintaining some margin.
Optimization Strategies for Output Reduction
Since output tokens cost 4-5x more than input tokens, reducing output length has an outsized impact on your bill. Here are proven strategies:
- Set explicit
max_tokens: Always set amax_tokenslimit appropriate to your use case. A classification task needs 10-50 tokens, not the default 4,096. Settingmax_tokens: 50on a classification endpoint prevents the model from generating verbose explanations you do not need. - Request structured output: Use JSON mode or structured output schemas to force concise, machine-readable responses. A JSON object with three fields is typically 50-100 tokens. An equivalent natural-language paragraph is 200-400 tokens. That is a 2-4x reduction in output tokens.
- Use response formatting instructions: Explicitly tell the model how to format its response. "Respond with only the category name" produces 2-5 tokens. "Explain your reasoning and then provide the category" produces 200+ tokens. Be precise about what you want.
- Avoid chain-of-thought when unnecessary: Chain-of-thought prompting ("think step by step") dramatically increases output length — often by 5-10x. Use it for complex reasoning tasks where accuracy matters, but skip it for straightforward extraction and classification tasks.
- Post-process with smaller models: If you need a long response for internal processing but a short summary for the user, generate the full response with a capable model and then summarize it with a cheaper model like GPT-4o-mini.
Applying these strategies in combination can reduce output token spend by 40-70% without degrading the quality of results your application delivers to users.
Measuring Input vs. Output Ratios
Your input-to-output ratio is a critical metric for cost forecasting and optimization prioritization. Calculate it as: total_input_tokens / total_output_tokens. A ratio of 10:1 means you send 10 input tokens for every 1 output token — common for classification and extraction tasks. A ratio of 1:3 means the model generates 3 tokens for every 1 you send — typical for creative writing and code generation.
Typical ratios by use case:
| Use Case | Input:Output Ratio | Output Share of Cost |
|---|---|---|
| Classification | 50:1 to 100:1 | 15-30% |
| Entity extraction | 20:1 to 50:1 | 20-40% |
| RAG Q&A | 5:1 to 15:1 | 40-65% |
| Summarization | 3:1 to 10:1 | 35-60% |
| Code generation | 1:1 to 1:5 | 65-90% |
| Creative writing | 1:2 to 1:10 | 75-95% |
| Agent chains | 2:1 to 1:2 | 50-75% |
Use cases with a low input-to-output ratio (meaning output-heavy) benefit most from output optimization strategies. Use cases with a high ratio (input-heavy) benefit more from prompt compression and caching strategies. CostHawk calculates this ratio automatically on the usage dashboard so you can prioritize optimizations by impact.
Tracking Token Direction with CostHawk
CostHawk provides granular visibility into input and output token consumption across every request, model, and project. The dashboard breaks down your spend into input cost and output cost columns, letting you instantly see which direction is driving your bill.
Key CostHawk features for token direction analysis:
- Per-request breakdown: Every logged API call shows input token count, output token count, input cost, and output cost as separate fields. Sort by output cost to find your most expensive responses.
- Aggregate ratio tracking: The usage analytics page displays your organization-wide input-to-output ratio over time. A trending shift toward more output-heavy usage signals growing costs that may need attention.
- Model-level comparison: Compare input/output splits across models to identify routing opportunities. If your GPT-4o requests are 80% output cost, routing some to GPT-4o-mini could save 90% on those specific calls.
- Anomaly detection on output spikes: CostHawk flags requests where output token counts are unusually high — a sign of verbose responses, infinite-loop agent behavior, or missing
max_tokensconstraints.
By treating input and output tokens as separate cost centers, CostHawk helps teams apply the right optimization strategy to the right side of their token spend.
FAQ
Frequently Asked Questions
Why do output tokens cost more than input tokens?+
What is a typical input-to-output token ratio?+
How can I reduce output tokens without losing quality?+
max_tokens limits appropriate to each task — a classification endpoint needs 10-50 tokens, not 4,096. (2) Use structured output (JSON mode) to force concise, machine-readable responses instead of verbose natural language. (3) Write precise formatting instructions — "respond with only the category name" versus "explain and then categorize." (4) Skip chain-of-thought prompting for simple tasks where step-by-step reasoning adds cost without improving accuracy. (5) Use a two-model pipeline for cases where you need detailed internal processing but concise user-facing output. These strategies combined typically reduce output token spend by 40-70% without degrading result quality.Does the input/output price ratio vary between providers?+
How do cached tokens affect the input/output price split?+
Do reasoning models like o1 have different input/output economics?+
max_completion_tokens is critical for reasoning models to prevent runaway generation. CostHawk flags reasoning model requests that exceed configurable output thresholds so you can identify tasks that may not need the full reasoning capability.Can I see input vs. output costs per feature in CostHawk?+
How does batch API pricing change the input/output calculation?+
What tools can I use to count tokens before sending a request?+
tiktoken library (Python) or tiktoken npm package (JavaScript/TypeScript) to count tokens before sending a request. Call encoding_for_model('gpt-4o') to get the correct tokenizer, then encode(text).length for the count. For Anthropic models, use the anthropic SDK's built-in token counting endpoint or the anthropic-tokenizer package. Google provides token counting in the Gemini SDK via model.count_tokens(). Pre-counting tokens lets you estimate costs before incurring them, enforce token budgets at the application layer, and avoid surprises. CostHawk also provides estimated costs in the MCP server's response metadata so you can log projected vs. actual costs.Should I optimize input tokens or output tokens first?+
max_tokens limits, using structured JSON output, and removing chain-of-thought from tasks that do not need it. Once you have optimized output, move to input optimization: prompt compression, caching, and RAG chunking strategies. CostHawk's dashboard makes it easy to see which side of the split is driving your costs so you can prioritize accurately.Related Terms
Token
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreMax Tokens
The API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.
Read morePrompt Compression
Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
