Token
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Definition
What is Token?
"unhappiness" might become three tokens: ["un", "happiness"] or ["un", "happ", "iness"], depending on the model's vocabulary. Every API call is metered in tokens. Providers like OpenAI, Anthropic, Google, and Mistral bill input tokens (the prompt you send) and output tokens (the completion the model generates) at separate per-million-token rates, with output tokens typically costing two to four times more than input tokens. Understanding how tokenization works and how tokens translate to dollars is the single most important concept for controlling AI API costs.Impact
Why It Matters for AI Costs
Token counts are the primary cost driver for every AI API call you make. Consider the economics of a typical production workload using GPT-4o:
- Input tokens: $2.50 per 1 million tokens
- Output tokens: $10.00 per 1 million tokens
A single request with a 1,000-token prompt and a 500-token response costs:
(1,000 / 1,000,000) × $2.50 + (500 / 1,000,000) × $10.00 = $0.0025 + $0.005 = $0.0075That looks negligible — less than a penny. But scale changes everything. At 100,000 queries per day, you are spending $750 per day, or roughly $22,500 per month. Add a system prompt that consumes 2,000 tokens per request and your input cost alone doubles, pushing the monthly bill past $30,000.
The insidious part is that token counts are invisible to most engineering teams. A developer adding "be thorough and explain your reasoning" to a system prompt may not realize they just increased average output length by 40%, adding $9,000/month to the bill. A product manager who asks for "richer responses" triggers longer completions that cost more per request.
Token awareness is the foundation of AI cost management. Without it, teams overspend on bloated prompts, redundant conversation history, and unnecessarily verbose outputs. With it, teams can right-size every request and achieve the same quality at a fraction of the cost. CostHawk tracks token consumption at the per-request, per-key, per-project, and per-model level so you can see exactly where your tokens — and your dollars — are going.
How Tokenization Works
All modern large language models use a tokenizer to convert raw text into a sequence of integer IDs before processing. The dominant algorithm is Byte-Pair Encoding (BPE), originally developed for data compression and adapted for NLP by Sennrich et al. in 2016. Here is how it works at a high level:
- Start with characters. The algorithm begins with every unique byte (or character) in the training corpus as its own token.
- Count adjacent pairs. It scans the corpus and counts how often every pair of adjacent tokens appears.
- Merge the most frequent pair. The most common pair is merged into a single new token and the corpus is re-encoded.
- Repeat. Steps 2–3 repeat for a fixed number of iterations (typically 50,000–100,000 merges), building up a vocabulary of common sub-word units.
The result is a vocabulary where common words like "the" and "is" are single tokens, moderately common words like "running" may be one or two tokens, and rare or technical terms like "defenestration" or "HIPAA" may be split into three or more tokens.
Each model family maintains its own tokenizer vocabulary. OpenAI uses cl100k_base for GPT-4 and GPT-4o (100,256 tokens in the vocabulary) and o200k_base for newer models. Anthropic uses a custom BPE tokenizer for Claude models. Google uses SentencePiece for Gemini. This means the same English sentence may produce a different token count depending on which model you send it to.
Key practical implications:
- English prose averages ~1.3 tokens per word, or ~4 characters per token.
- Source code is less token-efficient because variable names, operators, and whitespace patterns are less common in training data. Python code averages ~2.0 tokens per word-equivalent.
- Non-Latin scripts (Chinese, Japanese, Korean, Arabic) typically consume 1.5–3x more tokens per character than English because these characters appear less frequently in training data and are split into more sub-word units.
- JSON and structured data is surprisingly token-heavy. Curly braces, colons, quotes, and key names all consume tokens. A 1 KB JSON payload can easily be 300+ tokens.
Understanding these ratios is essential for estimating costs before you ship a feature. If your application sends structured JSON context, your token counts — and costs — will be higher than if you sent the same information as compressed plain text.
Token Counting by Content Type
Different content types produce vastly different token-per-character ratios. The table below shows approximate token counts for common content types using the GPT-4o tokenizer (o200k_base):
| Content Type | Example | Characters | Tokens | Chars/Token |
|---|---|---|---|---|
| English prose | A paragraph from a blog post | 1,000 | ~250 | ~4.0 |
| Technical documentation | API reference with code terms | 1,000 | ~280 | ~3.6 |
| Python source code | A utility function with docstring | 1,000 | ~330 | ~3.0 |
| TypeScript/JSX | A React component | 1,000 | ~350 | ~2.9 |
| JSON payload | API response with nested objects | 1,000 | ~380 | ~2.6 |
| Minified JavaScript | Bundled production code | 1,000 | ~400 | ~2.5 |
| Chinese text | A news article in Simplified Chinese | 1,000 | ~700 | ~1.4 |
| Base64-encoded data | An encoded image thumbnail | 1,000 | ~750 | ~1.3 |
These ratios have direct cost implications. If you are building a code review tool that sends entire source files as context, expect 30–50% more tokens per kilobyte than a chatbot that processes natural language. If your application handles multilingual content, budget for 2–3x the token cost compared to English-only workloads.
CostHawk displays per-request token counts broken down by input and output, allowing you to see exactly which requests are consuming the most tokens and why. Use this data to identify opportunities for compression, truncation, or format changes that reduce token consumption without sacrificing quality.
Input vs Output Token Economics
Every major LLM provider charges different rates for input and output tokens, and understanding this split is critical for cost optimization. Output tokens are more expensive because generating them requires autoregressive decoding — the model must produce one token at a time, each conditioned on all previous tokens. Input tokens, by contrast, can be processed in parallel during the "prefill" phase.
Here are the current rates for popular models (as of March 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Output/Input Ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4.0x |
| GPT-4o mini | $0.15 | $0.60 | 4.0x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 5.0x |
| Claude 3.5 Haiku | $0.80 | $4.00 | 5.0x |
| Gemini 2.0 Flash | $0.10 | $0.40 | 4.0x |
| Gemini 1.5 Pro | $1.25 | $5.00 | 4.0x |
The key insight is that output tokens typically cost 4–5x more than input tokens. This has profound implications for cost optimization:
- Reducing output length by 20% saves more money than reducing input length by 20% for most workloads where input and output are comparable in size.
- Use
max_tokensto cap output length and prevent runaway generation. - Instruct the model to be concise. Adding "Respond in under 200 words" to your system prompt can cut output token costs by 30–50%.
- For structured outputs, prefer compact formats. A JSON response with short keys costs fewer output tokens than one with verbose, descriptive keys.
CostHawk tracks input and output tokens separately in every usage report, so you can see exactly which side of the equation is driving your costs and optimize accordingly.
Token Counting in Practice
Accurate token counting before sending a request is essential for cost estimation, budget enforcement, and staying within context window limits. Here are the standard approaches for each major provider:
OpenAI — tiktoken library:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("How many tokens is this sentence?")
print(f"Token count: {len(tokens)}") # Token count: 7
# For chat messages with role overhead:
def count_chat_tokens(messages, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
token_count = 0
for msg in messages:
token_count += 4 # message overhead
token_count += len(enc.encode(msg["content"]))
token_count += 2 # reply priming
return token_countAnthropic — anthropic-tokenizer:
import { countTokens } from "@anthropic-ai/tokenizer"
const count = countTokens("How many tokens is this sentence?")
console.log(`Token count: ${count}`) // Token count: 8Approximate counting without libraries:
// Quick estimation: ~4 chars per token for English
function estimateTokens(text: string): number {
return Math.ceil(text.length / 4)
}
// More accurate: split on whitespace and punctuation
function betterEstimate(text: string): number {
return Math.ceil(text.split(/\s+/).length * 1.3)
}Important caveats:
- Chat-format messages add per-message overhead (typically 3–4 tokens per message for role tags and delimiters).
- Function/tool definitions included in the request are tokenized and count toward input tokens.
- System prompts are tokenized once per request. If you send the same system prompt with every request, multiply its token count by your daily request volume to see the true cost.
- The response
usageobject returned by the API contains the exactprompt_tokensandcompletion_tokens— always use these for billing reconciliation rather than client-side estimates.
Token Optimization Strategies
Reducing token consumption is the most direct lever for lowering AI API costs. Here are six proven strategies, ordered by typical impact:
1. Trim system prompts. System prompts are included in every request. A 2,000-token system prompt across 50,000 daily requests means 100 million input tokens per day — $250/day on GPT-4o. Audit your system prompts ruthlessly: remove redundant instructions, compress examples, and eliminate verbose formatting guidance. Many teams find they can cut system prompt length by 40–60% without any quality degradation.
2. Manage conversation history. Chatbot-style applications that send the full conversation history with each turn accumulate tokens geometrically. After 20 turns of conversation, you might be sending 15,000+ tokens of history with every request. Implement a sliding window (keep the last N turns), summarize older turns, or use RAG to retrieve relevant prior context instead of sending everything.
3. Cap output length. Set max_tokens on every request and instruct the model to be concise. Without a cap, models may generate 1,000+ tokens when 200 would suffice. Since output tokens cost 4–5x more than input tokens, this is the highest-ROI optimization for many workloads.
4. Choose the right model. Not every request needs the most capable model. Simple classification, extraction, and formatting tasks can often be handled by GPT-4o mini ($0.15/$0.60 per MTok) instead of GPT-4o ($2.50/$10.00 per MTok) — a 16x cost reduction. Use model routing to direct each request to the cheapest model that meets your quality bar.
5. Use prompt caching. Both OpenAI and Anthropic offer prompt caching that gives a 50–90% discount on input tokens for repeated prompt prefixes. If your system prompt is the same across all requests, caching can dramatically reduce your input token costs. Anthropic's caching gives a 90% discount on cached tokens; OpenAI's gives 50%.
6. Compress structured data. If you are sending JSON context, consider whether a more compact format (CSV, abbreviated key names, or even plain text summaries) would convey the same information in fewer tokens. A 5 KB JSON payload might be 1,500 tokens; the same data as a compact table might be 400 tokens.
CostHawk's per-request analytics let you measure the before-and-after impact of each optimization, so you can prioritize the strategies that deliver the biggest savings for your specific workload.
Tokens and Cost Monitoring
Effective token monitoring requires visibility at multiple levels of granularity. CostHawk provides this through a layered monitoring architecture:
Per-request level: Every API call logged through CostHawk records the exact input token count, output token count, model used, and computed cost. This is the foundation for all higher-level analytics. You can drill into individual requests to understand why a particular call was expensive — perhaps it included an unexpectedly large context, or the model generated a verbose response.
Per-key level: CostHawk wrapped keys provide per-key attribution. If you issue separate keys for different services, teams, or environments, you can see token consumption broken down by key. This enables chargeback models and helps identify which service is driving the most token spend.
Per-project level: Tags and project labels let you aggregate token usage by business dimension — feature, team, customer, or environment (dev/staging/prod). Many teams discover that their development environment accounts for 30–50% of total token spend, leading to immediate savings through dev environment optimization.
Temporal trends: CostHawk's time-series dashboards show token consumption over hours, days, and weeks. This reveals patterns like batch jobs that spike token usage overnight, gradual context window creep as conversation histories grow, or sudden jumps when a new feature ships. Anomaly detection flags deviations from your baseline so you can investigate before costs spiral.
Model-level breakdown: See which models are consuming the most tokens and at what cost. This data feeds into model routing decisions — if 60% of your tokens go to an expensive model but only 20% of those requests actually need its capabilities, you have a clear optimization target.
The goal is to make token consumption as visible and actionable as any other infrastructure metric. Just as you monitor CPU utilization, memory usage, and request latency, token consumption should be a first-class metric in your observability stack. CostHawk makes this possible without requiring any changes to your application code — just route your API calls through a CostHawk wrapped key or sync your MCP telemetry.
FAQ
Frequently Asked Questions
How many tokens is one word on average?+
Do whitespace and punctuation count as tokens?+
Why do different models produce different token counts for the same text?+
o200k_base tokenizer with a vocabulary of roughly 200,000 tokens, while GPT-4 used cl100k_base with 100,256 tokens. Anthropic's Claude models use a proprietary tokenizer, and Google's Gemini models use SentencePiece. A larger vocabulary generally means common phrases are encoded as fewer tokens (because more multi-character sequences exist as single tokens), but rare or domain-specific terms may still be split differently. In practice, the difference is usually 5–15% between models for standard English text, but can be larger for code, non-Latin scripts, or specialized content. When comparing costs across providers, always use each provider's actual tokenizer for accurate estimates rather than assuming universal token counts.How do I count tokens before sending a request to the API?+
tiktoken library (available in Python and as a WASM module for JavaScript). Call tiktoken.encoding_for_model("gpt-4o") to get the correct encoder, then encoder.encode(text) to get the token list. For Anthropic, use the @anthropic-ai/tokenizer npm package or the anthropic Python SDK's built-in token counting. For Google Gemini, use the countTokens API endpoint. If you need a quick estimate without a library, divide the character count by 4 for English prose — this gives a rough approximation within 10–20% accuracy. For production systems, always use the official tokenizer library for accuracy, especially when enforcing budget limits or context window constraints. CostHawk also exposes token counts in its API response metadata, so you can reconcile client-side estimates with server-reported actuals.Why are output tokens more expensive than input tokens?+
How do tokens work with images and multimodal inputs?+
What is the relationship between tokens and context windows?+
max_tokens to 4,000 for the output, you can send up to 124,000 tokens of input (system prompt + user messages + conversation history + tool definitions). Every token of input you send is billed as an input token, and every token the model generates is billed as an output token, both within the context window limit. The critical cost implication is that stuffing the context window inflates your bill proportionally. Sending 100,000 tokens of context costs 50x more in input tokens than sending 2,000 tokens. Teams that naively include full conversation history or entire documents in every request often discover their context window usage — and their costs — growing geometrically over time. CostHawk's per-request analytics show context utilization as a percentage, making it easy to spot requests that are using more context than necessary.How can I reduce token usage without sacrificing output quality?+
Related Terms
Cost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreInput vs. Output Tokens
The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreMax Tokens
The API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.
Read morePrompt Compression
Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
