GlossaryBilling & PricingUpdated 2026-03-16

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Definition

What is Token?

A token is the atomic unit of text that a large language model reads and generates. Modern LLMs do not process raw characters or whole words. Instead, they use a tokenizer — most commonly Byte-Pair Encoding (BPE) — to split text into sub-word fragments called tokens. In English prose, one token averages approximately four characters, or about three-quarters of a word. The word "unhappiness" might become three tokens: ["un", "happiness"] or ["un", "happ", "iness"], depending on the model's vocabulary. Every API call is metered in tokens. Providers like OpenAI, Anthropic, Google, and Mistral bill input tokens (the prompt you send) and output tokens (the completion the model generates) at separate per-million-token rates, with output tokens typically costing two to four times more than input tokens. Understanding how tokenization works and how tokens translate to dollars is the single most important concept for controlling AI API costs.

Impact

Why It Matters for AI Costs

Token counts are the primary cost driver for every AI API call you make. Consider the economics of a typical production workload using GPT-4o:

  • Input tokens: $2.50 per 1 million tokens
  • Output tokens: $10.00 per 1 million tokens

A single request with a 1,000-token prompt and a 500-token response costs:

(1,000 / 1,000,000) × $2.50 + (500 / 1,000,000) × $10.00 = $0.0025 + $0.005 = $0.0075

That looks negligible — less than a penny. But scale changes everything. At 100,000 queries per day, you are spending $750 per day, or roughly $22,500 per month. Add a system prompt that consumes 2,000 tokens per request and your input cost alone doubles, pushing the monthly bill past $30,000.

The insidious part is that token counts are invisible to most engineering teams. A developer adding "be thorough and explain your reasoning" to a system prompt may not realize they just increased average output length by 40%, adding $9,000/month to the bill. A product manager who asks for "richer responses" triggers longer completions that cost more per request.

Token awareness is the foundation of AI cost management. Without it, teams overspend on bloated prompts, redundant conversation history, and unnecessarily verbose outputs. With it, teams can right-size every request and achieve the same quality at a fraction of the cost. CostHawk tracks token consumption at the per-request, per-key, per-project, and per-model level so you can see exactly where your tokens — and your dollars — are going.

How Tokenization Works

All modern large language models use a tokenizer to convert raw text into a sequence of integer IDs before processing. The dominant algorithm is Byte-Pair Encoding (BPE), originally developed for data compression and adapted for NLP by Sennrich et al. in 2016. Here is how it works at a high level:

  1. Start with characters. The algorithm begins with every unique byte (or character) in the training corpus as its own token.
  2. Count adjacent pairs. It scans the corpus and counts how often every pair of adjacent tokens appears.
  3. Merge the most frequent pair. The most common pair is merged into a single new token and the corpus is re-encoded.
  4. Repeat. Steps 2–3 repeat for a fixed number of iterations (typically 50,000–100,000 merges), building up a vocabulary of common sub-word units.

The result is a vocabulary where common words like "the" and "is" are single tokens, moderately common words like "running" may be one or two tokens, and rare or technical terms like "defenestration" or "HIPAA" may be split into three or more tokens.

Each model family maintains its own tokenizer vocabulary. OpenAI uses cl100k_base for GPT-4 and GPT-4o (100,256 tokens in the vocabulary) and o200k_base for newer models. Anthropic uses a custom BPE tokenizer for Claude models. Google uses SentencePiece for Gemini. This means the same English sentence may produce a different token count depending on which model you send it to.

Key practical implications:

  • English prose averages ~1.3 tokens per word, or ~4 characters per token.
  • Source code is less token-efficient because variable names, operators, and whitespace patterns are less common in training data. Python code averages ~2.0 tokens per word-equivalent.
  • Non-Latin scripts (Chinese, Japanese, Korean, Arabic) typically consume 1.5–3x more tokens per character than English because these characters appear less frequently in training data and are split into more sub-word units.
  • JSON and structured data is surprisingly token-heavy. Curly braces, colons, quotes, and key names all consume tokens. A 1 KB JSON payload can easily be 300+ tokens.

Understanding these ratios is essential for estimating costs before you ship a feature. If your application sends structured JSON context, your token counts — and costs — will be higher than if you sent the same information as compressed plain text.

Token Counting by Content Type

Different content types produce vastly different token-per-character ratios. The table below shows approximate token counts for common content types using the GPT-4o tokenizer (o200k_base):

Content TypeExampleCharactersTokensChars/Token
English proseA paragraph from a blog post1,000~250~4.0
Technical documentationAPI reference with code terms1,000~280~3.6
Python source codeA utility function with docstring1,000~330~3.0
TypeScript/JSXA React component1,000~350~2.9
JSON payloadAPI response with nested objects1,000~380~2.6
Minified JavaScriptBundled production code1,000~400~2.5
Chinese textA news article in Simplified Chinese1,000~700~1.4
Base64-encoded dataAn encoded image thumbnail1,000~750~1.3

These ratios have direct cost implications. If you are building a code review tool that sends entire source files as context, expect 30–50% more tokens per kilobyte than a chatbot that processes natural language. If your application handles multilingual content, budget for 2–3x the token cost compared to English-only workloads.

CostHawk displays per-request token counts broken down by input and output, allowing you to see exactly which requests are consuming the most tokens and why. Use this data to identify opportunities for compression, truncation, or format changes that reduce token consumption without sacrificing quality.

Input vs Output Token Economics

Every major LLM provider charges different rates for input and output tokens, and understanding this split is critical for cost optimization. Output tokens are more expensive because generating them requires autoregressive decoding — the model must produce one token at a time, each conditioned on all previous tokens. Input tokens, by contrast, can be processed in parallel during the "prefill" phase.

Here are the current rates for popular models (as of March 2026):

ModelInput (per 1M tokens)Output (per 1M tokens)Output/Input Ratio
GPT-4o$2.50$10.004.0x
GPT-4o mini$0.15$0.604.0x
Claude 3.5 Sonnet$3.00$15.005.0x
Claude 3.5 Haiku$0.80$4.005.0x
Gemini 2.0 Flash$0.10$0.404.0x
Gemini 1.5 Pro$1.25$5.004.0x

The key insight is that output tokens typically cost 4–5x more than input tokens. This has profound implications for cost optimization:

  • Reducing output length by 20% saves more money than reducing input length by 20% for most workloads where input and output are comparable in size.
  • Use max_tokens to cap output length and prevent runaway generation.
  • Instruct the model to be concise. Adding "Respond in under 200 words" to your system prompt can cut output token costs by 30–50%.
  • For structured outputs, prefer compact formats. A JSON response with short keys costs fewer output tokens than one with verbose, descriptive keys.

CostHawk tracks input and output tokens separately in every usage report, so you can see exactly which side of the equation is driving your costs and optimize accordingly.

Token Counting in Practice

Accurate token counting before sending a request is essential for cost estimation, budget enforcement, and staying within context window limits. Here are the standard approaches for each major provider:

OpenAI — tiktoken library:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("How many tokens is this sentence?")
print(f"Token count: {len(tokens)}")  # Token count: 7

# For chat messages with role overhead:
def count_chat_tokens(messages, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    token_count = 0
    for msg in messages:
        token_count += 4  # message overhead
        token_count += len(enc.encode(msg["content"]))
    token_count += 2  # reply priming
    return token_count

Anthropic — anthropic-tokenizer:

import { countTokens } from "@anthropic-ai/tokenizer"

const count = countTokens("How many tokens is this sentence?")
console.log(`Token count: ${count}`)  // Token count: 8

Approximate counting without libraries:

// Quick estimation: ~4 chars per token for English
function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4)
}

// More accurate: split on whitespace and punctuation
function betterEstimate(text: string): number {
  return Math.ceil(text.split(/\s+/).length * 1.3)
}

Important caveats:

  • Chat-format messages add per-message overhead (typically 3–4 tokens per message for role tags and delimiters).
  • Function/tool definitions included in the request are tokenized and count toward input tokens.
  • System prompts are tokenized once per request. If you send the same system prompt with every request, multiply its token count by your daily request volume to see the true cost.
  • The response usage object returned by the API contains the exact prompt_tokens and completion_tokens — always use these for billing reconciliation rather than client-side estimates.

Token Optimization Strategies

Reducing token consumption is the most direct lever for lowering AI API costs. Here are six proven strategies, ordered by typical impact:

1. Trim system prompts. System prompts are included in every request. A 2,000-token system prompt across 50,000 daily requests means 100 million input tokens per day — $250/day on GPT-4o. Audit your system prompts ruthlessly: remove redundant instructions, compress examples, and eliminate verbose formatting guidance. Many teams find they can cut system prompt length by 40–60% without any quality degradation.

2. Manage conversation history. Chatbot-style applications that send the full conversation history with each turn accumulate tokens geometrically. After 20 turns of conversation, you might be sending 15,000+ tokens of history with every request. Implement a sliding window (keep the last N turns), summarize older turns, or use RAG to retrieve relevant prior context instead of sending everything.

3. Cap output length. Set max_tokens on every request and instruct the model to be concise. Without a cap, models may generate 1,000+ tokens when 200 would suffice. Since output tokens cost 4–5x more than input tokens, this is the highest-ROI optimization for many workloads.

4. Choose the right model. Not every request needs the most capable model. Simple classification, extraction, and formatting tasks can often be handled by GPT-4o mini ($0.15/$0.60 per MTok) instead of GPT-4o ($2.50/$10.00 per MTok) — a 16x cost reduction. Use model routing to direct each request to the cheapest model that meets your quality bar.

5. Use prompt caching. Both OpenAI and Anthropic offer prompt caching that gives a 50–90% discount on input tokens for repeated prompt prefixes. If your system prompt is the same across all requests, caching can dramatically reduce your input token costs. Anthropic's caching gives a 90% discount on cached tokens; OpenAI's gives 50%.

6. Compress structured data. If you are sending JSON context, consider whether a more compact format (CSV, abbreviated key names, or even plain text summaries) would convey the same information in fewer tokens. A 5 KB JSON payload might be 1,500 tokens; the same data as a compact table might be 400 tokens.

CostHawk's per-request analytics let you measure the before-and-after impact of each optimization, so you can prioritize the strategies that deliver the biggest savings for your specific workload.

Tokens and Cost Monitoring

Effective token monitoring requires visibility at multiple levels of granularity. CostHawk provides this through a layered monitoring architecture:

Per-request level: Every API call logged through CostHawk records the exact input token count, output token count, model used, and computed cost. This is the foundation for all higher-level analytics. You can drill into individual requests to understand why a particular call was expensive — perhaps it included an unexpectedly large context, or the model generated a verbose response.

Per-key level: CostHawk wrapped keys provide per-key attribution. If you issue separate keys for different services, teams, or environments, you can see token consumption broken down by key. This enables chargeback models and helps identify which service is driving the most token spend.

Per-project level: Tags and project labels let you aggregate token usage by business dimension — feature, team, customer, or environment (dev/staging/prod). Many teams discover that their development environment accounts for 30–50% of total token spend, leading to immediate savings through dev environment optimization.

Temporal trends: CostHawk's time-series dashboards show token consumption over hours, days, and weeks. This reveals patterns like batch jobs that spike token usage overnight, gradual context window creep as conversation histories grow, or sudden jumps when a new feature ships. Anomaly detection flags deviations from your baseline so you can investigate before costs spiral.

Model-level breakdown: See which models are consuming the most tokens and at what cost. This data feeds into model routing decisions — if 60% of your tokens go to an expensive model but only 20% of those requests actually need its capabilities, you have a clear optimization target.

The goal is to make token consumption as visible and actionable as any other infrastructure metric. Just as you monitor CPU utilization, memory usage, and request latency, token consumption should be a first-class metric in your observability stack. CostHawk makes this possible without requiring any changes to your application code — just route your API calls through a CostHawk wrapped key or sync your MCP telemetry.

FAQ

Frequently Asked Questions

How many tokens is one word on average?+
In English, one word averages approximately 1.3 tokens. Short, common words like "the," "is," and "a" are typically a single token. Longer words like "unfortunately" or "implementation" are usually split into two or three tokens. Technical jargon, brand names, and uncommon words tend to produce more tokens because they appear less frequently in the tokenizer's training data and get split into smaller sub-word units. For quick estimation, you can multiply your word count by 1.3 to get an approximate token count, though this varies by content type. Source code and structured data like JSON tend to have higher token-to-word ratios (closer to 1.8–2.0) because of punctuation, operators, and non-natural-language patterns. CostHawk shows exact token counts for every request, so you can calibrate your estimates against real production data.
Do whitespace and punctuation count as tokens?+
Yes. Whitespace characters (spaces, tabs, newlines) and punctuation marks are encoded into the token stream and count toward your billed token total. In BPE tokenization, a leading space is typically merged with the following word into a single token — so " hello" (space + hello) is one token, not two. However, multiple consecutive spaces, extra newlines, and trailing whitespace all add tokens. A common source of token waste is pretty-printed JSON with indentation: a 4-space indent on 200 lines adds roughly 200 extra tokens compared to minified JSON. Similarly, Markdown formatting with extra blank lines between paragraphs adds tokens that provide no semantic value to the model. Audit your prompts for unnecessary whitespace and consider minifying structured data before sending it to the API.
Why do different models produce different token counts for the same text?+
Each model family uses its own tokenizer with a different vocabulary. OpenAI's GPT-4o uses the o200k_base tokenizer with a vocabulary of roughly 200,000 tokens, while GPT-4 used cl100k_base with 100,256 tokens. Anthropic's Claude models use a proprietary tokenizer, and Google's Gemini models use SentencePiece. A larger vocabulary generally means common phrases are encoded as fewer tokens (because more multi-character sequences exist as single tokens), but rare or domain-specific terms may still be split differently. In practice, the difference is usually 5–15% between models for standard English text, but can be larger for code, non-Latin scripts, or specialized content. When comparing costs across providers, always use each provider's actual tokenizer for accurate estimates rather than assuming universal token counts.
How do I count tokens before sending a request to the API?+
For OpenAI models, use the official tiktoken library (available in Python and as a WASM module for JavaScript). Call tiktoken.encoding_for_model("gpt-4o") to get the correct encoder, then encoder.encode(text) to get the token list. For Anthropic, use the @anthropic-ai/tokenizer npm package or the anthropic Python SDK's built-in token counting. For Google Gemini, use the countTokens API endpoint. If you need a quick estimate without a library, divide the character count by 4 for English prose — this gives a rough approximation within 10–20% accuracy. For production systems, always use the official tokenizer library for accuracy, especially when enforcing budget limits or context window constraints. CostHawk also exposes token counts in its API response metadata, so you can reconcile client-side estimates with server-reported actuals.
Why are output tokens more expensive than input tokens?+
Output tokens cost more because of the fundamental asymmetry in how transformer models process input versus generate output. During the prefill phase, the model processes all input tokens in parallel using matrix multiplications on the GPU — this is computationally efficient because modern GPUs are optimized for parallel operations. During the decode phase, the model generates output tokens one at a time in an autoregressive loop: each new token depends on all previous tokens, so the model must run a forward pass for every single output token. This sequential process is inherently less efficient and consumes more GPU-seconds per token. Additionally, during decoding, the model maintains a key-value (KV) cache that grows with each generated token, consuming GPU memory. The combination of sequential computation and growing memory requirements makes output generation 3–5x more compute-intensive per token than input processing, which is directly reflected in the pricing differential.
How do tokens work with images and multimodal inputs?+
For multimodal models like GPT-4o and Claude 3.5, images are converted into a fixed or variable number of tokens depending on image size and detail level. OpenAI's GPT-4o charges based on image resolution: a low-detail image is 85 tokens, while a high-detail 2048x2048 image can be 1,105 tokens. Anthropic's Claude models calculate image tokens based on pixel count, roughly 1 token per 32x32 pixel block. A 1024x1024 image in Claude costs approximately 1,024 tokens. These image tokens are billed at the same input token rate as text tokens. Audio inputs (for models that support them) are also tokenized — OpenAI's Whisper and GPT-4o audio mode convert audio into tokens at approximately 1 token per 10ms of audio. When building multimodal applications, factor in these conversion rates for accurate cost estimation. Resizing images to the minimum resolution that maintains quality is a quick win for reducing multimodal token costs.
What is the relationship between tokens and context windows?+
The context window is the total token budget for a single API request, encompassing both input tokens and output tokens. If a model has a 128,000-token context window and you set max_tokens to 4,000 for the output, you can send up to 124,000 tokens of input (system prompt + user messages + conversation history + tool definitions). Every token of input you send is billed as an input token, and every token the model generates is billed as an output token, both within the context window limit. The critical cost implication is that stuffing the context window inflates your bill proportionally. Sending 100,000 tokens of context costs 50x more in input tokens than sending 2,000 tokens. Teams that naively include full conversation history or entire documents in every request often discover their context window usage — and their costs — growing geometrically over time. CostHawk's per-request analytics show context utilization as a percentage, making it easy to spot requests that are using more context than necessary.
How can I reduce token usage without sacrificing output quality?+
The most effective strategies for reducing tokens while maintaining quality are: (1) Optimize system prompts — rewrite instructions to be concise, remove redundant guidance, and test that shorter prompts produce equivalent outputs. Most teams can cut 30–50% of system prompt tokens without quality loss. (2) Use structured output formats — request JSON or CSV output instead of natural language when downstream code consumes the result. This produces shorter, more predictable outputs. (3) Implement prompt caching — Anthropic offers 90% discount and OpenAI offers 50% discount on cached input tokens for repeated prompt prefixes. (4) Manage conversation history — summarize older turns instead of sending raw history. A 50-token summary of 5 conversation turns replaces 2,000+ tokens of raw messages. (5) Route to smaller models — use GPT-4o mini or Claude Haiku for simple tasks where a frontier model is overkill. (6) Set explicit max_tokens — prevent the model from generating unnecessarily long responses. Track the impact of each change in CostHawk to confirm it reduces cost without degrading quality metrics.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.