Tokenization
The process of splitting raw text into discrete sub-word units called tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. Tokenization is the invisible first step of every LLM API call and directly determines how many tokens you are billed for — identical text fed through different tokenizers can produce token counts that vary by 10–40%, making tokenizer choice a material cost factor.
Definition
What is Tokenization?
o200k_base) has a vocabulary of roughly 200,000 tokens; Anthropic's Claude tokenizer uses a proprietary BPE vocabulary of a similar scale; Google's Gemini models use a SentencePiece unigram vocabulary. Because every API provider bills by the token, understanding how tokenization works — and how different tokenizers handle your specific content — is foundational to accurate cost estimation, budget planning, and cross-provider cost comparison.Impact
Why It Matters for AI Costs
Tokenization is the conversion layer between your text and your bill. Every character you send to an LLM API passes through a tokenizer that determines exactly how many billable units that text becomes. This makes tokenization the single most important mechanical process to understand for AI cost management.
Consider a concrete example. The sentence "The quick brown fox jumps over the lazy dog" produces:
- 9 tokens with OpenAI's
o200k_basetokenizer (GPT-4o) - 10 tokens with Anthropic's Claude tokenizer
- 11 tokens with Google's Gemini SentencePiece tokenizer
A 22% difference on a simple English sentence. Now scale that to production workloads processing millions of requests per day, and the cost implications become significant. A system processing 50 million tokens per day on a model charging $3.00 per million input tokens pays $150/day. If a different provider's tokenizer produces 15% fewer tokens for the same content, that is $22.50/day in savings — $8,212 per year — from tokenizer efficiency alone, before considering any difference in per-token price.
The impact is even more dramatic for non-English content and structured data. Chinese text can produce 2–3x more tokens than the equivalent semantic content in English, and JSON payloads are notoriously token-heavy due to punctuation overhead. Teams that understand these dynamics can make informed decisions about content format, language handling, and provider selection that compound into substantial savings.
CostHawk tracks per-request token counts across all providers, enabling you to compare tokenization efficiency for your actual production traffic — not theoretical benchmarks — and route requests to the most cost-efficient provider for each content type.
What is Tokenization?
Tokenization is the mandatory first step in every interaction with a large language model. Before an LLM can read your prompt, reason about it, or generate a response, your raw text must be converted into a sequence of integers — each integer representing a token from the model's fixed vocabulary. This process is deterministic: the same text always produces the same tokens with a given tokenizer, and it is reversible — tokens can be decoded back to the original text without information loss.
The need for tokenization arises from how neural networks operate. Transformers process fixed-dimensional vectors, not variable-length strings. The tokenizer bridges this gap by mapping every possible text input to a sequence of vocabulary indices, each of which is then looked up in an embedding table to produce the dense vectors the model actually computes on.
Why not just use characters or words? Character-level tokenization would require extremely long sequences (a 1,000-word paragraph is roughly 5,000 characters), making attention computation prohibitively expensive since transformer attention scales quadratically with sequence length. Word-level tokenization cannot handle misspellings, neologisms, or morphological variations without an impossibly large vocabulary. Sub-word tokenization — the approach used by BPE, WordPiece, and SentencePiece — strikes the optimal balance: common words become single tokens, uncommon words are split into recognizable fragments, and the vocabulary stays manageable (typically 32,000–200,000 entries).
The tokenization process is invisible to most API users — you send text, you get text back — but it operates behind every call and directly determines the token count that appears on your invoice. A 1,000-character English paragraph might become 250 tokens with one tokenizer and 290 with another. That 16% difference is pure cost overhead if you are on the less efficient tokenizer for your content type.
Understanding tokenization transforms cost management from guesswork to engineering. Instead of wondering why a seemingly short prompt costs more than expected, you can inspect the token breakdown, identify inefficiencies (unnecessary whitespace, verbose JSON keys, non-Latin characters that tokenize poorly), and make targeted optimizations that reduce your bill without changing what the model actually does.
Tokenization Algorithms Compared
Three tokenization algorithms dominate the LLM landscape, each with distinct mechanics and cost implications. Understanding their differences helps explain why the same text produces different token counts — and different costs — across providers.
| Algorithm | Used By | Vocabulary Size | Mechanism | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Byte-Pair Encoding (BPE) | OpenAI (GPT-4o, GPT-4), Anthropic (Claude), Meta (Llama 3), Mistral | 100K–200K | Iteratively merges the most frequent adjacent byte pairs in training data | Excellent compression for languages well-represented in training data; deterministic; fast encoding | Efficiency drops sharply for low-resource languages and domains underrepresented in training corpus |
| WordPiece | Google (BERT, ALBERT, DistilBERT), some older encoder models | 30K–110K | Similar to BPE but selects merges that maximize likelihood of the training data rather than raw frequency | Slightly better subword boundaries for morphologically rich languages; strong for classification tasks | Smaller vocabularies mean more tokens per text segment; largely superseded by SentencePiece/BPE for generative models |
| SentencePiece (Unigram) | Google (Gemini, T5, PaLM), some multilingual models | 32K–256K | Starts with a large candidate vocabulary and iteratively removes tokens that least affect the training data likelihood | Language-agnostic (operates on raw bytes/Unicode); handles multilingual and code content more uniformly | Can produce slightly longer sequences for English-heavy content compared to BPE with equivalent vocab size |
BPE in detail: Byte-Pair Encoding starts by treating every byte (or Unicode character) as an individual token. It then scans the training corpus to find the most frequently occurring adjacent pair of tokens, merges them into a new single token, and repeats. After 100,000+ merge operations, common English words like "the", "and", and "function" are single tokens, while rare terms are split into sub-word fragments. OpenAI's o200k_base tokenizer uses approximately 200,000 merge operations, producing a large vocabulary that compresses English very efficiently — roughly 4.0 characters per token on average.
SentencePiece in detail: Unlike BPE, which builds its vocabulary bottom-up, SentencePiece's unigram model works top-down. It starts with a massive candidate vocabulary (often 1 million+ entries) and iteratively prunes tokens whose removal has the least impact on the overall probability of the training corpus. This approach tends to produce more uniform tokenization across languages because it optimizes globally rather than greedily. For multilingual workloads, SentencePiece-based models often produce 5–15% fewer tokens for non-English text compared to BPE-based models.
Cost implication: The algorithm determines the vocabulary, the vocabulary determines token counts, and token counts determine cost. Choosing a provider whose tokenizer is most efficient for your content type can yield meaningful savings before you even consider per-token pricing differences.
How Tokenizers Affect Cost
Because each provider uses a different tokenizer, the same input text produces different token counts — and therefore different costs — depending on which API you call. This is not a theoretical concern; the differences are measurable and financially material at scale.
Here is a real-world comparison of token counts for identical text samples across major tokenizers:
| Text Sample | Characters | GPT-4o (o200k_base) | Claude (Anthropic BPE) | Gemini (SentencePiece) | Llama 3 (BPE 128K) |
|---|---|---|---|---|---|
| English blog post (500 words) | 2,847 | 648 | 671 | 702 | 659 |
| Python function with docstring | 1,200 | 389 | 412 | 421 | 395 |
| JSON API response (nested) | 3,500 | 1,102 | 1,187 | 1,095 | 1,134 |
| Chinese news article (500 chars) | 500 | 412 | 389 | 347 | 401 |
| Legal contract paragraph | 1,800 | 421 | 445 | 468 | 430 |
| TypeScript React component | 2,200 | 718 | 756 | 741 | 729 |
Key observations from this data:
- GPT-4o's 200K vocabulary gives it the best compression for English prose and code. Its larger vocabulary means more common multi-character sequences are captured as single tokens.
- Gemini's SentencePiece tokenizer excels on non-Latin scripts. The Chinese sample produces 16% fewer tokens on Gemini than GPT-4o, which translates to 16% lower input costs at identical per-token pricing.
- JSON and structured data shows the most variability. Tokenizers handle punctuation-heavy content differently, and the range can be 10–15% between the most and least efficient tokenizer.
- Code tokenization is consistently less efficient than prose across all tokenizers. Expect 30–40% more tokens per character for source code compared to natural language.
To illustrate the financial impact: a team processing 100 million tokens per day of mixed English and JSON content on Claude 3.5 Sonnet ($3.00/1M input) pays $300/day. If GPT-4o produces 8% fewer tokens for the same content at $2.50/1M input, the equivalent cost is $230/day — a saving of $70/day, or $25,550 per year, from tokenizer efficiency and pricing combined.
CostHawk normalizes token counts across providers in its cost comparison reports, showing you the effective price-per-character (not just price-per-token) so you can make apples-to-apples comparisons that account for tokenizer efficiency differences.
Tokenization Across Languages
Tokenizer efficiency varies dramatically across languages because most tokenizers are trained primarily on English-heavy corpora. Languages with non-Latin scripts, agglutinative morphology, or limited representation in training data produce significantly more tokens per semantic unit, directly inflating costs for multilingual applications.
| Language | Script | Tokens per 1,000 Characters | Relative Cost vs English | Notes |
|---|---|---|---|---|
| English | Latin | ~250 | 1.0x (baseline) | Best-optimized across all tokenizers |
| Spanish | Latin | ~270 | 1.08x | Accented characters slightly less efficient |
| German | Latin | ~290 | 1.16x | Compound words split into more sub-tokens |
| French | Latin | ~265 | 1.06x | Close to English efficiency |
| Russian | Cyrillic | ~380 | 1.52x | Cyrillic characters less common in training data |
| Arabic | Arabic | ~420 | 1.68x | Right-to-left script, complex morphology |
| Hindi | Devanagari | ~480 | 1.92x | Devanagari characters often split into multiple tokens |
| Chinese (Simplified) | Han | ~650 | 2.60x | Each character often becomes 1–3 tokens |
| Japanese | Mixed (Kanji/Hiragana/Katakana) | ~580 | 2.32x | Mixed scripts add complexity |
| Korean | Hangul | ~520 | 2.08x | Syllable blocks partially mitigate inefficiency |
| Thai | Thai | ~550 | 2.20x | No word boundaries in script; tokenizer must infer |
Why does this happen? BPE and similar algorithms learn merge rules from training data. Since training corpora are overwhelmingly English-dominant (estimates range from 50–80% English for most foundation models), English text benefits from the most merge operations. Common English words and phrases are single tokens. Chinese characters, by contrast, appear less frequently in training data, so the tokenizer learns fewer merges for Chinese character sequences — meaning each character is often split into two or three byte-level tokens.
The cost impact is substantial. A customer support chatbot serving Chinese-speaking users will pay 2.5–3x more per conversation in token costs compared to an identical English chatbot, all else being equal. For a high-volume multilingual application processing 1 million conversations per month across 10 languages, the tokenization efficiency gap can represent $10,000–$50,000/month in additional costs for non-English traffic.
Mitigation strategies for multilingual workloads:
- Evaluate SentencePiece-based models (Gemini) for non-Latin content. SentencePiece's language-agnostic approach often produces 10–20% fewer tokens for CJK languages compared to BPE-based models.
- Pre-translate to English when feasible. If your pipeline involves sending large context blocks (e.g., documents for summarization), translating to English first and then processing with a BPE model can actually be cheaper than processing the original language directly — translation costs may be offset by tokenization savings.
- Monitor per-language costs with CostHawk's tag-based analytics. Tag requests by language to see exactly how much each language costs relative to English and identify the highest-ROI optimization targets.
- Use shorter prompts in high-cost languages. If Chinese tokenizes at 2.6x the cost of English, every word in your Chinese system prompt costs 2.6x more. Keep system prompts minimal for high-cost languages and rely more on few-shot examples that fit the model's training distribution.
Tokenizing Code, JSON, and Structured Data
Source code and structured data formats like JSON, XML, and YAML are among the most token-inefficient content types, yet they are ubiquitous in AI-powered developer tools, code assistants, and data processing pipelines. Understanding why — and how to mitigate the cost — is critical for teams building these applications.
Why code tokenizes poorly:
- Punctuation density. Code contains far more punctuation (braces, brackets, semicolons, colons, operators) than natural language. Each punctuation character is typically its own token or shares a token with adjacent whitespace, producing a low characters-per-token ratio.
- Variable and function names. Identifiers like
calculateMonthlyRevenueoruser_authentication_middlewareare domain-specific strings that do not appear in the tokenizer's training data as single entries. They are split into multiple sub-word tokens:["calculate", "Monthly", "Revenue"]or["user", "_authentication", "_middleware"]. - Indentation. Whitespace-sensitive languages like Python pay a token cost for every level of indentation. A function nested 4 levels deep with 4-space indentation consumes 16 characters of whitespace per line — roughly 4 tokens per line just for indentation.
- Repetitive boilerplate. Import statements, type annotations, and framework boilerplate add tokens that carry structural but not semantic novelty.
JSON is particularly expensive:
// This JSON payload is 164 characters and ~62 tokens:
{
"user_id": "usr_12345",
"subscription_plan": "enterprise",
"monthly_cost": 299.99,
"features": ["analytics", "alerts", "api_access"]
}
// The same data as compact text is 95 characters and ~24 tokens:
user usr_12345 enterprise $299.99 analytics,alerts,api_accessThat is a 2.6x token reduction for the same information by switching from JSON to a compact text format. At scale — say, 100,000 requests per day each including a 500-token JSON context block — switching to compact text could save 38 million tokens per day, or $95/day on GPT-4o input pricing ($34,675/year).
Optimization strategies for structured data:
- Minify JSON before including it in prompts. Remove pretty-printing, collapse whitespace, and use short key names.
{"n":"John","a":30}uses fewer tokens than{"name": "John", "age": 30}. - Use CSV or TSV for tabular data. Column headers plus comma-separated rows produce 40–60% fewer tokens than the equivalent JSON array of objects.
- Summarize instead of including raw data. Instead of sending 50 rows of database results as JSON, send a natural-language summary: "The top 5 users by spend are: Alice ($4,200), Bob ($3,800)..." This conveys the same information in far fewer tokens.
- Strip comments and docstrings from code context. If you are sending source code as context for a code review or refactoring task, remove comments and docstrings that the model does not need to see. This can reduce code token counts by 15–25%.
- Use tree-sitter or AST-based extraction to send only the relevant code structures (function signatures, class definitions) rather than entire files.
CostHawk's per-request analytics show token counts alongside request metadata, making it straightforward to identify which requests contain token-heavy structured data and quantify the savings from format optimization.
Choosing Models by Tokenization Efficiency
Model selection is usually driven by capability, latency, and per-token price — but tokenization efficiency deserves a seat at the table. Two models with identical per-token pricing can have materially different effective costs if one tokenizer compresses your content more efficiently than the other.
Framework for tokenizer-aware model selection:
Step 1: Profile your content. Take a representative sample of your production prompts (at least 1,000 requests) and run them through each candidate model's tokenizer. Record the token count for each sample across all tokenizers. Tools for this include OpenAI's tiktoken library, Anthropic's @anthropic-ai/tokenizer, and Google's countTokens API endpoint.
Step 2: Calculate effective cost per character. This is the metric that accounts for tokenizer efficiency:
effective_cost_per_char = (price_per_million_tokens / 1,000,000) × (tokens / characters)
// Example for English prose:
// GPT-4o: ($2.50 / 1M) × (250 tokens / 1000 chars) = $0.000000625/char
// Claude 3.5 Sonnet: ($3.00 / 1M) × (268 tokens / 1000 chars) = $0.000000804/char
// Gemini 1.5 Pro: ($1.25 / 1M) × (280 tokens / 1000 chars) = $0.000000350/charStep 3: Factor in output tokenization. Output tokens matter too. If one model's tokenizer produces more compact output (fewer tokens for equivalent semantic content), your output costs decrease. This is harder to measure because output content varies, but you can estimate by comparing the average output token count for identical prompts across models.
Step 4: Weight by content mix. If 70% of your traffic is English prose and 30% is JSON, calculate a weighted effective cost that reflects your actual content distribution. The model that is cheapest for English prose may not be cheapest for JSON-heavy workloads.
Practical recommendations by content type:
| Content Type | Most Efficient Tokenizer | Runner-Up | Avoid |
|---|---|---|---|
| English prose | GPT-4o (o200k_base) | Llama 3 (128K BPE) | Older 32K-vocab models |
| Source code (Python/JS/TS) | GPT-4o (o200k_base) | Claude (Anthropic BPE) | SentencePiece (slightly less efficient for code) |
| JSON / structured data | Gemini (SentencePiece) | GPT-4o (o200k_base) | Small-vocab BPE models |
| Chinese / Japanese / Korean | Gemini (SentencePiece) | Llama 3 (128K BPE) | GPT-3.5-era tokenizers |
| Multilingual mixed | Gemini (SentencePiece) | GPT-4o (o200k_base) | English-optimized small vocab |
The takeaway: do not just compare price-per-token across providers. Compare price-per-character (or price-per-semantic-unit) to account for tokenizer efficiency. CostHawk's provider comparison reports show effective cost per request — normalizing for tokenizer differences — so you can see which provider is genuinely cheapest for your specific workload, not just cheapest on a per-token sticker price.
FAQ
Frequently Asked Questions
What is the difference between tokenization and tokenizing in the crypto/security sense?+
Can I change which tokenizer a model uses to reduce my costs?+
o200k_base. When you call Claude 3.5 Sonnet, Anthropic's proprietary tokenizer is always used. You cannot bring your own tokenizer to a hosted API. However, you can influence how many tokens the tokenizer produces by optimizing your input content: minifying JSON, shortening variable names in code context, compressing whitespace, and using compact natural language. You can also choose which model (and therefore which tokenizer) to use for each request by implementing model routing. If your content tokenizes 15% more efficiently on GPT-4o than Claude, routing that content to GPT-4o saves 15% on input token costs — even if the per-token prices are similar. CostHawk's multi-provider analytics help you identify these routing opportunities.How much do tokenizer vocabulary size differences actually matter for cost?+
Does prompt caching interact with tokenization?+
Why does the same word sometimes produce different numbers of tokens in different contexts?+
"Hello" at the start of a string might be a single token, but " Hello" (with a leading space) is typically also a single but different token — the space is merged with the word. However, if the word follows a newline, tab, or unusual punctuation, the byte context changes and the merge rules may produce a different tokenization. Additionally, capitalization matters: "the" and "The" are different byte sequences and may map to different tokens or different numbers of tokens. In practice, these contextual differences are small — usually 0–1 token per word — but they accumulate across long documents. This is why client-side token estimation functions that count words and multiply by 1.3 are only approximate. For exact counts, always use the official tokenizer library for your target model.How do special tokens and chat formatting affect my billed token count?+
usage field), which includes all overhead tokens, giving you an accurate picture of true costs.Is there a way to preview exactly how my text will be tokenized before sending it to the API?+
tiktoken Python library lets you encode text and see the exact token boundaries: tiktoken.encoding_for_model('gpt-4o').encode('your text') returns a list of token IDs, and .decode_single_token_bytes() shows you the text each token represents. OpenAI also offers an online Tokenizer tool at platform.openai.com/tokenizer. For Anthropic, the @anthropic-ai/tokenizer npm package provides countTokens() and encoding functions. For Google Gemini, call the countTokens REST endpoint or use the Python SDK's count_tokens() method. For local preview without API calls, the tokenizers Python library by Hugging Face can load any tokenizer from the Hub and encode text locally. These tools are invaluable for debugging unexpectedly high token counts, optimizing prompt formats, and verifying that compression techniques actually reduce tokens. Build token counting into your CI/CD pipeline to catch prompt size regressions before they reach production.How will tokenization evolve, and should I future-proof my cost estimates?+
usage object returned by each API call rather than client-side estimates. This way, as tokenizers improve and produce fewer tokens for the same content, your cost tracking automatically reflects the savings without code changes.Related Terms
Token
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read morePrompt Compression
Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Read moreLarge Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
