GlossaryBilling & PricingUpdated 2026-03-16

Tokenization

The process of splitting raw text into discrete sub-word units called tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. Tokenization is the invisible first step of every LLM API call and directly determines how many tokens you are billed for — identical text fed through different tokenizers can produce token counts that vary by 10–40%, making tokenizer choice a material cost factor.

Definition

What is Tokenization?

Tokenization is the process by which raw text — characters, words, punctuation, whitespace — is converted into a sequence of integer-indexed sub-word units that a large language model can process. Every LLM operates on tokens, not characters or words, and the specific algorithm used to perform this conversion is called a tokenizer. The dominant tokenization algorithm in modern LLMs is Byte-Pair Encoding (BPE), but alternatives like WordPiece (used in BERT-family models) and SentencePiece (used in Google's Gemini and T5 families) are also widely deployed. Each algorithm produces a different vocabulary — a mapping from text fragments to integer IDs — which means the same English sentence can yield different token counts depending on which model's tokenizer processes it. OpenAI's GPT-4o tokenizer (o200k_base) has a vocabulary of roughly 200,000 tokens; Anthropic's Claude tokenizer uses a proprietary BPE vocabulary of a similar scale; Google's Gemini models use a SentencePiece unigram vocabulary. Because every API provider bills by the token, understanding how tokenization works — and how different tokenizers handle your specific content — is foundational to accurate cost estimation, budget planning, and cross-provider cost comparison.

Impact

Why It Matters for AI Costs

Tokenization is the conversion layer between your text and your bill. Every character you send to an LLM API passes through a tokenizer that determines exactly how many billable units that text becomes. This makes tokenization the single most important mechanical process to understand for AI cost management.

Consider a concrete example. The sentence "The quick brown fox jumps over the lazy dog" produces:

  • 9 tokens with OpenAI's o200k_base tokenizer (GPT-4o)
  • 10 tokens with Anthropic's Claude tokenizer
  • 11 tokens with Google's Gemini SentencePiece tokenizer

A 22% difference on a simple English sentence. Now scale that to production workloads processing millions of requests per day, and the cost implications become significant. A system processing 50 million tokens per day on a model charging $3.00 per million input tokens pays $150/day. If a different provider's tokenizer produces 15% fewer tokens for the same content, that is $22.50/day in savings — $8,212 per year — from tokenizer efficiency alone, before considering any difference in per-token price.

The impact is even more dramatic for non-English content and structured data. Chinese text can produce 2–3x more tokens than the equivalent semantic content in English, and JSON payloads are notoriously token-heavy due to punctuation overhead. Teams that understand these dynamics can make informed decisions about content format, language handling, and provider selection that compound into substantial savings.

CostHawk tracks per-request token counts across all providers, enabling you to compare tokenization efficiency for your actual production traffic — not theoretical benchmarks — and route requests to the most cost-efficient provider for each content type.

What is Tokenization?

Tokenization is the mandatory first step in every interaction with a large language model. Before an LLM can read your prompt, reason about it, or generate a response, your raw text must be converted into a sequence of integers — each integer representing a token from the model's fixed vocabulary. This process is deterministic: the same text always produces the same tokens with a given tokenizer, and it is reversible — tokens can be decoded back to the original text without information loss.

The need for tokenization arises from how neural networks operate. Transformers process fixed-dimensional vectors, not variable-length strings. The tokenizer bridges this gap by mapping every possible text input to a sequence of vocabulary indices, each of which is then looked up in an embedding table to produce the dense vectors the model actually computes on.

Why not just use characters or words? Character-level tokenization would require extremely long sequences (a 1,000-word paragraph is roughly 5,000 characters), making attention computation prohibitively expensive since transformer attention scales quadratically with sequence length. Word-level tokenization cannot handle misspellings, neologisms, or morphological variations without an impossibly large vocabulary. Sub-word tokenization — the approach used by BPE, WordPiece, and SentencePiece — strikes the optimal balance: common words become single tokens, uncommon words are split into recognizable fragments, and the vocabulary stays manageable (typically 32,000–200,000 entries).

The tokenization process is invisible to most API users — you send text, you get text back — but it operates behind every call and directly determines the token count that appears on your invoice. A 1,000-character English paragraph might become 250 tokens with one tokenizer and 290 with another. That 16% difference is pure cost overhead if you are on the less efficient tokenizer for your content type.

Understanding tokenization transforms cost management from guesswork to engineering. Instead of wondering why a seemingly short prompt costs more than expected, you can inspect the token breakdown, identify inefficiencies (unnecessary whitespace, verbose JSON keys, non-Latin characters that tokenize poorly), and make targeted optimizations that reduce your bill without changing what the model actually does.

Tokenization Algorithms Compared

Three tokenization algorithms dominate the LLM landscape, each with distinct mechanics and cost implications. Understanding their differences helps explain why the same text produces different token counts — and different costs — across providers.

AlgorithmUsed ByVocabulary SizeMechanismStrengthsWeaknesses
Byte-Pair Encoding (BPE)OpenAI (GPT-4o, GPT-4), Anthropic (Claude), Meta (Llama 3), Mistral100K–200KIteratively merges the most frequent adjacent byte pairs in training dataExcellent compression for languages well-represented in training data; deterministic; fast encodingEfficiency drops sharply for low-resource languages and domains underrepresented in training corpus
WordPieceGoogle (BERT, ALBERT, DistilBERT), some older encoder models30K–110KSimilar to BPE but selects merges that maximize likelihood of the training data rather than raw frequencySlightly better subword boundaries for morphologically rich languages; strong for classification tasksSmaller vocabularies mean more tokens per text segment; largely superseded by SentencePiece/BPE for generative models
SentencePiece (Unigram)Google (Gemini, T5, PaLM), some multilingual models32K–256KStarts with a large candidate vocabulary and iteratively removes tokens that least affect the training data likelihoodLanguage-agnostic (operates on raw bytes/Unicode); handles multilingual and code content more uniformlyCan produce slightly longer sequences for English-heavy content compared to BPE with equivalent vocab size

BPE in detail: Byte-Pair Encoding starts by treating every byte (or Unicode character) as an individual token. It then scans the training corpus to find the most frequently occurring adjacent pair of tokens, merges them into a new single token, and repeats. After 100,000+ merge operations, common English words like "the", "and", and "function" are single tokens, while rare terms are split into sub-word fragments. OpenAI's o200k_base tokenizer uses approximately 200,000 merge operations, producing a large vocabulary that compresses English very efficiently — roughly 4.0 characters per token on average.

SentencePiece in detail: Unlike BPE, which builds its vocabulary bottom-up, SentencePiece's unigram model works top-down. It starts with a massive candidate vocabulary (often 1 million+ entries) and iteratively prunes tokens whose removal has the least impact on the overall probability of the training corpus. This approach tends to produce more uniform tokenization across languages because it optimizes globally rather than greedily. For multilingual workloads, SentencePiece-based models often produce 5–15% fewer tokens for non-English text compared to BPE-based models.

Cost implication: The algorithm determines the vocabulary, the vocabulary determines token counts, and token counts determine cost. Choosing a provider whose tokenizer is most efficient for your content type can yield meaningful savings before you even consider per-token pricing differences.

How Tokenizers Affect Cost

Because each provider uses a different tokenizer, the same input text produces different token counts — and therefore different costs — depending on which API you call. This is not a theoretical concern; the differences are measurable and financially material at scale.

Here is a real-world comparison of token counts for identical text samples across major tokenizers:

Text SampleCharactersGPT-4o (o200k_base)Claude (Anthropic BPE)Gemini (SentencePiece)Llama 3 (BPE 128K)
English blog post (500 words)2,847648671702659
Python function with docstring1,200389412421395
JSON API response (nested)3,5001,1021,1871,0951,134
Chinese news article (500 chars)500412389347401
Legal contract paragraph1,800421445468430
TypeScript React component2,200718756741729

Key observations from this data:

  • GPT-4o's 200K vocabulary gives it the best compression for English prose and code. Its larger vocabulary means more common multi-character sequences are captured as single tokens.
  • Gemini's SentencePiece tokenizer excels on non-Latin scripts. The Chinese sample produces 16% fewer tokens on Gemini than GPT-4o, which translates to 16% lower input costs at identical per-token pricing.
  • JSON and structured data shows the most variability. Tokenizers handle punctuation-heavy content differently, and the range can be 10–15% between the most and least efficient tokenizer.
  • Code tokenization is consistently less efficient than prose across all tokenizers. Expect 30–40% more tokens per character for source code compared to natural language.

To illustrate the financial impact: a team processing 100 million tokens per day of mixed English and JSON content on Claude 3.5 Sonnet ($3.00/1M input) pays $300/day. If GPT-4o produces 8% fewer tokens for the same content at $2.50/1M input, the equivalent cost is $230/day — a saving of $70/day, or $25,550 per year, from tokenizer efficiency and pricing combined.

CostHawk normalizes token counts across providers in its cost comparison reports, showing you the effective price-per-character (not just price-per-token) so you can make apples-to-apples comparisons that account for tokenizer efficiency differences.

Tokenization Across Languages

Tokenizer efficiency varies dramatically across languages because most tokenizers are trained primarily on English-heavy corpora. Languages with non-Latin scripts, agglutinative morphology, or limited representation in training data produce significantly more tokens per semantic unit, directly inflating costs for multilingual applications.

LanguageScriptTokens per 1,000 CharactersRelative Cost vs EnglishNotes
EnglishLatin~2501.0x (baseline)Best-optimized across all tokenizers
SpanishLatin~2701.08xAccented characters slightly less efficient
GermanLatin~2901.16xCompound words split into more sub-tokens
FrenchLatin~2651.06xClose to English efficiency
RussianCyrillic~3801.52xCyrillic characters less common in training data
ArabicArabic~4201.68xRight-to-left script, complex morphology
HindiDevanagari~4801.92xDevanagari characters often split into multiple tokens
Chinese (Simplified)Han~6502.60xEach character often becomes 1–3 tokens
JapaneseMixed (Kanji/Hiragana/Katakana)~5802.32xMixed scripts add complexity
KoreanHangul~5202.08xSyllable blocks partially mitigate inefficiency
ThaiThai~5502.20xNo word boundaries in script; tokenizer must infer

Why does this happen? BPE and similar algorithms learn merge rules from training data. Since training corpora are overwhelmingly English-dominant (estimates range from 50–80% English for most foundation models), English text benefits from the most merge operations. Common English words and phrases are single tokens. Chinese characters, by contrast, appear less frequently in training data, so the tokenizer learns fewer merges for Chinese character sequences — meaning each character is often split into two or three byte-level tokens.

The cost impact is substantial. A customer support chatbot serving Chinese-speaking users will pay 2.5–3x more per conversation in token costs compared to an identical English chatbot, all else being equal. For a high-volume multilingual application processing 1 million conversations per month across 10 languages, the tokenization efficiency gap can represent $10,000–$50,000/month in additional costs for non-English traffic.

Mitigation strategies for multilingual workloads:

  • Evaluate SentencePiece-based models (Gemini) for non-Latin content. SentencePiece's language-agnostic approach often produces 10–20% fewer tokens for CJK languages compared to BPE-based models.
  • Pre-translate to English when feasible. If your pipeline involves sending large context blocks (e.g., documents for summarization), translating to English first and then processing with a BPE model can actually be cheaper than processing the original language directly — translation costs may be offset by tokenization savings.
  • Monitor per-language costs with CostHawk's tag-based analytics. Tag requests by language to see exactly how much each language costs relative to English and identify the highest-ROI optimization targets.
  • Use shorter prompts in high-cost languages. If Chinese tokenizes at 2.6x the cost of English, every word in your Chinese system prompt costs 2.6x more. Keep system prompts minimal for high-cost languages and rely more on few-shot examples that fit the model's training distribution.

Tokenizing Code, JSON, and Structured Data

Source code and structured data formats like JSON, XML, and YAML are among the most token-inefficient content types, yet they are ubiquitous in AI-powered developer tools, code assistants, and data processing pipelines. Understanding why — and how to mitigate the cost — is critical for teams building these applications.

Why code tokenizes poorly:

  • Punctuation density. Code contains far more punctuation (braces, brackets, semicolons, colons, operators) than natural language. Each punctuation character is typically its own token or shares a token with adjacent whitespace, producing a low characters-per-token ratio.
  • Variable and function names. Identifiers like calculateMonthlyRevenue or user_authentication_middleware are domain-specific strings that do not appear in the tokenizer's training data as single entries. They are split into multiple sub-word tokens: ["calculate", "Monthly", "Revenue"] or ["user", "_authentication", "_middleware"].
  • Indentation. Whitespace-sensitive languages like Python pay a token cost for every level of indentation. A function nested 4 levels deep with 4-space indentation consumes 16 characters of whitespace per line — roughly 4 tokens per line just for indentation.
  • Repetitive boilerplate. Import statements, type annotations, and framework boilerplate add tokens that carry structural but not semantic novelty.

JSON is particularly expensive:

// This JSON payload is 164 characters and ~62 tokens:
{
  "user_id": "usr_12345",
  "subscription_plan": "enterprise",
  "monthly_cost": 299.99,
  "features": ["analytics", "alerts", "api_access"]
}

// The same data as compact text is 95 characters and ~24 tokens:
user usr_12345 enterprise $299.99 analytics,alerts,api_access

That is a 2.6x token reduction for the same information by switching from JSON to a compact text format. At scale — say, 100,000 requests per day each including a 500-token JSON context block — switching to compact text could save 38 million tokens per day, or $95/day on GPT-4o input pricing ($34,675/year).

Optimization strategies for structured data:

  • Minify JSON before including it in prompts. Remove pretty-printing, collapse whitespace, and use short key names. {"n":"John","a":30} uses fewer tokens than {"name": "John", "age": 30}.
  • Use CSV or TSV for tabular data. Column headers plus comma-separated rows produce 40–60% fewer tokens than the equivalent JSON array of objects.
  • Summarize instead of including raw data. Instead of sending 50 rows of database results as JSON, send a natural-language summary: "The top 5 users by spend are: Alice ($4,200), Bob ($3,800)..." This conveys the same information in far fewer tokens.
  • Strip comments and docstrings from code context. If you are sending source code as context for a code review or refactoring task, remove comments and docstrings that the model does not need to see. This can reduce code token counts by 15–25%.
  • Use tree-sitter or AST-based extraction to send only the relevant code structures (function signatures, class definitions) rather than entire files.

CostHawk's per-request analytics show token counts alongside request metadata, making it straightforward to identify which requests contain token-heavy structured data and quantify the savings from format optimization.

Choosing Models by Tokenization Efficiency

Model selection is usually driven by capability, latency, and per-token price — but tokenization efficiency deserves a seat at the table. Two models with identical per-token pricing can have materially different effective costs if one tokenizer compresses your content more efficiently than the other.

Framework for tokenizer-aware model selection:

Step 1: Profile your content. Take a representative sample of your production prompts (at least 1,000 requests) and run them through each candidate model's tokenizer. Record the token count for each sample across all tokenizers. Tools for this include OpenAI's tiktoken library, Anthropic's @anthropic-ai/tokenizer, and Google's countTokens API endpoint.

Step 2: Calculate effective cost per character. This is the metric that accounts for tokenizer efficiency:

effective_cost_per_char = (price_per_million_tokens / 1,000,000) × (tokens / characters)

// Example for English prose:
// GPT-4o: ($2.50 / 1M) × (250 tokens / 1000 chars) = $0.000000625/char
// Claude 3.5 Sonnet: ($3.00 / 1M) × (268 tokens / 1000 chars) = $0.000000804/char
// Gemini 1.5 Pro: ($1.25 / 1M) × (280 tokens / 1000 chars) = $0.000000350/char

Step 3: Factor in output tokenization. Output tokens matter too. If one model's tokenizer produces more compact output (fewer tokens for equivalent semantic content), your output costs decrease. This is harder to measure because output content varies, but you can estimate by comparing the average output token count for identical prompts across models.

Step 4: Weight by content mix. If 70% of your traffic is English prose and 30% is JSON, calculate a weighted effective cost that reflects your actual content distribution. The model that is cheapest for English prose may not be cheapest for JSON-heavy workloads.

Practical recommendations by content type:

Content TypeMost Efficient TokenizerRunner-UpAvoid
English proseGPT-4o (o200k_base)Llama 3 (128K BPE)Older 32K-vocab models
Source code (Python/JS/TS)GPT-4o (o200k_base)Claude (Anthropic BPE)SentencePiece (slightly less efficient for code)
JSON / structured dataGemini (SentencePiece)GPT-4o (o200k_base)Small-vocab BPE models
Chinese / Japanese / KoreanGemini (SentencePiece)Llama 3 (128K BPE)GPT-3.5-era tokenizers
Multilingual mixedGemini (SentencePiece)GPT-4o (o200k_base)English-optimized small vocab

The takeaway: do not just compare price-per-token across providers. Compare price-per-character (or price-per-semantic-unit) to account for tokenizer efficiency. CostHawk's provider comparison reports show effective cost per request — normalizing for tokenizer differences — so you can see which provider is genuinely cheapest for your specific workload, not just cheapest on a per-token sticker price.

FAQ

Frequently Asked Questions

What is the difference between tokenization and tokenizing in the crypto/security sense?+
In the AI/NLP context, tokenization refers exclusively to splitting text into sub-word units for model processing. This is entirely different from security tokenization (replacing sensitive data like credit card numbers with non-sensitive placeholder tokens) or cryptocurrency tokenization (representing assets as blockchain tokens). The terminology overlap is unfortunate but the concepts are unrelated. In AI cost management, when we say tokenization we always mean the text-to-sub-word-unit conversion performed by algorithms like BPE, WordPiece, or SentencePiece. This process determines how many billable tokens your API call consumes. Security tokenization and crypto tokenization have no bearing on your AI API costs. If you encounter the term in documentation, the context — NLP versus security versus blockchain — will clarify which meaning applies. CostHawk uses the term exclusively in its NLP sense: the mechanical process that converts your prompt text into the billable units that appear on your invoice.
Can I change which tokenizer a model uses to reduce my costs?+
No. The tokenizer is a fixed component of each model — it is baked into the model during training and cannot be swapped or configured at inference time. When you call the GPT-4o API, your text is always tokenized with o200k_base. When you call Claude 3.5 Sonnet, Anthropic's proprietary tokenizer is always used. You cannot bring your own tokenizer to a hosted API. However, you can influence how many tokens the tokenizer produces by optimizing your input content: minifying JSON, shortening variable names in code context, compressing whitespace, and using compact natural language. You can also choose which model (and therefore which tokenizer) to use for each request by implementing model routing. If your content tokenizes 15% more efficiently on GPT-4o than Claude, routing that content to GPT-4o saves 15% on input token costs — even if the per-token prices are similar. CostHawk's multi-provider analytics help you identify these routing opportunities.
How much do tokenizer vocabulary size differences actually matter for cost?+
Vocabulary size has a measurable but moderate impact on token counts for English prose, and a larger impact for code and non-English text. Larger vocabularies (like GPT-4o's 200K tokens) capture more multi-character sequences as single tokens, compressing text more efficiently. Moving from a 32K vocabulary (older models) to a 200K vocabulary typically reduces English token counts by 15–25%. For code, the improvement can be 20–30% because larger vocabularies are more likely to include common programming constructs as single tokens. For non-Latin scripts, vocabulary size matters less than how much of that vocabulary is allocated to the target language — a 200K vocabulary trained mostly on English may still tokenize Chinese poorly. In dollar terms, if you process 50 million tokens per day on a model charging $2.50 per million input tokens, a 20% token reduction from a more efficient tokenizer saves $25/day, or $9,125/year. This is material savings that requires zero code changes — just routing to a model with a better tokenizer for your content.
Does prompt caching interact with tokenization?+
Yes, prompt caching and tokenization are closely linked. Prompt caching works at the token level — the provider identifies a matching prefix of tokens from a previous request and serves the cached key-value activations instead of recomputing them. The cache match is exact: the token sequence must be identical, byte-for-byte, to the cached prefix. This means even a single character change in your system prompt invalidates the cache and forces a full recomputation (and full-price billing) of all tokens. Anthropic's prompt caching offers a 90% discount on cached input tokens (you pay only 10% of the normal rate), while OpenAI's caching provides a 50% discount. The cost savings from caching multiply with tokenization optimization: if you reduce your system prompt from 2,000 tokens to 1,200 tokens and then cache those 1,200 tokens at a 90% discount, you are paying for only 120 effective tokens instead of the original 2,000 — a 94% reduction. The key takeaway is to optimize your prompt (reducing token count) first, then enable caching on the optimized version for maximum savings.
Why does the same word sometimes produce different numbers of tokens in different contexts?+
BPE tokenization is context-sensitive at the byte level. A word at the beginning of a text (no leading space) may tokenize differently than the same word in the middle of a sentence (with a leading space), because BPE merges are learned on byte sequences that include whitespace. For example, "Hello" at the start of a string might be a single token, but " Hello" (with a leading space) is typically also a single but different token — the space is merged with the word. However, if the word follows a newline, tab, or unusual punctuation, the byte context changes and the merge rules may produce a different tokenization. Additionally, capitalization matters: "the" and "The" are different byte sequences and may map to different tokens or different numbers of tokens. In practice, these contextual differences are small — usually 0–1 token per word — but they accumulate across long documents. This is why client-side token estimation functions that count words and multiply by 1.3 are only approximate. For exact counts, always use the official tokenizer library for your target model.
How do special tokens and chat formatting affect my billed token count?+
Every LLM API adds special tokens that are invisible to you but still count toward your billed token total. These include message boundary tokens (marking the start and end of each message in a chat sequence), role tokens (indicating whether a message is from the system, user, or assistant), and tool/function definition tokens. For OpenAI's chat API, each message adds approximately 4 overhead tokens for role and boundary markers. If your conversation has 20 messages, that is ~80 tokens of overhead — minor for a single request but meaningful at scale. Tool definitions are more expensive: a complex function schema with multiple parameters and descriptions can consume 200–500 tokens, and these are included in every request where tools are enabled. Anthropic's Claude similarly adds per-message overhead tokens. The practical implication is that your billed token count will always be slightly higher than what you count by tokenizing just the text content of your messages. CostHawk reports the actual billed token count (from the API response's usage field), which includes all overhead tokens, giving you an accurate picture of true costs.
Is there a way to preview exactly how my text will be tokenized before sending it to the API?+
Yes. Each major provider offers tools for pre-flight token inspection. For OpenAI, the tiktoken Python library lets you encode text and see the exact token boundaries: tiktoken.encoding_for_model('gpt-4o').encode('your text') returns a list of token IDs, and .decode_single_token_bytes() shows you the text each token represents. OpenAI also offers an online Tokenizer tool at platform.openai.com/tokenizer. For Anthropic, the @anthropic-ai/tokenizer npm package provides countTokens() and encoding functions. For Google Gemini, call the countTokens REST endpoint or use the Python SDK's count_tokens() method. For local preview without API calls, the tokenizers Python library by Hugging Face can load any tokenizer from the Hub and encode text locally. These tools are invaluable for debugging unexpectedly high token counts, optimizing prompt formats, and verifying that compression techniques actually reduce tokens. Build token counting into your CI/CD pipeline to catch prompt size regressions before they reach production.
How will tokenization evolve, and should I future-proof my cost estimates?+
Tokenization is actively evolving in several directions. First, vocabulary sizes are increasing: GPT-4o's 200K vocabulary is double GPT-4's 100K, and future models will likely push to 500K+ tokens, further improving compression efficiency (especially for code and non-English text). Second, byte-level models like Meta's experiments with byte-latent transformers could eventually eliminate sub-word tokenization entirely, processing raw bytes and making the concept of token counts less relevant — though this is years from production readiness. Third, multimodal tokenization for images, audio, and video will become standardized, with dedicated tokenizers producing visual or audio tokens that are billed alongside text tokens. For future-proofing, do not hardcode token-to-cost ratios in your budgets. Instead, use CostHawk or similar tools that dynamically read actual token counts from API responses. Design your cost monitoring around the usage object returned by each API call rather than client-side estimates. This way, as tokenizers improve and produce fewer tokens for the same content, your cost tracking automatically reflects the savings without code changes.

Related Terms

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Read more

Cost Per Token

The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.

Read more

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Read more

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

Read more

Prompt Compression

Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.

Read more

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.