GlossaryUsage & MeteringUpdated 2026-03-16By Chase Dillingham

Max Tokens

Q: How does max_tokens interact with streaming responses?

When using streaming (Server-Sent Events), max_tokens works identically to non-streaming requests — it caps the total number of output tokens. The difference is in how truncation manifests. In a non-streaming request, you receive the full response at once and can check finish_reason before displaying it. In a streaming request, tokens arrive incrementally, and the user sees the response building in real time. If the response hits the max_tokens limit, the stream ends abruptly with a final event containing finish_reason="length". The user sees the response stop mid-sentence, which is a poor experience. For streaming use cases, set max_tokens 20–30% higher than your expected output to avoid visible truncation, and handle the "length" finish_reason gracefully in your UI (e.g., showing a "response truncated" indicator). The cost implication is the same as non-streaming — you pay for exactly the tokens generated regardless of delivery method.

Q: Does max_tokens affect input token billing?

No. The max_tokens parameter only affects the output side of billing. You are always charged for the full input token count regardless of your max_tokens setting. Even if you set max_tokens=1 (requesting a single output token), you still pay for every token in your prompt, system message, conversation history, and tool definitions. This means max_tokens is only effective at controlling costs when output tokens represent a significant portion of your total token cost. For applications with very long prompts and short outputs (e.g., RAG with large context retrieval), optimizing input tokens through better retrieval, prompt compression, and context management will have a larger cost impact than tuning max_tokens. For applications with short prompts and long outputs (e.g., content generation, code writing), max_tokens is your primary cost lever. CostHawk's input/output cost breakdown per endpoint shows you which lever matters more for each part of your application.

The API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.

Definition

What is Max Tokens?

max_tokens (also called max_output_tokens or max_completion_tokens depending on the provider) is an API parameter that sets an upper bound on how many tokens the model can generate in its response. It acts as a ceiling — the model will stop generating once it reaches this limit, even if the response is incomplete. Setting max_tokens=500 means the model will produce at most 500 output tokens, regardless of how much more it could say. Since output tokens are the most expensive component of an API call (typically 2–5x the cost of input tokens), this parameter is one of the most direct cost controls available to developers.

For example, if you are using Claude 3.5 Sonnet at $15.00 per million output tokens, the difference between max_tokens=4096 (the common default) and max_tokens=500 is the difference between a maximum output cost of $0.0614 and $0.0075 per request — an 88% reduction in worst-case output cost. In practice, the model often generates fewer tokens than the maximum, but setting a lower ceiling prevents expensive outlier responses and makes your cost distribution tighter and more predictable.

Impact

Why It Matters for AI Costs

Most developers leave max_tokens at its default value or set it to the model's maximum, treating it as a "set and forget" parameter. This is one of the most common and easily fixable sources of AI cost waste. A classification endpoint that needs a one-word answer ("positive" or "negative") should not allow 4,096 output tokens. A summarization endpoint producing 2-sentence summaries should not allow 8,192 tokens. By right-sizing max_tokens for each use case, teams typically reduce output token costs by 30–70% with zero impact on quality. CostHawk's per-endpoint cost analytics help you identify which endpoints have the largest gap between actual output length and max_tokens setting.

What is the max_tokens Parameter?

The max_tokens parameter is included in the request body of every LLM API call. It tells the model the maximum number of tokens it is allowed to generate in its response. The parameter name varies slightly by provider:

OpenAI: max_completion_tokens (newer endpoints) or max_tokens (legacy). Controls the maximum output length for chat completions.
Anthropic: max_tokens. Required parameter — Anthropic does not set a default, so you must always specify it.
Google (Gemini): maxOutputTokens. Optional, defaults to the model's maximum if not set.
Mistral: max_tokens. Optional, defaults vary by model.

When the model reaches the max_tokens limit, it stops generating immediately. The response will include a finish_reason (or stop_reason) of "length" instead of "stop", indicating the response was truncated rather than naturally completed. This is an important signal — a high rate of length finish reasons means your max_tokens setting is too low and responses are being cut off, potentially degrading quality.

Here is a basic example of setting max_tokens in an API call:

// OpenAI
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Classify this review as positive or negative: ..." }],
  max_completion_tokens: 50,
});

// Anthropic
const response = await anthropic.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 50,
  messages: [{ role: "user", content: "Classify this review as positive or negative: ..." }],
});

How max_tokens Affects Cost

The max_tokens parameter affects cost by capping the maximum possible output token count per request. You are billed for the actual tokens generated, not the max_tokens value — but setting a lower ceiling prevents expensive outlier responses. Here are concrete examples showing the impact:

Use Case	max_tokens Setting	Avg Actual Output	Max Output Cost (Claude 3.5 Sonnet @ $15/MTok)	Avg Output Cost
Classification	50	5 tokens	$0.00075	$0.000075
Classification (untuned)	4,096	45 tokens	$0.0614	$0.000675
Summarization	500	280 tokens	$0.0075	$0.0042
Summarization (untuned)	4,096	650 tokens	$0.0614	$0.00975
Code generation	2,000	800 tokens	$0.030	$0.012
Code generation (untuned)	8,192	1,400 tokens	$0.1229	$0.021

Notice two effects: (1) the max cost per request drops dramatically with a lower setting, making your worst-case scenarios much cheaper, and (2) the average output also tends to decrease because models use the max_tokens value as a framing signal. When told they have 4,096 tokens available, models tend to be more verbose than when given 500 tokens. This behavioral effect means that right-sizing max_tokens often reduces actual output by 20–40%, not just the theoretical maximum.

For a team processing 100,000 classification requests per day, the difference between 45 average output tokens (untuned) and 5 tokens (max_tokens=50) saves: (40 tokens × 100,000 requests × $15/MTok) = $60/day or $1,800/month — just from one parameter change on one endpoint.

Provider Defaults and Behavior

Each provider handles max_tokens differently in terms of defaults, maximums, and behavior when the limit is reached:

Provider	Parameter Name	Default	Maximum Value	Required?	Truncation Signal
OpenAI (GPT-4o)	`max_completion_tokens`	16,384	16,384	No	`finish_reason: "length"`
OpenAI (GPT-4o-mini)	`max_completion_tokens`	16,384	16,384	No	`finish_reason: "length"`
Anthropic (Claude 3.5 Sonnet)	`max_tokens`	None (required)	8,192	Yes	`stop_reason: "max_tokens"`
Anthropic (Claude 3.5 Haiku)	`max_tokens`	None (required)	8,192	Yes	`stop_reason: "max_tokens"`
Google (Gemini 1.5 Pro)	`maxOutputTokens`	8,192	8,192	No	`finishReason: "MAX_TOKENS"`
Google (Gemini 1.5 Flash)	`maxOutputTokens`	8,192	8,192	No	`finishReason: "MAX_TOKENS"`
Mistral (Large)	`max_tokens`	No limit (model max)	32,768	No	`finish_reason: "length"`

Key observations:

Anthropic requires max_tokens — this is actually a good design decision because it forces developers to think about output length. You cannot accidentally leave it at a high default.
OpenAI defaults to 16,384 — generous but expensive. If your use case only needs 200 tokens of output, you are leaving room for 16,184 unnecessary tokens in worst-case scenarios.
The maximum value limits model capability — models cannot generate more than their maximum regardless of what you set. Setting max_tokens=100000 is silently clamped to the model's actual limit.

Setting max_tokens by Use Case

The optimal max_tokens value depends on the task. Setting it too low truncates useful output; setting it too high wastes money on verbose responses and allows worst-case cost outliers. Here are recommended settings based on common AI application patterns:

Use Case	Recommended max_tokens	Typical Actual Output	Notes
Binary classification (yes/no, positive/negative)	10–20	1–5 tokens	Use with structured output / JSON mode for consistent formatting
Multi-label classification	50–100	10–40 tokens	Account for JSON array formatting overhead
Entity extraction (NER)	200–500	50–300 tokens	Scales with document length and entity count
Short summarization (1–3 sentences)	200–300	60–150 tokens	Explicitly state desired length in the prompt
Long summarization (paragraph+)	500–1,000	200–600 tokens	Consider chunked summarization for very long documents
Question answering	300–500	50–300 tokens	Highly variable; monitor finish_reason for truncation
Code generation (single function)	1,000–2,000	200–1,000 tokens	Code is token-dense; underestimating causes broken output
Code generation (full file)	4,000–8,000	1,000–4,000 tokens	Expensive; consider generating in sections
Creative writing (short)	500–1,000	300–800 tokens	Models will fill available space; lower limits produce tighter prose
Creative writing (long-form)	4,000–8,000	2,000–6,000 tokens	May need continuation for novel-length content
Data transformation / reformatting	1.5x input length	~1x input length	Output scales with input; set dynamically

A best practice is to measure your actual output token distribution for each endpoint over a one-week baseline period, then set max_tokens to the 99th percentile of actual output length plus a 20% buffer. This captures virtually all legitimate responses while capping the expensive outliers. CostHawk's token analytics dashboard shows output length distributions per endpoint, making this analysis straightforward.

The Ceiling vs Target Misconception

A common misconception is that max_tokens acts as a target — that setting max_tokens=500 tells the model to generate approximately 500 tokens. This is incorrect. max_tokens is a ceiling, not a target. The model generates tokens until it produces a natural stop token (end of message) or hits the max_tokens limit, whichever comes first.

However, there is a subtle behavioral effect: models do tend to generate slightly shorter responses when given lower max_tokens values. This is not because they are "aiming" for the limit, but because the tokenizer and sampling process are influenced by the generation budget. Research from Stanford's HELM benchmark shows that the same prompt with max_tokens=256 produces responses averaging 15–25% shorter than with max_tokens=4096, even when the 4096 response does not hit the limit. The model appears to internalize the budget constraint during generation.

To control output length more precisely, use prompt instructions rather than relying on max_tokens alone:

// Instead of just setting max_tokens low:
max_tokens: 100

// Combine with explicit instructions:
max_tokens: 150  // Ceiling with buffer
prompt: "Summarize this article in exactly 2 sentences. Be concise."

// For structured output, specify format:
max_tokens: 50
prompt: "Classify the sentiment. Respond with a JSON object: {\"sentiment\": \"positive\" | \"negative\", \"confidence\": 0.0-1.0}"

The ideal approach combines both: use prompt instructions to guide the model toward your desired length, and use max_tokens as a safety net to catch outlier responses. If you only use max_tokens, you risk truncation (cutting off a response mid-sentence). If you only use prompt instructions, you risk the occasional 3,000-token response to a question that should take 200 tokens.

max_tokens and Context Window Interaction

The max_tokens parameter and the model's context window are related but separate constraints. The context window is the total token capacity — input plus output combined. The max_tokens parameter carves out a reservation from that total for the output. This interaction has important implications:

The fundamental equation:

Input Tokens + max_tokens ≤ Context Window Size

If your input is 120,000 tokens and you set max_tokens=16,384 on a model with a 128K context window, you are using 136,384 tokens — which exceeds the limit. The API will return an error. You need to either reduce your input or lower max_tokens.

Practical implications:

Long-context applications: If you are using RAG with large context chunks (50K+ tokens of retrieved documents), you must account for this when setting max_tokens. A 128K context window with 100K tokens of input leaves only 28K tokens for output — setting max_tokens higher than 28K will cause errors.
Conversation history growth: In multi-turn conversations, each turn adds to the input. A conversation that starts with 2,000 input tokens might grow to 20,000 by turn 10. If max_tokens is set high (say 8,192), you will eventually hit the context window limit. Implement conversation history management (summarization, sliding window) to prevent this.
Cost compounding: Here is the critical cost insight — every output token from the current turn becomes an input token in the next turn (if you include conversation history). So a max_tokens setting of 4,096 does not just cost you output tokens now; it costs you input tokens on every subsequent turn. In a 10-turn conversation, one verbose 4,000-token response adds 4,000 × 9 = 36,000 extra input tokens across the remaining turns. At $3.00/MTok input (Claude 3.5 Sonnet), that is $0.108 in input costs alone — from a single verbose response.

CostHawk tracks per-turn token counts and flags conversations where context window utilization exceeds 80%, helping you identify where history management or max_tokens tuning can reduce costs.

FAQ

Frequently Asked Questions

What happens if I set max_tokens too low?+

If max_tokens is set lower than what the model needs to complete its response, the output will be truncated mid-sentence or mid-thought. The API response will include a finish_reason of "length" (OpenAI) or stop_reason of "max_tokens" (Anthropic) instead of the normal "stop" / "end_turn" value. Truncated responses can cause problems downstream — a JSON response cut off mid-object will fail to parse, a code snippet missing its closing brackets will not compile, and a summary missing its conclusion will confuse users. To find the right balance, monitor your truncation rate (percentage of responses with finish_reason="length"). A healthy truncation rate is under 1%. If it exceeds 5%, your max_tokens is too aggressive and you should increase it. CostHawk tracks finish_reason distributions per endpoint so you can spot truncation issues before they affect users.

Does setting a lower max_tokens make the API call faster?+

Yes, but only if the model would have generated more tokens than your new limit. The model generates tokens sequentially — each token takes roughly the same amount of time (typically 10–40 milliseconds per token depending on the model and provider). If you set max_tokens=500 and the model would have generated 2,000 tokens, you save the generation time for 1,500 tokens, which is approximately 15–60 seconds. However, if the model naturally stops at 300 tokens, setting max_tokens=500 versus max_tokens=4096 makes no difference in latency because the model stopped before hitting either limit. The latency benefit is most noticeable for open-ended generation tasks (creative writing, long explanations) where the model tends to use all available output space. For structured tasks (classification, JSON extraction), the model typically stops well short of any reasonable limit, so the latency benefit is minimal.

Should I set max_tokens dynamically based on the request?+

Yes — dynamic max_tokens is a best practice for applications that handle diverse request types. A customer support bot might need 50 tokens for a "yes/no" answer but 1,500 tokens for a detailed troubleshooting guide. Setting a static max_tokens of 1,500 for all requests wastes money on the simple ones and potentially truncates the complex ones if you set it too low. Implement a routing layer that estimates the required output length based on request characteristics: question type, expected response format, and historical output lengths for similar queries. A simple approach is to maintain a lookup table of max_tokens values keyed by intent or endpoint. A more sophisticated approach uses a lightweight classifier to predict output length category (short/medium/long) and sets max_tokens accordingly. CostHawk's per-request token analytics provide the historical data you need to calibrate these dynamic settings.

What is the difference between max_tokens and max_completion_tokens?+

They serve the same purpose but reflect naming differences across API versions and providers. OpenAI's newer chat completion API uses max_completion_tokens, while the legacy completion API used max_tokens. Anthropic's Messages API uses max_tokens. Google's Gemini API uses maxOutputTokens. Functionally, they all do the same thing: cap the maximum number of tokens in the model's generated response. The naming inconsistency is a source of confusion when switching between providers. If you use an abstraction layer (LangChain, LiteLLM, Vercel AI SDK), the library typically normalizes the parameter name for you. When working directly with provider SDKs, always check the current API documentation — using the wrong parameter name will either cause an error or be silently ignored, defaulting to the model's maximum output length and potentially inflating your costs.

How does max_tokens interact with streaming responses?+

When using streaming (Server-Sent Events), max_tokens works identically to non-streaming requests — it caps the total number of output tokens. The difference is in how truncation manifests. In a non-streaming request, you receive the full response at once and can check finish_reason before displaying it. In a streaming request, tokens arrive incrementally, and the user sees the response building in real time. If the response hits the max_tokens limit, the stream ends abruptly with a final event containing finish_reason="length". The user sees the response stop mid-sentence, which is a poor experience. For streaming use cases, set max_tokens 20–30% higher than your expected output to avoid visible truncation, and handle the "length" finish_reason gracefully in your UI (e.g., showing a "response truncated" indicator). The cost implication is the same as non-streaming — you pay for exactly the tokens generated regardless of delivery method.

Can I use max_tokens to control costs on a per-request basis?+

Absolutely — max_tokens is the most direct per-request cost control available. Unlike budget limits (which operate at the account or key level over time), max_tokens caps the cost of each individual API call. The maximum output cost for any request is: max_tokens × output_price_per_token. For Claude 3.5 Sonnet at $15/MTok output, setting max_tokens=500 means the maximum output cost per request is $0.0075, regardless of what the model tries to generate. Combined with input cost (which you can estimate from your prompt size), this gives you a tight cost ceiling per request. For budgeting purposes, your maximum daily output cost is: max_tokens × output_price × daily_request_count. With max_tokens=500, $15/MTok output rate, and 50,000 daily requests, your worst-case daily output cost is $375. Without max_tokens limits, the same 50,000 requests with max output (8,192 tokens each) could theoretically cost $6,144/day.

What max_tokens value should I use for JSON / structured output?+

For structured JSON output, calculate the expected token count based on your schema's field count and value sizes, then add a 50–100% buffer. A simple schema like {"sentiment": "positive", "confidence": 0.95} requires roughly 15–20 tokens. A more complex schema with 10 fields and array values might need 100–300 tokens. The key insight is that JSON formatting tokens (braces, colons, quotes, commas) add 30–50% overhead versus plain text. Here is a rough formula: estimated_tokens = (number_of_fields × 8) + (total_characters_in_values / 4) + 20 (for structure). If you use OpenAI's structured output mode or Anthropic's tool_use for JSON generation, the model is more token-efficient because it follows the schema precisely without verbose explanations. Always test with real data to calibrate — generate 100 sample outputs, measure the 99th percentile token count, and set max_tokens to that value plus 30%. CostHawk's output token histograms per endpoint make this calibration easy to maintain over time.

Does max_tokens affect input token billing?+

No. The max_tokens parameter only affects the output side of billing. You are always charged for the full input token count regardless of your max_tokens setting. Even if you set max_tokens=1 (requesting a single output token), you still pay for every token in your prompt, system message, conversation history, and tool definitions. This means max_tokens is only effective at controlling costs when output tokens represent a significant portion of your total token cost. For applications with very long prompts and short outputs (e.g., RAG with large context retrieval), optimizing input tokens through better retrieval, prompt compression, and context management will have a larger cost impact than tuning max_tokens. For applications with short prompts and long outputs (e.g., content generation, code writing), max_tokens is your primary cost lever. CostHawk's input/output cost breakdown per endpoint shows you which lever matters more for each part of your application.

How do I handle the max_tokens limit gracefully in production?+

Implement a three-layer defense: (1) Set max_tokens appropriately for each endpoint based on measured output distributions. (2) Check the finish_reason / stop_reason field in every API response. If it indicates truncation ("length" or "max_tokens"), decide whether to retry with a higher limit, return a partial result with a truncation indicator, or request a continuation. (3) Monitor your truncation rate and alert when it exceeds your threshold (we recommend alerting at 2% and investigating at 5%). For retry logic, increase max_tokens by 50–100% on the retry, but set a hard maximum to prevent infinite retries on pathologically long responses. For continuation logic, send the truncated response back as context and ask the model to continue from where it left off — but be aware this doubles the input tokens on the retry. CostHawk tracks finish_reason distributions and can alert you when truncation rates spike, which often indicates a prompt change that increased typical output length.

Related Terms

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Input vs. Output Tokens

The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Token Budget

Spending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary