Tokens Per Second (TPS)
The rate at which an LLM generates output tokens during the decode phase of inference. TPS determines how fast a streaming response flows to the user, the maximum throughput capacity of inference infrastructure, and the economic efficiency of GPU utilization.
Definition
What is Tokens Per Second (TPS)?
Impact
Why It Matters for AI Costs
TPS directly impacts three critical dimensions of AI application performance: user experience, infrastructure economics, and cost efficiency.
User experience: In streaming applications, TPS determines how fast text appears on screen. Human reading speed averages 250 words per minute (about 4.2 words per second, or roughly 5.5 tokens per second). Any TPS above ~6 tokens/second is faster than most users can read, meaning the streaming experience feels smooth and continuous. Below 6 TPS, users may perceive the generation as "stuttery" or slow, especially for shorter responses where the text trickles out word by word. The perceptual thresholds are:
| TPS Range | User Perception | UX Quality |
|---|---|---|
| > 60 TPS | Near-instantaneous; text appears in blocks | Excellent — almost indistinguishable from non-streamed |
| 30–60 TPS | Very fast streaming; smooth flow | Excellent — the standard for production chatbots |
| 15–30 TPS | Visible token-by-token generation | Good — the "ChatGPT effect" users are accustomed to |
| 6–15 TPS | Slow but readable streaming | Acceptable — user can read at generation pace |
| < 6 TPS | Frustratingly slow; words appear one at a time | Poor — users may abandon or retry |
Infrastructure economics: For self-hosted models, TPS is the primary measure of GPU utilization efficiency. A GPU running at 80 aggregate TPS is generating 80 billable tokens per second. At a hypothetical internal cost of $0.001 per 1,000 tokens, that GPU generates $0.08 of value per second, or $288 per hour. If the same GPU is poorly configured and runs at 20 aggregate TPS, it generates only $72 per hour of value from the same hardware investment. Maximizing aggregate TPS from your GPU fleet is the core objective of inference optimization.
Cost efficiency: For API consumers, TPS interacts with cost through latency. Lower TPS means longer total latency for the same output length, which means longer-held connections, higher tail latencies, and increased risk of timeouts that trigger costly retries. A 500-token response at 40 TPS takes 12.5 seconds; at 80 TPS, it takes 6.25 seconds. The token cost is identical, but the latency-related operational costs (connection overhead, timeout retries, user abandonment) differ significantly. CostHawk tracks per-request TPS alongside cost metrics, enabling you to identify requests where low TPS is degrading UX or causing downstream cost impacts from retries and timeouts.
How TPS Is Calculated
TPS can be calculated at different stages and scopes, and the methodology matters for accurate measurement and meaningful comparisons.
Per-request TPS (output generation rate):
TPS = output_tokens / decode_time_seconds
Where:
output_tokens = number of tokens generated in the response
decode_time = total_latency - TTFT
(i.e., the time spent generating tokens, excluding prefill)Example: A request with 180 ms TTFT, 2,680 ms total latency, and 200 output tokens:
decode_time = 2,680 - 180 = 2,500 ms = 2.5 seconds
TPS = 200 / 2.5 = 80 tokens/secondIt is essential to subtract TTFT from total latency because the prefill phase (processing input) does not generate output tokens. Including prefill time in the denominator would artificially lower the calculated TPS, especially for requests with large inputs and short outputs.
Aggregate TPS (server throughput):
Aggregate TPS = sum of all concurrent per-request TPS
= total_output_tokens_generated / time_periodIf an inference server is handling 10 concurrent requests, each generating at 35 TPS, the aggregate throughput is 350 TPS. This metric is critical for capacity planning: if your application needs to sustain 1,000 TPS during peak hours, you need enough GPU capacity to deliver that aggregate rate.
Important measurement nuances:
- First-token bias: The first output token is generated as part of the prefill-to-decode transition and may have different timing than subsequent tokens. Many measurement frameworks exclude the first token from TPS calculations.
- Token batching in SSE: Providers may batch multiple tokens into a single SSE event for network efficiency. If you measure TPS by counting SSE events per second rather than tokens per second, you will undercount. Always use the actual token count from the
usageobject. - End-of-generation slowdown: Some models slow down slightly near the end of a response as the probability distribution becomes more peaked. P50 TPS over the full response is more representative than TPS measured from any single time window.
- Quantization effects: Quantized models (INT8, INT4, GPTQ, AWQ) run at higher TPS than full-precision models because they require less memory bandwidth per token. However, the quality tradeoff means you cannot directly compare TPS across different quantization levels without also comparing output quality.
CostHawk calculates per-request TPS automatically by subtracting TTFT from total latency and dividing by output token count, providing accurate TPS measurements without client-side instrumentation complexity.
TPS Benchmarks by Model and Hardware
TPS varies widely depending on the model, hardware, serving framework, and concurrency level. The following benchmarks reflect per-request TPS under typical production conditions with moderate concurrency (4–8 concurrent requests per GPU).
API-hosted models (per-request TPS as observed by clients):
| Model | Provider | P50 TPS | P95 TPS (lower bound) | Notes |
|---|---|---|---|---|
| GPT-4o mini | OpenAI | ~110 TPS | ~65 TPS | Fastest generation among OpenAI models |
| GPT-4o | OpenAI | ~80 TPS | ~45 TPS | Consistent speed; good for streaming UX |
| Gemini 2.0 Flash | ~150 TPS | ~90 TPS | Highest TPS among comparable-tier models | |
| Gemini 1.5 Pro | ~70 TPS | ~40 TPS | Larger context window reduces throughput | |
| Claude 3.5 Haiku | Anthropic | ~100 TPS | ~60 TPS | Fast economy model |
| Claude 3.5 Sonnet | Anthropic | ~70 TPS | ~35 TPS | Balances speed and capability |
| Claude 3 Opus | Anthropic | ~35 TPS | ~18 TPS | Slower due to model size; noticeable in streaming |
Self-hosted models (per-request TPS, single A100 80GB GPU):
| Model | Quantization | Framework | Per-Request TPS (batch=1) | Aggregate TPS (batch=8) |
|---|---|---|---|---|
| Llama 3 8B | FP16 | vLLM | ~120 TPS | ~400 TPS |
| Llama 3 8B | INT4 (AWQ) | vLLM | ~180 TPS | ~600 TPS |
| Llama 3 70B | FP16 (4xA100) | vLLM | ~45 TPS | ~160 TPS |
| Llama 3 70B | INT4 (GPTQ) | vLLM | ~80 TPS | ~280 TPS |
| Mistral 7B | FP16 | TGI | ~110 TPS | ~380 TPS |
Key takeaways from benchmarks:
- Model size is the dominant factor. 7–8B parameter models consistently generate at 2–3x the TPS of 70B+ parameter models on the same hardware.
- Quantization provides 40–80% TPS improvement with minimal quality degradation for most tasks. INT4 quantization nearly doubles TPS compared to FP16 for memory-bandwidth-bound inference.
- Batching trades per-request TPS for aggregate throughput. Going from batch=1 to batch=8 typically reduces per-request TPS by 30–40% but increases aggregate TPS by 3–4x.
- API providers optimize for aggregate throughput, so per-request TPS from hosted APIs reflects the provider's batching and scheduling decisions, not the model's raw speed.
Factors That Determine TPS
TPS is constrained by a combination of hardware, model architecture, and serving configuration factors. Understanding these allows you to predict TPS for new configurations and diagnose why TPS is lower than expected.
1. GPU memory bandwidth (primary bottleneck). During autoregressive decoding, each output token requires reading the model's weights from GPU memory. This makes decode fundamentally memory-bandwidth-bound, not compute-bound. The token generation rate is approximately:
Max TPS ≈ memory_bandwidth_GB_s / model_size_GB
Example (A100 80GB, Llama 3 70B FP16):
Memory bandwidth: 2,039 GB/s
Model size: ~140 GB (70B params × 2 bytes FP16)
Theoretical max TPS: 2,039 / 140 ≈ 14.6 TPS (per batch element)
With batch=8: ~14.6 × 8 ≈ 117 aggregate TPS (theoretical)In practice, utilization of theoretical bandwidth is 50–80%, so real-world TPS is lower. This formula explains why the H100 (3,350 GB/s bandwidth) generates tokens 50–65% faster than the A100 (2,039 GB/s) for the same model — it is the bandwidth improvement that matters, not the compute improvement.
2. Model parameter count. More parameters means more bytes to read from memory per token, directly reducing TPS. A 70B parameter model at FP16 is 140 GB — roughly 10x larger than a 7B model at 14 GB. All else equal, the 7B model generates tokens ~10x faster. This is the fundamental reason smaller models have higher TPS.
3. Quantization level. INT8 quantization halves the model size (each parameter is 1 byte instead of 2), roughly doubling the effective memory bandwidth per token and increasing TPS by 40–80%. INT4 quantization quarters the model size, providing even higher TPS gains. The quality tradeoff is typically small — INT8 is nearly lossless for most models, and INT4 (with techniques like GPTQ or AWQ) maintains 95–99% of full-precision quality on standard benchmarks.
4. KV cache size. The key-value cache stores intermediate computations for all processed tokens (input + previously generated output). As the KV cache grows with sequence length, it competes with model weights for GPU memory bandwidth, gradually reducing TPS. This effect becomes noticeable at long sequence lengths (10,000+ tokens) and is one reason why generating the 500th token in a response is slightly slower than generating the 10th token.
5. Batch size (continuous batching). Modern serving frameworks like vLLM and TGI use continuous batching to process multiple requests simultaneously. Higher batch sizes improve aggregate TPS (more total tokens per second across all requests) but reduce per-request TPS because memory bandwidth is shared. The optimal batch size depends on the ratio of compute to memory bandwidth and the target per-request TPS for acceptable UX.
6. Speculative decoding. An advanced technique where a small "draft" model generates several candidate tokens quickly, and the large "target" model verifies them in a single forward pass. When the draft model's predictions are correct (typically 60–80% of the time), multiple tokens are confirmed per forward pass, increasing effective TPS by 1.5–2.5x. This technique is increasingly used by API providers to improve decode speed without changing the model itself.
TPS and Streaming UX Design
TPS directly shapes how users experience streaming AI responses. Designing your streaming UX around realistic TPS ranges ensures a smooth experience rather than a jarring one.
Token delivery patterns: At 80 TPS, a token arrives roughly every 12.5 ms. Modern browsers can render DOM updates at 60 FPS (every 16.7 ms), so individual tokens arrive slightly faster than the screen can update. The result is smooth, continuous text flow. At 20 TPS, a token arrives every 50 ms — clearly visible as individual word or sub-word additions, creating the characteristic "typewriter" effect familiar from ChatGPT. At 5 TPS, each token appears with a 200 ms gap, and the experience feels sluggish.
Buffering strategy: Rather than rendering each token as it arrives (which can cause layout thrash at high TPS), buffer tokens and flush to the DOM on an animation frame schedule:
// Buffer tokens and render at 60fps
let tokenBuffer = ""
let rafId: number | null = null
function onToken(token: string) {
tokenBuffer += token
if (!rafId) {
rafId = requestAnimationFrame(() => {
appendToDOM(tokenBuffer)
tokenBuffer = ""
rafId = null
})
}
}
// This naturally batches tokens arriving faster than 16.7ms
// At 80 TPS, ~1-2 tokens per frame
// At 200 TPS, ~3-4 tokens per frameMarkdown rendering considerations: Many AI responses include Markdown formatting (headers, lists, code blocks, tables). Rendering Markdown incrementally as tokens stream in is challenging because incomplete Markdown can produce broken formatting. Two approaches: (1) buffer until a natural break point (paragraph end, code fence close) before rendering, which slightly delays display but ensures clean formatting; (2) use a streaming-aware Markdown parser that gracefully handles incomplete syntax and re-renders as more tokens arrive. Libraries like react-markdown with a streaming wrapper handle this pattern.
Code block streaming: Code responses require syntax highlighting, which is expensive to re-run on every token. Buffer code block content and apply syntax highlighting only when the code block is complete (after the closing ```) or at throttled intervals during streaming. Highlighting every token wastes CPU and causes visual flicker.
Perceived speed tricks:
- Skeleton loading: Show a pulsing skeleton or typing indicator during TTFT to set expectations that a response is coming.
- Progressive disclosure: For long responses, consider rendering the first paragraph immediately and loading the rest as the user scrolls, making the response feel complete sooner.
- Token speed indicator: Some AI interfaces display the generation speed (e.g., "78 tok/s") as a subtle indicator. This sets expectations and serves as a quality signal — fast generation suggests the system is healthy.
CostHawk captures per-request TPS that you can use to segment your UX analytics. Correlate TPS with user engagement metrics (message rate, session duration, abandonment) to empirically determine the minimum acceptable TPS for your application.
Optimizing TPS
Optimization strategies differ depending on whether you are an API consumer (optimizing for per-request TPS from a hosted provider) or self-hosting (optimizing aggregate TPS from your GPU fleet).
For API consumers:
1. Choose faster models. Model selection is the primary lever. If streaming UX is critical and the task does not require frontier capability, use economy models (GPT-4o mini at ~110 TPS, Gemini Flash at ~150 TPS, Claude Haiku at ~100 TPS) instead of flagship models (GPT-4o at ~80 TPS, Claude Sonnet at ~70 TPS). The TPS difference is noticeable to users.
2. Use shorter max_tokens. Some providers may allocate fewer resources to requests with very high max_tokens values because they must reserve KV cache memory for the full potential output length. Setting max_tokens to a realistic ceiling (e.g., 1,000 instead of 4,096) can improve resource allocation and TPS in some cases.
3. Reduce input length. Shorter inputs mean a smaller KV cache during generation, which can slightly improve TPS because less memory bandwidth is consumed reading the cache during each decode step. The effect is small (2–10%) but compounds with the TTFT improvement from shorter inputs.
4. Use provisioned throughput. Providers offer reserved capacity tiers that guarantee consistent TPS without degradation from shared-fleet contention. OpenAI's provisioned throughput, Anthropic's reserved capacity, and Google's provisioned APIs all offer more consistent per-request TPS, especially during peak hours when shared-tier TPS can drop 30–50%.
For self-hosted models:
5. Quantize aggressively. INT4 quantization (GPTQ, AWQ, or GGUF) nearly doubles TPS for memory-bandwidth-bound inference with minimal quality loss. For most production use cases, INT4 or INT8 quantization is the highest-ROI optimization. Benchmark quality on your specific tasks to ensure acceptable output.
6. Use optimized serving frameworks. vLLM, TensorRT-LLM, and Text Generation Inference (TGI) implement PagedAttention, continuous batching, and KV cache management that dramatically improve both per-request and aggregate TPS compared to naive implementations. vLLM's PagedAttention alone can improve aggregate throughput by 2–4x over static batching.
7. Enable speculative decoding. Use a small draft model (e.g., a 1B parameter model) to speculatively generate candidate tokens, then verify with the target model. This can improve effective TPS by 1.5–2.5x when the draft model has a high acceptance rate for your domain.
8. Upgrade GPU hardware. The H100 provides 65% more memory bandwidth than the A100, translating directly to higher TPS. For high-volume workloads, the H100's higher per-unit cost is offset by the proportionally higher throughput. Newer GPUs like the H200 (with HBM3e) push bandwidth even further.
9. Tune batch size. Profile the tradeoff between per-request TPS and aggregate TPS for your specific model and hardware. If your SLA requires P95 per-request TPS above 40, you may need to limit batch size even though higher batching would improve aggregate throughput.
Monitoring TPS in Production
TPS monitoring complements TTFT and total latency monitoring to provide a complete picture of inference performance. While TTFT measures how quickly a response starts, TPS measures how quickly it flows — both are needed to fully characterize the streaming experience.
Key metrics to track:
- P50 per-request TPS: The typical generation speed your users experience. Track this by model to identify which models are delivering the best streaming UX.
- P5 per-request TPS (lower tail): The slowest 5% of requests. This is where users experience frustration. A P5 TPS below 10 indicates some requests are generating painfully slowly.
- Aggregate TPS (self-hosted): Total output tokens per second across your GPU fleet. This is the capacity metric that drives scaling decisions.
- TPS over time: Track TPS trends to detect gradual degradation. A slow decline in P50 TPS over weeks may indicate increasing KV cache sizes (from growing conversation histories), provider-side degradation, or infrastructure issues.
- TPS by input token count: Correlate TPS with input size to quantify the KV cache impact. If TPS drops 30% when input exceeds 8,000 tokens, you have a concrete argument for keeping prompts under that threshold.
Alerting thresholds:
| Alert | Condition | Suggested Action |
|---|---|---|
| Warning | P50 TPS drops below 80% of 7-day average | Investigate provider status, check for prompt size regressions |
| Critical | P5 TPS drops below 10 TPS for 10+ minutes | Activate faster model failover for latency-sensitive endpoints |
| Capacity (self-hosted) | Aggregate TPS exceeds 80% of theoretical maximum | Scale out GPU fleet or enable request queuing |
TPS and cost correlation: TPS does not directly affect per-token cost (you pay the same per token regardless of generation speed), but it has indirect cost impacts. Low TPS increases the probability of client-side timeouts, which trigger retries that double cost. Low TPS also increases connection hold time, which can exhaust connection pools and cause cascading failures in high-throughput applications. Track the correlation between TPS percentiles and retry rates to quantify this hidden cost.
Dashboard design: A TPS monitoring dashboard should include: (1) time-series of P50 and P5 per-request TPS by model, (2) histogram of TPS distribution to show the shape of the speed experience, (3) aggregate TPS gauge for self-hosted infrastructure showing utilization against capacity, (4) scatter plot of TPS vs input token count to visualize the KV cache impact curve.
CostHawk computes per-request TPS automatically from TTFT and total latency measurements, providing all of these dashboards without requiring any client-side TPS instrumentation. For self-hosted infrastructure, CostHawk integrates with vLLM and TGI metrics exporters to capture aggregate throughput data alongside per-request metrics.
FAQ
Frequently Asked Questions
What is a good TPS for a production AI application?+
Why does TPS vary so much between requests to the same model?+
How does TPS relate to cost?+
What is the difference between TPS and throughput?+
How does quantization affect TPS?+
Can TPS decrease during a single response generation?+
How do I improve TPS when using an API provider?+
What is the relationship between TPS, TTFT, and total latency?+
Total Latency = TTFT + (Output Tokens / TPS × 1000) (in milliseconds). TTFT captures the time before generation begins (network transit + queue wait + prefill). TPS determines how fast output tokens are generated once generation starts. Total latency is the sum of both phases. For example: TTFT of 400 ms, 300 output tokens at 60 TPS → decode time = 300/60 = 5,000 ms → total latency = 400 + 5,000 = 5,400 ms. This relationship reveals which factor dominates latency for different request profiles. Short outputs (under 100 tokens): TTFT dominates — a 400 ms TTFT with 50 tokens at 60 TPS gives 400 + 833 = 1,233 ms total, with TTFT being 32% of total latency. Long outputs (500+ tokens): TPS dominates — a 400 ms TTFT with 500 tokens at 60 TPS gives 400 + 8,333 = 8,733 ms total, with TTFT being only 4.6%. This means TTFT optimization has the most impact for short-response applications (classification, extraction, Q&A), while TPS optimization matters more for long-response applications (content generation, code generation, analysis). CostHawk decomposes total latency into TTFT and decode time on every request, making it easy to identify which factor to optimize for each endpoint.Related Terms
Throughput
The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.
Read moreTime to First Token (TTFT)
The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.
Read moreLatency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read moreLLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Read moreGPU Instance
Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
