GlossaryObservabilityUpdated 2026-03-16By Chase Dillingham

Tokens Per Second (TPS)

Q: What is a good TPS for a production AI application?

The target TPS depends on your application type and user expectations. For interactive streaming chatbots, a P50 per-request TPS of 40+ is considered good, and 70+ is excellent. At 40 TPS, text flows at roughly 30 words per second — significantly faster than reading speed, so users perceive smooth, continuous generation. For code completion tools where users are waiting for a complete suggestion, higher TPS (80+) is desirable because the response is typically short and the user is blocked until it finishes. For batch processing and background API calls, per-request TPS matters less than aggregate throughput — you care about processing the maximum number of tokens per second across all concurrent requests to minimize total wall-clock time. For self-hosted infrastructure, target aggregate TPS utilization of 60–80% of theoretical maximum; running above 80% leaves insufficient headroom for traffic spikes and increases queuing latency. CostHawk's TPS analytics show your current percentiles by model and endpoint, making it easy to identify where you are meeting targets and where optimization is needed.

Q: Why does TPS vary so much between requests to the same model?

TPS variability is normal and caused by several factors. First, server-side batching — API providers batch multiple concurrent requests onto the same GPU, and your per-request TPS depends on how many other requests share the batch. During peak hours, higher batching reduces per-request TPS by 20–40%. Second, KV cache size — longer conversations or larger input contexts create bigger KV caches that consume more memory bandwidth during generation, reducing TPS. A request with 1,000 input tokens may generate at 85 TPS while one with 10,000 input tokens generates at 60 TPS on the same model. Third, output content variation — some token sequences are faster to generate than others. Highly predictable tokens (common words, code patterns) may generate slightly faster than rare or complex tokens. Fourth, infrastructure variability — your request may land on different hardware generations within the provider's fleet, each with different performance characteristics. The combination of these factors typically produces a TPS distribution with a coefficient of variation of 15–30%, meaning a model averaging 80 TPS will regularly produce requests between 55 and 100 TPS.

Q: How does TPS relate to cost?

TPS does not directly affect the per-token price you pay — a token costs the same whether it was generated at 20 TPS or 200 TPS. However, TPS has important indirect cost implications. First, timeout-driven retries : if your application has a timeout threshold (e.g., 30 seconds) and a low-TPS request exceeds that threshold before completing, the retry generates a duplicate request that doubles the token cost. At scale, 5–10% of requests timing out and retrying adds 5–10% to your bill. Second, infrastructure cost efficiency for self-hosted models: higher aggregate TPS means more tokens processed per GPU-hour, directly reducing the cost per token. If your GPU fleet processes 500 aggregate TPS at $4/hour, your effective cost is $0.0080 per 1,000 tokens. Improving to 800 aggregate TPS through optimization reduces the cost to $0.0050 per 1,000 tokens — a 37% reduction. Third, connection costs : in serverless or edge environments where you pay for compute time on the client side, slower TPS means the client-side process runs longer, increasing serverless function costs. CostHawk correlates TPS with retry rates and timeout events to quantify the hidden cost of low TPS.

Q: What is the difference between TPS and throughput?

TPS and throughput are closely related but used in different contexts. Per-request TPS measures the generation speed of a single request — how fast tokens flow to one user. Throughput measures the total processing capacity of an inference system — how many total tokens (input + output) it can handle per second across all concurrent requests. Throughput encompasses both the prefill phase (processing input tokens) and the decode phase (generating output tokens), while TPS typically refers only to the decode phase. For example, a server might have a throughput of 5,000 tokens/second (including prefill of input tokens) but a per-request TPS of 60 (decode rate for any single user). Throughput is the metric that matters for capacity planning and infrastructure sizing — it tells you how many concurrent users your system can support at acceptable quality. Per-request TPS is the metric that matters for UX — it tells you how fast the user experiences the response. When someone says "our model runs at 80 TPS" without qualification, they typically mean per-request decode TPS. When they say "our cluster handles 10,000 tokens per second," they mean aggregate throughput.

Q: How does quantization affect TPS?

Quantization — reducing model weight precision from FP16 (16 bits) to INT8 (8 bits) or INT4 (4 bits) — is the most impactful technique for improving TPS on self-hosted models. The mechanism is straightforward: autoregressive decoding is memory-bandwidth-bound, meaning TPS is limited by how fast model weights can be read from GPU memory. Reducing weight precision reduces the bytes per parameter, allowing the GPU to read weights faster. INT8 quantization halves the model's memory footprint and typically improves TPS by 40–60%. INT4 quantization quarters the footprint and can improve TPS by 70–90%. The quality tradeoff depends on the quantization method and the task. Post-training quantization methods like GPTQ and AWQ (for INT4) and simple round-to-nearest (for INT8) preserve 95–99% of output quality on standard benchmarks. Fine-tuned tasks may be more sensitive to quantization than general-purpose generation. Always benchmark quality on your specific use cases before deploying quantized models in production. The combination of INT4 quantization and optimized serving (vLLM) can enable a 70B parameter model to run at 70–80 TPS on 2xA100 GPUs — performance that would require 4xA100s with FP16 weights.

Q: Can TPS decrease during a single response generation?

Yes, TPS can decrease gradually during a single response for several reasons. The primary cause is KV cache growth : as the model generates more output tokens, the key-value cache grows, and each subsequent decode step must read a larger cache from GPU memory. For a model with a 4,096-token input and generating a 2,000-token output, the KV cache at the start of generation represents 4,096 tokens but grows to 6,096 tokens by the end — a 49% increase in cache memory reads per decode step. In practice, this causes a 5–15% TPS decrease over the course of a long output. A second cause is thermal throttling on self-hosted GPUs: sustained decode computation generates heat, and GPUs reduce clock speeds when thermal limits are approached, particularly in dense multi-GPU configurations. A third cause is batch scheduling interference : on hosted providers, new requests may be added to your inference batch mid-generation, increasing contention for memory bandwidth and reducing your per-request TPS. These intra-response variations are typically small (under 20%) and invisible to users in most applications, but they can affect latency estimates for very long outputs (2,000+ tokens). CostHawk reports average TPS per request rather than instantaneous TPS, which smooths over these intra-response variations.

Q: How do I improve TPS when using an API provider?

As an API consumer, you have limited control over TPS because the generation speed is determined by the provider's inference infrastructure. However, several strategies can improve your observed TPS: (1) Select inherently faster models — GPT-4o mini (~110 TPS) is 37% faster than GPT-4o (~80 TPS), and Gemini Flash (~150 TPS) is the fastest among comparable models. If your task does not require a larger model's capabilities, switching models is the most effective TPS improvement. (2) Reduce input size — shorter inputs create smaller KV caches, slightly improving decode TPS. This is a secondary benefit on top of the primary TTFT reduction. (3) Use provisioned throughput — reserved capacity tiers typically deliver more consistent and often higher TPS because your requests are not sharing GPU batches with other customers. (4) Time your requests — if your workload is flexible on timing, avoid peak hours (typically 10 AM–4 PM US Eastern) when provider contention is highest and TPS is lowest. (5) Implement multi-provider routing — monitor real-time TPS across providers and route latency-sensitive requests to whichever provider is currently delivering the highest TPS. CostHawk tracks TPS by provider and model in real time, enabling data-driven routing decisions.

Q: What is the relationship between TPS, TTFT, and total latency?

These three metrics are mathematically related: Total Latency = TTFT + (Output Tokens / TPS × 1000) (in milliseconds). TTFT captures the time before generation begins (network transit + queue wait + prefill). TPS determines how fast output tokens are generated once generation starts. Total latency is the sum of both phases. For example: TTFT of 400 ms, 300 output tokens at 60 TPS → decode time = 300/60 = 5,000 ms → total latency = 400 + 5,000 = 5,400 ms. This relationship reveals which factor dominates latency for different request profiles. Short outputs (under 100 tokens): TTFT dominates — a 400 ms TTFT with 50 tokens at 60 TPS gives 400 + 833 = 1,233 ms total, with TTFT being 32% of total latency. Long outputs (500+ tokens): TPS dominates — a 400 ms TTFT with 500 tokens at 60 TPS gives 400 + 8,333 = 8,733 ms total, with TTFT being only 4.6%. This means TTFT optimization has the most impact for short-response applications (classification, extraction, Q&A), while TPS optimization matters more for long-response applications (content generation, code generation, analysis). CostHawk decomposes total latency into TTFT and decode time on every request, making it easy to identify which factor to optimize for each endpoint.

The rate at which an LLM generates output tokens during the decode phase of inference. TPS determines how fast a streaming response flows to the user, the maximum throughput capacity of inference infrastructure, and the economic efficiency of GPU utilization.

Definition

What is Tokens Per Second (TPS)?

Tokens Per Second (TPS) measures the rate of output token generation during the decode phase of LLM inference — the autoregressive process where the model generates one token at a time, each conditioned on all preceding tokens. A model generating at 80 TPS produces 80 output tokens every second, which translates to roughly 60 words per second of English text. TPS is related to but distinct from inter-token latency (ITL), which is the inverse: ITL = 1000 / TPS milliseconds per token. A model running at 80 TPS has an inter-token latency of 12.5 ms. TPS operates at two distinct levels that are critical to distinguish. Per-request TPS is the generation speed experienced by a single user during a single streaming response. Aggregate TPS (also called throughput) is the total output token generation rate across all concurrent requests on an inference server or GPU cluster. A single A100 GPU serving Llama 3 70B might sustain 40 TPS per request while handling 8 concurrent requests, yielding an aggregate throughput of 320 TPS. These two levels create a fundamental tradeoff: increasing concurrent request count (batching) improves aggregate throughput but reduces per-request TPS because GPU compute and memory bandwidth are shared across requests. For API consumers using hosted providers like OpenAI, Anthropic, and Google, per-request TPS is what determines the streaming UX quality. For teams self-hosting models, aggregate TPS determines infrastructure capacity and cost-efficiency. TPS varies dramatically across models, hardware, quantization levels, and serving configurations — from 15 TPS for a large model on modest hardware to 200+ TPS for a small model on top-tier GPUs.

Impact

Why It Matters for AI Costs

TPS directly impacts three critical dimensions of AI application performance: user experience, infrastructure economics, and cost efficiency.

User experience: In streaming applications, TPS determines how fast text appears on screen. Human reading speed averages 250 words per minute (about 4.2 words per second, or roughly 5.5 tokens per second). Any TPS above ~6 tokens/second is faster than most users can read, meaning the streaming experience feels smooth and continuous. Below 6 TPS, users may perceive the generation as "stuttery" or slow, especially for shorter responses where the text trickles out word by word. The perceptual thresholds are:

TPS Range	User Perception	UX Quality
> 60 TPS	Near-instantaneous; text appears in blocks	Excellent — almost indistinguishable from non-streamed
30–60 TPS	Very fast streaming; smooth flow	Excellent — the standard for production chatbots
15–30 TPS	Visible token-by-token generation	Good — the "ChatGPT effect" users are accustomed to
6–15 TPS	Slow but readable streaming	Acceptable — user can read at generation pace
< 6 TPS	Frustratingly slow; words appear one at a time	Poor — users may abandon or retry

Infrastructure economics: For self-hosted models, TPS is the primary measure of GPU utilization efficiency. A GPU running at 80 aggregate TPS is generating 80 billable tokens per second. At a hypothetical internal cost of $0.001 per 1,000 tokens, that GPU generates $0.08 of value per second, or $288 per hour. If the same GPU is poorly configured and runs at 20 aggregate TPS, it generates only $72 per hour of value from the same hardware investment. Maximizing aggregate TPS from your GPU fleet is the core objective of inference optimization.

Cost efficiency: For API consumers, TPS interacts with cost through latency. Lower TPS means longer total latency for the same output length, which means longer-held connections, higher tail latencies, and increased risk of timeouts that trigger costly retries. A 500-token response at 40 TPS takes 12.5 seconds; at 80 TPS, it takes 6.25 seconds. The token cost is identical, but the latency-related operational costs (connection overhead, timeout retries, user abandonment) differ significantly. CostHawk tracks per-request TPS alongside cost metrics, enabling you to identify requests where low TPS is degrading UX or causing downstream cost impacts from retries and timeouts.

How TPS Is Calculated

TPS can be calculated at different stages and scopes, and the methodology matters for accurate measurement and meaningful comparisons.

Per-request TPS (output generation rate):

TPS = output_tokens / decode_time_seconds

Where:
  output_tokens = number of tokens generated in the response
  decode_time = total_latency - TTFT
  (i.e., the time spent generating tokens, excluding prefill)

Example: A request with 180 ms TTFT, 2,680 ms total latency, and 200 output tokens:

decode_time = 2,680 - 180 = 2,500 ms = 2.5 seconds
TPS = 200 / 2.5 = 80 tokens/second

It is essential to subtract TTFT from total latency because the prefill phase (processing input) does not generate output tokens. Including prefill time in the denominator would artificially lower the calculated TPS, especially for requests with large inputs and short outputs.

Aggregate TPS (server throughput):

Aggregate TPS = sum of all concurrent per-request TPS
            = total_output_tokens_generated / time_period

If an inference server is handling 10 concurrent requests, each generating at 35 TPS, the aggregate throughput is 350 TPS. This metric is critical for capacity planning: if your application needs to sustain 1,000 TPS during peak hours, you need enough GPU capacity to deliver that aggregate rate.

Important measurement nuances:

First-token bias: The first output token is generated as part of the prefill-to-decode transition and may have different timing than subsequent tokens. Many measurement frameworks exclude the first token from TPS calculations.
Token batching in SSE: Providers may batch multiple tokens into a single SSE event for network efficiency. If you measure TPS by counting SSE events per second rather than tokens per second, you will undercount. Always use the actual token count from the usage object.
End-of-generation slowdown: Some models slow down slightly near the end of a response as the probability distribution becomes more peaked. P50 TPS over the full response is more representative than TPS measured from any single time window.
Quantization effects: Quantized models (INT8, INT4, GPTQ, AWQ) run at higher TPS than full-precision models because they require less memory bandwidth per token. However, the quality tradeoff means you cannot directly compare TPS across different quantization levels without also comparing output quality.

CostHawk calculates per-request TPS automatically by subtracting TTFT from total latency and dividing by output token count, providing accurate TPS measurements without client-side instrumentation complexity.

TPS Benchmarks by Model and Hardware

TPS varies widely depending on the model, hardware, serving framework, and concurrency level. The following benchmarks reflect per-request TPS under typical production conditions with moderate concurrency (4–8 concurrent requests per GPU).

API-hosted models (per-request TPS as observed by clients):

Model	Provider	P50 TPS	P95 TPS (lower bound)	Notes
GPT-4o mini	OpenAI	~110 TPS	~65 TPS	Fastest generation among OpenAI models
GPT-4o	OpenAI	~80 TPS	~45 TPS	Consistent speed; good for streaming UX
Gemini 2.0 Flash	Google	~150 TPS	~90 TPS	Highest TPS among comparable-tier models
Gemini 1.5 Pro	Google	~70 TPS	~40 TPS	Larger context window reduces throughput
Claude 3.5 Haiku	Anthropic	~100 TPS	~60 TPS	Fast economy model
Claude 3.5 Sonnet	Anthropic	~70 TPS	~35 TPS	Balances speed and capability
Claude 3 Opus	Anthropic	~35 TPS	~18 TPS	Slower due to model size; noticeable in streaming

Self-hosted models (per-request TPS, single A100 80GB GPU):

Model	Quantization	Framework	Per-Request TPS (batch=1)	Aggregate TPS (batch=8)
Llama 3 8B	FP16	vLLM	~120 TPS	~400 TPS
Llama 3 8B	INT4 (AWQ)	vLLM	~180 TPS	~600 TPS
Llama 3 70B	FP16 (4xA100)	vLLM	~45 TPS	~160 TPS
Llama 3 70B	INT4 (GPTQ)	vLLM	~80 TPS	~280 TPS
Mistral 7B	FP16	TGI	~110 TPS	~380 TPS

Key takeaways from benchmarks:

Model size is the dominant factor. 7–8B parameter models consistently generate at 2–3x the TPS of 70B+ parameter models on the same hardware.
Quantization provides 40–80% TPS improvement with minimal quality degradation for most tasks. INT4 quantization nearly doubles TPS compared to FP16 for memory-bandwidth-bound inference.
Batching trades per-request TPS for aggregate throughput. Going from batch=1 to batch=8 typically reduces per-request TPS by 30–40% but increases aggregate TPS by 3–4x.
API providers optimize for aggregate throughput, so per-request TPS from hosted APIs reflects the provider's batching and scheduling decisions, not the model's raw speed.

Factors That Determine TPS

TPS is constrained by a combination of hardware, model architecture, and serving configuration factors. Understanding these allows you to predict TPS for new configurations and diagnose why TPS is lower than expected.

1. GPU memory bandwidth (primary bottleneck). During autoregressive decoding, each output token requires reading the model's weights from GPU memory. This makes decode fundamentally memory-bandwidth-bound, not compute-bound. The token generation rate is approximately:

Max TPS ≈ memory_bandwidth_GB_s / model_size_GB

Example (A100 80GB, Llama 3 70B FP16):
  Memory bandwidth: 2,039 GB/s
  Model size: ~140 GB (70B params × 2 bytes FP16)
  Theoretical max TPS: 2,039 / 140 ≈ 14.6 TPS (per batch element)
  With batch=8: ~14.6 × 8 ≈ 117 aggregate TPS (theoretical)

In practice, utilization of theoretical bandwidth is 50–80%, so real-world TPS is lower. This formula explains why the H100 (3,350 GB/s bandwidth) generates tokens 50–65% faster than the A100 (2,039 GB/s) for the same model — it is the bandwidth improvement that matters, not the compute improvement.

2. Model parameter count. More parameters means more bytes to read from memory per token, directly reducing TPS. A 70B parameter model at FP16 is 140 GB — roughly 10x larger than a 7B model at 14 GB. All else equal, the 7B model generates tokens ~10x faster. This is the fundamental reason smaller models have higher TPS.

3. Quantization level. INT8 quantization halves the model size (each parameter is 1 byte instead of 2), roughly doubling the effective memory bandwidth per token and increasing TPS by 40–80%. INT4 quantization quarters the model size, providing even higher TPS gains. The quality tradeoff is typically small — INT8 is nearly lossless for most models, and INT4 (with techniques like GPTQ or AWQ) maintains 95–99% of full-precision quality on standard benchmarks.

4. KV cache size. The key-value cache stores intermediate computations for all processed tokens (input + previously generated output). As the KV cache grows with sequence length, it competes with model weights for GPU memory bandwidth, gradually reducing TPS. This effect becomes noticeable at long sequence lengths (10,000+ tokens) and is one reason why generating the 500th token in a response is slightly slower than generating the 10th token.

5. Batch size (continuous batching). Modern serving frameworks like vLLM and TGI use continuous batching to process multiple requests simultaneously. Higher batch sizes improve aggregate TPS (more total tokens per second across all requests) but reduce per-request TPS because memory bandwidth is shared. The optimal batch size depends on the ratio of compute to memory bandwidth and the target per-request TPS for acceptable UX.

6. Speculative decoding. An advanced technique where a small "draft" model generates several candidate tokens quickly, and the large "target" model verifies them in a single forward pass. When the draft model's predictions are correct (typically 60–80% of the time), multiple tokens are confirmed per forward pass, increasing effective TPS by 1.5–2.5x. This technique is increasingly used by API providers to improve decode speed without changing the model itself.

TPS and Streaming UX Design

TPS directly shapes how users experience streaming AI responses. Designing your streaming UX around realistic TPS ranges ensures a smooth experience rather than a jarring one.

Token delivery patterns: At 80 TPS, a token arrives roughly every 12.5 ms. Modern browsers can render DOM updates at 60 FPS (every 16.7 ms), so individual tokens arrive slightly faster than the screen can update. The result is smooth, continuous text flow. At 20 TPS, a token arrives every 50 ms — clearly visible as individual word or sub-word additions, creating the characteristic "typewriter" effect familiar from ChatGPT. At 5 TPS, each token appears with a 200 ms gap, and the experience feels sluggish.

Buffering strategy: Rather than rendering each token as it arrives (which can cause layout thrash at high TPS), buffer tokens and flush to the DOM on an animation frame schedule:

// Buffer tokens and render at 60fps
let tokenBuffer = ""
let rafId: number | null = null

function onToken(token: string) {
  tokenBuffer += token
  if (!rafId) {
    rafId = requestAnimationFrame(() => {
      appendToDOM(tokenBuffer)
      tokenBuffer = ""
      rafId = null
    })
  }
}

// This naturally batches tokens arriving faster than 16.7ms
// At 80 TPS, ~1-2 tokens per frame
// At 200 TPS, ~3-4 tokens per frame

Markdown rendering considerations: Many AI responses include Markdown formatting (headers, lists, code blocks, tables). Rendering Markdown incrementally as tokens stream in is challenging because incomplete Markdown can produce broken formatting. Two approaches: (1) buffer until a natural break point (paragraph end, code fence close) before rendering, which slightly delays display but ensures clean formatting; (2) use a streaming-aware Markdown parser that gracefully handles incomplete syntax and re-renders as more tokens arrive. Libraries like react-markdown with a streaming wrapper handle this pattern.

Code block streaming: Code responses require syntax highlighting, which is expensive to re-run on every token. Buffer code block content and apply syntax highlighting only when the code block is complete (after the closing ```) or at throttled intervals during streaming. Highlighting every token wastes CPU and causes visual flicker.

Perceived speed tricks:

Skeleton loading: Show a pulsing skeleton or typing indicator during TTFT to set expectations that a response is coming.
Progressive disclosure: For long responses, consider rendering the first paragraph immediately and loading the rest as the user scrolls, making the response feel complete sooner.
Token speed indicator: Some AI interfaces display the generation speed (e.g., "78 tok/s") as a subtle indicator. This sets expectations and serves as a quality signal — fast generation suggests the system is healthy.

CostHawk captures per-request TPS that you can use to segment your UX analytics. Correlate TPS with user engagement metrics (message rate, session duration, abandonment) to empirically determine the minimum acceptable TPS for your application.

Optimizing TPS

Optimization strategies differ depending on whether you are an API consumer (optimizing for per-request TPS from a hosted provider) or self-hosting (optimizing aggregate TPS from your GPU fleet).

For API consumers:

1. Choose faster models. Model selection is the primary lever. If streaming UX is critical and the task does not require frontier capability, use economy models (GPT-4o mini at ~110 TPS, Gemini Flash at ~150 TPS, Claude Haiku at ~100 TPS) instead of flagship models (GPT-4o at ~80 TPS, Claude Sonnet at ~70 TPS). The TPS difference is noticeable to users.

2. Use shorter max_tokens. Some providers may allocate fewer resources to requests with very high max_tokens values because they must reserve KV cache memory for the full potential output length. Setting max_tokens to a realistic ceiling (e.g., 1,000 instead of 4,096) can improve resource allocation and TPS in some cases.

3. Reduce input length. Shorter inputs mean a smaller KV cache during generation, which can slightly improve TPS because less memory bandwidth is consumed reading the cache during each decode step. The effect is small (2–10%) but compounds with the TTFT improvement from shorter inputs.

4. Use provisioned throughput. Providers offer reserved capacity tiers that guarantee consistent TPS without degradation from shared-fleet contention. OpenAI's provisioned throughput, Anthropic's reserved capacity, and Google's provisioned APIs all offer more consistent per-request TPS, especially during peak hours when shared-tier TPS can drop 30–50%.

For self-hosted models:

5. Quantize aggressively. INT4 quantization (GPTQ, AWQ, or GGUF) nearly doubles TPS for memory-bandwidth-bound inference with minimal quality loss. For most production use cases, INT4 or INT8 quantization is the highest-ROI optimization. Benchmark quality on your specific tasks to ensure acceptable output.

6. Use optimized serving frameworks. vLLM, TensorRT-LLM, and Text Generation Inference (TGI) implement PagedAttention, continuous batching, and KV cache management that dramatically improve both per-request and aggregate TPS compared to naive implementations. vLLM's PagedAttention alone can improve aggregate throughput by 2–4x over static batching.

7. Enable speculative decoding. Use a small draft model (e.g., a 1B parameter model) to speculatively generate candidate tokens, then verify with the target model. This can improve effective TPS by 1.5–2.5x when the draft model has a high acceptance rate for your domain.

8. Upgrade GPU hardware. The H100 provides 65% more memory bandwidth than the A100, translating directly to higher TPS. For high-volume workloads, the H100's higher per-unit cost is offset by the proportionally higher throughput. Newer GPUs like the H200 (with HBM3e) push bandwidth even further.

9. Tune batch size. Profile the tradeoff between per-request TPS and aggregate TPS for your specific model and hardware. If your SLA requires P95 per-request TPS above 40, you may need to limit batch size even though higher batching would improve aggregate throughput.

Monitoring TPS in Production

TPS monitoring complements TTFT and total latency monitoring to provide a complete picture of inference performance. While TTFT measures how quickly a response starts, TPS measures how quickly it flows — both are needed to fully characterize the streaming experience.

Key metrics to track:

P50 per-request TPS: The typical generation speed your users experience. Track this by model to identify which models are delivering the best streaming UX.
P5 per-request TPS (lower tail): The slowest 5% of requests. This is where users experience frustration. A P5 TPS below 10 indicates some requests are generating painfully slowly.
Aggregate TPS (self-hosted): Total output tokens per second across your GPU fleet. This is the capacity metric that drives scaling decisions.
TPS over time: Track TPS trends to detect gradual degradation. A slow decline in P50 TPS over weeks may indicate increasing KV cache sizes (from growing conversation histories), provider-side degradation, or infrastructure issues.
TPS by input token count: Correlate TPS with input size to quantify the KV cache impact. If TPS drops 30% when input exceeds 8,000 tokens, you have a concrete argument for keeping prompts under that threshold.

Alerting thresholds:

Alert	Condition	Suggested Action
Warning	P50 TPS drops below 80% of 7-day average	Investigate provider status, check for prompt size regressions
Critical	P5 TPS drops below 10 TPS for 10+ minutes	Activate faster model failover for latency-sensitive endpoints
Capacity (self-hosted)	Aggregate TPS exceeds 80% of theoretical maximum	Scale out GPU fleet or enable request queuing

TPS and cost correlation: TPS does not directly affect per-token cost (you pay the same per token regardless of generation speed), but it has indirect cost impacts. Low TPS increases the probability of client-side timeouts, which trigger retries that double cost. Low TPS also increases connection hold time, which can exhaust connection pools and cause cascading failures in high-throughput applications. Track the correlation between TPS percentiles and retry rates to quantify this hidden cost.

Dashboard design: A TPS monitoring dashboard should include: (1) time-series of P50 and P5 per-request TPS by model, (2) histogram of TPS distribution to show the shape of the speed experience, (3) aggregate TPS gauge for self-hosted infrastructure showing utilization against capacity, (4) scatter plot of TPS vs input token count to visualize the KV cache impact curve.

CostHawk computes per-request TPS automatically from TTFT and total latency measurements, providing all of these dashboards without requiring any client-side TPS instrumentation. For self-hosted infrastructure, CostHawk integrates with vLLM and TGI metrics exporters to capture aggregate throughput data alongside per-request metrics.

FAQ

Frequently Asked Questions

What is a good TPS for a production AI application?+

The target TPS depends on your application type and user expectations. For interactive streaming chatbots, a P50 per-request TPS of 40+ is considered good, and 70+ is excellent. At 40 TPS, text flows at roughly 30 words per second — significantly faster than reading speed, so users perceive smooth, continuous generation. For code completion tools where users are waiting for a complete suggestion, higher TPS (80+) is desirable because the response is typically short and the user is blocked until it finishes. For batch processing and background API calls, per-request TPS matters less than aggregate throughput — you care about processing the maximum number of tokens per second across all concurrent requests to minimize total wall-clock time. For self-hosted infrastructure, target aggregate TPS utilization of 60–80% of theoretical maximum; running above 80% leaves insufficient headroom for traffic spikes and increases queuing latency. CostHawk's TPS analytics show your current percentiles by model and endpoint, making it easy to identify where you are meeting targets and where optimization is needed.

Why does TPS vary so much between requests to the same model?+

TPS variability is normal and caused by several factors. First, server-side batching — API providers batch multiple concurrent requests onto the same GPU, and your per-request TPS depends on how many other requests share the batch. During peak hours, higher batching reduces per-request TPS by 20–40%. Second, KV cache size — longer conversations or larger input contexts create bigger KV caches that consume more memory bandwidth during generation, reducing TPS. A request with 1,000 input tokens may generate at 85 TPS while one with 10,000 input tokens generates at 60 TPS on the same model. Third, output content variation — some token sequences are faster to generate than others. Highly predictable tokens (common words, code patterns) may generate slightly faster than rare or complex tokens. Fourth, infrastructure variability — your request may land on different hardware generations within the provider's fleet, each with different performance characteristics. The combination of these factors typically produces a TPS distribution with a coefficient of variation of 15–30%, meaning a model averaging 80 TPS will regularly produce requests between 55 and 100 TPS.

How does TPS relate to cost?+

TPS does not directly affect the per-token price you pay — a token costs the same whether it was generated at 20 TPS or 200 TPS. However, TPS has important indirect cost implications. First, timeout-driven retries: if your application has a timeout threshold (e.g., 30 seconds) and a low-TPS request exceeds that threshold before completing, the retry generates a duplicate request that doubles the token cost. At scale, 5–10% of requests timing out and retrying adds 5–10% to your bill. Second, infrastructure cost efficiency for self-hosted models: higher aggregate TPS means more tokens processed per GPU-hour, directly reducing the cost per token. If your GPU fleet processes 500 aggregate TPS at $4/hour, your effective cost is $0.0080 per 1,000 tokens. Improving to 800 aggregate TPS through optimization reduces the cost to $0.0050 per 1,000 tokens — a 37% reduction. Third, connection costs: in serverless or edge environments where you pay for compute time on the client side, slower TPS means the client-side process runs longer, increasing serverless function costs. CostHawk correlates TPS with retry rates and timeout events to quantify the hidden cost of low TPS.

What is the difference between TPS and throughput?+

TPS and throughput are closely related but used in different contexts. Per-request TPS measures the generation speed of a single request — how fast tokens flow to one user. Throughput measures the total processing capacity of an inference system — how many total tokens (input + output) it can handle per second across all concurrent requests. Throughput encompasses both the prefill phase (processing input tokens) and the decode phase (generating output tokens), while TPS typically refers only to the decode phase. For example, a server might have a throughput of 5,000 tokens/second (including prefill of input tokens) but a per-request TPS of 60 (decode rate for any single user). Throughput is the metric that matters for capacity planning and infrastructure sizing — it tells you how many concurrent users your system can support at acceptable quality. Per-request TPS is the metric that matters for UX — it tells you how fast the user experiences the response. When someone says "our model runs at 80 TPS" without qualification, they typically mean per-request decode TPS. When they say "our cluster handles 10,000 tokens per second," they mean aggregate throughput.

How does quantization affect TPS?+

Quantization — reducing model weight precision from FP16 (16 bits) to INT8 (8 bits) or INT4 (4 bits) — is the most impactful technique for improving TPS on self-hosted models. The mechanism is straightforward: autoregressive decoding is memory-bandwidth-bound, meaning TPS is limited by how fast model weights can be read from GPU memory. Reducing weight precision reduces the bytes per parameter, allowing the GPU to read weights faster. INT8 quantization halves the model's memory footprint and typically improves TPS by 40–60%. INT4 quantization quarters the footprint and can improve TPS by 70–90%. The quality tradeoff depends on the quantization method and the task. Post-training quantization methods like GPTQ and AWQ (for INT4) and simple round-to-nearest (for INT8) preserve 95–99% of output quality on standard benchmarks. Fine-tuned tasks may be more sensitive to quantization than general-purpose generation. Always benchmark quality on your specific use cases before deploying quantized models in production. The combination of INT4 quantization and optimized serving (vLLM) can enable a 70B parameter model to run at 70–80 TPS on 2xA100 GPUs — performance that would require 4xA100s with FP16 weights.

Can TPS decrease during a single response generation?+

Yes, TPS can decrease gradually during a single response for several reasons. The primary cause is KV cache growth: as the model generates more output tokens, the key-value cache grows, and each subsequent decode step must read a larger cache from GPU memory. For a model with a 4,096-token input and generating a 2,000-token output, the KV cache at the start of generation represents 4,096 tokens but grows to 6,096 tokens by the end — a 49% increase in cache memory reads per decode step. In practice, this causes a 5–15% TPS decrease over the course of a long output. A second cause is thermal throttling on self-hosted GPUs: sustained decode computation generates heat, and GPUs reduce clock speeds when thermal limits are approached, particularly in dense multi-GPU configurations. A third cause is batch scheduling interference: on hosted providers, new requests may be added to your inference batch mid-generation, increasing contention for memory bandwidth and reducing your per-request TPS. These intra-response variations are typically small (under 20%) and invisible to users in most applications, but they can affect latency estimates for very long outputs (2,000+ tokens). CostHawk reports average TPS per request rather than instantaneous TPS, which smooths over these intra-response variations.

How do I improve TPS when using an API provider?+

As an API consumer, you have limited control over TPS because the generation speed is determined by the provider's inference infrastructure. However, several strategies can improve your observed TPS: (1) Select inherently faster models — GPT-4o mini (~110 TPS) is 37% faster than GPT-4o (~80 TPS), and Gemini Flash (~150 TPS) is the fastest among comparable models. If your task does not require a larger model's capabilities, switching models is the most effective TPS improvement. (2) Reduce input size — shorter inputs create smaller KV caches, slightly improving decode TPS. This is a secondary benefit on top of the primary TTFT reduction. (3) Use provisioned throughput — reserved capacity tiers typically deliver more consistent and often higher TPS because your requests are not sharing GPU batches with other customers. (4) Time your requests — if your workload is flexible on timing, avoid peak hours (typically 10 AM–4 PM US Eastern) when provider contention is highest and TPS is lowest. (5) Implement multi-provider routing — monitor real-time TPS across providers and route latency-sensitive requests to whichever provider is currently delivering the highest TPS. CostHawk tracks TPS by provider and model in real time, enabling data-driven routing decisions.

What is the relationship between TPS, TTFT, and total latency?+

These three metrics are mathematically related: Total Latency = TTFT + (Output Tokens / TPS × 1000) (in milliseconds). TTFT captures the time before generation begins (network transit + queue wait + prefill). TPS determines how fast output tokens are generated once generation starts. Total latency is the sum of both phases. For example: TTFT of 400 ms, 300 output tokens at 60 TPS → decode time = 300/60 = 5,000 ms → total latency = 400 + 5,000 = 5,400 ms. This relationship reveals which factor dominates latency for different request profiles. Short outputs (under 100 tokens): TTFT dominates — a 400 ms TTFT with 50 tokens at 60 TPS gives 400 + 833 = 1,233 ms total, with TTFT being 32% of total latency. Long outputs (500+ tokens): TPS dominates — a 400 ms TTFT with 500 tokens at 60 TPS gives 400 + 8,333 = 8,733 ms total, with TTFT being only 4.6%. This means TTFT optimization has the most impact for short-response applications (classification, extraction, Q&A), while TPS optimization matters more for long-response applications (content generation, code generation, analysis). CostHawk decomposes total latency into TTFT and decode time on every request, making it easy to identify which factor to optimize for each endpoint.

Related Terms

Throughput

The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.

Time to First Token (TTFT)

The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

GPU Instance

Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary