Latency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Definition
What is Latency?
Impact
Why It Matters for AI Costs
Latency is the primary determinant of user experience in AI-powered applications, and it trades off directly against cost in ways that catch teams off guard. Consider the latency profiles of models at different price points (March 2026 median values from production workloads):
| Model | Median TTFT | Generation Speed | Input Cost/MTok | Output Cost/MTok |
|---|---|---|---|---|
| GPT-4o mini | ~280ms | ~120 tok/s | $0.15 | $0.60 |
| Gemini 2.0 Flash | ~200ms | ~150 tok/s | $0.10 | $0.40 |
| GPT-4o | ~450ms | ~90 tok/s | $2.50 | $10.00 |
| Claude 3.5 Sonnet | ~520ms | ~80 tok/s | $3.00 | $15.00 |
| Claude 3 Opus | ~1,200ms | ~40 tok/s | $15.00 | $75.00 |
The pattern is clear: more capable (and more expensive) models are also slower. Claude 3 Opus costs 100x more than Gemini Flash and is 3-4x slower. This creates a compounding problem: if you choose a frontier model for quality, you pay more per token and degrade the user experience with higher latency.
The cost implications of latency go beyond per-token pricing:
- User abandonment. Research consistently shows that AI chat interfaces with TTFT above 3 seconds see 25-40% higher abandonment rates. Each abandoned request still costs you tokens — the model processes the input and may have started generating output before the user navigates away.
- Timeout-driven retries. Applications with aggressive timeout settings may kill and retry slow requests, doubling or tripling the token cost for a single user action. If your timeout is 10 seconds and 5% of Claude Opus requests exceed that threshold, you are retrying the most expensive requests — compounding the cost.
- Throughput bottlenecks. In synchronous architectures, slow responses occupy server threads longer. A 5-second response ties up a connection 5x longer than a 1-second response, reducing your effective throughput and potentially requiring more infrastructure to handle the same traffic.
CostHawk tracks latency alongside cost for every request, making it possible to analyze the latency-cost tradeoff with real production data. You can identify requests where you are paying for frontier-model latency on tasks that a faster, cheaper model could handle equally well.
Understanding LLM Latency
LLM latency is fundamentally different from the latency of traditional web APIs, and understanding this difference is essential for building performant AI applications. When you call a conventional REST API, the server receives your request, processes it (usually in milliseconds to low hundreds of milliseconds), and returns the complete response. The latency is primarily determined by network round-trip time, server processing time, and database query time.
LLM APIs break this model in two important ways. First, the processing time is orders of magnitude longer — a single LLM inference can take 1-30 seconds depending on the model, input size, and output length. Second, the response is generated incrementally, one token at a time, which enables streaming — a delivery pattern where tokens are sent to the client as they are produced rather than waiting for the complete response.
Streaming fundamentally changes how users perceive latency. A request that takes 8 seconds to complete feels slow if the user stares at a loading spinner for 8 seconds. But if the first token appears after 400ms and text streams in at 80 tokens per second, the same 8-second request feels responsive — the user is reading as the response appears, and the perceived wait is only the 400ms TTFT.
This is why LLM latency must be understood as a composite metric, not a single number. The three components are:
- Network latency: The time for your request to reach the provider's servers and for the first response bytes to return. Typically 20-100ms for well-connected data centers, 50-300ms for global traffic. This is the same as traditional API latency and is optimized through geographic proximity and connection pooling.
- Time-to-first-token (TTFT): The time from when the provider receives your request to when the first output token is generated. This includes the prefill phase (processing all input tokens) and any queuing delay at the provider. TTFT scales with input length — a 10,000-token input takes longer to prefill than a 500-token input. Typical ranges: 100ms-3s depending on model, input size, and provider load.
- Generation time: The time to produce all output tokens after the first one. This is determined by the generation speed (tokens per second) and the total number of output tokens. At 80 tok/s, generating 400 tokens takes 5 seconds. Generation speed is primarily determined by the model architecture and the provider's GPU allocation.
The total end-to-end latency is: Network latency + TTFT + (Output tokens / Generation speed). This decomposition is critical for optimization because each component has different levers: network latency is optimized by infrastructure, TTFT is optimized by reducing input length and using prompt caching, and generation time is optimized by capping output length and choosing faster models.
Latency Components
Each component of LLM latency has distinct causes, measurement approaches, and optimization strategies. Understanding them individually is essential for targeted performance improvement.
1. Queuing Delay
Before your request even begins processing, it may sit in a queue at the provider's infrastructure. During peak hours or traffic spikes, queuing delays can add hundreds of milliseconds to seconds. Queuing delay is entirely outside your control for API-hosted models, but you can mitigate its impact through provider diversification (failover to a different provider when one is congested) and off-peak scheduling for batch workloads. CostHawk's latency tracking helps you identify time-of-day patterns in queuing delay so you can shift non-urgent workloads to lower-traffic periods.
2. Prefill Time (Input Processing)
The prefill phase processes all input tokens in parallel using GPU matrix multiplications. Prefill time scales roughly linearly with input token count, though the relationship is not perfectly linear due to attention mechanism complexity (which scales quadratically with sequence length for standard attention, though most modern models use efficient attention variants). Practical prefill times for common scenarios:
- 500-token input on GPT-4o: ~100-150ms
- 5,000-token input on GPT-4o: ~300-500ms
- 50,000-token input on GPT-4o: ~1,500-3,000ms
- 128,000-token input on Claude 3.5 Sonnet: ~3,000-8,000ms
The key insight is that long context inputs significantly increase TTFT. If your application sends 50,000 tokens of context with every request, you are adding 1-3 seconds to TTFT before a single output token is generated. Prompt caching dramatically reduces prefill time for repeated prefixes — both OpenAI and Anthropic cache the KV-cache from the prefill phase, so subsequent requests with the same prefix skip most of the prefill computation. Anthropic's prompt caching can reduce TTFT by 80-90% for cached prefixes.
3. Decode Time (Output Generation)
The decode phase generates output tokens sequentially. Each token requires a forward pass through the model that reads from the KV-cache and produces a probability distribution over the vocabulary. Decode speed (tokens per second) is primarily determined by the model's size and the provider's GPU allocation. Larger models generate tokens more slowly because each forward pass involves more computation. Current decode speeds for popular models range from 40 tok/s (Claude 3 Opus) to 200+ tok/s (Gemini 2.0 Flash). Decode time is: Output tokens / Decode speed. A 500-token response at 80 tok/s takes 6.25 seconds of generation time.
4. Network Transfer
For streaming responses, network latency manifests as a constant offset on TTFT plus minimal per-chunk overhead. For non-streaming responses, the entire response must transit the network, adding latency proportional to response size (though LLM responses are text and therefore small relative to bandwidth). Network optimization is standard web engineering: use persistent connections (HTTP/2), minimize geographic distance to the provider, and avoid unnecessary proxies that add per-hop latency. If you route through an observability proxy (like CostHawk's), choose one with edge presence near your servers to minimize the additional hop.
Latency vs Cost Tradeoffs
Latency and cost are fundamentally coupled in LLM applications, and optimizing one often comes at the expense of the other. Understanding these tradeoffs enables informed decision-making rather than accidental overspending or underperformance.
Tradeoff 1: Model selection (faster = cheaper, but less capable)
Smaller, cheaper models are consistently faster than larger, more expensive ones. GPT-4o mini generates tokens at ~120 tok/s for $0.60/MTok output; Claude 3 Opus generates at ~40 tok/s for $75.00/MTok output. This means you pay 125x more for a model that is 3x slower. For tasks where the cheaper model produces acceptable quality, the latency-cost tradeoff is entirely one-directional: switching to a cheaper model simultaneously improves latency and reduces cost. The tradeoff only becomes a real tension when the cheaper model's quality is noticeably worse, forcing you to choose between user experience (fast but lower quality) and output quality (slow but better).
Tradeoff 2: Prompt caching (initial cost for ongoing latency savings)
Prompt caching reduces TTFT by 80-90% for repeated prefixes but has setup requirements. Anthropic charges a small write fee for the initial cache creation and gives a 90% discount on subsequent reads. OpenAI's caching gives a 50% discount on cached input tokens. The tradeoff: you pay full price (plus cache creation overhead) on the first request, then save on every subsequent request. For high-volume endpoints with stable system prompts, caching pays for itself within minutes. For low-volume or frequently-changing prompts, the cache hit rate may be too low to justify the complexity.
Tradeoff 3: Output length (shorter = faster and cheaper)
Reducing output length directly reduces both generation time and output token cost. Setting max_tokens=200 instead of allowing unbounded generation caps both the latency and cost of a request. The tradeoff is obvious: shorter outputs may omit useful information. The key is to match output length to the actual need. A classification endpoint that returns {"category": "billing"} should cap output at 50 tokens, not 4,096. A creative writing endpoint may legitimately need 1,000+ tokens. CostHawk's per-endpoint token distribution analysis shows you the actual output token distributions so you can set appropriate caps without truncating legitimate responses.
Tradeoff 4: Provisioned throughput (pay more for guaranteed latency)
Both OpenAI and Anthropic offer provisioned throughput options that guarantee faster response times in exchange for committed spend. OpenAI's Provisioned Throughput Units (PTUs) provide dedicated GPU capacity with guaranteed tokens-per-minute throughput and lower, more consistent latency. Anthropic offers similar reserved capacity for enterprise customers. The tradeoff is pure economics: you pay a fixed monthly commitment (typically 30-50% more than on-demand rates at equivalent volume) in exchange for guaranteed low latency and no rate limits. This makes sense for latency-sensitive applications processing predictable, high volumes — but is wasteful for bursty or low-volume workloads.
Tradeoff 5: Parallelism vs sequencing (faster total time, same total cost)
If a user action requires multiple LLM calls, you can run them in parallel to reduce total latency without changing total cost. Two 3-second calls in parallel take 3 seconds total instead of 6 seconds sequentially, with the same token consumption and cost. The tradeoff is architectural complexity: parallel execution requires careful orchestration, error handling for partial failures, and possibly more concurrent API connections (which count toward rate limits).
Latency Benchmarks by Model
Latency varies significantly across models, providers, and request characteristics. The following benchmarks represent median production values observed across thousands of real-world requests in March 2026. Use these as reference points for capacity planning and SLA setting, but always validate against your own workloads — your specific prompt patterns, geographic location, and traffic timing will produce different numbers.
| Model | Provider | Median TTFT | P95 TTFT | Generation Speed (tok/s) | 500-Token Response (Total) |
|---|---|---|---|---|---|
| Gemini 2.0 Flash | 180ms | 450ms | ~160 | ~3.3s | |
| GPT-4o mini | OpenAI | 260ms | 680ms | ~125 | ~4.3s |
| Claude 3.5 Haiku | Anthropic | 300ms | 750ms | ~110 | ~4.8s |
| Mistral Small | Mistral | 220ms | 550ms | ~130 | ~4.1s |
| GPT-4o | OpenAI | 420ms | 1,100ms | ~90 | ~6.0s |
| Claude 3.5 Sonnet | Anthropic | 500ms | 1,400ms | ~82 | ~6.6s |
| Gemini 1.5 Pro | 380ms | 950ms | ~95 | ~5.6s | |
| Mistral Large | Mistral | 480ms | 1,200ms | ~70 | ~7.6s |
| Claude 3 Opus | Anthropic | 1,100ms | 3,200ms | ~42 | ~13.0s |
| o3-mini | OpenAI | 800ms* | 4,500ms* | ~100 | ~5.8s* |
*Reasoning model TTFTs are highly variable because they include internal "thinking" time that varies with problem complexity. A simple question might have a 500ms TTFT, while a complex math problem might think for 10+ seconds before producing the first visible token.
Key observations from these benchmarks:
- Economy models are 2-3x faster than frontier models. Gemini Flash and GPT-4o mini deliver sub-300ms TTFT and 120-160 tok/s generation. Claude Opus and reasoning models are 3-10x slower.
- P95 TTFT is 2-3x the median. Always design your timeouts and user experience around P95 or P99 latency, not the median. If your median TTFT is 500ms but P95 is 1,400ms, 5% of your users are waiting nearly 3x longer than average.
- Generation speed is the dominant factor for longer outputs. For a 500-token response, generation accounts for 80-90% of total latency. Reducing input tokens (which reduces TTFT) has diminishing returns once output generation dominates.
- Provider load causes 20-40% latency variation by time of day. US business hours (9am-5pm PT) typically show the highest latency for US-hosted providers. Schedule batch workloads for off-peak hours when possible.
CostHawk records TTFT and total latency for every request, allowing you to build your own provider-specific, model-specific, and time-of-day-specific benchmarks from your actual production traffic.
Reducing Latency Without Increasing Cost
Many latency optimizations also reduce cost, making them doubly valuable. Here are the most effective techniques, ordered by typical impact for production workloads:
1. Enable streaming (zero cost impact, massive perceived latency improvement)
Streaming delivers tokens to the client as they are generated instead of waiting for the complete response. This does not reduce total latency or cost — the same tokens are generated at the same speed — but it transforms the user experience. A 6-second response that streams token-by-token feels dramatically faster than a 6-second response that appears all at once after a loading spinner. Implementing streaming is straightforward with every major provider's SDK. OpenAI, Anthropic, and Google all support server-sent events (SSE) for streaming. If you are using CostHawk's proxy, streaming is transparently passed through — the proxy logs the complete request metadata without adding meaningful latency to the stream.
2. Implement prompt caching (reduces cost AND latency)
Prompt caching eliminates redundant prefill computation for repeated prompt prefixes. If your system prompt is 2,000 tokens and you send it with every request, prompt caching computes the prefill for those 2,000 tokens once and reuses the cached KV-cache for subsequent requests. The impact on TTFT is dramatic: Anthropic's prompt caching reduces TTFT by 80-85% for the cached portion. On a 10,000-token input where 8,000 tokens are a cached system prompt, TTFT drops from ~600ms to ~200ms. Cost impact is also positive: Anthropic charges 10% of the normal input rate for cached tokens (90% discount), and OpenAI charges 50% of the normal rate. This is one of the rare optimizations that improves latency and reduces cost simultaneously.
3. Reduce input token count (reduces cost AND TTFT)
Every input token adds to prefill time. Reducing input tokens directly reduces TTFT and input token cost. Practical approaches: trim system prompts by removing redundant instructions; use RAG to retrieve only relevant context instead of stuffing the full document; summarize conversation history instead of sending raw transcripts; compress structured data (use CSV instead of verbose JSON). A team that reduces average input tokens from 5,000 to 2,000 will see TTFT improve by approximately 40% and input token costs drop by 60%.
4. Set appropriate max_tokens (reduces cost AND generation time)
Setting max_tokens prevents the model from generating unnecessarily long responses. This caps both the generation time and the output token cost. The key is to set it tightly enough to prevent waste but loosely enough to avoid truncating legitimate responses. Analyze your output token distribution in CostHawk: if the P99 output length is 300 tokens, setting max_tokens=400 gives 33% headroom while preventing the rare 2,000-token runaway response that takes 25 seconds and costs 5x a normal response.
5. Route to faster models for simple tasks (reduces cost AND latency)
Model routing is the practice of directing each request to the cheapest and fastest model that can handle it adequately. Simple tasks like classification, entity extraction, and format conversion rarely need a frontier model. Routing these to GPT-4o mini or Gemini Flash instead of GPT-4o or Claude Sonnet simultaneously reduces latency (economy models are 2-3x faster) and cost (economy models are 10-20x cheaper). Even a basic static routing rule — "all /api/classify requests use GPT-4o mini" — can capture significant savings. Dynamic routing with a lightweight classifier can push optimization further.
6. Parallelize independent LLM calls (same cost, lower total latency)
If a user action requires multiple independent LLM calls (e.g., summarize + classify + extract), run them in parallel instead of sequentially. Three 2-second calls in parallel take 2 seconds total instead of 6 seconds sequentially, with identical token consumption and cost. Use Promise.all() in JavaScript or asyncio.gather() in Python to parallelize calls. Be mindful of rate limits — parallel calls consume your rate limit budget faster.
Monitoring Latency
Effective latency monitoring for LLM applications requires tracking multiple dimensions that traditional APM tools do not capture. Here is a comprehensive monitoring strategy:
What to measure:
- TTFT (time-to-first-token): The most important user-facing latency metric for streaming applications. Measure at the client side (includes network latency) and, if possible, at the proxy side (excludes network latency to isolate provider performance). Track P50, P95, and P99 percentiles — averages hide the long-tail experiences that frustrate users.
- End-to-end latency: Total time from request to complete response. The most important metric for non-streaming applications and batch workloads. Track per-model and per-endpoint to identify which combinations are underperforming.
- Generation speed (tokens/second): Calculated as
output_tokens / (end_to_end_latency - TTFT). This isolates the decode phase performance. A drop in generation speed often indicates provider-side GPU contention — the model is generating tokens more slowly because the provider's infrastructure is under load. - Latency by input size: Plot TTFT against input token count. The relationship should be roughly linear; a steeper-than-expected slope suggests the model or provider is handling long inputs poorly. If TTFT spikes disproportionately at high input counts, consider implementing input truncation or chunking.
How to set alerts:
Static latency thresholds ("alert if P95 TTFT exceeds 3 seconds") are a reasonable starting point but suffer from two problems: they do not account for model-specific baselines (3 seconds is normal for Opus but alarming for GPT-4o mini), and they do not detect gradual degradation that stays below the threshold. A better approach combines static thresholds with anomaly detection:
- Set per-model static alerts based on 2x the historical P95 for that model
- Set anomaly alerts that fire when the current hour's P50 latency exceeds the trailing 7-day P50 by more than 50%
- Set input-size alerts that fire when average input tokens per request increases by more than 30% (a leading indicator of TTFT increases)
Dashboards to build:
CostHawk provides pre-built latency dashboards, but if you are building custom monitoring, prioritize these views:
- Latency heatmap: P50/P95/P99 latency by model over the past 24 hours. This shows time-of-day patterns and provider-specific performance variations at a glance.
- TTFT vs input tokens scatter plot: Reveals the relationship between input size and responsiveness. Outliers (high TTFT for low input counts) indicate queuing delays or provider issues.
- Latency-cost quadrant chart: Plot each endpoint's average latency (x-axis) against average cost per request (y-axis). Endpoints in the high-latency/high-cost quadrant are optimization priorities. Endpoints in the low-latency/low-cost quadrant are well-optimized.
- Timeout and retry tracker: Count of requests that exceeded timeout thresholds or triggered retries, broken down by model. High timeout rates on a specific model suggest it is too slow for your SLA requirements.
The goal of latency monitoring is not just awareness — it is actionable intelligence. Every latency anomaly should lead to an investigation: is it a provider issue (check their status page), a prompt change (check input token trends), a traffic spike (check request volume), or a model routing misconfiguration (check model distribution)? CostHawk's correlated metrics make this investigation fast by showing latency, cost, token counts, and error rates on the same timeline.
FAQ
Frequently Asked Questions
What is a good TTFT (time-to-first-token) for production applications?+
Why does latency vary so much between identical requests?+
How does input length affect latency?+
What is the difference between latency and throughput?+
How does prompt caching affect latency?+
Should I use streaming or non-streaming responses?+
How do reasoning models like o1 and o3-mini affect latency?+
What timeout values should I set for LLM API calls?+
P99 TTFT + (max_tokens / generation speed) + buffer. For GPT-4o with max_tokens=1000: P99 TTFT ~1.5s + 1000/90 tok/s ~11s + 3s buffer = ~16s timeout. For GPT-4o mini with max_tokens=500: P99 TTFT ~0.8s + 500/125 tok/s ~4s + 2s buffer = ~7s timeout. For Claude 3 Opus with max_tokens=2000: P99 TTFT ~4s + 2000/42 tok/s ~48s + 5s buffer = ~57s timeout. For reasoning models, add generous headroom for thinking time: 60-120 seconds minimum. Common mistakes to avoid: (1) Using the same timeout for all models — a 10-second timeout is fine for GPT-4o mini but will kill 30% of Claude Opus requests. (2) Not accounting for output length — a request generating 2,000 tokens needs 4x the generation time of a 500-token request. (3) Retrying on timeout without checking why — if a model consistently times out, the fix is a higher timeout or a faster model, not more retries that multiply your cost. Monitor timeout rates per model in CostHawk and adjust thresholds based on actual P99 latency data.Related Terms
Time to First Token (TTFT)
The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.
Read moreThroughput
The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.
Read moreP95 / P99 Latency
Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read moreLLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
