P95 / P99 Latency
Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.
Definition
What is P95 / P99 Latency?
Impact
Why It Matters for AI Costs
Percentile latency metrics are the difference between believing your API integration is fast and knowing that your users experience it as fast. Consider a production deployment where your LLM-powered feature handles 50,000 requests per day:
- Average latency: 800 ms — looks great, well within a 2-second UX target.
- P50 latency: 620 ms — half your users get sub-second responses. Excellent.
- P95 latency: 2,800 ms — 2,500 users per day wait nearly 3 seconds. Noticeable.
- P99 latency: 8,200 ms — 500 users per day wait over 8 seconds. Some abandon the page.
If you only monitored the average, you would never know that 500 users per day are having a terrible experience. Worse, these tail-latency users are not randomly distributed. Provider-side queuing and GPU contention often correlate with peak traffic hours, meaning your highest-value users (those active during business hours) disproportionately encounter the slowest responses.
P95 and P99 are also the metrics that matter for SLA contracts. When an AI provider or an internal platform team commits to an SLA, it is almost always expressed as a percentile guarantee — for example, "P95 latency below 2,000 ms for requests under 4,000 input tokens." If you are not measuring percentiles, you cannot verify whether your provider is meeting their SLA or whether your own system is meeting the performance targets your product team has set.
From a cost perspective, tail latency often signals inefficiency. Requests that take 5x longer than the median may be hitting cold instances, waiting in provider queues due to rate limiting, or processing unnecessarily large context windows. Identifying and fixing these outliers frequently reduces both latency and cost simultaneously. CostHawk captures per-request latency alongside cost data, enabling you to correlate high-latency requests with specific models, prompt sizes, or time-of-day patterns and take targeted action.
What Are Percentile Latency Metrics?
Percentile latency metrics describe the distribution of response times across a population of requests by answering the question: "What is the maximum latency experienced by X% of requests?" They are computed by sorting all observed latency values and selecting the value at a specific position in the sorted array.
Formally, the Pth percentile of a dataset is the value below which P% of observations fall. For a dataset of N latency measurements sorted in ascending order:
- P50 (median): The value at position N × 0.50. Half of all requests are faster, half are slower. This is the "typical" experience.
- P75: The value at position N × 0.75. Three-quarters of requests complete within this time.
- P90: 90% of requests are at or below this threshold. Often used as a first-tier alert boundary.
- P95: The industry-standard SLA metric. Only 5% of requests exceed this value. At 100,000 requests per day, 5,000 requests are slower than your P95.
- P99: Captures the tail of the distribution. Only 1 in 100 requests is slower. At scale, this still represents thousands of real users per day.
- P99.9: The extreme tail. 1 in 1,000 requests. Relevant for high-traffic services where even 0.1% represents significant user impact.
In database systems, percentiles are typically computed using the PERCENTILE_CONT function, which performs linear interpolation between adjacent data points when the exact percentile position falls between two observations. For real-time monitoring, many observability platforms use approximate algorithms like t-digest or HDR histograms to compute percentiles over streaming data without storing every individual measurement.
For LLM API monitoring, the most operationally relevant percentiles are P50 (baseline performance), P95 (SLA target), and P99 (tail-latency investigation). CostHawk computes all three using PERCENTILE_CONT over raw request logs, providing exact values rather than approximations. This precision matters when you are holding a provider accountable to an SLA or diagnosing whether a latency regression is real or a statistical artifact.
A critical nuance: percentiles are not additive. If Service A has a P99 of 200 ms and Service B has a P99 of 300 ms, the P99 of a request that hits both services sequentially is not 500 ms. The combined P99 depends on the correlation structure between the two distributions. This makes end-to-end percentile monitoring (which CostHawk provides at the request level) more valuable than aggregating percentiles from individual components.
Why Averages Lie
The arithmetic mean (average) is the most commonly reported latency statistic and also the most misleading. Averages are distorted by outliers and completely mask the shape of the latency distribution. To understand why, consider two hypothetical LLM API endpoints with identical average latencies:
| Metric | Endpoint A | Endpoint B |
|---|---|---|
| Average | 900 ms | 900 ms |
| P50 (Median) | 850 ms | 400 ms |
| P90 | 1,050 ms | 1,800 ms |
| P95 | 1,200 ms | 3,500 ms |
| P99 | 1,500 ms | 12,000 ms |
| Max | 2,100 ms | 45,000 ms |
Endpoint A has a tight, well-behaved distribution: most requests cluster near the average, and even the worst requests are only about 2x the median. Endpoint B has a bimodal distribution: most requests are fast (400 ms median), but a significant minority take 10x-100x longer, dragging the average up to 900 ms. If you only look at the average, these endpoints appear identical. In reality, Endpoint B delivers a much worse user experience because 5% of users wait over 3.5 seconds and 1% wait over 12 seconds.
LLM APIs are particularly prone to bimodal and heavy-tailed latency distributions because of several factors:
- Variable output length: A request that generates 50 tokens finishes in 200 ms. A request to the same model that generates 2,000 tokens takes 8,000 ms. Both contribute to the same latency distribution.
- Provider-side batching: AI providers batch multiple requests onto the same GPU for throughput efficiency. When a batch includes one request with a very long output, all other requests in the batch may experience increased latency while waiting for the batch to complete.
- Queue wait time: During peak usage periods, requests may sit in a provider's queue for 500 ms to several seconds before GPU resources become available. This creates a bimodal pattern: requests that get immediate GPU access are fast; requests that queue are slow.
- Cold starts and model loading: Some providers dynamically scale their GPU fleet. The first request to a newly spun-up instance may incur a multi-second cold start penalty while model weights are loaded into GPU memory.
- Network variability: Cross-region API calls add 50-200 ms of network latency, and network jitter can add unpredictable variation.
The practical consequence is clear: never use averages as your primary latency metric for LLM APIs. Report P50 for the "typical" experience, P95 for your SLA target, and P99 to track tail behavior. When someone says "our average latency is 500 ms," the correct response is: "What is your P99?"
P95/P99 by Model and Provider
Latency characteristics vary enormously across models and providers. Smaller models on optimized infrastructure deliver sub-second P99 latencies, while frontier models processing large contexts can have P99 latencies exceeding 30 seconds. The following table shows representative P95 and P99 latencies observed in production workloads during Q1 2026, based on requests with 1,000 input tokens and 500 output tokens (a typical conversational query):
| Provider | Model | P50 (ms) | P95 (ms) | P99 (ms) | P99/P50 Ratio |
|---|---|---|---|---|---|
| OpenAI | GPT-4o | 680 | 1,850 | 4,200 | 6.2x |
| OpenAI | GPT-4o mini | 320 | 780 | 1,600 | 5.0x |
| Anthropic | Claude 3.5 Sonnet | 750 | 2,100 | 5,800 | 7.7x |
| Anthropic | Claude 3.5 Haiku | 280 | 650 | 1,400 | 5.0x |
| Gemini 2.0 Flash | 250 | 580 | 1,200 | 4.8x | |
| Gemini 1.5 Pro | 620 | 1,700 | 4,500 | 7.3x | |
| Mistral | Mistral Large | 550 | 1,400 | 3,800 | 6.9x |
| Mistral | Mistral Small | 180 | 420 | 900 | 5.0x |
Key observations from this data:
- The P99/P50 ratio ranges from 4.8x to 7.7x. This means the slowest 1% of requests take 5-8x longer than the median. This ratio is significantly higher than traditional web APIs, where P99/P50 is typically 2-4x.
- Smaller models have tighter distributions. GPT-4o mini, Claude 3.5 Haiku, Gemini Flash, and Mistral Small all have P99/P50 ratios around 5.0x, while their larger counterparts range from 6.2x to 7.7x. Larger models have more variable decode times and are more susceptible to batching delays.
- Provider infrastructure matters. Google's Gemini Flash has the lowest absolute P99 (1,200 ms) thanks to custom TPU hardware optimized for inference. Mistral Small on optimized inference endpoints achieves a 900 ms P99.
- These numbers scale with output length. For a request generating 2,000 tokens instead of 500, multiply all values by roughly 3-4x. A GPT-4o request generating 2,000 tokens can easily hit a P99 of 15,000-18,000 ms.
Input context size also affects latency, primarily through the prefill (time-to-first-token) phase. A request with 50,000 input tokens will have a TTFT 5-10x longer than a request with 1,000 input tokens, adding 500-2,000 ms to the total latency for frontier models. This is why CostHawk tracks both TTFT and total latency separately — a high P95 TTFT with a normal decode rate indicates a context-size problem, while a normal TTFT with a high P95 total latency indicates a long-output problem. Different root causes require different optimization strategies.
Setting SLAs
A Service Level Agreement (SLA) defines the performance guarantees your system commits to — both internally (engineering team to product team) and externally (your product to your customers). For LLM-powered features, latency SLAs must be expressed as percentile thresholds with clear measurement methodology, scope, and consequences for violation.
Choosing the right percentile for your SLA:
- P95 is the most common SLA metric for LLM API features. It balances operational achievability with user experience coverage — 95% of users get the guaranteed experience, and the 5% tail is monitored but not contractually bound.
- P99 SLAs are appropriate for critical paths — payment processing, safety-critical decisions, or premium-tier customers. They are harder to meet and require more infrastructure investment (redundancy, failover, request hedging).
- P50 SLAs are rarely used because they only guarantee the median experience. A system could have a 300 ms P50 and a 30-second P99, which meets a P50 SLA while delivering an abysmal experience to half a percent of users.
How to define an LLM latency SLA:
- Segment by request profile. A single SLA for all requests is meaningless when a 100-token classification takes 200 ms and a 4,000-token generation takes 8 seconds. Define SLA tiers based on expected output length or task type. Example: "Tier 1 (classification/extraction, <200 output tokens): P95 < 800 ms. Tier 2 (generation, 200-1,000 output tokens): P95 < 3,000 ms. Tier 3 (long-form, >1,000 output tokens): P95 < 8,000 ms."
- Specify the measurement window. Is the SLA measured over 1 hour, 1 day, or 1 calendar month? Shorter windows detect regressions faster but are more susceptible to transient spikes. A common approach: measure over rolling 24-hour windows, alert on 1-hour windows, and report monthly for business reviews.
- Define what counts as a request. Does the SLA include retried requests? Requests that timeout? Requests during provider outages? Typically, provider-side outages are excluded (covered by the provider's own SLA), but application-level retries are included because they affect the user's perceived latency.
- Set a compliance target. "P95 < 2,000 ms, 99.5% of measurement windows compliant" means you can have 3.6 hours per month where P95 exceeds the target without violating the SLA. This provides headroom for maintenance windows and transient provider degradation.
Example internal SLA for an AI-powered customer support chatbot:
| Metric | Target | Measurement | Action on Breach |
|---|---|---|---|
| TTFT P95 | < 600 ms | Rolling 1-hour window | Page on-call engineer |
| Total Latency P95 | < 3,000 ms | Rolling 1-hour window | Page on-call engineer |
| Total Latency P99 | < 8,000 ms | Rolling 24-hour window | Escalate to team lead |
| Error rate | < 0.5% | Rolling 1-hour window | Page on-call + failover |
CostHawk supports SLA monitoring by computing percentile latencies in real time across all requests flowing through wrapped keys. You can configure alerts that fire when P95 or P99 exceeds a threshold for a given model, project, or API key — turning your SLA from a document into an automated enforcement mechanism.
Reducing Tail Latency
Tail latency — the P95, P99, and beyond — is where user experience degrades and where the most impactful optimizations often live. Reducing tail latency for LLM APIs requires a different playbook than reducing median latency, because tail events are caused by specific, identifiable factors rather than baseline model speed. Here are seven proven strategies, ordered by typical impact:
1. Reduce output token count. The single largest driver of LLM latency is the number of tokens generated. Each output token requires a sequential decode step, so a request generating 2,000 tokens takes roughly 4x longer than one generating 500 tokens. The highest-latency requests in your distribution are almost always the ones generating the longest outputs. Set max_tokens to enforce a ceiling, and instruct the model to be concise. Cutting your max output from 2,000 to 800 tokens can reduce P99 by 50-60%.
2. Reduce input context size. Large input contexts increase time-to-first-token (TTFT) because the model must process all input tokens during the prefill phase. A request with 100,000 input tokens may have a TTFT of 3-5 seconds on frontier models, compared to 200-400 ms for a 1,000-token input. If your P95 TTFT is high, audit your context assembly pipeline: are you including full documents when summaries would suffice? Are you sending entire conversation histories when only the last 5 turns are relevant?
3. Implement request hedging. Send the same request to two provider endpoints simultaneously and use whichever responds first. This is the most effective technique for reducing P99 because it converts tail-latency events into a race where the fastest response wins. The cost is 2x token spend for hedged requests, but if you only hedge the slowest 5% (detected by a timeout threshold), the cost increase is manageable. For a request that times out after 3 seconds, fire a parallel request to an alternative model or region.
4. Choose the right model tier. If your P99 on GPT-4o is 4,200 ms and your product can tolerate the quality of GPT-4o mini (P99 of 1,600 ms), the fastest path to lower tail latency is model routing. Many teams implement a tiered system: send latency-sensitive requests to a fast model and quality-sensitive requests to a frontier model. This reduces P99 for the latency-sensitive tier by 60-70%.
5. Use streaming responses. Streaming does not reduce total latency (the last token arrives at the same time), but it dramatically improves perceived latency because the user sees the first token within the TTFT window (typically 200-800 ms) rather than waiting for the entire response. For interactive applications, streaming is effectively mandatory — a 5-second wait followed by a complete response feels much slower than seeing tokens appear after 300 ms.
6. Implement regional routing. If your users are globally distributed but your API calls all route to a single provider region, users far from that region add 100-300 ms of network round-trip time. Route requests to the nearest provider region (OpenAI and Anthropic both have multi-region deployments, and Google Cloud has Gemini endpoints in multiple regions). Network latency is additive and constant, so reducing it from 200 ms to 30 ms improves every percentile equally.
7. Add client-side timeouts with graceful degradation. Set a timeout at the percentile level where user experience becomes unacceptable (e.g., 5 seconds). When a request exceeds the timeout, cancel it and fall back to a cached response, a simpler model, or a non-AI code path. This does not reduce the actual tail latency but caps the user-perceived latency, turning P99 outliers into fast fallback responses. Track timeout rates in CostHawk to understand how often fallbacks trigger and whether the underlying latency problem is worsening.
Percentile Monitoring
Monitoring percentile latencies in production requires purpose-built infrastructure because percentiles cannot be computed from pre-aggregated data — you need access to individual request-level measurements or statistically sound approximations. Here is how to build a robust percentile monitoring pipeline for LLM API calls.
Data collection: Every API request must record its total latency (and ideally TTFT, decode time, and queue wait time as separate fields). CostHawk captures this automatically for all requests routed through wrapped keys: each request log includes latency_ms, ttft_ms (when streaming), input_tokens, output_tokens, model, api_key_id, and timestamp. This per-request granularity is essential because percentiles must be computed over the raw data, not over averages or pre-bucketed histograms.
Computation methods:
- Exact computation (SQL): For datasets that fit in a single query (up to millions of rows with proper indexing), use
PERCENTILE_CONTin PostgreSQL:SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99 FROM request_logs WHERE timestamp > NOW() - INTERVAL '1 hour'. This is the approach CostHawk uses for dashboard queries and SLA reporting. - Approximate computation (streaming): For real-time alerting over high-throughput streams, use t-digest or DDSketch algorithms that maintain a compact data structure approximating the full distribution. These algorithms can compute percentiles with <1% error while using only a few kilobytes of memory, making them suitable for per-second percentile computation across thousands of API keys.
- Histogram buckets (Prometheus-style): Define latency buckets (e.g., 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s) and increment a counter for each request. Percentiles are estimated by interpolating between bucket boundaries. This approach is less accurate than t-digest but integrates naturally with Prometheus/Grafana monitoring stacks.
Alerting on percentiles:
Configure alerts at two levels:
- Warning alert: P95 exceeds target over a 15-minute window. This catches gradual degradation (e.g., increasing context sizes, provider performance regression) early enough to investigate before users are heavily impacted.
- Critical alert: P99 exceeds 2x the target over a 5-minute window. This fires on sudden, severe latency spikes — provider outages, thundering-herd effects after a retry storm, or a code deployment that accidentally increased prompt sizes.
Dashboards:
An effective latency dashboard shows:
- P50, P95, and P99 as three overlaid time-series lines, so you can see whether a latency increase affects all requests (P50 rises) or only the tail (P99 rises while P50 is stable).
- Latency broken down by model, so you can see if one model is disproportionately contributing to tail latency.
- A latency heatmap (time on x-axis, latency buckets on y-axis, color intensity for request count) that reveals the full distribution shape over time, exposing bimodal patterns that percentile lines alone might miss.
- Correlation with request volume, so you can distinguish demand-driven latency increases (more requests = more queuing) from infrastructure-driven ones (same volume, slower responses).
CostHawk's dashboard provides all of these views out of the box for any request flowing through wrapped keys or synced via the MCP server. Each chart is filterable by project, API key, model, and time range, so you can drill into exactly the slice of traffic that matters for your investigation.
FAQ
Frequently Asked Questions
What is the difference between P95 and P99 latency?+
Why do LLM APIs have higher tail latency than traditional REST APIs?+
How do I measure P95 latency for my LLM API calls?+
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) FROM requests WHERE timestamp > NOW() - INTERVAL '1 hour' for the past hour's P95. In application code, you can maintain a sorted array or use a streaming algorithm like t-digest. For real-time monitoring, Prometheus histograms with bucket boundaries at 100ms, 250ms, 500ms, 1s, 2.5s, 5s, and 10s allow you to estimate percentiles by interpolation. CostHawk automates all of this: every request routed through a wrapped key has its latency recorded automatically, and the dashboard computes P50, P95, and P99 in real time. For self-hosted measurement, make sure you measure end-to-end latency (including network transit), not just server processing time, because network variability is a significant contributor to tail latency in cross-region API calls.What P95 latency should I target for a user-facing LLM feature?+
Does prompt caching affect P95/P99 latency?+
How do streaming responses affect latency percentile measurement?+
Can I compare P95 latency across different AI providers?+
How does request volume affect P95 and P99 latency?+
Related Terms
Latency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreTime to First Token (TTFT)
The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.
Read moreThroughput
The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.
Read moreLLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Read moreDashboards
Visual interfaces for monitoring AI cost, usage, and performance metrics in real-time. The command center for AI cost management — dashboards aggregate token spend, model utilization, latency, and budget health into a single pane of glass.
Read moreAlerting
Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
