GlossaryObservabilityUpdated 2026-03-16By Chase Dillingham

P95 / P99 Latency

Q: What is the difference between P95 and P99 latency?

P95 latency is the value below which 95% of requests complete, meaning only 5% of requests are slower. P99 latency is the value below which 99% of requests complete, with only 1% exceeding it. The P99 is always equal to or higher than the P95, and for LLM APIs, P99 is typically 1.5x to 3x the P95 value. The practical difference is the population they represent: at 100,000 requests per day, P95 covers the experience of 95,000 users while 5,000 are slower, whereas P99 covers 99,000 users with only 1,000 in the tail. Most SLAs use P95 as the primary target because it is achievable without heroic engineering effort, while P99 is monitored for regression detection and capacity planning. If your P95 is 1,500 ms and your P99 is 8,000 ms, the gap between them reveals that a small fraction of requests is hitting a fundamentally different code path — perhaps a cold start, a provider queue backup, or an unusually large context window.

Q: Why do LLM APIs have higher tail latency than traditional REST APIs?

Traditional REST APIs perform fixed, predictable operations — a database query, a cache lookup, a computation — where the work per request is bounded and similar. LLM APIs perform variable-length autoregressive generation: each output token requires a sequential forward pass through a multi-billion-parameter neural network, and the number of output tokens varies dramatically per request. A request generating 50 tokens takes 200 ms; a request generating 3,000 tokens takes 12 seconds. This inherent variability creates a wide latency distribution. On top of that, LLM providers use dynamic batching on GPUs, grouping multiple requests together for throughput efficiency. When one request in a batch has a very long output, other requests in the same batch may stall. Provider-side GPU contention during peak hours adds queuing delays that traditional APIs avoid because CPU resources are more elastically scalable. Finally, the models themselves are so large (hundreds of billions of parameters) that cold-starting a new inference instance takes 10-30 seconds, creating extreme outliers if auto-scaling cannot keep up with demand spikes.

Q: How do I measure P95 latency for my LLM API calls?

The most straightforward approach is to record the total round-trip latency of every API call (start timestamp minus end timestamp) in a database or time-series store, then compute percentiles over the collected data. In PostgreSQL, use SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) FROM requests WHERE timestamp > NOW() - INTERVAL '1 hour' for the past hour's P95. In application code, you can maintain a sorted array or use a streaming algorithm like t-digest. For real-time monitoring, Prometheus histograms with bucket boundaries at 100ms, 250ms, 500ms, 1s, 2.5s, 5s, and 10s allow you to estimate percentiles by interpolation. CostHawk automates all of this: every request routed through a wrapped key has its latency recorded automatically, and the dashboard computes P50, P95, and P99 in real time. For self-hosted measurement, make sure you measure end-to-end latency (including network transit), not just server processing time, because network variability is a significant contributor to tail latency in cross-region API calls.

Q: What P95 latency should I target for a user-facing LLM feature?

The right target depends on how the response is presented to the user. For streaming responses (chatbot-style), the key metric is time-to-first-token (TTFT), and a P95 target of 500-800 ms ensures users see output begin quickly; total latency matters less because tokens stream in continuously. For non-streaming responses where the user waits for a complete result (e.g., a classification label, an extraction result, or a generated summary displayed all at once), P95 total latency targets depend on context: inline UI elements should complete in under 1,000 ms at P95, background enrichment tasks can tolerate 3,000-5,000 ms, and batch processing has no real-time constraint. A reasonable starting point for most interactive features is P95 total latency under 2,000 ms, which requires either a fast model (GPT-4o mini, Claude Haiku, Gemini Flash) or aggressive output length limits on slower frontier models. Measure your current P95, set a target 20% below it, optimize, and iterate.

Q: Does prompt caching affect P95/P99 latency?

Yes, prompt caching significantly reduces latency for the prefill (time-to-first-token) phase when a cache hit occurs. Anthropic's prompt caching reports TTFT reductions of 80-85% on cache hits because the model skips re-processing the cached prefix tokens. OpenAI's caching offers similar TTFT improvements. However, caching only accelerates the prefill phase, not the decode phase — the time to generate output tokens remains unchanged. This means caching has the most dramatic latency impact on requests with large input contexts and short outputs (e.g., a 50,000-token document with a 100-token classification response), where prefill dominates total latency. For requests with short inputs and long outputs, caching may shave off only 50-100 ms. The effect on percentiles depends on your cache hit rate: if 90% of requests hit the cache, your P50 and P95 will drop significantly, but P99 may still be dominated by cache misses. Monitor cache hit rates alongside latency percentiles to understand the actual impact on your distribution.

Q: How do streaming responses affect latency percentile measurement?

Streaming fundamentally changes what you should measure. With streaming enabled, there are two distinct latency metrics: time-to-first-token (TTFT) , which measures how long until the first token arrives and the user sees output begin, and total latency , which measures the time until the final token arrives and the response is complete. For user experience, TTFT is the more important metric because it determines how long the user stares at a blank screen. A streaming response with a 300 ms TTFT and a 6-second total latency feels much faster than a non-streaming response that delivers everything at 6 seconds. When reporting percentiles for streaming features, always report TTFT percentiles (P95 TTFT) as your primary UX metric and total latency percentiles as a secondary metric for throughput and cost analysis. In your monitoring pipeline, this means recording two timestamps per request: the time the first Server-Sent Event (SSE) chunk arrives (TTFT) and the time the final chunk arrives (total). CostHawk records both when streaming is detected.

Q: Can I compare P95 latency across different AI providers?

You can, but meaningful comparison requires controlling for several variables. First, match the request profile : compare latencies for requests with the same input token count, output token count, and comparable model capability tier (e.g., GPT-4o vs. Claude 3.5 Sonnet, not GPT-4o vs. Claude Haiku). Second, match the region : if your server is in us-east-1 and you compare OpenAI (US endpoints) with a European-hosted model, network latency alone accounts for 80-150 ms of difference. Third, measure over the same time period : provider performance varies by hour and day as load fluctuates. Fourth, use enough samples : computing a reliable P95 requires at least 1,000 data points, and P99 needs at least 10,000 to be statistically meaningful. CostHawk makes cross-provider comparison straightforward by normalizing all request logs into a common schema with consistent latency measurement. The provider comparison view shows P50, P95, and P99 side-by-side for each provider and model, filtered to matching request profiles, so you can make apples-to-apples latency comparisons backed by real production data.

Q: How does request volume affect P95 and P99 latency?

Request volume affects tail latency through two primary mechanisms: provider-side queuing and GPU contention . As your request volume increases, or as aggregate demand on the provider's infrastructure rises during peak hours, requests are more likely to be queued before GPU resources become available. This queuing time adds directly to latency and disproportionately affects tail percentiles because queue wait times follow an exponential distribution — most requests wait briefly, but a few wait much longer. Empirically, many teams observe P99 latency increase by 30-80% during peak hours (typically 10am-4pm US Eastern for North American workloads) compared to off-peak. Your own request volume also matters if you are close to your rate limit: providers may throttle requests that exceed your tier's rate, adding retry delays. The relationship is not linear — there is usually a threshold below which latency is stable, and above which it degrades rapidly. Monitor your P95 and P99 alongside request-per-minute graphs to identify your saturation point. If tail latency correlates strongly with volume, consider spreading requests across time (for batch workloads) or across providers (for real-time workloads) to avoid hitting capacity ceilings.

Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.

Definition

What is P95 / P99 Latency?

P95 and P99 latency are percentile-based measurements of how long API requests take to complete. If your P95 latency is 1,200 ms, that means 95 out of every 100 requests finish in 1,200 ms or less — and 5 out of 100 take longer. P99 latency captures the slowest 1% of requests: a P99 of 3,400 ms means only 1 in 100 requests exceeds that threshold. These metrics belong to a family of percentile statistics — P50 (median), P75, P90, P95, P99, and P99.9 — that collectively describe the latency distribution of a system rather than reducing it to a single number. In the context of LLM APIs, latency encompasses the full round-trip time from the moment your application sends a request to the moment the final token of the response is received (often called end-to-end latency or total latency). This includes network transit, queue wait time on the provider side, time-to-first-token (TTFT) prefill, and the autoregressive decode phase that generates output tokens one by one. Because LLM inference involves variable-length inputs, variable-length outputs, dynamic batching on the provider's GPU fleet, and occasional cold starts or capacity contention, latency distributions for LLM APIs tend to have heavy right tails — the slowest requests can be 5x to 20x slower than the median. This is precisely why percentile metrics matter more than averages: the average may look healthy while a meaningful fraction of users endure unacceptably slow responses.

Impact

Why It Matters for AI Costs

Percentile latency metrics are the difference between believing your API integration is fast and knowing that your users experience it as fast. Consider a production deployment where your LLM-powered feature handles 50,000 requests per day:

Average latency: 800 ms — looks great, well within a 2-second UX target.
P50 latency: 620 ms — half your users get sub-second responses. Excellent.
P95 latency: 2,800 ms — 2,500 users per day wait nearly 3 seconds. Noticeable.
P99 latency: 8,200 ms — 500 users per day wait over 8 seconds. Some abandon the page.

If you only monitored the average, you would never know that 500 users per day are having a terrible experience. Worse, these tail-latency users are not randomly distributed. Provider-side queuing and GPU contention often correlate with peak traffic hours, meaning your highest-value users (those active during business hours) disproportionately encounter the slowest responses.

P95 and P99 are also the metrics that matter for SLA contracts. When an AI provider or an internal platform team commits to an SLA, it is almost always expressed as a percentile guarantee — for example, "P95 latency below 2,000 ms for requests under 4,000 input tokens." If you are not measuring percentiles, you cannot verify whether your provider is meeting their SLA or whether your own system is meeting the performance targets your product team has set.

From a cost perspective, tail latency often signals inefficiency. Requests that take 5x longer than the median may be hitting cold instances, waiting in provider queues due to rate limiting, or processing unnecessarily large context windows. Identifying and fixing these outliers frequently reduces both latency and cost simultaneously. CostHawk captures per-request latency alongside cost data, enabling you to correlate high-latency requests with specific models, prompt sizes, or time-of-day patterns and take targeted action.

What Are Percentile Latency Metrics?

Percentile latency metrics describe the distribution of response times across a population of requests by answering the question: "What is the maximum latency experienced by X% of requests?" They are computed by sorting all observed latency values and selecting the value at a specific position in the sorted array.

Formally, the Pth percentile of a dataset is the value below which P% of observations fall. For a dataset of N latency measurements sorted in ascending order:

P50 (median): The value at position N × 0.50. Half of all requests are faster, half are slower. This is the "typical" experience.
P75: The value at position N × 0.75. Three-quarters of requests complete within this time.
P90: 90% of requests are at or below this threshold. Often used as a first-tier alert boundary.
P95: The industry-standard SLA metric. Only 5% of requests exceed this value. At 100,000 requests per day, 5,000 requests are slower than your P95.
P99: Captures the tail of the distribution. Only 1 in 100 requests is slower. At scale, this still represents thousands of real users per day.
P99.9: The extreme tail. 1 in 1,000 requests. Relevant for high-traffic services where even 0.1% represents significant user impact.

In database systems, percentiles are typically computed using the PERCENTILE_CONT function, which performs linear interpolation between adjacent data points when the exact percentile position falls between two observations. For real-time monitoring, many observability platforms use approximate algorithms like t-digest or HDR histograms to compute percentiles over streaming data without storing every individual measurement.

For LLM API monitoring, the most operationally relevant percentiles are P50 (baseline performance), P95 (SLA target), and P99 (tail-latency investigation). CostHawk computes all three using PERCENTILE_CONT over raw request logs, providing exact values rather than approximations. This precision matters when you are holding a provider accountable to an SLA or diagnosing whether a latency regression is real or a statistical artifact.

A critical nuance: percentiles are not additive. If Service A has a P99 of 200 ms and Service B has a P99 of 300 ms, the P99 of a request that hits both services sequentially is not 500 ms. The combined P99 depends on the correlation structure between the two distributions. This makes end-to-end percentile monitoring (which CostHawk provides at the request level) more valuable than aggregating percentiles from individual components.

Why Averages Lie

The arithmetic mean (average) is the most commonly reported latency statistic and also the most misleading. Averages are distorted by outliers and completely mask the shape of the latency distribution. To understand why, consider two hypothetical LLM API endpoints with identical average latencies:

Metric	Endpoint A	Endpoint B
Average	900 ms	900 ms
P50 (Median)	850 ms	400 ms
P90	1,050 ms	1,800 ms
P95	1,200 ms	3,500 ms
P99	1,500 ms	12,000 ms
Max	2,100 ms	45,000 ms

Endpoint A has a tight, well-behaved distribution: most requests cluster near the average, and even the worst requests are only about 2x the median. Endpoint B has a bimodal distribution: most requests are fast (400 ms median), but a significant minority take 10x-100x longer, dragging the average up to 900 ms. If you only look at the average, these endpoints appear identical. In reality, Endpoint B delivers a much worse user experience because 5% of users wait over 3.5 seconds and 1% wait over 12 seconds.

LLM APIs are particularly prone to bimodal and heavy-tailed latency distributions because of several factors:

Variable output length: A request that generates 50 tokens finishes in 200 ms. A request to the same model that generates 2,000 tokens takes 8,000 ms. Both contribute to the same latency distribution.
Provider-side batching: AI providers batch multiple requests onto the same GPU for throughput efficiency. When a batch includes one request with a very long output, all other requests in the batch may experience increased latency while waiting for the batch to complete.
Queue wait time: During peak usage periods, requests may sit in a provider's queue for 500 ms to several seconds before GPU resources become available. This creates a bimodal pattern: requests that get immediate GPU access are fast; requests that queue are slow.
Cold starts and model loading: Some providers dynamically scale their GPU fleet. The first request to a newly spun-up instance may incur a multi-second cold start penalty while model weights are loaded into GPU memory.
Network variability: Cross-region API calls add 50-200 ms of network latency, and network jitter can add unpredictable variation.

The practical consequence is clear: never use averages as your primary latency metric for LLM APIs. Report P50 for the "typical" experience, P95 for your SLA target, and P99 to track tail behavior. When someone says "our average latency is 500 ms," the correct response is: "What is your P99?"

P95/P99 by Model and Provider

Latency characteristics vary enormously across models and providers. Smaller models on optimized infrastructure deliver sub-second P99 latencies, while frontier models processing large contexts can have P99 latencies exceeding 30 seconds. The following table shows representative P95 and P99 latencies observed in production workloads during Q1 2026, based on requests with 1,000 input tokens and 500 output tokens (a typical conversational query):

Provider	Model	P50 (ms)	P95 (ms)	P99 (ms)	P99/P50 Ratio
OpenAI	GPT-4o	680	1,850	4,200	6.2x
OpenAI	GPT-4o mini	320	780	1,600	5.0x
Anthropic	Claude 3.5 Sonnet	750	2,100	5,800	7.7x
Anthropic	Claude 3.5 Haiku	280	650	1,400	5.0x
Google	Gemini 2.0 Flash	250	580	1,200	4.8x
Google	Gemini 1.5 Pro	620	1,700	4,500	7.3x
Mistral	Mistral Large	550	1,400	3,800	6.9x
Mistral	Mistral Small	180	420	900	5.0x

Key observations from this data:

The P99/P50 ratio ranges from 4.8x to 7.7x. This means the slowest 1% of requests take 5-8x longer than the median. This ratio is significantly higher than traditional web APIs, where P99/P50 is typically 2-4x.
Smaller models have tighter distributions. GPT-4o mini, Claude 3.5 Haiku, Gemini Flash, and Mistral Small all have P99/P50 ratios around 5.0x, while their larger counterparts range from 6.2x to 7.7x. Larger models have more variable decode times and are more susceptible to batching delays.
Provider infrastructure matters. Google's Gemini Flash has the lowest absolute P99 (1,200 ms) thanks to custom TPU hardware optimized for inference. Mistral Small on optimized inference endpoints achieves a 900 ms P99.
These numbers scale with output length. For a request generating 2,000 tokens instead of 500, multiply all values by roughly 3-4x. A GPT-4o request generating 2,000 tokens can easily hit a P99 of 15,000-18,000 ms.

Input context size also affects latency, primarily through the prefill (time-to-first-token) phase. A request with 50,000 input tokens will have a TTFT 5-10x longer than a request with 1,000 input tokens, adding 500-2,000 ms to the total latency for frontier models. This is why CostHawk tracks both TTFT and total latency separately — a high P95 TTFT with a normal decode rate indicates a context-size problem, while a normal TTFT with a high P95 total latency indicates a long-output problem. Different root causes require different optimization strategies.

Setting SLAs

A Service Level Agreement (SLA) defines the performance guarantees your system commits to — both internally (engineering team to product team) and externally (your product to your customers). For LLM-powered features, latency SLAs must be expressed as percentile thresholds with clear measurement methodology, scope, and consequences for violation.

Choosing the right percentile for your SLA:

P95 is the most common SLA metric for LLM API features. It balances operational achievability with user experience coverage — 95% of users get the guaranteed experience, and the 5% tail is monitored but not contractually bound.
P99 SLAs are appropriate for critical paths — payment processing, safety-critical decisions, or premium-tier customers. They are harder to meet and require more infrastructure investment (redundancy, failover, request hedging).
P50 SLAs are rarely used because they only guarantee the median experience. A system could have a 300 ms P50 and a 30-second P99, which meets a P50 SLA while delivering an abysmal experience to half a percent of users.

How to define an LLM latency SLA:

Segment by request profile. A single SLA for all requests is meaningless when a 100-token classification takes 200 ms and a 4,000-token generation takes 8 seconds. Define SLA tiers based on expected output length or task type. Example: "Tier 1 (classification/extraction, <200 output tokens): P95 < 800 ms. Tier 2 (generation, 200-1,000 output tokens): P95 < 3,000 ms. Tier 3 (long-form, >1,000 output tokens): P95 < 8,000 ms."
Specify the measurement window. Is the SLA measured over 1 hour, 1 day, or 1 calendar month? Shorter windows detect regressions faster but are more susceptible to transient spikes. A common approach: measure over rolling 24-hour windows, alert on 1-hour windows, and report monthly for business reviews.
Define what counts as a request. Does the SLA include retried requests? Requests that timeout? Requests during provider outages? Typically, provider-side outages are excluded (covered by the provider's own SLA), but application-level retries are included because they affect the user's perceived latency.
Set a compliance target. "P95 < 2,000 ms, 99.5% of measurement windows compliant" means you can have 3.6 hours per month where P95 exceeds the target without violating the SLA. This provides headroom for maintenance windows and transient provider degradation.

Example internal SLA for an AI-powered customer support chatbot:

Metric	Target	Measurement	Action on Breach
TTFT P95	< 600 ms	Rolling 1-hour window	Page on-call engineer
Total Latency P95	< 3,000 ms	Rolling 1-hour window	Page on-call engineer
Total Latency P99	< 8,000 ms	Rolling 24-hour window	Escalate to team lead
Error rate	< 0.5%	Rolling 1-hour window	Page on-call + failover

CostHawk supports SLA monitoring by computing percentile latencies in real time across all requests flowing through wrapped keys. You can configure alerts that fire when P95 or P99 exceeds a threshold for a given model, project, or API key — turning your SLA from a document into an automated enforcement mechanism.

Reducing Tail Latency

Tail latency — the P95, P99, and beyond — is where user experience degrades and where the most impactful optimizations often live. Reducing tail latency for LLM APIs requires a different playbook than reducing median latency, because tail events are caused by specific, identifiable factors rather than baseline model speed. Here are seven proven strategies, ordered by typical impact:

1. Reduce output token count. The single largest driver of LLM latency is the number of tokens generated. Each output token requires a sequential decode step, so a request generating 2,000 tokens takes roughly 4x longer than one generating 500 tokens. The highest-latency requests in your distribution are almost always the ones generating the longest outputs. Set max_tokens to enforce a ceiling, and instruct the model to be concise. Cutting your max output from 2,000 to 800 tokens can reduce P99 by 50-60%.

2. Reduce input context size. Large input contexts increase time-to-first-token (TTFT) because the model must process all input tokens during the prefill phase. A request with 100,000 input tokens may have a TTFT of 3-5 seconds on frontier models, compared to 200-400 ms for a 1,000-token input. If your P95 TTFT is high, audit your context assembly pipeline: are you including full documents when summaries would suffice? Are you sending entire conversation histories when only the last 5 turns are relevant?

3. Implement request hedging. Send the same request to two provider endpoints simultaneously and use whichever responds first. This is the most effective technique for reducing P99 because it converts tail-latency events into a race where the fastest response wins. The cost is 2x token spend for hedged requests, but if you only hedge the slowest 5% (detected by a timeout threshold), the cost increase is manageable. For a request that times out after 3 seconds, fire a parallel request to an alternative model or region.

4. Choose the right model tier. If your P99 on GPT-4o is 4,200 ms and your product can tolerate the quality of GPT-4o mini (P99 of 1,600 ms), the fastest path to lower tail latency is model routing. Many teams implement a tiered system: send latency-sensitive requests to a fast model and quality-sensitive requests to a frontier model. This reduces P99 for the latency-sensitive tier by 60-70%.

5. Use streaming responses. Streaming does not reduce total latency (the last token arrives at the same time), but it dramatically improves perceived latency because the user sees the first token within the TTFT window (typically 200-800 ms) rather than waiting for the entire response. For interactive applications, streaming is effectively mandatory — a 5-second wait followed by a complete response feels much slower than seeing tokens appear after 300 ms.

6. Implement regional routing. If your users are globally distributed but your API calls all route to a single provider region, users far from that region add 100-300 ms of network round-trip time. Route requests to the nearest provider region (OpenAI and Anthropic both have multi-region deployments, and Google Cloud has Gemini endpoints in multiple regions). Network latency is additive and constant, so reducing it from 200 ms to 30 ms improves every percentile equally.

7. Add client-side timeouts with graceful degradation. Set a timeout at the percentile level where user experience becomes unacceptable (e.g., 5 seconds). When a request exceeds the timeout, cancel it and fall back to a cached response, a simpler model, or a non-AI code path. This does not reduce the actual tail latency but caps the user-perceived latency, turning P99 outliers into fast fallback responses. Track timeout rates in CostHawk to understand how often fallbacks trigger and whether the underlying latency problem is worsening.

Percentile Monitoring

Monitoring percentile latencies in production requires purpose-built infrastructure because percentiles cannot be computed from pre-aggregated data — you need access to individual request-level measurements or statistically sound approximations. Here is how to build a robust percentile monitoring pipeline for LLM API calls.

Data collection: Every API request must record its total latency (and ideally TTFT, decode time, and queue wait time as separate fields). CostHawk captures this automatically for all requests routed through wrapped keys: each request log includes latency_ms, ttft_ms (when streaming), input_tokens, output_tokens, model, api_key_id, and timestamp. This per-request granularity is essential because percentiles must be computed over the raw data, not over averages or pre-bucketed histograms.

Computation methods:

Exact computation (SQL): For datasets that fit in a single query (up to millions of rows with proper indexing), use PERCENTILE_CONT in PostgreSQL: SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99 FROM request_logs WHERE timestamp > NOW() - INTERVAL '1 hour'. This is the approach CostHawk uses for dashboard queries and SLA reporting.
Approximate computation (streaming): For real-time alerting over high-throughput streams, use t-digest or DDSketch algorithms that maintain a compact data structure approximating the full distribution. These algorithms can compute percentiles with <1% error while using only a few kilobytes of memory, making them suitable for per-second percentile computation across thousands of API keys.
Histogram buckets (Prometheus-style): Define latency buckets (e.g., 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s, 30s) and increment a counter for each request. Percentiles are estimated by interpolating between bucket boundaries. This approach is less accurate than t-digest but integrates naturally with Prometheus/Grafana monitoring stacks.

Alerting on percentiles:

Configure alerts at two levels:

Warning alert: P95 exceeds target over a 15-minute window. This catches gradual degradation (e.g., increasing context sizes, provider performance regression) early enough to investigate before users are heavily impacted.
Critical alert: P99 exceeds 2x the target over a 5-minute window. This fires on sudden, severe latency spikes — provider outages, thundering-herd effects after a retry storm, or a code deployment that accidentally increased prompt sizes.

Dashboards:

An effective latency dashboard shows:

P50, P95, and P99 as three overlaid time-series lines, so you can see whether a latency increase affects all requests (P50 rises) or only the tail (P99 rises while P50 is stable).
Latency broken down by model, so you can see if one model is disproportionately contributing to tail latency.
A latency heatmap (time on x-axis, latency buckets on y-axis, color intensity for request count) that reveals the full distribution shape over time, exposing bimodal patterns that percentile lines alone might miss.
Correlation with request volume, so you can distinguish demand-driven latency increases (more requests = more queuing) from infrastructure-driven ones (same volume, slower responses).

CostHawk's dashboard provides all of these views out of the box for any request flowing through wrapped keys or synced via the MCP server. Each chart is filterable by project, API key, model, and time range, so you can drill into exactly the slice of traffic that matters for your investigation.

FAQ

Frequently Asked Questions

What is the difference between P95 and P99 latency?+

P95 latency is the value below which 95% of requests complete, meaning only 5% of requests are slower. P99 latency is the value below which 99% of requests complete, with only 1% exceeding it. The P99 is always equal to or higher than the P95, and for LLM APIs, P99 is typically 1.5x to 3x the P95 value. The practical difference is the population they represent: at 100,000 requests per day, P95 covers the experience of 95,000 users while 5,000 are slower, whereas P99 covers 99,000 users with only 1,000 in the tail. Most SLAs use P95 as the primary target because it is achievable without heroic engineering effort, while P99 is monitored for regression detection and capacity planning. If your P95 is 1,500 ms and your P99 is 8,000 ms, the gap between them reveals that a small fraction of requests is hitting a fundamentally different code path — perhaps a cold start, a provider queue backup, or an unusually large context window.

Why do LLM APIs have higher tail latency than traditional REST APIs?+

Traditional REST APIs perform fixed, predictable operations — a database query, a cache lookup, a computation — where the work per request is bounded and similar. LLM APIs perform variable-length autoregressive generation: each output token requires a sequential forward pass through a multi-billion-parameter neural network, and the number of output tokens varies dramatically per request. A request generating 50 tokens takes 200 ms; a request generating 3,000 tokens takes 12 seconds. This inherent variability creates a wide latency distribution. On top of that, LLM providers use dynamic batching on GPUs, grouping multiple requests together for throughput efficiency. When one request in a batch has a very long output, other requests in the same batch may stall. Provider-side GPU contention during peak hours adds queuing delays that traditional APIs avoid because CPU resources are more elastically scalable. Finally, the models themselves are so large (hundreds of billions of parameters) that cold-starting a new inference instance takes 10-30 seconds, creating extreme outliers if auto-scaling cannot keep up with demand spikes.

How do I measure P95 latency for my LLM API calls?+

The most straightforward approach is to record the total round-trip latency of every API call (start timestamp minus end timestamp) in a database or time-series store, then compute percentiles over the collected data. In PostgreSQL, use

SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) FROM requests WHERE timestamp > NOW() - INTERVAL '1 hour'

for the past hour's P95. In application code, you can maintain a sorted array or use a streaming algorithm like t-digest. For real-time monitoring, Prometheus histograms with bucket boundaries at 100ms, 250ms, 500ms, 1s, 2.5s, 5s, and 10s allow you to estimate percentiles by interpolation. CostHawk automates all of this: every request routed through a wrapped key has its latency recorded automatically, and the dashboard computes P50, P95, and P99 in real time. For self-hosted measurement, make sure you measure end-to-end latency (including network transit), not just server processing time, because network variability is a significant contributor to tail latency in cross-region API calls.

What P95 latency should I target for a user-facing LLM feature?+

The right target depends on how the response is presented to the user. For streaming responses (chatbot-style), the key metric is time-to-first-token (TTFT), and a P95 target of 500-800 ms ensures users see output begin quickly; total latency matters less because tokens stream in continuously. For non-streaming responses where the user waits for a complete result (e.g., a classification label, an extraction result, or a generated summary displayed all at once), P95 total latency targets depend on context: inline UI elements should complete in under 1,000 ms at P95, background enrichment tasks can tolerate 3,000-5,000 ms, and batch processing has no real-time constraint. A reasonable starting point for most interactive features is P95 total latency under 2,000 ms, which requires either a fast model (GPT-4o mini, Claude Haiku, Gemini Flash) or aggressive output length limits on slower frontier models. Measure your current P95, set a target 20% below it, optimize, and iterate.

Does prompt caching affect P95/P99 latency?+

Yes, prompt caching significantly reduces latency for the prefill (time-to-first-token) phase when a cache hit occurs. Anthropic's prompt caching reports TTFT reductions of 80-85% on cache hits because the model skips re-processing the cached prefix tokens. OpenAI's caching offers similar TTFT improvements. However, caching only accelerates the prefill phase, not the decode phase — the time to generate output tokens remains unchanged. This means caching has the most dramatic latency impact on requests with large input contexts and short outputs (e.g., a 50,000-token document with a 100-token classification response), where prefill dominates total latency. For requests with short inputs and long outputs, caching may shave off only 50-100 ms. The effect on percentiles depends on your cache hit rate: if 90% of requests hit the cache, your P50 and P95 will drop significantly, but P99 may still be dominated by cache misses. Monitor cache hit rates alongside latency percentiles to understand the actual impact on your distribution.

How do streaming responses affect latency percentile measurement?+

Streaming fundamentally changes what you should measure. With streaming enabled, there are two distinct latency metrics: time-to-first-token (TTFT), which measures how long until the first token arrives and the user sees output begin, and total latency, which measures the time until the final token arrives and the response is complete. For user experience, TTFT is the more important metric because it determines how long the user stares at a blank screen. A streaming response with a 300 ms TTFT and a 6-second total latency feels much faster than a non-streaming response that delivers everything at 6 seconds. When reporting percentiles for streaming features, always report TTFT percentiles (P95 TTFT) as your primary UX metric and total latency percentiles as a secondary metric for throughput and cost analysis. In your monitoring pipeline, this means recording two timestamps per request: the time the first Server-Sent Event (SSE) chunk arrives (TTFT) and the time the final chunk arrives (total). CostHawk records both when streaming is detected.

Can I compare P95 latency across different AI providers?+

You can, but meaningful comparison requires controlling for several variables. First, match the request profile: compare latencies for requests with the same input token count, output token count, and comparable model capability tier (e.g., GPT-4o vs. Claude 3.5 Sonnet, not GPT-4o vs. Claude Haiku). Second, match the region: if your server is in us-east-1 and you compare OpenAI (US endpoints) with a European-hosted model, network latency alone accounts for 80-150 ms of difference. Third, measure over the same time period: provider performance varies by hour and day as load fluctuates. Fourth, use enough samples: computing a reliable P95 requires at least 1,000 data points, and P99 needs at least 10,000 to be statistically meaningful. CostHawk makes cross-provider comparison straightforward by normalizing all request logs into a common schema with consistent latency measurement. The provider comparison view shows P50, P95, and P99 side-by-side for each provider and model, filtered to matching request profiles, so you can make apples-to-apples latency comparisons backed by real production data.

How does request volume affect P95 and P99 latency?+

Request volume affects tail latency through two primary mechanisms: provider-side queuing and GPU contention. As your request volume increases, or as aggregate demand on the provider's infrastructure rises during peak hours, requests are more likely to be queued before GPU resources become available. This queuing time adds directly to latency and disproportionately affects tail percentiles because queue wait times follow an exponential distribution — most requests wait briefly, but a few wait much longer. Empirically, many teams observe P99 latency increase by 30-80% during peak hours (typically 10am-4pm US Eastern for North American workloads) compared to off-peak. Your own request volume also matters if you are close to your rate limit: providers may throttle requests that exceed your tier's rate, adding retry delays. The relationship is not linear — there is usually a threshold below which latency is stable, and above which it degrades rapidly. Monitor your P95 and P99 alongside request-per-minute graphs to identify your saturation point. If tail latency correlates strongly with volume, consider spreading requests across time (for batch workloads) or across providers (for real-time workloads) to avoid hitting capacity ceilings.

Related Terms

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Time to First Token (TTFT)

The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.

Throughput

The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Dashboards

Visual interfaces for monitoring AI cost, usage, and performance metrics in real-time. The command center for AI cost management — dashboards aggregate token spend, model utilization, latency, and budget health into a single pane of glass.

Alerting

Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary