GlossaryObservabilityUpdated 2026-03-16

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Definition

What is Latency?

Latency in the context of LLM APIs is the total time elapsed from the moment your application sends a request to the moment the complete response is received. Unlike traditional API latency, which is a single number (response time), LLM latency has a unique two-phase structure dictated by the transformer architecture. The first phase is the prefill, where the model processes all input tokens in parallel to build an internal representation. The second phase is autoregressive decoding, where the model generates output tokens one at a time, each conditioned on all previous tokens. This two-phase structure creates two distinct latency components that matter to users: time-to-first-token (TTFT), which determines how quickly the user sees the response start, and generation time, which determines how long it takes to produce the full output. For a streaming UI, TTFT drives the perception of responsiveness — a 200ms TTFT feels instant even if the total response takes 5 seconds. For a non-streaming API call, only the total end-to-end latency matters. Understanding and monitoring both components is essential because they have different causes, different optimization levers, and different cost implications.

Impact

Why It Matters for AI Costs

Latency is the primary determinant of user experience in AI-powered applications, and it trades off directly against cost in ways that catch teams off guard. Consider the latency profiles of models at different price points (March 2026 median values from production workloads):

ModelMedian TTFTGeneration SpeedInput Cost/MTokOutput Cost/MTok
GPT-4o mini~280ms~120 tok/s$0.15$0.60
Gemini 2.0 Flash~200ms~150 tok/s$0.10$0.40
GPT-4o~450ms~90 tok/s$2.50$10.00
Claude 3.5 Sonnet~520ms~80 tok/s$3.00$15.00
Claude 3 Opus~1,200ms~40 tok/s$15.00$75.00

The pattern is clear: more capable (and more expensive) models are also slower. Claude 3 Opus costs 100x more than Gemini Flash and is 3-4x slower. This creates a compounding problem: if you choose a frontier model for quality, you pay more per token and degrade the user experience with higher latency.

The cost implications of latency go beyond per-token pricing:

  • User abandonment. Research consistently shows that AI chat interfaces with TTFT above 3 seconds see 25-40% higher abandonment rates. Each abandoned request still costs you tokens — the model processes the input and may have started generating output before the user navigates away.
  • Timeout-driven retries. Applications with aggressive timeout settings may kill and retry slow requests, doubling or tripling the token cost for a single user action. If your timeout is 10 seconds and 5% of Claude Opus requests exceed that threshold, you are retrying the most expensive requests — compounding the cost.
  • Throughput bottlenecks. In synchronous architectures, slow responses occupy server threads longer. A 5-second response ties up a connection 5x longer than a 1-second response, reducing your effective throughput and potentially requiring more infrastructure to handle the same traffic.

CostHawk tracks latency alongside cost for every request, making it possible to analyze the latency-cost tradeoff with real production data. You can identify requests where you are paying for frontier-model latency on tasks that a faster, cheaper model could handle equally well.

Understanding LLM Latency

LLM latency is fundamentally different from the latency of traditional web APIs, and understanding this difference is essential for building performant AI applications. When you call a conventional REST API, the server receives your request, processes it (usually in milliseconds to low hundreds of milliseconds), and returns the complete response. The latency is primarily determined by network round-trip time, server processing time, and database query time.

LLM APIs break this model in two important ways. First, the processing time is orders of magnitude longer — a single LLM inference can take 1-30 seconds depending on the model, input size, and output length. Second, the response is generated incrementally, one token at a time, which enables streaming — a delivery pattern where tokens are sent to the client as they are produced rather than waiting for the complete response.

Streaming fundamentally changes how users perceive latency. A request that takes 8 seconds to complete feels slow if the user stares at a loading spinner for 8 seconds. But if the first token appears after 400ms and text streams in at 80 tokens per second, the same 8-second request feels responsive — the user is reading as the response appears, and the perceived wait is only the 400ms TTFT.

This is why LLM latency must be understood as a composite metric, not a single number. The three components are:

  • Network latency: The time for your request to reach the provider's servers and for the first response bytes to return. Typically 20-100ms for well-connected data centers, 50-300ms for global traffic. This is the same as traditional API latency and is optimized through geographic proximity and connection pooling.
  • Time-to-first-token (TTFT): The time from when the provider receives your request to when the first output token is generated. This includes the prefill phase (processing all input tokens) and any queuing delay at the provider. TTFT scales with input length — a 10,000-token input takes longer to prefill than a 500-token input. Typical ranges: 100ms-3s depending on model, input size, and provider load.
  • Generation time: The time to produce all output tokens after the first one. This is determined by the generation speed (tokens per second) and the total number of output tokens. At 80 tok/s, generating 400 tokens takes 5 seconds. Generation speed is primarily determined by the model architecture and the provider's GPU allocation.

The total end-to-end latency is: Network latency + TTFT + (Output tokens / Generation speed). This decomposition is critical for optimization because each component has different levers: network latency is optimized by infrastructure, TTFT is optimized by reducing input length and using prompt caching, and generation time is optimized by capping output length and choosing faster models.

Latency Components

Each component of LLM latency has distinct causes, measurement approaches, and optimization strategies. Understanding them individually is essential for targeted performance improvement.

1. Queuing Delay

Before your request even begins processing, it may sit in a queue at the provider's infrastructure. During peak hours or traffic spikes, queuing delays can add hundreds of milliseconds to seconds. Queuing delay is entirely outside your control for API-hosted models, but you can mitigate its impact through provider diversification (failover to a different provider when one is congested) and off-peak scheduling for batch workloads. CostHawk's latency tracking helps you identify time-of-day patterns in queuing delay so you can shift non-urgent workloads to lower-traffic periods.

2. Prefill Time (Input Processing)

The prefill phase processes all input tokens in parallel using GPU matrix multiplications. Prefill time scales roughly linearly with input token count, though the relationship is not perfectly linear due to attention mechanism complexity (which scales quadratically with sequence length for standard attention, though most modern models use efficient attention variants). Practical prefill times for common scenarios:

  • 500-token input on GPT-4o: ~100-150ms
  • 5,000-token input on GPT-4o: ~300-500ms
  • 50,000-token input on GPT-4o: ~1,500-3,000ms
  • 128,000-token input on Claude 3.5 Sonnet: ~3,000-8,000ms

The key insight is that long context inputs significantly increase TTFT. If your application sends 50,000 tokens of context with every request, you are adding 1-3 seconds to TTFT before a single output token is generated. Prompt caching dramatically reduces prefill time for repeated prefixes — both OpenAI and Anthropic cache the KV-cache from the prefill phase, so subsequent requests with the same prefix skip most of the prefill computation. Anthropic's prompt caching can reduce TTFT by 80-90% for cached prefixes.

3. Decode Time (Output Generation)

The decode phase generates output tokens sequentially. Each token requires a forward pass through the model that reads from the KV-cache and produces a probability distribution over the vocabulary. Decode speed (tokens per second) is primarily determined by the model's size and the provider's GPU allocation. Larger models generate tokens more slowly because each forward pass involves more computation. Current decode speeds for popular models range from 40 tok/s (Claude 3 Opus) to 200+ tok/s (Gemini 2.0 Flash). Decode time is: Output tokens / Decode speed. A 500-token response at 80 tok/s takes 6.25 seconds of generation time.

4. Network Transfer

For streaming responses, network latency manifests as a constant offset on TTFT plus minimal per-chunk overhead. For non-streaming responses, the entire response must transit the network, adding latency proportional to response size (though LLM responses are text and therefore small relative to bandwidth). Network optimization is standard web engineering: use persistent connections (HTTP/2), minimize geographic distance to the provider, and avoid unnecessary proxies that add per-hop latency. If you route through an observability proxy (like CostHawk's), choose one with edge presence near your servers to minimize the additional hop.

Latency vs Cost Tradeoffs

Latency and cost are fundamentally coupled in LLM applications, and optimizing one often comes at the expense of the other. Understanding these tradeoffs enables informed decision-making rather than accidental overspending or underperformance.

Tradeoff 1: Model selection (faster = cheaper, but less capable)

Smaller, cheaper models are consistently faster than larger, more expensive ones. GPT-4o mini generates tokens at ~120 tok/s for $0.60/MTok output; Claude 3 Opus generates at ~40 tok/s for $75.00/MTok output. This means you pay 125x more for a model that is 3x slower. For tasks where the cheaper model produces acceptable quality, the latency-cost tradeoff is entirely one-directional: switching to a cheaper model simultaneously improves latency and reduces cost. The tradeoff only becomes a real tension when the cheaper model's quality is noticeably worse, forcing you to choose between user experience (fast but lower quality) and output quality (slow but better).

Tradeoff 2: Prompt caching (initial cost for ongoing latency savings)

Prompt caching reduces TTFT by 80-90% for repeated prefixes but has setup requirements. Anthropic charges a small write fee for the initial cache creation and gives a 90% discount on subsequent reads. OpenAI's caching gives a 50% discount on cached input tokens. The tradeoff: you pay full price (plus cache creation overhead) on the first request, then save on every subsequent request. For high-volume endpoints with stable system prompts, caching pays for itself within minutes. For low-volume or frequently-changing prompts, the cache hit rate may be too low to justify the complexity.

Tradeoff 3: Output length (shorter = faster and cheaper)

Reducing output length directly reduces both generation time and output token cost. Setting max_tokens=200 instead of allowing unbounded generation caps both the latency and cost of a request. The tradeoff is obvious: shorter outputs may omit useful information. The key is to match output length to the actual need. A classification endpoint that returns {"category": "billing"} should cap output at 50 tokens, not 4,096. A creative writing endpoint may legitimately need 1,000+ tokens. CostHawk's per-endpoint token distribution analysis shows you the actual output token distributions so you can set appropriate caps without truncating legitimate responses.

Tradeoff 4: Provisioned throughput (pay more for guaranteed latency)

Both OpenAI and Anthropic offer provisioned throughput options that guarantee faster response times in exchange for committed spend. OpenAI's Provisioned Throughput Units (PTUs) provide dedicated GPU capacity with guaranteed tokens-per-minute throughput and lower, more consistent latency. Anthropic offers similar reserved capacity for enterprise customers. The tradeoff is pure economics: you pay a fixed monthly commitment (typically 30-50% more than on-demand rates at equivalent volume) in exchange for guaranteed low latency and no rate limits. This makes sense for latency-sensitive applications processing predictable, high volumes — but is wasteful for bursty or low-volume workloads.

Tradeoff 5: Parallelism vs sequencing (faster total time, same total cost)

If a user action requires multiple LLM calls, you can run them in parallel to reduce total latency without changing total cost. Two 3-second calls in parallel take 3 seconds total instead of 6 seconds sequentially, with the same token consumption and cost. The tradeoff is architectural complexity: parallel execution requires careful orchestration, error handling for partial failures, and possibly more concurrent API connections (which count toward rate limits).

Latency Benchmarks by Model

Latency varies significantly across models, providers, and request characteristics. The following benchmarks represent median production values observed across thousands of real-world requests in March 2026. Use these as reference points for capacity planning and SLA setting, but always validate against your own workloads — your specific prompt patterns, geographic location, and traffic timing will produce different numbers.

ModelProviderMedian TTFTP95 TTFTGeneration Speed (tok/s)500-Token Response (Total)
Gemini 2.0 FlashGoogle180ms450ms~160~3.3s
GPT-4o miniOpenAI260ms680ms~125~4.3s
Claude 3.5 HaikuAnthropic300ms750ms~110~4.8s
Mistral SmallMistral220ms550ms~130~4.1s
GPT-4oOpenAI420ms1,100ms~90~6.0s
Claude 3.5 SonnetAnthropic500ms1,400ms~82~6.6s
Gemini 1.5 ProGoogle380ms950ms~95~5.6s
Mistral LargeMistral480ms1,200ms~70~7.6s
Claude 3 OpusAnthropic1,100ms3,200ms~42~13.0s
o3-miniOpenAI800ms*4,500ms*~100~5.8s*

*Reasoning model TTFTs are highly variable because they include internal "thinking" time that varies with problem complexity. A simple question might have a 500ms TTFT, while a complex math problem might think for 10+ seconds before producing the first visible token.

Key observations from these benchmarks:

  • Economy models are 2-3x faster than frontier models. Gemini Flash and GPT-4o mini deliver sub-300ms TTFT and 120-160 tok/s generation. Claude Opus and reasoning models are 3-10x slower.
  • P95 TTFT is 2-3x the median. Always design your timeouts and user experience around P95 or P99 latency, not the median. If your median TTFT is 500ms but P95 is 1,400ms, 5% of your users are waiting nearly 3x longer than average.
  • Generation speed is the dominant factor for longer outputs. For a 500-token response, generation accounts for 80-90% of total latency. Reducing input tokens (which reduces TTFT) has diminishing returns once output generation dominates.
  • Provider load causes 20-40% latency variation by time of day. US business hours (9am-5pm PT) typically show the highest latency for US-hosted providers. Schedule batch workloads for off-peak hours when possible.

CostHawk records TTFT and total latency for every request, allowing you to build your own provider-specific, model-specific, and time-of-day-specific benchmarks from your actual production traffic.

Reducing Latency Without Increasing Cost

Many latency optimizations also reduce cost, making them doubly valuable. Here are the most effective techniques, ordered by typical impact for production workloads:

1. Enable streaming (zero cost impact, massive perceived latency improvement)

Streaming delivers tokens to the client as they are generated instead of waiting for the complete response. This does not reduce total latency or cost — the same tokens are generated at the same speed — but it transforms the user experience. A 6-second response that streams token-by-token feels dramatically faster than a 6-second response that appears all at once after a loading spinner. Implementing streaming is straightforward with every major provider's SDK. OpenAI, Anthropic, and Google all support server-sent events (SSE) for streaming. If you are using CostHawk's proxy, streaming is transparently passed through — the proxy logs the complete request metadata without adding meaningful latency to the stream.

2. Implement prompt caching (reduces cost AND latency)

Prompt caching eliminates redundant prefill computation for repeated prompt prefixes. If your system prompt is 2,000 tokens and you send it with every request, prompt caching computes the prefill for those 2,000 tokens once and reuses the cached KV-cache for subsequent requests. The impact on TTFT is dramatic: Anthropic's prompt caching reduces TTFT by 80-85% for the cached portion. On a 10,000-token input where 8,000 tokens are a cached system prompt, TTFT drops from ~600ms to ~200ms. Cost impact is also positive: Anthropic charges 10% of the normal input rate for cached tokens (90% discount), and OpenAI charges 50% of the normal rate. This is one of the rare optimizations that improves latency and reduces cost simultaneously.

3. Reduce input token count (reduces cost AND TTFT)

Every input token adds to prefill time. Reducing input tokens directly reduces TTFT and input token cost. Practical approaches: trim system prompts by removing redundant instructions; use RAG to retrieve only relevant context instead of stuffing the full document; summarize conversation history instead of sending raw transcripts; compress structured data (use CSV instead of verbose JSON). A team that reduces average input tokens from 5,000 to 2,000 will see TTFT improve by approximately 40% and input token costs drop by 60%.

4. Set appropriate max_tokens (reduces cost AND generation time)

Setting max_tokens prevents the model from generating unnecessarily long responses. This caps both the generation time and the output token cost. The key is to set it tightly enough to prevent waste but loosely enough to avoid truncating legitimate responses. Analyze your output token distribution in CostHawk: if the P99 output length is 300 tokens, setting max_tokens=400 gives 33% headroom while preventing the rare 2,000-token runaway response that takes 25 seconds and costs 5x a normal response.

5. Route to faster models for simple tasks (reduces cost AND latency)

Model routing is the practice of directing each request to the cheapest and fastest model that can handle it adequately. Simple tasks like classification, entity extraction, and format conversion rarely need a frontier model. Routing these to GPT-4o mini or Gemini Flash instead of GPT-4o or Claude Sonnet simultaneously reduces latency (economy models are 2-3x faster) and cost (economy models are 10-20x cheaper). Even a basic static routing rule — "all /api/classify requests use GPT-4o mini" — can capture significant savings. Dynamic routing with a lightweight classifier can push optimization further.

6. Parallelize independent LLM calls (same cost, lower total latency)

If a user action requires multiple independent LLM calls (e.g., summarize + classify + extract), run them in parallel instead of sequentially. Three 2-second calls in parallel take 2 seconds total instead of 6 seconds sequentially, with identical token consumption and cost. Use Promise.all() in JavaScript or asyncio.gather() in Python to parallelize calls. Be mindful of rate limits — parallel calls consume your rate limit budget faster.

Monitoring Latency

Effective latency monitoring for LLM applications requires tracking multiple dimensions that traditional APM tools do not capture. Here is a comprehensive monitoring strategy:

What to measure:

  • TTFT (time-to-first-token): The most important user-facing latency metric for streaming applications. Measure at the client side (includes network latency) and, if possible, at the proxy side (excludes network latency to isolate provider performance). Track P50, P95, and P99 percentiles — averages hide the long-tail experiences that frustrate users.
  • End-to-end latency: Total time from request to complete response. The most important metric for non-streaming applications and batch workloads. Track per-model and per-endpoint to identify which combinations are underperforming.
  • Generation speed (tokens/second): Calculated as output_tokens / (end_to_end_latency - TTFT). This isolates the decode phase performance. A drop in generation speed often indicates provider-side GPU contention — the model is generating tokens more slowly because the provider's infrastructure is under load.
  • Latency by input size: Plot TTFT against input token count. The relationship should be roughly linear; a steeper-than-expected slope suggests the model or provider is handling long inputs poorly. If TTFT spikes disproportionately at high input counts, consider implementing input truncation or chunking.

How to set alerts:

Static latency thresholds ("alert if P95 TTFT exceeds 3 seconds") are a reasonable starting point but suffer from two problems: they do not account for model-specific baselines (3 seconds is normal for Opus but alarming for GPT-4o mini), and they do not detect gradual degradation that stays below the threshold. A better approach combines static thresholds with anomaly detection:

  • Set per-model static alerts based on 2x the historical P95 for that model
  • Set anomaly alerts that fire when the current hour's P50 latency exceeds the trailing 7-day P50 by more than 50%
  • Set input-size alerts that fire when average input tokens per request increases by more than 30% (a leading indicator of TTFT increases)

Dashboards to build:

CostHawk provides pre-built latency dashboards, but if you are building custom monitoring, prioritize these views:

  1. Latency heatmap: P50/P95/P99 latency by model over the past 24 hours. This shows time-of-day patterns and provider-specific performance variations at a glance.
  2. TTFT vs input tokens scatter plot: Reveals the relationship between input size and responsiveness. Outliers (high TTFT for low input counts) indicate queuing delays or provider issues.
  3. Latency-cost quadrant chart: Plot each endpoint's average latency (x-axis) against average cost per request (y-axis). Endpoints in the high-latency/high-cost quadrant are optimization priorities. Endpoints in the low-latency/low-cost quadrant are well-optimized.
  4. Timeout and retry tracker: Count of requests that exceeded timeout thresholds or triggered retries, broken down by model. High timeout rates on a specific model suggest it is too slow for your SLA requirements.

The goal of latency monitoring is not just awareness — it is actionable intelligence. Every latency anomaly should lead to an investigation: is it a provider issue (check their status page), a prompt change (check input token trends), a traffic spike (check request volume), or a model routing misconfiguration (check model distribution)? CostHawk's correlated metrics make this investigation fast by showing latency, cost, token counts, and error rates on the same timeline.

FAQ

Frequently Asked Questions

What is a good TTFT (time-to-first-token) for production applications?+
The target TTFT depends on your application type and user expectations. For real-time chat interfaces where users are watching the response stream in, a P95 TTFT under 1 second is ideal and under 2 seconds is acceptable. Most economy models (GPT-4o mini, Gemini Flash, Claude Haiku) consistently achieve sub-500ms median TTFT with P95 under 1 second, making them well-suited for latency-sensitive chat applications. For mid-tier models (GPT-4o, Claude Sonnet), expect 400-600ms median TTFT with P95 around 1-1.5 seconds. For frontier and reasoning models (Claude Opus, o1, o3), TTFT can reach 1-5+ seconds, which is generally too slow for real-time chat but acceptable for applications where users expect longer processing times (code review, complex analysis, research tasks). For non-interactive applications like batch processing, document summarization pipelines, or background workers, TTFT is less important than total throughput — you care about tokens per dollar, not milliseconds to first token. The key principle: measure your actual TTFT distribution in production, set alerts at your acceptable P95 threshold, and route latency-sensitive requests to faster models.
Why does latency vary so much between identical requests?+
LLM latency variance comes from multiple sources, some within your control and some outside it. The largest source of variance is provider-side queuing: your request competes for GPU resources with thousands of other customers' requests. During peak hours (US business hours), queuing delays can add hundreds of milliseconds that are absent during off-peak times. The second source is non-deterministic generation: even with the same prompt, the model may generate different-length responses on different runs (unless temperature is 0 and the model is truly deterministic, which most are not in practice). A response that generates 200 tokens will have half the generation time of a response that generates 400 tokens. The third source is infrastructure variation: the provider may route your request to different GPU hardware or data centers on different runs, and the performance of individual GPUs varies. The fourth source is network jitter: internet routing changes, packet loss, and TCP congestion affect request and response transit times. In practice, you should expect P95 latency to be 2-3x the median for any given model and input size. Design your application around worst-case latency (P95 or P99), not average latency. Use timeouts to cap the maximum wait time, and implement graceful degradation — if a request exceeds your latency budget, show a fallback response or retry with a faster model.
How does input length affect latency?+
Input length directly affects the prefill phase, which determines TTFT. The relationship is approximately linear for standard transformer models: doubling the input token count roughly doubles the prefill time. For example, on GPT-4o, a 1,000-token input might have a ~200ms prefill time, while a 10,000-token input has a ~800ms prefill time. However, this linear relationship breaks down at very long context lengths because the attention mechanism has quadratic complexity — processing 100,000 tokens is more than 10x the work of processing 10,000 tokens. Modern models mitigate this with efficient attention implementations (FlashAttention, sliding window attention), but the effect is still measurable at long context lengths. Input length does NOT significantly affect generation speed (tokens per second). Once the prefill is complete and the KV-cache is built, the model generates output at roughly the same speed regardless of whether the input was 500 tokens or 50,000 tokens (though the KV-cache from a longer input consumes more GPU memory, which can slightly reduce generation speed). The practical implication is that reducing input length primarily improves TTFT (time-to-first-token) and input token cost, while reducing output length primarily improves total latency and output token cost. For streaming applications where TTFT is the critical user experience metric, input reduction is the highest-impact optimization.
What is the difference between latency and throughput?+
Latency and throughput measure fundamentally different things, and optimizing for one does not necessarily improve the other. Latency is the time for a single request to complete — it answers 'how long does one user wait?' Throughput is the number of requests (or tokens) processed per unit time — it answers 'how many requests can we handle simultaneously?' A single LLM inference might have 5 seconds of latency, but if the provider can run 1,000 inferences in parallel, the throughput is 1,000 requests per 5 seconds, or 200 requests per second. You can have high throughput with high latency (batch processing where many slow requests run in parallel) or low throughput with low latency (a single-user application where each request is fast but only one runs at a time). The tradeoff becomes real when you hit provider rate limits. If your rate limit is 100 requests per minute and each request takes 5 seconds, you can have at most ~8 concurrent requests (100/60 × 5 ≈ 8). Reducing latency to 2 seconds lets you sustain ~3 concurrent requests within the same rate limit — the same throughput ceiling but faster individual responses. For cost, latency and throughput interact through retries and timeouts: higher latency leads to more timeouts, which trigger retries, which consume your throughput capacity and multiply your token costs.
How does prompt caching affect latency?+
Prompt caching has a dramatic impact on TTFT by eliminating the prefill computation for repeated prompt prefixes. When you send a request with a prompt that shares a prefix with a recently cached prompt, the provider skips the prefill for the cached portion and only processes the new tokens. The latency reduction is proportional to the fraction of the input that is cached. For a typical scenario where 80% of input tokens come from a stable system prompt and 20% are the user's unique message, prompt caching reduces TTFT by approximately 70-80%. In absolute terms, this can mean the difference between a 600ms TTFT and a 150ms TTFT. Anthropic's prompt caching is the most aggressive: cached tokens are served from the KV-cache with a 90% cost discount and ~85% TTFT reduction. OpenAI's caching provides a 50% cost discount with similar latency benefits. To maximize cache hit rates, structure your prompts with the static content (system prompt, instructions, few-shot examples) at the beginning and the variable content (user message, dynamic context) at the end. Cache hits require an exact prefix match — if a single token in the cached prefix changes, the entire cache misses and the full prefill runs again. Monitor your cache hit rate in your observability dashboard; a rate below 80% on a high-volume endpoint with a stable system prompt suggests a prompt structure issue that is costing you both latency and money.
Should I use streaming or non-streaming responses?+
Streaming is almost always the better choice for user-facing applications, and non-streaming is usually better for programmatic or batch workloads. The key distinction is whether a human is waiting for the response. For streaming: when a user is watching a chat interface, streaming provides dramatically better perceived performance. The user starts reading the response within hundreds of milliseconds (TTFT) instead of waiting seconds for the complete response. Streaming also enables progressive rendering — you can show partial results, update a UI incrementally, or even cancel a request mid-generation if the user sees the response is going in the wrong direction (saving tokens and cost on the truncated portion). Most modern chat UIs (ChatGPT, Claude, Gemini) use streaming by default. For non-streaming: when a downstream system consumes the response programmatically (parsing JSON output, feeding into another API, storing in a database), streaming adds complexity without benefit. You need the complete response before you can process it, so buffering a stream just adds code complexity. Batch processing workloads also benefit from non-streaming because they can process responses asynchronously without maintaining open SSE connections. Cost impact: streaming and non-streaming produce the same number of tokens and cost the same amount. The only cost difference arises if streaming enables mid-response cancellation — stopping generation early saves the output tokens that would have been generated, which directly reduces cost.
How do reasoning models like o1 and o3-mini affect latency?+
Reasoning models introduce a fundamentally different latency profile compared to standard LLMs. Before producing visible output, reasoning models perform an internal 'thinking' phase where they break down the problem, explore solution paths, and verify their reasoning. This thinking phase can take anywhere from 500ms (simple questions) to 30+ seconds (complex math or logic problems), and the duration is unpredictable because it depends on problem complexity, not input length. The TTFT for reasoning models is therefore the thinking time plus the standard prefill time, making it highly variable and often much longer than standard models. OpenAI's o1 model has been observed to think for 5-15 seconds on moderately complex problems, with extreme cases exceeding 60 seconds. The newer o3-mini model is faster (typically 1-5 seconds of thinking) but still much less predictable than standard models. From a cost perspective, thinking tokens are billed as output tokens — so a request that generates 200 visible tokens but uses 3,000 thinking tokens costs 15x more in output tokens than the visible output would suggest. For latency-sensitive applications, reasoning models are generally unsuitable for real-time user-facing interactions. They are better suited for background tasks, code review, complex analysis, and scenarios where users explicitly expect longer processing times. If you need reasoning capabilities with lower latency, consider using a standard model with chain-of-thought prompting — you sacrifice some reasoning quality but gain predictable latency.
What timeout values should I set for LLM API calls?+
Timeout values should be set based on the model, expected output length, and your application's tolerance for slow responses. A timeout that is too aggressive will kill requests that would have succeeded, triggering retries that double your cost. A timeout that is too generous will leave users waiting unnecessarily or tie up server resources. Here is a framework: calculate the expected maximum response time as P99 TTFT + (max_tokens / generation speed) + buffer. For GPT-4o with max_tokens=1000: P99 TTFT ~1.5s + 1000/90 tok/s ~11s + 3s buffer = ~16s timeout. For GPT-4o mini with max_tokens=500: P99 TTFT ~0.8s + 500/125 tok/s ~4s + 2s buffer = ~7s timeout. For Claude 3 Opus with max_tokens=2000: P99 TTFT ~4s + 2000/42 tok/s ~48s + 5s buffer = ~57s timeout. For reasoning models, add generous headroom for thinking time: 60-120 seconds minimum. Common mistakes to avoid: (1) Using the same timeout for all models — a 10-second timeout is fine for GPT-4o mini but will kill 30% of Claude Opus requests. (2) Not accounting for output length — a request generating 2,000 tokens needs 4x the generation time of a 500-token request. (3) Retrying on timeout without checking why — if a model consistently times out, the fix is a higher timeout or a faster model, not more retries that multiply your cost. Monitor timeout rates per model in CostHawk and adjust thresholds based on actual P99 latency data.

Related Terms

Time to First Token (TTFT)

The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.

Read more

Throughput

The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.

Read more

P95 / P99 Latency

Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.

Read more

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Read more

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

Read more

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.