GlossaryObservabilityUpdated 2026-03-16

Time to First Token (TTFT)

The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.

Definition

What is Time to First Token (TTFT)?

Time to First Token (TTFT) measures the elapsed time, in milliseconds, between dispatching an LLM API request and receiving the very first token of the model's response. For streaming endpoints — where the server sends tokens incrementally via server-sent events (SSE) as they are generated — TTFT represents the initial perceived latency: how long the user stares at a blank screen or a loading spinner before text starts appearing. TTFT is distinct from total latency (also called time to last token, or TTLT), which measures the time until the entire response is complete. In a streaming scenario, a request might have a TTFT of 300 ms but a total latency of 4,500 ms for a 600-token response. The user perceives responsiveness at 300 ms, even though the full answer takes 4.5 seconds to arrive. TTFT is a function of several factors: the model's prefill time (processing the input prompt), the network round-trip between client and inference server, any queuing delay if the provider's GPU fleet is saturated, and preprocessing overhead like safety filters or prompt caching checks. Larger prompts increase prefill time because the model must compute attention across more tokens before generating the first output token. TTFT has become one of the most important operational metrics for AI applications because users have been conditioned by ChatGPT, Claude, and Gemini to expect streaming responses that begin within 200–800 milliseconds. Applications with TTFT exceeding 2 seconds see measurably higher abandonment rates, making TTFT optimization a direct driver of user engagement and retention.

Impact

Why It Matters for AI Costs

TTFT is the difference between an AI application that feels instant and one that feels broken. Human perception research consistently shows that latency thresholds drive user behavior:

TTFT RangeUser PerceptionBehavioral Impact
< 200 msInstantaneousUser perceives no wait; maximum engagement
200–500 msFastFeels responsive; standard for well-optimized apps
500–1,000 msNoticeable delayUser notices the wait but tolerates it with streaming
1,000–2,000 msSlowUser attention begins to wander; consider loading indicators
2,000–5,000 msFrustratingMeasurable increase in tab switches and abandonment
> 5,000 msBrokenUsers assume an error occurred; high abandonment

For AI-powered products, TTFT directly impacts key business metrics. A study of conversational AI interfaces found that reducing P50 TTFT from 1,200 ms to 400 ms increased message-send rates by 22% and session duration by 15%. Users who experience fast TTFT send more messages, explore more features, and are more likely to convert to paid plans.

TTFT also has cost implications. When TTFT is high, users often retry their requests (assuming the first attempt failed), generating duplicate LLM calls that double cost without adding value. At scale, retry-induced duplicate requests can account for 5–15% of total API spend. Monitoring and optimizing TTFT reduces both user frustration and wasted spend.

From an operational perspective, TTFT is a leading indicator of infrastructure health. A sudden spike in P95 TTFT often indicates GPU saturation at the provider, queuing delays from rate limiting, or a regression in prompt size that increased prefill time. CostHawk tracks TTFT at the per-request level and alerts when percentile TTFT deviates from your baseline, enabling rapid response to latency degradations before they impact user experience.

What Is TTFT?

Time to First Token is the wall-clock duration from the moment your application sends an HTTP request to an LLM API endpoint to the moment it receives the first byte of the streamed response containing a generated token. The measurement point is precise:

  • Start: The instant the HTTP request leaves your application (specifically, when the request body has been fully sent and the TCP socket is waiting for a response)
  • End: The instant the first SSE data: event containing a delta (content token) is received by your application

TTFT captures several phases of processing that occur before the model begins generating output:

  1. Network transit (outbound): Your request travels from your server to the provider's API gateway. Typical time: 10–50 ms depending on geographic distance and network conditions.
  2. API gateway processing: The provider validates your API key, checks rate limits, and routes the request to an inference server. Typical time: 5–20 ms.
  3. Queue wait: If all inference GPUs are busy, your request waits in a queue. This is the most variable component — it can be 0 ms under light load or 2,000+ ms during peak traffic or provider incidents.
  4. Prefill computation: The model processes your entire input prompt, computing attention across all input tokens to build the key-value cache. This is the core computational step. Typical time: 50–500 ms, scaling roughly linearly with input token count.
  5. First token generation: The model samples the first output token from its probability distribution. Typical time: 5–20 ms (one decode step).
  6. Network transit (return): The first SSE event travels back to your application. Typical time: 10–50 ms.

In practice, TTFT is dominated by prefill computation and queue wait. A well-optimized request with a short prompt and no queuing might achieve 100–200 ms TTFT. A long prompt during peak traffic might see 3,000+ ms TTFT, with most of that time split between queuing and prefill.

The measurement must be taken at the client side to capture the full end-to-end experience. Server-side TTFT measurements miss network transit time and do not reflect what the user actually experiences.

TTFT vs Total Latency

TTFT and total latency (time to last token, TTLT) measure different aspects of the user experience and are affected by different factors. Understanding the distinction is essential for choosing the right optimization strategy.

MetricMeasuresDepends OnMatters Most For
TTFTTime until first token appearsInput size, queue wait, prefill speed, networkStreaming UX, perceived responsiveness
TTLTTime until entire response is completeTTFT + (output tokens × decode time per token)Non-streaming apps, batch processing, total wait time
Inter-Token Latency (ITL)Time between consecutive tokens during generationModel decode speed, KV cache size, GPU utilizationStreaming smoothness, perceived generation speed

Example scenario: A user sends a 2,000-token prompt and the model generates a 400-token response.

  • TTFT: 380 ms (prefill dominates)
  • Inter-token latency: 25 ms/token average
  • Decode time: 400 tokens × 25 ms = 10,000 ms
  • Total latency (TTLT): 380 + 10,000 = 10,380 ms

In a streaming scenario, the user sees the first word at 380 ms and watches text flow in over 10 seconds. This feels responsive because there is continuous visual feedback. In a non-streaming scenario, the user waits 10.4 seconds with no feedback, which feels painfully slow.

This is why streaming fundamentally changes which metric matters. For streaming applications, optimizing TTFT has the most impact on perceived performance. For non-streaming applications (batch processing, API-to-API calls, background jobs), total latency matters more than TTFT because there is no user watching a cursor.

When TTFT and TTLT diverge dramatically: A request with 200 ms TTFT but 30,000 ms TTLT indicates a short prompt (fast prefill) but very long output. Conversely, a request with 2,000 ms TTFT but 3,000 ms TTLT indicates a long prompt (slow prefill) with short output. These profiles require different optimization strategies:

  • High TTFT, low TTLT → optimize input (shorter prompts, prompt caching, smaller context)
  • Low TTFT, high TTLT → optimize output (set max_tokens, request concise responses, use a faster model for generation)

CostHawk tracks both TTFT and TTLT for every traced request, making it easy to diagnose which dimension of latency is dominating for each endpoint or use case.

TTFT Benchmarks by Model

TTFT varies significantly across models and providers. The table below shows approximate P50 TTFT benchmarks for common models under typical production conditions (1,000-token input prompt, no queuing delay, US-based client). These benchmarks are gathered from public performance reports and community benchmarks as of early 2026:

ModelProviderP50 TTFTP95 TTFTNotes
GPT-4o miniOpenAI~180 ms~450 msFastest OpenAI model; excellent for latency-sensitive apps
GPT-4oOpenAI~320 ms~800 msGood balance of speed and capability
Gemini 2.0 FlashGoogle~150 ms~380 msAmong the fastest models available; optimized for speed
Gemini 1.5 ProGoogle~350 ms~900 msLonger context window increases prefill time
Claude 3.5 HaikuAnthropic~200 ms~500 msAnthropic's speed-optimized model
Claude 3.5 SonnetAnthropic~400 ms~1,100 msHigher capability comes with higher TTFT
Claude 3 OpusAnthropic~800 ms~2,500 msLargest Anthropic model; slowest TTFT
o1OpenAI~2,000 ms~8,000 msReasoning tokens add significant delay before visible output
Llama 3 70B (self-hosted, 4xA100)Self-hosted~250 ms~600 msDepends heavily on hardware and batching configuration

Important caveats:

  • These benchmarks assume a 1,000-token input. TTFT scales roughly linearly with input size for the prefill-dominated component. A 10,000-token input will have roughly 3–5x higher TTFT than a 1,000-token input.
  • P95 TTFT is often 2–3x the P50 due to queuing variability. During provider capacity constraints, P95 can spike to 5–10x the P50.
  • Reasoning models (o1, o3-mini) have uniquely high TTFT because they generate internal "thinking" tokens before producing visible output. The thinking phase can take seconds or even minutes, and the first visible token only appears after thinking completes.
  • Prompt caching dramatically reduces TTFT when the cache is hit. Anthropic reports 80–85% reduction in TTFT for cached prompts. OpenAI's caching provides similar benefits.
  • Geographic distance adds 50–150 ms of network latency. An application in Singapore calling a US-based API endpoint will see consistently higher TTFT than one in Virginia.

Use these benchmarks as starting points, then measure your actual TTFT in production. CostHawk records TTFT for every streaming request and computes P50, P95, and P99 percentiles by model, endpoint, and time period so you can establish and track your own baseline.

Factors Affecting TTFT

TTFT is influenced by a chain of factors from your application through the network to the provider's GPU fleet. Understanding each factor helps you identify which ones are within your control and which require provider-level changes.

1. Input prompt size (high impact, within your control). The prefill phase — where the model processes your input to build the key-value cache — scales roughly linearly with the number of input tokens. A 500-token prompt might require 80 ms of prefill on GPT-4o, while a 10,000-token prompt might require 600 ms. This is the factor you have the most control over. Reducing input size through prompt optimization, summarization of conversation history, or selective context injection directly reduces TTFT. Every 1,000 tokens you trim from the input saves approximately 30–80 ms of TTFT depending on the model.

2. Prompt caching (high impact, within your control). When the provider's prompt cache contains the prefill computation for your prompt prefix (typically the system prompt and any static context), the model can skip the prefill for the cached portion and begin generating output much faster. Anthropic's prompt caching reduces TTFT by 80–85% on cache hits. OpenAI's caching provides comparable acceleration. Structuring your prompts so that the static portion comes first (system prompt, few-shot examples) and the variable portion comes last (user query) maximizes cache hit rates and minimizes TTFT.

3. Queue wait time (high impact, outside your control). When a provider's inference fleet is at capacity, requests queue. Queue wait time is added directly to TTFT and is the primary cause of TTFT spikes. During peak hours or provider incidents, queue times can exceed 5,000 ms. You can mitigate this by: using provisioned throughput (reserved capacity) for critical workloads, implementing multi-provider failover (if Anthropic is slow, route to OpenAI), or using batch endpoints for non-latency-sensitive workloads to avoid competing for real-time capacity.

4. Model size (medium impact, within your control via model selection). Larger models have more parameters, which means more computation during both prefill and decode. Economy models (GPT-4o mini, Gemini Flash, Claude Haiku) consistently deliver lower TTFT than their larger counterparts. Choosing a smaller model for latency-sensitive operations is one of the most effective TTFT optimizations.

5. Network latency (medium impact, partially within your control). The round-trip time between your server and the provider's API endpoint adds directly to TTFT. Deploying your application in the same cloud region as the provider's primary inference infrastructure (typically US East for OpenAI, US East or Europe for Anthropic) minimizes network latency. Using the provider's edge endpoints or CDN-accelerated endpoints (where available) can further reduce network transit time by 20–60 ms.

6. Request overhead (low impact, within your control). Large tool/function definitions, extensive JSON schemas, and verbose system prompts all increase the request payload size and processing overhead. While smaller than the prefill impact, minimizing unnecessary request metadata contributes to lower TTFT at the margins. Strip unused tool definitions, minimize JSON schema verbosity, and avoid sending tools that the model is unlikely to need for a given request.

Optimizing TTFT

Reducing TTFT requires a systematic approach that addresses each contributing factor. Here are seven optimization strategies, ordered by typical impact:

1. Enable prompt caching. This is the single highest-impact TTFT optimization for most applications. Both Anthropic and OpenAI support automatic prompt caching where repeated prompt prefixes are cached on the provider's infrastructure. Anthropic caches prompt prefixes of 1,024+ tokens automatically and reduces TTFT by 80–85% on cache hits. To maximize cache hit rate: place your system prompt and static context at the beginning of the message array, keep the static prefix consistent across requests, and put the variable user query at the end. A typical RAG application with a 3,000-token system prompt and 2,000-token static few-shot examples can achieve 85%+ cache hit rates, reducing TTFT from ~500 ms to ~100 ms for the cached portion.

2. Reduce input prompt length. Every token in your input adds prefill time. Audit your prompts for redundancy, verbosity, and unnecessary context. Common culprits include: overly detailed system prompts that repeat instructions the model already follows, full conversation history when a summary would suffice, retrieved context chunks that are marginally relevant, and verbose JSON schemas for tool definitions. Teams that audit their prompts typically find 30–50% of input tokens can be eliminated without quality loss, reducing TTFT proportionally.

3. Select faster models for latency-sensitive operations. If your application has a chatbot interface where TTFT is critical, route those requests to a fast model (GPT-4o mini, Gemini Flash, Claude Haiku). Reserve slower, more capable models for background analysis or complex reasoning tasks where latency is less important. A well-implemented model routing strategy can achieve sub-200 ms TTFT for interactive use cases while still using frontier models for tasks that need them.

4. Implement multi-provider failover. When one provider experiences high latency, route to another. A simple implementation sends the request to your primary provider, and if TTFT exceeds a threshold (e.g., 1,500 ms without receiving the first token), cancels the request and retries with a secondary provider. More sophisticated approaches use a load balancer that tracks real-time TTFT per provider and routes new requests to the provider with the lowest current TTFT.

async function llmCallWithFailover(prompt: string) {
  const controller = new AbortController()
  const timeout = setTimeout(() => controller.abort(), 1500)

  try {
    const response = await anthropic.messages.create({
      model: "claude-3-5-sonnet-20241022",
      messages: [{ role: "user", content: prompt }],
      stream: true,
    }, { signal: controller.signal })
    clearTimeout(timeout)
    return response
  } catch (e) {
    // Failover to OpenAI
    return openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: prompt }],
      stream: true,
    })
  }
}

5. Use provisioned throughput. For production workloads with consistent volume, provisioned throughput eliminates queue wait time by reserving dedicated GPU capacity. OpenAI, Anthropic, and Google all offer provisioned throughput tiers. The cost is higher per token (you pay for reserved capacity whether you use it or not), but the latency is consistently low because your requests skip the shared queue.

6. Warm the inference path. Some providers cache model weights in GPU memory per endpoint. Infrequently used model-region combinations may require loading model weights from disk, adding seconds to the first request. If your application uses a less common model, consider sending periodic keepalive requests to keep the inference path warm.

7. Optimize client-side measurement. Ensure your TTFT measurement is accurate by capturing the timestamp before the HTTP request is sent (not before prompt construction) and capturing the first token timestamp when the first SSE data: event with content arrives (not the HTTP response headers). Inaccurate measurement leads to misguided optimization efforts.

Monitoring TTFT

Effective TTFT monitoring requires tracking percentile distributions, not just averages. A P50 TTFT of 300 ms can mask a P99 of 5,000 ms — and it is the P99 experience that drives user complaints and abandonment.

Key metrics to track:

  • P50 TTFT: The median experience. This is what most users see most of the time. Target: under 500 ms for interactive applications.
  • P95 TTFT: The tail experience affecting 1 in 20 requests. Users who consistently land in the P95 will perceive your application as slow. Target: under 1,500 ms.
  • P99 TTFT: The worst-case experience for 1 in 100 requests. Important for SLA compliance. Target: under 3,000 ms.
  • TTFT by model: Break down TTFT percentiles by model to identify which models are contributing to latency. This data directly informs model routing decisions.
  • TTFT by input token count: Correlate TTFT with input size to quantify the prefill cost per token and identify requests with unnecessarily large inputs.
  • TTFT by time of day: Provider capacity varies throughout the day. US business hours typically have higher TTFT than overnight periods due to demand patterns.

Alerting strategy:

Set tiered alerts that escalate with severity:

Alert LevelConditionAction
WarningP95 TTFT > 2x baseline for 5 minutesNotify on-call engineer via Slack
CriticalP95 TTFT > 4x baseline for 5 minutesPage on-call; consider activating failover
EmergencyP50 TTFT > 3x baseline for 10 minutesActivate multi-provider failover; escalate to provider support

Dashboard design:

A TTFT monitoring dashboard should include: (1) a time-series chart showing P50, P95, and P99 TTFT over the last 24 hours, with the baseline overlaid for comparison; (2) a heatmap showing TTFT distribution by hour of day to reveal temporal patterns; (3) a breakdown by model showing which models contribute the most to tail latency; (4) a scatter plot of TTFT vs input token count to visualize the prefill cost curve.

CostHawk captures TTFT for every streaming request, computes all standard percentiles, and provides pre-built dashboards for TTFT monitoring. Anomaly detection uses a rolling-window statistical model that accounts for time-of-day and day-of-week patterns, reducing false alerts while catching genuine regressions. When TTFT spikes, CostHawk's trace drill-down shows whether the spike was caused by increased input size, provider queuing, or a specific model, enabling rapid root-cause identification.

FAQ

Frequently Asked Questions

What is a good TTFT for a production AI chatbot?+
For a consumer-facing AI chatbot with streaming enabled, target a P50 TTFT under 400 ms and a P95 under 1,200 ms. These thresholds align with user perception research showing that responses beginning within 500 ms feel responsive, while delays beyond 1,500 ms cause noticeable frustration. The ChatGPT and Claude web interfaces have set user expectations in the 200–600 ms range for P50 TTFT. Enterprise applications with internal users can tolerate slightly higher TTFT (up to 800 ms P50) because users are more patient with internal tools than consumer products. Batch processing and API-to-API integrations where no human is waiting do not need TTFT optimization at all — total latency and throughput are more relevant metrics. If your current P50 TTFT exceeds 1,000 ms, the highest-impact optimizations are usually prompt caching (if not already enabled), reducing prompt size, or switching to a faster model for the interactive use case. CostHawk's TTFT analytics show your current percentiles and help you track improvements over time.
How does prompt caching affect TTFT?+
Prompt caching is the single most effective technique for reducing TTFT. When the provider caches the key-value computations for your prompt prefix, subsequent requests with the same prefix skip the prefill phase for the cached portion, dramatically reducing TTFT. Anthropic's prompt caching reports an 80–85% reduction in TTFT for cached prefixes — a request that would have 600 ms TTFT without caching drops to 90–120 ms with a cache hit. OpenAI's automatic caching provides similar acceleration. The cache hit depends on your prompt structure: the system prompt and any static context must form a consistent prefix across requests. If you change even one token in the cached prefix, the cache is invalidated. To maximize cache effectiveness, place all static content (system prompt, few-shot examples, tool definitions) at the beginning of the message array and the variable content (user query, dynamic context) at the end. Cache TTLs vary by provider — Anthropic caches persist for about 5 minutes of inactivity, so maintaining steady request volume keeps the cache warm. CostHawk tracks cache hit rates and shows the TTFT improvement from caching so you can verify the optimization is working.
Why is my TTFT much higher than the model benchmarks suggest?+
If your observed TTFT significantly exceeds published benchmarks, the most common causes are: (1) Large input prompts — benchmarks typically use 500–1,000 token inputs, but production prompts often include 5,000–20,000 tokens of system prompt, conversation history, and retrieved context. Prefill time scales roughly linearly with input size, so a 10x larger input means 3–5x higher TTFT. (2) Queue wait time — during peak hours or provider capacity constraints, requests can queue for 500–3,000 ms before processing begins. This is invisible in benchmarks (which assume no queuing) but very real in production. (3) Geographic distance — if your servers are in Europe or Asia and the provider's inference fleet is in US East, network round-trip adds 100–200 ms. (4) No prompt caching — benchmarks often assume cached prompts, which reduce TTFT by 80–85%. If you have not enabled or structured your prompts for caching, you are paying the full prefill cost on every request. (5) Rate limiting — some providers impose micro-delays on requests that approach rate limits. Check your provider dashboard for rate limit warnings. Diagnose by decomposing TTFT into its components: measure network RTT separately, check if TTFT correlates with input token count (prefill-dominated) or is constant regardless of input (queue-dominated).
How does TTFT differ for reasoning models like o1 and o3?+
Reasoning models (OpenAI's o1, o3-mini, o3) have fundamentally different TTFT characteristics compared to standard LLMs. These models perform extended internal reasoning — generating hundreds or thousands of "thinking" tokens — before producing any visible output. The thinking phase happens entirely server-side, and the first visible token is only emitted after the model has completed its reasoning chain. This means TTFT for reasoning models is not primarily a function of prefill time or network latency; it is dominated by the length of the reasoning chain. A simple question might trigger 200 thinking tokens (TTFT ~2–3 seconds), while a complex math problem might trigger 10,000+ thinking tokens (TTFT ~30–60 seconds). Unlike standard models where TTFT is a fraction of total latency, reasoning model TTFT often represents 80–95% of total latency because the decode phase for visible output is relatively brief. This has implications for UX design: standard streaming UX patterns (showing a cursor immediately) do not work well with reasoning models because the user would see a cursor for 30+ seconds with no text. Instead, applications should show a "thinking" indicator during the reasoning phase. Some providers expose the thinking tokens in the stream so you can show a progress indicator.
Should I optimize TTFT or total latency?+
The answer depends on whether your application uses streaming. For streaming applications (chatbots, writing assistants, code completion) where users watch tokens appear in real time, TTFT is the primary metric because it determines how quickly the user sees the first sign of a response. Once tokens start flowing, the streaming visual feedback keeps users engaged even if the total response takes several seconds. Optimize TTFT aggressively for these use cases. For non-streaming applications (batch processing, API-to-API calls, background analysis, structured data extraction) where the consumer waits for the complete response, total latency (TTLT) is what matters. TTFT is largely irrelevant because no user is watching a cursor. Optimize for total throughput and complete response time instead. For hybrid applications that sometimes stream and sometimes wait for the full response (e.g., a RAG system that streams the answer but also extracts structured metadata), optimize TTFT for the streamed path and total latency for the blocking path. Many teams make the mistake of optimizing only total latency and ignore TTFT, which leaves their streaming UX feeling slower than it needs to be.
How do I measure TTFT accurately in my application?+
Accurate TTFT measurement requires capturing two precise timestamps: (1) when the request is dispatched and (2) when the first content token is received. The implementation depends on your SDK and language. In JavaScript/TypeScript with the OpenAI SDK: capture Date.now() immediately before calling openai.chat.completions.create({ stream: true }), then capture another timestamp inside the stream iterator when the first chunk with delta.content arrives. The difference is your TTFT. Be careful to exclude SSE connection-setup events and empty deltas — only count the first chunk that contains actual generated content. In Python with the Anthropic SDK: capture time.monotonic() before the API call and again when the first ContentBlockDelta event arrives. Use time.monotonic() rather than time.time() to avoid clock-synchronization artifacts. Common measurement mistakes include: starting the timer before prompt construction (inflates TTFT with client-side processing time), counting HTTP response headers as the first token (headers arrive before any generated content), and using wall-clock timestamps across async contexts where the event loop may introduce jitter. CostHawk measures TTFT server-side at the proxy layer with microsecond precision, providing a ground-truth measurement that you can compare against client-side measurements to quantify network transit time.
Does TTFT affect cost?+
TTFT does not directly appear on your API bill — providers charge per token, not per millisecond of latency. However, TTFT has significant indirect cost implications. First, high TTFT causes user retries. When users perceive a request as stuck (typically after 2–3 seconds with no visible response), they often refresh the page or resend the message. Each retry generates a duplicate API call that consumes tokens and costs money. At scale, retry-induced duplicates can account for 5–15% of total API spend. Second, high TTFT is often a symptom of large input prompts, which directly increase cost. The same 10,000-token context that causes 800 ms TTFT also costs $0.025 in input tokens per request on GPT-4o. Optimizing TTFT by reducing prompt size thus reduces both latency and cost simultaneously. Third, the factors that improve TTFT (prompt caching, smaller models, shorter prompts) also happen to reduce cost. Prompt caching reduces input token cost by 50–90% while also reducing TTFT by 80–85%. Model routing to faster, cheaper models reduces both per-token cost and TTFT. This alignment means that TTFT optimization and cost optimization are complementary rather than competing objectives.
How does input size affect TTFT and is there a linear relationship?+
Input size affects TTFT primarily through the prefill phase, and the relationship is approximately linear but not perfectly so. Doubling the input token count roughly doubles the prefill time, which translates to a proportional increase in the prefill component of TTFT. However, TTFT has fixed-cost components (network round-trip, API gateway processing, first decode step) that do not scale with input size, so total TTFT increases sub-linearly at small input sizes and approaches linearity at larger sizes. Empirically, for GPT-4o, the prefill rate is approximately 15,000–25,000 tokens per second, meaning each additional 1,000 input tokens adds roughly 40–65 ms to TTFT. For Claude 3.5 Sonnet, the prefill rate is approximately 10,000–18,000 tokens per second, so each additional 1,000 tokens adds roughly 55–100 ms. At very large input sizes (50,000+ tokens), some providers experience non-linear slowdown due to attention computation scaling quadratically with sequence length, though many modern models use optimized attention mechanisms (e.g., Flash Attention, multi-query attention) that maintain near-linear scaling up to the context window limit. The practical takeaway: reducing input by 5,000 tokens typically saves 200–500 ms of TTFT depending on the model, which is highly significant for interactive applications.

Related Terms

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Read more

P95 / P99 Latency

Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.

Read more

Throughput

The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.

Read more

Prompt Caching

A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.

Read more

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Read more

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.