Time to First Token (TTFT)
The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.
Definition
What is Time to First Token (TTFT)?
Impact
Why It Matters for AI Costs
TTFT is the difference between an AI application that feels instant and one that feels broken. Human perception research consistently shows that latency thresholds drive user behavior:
| TTFT Range | User Perception | Behavioral Impact |
|---|---|---|
| < 200 ms | Instantaneous | User perceives no wait; maximum engagement |
| 200–500 ms | Fast | Feels responsive; standard for well-optimized apps |
| 500–1,000 ms | Noticeable delay | User notices the wait but tolerates it with streaming |
| 1,000–2,000 ms | Slow | User attention begins to wander; consider loading indicators |
| 2,000–5,000 ms | Frustrating | Measurable increase in tab switches and abandonment |
| > 5,000 ms | Broken | Users assume an error occurred; high abandonment |
For AI-powered products, TTFT directly impacts key business metrics. A study of conversational AI interfaces found that reducing P50 TTFT from 1,200 ms to 400 ms increased message-send rates by 22% and session duration by 15%. Users who experience fast TTFT send more messages, explore more features, and are more likely to convert to paid plans.
TTFT also has cost implications. When TTFT is high, users often retry their requests (assuming the first attempt failed), generating duplicate LLM calls that double cost without adding value. At scale, retry-induced duplicate requests can account for 5–15% of total API spend. Monitoring and optimizing TTFT reduces both user frustration and wasted spend.
From an operational perspective, TTFT is a leading indicator of infrastructure health. A sudden spike in P95 TTFT often indicates GPU saturation at the provider, queuing delays from rate limiting, or a regression in prompt size that increased prefill time. CostHawk tracks TTFT at the per-request level and alerts when percentile TTFT deviates from your baseline, enabling rapid response to latency degradations before they impact user experience.
What Is TTFT?
Time to First Token is the wall-clock duration from the moment your application sends an HTTP request to an LLM API endpoint to the moment it receives the first byte of the streamed response containing a generated token. The measurement point is precise:
- Start: The instant the HTTP request leaves your application (specifically, when the request body has been fully sent and the TCP socket is waiting for a response)
- End: The instant the first SSE
data:event containing adelta(content token) is received by your application
TTFT captures several phases of processing that occur before the model begins generating output:
- Network transit (outbound): Your request travels from your server to the provider's API gateway. Typical time: 10–50 ms depending on geographic distance and network conditions.
- API gateway processing: The provider validates your API key, checks rate limits, and routes the request to an inference server. Typical time: 5–20 ms.
- Queue wait: If all inference GPUs are busy, your request waits in a queue. This is the most variable component — it can be 0 ms under light load or 2,000+ ms during peak traffic or provider incidents.
- Prefill computation: The model processes your entire input prompt, computing attention across all input tokens to build the key-value cache. This is the core computational step. Typical time: 50–500 ms, scaling roughly linearly with input token count.
- First token generation: The model samples the first output token from its probability distribution. Typical time: 5–20 ms (one decode step).
- Network transit (return): The first SSE event travels back to your application. Typical time: 10–50 ms.
In practice, TTFT is dominated by prefill computation and queue wait. A well-optimized request with a short prompt and no queuing might achieve 100–200 ms TTFT. A long prompt during peak traffic might see 3,000+ ms TTFT, with most of that time split between queuing and prefill.
The measurement must be taken at the client side to capture the full end-to-end experience. Server-side TTFT measurements miss network transit time and do not reflect what the user actually experiences.
TTFT vs Total Latency
TTFT and total latency (time to last token, TTLT) measure different aspects of the user experience and are affected by different factors. Understanding the distinction is essential for choosing the right optimization strategy.
| Metric | Measures | Depends On | Matters Most For |
|---|---|---|---|
| TTFT | Time until first token appears | Input size, queue wait, prefill speed, network | Streaming UX, perceived responsiveness |
| TTLT | Time until entire response is complete | TTFT + (output tokens × decode time per token) | Non-streaming apps, batch processing, total wait time |
| Inter-Token Latency (ITL) | Time between consecutive tokens during generation | Model decode speed, KV cache size, GPU utilization | Streaming smoothness, perceived generation speed |
Example scenario: A user sends a 2,000-token prompt and the model generates a 400-token response.
- TTFT: 380 ms (prefill dominates)
- Inter-token latency: 25 ms/token average
- Decode time: 400 tokens × 25 ms = 10,000 ms
- Total latency (TTLT): 380 + 10,000 = 10,380 ms
In a streaming scenario, the user sees the first word at 380 ms and watches text flow in over 10 seconds. This feels responsive because there is continuous visual feedback. In a non-streaming scenario, the user waits 10.4 seconds with no feedback, which feels painfully slow.
This is why streaming fundamentally changes which metric matters. For streaming applications, optimizing TTFT has the most impact on perceived performance. For non-streaming applications (batch processing, API-to-API calls, background jobs), total latency matters more than TTFT because there is no user watching a cursor.
When TTFT and TTLT diverge dramatically: A request with 200 ms TTFT but 30,000 ms TTLT indicates a short prompt (fast prefill) but very long output. Conversely, a request with 2,000 ms TTFT but 3,000 ms TTLT indicates a long prompt (slow prefill) with short output. These profiles require different optimization strategies:
- High TTFT, low TTLT → optimize input (shorter prompts, prompt caching, smaller context)
- Low TTFT, high TTLT → optimize output (set
max_tokens, request concise responses, use a faster model for generation)
CostHawk tracks both TTFT and TTLT for every traced request, making it easy to diagnose which dimension of latency is dominating for each endpoint or use case.
TTFT Benchmarks by Model
TTFT varies significantly across models and providers. The table below shows approximate P50 TTFT benchmarks for common models under typical production conditions (1,000-token input prompt, no queuing delay, US-based client). These benchmarks are gathered from public performance reports and community benchmarks as of early 2026:
| Model | Provider | P50 TTFT | P95 TTFT | Notes |
|---|---|---|---|---|
| GPT-4o mini | OpenAI | ~180 ms | ~450 ms | Fastest OpenAI model; excellent for latency-sensitive apps |
| GPT-4o | OpenAI | ~320 ms | ~800 ms | Good balance of speed and capability |
| Gemini 2.0 Flash | ~150 ms | ~380 ms | Among the fastest models available; optimized for speed | |
| Gemini 1.5 Pro | ~350 ms | ~900 ms | Longer context window increases prefill time | |
| Claude 3.5 Haiku | Anthropic | ~200 ms | ~500 ms | Anthropic's speed-optimized model |
| Claude 3.5 Sonnet | Anthropic | ~400 ms | ~1,100 ms | Higher capability comes with higher TTFT |
| Claude 3 Opus | Anthropic | ~800 ms | ~2,500 ms | Largest Anthropic model; slowest TTFT |
| o1 | OpenAI | ~2,000 ms | ~8,000 ms | Reasoning tokens add significant delay before visible output |
| Llama 3 70B (self-hosted, 4xA100) | Self-hosted | ~250 ms | ~600 ms | Depends heavily on hardware and batching configuration |
Important caveats:
- These benchmarks assume a 1,000-token input. TTFT scales roughly linearly with input size for the prefill-dominated component. A 10,000-token input will have roughly 3–5x higher TTFT than a 1,000-token input.
- P95 TTFT is often 2–3x the P50 due to queuing variability. During provider capacity constraints, P95 can spike to 5–10x the P50.
- Reasoning models (o1, o3-mini) have uniquely high TTFT because they generate internal "thinking" tokens before producing visible output. The thinking phase can take seconds or even minutes, and the first visible token only appears after thinking completes.
- Prompt caching dramatically reduces TTFT when the cache is hit. Anthropic reports 80–85% reduction in TTFT for cached prompts. OpenAI's caching provides similar benefits.
- Geographic distance adds 50–150 ms of network latency. An application in Singapore calling a US-based API endpoint will see consistently higher TTFT than one in Virginia.
Use these benchmarks as starting points, then measure your actual TTFT in production. CostHawk records TTFT for every streaming request and computes P50, P95, and P99 percentiles by model, endpoint, and time period so you can establish and track your own baseline.
Factors Affecting TTFT
TTFT is influenced by a chain of factors from your application through the network to the provider's GPU fleet. Understanding each factor helps you identify which ones are within your control and which require provider-level changes.
1. Input prompt size (high impact, within your control). The prefill phase — where the model processes your input to build the key-value cache — scales roughly linearly with the number of input tokens. A 500-token prompt might require 80 ms of prefill on GPT-4o, while a 10,000-token prompt might require 600 ms. This is the factor you have the most control over. Reducing input size through prompt optimization, summarization of conversation history, or selective context injection directly reduces TTFT. Every 1,000 tokens you trim from the input saves approximately 30–80 ms of TTFT depending on the model.
2. Prompt caching (high impact, within your control). When the provider's prompt cache contains the prefill computation for your prompt prefix (typically the system prompt and any static context), the model can skip the prefill for the cached portion and begin generating output much faster. Anthropic's prompt caching reduces TTFT by 80–85% on cache hits. OpenAI's caching provides comparable acceleration. Structuring your prompts so that the static portion comes first (system prompt, few-shot examples) and the variable portion comes last (user query) maximizes cache hit rates and minimizes TTFT.
3. Queue wait time (high impact, outside your control). When a provider's inference fleet is at capacity, requests queue. Queue wait time is added directly to TTFT and is the primary cause of TTFT spikes. During peak hours or provider incidents, queue times can exceed 5,000 ms. You can mitigate this by: using provisioned throughput (reserved capacity) for critical workloads, implementing multi-provider failover (if Anthropic is slow, route to OpenAI), or using batch endpoints for non-latency-sensitive workloads to avoid competing for real-time capacity.
4. Model size (medium impact, within your control via model selection). Larger models have more parameters, which means more computation during both prefill and decode. Economy models (GPT-4o mini, Gemini Flash, Claude Haiku) consistently deliver lower TTFT than their larger counterparts. Choosing a smaller model for latency-sensitive operations is one of the most effective TTFT optimizations.
5. Network latency (medium impact, partially within your control). The round-trip time between your server and the provider's API endpoint adds directly to TTFT. Deploying your application in the same cloud region as the provider's primary inference infrastructure (typically US East for OpenAI, US East or Europe for Anthropic) minimizes network latency. Using the provider's edge endpoints or CDN-accelerated endpoints (where available) can further reduce network transit time by 20–60 ms.
6. Request overhead (low impact, within your control). Large tool/function definitions, extensive JSON schemas, and verbose system prompts all increase the request payload size and processing overhead. While smaller than the prefill impact, minimizing unnecessary request metadata contributes to lower TTFT at the margins. Strip unused tool definitions, minimize JSON schema verbosity, and avoid sending tools that the model is unlikely to need for a given request.
Optimizing TTFT
Reducing TTFT requires a systematic approach that addresses each contributing factor. Here are seven optimization strategies, ordered by typical impact:
1. Enable prompt caching. This is the single highest-impact TTFT optimization for most applications. Both Anthropic and OpenAI support automatic prompt caching where repeated prompt prefixes are cached on the provider's infrastructure. Anthropic caches prompt prefixes of 1,024+ tokens automatically and reduces TTFT by 80–85% on cache hits. To maximize cache hit rate: place your system prompt and static context at the beginning of the message array, keep the static prefix consistent across requests, and put the variable user query at the end. A typical RAG application with a 3,000-token system prompt and 2,000-token static few-shot examples can achieve 85%+ cache hit rates, reducing TTFT from ~500 ms to ~100 ms for the cached portion.
2. Reduce input prompt length. Every token in your input adds prefill time. Audit your prompts for redundancy, verbosity, and unnecessary context. Common culprits include: overly detailed system prompts that repeat instructions the model already follows, full conversation history when a summary would suffice, retrieved context chunks that are marginally relevant, and verbose JSON schemas for tool definitions. Teams that audit their prompts typically find 30–50% of input tokens can be eliminated without quality loss, reducing TTFT proportionally.
3. Select faster models for latency-sensitive operations. If your application has a chatbot interface where TTFT is critical, route those requests to a fast model (GPT-4o mini, Gemini Flash, Claude Haiku). Reserve slower, more capable models for background analysis or complex reasoning tasks where latency is less important. A well-implemented model routing strategy can achieve sub-200 ms TTFT for interactive use cases while still using frontier models for tasks that need them.
4. Implement multi-provider failover. When one provider experiences high latency, route to another. A simple implementation sends the request to your primary provider, and if TTFT exceeds a threshold (e.g., 1,500 ms without receiving the first token), cancels the request and retries with a secondary provider. More sophisticated approaches use a load balancer that tracks real-time TTFT per provider and routes new requests to the provider with the lowest current TTFT.
async function llmCallWithFailover(prompt: string) {
const controller = new AbortController()
const timeout = setTimeout(() => controller.abort(), 1500)
try {
const response = await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
messages: [{ role: "user", content: prompt }],
stream: true,
}, { signal: controller.signal })
clearTimeout(timeout)
return response
} catch (e) {
// Failover to OpenAI
return openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
stream: true,
})
}
}5. Use provisioned throughput. For production workloads with consistent volume, provisioned throughput eliminates queue wait time by reserving dedicated GPU capacity. OpenAI, Anthropic, and Google all offer provisioned throughput tiers. The cost is higher per token (you pay for reserved capacity whether you use it or not), but the latency is consistently low because your requests skip the shared queue.
6. Warm the inference path. Some providers cache model weights in GPU memory per endpoint. Infrequently used model-region combinations may require loading model weights from disk, adding seconds to the first request. If your application uses a less common model, consider sending periodic keepalive requests to keep the inference path warm.
7. Optimize client-side measurement. Ensure your TTFT measurement is accurate by capturing the timestamp before the HTTP request is sent (not before prompt construction) and capturing the first token timestamp when the first SSE data: event with content arrives (not the HTTP response headers). Inaccurate measurement leads to misguided optimization efforts.
Monitoring TTFT
Effective TTFT monitoring requires tracking percentile distributions, not just averages. A P50 TTFT of 300 ms can mask a P99 of 5,000 ms — and it is the P99 experience that drives user complaints and abandonment.
Key metrics to track:
- P50 TTFT: The median experience. This is what most users see most of the time. Target: under 500 ms for interactive applications.
- P95 TTFT: The tail experience affecting 1 in 20 requests. Users who consistently land in the P95 will perceive your application as slow. Target: under 1,500 ms.
- P99 TTFT: The worst-case experience for 1 in 100 requests. Important for SLA compliance. Target: under 3,000 ms.
- TTFT by model: Break down TTFT percentiles by model to identify which models are contributing to latency. This data directly informs model routing decisions.
- TTFT by input token count: Correlate TTFT with input size to quantify the prefill cost per token and identify requests with unnecessarily large inputs.
- TTFT by time of day: Provider capacity varies throughout the day. US business hours typically have higher TTFT than overnight periods due to demand patterns.
Alerting strategy:
Set tiered alerts that escalate with severity:
| Alert Level | Condition | Action |
|---|---|---|
| Warning | P95 TTFT > 2x baseline for 5 minutes | Notify on-call engineer via Slack |
| Critical | P95 TTFT > 4x baseline for 5 minutes | Page on-call; consider activating failover |
| Emergency | P50 TTFT > 3x baseline for 10 minutes | Activate multi-provider failover; escalate to provider support |
Dashboard design:
A TTFT monitoring dashboard should include: (1) a time-series chart showing P50, P95, and P99 TTFT over the last 24 hours, with the baseline overlaid for comparison; (2) a heatmap showing TTFT distribution by hour of day to reveal temporal patterns; (3) a breakdown by model showing which models contribute the most to tail latency; (4) a scatter plot of TTFT vs input token count to visualize the prefill cost curve.
CostHawk captures TTFT for every streaming request, computes all standard percentiles, and provides pre-built dashboards for TTFT monitoring. Anomaly detection uses a rolling-window statistical model that accounts for time-of-day and day-of-week patterns, reducing false alerts while catching genuine regressions. When TTFT spikes, CostHawk's trace drill-down shows whether the spike was caused by increased input size, provider queuing, or a specific model, enabling rapid root-cause identification.
FAQ
Frequently Asked Questions
What is a good TTFT for a production AI chatbot?+
How does prompt caching affect TTFT?+
Why is my TTFT much higher than the model benchmarks suggest?+
How does TTFT differ for reasoning models like o1 and o3?+
Should I optimize TTFT or total latency?+
How do I measure TTFT accurately in my application?+
Date.now() immediately before calling openai.chat.completions.create({ stream: true }), then capture another timestamp inside the stream iterator when the first chunk with delta.content arrives. The difference is your TTFT. Be careful to exclude SSE connection-setup events and empty deltas — only count the first chunk that contains actual generated content. In Python with the Anthropic SDK: capture time.monotonic() before the API call and again when the first ContentBlockDelta event arrives. Use time.monotonic() rather than time.time() to avoid clock-synchronization artifacts. Common measurement mistakes include: starting the timer before prompt construction (inflates TTFT with client-side processing time), counting HTTP response headers as the first token (headers arrive before any generated content), and using wall-clock timestamps across async contexts where the event loop may introduce jitter. CostHawk measures TTFT server-side at the proxy layer with microsecond precision, providing a ground-truth measurement that you can compare against client-side measurements to quantify network transit time.Does TTFT affect cost?+
How does input size affect TTFT and is there a linear relationship?+
Related Terms
Latency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreP95 / P99 Latency
Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.
Read moreThroughput
The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.
Read morePrompt Caching
A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Read moreLLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
