GlossaryObservabilityUpdated 2026-03-16By Chase Dillingham

Throughput

The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.

Definition

What is Throughput?

Throughput in the context of LLM applications measures the rate at which a system processes AI workloads over a given time period. It is expressed in three primary units depending on the context: requests per second (RPS), which counts the number of complete API calls processed per second — useful for capacity planning and load balancing; tokens per second (TPS), which counts the total tokens (input + output) processed per second — useful for understanding raw model utilization; and tokens per minute (TPM), which is the unit most LLM API providers use for rate limits and quota allocations. Throughput is distinct from latency: latency measures how long a single request takes, while throughput measures how many requests the system handles in aggregate. A system can have high throughput with high latency (many slow requests processed in parallel) or low throughput with low latency (few fast requests processed sequentially). For teams consuming LLM APIs, throughput is constrained by two factors: the provider's rate limits (how many requests or tokens per minute you are allowed to send) and your application's concurrency capacity (how many simultaneous requests your architecture can manage). Understanding and optimizing throughput is essential because it determines both the scalability of your AI features and the economics of your AI spend — throughput directly maps to tokens consumed, which directly maps to dollars spent.

Impact

Why It Matters for AI Costs

Throughput is the bridge between your AI features and your infrastructure budget. Every token processed per second has a direct dollar cost, and at scale, small throughput inefficiencies compound into significant waste. Consider the economics of a production AI application processing 500 requests per minute:

Metric	Value
Request volume	500 req/min (720,000 req/day)
Avg input tokens per request	1,500
Avg output tokens per request	400
Total tokens per day	1.37 billion
Model	GPT-4o ($2.50/$10.00 per MTok)
Daily input cost	1.08B tokens x $2.50/MTok = $2,700
Daily output cost	288M tokens x $10.00/MTok = $2,880
Daily total	$5,580
Monthly total	$167,400

At this scale, a 10% throughput optimization — processing the same work with 10% fewer tokens — saves $16,740 per month. A 30% reduction through model routing, prompt optimization, and caching saves $50,220 per month. These are not hypothetical numbers; they represent the actual economics facing teams with production AI workloads.

Throughput also determines whether your AI features can scale to meet demand. If your rate limit is 60,000 TPM and each request consumes 2,000 tokens, you can process at most 30 requests per minute — about one every 2 seconds. If product growth pushes demand to 100 requests per minute, you will hit rate limits, queue requests, and degrade the user experience unless you increase throughput capacity (higher rate limits, multiple API keys, or provider diversification).

The hidden cost of throughput constraints is wasted spend on retry storms. When an application hits a rate limit (HTTP 429), the typical response is to retry with exponential backoff. Each retry re-sends the same tokens, consuming throughput capacity and cost without delivering value. A 5% rate limit hit rate that triggers one retry per blocked request effectively adds 5% to your token spend — and if the retry also hits the rate limit, the cost compounds further.

CostHawk monitors throughput in real time, tracking RPS, TPS, and TPM alongside cost per token. This allows you to identify throughput bottlenecks, predict when you will hit rate limits based on growth trends, and optimize token efficiency to get more value from every unit of throughput capacity.

What is Throughput?

Throughput quantifies the productive capacity of your LLM-powered system. While the concept is simple — how much work gets done per unit of time — measuring throughput for LLM applications requires understanding which unit of measurement is appropriate for which context.

Requests per second (RPS) is the most intuitive throughput metric. It counts complete API calls regardless of their size. RPS is useful for capacity planning ("can our system handle 100 concurrent users?"), load balancer configuration, and application-level monitoring. However, RPS can be misleading for cost analysis because requests vary enormously in size: a classification request consuming 200 tokens and a document analysis request consuming 50,000 tokens are both "one request" but differ 250x in cost.

Tokens per second (TPS) measures the raw token processing rate of the system. This is the most granular throughput metric and the most useful for understanding model utilization and cost rate. TPS can be broken into input TPS (tokens sent to the model per second) and output TPS (tokens generated by the model per second). Because output tokens cost 4-5x more than input tokens, a system processing 1,000 output TPS costs significantly more than one processing 1,000 input TPS.

Tokens per minute (TPM) is the unit used by most LLM API providers for rate limits. OpenAI specifies rate limits in both RPM (requests per minute) and TPM. Anthropic uses tokens per minute for its rate tier definitions. Understanding your TPM consumption relative to your TPM allowance is essential for avoiding rate limit errors that degrade performance and waste money through retries.

Effective throughput vs raw throughput: An important distinction is between the total requests or tokens your system processes (raw throughput) and the requests or tokens that produce useful output (effective throughput). Retried requests, rate-limited requests, and requests that produce errors or unusable output consume throughput capacity and cost money but do not deliver value. If 8% of your requests are retries and 3% produce errors, your effective throughput is only 89% of your raw throughput — meaning 11% of your token spend is waste. CostHawk tracks both raw and effective throughput, highlighting the gap as an optimization opportunity.

Throughput as a cost rate: The most actionable way to think about throughput is as a cost rate — dollars per minute or dollars per hour. If your system processes 100,000 tokens per minute on GPT-4o, your cost rate is approximately $0.25/minute input + $1.00/minute output = $1.25/minute, or $75/hour. Viewing throughput in dollar terms makes it immediately clear when something changes: if your cost rate jumps from $75/hour to $120/hour, you know throughput increased by 60% — and you need to understand whether that increase is driven by legitimate demand growth or by a technical problem (retry storm, prompt bloat, model routing error).

Measuring Throughput

Accurate throughput measurement requires instrumentation at multiple levels of your stack. Here is a practical guide to measuring each throughput dimension:

Application-level measurement (RPS):

Measure RPS at your API gateway or load balancer by counting the number of outbound LLM API calls per second. If you use CostHawk's proxy, this is automatic — every request is counted and timestamped. For custom measurement, instrument your LLM client wrapper to emit a counter metric on every call. Key considerations:

Count requests at the point of API call, not at the point of user request. A single user action might trigger multiple LLM calls (router + generator + guardrail = 3 RPS from 1 user request).
Separate successful requests from failed requests. A 429 rate limit error still consumed an attempt but did not produce useful output.
Track RPS by model, endpoint, and API key for granular attribution.

Token-level measurement (TPS/TPM):

Extract token counts from every API response's usage object. Aggregate input tokens and output tokens separately over your desired time window. The formula is straightforward:

Input TPS = sum(input_tokens for all requests in 1-second window) / 1
Output TPS = sum(output_tokens for all requests in 1-second window) / 1
Total TPM = sum(all_tokens for all requests in 1-minute window)

Be precise about what you are counting: some providers include system prompt tokens in the input count, some count them separately. Reasoning model thinking tokens are typically included in the output token count even though they do not appear in the visible response. Always use the provider-reported token counts rather than client-side estimates for billing accuracy.

Cost-rate measurement ($/min):

The most actionable throughput metric multiplies token throughput by pricing:

Cost rate ($/min) = (Input TPM × input_price_per_token) + (Output TPM × output_price_per_token)

For a system processing 50,000 input TPM and 15,000 output TPM on GPT-4o:

Cost rate = (50,000 × $0.0000025) + (15,000 × $0.000010) = $0.125 + $0.150 = $0.275/min = $16.50/hour

CostHawk computes and displays cost rate in real time, allowing you to see your AI spend velocity at any moment — not just as a daily or monthly aggregate.

Throughput benchmarking:

Establish throughput baselines during normal operation so you can detect anomalies. Track your peak throughput (highest TPM in any 1-minute window over the past 30 days), average throughput (mean TPM during business hours), and ratio of peak to average. A peak-to-average ratio above 3:1 suggests bursty traffic patterns that may benefit from request queuing, batch processing, or provider-side burst capacity provisions.

Throughput and Cost Scaling

The relationship between throughput and cost is deceptively simple at first glance — more throughput means more tokens, which means more money — but the details of how costs scale with throughput contain critical nuances that affect budgeting, architecture, and optimization strategy.

Linear scaling (the baseline):

At the most basic level, LLM API costs scale linearly with token throughput. Double your TPM and you double your cost. Process 10 million tokens per day and you pay 10x what you paid at 1 million tokens per day. There are no volume discounts in standard on-demand API pricing — your ten-millionth token costs exactly the same as your first. This linear relationship means that throughput growth must be carefully monitored and budgeted. A feature that works fine at 10,000 requests per day ($150/day) might become untenable at 100,000 requests per day ($1,500/day) without any code changes.

Sub-linear scaling opportunities:

Several techniques can make your cost scale sub-linearly with throughput — meaning costs grow more slowly than token volume:

Prompt caching: As throughput increases, cache hit rates tend to improve because more requests share common prefixes. A system prompt cache that serves 60% of requests at 1,000 RPM might serve 85% at 10,000 RPM (because the cache stays warm with higher traffic). This means input token costs grow slower than throughput — more requests, but a higher fraction use cached (90% cheaper) tokens.
Semantic caching: Caching complete responses for identical or semantically similar queries can eliminate LLM calls entirely. At low throughput, few queries repeat. At high throughput, the probability of cache hits increases because the query distribution follows a power law — a small number of popular queries account for a large share of traffic. A semantic cache with a 15% hit rate at 1,000 RPM might achieve 25% at 10,000 RPM, effectively making 25% of your throughput free.
Batch API pricing: OpenAI's Batch API and Anthropic's Message Batches API offer 50% discounts for workloads that can tolerate up to 24 hours of latency. At high throughput, more workloads become batch-eligible because the pipeline processes enough volume to fill batches efficiently. A team processing 100,000 documents per day might batch 60% of them at half price, while a team processing 1,000 documents per day cannot fill batches efficiently.

Super-linear scaling risks:

Conversely, some patterns cause costs to scale faster than throughput — a dangerous dynamic that leads to budget blowouts:

Retry amplification: At low throughput, you are well below rate limits and retries are rare. As throughput approaches rate limit thresholds, rate limit errors increase, triggering retries that further increase throughput, which triggers more rate limits — a positive feedback loop. If 5% of requests at 80% rate limit utilization trigger retries, that might jump to 15% at 95% utilization, adding 10% more effective token cost.
Context accumulation: In conversational applications, higher throughput often means more active conversations, which means longer conversation histories being sent as context. If average context length grows from 2,000 tokens at 1,000 active conversations to 5,000 tokens at 5,000 active conversations, your input token cost per request has more than doubled even if request volume only increased 5x.
Infrastructure overhead: At high throughput, you may need to add proxy layers, load balancers, logging pipelines, and monitoring infrastructure that add fixed costs on top of per-token API costs. These infrastructure costs create a per-request overhead that does not exist at lower volumes.

CostHawk's throughput-cost correlation dashboards show you whether your costs are scaling linearly, sub-linearly, or super-linearly with throughput — giving you early warning if your scaling economics are deteriorating.

Provider Throughput Limits

Every LLM API provider imposes throughput limits (rate limits) that constrain how fast you can process requests. These limits vary by provider, model, pricing tier, and account type. Hitting these limits is one of the most common causes of production incidents in AI applications. Here are the current rate limit structures for major providers as of March 2026:

Provider	Tier	RPM Limit	TPM Limit	Notes
OpenAI	Free	3	40,000	Per model. 200 RPD on some models.
OpenAI	Tier 1 ($5+ spend)	500	200,000	Increases automatically with spend history.
OpenAI	Tier 3 ($100+ spend)	5,000	2,000,000	Most production workloads land here.
OpenAI	Tier 5 ($1,000+ spend)	10,000	30,000,000	High-volume production tier.
Anthropic	Build (Free)	50	40,000	Per model, across all models.
Anthropic	Build ($5+ credit)	1,000	400,000	Scales with spend. Per-model limits.
Anthropic	Scale (Custom)	4,000+	4,000,000+	Negotiated. Custom limits per model.
Google	Free	15	1,000,000	Gemini models. Generous token limits.
Google	Pay-as-you-go	1,000	4,000,000	Per model. Higher for Flash models.
Mistral	Standard	1,000	2,000,000	Per endpoint. Varies by model.

Understanding rate limit mechanics:

Rate limits are enforced per minute (rolling window) and typically apply per model and per API key. If your rate limit is 5,000 RPM for GPT-4o, you can send at most 5,000 requests to GPT-4o in any 60-second window. Requests to GPT-4o mini have their own separate limit. Some providers use a sliding window (track requests over a rolling 60-second period), while others use fixed 1-minute windows.

When you exceed a rate limit, the provider returns an HTTP 429 (Too Many Requests) response with a Retry-After header indicating how long to wait. This response still consumes a network round-trip but does not consume tokens or incur billing — you are only charged for successfully processed requests.

The real-world impact of rate limits:

Rate limits interact with throughput in ways that create production problems:

Headroom erosion: If your normal throughput is 3,500 RPM on a 5,000 RPM limit, you have 30% headroom for traffic spikes. A product launch, marketing event, or viral moment that pushes traffic up 40% will breach the limit, causing request failures and degraded user experience.
Burst penalties: Even if your average throughput is well below the limit, bursts can exceed it momentarily. A batch job that sends 2,000 requests in 10 seconds uses 12,000 RPM instantaneous throughput, which will trigger rate limits even if the average over the full minute is under 5,000.
Multi-key strategies: Some teams distribute requests across multiple API keys to multiply their effective rate limit. This works but adds complexity and may violate provider terms of service if used to circumvent intentional limits. CostHawk tracks throughput per key so you can monitor the load distribution across keys.

Optimizing for Throughput

Throughput optimization is about getting more useful work done per unit of time and cost. The strategies fall into two categories: increasing raw capacity (more requests per minute) and increasing efficiency (more value per request).

1. Request batching (maximize tokens per request)

Instead of sending one document per API call, batch multiple items into a single request. If you need to classify 100 emails, you can send them in batches of 10 with a prompt like "Classify each of the following 10 emails." This reduces request overhead (fewer HTTP round-trips, fewer per-message token overheads for system prompts) and stays further from RPM rate limits. The tradeoff is that a single failed batch request loses all 10 results, and latency for the first result increases because you wait for the entire batch. Practical batch sizes range from 5-20 items depending on the total token count — stay well within the model's context window and output token limits.

2. Asynchronous processing (decouple throughput from latency)

For workloads that do not require real-time responses, decouple the user-facing request from the LLM call using a message queue (Redis, SQS, RabbitMQ). The user submits a request, receives an immediate acknowledgment, and the LLM call is processed asynchronously. This allows you to smooth out traffic spikes (queue requests during bursts, process them at a steady rate), process requests during off-peak hours when rate limits have more headroom, and use batch APIs (50% discount) for latency-tolerant workloads. CostHawk's time-of-day throughput analysis helps you identify optimal processing windows.

3. Provider diversification (multiply capacity)

Using multiple LLM providers effectively multiplies your throughput capacity. If you have 5,000 RPM on OpenAI and 4,000 RPM on Anthropic, your combined capacity is 9,000 RPM. Route requests to the provider with the most available capacity, failing over to the secondary when the primary is congested. This also provides resilience — if one provider has an outage, traffic shifts to the other. The complexity cost is maintaining compatible prompts and handling different response formats across providers. Tools like Portkey and LiteLLM provide unified APIs that abstract provider differences.

4. Caching (eliminate redundant throughput)

Every cached response is a request you did not need to make, effectively increasing your throughput capacity by the cache hit rate. Implement caching at two levels: exact-match caching for identical queries (simple hash-based lookup, suitable for FAQ bots and search auto-suggest), and semantic caching for similar queries (embedding-based similarity search, suitable for customer support and knowledge bases where many users ask the same question in different words). A 20% cache hit rate on a 5,000 RPM workload eliminates 1,000 RPM of LLM calls — equivalent to adding 1,000 RPM of rate limit capacity at zero marginal cost.

5. Prompt optimization (reduce tokens per request)

Shorter prompts consume fewer tokens per request, which means your TPM rate limit accommodates more requests. If you reduce average input tokens from 2,000 to 1,200, the same 2,000,000 TPM limit now supports 1,667 RPM instead of 1,000 RPM — a 67% throughput increase with no infrastructure changes. Prompt optimization compounds with all other throughput strategies: every request that passes through your system benefits from shorter prompts, whether it is a real-time user request, a batched background job, or a retry.

6. Model routing for throughput (use different limits)

Different models have different rate limits. If you are hitting the GPT-4o rate limit at 5,000 RPM but GPT-4o mini allows 10,000 RPM, routing simple tasks to the cheaper model not only saves money but also frees up GPT-4o capacity for complex tasks. This is throughput optimization through demand shaping — moving workloads to where capacity is available rather than demanding more capacity where you are constrained.

Throughput Monitoring

Monitoring throughput is essential for maintaining service quality, controlling costs, and planning capacity. Here is a comprehensive monitoring framework for LLM throughput:

Real-time metrics to track:

Current RPS/RPM: The instantaneous request rate, smoothed over a 1-minute window. This is your primary capacity utilization metric. Display it alongside your rate limit to show utilization percentage (e.g., 3,200/5,000 RPM = 64% utilization).
Current TPS/TPM: Token throughput, split by input and output. Since output tokens cost more, a spike in output TPS has a bigger cost impact than a spike in input TPS of the same magnitude.
Rate limit utilization: Current throughput as a percentage of your rate limit. Alert when utilization exceeds 80% to give yourself headroom for traffic spikes. CostHawk calculates this automatically based on your provider tier and current throughput.
Request queue depth: If you use asynchronous processing, the number of pending requests in the queue. A growing queue indicates throughput demand exceeding processing capacity — you need to increase concurrency, add provider capacity, or optimize token efficiency.
Error rate by type: Break down errors into rate limit errors (429), server errors (500/503), timeout errors, and content filter errors. Each type has a different throughput implication: rate limit errors directly indicate throughput saturation, server errors indicate provider issues, and timeouts indicate latency problems that reduce effective throughput.

Trend analysis metrics:

Peak throughput trend: Track your daily peak RPM over the past 30, 60, and 90 days. If peak throughput is growing 15% month-over-month, you will hit your rate limit in N months — calculate when and plan capacity increases before you get there.
Throughput efficiency: Track the ratio of effective throughput (requests that produced useful output) to raw throughput (all requests including retries and errors). If this ratio is declining, an increasing share of your throughput — and cost — is waste.
Tokens per request trend: Average input and output tokens per request over time. An upward trend means each request consumes more throughput capacity, effectively reducing the number of requests you can process per minute even if your rate limit has not changed.
Cost per token trend: If you use multiple models, your blended cost per token changes as traffic shifts between models. Track this to detect unintentional model routing changes that increase effective cost.

Alerting rules:

Rate limit proximity: Alert when RPM or TPM exceeds 80% of the rate limit for any model. This gives you a 20% buffer to respond before users are affected.
Rate limit errors: Alert immediately when any rate limit error (HTTP 429) is returned. Even a single 429 in production indicates you are at the boundary and more will follow.
Throughput anomaly: Alert when current RPM exceeds 2x the trailing 7-day average for the current hour. This catches both legitimate traffic spikes (that need capacity planning) and technical issues (retry storms, runaway batch jobs).
Cost rate spike: Alert when the cost rate ($/hour) exceeds the budgeted rate by more than 30%. This is the ultimate throughput alert — it catches any issue (more requests, bigger requests, more expensive models, more retries) that increases spend velocity.

CostHawk provides all of these metrics and alerts out of the box. Its real-time dashboard shows current throughput, rate limit utilization, cost rate, and error breakdown in a single view. Anomaly detection analyzes throughput patterns using statistical models trained on your historical data, so alerts are calibrated to your specific traffic patterns rather than generic thresholds.

FAQ

Frequently Asked Questions

What is the difference between RPS, TPS, and TPM?+

These three metrics measure throughput at different granularities, and each is useful in different contexts. RPS (requests per second) counts the number of complete API calls processed per second. It is the most intuitive metric for application-level monitoring and capacity planning — 'can our system handle 50 concurrent users making 2 requests per second each?' However, RPS treats all requests equally, even though a 200-token classification request and a 50,000-token analysis request have vastly different costs and processing requirements. TPS (tokens per second) counts the raw tokens processed per second, giving a more accurate picture of actual model utilization. TPS is split into input TPS and output TPS, which matters because output tokens cost 4-5x more. TPM (tokens per minute) is essentially TPS multiplied by 60, but it is the standard unit used by LLM API providers for rate limits. OpenAI, Anthropic, and most other providers express their rate limits in TPM (and RPM). When monitoring throughput, use all three: RPS for application health, TPS for real-time cost rate monitoring, and TPM for rate limit utilization tracking. CostHawk displays all three simultaneously so you always have the right metric for whatever question you are investigating.

How do rate limits affect throughput in practice?+

Rate limits create a hard ceiling on your throughput that manifests in production as HTTP 429 errors when you exceed the limit. The practical impact depends on how your application handles these errors. Most applications implement exponential backoff retry logic: when a 429 is received, wait 1 second and retry, then 2 seconds, then 4 seconds, up to a maximum. This means that hitting a rate limit does not cause permanent failures — requests eventually succeed — but it introduces delays and waste. Each retried request consumes a network round-trip, and if the original request included tokens that were partially processed before the rate limit hit, those tokens may be billed. The more insidious effect is throughput degradation: when you hit rate limits, your effective throughput drops below the limit because retries consume both time and rate limit budget. If 10% of your requests trigger retries at a 5,000 RPM limit, your effective throughput is closer to 4,500 successful RPM because the retried requests consume 500 RPM of capacity on their second attempt. At scale, this retry overhead compounds with traffic spikes, creating a feedback loop where rate limits cause retries that consume more rate limit capacity that triggers more rate limits. The solution is proactive throughput management: monitor utilization, implement client-side rate limiting (throttle before hitting the server-side limit), and distribute traffic across providers or time windows.

How do I calculate the throughput I need for my application?+

Start with your user-facing requirements and work backward to token throughput. The formula is: Required TPM = (Concurrent users × Requests per user per minute) × (Avg input tokens + Avg output tokens). For example, a customer support chatbot with 200 concurrent users, each sending 2 messages per minute, with 1,500 input tokens and 400 output tokens per request: 200 × 2 × (1,500 + 400) = 760,000 TPM. Add a 50% buffer for traffic spikes: 760,000 × 1.5 = 1,140,000 TPM. Check this against your provider's rate limit. On OpenAI Tier 3, GPT-4o allows 2,000,000 TPM — so you have headroom. On Anthropic's standard Build tier, Claude 3.5 Sonnet allows 400,000 TPM — you would need to request a limit increase or use the Scale tier. Also consider peak vs average: if your traffic doubles during business hours, your peak TPM requirement might be 2x the average. Always provision for peak, not average. CostHawk's throughput forecasting tools project future TPM requirements based on your growth rate, helping you plan rate limit increases and budget adjustments before you hit capacity constraints.

What happens when I exceed my provider's rate limit?+

When you exceed a rate limit, the provider returns an HTTP 429 (Too Many Requests) response. The exact behavior varies by provider. OpenAI returns a 429 with a Retry-After header indicating how many seconds to wait before retrying. The request is not processed and you are not charged for tokens. Anthropic returns a 429 with a similar retry indication and a message specifying which limit was exceeded (RPM or TPM). Google returns a 429 with resource exhausted details. All providers also set rate limit headers on successful responses (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, etc.) so you can proactively detect when you are approaching the limit. The downstream impact on your application depends on your error handling. With proper retry logic (exponential backoff with jitter), users experience increased latency but eventually get responses. Without retry logic, users see errors. The cost impact is primarily indirect: retried requests consume time and throughput capacity, and if your application queues requests during rate limit periods, the queue depth can grow, increasing response times for all users. The most costly scenario is a retry storm where aggressive retry logic amplifies the problem — each failed request generates multiple retries that all hit the same rate limit, consuming your entire rate limit budget on retries rather than new requests.

How does batch processing affect throughput and cost?+

Batch processing is one of the most effective strategies for optimizing both throughput and cost for workloads that can tolerate latency. OpenAI's Batch API and Anthropic's Message Batches API accept large sets of requests and process them asynchronously within a 24-hour window, offering a 50% discount on per-token pricing. For throughput, batch processing has two key advantages: (1) batch requests have separate, higher rate limits than real-time requests, so they do not compete with your production traffic for rate limit capacity; (2) the provider optimizes batch processing for efficiency, often achieving higher aggregate throughput than you could with individual real-time requests. For a team spending $10,000/month on AI APIs, shifting 40% of workloads to batch processing saves $2,000/month — the equivalent of a 20% overall cost reduction with zero quality impact. Workloads well-suited for batching include document processing pipelines, nightly analytics runs, bulk classification or extraction jobs, content generation queues, and evaluation or testing suites. The tradeoff is latency: batch results may take minutes to hours. Design your application to separate latency-sensitive requests (user-facing chat, real-time search) from latency-tolerant requests (background processing, scheduled jobs) and route the latter through batch APIs. CostHawk tracks batch vs real-time spending separately so you can measure the savings from your batching strategy.

What is the relationship between throughput and auto-scaling?+

For applications consuming LLM APIs (as opposed to self-hosting models), throughput auto-scaling operates differently than traditional infrastructure auto-scaling. In traditional systems, you scale by adding more compute instances when demand increases. With LLM APIs, the compute is managed by the provider — your scaling constraint is the rate limit, not your own infrastructure. This means auto-scaling for LLM applications focuses on three dimensions: (1) Connection pool scaling: Increasing the number of concurrent HTTP connections to the LLM API as request volume grows. Most HTTP clients default to 5-10 concurrent connections, which limits your RPS regardless of rate limits. Scale this to match your throughput needs. (2) Multi-provider routing: Automatically shifting traffic to a secondary provider when the primary provider's rate limit is approached. This effectively auto-scales your rate limit ceiling. (3) Queue-based buffering: When throughput demand exceeds rate limits, queue excess requests and process them as capacity becomes available. The queue depth becomes your auto-scaling signal — when it exceeds a threshold, escalate to a higher rate limit tier, add a provider, or alert the team. For self-hosted models, traditional auto-scaling applies: add more GPU instances when TPS demand exceeds current capacity. The economics differ from CPU auto-scaling because GPU instances are expensive ($1-4/hour each), so auto-scaling decisions must be cost-aware.

How do I monitor throughput across multiple providers?+

Multi-provider throughput monitoring requires normalizing metrics across providers that report them differently. The challenges include: different rate limit structures (some providers limit RPM and TPM separately, others combine them), different response metadata formats (each provider's usage object has slightly different field names and semantics), and different billing cycles that make cost aggregation complex. The solution is a unified observability layer that ingests telemetry from all providers and normalizes it into a consistent format. CostHawk does this automatically — it tracks RPS, TPS, TPM, and cost rate for every provider in a single dashboard. For custom implementations, build a centralized telemetry pipeline that extracts the following from every API response regardless of provider: timestamp, provider name, model name, input token count, output token count, latency (TTFT and total), status code, and computed cost. Aggregate these into a time-series database and build dashboards that show both per-provider views (for capacity planning against each provider's rate limits) and unified views (for total throughput and total cost rate). The unified view is essential for answering questions like 'what is our total AI throughput right now?' and 'if OpenAI goes down, can Anthropic absorb the traffic?' CostHawk's multi-provider dashboard answers both questions at a glance, showing current throughput and rate limit utilization per provider alongside the aggregate view.

What throughput levels justify moving from API to self-hosted models?+

The crossover point where self-hosting becomes cheaper than API access depends on the model, your throughput volume, and your team's infrastructure capabilities. As a rough guide: for an economy-tier model equivalent (like Llama 3 70B as an alternative to GPT-4o mini), self-hosting becomes cost-competitive at approximately 50-100 million tokens per day. At GPT-4o mini pricing ($0.15/$0.60 per MTok), 100M tokens/day costs roughly $75/day or $2,250/month via API. Self-hosting Llama 3 70B on 4x A100 GPUs costs approximately $4,320/month in GPU rental, but can process 1+ billion tokens per month — so at 100M tokens/day (3B/month), self-hosting is significantly cheaper. For mid-tier model equivalents, the crossover is lower in token volume but the quality comparison is harder to validate. The hidden costs of self-hosting that offset the per-token savings include: DevOps engineering time ($15,000-25,000/month in salary), GPU instance management and monitoring, model serving framework maintenance (vLLM, TGI), auto-scaling infrastructure, and the opportunity cost of engineering time spent on inference infrastructure instead of product features. For most teams spending under $5,000-10,000/month on API costs, the total cost of ownership for self-hosting exceeds API costs. CostHawk's throughput and cost trending helps you track your trajectory toward the crossover point so you can plan the migration thoughtfully rather than reactively.

Related Terms

Rate Limiting

Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Tokens Per Second (TPS)

The rate at which an LLM generates output tokens during the decode phase of inference. TPS determines how fast a streaming response flows to the user, the maximum throughput capacity of inference infrastructure, and the economic efficiency of GPU utilization.

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

Batch API

Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary