GlossaryInfrastructureUpdated 2026-03-16

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

Definition

What is Provisioned Throughput?

Provisioned throughput is a capacity reservation model where an organization pre-purchases dedicated AI inference compute — measured in Provisioned Throughput Units (PTUs) for OpenAI, or Reserved Capacity for other providers — rather than paying per token on demand. Each PTU represents a fixed amount of model-specific inference capacity measured in tokens per minute. The provisioned capacity is dedicated to your organization, guaranteeing consistent latency and throughput regardless of other customers' usage patterns. OpenAI PTUs are available directly and through Azure OpenAI Service, with Azure pricing at approximately $6 per PTU per hour for GPT-4o-class models on monthly reservations. Provisioned throughput becomes cost-effective when sustained on-demand spending exceeds approximately $10,000-$15,000 per month on a single model.

Impact

Why It Matters for AI Costs

For high-volume AI applications, the pay-per-token model becomes increasingly expensive as traffic grows. Provisioned throughput flips the economics: instead of costs scaling linearly with usage, you pay a fixed monthly fee for a guaranteed capacity block, and any tokens processed within that capacity are effectively pre-paid. At sufficient scale, provisioned throughput costs 30-60% less per token than on-demand pricing. Beyond cost, provisioned throughput eliminates two operational headaches: rate limit contention (your dedicated capacity cannot be throttled by shared infrastructure load) and latency variability (dedicated GPUs provide consistent response times). CostHawk tracks provisioned capacity utilization alongside on-demand spend, helping you optimize the split between reserved and on-demand capacity.

What Is Provisioned Throughput?

Provisioned throughput is the AI inference equivalent of reserved instances in cloud computing. Instead of paying per token at the standard rate, you commit to a fixed amount of compute capacity for a defined period (typically 1 month or 1 year) at a discounted rate.

The concept is straightforward: if you know you will process 500 million tokens per month through GPT-4o, you can pre-purchase capacity to handle that volume at a lower effective per-token cost than the on-demand rate. The provider allocates dedicated GPU resources to your organization, ensuring that your traffic is never queued behind other customers' workloads.

Key characteristics of provisioned throughput:

  • Dedicated capacity — Your provisioned compute is not shared. Other customers' traffic spikes do not affect your latency or availability.
  • Guaranteed throughput — Each PTU provides a defined tokens-per-minute capacity. You can plan capacity with precision.
  • Fixed cost — Monthly cost is determined by the number of PTUs reserved, not by actual token consumption. Unused capacity is wasted.
  • Model-specific — PTUs are allocated for a specific model. GPT-4o PTUs cannot be used for GPT-4o-mini or vice versa.
  • Commitment required — Most provisioned offerings require a minimum 1-month commitment. Longer commitments (6 or 12 months) offer deeper discounts.

How PTUs Work

A Provisioned Throughput Unit (PTU) is a unit of model-specific inference capacity. The number of tokens per minute each PTU delivers varies by model because different models have different computational requirements per token.

For OpenAI models via Azure OpenAI Service, the approximate capacity per PTU is:

ModelInput TPM per PTUOutput TPM per PTU
GPT-4o~2,500~833
GPT-4o-mini~37,000~12,333
GPT-4.1~2,500~833
GPT-4.1-mini~11,000~3,667

These numbers are approximate and vary based on request characteristics (longer prompts consume more memory, reducing effective throughput). The critical insight is that output tokens consume approximately 3x the compute of input tokens, so output-heavy workloads require more PTUs than input-heavy workloads of the same total token volume.

Sizing example: If your application processes 250,000 input tokens per minute and 50,000 output tokens per minute of GPT-4o traffic:

  • Input PTUs needed: 250,000 / 2,500 = 100 PTUs
  • Output PTUs needed: 50,000 / 833 = 60 PTUs
  • Total PTUs needed: max(100, 60) = 100 PTUs (the constraint is typically on one dimension)

In practice, you need to account for traffic variability. If your peak traffic is 2x your average, you need PTUs to cover the peak unless you are willing to spill excess traffic to on-demand (a hybrid approach discussed later).

PTU allocation is not instantaneous. Requesting new PTU capacity can take hours to days depending on current GPU availability. Plan capacity changes well in advance of expected traffic increases.

Pricing and Breakeven Analysis

The fundamental question with provisioned throughput is: at what spend level does reserved capacity become cheaper than on-demand pricing?

Azure OpenAI PTU pricing (representative):

CommitmentPrice per PTU per HourMonthly Cost per PTUDiscount vs. Monthly
Monthly reservation~$6.00~$4,380Baseline
6-month reservation~$4.50~$3,285~25% discount
12-month reservation~$3.60~$2,628~40% discount

Breakeven calculation for GPT-4o:

One PTU provides approximately 2,500 input TPM or 833 output TPM. Let us calculate for a balanced workload (3:1 input-to-output ratio):

  • 1 PTU processes approximately 2,500 input TPM + 833 output TPM = continuous throughput
  • Monthly tokens per PTU: 2,500 input TPM x 60 min x 24 hr x 30 days = ~108M input tokens/month, plus ~36M output tokens/month
  • On-demand cost for same volume: (108M / 1M x $2.50) + (36M / 1M x $10.00) = $270 + $360 = $630/month
  • Monthly PTU cost: ~$4,380/month

Wait — that suggests provisioned is much more expensive per PTU. The key is that in practice, PTU sizing accounts for peak capacity while on-demand billing only charges for actual usage. The breakeven occurs when your sustained utilization is high enough that the fixed PTU cost is less than what the equivalent on-demand traffic would cost.

The realistic breakeven point is when on-demand spending exceeds approximately $10,000-$15,000 per month on a single model with consistent, predictable traffic patterns. Below this threshold, on-demand pricing is almost always cheaper because you only pay for what you use. Above this threshold, the economics shift in favor of provisioned capacity, especially with 6-12 month commitments.

Real-world example:

ScenarioOn-Demand Monthly CostProvisioned Monthly Cost (12mo)Monthly Savings
50 PTUs, high utilization (85%)$26,775$13,140$13,635 (51%)
50 PTUs, moderate utilization (60%)$18,900$13,140$5,760 (30%)
50 PTUs, low utilization (30%)$9,450$13,140-$3,690 (loss)

The table illustrates the critical factor: utilization. Provisioned throughput only saves money when utilization is consistently high. If you reserve 50 PTUs but only use 30% of the capacity, you are paying for 70% idle compute.

Provisioned vs On-Demand Comparison

The following comprehensive comparison helps you decide which model fits your workload:

FactorOn-Demand (Pay-Per-Token)Provisioned Throughput
Pricing modelPay per token consumedFixed monthly fee for reserved capacity
Cost at low volumeLower (you only pay for what you use)Higher (paying for capacity even when idle)
Cost at high volumeHigher (linear cost scaling)Lower (fixed cost regardless of volume within capacity)
Breakeven pointN/A~$10K-$15K/month sustained spend
LatencyVariable (affected by shared infrastructure load)Consistent (dedicated resources)
Rate limitsShared, tier-based (RPM/TPM caps)Determined by provisioned capacity (no shared limits)
Capacity guaranteeBest-effort within rate limitsGuaranteed throughput up to provisioned capacity
FlexibilityScale up/down instantly with trafficCapacity changes take hours/days
CommitmentNone (pay as you go)Minimum 1-month, deeper discounts at 6/12 months
Model switchingUse any model, switch instantlyPTUs locked to specific model
Traffic variability handlingExcellent (costs match usage)Poor (idle capacity is wasted)
Billing predictabilityLow (varies with usage)High (fixed monthly cost)
Minimum viable spend$0 (free tier available)~$2,600-$4,400/month per PTU

The decision framework is straightforward: if your AI spend on a single model is predictable, sustained, and exceeds $10K-$15K per month, evaluate provisioned throughput. If your traffic is variable, unpredictable, or spread across multiple models, stay on-demand. Most organizations benefit from a hybrid approach that combines both.

Capacity Planning

Accurate capacity planning is the difference between provisioned throughput saving 50% and costing more than on-demand. The goal is to provision enough capacity for your sustained baseline while handling peaks through on-demand overflow.

Step 1: Measure your traffic pattern

Before provisioning, collect at least 30 days of per-minute token consumption data, broken down by model. CostHawk's usage dashboard provides this data. Key metrics:

  • Average TPM — Your baseline sustained throughput. This is the minimum capacity to consider provisioning.
  • P95 TPM — The throughput at the 95th percentile. This represents your typical peak.
  • P99 TPM — The throughput at the 99th percentile. This represents rare spikes.
  • Peak-to-average ratio — How spiky your traffic is. A ratio of 2:1 means peaks are double your average. Ratios above 3:1 make provisioning challenging.

Step 2: Choose your provisioning point

The optimal provisioning point depends on your tolerance for on-demand overflow:

  • Conservative: Provision for P50 (median) traffic. ~50% of requests use provisioned capacity; ~50% overflow to on-demand. Safest approach for first-time provisioning.
  • Balanced: Provision for P75 traffic. ~75% of requests use provisioned capacity. Good balance of savings and risk.
  • Aggressive: Provision for P95 traffic. ~95% of requests use provisioned capacity. Maximum savings but high risk of idle capacity during off-peak.

Step 3: Model traffic growth

Factor in expected growth over your commitment period. If you sign a 12-month reservation and expect traffic to double over that period, provision for the average expected traffic over the full year, not just today's traffic. Under-provisioning early and over-provisioning late roughly balances out.

Step 4: Account for seasonality

If your traffic has seasonal patterns (higher during business hours, lower on weekends, spikes during marketing campaigns), provisioned capacity should cover your sustained baseline, not your seasonal peaks. Use on-demand overflow for seasonal spikes.

Hybrid Provisioned + On-Demand Strategy

The most cost-effective approach for most organizations is a hybrid strategy that combines provisioned capacity for baseline traffic with on-demand overflow for peaks. This captures the bulk of provisioned savings while maintaining the flexibility to handle traffic variability.

How hybrid works:

  1. Provision enough PTUs to handle your sustained baseline traffic (typically P50-P75 of your traffic distribution).
  2. Configure your application to route requests to provisioned capacity first.
  3. When provisioned capacity is fully utilized, overflow requests automatically route to on-demand endpoints.
  4. On-demand requests pay standard per-token pricing.

Cost optimization example:

MetricPure On-DemandPure Provisioned (P95)Hybrid (P50 Provisioned)
Monthly traffic500M tokens500M tokens500M tokens
Provisioned capacity0 PTUs200 PTUs100 PTUs
Provisioned cost/mo$0$26,280$13,140
On-demand overflow/mo$25,000$500$6,250
Total monthly cost$25,000$26,780$19,390
Savings vs on-demand-7% (loss)22.4%

In this example, pure provisioned at P95 actually costs more than on-demand because the over-provisioned capacity is wasted. The hybrid approach at P50 saves 22.4% by efficiently combining cheap provisioned base capacity with flexible on-demand overflow.

Implementation considerations:

  • Azure OpenAI supports hybrid routing natively — configure a provisioned deployment as the primary and a standard deployment as the fallback. Azure handles the overflow automatically.
  • OpenAI direct API provisioned throughput requires coordinating with OpenAI sales. Hybrid routing must be implemented in your application layer.
  • CostHawk monitoring tracks provisioned utilization and on-demand overflow separately, showing you the exact cost split and helping you optimize the provisioning level over time.

Review your hybrid split monthly. If on-demand overflow consistently exceeds 40% of total traffic, consider increasing your provisioned capacity. If provisioned utilization consistently drops below 60%, consider reducing PTUs to avoid paying for idle capacity.

CostHawk's provisioned throughput dashboard shows daily utilization curves, overflow rates, and projected savings for different provisioning levels, making it easy to right-size your reservation at every renewal period.

FAQ

Frequently Asked Questions

What is a PTU (Provisioned Throughput Unit)?+
A PTU is a unit of model-specific AI inference capacity offered by OpenAI and Azure OpenAI Service. Each PTU provides a fixed amount of tokens-per-minute throughput for a specific model — the capacity varies by model because different models have different computational requirements per token. For GPT-4o, one PTU delivers approximately 2,500 input tokens per minute or 833 output tokens per minute. For GPT-4o-mini, a single PTU provides roughly 37,000 input tokens per minute because the smaller model requires less compute. PTUs are purchased in blocks and reserved for your exclusive use — they are not shared with other customers on the platform. The cost per PTU varies by commitment length: approximately $6 per hour ($4,380 per month) for monthly reservations, with discounts reaching up to 40% off for 12-month commitments. PTUs are the fundamental building block of provisioned throughput capacity planning.
When does provisioned throughput become cost-effective?+
Provisioned throughput typically becomes cost-effective when sustained on-demand spending on a single model exceeds $10,000-$15,000 per month with consistent, predictable traffic patterns that maintain high utilization of the reserved capacity. The exact breakeven depends on three primary factors: utilization rate (how consistently you use the provisioned capacity throughout the day and week), commitment length (longer commitments have significantly lower per-PTU costs, with 12-month reservations costing up to 40% less than monthly), and traffic variability (spiky traffic means more idle capacity during off-peak periods). At 85% sustained utilization with a 12-month commitment, provisioned throughput can save 40-60% versus on-demand pricing. At 50% utilization, the savings shrink to a modest 10-20%. Below 40% utilization, you typically lose money compared to staying on-demand pricing entirely.
Can I mix provisioned and on-demand capacity?+
Yes, and this hybrid approach is the recommended strategy for most organizations because it balances cost savings with operational flexibility. The hybrid model provisions enough dedicated capacity to handle your baseline sustained traffic (typically the P50-P75 percentile of your traffic distribution) and allows peak traffic that exceeds provisioned capacity to automatically overflow to standard on-demand endpoints at regular per-token pricing. Azure OpenAI Service supports this natively through deployment configuration — you set up a provisioned deployment as the primary endpoint and a standard pay-as-you-go deployment as the fallback. For OpenAI's direct API offering, you implement the hybrid routing logic in your application layer. The hybrid approach captures 60-80% of the savings that full provisioning would achieve while maintaining complete flexibility to handle traffic variability, seasonal spikes, and unexpected load without any risk of request failures due to capacity constraints.
What happens if I exceed my provisioned capacity?+
When traffic exceeds your provisioned capacity, the behavior depends on your configuration. With Azure OpenAI, you can configure overflow to a standard (on-demand) deployment, in which case excess requests are processed at standard per-token pricing with no interruption. Without overflow configured, requests that exceed provisioned capacity receive HTTP 429 errors, similar to rate limiting. With OpenAI's direct provisioned offering, excess requests that cannot be served by your provisioned capacity are rejected unless you have a fallback on-demand deployment configured. Always configure overflow routing in production environments — treating provisioned capacity as a hard ceiling rather than a preferred tier risks user-facing outages during traffic spikes.
Are PTUs locked to a specific model?+
Yes. PTUs are allocated for a specific model version. GPT-4o PTUs can only serve GPT-4o requests — they cannot be used for GPT-4o-mini, GPT-4.1, or any other model. This means you need separate PTU reservations for each model you want to provision. If you route traffic across three models, you need three separate PTU allocations, each sized for that model's traffic. This model-specificity is one of the main drawbacks of provisioned throughput compared to on-demand pricing, where you can switch models instantly. When a new model version is released, you typically need to migrate your PTU reservation to the new version, which may require coordinating with the provider.
How does provisioned throughput affect latency?+
Provisioned throughput provides significantly more consistent and typically lower latency than on-demand processing for the same model. With on-demand pricing, your requests share GPU infrastructure with all other customers on the platform, which means latency varies based on overall system load and concurrent demand from other organizations. During peak usage hours, on-demand latency can increase by 50-200% compared to off-peak periods, creating unpredictable user experiences. Provisioned throughput eliminates this variability entirely by dedicating GPU resources exclusively to your organization. In practice, provisioned deployments typically deliver 20-40% lower median latency and 50-70% lower P99 tail latency compared to equivalent on-demand requests. For latency-sensitive applications such as real-time chat interfaces, voice agents, and interactive coding assistants, the latency consistency and reduction alone may justify the cost premium of provisioning even before considering volume discounts.
What is the minimum commitment for provisioned throughput?+
Azure OpenAI Service offers provisioned throughput with a minimum monthly commitment. You can provision as few as 50 PTUs for some models, though the practical minimum depends on the model's PTU-to-throughput ratio. At $6/hour per PTU, 50 PTUs costs approximately $219,000 per year ($18,250/month) — this is not a small commitment. OpenAI's direct provisioned offering is available through their sales team and typically requires an annual commitment with custom pricing based on volume. The minimum commitment varies but is generally targeted at organizations spending $50,000+ per month on OpenAI. For organizations below these thresholds, on-demand pricing with batch API discounts is usually more appropriate.
How do I monitor provisioned throughput utilization?+
Azure OpenAI provides utilization metrics through Azure Monitor, showing tokens-per-minute consumption as a percentage of provisioned capacity. The key metric is 'provisioned managed utilization' — if this consistently exceeds 80%, you risk request failures during traffic spikes and should consider adding capacity. If it consistently stays below 40%, you are over-provisioned and wasting money. CostHawk integrates with Azure monitoring data to provide a unified view of provisioned utilization alongside on-demand spend, overflow rates, and cost comparisons. The CostHawk dashboard shows the exact dollar amount you saved by using provisioned versus what the same traffic would have cost on-demand, making renewal decisions data-driven.
Should I use provisioned throughput or the Batch API for cost savings?+
These are complementary optimizations for different workload types. The Batch API provides a 50% discount for asynchronous workloads that can tolerate 24-hour turnaround — it is ideal for evaluations, data labeling, content generation, and other offline tasks. Provisioned throughput provides 30-60% savings for real-time, latency-sensitive workloads with sustained high volume — it is ideal for production chat interfaces, real-time APIs, and interactive applications. Use both: route asynchronous workloads to the Batch API for 50% savings, and provision capacity for real-time workloads that need guaranteed latency. CostHawk tracks both batch savings and provisioned savings separately so you can measure the total optimization impact.
What risks should I consider before committing to provisioned throughput?+
The primary risks are: (1) Traffic decline — if usage drops after you commit, you pay for idle capacity with no refund. Mitigate by starting with monthly commitments before locking into annual terms. (2) Model obsolescence — if a better or cheaper model launches during your commitment, your PTUs are locked to the old model. Mitigate by keeping commitments short for rapidly evolving model categories. (3) Over-provisioning — provisioning for peak traffic instead of baseline wastes money on idle capacity during off-peak hours. Mitigate with the hybrid approach, provisioning for P50-P75 and using on-demand overflow. (4) Pricing changes — providers occasionally reduce on-demand pricing, which can make your provisioned reservation less competitive. There is no protection against this beyond shorter commitment terms.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.