Provisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Definition
What is Provisioned Throughput?
Impact
Why It Matters for AI Costs
What Is Provisioned Throughput?
Provisioned throughput is the AI inference equivalent of reserved instances in cloud computing. Instead of paying per token at the standard rate, you commit to a fixed amount of compute capacity for a defined period (typically 1 month or 1 year) at a discounted rate.
The concept is straightforward: if you know you will process 500 million tokens per month through GPT-4o, you can pre-purchase capacity to handle that volume at a lower effective per-token cost than the on-demand rate. The provider allocates dedicated GPU resources to your organization, ensuring that your traffic is never queued behind other customers' workloads.
Key characteristics of provisioned throughput:
- Dedicated capacity — Your provisioned compute is not shared. Other customers' traffic spikes do not affect your latency or availability.
- Guaranteed throughput — Each PTU provides a defined tokens-per-minute capacity. You can plan capacity with precision.
- Fixed cost — Monthly cost is determined by the number of PTUs reserved, not by actual token consumption. Unused capacity is wasted.
- Model-specific — PTUs are allocated for a specific model. GPT-4o PTUs cannot be used for GPT-4o-mini or vice versa.
- Commitment required — Most provisioned offerings require a minimum 1-month commitment. Longer commitments (6 or 12 months) offer deeper discounts.
How PTUs Work
A Provisioned Throughput Unit (PTU) is a unit of model-specific inference capacity. The number of tokens per minute each PTU delivers varies by model because different models have different computational requirements per token.
For OpenAI models via Azure OpenAI Service, the approximate capacity per PTU is:
| Model | Input TPM per PTU | Output TPM per PTU |
|---|---|---|
| GPT-4o | ~2,500 | ~833 |
| GPT-4o-mini | ~37,000 | ~12,333 |
| GPT-4.1 | ~2,500 | ~833 |
| GPT-4.1-mini | ~11,000 | ~3,667 |
These numbers are approximate and vary based on request characteristics (longer prompts consume more memory, reducing effective throughput). The critical insight is that output tokens consume approximately 3x the compute of input tokens, so output-heavy workloads require more PTUs than input-heavy workloads of the same total token volume.
Sizing example: If your application processes 250,000 input tokens per minute and 50,000 output tokens per minute of GPT-4o traffic:
- Input PTUs needed: 250,000 / 2,500 = 100 PTUs
- Output PTUs needed: 50,000 / 833 = 60 PTUs
- Total PTUs needed: max(100, 60) = 100 PTUs (the constraint is typically on one dimension)
In practice, you need to account for traffic variability. If your peak traffic is 2x your average, you need PTUs to cover the peak unless you are willing to spill excess traffic to on-demand (a hybrid approach discussed later).
PTU allocation is not instantaneous. Requesting new PTU capacity can take hours to days depending on current GPU availability. Plan capacity changes well in advance of expected traffic increases.
Pricing and Breakeven Analysis
The fundamental question with provisioned throughput is: at what spend level does reserved capacity become cheaper than on-demand pricing?
Azure OpenAI PTU pricing (representative):
| Commitment | Price per PTU per Hour | Monthly Cost per PTU | Discount vs. Monthly |
|---|---|---|---|
| Monthly reservation | ~$6.00 | ~$4,380 | Baseline |
| 6-month reservation | ~$4.50 | ~$3,285 | ~25% discount |
| 12-month reservation | ~$3.60 | ~$2,628 | ~40% discount |
Breakeven calculation for GPT-4o:
One PTU provides approximately 2,500 input TPM or 833 output TPM. Let us calculate for a balanced workload (3:1 input-to-output ratio):
- 1 PTU processes approximately 2,500 input TPM + 833 output TPM = continuous throughput
- Monthly tokens per PTU: 2,500 input TPM x 60 min x 24 hr x 30 days = ~108M input tokens/month, plus ~36M output tokens/month
- On-demand cost for same volume: (108M / 1M x $2.50) + (36M / 1M x $10.00) = $270 + $360 = $630/month
- Monthly PTU cost: ~$4,380/month
Wait — that suggests provisioned is much more expensive per PTU. The key is that in practice, PTU sizing accounts for peak capacity while on-demand billing only charges for actual usage. The breakeven occurs when your sustained utilization is high enough that the fixed PTU cost is less than what the equivalent on-demand traffic would cost.
The realistic breakeven point is when on-demand spending exceeds approximately $10,000-$15,000 per month on a single model with consistent, predictable traffic patterns. Below this threshold, on-demand pricing is almost always cheaper because you only pay for what you use. Above this threshold, the economics shift in favor of provisioned capacity, especially with 6-12 month commitments.
Real-world example:
| Scenario | On-Demand Monthly Cost | Provisioned Monthly Cost (12mo) | Monthly Savings |
|---|---|---|---|
| 50 PTUs, high utilization (85%) | $26,775 | $13,140 | $13,635 (51%) |
| 50 PTUs, moderate utilization (60%) | $18,900 | $13,140 | $5,760 (30%) |
| 50 PTUs, low utilization (30%) | $9,450 | $13,140 | -$3,690 (loss) |
The table illustrates the critical factor: utilization. Provisioned throughput only saves money when utilization is consistently high. If you reserve 50 PTUs but only use 30% of the capacity, you are paying for 70% idle compute.
Provisioned vs On-Demand Comparison
The following comprehensive comparison helps you decide which model fits your workload:
| Factor | On-Demand (Pay-Per-Token) | Provisioned Throughput |
|---|---|---|
| Pricing model | Pay per token consumed | Fixed monthly fee for reserved capacity |
| Cost at low volume | Lower (you only pay for what you use) | Higher (paying for capacity even when idle) |
| Cost at high volume | Higher (linear cost scaling) | Lower (fixed cost regardless of volume within capacity) |
| Breakeven point | N/A | ~$10K-$15K/month sustained spend |
| Latency | Variable (affected by shared infrastructure load) | Consistent (dedicated resources) |
| Rate limits | Shared, tier-based (RPM/TPM caps) | Determined by provisioned capacity (no shared limits) |
| Capacity guarantee | Best-effort within rate limits | Guaranteed throughput up to provisioned capacity |
| Flexibility | Scale up/down instantly with traffic | Capacity changes take hours/days |
| Commitment | None (pay as you go) | Minimum 1-month, deeper discounts at 6/12 months |
| Model switching | Use any model, switch instantly | PTUs locked to specific model |
| Traffic variability handling | Excellent (costs match usage) | Poor (idle capacity is wasted) |
| Billing predictability | Low (varies with usage) | High (fixed monthly cost) |
| Minimum viable spend | $0 (free tier available) | ~$2,600-$4,400/month per PTU |
The decision framework is straightforward: if your AI spend on a single model is predictable, sustained, and exceeds $10K-$15K per month, evaluate provisioned throughput. If your traffic is variable, unpredictable, or spread across multiple models, stay on-demand. Most organizations benefit from a hybrid approach that combines both.
Capacity Planning
Accurate capacity planning is the difference between provisioned throughput saving 50% and costing more than on-demand. The goal is to provision enough capacity for your sustained baseline while handling peaks through on-demand overflow.
Step 1: Measure your traffic pattern
Before provisioning, collect at least 30 days of per-minute token consumption data, broken down by model. CostHawk's usage dashboard provides this data. Key metrics:
- Average TPM — Your baseline sustained throughput. This is the minimum capacity to consider provisioning.
- P95 TPM — The throughput at the 95th percentile. This represents your typical peak.
- P99 TPM — The throughput at the 99th percentile. This represents rare spikes.
- Peak-to-average ratio — How spiky your traffic is. A ratio of 2:1 means peaks are double your average. Ratios above 3:1 make provisioning challenging.
Step 2: Choose your provisioning point
The optimal provisioning point depends on your tolerance for on-demand overflow:
- Conservative: Provision for P50 (median) traffic. ~50% of requests use provisioned capacity; ~50% overflow to on-demand. Safest approach for first-time provisioning.
- Balanced: Provision for P75 traffic. ~75% of requests use provisioned capacity. Good balance of savings and risk.
- Aggressive: Provision for P95 traffic. ~95% of requests use provisioned capacity. Maximum savings but high risk of idle capacity during off-peak.
Step 3: Model traffic growth
Factor in expected growth over your commitment period. If you sign a 12-month reservation and expect traffic to double over that period, provision for the average expected traffic over the full year, not just today's traffic. Under-provisioning early and over-provisioning late roughly balances out.
Step 4: Account for seasonality
If your traffic has seasonal patterns (higher during business hours, lower on weekends, spikes during marketing campaigns), provisioned capacity should cover your sustained baseline, not your seasonal peaks. Use on-demand overflow for seasonal spikes.
Hybrid Provisioned + On-Demand Strategy
The most cost-effective approach for most organizations is a hybrid strategy that combines provisioned capacity for baseline traffic with on-demand overflow for peaks. This captures the bulk of provisioned savings while maintaining the flexibility to handle traffic variability.
How hybrid works:
- Provision enough PTUs to handle your sustained baseline traffic (typically P50-P75 of your traffic distribution).
- Configure your application to route requests to provisioned capacity first.
- When provisioned capacity is fully utilized, overflow requests automatically route to on-demand endpoints.
- On-demand requests pay standard per-token pricing.
Cost optimization example:
| Metric | Pure On-Demand | Pure Provisioned (P95) | Hybrid (P50 Provisioned) |
|---|---|---|---|
| Monthly traffic | 500M tokens | 500M tokens | 500M tokens |
| Provisioned capacity | 0 PTUs | 200 PTUs | 100 PTUs |
| Provisioned cost/mo | $0 | $26,280 | $13,140 |
| On-demand overflow/mo | $25,000 | $500 | $6,250 |
| Total monthly cost | $25,000 | $26,780 | $19,390 |
| Savings vs on-demand | — | -7% (loss) | 22.4% |
In this example, pure provisioned at P95 actually costs more than on-demand because the over-provisioned capacity is wasted. The hybrid approach at P50 saves 22.4% by efficiently combining cheap provisioned base capacity with flexible on-demand overflow.
Implementation considerations:
- Azure OpenAI supports hybrid routing natively — configure a provisioned deployment as the primary and a standard deployment as the fallback. Azure handles the overflow automatically.
- OpenAI direct API provisioned throughput requires coordinating with OpenAI sales. Hybrid routing must be implemented in your application layer.
- CostHawk monitoring tracks provisioned utilization and on-demand overflow separately, showing you the exact cost split and helping you optimize the provisioning level over time.
Review your hybrid split monthly. If on-demand overflow consistently exceeds 40% of total traffic, consider increasing your provisioned capacity. If provisioned utilization consistently drops below 60%, consider reducing PTUs to avoid paying for idle capacity.
CostHawk's provisioned throughput dashboard shows daily utilization curves, overflow rates, and projected savings for different provisioning levels, making it easy to right-size your reservation at every renewal period.
FAQ
Frequently Asked Questions
What is a PTU (Provisioned Throughput Unit)?+
When does provisioned throughput become cost-effective?+
Can I mix provisioned and on-demand capacity?+
What happens if I exceed my provisioned capacity?+
Are PTUs locked to a specific model?+
How does provisioned throughput affect latency?+
What is the minimum commitment for provisioned throughput?+
How do I monitor provisioned throughput utilization?+
Should I use provisioned throughput or the Batch API for cost savings?+
What risks should I consider before committing to provisioned throughput?+
Related Terms
Pay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Read moreRate Limiting
Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreToken Budget
Spending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
