Rate Limiting
Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Definition
What is Rate Limiting?
Impact
Why It Matters for AI Costs
Understanding Rate Limits
AI providers enforce rate limits to protect infrastructure stability, ensure fair access across customers, and manage GPU utilization. Rate limits are not a bug — they are a fundamental design constraint of shared AI infrastructure that you must architect around.
Every provider enforces multiple simultaneous limits. A single API call must pass all applicable limits to succeed:
- Requests per minute (RPM) — The number of API calls allowed per minute, regardless of token count. This limits call frequency.
- Tokens per minute (TPM) — The total number of input + output tokens allowed per minute. This limits throughput.
- Tokens per day (TPD) — Some providers cap daily token consumption, especially for newer or more expensive models.
- Requests per day (RPD) — A daily request count cap, common for image generation and embedding endpoints.
- Concurrent requests — Some providers limit how many requests can be in flight simultaneously.
If any single limit is exceeded, the request receives a 429 response. The most common bottleneck for low-volume applications is RPM (too many small requests), while high-volume applications typically hit TPM first (fewer large requests consuming many tokens).
Rate limits reset on a rolling window basis, not at fixed calendar minutes. This means consuming all your RPM allocation in the first 5 seconds of a minute leaves you throttled for the remaining 55 seconds.
Provider Rate Limit Tiers
Provider rate limits vary dramatically based on your usage tier, which is determined by cumulative spend and account age. The following table shows representative limits for major providers as of early 2026:
| Provider | Tier | Model | RPM | TPM | Notes |
|---|---|---|---|---|---|
| OpenAI | Tier 1 ($5+ spent) | GPT-4o | 500 | 30,000 | Auto-upgrade after $50 spend |
| OpenAI | Tier 2 ($50+ spent) | GPT-4o | 5,000 | 450,000 | Auto-upgrade after $100 spend |
| OpenAI | Tier 3 ($100+ spent) | GPT-4o | 5,000 | 800,000 | Auto-upgrade after $250 spend |
| OpenAI | Tier 4 ($250+ spent) | GPT-4o | 10,000 | 2,000,000 | Auto-upgrade after $1,000 spend |
| OpenAI | Tier 5 ($1,000+ spent) | GPT-4o | 10,000 | 30,000,000 | Highest standard tier |
| Anthropic | Tier 1 (Build) | Claude Sonnet 4 | 50 | 40,000 | Default for new accounts |
| Anthropic | Tier 2 (Build) | Claude Sonnet 4 | 1,000 | 80,000 | After deposit/spend threshold |
| Anthropic | Tier 3 (Scale) | Claude Sonnet 4 | 2,000 | 160,000 | Requires Scale plan |
| Anthropic | Tier 4 (Scale) | Claude Sonnet 4 | 4,000 | 400,000 | Higher spend commitment |
| Free tier | Gemini 2.5 Pro | 5 | N/A | Very limited for testing | |
| Pay-as-you-go | Gemini 2.5 Pro | 1,000 | 4,000,000 | Standard paid tier |
These numbers change frequently. The critical takeaway is that rate limits can differ by 100-1000x between the lowest and highest tiers. A Tier 1 OpenAI account gets 500 RPM, while a Tier 5 account gets 10,000 RPM. Planning capacity around your actual tier is essential.
Tier upgrades at OpenAI are automatic based on cumulative spend and account age. Anthropic requires plan upgrades or explicit limit increase requests. Google offers limit increases via their Cloud console.
HTTP 429 Handling and Exponential Backoff
When you hit a rate limit, the provider returns an HTTP 429 response with a Retry-After header indicating how many seconds to wait before retrying. Proper handling of 429 responses is critical — naive retry logic can amplify the problem and double your token costs.
The gold standard for retry logic is exponential backoff with jitter:
async function callWithBackoff<T>(
fn: () => Promise<T>,
maxRetries = 5,
baseDelay = 1000
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error: any) {
if (error?.status !== 429 || attempt === maxRetries) {
throw error;
}
// Use Retry-After header if present, otherwise exponential backoff
const retryAfter = error?.headers?.['retry-after'];
const delay = retryAfter
? parseInt(retryAfter) * 1000
: baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
console.warn(
`Rate limited (attempt ${attempt + 1}/${maxRetries}). ` +
`Retrying in ${Math.round(delay / 1000)}s...`
);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Max retries exceeded');
}Key principles for 429 handling:
- Always respect
Retry-After— If the provider tells you to wait 30 seconds, waiting 5 seconds and retrying will just generate another 429. - Add jitter — The random component prevents the thundering herd problem where all your concurrent requests retry at the same instant and immediately hit the limit again.
- Cap max retries — Infinite retries can cascade into runaway costs. Set a maximum (typically 3-5 retries) and fail gracefully.
- Log retry events — Every retry represents wasted cost (the original request's input tokens were processed before the 429) and degraded latency. Track retry rates as a key operational metric.
- Distinguish 429 from 500 — A 429 means your request was valid but throttled. A 500 means something broke. Different error types require different retry strategies.
Strategies for Working Within Rate Limits
Rather than simply reacting to 429 errors, proactive strategies can prevent you from hitting limits in the first place:
1. Request queuing and throttling
Implement a client-side rate limiter that paces outgoing requests to stay within your known limits. A token bucket or leaky bucket algorithm can smooth traffic to a steady rate:
class RateLimiter {
private queue: Array<{ resolve: Function; fn: Function }> = [];
private activeRequests = 0;
private tokensThisMinute = 0;
constructor(
private maxRPM: number,
private maxTPM: number
) {
// Reset token counter every minute
setInterval(() => { this.tokensThisMinute = 0; }, 60000);
}
async submit<T>(fn: () => Promise<T>, estimatedTokens: number): Promise<T> {
// Wait if at RPM or TPM limit
while (
this.activeRequests >= this.maxRPM / 60 || // Spread RPM across seconds
this.tokensThisMinute + estimatedTokens > this.maxTPM
) {
await new Promise(r => setTimeout(r, 100));
}
this.activeRequests++;
this.tokensThisMinute += estimatedTokens;
try {
return await fn();
} finally {
this.activeRequests--;
}
}
}2. API key rotation
If you have multiple API keys (e.g., per-project or per-team keys), you can distribute requests across keys to multiply your effective rate limit. Each key has independent limits. This is particularly useful for batch processing or evaluation workloads where the source of the request does not matter. CostHawk's wrapped keys make this easy — create multiple wrapped keys pointing to the same provider key and round-robin across them.
3. Model tiering for throughput
Smaller models typically have higher rate limits. If you are hitting TPM limits on GPT-4o, routing simpler requests to GPT-4o-mini not only saves money but also frees up TPM headroom for the requests that truly need the larger model.
4. Request batching
Combine multiple small requests into fewer larger requests where the API supports it. Embedding endpoints accept arrays of inputs in a single call. Chat endpoints do not support batching within a single request, but the Batch API processes thousands of requests without consuming real-time rate limits.
5. Caching
Cache responses for deterministic or semi-deterministic requests. If the same prompt is sent repeatedly (e.g., a classification prompt with static instructions and a limited set of inputs), cache the output to avoid redundant API calls entirely. Even partial caching of system prompt responses can reduce RPM by 30-50%.
Rate Limits and Cost Optimization
Rate limits and cost optimization are more closely linked than most teams realize. Rate limits shape your architecture, and architectural decisions shape your costs.
The retry cost tax: Every 429 error that eventually succeeds costs you the input tokens for both the original failed attempt and the successful retry. If your retry rate is 10%, you are paying approximately 10% more in input token costs than necessary. For a team spending $10,000/month, a 10% retry rate adds $1,000 in wasted spend. CostHawk tracks this as "retry overhead" in the cost dashboard.
Forced model upgrades: Teams sometimes upgrade to higher tiers (which require more spending) just to get higher rate limits. Before upgrading your tier, analyze whether request queuing, caching, or model routing could solve the throughput problem without increasing your spend commitment.
Batch API as rate limit bypass: The Batch API has separate, higher throughput limits than real-time endpoints. If you are hitting rate limits on bulk workloads, migrating to batch processing simultaneously solves the rate limit problem and saves 50% on costs.
Architectural decisions: Rate limits influence fundamental architecture choices. A system that makes one LLM call per user request is constrained by RPM. A system that makes five chained calls per user request hits RPM limits 5x faster. Designing for fewer, more efficient calls per user interaction improves both cost and rate limit headroom.
CostHawk's rate limit monitoring dashboard shows your current utilization as a percentage of your limit for each model, along with 429 event frequency and retry cost overhead. This data helps you make informed decisions about when to implement queuing, when to upgrade tiers, and when to restructure your calling patterns.
Monitoring Rate Limit Usage
Proactive monitoring of rate limit utilization prevents outages and identifies optimization opportunities before they become incidents.
Key metrics to track:
- Limit utilization percentage — Track TPM and RPM usage as a percentage of your limit. Alert at 70% utilization so you have time to react before hitting 100%.
- 429 error rate — The percentage of requests that receive a 429 response. Any rate above 1% indicates a systemic issue that needs architectural attention.
- Retry overhead cost — The dollar amount spent on input tokens for requests that were retried after a 429. This is pure waste.
- P99 latency including retries — Rate limit retries add seconds to minutes of latency. Track the tail latency impact.
- Per-model utilization — Some models may be near their limits while others have headroom. This informs model routing decisions.
OpenAI returns rate limit headers with every response: x-ratelimit-limit-requests, x-ratelimit-remaining-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens, and x-ratelimit-reset-requests. Parse and log these headers to build a real-time utilization dashboard.
Anthropic returns similar headers: anthropic-ratelimit-requests-limit, anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-limit, and anthropic-ratelimit-tokens-remaining.
// Extract rate limit headers from API responses
function logRateLimitStatus(headers: Headers, model: string) {
const remaining = parseInt(headers.get('x-ratelimit-remaining-tokens') || '0');
const limit = parseInt(headers.get('x-ratelimit-limit-tokens') || '1');
const utilization = ((limit - remaining) / limit) * 100;
metrics.gauge('rate_limit.utilization_pct', utilization, { model });
if (utilization > 70) {
console.warn(`Rate limit warning: ${model} at ${utilization.toFixed(1)}% TPM utilization`);
}
}CostHawk aggregates rate limit data across all your API keys and providers, providing a unified view of your rate limit posture. The dashboard highlights models approaching their limits and estimates the cost impact of current retry rates.
FAQ
Frequently Asked Questions
What exactly happens when I hit a rate limit?+
How do I check my current rate limits?+
Can I get my rate limits increased?+
Do different models have different rate limits?+
What is the difference between RPM and TPM rate limits?+
How do rate limits affect my costs?+
Should I use multiple API keys to get around rate limits?+
How does the Batch API interact with rate limits?+
What is a retry storm and how do I prevent one?+
How do rate limits work for streaming responses?+
Related Terms
Batch API
Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read moreToken Budget
Spending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read morePay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
