GlossaryInfrastructureUpdated 2026-03-16By Chase Dillingham

Rate Limiting

Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.

Definition

What is Rate Limiting?

Rate limiting is a traffic management mechanism that AI providers use to cap the number of requests per minute (RPM), tokens per minute (TPM), and in some cases tokens per day (TPD) that an individual organization or API key can consume. When a client exceeds these limits, the API returns an HTTP 429 Too Many Requests response, signaling the client to back off and retry later. Rate limits exist at multiple levels — per model, per API key, per organization — and vary dramatically based on the provider, model, and the customer's usage tier.

Impact

Why It Matters for AI Costs

Rate limits are one of the most common sources of production incidents in AI-powered applications. A sudden traffic spike, a retry storm, or a misconfigured batch job can push you past your limits, causing cascading failures as 429 errors trigger retries that generate more 429 errors. Understanding your rate limits, implementing proper backoff logic, and monitoring limit utilization are essential for both reliability and cost control — because every failed request that retries successfully costs you double the tokens. CostHawk tracks 429 error rates and retry overhead so you can quantify the cost impact of rate limit events.

Understanding Rate Limits

AI providers enforce rate limits to protect infrastructure stability, ensure fair access across customers, and manage GPU utilization. Rate limits are not a bug — they are a fundamental design constraint of shared AI infrastructure that you must architect around.

Every provider enforces multiple simultaneous limits. A single API call must pass all applicable limits to succeed:

Requests per minute (RPM) — The number of API calls allowed per minute, regardless of token count. This limits call frequency.
Tokens per minute (TPM) — The total number of input + output tokens allowed per minute. This limits throughput.
Tokens per day (TPD) — Some providers cap daily token consumption, especially for newer or more expensive models.
Requests per day (RPD) — A daily request count cap, common for image generation and embedding endpoints.
Concurrent requests — Some providers limit how many requests can be in flight simultaneously.

If any single limit is exceeded, the request receives a 429 response. The most common bottleneck for low-volume applications is RPM (too many small requests), while high-volume applications typically hit TPM first (fewer large requests consuming many tokens).

Rate limits reset on a rolling window basis, not at fixed calendar minutes. This means consuming all your RPM allocation in the first 5 seconds of a minute leaves you throttled for the remaining 55 seconds.

Provider Rate Limit Tiers

Provider rate limits vary dramatically based on your usage tier, which is determined by cumulative spend and account age. The following table shows representative limits for major providers as of early 2026:

Provider	Tier	Model	RPM	TPM	Notes
OpenAI	Tier 1 ($5+ spent)	GPT-4o	500	30,000	Auto-upgrade after $50 spend
OpenAI	Tier 2 ($50+ spent)	GPT-4o	5,000	450,000	Auto-upgrade after $100 spend
OpenAI	Tier 3 ($100+ spent)	GPT-4o	5,000	800,000	Auto-upgrade after $250 spend
OpenAI	Tier 4 ($250+ spent)	GPT-4o	10,000	2,000,000	Auto-upgrade after $1,000 spend
OpenAI	Tier 5 ($1,000+ spent)	GPT-4o	10,000	30,000,000	Highest standard tier
Anthropic	Tier 1 (Build)	Claude Sonnet 4	50	40,000	Default for new accounts
Anthropic	Tier 2 (Build)	Claude Sonnet 4	1,000	80,000	After deposit/spend threshold
Anthropic	Tier 3 (Scale)	Claude Sonnet 4	2,000	160,000	Requires Scale plan
Anthropic	Tier 4 (Scale)	Claude Sonnet 4	4,000	400,000	Higher spend commitment
Google	Free tier	Gemini 2.5 Pro	5	N/A	Very limited for testing
Google	Pay-as-you-go	Gemini 2.5 Pro	1,000	4,000,000	Standard paid tier

These numbers change frequently. The critical takeaway is that rate limits can differ by 100-1000x between the lowest and highest tiers. A Tier 1 OpenAI account gets 500 RPM, while a Tier 5 account gets 10,000 RPM. Planning capacity around your actual tier is essential.

Tier upgrades at OpenAI are automatic based on cumulative spend and account age. Anthropic requires plan upgrades or explicit limit increase requests. Google offers limit increases via their Cloud console.

HTTP 429 Handling and Exponential Backoff

When you hit a rate limit, the provider returns an HTTP 429 response with a Retry-After header indicating how many seconds to wait before retrying. Proper handling of 429 responses is critical — naive retry logic can amplify the problem and double your token costs.

The gold standard for retry logic is exponential backoff with jitter:

async function callWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5,
  baseDelay = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      if (error?.status !== 429 || attempt === maxRetries) {
        throw error;
      }

      // Use Retry-After header if present, otherwise exponential backoff
      const retryAfter = error?.headers?.['retry-after'];
      const delay = retryAfter
        ? parseInt(retryAfter) * 1000
        : baseDelay * Math.pow(2, attempt) + Math.random() * 1000;

      console.warn(
        `Rate limited (attempt ${attempt + 1}/${maxRetries}). ` +
        `Retrying in ${Math.round(delay / 1000)}s...`
      );

      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
  throw new Error('Max retries exceeded');
}

Key principles for 429 handling:

Always respect Retry-After — If the provider tells you to wait 30 seconds, waiting 5 seconds and retrying will just generate another 429.
Add jitter — The random component prevents the thundering herd problem where all your concurrent requests retry at the same instant and immediately hit the limit again.
Cap max retries — Infinite retries can cascade into runaway costs. Set a maximum (typically 3-5 retries) and fail gracefully.
Log retry events — Every retry represents wasted cost (the original request's input tokens were processed before the 429) and degraded latency. Track retry rates as a key operational metric.
Distinguish 429 from 500 — A 429 means your request was valid but throttled. A 500 means something broke. Different error types require different retry strategies.

Strategies for Working Within Rate Limits

Rather than simply reacting to 429 errors, proactive strategies can prevent you from hitting limits in the first place:

1. Request queuing and throttling

Implement a client-side rate limiter that paces outgoing requests to stay within your known limits. A token bucket or leaky bucket algorithm can smooth traffic to a steady rate:

class RateLimiter {
  private queue: Array<{ resolve: Function; fn: Function }> = [];
  private activeRequests = 0;
  private tokensThisMinute = 0;

  constructor(
    private maxRPM: number,
    private maxTPM: number
  ) {
    // Reset token counter every minute
    setInterval(() => { this.tokensThisMinute = 0; }, 60000);
  }

  async submit<T>(fn: () => Promise<T>, estimatedTokens: number): Promise<T> {
    // Wait if at RPM or TPM limit
    while (
      this.activeRequests >= this.maxRPM / 60 || // Spread RPM across seconds
      this.tokensThisMinute + estimatedTokens > this.maxTPM
    ) {
      await new Promise(r => setTimeout(r, 100));
    }

    this.activeRequests++;
    this.tokensThisMinute += estimatedTokens;
    try {
      return await fn();
    } finally {
      this.activeRequests--;
    }
  }
}

2. API key rotation

If you have multiple API keys (e.g., per-project or per-team keys), you can distribute requests across keys to multiply your effective rate limit. Each key has independent limits. This is particularly useful for batch processing or evaluation workloads where the source of the request does not matter. CostHawk's wrapped keys make this easy — create multiple wrapped keys pointing to the same provider key and round-robin across them.

3. Model tiering for throughput

Smaller models typically have higher rate limits. If you are hitting TPM limits on GPT-4o, routing simpler requests to GPT-4o-mini not only saves money but also frees up TPM headroom for the requests that truly need the larger model.

4. Request batching

Combine multiple small requests into fewer larger requests where the API supports it. Embedding endpoints accept arrays of inputs in a single call. Chat endpoints do not support batching within a single request, but the Batch API processes thousands of requests without consuming real-time rate limits.

5. Caching

Cache responses for deterministic or semi-deterministic requests. If the same prompt is sent repeatedly (e.g., a classification prompt with static instructions and a limited set of inputs), cache the output to avoid redundant API calls entirely. Even partial caching of system prompt responses can reduce RPM by 30-50%.

Rate Limits and Cost Optimization

Rate limits and cost optimization are more closely linked than most teams realize. Rate limits shape your architecture, and architectural decisions shape your costs.

The retry cost tax: Every 429 error that eventually succeeds costs you the input tokens for both the original failed attempt and the successful retry. If your retry rate is 10%, you are paying approximately 10% more in input token costs than necessary. For a team spending $10,000/month, a 10% retry rate adds $1,000 in wasted spend. CostHawk tracks this as "retry overhead" in the cost dashboard.

Forced model upgrades: Teams sometimes upgrade to higher tiers (which require more spending) just to get higher rate limits. Before upgrading your tier, analyze whether request queuing, caching, or model routing could solve the throughput problem without increasing your spend commitment.

Batch API as rate limit bypass: The Batch API has separate, higher throughput limits than real-time endpoints. If you are hitting rate limits on bulk workloads, migrating to batch processing simultaneously solves the rate limit problem and saves 50% on costs.

Architectural decisions: Rate limits influence fundamental architecture choices. A system that makes one LLM call per user request is constrained by RPM. A system that makes five chained calls per user request hits RPM limits 5x faster. Designing for fewer, more efficient calls per user interaction improves both cost and rate limit headroom.

CostHawk's rate limit monitoring dashboard shows your current utilization as a percentage of your limit for each model, along with 429 event frequency and retry cost overhead. This data helps you make informed decisions about when to implement queuing, when to upgrade tiers, and when to restructure your calling patterns.

Monitoring Rate Limit Usage

Proactive monitoring of rate limit utilization prevents outages and identifies optimization opportunities before they become incidents.

Key metrics to track:

Limit utilization percentage — Track TPM and RPM usage as a percentage of your limit. Alert at 70% utilization so you have time to react before hitting 100%.
429 error rate — The percentage of requests that receive a 429 response. Any rate above 1% indicates a systemic issue that needs architectural attention.
Retry overhead cost — The dollar amount spent on input tokens for requests that were retried after a 429. This is pure waste.
P99 latency including retries — Rate limit retries add seconds to minutes of latency. Track the tail latency impact.
Per-model utilization — Some models may be near their limits while others have headroom. This informs model routing decisions.

OpenAI returns rate limit headers with every response: x-ratelimit-limit-requests, x-ratelimit-remaining-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens, and x-ratelimit-reset-requests. Parse and log these headers to build a real-time utilization dashboard.

Anthropic returns similar headers: anthropic-ratelimit-requests-limit, anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-limit, and anthropic-ratelimit-tokens-remaining.

// Extract rate limit headers from API responses
function logRateLimitStatus(headers: Headers, model: string) {
  const remaining = parseInt(headers.get('x-ratelimit-remaining-tokens') || '0');
  const limit = parseInt(headers.get('x-ratelimit-limit-tokens') || '1');
  const utilization = ((limit - remaining) / limit) * 100;

  metrics.gauge('rate_limit.utilization_pct', utilization, { model });

  if (utilization > 70) {
    console.warn(`Rate limit warning: ${model} at ${utilization.toFixed(1)}% TPM utilization`);
  }
}

CostHawk aggregates rate limit data across all your API keys and providers, providing a unified view of your rate limit posture. The dashboard highlights models approaching their limits and estimates the cost impact of current retry rates.

FAQ

Frequently Asked Questions

What exactly happens when I hit a rate limit?+

When you exceed a rate limit, the provider immediately returns an HTTP 429 Too Many Requests response without processing your request. The response includes a Retry-After header indicating how many seconds you should wait before retrying. Importantly, the input tokens you sent are not billed for the failed request — you are only billed when a request succeeds. However, if your retry logic re-sends the same request and it succeeds, you pay for all the input and output tokens of the successful attempt. The total latency for that request is now the original attempt time plus the backoff delay plus the successful retry time, which can be 10-60 seconds.

How do I check my current rate limits?+

For OpenAI, visit the Usage page in your dashboard under Settings > Limits, which shows your current tier and all applicable rate limits per model. You can also check programmatically by inspecting the rate limit response headers (x-ratelimit-limit-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens) returned with every successful API response. For Anthropic, your rate limits are shown in the API Settings page of your Anthropic Console, and similar headers are included in every response. Google Cloud AI rate limits are configured and visible in the Cloud Console under API quotas. CostHawk aggregates rate limit information across all your providers into a single unified dashboard, so you can see all limits, current utilization percentages, and historical 429 error rates without visiting multiple provider portals.

Can I get my rate limits increased?+

Yes, but the process varies by provider. OpenAI automatically increases rate limits as you progress through usage tiers — spending more money and maintaining your account over time unlocks higher tiers. Tier 1 starts at just $5 in total spend with 500 RPM for GPT-4o, while Tier 5 requires $1,000+ and provides 10,000 RPM. Anthropic offers tier upgrades through their console and sales team. For custom limits beyond the published tiers, both providers offer enterprise agreements with negotiated rate limits. Google Cloud allows quota increase requests through the Cloud Console. Before requesting increases, ensure you have implemented caching, queuing, and model routing — these often resolve rate limit issues without provider intervention.

Do different models have different rate limits?+

Yes. Each model has its own independent rate limits. For example, on OpenAI Tier 3, GPT-4o has 5,000 RPM and 800,000 TPM, while GPT-4o-mini has 5,000 RPM and 4,000,000 TPM — five times the token throughput. On Anthropic, Claude Opus 4 typically has lower RPM and TPM limits than Claude Sonnet 4 or Claude Haiku 3.5. This asymmetry is intentional: more expensive models consume more compute and GPU memory, so providers allocate less capacity per customer. This is why model routing is a powerful strategy — routing simple requests to smaller models simultaneously saves money and frees up rate limit headroom on the expensive models.

What is the difference between RPM and TPM rate limits?+

RPM (requests per minute) limits how many API calls you can make per minute regardless of size. TPM (tokens per minute) limits the total number of tokens (input + output) processed per minute regardless of how many requests generated them. You can hit either limit independently. A chatbot sending many small requests (100 tokens each) is more likely to hit RPM first. An application sending fewer but larger requests (50,000 tokens each for document processing) is more likely to hit TPM first. Both limits are enforced simultaneously, so even if you have RPM headroom, exceeding TPM will trigger a 429. Effective capacity planning requires tracking both metrics.

How do rate limits affect my costs?+

Rate limits affect costs in three ways. First, retries waste money — every 429 error followed by a successful retry doubles the input token cost for that request. A 10% retry rate on $10,000 monthly spend adds approximately $1,000 in waste. Second, rate limits can force architectural decisions that increase costs, such as upgrading to a higher provider tier (which requires more minimum spend) or switching to a more expensive model that has higher limits. Third, rate limits can push you toward batch processing, which actually saves 50% — this is a case where the constraint produces a better outcome. CostHawk quantifies all three effects in its rate limit cost impact report.

Should I use multiple API keys to get around rate limits?+

Rate limits are typically enforced at the organization level, not the individual API key level. Creating multiple API keys under the same organization will not increase your aggregate rate limits — all keys share the same organizational quota. However, if you have legitimate separate organizations or accounts (for example, different business units or customer accounts with separate billing), each organization has its own independent limits. Some teams use separate provider organizations for production, staging, and batch workloads to ensure production traffic is never throttled by non-production work consuming shared quota. This is a valid and recommended pattern for workload isolation, but creating multiple organizations solely to circumvent rate limits may violate provider terms of service and risk account suspension.

How does the Batch API interact with rate limits?+

The Batch API has completely separate rate limits from the real-time API for OpenAI. Batch token usage does not count against your real-time TPM or RPM limits at all. This means you can process large batch jobs containing millions of tokens without degrading your real-time API availability or throughput for user-facing features. OpenAI provides a separate batch queue limit measured in enqueued tokens per day rather than tokens per minute. For Anthropic, the situation is different: batch requests currently share rate limits with real-time requests, which means submitting a large batch of thousands of requests can temporarily reduce available capacity that your real-time traffic needs. If you use Anthropic, plan batch submissions during off-peak hours or use separate workspaces to avoid impacting real-time production workloads.

What is a retry storm and how do I prevent one?+

A retry storm occurs when a rate limit triggers 429 errors, which cause retries, which hit the rate limit again, which cause more retries, creating a cascading feedback loop. In severe cases, a retry storm can amplify traffic by 5-10x and persist for minutes after the initial trigger resolves. Prevention requires three mechanisms: exponential backoff (each retry waits exponentially longer), jitter (randomizing retry timing to prevent synchronized retries), and a circuit breaker (stop retrying entirely after a threshold is reached and fail fast instead). The circuit breaker is the most important — without it, exponential backoff alone only slows the storm rather than stopping it.

How do rate limits work for streaming responses?+

Streaming responses consume rate limits the same as non-streaming responses. The RPM is consumed when the request begins, and the TPM is consumed based on the total input and output tokens of the completed response. Streaming does not provide any rate limit advantage — it only affects latency perception because the client sees partial results before the response is complete. One nuance: if a streaming response is interrupted mid-stream (e.g., the client disconnects), you are still billed for all tokens generated up to the point of disconnection, and those tokens count against your TPM. This is another reason to set appropriate max_tokens limits on every request.

Related Terms

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary

Rate Limiting

What is Rate Limiting?

Why It Matters for AI Costs

Understanding Rate Limits

Provider Rate Limit Tiers

HTTP 429 Handling and Exponential Backoff

Strategies for Working Within Rate Limits

Rate Limits and Cost Optimization

Monitoring Rate Limit Usage

Frequently Asked Questions

Batch API

Provisioned Throughput

Token Budget

Cost Per Query

Model Routing

Pay-Per-Token

Put this knowledge to work. Track your AI spend in one place.