GlossaryInfrastructureUpdated 2026-03-16By Chase Dillingham

Failover

Q: How quickly should failover activate after a provider failure?

For user-facing applications, failover should activate within 5–15 seconds of detecting a sustained failure. The detection phase typically requires 2–3 consecutive failed requests or a 10-second window with an error rate above a threshold (such as 50%). This means the first 2–3 requests may fail before failover kicks in. For a hot standby configuration, the failover itself is nearly instantaneous because the backup is already receiving traffic. For a warm fallback, add 5–10 seconds for connection establishment and first-request latency on the backup provider. The total time from first failed request to first successful failover response is typically 10–20 seconds. For batch processing workloads where latency is less critical, you can use a longer detection window (30–60 seconds) to avoid premature failover on transient issues. The optimal threshold depends on your application's tolerance for errors versus your tolerance for unnecessary failover events.

Q: Should I fail over to a cheaper or more expensive model?

The answer depends on your priorities. Failing over to a cheaper model (for example, GPT-4o mini as backup for Claude 3.5 Sonnet) reduces costs during outages but may degrade output quality. Failing over to a comparable or more expensive model (GPT-4o as backup for Claude 3.5 Sonnet) maintains quality but increases costs. For most production applications, the recommended approach is to fail over to a model of comparable capability from a different provider. The cost premium during failover is small compared to the cost of downtime, and users will not notice the difference in output quality. If cost minimization during failover is critical, consider a tiered approach: fail over to a comparable model for the first 30 minutes, then downgrade to a cheaper model if the outage persists. This limits the cost impact of extended outages while maintaining quality during short incidents, which represent the majority of provider outages. CostHawk's per-provider cost analytics let you model the cost impact of different failover configurations before committing to a strategy.

Q: How do I test failover without causing a real outage?

There are four approaches to testing failover: (1) Chaos engineering — use a tool like Chaos Monkey or Toxiproxy to inject failures into the network path between your application and the primary provider. Toxiproxy can simulate timeouts, connection resets, and HTTP 500 errors at configurable rates. (2) Feature flags — add a feature flag that forces the failover path regardless of primary provider health. Toggle the flag in a staging environment to verify that failover activates, the backup serves traffic correctly, and recovery works when the flag is toggled off. (3) Synthetic failures — configure your failover logic with a "test mode" that treats the primary as unhealthy based on a header or parameter in the request, allowing you to trigger failover for specific test requests without affecting production traffic. (4) Scheduled drills — run monthly failover drills where you intentionally block traffic to the primary for 5–10 minutes and verify automated failover, cost tracking, alerting, and recovery. Document the results and fix any issues before the next drill. Testing failover is as important as implementing it — untested failover is not failover.

Q: What happens to in-flight requests when failover activates?

In-flight requests — those that have been sent to the primary provider but have not yet received a response — are the trickiest part of failover. There are two approaches: (1) Abandon and retry. Cancel the in-flight request (close the HTTP connection) and immediately resend it to the backup provider. This provides the fastest recovery but may result in duplicate processing if the primary eventually completes the request after the connection is closed. For idempotent operations (read-only queries, deterministic transformations), this is safe. For operations with side effects (function calls that trigger actions), add deduplication logic. (2) Wait with timeout. Allow in-flight requests to complete or timeout (with a short timeout, such as 10 seconds), then send only subsequent requests to the backup. This avoids duplicate processing but means users who had in-flight requests experience the full timeout delay. The recommended approach for most applications is to abandon and retry with a request-level idempotency key. The backup provider processes the request, and if the primary also returns a response later, the idempotency key is used to discard the duplicate.

Q: Can failover work with streaming responses?

Yes, but streaming adds complexity to failover handling. If the primary provider fails mid-stream — after delivering some SSE chunks but before completing the response — you have a partially generated response that needs to be handled. There are three approaches: (1) Discard and restart. Discard the partial response and resend the full request to the backup provider. The user sees a brief interruption followed by a complete response from the backup. This is the simplest and most reliable approach. (2) Continue from partial. Send the partial response as context to the backup provider and ask it to continue from where the primary left off. This avoids repeating content but is fragile — the backup model may not seamlessly continue another model's output. (3) Buffer and replay. Buffer the complete stream on the server side and only deliver it to the client when the stream completes successfully. If the stream fails, silently retry on the backup and deliver the backup's stream to the client. The user never sees the partial response. This adds latency equal to the full generation time but provides the smoothest user experience. For most applications, approach 1 (discard and restart) is the best balance of simplicity and user experience.

Q: How do I handle failover when using provider-specific features?

Provider-specific features — such as OpenAI's assistants API, Anthropic's computer use, or Google's grounding with search — do not have equivalents on other providers and cannot be failed over in the traditional sense. For applications that depend on provider-specific features, you have three options: (1) Multi-account failover. Instead of failing over to a different provider, fail over to a different account with the same provider. This protects against per-account rate limits and some infrastructure issues, but not against provider-wide outages. (2) Feature degradation. Implement a fallback that provides reduced functionality using a standard API that all providers support. For example, if your primary uses Anthropic's computer use feature, your fallback could use standard tool calling to accomplish a subset of the same tasks. (3) Queue and retry. If the feature is not time-sensitive, queue requests during the outage and process them when the provider recovers. This maintains full functionality at the cost of delayed results. The right approach depends on how critical the provider-specific feature is to your user experience and how tolerant your users are of delayed or degraded service.

Q: What is the cost impact of maintaining failover infrastructure?

The total cost of failover infrastructure has three components: operational cost, engineering cost, and opportunity cost. Operationally, a hot standby configuration routes 5–10% of traffic to the backup provider during normal operation. For a $50,000/month AI spend, this adds $2,500–$5,000/month if the backup provider is equally priced, or proportionally more or less depending on the backup's pricing. A warm fallback has zero operational cost but slower failover time. Engineering cost includes the initial implementation (2–5 engineering days), ongoing maintenance (1–2 days per quarter for testing and updates), and the operational burden of monitoring and responding to failover alerts. Opportunity cost is what you could have built with those engineering days instead. For most production applications, the total cost of failover is 3–8% of your AI API spend, which is well justified by the availability improvement. CostHawk provides detailed failover cost breakdowns that separate the standing cost of hot standby traffic from the premium cost incurred during actual failover events, helping you optimize your failover budget.

Q: Should I use the same model on the backup provider or a different one?

Use a comparable model from a different provider for maximum resilience. If your primary is Claude 3.5 Sonnet on Anthropic, your backup should be GPT-4o on OpenAI or Gemini 1.5 Pro on Google — not another Claude model on the same provider. The reason is that most outages are provider-wide: when Anthropic's API goes down, all Claude models are affected. Failing over to a different Claude model on the same provider provides no protection against the most common failure mode. If you fail over to a different provider, test both models against your evaluation suite to ensure the backup meets your quality bar. Maintain provider-specific system prompt variants if needed to optimize each model's output for your use case. For applications where output consistency is paramount and you cannot tolerate any model switching, consider multi-account failover within the same provider as a secondary strategy — it protects against per-account rate limits and some regional infrastructure issues, even though it does not protect against provider-wide outages.

Automatically switching to a backup LLM provider when the primary fails or becomes unavailable. Failover prevents user-facing downtime in AI-powered features but introduces cost implications when backup providers have different pricing. A well-designed failover strategy balances reliability against budget impact.

Definition

What is Failover?

Failover is a resilience pattern in which an LLM-powered application automatically redirects API requests to a backup provider, model, or account when the primary backend becomes unavailable or degrades beyond an acceptable threshold. In traditional infrastructure, failover typically involves switching to an identical replica — a standby database or a mirror web server. LLM failover is more complex because backup providers are not identical: they run different models with different capabilities, pricing, response styles, and API formats. Switching from Claude 3.5 Sonnet to GPT-4o during an Anthropic outage maintains availability but changes the output characteristics and cost profile of every request. Effective LLM failover therefore requires not only detecting failures and redirecting traffic, but also managing the downstream consequences — cost spikes from more expensive backup providers, output quality differences, API format translation, and graceful recovery when the primary comes back online. For production AI applications where downtime has direct revenue or user-experience impact, failover is not optional — it is a critical infrastructure requirement on par with database replication and CDN redundancy.

Impact

Why It Matters for AI Costs

LLM provider outages are more common than most teams realize. In 2025 alone, OpenAI experienced 14 incidents with degraded API performance or full outages, Anthropic had 8 significant incidents, and Google Cloud's Gemini API had 6 documented outages. The average incident duration was 47 minutes, but some lasted several hours. For an application serving 10,000 users, a 47-minute outage means:

7,833 failed requests (at 10,000 RPM baseline)
$0 in direct API costs saved — but potentially thousands in lost revenue or SLA penalties
User trust erosion that takes weeks to rebuild

Failover eliminates this exposure. When the primary provider goes down, requests automatically redirect to a backup within seconds — often before users notice any disruption. The cost of failover is typically a 10–30% premium per request during the failover period (because backup providers may be more expensive) plus the engineering investment to implement and test the failover logic.

Consider the math for a $100,000/month AI API spend:

Without failover: 14 outages × 47 minutes = ~11 hours of downtime per year. If each hour of downtime costs $5,000 in lost revenue, that is $55,000/year in losses.
With failover: Near-zero downtime. Failover traffic costs ~20% more than primary, but only during outages. If failover handles 2% of total traffic annually, the premium is $100,000 × 0.02 × 0.20 = $400/year.

Failover delivers a 137:1 ROI ($55,000 saved for $400 in additional cost) in this scenario. Even in less extreme cases, the business case for failover is compelling for any application where LLM availability directly impacts users or revenue. CostHawk tracks failover events and their cost impact, showing you exactly how much your failover strategy costs versus how much downtime it prevents.

What is LLM Failover?

LLM failover is an automated mechanism that detects when your primary LLM provider is failing and redirects requests to a pre-configured backup. The detection can be based on explicit failure signals (HTTP 500 errors, connection timeouts, 429 rate limit responses) or implicit degradation signals (latency exceeding a threshold, response quality below a benchmark). Once a failure is detected, the failover system routes subsequent requests to the backup provider until the primary recovers.

Failover operates at a different layer than retry logic. A retry resends the same request to the same backend after a transient error, hoping the next attempt succeeds. Failover switches to an entirely different backend, acknowledging that the primary is experiencing a sustained issue that retries will not resolve. In practice, the two work together: retry the primary 1–2 times for transient errors, then fail over to the backup if the primary is persistently failing.

The failover lifecycle has four phases:

Detection. The system identifies that the primary backend is failing. This can be immediate (a single request returns a 500 error) or statistical (error rate exceeds 10% over a 60-second window). Statistical detection is more robust because it avoids premature failover on isolated transient errors.
Activation. The system switches new requests to the backup backend. In-flight requests to the primary may be abandoned (fast failover) or allowed to complete/timeout (graceful failover). The choice depends on your latency tolerance — fast failover minimizes user impact but may result in duplicate requests if the primary eventually responds.
Operation. The backup backend serves all traffic while the primary is down. During this phase, the system periodically probes the primary (synthetic health check requests) to detect recovery. Cost tracking is critical during this phase because pricing may differ significantly.
Recovery. The primary comes back online and begins passing health checks. The system gradually shifts traffic back to the primary — either immediately (hard recovery) or incrementally over several minutes (gradual recovery). Gradual recovery is safer because it validates that the primary is stable before sending full traffic. A common pattern is to shift 10% of traffic back to the primary, wait 2 minutes, then 25%, 50%, 75%, and finally 100%.

The entire failover lifecycle should be automated and require no human intervention. Manual failover — where an engineer notices an outage and reconfigures the system — is too slow for production use. By the time a human responds, users have already experienced minutes of degraded service.

Failover Strategies

There are three primary failover strategies, each with different cost, complexity, and recovery-time characteristics.

1. Hot standby. The backup provider receives a small percentage of live traffic at all times (typically 5–10%). This ensures the backup is warm — its connections are established, any caching layers are populated, and you have recent latency/quality data. When the primary fails, the backup simply absorbs the primary's traffic share, scaling from 5% to 100%. Hot standby provides the fastest failover (under 5 seconds) because the backup is already active and proven healthy. The cost overhead during normal operation is minimal: 5–10% of traffic routed to a possibly more expensive provider adds 1–3% to your total bill. Hot standby is the recommended strategy for production applications with high availability requirements.

2. Warm fallback. The backup provider is configured and tested but does not receive live traffic during normal operation. When the primary fails, the system activates the backup by establishing connections and sending the first request. Warm fallback has a longer activation time (5–30 seconds for connection establishment and first-request latency) but zero cost overhead during normal operation. The risk is that the backup may have changed since it was last tested — the provider may have updated its API, rate limits, or pricing. Regular health checks (every 5 minutes) mitigate this risk. Warm fallback is appropriate for applications with moderate availability requirements (99.9% SLA) where the cost of hot standby is not justified.

3. Degraded service. Instead of failing over to another provider, the application degrades gracefully when the primary fails. For a chatbot, this might mean returning a cached or template response ("I'm experiencing high demand, please try again in a moment"). For a code review tool, it might mean showing a "review pending" badge and processing the review asynchronously when the provider recovers. Degraded service has zero failover cost because no backup provider is invoked, but it provides a worse user experience than a true failover. It is appropriate for non-critical features where some degradation is acceptable, or as a last-resort fallback when all providers are down simultaneously.

Many production deployments combine strategies in a cascade: hot standby to Provider B when Provider A fails, warm fallback to Provider C if both A and B are down, and degraded service if all providers are unavailable. This layered approach maximizes availability while containing costs.

Strategy	Failover Time	Normal-Operation Cost	Complexity	Best For
Hot standby	<5 seconds	+1–3% overhead	Medium	High-availability production apps
Warm fallback	5–30 seconds	Zero overhead	Low–Medium	Moderate availability requirements
Degraded service	Instant	Zero overhead	Low	Non-critical features, last resort

Cost of Failover

Failover has three distinct cost components: the premium cost during failover, the standing cost of backup capacity, and the engineering cost of implementation and testing. Understanding each component helps you make informed decisions about your failover architecture.

1. Premium cost during failover. When your primary provider goes down and traffic shifts to a backup, the backup may be more expensive. If your primary is Gemini 1.5 Pro ($1.25/MTok input) and your backup is Claude 3.5 Sonnet ($3.00/MTok input), you pay a 140% premium on every request during the failover period. For a workload processing 100,000 requests per hour, a 2-hour outage with failover to the more expensive provider adds:

Normal cost:   200K requests × 1,500 tokens × ($1.25/1M) = $375
Failover cost: 200K requests × 1,500 tokens × ($3.00/1M) = $900
Premium:       $900 - $375 = $525 for 2 hours of failover

This premium is almost always worth paying compared to the revenue lost from 2 hours of downtime. To minimize the premium, choose a backup provider with comparable or lower pricing than your primary. Gemini Flash ($0.10/MTok) as a backup for GPT-4o ($2.50/MTok) actually reduces cost during failover, though output quality may differ.

2. Standing cost of backup capacity. Hot standby strategies route 5–10% of traffic to the backup during normal operation, incurring a small ongoing cost. If your primary processes $100,000/month in API costs and the backup is 20% more expensive per token, routing 5% of traffic to the backup adds: $100,000 × 0.05 × 0.20 = $1,000/month. This is the "insurance premium" for fast failover. Warm fallback and degraded service strategies have zero standing cost.

3. Over-provisioning costs. Some organizations maintain dedicated capacity (provisioned throughput) on their backup provider to ensure it can absorb the full primary workload during failover. This reserved capacity costs money whether or not it is used. A better approach for most teams is to use on-demand pricing for the backup and accept that rate limits may constrain throughput during failover. If your backup provider's rate limits are insufficient for your full workload, you may need multiple backup accounts or a tiered failover strategy that routes overflow traffic to a third provider.

4. Engineering cost. Building, testing, and maintaining failover logic is a real investment. Initial implementation takes 2–5 engineering days for a basic failover system. Testing requires simulating provider outages (chaos engineering), which requires additional tooling. Ongoing maintenance includes updating failover logic when providers change APIs or pricing, rotating backup API keys, and periodically running failover drills to verify the system works. CostHawk's failover event tracking and cost attribution reduce the monitoring burden by automatically calculating the cost impact of each failover event and alerting you to unexpected cost spikes.

Implementing Failover Logic

Here is a production-ready failover implementation in TypeScript that handles detection, activation, and recovery with proper error tracking and cost awareness.

interface ProviderConfig {
  name: string
  client: OpenAI
  costPerInputMTok: number
  costPerOutputMTok: number
  isHealthy: boolean
  consecutiveFailures: number
  lastHealthCheck: number
}

class LLMFailoverManager {
  private primary: ProviderConfig
  private backups: ProviderConfig[]
  private activeProvider: ProviderConfig
  private readonly FAILURE_THRESHOLD = 3
  private readonly HEALTH_CHECK_INTERVAL = 30_000 // 30s
  private readonly RECOVERY_PROBE_INTERVAL = 60_000 // 60s

  constructor(
    primary: ProviderConfig,
    backups: ProviderConfig[]
  ) {
    this.primary = primary
    this.backups = backups
    this.activeProvider = primary
    this.startHealthMonitor()
  }

  async sendRequest(
    messages: ChatMessage[],
    options: RequestOptions
  ): Promise<ChatResponse> {
    try {
      const response = await this.activeProvider.client
        .chat.completions.create({
          model: options.model,
          messages,
          max_tokens: options.maxTokens
        })

      // Reset failure counter on success
      this.activeProvider.consecutiveFailures = 0
      return response
    } catch (error: any) {
      return this.handleFailure(error, messages, options)
    }
  }

  private async handleFailure(
    error: any,
    messages: ChatMessage[],
    options: RequestOptions
  ): Promise<ChatResponse> {
    this.activeProvider.consecutiveFailures++

    // Retry on transient errors
    if (this.isTransient(error) &&
        this.activeProvider.consecutiveFailures < this.FAILURE_THRESHOLD) {
      await this.delay(1000 * this.activeProvider.consecutiveFailures)
      return this.sendRequest(messages, options)
    }

    // Failover to backup
    this.activeProvider.isHealthy = false
    console.warn(
      `Provider ${this.activeProvider.name} marked unhealthy ` +
      `after ${this.activeProvider.consecutiveFailures} failures`
    )

    const backup = this.backups.find(b => b.isHealthy)
    if (!backup) throw new Error("All providers unavailable")

    this.activeProvider = backup
    return this.sendRequest(messages, options)
  }

  private isTransient(error: any): boolean {
    return [429, 500, 502, 503, 504].includes(error.status)
  }

  private startHealthMonitor() {
    setInterval(async () => {
      // Probe unhealthy providers for recovery
      if (!this.primary.isHealthy) {
        try {
          await this.primary.client.chat.completions.create({
            model: "gpt-4o-mini",
            messages: [{ role: "user", content: "ping" }],
            max_tokens: 1
          })
          this.primary.isHealthy = true
          this.primary.consecutiveFailures = 0
          this.activeProvider = this.primary // Recover
        } catch { /* still unhealthy */ }
      }
    }, this.RECOVERY_PROBE_INTERVAL)
  }

  private delay(ms: number) {
    return new Promise(r => setTimeout(r, ms))
  }
}

Key implementation details in this pattern:

Consecutive failure threshold. Three consecutive failures trigger failover, not a single error. This prevents premature failover on isolated transient errors while responding quickly to sustained outages.
Automatic recovery probing. The health monitor periodically checks the primary provider with a minimal-cost request (1 output token). When the primary recovers, traffic shifts back automatically without human intervention.
Error classification. Only retryable HTTP status codes trigger the failover path. Client errors (400, 401, 403) indicate a problem with the request itself, not the provider, and should not trigger failover.
Cost tracking integration. Wrap each request with CostHawk logging to track which provider served each request and calculate the cost premium during failover periods.

Failover and Model Compatibility

The most challenging aspect of LLM failover is managing the differences between primary and backup models. Unlike traditional infrastructure failover (where the standby database is an exact replica), LLM failover switches between models that produce different outputs for the same input. Understanding and mitigating these differences is critical for maintaining application quality during failover events.

API format differences. Each provider has its own API format. OpenAI's chat completions API uses a messages array with role and content fields. Anthropic uses a separate system parameter and a different message structure. Google's Gemini API uses contents with parts. Your failover logic must translate between these formats when switching providers. Using a unified SDK like LiteLLM or Vercel AI SDK handles this translation automatically, making your failover code provider-agnostic.

Feature parity gaps. Not all providers support the same features. OpenAI's function calling has a different schema format than Anthropic's tool use. Some models support structured output (JSON mode) and others do not. Vision capabilities, audio processing, and extended context windows vary across providers. Your failover configuration must account for these gaps — if your primary provider supports a feature that your backup does not, you need either a compatible alternative or a graceful degradation path.

Key compatibility considerations by feature:

Feature	OpenAI GPT-4o	Anthropic Claude 3.5 Sonnet	Google Gemini 1.5 Pro
Max context window	128K tokens	200K tokens	2M tokens
Function/tool calling	Yes (functions API)	Yes (tool_use)	Yes (function_calling)
JSON mode	Yes	Yes (via tool_use)	Yes
Vision	Yes	Yes	Yes
Streaming	SSE	SSE	SSE
Prompt caching	Yes (50% discount)	Yes (90% discount)	Yes (context caching)

Output quality testing. Before deploying failover to production, run your evaluation suite against both the primary and backup models. Measure accuracy, formatting compliance, response length, and any domain-specific metrics. If the backup model scores significantly lower on critical metrics, consider whether the failover is acceptable or whether you should invest in prompt adjustments specific to the backup model. Some teams maintain provider-specific system prompt variants that optimize each model's output for their use case. The failover logic selects the appropriate prompt variant when switching providers.

Behavioral differences to watch for. Even when two models produce comparable quality outputs, they may differ in subtle ways that affect user experience: response length (Claude tends toward longer responses than GPT-4o), formatting preferences (Markdown headers vs. bold text), confidence calibration (some models hedge more than others), and handling of ambiguous instructions. For user-facing applications, these differences can be disorienting during a failover event. Mitigations include constraining output format with structured output modes and using detailed system prompts that specify expected behavior.

Monitoring Failover Events

Monitoring failover events is essential for understanding your system's reliability, measuring the cost impact of outages, and improving your failover configuration over time. Every failover event should be treated as an incident that generates a data record for analysis.

What to log for each failover event:

Timestamp — when the failover was triggered.
Trigger reason — what caused the failover (consecutive errors, latency threshold, rate limit exhaustion, manual trigger).
Primary provider — which provider failed and what error codes were returned.
Backup provider — which backup received the traffic.
Duration — how long the failover lasted before the primary recovered.
Requests affected — how many requests were served by the backup during the failover period.
Cost impact — the additional cost incurred from using the backup provider versus what the primary would have cost.
Quality impact — any measurable difference in output quality (if you run automated evaluations).
Recovery method — automatic (health check detected recovery) or manual (engineer intervened).

Alerting on failover events. Configure alerts at two levels:

Immediate alert when failover activates. This notifies the on-call engineer that the primary provider is down and the system is running on backup. Even though the failover is automatic, human awareness is important in case the backup also starts failing or the failover has unexpected side effects.
Cost threshold alert if failover cost premium exceeds a threshold. If your backup provider is significantly more expensive than your primary, a long failover event could spike your bill. Set an alert at a dollar threshold (for example, $500 in additional failover cost) to ensure someone is tracking the financial impact.

Post-incident analysis. After each failover event, review the data to answer four questions:

Was the failover necessary? Did the primary truly fail, or did a transient error trigger premature failover? If premature, increase the failure threshold or detection window.
Was failover fast enough? How many requests failed before failover activated? If too many, reduce the detection threshold or switch from warm fallback to hot standby.
Was the backup adequate? Did the backup provider handle the traffic without issues? Were there quality complaints from users? If the backup struggled, consider adding a second backup or increasing the backup's rate limit tier.
Was recovery smooth? Did the system correctly detect the primary's recovery and shift traffic back? Did the gradual recovery process work, or did it cause oscillation between providers?

CostHawk automatically tracks failover events when detected through changes in the wrapped key routing. The dashboard shows a timeline of failover events overlaid with cost data, making it easy to calculate the total cost of each outage — both the direct cost premium and the lost revenue from any requests that failed before failover activated. Over time, this data helps you optimize your failover configuration: tightening detection thresholds if failover is too slow, loosening them if failover triggers too often, or switching backup providers if the cost premium is too high.

FAQ

Frequently Asked Questions

How quickly should failover activate after a provider failure?+

For user-facing applications, failover should activate within 5–15 seconds of detecting a sustained failure. The detection phase typically requires 2–3 consecutive failed requests or a 10-second window with an error rate above a threshold (such as 50%). This means the first 2–3 requests may fail before failover kicks in. For a hot standby configuration, the failover itself is nearly instantaneous because the backup is already receiving traffic. For a warm fallback, add 5–10 seconds for connection establishment and first-request latency on the backup provider. The total time from first failed request to first successful failover response is typically 10–20 seconds. For batch processing workloads where latency is less critical, you can use a longer detection window (30–60 seconds) to avoid premature failover on transient issues. The optimal threshold depends on your application's tolerance for errors versus your tolerance for unnecessary failover events.

Should I fail over to a cheaper or more expensive model?+

The answer depends on your priorities. Failing over to a cheaper model (for example, GPT-4o mini as backup for Claude 3.5 Sonnet) reduces costs during outages but may degrade output quality. Failing over to a comparable or more expensive model (GPT-4o as backup for Claude 3.5 Sonnet) maintains quality but increases costs. For most production applications, the recommended approach is to fail over to a model of comparable capability from a different provider. The cost premium during failover is small compared to the cost of downtime, and users will not notice the difference in output quality. If cost minimization during failover is critical, consider a tiered approach: fail over to a comparable model for the first 30 minutes, then downgrade to a cheaper model if the outage persists. This limits the cost impact of extended outages while maintaining quality during short incidents, which represent the majority of provider outages. CostHawk's per-provider cost analytics let you model the cost impact of different failover configurations before committing to a strategy.

How do I test failover without causing a real outage?+

There are four approaches to testing failover: (1) Chaos engineering — use a tool like Chaos Monkey or Toxiproxy to inject failures into the network path between your application and the primary provider. Toxiproxy can simulate timeouts, connection resets, and HTTP 500 errors at configurable rates. (2) Feature flags — add a feature flag that forces the failover path regardless of primary provider health. Toggle the flag in a staging environment to verify that failover activates, the backup serves traffic correctly, and recovery works when the flag is toggled off. (3) Synthetic failures — configure your failover logic with a "test mode" that treats the primary as unhealthy based on a header or parameter in the request, allowing you to trigger failover for specific test requests without affecting production traffic. (4) Scheduled drills — run monthly failover drills where you intentionally block traffic to the primary for 5–10 minutes and verify automated failover, cost tracking, alerting, and recovery. Document the results and fix any issues before the next drill. Testing failover is as important as implementing it — untested failover is not failover.

What happens to in-flight requests when failover activates?+

In-flight requests — those that have been sent to the primary provider but have not yet received a response — are the trickiest part of failover. There are two approaches: (1) Abandon and retry. Cancel the in-flight request (close the HTTP connection) and immediately resend it to the backup provider. This provides the fastest recovery but may result in duplicate processing if the primary eventually completes the request after the connection is closed. For idempotent operations (read-only queries, deterministic transformations), this is safe. For operations with side effects (function calls that trigger actions), add deduplication logic. (2) Wait with timeout. Allow in-flight requests to complete or timeout (with a short timeout, such as 10 seconds), then send only subsequent requests to the backup. This avoids duplicate processing but means users who had in-flight requests experience the full timeout delay. The recommended approach for most applications is to abandon and retry with a request-level idempotency key. The backup provider processes the request, and if the primary also returns a response later, the idempotency key is used to discard the duplicate.

Can failover work with streaming responses?+

Yes, but streaming adds complexity to failover handling. If the primary provider fails mid-stream — after delivering some SSE chunks but before completing the response — you have a partially generated response that needs to be handled. There are three approaches: (1) Discard and restart. Discard the partial response and resend the full request to the backup provider. The user sees a brief interruption followed by a complete response from the backup. This is the simplest and most reliable approach. (2) Continue from partial. Send the partial response as context to the backup provider and ask it to continue from where the primary left off. This avoids repeating content but is fragile — the backup model may not seamlessly continue another model's output. (3) Buffer and replay. Buffer the complete stream on the server side and only deliver it to the client when the stream completes successfully. If the stream fails, silently retry on the backup and deliver the backup's stream to the client. The user never sees the partial response. This adds latency equal to the full generation time but provides the smoothest user experience. For most applications, approach 1 (discard and restart) is the best balance of simplicity and user experience.

How do I handle failover when using provider-specific features?+

Provider-specific features — such as OpenAI's assistants API, Anthropic's computer use, or Google's grounding with search — do not have equivalents on other providers and cannot be failed over in the traditional sense. For applications that depend on provider-specific features, you have three options: (1) Multi-account failover. Instead of failing over to a different provider, fail over to a different account with the same provider. This protects against per-account rate limits and some infrastructure issues, but not against provider-wide outages. (2) Feature degradation. Implement a fallback that provides reduced functionality using a standard API that all providers support. For example, if your primary uses Anthropic's computer use feature, your fallback could use standard tool calling to accomplish a subset of the same tasks. (3) Queue and retry. If the feature is not time-sensitive, queue requests during the outage and process them when the provider recovers. This maintains full functionality at the cost of delayed results. The right approach depends on how critical the provider-specific feature is to your user experience and how tolerant your users are of delayed or degraded service.

What is the cost impact of maintaining failover infrastructure?+

The total cost of failover infrastructure has three components: operational cost, engineering cost, and opportunity cost. Operationally, a hot standby configuration routes 5–10% of traffic to the backup provider during normal operation. For a $50,000/month AI spend, this adds $2,500–$5,000/month if the backup provider is equally priced, or proportionally more or less depending on the backup's pricing. A warm fallback has zero operational cost but slower failover time. Engineering cost includes the initial implementation (2–5 engineering days), ongoing maintenance (1–2 days per quarter for testing and updates), and the operational burden of monitoring and responding to failover alerts. Opportunity cost is what you could have built with those engineering days instead. For most production applications, the total cost of failover is 3–8% of your AI API spend, which is well justified by the availability improvement. CostHawk provides detailed failover cost breakdowns that separate the standing cost of hot standby traffic from the premium cost incurred during actual failover events, helping you optimize your failover budget.

Should I use the same model on the backup provider or a different one?+

Use a comparable model from a different provider for maximum resilience. If your primary is Claude 3.5 Sonnet on Anthropic, your backup should be GPT-4o on OpenAI or Gemini 1.5 Pro on Google — not another Claude model on the same provider. The reason is that most outages are provider-wide: when Anthropic's API goes down, all Claude models are affected. Failing over to a different Claude model on the same provider provides no protection against the most common failure mode. If you fail over to a different provider, test both models against your evaluation suite to ensure the backup meets your quality bar. Maintain provider-specific system prompt variants if needed to optimize each model's output for your use case. For applications where output consistency is paramount and you cannot tolerate any model switching, consider multi-account failover within the same provider as a secondary strategy — it protects against per-account rate limits and some regional infrastructure issues, even though it does not protect against provider-wide outages.

Related Terms

Load Balancing

Distributing LLM API requests across multiple provider accounts, endpoints, or models to optimize for cost, latency, and availability. Load balancing prevents rate limit exhaustion on any single account and enables cost-aware request distribution.

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

LLM Gateway

An AI-specific API gateway purpose-built for routing LLM requests across providers. Adds model routing, cost tracking, caching, and fallback capabilities that traditional API gateways lack.

Rate Limiting

Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.

Alerting

Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary