Failover
Automatically switching to a backup LLM provider when the primary fails or becomes unavailable. Failover prevents user-facing downtime in AI-powered features but introduces cost implications when backup providers have different pricing. A well-designed failover strategy balances reliability against budget impact.
Definition
What is Failover?
Impact
Why It Matters for AI Costs
LLM provider outages are more common than most teams realize. In 2025 alone, OpenAI experienced 14 incidents with degraded API performance or full outages, Anthropic had 8 significant incidents, and Google Cloud's Gemini API had 6 documented outages. The average incident duration was 47 minutes, but some lasted several hours. For an application serving 10,000 users, a 47-minute outage means:
- 7,833 failed requests (at 10,000 RPM baseline)
- $0 in direct API costs saved — but potentially thousands in lost revenue or SLA penalties
- User trust erosion that takes weeks to rebuild
Failover eliminates this exposure. When the primary provider goes down, requests automatically redirect to a backup within seconds — often before users notice any disruption. The cost of failover is typically a 10–30% premium per request during the failover period (because backup providers may be more expensive) plus the engineering investment to implement and test the failover logic.
Consider the math for a $100,000/month AI API spend:
- Without failover: 14 outages × 47 minutes = ~11 hours of downtime per year. If each hour of downtime costs $5,000 in lost revenue, that is $55,000/year in losses.
- With failover: Near-zero downtime. Failover traffic costs ~20% more than primary, but only during outages. If failover handles 2% of total traffic annually, the premium is $100,000 × 0.02 × 0.20 = $400/year.
Failover delivers a 137:1 ROI ($55,000 saved for $400 in additional cost) in this scenario. Even in less extreme cases, the business case for failover is compelling for any application where LLM availability directly impacts users or revenue. CostHawk tracks failover events and their cost impact, showing you exactly how much your failover strategy costs versus how much downtime it prevents.
What is LLM Failover?
LLM failover is an automated mechanism that detects when your primary LLM provider is failing and redirects requests to a pre-configured backup. The detection can be based on explicit failure signals (HTTP 500 errors, connection timeouts, 429 rate limit responses) or implicit degradation signals (latency exceeding a threshold, response quality below a benchmark). Once a failure is detected, the failover system routes subsequent requests to the backup provider until the primary recovers.
Failover operates at a different layer than retry logic. A retry resends the same request to the same backend after a transient error, hoping the next attempt succeeds. Failover switches to an entirely different backend, acknowledging that the primary is experiencing a sustained issue that retries will not resolve. In practice, the two work together: retry the primary 1–2 times for transient errors, then fail over to the backup if the primary is persistently failing.
The failover lifecycle has four phases:
- Detection. The system identifies that the primary backend is failing. This can be immediate (a single request returns a 500 error) or statistical (error rate exceeds 10% over a 60-second window). Statistical detection is more robust because it avoids premature failover on isolated transient errors.
- Activation. The system switches new requests to the backup backend. In-flight requests to the primary may be abandoned (fast failover) or allowed to complete/timeout (graceful failover). The choice depends on your latency tolerance — fast failover minimizes user impact but may result in duplicate requests if the primary eventually responds.
- Operation. The backup backend serves all traffic while the primary is down. During this phase, the system periodically probes the primary (synthetic health check requests) to detect recovery. Cost tracking is critical during this phase because pricing may differ significantly.
- Recovery. The primary comes back online and begins passing health checks. The system gradually shifts traffic back to the primary — either immediately (hard recovery) or incrementally over several minutes (gradual recovery). Gradual recovery is safer because it validates that the primary is stable before sending full traffic. A common pattern is to shift 10% of traffic back to the primary, wait 2 minutes, then 25%, 50%, 75%, and finally 100%.
The entire failover lifecycle should be automated and require no human intervention. Manual failover — where an engineer notices an outage and reconfigures the system — is too slow for production use. By the time a human responds, users have already experienced minutes of degraded service.
Failover Strategies
There are three primary failover strategies, each with different cost, complexity, and recovery-time characteristics.
1. Hot standby. The backup provider receives a small percentage of live traffic at all times (typically 5–10%). This ensures the backup is warm — its connections are established, any caching layers are populated, and you have recent latency/quality data. When the primary fails, the backup simply absorbs the primary's traffic share, scaling from 5% to 100%. Hot standby provides the fastest failover (under 5 seconds) because the backup is already active and proven healthy. The cost overhead during normal operation is minimal: 5–10% of traffic routed to a possibly more expensive provider adds 1–3% to your total bill. Hot standby is the recommended strategy for production applications with high availability requirements.
2. Warm fallback. The backup provider is configured and tested but does not receive live traffic during normal operation. When the primary fails, the system activates the backup by establishing connections and sending the first request. Warm fallback has a longer activation time (5–30 seconds for connection establishment and first-request latency) but zero cost overhead during normal operation. The risk is that the backup may have changed since it was last tested — the provider may have updated its API, rate limits, or pricing. Regular health checks (every 5 minutes) mitigate this risk. Warm fallback is appropriate for applications with moderate availability requirements (99.9% SLA) where the cost of hot standby is not justified.
3. Degraded service. Instead of failing over to another provider, the application degrades gracefully when the primary fails. For a chatbot, this might mean returning a cached or template response ("I'm experiencing high demand, please try again in a moment"). For a code review tool, it might mean showing a "review pending" badge and processing the review asynchronously when the provider recovers. Degraded service has zero failover cost because no backup provider is invoked, but it provides a worse user experience than a true failover. It is appropriate for non-critical features where some degradation is acceptable, or as a last-resort fallback when all providers are down simultaneously.
Many production deployments combine strategies in a cascade: hot standby to Provider B when Provider A fails, warm fallback to Provider C if both A and B are down, and degraded service if all providers are unavailable. This layered approach maximizes availability while containing costs.
| Strategy | Failover Time | Normal-Operation Cost | Complexity | Best For |
|---|---|---|---|---|
| Hot standby | <5 seconds | +1–3% overhead | Medium | High-availability production apps |
| Warm fallback | 5–30 seconds | Zero overhead | Low–Medium | Moderate availability requirements |
| Degraded service | Instant | Zero overhead | Low | Non-critical features, last resort |
Cost of Failover
Failover has three distinct cost components: the premium cost during failover, the standing cost of backup capacity, and the engineering cost of implementation and testing. Understanding each component helps you make informed decisions about your failover architecture.
1. Premium cost during failover. When your primary provider goes down and traffic shifts to a backup, the backup may be more expensive. If your primary is Gemini 1.5 Pro ($1.25/MTok input) and your backup is Claude 3.5 Sonnet ($3.00/MTok input), you pay a 140% premium on every request during the failover period. For a workload processing 100,000 requests per hour, a 2-hour outage with failover to the more expensive provider adds:
Normal cost: 200K requests × 1,500 tokens × ($1.25/1M) = $375
Failover cost: 200K requests × 1,500 tokens × ($3.00/1M) = $900
Premium: $900 - $375 = $525 for 2 hours of failoverThis premium is almost always worth paying compared to the revenue lost from 2 hours of downtime. To minimize the premium, choose a backup provider with comparable or lower pricing than your primary. Gemini Flash ($0.10/MTok) as a backup for GPT-4o ($2.50/MTok) actually reduces cost during failover, though output quality may differ.
2. Standing cost of backup capacity. Hot standby strategies route 5–10% of traffic to the backup during normal operation, incurring a small ongoing cost. If your primary processes $100,000/month in API costs and the backup is 20% more expensive per token, routing 5% of traffic to the backup adds: $100,000 × 0.05 × 0.20 = $1,000/month. This is the "insurance premium" for fast failover. Warm fallback and degraded service strategies have zero standing cost.
3. Over-provisioning costs. Some organizations maintain dedicated capacity (provisioned throughput) on their backup provider to ensure it can absorb the full primary workload during failover. This reserved capacity costs money whether or not it is used. A better approach for most teams is to use on-demand pricing for the backup and accept that rate limits may constrain throughput during failover. If your backup provider's rate limits are insufficient for your full workload, you may need multiple backup accounts or a tiered failover strategy that routes overflow traffic to a third provider.
4. Engineering cost. Building, testing, and maintaining failover logic is a real investment. Initial implementation takes 2–5 engineering days for a basic failover system. Testing requires simulating provider outages (chaos engineering), which requires additional tooling. Ongoing maintenance includes updating failover logic when providers change APIs or pricing, rotating backup API keys, and periodically running failover drills to verify the system works. CostHawk's failover event tracking and cost attribution reduce the monitoring burden by automatically calculating the cost impact of each failover event and alerting you to unexpected cost spikes.
Implementing Failover Logic
Here is a production-ready failover implementation in TypeScript that handles detection, activation, and recovery with proper error tracking and cost awareness.
interface ProviderConfig {
name: string
client: OpenAI
costPerInputMTok: number
costPerOutputMTok: number
isHealthy: boolean
consecutiveFailures: number
lastHealthCheck: number
}
class LLMFailoverManager {
private primary: ProviderConfig
private backups: ProviderConfig[]
private activeProvider: ProviderConfig
private readonly FAILURE_THRESHOLD = 3
private readonly HEALTH_CHECK_INTERVAL = 30_000 // 30s
private readonly RECOVERY_PROBE_INTERVAL = 60_000 // 60s
constructor(
primary: ProviderConfig,
backups: ProviderConfig[]
) {
this.primary = primary
this.backups = backups
this.activeProvider = primary
this.startHealthMonitor()
}
async sendRequest(
messages: ChatMessage[],
options: RequestOptions
): Promise<ChatResponse> {
try {
const response = await this.activeProvider.client
.chat.completions.create({
model: options.model,
messages,
max_tokens: options.maxTokens
})
// Reset failure counter on success
this.activeProvider.consecutiveFailures = 0
return response
} catch (error: any) {
return this.handleFailure(error, messages, options)
}
}
private async handleFailure(
error: any,
messages: ChatMessage[],
options: RequestOptions
): Promise<ChatResponse> {
this.activeProvider.consecutiveFailures++
// Retry on transient errors
if (this.isTransient(error) &&
this.activeProvider.consecutiveFailures < this.FAILURE_THRESHOLD) {
await this.delay(1000 * this.activeProvider.consecutiveFailures)
return this.sendRequest(messages, options)
}
// Failover to backup
this.activeProvider.isHealthy = false
console.warn(
`Provider ${this.activeProvider.name} marked unhealthy ` +
`after ${this.activeProvider.consecutiveFailures} failures`
)
const backup = this.backups.find(b => b.isHealthy)
if (!backup) throw new Error("All providers unavailable")
this.activeProvider = backup
return this.sendRequest(messages, options)
}
private isTransient(error: any): boolean {
return [429, 500, 502, 503, 504].includes(error.status)
}
private startHealthMonitor() {
setInterval(async () => {
// Probe unhealthy providers for recovery
if (!this.primary.isHealthy) {
try {
await this.primary.client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "ping" }],
max_tokens: 1
})
this.primary.isHealthy = true
this.primary.consecutiveFailures = 0
this.activeProvider = this.primary // Recover
} catch { /* still unhealthy */ }
}
}, this.RECOVERY_PROBE_INTERVAL)
}
private delay(ms: number) {
return new Promise(r => setTimeout(r, ms))
}
}Key implementation details in this pattern:
- Consecutive failure threshold. Three consecutive failures trigger failover, not a single error. This prevents premature failover on isolated transient errors while responding quickly to sustained outages.
- Automatic recovery probing. The health monitor periodically checks the primary provider with a minimal-cost request (1 output token). When the primary recovers, traffic shifts back automatically without human intervention.
- Error classification. Only retryable HTTP status codes trigger the failover path. Client errors (400, 401, 403) indicate a problem with the request itself, not the provider, and should not trigger failover.
- Cost tracking integration. Wrap each request with CostHawk logging to track which provider served each request and calculate the cost premium during failover periods.
Failover and Model Compatibility
The most challenging aspect of LLM failover is managing the differences between primary and backup models. Unlike traditional infrastructure failover (where the standby database is an exact replica), LLM failover switches between models that produce different outputs for the same input. Understanding and mitigating these differences is critical for maintaining application quality during failover events.
API format differences. Each provider has its own API format. OpenAI's chat completions API uses a messages array with role and content fields. Anthropic uses a separate system parameter and a different message structure. Google's Gemini API uses contents with parts. Your failover logic must translate between these formats when switching providers. Using a unified SDK like LiteLLM or Vercel AI SDK handles this translation automatically, making your failover code provider-agnostic.
Feature parity gaps. Not all providers support the same features. OpenAI's function calling has a different schema format than Anthropic's tool use. Some models support structured output (JSON mode) and others do not. Vision capabilities, audio processing, and extended context windows vary across providers. Your failover configuration must account for these gaps — if your primary provider supports a feature that your backup does not, you need either a compatible alternative or a graceful degradation path.
Key compatibility considerations by feature:
| Feature | OpenAI GPT-4o | Anthropic Claude 3.5 Sonnet | Google Gemini 1.5 Pro |
|---|---|---|---|
| Max context window | 128K tokens | 200K tokens | 2M tokens |
| Function/tool calling | Yes (functions API) | Yes (tool_use) | Yes (function_calling) |
| JSON mode | Yes | Yes (via tool_use) | Yes |
| Vision | Yes | Yes | Yes |
| Streaming | SSE | SSE | SSE |
| Prompt caching | Yes (50% discount) | Yes (90% discount) | Yes (context caching) |
Output quality testing. Before deploying failover to production, run your evaluation suite against both the primary and backup models. Measure accuracy, formatting compliance, response length, and any domain-specific metrics. If the backup model scores significantly lower on critical metrics, consider whether the failover is acceptable or whether you should invest in prompt adjustments specific to the backup model. Some teams maintain provider-specific system prompt variants that optimize each model's output for their use case. The failover logic selects the appropriate prompt variant when switching providers.
Behavioral differences to watch for. Even when two models produce comparable quality outputs, they may differ in subtle ways that affect user experience: response length (Claude tends toward longer responses than GPT-4o), formatting preferences (Markdown headers vs. bold text), confidence calibration (some models hedge more than others), and handling of ambiguous instructions. For user-facing applications, these differences can be disorienting during a failover event. Mitigations include constraining output format with structured output modes and using detailed system prompts that specify expected behavior.
Monitoring Failover Events
Monitoring failover events is essential for understanding your system's reliability, measuring the cost impact of outages, and improving your failover configuration over time. Every failover event should be treated as an incident that generates a data record for analysis.
What to log for each failover event:
- Timestamp — when the failover was triggered.
- Trigger reason — what caused the failover (consecutive errors, latency threshold, rate limit exhaustion, manual trigger).
- Primary provider — which provider failed and what error codes were returned.
- Backup provider — which backup received the traffic.
- Duration — how long the failover lasted before the primary recovered.
- Requests affected — how many requests were served by the backup during the failover period.
- Cost impact — the additional cost incurred from using the backup provider versus what the primary would have cost.
- Quality impact — any measurable difference in output quality (if you run automated evaluations).
- Recovery method — automatic (health check detected recovery) or manual (engineer intervened).
Alerting on failover events. Configure alerts at two levels:
- Immediate alert when failover activates. This notifies the on-call engineer that the primary provider is down and the system is running on backup. Even though the failover is automatic, human awareness is important in case the backup also starts failing or the failover has unexpected side effects.
- Cost threshold alert if failover cost premium exceeds a threshold. If your backup provider is significantly more expensive than your primary, a long failover event could spike your bill. Set an alert at a dollar threshold (for example, $500 in additional failover cost) to ensure someone is tracking the financial impact.
Post-incident analysis. After each failover event, review the data to answer four questions:
- Was the failover necessary? Did the primary truly fail, or did a transient error trigger premature failover? If premature, increase the failure threshold or detection window.
- Was failover fast enough? How many requests failed before failover activated? If too many, reduce the detection threshold or switch from warm fallback to hot standby.
- Was the backup adequate? Did the backup provider handle the traffic without issues? Were there quality complaints from users? If the backup struggled, consider adding a second backup or increasing the backup's rate limit tier.
- Was recovery smooth? Did the system correctly detect the primary's recovery and shift traffic back? Did the gradual recovery process work, or did it cause oscillation between providers?
CostHawk automatically tracks failover events when detected through changes in the wrapped key routing. The dashboard shows a timeline of failover events overlaid with cost data, making it easy to calculate the total cost of each outage — both the direct cost premium and the lost revenue from any requests that failed before failover activated. Over time, this data helps you optimize your failover configuration: tightening detection thresholds if failover is too slow, loosening them if failover triggers too often, or switching backup providers if the cost premium is too high.
FAQ
Frequently Asked Questions
How quickly should failover activate after a provider failure?+
Should I fail over to a cheaper or more expensive model?+
How do I test failover without causing a real outage?+
What happens to in-flight requests when failover activates?+
Can failover work with streaming responses?+
How do I handle failover when using provider-specific features?+
What is the cost impact of maintaining failover infrastructure?+
Should I use the same model on the backup provider or a different one?+
Related Terms
Load Balancing
Distributing LLM API requests across multiple provider accounts, endpoints, or models to optimize for cost, latency, and availability. Load balancing prevents rate limit exhaustion on any single account and enables cost-aware request distribution.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreLLM Gateway
An AI-specific API gateway purpose-built for routing LLM requests across providers. Adds model routing, cost tracking, caching, and fallback capabilities that traditional API gateways lack.
Read moreRate Limiting
Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Read moreAlerting
Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.
Read moreLatency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
