Model Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Definition
What is Model Routing?
Impact
Why It Matters for AI Costs
What Is Model Routing?
Model routing is the practice of choosing which AI model handles each request based on the characteristics of that request. At its simplest, it is a decision function: given a request, which model should process it?
Consider a customer support application that handles three types of requests:
- FAQ lookups (60% of traffic) — Simple questions with known answers. A small, fast model handles these perfectly.
- Troubleshooting guidance (30% of traffic) — Moderate complexity requiring some reasoning. A mid-tier model works well.
- Escalation analysis (10% of traffic) — Complex multi-step reasoning requiring the best available model.
Without routing, all three categories hit the same expensive model. With routing, costs drop dramatically because the majority of requests use the cheapest appropriate model.
The key insight is that model quality is not a single dimension. A model that excels at creative writing may underperform at code generation. A model that is overkill for binary classification is perfectly calibrated for nuanced analysis. Routing exploits these capability differences to minimize cost without sacrificing quality where it matters.
Routing Architectures
There are four primary architectures for model routing, each with different complexity, cost, and effectiveness tradeoffs:
1. Rule-Based Routing
The simplest approach: hardcoded rules map request attributes to models. Rules typically match on task type, endpoint, input length, or user tier.
function selectModel(request: RouteRequest): string {
if (request.taskType === 'classification') return 'gpt-4.1-nano';
if (request.taskType === 'summarization' && request.inputTokens < 1000)
return 'gpt-4.1-mini';
if (request.userTier === 'free') return 'gpt-4.1-mini';
return 'gpt-4o'; // Default to capable model
}Rule-based routing is easy to implement, has zero latency overhead, and is fully deterministic. The downside is that rules require manual tuning and cannot adapt to the content of individual requests.
2. Classifier-Based Routing
A lightweight ML classifier (logistic regression, small neural network, or even a regex-based heuristic) analyzes the request content and predicts which model tier is needed. The classifier is trained on historical request-response pairs labeled with the minimum model tier that produced acceptable quality.
Classifier-based routing adds 1-5ms of latency and can achieve 85-95% accuracy in selecting the correct model tier. The classifier itself is cheap to run — typically a small model or even a CPU-based model that costs nothing in API fees.
3. LLM-as-Judge Routing
A small, cheap LLM evaluates each request and decides which model should handle it. For example, GPT-4.1-nano ($0.10/MTok input) reads the request and classifies its complexity as low, medium, or high, then routes accordingly.
This approach is more flexible than rules or classifiers because the judge can reason about novel request types. However, it adds latency (one additional LLM call per request) and cost (the judge call itself). The economics work when the judge call costs less than the savings from routing. A judge call costing $0.0001 that saves $0.01 by routing to a cheaper model is a 100x return.
4. Cascade (Fallback) Routing
The cascade approach sends every request to the cheapest model first, evaluates the output quality, and escalates to a more expensive model only if the quality is insufficient. Quality evaluation can be rule-based (check for specific patterns, confidence scores, or format compliance) or LLM-based (a judge model scores the output).
Cascading is the most conservative approach — it guarantees that every request eventually gets an adequate response. The tradeoff is that escalated requests pay for two model calls (the cheap one that was rejected plus the expensive one that was accepted). The cascade pattern works best when the cheap model handles 70%+ of requests successfully, making the double-call cost on the remaining 30% worthwhile.
Cost Impact of Routing
The financial impact of model routing depends on your traffic distribution and the price differential between models. Here is a realistic analysis for a production application processing 1 million requests per month:
Before routing (all traffic to GPT-4o):
| Request Type | % of Traffic | Monthly Requests | Avg Input Tokens | Avg Output Tokens | Model | Monthly Cost |
|---|---|---|---|---|---|---|
| Simple FAQ | 50% | 500,000 | 500 | 150 | GPT-4o | $2,000 |
| Moderate tasks | 35% | 350,000 | 800 | 300 | GPT-4o | $1,750 |
| Complex reasoning | 15% | 150,000 | 1,200 | 500 | GPT-4o | $1,200 |
| Total monthly cost | $4,950 | |||||
After routing (tiered model selection):
| Request Type | % of Traffic | Monthly Requests | Avg Input Tokens | Avg Output Tokens | Model | Monthly Cost |
|---|---|---|---|---|---|---|
| Simple FAQ | 50% | 500,000 | 500 | 150 | GPT-4.1-nano | $48 |
| Moderate tasks | 35% | 350,000 | 800 | 300 | GPT-4.1-mini | $155 |
| Complex reasoning | 15% | 150,000 | 1,200 | 500 | GPT-4o | $1,200 |
| Total monthly cost | $1,403 | |||||
Monthly savings: $3,547 (71.7% reduction)
Annual savings: $42,564
In the most aggressive routing scenarios — where 90%+ of traffic can be handled by nano or mini models — savings can reach 90-94%. The key metric is the percentage of traffic that can be successfully downgraded without quality loss. CostHawk's model comparison dashboard helps you identify this percentage by showing per-request cost and quality metrics across models.
Implementing a Basic Router
Here is a production-ready implementation of a rule-based router with fallback logic that you can deploy immediately:
import OpenAI from 'openai';
interface RouterConfig {
rules: RouterRule[];
defaultModel: string;
fallbackModel: string;
}
interface RouterRule {
name: string;
match: (req: RoutableRequest) => boolean;
model: string;
}
interface RoutableRequest {
messages: OpenAI.Chat.ChatCompletionMessageParam[];
taskType?: string;
inputTokenEstimate: number;
qualityRequired: 'low' | 'medium' | 'high';
}
const routerConfig: RouterConfig = {
rules: [
{
name: 'simple-classification',
match: (req) =>
req.taskType === 'classification' ||
req.qualityRequired === 'low',
model: 'gpt-4.1-nano', // $0.10/$0.40 per MTok
},
{
name: 'moderate-generation',
match: (req) =>
req.qualityRequired === 'medium' ||
req.inputTokenEstimate < 2000,
model: 'gpt-4.1-mini', // $0.40/$1.60 per MTok
},
{
name: 'complex-reasoning',
match: (req) => req.qualityRequired === 'high',
model: 'gpt-4o', // $2.50/$10.00 per MTok
},
],
defaultModel: 'gpt-4.1-mini',
fallbackModel: 'gpt-4o',
};
async function routeRequest(
client: OpenAI,
request: RoutableRequest,
config: RouterConfig
): Promise<{ response: OpenAI.Chat.ChatCompletion; model: string; routed: boolean }> {
// Find first matching rule
const rule = config.rules.find(r => r.match(request));
const selectedModel = rule?.model || config.defaultModel;
try {
const response = await client.chat.completions.create({
model: selectedModel,
messages: request.messages,
});
return { response, model: selectedModel, routed: !!rule };
} catch (error) {
// Fallback to more capable model on failure
console.warn(`Model ${selectedModel} failed, falling back to ${config.fallbackModel}`);
const response = await client.chat.completions.create({
model: config.fallbackModel,
messages: request.messages,
});
return { response, model: config.fallbackModel, routed: false };
}
}This router evaluates rules in order, selects the first matching model, and falls back to a more capable model if the selected model fails. In production, you would add quality validation on the response (checking for format compliance, confidence scores, or content filters) and escalate to the fallback model if quality is insufficient, not just on errors.
Log every routing decision — the selected model, the rule that matched, and the request characteristics — so you can analyze routing effectiveness over time and tune your rules based on real data.
Quality vs Cost Tradeoffs
The fundamental challenge of model routing is ensuring that cheaper models deliver acceptable quality. "Acceptable" varies by use case, and getting this wrong erodes user trust. Here is a systematic approach to managing the tradeoff:
1. Define quality metrics per task type
Before routing, establish measurable quality criteria for each task type. For classification, accuracy and F1 score are objective. For generation, you may need human evaluation or LLM-as-judge scoring. Without defined metrics, routing decisions become guesswork.
2. Benchmark every candidate model
Run each model against your quality metrics on a representative sample of production requests. A typical benchmarking matrix looks like this:
| Task Type | GPT-4o Quality | GPT-4.1-mini Quality | GPT-4.1-nano Quality | Minimum Acceptable |
|---|---|---|---|---|
| FAQ Classification | 98.2% | 97.1% | 95.8% | 95% |
| Sentiment Analysis | 96.5% | 95.3% | 92.1% | 93% |
| Code Generation | 94.7% | 88.2% | 71.3% | 90% |
| Creative Writing | 4.7/5.0 | 4.2/5.0 | 3.4/5.0 | 4.0/5.0 |
From this matrix, the router can safely route FAQ classification to nano (95.8% > 95% threshold), sentiment analysis to mini (95.3% > 93% threshold), but must keep code generation on GPT-4o (only it exceeds 90%) and creative writing on at least mini (4.2 > 4.0 threshold).
3. Monitor quality continuously
Quality can drift as models are updated, input distributions shift, or new task types emerge. Implement ongoing quality monitoring that samples routed requests, evaluates them against your metrics, and alerts when quality drops below thresholds. CostHawk's quality signals (when integrated with your evaluation pipeline) can trigger automatic routing adjustments.
4. Use shadow testing
Before enabling routing for a new task type, run the cheap model in shadow mode: send the request to both the current expensive model and the candidate cheap model, return the expensive model's response to the user, and compare the outputs offline. This lets you validate routing decisions with zero user impact.
5. Implement gradual rollout
Route 5% of traffic for a task type to the cheaper model, monitor quality for a week, then increase to 25%, then 50%, then 100%. This progressive approach catches quality issues early while limiting blast radius.
Model Routing and CostHawk
CostHawk provides several features that make model routing more effective and measurable:
Per-request cost attribution: CostHawk tracks the model used, tokens consumed, and cost for every request. This data is the foundation for routing analysis — you can see exactly which requests are using expensive models and calculate the savings potential from routing them to cheaper alternatives.
Task type tagging: Using CostHawk wrapped keys and request metadata, you can tag requests by task type. The usage dashboard then shows cost breakdowns by task type, revealing which categories are the best candidates for routing.
Model comparison analytics: CostHawk's model comparison view lets you see cost-per-request distributions across models for the same task type. If GPT-4o and GPT-4.1-mini produce similar quality for a given task type but GPT-4.1-mini costs 95% less, the dashboard makes this visible.
Savings simulation: The CostHawk savings calculator can model the impact of routing rules before you implement them. Input your traffic distribution, proposed routing rules, and model pricing, and see the projected monthly savings. This helps build the business case for investing in routing infrastructure.
Quality monitoring integration: CostHawk can ingest quality scores from your evaluation pipeline and correlate them with routing decisions. This closed-loop system ensures that cost savings do not come at the expense of output quality.
Teams that implement model routing with CostHawk's analytics typically achieve 40-80% cost reduction within the first month, with the exact savings depending on traffic distribution and the percentage of requests that can be safely downgraded to cheaper models.
FAQ
Frequently Asked Questions
How much can model routing save?+
Does model routing hurt output quality?+
What is the simplest way to start with model routing?+
Should I use an LLM to route requests to other LLMs?+
How does cascade routing compare to pre-classification routing?+
Can I route between different providers, not just different models?+
How do I measure routing effectiveness?+
What about latency differences between models?+
How often should I update my routing rules?+
Does model routing work with streaming responses?+
Related Terms
Cost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreBatch API
Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.
Read moreCost Anomaly Detection
Automated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.
Read morePay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
