GlossaryOptimizationUpdated 2026-03-16

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Definition

What is Model Routing?

Model routing is an architectural pattern where an intelligent layer between the client and the AI provider evaluates each incoming request and selects the most cost-effective model that can deliver acceptable quality for that specific task. Instead of sending every request to a single expensive model like GPT-4o or Claude Opus 4, a router can direct simple classification tasks to a $0.10/MTok model, moderate tasks to a $1-3/MTok model, and only send truly complex reasoning tasks to a $15/MTok model. The routing decision is based on signals like task type, input complexity, user tier, latency requirements, and current budget utilization.

Impact

Why It Matters for AI Costs

Model routing is the single most impactful optimization strategy for teams with diverse AI workloads. Industry benchmarks show that 60-80% of typical production requests can be handled by smaller, cheaper models without measurable quality degradation. A team sending 100% of traffic to GPT-4o at $10/MTok output can cut costs by 70-94% by routing the majority of requests to GPT-4o-mini at $0.60/MTok output — with the same or better latency. CostHawk's per-request cost and quality analytics help you identify which requests are candidates for routing to cheaper models, quantify the savings, and monitor quality after migration.

What Is Model Routing?

Model routing is the practice of choosing which AI model handles each request based on the characteristics of that request. At its simplest, it is a decision function: given a request, which model should process it?

Consider a customer support application that handles three types of requests:

  • FAQ lookups (60% of traffic) — Simple questions with known answers. A small, fast model handles these perfectly.
  • Troubleshooting guidance (30% of traffic) — Moderate complexity requiring some reasoning. A mid-tier model works well.
  • Escalation analysis (10% of traffic) — Complex multi-step reasoning requiring the best available model.

Without routing, all three categories hit the same expensive model. With routing, costs drop dramatically because the majority of requests use the cheapest appropriate model.

The key insight is that model quality is not a single dimension. A model that excels at creative writing may underperform at code generation. A model that is overkill for binary classification is perfectly calibrated for nuanced analysis. Routing exploits these capability differences to minimize cost without sacrificing quality where it matters.

Routing Architectures

There are four primary architectures for model routing, each with different complexity, cost, and effectiveness tradeoffs:

1. Rule-Based Routing

The simplest approach: hardcoded rules map request attributes to models. Rules typically match on task type, endpoint, input length, or user tier.

function selectModel(request: RouteRequest): string {
  if (request.taskType === 'classification') return 'gpt-4.1-nano';
  if (request.taskType === 'summarization' && request.inputTokens < 1000)
    return 'gpt-4.1-mini';
  if (request.userTier === 'free') return 'gpt-4.1-mini';
  return 'gpt-4o'; // Default to capable model
}

Rule-based routing is easy to implement, has zero latency overhead, and is fully deterministic. The downside is that rules require manual tuning and cannot adapt to the content of individual requests.

2. Classifier-Based Routing

A lightweight ML classifier (logistic regression, small neural network, or even a regex-based heuristic) analyzes the request content and predicts which model tier is needed. The classifier is trained on historical request-response pairs labeled with the minimum model tier that produced acceptable quality.

Classifier-based routing adds 1-5ms of latency and can achieve 85-95% accuracy in selecting the correct model tier. The classifier itself is cheap to run — typically a small model or even a CPU-based model that costs nothing in API fees.

3. LLM-as-Judge Routing

A small, cheap LLM evaluates each request and decides which model should handle it. For example, GPT-4.1-nano ($0.10/MTok input) reads the request and classifies its complexity as low, medium, or high, then routes accordingly.

This approach is more flexible than rules or classifiers because the judge can reason about novel request types. However, it adds latency (one additional LLM call per request) and cost (the judge call itself). The economics work when the judge call costs less than the savings from routing. A judge call costing $0.0001 that saves $0.01 by routing to a cheaper model is a 100x return.

4. Cascade (Fallback) Routing

The cascade approach sends every request to the cheapest model first, evaluates the output quality, and escalates to a more expensive model only if the quality is insufficient. Quality evaluation can be rule-based (check for specific patterns, confidence scores, or format compliance) or LLM-based (a judge model scores the output).

Cascading is the most conservative approach — it guarantees that every request eventually gets an adequate response. The tradeoff is that escalated requests pay for two model calls (the cheap one that was rejected plus the expensive one that was accepted). The cascade pattern works best when the cheap model handles 70%+ of requests successfully, making the double-call cost on the remaining 30% worthwhile.

Cost Impact of Routing

The financial impact of model routing depends on your traffic distribution and the price differential between models. Here is a realistic analysis for a production application processing 1 million requests per month:

Before routing (all traffic to GPT-4o):

Request Type% of TrafficMonthly RequestsAvg Input TokensAvg Output TokensModelMonthly Cost
Simple FAQ50%500,000500150GPT-4o$2,000
Moderate tasks35%350,000800300GPT-4o$1,750
Complex reasoning15%150,0001,200500GPT-4o$1,200
Total monthly cost$4,950

After routing (tiered model selection):

Request Type% of TrafficMonthly RequestsAvg Input TokensAvg Output TokensModelMonthly Cost
Simple FAQ50%500,000500150GPT-4.1-nano$48
Moderate tasks35%350,000800300GPT-4.1-mini$155
Complex reasoning15%150,0001,200500GPT-4o$1,200
Total monthly cost$1,403

Monthly savings: $3,547 (71.7% reduction)

Annual savings: $42,564

In the most aggressive routing scenarios — where 90%+ of traffic can be handled by nano or mini models — savings can reach 90-94%. The key metric is the percentage of traffic that can be successfully downgraded without quality loss. CostHawk's model comparison dashboard helps you identify this percentage by showing per-request cost and quality metrics across models.

Implementing a Basic Router

Here is a production-ready implementation of a rule-based router with fallback logic that you can deploy immediately:

import OpenAI from 'openai';

interface RouterConfig {
  rules: RouterRule[];
  defaultModel: string;
  fallbackModel: string;
}

interface RouterRule {
  name: string;
  match: (req: RoutableRequest) => boolean;
  model: string;
}

interface RoutableRequest {
  messages: OpenAI.Chat.ChatCompletionMessageParam[];
  taskType?: string;
  inputTokenEstimate: number;
  qualityRequired: 'low' | 'medium' | 'high';
}

const routerConfig: RouterConfig = {
  rules: [
    {
      name: 'simple-classification',
      match: (req) =>
        req.taskType === 'classification' ||
        req.qualityRequired === 'low',
      model: 'gpt-4.1-nano',  // $0.10/$0.40 per MTok
    },
    {
      name: 'moderate-generation',
      match: (req) =>
        req.qualityRequired === 'medium' ||
        req.inputTokenEstimate < 2000,
      model: 'gpt-4.1-mini',  // $0.40/$1.60 per MTok
    },
    {
      name: 'complex-reasoning',
      match: (req) => req.qualityRequired === 'high',
      model: 'gpt-4o',        // $2.50/$10.00 per MTok
    },
  ],
  defaultModel: 'gpt-4.1-mini',
  fallbackModel: 'gpt-4o',
};

async function routeRequest(
  client: OpenAI,
  request: RoutableRequest,
  config: RouterConfig
): Promise<{ response: OpenAI.Chat.ChatCompletion; model: string; routed: boolean }> {
  // Find first matching rule
  const rule = config.rules.find(r => r.match(request));
  const selectedModel = rule?.model || config.defaultModel;

  try {
    const response = await client.chat.completions.create({
      model: selectedModel,
      messages: request.messages,
    });

    return { response, model: selectedModel, routed: !!rule };
  } catch (error) {
    // Fallback to more capable model on failure
    console.warn(`Model ${selectedModel} failed, falling back to ${config.fallbackModel}`);
    const response = await client.chat.completions.create({
      model: config.fallbackModel,
      messages: request.messages,
    });

    return { response, model: config.fallbackModel, routed: false };
  }
}

This router evaluates rules in order, selects the first matching model, and falls back to a more capable model if the selected model fails. In production, you would add quality validation on the response (checking for format compliance, confidence scores, or content filters) and escalate to the fallback model if quality is insufficient, not just on errors.

Log every routing decision — the selected model, the rule that matched, and the request characteristics — so you can analyze routing effectiveness over time and tune your rules based on real data.

Quality vs Cost Tradeoffs

The fundamental challenge of model routing is ensuring that cheaper models deliver acceptable quality. "Acceptable" varies by use case, and getting this wrong erodes user trust. Here is a systematic approach to managing the tradeoff:

1. Define quality metrics per task type

Before routing, establish measurable quality criteria for each task type. For classification, accuracy and F1 score are objective. For generation, you may need human evaluation or LLM-as-judge scoring. Without defined metrics, routing decisions become guesswork.

2. Benchmark every candidate model

Run each model against your quality metrics on a representative sample of production requests. A typical benchmarking matrix looks like this:

Task TypeGPT-4o QualityGPT-4.1-mini QualityGPT-4.1-nano QualityMinimum Acceptable
FAQ Classification98.2%97.1%95.8%95%
Sentiment Analysis96.5%95.3%92.1%93%
Code Generation94.7%88.2%71.3%90%
Creative Writing4.7/5.04.2/5.03.4/5.04.0/5.0

From this matrix, the router can safely route FAQ classification to nano (95.8% > 95% threshold), sentiment analysis to mini (95.3% > 93% threshold), but must keep code generation on GPT-4o (only it exceeds 90%) and creative writing on at least mini (4.2 > 4.0 threshold).

3. Monitor quality continuously

Quality can drift as models are updated, input distributions shift, or new task types emerge. Implement ongoing quality monitoring that samples routed requests, evaluates them against your metrics, and alerts when quality drops below thresholds. CostHawk's quality signals (when integrated with your evaluation pipeline) can trigger automatic routing adjustments.

4. Use shadow testing

Before enabling routing for a new task type, run the cheap model in shadow mode: send the request to both the current expensive model and the candidate cheap model, return the expensive model's response to the user, and compare the outputs offline. This lets you validate routing decisions with zero user impact.

5. Implement gradual rollout

Route 5% of traffic for a task type to the cheaper model, monitor quality for a week, then increase to 25%, then 50%, then 100%. This progressive approach catches quality issues early while limiting blast radius.

Model Routing and CostHawk

CostHawk provides several features that make model routing more effective and measurable:

Per-request cost attribution: CostHawk tracks the model used, tokens consumed, and cost for every request. This data is the foundation for routing analysis — you can see exactly which requests are using expensive models and calculate the savings potential from routing them to cheaper alternatives.

Task type tagging: Using CostHawk wrapped keys and request metadata, you can tag requests by task type. The usage dashboard then shows cost breakdowns by task type, revealing which categories are the best candidates for routing.

Model comparison analytics: CostHawk's model comparison view lets you see cost-per-request distributions across models for the same task type. If GPT-4o and GPT-4.1-mini produce similar quality for a given task type but GPT-4.1-mini costs 95% less, the dashboard makes this visible.

Savings simulation: The CostHawk savings calculator can model the impact of routing rules before you implement them. Input your traffic distribution, proposed routing rules, and model pricing, and see the projected monthly savings. This helps build the business case for investing in routing infrastructure.

Quality monitoring integration: CostHawk can ingest quality scores from your evaluation pipeline and correlate them with routing decisions. This closed-loop system ensures that cost savings do not come at the expense of output quality.

Teams that implement model routing with CostHawk's analytics typically achieve 40-80% cost reduction within the first month, with the exact savings depending on traffic distribution and the percentage of requests that can be safely downgraded to cheaper models.

FAQ

Frequently Asked Questions

How much can model routing save?+
Savings depend on your traffic mix, but most teams achieve 40-80% cost reduction with well-implemented model routing. The math is straightforward: if 60% of your requests can be handled by a model that costs 95% less (e.g., GPT-4.1-nano at $0.10/MTok vs GPT-4o at $2.50/MTok input), you save 57% on input costs from that traffic alone. In aggressive routing scenarios where 80-90% of traffic routes to nano or mini models, total savings can reach 70-94%. A team spending $10,000/month on a single expensive model can typically reduce to $2,000-$6,000/month with routing in place. CostHawk's savings simulator can project your specific savings based on your actual traffic patterns and task-type distribution before you write any routing code.
Does model routing hurt output quality?+
Not if implemented correctly with proper benchmarking and quality monitoring. The key is that routing does not mean sending complex tasks to weak models — it means sending simple tasks to appropriately-sized models that can handle them well. A binary classification task (spam or not spam) does not need GPT-4o's advanced reasoning capabilities; GPT-4.1-nano handles it with 95%+ accuracy at 25x lower cost. Quality degradation only occurs when you route tasks to models that are genuinely incapable of handling the complexity required. This is why benchmarking is absolutely essential: test every candidate model on every task type against your quality thresholds before enabling routing in production. Implement gradual rollouts starting at 5% of traffic and continuous quality monitoring to catch regressions early before they affect a significant portion of users.
What is the simplest way to start with model routing?+
Start with rule-based routing using task type as the primary routing signal. Categorize your existing API calls into 3-5 task types (e.g., classification, simple generation, summarization, complex reasoning), then benchmark 2-3 models on each category to find the cheapest one that meets your quality bar. Hardcode those model assignments in a simple routing function — this can be implemented in 50 lines of code and typically delivers 40-60% savings on day one. You do not need a sophisticated ML classifier or an LLM-as-judge to get meaningful results. Once you have basic routing in place, analyze the outcomes with CostHawk's per-request cost analytics and iteratively refine your rules based on real production data. Many teams find that this straightforward approach captures 80% of the total available savings without any complex infrastructure.
Should I use an LLM to route requests to other LLMs?+
Using an LLM-as-judge router makes sense when rule-based routing cannot capture the routing signal. If the complexity of a request depends on the content itself (not just metadata like task type or input length), an LLM judge can analyze the request and classify its difficulty. The economics work when the judge call costs significantly less than the savings from routing. For example, a GPT-4.1-nano judge call costing $0.0001 that routes a request from GPT-4o ($0.01 per request) to GPT-4.1-mini ($0.001 per request) saves $0.0089 per routed request — an 89x return on the judge cost. However, the judge adds latency (200-500ms per call), so it is not suitable for latency-critical paths.
How does cascade routing compare to pre-classification routing?+
Cascade routing tries the cheap model first and escalates on failure. Pre-classification routing decides the model before calling any model. Cascade routing is more conservative — it guarantees every request eventually gets a good response — but it wastes tokens on failed cheap model calls that get escalated. If 30% of requests escalate, those requests cost roughly 30% more than they would with pre-classification (because you pay for the cheap model call plus the expensive one). Pre-classification routing is more efficient when your classifier is accurate, but risks sending complex requests to cheap models if the classifier makes mistakes. In practice, cascade routing is better for high-stakes applications where quality failures are costly, and pre-classification is better for high-volume applications where efficiency matters more.
Can I route between different providers, not just different models?+
Yes, cross-provider routing is increasingly common. For example, you might route simple tasks to Google's Gemini 2.0 Flash (very fast and cheap), moderate tasks to Anthropic's Claude Haiku 3.5 (good balance of cost and quality), and complex tasks to OpenAI's GPT-4o or Anthropic's Claude Opus 4. Cross-provider routing adds complexity because each provider has different API formats, rate limits, and billing models. CostHawk's normalized pricing data across providers makes it easy to compare the true cost of each option. However, be aware that switching providers for the same task type can introduce subtle quality differences due to different training data and model architectures.
How do I measure routing effectiveness?+
Track three key metrics: cost savings, quality maintenance, and routing accuracy. Cost savings = (pre-routing monthly cost - post-routing monthly cost) / pre-routing monthly cost. Quality maintenance = average quality score of routed requests compared to the baseline (all requests using the most capable model). Routing accuracy = percentage of requests where the selected model was the cheapest one that met the quality threshold (measured retrospectively by evaluating outputs). CostHawk tracks the first two metrics automatically. For routing accuracy, sample 1-5% of routed requests, evaluate them against all candidate models, and check whether a cheaper model could have produced acceptable quality. A routing accuracy above 85% indicates your rules are well-tuned.
What about latency differences between models?+
Smaller models are generally significantly faster, which means model routing often improves latency alongside reducing costs — a rare win-win in engineering. GPT-4.1-nano responds in 100-300ms for typical requests, while GPT-4o may take 500-2000ms for the same input. This means routing simple requests to nano-tier models not only saves 90%+ on cost but delivers a noticeably better user experience through faster response times. However, if you use cascade routing, the total latency for escalated requests is higher because you pay for both the cheap model attempt and the expensive model attempt sequentially. For latency-sensitive applications like real-time chat or interactive coding assistants, prefer pre-classification routing over cascade routing to avoid the double-call latency penalty on requests that get escalated to the more capable model.
How often should I update my routing rules?+
Review routing rules monthly, or whenever a significant change occurs: new model releases, pricing changes, traffic pattern shifts, or quality regressions. New model releases are the most impactful trigger — when a provider launches a new model tier (like GPT-4.1-nano), it may be cheaper and better than your current low-tier routing target, enabling more aggressive routing. Pricing changes can shift the cost-optimal model for each task type. Traffic pattern shifts (e.g., a new feature driving a new type of request) may require new routing rules. CostHawk's cost dashboard makes it easy to spot when your routing configuration is no longer optimal by showing per-task-type cost trends over time.
Does model routing work with streaming responses?+
Yes, model routing works seamlessly with streaming responses for pre-classification routing. The routing decision happens before the API call is made, so the selected model streams its response normally without any special handling. One important consideration arises with cascade routing and streaming: if you stream the cheap model's response tokens to the user in real time and then determine the output is low quality after partial delivery, you cannot un-stream those tokens that the user has already seen. For cascade routing combined with streaming, either evaluate quality before streaming begins (using a non-streaming pre-check call or analyzing the first few tokens in a buffer before flushing to the client) or accept that escalation will require visibly restarting the response. Pre-classification routing avoids this problem entirely because the final model decision is made before any response generation begins.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.