GlossaryObservabilityUpdated 2026-03-16By Chase Dillingham

Evals

Q: How many test cases do I need in my eval dataset?

For most production applications, 200–500 test cases provide a statistically meaningful quality signal. With 200 cases, your accuracy estimates have a 95% confidence interval of roughly ±5–7 percentage points — sufficient to detect a model switch that drops accuracy from 95% to 88%, but too coarse to detect a 2-point regression. With 500 cases, the confidence interval narrows to ±3–4 points, catching smaller regressions. Beyond 1,000 cases, the marginal improvement in statistical precision diminishes rapidly, so most teams see diminishing returns past that threshold. The more important consideration is dataset quality rather than quantity — 200 carefully curated cases covering edge cases, failure modes, and representative production traffic patterns will outperform 2,000 randomly sampled cases that are 90% easy examples the model always gets right. Start with 100 hand-curated cases that stress-test known weaknesses, then gradually add production examples. CostHawk can help you identify which request patterns are most expensive, which is a useful heuristic for prioritizing eval coverage — optimizing your highest-cost request types delivers the biggest savings.

Q: How much does it cost to run an LLM-as-judge evaluation?

The cost depends on your judge model, dataset size, and the number of quality dimensions. A typical evaluation of 500 test cases across 5 quality dimensions using GPT-4o as the judge costs approximately $15–$25 per run. Here is the math: each judgment requires the judge to read the input (~300 tokens), the output (~400 tokens), the rubric (~500 tokens), and generate a scored assessment (~300 tokens output). That is roughly 1,200 input tokens and 300 output tokens per judgment. At GPT-4o rates ($2.50/$10.00 per MTok), one judgment costs about $0.006. Multiply by 500 cases and 5 dimensions: 2,500 judgments × $0.006 = $15. Using GPT-4o mini as the judge drops the cost to approximately $1.20 for the same volume. Using Claude 3.5 Haiku costs roughly $2.50. For daily evals, the monthly cost ranges from $36 (GPT-4o mini daily) to $750 (GPT-4o daily). Most teams use a cheaper judge model for daily monitoring and reserve the frontier judge for weekly or pre-deployment validations, keeping monthly eval costs under $200.

Q: Should I use the same model as a judge that I am evaluating?

No — using the same model as both the system under test and the evaluator introduces self-preference bias. Research from multiple labs shows that LLMs rate their own outputs 5–15% higher than outputs from other models of comparable quality. GPT-4o rating GPT-4o outputs will be systematically more generous than GPT-4o rating Claude outputs, and vice versa. The best practice is to use a different model family as your judge. If you are evaluating GPT-4o outputs, use Claude 3.5 Sonnet as the judge. If you are evaluating Claude outputs, use GPT-4o as the judge. This cross-family approach minimizes bias and produces scores that correlate more closely with human judgments. If budget constraints require using the same provider, at minimum use a different model within the family (for example, judge GPT-4o outputs with GPT-4o mini — the bias is smaller across model sizes than across identical models). Another mitigation is pairwise comparison rather than absolute scoring: show the judge two outputs (from the old and new configuration) without revealing which is which, and ask it to choose the better one. This relative comparison is less susceptible to self-preference bias than absolute scoring.

Q: How do evals relate to A/B testing?

Evals and A/B tests are complementary but serve different purposes. Evals measure output quality against a rubric in a controlled setting — they tell you whether the model's outputs meet your quality criteria. A/B tests measure user behavior in production — they tell you whether users prefer one model configuration over another, as measured by engagement, conversion, satisfaction, or other business metrics. A typical optimization workflow uses both: first, run evals to confirm that the new configuration meets minimum quality thresholds (this is cheap and fast, taking minutes). Then, if evals pass, run an A/B test to validate that real users do not notice a quality difference (this is slower, taking days or weeks, but provides the strongest evidence). Evals are a gate that prevents obviously degraded configurations from reaching A/B testing, saving you the cost and risk of exposing users to a bad experience. Think of evals as your staging environment validation and A/B tests as your production validation. CostHawk can segment cost analytics by A/B test variant, showing you not just which variant users prefer but which one costs less per successful user interaction — the combined metric that matters most for optimization decisions.

Q: What is the difference between evals and benchmarks?

Benchmarks are standardized, public evaluation datasets designed to compare models against each other on general capabilities — MMLU tests broad knowledge, HumanEval tests code generation, GSM8K tests math reasoning, and HELM provides a holistic comparison framework. They are created by researchers and used primarily by model developers to demonstrate progress and by consumers to compare models before adoption. Evals, in the production context, are custom evaluation suites designed to test a specific model configuration on a specific application's requirements. While a benchmark might ask "Can this model solve differential equations?" your eval asks "Does this model correctly extract invoice line items from our customers' PDF formats with our current system prompt?" Benchmarks are useful for initial model selection — they help you narrow the field from dozens of models to two or three candidates. Evals take over from there, providing the application-specific quality measurement that benchmarks cannot. A model that tops the MMLU leaderboard may perform poorly on your eval if your use case requires capabilities that MMLU does not measure (domain-specific jargon, particular output formats, nuanced tone requirements). Always treat benchmark scores as a starting point for exploration, not a definitive quality guarantee.

Q: How often should I run my eval suite?

The right cadence depends on how frequently your LLM configuration changes and how sensitive your application is to quality regressions. At minimum, run evals on every change to prompts, model selections, or LLM-related code — integrate automated metrics and LLM-as-judge into your CI/CD pipeline so that pull requests cannot merge without passing quality gates. Beyond change-triggered runs, schedule recurring evals to catch model drift from provider-side updates: daily for high-stakes applications (medical, legal, financial), weekly for standard production workloads, and monthly for low-risk internal tools. The cost of daily LLM-as-judge runs on 200 test cases is approximately $3–$10/day using GPT-4o or $0.30–$1.00/day using GPT-4o mini — negligible relative to the cost of shipping a quality regression. Additionally, implement online evaluation that continuously samples 1–5% of production traffic for async quality assessment. This catches distribution shifts and edge cases that your static eval dataset does not cover. The key principle: the cost of running evals too frequently is dollars; the cost of running them too infrequently is degraded user experience and lost trust. Err on the side of more frequent evaluation.

Q: Can I use evals to decide which model to route each request to?

Yes — this is the foundation of eval-driven model routing, one of the most effective cost optimization strategies available. The approach works in two phases. In the offline phase, you run your eval suite against every candidate model (for example, GPT-4o, GPT-4o mini, Claude 3.5 Haiku, Gemini 2.0 Flash) and categorize test cases by which cheaper models pass quality thresholds. You might discover that GPT-4o mini handles 70% of your request types within 3% of GPT-4o's quality score, Claude Haiku handles another 15%, and only 15% of requests truly require a frontier model. In the online phase, you build a classifier (often itself a lightweight LLM call or a rule-based system) that examines each incoming request and routes it to the cheapest model expected to produce acceptable quality. The eval suite validates the router's decisions: periodically route a sample of requests to all models, evaluate the outputs, and confirm that the routing logic is making correct decisions. If the router is sending requests to GPT-4o mini that score poorly, tighten the routing criteria. CostHawk's per-model cost analytics combined with eval quality scores provide the data needed to continuously tune the routing logic — you can see exactly how much each routing decision saves and whether it maintains quality.

Q: What tools should I use to build an eval pipeline?

The eval tooling landscape has matured significantly. For most teams, the best approach is to combine a dedicated eval framework with your existing CI/CD and monitoring infrastructure. Braintrust is a leading commercial platform that provides dataset management, LLM-as-judge orchestration, score tracking, and CI/CD integration out of the box — plans start free for small teams and scale to $500+/month for enterprise features. Promptfoo is an open-source CLI tool that supports automated metrics, LLM-as-judge, and human review with a configuration-driven approach — excellent for teams that want eval-as-code in their repository. OpenAI Evals is OpenAI's open-source framework, good if you are primarily evaluating OpenAI models but less flexible for multi-provider setups. Ragas specializes in RAG (retrieval-augmented generation) evaluation with metrics like faithfulness, answer relevancy, and context precision. DeepEval is a pytest-based framework that integrates naturally with Python test suites. For teams that prefer to build custom, the core components are: (1) a test runner that calls your LLM API in parallel, (2) an evaluation layer that applies metrics and LLM-as-judge, (3) a storage layer for results (PostgreSQL works fine), and (4) a comparison tool that detects regressions against baselines. Pair any of these with CostHawk to track the cost of both your eval runs and the production workloads they validate.

Systematic evaluation of LLM output quality using automated metrics, human review, or LLM-as-judge methodologies. Evals are the quality gate that ensures cost optimizations — model downgrades, prompt compression, caching — do not silently degrade the user experience.

Definition

What is Evals?

An eval (short for evaluation) is a structured process for measuring the quality, accuracy, safety, or usefulness of a large language model's outputs against a defined set of criteria. In production AI systems, evals serve as the quality assurance layer that validates whether the model is performing as expected — and crucially, whether cost optimizations like switching to a cheaper model, compressing prompts, or enabling caching have degraded output quality. Evals range from simple string-matching checks ("does the JSON output parse correctly?") to sophisticated multi-dimensional assessments involving human annotators or secondary LLMs acting as judges.

The term originates from the machine learning tradition of evaluation benchmarks (MMLU, HumanEval, GSM8K), but in the context of production AI applications, evals are far more specific and business-relevant. A generic benchmark might test whether a model can solve calculus problems. A production eval tests whether your model, with your system prompt, produces outputs that meet your users' expectations for your specific use case. This distinction is critical — a model that scores 92% on a public benchmark might score 68% on your custom eval if your domain requires specialized knowledge, specific formatting, or nuanced tone.

For AI cost management, evals are indispensable. Every cost optimization carries a quality risk. Switching from Claude 3.5 Sonnet ($3/$15 per MTok) to Claude 3.5 Haiku ($0.80/$4 per MTok) saves 73% on input costs and 73% on output costs, but does the cheaper model produce acceptable outputs for your use case? Without an eval suite, you are flying blind — saving money today while potentially degrading the product experience in ways that cost far more in user churn, support tickets, and lost revenue. With an eval suite, you can quantify exactly how much quality you trade for each dollar saved, and make informed decisions about where that tradeoff is acceptable.

Impact

Why It Matters for AI Costs

Evals are the bridge between cost optimization and quality assurance. Without them, every cost-saving change is a gamble. With them, every optimization is a measured, reversible experiment. Here is why they matter for teams managing AI spend:

Cost optimizations are meaningless without quality measurement. Suppose you switch from GPT-4o ($2.50/$10.00 per MTok) to GPT-4o mini ($0.15/$0.60 per MTok) and save $18,000/month. If your customer satisfaction score drops from 4.2 to 3.1, you have not "saved" anything — you have traded revenue for reduced API costs. Evals prevent this scenario by quantifying quality before and after the switch, giving you a clear signal: "GPT-4o mini scores 94% on our eval suite compared to 97% for GPT-4o, and the 3-point gap is entirely in creative writing tasks, which represent 8% of our traffic. For the other 92% of requests, quality is equivalent." That is an actionable insight. A blind model swap is a prayer.

The cost of running evals is a fraction of the cost of shipping degraded quality. A comprehensive eval suite that tests 500 examples across 10 quality dimensions using LLM-as-judge costs approximately $5–$25 per run (depending on the judge model and number of examples). Running it daily costs $150–$750/month. Compare that to the cost of shipping a regression that increases support tickets by 20% or reduces conversion rates by 5%. For any team spending more than $5,000/month on AI APIs, the eval investment pays for itself many times over.

Evals enable continuous optimization. The AI cost landscape changes constantly — new models launch monthly, pricing changes quarterly, and prompt engineering techniques evolve weekly. Without evals, each change requires manual quality review: engineers eyeballing outputs and making subjective judgments. With evals, you can run automated quality checks against every change, enabling a continuous optimization loop: hypothesize a savings opportunity, implement it in a test environment, run evals, compare quality scores, promote or reject the change. CostHawk's cost analytics combined with a rigorous eval pipeline creates a complete feedback loop where you can see both the cost savings and the quality impact of every optimization in a single view.

What Are LLM Evals?

LLM evals are systematic, repeatable tests that measure the quality of a language model's outputs against predefined criteria. They serve the same function in AI systems that unit tests and integration tests serve in traditional software engineering — they provide automated confidence that the system behaves correctly, catch regressions before they reach production, and document expected behavior in executable form.

An eval consists of four components:

Test cases (the dataset). A collection of inputs (prompts, conversation histories, documents) paired with expected outputs or quality criteria. A summarization eval might include 200 articles with human-written reference summaries. A classification eval might include 1,000 customer messages with ground-truth labels. A code generation eval might include 150 function specifications with test suites that validate the generated code.
The system under test. The specific configuration you are evaluating — the model, system prompt, temperature, tools, and any preprocessing or postprocessing logic. Changing any one of these parameters can affect output quality, so evals must capture the full configuration, not just the model name.
The evaluation method. How you determine whether an output is good. This ranges from deterministic checks ("does the JSON parse?", "does the output contain required fields?") to subjective assessments ("is this summary accurate and coherent?"). The three primary evaluation methods — automated metrics, human review, and LLM-as-judge — are detailed in the next section.
Scoring and aggregation. How individual test case results are combined into overall quality scores. Common approaches include pass/fail rates, mean scores on a 1–5 scale, percentile distributions ("95% of outputs score 4 or higher"), and dimensional breakdowns (accuracy: 92%, tone: 88%, formatting: 97%).

A well-designed eval suite typically covers multiple quality dimensions relevant to the specific application:

Dimension	What It Measures	Example Criteria
Accuracy	Factual correctness of the output	All dates, numbers, and claims match source material
Completeness	Whether all required information is present	Summary covers all 5 key points from the document
Relevance	Whether the output addresses the actual question	Response directly answers the user's query without tangents
Formatting	Structural correctness of the output	Valid JSON, correct schema, required fields present
Tone	Appropriateness of language and style	Professional, empathetic, matches brand voice
Safety	Absence of harmful, biased, or inappropriate content	No PII leakage, no hallucinated medical advice
Conciseness	Output length relative to information content	Under 200 words while preserving all key information

For cost optimization specifically, conciseness is a dimension that directly impacts the bill — a model that produces equally accurate output in 150 tokens instead of 400 tokens saves you 62% on output costs per request. Evals that track output token counts alongside quality scores reveal which optimizations save money without losing quality and which ones save money because they lose quality (producing shorter but less complete outputs).

Types of Evals

The three primary evaluation methodologies each have different strengths, costs, and appropriate use cases. Most production teams use a combination of all three, applying each method where it delivers the best quality signal per dollar spent.

1. Automated Metrics

Automated evals use deterministic or statistical measures to score outputs without any human or LLM involvement. They are the cheapest and fastest evaluation method, running in milliseconds and costing nothing beyond compute time. Common automated metrics include:

Exact match / contains: Does the output contain a required string, match an expected value, or follow a regex pattern? Ideal for classification, extraction, and structured output tasks.
BLEU / ROUGE / METEOR: N-gram overlap metrics that compare generated text to reference text. ROUGE-L (longest common subsequence) is widely used for summarization. These metrics correlate moderately with human judgment — a ROUGE-L score above 0.45 typically indicates acceptable summarization quality.
Semantic similarity: Embed the output and the reference with an embedding model, then compute cosine similarity. Scores above 0.85 generally indicate semantically equivalent content. Costs ~$0.0001 per comparison using text-embedding-3-small.
Code execution: For code generation tasks, run the generated code against a test suite. Pass/fail is the most rigorous eval possible for code — if the tests pass, the code works.
JSON schema validation: For structured output, validate against a JSON schema. This catches formatting regressions instantly and costs nothing to run.

2. Human Review

Human evaluation remains the gold standard for subjective quality dimensions like tone, helpfulness, creativity, and nuanced accuracy. Human reviewers can catch subtle errors that automated metrics miss — a factually accurate summary that buries the most important point, a technically correct response that uses an inappropriate tone for the audience, or a code snippet that works but follows anti-patterns.

Human review is expensive and slow. A trained annotator reviewing LLM outputs typically costs $25–$60 per hour and can evaluate 20–40 outputs per hour, depending on complexity. That translates to $0.60–$3.00 per evaluation. At 500 test cases, a single human eval run costs $300–$1,500 and takes 12–25 hours. For this reason, human review is typically reserved for high-stakes evaluations: validating a major model switch, calibrating LLM-as-judge prompts, or auditing safety-critical outputs.

3. LLM-as-Judge

LLM-as-judge uses a secondary language model to evaluate the outputs of the primary model. A strong frontier model (Claude 3.5 Sonnet, GPT-4o) is given the original prompt, the model's output, and a detailed rubric, then asked to score the output on each quality dimension. This approach sits between automated metrics and human review in both cost and quality:

Method	Cost per Evaluation	Speed	Quality Signal	Best For
Automated metrics	$0.00–$0.0001	<100ms	Good for objective criteria	Formatting, extraction, classification
LLM-as-judge	$0.003–$0.02	2–10 seconds	Good for subjective criteria	Tone, helpfulness, accuracy, coherence
Human review	$0.60–$3.00	2–5 minutes	Gold standard	Calibration, safety, high-stakes decisions

The cost comparison is striking. Evaluating 500 test cases across 5 quality dimensions:

Automated metrics: $0 + a few seconds of compute = essentially free
LLM-as-judge (GPT-4o): 500 cases × 5 dimensions × ~800 tokens per judgment × $10/MTok output = ~$20
LLM-as-judge (GPT-4o mini): Same volume × $0.60/MTok output = ~$1.50
Human review: 2,500 judgments at $1.50 each = ~$3,750

LLM-as-judge has become the workhorse method for production evals because it provides subjective quality assessment at 100–1,000x lower cost than human review. Research from Anthropic, OpenAI, and academic labs shows that frontier LLM judges agree with human annotators 80–90% of the time on most quality dimensions — comparable to inter-annotator agreement among humans themselves (typically 75–90%). The primary risks are position bias (LLMs tend to favor the first option in pairwise comparisons), self-preference bias (a model tends to rate its own outputs higher), and rubric sensitivity (small changes in the judge prompt can shift scores significantly). Mitigate these by randomizing option order, using a different model family as judge than the model being evaluated, and calibrating judge prompts against human annotations.

The Cost of Running Evals

Evals are not free. They consume tokens, compute time, and — for human evaluation — significant labor costs. Understanding and budgeting for eval costs is essential for building a sustainable quality assurance practice that does not itself become a runaway expense line.

Eval spend as a percentage of production spend. Industry benchmarks suggest that eval costs should run between 5% and 20% of production AI spend, depending on the risk profile of your application. A consumer chatbot with low stakes per interaction might allocate 5%. A medical information system or financial advisory tool should allocate 15–20% because the cost of a quality regression is much higher. Here is how this breaks down in practice:

Monthly AI Spend	Eval Budget (5–20%)	What That Buys
$1,000	$50–$200	Daily automated evals + weekly LLM-as-judge on 100 cases
$5,000	$250–$1,000	Daily LLM-as-judge on 500 cases + monthly human review of 200 cases
$25,000	$1,250–$5,000	Continuous LLM-as-judge + weekly human review + A/B eval framework
$100,000	$5,000–$20,000	Full eval pipeline with automated, LLM-judge, and human layers + custom benchmark development

Hidden costs of evaluation. The direct token cost of running LLM-as-judge evaluations is only part of the picture. Factor in:

Dataset creation and maintenance: Building a high-quality eval dataset takes 20–80 hours of expert time, depending on domain complexity. The dataset needs regular updates as your product evolves — new features mean new test cases, changed requirements mean updated rubrics. Budget 2–5 hours per month for dataset maintenance.
Judge prompt engineering: Writing evaluation rubrics that produce consistent, calibrated scores is an art. Expect 10–20 hours to develop and calibrate a judge prompt for each quality dimension. Poorly calibrated judges produce noisy scores that do not correlate with real quality differences.
Infrastructure: Running evals at scale requires orchestration — parallel API calls, result storage, score aggregation, regression detection, and alerting. Tools like Braintrust, Promptfoo, and Evalkit provide this infrastructure, typically at $100–$500/month. Open-source alternatives (OpenAI Evals, Ragas, DeepEval) are free but require self-hosting.
Regeneration costs: Each eval run requires re-generating outputs from your production model configuration. If you are evaluating 500 test cases with a model that averages 1,200 tokens per response at $10/MTok output, that is $6 per regeneration run just for the outputs — before the judge even sees them. For daily evals, that is $180/month in regeneration costs alone.

Strategies to reduce eval costs without reducing quality:

Tiered evaluation. Run cheap automated metrics on every deployment (free). Run LLM-as-judge on daily builds (~$5–$20/run). Run human review monthly or on major changes ($500–$2,000/run). This gives you continuous coverage at low cost with periodic deep validation.
Sampling. You do not need to evaluate every test case every time. A random sample of 100–200 cases provides statistically meaningful quality estimates with 95% confidence intervals of ±3–5%. Running 200 cases instead of 2,000 cuts eval cost by 90%.
Use cheaper judge models. GPT-4o mini and Claude 3.5 Haiku produce LLM-as-judge scores that correlate 85–92% with frontier model judges at 5–15x lower cost. For routine daily evals, a cheaper judge is sufficient. Reserve frontier judges for high-stakes evaluations.
Cache eval inputs. If your eval dataset is stable, cache the model outputs from previous runs. When only the system prompt changes, you only need to regenerate outputs — you can reuse the same judge prompts and scoring infrastructure.
Invest in automated metrics first. Every quality dimension that can be captured by an automated metric (JSON validity, required field presence, output length, regex patterns) should be. These cost nothing to run and catch the most common regressions. LLM-as-judge should focus on dimensions that genuinely require subjective judgment.

CostHawk tracks eval-related API costs alongside production costs, so you can monitor your eval budget as a percentage of total spend and ensure it stays within your target range.

Evals and Cost Optimization

The relationship between evals and cost optimization is symbiotic — evals enable safe cost reduction, and cost constraints motivate smarter eval design. Here are the five most common cost optimization scenarios where evals play a critical role:

1. Model downgrade validation. Switching from a frontier model to a cheaper alternative is the highest-leverage cost optimization available — potential savings of 50–95%. But it carries the highest quality risk. The eval workflow:

Run your full eval suite against the current model to establish a quality baseline.
Run the same eval suite against the candidate cheaper model with the same prompt and parameters.
Compare scores dimension by dimension. If the cheaper model scores within 3–5% of the baseline on all critical dimensions, it is likely a safe swap.
Run a shadow deployment where both models process production traffic but only the original model's output is served. Compare quality scores on real traffic for 3–7 days.
If shadow results confirm eval suite findings, promote the cheaper model to production.

Example: A document summarization service migrating from Claude 3.5 Sonnet ($3/$15 per MTok) to Claude 3.5 Haiku ($0.80/$4 per MTok). Eval results show Haiku scores 93% vs Sonnet's 96% on accuracy, 91% vs 95% on completeness, and 97% vs 97% on formatting. The 3-point accuracy gap is acceptable for internal document summaries but might not be for customer-facing legal summaries. The eval data enables a nuanced decision: use Haiku for internal documents (saving 73%) and keep Sonnet for legal documents (maintaining quality). This targeted routing saves 58% overall while preserving quality where it matters most.

2. Prompt compression validation. Shortening system prompts reduces input token costs on every request. A 1,500-token system prompt sent with 50,000 daily requests at $2.50/MTok input costs $187.50/day. Compressing it to 800 tokens saves $87.50/day ($2,625/month). But shorter prompts may lose critical instructions. Evals quantify the impact: run the eval suite with both the original and compressed prompt, compare quality scores, and promote the shorter prompt only if scores are maintained.

3. Prompt caching ROI measurement. Anthropic's prompt caching charges a 25% premium on the first request (to write to cache) but gives a 90% discount on subsequent cache hits. OpenAI's caching gives a 50% discount with no write premium. Whether caching saves money depends on your cache hit rate, which depends on how many requests share the same prompt prefix. Evals confirm that cached responses maintain quality — theoretically they should be identical, but implementation bugs in caching logic can introduce subtle issues.

4. Temperature and parameter tuning. Reducing temperature from 1.0 to 0.2 produces more deterministic outputs that are often shorter (fewer tokens = lower cost) but may sacrifice creativity. Evals measure the tradeoff: at temperature 0.2, does the model still produce sufficiently varied and useful outputs for your use case? For classification and extraction tasks, low temperature almost always maintains quality while reducing output tokens by 10–25%. For creative writing or brainstorming, the quality loss is measurable and significant.

5. Batch API migration. OpenAI's Batch API offers a 50% discount but returns results within 24 hours instead of seconds. Anthropic's Message Batches API offers similar savings. Evals verify that batch-processed outputs are identical in quality to real-time outputs — they should be, since the same model processes them, but operational differences (timeout handling, retry logic, error recovery) can introduce discrepancies. Run your eval suite on a sample of batch outputs before migrating latency-tolerant workloads.

The common thread across all five scenarios is that evals transform cost optimization from guesswork into engineering. Instead of hoping a change does not break quality, you measure it. Instead of debating whether the cheaper model is "good enough," you have data showing exactly where it falls short and by how much. This data-driven approach is what separates teams that successfully reduce AI costs by 40–60% from teams that bounce between models, reverting changes after quality complaints, never confident in their optimization decisions.

Building an Eval Pipeline

A production eval pipeline is infrastructure — it requires the same engineering rigor as your CI/CD pipeline, monitoring stack, or data pipeline. Here is a reference architecture for building an eval pipeline that supports continuous quality monitoring and cost optimization.

Step 1: Define your eval dataset.

Start with 100–500 test cases that represent the distribution of real production traffic. Each test case should include:

Input: The full prompt (system message + user message + any context) that will be sent to the model.
Expected output (optional): A reference answer for cases where one exists. Not all evals require reference answers — LLM-as-judge can evaluate quality without one.
Metadata: Tags for difficulty level, category, edge-case type, and any other dimensions useful for slicing results.
Quality criteria: What "good" looks like for this specific test case. This might be a rubric, a set of required elements, or a pass/fail condition.

Source test cases from three places: (1) curate examples from production logs that represent common request types, (2) hand-write edge cases that test known failure modes, and (3) synthetically generate diverse inputs using an LLM. A good dataset overrepresents edge cases and failure modes — production traffic is 80% easy cases that any model handles well, so your eval dataset should be 50%+ hard cases that differentiate model quality.

Step 2: Implement evaluation methods.

Layer three evaluation methods in order of cost and depth:

// Layer 1: Automated metrics (run on every evaluation)
function automatedEval(output: string, testCase: TestCase): AutomatedScore {
  return {
    jsonValid: isValidJSON(output),
    containsRequiredFields: checkRequiredFields(output, testCase.schema),
    outputTokenCount: countTokens(output),
    withinLengthLimit: countTokens(output) <= testCase.maxTokens,
    regexMatch: testCase.pattern ? testCase.pattern.test(output) : null,
  }
}

// Layer 2: LLM-as-judge (run daily or on deployments)
async function llmJudge(input: string, output: string, rubric: string): Promise<JudgeScore> {
  const response = await anthropic.messages.create({
    model: "claude-3-5-sonnet-20241022",
    max_tokens: 1024,
    messages: [{
      role: "user",
      content: `Evaluate the following AI response.\n\n` +
        `USER QUERY: ${input}\n\nAI RESPONSE: ${output}\n\n` +
        `RUBRIC: ${rubric}\n\n` +
        `Score each dimension 1-5 and explain your reasoning. ` +
        `Return valid JSON: {"accuracy": {"score": N, "reason": "..."}, ...}`
    }]
  })
  return JSON.parse(response.content[0].text)
}

// Layer 3: Human review (run monthly or on major changes)
function queueForHumanReview(testCase: TestCase, output: string): void {
  reviewQueue.push({ testCase, output, assignedTo: getNextReviewer() })
}

Step 3: Build the orchestration layer.

The orchestration layer manages the lifecycle of an eval run: generate outputs, apply evaluation methods, aggregate scores, detect regressions, and report results. Key requirements:

Parallel execution: Generate outputs and run LLM-as-judge evaluations in parallel across test cases, respecting API rate limits. A 500-case eval that runs serially takes 30+ minutes; with 50 parallel requests, it finishes in 3–5 minutes.
Result storage: Store every eval run's inputs, outputs, scores, and metadata in a database. Historical data enables trend analysis ("is quality improving or degrading over time?") and regression detection ("this deployment dropped accuracy by 4 points").
Version tracking: Record the exact model, system prompt, temperature, and tools configuration for each run. You must be able to reproduce any eval run and trace quality changes to specific configuration changes.
Cost tracking: Record the token cost of each eval run. CostHawk can tag eval-related API calls separately from production traffic, making it easy to track your eval budget.

Step 4: Integrate with CI/CD.

The highest-value integration is running evals automatically on pull requests that change prompts, model configurations, or LLM-related code. This catches quality regressions before they merge, the same way unit tests catch code bugs:

# .github/workflows/eval.yml
name: LLM Eval Suite
on:
  pull_request:
    paths:
      - 'src/prompts/**'
      - 'src/config/models.ts'
      - 'src/lib/llm/**'
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run eval:automated  # Layer 1: free, fast
      - run: npm run eval:llm-judge  # Layer 2: ~$5-20, 5 min
      - run: npm run eval:compare-baseline  # Fails if regression detected

Step 5: Set quality gates.

Define minimum acceptable scores for each quality dimension and block deployments that fall below them. For example:

Accuracy: minimum 90% (block deployment if below)
Formatting: minimum 98% (block deployment if below)
Completeness: minimum 85% (warn but allow deployment)
Conciseness: advisory only (track but do not block)

Quality gates should be strict enough to catch real regressions but lenient enough to accommodate normal score variance (±2–3 points between runs due to LLM non-determinism). Use a rolling 3-run average rather than a single run's score to reduce false alarms from statistical noise.

Continuous Evaluation

Running evals once before launch is necessary but insufficient. Production AI systems drift — models receive silent updates, user behavior shifts, data distributions change, and accumulated prompt tweaks compound in unexpected ways. Continuous evaluation catches these slow regressions that one-time assessments miss.

Online evaluation vs offline evaluation. Offline evals (the pipeline described above) test the model against a fixed dataset in a controlled environment. Online evals measure quality on live production traffic. Both are essential:

Dimension	Offline Evals	Online Evals
When they run	Pre-deployment, on a schedule, or in CI/CD	Continuously on production traffic
What they test	Known test cases with controlled inputs	Real user requests with real-world distribution
Coverage	Only as good as your test dataset	Covers edge cases you never anticipated
Speed of detection	Catches regressions before deployment	Catches regressions that offline evals miss
Cost	Fixed per run ($5–$50 typical)	Proportional to traffic (1–5% sampling)
Best for	Model switches, prompt changes, configuration updates	Model drift, distribution shift, emergent failures

Implementing online evaluation. The most practical approach is to sample 1–5% of production requests and run them through LLM-as-judge asynchronously. This does not add latency to the user-facing request and keeps costs proportional to traffic. For a system handling 100,000 requests/day, sampling 2% means evaluating 2,000 requests daily. At $0.01 per LLM-as-judge evaluation, that is $20/day ($600/month) for continuous quality monitoring.

Online eval architecture:

// In your request handler — sample and evaluate asynchronously
async function handleLLMRequest(userInput: string): Promise<string> {
  const response = await callLLM(userInput)
  
  // Sample 2% of requests for online evaluation
  if (Math.random() < 0.02) {
    // Fire-and-forget: do not block the response
    evaluateAsync({
      input: userInput,
      output: response,
      model: currentModelConfig.model,
      timestamp: Date.now(),
      requestId: generateRequestId()
    }).catch(err => logger.warn('Online eval failed', err))
  }
  
  return response
}

Detecting model drift. LLM providers update their models without notice. OpenAI has acknowledged updating GPT-4 and GPT-3.5-turbo between major version releases. These silent updates can change output quality, length, formatting, and behavior in subtle ways. Continuous evaluation is the only reliable way to detect these changes. Track your online eval scores on a 7-day rolling average. If the average drops by more than 2 standard deviations from its 30-day baseline, trigger an alert. This catches both sudden regressions (a model update that breaks a specific behavior) and gradual drift (slowly declining quality over weeks).

User feedback as an eval signal. Implicit and explicit user feedback provides the most direct quality signal available — it measures what actually matters to your users, not what your rubric says should matter. Integrate feedback mechanisms into your eval pipeline:

Explicit feedback: Thumbs up/down, 1–5 star ratings, "Was this helpful?" buttons. Even a 5–10% response rate provides valuable signal at scale.
Implicit feedback: Regeneration rate (user clicked "try again"), edit rate (user modified the AI output), copy rate (user copied the output — likely indicates satisfaction), session abandonment (user left after receiving the response — possible dissatisfaction).
Escalation signals: User contacted support after an AI interaction, user switched from AI-assisted to manual workflow, user explicitly reported an incorrect AI response.

Feed these signals back into your eval pipeline. If users consistently rate responses from a particular model or prompt configuration lower than expected, your offline evals may have a blind spot. Use negative user feedback to generate new test cases that cover the failure modes your existing eval dataset misses.

Eval-driven cost optimization loop. The most mature AI cost management practices combine continuous evaluation with continuous optimization in a closed-loop system:

Monitor: CostHawk tracks per-request costs, model utilization, and spend trends in real time.
Hypothesize: Cost analytics reveal optimization opportunities — "60% of requests go to GPT-4o but 45% of those are simple classification tasks that a cheaper model could handle."
Test: Route a shadow sample of those classification requests to GPT-4o mini. Run evals comparing quality.
Validate: Evals show GPT-4o mini scores 96% vs GPT-4o's 97% on classification accuracy. The 1-point gap is within noise.
Deploy: Route classification requests to GPT-4o mini. Monitor online eval scores for regression.
Measure: CostHawk confirms the optimization saved $4,200/month with no quality degradation in online evals.
Repeat: Move to the next optimization opportunity identified by CostHawk's analytics.

This loop turns cost optimization from a periodic exercise ("let's review our AI costs this quarter") into a continuous engineering practice with measurable, data-backed improvements week over week.

FAQ

Frequently Asked Questions

How many test cases do I need in my eval dataset?+

For most production applications, 200–500 test cases provide a statistically meaningful quality signal. With 200 cases, your accuracy estimates have a 95% confidence interval of roughly ±5–7 percentage points — sufficient to detect a model switch that drops accuracy from 95% to 88%, but too coarse to detect a 2-point regression. With 500 cases, the confidence interval narrows to ±3–4 points, catching smaller regressions. Beyond 1,000 cases, the marginal improvement in statistical precision diminishes rapidly, so most teams see diminishing returns past that threshold. The more important consideration is dataset quality rather than quantity — 200 carefully curated cases covering edge cases, failure modes, and representative production traffic patterns will outperform 2,000 randomly sampled cases that are 90% easy examples the model always gets right. Start with 100 hand-curated cases that stress-test known weaknesses, then gradually add production examples. CostHawk can help you identify which request patterns are most expensive, which is a useful heuristic for prioritizing eval coverage — optimizing your highest-cost request types delivers the biggest savings.

How much does it cost to run an LLM-as-judge evaluation?+

The cost depends on your judge model, dataset size, and the number of quality dimensions. A typical evaluation of 500 test cases across 5 quality dimensions using GPT-4o as the judge costs approximately $15–$25 per run. Here is the math: each judgment requires the judge to read the input (~300 tokens), the output (~400 tokens), the rubric (~500 tokens), and generate a scored assessment (~300 tokens output). That is roughly 1,200 input tokens and 300 output tokens per judgment. At GPT-4o rates ($2.50/$10.00 per MTok), one judgment costs about $0.006. Multiply by 500 cases and 5 dimensions: 2,500 judgments × $0.006 = $15. Using GPT-4o mini as the judge drops the cost to approximately $1.20 for the same volume. Using Claude 3.5 Haiku costs roughly $2.50. For daily evals, the monthly cost ranges from $36 (GPT-4o mini daily) to $750 (GPT-4o daily). Most teams use a cheaper judge model for daily monitoring and reserve the frontier judge for weekly or pre-deployment validations, keeping monthly eval costs under $200.

Should I use the same model as a judge that I am evaluating?+

No — using the same model as both the system under test and the evaluator introduces self-preference bias. Research from multiple labs shows that LLMs rate their own outputs 5–15% higher than outputs from other models of comparable quality. GPT-4o rating GPT-4o outputs will be systematically more generous than GPT-4o rating Claude outputs, and vice versa. The best practice is to use a different model family as your judge. If you are evaluating GPT-4o outputs, use Claude 3.5 Sonnet as the judge. If you are evaluating Claude outputs, use GPT-4o as the judge. This cross-family approach minimizes bias and produces scores that correlate more closely with human judgments. If budget constraints require using the same provider, at minimum use a different model within the family (for example, judge GPT-4o outputs with GPT-4o mini — the bias is smaller across model sizes than across identical models). Another mitigation is pairwise comparison rather than absolute scoring: show the judge two outputs (from the old and new configuration) without revealing which is which, and ask it to choose the better one. This relative comparison is less susceptible to self-preference bias than absolute scoring.

How do evals relate to A/B testing?+

Evals and A/B tests are complementary but serve different purposes. Evals measure output quality against a rubric in a controlled setting — they tell you whether the model's outputs meet your quality criteria. A/B tests measure user behavior in production — they tell you whether users prefer one model configuration over another, as measured by engagement, conversion, satisfaction, or other business metrics. A typical optimization workflow uses both: first, run evals to confirm that the new configuration meets minimum quality thresholds (this is cheap and fast, taking minutes). Then, if evals pass, run an A/B test to validate that real users do not notice a quality difference (this is slower, taking days or weeks, but provides the strongest evidence). Evals are a gate that prevents obviously degraded configurations from reaching A/B testing, saving you the cost and risk of exposing users to a bad experience. Think of evals as your staging environment validation and A/B tests as your production validation. CostHawk can segment cost analytics by A/B test variant, showing you not just which variant users prefer but which one costs less per successful user interaction — the combined metric that matters most for optimization decisions.

What is the difference between evals and benchmarks?+

Benchmarks are standardized, public evaluation datasets designed to compare models against each other on general capabilities — MMLU tests broad knowledge, HumanEval tests code generation, GSM8K tests math reasoning, and HELM provides a holistic comparison framework. They are created by researchers and used primarily by model developers to demonstrate progress and by consumers to compare models before adoption. Evals, in the production context, are custom evaluation suites designed to test a specific model configuration on a specific application's requirements. While a benchmark might ask "Can this model solve differential equations?" your eval asks "Does this model correctly extract invoice line items from our customers' PDF formats with our current system prompt?" Benchmarks are useful for initial model selection — they help you narrow the field from dozens of models to two or three candidates. Evals take over from there, providing the application-specific quality measurement that benchmarks cannot. A model that tops the MMLU leaderboard may perform poorly on your eval if your use case requires capabilities that MMLU does not measure (domain-specific jargon, particular output formats, nuanced tone requirements). Always treat benchmark scores as a starting point for exploration, not a definitive quality guarantee.

How often should I run my eval suite?+

The right cadence depends on how frequently your LLM configuration changes and how sensitive your application is to quality regressions. At minimum, run evals on every change to prompts, model selections, or LLM-related code — integrate automated metrics and LLM-as-judge into your CI/CD pipeline so that pull requests cannot merge without passing quality gates. Beyond change-triggered runs, schedule recurring evals to catch model drift from provider-side updates: daily for high-stakes applications (medical, legal, financial), weekly for standard production workloads, and monthly for low-risk internal tools. The cost of daily LLM-as-judge runs on 200 test cases is approximately $3–$10/day using GPT-4o or $0.30–$1.00/day using GPT-4o mini — negligible relative to the cost of shipping a quality regression. Additionally, implement online evaluation that continuously samples 1–5% of production traffic for async quality assessment. This catches distribution shifts and edge cases that your static eval dataset does not cover. The key principle: the cost of running evals too frequently is dollars; the cost of running them too infrequently is degraded user experience and lost trust. Err on the side of more frequent evaluation.

Can I use evals to decide which model to route each request to?+

Yes — this is the foundation of eval-driven model routing, one of the most effective cost optimization strategies available. The approach works in two phases. In the offline phase, you run your eval suite against every candidate model (for example, GPT-4o, GPT-4o mini, Claude 3.5 Haiku, Gemini 2.0 Flash) and categorize test cases by which cheaper models pass quality thresholds. You might discover that GPT-4o mini handles 70% of your request types within 3% of GPT-4o's quality score, Claude Haiku handles another 15%, and only 15% of requests truly require a frontier model. In the online phase, you build a classifier (often itself a lightweight LLM call or a rule-based system) that examines each incoming request and routes it to the cheapest model expected to produce acceptable quality. The eval suite validates the router's decisions: periodically route a sample of requests to all models, evaluate the outputs, and confirm that the routing logic is making correct decisions. If the router is sending requests to GPT-4o mini that score poorly, tighten the routing criteria. CostHawk's per-model cost analytics combined with eval quality scores provide the data needed to continuously tune the routing logic — you can see exactly how much each routing decision saves and whether it maintains quality.

What tools should I use to build an eval pipeline?+

The eval tooling landscape has matured significantly. For most teams, the best approach is to combine a dedicated eval framework with your existing CI/CD and monitoring infrastructure. Braintrust is a leading commercial platform that provides dataset management, LLM-as-judge orchestration, score tracking, and CI/CD integration out of the box — plans start free for small teams and scale to $500+/month for enterprise features. Promptfoo is an open-source CLI tool that supports automated metrics, LLM-as-judge, and human review with a configuration-driven approach — excellent for teams that want eval-as-code in their repository. OpenAI Evals is OpenAI's open-source framework, good if you are primarily evaluating OpenAI models but less flexible for multi-provider setups. Ragas specializes in RAG (retrieval-augmented generation) evaluation with metrics like faithfulness, answer relevancy, and context precision. DeepEval is a pytest-based framework that integrates naturally with Python test suites. For teams that prefer to build custom, the core components are: (1) a test runner that calls your LLM API in parallel, (2) an evaluation layer that applies metrics and LLM-as-judge, (3) a storage layer for results (PostgreSQL works fine), and (4) a comparison tool that detects regressions against baselines. Pair any of these with CostHawk to track the cost of both your eval runs and the production workloads they validate.

Related Terms

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Fine-Tuning

The process of further training a pre-trained large language model on a custom dataset to specialize it for a specific task, domain, or output style. Fine-tuning incurs upfront training costs (billed per training token) but can reduce ongoing inference costs by enabling a smaller, cheaper model to match the performance of a larger, more expensive one — making it both a quality tool and a cost optimization strategy.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Batch API

Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Prompt Engineering

The practice of designing, structuring, and iterating on the text inputs (prompts) sent to large language models to elicit desired outputs. Prompt engineering directly affects AI API costs through two mechanisms: the token count of the prompt itself (input cost) and the length and quality of the model's response (output cost). A well-engineered prompt can reduce total per-request cost by 40–70% compared to a naive prompt while maintaining or improving output quality.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary