LLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Definition
What is LLM Observability?
Impact
Why It Matters for AI Costs
AI API spend is one of the fastest-growing line items in modern engineering budgets, and it is uniquely difficult to predict and control. Traditional infrastructure costs are relatively stable: a database server costs the same whether it processes 1,000 or 10,000 queries (up to its capacity limit). LLM costs, by contrast, scale linearly with every token processed — and token consumption is driven by factors that are largely invisible without dedicated observability.
Consider what can go wrong without LLM observability:
- Silent cost explosions. A developer adds a new feature that includes full conversation history in every request. Token consumption doubles overnight. Without per-request token tracking, the team does not notice until the monthly invoice arrives — $15,000 over budget.
- Quality regressions. A prompt change that was tested on 10 examples degrades output quality for an edge case that affects 5% of production traffic. Without quality monitoring, users churn before anyone notices.
- Model routing failures. A load balancer is supposed to route simple requests to GPT-4o mini and complex ones to Claude 3.5 Sonnet, but a configuration error sends everything to Sonnet. The team pays 20x more than necessary for 60% of its traffic.
- Retry storms. A transient API error triggers aggressive retries that quadruple token consumption for 30 minutes. Without real-time spend monitoring, the incident burns through the daily budget before anyone intervenes.
LLM observability transforms AI spend from an opaque monthly bill into a real-time, actionable data stream. Teams with mature observability practices consistently spend 30-50% less than teams without it, simply because they can see where money is going and act on that information. CostHawk provides the observability layer that connects every API call to its cost, latency, and quality impact — giving you the visibility you need to run AI features like a well-managed production system, not a science experiment.
What is LLM Observability?
LLM observability is the comprehensive practice of collecting, correlating, and acting on telemetry data from every layer of an LLM-powered application. It draws from three foundational pillars of traditional observability — metrics, traces, and logs — but extends each with AI-specific dimensions that have no equivalent in conventional software monitoring.
Metrics in LLM observability go beyond request counts and error rates. The core metrics include:
- Token throughput: Input and output tokens per second, minute, and hour — broken down by model, endpoint, and user segment.
- Cost rate: Dollars per hour, day, or month — tracked in real time so teams can detect budget overruns as they happen, not after the invoice arrives.
- Latency distribution: Time-to-first-token (TTFT), end-to-end latency, and tokens-per-second generation speed — segmented by model and request complexity.
- Error rates: Rate limit hits (HTTP 429), server errors (HTTP 500/503), content filtering blocks, and malformed responses.
- Quality scores: Application-specific metrics like classification accuracy, hallucination rate, format compliance, and user satisfaction ratings.
Traces in LLM observability capture the full lifecycle of an AI request. A single user interaction might involve multiple LLM calls: a router call to classify intent, a retrieval step to fetch context from a vector database, a generation call to produce the response, and a guard call to check the output for safety. An LLM trace links all of these steps together, showing the total token consumption, latency, and cost for the entire chain — not just individual API calls.
Logs in LLM observability record the actual prompts and completions flowing through your system. This is essential for debugging quality issues ("why did the model say that?"), auditing for compliance, and building evaluation datasets for prompt optimization. However, logging LLM inputs and outputs introduces privacy and storage considerations that do not exist with traditional application logs — a single day of production LLM logs can easily exceed 100 GB for high-traffic applications.
The synthesis of these three pillars creates a unified observability layer that answers questions no single data source can address alone: Which model is giving us the best quality-per-dollar? Is our cost spike caused by more requests or longer prompts? Did last week's prompt change actually improve accuracy, or just make responses longer?
LLM observability is not optional for production AI applications. It is the foundational capability that makes every other optimization — model routing, prompt engineering, caching, budget enforcement — possible. Without it, you are making decisions based on intuition and monthly invoices. With it, you are making decisions based on real-time, per-request data.
LLM Observability vs Traditional Monitoring
Teams that have invested heavily in traditional APM tools like Datadog, New Relic, or Grafana often assume these tools are sufficient for monitoring LLM-powered applications. They are not. While traditional APM provides valuable infrastructure-level visibility, it is blind to the dimensions that matter most for AI applications. Here is a detailed comparison:
| Dimension | Traditional APM | LLM Observability |
|---|---|---|
| Cost tracking | Infrastructure costs (compute, storage, network) — relatively stable and predictable | Per-request token costs that vary by model, prompt length, and output length — highly variable and directly tied to application behavior |
| Latency analysis | HTTP response time as a single number | Decomposed into TTFT (time-to-first-token), generation speed (tokens/sec), and total completion time — each with different optimization levers |
| Error classification | HTTP status codes (4xx, 5xx) | Rate limits (429), content blocks, malformed JSON, hallucinations, refusals, truncated outputs — each requiring different remediation |
| Quality metrics | Not applicable — traditional APM does not evaluate response content | Accuracy, hallucination rate, format compliance, safety scores, relevance ratings — essential for maintaining user trust |
| Request analysis | URL, method, headers, payload size | Full prompt/completion text, token counts, model parameters (temperature, max_tokens), system prompt versions |
| Attribution | Service, endpoint, user | Model, provider, prompt version, API key, project tag, feature flag, user segment |
| Optimization guidance | Scale up, add caching, optimize queries | Switch models, shorten prompts, enable prompt caching, implement model routing, cap output length |
The fundamental difference is that traditional APM treats the AI API as a black box: it sees that a request went out and a response came back, with some latency and a status code. LLM observability opens that black box and examines what happened inside: how many tokens were consumed, what model processed them, how much it cost, and whether the output was actually good.
Consider a concrete scenario. Your AI-powered search feature starts receiving user complaints about slow responses. Traditional APM shows you that the /api/search endpoint P95 latency increased from 2 seconds to 8 seconds. That is useful, but it does not tell you why. LLM observability shows you that the average input token count increased from 1,200 to 4,800 because a recent feature change started including full document content instead of summaries in the search context. The latency increase is a direct consequence of the token increase — the model takes longer to process 4x more input. The fix is clear: revert to document summaries or implement chunked retrieval. Without LLM-specific observability, the team might have wasted days investigating infrastructure causes (Is the API provider slow? Is our network degraded?) when the root cause was purely application-level.
The most effective approach is to layer LLM observability on top of traditional APM. Use Datadog or Grafana for infrastructure health (CPU, memory, network, container metrics) and a purpose-built LLM observability tool like CostHawk for AI-specific telemetry (tokens, costs, quality, model performance). This gives you complete visibility across both dimensions.
Key Metrics to Track
Effective LLM observability requires tracking a specific set of metrics that collectively describe the health, cost, and quality of your AI features. Below is a comprehensive reference of the metrics every team should track, organized by category:
| Category | Metric | Description | Why It Matters |
|---|---|---|---|
| Cost | Cost per request | Dollar cost of each API call, computed from token counts and model pricing | The unit economic foundation for budgeting and pricing your product |
| Cost | Daily/monthly spend | Aggregate cost over time, broken down by model, endpoint, project, and key | Budget tracking and anomaly detection baseline |
| Cost | Cost per user action | Total AI cost attributed to a single user-facing action (which may involve multiple LLM calls) | Product economics — ensures AI features are profitable |
| Tokens | Input tokens per request | Average, P50, P95, and P99 input token counts | Identifies prompt bloat and context window inefficiency |
| Tokens | Output tokens per request | Average, P50, P95, and P99 output token counts | Detects verbose generation; output tokens cost 4-5x more than input |
| Tokens | Cache hit rate | Percentage of input tokens served from prompt cache vs computed fresh | Prompt caching can save 50-90% on input token costs |
| Latency | Time to first token (TTFT) | Time from request sent to first response token received | Directly impacts perceived responsiveness for streaming UIs |
| Latency | End-to-end latency | Total time from request to final token | Overall user experience metric; correlates with output length |
| Latency | Tokens per second | Output token generation speed | Bottleneck indicator; varies by model and provider load |
| Reliability | Error rate | Percentage of requests that return errors (429, 500, 503, timeouts) | Service availability; rate limits indicate capacity issues |
| Reliability | Retry rate | Percentage of requests that required one or more retries | Retries multiply token consumption and cost |
| Quality | Format compliance | Percentage of responses that match expected output schema (valid JSON, correct fields) | Downstream parsing failures if format compliance drops |
| Quality | Hallucination rate | Percentage of responses containing factually incorrect claims (sampled or automated) | User trust and safety; hardest quality metric to automate |
| Quality | User feedback score | Thumbs up/down or satisfaction rating from end users | Ground truth for whether AI output is actually useful |
Not every team needs every metric from day one. The recommended adoption sequence is:
- Week 1: Cost per request, daily spend, input/output token counts, error rate. These give you basic cost visibility and reliability monitoring.
- Month 1: Add latency metrics (TTFT, end-to-end, tokens/sec), retry rate, and cost-per-user-action. These enable performance optimization and product economics analysis.
- Month 3: Add quality metrics (format compliance, hallucination sampling, user feedback). These require more instrumentation but are essential for long-term quality assurance.
CostHawk captures cost, token, latency, and reliability metrics automatically for every request routed through its proxy or synced via the MCP server. Quality metrics can be added through custom tags and the evaluation API.
Observability Tool Landscape
The LLM observability market has matured rapidly since 2024, with several specialized tools competing alongside extensions from traditional APM vendors. Here is a practical overview of the major players and how they compare:
Helicone is an open-source LLM observability platform that works as a proxy layer in front of your LLM API calls. You swap your API base URL to route through Helicone, and it logs every request with token counts, latency, costs, and the full prompt/completion text. Helicone excels at per-request logging and provides a clean UI for browsing individual requests. Its main strengths are the open-source codebase, easy integration (one-line base URL change), and a generous free tier. Its main limitations are less sophisticated cost analytics (no automated anomaly detection or budget enforcement) and a focus on individual request debugging rather than aggregate cost trends.
Langfuse is an open-source observability and analytics platform designed around the concept of "traces" — multi-step LLM workflows that chain multiple API calls together. Langfuse provides SDKs for Python and JavaScript that let you instrument complex agent workflows with nested spans, scoring, and metadata. Its strongest differentiator is trace-level analytics: you can see the total cost and latency of an entire multi-step agent execution, not just individual API calls. Langfuse integrates well with LangChain and LlamaIndex. Its main limitations are a more complex setup compared to proxy-based tools, and its self-hosted deployment requires significant infrastructure management.
Portkey is a commercial AI gateway that combines observability with operational features like load balancing, fallback routing, and caching. Portkey sits as a proxy between your application and LLM providers, providing real-time dashboards for cost, latency, and reliability. Its key differentiator is the operational layer: automatic failover when a provider goes down, intelligent load balancing across models, and built-in prompt caching. Portkey is strongest for teams running multi-provider, multi-model architectures that need both observability and resilience. Its main limitation is that it is a commercial product with per-request pricing that adds to your AI spend.
CostHawk is purpose-built for the cost and financial dimension of LLM observability. While other tools focus primarily on debugging and tracing, CostHawk centers on the question that matters most to engineering leaders and finance teams: how much are we spending, where is it going, and how do we reduce it? CostHawk provides real-time cost dashboards with per-model, per-key, per-project, and per-tag breakdowns. Its anomaly detection system flags unusual spending patterns within minutes, not days. Budget alerts notify you before you exceed thresholds. The wrapped-key proxy captures every request with zero code changes, and the MCP server syncs local development tool usage (Claude Code, Codex) that other observability tools miss entirely. CostHawk is strongest for teams that need to treat AI spend as a first-class financial metric — with budget enforcement, cost attribution, and savings recommendations that directly reduce your bill.
How these tools compare in practice:
| Feature | Helicone | Langfuse | Portkey | CostHawk |
|---|---|---|---|---|
| Integration method | Proxy (URL swap) | SDK instrumentation | Proxy (URL swap) | Proxy + MCP sync |
| Cost analytics depth | Basic (per-request cost) | Moderate (trace-level cost) | Good (real-time dashboards) | Deep (anomaly detection, budgets, forecasting, savings recs) |
| Trace support | Basic | Excellent (nested spans) | Moderate | Moderate |
| Local dev tool tracking | No | No | No | Yes (Claude Code, Codex sync) |
| Budget enforcement | No | No | Partial | Yes (alerts + hard limits) |
| Open source | Yes | Yes | No | MCP server is open source |
| Best for | Individual request debugging | Complex agent workflows | Multi-provider resilience | Cost management and optimization |
Many teams use a combination: CostHawk for cost management and Langfuse or Helicone for detailed request-level debugging. The tools are complementary, not mutually exclusive.
Implementing LLM Observability
Implementing LLM observability does not require a massive upfront investment. The most effective approach is incremental: start with basic cost and token tracking, then add layers of sophistication as your needs grow. Here is a practical implementation roadmap:
Phase 1: Basic telemetry (1-2 hours to set up)
The fastest path to observability is intercepting every LLM API call and recording the response metadata. Every major provider returns a usage object in the API response that includes prompt_tokens and completion_tokens. At minimum, log these fields for every request:
{
"timestamp": "2026-03-16T14:23:01Z",
"model": "gpt-4o",
"provider": "openai",
"input_tokens": 1247,
"output_tokens": 382,
"latency_ms": 2340,
"status": 200,
"cost_usd": 0.006937,
"endpoint": "/api/chat",
"project": "customer-support"
}If you are using CostHawk, this step is automatic: route your API calls through a CostHawk wrapped key and every field above is captured without any code changes. For custom implementations, wrap your API client with a logging middleware that extracts the usage data from each response.
Phase 2: Cost attribution and dashboards (1-2 days)
Raw telemetry data becomes useful when it is aggregated and visualized. Build or configure dashboards that show:
- Total daily spend by model and provider
- Average cost per request by endpoint or feature
- Token consumption trends over the past 7 and 30 days
- Top 10 most expensive endpoints or features
- Cost breakdown by environment (dev, staging, production)
CostHawk provides these dashboards out of the box. If you are building custom dashboards, tools like Grafana with a time-series database (InfluxDB, TimescaleDB) work well, though you will need to compute costs from raw token counts and model pricing — which requires maintaining an up-to-date pricing table for every model.
Phase 3: Alerting and anomaly detection (1 week)
Static budget alerts ("notify me when daily spend exceeds $500") are a good starting point, but they miss gradual cost creep and can be too slow to catch sudden spikes. Implement anomaly detection that compares current spending patterns to historical baselines:
- Alert when hourly spend exceeds 2x the trailing 7-day average for that hour
- Alert when average input token count per request increases by more than 30% (prompt bloat indicator)
- Alert when error rate exceeds 5% (potential retry storm indicator)
- Alert when a specific model's spend exceeds its monthly budget allocation
CostHawk's anomaly detection uses statistical methods (z-score analysis with configurable sensitivity) and integrates with Slack, PagerDuty, and email for notifications. Custom implementations can use a simple rolling-window z-score calculation over your telemetry data.
Phase 4: Quality monitoring (2-4 weeks)
Quality monitoring requires more instrumentation than cost monitoring because there is no universal quality metric — quality is application-specific. Common approaches include:
- Format compliance checking: Parse every response against your expected schema and log success/failure rates. For JSON outputs, validate against a JSON schema. For structured text, check for required sections or fields.
- LLM-as-judge evaluation: Use a cheaper model to evaluate the quality of a more expensive model's output. For example, send GPT-4o mini a rubric and the original response, and ask it to score accuracy on a 1-5 scale. This costs pennies per evaluation and can run on a sample of production traffic.
- User feedback loops: Add thumbs up/down or 1-5 star ratings to your AI-generated outputs and log them alongside the request telemetry. This is the gold standard for quality measurement but requires product integration.
Phase 5: Optimization loop (ongoing)
Observability without action is just monitoring. The final phase is closing the optimization loop: using your observability data to make changes that reduce cost and improve quality, then measuring the impact of those changes. CostHawk's savings recommendations surface the highest-impact optimizations based on your actual usage patterns — such as "switching your /api/summarize endpoint from GPT-4o to GPT-4o mini would save $2,400/month based on last 30 days of traffic."
The Cost of NOT Having Observability
The irony of LLM observability is that the teams who need it most — those spending the most on AI APIs — are often the last to implement it. They are moving fast, shipping features, and treating AI costs as a problem for later. Here is what "later" typically looks like, drawn from real scenarios across hundreds of engineering teams:
Scenario 1: The invisible prompt bloat. A product team adds a "provide detailed explanations" instruction to the system prompt for a customer support chatbot. Average output length increases from 150 tokens to 450 tokens. Output tokens cost 4x more than input tokens, so the cost per request triples from $0.003 to $0.009. At 50,000 requests per day, this adds $300/day — $9,000/month — to the bill. Without observability, the team does not connect the prompt change to the cost increase for six weeks. Total wasted spend: $13,500.
Scenario 2: The forgotten dev environment. Developers testing AI features locally route all requests through the production API key. Each developer makes 200-500 API calls per day during active development. A team of 8 developers burns through 2,000-4,000 requests per day in development, at an average cost of $0.01 per request — $20-$40/day, or $600-$1,200/month — with zero production value. Without per-environment cost attribution, this spending is invisible. Multiply across a year and the dev environment has consumed $7,200-$14,400 in pure waste.
Scenario 3: The rate limit retry storm. An application hits a rate limit (HTTP 429) during a traffic spike. The retry logic uses exponential backoff, but a bug in the implementation resets the backoff counter on each new user request, causing hundreds of parallel retry loops. Each retry consumes the same tokens as the original request. Over 45 minutes, the application sends 12x the normal request volume before an engineer manually kills the retry loop. The incident costs $3,800 in excess API spend — more than the team's entire typical daily budget.
Scenario 4: The wrong model for the job. A team builds a document classification pipeline using Claude 3.5 Sonnet ($3.00/$15.00 per MTok) because "we want the best accuracy." The pipeline processes 200,000 documents per month, each requiring ~500 input tokens and ~50 output tokens. Monthly cost: $375 in input + $150 in output = $525. After finally implementing observability and benchmarking, the team discovers that Claude 3.5 Haiku ($0.80/$4.00 per MTok) achieves 97% of Sonnet's accuracy on this specific task. Switching saves $385/month — a 73% reduction. Without observability data to justify the experiment, the team ran Sonnet for 9 months, overspending by $3,465.
Scenario 5: The compliance audit nightmare. A healthcare application uses LLMs to summarize patient notes. During an audit, regulators ask for evidence of what data was sent to the LLM API, what model processed it, and what outputs were generated. Without LLM-specific logging, the team has no record of individual requests — only aggregate API billing data. The audit remediation effort costs 3 engineering-weeks and results in a compliance finding. Proper LLM observability with prompt/completion logging would have provided the audit trail automatically.
The common thread across all these scenarios is that the cost of not having observability always exceeds the cost of implementing it. CostHawk takes less than 30 minutes to set up (swap your API base URL or install the MCP server), and most teams identify optimization opportunities worth 3-10x the subscription cost within the first week. The question is not whether you can afford LLM observability — it is whether you can afford not to have it.
FAQ
Frequently Asked Questions
How is LLM observability different from logging API requests?+
What is the minimum setup needed to get started with LLM observability?+
usage object from every API response, which contains prompt_tokens and completion_tokens; (2) mapping those token counts to dollar costs using the model's pricing; and (3) storing the results in a queryable format. If you use CostHawk, this entire setup takes under 30 minutes: create an account, generate a wrapped API key, and swap your base URL to route through the CostHawk proxy. Every request is automatically logged with token counts, costs, latency, model, and metadata. If you prefer a custom approach, the simplest implementation is a logging middleware that wraps your LLM API client. For each request, extract the model name and usage data from the response, compute the cost using a pricing lookup table, and write a structured log entry to your preferred storage (a database table, a logging service, or even a CSV file). Even this basic setup gives you dramatically more visibility than the alternative of checking your provider's billing dashboard once a month. From there, you can incrementally add dashboards, alerting, quality metrics, and anomaly detection as your needs grow. The key is to start capturing data immediately — you cannot analyze historical trends if you did not log the data.How do I track costs across multiple LLM providers?+
usage object with token counts, and some (like Anthropic) include the actual cost in the response. Log these provider-reported numbers and multiply by the current per-token rate for each model. CostHawk handles multi-provider normalization automatically: it maintains an up-to-date pricing table for every model across OpenAI, Anthropic, Google, Mistral, and other providers, computes costs at the per-request level, and presents a unified dashboard where you can compare spend across providers in the same view. This is particularly valuable for teams that use different providers for different tasks — you can see your total AI spend in one place instead of checking three separate billing dashboards.Should I log full prompts and completions, or just metadata?+
How does LLM observability help with prompt engineering?+
What role does tracing play in LLM observability?+
How often should I review LLM observability dashboards?+
Can LLM observability work with self-hosted or open-source models?+
usage object with token counts, and the provider's pricing is publicly documented. With self-hosted models (Llama 3, Mistral, DeepSeek running on your own GPUs or via services like vLLM, TGI, or Ollama), you need to instrument the serving layer to emit the same telemetry. Most model serving frameworks provide token count data: vLLM includes prompt_tokens and completion_tokens in its response metadata, TGI provides similar fields, and Ollama returns token counts in its API response. The cost calculation is different for self-hosted models because you are not paying per token — you are paying for GPU compute time. The relevant cost metric becomes cost per GPU-second per request, which you calculate by dividing your GPU hourly rate by the number of requests processed per hour. For hybrid architectures that mix API-hosted and self-hosted models, CostHawk normalizes both cost models into a unified dashboard. You can compare the effective cost-per-token of your self-hosted Llama 3 deployment against the API price of GPT-4o mini to make informed decisions about which workloads to route where. The key insight is that LLM observability principles are model-agnostic — the metrics (tokens, latency, cost, quality) matter regardless of where the model runs. Only the instrumentation method changes.Related Terms
Tracing
The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.
Read moreLatency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreCost Anomaly Detection
Automated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.
Read moreDashboards
Visual interfaces for monitoring AI cost, usage, and performance metrics in real-time. The command center for AI cost management — dashboards aggregate token spend, model utilization, latency, and budget health into a single pane of glass.
Read moreLogging
Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.
Read moreAlerting
Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
