GlossaryObservabilityUpdated 2026-03-16By Chase Dillingham

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Definition

What is LLM Observability?

LLM observability is the discipline of instrumenting, collecting, and analyzing telemetry data from large language model applications to understand their behavior, performance, and cost in production. Unlike traditional application performance monitoring (APM), which focuses on request latency, error rates, and throughput, LLM observability introduces an entirely new set of metrics: token consumption (input and output tokens per request), cost per inference (dollars spent per API call), model selection patterns (which models are serving which requests), prompt effectiveness (how well prompts elicit correct responses), and output quality (whether generated content meets accuracy and safety thresholds). A mature LLM observability stack gives engineering and finance teams a shared, real-time view into how AI features behave, what they cost, and where optimization opportunities exist. Without it, teams operate blind — unable to explain cost spikes, diagnose quality regressions, or make data-driven decisions about model selection and prompt engineering.

Impact

Why It Matters for AI Costs

AI API spend is one of the fastest-growing line items in modern engineering budgets, and it is uniquely difficult to predict and control. Traditional infrastructure costs are relatively stable: a database server costs the same whether it processes 1,000 or 10,000 queries (up to its capacity limit). LLM costs, by contrast, scale linearly with every token processed — and token consumption is driven by factors that are largely invisible without dedicated observability.

Consider what can go wrong without LLM observability:

Silent cost explosions. A developer adds a new feature that includes full conversation history in every request. Token consumption doubles overnight. Without per-request token tracking, the team does not notice until the monthly invoice arrives — $15,000 over budget.
Quality regressions. A prompt change that was tested on 10 examples degrades output quality for an edge case that affects 5% of production traffic. Without quality monitoring, users churn before anyone notices.
Model routing failures. A load balancer is supposed to route simple requests to GPT-4o mini and complex ones to Claude 3.5 Sonnet, but a configuration error sends everything to Sonnet. The team pays 20x more than necessary for 60% of its traffic.
Retry storms. A transient API error triggers aggressive retries that quadruple token consumption for 30 minutes. Without real-time spend monitoring, the incident burns through the daily budget before anyone intervenes.

LLM observability transforms AI spend from an opaque monthly bill into a real-time, actionable data stream. Teams with mature observability practices consistently spend 30-50% less than teams without it, simply because they can see where money is going and act on that information. CostHawk provides the observability layer that connects every API call to its cost, latency, and quality impact — giving you the visibility you need to run AI features like a well-managed production system, not a science experiment.

What is LLM Observability?

LLM observability is the comprehensive practice of collecting, correlating, and acting on telemetry data from every layer of an LLM-powered application. It draws from three foundational pillars of traditional observability — metrics, traces, and logs — but extends each with AI-specific dimensions that have no equivalent in conventional software monitoring.

Metrics in LLM observability go beyond request counts and error rates. The core metrics include:

Token throughput: Input and output tokens per second, minute, and hour — broken down by model, endpoint, and user segment.
Cost rate: Dollars per hour, day, or month — tracked in real time so teams can detect budget overruns as they happen, not after the invoice arrives.
Latency distribution: Time-to-first-token (TTFT), end-to-end latency, and tokens-per-second generation speed — segmented by model and request complexity.
Error rates: Rate limit hits (HTTP 429), server errors (HTTP 500/503), content filtering blocks, and malformed responses.
Quality scores: Application-specific metrics like classification accuracy, hallucination rate, format compliance, and user satisfaction ratings.

Traces in LLM observability capture the full lifecycle of an AI request. A single user interaction might involve multiple LLM calls: a router call to classify intent, a retrieval step to fetch context from a vector database, a generation call to produce the response, and a guard call to check the output for safety. An LLM trace links all of these steps together, showing the total token consumption, latency, and cost for the entire chain — not just individual API calls.

Logs in LLM observability record the actual prompts and completions flowing through your system. This is essential for debugging quality issues ("why did the model say that?"), auditing for compliance, and building evaluation datasets for prompt optimization. However, logging LLM inputs and outputs introduces privacy and storage considerations that do not exist with traditional application logs — a single day of production LLM logs can easily exceed 100 GB for high-traffic applications.

The synthesis of these three pillars creates a unified observability layer that answers questions no single data source can address alone: Which model is giving us the best quality-per-dollar? Is our cost spike caused by more requests or longer prompts? Did last week's prompt change actually improve accuracy, or just make responses longer?

LLM observability is not optional for production AI applications. It is the foundational capability that makes every other optimization — model routing, prompt engineering, caching, budget enforcement — possible. Without it, you are making decisions based on intuition and monthly invoices. With it, you are making decisions based on real-time, per-request data.

LLM Observability vs Traditional Monitoring

Teams that have invested heavily in traditional APM tools like Datadog, New Relic, or Grafana often assume these tools are sufficient for monitoring LLM-powered applications. They are not. While traditional APM provides valuable infrastructure-level visibility, it is blind to the dimensions that matter most for AI applications. Here is a detailed comparison:

Dimension	Traditional APM	LLM Observability
Cost tracking	Infrastructure costs (compute, storage, network) — relatively stable and predictable	Per-request token costs that vary by model, prompt length, and output length — highly variable and directly tied to application behavior
Latency analysis	HTTP response time as a single number	Decomposed into TTFT (time-to-first-token), generation speed (tokens/sec), and total completion time — each with different optimization levers
Error classification	HTTP status codes (4xx, 5xx)	Rate limits (429), content blocks, malformed JSON, hallucinations, refusals, truncated outputs — each requiring different remediation
Quality metrics	Not applicable — traditional APM does not evaluate response content	Accuracy, hallucination rate, format compliance, safety scores, relevance ratings — essential for maintaining user trust
Request analysis	URL, method, headers, payload size	Full prompt/completion text, token counts, model parameters (temperature, max_tokens), system prompt versions
Attribution	Service, endpoint, user	Model, provider, prompt version, API key, project tag, feature flag, user segment
Optimization guidance	Scale up, add caching, optimize queries	Switch models, shorten prompts, enable prompt caching, implement model routing, cap output length

The fundamental difference is that traditional APM treats the AI API as a black box: it sees that a request went out and a response came back, with some latency and a status code. LLM observability opens that black box and examines what happened inside: how many tokens were consumed, what model processed them, how much it cost, and whether the output was actually good.

Consider a concrete scenario. Your AI-powered search feature starts receiving user complaints about slow responses. Traditional APM shows you that the /api/search endpoint P95 latency increased from 2 seconds to 8 seconds. That is useful, but it does not tell you why. LLM observability shows you that the average input token count increased from 1,200 to 4,800 because a recent feature change started including full document content instead of summaries in the search context. The latency increase is a direct consequence of the token increase — the model takes longer to process 4x more input. The fix is clear: revert to document summaries or implement chunked retrieval. Without LLM-specific observability, the team might have wasted days investigating infrastructure causes (Is the API provider slow? Is our network degraded?) when the root cause was purely application-level.

The most effective approach is to layer LLM observability on top of traditional APM. Use Datadog or Grafana for infrastructure health (CPU, memory, network, container metrics) and a purpose-built LLM observability tool like CostHawk for AI-specific telemetry (tokens, costs, quality, model performance). This gives you complete visibility across both dimensions.

Key Metrics to Track

Effective LLM observability requires tracking a specific set of metrics that collectively describe the health, cost, and quality of your AI features. Below is a comprehensive reference of the metrics every team should track, organized by category:

Category	Metric	Description	Why It Matters
Cost	Cost per request	Dollar cost of each API call, computed from token counts and model pricing	The unit economic foundation for budgeting and pricing your product
Cost	Daily/monthly spend	Aggregate cost over time, broken down by model, endpoint, project, and key	Budget tracking and anomaly detection baseline
Cost	Cost per user action	Total AI cost attributed to a single user-facing action (which may involve multiple LLM calls)	Product economics — ensures AI features are profitable
Tokens	Input tokens per request	Average, P50, P95, and P99 input token counts	Identifies prompt bloat and context window inefficiency
Tokens	Output tokens per request	Average, P50, P95, and P99 output token counts	Detects verbose generation; output tokens cost 4-5x more than input
Tokens	Cache hit rate	Percentage of input tokens served from prompt cache vs computed fresh	Prompt caching can save 50-90% on input token costs
Latency	Time to first token (TTFT)	Time from request sent to first response token received	Directly impacts perceived responsiveness for streaming UIs
Latency	End-to-end latency	Total time from request to final token	Overall user experience metric; correlates with output length
Latency	Tokens per second	Output token generation speed	Bottleneck indicator; varies by model and provider load
Reliability	Error rate	Percentage of requests that return errors (429, 500, 503, timeouts)	Service availability; rate limits indicate capacity issues
Reliability	Retry rate	Percentage of requests that required one or more retries	Retries multiply token consumption and cost
Quality	Format compliance	Percentage of responses that match expected output schema (valid JSON, correct fields)	Downstream parsing failures if format compliance drops
Quality	Hallucination rate	Percentage of responses containing factually incorrect claims (sampled or automated)	User trust and safety; hardest quality metric to automate
Quality	User feedback score	Thumbs up/down or satisfaction rating from end users	Ground truth for whether AI output is actually useful

Not every team needs every metric from day one. The recommended adoption sequence is:

Week 1: Cost per request, daily spend, input/output token counts, error rate. These give you basic cost visibility and reliability monitoring.
Month 1: Add latency metrics (TTFT, end-to-end, tokens/sec), retry rate, and cost-per-user-action. These enable performance optimization and product economics analysis.
Month 3: Add quality metrics (format compliance, hallucination sampling, user feedback). These require more instrumentation but are essential for long-term quality assurance.

CostHawk captures cost, token, latency, and reliability metrics automatically for every request routed through its proxy or synced via the MCP server. Quality metrics can be added through custom tags and the evaluation API.

Observability Tool Landscape

The LLM observability market has matured rapidly since 2024, with several specialized tools competing alongside extensions from traditional APM vendors. Here is a practical overview of the major players and how they compare:

Helicone is an open-source LLM observability platform that works as a proxy layer in front of your LLM API calls. You swap your API base URL to route through Helicone, and it logs every request with token counts, latency, costs, and the full prompt/completion text. Helicone excels at per-request logging and provides a clean UI for browsing individual requests. Its main strengths are the open-source codebase, easy integration (one-line base URL change), and a generous free tier. Its main limitations are less sophisticated cost analytics (no automated anomaly detection or budget enforcement) and a focus on individual request debugging rather than aggregate cost trends.

Langfuse is an open-source observability and analytics platform designed around the concept of "traces" — multi-step LLM workflows that chain multiple API calls together. Langfuse provides SDKs for Python and JavaScript that let you instrument complex agent workflows with nested spans, scoring, and metadata. Its strongest differentiator is trace-level analytics: you can see the total cost and latency of an entire multi-step agent execution, not just individual API calls. Langfuse integrates well with LangChain and LlamaIndex. Its main limitations are a more complex setup compared to proxy-based tools, and its self-hosted deployment requires significant infrastructure management.

Portkey is a commercial AI gateway that combines observability with operational features like load balancing, fallback routing, and caching. Portkey sits as a proxy between your application and LLM providers, providing real-time dashboards for cost, latency, and reliability. Its key differentiator is the operational layer: automatic failover when a provider goes down, intelligent load balancing across models, and built-in prompt caching. Portkey is strongest for teams running multi-provider, multi-model architectures that need both observability and resilience. Its main limitation is that it is a commercial product with per-request pricing that adds to your AI spend.

CostHawk is purpose-built for the cost and financial dimension of LLM observability. While other tools focus primarily on debugging and tracing, CostHawk centers on the question that matters most to engineering leaders and finance teams: how much are we spending, where is it going, and how do we reduce it? CostHawk provides real-time cost dashboards with per-model, per-key, per-project, and per-tag breakdowns. Its anomaly detection system flags unusual spending patterns within minutes, not days. Budget alerts notify you before you exceed thresholds. The wrapped-key proxy captures every request with zero code changes, and the MCP server syncs local development tool usage (Claude Code, Codex) that other observability tools miss entirely. CostHawk is strongest for teams that need to treat AI spend as a first-class financial metric — with budget enforcement, cost attribution, and savings recommendations that directly reduce your bill.

How these tools compare in practice:

Feature	Helicone	Langfuse	Portkey	CostHawk
Integration method	Proxy (URL swap)	SDK instrumentation	Proxy (URL swap)	Proxy + MCP sync
Cost analytics depth	Basic (per-request cost)	Moderate (trace-level cost)	Good (real-time dashboards)	Deep (anomaly detection, budgets, forecasting, savings recs)
Trace support	Basic	Excellent (nested spans)	Moderate	Moderate
Local dev tool tracking	No	No	No	Yes (Claude Code, Codex sync)
Budget enforcement	No	No	Partial	Yes (alerts + hard limits)
Open source	Yes	Yes	No	MCP server is open source
Best for	Individual request debugging	Complex agent workflows	Multi-provider resilience	Cost management and optimization

Many teams use a combination: CostHawk for cost management and Langfuse or Helicone for detailed request-level debugging. The tools are complementary, not mutually exclusive.

Implementing LLM Observability

Implementing LLM observability does not require a massive upfront investment. The most effective approach is incremental: start with basic cost and token tracking, then add layers of sophistication as your needs grow. Here is a practical implementation roadmap:

Phase 1: Basic telemetry (1-2 hours to set up)

The fastest path to observability is intercepting every LLM API call and recording the response metadata. Every major provider returns a usage object in the API response that includes prompt_tokens and completion_tokens. At minimum, log these fields for every request:

{
  "timestamp": "2026-03-16T14:23:01Z",
  "model": "gpt-4o",
  "provider": "openai",
  "input_tokens": 1247,
  "output_tokens": 382,
  "latency_ms": 2340,
  "status": 200,
  "cost_usd": 0.006937,
  "endpoint": "/api/chat",
  "project": "customer-support"
}

If you are using CostHawk, this step is automatic: route your API calls through a CostHawk wrapped key and every field above is captured without any code changes. For custom implementations, wrap your API client with a logging middleware that extracts the usage data from each response.

Phase 2: Cost attribution and dashboards (1-2 days)

Raw telemetry data becomes useful when it is aggregated and visualized. Build or configure dashboards that show:

Total daily spend by model and provider
Average cost per request by endpoint or feature
Token consumption trends over the past 7 and 30 days
Top 10 most expensive endpoints or features
Cost breakdown by environment (dev, staging, production)

CostHawk provides these dashboards out of the box. If you are building custom dashboards, tools like Grafana with a time-series database (InfluxDB, TimescaleDB) work well, though you will need to compute costs from raw token counts and model pricing — which requires maintaining an up-to-date pricing table for every model.

Phase 3: Alerting and anomaly detection (1 week)

Static budget alerts ("notify me when daily spend exceeds $500") are a good starting point, but they miss gradual cost creep and can be too slow to catch sudden spikes. Implement anomaly detection that compares current spending patterns to historical baselines:

Alert when hourly spend exceeds 2x the trailing 7-day average for that hour
Alert when average input token count per request increases by more than 30% (prompt bloat indicator)
Alert when error rate exceeds 5% (potential retry storm indicator)
Alert when a specific model's spend exceeds its monthly budget allocation

CostHawk's anomaly detection uses statistical methods (z-score analysis with configurable sensitivity) and integrates with Slack, PagerDuty, and email for notifications. Custom implementations can use a simple rolling-window z-score calculation over your telemetry data.

Phase 4: Quality monitoring (2-4 weeks)

Quality monitoring requires more instrumentation than cost monitoring because there is no universal quality metric — quality is application-specific. Common approaches include:

Format compliance checking: Parse every response against your expected schema and log success/failure rates. For JSON outputs, validate against a JSON schema. For structured text, check for required sections or fields.
LLM-as-judge evaluation: Use a cheaper model to evaluate the quality of a more expensive model's output. For example, send GPT-4o mini a rubric and the original response, and ask it to score accuracy on a 1-5 scale. This costs pennies per evaluation and can run on a sample of production traffic.
User feedback loops: Add thumbs up/down or 1-5 star ratings to your AI-generated outputs and log them alongside the request telemetry. This is the gold standard for quality measurement but requires product integration.

Phase 5: Optimization loop (ongoing)

Observability without action is just monitoring. The final phase is closing the optimization loop: using your observability data to make changes that reduce cost and improve quality, then measuring the impact of those changes. CostHawk's savings recommendations surface the highest-impact optimizations based on your actual usage patterns — such as "switching your /api/summarize endpoint from GPT-4o to GPT-4o mini would save $2,400/month based on last 30 days of traffic."

The Cost of NOT Having Observability

The irony of LLM observability is that the teams who need it most — those spending the most on AI APIs — are often the last to implement it. They are moving fast, shipping features, and treating AI costs as a problem for later. Here is what "later" typically looks like, drawn from real scenarios across hundreds of engineering teams:

Scenario 1: The invisible prompt bloat. A product team adds a "provide detailed explanations" instruction to the system prompt for a customer support chatbot. Average output length increases from 150 tokens to 450 tokens. Output tokens cost 4x more than input tokens, so the cost per request triples from $0.003 to $0.009. At 50,000 requests per day, this adds $300/day — $9,000/month — to the bill. Without observability, the team does not connect the prompt change to the cost increase for six weeks. Total wasted spend: $13,500.

Scenario 2: The forgotten dev environment. Developers testing AI features locally route all requests through the production API key. Each developer makes 200-500 API calls per day during active development. A team of 8 developers burns through 2,000-4,000 requests per day in development, at an average cost of $0.01 per request — $20-$40/day, or $600-$1,200/month — with zero production value. Without per-environment cost attribution, this spending is invisible. Multiply across a year and the dev environment has consumed $7,200-$14,400 in pure waste.

Scenario 3: The rate limit retry storm. An application hits a rate limit (HTTP 429) during a traffic spike. The retry logic uses exponential backoff, but a bug in the implementation resets the backoff counter on each new user request, causing hundreds of parallel retry loops. Each retry consumes the same tokens as the original request. Over 45 minutes, the application sends 12x the normal request volume before an engineer manually kills the retry loop. The incident costs $3,800 in excess API spend — more than the team's entire typical daily budget.

Scenario 4: The wrong model for the job. A team builds a document classification pipeline using Claude 3.5 Sonnet ($3.00/$15.00 per MTok) because "we want the best accuracy." The pipeline processes 200,000 documents per month, each requiring ~500 input tokens and ~50 output tokens. Monthly cost: $375 in input + $150 in output = $525. After finally implementing observability and benchmarking, the team discovers that Claude 3.5 Haiku ($0.80/$4.00 per MTok) achieves 97% of Sonnet's accuracy on this specific task. Switching saves $385/month — a 73% reduction. Without observability data to justify the experiment, the team ran Sonnet for 9 months, overspending by $3,465.

Scenario 5: The compliance audit nightmare. A healthcare application uses LLMs to summarize patient notes. During an audit, regulators ask for evidence of what data was sent to the LLM API, what model processed it, and what outputs were generated. Without LLM-specific logging, the team has no record of individual requests — only aggregate API billing data. The audit remediation effort costs 3 engineering-weeks and results in a compliance finding. Proper LLM observability with prompt/completion logging would have provided the audit trail automatically.

The common thread across all these scenarios is that the cost of not having observability always exceeds the cost of implementing it. CostHawk takes less than 30 minutes to set up (swap your API base URL or install the MCP server), and most teams identify optimization opportunities worth 3-10x the subscription cost within the first week. The question is not whether you can afford LLM observability — it is whether you can afford not to have it.

FAQ

Frequently Asked Questions

How is LLM observability different from logging API requests?+

Logging API requests is one component of LLM observability, but it is not sufficient on its own. A basic request log captures the HTTP method, URL, status code, and maybe the payload size — the same data you would log for any REST API call. LLM observability enriches this with AI-specific context: the exact number of input and output tokens, the model that processed the request, the computed dollar cost based on current pricing, the full prompt and completion text (for debugging and compliance), latency broken into TTFT and generation time, and quality indicators like format compliance or user ratings. Beyond richer per-request data, LLM observability adds aggregate analysis that raw logs cannot provide: cost trends over time, anomaly detection against spending baselines, per-model and per-feature cost attribution, budget enforcement with automated alerts, and optimization recommendations based on usage patterns. Think of it this way: logging tells you what happened; observability tells you what it meant and what you should do about it. A team that only logs API requests can answer 'how many requests did we make yesterday?' but cannot answer 'how much did our customer support feature cost per conversation last week, and is that trending up?' LLM observability provides the analytics layer that transforms raw logs into actionable intelligence.

What is the minimum setup needed to get started with LLM observability?+

The absolute minimum viable LLM observability setup requires three things: (1) capturing the usage object from every API response, which contains prompt_tokens and completion_tokens; (2) mapping those token counts to dollar costs using the model's pricing; and (3) storing the results in a queryable format. If you use CostHawk, this entire setup takes under 30 minutes: create an account, generate a wrapped API key, and swap your base URL to route through the CostHawk proxy. Every request is automatically logged with token counts, costs, latency, model, and metadata. If you prefer a custom approach, the simplest implementation is a logging middleware that wraps your LLM API client. For each request, extract the model name and usage data from the response, compute the cost using a pricing lookup table, and write a structured log entry to your preferred storage (a database table, a logging service, or even a CSV file). Even this basic setup gives you dramatically more visibility than the alternative of checking your provider's billing dashboard once a month. From there, you can incrementally add dashboards, alerting, quality metrics, and anomaly detection as your needs grow. The key is to start capturing data immediately — you cannot analyze historical trends if you did not log the data.

How do I track costs across multiple LLM providers?+

Multi-provider cost tracking requires normalizing token counts and pricing across providers that use different tokenizers, pricing structures, and billing units. The main challenges are: (1) different tokenizers produce different token counts for the same text — a sentence that is 10 tokens in GPT-4o might be 12 tokens in Claude; (2) providers bill at different granularities — OpenAI bills per token, Anthropic bills per token but with different cache token rates, and some providers bill per character or per request; (3) pricing changes at different times for each provider. The most reliable approach is to use the actual token counts and costs reported by each provider in their API responses rather than trying to pre-calculate costs client-side. Every major provider includes a usage object with token counts, and some (like Anthropic) include the actual cost in the response. Log these provider-reported numbers and multiply by the current per-token rate for each model. CostHawk handles multi-provider normalization automatically: it maintains an up-to-date pricing table for every model across OpenAI, Anthropic, Google, Mistral, and other providers, computes costs at the per-request level, and presents a unified dashboard where you can compare spend across providers in the same view. This is particularly valuable for teams that use different providers for different tasks — you can see your total AI spend in one place instead of checking three separate billing dashboards.

Should I log full prompts and completions, or just metadata?+

The answer depends on your use case, compliance requirements, and storage budget. Logging full prompts and completions provides the richest debugging and quality analysis capability: you can investigate why a specific request produced a bad output, build evaluation datasets from production traffic, and maintain a compliance audit trail. However, full logging introduces significant considerations. Storage costs can be substantial — a high-traffic application generating 100,000 requests per day with average combined prompt+completion sizes of 3,000 tokens (~12 KB of text) produces approximately 1.2 GB of log data per day, or 36 GB per month. Privacy regulations (GDPR, HIPAA, SOC 2) may restrict logging user data that is included in prompts. PII in prompts must be redacted or encrypted before storage. The recommended approach for most teams is a tiered strategy: log full metadata (token counts, costs, latency, model, status) for every request, log full prompts and completions for a configurable sample (5-20% of production traffic) for quality analysis, and log full prompts and completions for all requests that trigger error conditions or quality flags. CostHawk supports this tiered approach through its proxy configuration — you can enable full prompt logging, metadata-only logging, or sampled logging based on your requirements. For compliance-sensitive applications, CostHawk also supports PII redaction rules that strip sensitive patterns before storage.

How does LLM observability help with prompt engineering?+

LLM observability provides the empirical feedback loop that transforms prompt engineering from guesswork into data-driven optimization. Without observability, prompt engineering is a subjective process: a developer writes a prompt, tests it on a handful of examples, and ships it to production with no systematic way to measure its impact. With observability, every prompt change becomes a measurable experiment. Here is how the feedback loop works in practice: First, observability establishes baselines. Before changing a prompt, you can see the current average token counts (input and output), cost per request, latency, and quality metrics for the endpoint using that prompt. Second, after deploying a prompt change, observability immediately shows the impact: did output tokens increase or decrease? Did cost per request change? Did format compliance improve or degrade? Third, observability enables A/B testing of prompts by tagging requests with a prompt version identifier and comparing metrics across versions. For example, you might test whether adding 'respond in JSON format' to the system prompt improves format compliance enough to justify the extra input tokens it adds. CostHawk's per-request tagging makes this straightforward — tag each request with the prompt version, then filter your cost and quality dashboards by tag to compare performance. Over time, this data-driven approach compounds: each prompt iteration is informed by real production data, leading to prompts that are simultaneously more effective and more cost-efficient.

What role does tracing play in LLM observability?+

Tracing is the connective tissue that links individual LLM API calls into coherent workflows. In modern AI applications, a single user action often triggers multiple LLM calls: a classifier to route the request, a retrieval-augmented generation (RAG) pipeline that queries a vector database and synthesizes results, a guardrail check that evaluates the output for safety, and perhaps a follow-up call to format the response. Without tracing, you see each of these as isolated API calls with their own token counts and costs. With tracing, you see the full chain as a single trace with a total cost, total latency, and a clear picture of how each step contributed to the final result. Tracing answers questions that per-request metrics cannot: 'What is the total cost of a single customer support conversation from start to finish?' 'Which step in our RAG pipeline is the latency bottleneck?' 'When the guardrail rejects an output and triggers regeneration, what is the cost overhead of that safety loop?' In agent architectures where the LLM decides what tools to call and may loop multiple times, tracing is essential for understanding cost — a single agent execution might make 5-20 LLM calls, and without tracing, you have no way to attribute those calls to a single user action or compute the total cost. CostHawk supports trace-level cost attribution through request tagging and session correlation, allowing you to see the full cost of multi-step workflows alongside per-request details.

How often should I review LLM observability dashboards?+

The review cadence depends on your spend level, rate of change, and risk tolerance, but a general framework that works for most teams is: daily quick checks, weekly deep dives, and monthly strategic reviews. Daily quick checks (2-3 minutes) should focus on anomalies: has today's spend significantly deviated from the 7-day trailing average? Are error rates elevated? Has any single endpoint or model spiked in cost? Automated alerts handle the most obvious anomalies, but a quick visual scan catches subtler trends that algorithms might miss — like a gradual 5% daily increase that is not yet statistically significant but will compound to a 35% monthly increase. Weekly deep dives (15-30 minutes) should examine trends: compare this week's cost and token consumption to last week, investigate any endpoints where cost-per-request has changed, review the top 10 most expensive requests to identify optimization opportunities, and check whether recent code deployments have affected AI cost or quality metrics. Monthly strategic reviews (1-2 hours) should inform planning: review total AI spend against budget, evaluate model routing efficiency (are you using the cheapest effective model for each task?), assess whether prompt caching and other optimizations are delivering expected savings, and project forward based on growth trends. For teams spending over $10,000/month on AI APIs, consider real-time dashboards on a team display — the constant visibility creates cultural awareness of AI costs similar to how engineering teams display deployment dashboards or error rate monitors.

Can LLM observability work with self-hosted or open-source models?+

Yes, though the implementation approach differs from API-hosted model observability. With API-hosted models (OpenAI, Anthropic, Google), observability is straightforward because the API response includes a usage object with token counts, and the provider's pricing is publicly documented. With self-hosted models (Llama 3, Mistral, DeepSeek running on your own GPUs or via services like vLLM, TGI, or Ollama), you need to instrument the serving layer to emit the same telemetry. Most model serving frameworks provide token count data: vLLM includes prompt_tokens and completion_tokens in its response metadata, TGI provides similar fields, and Ollama returns token counts in its API response. The cost calculation is different for self-hosted models because you are not paying per token — you are paying for GPU compute time. The relevant cost metric becomes cost per GPU-second per request, which you calculate by dividing your GPU hourly rate by the number of requests processed per hour. For hybrid architectures that mix API-hosted and self-hosted models, CostHawk normalizes both cost models into a unified dashboard. You can compare the effective cost-per-token of your self-hosted Llama 3 deployment against the API price of GPT-4o mini to make informed decisions about which workloads to route where. The key insight is that LLM observability principles are model-agnostic — the metrics (tokens, latency, cost, quality) matter regardless of where the model runs. Only the instrumentation method changes.

Related Terms

Tracing

The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Cost Anomaly Detection

Automated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.

Dashboards

Visual interfaces for monitoring AI cost, usage, and performance metrics in real-time. The command center for AI cost management — dashboards aggregate token spend, model utilization, latency, and budget health into a single pane of glass.

Logging

Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.

Alerting

Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary