Tracing
The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.
Definition
What is Tracing?
Impact
Why It Matters for AI Costs
Modern LLM applications are not simple request-response systems. A single user interaction may trigger multiple model calls, vector database lookups, tool executions, and conditional branching — all of which consume time and money. Without tracing, debugging and optimizing these pipelines is like diagnosing a performance problem in a microservices architecture by looking at aggregate metrics alone: you can see that something is slow or expensive, but you cannot pinpoint where or why.
Consider a typical RAG (Retrieval-Augmented Generation) chatbot pipeline:
- Embedding generation — 50 ms, $0.0001 (embed the user query)
- Vector search — 120 ms, $0 (query Pinecone/Weaviate for relevant documents)
- Context assembly — 5 ms, $0 (format retrieved documents into a prompt)
- LLM inference — 2,200 ms, $0.018 (Claude 3.5 Sonnet, 3,000 input + 800 output tokens)
- Response formatting — 2 ms, $0 (parse and sanitize the response)
Total: 2,377 ms, $0.0181. Without tracing, all you see is a 2.4-second request that cost $0.018. With tracing, you see that 92% of latency and 99% of cost come from step 4. If you want to reduce latency, optimizing steps 1-3 is pointless — you need to reduce context size, switch to a faster model, or cap output length. If you want to reduce cost, the embedding call is negligible; only the LLM call matters.
Now multiply this by the complexity of real production systems. Agentic workflows may chain 5-15 model calls with tool use in between. A single user request might cost $0.50 and take 30 seconds, with costs distributed unevenly across steps. Tracing is the only way to identify which steps are consuming disproportionate resources and prioritize optimization efforts accordingly. CostHawk integrates trace data with cost attribution, so each span in a trace shows not just its duration but its exact token consumption and dollar cost.
What is LLM Tracing?
LLM tracing adapts the concept of distributed tracing — originally developed for microservices architectures — to the specific needs of AI application observability. In traditional distributed tracing (popularized by Google's Dapper paper and implemented in tools like Jaeger and Zipkin), a trace follows an HTTP request as it traverses multiple services, recording a span for each service hop. In LLM tracing, the "services" are the stages of an AI pipeline: prompt construction, model inference, tool execution, memory retrieval, and response processing.
A trace is the top-level container representing a single end-to-end operation — typically one user request or one pipeline invocation. Each trace has a unique trace_id (usually a 128-bit UUID) that correlates all spans belonging to that operation.
A span is a single timed operation within a trace. Spans have:
span_id: Unique identifier for this span.parent_span_id: The ID of the parent span, forming the tree structure. The root span has no parent.operation_name: Human-readable label (e.g., "llm.chat", "retriever.query", "tool.execute").start_timeandend_time: Nanosecond-precision timestamps.attributes: Key-value pairs carrying contextual data. For LLM spans, common attributes includellm.model,llm.input_tokens,llm.output_tokens,llm.cost_usd,llm.temperature, andllm.max_tokens.status: Success, error, or unset.events: Point-in-time annotations within the span (e.g., "first token received" at timestamp T).
LLM tracing extends the general tracing model with AI-specific semantics. The OpenTelemetry project has defined Semantic Conventions for GenAI that standardize attribute names for model calls, including gen_ai.system (provider), gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. Adopting these conventions ensures your traces are portable across observability backends and compatible with the growing ecosystem of AI monitoring tools.
The value of LLM tracing increases dramatically with pipeline complexity. A single LLM call with no tools or retrieval may not need tracing — a simple log suffices. But the moment you add RAG, tool use, conversation memory, guardrails, or multi-step agents, tracing becomes indispensable for understanding behavior, diagnosing failures, and optimizing cost and latency.
Trace Anatomy
Understanding the anatomy of a trace is essential for instrumenting your LLM application correctly and extracting maximum value from trace data. Let us walk through a concrete example: a customer support agent that retrieves knowledge base articles and answers user questions.
The trace tree for a single user message:
Trace: trace_id=abc123
└─ [root] handle_message (total: 3,420 ms, cost: $0.024)
├─ [span-1] classify_intent (180 ms, cost: $0.0003)
│ └─ [span-1a] llm.chat — gpt-4o-mini (175 ms)
│ attrs: input_tokens=120, output_tokens=8, cost=$0.0003
├─ [span-2] retrieve_context (210 ms, cost: $0.0001)
│ ├─ [span-2a] embed_query (45 ms)
│ │ attrs: model=text-embedding-3-small, tokens=18, cost=$0.0001
│ └─ [span-2b] vector_search (160 ms)
│ attrs: index=knowledge_base, top_k=5, results=5
├─ [span-3] generate_response (2,980 ms, cost: $0.0236)
│ └─ [span-3a] llm.chat — claude-3.5-sonnet (2,970 ms)
│ attrs: input_tokens=3200, output_tokens=620, cost=$0.0236
└─ [span-4] format_and_deliver (50 ms, cost: $0)
attrs: response_length=2480_chars, channel=web_chatKey structural concepts illustrated:
- Parent-child relationships: The root span
handle_messageis the parent ofclassify_intent,retrieve_context,generate_response, andformat_and_deliver. Theretrieve_contextspan is itself the parent ofembed_queryandvector_search. This hierarchy mirrors the call stack of your application. - Sequential vs. parallel spans: In this example, all child spans of the root are sequential — each starts after the previous one ends. But spans can also be parallel: if you queried two vector databases simultaneously, they would be sibling spans with overlapping time ranges.
- Attributes carry the business data: The raw timing data tells you how long each step took. The attributes tell you why: which model was used, how many tokens were consumed, what the cost was, and what parameters were configured. Attributes are the bridge between observability and cost analytics.
- Cost rolls up through the tree: The root span's total cost ($0.024) is the sum of all descendant span costs. CostHawk computes this roll-up automatically, so you can see cost at any level of aggregation: per-request (root span), per-stage (child spans), or per-operation (leaf spans).
- Timing reveals the critical path: The
generate_responsespan accounts for 2,980 ms of the 3,420 ms total — 87% of latency. This is the critical path. Optimizing any other span has limited impact on total latency unless it enables reducing the critical path (e.g., retrieving less context to reduce input tokens for the generation step).
A well-instrumented trace answers three questions simultaneously: What happened? (the operation sequence), How long did it take? (span durations), and How much did it cost? (token and cost attributes). This trifecta is what makes tracing the foundation of LLM application observability.
Tracing for Cost Attribution
Tracing transforms cost monitoring from a blunt aggregate number into precise, actionable attribution. Without tracing, you know that your AI pipeline spent $4,200 yesterday. With tracing, you know that $1,680 (40%) went to RAG context assembly across 12,000 requests, $1,470 (35%) went to final response generation, $630 (15%) went to tool-call re-invocations in agentic workflows, and $420 (10%) went to intent classification. Each of these numbers maps to a specific span type in your trace data, enabling targeted optimization.
Cost attribution by span type:
Tag each span that involves an LLM call with llm.cost_usd computed from the model's pricing and the actual token counts reported in the API response. Then aggregate by span operation name to see cost distribution:
| Span Operation | Daily Requests | Avg Input Tokens | Avg Output Tokens | Model | Daily Cost | % of Total |
|---|---|---|---|---|---|---|
| classify_intent | 50,000 | 150 | 12 | GPT-4o mini | $4.73 | 2% |
| embed_query | 50,000 | 25 | n/a | text-embedding-3-small | $0.25 | 0.1% |
| generate_with_context | 50,000 | 4,200 | 650 | Claude 3.5 Sonnet | $1,112.50 | 85% |
| tool_call.search | 18,000 | 800 | 200 | Claude 3.5 Sonnet | $151.20 | 12% |
| summarize_result | 8,000 | 500 | 100 | GPT-4o mini | $0.60 | 0.05% |
This table immediately reveals that generate_with_context dominates cost at 85%, driven by the combination of a large input context (4,200 tokens from RAG retrieval) and an expensive model (Claude 3.5 Sonnet). The optimization path is clear: reduce context size (fewer retrieved documents, shorter document chunks), cap output tokens, or route to a cheaper model for simpler queries.
Cost attribution by user, customer, or feature:
By propagating business-context attributes through the trace (e.g., customer_id, feature_name, environment), you can aggregate costs along any business dimension. This enables:
- Customer-level chargeback: Enterprise B2B products can attribute exact AI costs to each customer, enabling usage-based pricing or identifying customers whose usage patterns are disproportionately expensive.
- Feature-level ROI: Compare the AI cost of each feature against its business value. If the "smart search" feature costs $800/day but drives $200/day in revenue, that is a clear signal to optimize or reconsider.
- Environment-level waste detection: Trace attributes revealing
env=developmentorenv=staginglet you quantify non-production spend. Many teams discover that 20-40% of their AI API costs come from development and testing environments running against production-grade models.
CostHawk ingests trace-level cost data through its API and MCP server integrations, automatically aggregating costs by span type, model, project tag, and API key. This means every dollar of spend is attributable to a specific step in a specific pipeline, enabling the kind of surgical cost optimization that aggregate monitoring cannot support.
Implementing Traces
There are three primary approaches to implementing tracing in LLM applications, each with different tradeoffs between standardization, ease of integration, and customization.
1. OpenTelemetry (OTel) — The Standards-Based Approach
OpenTelemetry is the CNCF-backed observability standard that provides vendor-neutral APIs for creating traces, spans, and metrics. For LLM tracing, OTel is the most future-proof choice because it decouples instrumentation from the backend — you can send traces to Jaeger, Grafana Tempo, Datadog, Honeycomb, or any OTel-compatible backend without changing your instrumentation code.
Implementation pattern in a Node.js/TypeScript application:
import { trace, SpanStatusCode } from '@opentelemetry/api'
const tracer = trace.getTracer('llm-pipeline')
async function handleRequest(userMessage: string) {
return tracer.startActiveSpan('handle_request', async (rootSpan) => {
try {
// Child span for retrieval
const context = await tracer.startActiveSpan('retrieve_context',
async (span) => {
const docs = await vectorSearch(userMessage)
span.setAttribute('retriever.doc_count', docs.length)
span.setAttribute('retriever.total_chars', totalChars(docs))
span.end()
return docs
}
)
// Child span for LLM call
const response = await tracer.startActiveSpan('llm.chat',
async (span) => {
const result = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{ role: 'user', content: buildPrompt(userMessage, context) }]
})
span.setAttribute('gen_ai.system', 'anthropic')
span.setAttribute('gen_ai.request.model', 'claude-3-5-sonnet')
span.setAttribute('gen_ai.usage.input_tokens', result.usage.input_tokens)
span.setAttribute('gen_ai.usage.output_tokens', result.usage.output_tokens)
span.setAttribute('llm.cost_usd', computeCost(result))
span.end()
return result
}
)
rootSpan.setAttribute('total_cost_usd', sumChildCosts(rootSpan))
rootSpan.setStatus({ code: SpanStatusCode.OK })
return response
} catch (err) {
rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message })
throw err
} finally {
rootSpan.end()
}
})
}2. Langfuse — The LLM-Native Approach
Langfuse is an open-source observability platform designed specifically for LLM applications. It provides a higher-level API than OTel with LLM-specific concepts built in: "generations" (LLM calls), "traces" (request-level containers), and "scores" (quality evaluations). The tradeoff is less portability but faster time-to-value for AI-specific use cases.
import { Langfuse } from 'langfuse'
const langfuse = new Langfuse({ publicKey: '...', secretKey: '...' })
async function handleRequest(userMessage: string) {
const trace = langfuse.trace({ name: 'chat_request', userId: user.id })
const retrieval = trace.span({ name: 'retrieve_context' })
const docs = await vectorSearch(userMessage)
retrieval.end({ metadata: { docCount: docs.length } })
const generation = trace.generation({
name: 'generate_response',
model: 'claude-3-5-sonnet',
input: buildPrompt(userMessage, docs),
})
const result = await anthropic.messages.create({ /* ... */ })
generation.end({
output: result.content,
usage: { inputTokens: result.usage.input_tokens,
outputTokens: result.usage.output_tokens },
})
await langfuse.flushAsync()
}3. Custom Tracing — The Lightweight Approach
For teams that need tracing but want to avoid adding dependencies, a custom tracing implementation built on structured logging is viable. Record start/end timestamps and attributes for each operation, assign trace and span IDs, and emit structured JSON logs that can be queried in your existing log infrastructure.
The key advantage of custom tracing is full control over what is captured and how it is stored. The disadvantage is that you must build your own visualization and aggregation tooling. CostHawk can ingest custom trace data through its API, providing the visualization layer without requiring OTel or Langfuse instrumentation — just send span-level cost and latency data in the expected format.
Tracing Multi-Step Agents
Agentic AI systems — where an LLM autonomously decides which tools to call, iterates through multiple reasoning steps, and may spawn sub-agents — present the most complex and highest-value tracing challenge. A single user request to an agent might generate 5-15 LLM calls, each with different token counts and costs, interspersed with tool executions that have their own latency and failure modes. Without tracing, debugging a slow or expensive agent run is nearly impossible.
Trace structure for a typical ReAct agent:
Trace: trace_id=agent_001
└─ [root] agent.run (total: 18,400 ms, cost: $0.127)
├─ [step-1] agent.think (1,200 ms, cost: $0.008)
│ └─ llm.chat — claude-3.5-sonnet
│ input_tokens=1800, output_tokens=95
│ decision: call tool "search_knowledge_base"
├─ [step-2] tool.execute: search_knowledge_base (340 ms, cost: $0.0002)
│ attrs: query="refund policy premium accounts", results=3
├─ [step-3] agent.think (2,800 ms, cost: $0.021)
│ └─ llm.chat — claude-3.5-sonnet
│ input_tokens=4200, output_tokens=120
│ decision: call tool "lookup_customer"
├─ [step-4] tool.execute: lookup_customer (180 ms, cost: $0)
│ attrs: customer_id=cust_8832, found=true
├─ [step-5] agent.think (3,400 ms, cost: $0.034)
│ └─ llm.chat — claude-3.5-sonnet
│ input_tokens=5800, output_tokens=180
│ decision: call tool "process_refund"
├─ [step-6] tool.execute: process_refund (2,200 ms, cost: $0)
│ attrs: amount=$49.99, status=approved
└─ [step-7] agent.think (8,280 ms, cost: $0.064)
└─ llm.chat — claude-3.5-sonnet
input_tokens=7200, output_tokens=820
decision: respond to user (final answer)Key observations from agent traces:
- Context accumulation drives cost growth. Each successive
agent.thinkspan has more input tokens than the previous one because the conversation history (including prior tool calls and results) grows with each step. Step 1 has 1,800 input tokens; step 7 has 7,200. This is the primary cost driver in agentic workflows — the later steps are processing all the context from earlier steps. - The final step often dominates cost. Step 7 costs $0.064 — half the total trace cost — because it has the largest input context and generates the longest output (the final user-facing response). Optimizing the final step (e.g., summarizing earlier context before generating the response) has the highest cost impact.
- Tool execution latency varies dramatically. The
search_knowledge_basetool takes 340 ms;process_refundtakes 2,200 ms because it involves a synchronous API call to a payment processor. Without tracing, you might blame the LLM for the overall 18-second latency, when in fact 2.2 seconds is spent on a non-LLM operation. - Loop detection becomes possible. Agents sometimes enter loops — calling the same tool repeatedly or oscillating between two tools. Trace data makes this visible: if you see five consecutive
search_knowledge_basespans with similar queries and increasing input token counts, the agent is stuck. Set guardrails: maximum 10 steps per trace, maximum 3 calls to the same tool, maximum $0.50 per trace.
For multi-agent systems where one agent delegates to another, traces should use a linked trace or span link pattern: the sub-agent's trace is linked to the parent agent's trace via a reference, preserving the full causal chain while allowing each agent to manage its own trace lifecycle. CostHawk aggregates costs across linked traces, so a parent agent's total cost includes all sub-agent costs — critical for understanding the true expense of orchestrated workflows.
Tracing and CostHawk
CostHawk integrates tracing data into its cost attribution and optimization workflows, providing a unified view of cost, latency, and quality across all your LLM operations. Here is how tracing connects to the broader CostHawk platform.
Automatic tracing via wrapped keys:
When you route API calls through CostHawk wrapped keys, each request is automatically traced with a single span capturing the full round-trip: model, input tokens, output tokens, latency, TTFT (for streaming requests), and computed cost. No instrumentation code is required. This provides request-level granularity for simple pipelines where each user request maps to a single LLM call. The wrapped key acts as an instrumented proxy, recording all the data that a manual span would capture.
Rich tracing via MCP server sync:
For complex pipelines with multiple LLM calls, tool use, and agentic workflows, the CostHawk MCP server accepts structured trace data that includes parent-child span relationships and custom attributes. This enables the full trace tree visualization described in earlier sections: you can see the complete execution path of an agent run, with cost and latency attributed to each step. The MCP server normalizes trace data from different sources (OTel, Langfuse, custom) into CostHawk's internal format.
Trace-powered analytics:
With trace data flowing into CostHawk, several powerful analytics become available:
- Cost-per-step breakdown: See which pipeline stages consume the most budget. Identify that your RAG retrieval context assembly accounts for 40% of total input tokens and optimize accordingly.
- Latency-cost correlation: Discover that your slowest requests (P99 latency) are also your most expensive, because they involve the most agent steps and the largest accumulated contexts. This reveals that latency optimization and cost optimization are often the same effort.
- Step-level anomaly detection: CostHawk can alert when a specific span type's average cost or latency deviates from its baseline. If your
retrieve_contextspan suddenly starts returning 10x more tokens (because someone changed the chunk size or top-k parameter), CostHawk flags the regression before it materially impacts your monthly bill. - Trace sampling and drill-down: For high-volume systems generating millions of traces per day, CostHawk supports head-based and tail-based sampling. Tail-based sampling is particularly valuable: it preferentially retains expensive traces (above a cost threshold) and high-latency traces (above a P99 threshold), ensuring that the most interesting and actionable traces are always available for investigation.
Trace data in the CostHawk dashboard:
The dashboard's usage analytics page surfaces trace-derived insights alongside aggregate cost and usage data. When you click into a specific time period showing elevated costs, you can see the individual traces that contributed to the spike, drill into their span trees, and identify the root cause. This closed-loop workflow — from aggregate anomaly to individual trace to specific span to optimization action — is the core value proposition of integrated tracing. You do not need a separate observability platform for traces and a separate platform for cost analytics; CostHawk unifies both into a single investigation flow.
Getting started:
If you are already using CostHawk wrapped keys, you have request-level tracing out of the box. To add multi-span tracing for complex pipelines, integrate the CostHawk MCP server into your application and emit span data from your pipeline stages. The MCP server accepts spans in OpenTelemetry-compatible format, so if you have existing OTel instrumentation, you can add CostHawk as an additional exporter with a single configuration change. Start by tracing your highest-cost pipeline — the one that accounts for the largest share of your monthly bill — and use the resulting data to identify your first optimization target.
FAQ
Frequently Asked Questions
What is the difference between tracing and logging for LLM applications?+
How much overhead does tracing add to my LLM application?+
Should I trace every request or use sampling?+
How do I trace across multiple services in a microservices architecture?+
traceparent header (format: 00-{trace_id}-{parent_span_id}-{flags}) for this purpose, and OpenTelemetry implements it automatically. When your API gateway receives a user request, it creates a root span and a trace ID. When it calls your retrieval service, the trace ID and current span ID are propagated in the traceparent header. The retrieval service creates a child span under the propagated parent. When the retrieval service calls the LLM API, it creates another child span. All spans share the same trace ID, forming a complete tree across service boundaries. For message queues (Kafka, SQS, RabbitMQ), propagate trace context in message attributes. For serverless functions, propagate via the event payload. The key is ensuring no service in the chain drops the context — one missing link breaks the trace tree and fragments your cost attribution.What attributes should I record on LLM trace spans?+
gen_ai.system (the provider: openai, anthropic, google), gen_ai.request.model (the specific model name), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and llm.cost_usd (computed from token counts and model pricing). These five attributes enable cost attribution, model comparison, and usage analytics. Beyond the minimum, highly valuable attributes include: gen_ai.request.max_tokens (to detect uncapped requests), gen_ai.request.temperature (to correlate quality settings with cost), llm.ttft_ms (time-to-first-token for streaming requests), project or feature (business attribution tags), customer_id (for multi-tenant chargeback), and environment (dev/staging/prod). For tool-call spans, record tool.name, tool.status (success/error), and tool.duration_ms. For retrieval spans, record retriever.source, retriever.doc_count, and retriever.total_tokens (the token count of retrieved context). Follow the OpenTelemetry GenAI semantic conventions for attribute naming to ensure compatibility across tools.How does tracing help with debugging failed or hallucinated LLM responses?+
Can I use tracing to implement cost guardrails for agentic workflows?+
llm.cost_usd attribute and add it to a running total stored on the root span. Before initiating the next agent step, check whether the running total exceeds your per-request budget (e.g., $0.50). If it does, force the agent to emit a final response with whatever context it has accumulated, rather than continuing to iterate. This pattern prevents runaway agent loops — where the model keeps calling tools and accumulating context indefinitely — from generating unbounded costs. You can also set guardrails on step count (maximum 10 LLM calls per trace), individual span cost (no single span above $0.10), and total input token count (halt if accumulated context exceeds 50,000 tokens). CostHawk's alerting can notify you when traces exceed cost thresholds, enabling after-the-fact analysis even if your real-time guardrails did not catch every edge case.What is the relationship between OpenTelemetry and LLM tracing?+
gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and others. By using OTel for your LLM tracing, you gain several advantages: your traces are portable across any OTel-compatible backend (Jaeger, Grafana Tempo, Datadog, Honeycomb, CostHawk), your LLM spans integrate seamlessly with existing infrastructure traces (HTTP requests, database queries, cache lookups), and you benefit from the OTel ecosystem of auto-instrumentation libraries, SDKs in every major language, and the OpenTelemetry Collector for data routing and transformation. The alternative LLM-specific tools like Langfuse and LangSmith offer faster initial setup for AI-only use cases, but OTel provides the most comprehensive and future-proof foundation for organizations that want unified observability across their entire stack.Related Terms
Spans
Individual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.
Read moreOpenTelemetry
An open-source observability framework providing a vendor-neutral standard (OTLP) for collecting traces, metrics, and logs from distributed systems. OpenTelemetry is rapidly becoming the standard instrumentation layer for LLM applications, enabling teams to track latency, token usage, cost, and quality across every inference call.
Read moreLLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Read moreLogging
Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.
Read moreAgentic AI
AI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
