GlossaryObservabilityUpdated 2026-03-16By Chase Dillingham

Tracing

Q: What attributes should I record on LLM trace spans?

At minimum, every LLM call span should record: gen_ai.system (the provider: openai, anthropic, google), gen_ai.request.model (the specific model name), gen_ai.usage.input_tokens , gen_ai.usage.output_tokens , and llm.cost_usd (computed from token counts and model pricing). These five attributes enable cost attribution, model comparison, and usage analytics. Beyond the minimum, highly valuable attributes include: gen_ai.request.max_tokens (to detect uncapped requests), gen_ai.request.temperature (to correlate quality settings with cost), llm.ttft_ms (time-to-first-token for streaming requests), project or feature (business attribution tags), customer_id (for multi-tenant chargeback), and environment (dev/staging/prod). For tool-call spans, record tool.name , tool.status (success/error), and tool.duration_ms . For retrieval spans, record retriever.source , retriever.doc_count , and retriever.total_tokens (the token count of retrieved context). Follow the OpenTelemetry GenAI semantic conventions for attribute naming to ensure compatibility across tools.

The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.

Definition

What is Tracing?

Tracing is a distributed observability technique that records the complete execution path of a request as it flows through a system, decomposing it into a tree of timed operations called spans. In the context of LLM applications, a trace captures every step involved in processing a user request: receiving the input, assembling the prompt (retrieving context from a vector database, formatting system instructions, attaching conversation history), calling the model API, parsing the response, executing any tool calls the model makes, calling the model again with tool results, and delivering the final response. Each span in the trace records its operation name, start time, end time, duration, and arbitrary key-value attributes — including token counts, model name, cost, and status. Spans form a parent-child hierarchy: a root span representing the entire request contains child spans for each sub-operation, which may themselves contain grandchild spans. This tree structure makes it possible to see not only how long a request took, but where the time was spent. For cost attribution, tracing is essential because it connects every dollar of API spend to a specific step in a specific pipeline, enabling precise optimization. Without tracing, you know your total spend; with tracing, you know that 40% of your spend is on retrieval-augmented generation context that could be cached, 25% is on tool-call re-invocations that could be eliminated, and 35% is on the final generation that is appropriately sized.

Impact

Why It Matters for AI Costs

Modern LLM applications are not simple request-response systems. A single user interaction may trigger multiple model calls, vector database lookups, tool executions, and conditional branching — all of which consume time and money. Without tracing, debugging and optimizing these pipelines is like diagnosing a performance problem in a microservices architecture by looking at aggregate metrics alone: you can see that something is slow or expensive, but you cannot pinpoint where or why.

Consider a typical RAG (Retrieval-Augmented Generation) chatbot pipeline:

Embedding generation — 50 ms, $0.0001 (embed the user query)
Vector search — 120 ms, $0 (query Pinecone/Weaviate for relevant documents)
Context assembly — 5 ms, $0 (format retrieved documents into a prompt)
LLM inference — 2,200 ms, $0.018 (Claude 3.5 Sonnet, 3,000 input + 800 output tokens)
Response formatting — 2 ms, $0 (parse and sanitize the response)

Total: 2,377 ms, $0.0181. Without tracing, all you see is a 2.4-second request that cost $0.018. With tracing, you see that 92% of latency and 99% of cost come from step 4. If you want to reduce latency, optimizing steps 1-3 is pointless — you need to reduce context size, switch to a faster model, or cap output length. If you want to reduce cost, the embedding call is negligible; only the LLM call matters.

Now multiply this by the complexity of real production systems. Agentic workflows may chain 5-15 model calls with tool use in between. A single user request might cost $0.50 and take 30 seconds, with costs distributed unevenly across steps. Tracing is the only way to identify which steps are consuming disproportionate resources and prioritize optimization efforts accordingly. CostHawk integrates trace data with cost attribution, so each span in a trace shows not just its duration but its exact token consumption and dollar cost.

What is LLM Tracing?

LLM tracing adapts the concept of distributed tracing — originally developed for microservices architectures — to the specific needs of AI application observability. In traditional distributed tracing (popularized by Google's Dapper paper and implemented in tools like Jaeger and Zipkin), a trace follows an HTTP request as it traverses multiple services, recording a span for each service hop. In LLM tracing, the "services" are the stages of an AI pipeline: prompt construction, model inference, tool execution, memory retrieval, and response processing.

A trace is the top-level container representing a single end-to-end operation — typically one user request or one pipeline invocation. Each trace has a unique trace_id (usually a 128-bit UUID) that correlates all spans belonging to that operation.

A span is a single timed operation within a trace. Spans have:

span_id: Unique identifier for this span.
parent_span_id: The ID of the parent span, forming the tree structure. The root span has no parent.
operation_name: Human-readable label (e.g., "llm.chat", "retriever.query", "tool.execute").
start_time and end_time: Nanosecond-precision timestamps.
attributes: Key-value pairs carrying contextual data. For LLM spans, common attributes include llm.model, llm.input_tokens, llm.output_tokens, llm.cost_usd, llm.temperature, and llm.max_tokens.
status: Success, error, or unset.
events: Point-in-time annotations within the span (e.g., "first token received" at timestamp T).

LLM tracing extends the general tracing model with AI-specific semantics. The OpenTelemetry project has defined Semantic Conventions for GenAI that standardize attribute names for model calls, including gen_ai.system (provider), gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. Adopting these conventions ensures your traces are portable across observability backends and compatible with the growing ecosystem of AI monitoring tools.

The value of LLM tracing increases dramatically with pipeline complexity. A single LLM call with no tools or retrieval may not need tracing — a simple log suffices. But the moment you add RAG, tool use, conversation memory, guardrails, or multi-step agents, tracing becomes indispensable for understanding behavior, diagnosing failures, and optimizing cost and latency.

Trace Anatomy

Understanding the anatomy of a trace is essential for instrumenting your LLM application correctly and extracting maximum value from trace data. Let us walk through a concrete example: a customer support agent that retrieves knowledge base articles and answers user questions.

The trace tree for a single user message:

Trace: trace_id=abc123
└─ [root] handle_message (total: 3,420 ms, cost: $0.024)
   ├─ [span-1] classify_intent (180 ms, cost: $0.0003)
   │  └─ [span-1a] llm.chat — gpt-4o-mini (175 ms)
   │     attrs: input_tokens=120, output_tokens=8, cost=$0.0003
   ├─ [span-2] retrieve_context (210 ms, cost: $0.0001)
   │  ├─ [span-2a] embed_query (45 ms)
   │  │  attrs: model=text-embedding-3-small, tokens=18, cost=$0.0001
   │  └─ [span-2b] vector_search (160 ms)
   │     attrs: index=knowledge_base, top_k=5, results=5
   ├─ [span-3] generate_response (2,980 ms, cost: $0.0236)
   │  └─ [span-3a] llm.chat — claude-3.5-sonnet (2,970 ms)
   │     attrs: input_tokens=3200, output_tokens=620, cost=$0.0236
   └─ [span-4] format_and_deliver (50 ms, cost: $0)
      attrs: response_length=2480_chars, channel=web_chat

Key structural concepts illustrated:

Parent-child relationships: The root span handle_message is the parent of classify_intent, retrieve_context, generate_response, and format_and_deliver. The retrieve_context span is itself the parent of embed_query and vector_search. This hierarchy mirrors the call stack of your application.
Sequential vs. parallel spans: In this example, all child spans of the root are sequential — each starts after the previous one ends. But spans can also be parallel: if you queried two vector databases simultaneously, they would be sibling spans with overlapping time ranges.
Attributes carry the business data: The raw timing data tells you how long each step took. The attributes tell you why: which model was used, how many tokens were consumed, what the cost was, and what parameters were configured. Attributes are the bridge between observability and cost analytics.
Cost rolls up through the tree: The root span's total cost ($0.024) is the sum of all descendant span costs. CostHawk computes this roll-up automatically, so you can see cost at any level of aggregation: per-request (root span), per-stage (child spans), or per-operation (leaf spans).
Timing reveals the critical path: The generate_response span accounts for 2,980 ms of the 3,420 ms total — 87% of latency. This is the critical path. Optimizing any other span has limited impact on total latency unless it enables reducing the critical path (e.g., retrieving less context to reduce input tokens for the generation step).

A well-instrumented trace answers three questions simultaneously: What happened? (the operation sequence), How long did it take? (span durations), and How much did it cost? (token and cost attributes). This trifecta is what makes tracing the foundation of LLM application observability.

Tracing for Cost Attribution

Tracing transforms cost monitoring from a blunt aggregate number into precise, actionable attribution. Without tracing, you know that your AI pipeline spent $4,200 yesterday. With tracing, you know that $1,680 (40%) went to RAG context assembly across 12,000 requests, $1,470 (35%) went to final response generation, $630 (15%) went to tool-call re-invocations in agentic workflows, and $420 (10%) went to intent classification. Each of these numbers maps to a specific span type in your trace data, enabling targeted optimization.

Cost attribution by span type:

Tag each span that involves an LLM call with llm.cost_usd computed from the model's pricing and the actual token counts reported in the API response. Then aggregate by span operation name to see cost distribution:

Span Operation	Daily Requests	Avg Input Tokens	Avg Output Tokens	Model	Daily Cost	% of Total
classify_intent	50,000	150	12	GPT-4o mini	$4.73	2%
embed_query	50,000	25	n/a	text-embedding-3-small	$0.25	0.1%
generate_with_context	50,000	4,200	650	Claude 3.5 Sonnet	$1,112.50	85%
tool_call.search	18,000	800	200	Claude 3.5 Sonnet	$151.20	12%
summarize_result	8,000	500	100	GPT-4o mini	$0.60	0.05%

This table immediately reveals that generate_with_context dominates cost at 85%, driven by the combination of a large input context (4,200 tokens from RAG retrieval) and an expensive model (Claude 3.5 Sonnet). The optimization path is clear: reduce context size (fewer retrieved documents, shorter document chunks), cap output tokens, or route to a cheaper model for simpler queries.

Cost attribution by user, customer, or feature:

By propagating business-context attributes through the trace (e.g., customer_id, feature_name, environment), you can aggregate costs along any business dimension. This enables:

Customer-level chargeback: Enterprise B2B products can attribute exact AI costs to each customer, enabling usage-based pricing or identifying customers whose usage patterns are disproportionately expensive.
Feature-level ROI: Compare the AI cost of each feature against its business value. If the "smart search" feature costs $800/day but drives $200/day in revenue, that is a clear signal to optimize or reconsider.
Environment-level waste detection: Trace attributes revealing env=development or env=staging let you quantify non-production spend. Many teams discover that 20-40% of their AI API costs come from development and testing environments running against production-grade models.

CostHawk ingests trace-level cost data through its API and MCP server integrations, automatically aggregating costs by span type, model, project tag, and API key. This means every dollar of spend is attributable to a specific step in a specific pipeline, enabling the kind of surgical cost optimization that aggregate monitoring cannot support.

Implementing Traces

There are three primary approaches to implementing tracing in LLM applications, each with different tradeoffs between standardization, ease of integration, and customization.

1. OpenTelemetry (OTel) — The Standards-Based Approach

OpenTelemetry is the CNCF-backed observability standard that provides vendor-neutral APIs for creating traces, spans, and metrics. For LLM tracing, OTel is the most future-proof choice because it decouples instrumentation from the backend — you can send traces to Jaeger, Grafana Tempo, Datadog, Honeycomb, or any OTel-compatible backend without changing your instrumentation code.

Implementation pattern in a Node.js/TypeScript application:

import { trace, SpanStatusCode } from '@opentelemetry/api'

const tracer = trace.getTracer('llm-pipeline')

async function handleRequest(userMessage: string) {
  return tracer.startActiveSpan('handle_request', async (rootSpan) => {
    try {
      // Child span for retrieval
      const context = await tracer.startActiveSpan('retrieve_context',
        async (span) => {
          const docs = await vectorSearch(userMessage)
          span.setAttribute('retriever.doc_count', docs.length)
          span.setAttribute('retriever.total_chars', totalChars(docs))
          span.end()
          return docs
        }
      )

      // Child span for LLM call
      const response = await tracer.startActiveSpan('llm.chat',
        async (span) => {
          const result = await anthropic.messages.create({
            model: 'claude-3-5-sonnet-20241022',
            max_tokens: 1024,
            messages: [{ role: 'user', content: buildPrompt(userMessage, context) }]
          })
          span.setAttribute('gen_ai.system', 'anthropic')
          span.setAttribute('gen_ai.request.model', 'claude-3-5-sonnet')
          span.setAttribute('gen_ai.usage.input_tokens', result.usage.input_tokens)
          span.setAttribute('gen_ai.usage.output_tokens', result.usage.output_tokens)
          span.setAttribute('llm.cost_usd', computeCost(result))
          span.end()
          return result
        }
      )

      rootSpan.setAttribute('total_cost_usd', sumChildCosts(rootSpan))
      rootSpan.setStatus({ code: SpanStatusCode.OK })
      return response
    } catch (err) {
      rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message })
      throw err
    } finally {
      rootSpan.end()
    }
  })
}

2. Langfuse — The LLM-Native Approach

Langfuse is an open-source observability platform designed specifically for LLM applications. It provides a higher-level API than OTel with LLM-specific concepts built in: "generations" (LLM calls), "traces" (request-level containers), and "scores" (quality evaluations). The tradeoff is less portability but faster time-to-value for AI-specific use cases.

import { Langfuse } from 'langfuse'

const langfuse = new Langfuse({ publicKey: '...', secretKey: '...' })

async function handleRequest(userMessage: string) {
  const trace = langfuse.trace({ name: 'chat_request', userId: user.id })

  const retrieval = trace.span({ name: 'retrieve_context' })
  const docs = await vectorSearch(userMessage)
  retrieval.end({ metadata: { docCount: docs.length } })

  const generation = trace.generation({
    name: 'generate_response',
    model: 'claude-3-5-sonnet',
    input: buildPrompt(userMessage, docs),
  })
  const result = await anthropic.messages.create({ /* ... */ })
  generation.end({
    output: result.content,
    usage: { inputTokens: result.usage.input_tokens,
             outputTokens: result.usage.output_tokens },
  })

  await langfuse.flushAsync()
}

3. Custom Tracing — The Lightweight Approach

For teams that need tracing but want to avoid adding dependencies, a custom tracing implementation built on structured logging is viable. Record start/end timestamps and attributes for each operation, assign trace and span IDs, and emit structured JSON logs that can be queried in your existing log infrastructure.

The key advantage of custom tracing is full control over what is captured and how it is stored. The disadvantage is that you must build your own visualization and aggregation tooling. CostHawk can ingest custom trace data through its API, providing the visualization layer without requiring OTel or Langfuse instrumentation — just send span-level cost and latency data in the expected format.

Tracing Multi-Step Agents

Agentic AI systems — where an LLM autonomously decides which tools to call, iterates through multiple reasoning steps, and may spawn sub-agents — present the most complex and highest-value tracing challenge. A single user request to an agent might generate 5-15 LLM calls, each with different token counts and costs, interspersed with tool executions that have their own latency and failure modes. Without tracing, debugging a slow or expensive agent run is nearly impossible.

Trace structure for a typical ReAct agent:

Trace: trace_id=agent_001
└─ [root] agent.run (total: 18,400 ms, cost: $0.127)
   ├─ [step-1] agent.think (1,200 ms, cost: $0.008)
   │  └─ llm.chat — claude-3.5-sonnet
   │     input_tokens=1800, output_tokens=95
   │     decision: call tool "search_knowledge_base"
   ├─ [step-2] tool.execute: search_knowledge_base (340 ms, cost: $0.0002)
   │  attrs: query="refund policy premium accounts", results=3
   ├─ [step-3] agent.think (2,800 ms, cost: $0.021)
   │  └─ llm.chat — claude-3.5-sonnet
   │     input_tokens=4200, output_tokens=120
   │     decision: call tool "lookup_customer"
   ├─ [step-4] tool.execute: lookup_customer (180 ms, cost: $0)
   │  attrs: customer_id=cust_8832, found=true
   ├─ [step-5] agent.think (3,400 ms, cost: $0.034)
   │  └─ llm.chat — claude-3.5-sonnet
   │     input_tokens=5800, output_tokens=180
   │     decision: call tool "process_refund"
   ├─ [step-6] tool.execute: process_refund (2,200 ms, cost: $0)
   │  attrs: amount=$49.99, status=approved
   └─ [step-7] agent.think (8,280 ms, cost: $0.064)
      └─ llm.chat — claude-3.5-sonnet
         input_tokens=7200, output_tokens=820
         decision: respond to user (final answer)

Key observations from agent traces:

Context accumulation drives cost growth. Each successive agent.think span has more input tokens than the previous one because the conversation history (including prior tool calls and results) grows with each step. Step 1 has 1,800 input tokens; step 7 has 7,200. This is the primary cost driver in agentic workflows — the later steps are processing all the context from earlier steps.
The final step often dominates cost. Step 7 costs $0.064 — half the total trace cost — because it has the largest input context and generates the longest output (the final user-facing response). Optimizing the final step (e.g., summarizing earlier context before generating the response) has the highest cost impact.
Tool execution latency varies dramatically. The search_knowledge_base tool takes 340 ms; process_refund takes 2,200 ms because it involves a synchronous API call to a payment processor. Without tracing, you might blame the LLM for the overall 18-second latency, when in fact 2.2 seconds is spent on a non-LLM operation.
Loop detection becomes possible. Agents sometimes enter loops — calling the same tool repeatedly or oscillating between two tools. Trace data makes this visible: if you see five consecutive search_knowledge_base spans with similar queries and increasing input token counts, the agent is stuck. Set guardrails: maximum 10 steps per trace, maximum 3 calls to the same tool, maximum $0.50 per trace.

For multi-agent systems where one agent delegates to another, traces should use a linked trace or span link pattern: the sub-agent's trace is linked to the parent agent's trace via a reference, preserving the full causal chain while allowing each agent to manage its own trace lifecycle. CostHawk aggregates costs across linked traces, so a parent agent's total cost includes all sub-agent costs — critical for understanding the true expense of orchestrated workflows.

Tracing and CostHawk

CostHawk integrates tracing data into its cost attribution and optimization workflows, providing a unified view of cost, latency, and quality across all your LLM operations. Here is how tracing connects to the broader CostHawk platform.

Automatic tracing via wrapped keys:

When you route API calls through CostHawk wrapped keys, each request is automatically traced with a single span capturing the full round-trip: model, input tokens, output tokens, latency, TTFT (for streaming requests), and computed cost. No instrumentation code is required. This provides request-level granularity for simple pipelines where each user request maps to a single LLM call. The wrapped key acts as an instrumented proxy, recording all the data that a manual span would capture.

Rich tracing via MCP server sync:

For complex pipelines with multiple LLM calls, tool use, and agentic workflows, the CostHawk MCP server accepts structured trace data that includes parent-child span relationships and custom attributes. This enables the full trace tree visualization described in earlier sections: you can see the complete execution path of an agent run, with cost and latency attributed to each step. The MCP server normalizes trace data from different sources (OTel, Langfuse, custom) into CostHawk's internal format.

Trace-powered analytics:

With trace data flowing into CostHawk, several powerful analytics become available:

Cost-per-step breakdown: See which pipeline stages consume the most budget. Identify that your RAG retrieval context assembly accounts for 40% of total input tokens and optimize accordingly.
Latency-cost correlation: Discover that your slowest requests (P99 latency) are also your most expensive, because they involve the most agent steps and the largest accumulated contexts. This reveals that latency optimization and cost optimization are often the same effort.
Step-level anomaly detection: CostHawk can alert when a specific span type's average cost or latency deviates from its baseline. If your retrieve_context span suddenly starts returning 10x more tokens (because someone changed the chunk size or top-k parameter), CostHawk flags the regression before it materially impacts your monthly bill.
Trace sampling and drill-down: For high-volume systems generating millions of traces per day, CostHawk supports head-based and tail-based sampling. Tail-based sampling is particularly valuable: it preferentially retains expensive traces (above a cost threshold) and high-latency traces (above a P99 threshold), ensuring that the most interesting and actionable traces are always available for investigation.

Trace data in the CostHawk dashboard:

The dashboard's usage analytics page surfaces trace-derived insights alongside aggregate cost and usage data. When you click into a specific time period showing elevated costs, you can see the individual traces that contributed to the spike, drill into their span trees, and identify the root cause. This closed-loop workflow — from aggregate anomaly to individual trace to specific span to optimization action — is the core value proposition of integrated tracing. You do not need a separate observability platform for traces and a separate platform for cost analytics; CostHawk unifies both into a single investigation flow.

Getting started:

If you are already using CostHawk wrapped keys, you have request-level tracing out of the box. To add multi-span tracing for complex pipelines, integrate the CostHawk MCP server into your application and emit span data from your pipeline stages. The MCP server accepts spans in OpenTelemetry-compatible format, so if you have existing OTel instrumentation, you can add CostHawk as an additional exporter with a single configuration change. Start by tracing your highest-cost pipeline — the one that accounts for the largest share of your monthly bill — and use the resulting data to identify your first optimization target.

FAQ

Frequently Asked Questions

What is the difference between tracing and logging for LLM applications?+

Logging and tracing capture different dimensions of system behavior. A log is a timestamped record of a discrete event: "Model called at 14:32:05, model=claude-3.5-sonnet, input_tokens=1200, output_tokens=350, latency=1850ms." Logs are independent — each entry stands alone. A trace is a correlated set of spans that tells the story of a complete operation: "User request abc123 started at 14:32:04, first retrieved context (210 ms), then called the model (1,850 ms), then formatted the response (12 ms), total 2,072 ms, total cost $0.019." Traces connect cause and effect across multiple operations through parent-child span relationships and shared trace IDs. The practical difference matters when diagnosing problems: logs tell you what happened (a model call was slow), while traces tell you why it happened (the model call was slow because the preceding retrieval step returned 15,000 tokens of context instead of the expected 3,000). For production LLM systems, you need both: logs for searchable event records, traces for causal analysis and cost attribution.

How much overhead does tracing add to my LLM application?+

The performance overhead of tracing is negligible compared to the latency of LLM API calls. Creating a span involves allocating a small in-memory object (typically 200-500 bytes), recording two timestamps (start and end), and setting a few string attributes. This takes microseconds — 1-10 microseconds per span on modern hardware. An LLM API call takes milliseconds to seconds. Even in a complex agent trace with 20 spans, the total tracing overhead is under 200 microseconds, which is less than 0.01% of a typical 2-second LLM pipeline. The memory overhead is similarly minimal: a trace with 20 spans consumes roughly 10 KB of memory before it is flushed to the backend. The primary resource cost of tracing is in the backend storage and processing: storing millions of trace records per day requires database capacity, and computing aggregations over trace data requires query infrastructure. CostHawk handles this backend infrastructure, so your application only pays the negligible cost of span creation and asynchronous export.

Should I trace every request or use sampling?+

The answer depends on your request volume and your observability budget. At low to moderate volumes (under 100,000 requests per day), trace every request — the storage cost is minimal, and having 100% of traces available makes debugging and cost attribution comprehensive. At high volumes (millions of requests per day), sampling becomes necessary to control storage costs and backend query performance. There are two sampling strategies: head-based sampling decides at the start of a trace whether to record it (e.g., sample 10% of traces randomly), which is simple but may miss interesting traces. Tail-based sampling records all spans temporarily and makes the sampling decision after the trace completes, keeping traces that are expensive (above a cost threshold), slow (above a latency threshold), or erroneous. Tail-based sampling is far more valuable for LLM cost monitoring because it preferentially retains the traces you actually need to investigate. CostHawk supports both strategies. A common production configuration: tail-sample to retain 100% of traces above $0.10 or above P95 latency, plus a 5% random baseline sample.

How do I trace across multiple services in a microservices architecture?+

Distributed tracing across service boundaries works through context propagation — passing the trace ID and parent span ID from one service to the next via HTTP headers or message metadata. The W3C Trace Context standard defines the traceparent header (format: 00-{trace_id}-{parent_span_id}-{flags}) for this purpose, and OpenTelemetry implements it automatically. When your API gateway receives a user request, it creates a root span and a trace ID. When it calls your retrieval service, the trace ID and current span ID are propagated in the traceparent header. The retrieval service creates a child span under the propagated parent. When the retrieval service calls the LLM API, it creates another child span. All spans share the same trace ID, forming a complete tree across service boundaries. For message queues (Kafka, SQS, RabbitMQ), propagate trace context in message attributes. For serverless functions, propagate via the event payload. The key is ensuring no service in the chain drops the context — one missing link breaks the trace tree and fragments your cost attribution.

What attributes should I record on LLM trace spans?+

At minimum, every LLM call span should record: gen_ai.system (the provider: openai, anthropic, google), gen_ai.request.model (the specific model name), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and llm.cost_usd (computed from token counts and model pricing). These five attributes enable cost attribution, model comparison, and usage analytics. Beyond the minimum, highly valuable attributes include: gen_ai.request.max_tokens (to detect uncapped requests), gen_ai.request.temperature (to correlate quality settings with cost), llm.ttft_ms (time-to-first-token for streaming requests), project or feature (business attribution tags), customer_id (for multi-tenant chargeback), and environment (dev/staging/prod). For tool-call spans, record tool.name, tool.status (success/error), and tool.duration_ms. For retrieval spans, record retriever.source, retriever.doc_count, and retriever.total_tokens (the token count of retrieved context). Follow the OpenTelemetry GenAI semantic conventions for attribute naming to ensure compatibility across tools.

How does tracing help with debugging failed or hallucinated LLM responses?+

When an LLM produces a hallucinated, incorrect, or malformed response, the trace provides the full context needed to understand why. By examining the trace, you can see: (1) exactly what input the model received — including the system prompt, retrieved context, conversation history, and tool results that comprised the full prompt; (2) which model was used and what parameters were set (temperature, max_tokens); (3) what the model produced at each step (if you log inputs and outputs on spans); and (4) whether any preceding step failed or returned unexpected data. For example, a hallucinated customer support response might trace back to a retrieval span that returned irrelevant documents (wrong vector search query), causing the model to fabricate information because its context did not contain the correct answer. Without the trace, you would only see the hallucinated output. With the trace, you see the retrieval failure that caused it and can fix the root cause. For production systems, combining traces with evaluation scores (e.g., scoring each response for factual accuracy) lets you correlate quality problems with specific pipeline configurations and identify systematic failure patterns.

Can I use tracing to implement cost guardrails for agentic workflows?+

Yes, and this is one of the highest-value applications of real-time tracing for agentic systems. The approach is to compute a running cost total as each span completes and enforce a budget ceiling on the trace. Implementation: your agent loop maintains a reference to the active trace. After each LLM call span ends, read the llm.cost_usd attribute and add it to a running total stored on the root span. Before initiating the next agent step, check whether the running total exceeds your per-request budget (e.g., $0.50). If it does, force the agent to emit a final response with whatever context it has accumulated, rather than continuing to iterate. This pattern prevents runaway agent loops — where the model keeps calling tools and accumulating context indefinitely — from generating unbounded costs. You can also set guardrails on step count (maximum 10 LLM calls per trace), individual span cost (no single span above $0.10), and total input token count (halt if accumulated context exceeds 50,000 tokens). CostHawk's alerting can notify you when traces exceed cost thresholds, enabling after-the-fact analysis even if your real-time guardrails did not catch every edge case.

What is the relationship between OpenTelemetry and LLM tracing?+

OpenTelemetry (OTel) is the vendor-neutral, CNCF-backed standard for observability instrumentation, and it is rapidly becoming the foundation for LLM tracing. OTel provides the core tracing primitives — traces, spans, attributes, context propagation, and exporters — that LLM tracing builds upon. The GenAI Semantic Conventions (currently in development within the OTel project) define standardized attribute names for AI-specific metadata: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and others. By using OTel for your LLM tracing, you gain several advantages: your traces are portable across any OTel-compatible backend (Jaeger, Grafana Tempo, Datadog, Honeycomb, CostHawk), your LLM spans integrate seamlessly with existing infrastructure traces (HTTP requests, database queries, cache lookups), and you benefit from the OTel ecosystem of auto-instrumentation libraries, SDKs in every major language, and the OpenTelemetry Collector for data routing and transformation. The alternative LLM-specific tools like Langfuse and LangSmith offer faster initial setup for AI-only use cases, but OTel provides the most comprehensive and future-proof foundation for organizations that want unified observability across their entire stack.

Related Terms

Spans

Individual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.

OpenTelemetry

An open-source observability framework providing a vendor-neutral standard (OTLP) for collecting traces, metrics, and logs from distributed systems. OpenTelemetry is rapidly becoming the standard instrumentation layer for LLM applications, enabling teams to track latency, token usage, cost, and quality across every inference call.

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Logging

Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.

Agentic AI

AI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary