Spans
Individual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.
Definition
What is Spans?
TraceId, SpanId, ParentSpanId, name, kind, startTime, endTime, attributes, events, and status. AI observability platforms like LangSmith, Arize Phoenix, Helicone, and CostHawk extend these standard span attributes with LLM-specific fields: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, gen_ai.response.finish_reason, and computed cost. This extension of the OpenTelemetry semantic conventions for generative AI (sometimes called the GenAI semantic conventions) allows teams to use familiar tracing infrastructure to monitor AI-specific concerns like token consumption, latency per model, and cost per span.Impact
Why It Matters for AI Costs
Without span-level visibility, AI costs and performance are a black box. You can see your total monthly bill, but you cannot answer questions like: Which step in my RAG pipeline is the most expensive? How much latency does the guardrail check add? Why did this agentic loop cost $0.47 when the average is $0.08?
Spans unlock granular cost attribution. Consider a retrieval-augmented generation (RAG) pipeline with these steps:
| Span | Operation | Duration | Tokens | Cost |
|---|---|---|---|---|
| 1 | Embed user query | 45 ms | 32 input | $0.000003 |
| 2 | Vector search (Pinecone) | 120 ms | — | $0.00002 |
| 3 | Re-rank retrieved chunks | 180 ms | 2,400 input | $0.0006 |
| 4 | Synthesize answer (Claude 3.5 Sonnet) | 1,850 ms | 3,200 in / 680 out | $0.0198 |
| 5 | Guardrail check | 210 ms | 800 input / 50 out | $0.0003 |
The total request cost is $0.0207, and span-level data immediately shows that Span 4 — the synthesis LLM call — accounts for 95.7% of the cost. Without spans, you would know the total but have no idea where to optimize. With spans, you can target the expensive operation: switch to a cheaper model for synthesis, reduce the context sent to the LLM, or cache frequent answers.
Spans also reveal latency bottlenecks. In the example above, total end-to-end latency is ~2,405 ms, and the synthesis call accounts for 77% of it. If your SLA requires responses under 2 seconds, you know exactly which span to optimize.
For agentic systems that execute multi-step plans with loops and branches, spans are indispensable. An agent that decides to call three tools in sequence, then re-plans and calls two more, generates a span tree that shows the full decision path. When that agent occasionally enters a runaway loop and makes 15 LLM calls instead of 3, the span tree makes it immediately visible — and the cost of each span tells you exactly how much the runaway cost.
CostHawk captures span-level telemetry for every traced request, enabling per-span cost breakdowns, latency analysis, and anomaly detection at the operation level rather than just the request level.
Anatomy of a Span
Every span, whether it represents a traditional microservice call or an LLM inference, contains a set of core fields defined by the OpenTelemetry specification. Understanding these fields is essential for configuring tracing, querying traces, and building dashboards.
Core fields:
| Field | Type | Description |
|---|---|---|
traceId | 128-bit hex string | Globally unique identifier for the entire trace. All spans in a single request share the same traceId. |
spanId | 64-bit hex string | Unique identifier for this individual span within the trace. |
parentSpanId | 64-bit hex string (optional) | The spanId of this span's parent. Root spans (the first span in a trace) have no parent. |
name | string | A human-readable name describing the operation, e.g., "llm.chat.completions" or "vectordb.query". |
kind | enum | One of CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL. LLM calls are typically CLIENT spans. |
startTime | timestamp (ns) | When the operation began, in nanoseconds since epoch. |
endTime | timestamp (ns) | When the operation completed. endTime - startTime = duration. |
status | object | Contains a code (OK, ERROR, UNSET) and optional message for error details. |
attributes | key-value map | Arbitrary metadata attached to the span. This is where LLM-specific data lives. |
events | array | Timestamped annotations within the span's lifetime, e.g., "first_token_received" at a specific timestamp. |
AI-specific attributes (GenAI semantic conventions):
The OpenTelemetry community has defined semantic conventions specifically for generative AI operations. These attributes are attached to the span's attributes map:
gen_ai.system— the AI provider ("openai","anthropic","google")gen_ai.request.model— the model requested (e.g.,"claude-3-5-sonnet-20241022")gen_ai.response.model— the model that actually served the response (may differ if aliased)gen_ai.usage.input_tokens— number of input tokens consumedgen_ai.usage.output_tokens— number of output tokens generatedgen_ai.request.temperature— sampling temperature usedgen_ai.request.max_tokens— maximum output tokens requestedgen_ai.response.finish_reason— why generation stopped ("stop","length","tool_calls")
Platforms like CostHawk add computed attributes beyond the standard conventions:
costhawk.cost.input— dollar cost of input tokens based on current model pricingcosthawk.cost.output— dollar cost of output tokenscosthawk.cost.total— total span cost (input + output)costhawk.cache.hit— whether prompt caching was used (reduces cost)
When you configure tracing in your application, ensuring these attributes are populated on every LLM span is the foundation for all downstream cost analytics.
Span Hierarchies and Trace Trees
Spans do not exist in isolation. They form hierarchical trees (or more precisely, directed acyclic graphs) that represent the causal structure of a request. The relationships between spans — parent-child, sibling, sequential, concurrent — reveal how your AI system actually executes and where time and money are spent.
Parent-child relationships: When one operation initiates another, the initiating span becomes the parent of the child span. For example, an agentic orchestrator span might be the parent of three child spans: a planning LLM call, a tool execution, and a synthesis LLM call. The parent's duration encompasses all of its children's durations (plus any orchestration overhead).
Root spans: Every trace has exactly one root span with no parent. This represents the entry point of the request — typically an HTTP handler, a queue consumer, or an SDK invocation. The root span's duration is the end-to-end latency of the entire request.
Trace tree example for a multi-step agent:
root: agent.execute (3,420 ms, $0.0847)
├── llm.chat: plan (680 ms, $0.0042)
│ model: gpt-4o, input: 1,200 tok, output: 180 tok
├── tool.execute: search_database (340 ms, $0.0000)
│ query: "Q3 revenue by region"
├── llm.chat: analyze (1,120 ms, $0.0285)
│ model: claude-3-5-sonnet, input: 4,800 tok, output: 620 tok
├── tool.execute: generate_chart (180 ms, $0.0000)
│ chart_type: "bar", data_points: 12
└── llm.chat: synthesize (890 ms, $0.0520)
model: claude-3-5-sonnet, input: 6,200 tok, output: 1,400 tokThis trace tree tells a complete story: the agent planned with GPT-4o (cheap, fast), searched a database (no LLM cost), analyzed results with Claude Sonnet, generated a chart (no LLM cost), and synthesized a final answer. The synthesis span is the most expensive ($0.052, 61% of total cost) because it processed the most tokens and generated the longest output.
Concurrent vs sequential spans: Sibling spans can execute concurrently (tool calls that run in parallel) or sequentially (each step depends on the previous). The trace tree captures timing information that distinguishes these patterns. If two child spans overlap in time, they ran concurrently; if one starts after another ends, they ran sequentially. This distinction matters for latency optimization — concurrent spans can be parallelized, while sequential spans define the critical path.
Span depth and complexity: Simple request-response LLM calls produce shallow traces (root span → LLM span). Agentic systems with planning, tool use, and reflection can produce traces with depth 5+ and dozens of spans. CostHawk's trace viewer renders these deep trees as collapsible hierarchies, with cost and duration annotations at every level, making it easy to navigate complex agent executions and identify the expensive paths.
Spans in LLM Pipelines vs Traditional Microservices
Distributed tracing with spans was invented for microservice architectures, where a single user request traverses multiple network services. LLM observability adapts this concept for AI pipelines, but there are important differences that affect how you instrument, analyze, and alert on spans.
Traditional microservice spans:
- Primarily measure latency and error rates across service boundaries
- Attributes focus on HTTP methods, status codes, service names, and infrastructure metadata
- Cost is indirect (compute hours, not per-request billing)
- Span counts per trace are relatively stable (service topology doesn't change per request)
- Tools: Jaeger, Zipkin, Datadog APM, Honeycomb
LLM pipeline spans:
- Measure latency, token counts, and direct dollar cost per operation
- Attributes include model name, token usage, temperature, finish reason, and computed cost
- Cost is direct and per-span — every LLM span has a calculable dollar amount
- Span counts per trace are variable — an agent might make 2 LLM calls or 20, depending on the task
- Tools: LangSmith, Arize Phoenix, Helicone, Braintrust, CostHawk
The key difference is cost attribution. In traditional microservices, you cannot easily assign a dollar cost to a single span — infrastructure costs are shared and amortized. In LLM pipelines, every LLM span has a precise cost: (input_tokens × input_rate) + (output_tokens × output_rate). This makes spans in AI observability uniquely powerful for cost management.
Variable span counts create new challenges. A microservice trace might always have 8 spans (the same 8 services in the call chain). An agentic AI trace might have 3 spans on one invocation and 30 on another, depending on the agent's planning decisions. This variability means:
- You cannot set fixed latency thresholds — a 5-span agent run will naturally be faster than a 20-span run
- Cost anomaly detection must account for span count variation — a trace that costs $0.30 because the agent made 15 LLM calls is different from one that costs $0.30 because a single call used an expensive model
- Span count itself becomes a metric to monitor — increasing average spans per trace may indicate planning inefficiency or looping behavior
Hybrid instrumentation: Production AI applications typically have both traditional microservice spans (HTTP middleware, database queries, cache lookups) and LLM-specific spans in the same trace. A user request hits your API server (HTTP span), authenticates (auth span), retrieves context from a vector database (DB span), calls an LLM (GenAI span), and returns a response (HTTP span). CostHawk integrates with OpenTelemetry to consume both types, providing unified latency analysis alongside LLM-specific cost attribution.
Instrumenting Spans in Your AI Application
Capturing span data requires instrumentation — the code or middleware that creates, populates, and exports spans. There are three levels of instrumentation, from least to most effort:
Level 1: Auto-instrumentation (zero code changes)
Many AI frameworks and proxy layers automatically create spans for LLM calls. If you use LangChain, LlamaIndex, or the Vercel AI SDK, you can enable tracing with a single configuration change:
// LangChain auto-instrumentation
import { Client } from "langsmith"
const client = new Client({ tracingEnabled: true })
// Vercel AI SDK with OpenTelemetry
import { experimental_telemetry } from "ai"
const result = await generateText({
model: openai("gpt-4o"),
prompt: "Summarize this document.",
experimental_telemetry: { isEnabled: true }
})Auto-instrumentation captures the standard GenAI attributes (model, tokens, duration) but may miss custom metadata specific to your application.
Level 2: SDK-based instrumentation (minimal code changes)
AI observability platforms provide SDK wrappers that instrument LLM client libraries. You wrap your existing OpenAI or Anthropic client and spans are captured automatically:
// Helicone-style header injection
import OpenAI from "openai"
const openai = new OpenAI({
baseURL: "https://oai.helicone.ai/v1",
defaultHeaders: {
"Helicone-Auth": "Bearer sk-helicone-xxx",
"Helicone-Property-Feature": "search",
}
})
// CostHawk wrapped key approach
const openai = new OpenAI({
apiKey: "chk_wrapped_key_xxx" // CostHawk wrapped key
// All requests automatically traced with span data
})CostHawk's wrapped key approach is the lowest-friction option: replace your API key with a CostHawk wrapped key, and every request is automatically traced with full span data — model, tokens, cost, latency, and any custom tags you attach via headers.
Level 3: Manual instrumentation (full control)
For maximum flexibility, use the OpenTelemetry SDK directly to create custom spans with whatever attributes you need:
import { trace } from "@opentelemetry/api"
const tracer = trace.getTracer("my-ai-app")
async function ragPipeline(query: string) {
return tracer.startActiveSpan("rag.pipeline", async (rootSpan) => {
// Embedding span
const embedding = await tracer.startActiveSpan("embed.query", async (span) => {
const result = await embedQuery(query)
span.setAttribute("gen_ai.usage.input_tokens", result.tokens)
span.setAttribute("gen_ai.request.model", "text-embedding-3-small")
span.end()
return result
})
// Retrieval span
const docs = await tracer.startActiveSpan("vectordb.search", async (span) => {
const results = await vectorSearch(embedding, { topK: 10 })
span.setAttribute("vectordb.top_k", 10)
span.setAttribute("vectordb.results_count", results.length)
span.end()
return results
})
// LLM synthesis span
const answer = await tracer.startActiveSpan("llm.synthesize", async (span) => {
const response = await llmCall(query, docs)
span.setAttribute("gen_ai.request.model", "claude-3-5-sonnet")
span.setAttribute("gen_ai.usage.input_tokens", response.inputTokens)
span.setAttribute("gen_ai.usage.output_tokens", response.outputTokens)
span.setAttribute("costhawk.cost.total", response.cost)
span.end()
return response
})
rootSpan.end()
return answer
})
}Manual instrumentation gives you full control over span names, attributes, and nesting. It is the best approach for custom pipelines that are not covered by framework auto-instrumentation. The tradeoff is more code to maintain, but the observability gains — especially for cost attribution — are substantial.
Using Spans for Cost Debugging
Span-level cost data transforms debugging from guesswork into precision analysis. Here are five concrete debugging workflows that spans enable:
1. Finding the most expensive span in a trace. When a single request costs more than expected, open the trace and sort spans by cost. The most expensive span immediately reveals the culprit — perhaps a synthesis call received too much context (high input tokens) or generated an excessively long response (high output tokens). Without spans, you would only see the total cost with no breakdown.
2. Detecting agent loops. Agentic systems sometimes enter loops where the agent repeatedly calls the same tool or makes the same LLM call with slightly different inputs. In a span tree, this appears as an unusually high number of sibling spans of the same type. Set an alert on span count per trace: if a trace exceeds your expected maximum (e.g., 20 LLM spans), investigate. CostHawk flags traces where span count deviates from the rolling average by more than 2 standard deviations.
3. Identifying context window bloat. Over time, applications that append to conversation history or accumulate retrieved documents will see input token counts grow per span. Query your span data for the trend of gen_ai.usage.input_tokens over time for a specific span name (e.g., llm.chat.synthesize). If the P50 input tokens increased from 2,000 to 8,000 over two weeks, your context management has a leak. The span data pinpoints exactly which operation is receiving the bloated context.
4. Comparing model costs across the same operation. If you are A/B testing models (e.g., routing 50% of synthesis calls to Claude Sonnet and 50% to GPT-4o), span data lets you compare cost and latency side by side. Filter spans by name, group by gen_ai.request.model, and compare the distributions. You might discover that GPT-4o is 15% cheaper for this specific operation while delivering equivalent quality, justifying a full routing switch.
5. Tracing cost through multi-tenant systems. If your application serves multiple customers, attaching a customer_id attribute to spans enables per-customer cost attribution. Aggregate span costs by customer to identify your most expensive accounts, calculate per-customer margins, and detect usage anomalies. A customer whose average request cost suddenly triples may be sending unusually large inputs or triggering complex agent paths.
Building span-based cost dashboards: The most actionable dashboards combine span data across traces:
- Cost by span name — shows which operations drive the most aggregate cost
- P95 cost per span — reveals the long tail of expensive operations
- Span cost as percentage of trace cost — identifies the dominant cost center in your pipeline
- Span count distribution — detects variability in pipeline complexity
CostHawk's span analytics provide all of these views out of the box, with the ability to filter by time range, model, tag, project, and customer.
Span Sampling and Storage Strategies
At scale, capturing and storing every span from every request can become expensive itself. A production application handling 1 million requests per day, each generating 5 spans, produces 5 million spans per day — roughly 150 million per month. At an average of 500 bytes per span, that is 75 GB of span data per month before indexing overhead. Span sampling and retention strategies are essential for keeping observability costs proportional to the value they provide.
Head-based sampling: Decide whether to trace a request before it starts, based on a probability or rule. For example, sample 10% of all requests, but always sample requests tagged as priority: high. This is the simplest approach and is supported natively by the OpenTelemetry SDK:
import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base"
// Sample 10% of traces
const sampler = new TraceIdRatioBasedSampler(0.1)The downside is that you miss interesting traces — if a rare, expensive request is in the unsampled 90%, you will never see it.
Tail-based sampling: Collect all spans initially, then decide which complete traces to keep based on their characteristics. Keep traces that are slow (duration > P95), expensive (cost > $0.10), errored, or anomalous. Discard normal traces. This captures the most valuable data while dramatically reducing storage. Tail-based sampling requires a collection pipeline that buffers traces temporarily (typically 30–60 seconds) before making the keep/discard decision.
Cost-aware sampling: A hybrid strategy particularly suited to AI observability. Always capture spans for LLM calls above a cost threshold (e.g., any span costing more than $0.05). Sample normally for cheap operations. This ensures you never miss expensive operations while reducing data volume from the high-frequency, low-cost operations like embeddings and cache lookups.
Retention tiers: Not all span data needs the same retention period. A practical tiering strategy:
| Tier | Data | Retention | Storage |
|---|---|---|---|
| Hot | Full span details (attributes, events, links) | 7–14 days | Indexed, queryable |
| Warm | Aggregated span metrics (count, P50/P95 duration, total cost) | 90 days | Time-series DB |
| Cold | Sampled raw spans for audit | 1 year | Object storage (S3) |
CostHawk handles sampling and retention automatically for customers using its managed tracing pipeline. For teams exporting spans via OpenTelemetry, CostHawk accepts OTLP-format spans and applies cost-aware tail-based sampling to ensure expensive and anomalous traces are always retained while keeping storage costs manageable.
Cost of observability itself: A common anti-pattern is spending more on observability than the savings it enables. If your AI API bill is $5,000/month and your observability platform costs $3,000/month, the economics do not work. Span sampling, aggregation, and retention tiers keep observability costs to 2–5% of the monitored AI spend, ensuring a strong return on investment.
FAQ
Frequently Asked Questions
What is the difference between a span and a trace?+
traceId, which is how observability tools stitch them together into a coherent view. A simple request-response LLM call produces a trace with just 2 spans (the HTTP handler and the LLM call). A complex agentic workflow might produce a trace with 30+ spans representing planning, tool use, retrieval, and synthesis steps. The trace gives you the big picture; the spans give you the details. CostHawk computes both trace-level cost (sum of all span costs) and per-span cost to support both high-level and granular analysis.How do spans help with AI cost optimization?+
Do I need OpenTelemetry to use spans for AI observability?+
What attributes should I add to LLM spans beyond the defaults?+
feature (which product feature triggered this span), customer_id (for per-customer cost attribution), environment (dev/staging/prod), and version (your application version). Second, add pipeline context: pipeline_name (e.g., "rag-v2" or "agent-planner"), step_index (position in the pipeline), and retry_count (whether this is a retry of a failed call). Third, add quality signals: cache_hit (was the prompt cache used), fallback_model (was this a fallback from a primary model), and output_truncated (did the response hit max_tokens). These attributes enable powerful queries: "Show me the P95 cost of the summarization step for customer X in production, excluding cache hits" or "Compare the average latency of the planner span between pipeline v2 and v3." The investment in rich span attributes pays for itself many times over in debugging speed and optimization insights.How do spans work in streaming LLM responses?+
startTime is when the request was sent, and the endTime is when the last token (or the stream-close event) was received. The span's duration therefore reflects the total generation time, not the time to first token. To capture TTFT (time to first token) within a streaming span, most instrumentation libraries record a span event — a timestamped annotation within the span — at the moment the first token arrives. For example: span.addEvent("gen_ai.first_token", { timestamp: firstTokenTime }). The token counts (gen_ai.usage.input_tokens and gen_ai.usage.output_tokens) are typically populated when the stream completes, since most providers send a final usage object in the last chunk of the stream. Cost calculation also happens at stream completion, once the final token counts are known. If the stream is interrupted (client disconnect, timeout), the span should still be ended and marked with an error status, and the token counts should reflect whatever was actually consumed — providers bill for tokens generated even if the client disconnects before receiving them all.Can spans detect runaway agentic loops?+
llm.chat.plan spans at the same depth, the agent is stuck in a planning loop. (3) Trace cost escalates — each LLM span adds cost, so a runaway trace might cost $0.50 when the average is $0.05. CostHawk's anomaly detection monitors both span count and trace cost, flagging runaway traces in real time. For prevention, many teams implement a max_iterations guard at the orchestrator level and record the iteration count as a span attribute, making it easy to query how often agents approach the limit.What is the performance overhead of capturing spans?+
How do I query and analyze span data effectively?+
vectordb.search spans where environment=production" reveals whether your vector database is meeting latency SLAs. (3) Time-series of span metrics — track how average input tokens per synthesis span changes over time to detect context bloat. (4) Trace-level roll-ups — sum span costs within each trace, then compute the distribution of trace costs to understand your cost profile. (5) Correlation analysis — correlate span count with trace cost to distinguish traces that are expensive because of many cheap spans versus few expensive spans. CostHawk provides a SQL-like query interface over span data, plus pre-built dashboards for the most common analysis patterns. For teams that export spans to their own data warehouse (BigQuery, Snowflake, ClickHouse), the same query patterns apply using standard SQL.Related Terms
Tracing
The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.
Read moreOpenTelemetry
An open-source observability framework providing a vendor-neutral standard (OTLP) for collecting traces, metrics, and logs from distributed systems. OpenTelemetry is rapidly becoming the standard instrumentation layer for LLM applications, enabling teams to track latency, token usage, cost, and quality across every inference call.
Read moreLLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Read moreLogging
Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreAgentic AI
AI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
