GlossaryObservabilityUpdated 2026-03-16

Spans

Individual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.

Definition

What is Spans?

A span is the fundamental building block of distributed tracing. It represents a single, named, timed operation within a larger workflow. In the context of AI and LLM observability, a span captures everything that happens during one discrete step of an AI pipeline: an LLM inference call, a vector database retrieval, a tool execution, a guardrail check, or an embedding generation. Each span records a start time, end time (from which duration is derived), status (success or error), attributes (key-value metadata such as model name, token counts, temperature, and cost), and a parent span ID that links it to the broader trace. When a user sends a query to an agentic AI system, that single request may fan out into dozens of spans — a planning span, multiple tool-call spans, retrieval spans, re-ranking spans, and a final synthesis LLM call — all stitched together by trace and span IDs into a directed acyclic graph that represents the full execution path. Spans originated in distributed systems observability (Google's Dapper paper, 2010) and are now standardized by OpenTelemetry, the CNCF project that defines a vendor-neutral wire format for traces, metrics, and logs. The OpenTelemetry specification defines a span as an object with a TraceId, SpanId, ParentSpanId, name, kind, startTime, endTime, attributes, events, and status. AI observability platforms like LangSmith, Arize Phoenix, Helicone, and CostHawk extend these standard span attributes with LLM-specific fields: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, gen_ai.response.finish_reason, and computed cost. This extension of the OpenTelemetry semantic conventions for generative AI (sometimes called the GenAI semantic conventions) allows teams to use familiar tracing infrastructure to monitor AI-specific concerns like token consumption, latency per model, and cost per span.

Impact

Why It Matters for AI Costs

Without span-level visibility, AI costs and performance are a black box. You can see your total monthly bill, but you cannot answer questions like: Which step in my RAG pipeline is the most expensive? How much latency does the guardrail check add? Why did this agentic loop cost $0.47 when the average is $0.08?

Spans unlock granular cost attribution. Consider a retrieval-augmented generation (RAG) pipeline with these steps:

SpanOperationDurationTokensCost
1Embed user query45 ms32 input$0.000003
2Vector search (Pinecone)120 ms$0.00002
3Re-rank retrieved chunks180 ms2,400 input$0.0006
4Synthesize answer (Claude 3.5 Sonnet)1,850 ms3,200 in / 680 out$0.0198
5Guardrail check210 ms800 input / 50 out$0.0003

The total request cost is $0.0207, and span-level data immediately shows that Span 4 — the synthesis LLM call — accounts for 95.7% of the cost. Without spans, you would know the total but have no idea where to optimize. With spans, you can target the expensive operation: switch to a cheaper model for synthesis, reduce the context sent to the LLM, or cache frequent answers.

Spans also reveal latency bottlenecks. In the example above, total end-to-end latency is ~2,405 ms, and the synthesis call accounts for 77% of it. If your SLA requires responses under 2 seconds, you know exactly which span to optimize.

For agentic systems that execute multi-step plans with loops and branches, spans are indispensable. An agent that decides to call three tools in sequence, then re-plans and calls two more, generates a span tree that shows the full decision path. When that agent occasionally enters a runaway loop and makes 15 LLM calls instead of 3, the span tree makes it immediately visible — and the cost of each span tells you exactly how much the runaway cost.

CostHawk captures span-level telemetry for every traced request, enabling per-span cost breakdowns, latency analysis, and anomaly detection at the operation level rather than just the request level.

Anatomy of a Span

Every span, whether it represents a traditional microservice call or an LLM inference, contains a set of core fields defined by the OpenTelemetry specification. Understanding these fields is essential for configuring tracing, querying traces, and building dashboards.

Core fields:

FieldTypeDescription
traceId128-bit hex stringGlobally unique identifier for the entire trace. All spans in a single request share the same traceId.
spanId64-bit hex stringUnique identifier for this individual span within the trace.
parentSpanId64-bit hex string (optional)The spanId of this span's parent. Root spans (the first span in a trace) have no parent.
namestringA human-readable name describing the operation, e.g., "llm.chat.completions" or "vectordb.query".
kindenumOne of CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL. LLM calls are typically CLIENT spans.
startTimetimestamp (ns)When the operation began, in nanoseconds since epoch.
endTimetimestamp (ns)When the operation completed. endTime - startTime = duration.
statusobjectContains a code (OK, ERROR, UNSET) and optional message for error details.
attributeskey-value mapArbitrary metadata attached to the span. This is where LLM-specific data lives.
eventsarrayTimestamped annotations within the span's lifetime, e.g., "first_token_received" at a specific timestamp.

AI-specific attributes (GenAI semantic conventions):

The OpenTelemetry community has defined semantic conventions specifically for generative AI operations. These attributes are attached to the span's attributes map:

  • gen_ai.system — the AI provider ("openai", "anthropic", "google")
  • gen_ai.request.model — the model requested (e.g., "claude-3-5-sonnet-20241022")
  • gen_ai.response.model — the model that actually served the response (may differ if aliased)
  • gen_ai.usage.input_tokens — number of input tokens consumed
  • gen_ai.usage.output_tokens — number of output tokens generated
  • gen_ai.request.temperature — sampling temperature used
  • gen_ai.request.max_tokens — maximum output tokens requested
  • gen_ai.response.finish_reason — why generation stopped ("stop", "length", "tool_calls")

Platforms like CostHawk add computed attributes beyond the standard conventions:

  • costhawk.cost.input — dollar cost of input tokens based on current model pricing
  • costhawk.cost.output — dollar cost of output tokens
  • costhawk.cost.total — total span cost (input + output)
  • costhawk.cache.hit — whether prompt caching was used (reduces cost)

When you configure tracing in your application, ensuring these attributes are populated on every LLM span is the foundation for all downstream cost analytics.

Span Hierarchies and Trace Trees

Spans do not exist in isolation. They form hierarchical trees (or more precisely, directed acyclic graphs) that represent the causal structure of a request. The relationships between spans — parent-child, sibling, sequential, concurrent — reveal how your AI system actually executes and where time and money are spent.

Parent-child relationships: When one operation initiates another, the initiating span becomes the parent of the child span. For example, an agentic orchestrator span might be the parent of three child spans: a planning LLM call, a tool execution, and a synthesis LLM call. The parent's duration encompasses all of its children's durations (plus any orchestration overhead).

Root spans: Every trace has exactly one root span with no parent. This represents the entry point of the request — typically an HTTP handler, a queue consumer, or an SDK invocation. The root span's duration is the end-to-end latency of the entire request.

Trace tree example for a multi-step agent:

root: agent.execute (3,420 ms, $0.0847)
├── llm.chat: plan (680 ms, $0.0042)
│   model: gpt-4o, input: 1,200 tok, output: 180 tok
├── tool.execute: search_database (340 ms, $0.0000)
│   query: "Q3 revenue by region"
├── llm.chat: analyze (1,120 ms, $0.0285)
│   model: claude-3-5-sonnet, input: 4,800 tok, output: 620 tok
├── tool.execute: generate_chart (180 ms, $0.0000)
│   chart_type: "bar", data_points: 12
└── llm.chat: synthesize (890 ms, $0.0520)
    model: claude-3-5-sonnet, input: 6,200 tok, output: 1,400 tok

This trace tree tells a complete story: the agent planned with GPT-4o (cheap, fast), searched a database (no LLM cost), analyzed results with Claude Sonnet, generated a chart (no LLM cost), and synthesized a final answer. The synthesis span is the most expensive ($0.052, 61% of total cost) because it processed the most tokens and generated the longest output.

Concurrent vs sequential spans: Sibling spans can execute concurrently (tool calls that run in parallel) or sequentially (each step depends on the previous). The trace tree captures timing information that distinguishes these patterns. If two child spans overlap in time, they ran concurrently; if one starts after another ends, they ran sequentially. This distinction matters for latency optimization — concurrent spans can be parallelized, while sequential spans define the critical path.

Span depth and complexity: Simple request-response LLM calls produce shallow traces (root span → LLM span). Agentic systems with planning, tool use, and reflection can produce traces with depth 5+ and dozens of spans. CostHawk's trace viewer renders these deep trees as collapsible hierarchies, with cost and duration annotations at every level, making it easy to navigate complex agent executions and identify the expensive paths.

Spans in LLM Pipelines vs Traditional Microservices

Distributed tracing with spans was invented for microservice architectures, where a single user request traverses multiple network services. LLM observability adapts this concept for AI pipelines, but there are important differences that affect how you instrument, analyze, and alert on spans.

Traditional microservice spans:

  • Primarily measure latency and error rates across service boundaries
  • Attributes focus on HTTP methods, status codes, service names, and infrastructure metadata
  • Cost is indirect (compute hours, not per-request billing)
  • Span counts per trace are relatively stable (service topology doesn't change per request)
  • Tools: Jaeger, Zipkin, Datadog APM, Honeycomb

LLM pipeline spans:

  • Measure latency, token counts, and direct dollar cost per operation
  • Attributes include model name, token usage, temperature, finish reason, and computed cost
  • Cost is direct and per-span — every LLM span has a calculable dollar amount
  • Span counts per trace are variable — an agent might make 2 LLM calls or 20, depending on the task
  • Tools: LangSmith, Arize Phoenix, Helicone, Braintrust, CostHawk

The key difference is cost attribution. In traditional microservices, you cannot easily assign a dollar cost to a single span — infrastructure costs are shared and amortized. In LLM pipelines, every LLM span has a precise cost: (input_tokens × input_rate) + (output_tokens × output_rate). This makes spans in AI observability uniquely powerful for cost management.

Variable span counts create new challenges. A microservice trace might always have 8 spans (the same 8 services in the call chain). An agentic AI trace might have 3 spans on one invocation and 30 on another, depending on the agent's planning decisions. This variability means:

  • You cannot set fixed latency thresholds — a 5-span agent run will naturally be faster than a 20-span run
  • Cost anomaly detection must account for span count variation — a trace that costs $0.30 because the agent made 15 LLM calls is different from one that costs $0.30 because a single call used an expensive model
  • Span count itself becomes a metric to monitor — increasing average spans per trace may indicate planning inefficiency or looping behavior

Hybrid instrumentation: Production AI applications typically have both traditional microservice spans (HTTP middleware, database queries, cache lookups) and LLM-specific spans in the same trace. A user request hits your API server (HTTP span), authenticates (auth span), retrieves context from a vector database (DB span), calls an LLM (GenAI span), and returns a response (HTTP span). CostHawk integrates with OpenTelemetry to consume both types, providing unified latency analysis alongside LLM-specific cost attribution.

Instrumenting Spans in Your AI Application

Capturing span data requires instrumentation — the code or middleware that creates, populates, and exports spans. There are three levels of instrumentation, from least to most effort:

Level 1: Auto-instrumentation (zero code changes)

Many AI frameworks and proxy layers automatically create spans for LLM calls. If you use LangChain, LlamaIndex, or the Vercel AI SDK, you can enable tracing with a single configuration change:

// LangChain auto-instrumentation
import { Client } from "langsmith"
const client = new Client({ tracingEnabled: true })

// Vercel AI SDK with OpenTelemetry
import { experimental_telemetry } from "ai"
const result = await generateText({
  model: openai("gpt-4o"),
  prompt: "Summarize this document.",
  experimental_telemetry: { isEnabled: true }
})

Auto-instrumentation captures the standard GenAI attributes (model, tokens, duration) but may miss custom metadata specific to your application.

Level 2: SDK-based instrumentation (minimal code changes)

AI observability platforms provide SDK wrappers that instrument LLM client libraries. You wrap your existing OpenAI or Anthropic client and spans are captured automatically:

// Helicone-style header injection
import OpenAI from "openai"
const openai = new OpenAI({
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": "Bearer sk-helicone-xxx",
    "Helicone-Property-Feature": "search",
  }
})

// CostHawk wrapped key approach
const openai = new OpenAI({
  apiKey: "chk_wrapped_key_xxx"  // CostHawk wrapped key
  // All requests automatically traced with span data
})

CostHawk's wrapped key approach is the lowest-friction option: replace your API key with a CostHawk wrapped key, and every request is automatically traced with full span data — model, tokens, cost, latency, and any custom tags you attach via headers.

Level 3: Manual instrumentation (full control)

For maximum flexibility, use the OpenTelemetry SDK directly to create custom spans with whatever attributes you need:

import { trace } from "@opentelemetry/api"

const tracer = trace.getTracer("my-ai-app")

async function ragPipeline(query: string) {
  return tracer.startActiveSpan("rag.pipeline", async (rootSpan) => {
    // Embedding span
    const embedding = await tracer.startActiveSpan("embed.query", async (span) => {
      const result = await embedQuery(query)
      span.setAttribute("gen_ai.usage.input_tokens", result.tokens)
      span.setAttribute("gen_ai.request.model", "text-embedding-3-small")
      span.end()
      return result
    })

    // Retrieval span
    const docs = await tracer.startActiveSpan("vectordb.search", async (span) => {
      const results = await vectorSearch(embedding, { topK: 10 })
      span.setAttribute("vectordb.top_k", 10)
      span.setAttribute("vectordb.results_count", results.length)
      span.end()
      return results
    })

    // LLM synthesis span
    const answer = await tracer.startActiveSpan("llm.synthesize", async (span) => {
      const response = await llmCall(query, docs)
      span.setAttribute("gen_ai.request.model", "claude-3-5-sonnet")
      span.setAttribute("gen_ai.usage.input_tokens", response.inputTokens)
      span.setAttribute("gen_ai.usage.output_tokens", response.outputTokens)
      span.setAttribute("costhawk.cost.total", response.cost)
      span.end()
      return response
    })

    rootSpan.end()
    return answer
  })
}

Manual instrumentation gives you full control over span names, attributes, and nesting. It is the best approach for custom pipelines that are not covered by framework auto-instrumentation. The tradeoff is more code to maintain, but the observability gains — especially for cost attribution — are substantial.

Using Spans for Cost Debugging

Span-level cost data transforms debugging from guesswork into precision analysis. Here are five concrete debugging workflows that spans enable:

1. Finding the most expensive span in a trace. When a single request costs more than expected, open the trace and sort spans by cost. The most expensive span immediately reveals the culprit — perhaps a synthesis call received too much context (high input tokens) or generated an excessively long response (high output tokens). Without spans, you would only see the total cost with no breakdown.

2. Detecting agent loops. Agentic systems sometimes enter loops where the agent repeatedly calls the same tool or makes the same LLM call with slightly different inputs. In a span tree, this appears as an unusually high number of sibling spans of the same type. Set an alert on span count per trace: if a trace exceeds your expected maximum (e.g., 20 LLM spans), investigate. CostHawk flags traces where span count deviates from the rolling average by more than 2 standard deviations.

3. Identifying context window bloat. Over time, applications that append to conversation history or accumulate retrieved documents will see input token counts grow per span. Query your span data for the trend of gen_ai.usage.input_tokens over time for a specific span name (e.g., llm.chat.synthesize). If the P50 input tokens increased from 2,000 to 8,000 over two weeks, your context management has a leak. The span data pinpoints exactly which operation is receiving the bloated context.

4. Comparing model costs across the same operation. If you are A/B testing models (e.g., routing 50% of synthesis calls to Claude Sonnet and 50% to GPT-4o), span data lets you compare cost and latency side by side. Filter spans by name, group by gen_ai.request.model, and compare the distributions. You might discover that GPT-4o is 15% cheaper for this specific operation while delivering equivalent quality, justifying a full routing switch.

5. Tracing cost through multi-tenant systems. If your application serves multiple customers, attaching a customer_id attribute to spans enables per-customer cost attribution. Aggregate span costs by customer to identify your most expensive accounts, calculate per-customer margins, and detect usage anomalies. A customer whose average request cost suddenly triples may be sending unusually large inputs or triggering complex agent paths.

Building span-based cost dashboards: The most actionable dashboards combine span data across traces:

  • Cost by span name — shows which operations drive the most aggregate cost
  • P95 cost per span — reveals the long tail of expensive operations
  • Span cost as percentage of trace cost — identifies the dominant cost center in your pipeline
  • Span count distribution — detects variability in pipeline complexity

CostHawk's span analytics provide all of these views out of the box, with the ability to filter by time range, model, tag, project, and customer.

Span Sampling and Storage Strategies

At scale, capturing and storing every span from every request can become expensive itself. A production application handling 1 million requests per day, each generating 5 spans, produces 5 million spans per day — roughly 150 million per month. At an average of 500 bytes per span, that is 75 GB of span data per month before indexing overhead. Span sampling and retention strategies are essential for keeping observability costs proportional to the value they provide.

Head-based sampling: Decide whether to trace a request before it starts, based on a probability or rule. For example, sample 10% of all requests, but always sample requests tagged as priority: high. This is the simplest approach and is supported natively by the OpenTelemetry SDK:

import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base"

// Sample 10% of traces
const sampler = new TraceIdRatioBasedSampler(0.1)

The downside is that you miss interesting traces — if a rare, expensive request is in the unsampled 90%, you will never see it.

Tail-based sampling: Collect all spans initially, then decide which complete traces to keep based on their characteristics. Keep traces that are slow (duration > P95), expensive (cost > $0.10), errored, or anomalous. Discard normal traces. This captures the most valuable data while dramatically reducing storage. Tail-based sampling requires a collection pipeline that buffers traces temporarily (typically 30–60 seconds) before making the keep/discard decision.

Cost-aware sampling: A hybrid strategy particularly suited to AI observability. Always capture spans for LLM calls above a cost threshold (e.g., any span costing more than $0.05). Sample normally for cheap operations. This ensures you never miss expensive operations while reducing data volume from the high-frequency, low-cost operations like embeddings and cache lookups.

Retention tiers: Not all span data needs the same retention period. A practical tiering strategy:

TierDataRetentionStorage
HotFull span details (attributes, events, links)7–14 daysIndexed, queryable
WarmAggregated span metrics (count, P50/P95 duration, total cost)90 daysTime-series DB
ColdSampled raw spans for audit1 yearObject storage (S3)

CostHawk handles sampling and retention automatically for customers using its managed tracing pipeline. For teams exporting spans via OpenTelemetry, CostHawk accepts OTLP-format spans and applies cost-aware tail-based sampling to ensure expensive and anomalous traces are always retained while keeping storage costs manageable.

Cost of observability itself: A common anti-pattern is spending more on observability than the savings it enables. If your AI API bill is $5,000/month and your observability platform costs $3,000/month, the economics do not work. Span sampling, aggregation, and retention tiers keep observability costs to 2–5% of the monitored AI spend, ensuring a strong return on investment.

FAQ

Frequently Asked Questions

What is the difference between a span and a trace?+
A trace is the complete record of a single end-to-end request as it moves through your system. A span is one individual operation within that trace. Think of a trace as a tree and spans as the nodes in that tree. Every trace contains one or more spans, connected by parent-child relationships. The root span represents the top-level operation (e.g., the HTTP request handler), and child spans represent sub-operations (LLM calls, database queries, tool executions). All spans in a trace share the same traceId, which is how observability tools stitch them together into a coherent view. A simple request-response LLM call produces a trace with just 2 spans (the HTTP handler and the LLM call). A complex agentic workflow might produce a trace with 30+ spans representing planning, tool use, retrieval, and synthesis steps. The trace gives you the big picture; the spans give you the details. CostHawk computes both trace-level cost (sum of all span costs) and per-span cost to support both high-level and granular analysis.
How do spans help with AI cost optimization?+
Spans enable cost optimization by providing per-operation cost attribution that reveals exactly where money is being spent within each request. Without spans, you see a total cost per API call but cannot break it down further. With spans, you can identify which step in a multi-step pipeline is the most expensive, which model is being used at each step, and whether any steps are consuming more tokens than necessary. For example, spans might reveal that your guardrail check (which runs on every request) is using Claude Sonnet when Haiku would suffice, costing 3.75x more per check. Or they might show that your retrieval step is sending 8,000 tokens of context to the synthesis LLM when only 2,000 are semantically relevant. Spans also enable cost comparison across A/B tests — you can route 50% of requests through a new pipeline configuration and compare per-span costs to measure the exact savings. Teams that adopt span-level cost monitoring typically find optimization opportunities worth 20-40% of their total AI spend within the first month.
Do I need OpenTelemetry to use spans for AI observability?+
No, OpenTelemetry is the industry standard but not the only option. Several AI observability platforms provide their own span instrumentation that does not require the full OpenTelemetry SDK. LangSmith uses LangChain's built-in tracing callbacks. Helicone captures spans through its proxy layer. Braintrust uses its own logging SDK. CostHawk captures span data through wrapped API keys — you replace your provider API key with a CostHawk key and spans are captured automatically without any OpenTelemetry setup. However, OpenTelemetry is the recommended approach for teams that want vendor-neutral tracing, already have OTel infrastructure for traditional microservice observability, or need to export spans to multiple backends simultaneously. The OpenTelemetry GenAI semantic conventions are becoming the standard attribute schema for LLM spans, and most AI observability platforms can ingest OTel-format spans via the OTLP protocol. If you are starting from scratch, CostHawk's wrapped key approach is the fastest path to span-level visibility. If you already have OpenTelemetry deployed, CostHawk accepts OTLP exports and enriches spans with cost data.
What attributes should I add to LLM spans beyond the defaults?+
Beyond the standard GenAI semantic conventions (model, tokens, temperature, finish reason), there are several custom attributes that dramatically improve the value of your span data. First, add business context: feature (which product feature triggered this span), customer_id (for per-customer cost attribution), environment (dev/staging/prod), and version (your application version). Second, add pipeline context: pipeline_name (e.g., "rag-v2" or "agent-planner"), step_index (position in the pipeline), and retry_count (whether this is a retry of a failed call). Third, add quality signals: cache_hit (was the prompt cache used), fallback_model (was this a fallback from a primary model), and output_truncated (did the response hit max_tokens). These attributes enable powerful queries: "Show me the P95 cost of the summarization step for customer X in production, excluding cache hits" or "Compare the average latency of the planner span between pipeline v2 and v3." The investment in rich span attributes pays for itself many times over in debugging speed and optimization insights.
How do spans work in streaming LLM responses?+
When an LLM response is streamed (using server-sent events or WebSocket), the span still represents the entire operation from request start to the final token. The span's startTime is when the request was sent, and the endTime is when the last token (or the stream-close event) was received. The span's duration therefore reflects the total generation time, not the time to first token. To capture TTFT (time to first token) within a streaming span, most instrumentation libraries record a span event — a timestamped annotation within the span — at the moment the first token arrives. For example: span.addEvent("gen_ai.first_token", { timestamp: firstTokenTime }). The token counts (gen_ai.usage.input_tokens and gen_ai.usage.output_tokens) are typically populated when the stream completes, since most providers send a final usage object in the last chunk of the stream. Cost calculation also happens at stream completion, once the final token counts are known. If the stream is interrupted (client disconnect, timeout), the span should still be ended and marked with an error status, and the token counts should reflect whatever was actually consumed — providers bill for tokens generated even if the client disconnects before receiving them all.
Can spans detect runaway agentic loops?+
Yes, and this is one of the most valuable applications of span-level observability for agentic AI systems. A runaway loop occurs when an agent repeatedly calls tools or makes LLM calls without converging on an answer — typically due to ambiguous instructions, tool failures that the agent keeps retrying, or circular reasoning patterns. Spans make loops visible in several ways: (1) Span count per trace increases dramatically — a trace that normally has 5-8 spans suddenly has 40+. Set an alert when span count exceeds a threshold. (2) Repeated span names appear as siblings — if you see 12 consecutive llm.chat.plan spans at the same depth, the agent is stuck in a planning loop. (3) Trace cost escalates — each LLM span adds cost, so a runaway trace might cost $0.50 when the average is $0.05. CostHawk's anomaly detection monitors both span count and trace cost, flagging runaway traces in real time. For prevention, many teams implement a max_iterations guard at the orchestrator level and record the iteration count as a span attribute, making it easy to query how often agents approach the limit.
What is the performance overhead of capturing spans?+
The overhead of span instrumentation is minimal relative to the operations being traced. Creating a span object, setting attributes, and exporting it asynchronously typically adds 0.1–0.5 milliseconds of CPU time per span. Given that a single LLM call takes 500–5,000 milliseconds, the tracing overhead is less than 0.1% of the operation duration — effectively invisible to end users. The more meaningful cost is network and storage overhead. Each span is typically 200–800 bytes when serialized to OTLP protobuf format. At 5 spans per request and 100,000 requests per day, that is roughly 200–400 MB of span data per day before compression. With gzip compression (standard for OTLP export), this drops to 40–80 MB. The OpenTelemetry SDK uses asynchronous batch export by default — spans are buffered in memory and sent to the collector in batches every 5 seconds, so network I/O does not block your application's hot path. The only scenario where overhead becomes noticeable is extremely high-throughput applications (millions of requests per minute) with deep trace trees (50+ spans per trace). In these cases, span sampling reduces the volume to manageable levels without sacrificing visibility into important traces.
How do I query and analyze span data effectively?+
Effective span analysis requires thinking in terms of aggregations across traces rather than individual span inspection. While drilling into a single trace is useful for debugging one request, the real value comes from querying patterns across thousands of traces. Key query patterns include: (1) Group by span name, aggregate cost — shows which operation types are the most expensive in aggregate (e.g., synthesis LLM calls account for 72% of total spend). (2) Filter by attribute, compute percentiles — for example, "P95 duration of vectordb.search spans where environment=production" reveals whether your vector database is meeting latency SLAs. (3) Time-series of span metrics — track how average input tokens per synthesis span changes over time to detect context bloat. (4) Trace-level roll-ups — sum span costs within each trace, then compute the distribution of trace costs to understand your cost profile. (5) Correlation analysis — correlate span count with trace cost to distinguish traces that are expensive because of many cheap spans versus few expensive spans. CostHawk provides a SQL-like query interface over span data, plus pre-built dashboards for the most common analysis patterns. For teams that export spans to their own data warehouse (BigQuery, Snowflake, ClickHouse), the same query patterns apply using standard SQL.

Related Terms

Tracing

The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.

Read more

OpenTelemetry

An open-source observability framework providing a vendor-neutral standard (OTLP) for collecting traces, metrics, and logs from distributed systems. OpenTelemetry is rapidly becoming the standard instrumentation layer for LLM applications, enabling teams to track latency, token usage, cost, and quality across every inference call.

Read more

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

Read more

Logging

Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.

Read more

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Read more

Agentic AI

AI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.