AI Cost Glossary
Every term, metric, and concept that matters when you're tracking and optimizing AI spend. From tokens to budget alerts, this is the reference your team needs.
Browse by category
Jump to letter
A
Agentic AI
Usage & MeteringAI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.
AI Cost Allocation
Billing & PricingThe practice of attributing AI API costs to specific teams, projects, features, or customers — enabling accountability, budgeting, and optimization at the organizational level.
AI ROI (Return on Investment)
Billing & PricingThe financial return generated by AI investments relative to their total cost. AI ROI is uniquely challenging to measure because the benefits — productivity gains, quality improvements, faster time-to-market — are often indirect, distributed across teams, and difficult to isolate from other variables. Rigorous ROI measurement requires a framework that captures both hard-dollar savings and soft-value gains.
Alerting
ObservabilityAutomated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.
API Gateway
InfrastructureA centralized entry point for API traffic that handles routing, authentication, rate limiting, and request transformation. For LLM APIs, gateways add cost tracking, policy enforcement, and provider abstraction.
API Key Management
Security & ComplianceSecuring, rotating, scoping, and tracking API credentials across AI providers. Effective key management is the foundation of both cost attribution and security — every unmanaged key is a potential source of untracked spend and unauthorized access.
C
Chargeback & Showback
Billing & PricingTwo complementary FinOps models for assigning AI cost accountability across teams and business units. Showback reports costs to each team for visibility and behavioral nudging without financial consequences. Chargeback bills teams directly from their departmental budgets for the AI resources they consume, creating hard financial accountability. Both models are essential for organizations scaling AI beyond a single team or project.
Context Window
Usage & MeteringThe maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Cost Anomaly Detection
ObservabilityAutomated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.
Cost Per Query
Billing & PricingThe total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Cost Per Token
Billing & PricingThe unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
E
Embedding
InfrastructureA dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.
Evals
ObservabilitySystematic evaluation of LLM output quality using automated metrics, human review, or LLM-as-judge methodologies. Evals are the quality gate that ensures cost optimizations — model downgrades, prompt compression, caching — do not silently degrade the user experience.
F
Failover
InfrastructureAutomatically switching to a backup LLM provider when the primary fails or becomes unavailable. Failover prevents user-facing downtime in AI-powered features but introduces cost implications when backup providers have different pricing. A well-designed failover strategy balances reliability against budget impact.
Fine-Tuning
OptimizationThe process of further training a pre-trained large language model on a custom dataset to specialize it for a specific task, domain, or output style. Fine-tuning incurs upfront training costs (billed per training token) but can reduce ongoing inference costs by enabling a smaller, cheaper model to match the performance of a larger, more expensive one — making it both a quality tool and a cost optimization strategy.
Foundation Model
InfrastructureA large, general-purpose AI model pre-trained on broad data that serves as the base for downstream applications. Foundation models like GPT-4, Claude, Gemini, and Llama represent enormous upfront training investments whose costs are amortized across millions of API consumers. Choosing the right foundation model determines both baseline capability and baseline cost for every AI-powered feature you build.
I
Inference
InfrastructureThe process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.
Input vs. Output Tokens
Billing & PricingThe two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.
L
Large Language Model (LLM)
InfrastructureA neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Latency
ObservabilityThe total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
LLM Gateway
InfrastructureAn AI-specific API gateway purpose-built for routing LLM requests across providers. Adds model routing, cost tracking, caching, and fallback capabilities that traditional API gateways lack.
LLM Observability
ObservabilityThe practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
LLM Proxy
InfrastructureA transparent intermediary that sits between your application and LLM providers, forwarding requests while adding tracking, caching, or policy enforcement without code changes. Proxies intercept standard SDK traffic, log usage metadata, and optionally transform requests before relaying them upstream.
Load Balancing
InfrastructureDistributing LLM API requests across multiple provider accounts, endpoints, or models to optimize for cost, latency, and availability. Load balancing prevents rate limit exhaustion on any single account and enables cost-aware request distribution.
Logging
ObservabilityRecording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.
M
Max Tokens
Usage & MeteringThe API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.
Model Context Protocol (MCP)
InfrastructureAn open protocol for connecting AI assistants to external tools and data sources via a standardized client-server architecture. MCP enables AI coding assistants like Claude Code and GitHub Copilot to query cost data, run analyses, set budgets, and take actions without leaving the development environment.
Model Routing
OptimizationDynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Multi-Modal Model
Billing & PricingAn AI model capable of processing and generating content across multiple modalities — text, images, audio, and video. Each modality carries a different token cost, with image inputs costing substantially more than text per semantic unit. Multi-modal models like GPT-4o, Claude 3.5, and Gemini 2.0 unlock powerful capabilities but introduce complex pricing structures that require careful monitoring to avoid cost surprises.
P
P95 / P99 Latency
ObservabilityPercentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.
Pay-Per-Token
Billing & PricingThe dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Prompt Caching
OptimizationA provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Prompt Compression
OptimizationTechniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Prompt Engineering
OptimizationThe practice of designing, structuring, and iterating on the text inputs (prompts) sent to large language models to elicit desired outputs. Prompt engineering directly affects AI API costs through two mechanisms: the token count of the prompt itself (input cost) and the length and quality of the model's response (output cost). A well-engineered prompt can reduce total per-request cost by 40–70% compared to a naive prompt while maintaining or improving output quality.
Provisioned Throughput
InfrastructurePre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
R
Rate Limiting
InfrastructureProvider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Retrieval-Augmented Generation (RAG)
OptimizationAn architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.
S
Semantic Caching
OptimizationAn application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.
Serverless Inference
InfrastructureRunning LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.
Spans
ObservabilityIndividual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.
T
Temperature
Usage & MeteringA sampling parameter (typically 0–2) that controls the randomness and creativity of LLM outputs. Higher temperature values produce more diverse and unpredictable responses but can increase output length and token consumption, indirectly raising API costs. Temperature tuning is a critical lever for balancing output quality against spend.
Throughput
ObservabilityThe volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.
Time to First Token (TTFT)
ObservabilityThe latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.
Token
Billing & PricingThe fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Token Budget
Billing & PricingSpending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.
Token Pricing
Billing & PricingThe per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Tokenization
Billing & PricingThe process of splitting raw text into discrete sub-word units called tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. Tokenization is the invisible first step of every LLM API call and directly determines how many tokens you are billed for — identical text fed through different tokenizers can produce token counts that vary by 10–40%, making tokenizer choice a material cost factor.
Tokens Per Second (TPS)
ObservabilityThe rate at which an LLM generates output tokens during the decode phase of inference. TPS determines how fast a streaming response flows to the user, the maximum throughput capacity of inference infrastructure, and the economic efficiency of GPU utilization.
Total Cost of Ownership (TCO) for AI
Billing & PricingThe complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.
Tracing
ObservabilityThe practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.
Transformer
InfrastructureThe foundational neural network architecture behind all modern large language models. Introduced in the 2017 paper 'Attention Is All You Need,' the transformer uses self-attention mechanisms to process sequences in parallel, enabling the scaling breakthroughs that power GPT, Claude, and Gemini. Understanding transformer architecture explains why API costs scale with context length and why inference is computationally expensive.
W
Webhook
InfrastructureAn HTTP callback that pushes real-time notifications when events occur — cost threshold breaches, anomaly detection alerts, usage milestones. Webhooks are the delivery mechanism that turns passive monitoring into active, automated response workflows across Slack, PagerDuty, Discord, and any HTTP endpoint.
Wrapped Keys
Security & ComplianceProxy API keys that route provider SDK traffic through a cost tracking layer. The original provider key never leaves the server, while the wrapped key provides per-key attribution, budget enforcement, and policy controls without requiring application code changes beyond a base URL swap.
Browse by Category
Billing & Pricing
AI Cost Allocation
The practice of attributing AI API costs to specific teams, projects, features, or customers — enabling accountability, budgeting, and optimization at the organizational level.
AI ROI (Return on Investment)
The financial return generated by AI investments relative to their total cost. AI ROI is uniquely challenging to measure because the benefits — productivity gains, quality improvements, faster time-to-market — are often indirect, distributed across teams, and difficult to isolate from other variables. Rigorous ROI measurement requires a framework that captures both hard-dollar savings and soft-value gains.
Chargeback & Showback
Two complementary FinOps models for assigning AI cost accountability across teams and business units. Showback reports costs to each team for visibility and behavioral nudging without financial consequences. Chargeback bills teams directly from their departmental budgets for the AI resources they consume, creating hard financial accountability. Both models are essential for organizations scaling AI beyond a single team or project.
Cost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Cost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Input vs. Output Tokens
The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.
Multi-Modal Model
An AI model capable of processing and generating content across multiple modalities — text, images, audio, and video. Each modality carries a different token cost, with image inputs costing substantially more than text per semantic unit. Multi-modal models like GPT-4o, Claude 3.5, and Gemini 2.0 unlock powerful capabilities but introduce complex pricing structures that require careful monitoring to avoid cost surprises.
Pay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Token
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Token Budget
Spending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.
Token Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Tokenization
The process of splitting raw text into discrete sub-word units called tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. Tokenization is the invisible first step of every LLM API call and directly determines how many tokens you are billed for — identical text fed through different tokenizers can produce token counts that vary by 10–40%, making tokenizer choice a material cost factor.
Total Cost of Ownership (TCO) for AI
The complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.
Unit Economics
The cost and revenue associated with a single unit of your AI-powered product — whether that unit is a query, a user session, a transaction, or an API call. Unit economics tell you whether each interaction your product serves is profitable or loss-making, and by how much. For AI features built on LLM APIs, unit economics are uniquely volatile because inference costs vary by model, prompt length, and output complexity, making per-unit cost tracking essential for sustainable growth.
Usage & Metering
Agentic AI
AI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.
Context Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Max Tokens
The API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.
Temperature
A sampling parameter (typically 0–2) that controls the randomness and creativity of LLM outputs. Higher temperature values produce more diverse and unpredictable responses but can increase output length and token consumption, indirectly raising API costs. Temperature tuning is a critical lever for balancing output quality against spend.
Optimization
Batch API
Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.
Fine-Tuning
The process of further training a pre-trained large language model on a custom dataset to specialize it for a specific task, domain, or output style. Fine-tuning incurs upfront training costs (billed per training token) but can reduce ongoing inference costs by enabling a smaller, cheaper model to match the performance of a larger, more expensive one — making it both a quality tool and a cost optimization strategy.
Model Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Prompt Caching
A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.
Prompt Compression
Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.
Prompt Engineering
The practice of designing, structuring, and iterating on the text inputs (prompts) sent to large language models to elicit desired outputs. Prompt engineering directly affects AI API costs through two mechanisms: the token count of the prompt itself (input cost) and the length and quality of the model's response (output cost). A well-engineered prompt can reduce total per-request cost by 40–70% compared to a naive prompt while maintaining or improving output quality.
Retrieval-Augmented Generation (RAG)
An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.
Semantic Caching
An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.
Infrastructure
API Gateway
A centralized entry point for API traffic that handles routing, authentication, rate limiting, and request transformation. For LLM APIs, gateways add cost tracking, policy enforcement, and provider abstraction.
Embedding
A dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.
Failover
Automatically switching to a backup LLM provider when the primary fails or becomes unavailable. Failover prevents user-facing downtime in AI-powered features but introduces cost implications when backup providers have different pricing. A well-designed failover strategy balances reliability against budget impact.
Foundation Model
A large, general-purpose AI model pre-trained on broad data that serves as the base for downstream applications. Foundation models like GPT-4, Claude, Gemini, and Llama represent enormous upfront training investments whose costs are amortized across millions of API consumers. Choosing the right foundation model determines both baseline capability and baseline cost for every AI-powered feature you build.
GPU Instance
Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.
Inference
The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.
Large Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
LLM Gateway
An AI-specific API gateway purpose-built for routing LLM requests across providers. Adds model routing, cost tracking, caching, and fallback capabilities that traditional API gateways lack.
LLM Proxy
A transparent intermediary that sits between your application and LLM providers, forwarding requests while adding tracking, caching, or policy enforcement without code changes. Proxies intercept standard SDK traffic, log usage metadata, and optionally transform requests before relaying them upstream.
Load Balancing
Distributing LLM API requests across multiple provider accounts, endpoints, or models to optimize for cost, latency, and availability. Load balancing prevents rate limit exhaustion on any single account and enables cost-aware request distribution.
Model Context Protocol (MCP)
An open protocol for connecting AI assistants to external tools and data sources via a standardized client-server architecture. MCP enables AI coding assistants like Claude Code and GitHub Copilot to query cost data, run analyses, set budgets, and take actions without leaving the development environment.
Provisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Rate Limiting
Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Serverless Inference
Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.
Transformer
The foundational neural network architecture behind all modern large language models. Introduced in the 2017 paper 'Attention Is All You Need,' the transformer uses self-attention mechanisms to process sequences in parallel, enabling the scaling breakthroughs that power GPT, Claude, and Gemini. Understanding transformer architecture explains why API costs scale with context length and why inference is computationally expensive.
Webhook
An HTTP callback that pushes real-time notifications when events occur — cost threshold breaches, anomaly detection alerts, usage milestones. Webhooks are the delivery mechanism that turns passive monitoring into active, automated response workflows across Slack, PagerDuty, Discord, and any HTTP endpoint.
Security & Compliance
API Key Management
Securing, rotating, scoping, and tracking API credentials across AI providers. Effective key management is the foundation of both cost attribution and security — every unmanaged key is a potential source of untracked spend and unauthorized access.
Wrapped Keys
Proxy API keys that route provider SDK traffic through a cost tracking layer. The original provider key never leaves the server, while the wrapped key provides per-key attribution, budget enforcement, and policy controls without requiring application code changes beyond a base URL swap.
Observability
Alerting
Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.
Cost Anomaly Detection
Automated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.
Dashboards
Visual interfaces for monitoring AI cost, usage, and performance metrics in real-time. The command center for AI cost management — dashboards aggregate token spend, model utilization, latency, and budget health into a single pane of glass.
Evals
Systematic evaluation of LLM output quality using automated metrics, human review, or LLM-as-judge methodologies. Evals are the quality gate that ensures cost optimizations — model downgrades, prompt compression, caching — do not silently degrade the user experience.
Latency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
LLM Observability
The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.
Logging
Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.
OpenTelemetry
An open-source observability framework providing a vendor-neutral standard (OTLP) for collecting traces, metrics, and logs from distributed systems. OpenTelemetry is rapidly becoming the standard instrumentation layer for LLM applications, enabling teams to track latency, token usage, cost, and quality across every inference call.
P95 / P99 Latency
Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.
Spans
Individual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.
Throughput
The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.
Time to First Token (TTFT)
The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.
Tokens Per Second (TPS)
The rate at which an LLM generates output tokens during the decode phase of inference. TPS determines how fast a streaming response flows to the user, the maximum throughput capacity of inference infrastructure, and the economic efficiency of GPU utilization.
Tracing
The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.
AI Cost Glossary
Stop guessing what your AI costs mean. Start tracking them.
CostHawk gives you one dashboard for every provider, every model, and every dollar. Connect in minutes with MCP telemetry or wrapped keys.
