AI Cost Glossary

AI systems that autonomously plan, reason, and execute multi-step tasks by chaining multiple LLM calls, tool invocations, and decision loops. Agentic workflows generate unpredictable and often enormous token consumption — 10x to 100x more than single-turn queries — making them the highest-cost AI pattern in production. Without per-session monitoring and cost guardrails, agent runs can consume hundreds of dollars in minutes.

AI Cost Allocation

The practice of attributing AI API costs to specific teams, projects, features, or customers — enabling accountability, budgeting, and optimization at the organizational level.

AI ROI (Return on Investment)

The financial return generated by AI investments relative to their total cost. AI ROI is uniquely challenging to measure because the benefits — productivity gains, quality improvements, faster time-to-market — are often indirect, distributed across teams, and difficult to isolate from other variables. Rigorous ROI measurement requires a framework that captures both hard-dollar savings and soft-value gains.

Alerting

Automated notifications triggered by cost thresholds, usage anomalies, or performance degradation in AI systems. The first line of defense against budget overruns — alerting ensures no cost spike goes unnoticed.

API Gateway

A centralized entry point for API traffic that handles routing, authentication, rate limiting, and request transformation. For LLM APIs, gateways add cost tracking, policy enforcement, and provider abstraction.

API Key Management

Security & Compliance

Securing, rotating, scoping, and tracking API credentials across AI providers. Effective key management is the foundation of both cost attribution and security — every unmanaged key is a potential source of untracked spend and unauthorized access.

B

Batch API

Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.

C

Chargeback & Showback

Two complementary FinOps models for assigning AI cost accountability across teams and business units. Showback reports costs to each team for visibility and behavioral nudging without financial consequences. Chargeback bills teams directly from their departmental budgets for the AI resources they consume, creating hard financial accountability. Both models are essential for organizations scaling AI beyond a single team or project.

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

Cost Anomaly Detection

Automated detection of unusual AI spending patterns — sudden spikes, gradual drift, and per-key anomalies — before they become budget-breaking surprises.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Cost Per Token

The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.

D

Dashboards

Visual interfaces for monitoring AI cost, usage, and performance metrics in real-time. The command center for AI cost management — dashboards aggregate token spend, model utilization, latency, and budget health into a single pane of glass.

E

Embedding

A dense vector representation of text (or other data) produced by a specialized neural network model. Embeddings capture semantic meaning as arrays of floating-point numbers, enabling similarity search, retrieval-augmented generation (RAG), classification, and clustering. Embedding models are priced separately from generation models — typically 10–100x cheaper per token — but high-volume pipelines can still accumulate significant embedding costs that require dedicated monitoring and optimization.

Evals

Systematic evaluation of LLM output quality using automated metrics, human review, or LLM-as-judge methodologies. Evals are the quality gate that ensures cost optimizations — model downgrades, prompt compression, caching — do not silently degrade the user experience.

F

Failover

Automatically switching to a backup LLM provider when the primary fails or becomes unavailable. Failover prevents user-facing downtime in AI-powered features but introduces cost implications when backup providers have different pricing. A well-designed failover strategy balances reliability against budget impact.

Fine-Tuning

The process of further training a pre-trained large language model on a custom dataset to specialize it for a specific task, domain, or output style. Fine-tuning incurs upfront training costs (billed per training token) but can reduce ongoing inference costs by enabling a smaller, cheaper model to match the performance of a larger, more expensive one — making it both a quality tool and a cost optimization strategy.

Foundation Model

A large, general-purpose AI model pre-trained on broad data that serves as the base for downstream applications. Foundation models like GPT-4, Claude, Gemini, and Llama represent enormous upfront training investments whose costs are amortized across millions of API consumers. Choosing the right foundation model determines both baseline capability and baseline cost for every AI-powered feature you build.

G

GPU Instance

Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.

I

Inference

The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.

Input vs. Output Tokens

The two token directions in every LLM API call, each priced differently. Output tokens cost 3-5x more than input tokens across all major providers.

L

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

LLM Gateway

An AI-specific API gateway purpose-built for routing LLM requests across providers. Adds model routing, cost tracking, caching, and fallback capabilities that traditional API gateways lack.

LLM Observability

The practice of monitoring, tracing, and analyzing LLM-powered applications in production across every dimension that matters: token consumption, cost, latency, error rates, and output quality. LLM observability goes far beyond traditional APM by tracking AI-specific metrics that determine both the reliability and the economics of your AI features.

LLM Proxy

A transparent intermediary that sits between your application and LLM providers, forwarding requests while adding tracking, caching, or policy enforcement without code changes. Proxies intercept standard SDK traffic, log usage metadata, and optionally transform requests before relaying them upstream.

Load Balancing

Distributing LLM API requests across multiple provider accounts, endpoints, or models to optimize for cost, latency, and availability. Load balancing prevents rate limit exhaustion on any single account and enables cost-aware request distribution.

Logging

Recording LLM request and response metadata — tokens consumed, model used, latency, cost, and status — for debugging, cost analysis, and compliance. Effective LLM logging captures the operational envelope of every API call without storing sensitive prompt content.

M

Max Tokens

The API parameter that limits the maximum number of output tokens a model can generate in a single response, directly controlling output cost and preventing runaway generation.

Model Context Protocol (MCP)

An open protocol for connecting AI assistants to external tools and data sources via a standardized client-server architecture. MCP enables AI coding assistants like Claude Code and GitHub Copilot to query cost data, run analyses, set budgets, and take actions without leaving the development environment.

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Multi-Modal Model

An AI model capable of processing and generating content across multiple modalities — text, images, audio, and video. Each modality carries a different token cost, with image inputs costing substantially more than text per semantic unit. Multi-modal models like GPT-4o, Claude 3.5, and Gemini 2.0 unlock powerful capabilities but introduce complex pricing structures that require careful monitoring to avoid cost surprises.

O

OpenTelemetry

An open-source observability framework providing a vendor-neutral standard (OTLP) for collecting traces, metrics, and logs from distributed systems. OpenTelemetry is rapidly becoming the standard instrumentation layer for LLM applications, enabling teams to track latency, token usage, cost, and quality across every inference call.

P

P95 / P99 Latency

Percentile latency metrics that capture the tail-end performance of LLM API calls. P95 means 95% of requests complete within this time; P99 means 99% do. Unlike averages, percentiles expose the worst experiences real users encounter and are the standard basis for SLA commitments with AI providers.

Pay-Per-Token

The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.

Prompt Caching

A provider-side optimization that caches repeated prompt prefixes to reduce input token costs by 50-90% on subsequent requests.

Prompt Compression

Techniques for reducing the token count of prompts while preserving semantic meaning — cutting input costs by 40–70% through manual optimization, algorithmic compression, and selective context strategies.

Prompt Engineering

The practice of designing, structuring, and iterating on the text inputs (prompts) sent to large language models to elicit desired outputs. Prompt engineering directly affects AI API costs through two mechanisms: the token count of the prompt itself (input cost) and the length and quality of the model's response (output cost). A well-engineered prompt can reduce total per-request cost by 40–70% compared to a naive prompt while maintaining or improving output quality.

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

R

Rate Limiting

Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.

Retrieval-Augmented Generation (RAG)

An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.

S

Semantic Caching

An application-level caching strategy that uses embedding similarity to serve previously generated responses for semantically equivalent queries, reducing API calls by 20-40%.

Serverless Inference

Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.

Spans

Individual units of work within a distributed trace. Each span records a single operation — such as an LLM call, a retrieval step, or a tool invocation — with its duration, token counts, cost, metadata, and parent-child relationships that reveal the full execution graph of an AI request.

T

Temperature

A sampling parameter (typically 0–2) that controls the randomness and creativity of LLM outputs. Higher temperature values produce more diverse and unpredictable responses but can increase output length and token consumption, indirectly raising API costs. Temperature tuning is a critical lever for balancing output quality against spend.

Throughput

The volume of requests or tokens an LLM system processes per unit of time, measured as requests per second (RPS), tokens per second (TPS), or tokens per minute (TPM). Throughput determines how many users your AI features can serve simultaneously and is the key scaling metric that connects infrastructure capacity to cost at scale.

Time to First Token (TTFT)

The latency measured from the moment a client sends an LLM API request to the moment the first token of the response is received. TTFT is the primary UX-facing latency metric for streaming applications, directly determining how fast an AI response feels to the end user.

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Token Budget

Spending limits applied per project, team, or time period to prevent uncontrolled AI API costs and protect against runaway agents.

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Tokenization

The process of splitting raw text into discrete sub-word units called tokens using algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece. Tokenization is the invisible first step of every LLM API call and directly determines how many tokens you are billed for — identical text fed through different tokenizers can produce token counts that vary by 10–40%, making tokenizer choice a material cost factor.

Tokens Per Second (TPS)

The rate at which an LLM generates output tokens during the decode phase of inference. TPS determines how fast a streaming response flows to the user, the maximum throughput capacity of inference infrastructure, and the economic efficiency of GPU utilization.

Total Cost of Ownership (TCO) for AI

The complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.

Tracing

The practice of recording the full execution path of an LLM request — from prompt construction through model inference to response delivery — with timing and cost attribution at each step. Tracing provides the granular visibility needed to understand where time and money are spent in multi-step AI pipelines.

Transformer

The foundational neural network architecture behind all modern large language models. Introduced in the 2017 paper 'Attention Is All You Need,' the transformer uses self-attention mechanisms to process sequences in parallel, enabling the scaling breakthroughs that power GPT, Claude, and Gemini. Understanding transformer architecture explains why API costs scale with context length and why inference is computationally expensive.

U

Unit Economics

The cost and revenue associated with a single unit of your AI-powered product — whether that unit is a query, a user session, a transaction, or an API call. Unit economics tell you whether each interaction your product serves is profitable or loss-making, and by how much. For AI features built on LLM APIs, unit economics are uniquely volatile because inference costs vary by model, prompt length, and output complexity, making per-unit cost tracking essential for sustainable growth.

W

Webhook