GlossaryInfrastructureUpdated 2026-03-16By Chase Dillingham

Transformer

Q: What is the difference between a transformer and an LLM?

A transformer is a neural network architecture — a design blueprint that specifies how data flows through layers of attention and feed-forward networks. An LLM is a specific model built using that architecture, trained on a particular dataset, with a particular number of parameters. The relationship is like the difference between a house design (architecture) and a specific house built from that design (model). All modern LLMs — GPT-4o, Claude 3.5, Gemini, Llama 3, Mistral — use the transformer architecture or close variants of it. But they differ in parameter count, training data, training methodology (RLHF, DPO, constitutional AI), vocabulary size, and optimization techniques. Two models using identical transformer architectures can have vastly different capabilities and costs depending on these factors. The transformer architecture is the shared foundation; the specific LLM is the product built on that foundation.

The foundational neural network architecture behind all modern large language models. Introduced in the 2017 paper 'Attention Is All You Need,' the transformer uses self-attention mechanisms to process sequences in parallel, enabling the scaling breakthroughs that power GPT, Claude, and Gemini. Understanding transformer architecture explains why API costs scale with context length and why inference is computationally expensive.

Definition

What is Transformer?

A transformer is a neural network architecture that processes sequential data (primarily text, but also images, audio, and code) using a mechanism called self-attention, which allows every element in a sequence to attend to every other element simultaneously. Introduced by Vaswani et al. at Google in 2017, the transformer replaced earlier sequential architectures (RNNs, LSTMs) that processed tokens one at a time, enabling massive parallelization during training and fundamentally changing the economics of building large language models. The core insight is the attention mechanism: for each token in a sequence, the model computes how much "attention" to pay to every other token, creating a rich contextual representation that captures long-range dependencies. This attention computation has a critical cost characteristic — it scales quadratically with sequence length, meaning that doubling the number of input tokens quadruples the computational work in the attention layers. This quadratic scaling is the fundamental reason why longer prompts cost more to process, why context windows have finite limits, and why providers charge per token. Every commercial LLM available via API today — GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, Mistral Large, Llama 3 — is built on the transformer architecture or a direct derivative of it.

Impact

Why It Matters for AI Costs

The transformer architecture is not just an academic concept — it directly determines the pricing structure, performance characteristics, and cost scaling of every AI API you use. Understanding it explains three critical aspects of AI cost management:

1. Why costs scale with input length. The self-attention mechanism computes pairwise interactions between all tokens in a sequence. For a 1,000-token input, the model computes 1,000,000 attention scores. For a 10,000-token input, it computes 100,000,000 attention scores — a 100x increase in compute for a 10x increase in tokens. While providers amortize this into linear per-token pricing (for simplicity), the underlying compute cost is superlinear, which is why some providers charge premium rates for inputs that exceed certain context length thresholds.

2. Why output generation is expensive. During output generation, the transformer must attend to all previous tokens (both input and already-generated output) for each new token. The KV cache mechanism avoids recomputing attention for earlier tokens, but the memory required grows linearly with sequence length, consuming expensive GPU VRAM. This is why output tokens cost 4–5x more than input tokens — each one requires a full forward pass through the model with an ever-growing attention context.

3. Why context windows have limits. The quadratic memory and compute requirements of attention mean that doubling the context window requires roughly 4x the GPU memory and compute. A 128K context window requires 16x the attention compute of a 32K window. Providers set context limits based on what their GPU infrastructure can support economically, and they charge accordingly — longer context models tend to have higher per-token rates.

CostHawk helps you navigate these architectural realities by tracking context utilization, identifying requests that use more context than necessary, and recommending model routing strategies that match context requirements to the most cost-efficient model.

The Transformer Architecture

The transformer architecture consists of stacked layers, each containing two primary sub-components: a multi-head self-attention mechanism and a feed-forward neural network. Modern LLMs stack 32 to 120+ of these layers to create deep networks with billions of parameters.

Self-attention mechanism: For each token in the input sequence, the model creates three vectors — a Query (Q), a Key (K), and a Value (V) — by multiplying the token's embedding by learned weight matrices. The attention score between any two tokens is computed as the dot product of one token's Query with another token's Key, scaled by the square root of the dimension. These scores are passed through a softmax function to produce attention weights, which are then used to create a weighted sum of Value vectors. Mathematically:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

The QK^T matrix multiplication produces a matrix of size sequence_length × sequence_length — this is the source of the quadratic scaling with context length.

Multi-head attention: Rather than computing a single attention function, the model runs multiple attention "heads" in parallel (typically 32–128 heads), each with its own Q, K, V weight matrices. This allows the model to attend to different types of relationships simultaneously — one head might focus on syntactic relationships, another on semantic similarity, another on positional proximity. The outputs of all heads are concatenated and projected through another weight matrix.

Feed-forward network: After attention, each token's representation passes through a position-wise feed-forward network (typically two linear transformations with a nonlinear activation like GELU or SiLU in between). This network has the same architecture for every position but does not share weights across layers. The feed-forward layers typically contain the majority of the model's parameters and perform the "reasoning" computation.

Layer normalization and residual connections: Each sub-component (attention and feed-forward) is wrapped with a residual connection (adding the input to the output) and layer normalization, which stabilizes training and allows gradients to flow through very deep networks.

Decoder-only architecture: Most modern LLMs (GPT, Claude, Llama, Mistral) use a decoder-only transformer, where each token can only attend to tokens that come before it in the sequence (causal masking). This enables autoregressive generation — the model produces one token at a time, each conditioned on all previous tokens. This is the architecture that powers the API endpoints you use daily.

Self-Attention and Computational Cost

The self-attention mechanism is both the transformer's greatest strength and its primary cost bottleneck. Understanding the computational profile explains why longer contexts cost more and why API providers price the way they do.

Quadratic scaling with sequence length: The core attention operation QK^T produces a matrix of size N × N, where N is the sequence length. The compute required for this operation is O(N² × d), where d is the model dimension (typically 4,096–12,288). For concrete numbers:

Context Length (N)	Attention Operations (N²)	Relative Cost	GPU Memory (KV Cache, est.)
1,024	1,048,576	1x	~128 MB
4,096	16,777,216	16x	~512 MB
16,384	268,435,456	256x	~2 GB
65,536	4,294,967,296	4,096x	~8 GB
131,072	17,179,869,184	16,384x	~16 GB

Moving from a 4K context to a 128K context increases attention compute by roughly 1,024x. This is why models with 128K+ context windows require significantly more GPU resources per request, and why providers like Anthropic and Google charge the same per-token rate but implicitly amortize the increased compute cost across their fleet.

Memory bottleneck — the KV cache: During autoregressive generation, the model maintains a Key-Value (KV) cache that stores the K and V vectors for all previous tokens. This avoids recomputing attention from scratch for each new output token. However, the KV cache grows linearly with sequence length and must fit in GPU memory (VRAM). For a model like Llama 3 70B with 80 attention layers and 64 heads, the KV cache for a 128K token sequence consumes approximately 20 GB of VRAM — on a GPU that might have 80 GB total. This memory pressure limits how many concurrent requests a GPU can serve, which directly impacts the economics of inference and ultimately the per-token price you pay.

Practical cost implications: Even though providers charge linearly per token, the underlying cost is superlinear. Providers handle this through fleet optimization, batching, and amortization — but the physics of attention computation mean that a 100K token request is genuinely more expensive to serve per token than a 1K token request. This is why some providers offer lower rates for shorter contexts or charge premium rates for extended context models. When optimizing costs, reducing context length has a disproportionate benefit because it reduces both the linear token cost and the hidden superlinear compute cost.

Encoder vs Decoder Models

The original transformer paper described an encoder-decoder architecture, but the AI API landscape in 2026 is dominated by decoder-only models. Understanding the distinction matters for cost because different architectures have different compute profiles and pricing structures.

Encoder-only models (e.g., BERT, RoBERTa) process the full input bidirectionally — every token can attend to every other token in both directions. These models excel at understanding tasks: classification, entity recognition, semantic similarity, and search ranking. They are fast and cheap to run because they process the entire input in a single forward pass with no autoregressive generation. However, they cannot generate text. Encoder models are rarely accessed via the same APIs as LLMs; they are more commonly deployed as embedding models or classifiers. Pricing is typically per-token for input only (no output cost), at rates much lower than generative models: OpenAI's text-embedding-3-small charges $0.02/MTok versus $2.50/MTok for GPT-4o input.

Decoder-only models (e.g., GPT-4o, Claude, Llama, Mistral) process tokens left-to-right with causal masking — each token can only attend to previous tokens. These are the models that power text generation APIs. Their cost profile includes both input (prefill) and output (decoding) costs, with output costing 3–5x more due to the sequential generation process described above. Virtually all LLM API costs come from decoder-only models.

Encoder-decoder models (e.g., T5, FLAN, early Gemini variants) use an encoder to process the input and a decoder to generate the output. The encoder processes the full input bidirectionally, then the decoder generates output autoregressively while attending to the encoder's representations via cross-attention. These models can be more efficient for tasks where the output is much shorter than the input (summarization, translation) because the decoder only needs to generate a short sequence while leveraging a rich bidirectional encoding of the long input. Google's Gemini models use a variant of this approach internally.

Cost comparison by architecture:

Architecture	Examples	Typical Cost Range	Cost Drivers
Encoder-only	BERT, text-embedding-3	$0.01–$0.13/MTok	Input tokens only
Decoder-only	GPT-4o, Claude 3.5, Llama 3	$0.10–$15.00/MTok	Input + output tokens
Encoder-decoder	T5, Gemini (internal)	$0.10–$5.00/MTok	Input + output, but more efficient for short outputs

For cost optimization, the key insight is: if your task only requires understanding (classification, extraction, embedding), use an encoder-only model at 10–100x lower cost than a generative decoder model. Reserve decoder-only models for tasks that genuinely require text generation.

How Architecture Affects API Pricing

The transformer architecture creates several fundamental constraints that directly shape the pricing of every AI API. Understanding these constraints helps you make better model selection and usage decisions.

Parameter count and model size: The number of parameters in a transformer determines how much GPU memory it requires and how much compute each forward pass consumes. A 70 billion parameter model requires roughly 140 GB of VRAM in FP16 precision just to store the weights — more than the capacity of a single A100 80GB GPU, requiring multi-GPU setups. Larger models generally produce better outputs but cost more to serve per token because they consume more expensive GPU resources per request. This is why GPT-4o (estimated ~200B+ parameters in its MoE architecture) costs more per token than GPT-4o mini (estimated ~8B parameters).

Mixture of Experts (MoE): Models like GPT-4, Mixtral, and some Gemini variants use a Mixture of Experts architecture, where only a subset of the model's parameters are activated for each token. A model with 1 trillion total parameters might only activate 200 billion per token, reducing compute costs while maintaining quality. MoE models offer a better cost-to-quality ratio because you get the benefits of a massive parameter space without the per-token compute cost of a fully dense model. This architectural choice is reflected in pricing — MoE models tend to be cheaper per token than equivalently capable dense models.

Quantization and precision: Running transformer models at reduced numerical precision (FP16, INT8, INT4) dramatically reduces memory requirements and increases throughput, allowing providers to serve more requests per GPU-second. Quantized models typically show minimal quality degradation for standard tasks. When a provider lowers prices (as OpenAI has done repeatedly), it often reflects improved quantization and serving efficiency rather than a change in the underlying model. The architecture's tolerance for quantization directly enables the price reductions that benefit API consumers.

KV cache and context pricing: As discussed in the self-attention section, the KV cache is a major cost factor for long-context requests. Some architectural innovations — Multi-Query Attention (MQA), Grouped Query Attention (GQA), and sliding window attention — reduce KV cache sizes by 4–8x, enabling longer contexts at lower cost. Claude 3.5 uses GQA, and Mistral uses sliding window attention in some layers. These architectural choices explain why some models can offer 128K+ context windows at reasonable prices while others are limited to 8K–32K.

Batch processing efficiency: Providers batch multiple requests onto the same GPU to maximize utilization. The transformer architecture's parallel nature makes it well-suited for batching during the prefill phase (processing inputs), but less so during decoding (generating outputs) because each request may be at a different point in its generation. This batching asymmetry contributes to the price differential between input and output tokens — input tokens benefit more from batching efficiencies than output tokens.

Transformer Efficiency Improvements

Since the original 2017 transformer paper, researchers and engineers have developed numerous improvements to reduce the computational cost of attention while preserving model quality. These innovations directly impact the prices you pay for API inference:

Multi-Query Attention (MQA): Introduced by Shazeer (2019), MQA uses a single shared Key and Value head across all attention heads, while maintaining separate Query heads. This reduces the KV cache size by a factor equal to the number of heads (typically 32–128x reduction), dramatically reducing memory requirements during generation. Models using MQA can serve longer contexts and more concurrent requests per GPU. PaLM and early Gemini models used MQA.

Grouped Query Attention (GQA): A compromise between standard multi-head attention and MQA, GQA groups attention heads and shares K/V within each group. With 8 groups instead of 64 individual heads, KV cache size is reduced 8x while preserving most of the quality advantages of full multi-head attention. Llama 3, Claude 3.5, and Mistral all use GQA. This is the dominant efficiency technique in production LLMs as of 2026 and is a key reason why 128K–200K context windows are commercially viable.

Flash Attention: Developed by Dao et al. (2022), Flash Attention is a hardware-aware algorithm that computes exact attention while minimizing GPU memory reads/writes. It does not change the mathematical output but reduces wall-clock time and memory usage by 2–4x through careful tiling and kernel fusion. Flash Attention has become standard in all major LLM serving frameworks (vLLM, TensorRT-LLM, TGI) and is a major contributor to the continued price reductions from API providers. Flash Attention 2 and 3 provide further improvements.

Sparse Attention: Instead of every token attending to every other token (full attention), sparse attention patterns restrict attention to a subset: local windows (attend only to nearby tokens), strided patterns (attend to every Nth token), or learned patterns (attend to tokens the model predicts are most relevant). Longformer, BigBird, and some Gemini variants use sparse attention to reduce the quadratic cost to linear or near-linear scaling. The tradeoff is that some long-range dependencies may be missed, but for many practical tasks, the quality impact is minimal while the cost improvement is substantial.

KV Cache Compression: Techniques like PagedAttention (used in vLLM), token dropping, and quantized KV caches reduce the memory footprint of the KV cache without changing the model architecture. PagedAttention manages KV cache memory like virtual memory, eliminating waste from pre-allocated buffers. This allows providers to serve 2–4x more concurrent requests per GPU, directly reducing per-token costs. vLLM with PagedAttention is the dominant open-source serving framework and is used by many commercial providers behind their APIs.

Speculative Decoding: This technique uses a small, fast "draft" model to generate candidate tokens, which are then verified in batch by the large target model. Because verification can be parallelized (unlike generation), speculative decoding can increase generation throughput by 2–3x when the draft model's predictions match the target model's. This reduces latency and increases GPU utilization, enabling providers to serve more requests and potentially lower prices.

The cumulative effect of these improvements is dramatic: a 2026 serving stack can serve the same model at 5–10x lower cost per token than a 2023 stack, through purely engineering and algorithmic improvements. This is why API prices have fallen 80–90% since 2023 even as model capabilities have improved. CostHawk tracks these pricing changes automatically, ensuring your cost reports always reflect current rates.

The Cost Implications of Architecture Choices

As an API consumer, you do not choose the transformer architecture — the model provider does. But understanding architecture informs your usage decisions and helps you predict future pricing trends:

Context length is not free. Marketing materials emphasize ever-larger context windows (128K, 200K, 1M tokens), but the quadratic cost of attention means that filling these windows is expensive. A request that sends 100K tokens of context costs 100x more in input tokens than one that sends 1K tokens — and the underlying compute cost is even higher due to quadratic attention scaling. Before stuffing a context window, ask: would retrieval-augmented generation (RAG) with 2K tokens of targeted context produce equivalent quality at 50x lower cost?

Model size correlates with cost, not always with quality. A 70B parameter model is not necessarily 10x better than a 7B model — but it is roughly 10x more expensive to serve. For many production tasks, a well-prompted 7B–13B model matches a 70B model on quality metrics. The architectural insight here is that parameter count determines compute cost, but task performance depends on how well the model's training data and fine-tuning match your use case. CostHawk's per-model analytics help you identify cases where you are paying for a larger model without getting proportional quality benefits.

Architecture improvements reduce costs over time. The trend in transformer efficiency — from full attention to GQA, from naive serving to Flash Attention + PagedAttention, from FP16 to INT4 quantization — consistently pushes per-token costs down. API prices have dropped 80–90% since GPT-4's launch in March 2023. If you are evaluating whether to build self-hosted infrastructure, factor in that API prices will likely continue falling 30–50% per year, shifting the crossover point where self-hosting becomes cheaper.

Reasoning models have hidden architectural costs. Models like o1, o3, and Claude with extended thinking use chain-of-thought reasoning that generates large numbers of internal "thinking" tokens before producing visible output. These thinking tokens consume the same compute as visible output tokens (they go through the same transformer forward pass) and are billed accordingly. A request that produces 500 visible output tokens might consume 5,000+ thinking tokens, making reasoning models 5–10x more expensive per visible output token than their sticker price suggests. CostHawk tracks thinking tokens separately to give you accurate cost visibility for reasoning model usage.

Multimodal models use adapted transformers. Vision transformers (ViT) process image patches as tokens, and audio transformers process audio frames as tokens. These modality-specific tokens go through the same attention mechanism as text tokens, which is why image and audio inputs are billed in token-equivalent units. The cost implication is that multimodal inputs are subject to the same quadratic attention scaling — a high-resolution image that maps to 2,000 tokens adds quadratically to attention cost just like 2,000 text tokens would.

The transformer architecture is the economic engine behind the AI API industry. Every pricing decision, every context limit, and every optimization technique traces back to the fundamental compute characteristics of self-attention. Understanding this architecture gives you a structural advantage in predicting costs, evaluating tradeoffs, and making informed decisions about how to use AI APIs efficiently.

FAQ

Frequently Asked Questions

Why does the transformer architecture make long contexts expensive?+

The core self-attention operation computes pairwise interactions between every token in the sequence, producing an N-by-N attention matrix where N is the sequence length. This means the compute and memory requirements scale quadratically: doubling the context length quadruples the attention computation. A 128K token context requires 16,384x more attention compute than a 1K token context. While modern optimizations like Flash Attention and GQA reduce the constant factors, the quadratic scaling remains fundamental. This is why a single request with 100K tokens of context costs 100x more in input token fees than one with 1K tokens — and the underlying compute cost is even steeper due to the superlinear attention scaling. Providers absorb some of this through linear per-token pricing, but the physics of attention mean that long-context requests are genuinely more expensive to serve per token, which is ultimately reflected in pricing tiers and context window limits.

What is the difference between a transformer and an LLM?+

A transformer is a neural network architecture — a design blueprint that specifies how data flows through layers of attention and feed-forward networks. An LLM is a specific model built using that architecture, trained on a particular dataset, with a particular number of parameters. The relationship is like the difference between a house design (architecture) and a specific house built from that design (model). All modern LLMs — GPT-4o, Claude 3.5, Gemini, Llama 3, Mistral — use the transformer architecture or close variants of it. But they differ in parameter count, training data, training methodology (RLHF, DPO, constitutional AI), vocabulary size, and optimization techniques. Two models using identical transformer architectures can have vastly different capabilities and costs depending on these factors. The transformer architecture is the shared foundation; the specific LLM is the product built on that foundation.

How does the number of transformer layers affect API costs?+

More transformer layers means more computation per forward pass, which directly increases the cost of serving each request. A model with 80 layers requires roughly twice the compute per token as one with 40 layers (assuming similar dimensions). Each layer contains attention and feed-forward sub-components, and the input must pass through every layer sequentially — there is no shortcut. This is why larger models cost more per token: GPT-4o's estimated 100+ layers cost more to run than GPT-4o mini's estimated 32 layers, which is reflected in the 16x price difference between them. For API consumers, you cannot control the number of layers — that is the provider's architectural decision. But you can choose between models with different layer counts (and therefore different costs) based on your quality requirements. Many tasks that appear to need a deep model actually perform well with a shallower one.

What is the KV cache and why does it matter for costs?+

The KV (Key-Value) cache is a memory optimization used during text generation that stores the Key and Value vectors computed for all previous tokens. Without the KV cache, generating each new output token would require recomputing attention over the entire sequence from scratch — making generation time grow quadratically with output length. With the KV cache, only the new token's Query needs to be computed; the cached Keys and Values from all previous tokens are reused. The tradeoff is memory: the KV cache grows linearly with sequence length and must fit in GPU VRAM. For a 70B parameter model with 128K context, the KV cache alone can consume 16–20 GB of VRAM. This memory pressure limits how many concurrent requests a GPU can serve, directly impacting throughput and cost. KV cache management is one of the key engineering challenges for inference providers, and innovations like PagedAttention and GQA are primarily aimed at reducing its memory footprint.

Why are output tokens more expensive than input tokens from an architecture perspective?+

The asymmetry comes from the two-phase processing model of decoder transformers. During the prefill phase (input processing), all input tokens are processed simultaneously in a single forward pass using highly parallelized matrix multiplications — GPUs excel at this because they can process thousands of tokens in the same time it takes to process one. During the decode phase (output generation), tokens are generated one at a time in a sequential autoregressive loop. Each output token requires its own forward pass through the entire model, and each pass must attend to all previous tokens via the KV cache. This means the GPU is doing a full forward pass to produce a single token, which is vastly less efficient than the prefill phase where a single forward pass processes thousands of tokens. The result is that generating one output token requires roughly the same GPU time as processing 10–50 input tokens, which is why providers charge 3–5x more for output tokens.

What is Flash Attention and does it reduce API costs?+

Flash Attention is an algorithm that computes mathematically exact self-attention while dramatically reducing GPU memory reads and writes. Traditional attention implementation stores the full N-by-N attention matrix in GPU high-bandwidth memory (HBM), which is slow to access. Flash Attention tiles the computation into small blocks that fit in the GPU's fast on-chip SRAM, computing attention in chunks and accumulating results without ever materializing the full attention matrix. This produces identical results to standard attention but runs 2–4x faster and uses 5–20x less memory. Flash Attention does not change the model output — it is a pure engineering optimization of the same mathematical operation. It reduces API costs indirectly by allowing providers to serve more requests per GPU-second, increasing throughput and enabling lower per-token prices. Flash Attention 2 (2023) and Flash Attention 3 (2024) further improved performance. It is now standard in all major serving frameworks and is a key enabler of the ongoing price reductions from API providers.

How do Mixture of Experts transformers reduce costs?+

Mixture of Experts (MoE) transformers replace the dense feed-forward network in each layer with a set of parallel "expert" sub-networks and a learned routing function that selects which experts to activate for each token. A model might have 8 experts per layer but only activate 2 for any given token. This means the model can have a very large total parameter count (providing a large knowledge capacity) while only using a fraction of the compute per token (keeping inference fast and cheap). For example, Mixtral 8x7B has 47B total parameters but only activates ~13B per token, giving it the quality of a 47B model at roughly the compute cost of a 13B model. GPT-4 is widely believed to use a similar MoE architecture with a much larger expert count. The cost benefit is significant: MoE models can offer frontier-level quality at mid-tier compute costs, which is reflected in competitive per-token pricing. The tradeoff is that MoE models require more total GPU memory to store all experts, even though only a subset is used per token.

Will transformer architecture improvements continue to reduce API prices?+

Yes, and the trajectory is well-established. Transformer serving efficiency has improved roughly 10x since 2023 through a combination of algorithmic improvements (Flash Attention, PagedAttention, speculative decoding), architectural innovations (GQA, MoE, sparse attention), precision optimization (INT8 and INT4 quantization), and hardware improvements (H100, H200, and upcoming B200 GPUs). API prices have dropped 80–90% from GPT-4's March 2023 launch to March 2026, and the improvement pipeline is far from exhausted. Upcoming techniques like linear attention (replacing quadratic attention with linear-complexity alternatives), dynamic computation (allocating different amounts of compute to different tokens based on difficulty), and improved distillation (training smaller models that match larger models on specific tasks) promise further efficiency gains. A reasonable expectation is that per-token costs will continue falling 30–50% annually for the next several years. CostHawk tracks these pricing changes automatically, ensuring your cost models reflect current rates.

Related Terms

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Inference

The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.

GPU Instance

Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary