Transformer
The foundational neural network architecture behind all modern large language models. Introduced in the 2017 paper 'Attention Is All You Need,' the transformer uses self-attention mechanisms to process sequences in parallel, enabling the scaling breakthroughs that power GPT, Claude, and Gemini. Understanding transformer architecture explains why API costs scale with context length and why inference is computationally expensive.
Definition
What is Transformer?
Impact
Why It Matters for AI Costs
The transformer architecture is not just an academic concept — it directly determines the pricing structure, performance characteristics, and cost scaling of every AI API you use. Understanding it explains three critical aspects of AI cost management:
1. Why costs scale with input length. The self-attention mechanism computes pairwise interactions between all tokens in a sequence. For a 1,000-token input, the model computes 1,000,000 attention scores. For a 10,000-token input, it computes 100,000,000 attention scores — a 100x increase in compute for a 10x increase in tokens. While providers amortize this into linear per-token pricing (for simplicity), the underlying compute cost is superlinear, which is why some providers charge premium rates for inputs that exceed certain context length thresholds.
2. Why output generation is expensive. During output generation, the transformer must attend to all previous tokens (both input and already-generated output) for each new token. The KV cache mechanism avoids recomputing attention for earlier tokens, but the memory required grows linearly with sequence length, consuming expensive GPU VRAM. This is why output tokens cost 4–5x more than input tokens — each one requires a full forward pass through the model with an ever-growing attention context.
3. Why context windows have limits. The quadratic memory and compute requirements of attention mean that doubling the context window requires roughly 4x the GPU memory and compute. A 128K context window requires 16x the attention compute of a 32K window. Providers set context limits based on what their GPU infrastructure can support economically, and they charge accordingly — longer context models tend to have higher per-token rates.
CostHawk helps you navigate these architectural realities by tracking context utilization, identifying requests that use more context than necessary, and recommending model routing strategies that match context requirements to the most cost-efficient model.
The Transformer Architecture
The transformer architecture consists of stacked layers, each containing two primary sub-components: a multi-head self-attention mechanism and a feed-forward neural network. Modern LLMs stack 32 to 120+ of these layers to create deep networks with billions of parameters.
Self-attention mechanism: For each token in the input sequence, the model creates three vectors — a Query (Q), a Key (K), and a Value (V) — by multiplying the token's embedding by learned weight matrices. The attention score between any two tokens is computed as the dot product of one token's Query with another token's Key, scaled by the square root of the dimension. These scores are passed through a softmax function to produce attention weights, which are then used to create a weighted sum of Value vectors. Mathematically:
Attention(Q, K, V) = softmax(QK^T / √d_k) × VThe QK^T matrix multiplication produces a matrix of size sequence_length × sequence_length — this is the source of the quadratic scaling with context length.
Multi-head attention: Rather than computing a single attention function, the model runs multiple attention "heads" in parallel (typically 32–128 heads), each with its own Q, K, V weight matrices. This allows the model to attend to different types of relationships simultaneously — one head might focus on syntactic relationships, another on semantic similarity, another on positional proximity. The outputs of all heads are concatenated and projected through another weight matrix.
Feed-forward network: After attention, each token's representation passes through a position-wise feed-forward network (typically two linear transformations with a nonlinear activation like GELU or SiLU in between). This network has the same architecture for every position but does not share weights across layers. The feed-forward layers typically contain the majority of the model's parameters and perform the "reasoning" computation.
Layer normalization and residual connections: Each sub-component (attention and feed-forward) is wrapped with a residual connection (adding the input to the output) and layer normalization, which stabilizes training and allows gradients to flow through very deep networks.
Decoder-only architecture: Most modern LLMs (GPT, Claude, Llama, Mistral) use a decoder-only transformer, where each token can only attend to tokens that come before it in the sequence (causal masking). This enables autoregressive generation — the model produces one token at a time, each conditioned on all previous tokens. This is the architecture that powers the API endpoints you use daily.
Self-Attention and Computational Cost
The self-attention mechanism is both the transformer's greatest strength and its primary cost bottleneck. Understanding the computational profile explains why longer contexts cost more and why API providers price the way they do.
Quadratic scaling with sequence length: The core attention operation QK^T produces a matrix of size N × N, where N is the sequence length. The compute required for this operation is O(N² × d), where d is the model dimension (typically 4,096–12,288). For concrete numbers:
| Context Length (N) | Attention Operations (N²) | Relative Cost | GPU Memory (KV Cache, est.) |
|---|---|---|---|
| 1,024 | 1,048,576 | 1x | ~128 MB |
| 4,096 | 16,777,216 | 16x | ~512 MB |
| 16,384 | 268,435,456 | 256x | ~2 GB |
| 65,536 | 4,294,967,296 | 4,096x | ~8 GB |
| 131,072 | 17,179,869,184 | 16,384x | ~16 GB |
Moving from a 4K context to a 128K context increases attention compute by roughly 1,024x. This is why models with 128K+ context windows require significantly more GPU resources per request, and why providers like Anthropic and Google charge the same per-token rate but implicitly amortize the increased compute cost across their fleet.
Memory bottleneck — the KV cache: During autoregressive generation, the model maintains a Key-Value (KV) cache that stores the K and V vectors for all previous tokens. This avoids recomputing attention from scratch for each new output token. However, the KV cache grows linearly with sequence length and must fit in GPU memory (VRAM). For a model like Llama 3 70B with 80 attention layers and 64 heads, the KV cache for a 128K token sequence consumes approximately 20 GB of VRAM — on a GPU that might have 80 GB total. This memory pressure limits how many concurrent requests a GPU can serve, which directly impacts the economics of inference and ultimately the per-token price you pay.
Practical cost implications: Even though providers charge linearly per token, the underlying cost is superlinear. Providers handle this through fleet optimization, batching, and amortization — but the physics of attention computation mean that a 100K token request is genuinely more expensive to serve per token than a 1K token request. This is why some providers offer lower rates for shorter contexts or charge premium rates for extended context models. When optimizing costs, reducing context length has a disproportionate benefit because it reduces both the linear token cost and the hidden superlinear compute cost.
Encoder vs Decoder Models
The original transformer paper described an encoder-decoder architecture, but the AI API landscape in 2026 is dominated by decoder-only models. Understanding the distinction matters for cost because different architectures have different compute profiles and pricing structures.
Encoder-only models (e.g., BERT, RoBERTa) process the full input bidirectionally — every token can attend to every other token in both directions. These models excel at understanding tasks: classification, entity recognition, semantic similarity, and search ranking. They are fast and cheap to run because they process the entire input in a single forward pass with no autoregressive generation. However, they cannot generate text. Encoder models are rarely accessed via the same APIs as LLMs; they are more commonly deployed as embedding models or classifiers. Pricing is typically per-token for input only (no output cost), at rates much lower than generative models: OpenAI's text-embedding-3-small charges $0.02/MTok versus $2.50/MTok for GPT-4o input.
Decoder-only models (e.g., GPT-4o, Claude, Llama, Mistral) process tokens left-to-right with causal masking — each token can only attend to previous tokens. These are the models that power text generation APIs. Their cost profile includes both input (prefill) and output (decoding) costs, with output costing 3–5x more due to the sequential generation process described above. Virtually all LLM API costs come from decoder-only models.
Encoder-decoder models (e.g., T5, FLAN, early Gemini variants) use an encoder to process the input and a decoder to generate the output. The encoder processes the full input bidirectionally, then the decoder generates output autoregressively while attending to the encoder's representations via cross-attention. These models can be more efficient for tasks where the output is much shorter than the input (summarization, translation) because the decoder only needs to generate a short sequence while leveraging a rich bidirectional encoding of the long input. Google's Gemini models use a variant of this approach internally.
Cost comparison by architecture:
| Architecture | Examples | Typical Cost Range | Cost Drivers |
|---|---|---|---|
| Encoder-only | BERT, text-embedding-3 | $0.01–$0.13/MTok | Input tokens only |
| Decoder-only | GPT-4o, Claude 3.5, Llama 3 | $0.10–$15.00/MTok | Input + output tokens |
| Encoder-decoder | T5, Gemini (internal) | $0.10–$5.00/MTok | Input + output, but more efficient for short outputs |
For cost optimization, the key insight is: if your task only requires understanding (classification, extraction, embedding), use an encoder-only model at 10–100x lower cost than a generative decoder model. Reserve decoder-only models for tasks that genuinely require text generation.
How Architecture Affects API Pricing
The transformer architecture creates several fundamental constraints that directly shape the pricing of every AI API. Understanding these constraints helps you make better model selection and usage decisions.
Parameter count and model size: The number of parameters in a transformer determines how much GPU memory it requires and how much compute each forward pass consumes. A 70 billion parameter model requires roughly 140 GB of VRAM in FP16 precision just to store the weights — more than the capacity of a single A100 80GB GPU, requiring multi-GPU setups. Larger models generally produce better outputs but cost more to serve per token because they consume more expensive GPU resources per request. This is why GPT-4o (estimated ~200B+ parameters in its MoE architecture) costs more per token than GPT-4o mini (estimated ~8B parameters).
Mixture of Experts (MoE): Models like GPT-4, Mixtral, and some Gemini variants use a Mixture of Experts architecture, where only a subset of the model's parameters are activated for each token. A model with 1 trillion total parameters might only activate 200 billion per token, reducing compute costs while maintaining quality. MoE models offer a better cost-to-quality ratio because you get the benefits of a massive parameter space without the per-token compute cost of a fully dense model. This architectural choice is reflected in pricing — MoE models tend to be cheaper per token than equivalently capable dense models.
Quantization and precision: Running transformer models at reduced numerical precision (FP16, INT8, INT4) dramatically reduces memory requirements and increases throughput, allowing providers to serve more requests per GPU-second. Quantized models typically show minimal quality degradation for standard tasks. When a provider lowers prices (as OpenAI has done repeatedly), it often reflects improved quantization and serving efficiency rather than a change in the underlying model. The architecture's tolerance for quantization directly enables the price reductions that benefit API consumers.
KV cache and context pricing: As discussed in the self-attention section, the KV cache is a major cost factor for long-context requests. Some architectural innovations — Multi-Query Attention (MQA), Grouped Query Attention (GQA), and sliding window attention — reduce KV cache sizes by 4–8x, enabling longer contexts at lower cost. Claude 3.5 uses GQA, and Mistral uses sliding window attention in some layers. These architectural choices explain why some models can offer 128K+ context windows at reasonable prices while others are limited to 8K–32K.
Batch processing efficiency: Providers batch multiple requests onto the same GPU to maximize utilization. The transformer architecture's parallel nature makes it well-suited for batching during the prefill phase (processing inputs), but less so during decoding (generating outputs) because each request may be at a different point in its generation. This batching asymmetry contributes to the price differential between input and output tokens — input tokens benefit more from batching efficiencies than output tokens.
Transformer Efficiency Improvements
Since the original 2017 transformer paper, researchers and engineers have developed numerous improvements to reduce the computational cost of attention while preserving model quality. These innovations directly impact the prices you pay for API inference:
Multi-Query Attention (MQA): Introduced by Shazeer (2019), MQA uses a single shared Key and Value head across all attention heads, while maintaining separate Query heads. This reduces the KV cache size by a factor equal to the number of heads (typically 32–128x reduction), dramatically reducing memory requirements during generation. Models using MQA can serve longer contexts and more concurrent requests per GPU. PaLM and early Gemini models used MQA.
Grouped Query Attention (GQA): A compromise between standard multi-head attention and MQA, GQA groups attention heads and shares K/V within each group. With 8 groups instead of 64 individual heads, KV cache size is reduced 8x while preserving most of the quality advantages of full multi-head attention. Llama 3, Claude 3.5, and Mistral all use GQA. This is the dominant efficiency technique in production LLMs as of 2026 and is a key reason why 128K–200K context windows are commercially viable.
Flash Attention: Developed by Dao et al. (2022), Flash Attention is a hardware-aware algorithm that computes exact attention while minimizing GPU memory reads/writes. It does not change the mathematical output but reduces wall-clock time and memory usage by 2–4x through careful tiling and kernel fusion. Flash Attention has become standard in all major LLM serving frameworks (vLLM, TensorRT-LLM, TGI) and is a major contributor to the continued price reductions from API providers. Flash Attention 2 and 3 provide further improvements.
Sparse Attention: Instead of every token attending to every other token (full attention), sparse attention patterns restrict attention to a subset: local windows (attend only to nearby tokens), strided patterns (attend to every Nth token), or learned patterns (attend to tokens the model predicts are most relevant). Longformer, BigBird, and some Gemini variants use sparse attention to reduce the quadratic cost to linear or near-linear scaling. The tradeoff is that some long-range dependencies may be missed, but for many practical tasks, the quality impact is minimal while the cost improvement is substantial.
KV Cache Compression: Techniques like PagedAttention (used in vLLM), token dropping, and quantized KV caches reduce the memory footprint of the KV cache without changing the model architecture. PagedAttention manages KV cache memory like virtual memory, eliminating waste from pre-allocated buffers. This allows providers to serve 2–4x more concurrent requests per GPU, directly reducing per-token costs. vLLM with PagedAttention is the dominant open-source serving framework and is used by many commercial providers behind their APIs.
Speculative Decoding: This technique uses a small, fast "draft" model to generate candidate tokens, which are then verified in batch by the large target model. Because verification can be parallelized (unlike generation), speculative decoding can increase generation throughput by 2–3x when the draft model's predictions match the target model's. This reduces latency and increases GPU utilization, enabling providers to serve more requests and potentially lower prices.
The cumulative effect of these improvements is dramatic: a 2026 serving stack can serve the same model at 5–10x lower cost per token than a 2023 stack, through purely engineering and algorithmic improvements. This is why API prices have fallen 80–90% since 2023 even as model capabilities have improved. CostHawk tracks these pricing changes automatically, ensuring your cost reports always reflect current rates.
The Cost Implications of Architecture Choices
As an API consumer, you do not choose the transformer architecture — the model provider does. But understanding architecture informs your usage decisions and helps you predict future pricing trends:
Context length is not free. Marketing materials emphasize ever-larger context windows (128K, 200K, 1M tokens), but the quadratic cost of attention means that filling these windows is expensive. A request that sends 100K tokens of context costs 100x more in input tokens than one that sends 1K tokens — and the underlying compute cost is even higher due to quadratic attention scaling. Before stuffing a context window, ask: would retrieval-augmented generation (RAG) with 2K tokens of targeted context produce equivalent quality at 50x lower cost?
Model size correlates with cost, not always with quality. A 70B parameter model is not necessarily 10x better than a 7B model — but it is roughly 10x more expensive to serve. For many production tasks, a well-prompted 7B–13B model matches a 70B model on quality metrics. The architectural insight here is that parameter count determines compute cost, but task performance depends on how well the model's training data and fine-tuning match your use case. CostHawk's per-model analytics help you identify cases where you are paying for a larger model without getting proportional quality benefits.
Architecture improvements reduce costs over time. The trend in transformer efficiency — from full attention to GQA, from naive serving to Flash Attention + PagedAttention, from FP16 to INT4 quantization — consistently pushes per-token costs down. API prices have dropped 80–90% since GPT-4's launch in March 2023. If you are evaluating whether to build self-hosted infrastructure, factor in that API prices will likely continue falling 30–50% per year, shifting the crossover point where self-hosting becomes cheaper.
Reasoning models have hidden architectural costs. Models like o1, o3, and Claude with extended thinking use chain-of-thought reasoning that generates large numbers of internal "thinking" tokens before producing visible output. These thinking tokens consume the same compute as visible output tokens (they go through the same transformer forward pass) and are billed accordingly. A request that produces 500 visible output tokens might consume 5,000+ thinking tokens, making reasoning models 5–10x more expensive per visible output token than their sticker price suggests. CostHawk tracks thinking tokens separately to give you accurate cost visibility for reasoning model usage.
Multimodal models use adapted transformers. Vision transformers (ViT) process image patches as tokens, and audio transformers process audio frames as tokens. These modality-specific tokens go through the same attention mechanism as text tokens, which is why image and audio inputs are billed in token-equivalent units. The cost implication is that multimodal inputs are subject to the same quadratic attention scaling — a high-resolution image that maps to 2,000 tokens adds quadratically to attention cost just like 2,000 text tokens would.
The transformer architecture is the economic engine behind the AI API industry. Every pricing decision, every context limit, and every optimization technique traces back to the fundamental compute characteristics of self-attention. Understanding this architecture gives you a structural advantage in predicting costs, evaluating tradeoffs, and making informed decisions about how to use AI APIs efficiently.
FAQ
Frequently Asked Questions
Why does the transformer architecture make long contexts expensive?+
What is the difference between a transformer and an LLM?+
How does the number of transformer layers affect API costs?+
What is the KV cache and why does it matter for costs?+
Why are output tokens more expensive than input tokens from an architecture perspective?+
What is Flash Attention and does it reduce API costs?+
How do Mixture of Experts transformers reduce costs?+
Will transformer architecture improvements continue to reduce API prices?+
Related Terms
Large Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read moreToken
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreInference
The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.
Read moreGPU Instance
Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.
Read moreLatency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
