GlossaryInfrastructureUpdated 2026-03-16By Chase Dillingham

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Definition

What is Large Language Model (LLM)?

A large language model (LLM) is a deep neural network built on the transformer architecture that has been trained on vast quantities of text data to predict and generate coherent language. Modern LLMs contain billions of parameters — GPT-4 is estimated at over 1 trillion parameters across its mixture-of-experts architecture, while Claude 3.5 Sonnet and Gemini 1.5 Pro each contain hundreds of billions. These models learn statistical patterns in language during a computationally expensive training phase, then serve predictions during a much cheaper (but still significant) inference phase. For teams consuming LLMs via API, inference cost is the dominant expense: you pay per token processed, with rates varying dramatically across model families, sizes, and providers. Understanding the landscape of LLMs — their capabilities, cost profiles, and tradeoffs — is essential for making informed decisions about which model to use for which task.

Impact

Why It Matters for AI Costs

LLMs are the compute engine behind every AI API call, and the model you choose determines both the quality and cost of your output. The pricing spread across available models is enormous:

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o mini	$0.15	$0.60
Gemini 2.0 Flash	$0.10	$0.40
Claude 3.5 Haiku	$0.80	$4.00
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
Claude 3 Opus	$15.00	$75.00

That is a 150x cost difference between the cheapest and most expensive model in this table. A team processing 10 million tokens per day would pay $1/day with Gemini Flash or $150/day with Claude Opus — $4,500/month versus $45,000/month. The capability gap between these models is real, but for many tasks (classification, extraction, simple Q&A), the cheaper models perform within 5% of the frontier models.

Choosing the right LLM for each task is the highest-leverage cost optimization available. CostHawk's model-level analytics show you exactly which models are consuming your budget and how they compare on cost-per-query, enabling data-driven model selection decisions.

How LLMs Generate Text

Understanding how LLMs generate text explains why inference costs money and why output tokens cost more than input tokens.

The process has two phases:

1. Prefill (processing input): The model reads your entire input — system prompt, user message, conversation history, tool definitions — and computes internal representations (called "activations") for each token. This happens in parallel on the GPU, making it relatively fast and computationally efficient. The model builds a key-value (KV) cache that stores contextual information about each input token for use during generation.

2. Autoregressive decoding (generating output): The model generates output one token at a time. For each new token, it:

Reads the KV cache from all previous tokens (both input and previously generated output)
Computes a probability distribution over the entire vocabulary
Samples or selects the next token based on a temperature setting
Appends the new token to the KV cache
Repeats until a stop condition is met (end-of-sequence token, max_tokens limit, or stop sequence)

This sequential generation process is why output tokens cost 3–5x more than input tokens: each output token requires its own forward pass through the model, and the KV cache grows linearly with sequence length, consuming increasingly more GPU memory.

From a cost perspective, this means:

Longer inputs increase cost linearly (more tokens to process in the prefill phase)
Longer outputs increase cost linearly and at a higher per-token rate
The total cost of a request is: (input_tokens × input_rate) + (output_tokens × output_rate)

Streaming responses (receiving tokens as they are generated) does not change the total cost — it only changes the delivery pattern. You pay the same whether you receive all tokens at once or one at a time.

Model Families and Pricing Tiers

The LLM landscape in 2026 organizes into distinct tiers based on capability and cost. Understanding these tiers is essential for model routing and cost optimization:

Tier	Models	Input Cost Range	Best For
Economy	GPT-4o mini, Gemini 2.0 Flash, Claude 3.5 Haiku, Mistral Small	$0.10 – $0.80/MTok	Classification, extraction, simple Q&A, formatting, translation
Mid-tier	GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, Mistral Large	$1.25 – $3.00/MTok	Complex reasoning, code generation, analysis, creative writing
Frontier	Claude 3 Opus, GPT-4.5, o1-pro	$10.00 – $15.00/MTok	Research-grade reasoning, novel problem-solving, safety-critical tasks
Reasoning	o1, o3-mini, Claude with extended thinking	$1.00 – $15.00/MTok (plus thinking tokens)	Math, logic, multi-step planning, code debugging

Key observations for cost management:

Economy models handle 60–80% of typical production workloads. If you are routing everything to a mid-tier model, you are likely overspending on simple tasks.
Reasoning models have hidden costs. Models like o1 and o3-mini consume "thinking tokens" that count toward your bill but don't appear in the output. A request that generates 500 visible output tokens might consume 5,000+ thinking tokens internally.
Pricing changes frequently. OpenAI has cut GPT-4o pricing twice since launch. Anthropic adjusts rates with new model releases. CostHawk tracks the latest pricing across all providers so your cost reports are always accurate.
Batch APIs offer 50% discounts. Both OpenAI and Anthropic offer batch processing at half the real-time rate, suitable for workloads that can tolerate hours of latency.

Training vs Inference Cost

The cost of an LLM has two fundamentally different components, and which one matters to you depends on your role:

Training cost is the one-time expense of training the model on massive datasets. GPT-4 is estimated to have cost $100+ million to train. Llama 3 405B reportedly cost Meta tens of millions in compute alone. Training costs are borne by the model provider (OpenAI, Anthropic, Google, Meta) and amortized across all customers.

Inference cost is the ongoing expense of running the trained model to process your requests. This is what you pay for as an API customer — every token in, every token out, metered and billed. For API consumers, inference cost is the only cost that matters.

The economics work like this: a provider spends $100 million training a model, then serves billions of requests at $2.50–$15.00 per million tokens. The training cost is fixed and sunk; the inference cost scales linearly with usage. Providers aim to recoup training costs and earn margins through inference pricing.

For your team, this means:

You cannot control training costs. These are embedded in the provider's per-token pricing.
You can control inference costs. Every optimization — shorter prompts, output capping, model routing, caching — directly reduces your inference bill.
Open-source models shift the equation. With models like Llama 3 and Mistral, you can self-host and pay only for the GPU compute, eliminating the provider's margin. But self-hosting introduces operational costs (GPU provisioning, model serving infrastructure, scaling, monitoring) that can exceed API costs for small-to-medium workloads.

The crossover point where self-hosting becomes cheaper than API access typically occurs around $10,000–$30,000/month in API spend, depending on the model and your team's infrastructure capabilities. CostHawk helps you track your spend trajectory so you can evaluate this decision with real data.

Open-Source vs API-Hosted Models

The choice between open-source (self-hosted) and API-hosted models is one of the most consequential infrastructure decisions for AI-powered applications. Each approach has distinct cost characteristics:

API-hosted models (OpenAI, Anthropic, Google):

Zero infrastructure management — the provider handles GPUs, scaling, uptime
Pay-per-token pricing with no minimum commitment (for most tiers)
Access to the most capable frontier models that are not available as open-source
Costs scale linearly with usage — great at low volume, expensive at high volume
Provider lock-in risk if you build against model-specific features

Open-source models (Llama 3, Mistral, DeepSeek, Qwen):

No per-token fees — you pay for GPU compute (typically $1–$4/hour per GPU)
Models lag frontier capabilities by 6–12 months but are improving rapidly
Full control over model weights, fine-tuning, and deployment configuration
Fixed infrastructure costs regardless of token volume — great at high volume
Requires MLOps expertise for deployment, scaling, and monitoring

Cost comparison example: Running Llama 3 70B on 4x A100 GPUs costs approximately $6/hour ($4,320/month). At the throughput these GPUs provide (~500 tokens/second), you can process roughly 1.3 billion tokens per month. The equivalent volume on GPT-4o would cost: 1.3B tokens × $2.50/MTok input = $3,250/month for input alone, plus output costs. For this specific scenario, self-hosting is competitive. But add in engineering time for deployment, monitoring, and maintenance, and the picture changes.

Hybrid approach: Many teams use a hybrid strategy — self-host an economy model for high-volume, simple tasks, and use API-hosted frontier models for complex tasks. CostHawk can monitor both self-hosted and API costs in a unified dashboard, helping you optimize the split between the two approaches.

Choosing the Right Model for Your Budget

Model selection is the single highest-leverage decision for AI cost optimization. Here is a practical framework for choosing the right model for each use case:

Step 1: Categorize your requests. Most AI workloads fall into a few task categories, each with different capability requirements:

Classification and extraction (sentiment, entity recognition, structured data extraction): Economy models handle these with 95%+ accuracy.
Simple generation (FAQ responses, template filling, translation): Economy models perform well; mid-tier if quality requirements are high.
Complex reasoning (multi-step analysis, code generation, long-form writing): Mid-tier models are the sweet spot; frontier for critical tasks.
Novel problem-solving (research, mathematical proofs, adversarial scenarios): Frontier or reasoning models required.

Step 2: Benchmark quality at each tier. For each task category, test representative prompts against models at each pricing tier. Measure output quality using your application's specific criteria (accuracy, completeness, formatting compliance, etc.). You will often find that the quality difference between tiers is smaller than expected for structured tasks.

Step 3: Calculate the cost at your volume. For each task category, estimate daily request volume, average input tokens, and average output tokens. Multiply by the per-token rate for each candidate model to get daily and monthly costs. This often reveals that moving a high-volume, simple task from a mid-tier to an economy model saves more money than optimizing prompts for a low-volume, complex task.

Step 4: Implement model routing. Configure your application to route each request to the appropriate model based on task category. Start with static routing (hardcoded model per endpoint) and evolve toward dynamic routing (a lightweight classifier that selects the model per request) as you gather more data.

Step 5: Monitor and iterate. Use CostHawk to track cost-per-query and quality metrics by model over time. As providers release new models and adjust pricing, re-evaluate your routing decisions. A model that was the best value six months ago may no longer be optimal.

LLMs and Cost Monitoring

Monitoring LLM costs requires a different approach than traditional infrastructure monitoring. Unlike CPU or memory, where utilization is the primary metric, LLM costs are driven by token volume, model selection, and request patterns. Here is what effective LLM cost monitoring looks like:

Per-model cost tracking: Break down your total AI spend by model to see which models are consuming the most budget. This reveals model routing inefficiencies — for example, if 70% of your spend goes to GPT-4o but analysis shows that 50% of those requests could be handled by GPT-4o mini, you have identified a clear optimization opportunity worth 50% of that 70% slice.

Cost-per-query analysis: Track the average cost per request by endpoint, feature, or use case. A search endpoint averaging $0.002/query is very different from an analysis endpoint averaging $0.15/query. Understanding these unit economics is essential for pricing your own product, forecasting costs as usage grows, and identifying optimization targets.

Token efficiency metrics: Monitor the ratio of useful output tokens to total tokens consumed. If your application generates 500 tokens of output but the user only reads or uses 100 tokens (for example, a chatbot where the user asks for a yes/no answer and gets a paragraph), you are paying for wasted output. Track output utilization to identify opportunities for more concise generation.

Provider-level aggregation: If you use multiple providers (many teams use OpenAI for some tasks and Anthropic for others), aggregate costs across providers for a unified view. CostHawk normalizes pricing across all major providers so you can compare apples-to-apples.

Anomaly detection: LLM costs can spike suddenly due to retry storms, leaked keys, context window bloat, or new feature launches. Set up anomaly detection that alerts when daily or hourly spend deviates significantly from the baseline. CostHawk's anomaly detection uses statistical methods to flag unusual patterns before they become expensive surprises.

Forecasting: Use historical token consumption trends to forecast future costs. If your token usage is growing 15% month-over-month, project out 3–6 months to understand your future budget requirements. This enables proactive conversations with finance about AI infrastructure budgets rather than reactive explanations of unexpected bills.

FAQ

Frequently Asked Questions

What is the difference between an LLM and a foundation model?+

A foundation model is a broader category that includes any large model trained on diverse data that can be adapted to multiple tasks. An LLM is a specific type of foundation model focused on language. Foundation models also include vision models (like DALL-E), multimodal models (like GPT-4o with vision), and audio models (like Whisper). In practice, the terms are often used interchangeably because the most commercially important foundation models are language models or language-centric multimodal models. From a cost perspective, the distinction matters because vision and audio inputs are tokenized differently and may have different pricing. A multimodal request that includes an image might consume 1,000+ tokens just for the image, on top of the text tokens.

How do I estimate the cost of running an LLM-powered feature?+

To estimate costs, you need four numbers: (1) average input tokens per request, (2) average output tokens per request, (3) the per-token rate for your chosen model, and (4) expected request volume. Multiply: (avg_input_tokens × input_rate + avg_output_tokens × output_rate) × daily_requests × 30. For example, a customer support chatbot using GPT-4o with 800 input tokens and 300 output tokens per message, handling 5,000 messages/day: (800 × $2.50/MTok + 300 × $10.00/MTok) × 5,000 × 30 = ($0.002 + $0.003) × 150,000 = $750/month. Always add a 20–30% buffer for conversation history growth, retry overhead, and usage spikes. CostHawk's cost calculator and historical analytics help you refine these estimates with real production data.

Should I use one LLM for everything or different models for different tasks?+

Using different models for different tasks (model routing) is almost always more cost-effective than using a single model for everything. The reason is simple: tasks vary enormously in complexity, and paying frontier model prices for simple tasks wastes money. A sentiment classification that works perfectly with GPT-4o mini at $0.15/MTok should not be routed to Claude 3.5 Sonnet at $3.00/MTok. Start by categorizing your workloads into complexity tiers, then benchmark each tier against economy, mid-tier, and frontier models. Most teams find that 50–70% of their requests can be handled by economy models with no quality loss, yielding 60–80% cost savings on those requests. The initial effort to set up model routing is repaid within weeks for any team spending more than $500/month on AI APIs.

What are reasoning models and why do they cost more?+

Reasoning models — like OpenAI's o1, o3-mini, and Anthropic's Claude with extended thinking — are LLMs that spend additional compute on internal "thinking" before producing a response. They break complex problems into steps, verify their own logic, and explore multiple solution paths. This produces better results on math, logic, coding, and multi-step reasoning tasks. The cost implication is significant: reasoning models consume "thinking tokens" that are billed but do not appear in the visible output. A request that produces 500 output tokens might consume 5,000–20,000 thinking tokens internally, making the effective cost 10–40x higher than the output token count suggests. OpenAI's o1 charges $15/MTok for input and $60/MTok for output (including thinking), while o3-mini is more economical at $1.10/$4.40. Use reasoning models selectively for tasks that genuinely require multi-step reasoning, not for simple generation.

How fast are LLM prices dropping?+

LLM pricing has been declining rapidly. GPT-4 launched in March 2023 at $30/$60 per million tokens. GPT-4 Turbo (November 2023) dropped to $10/$30. GPT-4o (May 2024) dropped further to $5/$15, then to $2.50/$10.00 by late 2024. That represents an approximately 90% price reduction in 18 months for comparable capability. Anthropic has followed a similar trajectory: Claude 3 Opus launched at $15/$75, and Claude 3.5 Sonnet delivers arguably better performance at $3/$15. Google has been the most aggressive on pricing, with Gemini 2.0 Flash at $0.10/$0.40. The trend is driven by hardware improvements (newer GPUs), inference optimization (speculative decoding, quantization), and competitive pressure. Expect another 30–50% reduction over the next 12 months, but do not wait for price drops — optimize now and bank the additional savings when prices fall.

What is the difference between a model's parameter count and its context window?+

Parameter count and context window are two independent dimensions of a model. Parameter count (measured in billions) refers to the number of learned weights in the neural network — it determines the model's knowledge capacity and reasoning ability. GPT-4 has an estimated 1.7 trillion parameters; Llama 3 70B has 70 billion. More parameters generally means better quality but higher inference cost per token. Context window (measured in tokens) is the maximum number of tokens the model can process in a single request. It determines how much text you can send and receive. GPT-4o has a 128K context window; Claude 3.5 Sonnet has 200K; Gemini 1.5 Pro has 2 million. A larger context window lets you include more information per request but costs more because you are processing more input tokens. The two are not directly correlated — a smaller model can have a larger context window and vice versa.

Can I fine-tune an LLM to reduce inference costs?+

Fine-tuning can reduce inference costs in specific scenarios, but it is not a universal cost-saving strategy. The primary way fine-tuning saves money is by eliminating few-shot examples from your prompts. If your current prompts include 2,000 tokens of examples to teach the model a specific output format, a fine-tuned model that achieves the same quality with zero-shot prompts saves those 2,000 input tokens per request. At 10,000 requests/day on GPT-4o, that is 20 million tokens/day = $50/day saved. However, fine-tuned models have higher per-token inference costs — OpenAI charges roughly 2–6x the base model rate for fine-tuned model inference. You need to calculate whether the input token savings outweigh the inference rate premium at your specific volume. Fine-tuning also requires a training dataset and ongoing maintenance as your requirements evolve. For most teams, prompt optimization and model routing deliver better ROI than fine-tuning.

How does CostHawk track costs across different LLM providers?+

CostHawk provides unified cost tracking across all major LLM providers through three mechanisms: (1) Wrapped keys — proxy your API calls through CostHawk to get automatic per-request cost attribution with zero code changes. CostHawk records the model, token counts, and computed cost for every request. (2) MCP telemetry — for AI coding tools like Claude Code and OpenAI Codex CLI, the CostHawk MCP server captures session-level usage data locally and syncs it to your dashboard. (3) Provider API sync — CostHawk can pull usage data directly from provider billing APIs for historical backfill and reconciliation. All costs are normalized to a common format (USD, per-million-token basis) so you can compare across providers in a single dashboard. This unified view is essential for model routing decisions, budget allocation, and identifying which provider offers the best value for each workload.

Related Terms

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Retrieval-Augmented Generation (RAG)

An architecture pattern that combines a large language model with an external knowledge retrieval system. Instead of relying solely on the model's trained knowledge, RAG fetches relevant documents at query time and injects them into the prompt, improving accuracy while enabling fine-grained cost control over context size.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary