GlossaryInfrastructureUpdated 2026-03-16

Foundation Model

A large, general-purpose AI model pre-trained on broad data that serves as the base for downstream applications. Foundation models like GPT-4, Claude, Gemini, and Llama represent enormous upfront training investments whose costs are amortized across millions of API consumers. Choosing the right foundation model determines both baseline capability and baseline cost for every AI-powered feature you build.

Definition

What is Foundation Model?

A foundation model is a large-scale AI model trained on broad, diverse data — typically trillions of tokens of text, code, images, and other media — that can be adapted to a wide range of downstream tasks without task-specific retraining. The term was popularized by Stanford's Center for Research on Foundation Models (CRFM) in 2021 to describe a paradigm shift in AI: rather than training a separate model for each task (sentiment analysis, translation, code generation), a single massive model is pre-trained once at enormous cost and then applied to many tasks through prompting, fine-tuning, or retrieval augmentation. Current foundation models include OpenAI's GPT-4o and GPT-4.5, Anthropic's Claude 3.5 Sonnet and Claude 3 Opus, Google's Gemini 2.0 Flash and Gemini 1.5 Pro, Meta's Llama 3 405B, and Mistral Large. These models cost tens to hundreds of millions of dollars to train — GPT-4 is estimated at $100M+, Gemini Ultra at $200M+ — but the training cost is borne entirely by the provider and amortized across all API consumers. For teams building on these models via API, the relevant cost is inference cost: the per-token price of running the pre-trained model on your inputs. Foundation model selection is the single most impactful cost decision because it sets the per-token rate for every request your application makes.

Impact

Why It Matters for AI Costs

Foundation models are the economic bedrock of the AI API ecosystem. Every dollar you spend on AI inference traces back to a foundation model's architecture, training, and serving infrastructure. Understanding this layer is critical because:

1. Model selection determines unit economics. The foundation model you choose sets the per-token rate that governs your entire cost structure. At 100,000 requests per day with 500 input and 300 output tokens per request, your monthly cost ranges from $162/month on Gemini 2.0 Flash to $20,250/month on Claude 3 Opus — a 125x difference. If your application can meet its quality requirements with an economy-tier foundation model, you save an order of magnitude compared to using a frontier model by default.

Foundation ModelInput $/MTokOutput $/MTokMonthly Cost (100K req/day)
Gemini 2.0 Flash$0.10$0.40$162
GPT-4o mini$0.15$0.60$243
Claude 3.5 Haiku$0.80$4.00$1,920
GPT-4o$2.50$10.00$4,650
Claude 3.5 Sonnet$3.00$15.00$5,850
Claude 3 Opus$15.00$75.00$20,250

2. Foundation models evolve rapidly. New models are released every few months with better capability-to-cost ratios. Claude 3.5 Sonnet (June 2024) offered Claude 3 Opus-level quality at 5x lower cost. GPT-4o mini (July 2024) delivered GPT-4-level performance for simple tasks at 16x lower cost than GPT-4o. Teams that lock into a single model and never re-evaluate overpay as cheaper, better alternatives emerge.

3. The foundation model determines what optimization levers are available. Prompt caching discounts vary by provider (90% at Anthropic, 50% at OpenAI). Batch processing availability and pricing differ. Fine-tuning options and costs vary dramatically. The foundation model is not just a cost rate — it is a platform that determines your entire optimization strategy. CostHawk tracks spending across all foundation models and highlights when newer, more cost-efficient models could reduce costs for specific workloads.

What Makes a Model a Foundation Model

Not every AI model qualifies as a "foundation model." The term has specific characteristics that distinguish foundation models from task-specific models, fine-tuned variants, and smaller specialized systems:

1. Scale of training data. Foundation models are trained on trillions of tokens from diverse sources: web text, books, academic papers, code repositories, social media, and increasingly images, audio, and video. GPT-4 was reportedly trained on approximately 13 trillion tokens. Llama 3 405B was trained on 15+ trillion tokens. This breadth gives foundation models general-purpose capabilities across language understanding, reasoning, code generation, creative writing, and more — without any task-specific training signal.

2. Scale of parameters. Foundation models contain billions to trillions of parameters. The smallest models commonly called "foundation models" have around 7 billion parameters (Llama 3 8B, Mistral 7B), while the largest exceed 1 trillion (GPT-4's rumored MoE architecture). Parameter count correlates with capability but also with inference cost — a 70B model costs roughly 10x more to serve per token than a 7B model.

3. General-purpose adaptability. The defining characteristic of a foundation model is that it can be adapted to virtually any text-based task through prompting alone, without retraining. You can use GPT-4o for customer support, code review, legal analysis, creative writing, data extraction, and translation — all from the same model endpoint, just by changing the prompt. This is what makes foundation models "foundational" — they provide a base capability layer that can be specialized through prompting, fine-tuning, or RAG.

4. Emergent capabilities. Foundation models exhibit capabilities that were not explicitly trained but emerge from scale. Chain-of-thought reasoning, in-context learning (improving performance from examples in the prompt), and tool use are all emergent capabilities that appear in sufficiently large foundation models but not in smaller task-specific models. These emergent capabilities are what enable the diverse applications built on foundation model APIs.

5. Enormous training cost. Training a frontier foundation model requires thousands of GPUs running for months at costs of $50M–$500M+. This cost barrier creates an oligopoly of foundation model providers (OpenAI, Anthropic, Google, Meta, Mistral) and means that API consumers benefit from amortized training costs rather than bearing them directly. When you pay $2.50 per million input tokens for GPT-4o, a portion of that covers OpenAI's amortized training investment.

The Foundation Model Landscape in 2026

The foundation model ecosystem has matured significantly, with clear tiers of capability and cost. Understanding the current landscape is essential for informed model selection:

Closed-source frontier models:

  • OpenAI GPT-4o: The workhorse of production AI. Strong across all tasks, 128K context window, multimodal (text + vision). $2.50/$10.00 per MTok. Best for: general-purpose production workloads requiring reliable quality.
  • Anthropic Claude 3.5 Sonnet: Excels at coding, analysis, and instruction following. 200K context window, strong safety alignment. $3.00/$15.00 per MTok. Best for: code generation, long-document analysis, safety-sensitive applications.
  • Google Gemini 1.5 Pro: 1M+ context window, strong multimodal capabilities (text, image, video, audio). $1.25/$5.00 per MTok. Best for: long-context tasks, multimodal applications, Google ecosystem integration.
  • OpenAI GPT-4.5: Enhanced reasoning and factuality over GPT-4o. $10.00/$30.00 per MTok. Best for: tasks requiring exceptional factual accuracy and nuanced reasoning.

Closed-source economy models:

  • GPT-4o mini: 80% of GPT-4o quality at 6% of the cost. $0.15/$0.60 per MTok. Best for: high-volume tasks where good-enough quality suffices.
  • Claude 3.5 Haiku: Fastest Claude model, competitive with GPT-4o mini. $0.80/$4.00 per MTok. Best for: low-latency applications, high-volume classification and extraction.
  • Gemini 2.0 Flash: Extremely cheap, fast, and surprisingly capable. $0.10/$0.40 per MTok. Best for: cost-sensitive workloads, prototyping, high-volume simple tasks.

Open-source foundation models:

  • Meta Llama 3 405B: Competitive with GPT-4o on many benchmarks. Free weights, self-hostable. Best for: teams with ML infrastructure that want to eliminate per-token costs at scale.
  • Mistral Large (123B): Strong multilingual performance, competitive pricing via API ($2.00/$6.00) or self-hostable. Best for: European deployments, multilingual workloads.
  • DeepSeek V3: Strong coding and reasoning capabilities. Aggressively priced via API ($0.27/$1.10) or self-hostable. Best for: code-heavy workloads, cost-sensitive teams.

The key trend is capability compression: each generation of economy models approaches the capability of the previous generation's frontier models at 5–20x lower cost. Teams that re-evaluate model selection quarterly can ride this curve to continuously lower costs without sacrificing quality.

Foundation Model Pricing Economics

Foundation model pricing reflects a complex interplay of training investment, serving costs, competitive dynamics, and market positioning. Understanding these economics helps you predict pricing trends and negotiate better rates:

Training cost amortization: When OpenAI invested $100M+ to train GPT-4, they needed to recoup that investment through inference revenue. If GPT-4 serves 1 trillion tokens per month at an effective average rate of $5/MTok, that is $5M/month in inference revenue. At that rate, training costs are recovered in approximately 20 months. This amortization pressure is why new model pricing starts high and decreases over time — providers lower prices as training costs are recovered and as newer models provide competitive pressure.

Serving cost breakdown: The actual cost of serving a foundation model via API includes GPU compute (the dominant cost), GPU memory (VRAM), network bandwidth, orchestration infrastructure, safety monitoring, and engineering overhead. For a typical mid-tier model like GPT-4o, the estimated serving cost per million tokens is roughly $0.50–$1.50 for input and $2.00–$5.00 for output — meaning OpenAI's margins on GPT-4o ($2.50/$10.00 pricing) are approximately 50–60%. Economy models like GPT-4o mini have lower absolute margins but higher percentage margins because the serving cost per token is dramatically lower for smaller models.

Competitive pricing dynamics: The foundation model market is one of the most competitive in technology. OpenAI, Anthropic, Google, and Meta are engaged in an aggressive capability and pricing competition. When one provider drops prices, others follow within weeks. GPT-4o's pricing has been cut multiple times since launch. Claude 3.5 Sonnet launched at roughly the same price as GPT-4o to remain competitive. Gemini Flash consistently undercuts on price to gain market share. This competition benefits API consumers through steadily declining prices — CostHawk tracks these changes and alerts you when a model you use gets a price reduction.

Volume discounts and committed use: Major providers offer volume discounts for committed usage. OpenAI's Reserved Capacity and Anthropic's volume pricing can reduce per-token costs by 20–40% for teams spending $10K+/month. Google Cloud's committed use discounts apply to Vertex AI (Gemini) usage. These discounts stack with optimizations like prompt caching and batch processing. CostHawk's spend tracking helps you identify when your volume qualifies for a discount tier and estimates the potential savings.

The open-source price anchor: Open-source models like Llama 3 create a price floor for the API market. If an API provider charges significantly more than the cost of self-hosting an open-source model of comparable quality, large customers will switch to self-hosting. This competitive pressure from open-source keeps API pricing in check and accelerates the trend toward lower costs. For teams spending $30K+/month on a single model, self-hosting an open-source alternative may be economically viable — CostHawk helps you model this decision with accurate cost data for both API and self-hosted scenarios.

Choosing a Foundation Model for Your Application

Selecting the right foundation model is a multi-dimensional optimization problem. Here is a systematic framework that balances capability, cost, and operational requirements:

Step 1: Define your quality requirements. For each use case in your application, establish a measurable quality threshold. Examples: classification accuracy > 95%, response factual accuracy > 90%, code compilation success rate > 85%, user satisfaction score > 4.2/5. These thresholds determine the minimum model capability you need.

Step 2: Benchmark candidates. Test 3–5 foundation models against your quality requirements using a representative evaluation set of 200–500 examples. Include at least one economy model, one mid-tier model, and one frontier model. Measure quality metrics against your thresholds. Many teams are surprised to find that economy models meet their quality requirements for 60–80% of their use cases.

Step 3: Calculate cost at your volume. For each candidate model that meets your quality threshold, estimate monthly cost: (daily_requests × avg_input_tokens × input_rate) + (daily_requests × avg_output_tokens × output_rate) × 30. Include expected growth — if you expect 3x volume in 6 months, model the cost at that scale too.

Step 4: Evaluate operational factors. Beyond capability and cost, consider: rate limits (can the provider handle your peak traffic?), latency (does time-to-first-token meet your UX requirements?), reliability (what is the provider's uptime track record?), context window (does your use case require long context?), multimodal support (do you need image/audio input?), fine-tuning availability (will you need to customize the model?), and data privacy (does the provider's data handling policy meet your compliance requirements?).

Step 5: Implement model routing. For most production applications, the optimal strategy is not a single foundation model but a routing architecture that directs each request to the cheapest model that meets its quality requirements. Simple classification requests go to Gemini Flash. Customer support responses go to GPT-4o mini. Complex analysis goes to Claude 3.5 Sonnet. CostHawk's per-endpoint analytics provide the data you need to optimize routing rules continuously.

Step 6: Re-evaluate quarterly. The foundation model landscape changes every 3–6 months. New models launch, existing models get price cuts, and quality improvements in economy models make previously impossible routing decisions viable. Set a quarterly calendar reminder to re-benchmark your model choices against new alternatives. CostHawk's model comparison features make this re-evaluation process straightforward by showing how your current spend maps to alternative models.

Fine-Tuning vs Prompting Foundation Models

Foundation models can be adapted to specific tasks through two primary approaches — prompting and fine-tuning — each with distinct cost profiles:

Prompting (zero-shot and few-shot): You provide instructions and examples in the prompt, and the foundation model adapts its behavior accordingly. This is the default approach for most API consumers and has zero upfront cost — you pay only the per-token inference cost. The tradeoff is that instructions and examples consume input tokens on every request, adding to per-request costs. A detailed system prompt with 5 few-shot examples might add 2,000 tokens to every request, costing $5/day at 100K requests on GPT-4o ($2.50/MTok × 2K tokens × 100K requests / 1M).

Fine-tuning: You train a modified version of the foundation model on task-specific data, embedding the instructions and patterns directly into the model weights. Fine-tuning has an upfront cost (training on your data) but produces a model that performs the task without needing lengthy system prompts or few-shot examples, reducing per-request token counts and costs. OpenAI charges $25/MTok for GPT-4o mini fine-tuning and then charges 2x the base inference rate for fine-tuned model inference ($0.30/$1.20 vs $0.15/$0.60). Anthropic offers fine-tuning for Claude models on a custom basis.

Cost comparison framework:

FactorPromptingFine-Tuning
Upfront cost$0$500 – $10,000+
Per-request input tokensHigher (includes instructions)Lower (instructions embedded)
Per-token rateStandard rate1.5–2x standard rate
Time to deployMinutesHours to days
Flexibility to changeInstant (edit prompt)Requires retraining
Quality ceilingLimited by prompt lengthHigher for specialized tasks

When fine-tuning saves money: Fine-tuning is cost-effective when (a) your system prompt is very long (2,000+ tokens), (b) your request volume is very high (100K+/day), and (c) your task is well-defined and stable. In that scenario, fine-tuning eliminates 2,000 tokens × 100K requests = 200M tokens/day of system prompt cost, saving $500/day on GPT-4o — which quickly recoups a $5,000 fine-tuning cost. For low-volume or rapidly evolving tasks, prompting is almost always more cost-effective because you avoid the upfront training cost and maintain the flexibility to iterate on instructions instantly.

Distillation — the hybrid approach: An increasingly popular strategy is to use a powerful foundation model (Claude 3.5 Sonnet) to generate high-quality training data, then fine-tune a cheaper model (GPT-4o mini) on that data. This "distillation" approach can produce a fine-tuned economy model that approaches the quality of the frontier model at a fraction of the ongoing cost. CostHawk's model comparison analytics help you evaluate whether distillation would reduce costs for your specific workload by comparing the quality and cost of different models on your actual request patterns.

Foundation Models and CostHawk

CostHawk is purpose-built to help teams navigate the foundation model landscape and optimize their model selection decisions with data:

Multi-model cost tracking: Most production applications use multiple foundation models across different endpoints, features, or routing paths. CostHawk aggregates costs across all models and providers into a unified dashboard, showing total spend, spend by model, spend by provider, and spend by endpoint. This unified view is essential because cost optimization often involves shifting traffic between models rather than optimizing a single model's usage.

Model comparison analytics: CostHawk tracks cost-per-query and quality metrics by model, enabling direct comparison. If you are routing 80% of requests to GPT-4o and 20% to GPT-4o mini, CostHawk shows the per-query cost for each model and helps you identify requests currently going to GPT-4o that could be served by GPT-4o mini without quality loss. These model migration opportunities are often the largest single cost optimization available.

New model evaluation support: When a new foundation model launches (which happens every few months), CostHawk's historical data helps you evaluate the potential impact. If Anthropic launches Claude 4.0 at $2.00/$10.00 per MTok with improved quality, CostHawk can estimate your savings from switching by applying the new rates to your historical usage patterns. This eliminates guesswork and accelerates model migration decisions.

Price change tracking: Foundation model prices change frequently — providers regularly cut prices for existing models and launch new models at different price points. CostHawk maintains a current pricing database for all major models and automatically applies the correct rates to your usage data. When a price change occurs, CostHawk shows the impact on your historical and projected spend, and alerts you if a model you use has become significantly cheaper than your current spending rate (which can happen if you are on an older pricing tier).

Foundation model ROI analysis: For teams evaluating whether AI API spend is delivering business value, CostHawk's per-feature and per-endpoint cost breakdowns provide the data needed for ROI calculations. If your AI-powered search feature costs $3,000/month in foundation model inference and drives $50,000/month in attributable revenue, the ROI is clear. If another feature costs $8,000/month and drives minimal measurable value, that is an optimization target — either improve the feature's business impact, reduce its model cost through routing, or reconsider whether it justifies the foundation model spend.

Migration planning: When you decide to switch foundation models (from GPT-4o to Claude 3.5 Sonnet, or from a closed-source model to self-hosted Llama 3), CostHawk provides migration impact analysis: estimated cost change, token count differences (different tokenizers produce different token counts for the same content), and projected savings timeline. This data supports informed, low-risk migration decisions.

FAQ

Frequently Asked Questions

What is the difference between a foundation model and a fine-tuned model?+
A foundation model is the base, general-purpose model trained by the provider on broad data — GPT-4o, Claude 3.5 Sonnet, and Llama 3 are all foundation models. A fine-tuned model is a derivative of a foundation model that has been additionally trained on task-specific data to improve performance on a particular use case. Fine-tuning modifies the model's weights to embed domain knowledge and behavioral patterns that would otherwise require lengthy prompt instructions. The cost distinction is important: foundation models are accessed at standard API rates, while fine-tuned models typically cost 1.5–2x more per token for inference but may reduce total request costs by eliminating the need for long system prompts and few-shot examples. Fine-tuning also requires an upfront training cost ($500–$10,000+ depending on dataset size and model). For most teams, prompting the foundation model directly is more cost-effective; fine-tuning becomes worthwhile at very high volumes (100K+ requests/day) with stable, well-defined tasks.
How do I decide between GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro?+
The decision should be driven by your specific use case requirements, tested empirically rather than assumed. GPT-4o ($2.50/$10.00 per MTok) is the most broadly capable general-purpose model with the largest ecosystem of tools and integrations. Claude 3.5 Sonnet ($3.00/$15.00) excels at coding, long-document analysis (200K context), and instruction following, with strong safety alignment. Gemini 1.5 Pro ($1.25/$5.00) offers a 1M+ token context window and strong multimodal capabilities at a significantly lower price point. For cost-sensitive workloads, Gemini is often 50–60% cheaper than GPT-4o and Claude at similar quality levels. The practical approach: benchmark all three on 200–500 representative inputs from your production workload, measure quality against your thresholds, and select the cheapest model that passes. Many teams discover that for their specific tasks, the quality difference between models is smaller than expected while the cost difference is larger than expected. CostHawk's model comparison features help you quantify these tradeoffs with production data.
Are open-source foundation models really free?+
Open-source foundation models like Llama 3 and Mistral have free model weights — you can download and use them without licensing fees or per-token charges. However, running them is far from free. You need GPU infrastructure to serve the model: a 70B parameter model requires 4x A100 80GB GPUs ($6–$8/hour on cloud providers, or $4,300–$5,800/month). You also need model serving infrastructure (vLLM, TGI, or TensorRT-LLM), monitoring, auto-scaling, and engineering expertise to maintain it. The total cost of self-hosting typically breaks down to: GPU compute (70%), storage and networking (10%), engineering time (15%), and tooling (5%). For small-to-medium workloads (under $10K/month in API costs), self-hosting is almost always more expensive than using API providers because you cannot achieve the same GPU utilization efficiency. The crossover point typically occurs at $15K–$30K/month in API spend, depending on the model and your engineering team's capabilities. CostHawk can model both scenarios to help you make this decision with real numbers.
How often do foundation model prices change?+
Foundation model prices change frequently — roughly every 2–4 months across the major providers. The dominant trend is downward: prices have fallen 80–90% since GPT-4's March 2023 launch. Price changes come in several forms: explicit price cuts on existing models (OpenAI has cut GPT-4o pricing twice), new models that offer better cost-performance ratios (GPT-4o mini replaced GPT-3.5 Turbo at similar cost with much better quality), and new pricing tiers (batch processing at 50% discount, prompt caching at 50–90% discount). The pace of change means that a model selection decision made 6 months ago may no longer be optimal. Teams that re-evaluate quarterly and stay informed about pricing changes can save 20–40% annually compared to teams that lock in a model and forget about it. CostHawk automatically tracks pricing changes across all major providers and alerts you when models you use get price reductions or when newer, cheaper alternatives become available.
What is model distillation and how does it reduce costs?+
Model distillation is the process of training a smaller, cheaper model to replicate the outputs of a larger, more expensive model on a specific task. You use the expensive foundation model (the "teacher") to generate high-quality outputs for a large set of inputs, then fine-tune a cheaper model (the "student") on those input-output pairs. The result is a student model that approaches the teacher's quality on that specific task at a fraction of the inference cost. For example, you might distill Claude 3.5 Sonnet's customer support responses into a GPT-4o mini fine-tune. If Sonnet achieves a 4.5/5 quality score and the distilled GPT-4o mini achieves 4.3/5, you have preserved 96% of the quality while reducing inference costs by 85% ($15.00/MTok output to $1.20/MTok for fine-tuned mini). The upfront cost is the teacher model inference for generating training data plus the fine-tuning cost, typically $2,000–$10,000 total. This investment pays for itself within weeks for high-volume workloads. Distillation works best for well-defined, stable tasks where the output format is consistent.
Can I use multiple foundation models in the same application?+
Yes, and this is strongly recommended for cost optimization. A multi-model architecture (often called model routing) directs each request to the cheapest foundation model that meets its quality requirements. A typical production setup might use Gemini 2.0 Flash for intent classification ($0.10/MTok input), GPT-4o mini for simple response generation ($0.15/$0.60), and Claude 3.5 Sonnet for complex analysis and coding tasks ($3.00/$15.00). The routing decision can be static (hardcoded by endpoint), dynamic (a lightweight classifier selects the model per request), or cascading (start with the cheapest model and escalate to a more expensive one if quality checks fail). Multi-model architectures typically reduce costs by 40–70% compared to using a single mid-tier model for everything. The implementation complexity is modest — most require only a model selection function and separate API client configurations. CostHawk's per-model and per-endpoint analytics provide the data needed to design and optimize routing rules.
What happens when a foundation model I depend on gets deprecated?+
Model deprecation is a real operational risk. OpenAI has deprecated GPT-3.5 Turbo variants, and all providers eventually retire older models as newer ones offer better performance at lower costs. Deprecation timelines vary: OpenAI typically provides 6–12 months notice, Anthropic has not yet deprecated major models but will eventually, and Google has shorter cycles for experimental models. To mitigate deprecation risk: (1) Abstract your model selection behind a configuration layer so switching models requires only a config change, not code changes. (2) Maintain evaluation benchmarks that can quickly validate a replacement model. (3) Avoid depending on model-specific behaviors (particular response formats, specific reasoning patterns) that may not transfer to a successor. (4) Monitor provider announcements through CostHawk's pricing and model tracking, which alerts you when models you use are scheduled for deprecation. When a deprecation is announced, use your evaluation benchmarks to test successor models and migrate before the deadline. The migration usually results in better quality at lower cost because successor models are almost always improvements.
How does CostHawk help with foundation model selection?+
CostHawk provides four key capabilities for foundation model selection. First, spend-by-model analytics show exactly how much you spend on each foundation model, broken down by endpoint, feature, and time period, revealing where model optimization would have the largest impact. Second, cost-per-query comparison lets you see the average cost per request for each model across your workloads, making it easy to identify endpoints where an expensive model is being used for tasks a cheaper model could handle. Third, pricing intelligence maintains current pricing for all major foundation models and alerts you when prices change or when a new model offers a better cost-performance ratio for your specific usage patterns. Fourth, migration modeling lets you simulate the cost impact of switching models before making any changes — select a target model, and CostHawk applies its rates to your historical usage to show projected savings. Teams using CostHawk for model selection typically identify 30–50% savings opportunities in their first audit by discovering that significant traffic is routed to expensive models unnecessarily.

Related Terms

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Read more

Fine-Tuning

The process of further training a pre-trained large language model on a custom dataset to specialize it for a specific task, domain, or output style. Fine-tuning incurs upfront training costs (billed per training token) but can reduce ongoing inference costs by enabling a smaller, cheaper model to match the performance of a larger, more expensive one — making it both a quality tool and a cost optimization strategy.

Read more

Inference

The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.

Read more

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Read more

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

Read more

Multi-Modal Model

An AI model capable of processing and generating content across multiple modalities — text, images, audio, and video. Each modality carries a different token cost, with image inputs costing substantially more than text per semantic unit. Multi-modal models like GPT-4o, Claude 3.5, and Gemini 2.0 unlock powerful capabilities but introduce complex pricing structures that require careful monitoring to avoid cost surprises.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.