GlossaryInfrastructureUpdated 2026-03-16

GPU Instance

Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.

Definition

What is GPU Instance?

A GPU instance is a virtual machine provisioned in a public cloud (AWS, Google Cloud, Azure, Lambda Labs, CoreWeave, or others) that includes one or more Graphics Processing Units (GPUs) alongside standard CPU, memory, and storage. GPUs are essential for LLM workloads because their massively parallel architecture — thousands of cores optimized for matrix multiplication — processes the tensor operations underlying neural network inference and training orders of magnitude faster than CPUs. When you run LLM inference on a GPU instance, you deploy a model (open-weight models like Llama 3, Mistral, Qwen, or DeepSeek) directly onto the GPU hardware and serve requests yourself, rather than calling a third-party API like OpenAI or Anthropic. This self-hosted approach means you pay a fixed hourly rate for the GPU hardware regardless of how many tokens you process, creating a fundamentally different cost model than the per-token pricing of managed APIs. At high request volumes, GPU instances become significantly cheaper per token than API pricing. At low volumes, the fixed hourly cost makes them more expensive. Understanding this breakeven dynamic is critical for optimizing AI infrastructure costs.

Impact

Why It Matters for AI Costs

GPU instance pricing represents one of the most consequential infrastructure decisions for teams running LLM workloads at scale. The financial difference between self-hosting on GPU instances and consuming managed APIs can be enormous — hundreds of thousands of dollars per year for high-volume workloads.

Consider a concrete example. Your application processes 10 million tokens per hour using a model equivalent to Llama 3 70B. Here are the two cost paths:

Path A: Managed API (e.g., a hosted Llama 3 70B endpoint):

  • Input: $0.60 per 1M tokens × 7M input tokens/hr = $4.20/hr
  • Output: $2.40 per 1M tokens × 3M output tokens/hr = $7.20/hr
  • Total: $11.40/hr → $8,208/month (24/7 operation)

Path B: Self-hosted on a single A100 80GB GPU instance:

  • Instance cost: ~$2.00/hr on-demand (AWS p4d equivalent)
  • Can serve 10M+ tokens/hr with optimized inference (vLLM, TGI)
  • Total: $2.00/hr → $1,440/month (24/7 operation)

That is a 5.7x cost reduction — $6,768 saved per month from a single workload. With reserved instances or spot pricing, the GPU cost drops further: a 1-year reserved A100 runs approximately $1.20/hr ($864/month), yielding a 9.5x reduction.

But GPU instances are not universally cheaper. They carry fixed costs, operational complexity, and utilization risk:

  • Fixed cost regardless of utilization: A GPU instance costs the same whether it processes 10 million tokens or zero. If your workload is bursty (high volume for 4 hours, idle for 20), you pay for 24 hours but only use 4 — making the effective per-token cost much higher.
  • Operational overhead: You manage model deployment, inference optimization, scaling, monitoring, security patches, and failure recovery. This requires engineering time that has its own cost.
  • Capital risk: Reserved instances require upfront commitment. If your workload shrinks, you are locked into paying for capacity you do not need.

CostHawk helps teams navigate this decision by tracking both API-based and GPU-based costs in a unified dashboard, enabling accurate breakeven analysis and identifying workloads where self-hosting would deliver the most savings.

What is a GPU Instance?

A GPU instance is a virtual machine in the cloud that includes specialized graphics processing hardware alongside standard compute resources. Unlike CPU-only instances (like AWS's m-series or c-series), GPU instances are designed for workloads that benefit from massive parallelism: machine learning training and inference, scientific simulation, video rendering, and other compute-intensive tasks.

The GPU hardware in cloud instances comes from two primary manufacturers:

  • NVIDIA: Dominates the LLM landscape with GPUs specifically designed for AI workloads. Key models include the A100 (introduced 2020, still widely used), H100 (introduced 2023, current generation for training and high-throughput inference), L40S (introduced 2023, optimized for inference), and the upcoming B200 (next generation). NVIDIA GPUs use CUDA, a proprietary parallel computing framework, which is deeply integrated into all major ML frameworks (PyTorch, TensorFlow, JAX).
  • AMD: Growing presence with the MI250X and MI300X GPUs, which offer competitive performance at lower prices. AMD GPUs use ROCm, an open-source alternative to CUDA. Library support is improving but not yet at CUDA parity for all workloads.

Key specifications that determine an instance's LLM capability:

  • GPU memory (VRAM): The most critical spec for LLM inference. A model must fit in GPU memory to run. A 70B parameter model in FP16 requires approximately 140 GB of VRAM. An A100 80GB can run it with quantization (reducing to INT8 or INT4), or two A100s can run it in full precision using tensor parallelism.
  • Memory bandwidth: Determines how fast the GPU can read model weights during inference. The H100 offers 3.35 TB/s vs. the A100's 2.0 TB/s — a 67% improvement that translates directly to faster token generation (inference is memory-bandwidth-bound for most LLM workloads).
  • Compute (TFLOPS): Determines training speed and prefill throughput. The H100 delivers 990 TFLOPS (FP16) vs. the A100's 312 TFLOPS.
  • Interconnect: For multi-GPU instances, the interconnect bandwidth between GPUs determines how efficiently models can be split across GPUs. NVLink provides 900 GB/s between H100s, essential for tensor parallelism on large models.

Cloud providers offer GPU instances in various configurations, from single-GPU instances for lightweight inference to 8-GPU instances for training and high-throughput serving. The choice of instance type directly determines your cost, throughput, and which models you can run.

GPU Instance Pricing

GPU instance pricing varies significantly across cloud providers, GPU types, and commitment levels. Understanding the pricing landscape is essential for cost optimization. Here are the current on-demand rates for the most commonly used GPU instances for LLM workloads (as of March 2026):

GPUVRAMAWS ($/hr)GCP ($/hr)Azure ($/hr)Lambda Labs ($/hr)CoreWeave ($/hr)
A100 40GB40 GB$3.67 (p4d, per GPU)$2.93 (a2-highgpu)$3.67 (NC A100 v4)$1.10$1.25
A100 80GB80 GB$4.10 (p4de, per GPU)$3.67 (a2-ultragpu)$3.67 (NC A100 v4)$1.29$1.45
H100 80GB80 GB$8.17 (p5.48xlarge, per GPU)$8.28 (a3-highgpu)$7.35 (NC H100 v5)$2.49$2.06
L40S48 GB$2.74 (g6e.xlarge)$2.34 (g2-standard)$2.72 (NC L40S v3)$0.99$1.14
A10G24 GB$1.21 (g5.xlarge)$0.60$0.65
T416 GB$0.53 (g4dn.xlarge)$0.35 (n1-standard + T4)$0.53 (NC T4 v3)

Key observations:

  • Specialized GPU clouds are 50–70% cheaper than hyperscalers. Lambda Labs and CoreWeave offer the same NVIDIA hardware at significantly lower prices because they focus exclusively on GPU workloads and operate with lower overhead than AWS/GCP/Azure.
  • Reserved instances cut costs by 30–60%. AWS 1-year reserved H100 instances cost approximately $5.20/hr (36% savings). 3-year reserved pricing offers even deeper discounts but requires long-term commitment.
  • Spot/preemptible instances offer 60–80% discounts but can be interrupted with short notice. Suitable for fault-tolerant training jobs and batch inference, but not for latency-sensitive production serving.
  • H100 vs. A100 cost-performance: The H100 costs roughly 2x per hour but delivers 2.5–3x the inference throughput for LLM workloads, making it the better value per token for high-throughput applications.
  • L40S is the sweet spot for inference: At $1.00–$2.75/hr with 48 GB VRAM, the L40S can run quantized 70B models and delivers excellent cost-per-token for inference workloads that do not need H100-class compute.

CostHawk tracks GPU instance costs alongside API costs, enabling unified cost reporting across self-hosted and API-based workloads. Configure your GPU instances as a data source in CostHawk to see total AI infrastructure cost in a single dashboard.

GPU vs API Pricing

The fundamental question for any LLM workload is: should you use a managed API or self-host on GPU instances? The answer depends on your volume, consistency, and operational capacity. Here is the breakeven analysis framework:

Breakeven formula:

Breakeven tokens/hour = (GPU instance $/hr) / (API cost per token)

Example with A100 80GB ($1.29/hr on Lambda Labs) vs. hosted Llama 3 70B API ($0.60/$2.40 per MTok):

Assuming 70% input / 30% output ratio:
Blended API cost = 0.70 × $0.60 + 0.30 × $2.40 = $1.14 per MTok

Breakeven = $1.29/hr ÷ ($1.14 / 1,000,000 tokens)
         = 1,131,579 tokens/hour
         ≈ 1.1M tokens/hour

If your workload exceeds 1.1 million tokens per hour (roughly 18,000 tokens per minute), self-hosting on an A100 is cheaper than the managed API. Below that volume, the API is cheaper because you are not fully utilizing the GPU.

Detailed breakeven comparison:

ScenarioMonthly VolumeAPI Cost/MonthGPU Cost/Month (A100 80GB)SavingsWinner
Low volume50M tokens$57$929 (1 GPU 24/7)-$872API
Medium volume500M tokens$570$929 (1 GPU 24/7)-$359API
Breakeven815M tokens$929$929 (1 GPU 24/7)$0Tie
High volume5B tokens$5,700$929 (1 GPU 24/7)$4,771GPU
Very high volume50B tokens$57,000$2,787 (3 GPUs 24/7)$54,213GPU

Critical factors beyond raw token cost:

  • Utilization rate: GPUs are only cheaper when utilized. A GPU running at 30% utilization has 3.3x the effective cost per token compared to 100% utilization. If your workload is bursty, factor in idle time.
  • Engineering cost: Self-hosting requires deploying inference servers (vLLM, TGI, TensorRT-LLM), managing infrastructure, implementing auto-scaling, monitoring GPU health, and handling failures. Budget 0.5–1.0 full-time engineer equivalent for a production GPU deployment.
  • Model quality: Open-weight models (Llama, Mistral, Qwen) are competitive with but not identical to proprietary models (GPT-4o, Claude). For tasks where the quality difference matters, the API may be the only option regardless of cost.
  • Latency: Optimized GPU inference with batching can match or beat API latency for throughput-oriented workloads, but API providers invest heavily in latency optimization for individual requests.
  • Compliance: Self-hosting keeps data on your infrastructure, which may be required for HIPAA, SOC 2, or data residency compliance. Some organizations must self-host regardless of cost.

GPU Instance Types for Inference

Choosing the right GPU instance for LLM inference requires matching the model size, throughput requirements, and budget to the available hardware. Here is a right-sizing guide:

Model size to GPU mapping:

Model SizeFP16 VRAM RequiredINT8 VRAM RequiredINT4 (GPTQ/AWQ) VRAMRecommended GPU
1–3B params2–6 GB1–3 GB0.5–1.5 GBT4 (16 GB) — cheapest option
7–8B params14–16 GB7–8 GB4–5 GBA10G (24 GB) or T4 with quantization
13B params26 GB13 GB7–8 GBA10G (24 GB) with INT8 or L40S (48 GB)
30–34B params60–68 GB30–34 GB17–20 GBL40S (48 GB) with INT8 or A100 40GB with INT4
70B params140 GB70 GB35–40 GBA100 80GB with INT8 or L40S with INT4 or 2× A100 40GB
110–120B params220–240 GB110–120 GB60–70 GB2× H100 80GB or 3× A100 80GB
400B+ params (MoE)200+ GB (active params)100+ GB55+ GB4× H100 80GB or 8× A100 80GB

Right-sizing principles:

  • Start with quantization. INT8 quantization (via bitsandbytes, GPTQ, or AWQ) halves the VRAM requirement with negligible quality loss for most tasks. INT4 quantization quarters the requirement with a small quality trade-off that is acceptable for many use cases. Always try quantized models before scaling to larger GPUs.
  • Match throughput to demand. A single L40S running a quantized Llama 3 70B can serve approximately 2,000–4,000 tokens per second (combined input/output with batching). If your application needs 10,000 tokens per second, you need 3–5 GPUs behind a load balancer. Benchmark with realistic traffic patterns, not synthetic benchmarks.
  • Consider VRAM headroom. GPU memory must hold the model weights, the KV cache (which grows with sequence length and batch size), and activation memory. A model that barely fits in VRAM with 10% headroom will fail under high concurrency because the KV cache will exceed available memory. Target 20–30% VRAM headroom for production deployments.
  • Use inference-optimized frameworks. vLLM, TensorRT-LLM, and Text Generation Inference (TGI) implement PagedAttention, continuous batching, and tensor parallelism that can increase throughput 3–10x compared to naive PyTorch inference. The framework choice often matters more than the GPU choice for cost efficiency.

CostHawk's model pricing database includes per-token cost estimates for self-hosted configurations, making it easy to compare your actual GPU utilization costs against API alternatives and identify the optimal deployment strategy for each model and workload.

GPU Cost Optimization

GPU instances are expensive — even the cheapest options run hundreds of dollars per month when running 24/7. Here are the most impactful strategies for reducing GPU infrastructure costs:

1. Spot/preemptible instances (60–80% savings): Cloud providers offer unused GPU capacity at steep discounts. AWS spot A100s run approximately $1.20–$1.60/hr vs. $4.10 on-demand — a 60–70% discount. The catch: spot instances can be reclaimed with 2 minutes' notice. For inference workloads, this is manageable if you run multiple instances behind a load balancer. When one instance is reclaimed, traffic shifts to the remaining instances while a replacement spins up. For batch inference (processing a queue of requests where latency is not critical), spot instances are ideal because interrupted work can simply be retried.

2. Reserved instances and committed use discounts (30–60% savings): If your workload is stable and predictable, 1-year or 3-year commitments offer significant savings. AWS 1-year reserved H100 pricing is approximately $5.20/hr vs. $8.17 on-demand (36% savings). 3-year pricing is approximately $3.50/hr (57% savings). Google Cloud offers Committed Use Discounts (CUDs) with similar savings tiers. The risk: if your workload decreases, you are still paying for the reserved capacity. Mitigate this by reserving only your baseline capacity and using on-demand or spot for peak demand.

3. Auto-scaling based on demand (20–50% savings): Do not run GPUs 24/7 if your workload does not require it. Implement auto-scaling that adds GPU instances when queue depth or latency exceeds thresholds and removes them when demand drops. Many workloads have daily patterns — high volume during business hours, low volume overnight. Scaling from 4 GPUs during peak to 1 GPU during off-peak saves 62.5% of GPU hours. Kubernetes with KEDA (Kubernetes Event-Driven Autoscaler) or cloud-native autoscaling groups are the standard tools for this.

4. Right-sizing GPU selection (20–40% savings): Not every workload needs an H100. An L40S at $1.00/hr can run quantized 70B models with adequate throughput for many applications. Benchmark your specific model and traffic pattern on multiple GPU types to find the best cost-per-token. Often, an older GPU with lower hourly cost delivers better economics than a newer, faster GPU for inference-bound workloads (as opposed to training-bound workloads where raw compute matters more).

5. Batching and throughput optimization (2–5x efficiency improvement): Inference serving frameworks like vLLM use continuous batching to process multiple requests simultaneously on a single GPU. Without batching, a GPU processes one request at a time, leaving compute resources idle during memory-bound phases. With batching, the GPU processes 16–64 requests concurrently, increasing token throughput 3–5x without additional hardware. This is not a direct cost savings — you pay the same for the GPU — but it dramatically reduces the cost per token and the number of GPUs needed to serve your workload.

6. Multi-tenancy (variable savings): If multiple teams or workloads share GPU infrastructure, consolidate them onto shared GPU pools. A single GPU running three small models is more cost-effective than three separate single-GPU instances, because each model individually would not fully utilize its GPU. NVIDIA's MIG (Multi-Instance GPU) technology on A100 and H100 GPUs enables hardware-level partitioning of a single GPU into multiple isolated instances, enabling safe multi-tenancy.

When to Use GPUs vs APIs

The GPU-vs-API decision is not binary — most organizations benefit from a hybrid approach that uses each option where it is most cost-effective. Here is a decision framework:

Use managed APIs when:

  • Volume is low to moderate (under 500M tokens/month per model). At this volume, GPU instances cost more than APIs because you are paying for idle GPU time.
  • Workload is unpredictable. If your traffic pattern is highly variable (viral product launches, seasonal spikes, experimentation phases), APIs absorb the variability without you provisioning for peak demand.
  • You need frontier model quality. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are only available via API. If your use case requires these specific models, APIs are the only option.
  • Engineering bandwidth is limited. If your team does not have experience operating GPU infrastructure, the operational overhead of self-hosting may exceed the cost savings. APIs let you focus on your product instead of your infrastructure.
  • You are in the experimentation phase. When evaluating different models, prompts, and architectures, APIs provide instant access without deployment overhead. Self-host after you have stabilized on a model and architecture.

Use GPU instances when:

  • Volume is high and stable (over 1B tokens/month per model). At this volume, the per-token savings from self-hosting accumulate to thousands of dollars per month.
  • You can use open-weight models. Llama 3, Mistral, Qwen, and DeepSeek models are competitive with proprietary models for many tasks. If an open model meets your quality bar, self-hosting unlocks significant savings.
  • Data privacy requires it. Self-hosting keeps all data on your infrastructure. No prompts or responses leave your network. This may be a compliance requirement for healthcare (HIPAA), finance (SOX), or government (FedRAMP) workloads.
  • Latency requirements are strict. With dedicated GPUs and optimized inference, you control latency directly. API latency includes network round-trip time and shared infrastructure contention that you cannot control.
  • You need custom models. Fine-tuned or custom-trained models must be self-hosted (or hosted on a platform that supports custom model deployment, like Replicate or Modal).

The hybrid approach: Many mature AI organizations run a hybrid infrastructure. High-volume, stable workloads run on self-hosted GPU instances. Bursty, experimental, or frontier-model workloads use APIs. A model routing layer directs each request to the optimal backend based on the model requested, current load, and cost optimization rules. CostHawk provides unified cost visibility across both environments, tracking API spend and GPU infrastructure costs in a single dashboard. This enables continuous optimization: as a workload grows and stabilizes, CostHawk's analytics identify when it crosses the breakeven threshold for self-hosting, prompting the migration.

FAQ

Frequently Asked Questions

How much VRAM do I need to run a 70B parameter model?+
A 70B parameter model in FP16 precision requires approximately 140 GB of VRAM (each parameter stored as a 2-byte floating point number: 70 billion × 2 bytes = 140 GB). In practice, you also need headroom for the KV cache (which grows with sequence length and batch size) and activation memory, so plan for 150–160 GB total. This exceeds any single GPU's memory, so you have three options: (1) Quantization — reduce precision to INT8 (70 GB) or INT4/GPTQ/AWQ (35–40 GB), fitting the model on a single A100 80GB or L40S 48GB with INT4. Quantization to INT8 has negligible quality impact for most tasks; INT4 has a small but measurable impact. (2) Tensor parallelism — split the model across multiple GPUs connected via NVLink. Two A100 40GBs can run a 70B FP16 model with adequate VRAM headroom. (3) Offloading — store part of the model in CPU RAM and swap layers to GPU as needed. This works but dramatically reduces throughput and is not suitable for production serving. For production inference, the recommended configuration is a single A100 80GB or L40S with INT8 quantization, providing the best balance of cost, throughput, and quality.
What is the difference between training and inference GPU requirements?+
Training and inference place fundamentally different demands on GPU hardware. Training is compute-bound: it processes massive datasets through repeated forward and backward passes, computing gradients and updating model weights. Training requires maximum floating-point throughput (TFLOPS), large memory for optimizer state (which can be 3–4x the model size for Adam optimizer), and high-bandwidth multi-GPU interconnects for distributed training. An H100 with its 990 TFLOPS (FP16) and NVLink interconnect is ideal for training. Inference is memory-bandwidth-bound: the GPU needs to read model weights from VRAM once per generated token, and the speed of this read determines token generation speed. The A100's 2.0 TB/s and H100's 3.35 TB/s memory bandwidth are the bottleneck metrics, not raw compute. Inference also requires much less total memory because there are no optimizer states or gradients — just model weights plus KV cache. This means inference can use smaller, cheaper GPUs, and quantization is particularly effective because it reduces the memory bandwidth required per token. For cost optimization, this distinction matters: do not use expensive H100s for inference unless you need their bandwidth. An L40S at one-third the price may deliver sufficient inference throughput for your workload.
What are spot instances and should I use them for LLM inference?+
Spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) are unused cloud capacity sold at deep discounts — typically 60–80% below on-demand prices. The trade-off is that the cloud provider can reclaim these instances with short notice (2 minutes on AWS, 30 seconds on GCP) when demand for the capacity increases. For LLM inference, spot instances are viable with the right architecture. Run multiple spot instances behind a load balancer (like a Kubernetes service with multiple replicas). If one instance is reclaimed, traffic automatically shifts to the remaining instances while a replacement spins up. Use a warm pool or fast-scaling configuration so replacements launch within 2–3 minutes. For batch inference (processing a queue rather than serving real-time requests), spot instances are ideal — interrupted work items return to the queue and are processed by another instance. For latency-sensitive production serving, use spot instances for a portion of your capacity (to save costs) while maintaining a baseline of on-demand or reserved instances that cannot be interrupted. A common pattern is 30% reserved (baseline), 50% spot (cost savings), and 20% on-demand (buffer for spot reclamation). This delivers approximately 40–50% savings compared to all on-demand while maintaining service reliability.
How does vLLM improve GPU cost efficiency?+
vLLM is an open-source inference serving framework that implements several optimizations that dramatically improve GPU utilization and throughput, reducing the effective cost per token. The key innovation is PagedAttention, which manages the key-value (KV) cache using a paging system similar to virtual memory in operating systems. Traditional inference servers allocate a contiguous block of GPU memory for each request's KV cache, sized for the maximum possible sequence length. This wastes memory because most requests do not use the full context window. PagedAttention allocates KV cache in small, non-contiguous pages, eliminating this waste and allowing 3–5x more requests to be processed concurrently. Continuous batching is the second key optimization: instead of waiting for all requests in a batch to complete before starting new ones, vLLM starts new requests as soon as a slot opens, keeping the GPU constantly busy. Together, these optimizations typically increase throughput 3–8x compared to naive Hugging Face transformers inference. For cost, this means a single GPU serves 3–8x more tokens per hour, reducing your effective cost per token by the same factor. Using vLLM on an A100 80GB at $1.29/hr, you might achieve throughput of 4,000 output tokens per second, yielding an effective cost of $0.09 per million output tokens — far below any managed API price.
Which cloud provider has the cheapest GPU instances?+
For raw hourly price, specialized GPU cloud providers like Lambda Labs and CoreWeave consistently offer the lowest rates — typically 50–70% cheaper than AWS, GCP, or Azure for equivalent NVIDIA hardware. Lambda Labs offers A100 80GB at $1.29/hr and H100 at $2.49/hr. CoreWeave offers H100 at approximately $2.06/hr. By comparison, AWS charges $4.10 (A100) and $8.17 (H100) per GPU on-demand. However, cheapest hourly rate is not always the best metric. Consider: (1) Availability — Lambda Labs and CoreWeave have smaller capacity pools. During high-demand periods, you may not be able to provision instances. AWS and GCP have massive capacity. (2) Ecosystem — hyperscalers offer richer surrounding services: managed Kubernetes, object storage, networking, monitoring, IAM, and compliance certifications. Building a production deployment on a specialized GPU cloud may require more custom infrastructure work. (3) Spot pricing — AWS spot A100 prices can drop to $1.20–$1.60/hr, closing the gap with specialized providers. (4) Commitments — AWS and GCP offer reserved pricing that approaches specialized provider rates for 1–3 year terms. For pure inference workloads, specialized GPU clouds are usually the best value. For complex deployments requiring integration with broader cloud services, hyperscalers may justify the premium.
How do I calculate the breakeven point between GPU and API?+
The breakeven point is the token volume at which GPU self-hosting costs the same as API usage. Calculate it with this formula: Breakeven tokens/month = (GPU monthly cost) / (API cost per token). To compute API cost per token, use the blended rate: blended_rate = (input_fraction × input_price + output_fraction × output_price). For a typical workload with 70% input tokens and 30% output tokens using a hosted Llama 3 70B endpoint at $0.60/$2.40 per MTok: blended rate = 0.70 × $0.60 + 0.30 × $2.40 = $1.14 per MTok. GPU monthly cost for an A100 80GB on Lambda Labs: $1.29/hr × 730 hours = $941.70/month. Breakeven: $941.70 / ($1.14/1M) = 826 million tokens/month. Below 826M tokens, the API is cheaper. Above, the GPU wins. Important adjustments: (1) Multiply GPU cost by 1/utilization if your GPU is not 100% utilized. At 50% utilization, the effective breakeven doubles to 1.65 billion tokens. (2) Add engineering cost — if managing GPU infrastructure requires 0.25 FTE at $200K/year, add $4,167/month to the GPU cost. (3) Account for quantization — if self-hosting uses INT8 quantization while the API uses FP16, factor in any quality difference. CostHawk's savings analysis tool automates this calculation using your actual usage data and current pricing.
Can I run multiple models on a single GPU instance?+
Yes, running multiple models on a single GPU is a common cost optimization strategy, especially for smaller models. There are several approaches: (1) Sequential loading — load different models at different times based on demand. Works for batch processing where you can process all requests for Model A, then switch to Model B. Not suitable for real-time serving of multiple models simultaneously. (2) Concurrent serving — run multiple smaller models simultaneously in the same GPU memory. An A100 80GB can easily host a 7B model (14 GB FP16) and a 13B model (26 GB FP16) with 40 GB remaining for KV caches. vLLM supports serving multiple models from a single process. (3) NVIDIA MIG (Multi-Instance GPU) — partition an A100 or H100 into up to 7 isolated GPU instances, each with dedicated memory and compute. This provides hardware-level isolation, ensuring one model's traffic does not affect another's performance. A single A100 80GB can be split into partitions of 10–40 GB each. (4) LoRA serving — if your models share the same base and differ only in fine-tuned LoRA adapters, serve the base model once and swap LoRA adapters per request. vLLM and TGI both support multi-LoRA serving, enabling hundreds of fine-tuned variants from a single GPU. This approach maximizes GPU utilization and is particularly cost-effective for teams running multiple specialized models.
How does CostHawk track GPU instance costs alongside API costs?+
CostHawk provides a unified cost dashboard that aggregates both API-based and GPU infrastructure costs. For API costs, CostHawk tracks per-request token usage and computes costs using its pricing database of 200+ models. For GPU costs, CostHawk integrates with cloud provider billing APIs (AWS Cost Explorer, GCP Billing Export, Azure Cost Management) to ingest GPU instance spend, tagged by purpose. You tag your GPU instances with CostHawk labels (project, team, model, environment), and CostHawk associates those costs with the corresponding workload. The dashboard shows total AI cost as the sum of API spend plus GPU infrastructure spend, broken down by any dimension: project, team, model, or time period. This unified view is critical for accurate total cost of ownership (TCO) analysis. A common finding is that teams underestimate their GPU costs because they do not account for associated compute (CPU instances for preprocessing), storage (model weights, logging), and networking (data transfer between services). CostHawk captures all tagged infrastructure costs, not just the GPU instances themselves. The savings analysis tool uses this unified data to recommend optimization opportunities — for example, identifying an API workload that exceeds the GPU breakeven point, or a GPU deployment with low utilization that should be consolidated or migrated back to APIs.

Related Terms

Inference

The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.

Read more

Serverless Inference

Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.

Read more

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

Read more

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Read more

Total Cost of Ownership (TCO) for AI

The complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.

Read more

Cost Per Token

The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.

Read more

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.