GPU Instance
Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.
Definition
What is GPU Instance?
Impact
Why It Matters for AI Costs
GPU instance pricing represents one of the most consequential infrastructure decisions for teams running LLM workloads at scale. The financial difference between self-hosting on GPU instances and consuming managed APIs can be enormous — hundreds of thousands of dollars per year for high-volume workloads.
Consider a concrete example. Your application processes 10 million tokens per hour using a model equivalent to Llama 3 70B. Here are the two cost paths:
Path A: Managed API (e.g., a hosted Llama 3 70B endpoint):
- Input: $0.60 per 1M tokens × 7M input tokens/hr = $4.20/hr
- Output: $2.40 per 1M tokens × 3M output tokens/hr = $7.20/hr
- Total: $11.40/hr → $8,208/month (24/7 operation)
Path B: Self-hosted on a single A100 80GB GPU instance:
- Instance cost: ~$2.00/hr on-demand (AWS p4d equivalent)
- Can serve 10M+ tokens/hr with optimized inference (vLLM, TGI)
- Total: $2.00/hr → $1,440/month (24/7 operation)
That is a 5.7x cost reduction — $6,768 saved per month from a single workload. With reserved instances or spot pricing, the GPU cost drops further: a 1-year reserved A100 runs approximately $1.20/hr ($864/month), yielding a 9.5x reduction.
But GPU instances are not universally cheaper. They carry fixed costs, operational complexity, and utilization risk:
- Fixed cost regardless of utilization: A GPU instance costs the same whether it processes 10 million tokens or zero. If your workload is bursty (high volume for 4 hours, idle for 20), you pay for 24 hours but only use 4 — making the effective per-token cost much higher.
- Operational overhead: You manage model deployment, inference optimization, scaling, monitoring, security patches, and failure recovery. This requires engineering time that has its own cost.
- Capital risk: Reserved instances require upfront commitment. If your workload shrinks, you are locked into paying for capacity you do not need.
CostHawk helps teams navigate this decision by tracking both API-based and GPU-based costs in a unified dashboard, enabling accurate breakeven analysis and identifying workloads where self-hosting would deliver the most savings.
What is a GPU Instance?
A GPU instance is a virtual machine in the cloud that includes specialized graphics processing hardware alongside standard compute resources. Unlike CPU-only instances (like AWS's m-series or c-series), GPU instances are designed for workloads that benefit from massive parallelism: machine learning training and inference, scientific simulation, video rendering, and other compute-intensive tasks.
The GPU hardware in cloud instances comes from two primary manufacturers:
- NVIDIA: Dominates the LLM landscape with GPUs specifically designed for AI workloads. Key models include the A100 (introduced 2020, still widely used), H100 (introduced 2023, current generation for training and high-throughput inference), L40S (introduced 2023, optimized for inference), and the upcoming B200 (next generation). NVIDIA GPUs use CUDA, a proprietary parallel computing framework, which is deeply integrated into all major ML frameworks (PyTorch, TensorFlow, JAX).
- AMD: Growing presence with the MI250X and MI300X GPUs, which offer competitive performance at lower prices. AMD GPUs use ROCm, an open-source alternative to CUDA. Library support is improving but not yet at CUDA parity for all workloads.
Key specifications that determine an instance's LLM capability:
- GPU memory (VRAM): The most critical spec for LLM inference. A model must fit in GPU memory to run. A 70B parameter model in FP16 requires approximately 140 GB of VRAM. An A100 80GB can run it with quantization (reducing to INT8 or INT4), or two A100s can run it in full precision using tensor parallelism.
- Memory bandwidth: Determines how fast the GPU can read model weights during inference. The H100 offers 3.35 TB/s vs. the A100's 2.0 TB/s — a 67% improvement that translates directly to faster token generation (inference is memory-bandwidth-bound for most LLM workloads).
- Compute (TFLOPS): Determines training speed and prefill throughput. The H100 delivers 990 TFLOPS (FP16) vs. the A100's 312 TFLOPS.
- Interconnect: For multi-GPU instances, the interconnect bandwidth between GPUs determines how efficiently models can be split across GPUs. NVLink provides 900 GB/s between H100s, essential for tensor parallelism on large models.
Cloud providers offer GPU instances in various configurations, from single-GPU instances for lightweight inference to 8-GPU instances for training and high-throughput serving. The choice of instance type directly determines your cost, throughput, and which models you can run.
GPU Instance Pricing
GPU instance pricing varies significantly across cloud providers, GPU types, and commitment levels. Understanding the pricing landscape is essential for cost optimization. Here are the current on-demand rates for the most commonly used GPU instances for LLM workloads (as of March 2026):
| GPU | VRAM | AWS ($/hr) | GCP ($/hr) | Azure ($/hr) | Lambda Labs ($/hr) | CoreWeave ($/hr) |
|---|---|---|---|---|---|---|
| A100 40GB | 40 GB | $3.67 (p4d, per GPU) | $2.93 (a2-highgpu) | $3.67 (NC A100 v4) | $1.10 | $1.25 |
| A100 80GB | 80 GB | $4.10 (p4de, per GPU) | $3.67 (a2-ultragpu) | $3.67 (NC A100 v4) | $1.29 | $1.45 |
| H100 80GB | 80 GB | $8.17 (p5.48xlarge, per GPU) | $8.28 (a3-highgpu) | $7.35 (NC H100 v5) | $2.49 | $2.06 |
| L40S | 48 GB | $2.74 (g6e.xlarge) | $2.34 (g2-standard) | $2.72 (NC L40S v3) | $0.99 | $1.14 |
| A10G | 24 GB | $1.21 (g5.xlarge) | — | — | $0.60 | $0.65 |
| T4 | 16 GB | $0.53 (g4dn.xlarge) | $0.35 (n1-standard + T4) | $0.53 (NC T4 v3) | — | — |
Key observations:
- Specialized GPU clouds are 50–70% cheaper than hyperscalers. Lambda Labs and CoreWeave offer the same NVIDIA hardware at significantly lower prices because they focus exclusively on GPU workloads and operate with lower overhead than AWS/GCP/Azure.
- Reserved instances cut costs by 30–60%. AWS 1-year reserved H100 instances cost approximately $5.20/hr (36% savings). 3-year reserved pricing offers even deeper discounts but requires long-term commitment.
- Spot/preemptible instances offer 60–80% discounts but can be interrupted with short notice. Suitable for fault-tolerant training jobs and batch inference, but not for latency-sensitive production serving.
- H100 vs. A100 cost-performance: The H100 costs roughly 2x per hour but delivers 2.5–3x the inference throughput for LLM workloads, making it the better value per token for high-throughput applications.
- L40S is the sweet spot for inference: At $1.00–$2.75/hr with 48 GB VRAM, the L40S can run quantized 70B models and delivers excellent cost-per-token for inference workloads that do not need H100-class compute.
CostHawk tracks GPU instance costs alongside API costs, enabling unified cost reporting across self-hosted and API-based workloads. Configure your GPU instances as a data source in CostHawk to see total AI infrastructure cost in a single dashboard.
GPU vs API Pricing
The fundamental question for any LLM workload is: should you use a managed API or self-host on GPU instances? The answer depends on your volume, consistency, and operational capacity. Here is the breakeven analysis framework:
Breakeven formula:
Breakeven tokens/hour = (GPU instance $/hr) / (API cost per token)
Example with A100 80GB ($1.29/hr on Lambda Labs) vs. hosted Llama 3 70B API ($0.60/$2.40 per MTok):
Assuming 70% input / 30% output ratio:
Blended API cost = 0.70 × $0.60 + 0.30 × $2.40 = $1.14 per MTok
Breakeven = $1.29/hr ÷ ($1.14 / 1,000,000 tokens)
= 1,131,579 tokens/hour
≈ 1.1M tokens/hourIf your workload exceeds 1.1 million tokens per hour (roughly 18,000 tokens per minute), self-hosting on an A100 is cheaper than the managed API. Below that volume, the API is cheaper because you are not fully utilizing the GPU.
Detailed breakeven comparison:
| Scenario | Monthly Volume | API Cost/Month | GPU Cost/Month (A100 80GB) | Savings | Winner |
|---|---|---|---|---|---|
| Low volume | 50M tokens | $57 | $929 (1 GPU 24/7) | -$872 | API |
| Medium volume | 500M tokens | $570 | $929 (1 GPU 24/7) | -$359 | API |
| Breakeven | 815M tokens | $929 | $929 (1 GPU 24/7) | $0 | Tie |
| High volume | 5B tokens | $5,700 | $929 (1 GPU 24/7) | $4,771 | GPU |
| Very high volume | 50B tokens | $57,000 | $2,787 (3 GPUs 24/7) | $54,213 | GPU |
Critical factors beyond raw token cost:
- Utilization rate: GPUs are only cheaper when utilized. A GPU running at 30% utilization has 3.3x the effective cost per token compared to 100% utilization. If your workload is bursty, factor in idle time.
- Engineering cost: Self-hosting requires deploying inference servers (vLLM, TGI, TensorRT-LLM), managing infrastructure, implementing auto-scaling, monitoring GPU health, and handling failures. Budget 0.5–1.0 full-time engineer equivalent for a production GPU deployment.
- Model quality: Open-weight models (Llama, Mistral, Qwen) are competitive with but not identical to proprietary models (GPT-4o, Claude). For tasks where the quality difference matters, the API may be the only option regardless of cost.
- Latency: Optimized GPU inference with batching can match or beat API latency for throughput-oriented workloads, but API providers invest heavily in latency optimization for individual requests.
- Compliance: Self-hosting keeps data on your infrastructure, which may be required for HIPAA, SOC 2, or data residency compliance. Some organizations must self-host regardless of cost.
GPU Instance Types for Inference
Choosing the right GPU instance for LLM inference requires matching the model size, throughput requirements, and budget to the available hardware. Here is a right-sizing guide:
Model size to GPU mapping:
| Model Size | FP16 VRAM Required | INT8 VRAM Required | INT4 (GPTQ/AWQ) VRAM | Recommended GPU |
|---|---|---|---|---|
| 1–3B params | 2–6 GB | 1–3 GB | 0.5–1.5 GB | T4 (16 GB) — cheapest option |
| 7–8B params | 14–16 GB | 7–8 GB | 4–5 GB | A10G (24 GB) or T4 with quantization |
| 13B params | 26 GB | 13 GB | 7–8 GB | A10G (24 GB) with INT8 or L40S (48 GB) |
| 30–34B params | 60–68 GB | 30–34 GB | 17–20 GB | L40S (48 GB) with INT8 or A100 40GB with INT4 |
| 70B params | 140 GB | 70 GB | 35–40 GB | A100 80GB with INT8 or L40S with INT4 or 2× A100 40GB |
| 110–120B params | 220–240 GB | 110–120 GB | 60–70 GB | 2× H100 80GB or 3× A100 80GB |
| 400B+ params (MoE) | 200+ GB (active params) | 100+ GB | 55+ GB | 4× H100 80GB or 8× A100 80GB |
Right-sizing principles:
- Start with quantization. INT8 quantization (via bitsandbytes, GPTQ, or AWQ) halves the VRAM requirement with negligible quality loss for most tasks. INT4 quantization quarters the requirement with a small quality trade-off that is acceptable for many use cases. Always try quantized models before scaling to larger GPUs.
- Match throughput to demand. A single L40S running a quantized Llama 3 70B can serve approximately 2,000–4,000 tokens per second (combined input/output with batching). If your application needs 10,000 tokens per second, you need 3–5 GPUs behind a load balancer. Benchmark with realistic traffic patterns, not synthetic benchmarks.
- Consider VRAM headroom. GPU memory must hold the model weights, the KV cache (which grows with sequence length and batch size), and activation memory. A model that barely fits in VRAM with 10% headroom will fail under high concurrency because the KV cache will exceed available memory. Target 20–30% VRAM headroom for production deployments.
- Use inference-optimized frameworks. vLLM, TensorRT-LLM, and Text Generation Inference (TGI) implement PagedAttention, continuous batching, and tensor parallelism that can increase throughput 3–10x compared to naive PyTorch inference. The framework choice often matters more than the GPU choice for cost efficiency.
CostHawk's model pricing database includes per-token cost estimates for self-hosted configurations, making it easy to compare your actual GPU utilization costs against API alternatives and identify the optimal deployment strategy for each model and workload.
GPU Cost Optimization
GPU instances are expensive — even the cheapest options run hundreds of dollars per month when running 24/7. Here are the most impactful strategies for reducing GPU infrastructure costs:
1. Spot/preemptible instances (60–80% savings): Cloud providers offer unused GPU capacity at steep discounts. AWS spot A100s run approximately $1.20–$1.60/hr vs. $4.10 on-demand — a 60–70% discount. The catch: spot instances can be reclaimed with 2 minutes' notice. For inference workloads, this is manageable if you run multiple instances behind a load balancer. When one instance is reclaimed, traffic shifts to the remaining instances while a replacement spins up. For batch inference (processing a queue of requests where latency is not critical), spot instances are ideal because interrupted work can simply be retried.
2. Reserved instances and committed use discounts (30–60% savings): If your workload is stable and predictable, 1-year or 3-year commitments offer significant savings. AWS 1-year reserved H100 pricing is approximately $5.20/hr vs. $8.17 on-demand (36% savings). 3-year pricing is approximately $3.50/hr (57% savings). Google Cloud offers Committed Use Discounts (CUDs) with similar savings tiers. The risk: if your workload decreases, you are still paying for the reserved capacity. Mitigate this by reserving only your baseline capacity and using on-demand or spot for peak demand.
3. Auto-scaling based on demand (20–50% savings): Do not run GPUs 24/7 if your workload does not require it. Implement auto-scaling that adds GPU instances when queue depth or latency exceeds thresholds and removes them when demand drops. Many workloads have daily patterns — high volume during business hours, low volume overnight. Scaling from 4 GPUs during peak to 1 GPU during off-peak saves 62.5% of GPU hours. Kubernetes with KEDA (Kubernetes Event-Driven Autoscaler) or cloud-native autoscaling groups are the standard tools for this.
4. Right-sizing GPU selection (20–40% savings): Not every workload needs an H100. An L40S at $1.00/hr can run quantized 70B models with adequate throughput for many applications. Benchmark your specific model and traffic pattern on multiple GPU types to find the best cost-per-token. Often, an older GPU with lower hourly cost delivers better economics than a newer, faster GPU for inference-bound workloads (as opposed to training-bound workloads where raw compute matters more).
5. Batching and throughput optimization (2–5x efficiency improvement): Inference serving frameworks like vLLM use continuous batching to process multiple requests simultaneously on a single GPU. Without batching, a GPU processes one request at a time, leaving compute resources idle during memory-bound phases. With batching, the GPU processes 16–64 requests concurrently, increasing token throughput 3–5x without additional hardware. This is not a direct cost savings — you pay the same for the GPU — but it dramatically reduces the cost per token and the number of GPUs needed to serve your workload.
6. Multi-tenancy (variable savings): If multiple teams or workloads share GPU infrastructure, consolidate them onto shared GPU pools. A single GPU running three small models is more cost-effective than three separate single-GPU instances, because each model individually would not fully utilize its GPU. NVIDIA's MIG (Multi-Instance GPU) technology on A100 and H100 GPUs enables hardware-level partitioning of a single GPU into multiple isolated instances, enabling safe multi-tenancy.
When to Use GPUs vs APIs
The GPU-vs-API decision is not binary — most organizations benefit from a hybrid approach that uses each option where it is most cost-effective. Here is a decision framework:
Use managed APIs when:
- Volume is low to moderate (under 500M tokens/month per model). At this volume, GPU instances cost more than APIs because you are paying for idle GPU time.
- Workload is unpredictable. If your traffic pattern is highly variable (viral product launches, seasonal spikes, experimentation phases), APIs absorb the variability without you provisioning for peak demand.
- You need frontier model quality. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are only available via API. If your use case requires these specific models, APIs are the only option.
- Engineering bandwidth is limited. If your team does not have experience operating GPU infrastructure, the operational overhead of self-hosting may exceed the cost savings. APIs let you focus on your product instead of your infrastructure.
- You are in the experimentation phase. When evaluating different models, prompts, and architectures, APIs provide instant access without deployment overhead. Self-host after you have stabilized on a model and architecture.
Use GPU instances when:
- Volume is high and stable (over 1B tokens/month per model). At this volume, the per-token savings from self-hosting accumulate to thousands of dollars per month.
- You can use open-weight models. Llama 3, Mistral, Qwen, and DeepSeek models are competitive with proprietary models for many tasks. If an open model meets your quality bar, self-hosting unlocks significant savings.
- Data privacy requires it. Self-hosting keeps all data on your infrastructure. No prompts or responses leave your network. This may be a compliance requirement for healthcare (HIPAA), finance (SOX), or government (FedRAMP) workloads.
- Latency requirements are strict. With dedicated GPUs and optimized inference, you control latency directly. API latency includes network round-trip time and shared infrastructure contention that you cannot control.
- You need custom models. Fine-tuned or custom-trained models must be self-hosted (or hosted on a platform that supports custom model deployment, like Replicate or Modal).
The hybrid approach: Many mature AI organizations run a hybrid infrastructure. High-volume, stable workloads run on self-hosted GPU instances. Bursty, experimental, or frontier-model workloads use APIs. A model routing layer directs each request to the optimal backend based on the model requested, current load, and cost optimization rules. CostHawk provides unified cost visibility across both environments, tracking API spend and GPU infrastructure costs in a single dashboard. This enables continuous optimization: as a workload grows and stabilizes, CostHawk's analytics identify when it crosses the breakeven threshold for self-hosting, prompting the migration.
FAQ
Frequently Asked Questions
How much VRAM do I need to run a 70B parameter model?+
What is the difference between training and inference GPU requirements?+
What are spot instances and should I use them for LLM inference?+
How does vLLM improve GPU cost efficiency?+
Which cloud provider has the cheapest GPU instances?+
How do I calculate the breakeven point between GPU and API?+
Breakeven tokens/month = (GPU monthly cost) / (API cost per token). To compute API cost per token, use the blended rate: blended_rate = (input_fraction × input_price + output_fraction × output_price). For a typical workload with 70% input tokens and 30% output tokens using a hosted Llama 3 70B endpoint at $0.60/$2.40 per MTok: blended rate = 0.70 × $0.60 + 0.30 × $2.40 = $1.14 per MTok. GPU monthly cost for an A100 80GB on Lambda Labs: $1.29/hr × 730 hours = $941.70/month. Breakeven: $941.70 / ($1.14/1M) = 826 million tokens/month. Below 826M tokens, the API is cheaper. Above, the GPU wins. Important adjustments: (1) Multiply GPU cost by 1/utilization if your GPU is not 100% utilized. At 50% utilization, the effective breakeven doubles to 1.65 billion tokens. (2) Add engineering cost — if managing GPU infrastructure requires 0.25 FTE at $200K/year, add $4,167/month to the GPU cost. (3) Account for quantization — if self-hosting uses INT8 quantization while the API uses FP16, factor in any quality difference. CostHawk's savings analysis tool automates this calculation using your actual usage data and current pricing.Can I run multiple models on a single GPU instance?+
How does CostHawk track GPU instance costs alongside API costs?+
Related Terms
Inference
The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.
Read moreServerless Inference
Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read moreLarge Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Read moreTotal Cost of Ownership (TCO) for AI
The complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
