Serverless Inference
Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.
Definition
What is Serverless Inference?
Impact
Why It Matters for AI Costs
Serverless inference solves the two biggest pain points of GPU self-hosting while preserving most of the benefits. The pain points it eliminates:
1. Idle cost. A dedicated GPU instance costs the same whether it processes 10 million tokens or zero. If your workload is variable — high during business hours, low overnight, spiking during product launches — you either over-provision (wasting money on idle GPUs) or under-provision (degrading user experience during peaks). Serverless inference scales to zero during idle periods, meaning you pay nothing when there is no traffic. For workloads with utilization below 40%, serverless inference can be cheaper than dedicated GPUs despite a higher per-token rate, simply because you are not paying for idle time.
2. Operational complexity. Running GPU inference in production requires expertise in model deployment (vLLM, TGI, TensorRT-LLM), Kubernetes GPU scheduling, auto-scaling policies, health monitoring, rolling updates, and failure recovery. This operational overhead requires 0.5–1.0 dedicated engineers. Serverless platforms abstract all of this — you deploy a model with a configuration file and get an endpoint. The platform handles everything else.
What serverless inference preserves from GPU self-hosting:
- Model choice: Run any open-weight model, including custom fine-tuned models. You are not limited to the models that OpenAI or Anthropic offer.
- Data control: Your prompts and responses are processed on platform infrastructure but not used for training (unlike some free-tier API offerings). Most platforms offer VPC peering and private endpoints for enterprise data requirements.
- Competitive pricing: Open-weight models on serverless platforms typically cost 30–70% less than equivalent proprietary APIs, because you are not paying for model training amortization or the provider's frontier research budget.
The cost comparison at different volume levels tells the story:
| Monthly Volume | Proprietary API (GPT-4o) | Serverless (Llama 3 70B) | Dedicated GPU (A100) | Cheapest Option |
|---|---|---|---|---|
| 50M tokens | $375 | $54 | $929 | Serverless |
| 500M tokens | $3,750 | $540 | $929 | Serverless |
| 2B tokens | $15,000 | $2,160 | $929 | Dedicated GPU |
| 10B tokens | $75,000 | $10,800 | $1,858 (2 GPUs) | Dedicated GPU |
Serverless is the sweet spot for the 50M–2B token range — too much volume for API pricing to be optimal, but not enough to justify dedicated GPU infrastructure. CostHawk tracks serverless inference costs alongside API and GPU costs, providing a complete picture of your AI infrastructure spend and identifying which deployment model is most cost-effective for each workload.
What is Serverless Inference?
Serverless inference applies the serverless computing paradigm — pioneered by AWS Lambda for general-purpose functions — to machine learning model serving. The core principle is the same: the developer provides the code (or in this case, the model), and the platform handles all infrastructure concerns.
In a serverless inference platform, the lifecycle of a request looks like this:
- Request arrives: Your application sends an inference request (prompt, parameters) to the platform's API endpoint.
- Scheduling: The platform routes the request to a warm GPU that already has the model loaded in memory. If no warm GPUs are available, the platform provisions one (this is the "cold start" scenario).
- Inference: The model processes the request on the GPU, generating tokens. The platform may batch multiple concurrent requests to improve throughput.
- Response: The generated tokens are streamed or returned to your application.
- Scale down: After a configurable idle period (typically 30 seconds to 5 minutes), the platform deallocates the GPU resources. If no more requests arrive, you scale to zero and stop paying.
The major serverless inference platforms each have distinct characteristics:
- AWS Bedrock: Amazon's managed LLM service. Offers proprietary models (Claude, Llama, Mistral, Cohere, Titan) on AWS infrastructure. Pricing is per-token with no minimum commitment. Deep integration with AWS services (IAM, VPC, CloudWatch, S3). Best for teams already invested in the AWS ecosystem.
- Google Vertex AI: Google's ML platform. Offers Gemini models natively, plus open models via Model Garden. Per-token pricing for Gemini; per-node-hour for custom model endpoints. Integration with Google Cloud services.
- Replicate: Developer-focused platform for running open-source models. Pay per second of GPU compute. Extensive model catalog with one-click deployment. Popular for experimentation and prototyping.
- Modal: Python-first serverless compute platform. Define your inference function in Python, and Modal handles containerization, GPU provisioning, and scaling. Per-second billing with fast cold starts. Excellent developer experience.
- Together AI: Focuses on open-source LLM inference. Per-token pricing competitive with self-hosting. Offers API-compatible endpoints for Llama, Mistral, Qwen, and other open models. Supports fine-tuned model deployment.
- Fireworks AI: High-performance inference platform. Offers some of the lowest per-token rates for open models, with emphasis on speed (low latency, high throughput). Custom model hosting available.
- Groq: Hardware-differentiated platform using custom LPU (Language Processing Unit) chips instead of NVIDIA GPUs. Offers extremely fast inference at competitive per-token prices. Limited model selection but very fast for supported models.
Each platform makes different trade-offs between price, performance, model selection, and ecosystem integration. CostHawk integrates with all major serverless inference platforms to provide unified cost tracking and comparison.
Serverless vs Dedicated GPU vs API
Choosing between serverless inference, dedicated GPU instances, and managed APIs requires evaluating multiple dimensions beyond just per-token cost. Here is a comprehensive comparison:
| Dimension | Managed API (OpenAI, Anthropic) | Serverless Inference (Replicate, Modal, Together) | Dedicated GPU (Self-managed) |
|---|---|---|---|
| Per-token cost | Highest. Includes model training amortization, research budget, and provider margin. | Medium. Open-model pricing without training costs. Platform margin is lower. | Lowest at high utilization. Fixed hourly cost amortized over tokens served. |
| Idle cost | Zero. Pay only for tokens consumed. | Near-zero. Scale to zero; pay only during active inference. Small cold-start cost. | Full. GPU runs and bills 24/7 regardless of utilization. |
| Model selection | Proprietary frontier models only (GPT-4o, Claude, Gemini). | Open-weight models plus custom fine-tuned. Growing proprietary model availability. | Any model you can run (open-weight, fine-tuned, custom-trained). |
| Latency (warm) | 100–500ms TTFT typical. Optimized by provider. | 150–800ms TTFT. Varies by platform and model size. Generally competitive with APIs. | 50–300ms TTFT. You control optimization. Can be fastest with tuning. |
| Latency (cold) | None. Always warm. | 5–60 seconds depending on model size and platform. Major concern for latency-sensitive apps. | None if running 24/7. Minutes if auto-scaling from zero. |
| Scaling | Automatic, massive scale. Provider handles capacity. | Automatic. Platform scales within its GPU pool. May have capacity limits. | Manual or semi-automatic. You configure and manage auto-scaling. |
| Operational overhead | Near-zero. Just an API call. | Low. Configure model, deploy, consume endpoint. Platform handles ops. | High. Manage deployment, scaling, monitoring, updates, failures. |
| Data privacy | Data processed on provider infrastructure. Usage policies vary by provider. | Data processed on platform infrastructure. Most platforms have data processing agreements. | Full control. Data stays on your infrastructure. |
| Customization | Limited. Some providers offer fine-tuning. No custom architectures. | Moderate. Custom models, LoRA adapters, custom containers on some platforms. | Full. Any model, any framework, any optimization. |
| Best volume range | 0–500M tokens/month (or when frontier quality is required regardless of volume). | 50M–2B tokens/month (sweet spot where idle GPU costs dominate but API markup is unnecessary). | 1B+ tokens/month with stable, predictable demand. |
The optimal strategy evolves as your workload grows. Many teams follow this progression:
- Prototype phase: Use managed APIs for fastest iteration and best model quality.
- Growth phase: Migrate non-frontier workloads to serverless inference for 50–70% cost reduction.
- Scale phase: Move highest-volume, most stable workloads to dedicated GPUs for an additional 40–60% reduction. Keep bursty and low-volume workloads on serverless.
Serverless Inference Pricing Models
Serverless inference platforms use several pricing models, each with distinct economics. Understanding these models is essential for cost prediction and optimization:
1. Per-token pricing: Charged by the number of input and output tokens processed, identical to how OpenAI and Anthropic price their APIs. This is the most transparent model for LLM workloads because cost scales linearly with usage.
| Platform | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| Together AI | Llama 3.1 70B | $0.54 | $0.54 |
| Together AI | Llama 3.1 8B | $0.10 | $0.10 |
| Fireworks AI | Llama 3.1 70B | $0.40 | $0.40 |
| Fireworks AI | Llama 3.1 8B | $0.10 | $0.10 |
| Groq | Llama 3.1 70B | $0.59 | $0.79 |
| AWS Bedrock | Llama 3.1 70B | $0.72 | $0.72 |
| AWS Bedrock | Claude 3.5 Sonnet | $3.00 | $15.00 |
| Google Vertex AI | Gemini 2.0 Flash | $0.10 | $0.40 |
2. Per-second GPU billing: Charged by the number of seconds your model runs on GPU hardware. This model is used by Replicate, Modal, and some other platforms. The advantage is full transparency into hardware utilization; the complexity is that cost per token varies depending on model efficiency and batch size.
| Platform | GPU Type | Rate (per second) | Rate (per hour) |
|---|---|---|---|
| Replicate | A40 (48 GB) | $0.000575 | $2.07 |
| Replicate | A100 80GB | $0.001400 | $5.04 |
| Modal | A100 80GB | $0.001092 | $3.93 |
| Modal | H100 | $0.002319 | $8.35 |
3. Hybrid pricing (provisioned + per-token): Some platforms offer a reserved baseline capacity at a lower per-token rate, with on-demand scaling at a higher rate for bursts. AWS Bedrock's Provisioned Throughput is an example: you commit to a fixed throughput level (measured in model units) for a term, getting guaranteed capacity and lower rates. This model suits workloads with a predictable baseline plus occasional spikes.
4. Free tiers and credits: Most platforms offer free tiers or new-user credits. Together AI provides $25 in free credits. Replicate offers free CPU inference and credits for new accounts. Google Vertex AI includes a free tier for Gemini models. These are useful for experimentation but rarely sufficient for production workloads.
Cost prediction tips:
- For per-token platforms, cost prediction is straightforward: multiply your estimated token volume by the per-token rate.
- For per-second platforms, benchmark your model's throughput (tokens/second) on the target GPU, then divide your token volume by throughput to estimate GPU-seconds needed.
- Always account for cold starts in per-second billing — loading a 70B model takes 30–60 seconds of GPU time, which is billed even though no inference is happening.
Cold Start and Latency Considerations
Cold start is the single biggest operational challenge in serverless inference. When a request arrives and no GPU has the model loaded in memory, the platform must provision a GPU, load the model from storage, and initialize the inference engine before processing the first token. This delay — the cold start latency — directly impacts user experience and cost.
Cold start durations by model size:
| Model Size | Typical Cold Start | Platform Variation |
|---|---|---|
| 1–3B params | 3–8 seconds | Replicate: ~5s, Modal: ~4s, Together: N/A (always warm) |
| 7–8B params | 5–15 seconds | Replicate: ~10s, Modal: ~8s |
| 13B params | 10–25 seconds | Replicate: ~18s, Modal: ~12s |
| 30–34B params | 20–45 seconds | Replicate: ~35s, Modal: ~25s |
| 70B params | 30–90 seconds | Replicate: ~60s, Modal: ~40s |
Cold start latency comes from three phases:
- GPU provisioning: Allocating a GPU from the platform's pool. Typically 2–10 seconds depending on availability.
- Model loading: Transferring model weights from storage (typically NVMe or network storage) to GPU VRAM. This is the dominant component — a 70B FP16 model is 140 GB, and even at 10 GB/s transfer speed, loading takes 14 seconds. Quantized models load proportionally faster.
- Engine initialization: Starting the inference engine (vLLM, TGI, etc.), compiling CUDA kernels, and warming up caches. Typically 2–5 seconds.
Mitigation strategies:
- Keep-warm configuration: Most platforms let you configure a minimum number of warm replicas — GPUs that stay provisioned with the model loaded, even during idle periods. This eliminates cold starts but reintroduces idle cost. The trade-off is worth it for latency-sensitive applications. Modal, for example, lets you set
keep_warm=1to maintain one warm GPU at all times. - Predictive scaling: Some platforms analyze traffic patterns and pre-warm GPUs before anticipated demand increases. If your traffic predictably rises at 9 AM, the platform begins provisioning at 8:55 AM.
- Smaller model variants: A quantized 7B model cold-starts in 5 seconds vs. 60 seconds for a 70B model. If latency is critical and a smaller model meets your quality bar, the cold-start benefit is substantial.
- Request queuing: For non-interactive workloads (batch processing, asynchronous tasks), queue requests and tolerate the cold-start delay. The first request takes 30–60 seconds; subsequent requests on the now-warm GPU are fast.
- Hybrid routing: Route latency-sensitive requests to a keep-warm serverless endpoint or a managed API, and route latency-tolerant requests to a scale-to-zero serverless endpoint. This minimizes both cost and latency.
CostHawk tracks cold-start events in its latency analytics, showing you how frequently cold starts occur, their duration, and their cost impact. This data helps you decide whether to configure keep-warm replicas and how many — balancing the idle cost of warm replicas against the latency cost of cold starts.
When Serverless Makes Sense
Serverless inference is not the cheapest option at every volume level, and it is not the most performant for every workload. It excels in specific scenarios where its unique characteristics — zero idle cost, automatic scaling, and operational simplicity — provide the most value:
1. Variable and unpredictable workloads: If your inference volume fluctuates significantly — high during business hours, low overnight, spiking during marketing campaigns, quiet on weekends — serverless inference's scale-to-zero capability means you only pay for actual usage. A dedicated GPU running at 20% average utilization effectively costs 5x its per-token rate. Serverless eliminates this waste entirely. Calculate your average utilization honestly: if a dedicated GPU would be utilized less than 40% of the time, serverless is almost certainly cheaper.
2. Early-stage and experimental workloads: When you are evaluating models, testing new features, or building prototypes, serverless inference provides instant access without infrastructure setup. Deploy a model, test it with real traffic for a week, analyze the results, and shut it down — total cost might be $50 instead of the $929/month minimum for a dedicated GPU. This low commitment enables faster experimentation and better model selection decisions.
3. Low to medium volume production workloads: The sweet spot for serverless inference is 50M–2B tokens per month. Below 50M, managed APIs are competitive and operationally simpler (no model selection or deployment). Above 2B, dedicated GPUs become cheaper even accounting for operational overhead. In the 50M–2B range, serverless offers the best combination of cost and simplicity.
4. Multi-model architectures: If your application uses multiple specialized models (a small model for classification, a medium model for summarization, a large model for complex reasoning), serverless inference shines because each model scales independently. A classification model handling 1,000 requests per minute and a reasoning model handling 10 requests per minute can each scale to exactly the resources they need. With dedicated GPUs, you would need at least one GPU per model regardless of volume.
5. Geographic distribution: Serverless platforms with multi-region deployment can serve requests from the nearest data center, reducing latency for global applications. Running dedicated GPUs in multiple regions multiplies your infrastructure cost; serverless distributes this across shared platform capacity.
When serverless does NOT make sense:
- Very high, stable volume: If you consistently process 5B+ tokens per month with predictable demand, dedicated GPUs are significantly cheaper. The serverless per-token premium adds up at scale.
- Ultra-low latency requirements: If every request must respond within 200ms including TTFT, cold starts are unacceptable and you need always-warm infrastructure (dedicated GPUs or keep-warm serverless, which approaches dedicated GPU cost).
- Custom infrastructure requirements: If you need custom CUDA kernels, non-standard model architectures, or hardware configurations that serverless platforms do not support, self-hosting is the only option.
Monitoring Serverless Inference Costs
Serverless inference introduces unique monitoring challenges compared to both APIs and dedicated GPUs. With APIs, cost is purely a function of tokens — easy to predict and track. With dedicated GPUs, cost is fixed hourly — easy to budget. Serverless inference combines variable per-token costs with cold-start costs, keep-warm costs, and platform-specific surcharges that require more nuanced monitoring.
Key metrics to track:
- Cost per request: The total cost of each inference request, including both inference compute and any cold-start overhead. CostHawk computes this by combining the platform's billing data with per-request token counts and latency measurements.
- Cold start frequency: The percentage of requests that trigger a cold start. If this exceeds 5–10% for a production endpoint, consider adding keep-warm replicas. Each cold start costs GPU-seconds (billed) and adds latency (degrading user experience).
- Cold start cost: The total cost attributable to cold starts — GPU time spent loading models rather than processing tokens. For a 70B model with 60-second cold starts at $0.0014/second (A100 rate), each cold start costs $0.084. At 100 cold starts per day, that is $8.40/day or $252/month in pure overhead.
- Effective cost per token: The blended cost including inference time, cold-start overhead, and keep-warm costs, divided by total tokens processed. This is the metric to compare against API and dedicated GPU alternatives.
- Utilization patterns: Time-series view of inference volume showing daily, weekly, and seasonal patterns. This data informs decisions about keep-warm configuration, predictive scaling, and potential migration to dedicated GPUs for stable base-load.
- Per-model cost breakdown: If you run multiple models on serverless, track each independently. A small classification model at $0.10/MTok and a large reasoning model at $0.54/MTok have very different economics, and optimization strategies differ.
CostHawk serverless monitoring:
CostHawk integrates with serverless inference platforms via API to ingest billing data, request logs, and performance metrics. The dashboard provides:
- Unified cost view across all serverless platforms, APIs, and GPU infrastructure
- Cold-start analysis with frequency, duration, and cost attribution
- Savings recommendations: "Your Together AI Llama 3 70B endpoint processed 3.2B tokens last month at $1,728. A dedicated A100 on Lambda Labs would cost $929/month — a potential 46% saving."
- Anomaly detection tuned for serverless patterns — distinguishing between normal traffic variability and genuine cost anomalies
- Budget alerts that account for the variability of serverless pricing — setting thresholds based on trailing averages rather than fixed amounts
The monitoring strategy should be tailored to your deployment model. For per-token platforms (Together, Fireworks), monitor token volume and cost per token. For per-second platforms (Replicate, Modal), monitor GPU-seconds consumed and effective tokens per GPU-second. CostHawk normalizes these different billing models into a consistent cost-per-token metric, enabling direct comparison across platforms and deployment models. This normalization is essential for identifying the most cost-effective platform for each workload and for tracking the impact of optimization efforts over time.
FAQ
Frequently Asked Questions
What is the difference between serverless inference and a managed API like OpenAI?+
How do cold starts affect production applications?+
Which serverless inference platform is cheapest?+
Can I run fine-tuned models on serverless inference platforms?+
How does serverless inference handle scaling to zero and back?+
Is serverless inference suitable for real-time applications?+
How do I migrate from a managed API to serverless inference?+
What are the data privacy implications of serverless inference?+
Related Terms
GPU Instance
Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.
Read moreInference
The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.
Read moreProvisioned Throughput
Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.
Read morePay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Read moreLatency
The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.
Read moreTotal Cost of Ownership (TCO) for AI
The complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
