GlossaryInfrastructureUpdated 2026-03-16By Chase Dillingham

Serverless Inference

Running LLM inference without managing GPU infrastructure. Serverless inference platforms automatically provision hardware, scale to demand, and charge per request or per token — combining the cost structure of APIs with the flexibility of self-hosting open-weight models. Platforms include AWS Bedrock, Google Vertex AI, Replicate, Modal, Together AI, and Fireworks AI.

Definition

What is Serverless Inference?

Serverless inference is a deployment model for machine learning models — including large language models — where the cloud platform manages all underlying GPU infrastructure, scaling, and operations. You deploy a model (or select from a catalog of pre-deployed models), send inference requests, and pay only for the compute consumed by those requests. There are no instances to provision, no GPUs to manage, no auto-scaling policies to configure, and no idle capacity to pay for. The platform handles hardware allocation, model loading, request batching, scaling up during traffic spikes, and scaling to zero during idle periods. For LLM workloads, serverless inference occupies a middle ground between managed APIs (like OpenAI and Anthropic) and self-managed GPU instances. Like APIs, it abstracts away infrastructure complexity. Like GPU instances, it allows you to run open-weight models (Llama, Mistral, Qwen) and custom fine-tuned models that are not available through major API providers. The cost model is typically per-token (similar to APIs) or per-second of GPU compute, but rates are often lower than proprietary API providers because you are running open-weight models without the provider's margin on model training costs.

Impact

Why It Matters for AI Costs

Serverless inference solves the two biggest pain points of GPU self-hosting while preserving most of the benefits. The pain points it eliminates:

1. Idle cost. A dedicated GPU instance costs the same whether it processes 10 million tokens or zero. If your workload is variable — high during business hours, low overnight, spiking during product launches — you either over-provision (wasting money on idle GPUs) or under-provision (degrading user experience during peaks). Serverless inference scales to zero during idle periods, meaning you pay nothing when there is no traffic. For workloads with utilization below 40%, serverless inference can be cheaper than dedicated GPUs despite a higher per-token rate, simply because you are not paying for idle time.

2. Operational complexity. Running GPU inference in production requires expertise in model deployment (vLLM, TGI, TensorRT-LLM), Kubernetes GPU scheduling, auto-scaling policies, health monitoring, rolling updates, and failure recovery. This operational overhead requires 0.5–1.0 dedicated engineers. Serverless platforms abstract all of this — you deploy a model with a configuration file and get an endpoint. The platform handles everything else.

What serverless inference preserves from GPU self-hosting:

Model choice: Run any open-weight model, including custom fine-tuned models. You are not limited to the models that OpenAI or Anthropic offer.
Data control: Your prompts and responses are processed on platform infrastructure but not used for training (unlike some free-tier API offerings). Most platforms offer VPC peering and private endpoints for enterprise data requirements.
Competitive pricing: Open-weight models on serverless platforms typically cost 30–70% less than equivalent proprietary APIs, because you are not paying for model training amortization or the provider's frontier research budget.

The cost comparison at different volume levels tells the story:

Monthly Volume	Proprietary API (GPT-4o)	Serverless (Llama 3 70B)	Dedicated GPU (A100)	Cheapest Option
50M tokens	$375	$54	$929	Serverless
500M tokens	$3,750	$540	$929	Serverless
2B tokens	$15,000	$2,160	$929	Dedicated GPU
10B tokens	$75,000	$10,800	$1,858 (2 GPUs)	Dedicated GPU

Serverless is the sweet spot for the 50M–2B token range — too much volume for API pricing to be optimal, but not enough to justify dedicated GPU infrastructure. CostHawk tracks serverless inference costs alongside API and GPU costs, providing a complete picture of your AI infrastructure spend and identifying which deployment model is most cost-effective for each workload.

What is Serverless Inference?

Serverless inference applies the serverless computing paradigm — pioneered by AWS Lambda for general-purpose functions — to machine learning model serving. The core principle is the same: the developer provides the code (or in this case, the model), and the platform handles all infrastructure concerns.

In a serverless inference platform, the lifecycle of a request looks like this:

Request arrives: Your application sends an inference request (prompt, parameters) to the platform's API endpoint.
Scheduling: The platform routes the request to a warm GPU that already has the model loaded in memory. If no warm GPUs are available, the platform provisions one (this is the "cold start" scenario).
Inference: The model processes the request on the GPU, generating tokens. The platform may batch multiple concurrent requests to improve throughput.
Response: The generated tokens are streamed or returned to your application.
Scale down: After a configurable idle period (typically 30 seconds to 5 minutes), the platform deallocates the GPU resources. If no more requests arrive, you scale to zero and stop paying.

The major serverless inference platforms each have distinct characteristics:

AWS Bedrock: Amazon's managed LLM service. Offers proprietary models (Claude, Llama, Mistral, Cohere, Titan) on AWS infrastructure. Pricing is per-token with no minimum commitment. Deep integration with AWS services (IAM, VPC, CloudWatch, S3). Best for teams already invested in the AWS ecosystem.
Google Vertex AI: Google's ML platform. Offers Gemini models natively, plus open models via Model Garden. Per-token pricing for Gemini; per-node-hour for custom model endpoints. Integration with Google Cloud services.
Replicate: Developer-focused platform for running open-source models. Pay per second of GPU compute. Extensive model catalog with one-click deployment. Popular for experimentation and prototyping.
Modal: Python-first serverless compute platform. Define your inference function in Python, and Modal handles containerization, GPU provisioning, and scaling. Per-second billing with fast cold starts. Excellent developer experience.
Together AI: Focuses on open-source LLM inference. Per-token pricing competitive with self-hosting. Offers API-compatible endpoints for Llama, Mistral, Qwen, and other open models. Supports fine-tuned model deployment.
Fireworks AI: High-performance inference platform. Offers some of the lowest per-token rates for open models, with emphasis on speed (low latency, high throughput). Custom model hosting available.
Groq: Hardware-differentiated platform using custom LPU (Language Processing Unit) chips instead of NVIDIA GPUs. Offers extremely fast inference at competitive per-token prices. Limited model selection but very fast for supported models.

Each platform makes different trade-offs between price, performance, model selection, and ecosystem integration. CostHawk integrates with all major serverless inference platforms to provide unified cost tracking and comparison.

Serverless vs Dedicated GPU vs API

Choosing between serverless inference, dedicated GPU instances, and managed APIs requires evaluating multiple dimensions beyond just per-token cost. Here is a comprehensive comparison:

Dimension	Managed API (OpenAI, Anthropic)	Serverless Inference (Replicate, Modal, Together)	Dedicated GPU (Self-managed)
Per-token cost	Highest. Includes model training amortization, research budget, and provider margin.	Medium. Open-model pricing without training costs. Platform margin is lower.	Lowest at high utilization. Fixed hourly cost amortized over tokens served.
Idle cost	Zero. Pay only for tokens consumed.	Near-zero. Scale to zero; pay only during active inference. Small cold-start cost.	Full. GPU runs and bills 24/7 regardless of utilization.
Model selection	Proprietary frontier models only (GPT-4o, Claude, Gemini).	Open-weight models plus custom fine-tuned. Growing proprietary model availability.	Any model you can run (open-weight, fine-tuned, custom-trained).
Latency (warm)	100–500ms TTFT typical. Optimized by provider.	150–800ms TTFT. Varies by platform and model size. Generally competitive with APIs.	50–300ms TTFT. You control optimization. Can be fastest with tuning.
Latency (cold)	None. Always warm.	5–60 seconds depending on model size and platform. Major concern for latency-sensitive apps.	None if running 24/7. Minutes if auto-scaling from zero.
Scaling	Automatic, massive scale. Provider handles capacity.	Automatic. Platform scales within its GPU pool. May have capacity limits.	Manual or semi-automatic. You configure and manage auto-scaling.
Operational overhead	Near-zero. Just an API call.	Low. Configure model, deploy, consume endpoint. Platform handles ops.	High. Manage deployment, scaling, monitoring, updates, failures.
Data privacy	Data processed on provider infrastructure. Usage policies vary by provider.	Data processed on platform infrastructure. Most platforms have data processing agreements.	Full control. Data stays on your infrastructure.
Customization	Limited. Some providers offer fine-tuning. No custom architectures.	Moderate. Custom models, LoRA adapters, custom containers on some platforms.	Full. Any model, any framework, any optimization.
Best volume range	0–500M tokens/month (or when frontier quality is required regardless of volume).	50M–2B tokens/month (sweet spot where idle GPU costs dominate but API markup is unnecessary).	1B+ tokens/month with stable, predictable demand.

The optimal strategy evolves as your workload grows. Many teams follow this progression:

Prototype phase: Use managed APIs for fastest iteration and best model quality.
Growth phase: Migrate non-frontier workloads to serverless inference for 50–70% cost reduction.
Scale phase: Move highest-volume, most stable workloads to dedicated GPUs for an additional 40–60% reduction. Keep bursty and low-volume workloads on serverless.

Serverless Inference Pricing Models

Serverless inference platforms use several pricing models, each with distinct economics. Understanding these models is essential for cost prediction and optimization:

1. Per-token pricing: Charged by the number of input and output tokens processed, identical to how OpenAI and Anthropic price their APIs. This is the most transparent model for LLM workloads because cost scales linearly with usage.

Platform	Model	Input (per 1M tokens)	Output (per 1M tokens)
Together AI	Llama 3.1 70B	$0.54	$0.54
Together AI	Llama 3.1 8B	$0.10	$0.10
Fireworks AI	Llama 3.1 70B	$0.40	$0.40
Fireworks AI	Llama 3.1 8B	$0.10	$0.10
Groq	Llama 3.1 70B	$0.59	$0.79
AWS Bedrock	Llama 3.1 70B	$0.72	$0.72
AWS Bedrock	Claude 3.5 Sonnet	$3.00	$15.00
Google Vertex AI	Gemini 2.0 Flash	$0.10	$0.40

2. Per-second GPU billing: Charged by the number of seconds your model runs on GPU hardware. This model is used by Replicate, Modal, and some other platforms. The advantage is full transparency into hardware utilization; the complexity is that cost per token varies depending on model efficiency and batch size.

Platform	GPU Type	Rate (per second)	Rate (per hour)
Replicate	A40 (48 GB)	$0.000575	$2.07
Replicate	A100 80GB	$0.001400	$5.04
Modal	A100 80GB	$0.001092	$3.93
Modal	H100	$0.002319	$8.35

3. Hybrid pricing (provisioned + per-token): Some platforms offer a reserved baseline capacity at a lower per-token rate, with on-demand scaling at a higher rate for bursts. AWS Bedrock's Provisioned Throughput is an example: you commit to a fixed throughput level (measured in model units) for a term, getting guaranteed capacity and lower rates. This model suits workloads with a predictable baseline plus occasional spikes.

4. Free tiers and credits: Most platforms offer free tiers or new-user credits. Together AI provides $25 in free credits. Replicate offers free CPU inference and credits for new accounts. Google Vertex AI includes a free tier for Gemini models. These are useful for experimentation but rarely sufficient for production workloads.

Cost prediction tips:

For per-token platforms, cost prediction is straightforward: multiply your estimated token volume by the per-token rate.
For per-second platforms, benchmark your model's throughput (tokens/second) on the target GPU, then divide your token volume by throughput to estimate GPU-seconds needed.
Always account for cold starts in per-second billing — loading a 70B model takes 30–60 seconds of GPU time, which is billed even though no inference is happening.

Cold Start and Latency Considerations

Cold start is the single biggest operational challenge in serverless inference. When a request arrives and no GPU has the model loaded in memory, the platform must provision a GPU, load the model from storage, and initialize the inference engine before processing the first token. This delay — the cold start latency — directly impacts user experience and cost.

Cold start durations by model size:

Model Size	Typical Cold Start	Platform Variation
1–3B params	3–8 seconds	Replicate: ~5s, Modal: ~4s, Together: N/A (always warm)
7–8B params	5–15 seconds	Replicate: ~10s, Modal: ~8s
13B params	10–25 seconds	Replicate: ~18s, Modal: ~12s
30–34B params	20–45 seconds	Replicate: ~35s, Modal: ~25s
70B params	30–90 seconds	Replicate: ~60s, Modal: ~40s

Cold start latency comes from three phases:

GPU provisioning: Allocating a GPU from the platform's pool. Typically 2–10 seconds depending on availability.
Model loading: Transferring model weights from storage (typically NVMe or network storage) to GPU VRAM. This is the dominant component — a 70B FP16 model is 140 GB, and even at 10 GB/s transfer speed, loading takes 14 seconds. Quantized models load proportionally faster.
Engine initialization: Starting the inference engine (vLLM, TGI, etc.), compiling CUDA kernels, and warming up caches. Typically 2–5 seconds.

Mitigation strategies:

Keep-warm configuration: Most platforms let you configure a minimum number of warm replicas — GPUs that stay provisioned with the model loaded, even during idle periods. This eliminates cold starts but reintroduces idle cost. The trade-off is worth it for latency-sensitive applications. Modal, for example, lets you set keep_warm=1 to maintain one warm GPU at all times.
Predictive scaling: Some platforms analyze traffic patterns and pre-warm GPUs before anticipated demand increases. If your traffic predictably rises at 9 AM, the platform begins provisioning at 8:55 AM.
Smaller model variants: A quantized 7B model cold-starts in 5 seconds vs. 60 seconds for a 70B model. If latency is critical and a smaller model meets your quality bar, the cold-start benefit is substantial.
Request queuing: For non-interactive workloads (batch processing, asynchronous tasks), queue requests and tolerate the cold-start delay. The first request takes 30–60 seconds; subsequent requests on the now-warm GPU are fast.
Hybrid routing: Route latency-sensitive requests to a keep-warm serverless endpoint or a managed API, and route latency-tolerant requests to a scale-to-zero serverless endpoint. This minimizes both cost and latency.

CostHawk tracks cold-start events in its latency analytics, showing you how frequently cold starts occur, their duration, and their cost impact. This data helps you decide whether to configure keep-warm replicas and how many — balancing the idle cost of warm replicas against the latency cost of cold starts.

When Serverless Makes Sense

Serverless inference is not the cheapest option at every volume level, and it is not the most performant for every workload. It excels in specific scenarios where its unique characteristics — zero idle cost, automatic scaling, and operational simplicity — provide the most value:

1. Variable and unpredictable workloads: If your inference volume fluctuates significantly — high during business hours, low overnight, spiking during marketing campaigns, quiet on weekends — serverless inference's scale-to-zero capability means you only pay for actual usage. A dedicated GPU running at 20% average utilization effectively costs 5x its per-token rate. Serverless eliminates this waste entirely. Calculate your average utilization honestly: if a dedicated GPU would be utilized less than 40% of the time, serverless is almost certainly cheaper.

2. Early-stage and experimental workloads: When you are evaluating models, testing new features, or building prototypes, serverless inference provides instant access without infrastructure setup. Deploy a model, test it with real traffic for a week, analyze the results, and shut it down — total cost might be $50 instead of the $929/month minimum for a dedicated GPU. This low commitment enables faster experimentation and better model selection decisions.

3. Low to medium volume production workloads: The sweet spot for serverless inference is 50M–2B tokens per month. Below 50M, managed APIs are competitive and operationally simpler (no model selection or deployment). Above 2B, dedicated GPUs become cheaper even accounting for operational overhead. In the 50M–2B range, serverless offers the best combination of cost and simplicity.

4. Multi-model architectures: If your application uses multiple specialized models (a small model for classification, a medium model for summarization, a large model for complex reasoning), serverless inference shines because each model scales independently. A classification model handling 1,000 requests per minute and a reasoning model handling 10 requests per minute can each scale to exactly the resources they need. With dedicated GPUs, you would need at least one GPU per model regardless of volume.

5. Geographic distribution: Serverless platforms with multi-region deployment can serve requests from the nearest data center, reducing latency for global applications. Running dedicated GPUs in multiple regions multiplies your infrastructure cost; serverless distributes this across shared platform capacity.

When serverless does NOT make sense:

Very high, stable volume: If you consistently process 5B+ tokens per month with predictable demand, dedicated GPUs are significantly cheaper. The serverless per-token premium adds up at scale.
Ultra-low latency requirements: If every request must respond within 200ms including TTFT, cold starts are unacceptable and you need always-warm infrastructure (dedicated GPUs or keep-warm serverless, which approaches dedicated GPU cost).
Custom infrastructure requirements: If you need custom CUDA kernels, non-standard model architectures, or hardware configurations that serverless platforms do not support, self-hosting is the only option.

Monitoring Serverless Inference Costs

Serverless inference introduces unique monitoring challenges compared to both APIs and dedicated GPUs. With APIs, cost is purely a function of tokens — easy to predict and track. With dedicated GPUs, cost is fixed hourly — easy to budget. Serverless inference combines variable per-token costs with cold-start costs, keep-warm costs, and platform-specific surcharges that require more nuanced monitoring.

Key metrics to track:

Cost per request: The total cost of each inference request, including both inference compute and any cold-start overhead. CostHawk computes this by combining the platform's billing data with per-request token counts and latency measurements.
Cold start frequency: The percentage of requests that trigger a cold start. If this exceeds 5–10% for a production endpoint, consider adding keep-warm replicas. Each cold start costs GPU-seconds (billed) and adds latency (degrading user experience).
Cold start cost: The total cost attributable to cold starts — GPU time spent loading models rather than processing tokens. For a 70B model with 60-second cold starts at $0.0014/second (A100 rate), each cold start costs $0.084. At 100 cold starts per day, that is $8.40/day or $252/month in pure overhead.
Effective cost per token: The blended cost including inference time, cold-start overhead, and keep-warm costs, divided by total tokens processed. This is the metric to compare against API and dedicated GPU alternatives.
Utilization patterns: Time-series view of inference volume showing daily, weekly, and seasonal patterns. This data informs decisions about keep-warm configuration, predictive scaling, and potential migration to dedicated GPUs for stable base-load.
Per-model cost breakdown: If you run multiple models on serverless, track each independently. A small classification model at $0.10/MTok and a large reasoning model at $0.54/MTok have very different economics, and optimization strategies differ.

CostHawk serverless monitoring:

CostHawk integrates with serverless inference platforms via API to ingest billing data, request logs, and performance metrics. The dashboard provides:

Unified cost view across all serverless platforms, APIs, and GPU infrastructure
Cold-start analysis with frequency, duration, and cost attribution
Savings recommendations: "Your Together AI Llama 3 70B endpoint processed 3.2B tokens last month at $1,728. A dedicated A100 on Lambda Labs would cost $929/month — a potential 46% saving."
Anomaly detection tuned for serverless patterns — distinguishing between normal traffic variability and genuine cost anomalies
Budget alerts that account for the variability of serverless pricing — setting thresholds based on trailing averages rather than fixed amounts

The monitoring strategy should be tailored to your deployment model. For per-token platforms (Together, Fireworks), monitor token volume and cost per token. For per-second platforms (Replicate, Modal), monitor GPU-seconds consumed and effective tokens per GPU-second. CostHawk normalizes these different billing models into a consistent cost-per-token metric, enabling direct comparison across platforms and deployment models. This normalization is essential for identifying the most cost-effective platform for each workload and for tracking the impact of optimization efforts over time.

FAQ

Frequently Asked Questions

What is the difference between serverless inference and a managed API like OpenAI?+

Both serverless inference and managed APIs abstract away GPU infrastructure, but they differ in three fundamental ways. First, model selection: managed APIs like OpenAI and Anthropic only offer their proprietary models (GPT-4o, Claude). Serverless inference platforms let you run any open-weight model (Llama, Mistral, Qwen, DeepSeek) as well as your own fine-tuned models. This gives you access to models optimized for your specific use case. Second, pricing: managed APIs include the provider's margin for model training, research, and frontier capabilities. Serverless inference prices reflect only the compute cost of running the model, which is typically 30–70% cheaper for equivalent-quality open models. Third, customization: managed APIs offer limited configurability (temperature, max_tokens, a few other parameters). Serverless platforms let you deploy custom model configurations, quantization levels, LoRA adapters, and on some platforms, custom inference code. The trade-off is that managed APIs generally offer the best model quality for cutting-edge tasks, the lowest operational overhead (just an API call), and the most consistent performance. Serverless inference is best when you need cost savings on high-volume workloads, custom models, or models not available through major API providers.

How do cold starts affect production applications?+

Cold starts create a two-tier latency experience: warm requests complete in 100–500 milliseconds (typical LLM inference time), while cold requests take 5–90 seconds depending on model size. For user-facing applications where responsiveness matters — chatbots, search, real-time assistants — cold starts can be unacceptable. Users waiting 30 seconds for a response will likely abandon the interaction. The impact depends on cold start frequency: if only 1 in 100 requests triggers a cold start, 99% of users have a good experience. If 1 in 5 triggers a cold start (common with very low traffic), the user experience degrades significantly. Mitigation strategies include: (1) Configure keep-warm replicas to maintain at least one GPU ready at all times — this adds idle cost but eliminates cold starts. (2) Use a smaller, faster-loading model that cold-starts in 3–5 seconds instead of 30–60 seconds. (3) Implement a loading state in your UI that gracefully handles the delay ('Warming up the model...'). (4) Route latency-sensitive requests to a managed API as a fallback when the serverless endpoint is cold. (5) Send periodic health-check requests to keep the endpoint warm during expected usage hours. CostHawk's cold-start analytics help you make data-driven decisions about how many keep-warm replicas to run based on actual traffic patterns.

Which serverless inference platform is cheapest?+

The cheapest platform depends on your specific model, volume, and usage pattern. For per-token pricing on popular open models like Llama 3.1 70B, Fireworks AI and Together AI are typically the cheapest at $0.40–$0.54 per million tokens. Groq offers competitive pricing at $0.59/$0.79 per MTok with extremely fast inference due to their custom LPU hardware. AWS Bedrock charges a premium ($0.72/MTok for Llama 3 70B) but provides deeper integration with AWS services. For per-second billing, Modal offers competitive A100 rates ($3.93/hr) with excellent developer experience and fast cold starts. Replicate is slightly more expensive ($5.04/hr for A100) but has the largest model catalog. Important caveat: cheapest per-token rate does not always mean lowest total cost. Factor in: (1) Cold start costs — platforms with more cold starts charge you GPU time for model loading. (2) Throughput — a platform charging $0.40/MTok but achieving 2x the throughput of one charging $0.35/MTok may be cheaper in per-second billing. (3) Quality of service — reliability, uptime, and consistent latency have economic value. CostHawk's multi-platform cost comparison helps identify the cheapest option for your specific workload by normalizing costs across different billing models.

Can I run fine-tuned models on serverless inference platforms?+

Yes, most serverless inference platforms support custom fine-tuned models, though the process varies by platform. Together AI lets you upload fine-tuned models (based on supported base models) and serve them with the same per-token pricing as the base model. Replicate supports custom models via Cog containers — you package your model and inference code into a Cog container, push it to Replicate, and it becomes a serverless endpoint. Modal is the most flexible — you write a Python class that loads your model and defines an inference method, and Modal handles containerization, GPU provisioning, and scaling. You can run any model from any framework. Fireworks AI offers custom model hosting with LoRA adapter support, enabling multiple fine-tuned variants from a single base model deployment. AWS Bedrock supports fine-tuning natively for some models and custom model import for others. For LoRA-based fine-tuning (the most common approach), some platforms support multi-LoRA serving where the base model is loaded once and different LoRA adapters are swapped per request — this is significantly more cost-effective than deploying separate fine-tuned models. The key consideration for cost is whether the platform charges extra for custom models or serves them at standard rates. Most platforms charge the same per-token or per-second rate regardless of whether the model is from their catalog or custom-uploaded.

How does serverless inference handle scaling to zero and back?+

Scale-to-zero is the defining feature that distinguishes serverless from dedicated infrastructure. When no requests have arrived for a configurable idle period (typically 30 seconds to 5 minutes depending on the platform), the platform deallocates the GPU resources. The model is unloaded from VRAM, and the GPU is returned to the platform's shared pool for use by other customers. You stop being billed the moment resources are deallocated. When a new request arrives after scale-to-zero, the platform must scale back up — this is the cold start. The platform provisions a GPU from its pool, loads the model weights from storage into GPU VRAM, initializes the inference engine, and processes the request. The total time for this cycle depends on GPU availability (seconds), model size (seconds to minutes for the weight transfer), and engine initialization (seconds). Platforms implement several optimizations to speed up scale-from-zero: (1) Pre-cached model weights — storing model files on fast local NVMe storage attached to GPU nodes rather than network storage. (2) Snapshot-based restoration — saving a GPU memory snapshot of the loaded model and restoring it directly, bypassing the normal loading process. (3) Pooled warm starts — maintaining a pool of GPUs with popular models pre-loaded, shared across customers. Your request may hit a GPU that already has the model loaded for another customer, eliminating the cold start entirely. The trade-off is clear: scale-to-zero saves money during idle periods but introduces latency when demand resumes.

Is serverless inference suitable for real-time applications?+

Serverless inference can be suitable for real-time applications, but it requires careful configuration to manage cold-start latency. For applications with consistent traffic (at least a few requests per minute throughout operating hours), the serverless endpoint stays warm naturally — warm inference latency on serverless platforms is competitive with managed APIs (100–500ms TTFT for most models). The challenge arises during low-traffic periods when the endpoint scales to zero. Three strategies make serverless viable for real-time applications: (1) Keep-warm replicas: Configure the platform to maintain at least one warm GPU at all times. This adds idle cost ($1–$8/hr depending on GPU type) but guarantees fast response times for every request. The cost is still lower than a dedicated GPU if your warm periods are shorter than 24/7. (2) Hybrid routing: Use a managed API (OpenAI, Anthropic) as a fallback when the serverless endpoint is cold. Route the first request to the API for instant response while the serverless endpoint warms up, then route subsequent requests to serverless for cost savings. (3) Traffic shaping: Send synthetic health-check requests during expected usage hours to keep the endpoint warm. This costs a few cents per hour in inference time but prevents cold starts. For applications where every single request must respond within a strict SLA (under 500ms), dedicated GPUs or managed APIs provide more consistent guarantees. For applications where occasional 5–10 second delays are tolerable (background processing, async workflows), serverless without keep-warm is perfect.

How do I migrate from a managed API to serverless inference?+

Migrating from a managed API (OpenAI, Anthropic) to serverless inference involves four steps. First, identify candidate workloads: not every API workload should move to serverless. Use CostHawk to identify workloads where you are using capabilities that open models can match (classification, extraction, summarization, simple Q&A) and where volume justifies the migration effort (at least $200/month in API spend). Frontier tasks requiring GPT-4o or Claude-level reasoning should stay on managed APIs. Second, evaluate model alternatives: benchmark open-weight models (Llama 3.1 70B, Mistral Large, Qwen 2.5 72B) against your current API model on your actual prompts and evaluation criteria. Most teams find that open models match proprietary API quality for 60–80% of their workloads. Third, deploy on a serverless platform: choose a platform based on your volume (per-token for predictable costs, per-second for flexibility), ecosystem (AWS Bedrock if you are on AWS), and latency requirements. Most platforms provide OpenAI-compatible API endpoints, meaning your application code only needs to change the base URL and model name — no SDK changes required. Fourth, gradual traffic migration: route 5% of traffic to the serverless endpoint initially, monitor quality and latency, increase to 25%, then 50%, then 100% as confidence builds. CostHawk tracks both the API and serverless costs during migration, showing real-time cost savings and enabling rollback if quality degrades.

What are the data privacy implications of serverless inference?+

Serverless inference platforms process your prompts and responses on shared infrastructure, which raises data privacy considerations that differ from both managed APIs and self-hosted GPUs. Most major serverless platforms (Together AI, Fireworks AI, Modal, Replicate) commit to not training on customer data and not retaining prompts or responses beyond the duration of the request. However, the specifics vary by platform and should be verified in their data processing agreements (DPAs). AWS Bedrock provides the strongest privacy guarantees among serverless options: data is processed within your AWS account's VPC, encrypted at rest and in transit, and covered by AWS's extensive compliance certifications (HIPAA, SOC 2, FedRAMP, GDPR). Bedrock is the go-to choice for regulated industries. Google Vertex AI similarly offers enterprise-grade data governance within the Google Cloud ecosystem. For smaller platforms, review their DPA, data retention policy, and compliance certifications carefully. Key questions to ask: Do they log prompts and responses? For how long? Can data be subpoenaed? Is the infrastructure SOC 2 audited? Do they support customer-managed encryption keys? For maximum data privacy without self-hosting, choose a platform with VPC peering or private endpoints, customer-managed encryption keys, and compliance certifications matching your regulatory requirements. If your data requirements are extremely strict (air-gapped, on-premises only), self-hosted GPU instances are the only option.

Related Terms

GPU Instance

Cloud-hosted GPU hardware used for running LLM inference or training workloads. GPU instances represent the alternative to API-based pricing — you pay for hardware time ($/hour) rather than per-token, making them cost-effective for high-volume, predictable workloads that exceed the breakeven point against API pricing.

Inference

The process of running a trained machine learning model to generate predictions, classifications, or text output from new input data. For AI API consumers, inference is the dominant cost — every API call is an inference request, and you are billed for the compute resources consumed during the model's forward pass through your input and output tokens. Inference costs dwarf training costs for most organizations because training happens once while inference happens millions of times.

Provisioned Throughput

Pre-purchased dedicated LLM compute capacity that guarantees consistent performance and can reduce per-token costs at scale.

Pay-Per-Token

The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.

Latency

The total elapsed time between sending a request to an LLM API and receiving the complete response. LLM latency decomposes into time-to-first-token (TTFT) — the wait before streaming begins — and generation time — the duration of token-by-token output. Latency directly trades off against cost: faster models and provisioned throughput reduce latency but increase spend.

Total Cost of Ownership (TCO) for AI

The complete, all-in cost of running AI in production over its full lifecycle. TCO extends far beyond API fees to include infrastructure, engineering, monitoring, data preparation, quality assurance, and operational overhead. Understanding true TCO is essential for accurate budgeting, build-vs-buy decisions, and meaningful ROI calculations.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary