Batch API
Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.
Definition
What is Batch API?
Impact
Why It Matters for AI Costs
What Is the Batch API?
The Batch API is an alternative interface to the same large language models you already use in production. Instead of sending one request and waiting for an immediate response, you upload a file containing hundreds or thousands of requests, the provider processes them asynchronously, and you download the results when the job completes.
The key characteristics of batch processing are:
- Same models, same quality — Batch requests use identical model weights and configurations as real-time requests. Output quality is unchanged.
- 50% cost reduction — Both OpenAI and Anthropic offer a flat 50% discount on all token costs (input and output) for batch requests.
- Longer turnaround — OpenAI guarantees completion within 24 hours. Anthropic's Message Batches typically complete in minutes to hours but do not publish a formal SLA.
- Higher throughput limits — Batch endpoints often have separate, higher rate limits than real-time endpoints. OpenAI's Batch API allows up to 90,000 requests per batch and a separate token-per-minute quota that does not count against your real-time limits.
Batch processing has been a staple of cloud computing for decades. AI providers adopted the pattern because GPU utilization is rarely 100% — batch jobs fill the gaps in compute demand, creating a win-win: providers improve utilization and customers pay less.
OpenAI vs Anthropic Batch Processing
Both major providers offer batch processing, but the implementations differ in important ways. The following table compares the two systems as of early 2026:
| Feature | OpenAI Batch API | Anthropic Message Batches |
|---|---|---|
| Discount | 50% off standard pricing | 50% off standard pricing |
| SLA / Turnaround | 24-hour completion guarantee | No formal SLA; typically minutes to hours |
| Max Requests per Batch | 90,000 requests | 100,000 requests |
| Input Format | JSONL file upload with custom_id per request | JSON array of request objects via API |
| Supported Models | GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o1, o3-mini | Claude Sonnet 4, Claude Haiku 3.5, Claude Opus 4 |
| Status Polling | GET /batches/{id} endpoint | GET /messages/batches/{id} endpoint |
| Partial Results | Available via output file on completion or cancellation | Individual results streamed as they complete |
| Token Limits | Separate batch queue token limit (varies by tier) | Shared with real-time token limits |
| Expiration | Uncompleted batches expire after 24 hours | Results available for 29 days after completion |
A significant architectural difference is how results are delivered. OpenAI writes all results to a single output file that you download after the batch completes. Anthropic streams individual results as they finish, which means you can begin processing partial results before the entire batch is done. This makes Anthropic's system better suited for pipelines where early results unlock downstream work.
For cost optimization, the discount is identical — 50% on both platforms. The choice between providers for batch work typically comes down to model preference, existing infrastructure, and whether the 24-hour SLA matters for your workflow.
Implementing Batch Requests
Both providers require you to restructure your code slightly to use batch endpoints instead of real-time endpoints. Below are working examples for both.
OpenAI Batch API Implementation
import OpenAI from 'openai';
import fs from 'fs';
const client = new OpenAI();
// Step 1: Prepare JSONL input file
const requests = prompts.map((prompt, i) => ({
custom_id: `req-${i}`,
method: 'POST',
url: '/v1/chat/completions',
body: {
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
max_tokens: 1000
}
}));
fs.writeFileSync('batch_input.jsonl',
requests.map(r => JSON.stringify(r)).join('\n')
);
// Step 2: Upload file
const file = await client.files.create({
file: fs.createReadStream('batch_input.jsonl'),
purpose: 'batch'
});
// Step 3: Create batch
const batch = await client.batches.create({
input_file_id: file.id,
endpoint: '/v1/chat/completions',
completion_window: '24h'
});
// Step 4: Poll for completion
let status = await client.batches.retrieve(batch.id);
while (status.status !== 'completed' && status.status !== 'failed') {
await new Promise(r => setTimeout(r, 60000));
status = await client.batches.retrieve(batch.id);
}
// Step 5: Download results
const output = await client.files.content(status.output_file_id);
const results = output.text().split('\n').filter(Boolean).map(JSON.parse);Anthropic Message Batches Implementation
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
// Step 1: Create batch with inline requests
const batch = await client.messages.batches.create({
requests: prompts.map((prompt, i) => ({
custom_id: `req-${i}`,
params: {
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }]
}
}))
});
// Step 2: Poll for completion
let status = await client.messages.batches.retrieve(batch.id);
while (status.processing_status !== 'ended') {
await new Promise(r => setTimeout(r, 30000));
status = await client.messages.batches.retrieve(batch.id);
}
// Step 3: Stream results
const results = [];
for await (const result of client.messages.batches.results(batch.id)) {
results.push(result);
}The key difference in implementation is that OpenAI requires a file upload step, while Anthropic accepts request arrays directly in the API call. For very large batches (10,000+ requests), OpenAI's file-based approach can be more reliable because it separates upload from processing, but Anthropic's inline approach is simpler for smaller batches.
Ideal Use Cases for Batch Processing
Not every workload can tolerate the latency of batch processing, but many high-volume workloads can. The following categories consistently deliver the highest ROI from batch migration:
- Evaluation suites (evals) — Running model evaluations across hundreds or thousands of test cases is a perfect batch workload. Eval suites are typically run nightly or on pull requests, so the 24-hour window is irrelevant. A team running 5,000 eval cases per day at $0.005 per case saves $12.50 daily — $375 per month — just by switching to batch.
- Data labeling and annotation — Using LLMs to label training data, tag content, or annotate documents is inherently asynchronous. A labeling pipeline processing 100,000 documents per month at an average cost of $0.02 per document saves $1,000 monthly with batch pricing.
- Classification and categorization — Spam detection, content moderation, sentiment analysis, and ticket routing are all high-volume, non-interactive tasks. These workloads often process thousands of items per hour and benefit enormously from the 50% discount.
- Content generation — Generating product descriptions, SEO meta tags, email variants, or marketing copy in bulk is a natural fit. A marketing team generating 10,000 product descriptions per month at $0.03 each saves $150 monthly.
- Synthetic data generation — Creating training data, test fixtures, or simulation data using LLMs is entirely offline. Batch processing halves the cost of generating large synthetic datasets.
- Document summarization — Processing backlogs of documents, research papers, or support tickets for summarization can run overnight without impacting user experience.
The common thread is that none of these workloads require sub-second response times. If the consumer of the output is another system, a scheduled job, or a human reviewing results the next morning, batch processing is almost always the right choice.
Error Handling and Retry Logic
Batch processing introduces failure modes that do not exist in real-time API calls. Individual requests within a batch can fail while others succeed, and the entire batch can expire if it does not complete within the provider's window.
Individual request failures: Both OpenAI and Anthropic report per-request success or failure in the output. OpenAI includes an error field in the output JSONL for failed requests. Anthropic returns a result.type of errored with an error object. Your processing pipeline must check each result individually and collect failed request IDs for retry.
Batch-level failures: An entire batch can fail if the input is malformed, the file upload is corrupted, or a systemic provider issue occurs. Always validate your input file format before submission and implement monitoring that alerts when a batch status transitions to failed or expired.
Retry strategy:
// Collect failed requests and resubmit as a new batch
const failedRequests = results
.filter(r => r.error || r.result?.type === 'errored')
.map(r => originalRequests.find(req => req.custom_id === r.custom_id))
.filter(Boolean);
if (failedRequests.length > 0) {
console.log(`Retrying ${failedRequests.length} failed requests`);
// Resubmit as a new batch — do NOT resubmit the entire original batch
const retryBatch = await submitBatch(failedRequests);
}Idempotency: Use the custom_id field to implement idempotent processing. If a batch partially completes and you need to retry, the custom IDs let you identify which requests have already been processed and avoid duplicate work. Store processed custom IDs in your database and skip them on retry.
Timeout handling: For OpenAI, if a batch does not complete within 24 hours, it transitions to expired. Any completed results are still available in the output file. For Anthropic, there is no formal expiration, but you should implement your own timeout (e.g., 6 hours) after which you cancel the batch and resubmit outstanding requests.
Batch API Cost Savings Calculator
To quantify the savings from migrating to batch processing, use this calculation framework with real pricing numbers.
Example: GPT-4o evaluation pipeline
| Metric | Real-Time | Batch API |
|---|---|---|
| Model | GPT-4o | GPT-4o |
| Input price (per 1M tokens) | $2.50 | $1.25 |
| Output price (per 1M tokens) | $10.00 | $5.00 |
| Daily eval requests | 5,000 | 5,000 |
| Avg input tokens per request | 800 | 800 |
| Avg output tokens per request | 200 | 200 |
| Daily input tokens | 4,000,000 | 4,000,000 |
| Daily output tokens | 1,000,000 | 1,000,000 |
| Daily input cost | $10.00 | $5.00 |
| Daily output cost | $10.00 | $5.00 |
| Daily total | $20.00 | $10.00 |
| Monthly total (30 days) | $600.00 | $300.00 |
| Annual total | $7,200.00 | $3,600.00 |
In this example, the team saves $3,600 per year on a single evaluation pipeline by migrating to batch processing. The output is identical — same model, same prompts, same quality — the only difference is that results arrive within 24 hours instead of seconds.
Example: Claude Sonnet 4 content generation pipeline
| Metric | Real-Time | Batch |
|---|---|---|
| Model | Claude Sonnet 4 | Claude Sonnet 4 |
| Input price (per 1M tokens) | $3.00 | $1.50 |
| Output price (per 1M tokens) | $15.00 | $7.50 |
| Monthly requests | 50,000 | 50,000 |
| Avg input tokens per request | 1,200 | 1,200 |
| Avg output tokens per request | 500 | 500 |
| Monthly input cost | $180.00 | $90.00 |
| Monthly output cost | $375.00 | $187.50 |
| Monthly total | $555.00 | $277.50 |
| Annual savings | $3,330.00 | |
CostHawk's usage dashboard tags each request as real-time or batch, so you can track your actual savings and identify additional workloads that could benefit from batch migration.
FAQ
Frequently Asked Questions
How much does the Batch API actually save?+
What is the turnaround time for batch requests?+
Can I use the Batch API for real-time user-facing features?+
What happens if some requests in my batch fail?+
Is there a minimum batch size to get the discount?+
Do batch requests count against my real-time rate limits?+
Can I cancel a batch that is in progress?+
How does CostHawk track batch vs. real-time spending?+
What models support batch processing?+
Should I batch embedding requests too?+
Related Terms
Token Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreRate Limiting
Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Read morePay-Per-Token
The dominant usage-based pricing model for AI APIs where you pay only for the tokens you consume, with no upfront commitment or monthly minimum.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreCost Per Token
The unit price an AI provider charges for processing a single token, quoted per million tokens. Ranges from $0.075/1M for budget models to $75.00/1M for frontier reasoning models — an 1,000x spread.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
