GlossaryOptimizationUpdated 2026-03-16By Chase Dillingham

Batch API

Asynchronous API endpoints that process large volumes of LLM requests at a 50% discount in exchange for longer turnaround times.

Definition

What is Batch API?

The Batch API is an asynchronous processing mode offered by major AI providers that allows you to submit large collections of requests as a single batch job rather than making individual real-time API calls. In exchange for accepting a longer processing window — typically 24 hours for OpenAI and several hours for Anthropic — you receive a 50% discount on token costs. OpenAI's Batch API processes requests within a 24-hour service-level agreement, while Anthropic's Message Batches API typically completes jobs in minutes to hours depending on volume. Both providers process batch requests using spare compute capacity, which is why the discount is so steep: you are filling idle GPU cycles that would otherwise go unused.

Impact

Why It Matters for AI Costs

At scale, the 50% discount from batch processing is transformative. A team spending $20,000 per month on OpenAI GPT-4o inference can cut that to $10,000 simply by shifting non-interactive workloads to the Batch API — with zero changes to prompts, models, or output quality. The requests are identical; only the delivery timeline changes. For any workload where results are not needed in real time — evaluation suites, nightly content generation, data labeling pipelines, bulk classification — the Batch API is the single highest-impact cost optimization available. CostHawk tracks batch vs. real-time spend separately so you can measure exactly how much you are saving and identify additional workloads that could be migrated to batch processing.

What Is the Batch API?

The Batch API is an alternative interface to the same large language models you already use in production. Instead of sending one request and waiting for an immediate response, you upload a file containing hundreds or thousands of requests, the provider processes them asynchronously, and you download the results when the job completes.

The key characteristics of batch processing are:

Same models, same quality — Batch requests use identical model weights and configurations as real-time requests. Output quality is unchanged.
50% cost reduction — Both OpenAI and Anthropic offer a flat 50% discount on all token costs (input and output) for batch requests.
Longer turnaround — OpenAI guarantees completion within 24 hours. Anthropic's Message Batches typically complete in minutes to hours but do not publish a formal SLA.
Higher throughput limits — Batch endpoints often have separate, higher rate limits than real-time endpoints. OpenAI's Batch API allows up to 90,000 requests per batch and a separate token-per-minute quota that does not count against your real-time limits.

Batch processing has been a staple of cloud computing for decades. AI providers adopted the pattern because GPU utilization is rarely 100% — batch jobs fill the gaps in compute demand, creating a win-win: providers improve utilization and customers pay less.

OpenAI vs Anthropic Batch Processing

Both major providers offer batch processing, but the implementations differ in important ways. The following table compares the two systems as of early 2026:

Feature	OpenAI Batch API	Anthropic Message Batches
Discount	50% off standard pricing	50% off standard pricing
SLA / Turnaround	24-hour completion guarantee	No formal SLA; typically minutes to hours
Max Requests per Batch	90,000 requests	100,000 requests
Input Format	JSONL file upload with custom_id per request	JSON array of request objects via API
Supported Models	GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o1, o3-mini	Claude Sonnet 4, Claude Haiku 3.5, Claude Opus 4
Status Polling	GET /batches/{id} endpoint	GET /messages/batches/{id} endpoint
Partial Results	Available via output file on completion or cancellation	Individual results streamed as they complete
Token Limits	Separate batch queue token limit (varies by tier)	Shared with real-time token limits
Expiration	Uncompleted batches expire after 24 hours	Results available for 29 days after completion

A significant architectural difference is how results are delivered. OpenAI writes all results to a single output file that you download after the batch completes. Anthropic streams individual results as they finish, which means you can begin processing partial results before the entire batch is done. This makes Anthropic's system better suited for pipelines where early results unlock downstream work.

For cost optimization, the discount is identical — 50% on both platforms. The choice between providers for batch work typically comes down to model preference, existing infrastructure, and whether the 24-hour SLA matters for your workflow.

Implementing Batch Requests

Both providers require you to restructure your code slightly to use batch endpoints instead of real-time endpoints. Below are working examples for both.

OpenAI Batch API Implementation

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI();

// Step 1: Prepare JSONL input file
const requests = prompts.map((prompt, i) => ({
  custom_id: `req-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 1000
  }
}));

fs.writeFileSync('batch_input.jsonl',
  requests.map(r => JSON.stringify(r)).join('\n')
);

// Step 2: Upload file
const file = await client.files.create({
  file: fs.createReadStream('batch_input.jsonl'),
  purpose: 'batch'
});

// Step 3: Create batch
const batch = await client.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h'
});

// Step 4: Poll for completion
let status = await client.batches.retrieve(batch.id);
while (status.status !== 'completed' && status.status !== 'failed') {
  await new Promise(r => setTimeout(r, 60000));
  status = await client.batches.retrieve(batch.id);
}

// Step 5: Download results
const output = await client.files.content(status.output_file_id);
const results = output.text().split('\n').filter(Boolean).map(JSON.parse);

Anthropic Message Batches Implementation

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Step 1: Create batch with inline requests
const batch = await client.messages.batches.create({
  requests: prompts.map((prompt, i) => ({
    custom_id: `req-${i}`,
    params: {
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1024,
      messages: [{ role: 'user', content: prompt }]
    }
  }))
});

// Step 2: Poll for completion
let status = await client.messages.batches.retrieve(batch.id);
while (status.processing_status !== 'ended') {
  await new Promise(r => setTimeout(r, 30000));
  status = await client.messages.batches.retrieve(batch.id);
}

// Step 3: Stream results
const results = [];
for await (const result of client.messages.batches.results(batch.id)) {
  results.push(result);
}

The key difference in implementation is that OpenAI requires a file upload step, while Anthropic accepts request arrays directly in the API call. For very large batches (10,000+ requests), OpenAI's file-based approach can be more reliable because it separates upload from processing, but Anthropic's inline approach is simpler for smaller batches.

Ideal Use Cases for Batch Processing

Not every workload can tolerate the latency of batch processing, but many high-volume workloads can. The following categories consistently deliver the highest ROI from batch migration:

Evaluation suites (evals) — Running model evaluations across hundreds or thousands of test cases is a perfect batch workload. Eval suites are typically run nightly or on pull requests, so the 24-hour window is irrelevant. A team running 5,000 eval cases per day at $0.005 per case saves $12.50 daily — $375 per month — just by switching to batch.
Data labeling and annotation — Using LLMs to label training data, tag content, or annotate documents is inherently asynchronous. A labeling pipeline processing 100,000 documents per month at an average cost of $0.02 per document saves $1,000 monthly with batch pricing.
Classification and categorization — Spam detection, content moderation, sentiment analysis, and ticket routing are all high-volume, non-interactive tasks. These workloads often process thousands of items per hour and benefit enormously from the 50% discount.
Content generation — Generating product descriptions, SEO meta tags, email variants, or marketing copy in bulk is a natural fit. A marketing team generating 10,000 product descriptions per month at $0.03 each saves $150 monthly.
Synthetic data generation — Creating training data, test fixtures, or simulation data using LLMs is entirely offline. Batch processing halves the cost of generating large synthetic datasets.
Document summarization — Processing backlogs of documents, research papers, or support tickets for summarization can run overnight without impacting user experience.

The common thread is that none of these workloads require sub-second response times. If the consumer of the output is another system, a scheduled job, or a human reviewing results the next morning, batch processing is almost always the right choice.

Error Handling and Retry Logic

Batch processing introduces failure modes that do not exist in real-time API calls. Individual requests within a batch can fail while others succeed, and the entire batch can expire if it does not complete within the provider's window.

Individual request failures: Both OpenAI and Anthropic report per-request success or failure in the output. OpenAI includes an error field in the output JSONL for failed requests. Anthropic returns a result.type of errored with an error object. Your processing pipeline must check each result individually and collect failed request IDs for retry.

Batch-level failures: An entire batch can fail if the input is malformed, the file upload is corrupted, or a systemic provider issue occurs. Always validate your input file format before submission and implement monitoring that alerts when a batch status transitions to failed or expired.

Retry strategy:

// Collect failed requests and resubmit as a new batch
const failedRequests = results
  .filter(r => r.error || r.result?.type === 'errored')
  .map(r => originalRequests.find(req => req.custom_id === r.custom_id))
  .filter(Boolean);

if (failedRequests.length > 0) {
  console.log(`Retrying ${failedRequests.length} failed requests`);
  // Resubmit as a new batch — do NOT resubmit the entire original batch
  const retryBatch = await submitBatch(failedRequests);
}

Idempotency: Use the custom_id field to implement idempotent processing. If a batch partially completes and you need to retry, the custom IDs let you identify which requests have already been processed and avoid duplicate work. Store processed custom IDs in your database and skip them on retry.

Timeout handling: For OpenAI, if a batch does not complete within 24 hours, it transitions to expired. Any completed results are still available in the output file. For Anthropic, there is no formal expiration, but you should implement your own timeout (e.g., 6 hours) after which you cancel the batch and resubmit outstanding requests.

Batch API Cost Savings Calculator

To quantify the savings from migrating to batch processing, use this calculation framework with real pricing numbers.

Example: GPT-4o evaluation pipeline

Metric	Real-Time	Batch API
Model	GPT-4o	GPT-4o
Input price (per 1M tokens)	$2.50	$1.25
Output price (per 1M tokens)	$10.00	$5.00
Daily eval requests	5,000	5,000
Avg input tokens per request	800	800
Avg output tokens per request	200	200
Daily input tokens	4,000,000	4,000,000
Daily output tokens	1,000,000	1,000,000
Daily input cost	$10.00	$5.00
Daily output cost	$10.00	$5.00
Daily total	$20.00	$10.00
Monthly total (30 days)	$600.00	$300.00
Annual total	$7,200.00	$3,600.00

In this example, the team saves $3,600 per year on a single evaluation pipeline by migrating to batch processing. The output is identical — same model, same prompts, same quality — the only difference is that results arrive within 24 hours instead of seconds.

Example: Claude Sonnet 4 content generation pipeline

Metric	Real-Time	Batch
Model	Claude Sonnet 4	Claude Sonnet 4
Input price (per 1M tokens)	$3.00	$1.50
Output price (per 1M tokens)	$15.00	$7.50
Monthly requests	50,000	50,000
Avg input tokens per request	1,200	1,200
Avg output tokens per request	500	500
Monthly input cost	$180.00	$90.00
Monthly output cost	$375.00	$187.50
Monthly total	$555.00	$277.50
Annual savings	$3,330.00

CostHawk's usage dashboard tags each request as real-time or batch, so you can track your actual savings and identify additional workloads that could benefit from batch migration.

FAQ

Frequently Asked Questions

How much does the Batch API actually save?+

The Batch API provides a flat 50% discount on all token costs — both input and output — for both OpenAI and Anthropic. This means if you are paying $10.00 per million output tokens for GPT-4o in real time, you pay $5.00 per million output tokens for the same model via the Batch API. The discount applies to every supported model across the board. For a team spending $5,000 per month on batch-eligible workloads, this translates to $2,500 in monthly savings or $30,000 annually. The savings are guaranteed, immediate, and require no negotiation, no minimum spend commitment, and no volume minimum. CostHawk tracks batch versus real-time spend automatically, so you can verify your actual savings match the expected 50% discount and identify additional workloads to migrate.

What is the turnaround time for batch requests?+

OpenAI guarantees that batch jobs complete within 24 hours, though most batches finish significantly faster — typically 1-4 hours for batches under 10,000 requests. Anthropic does not publish a formal SLA for Message Batches, but in practice most batches complete within minutes to a few hours depending on volume and current system load. For planning purposes, design your pipeline to tolerate the maximum window (24 hours for OpenAI) so that slower processing during peak periods does not break your workflow. If you need results within a predictable timeframe shorter than 24 hours, Anthropic's faster typical turnaround may be more suitable. CostHawk logs actual batch completion times for each job, giving you historical data on real-world turnaround to inform pipeline scheduling decisions.

Can I use the Batch API for real-time user-facing features?+

No. The Batch API is designed for asynchronous workloads where the user is not waiting for an immediate response. The minimum turnaround is typically several minutes, and the maximum is 24 hours. For user-facing features that require sub-second or even sub-minute responses, you must use the standard real-time API endpoints. However, many production applications have a mix of real-time and asynchronous workloads running side by side. CostHawk's usage analytics can help you identify which portion of your traffic is genuinely batch-eligible by analyzing request patterns, caller contexts, and response consumption timing. Common batch-eligible patterns include requests where the response is stored in a database before being read, or where results are consumed by a scheduled job rather than displayed to a user in real time.

What happens if some requests in my batch fail?+

Both providers handle partial failures gracefully without requiring you to resubmit the entire batch. In OpenAI's Batch API, the output file contains results for all requests — successful ones include the full completion response, and failed ones include an error object with a descriptive code and message. You are only charged for successful completions, not for failed requests. In Anthropic's Message Batches, each result object has a type field that is either 'succeeded' or 'errored', and failed requests are similarly not billed. Your processing pipeline should iterate through all results, collect the custom_id values of failed requests, and resubmit only those failures as a new smaller batch. Never resubmit the entire original batch, as this would duplicate all already-successful work and double those costs unnecessarily.

Is there a minimum batch size to get the discount?+

No. Both OpenAI and Anthropic apply the 50% batch discount regardless of batch size — there is no minimum request count or minimum token volume required. You can technically submit a batch with a single request and still receive the full discount, though this defeats the purpose of batching from an operational standpoint because of the overhead of creating and polling the batch job. In practice, batching becomes operationally worthwhile when you have at least 50-100 requests to submit together. The maximum batch sizes are 90,000 requests for OpenAI and 100,000 requests for Anthropic per individual batch submission. If you need to process more than the maximum, split your workload into multiple sequential or parallel batch jobs and aggregate the results.

Do batch requests count against my real-time rate limits?+

For OpenAI, batch requests use a completely separate token quota that does not count against your real-time rate limits. This means you can run large batch jobs without affecting the throughput or availability of your production real-time traffic in any way. The batch queue limit varies by usage tier — Tier 1 accounts get 200,000 enqueued tokens for batch, while Tier 5 accounts get up to 5,000,000,000 tokens per day in the batch queue. For Anthropic, batch requests currently share the same rate limits as real-time requests, which means large batch submissions can consume capacity that your real-time traffic needs for user-facing responses. If you are using Anthropic, plan batch submissions during off-peak hours to avoid impacting real-time workloads, or coordinate batch submission timing with your traffic patterns.

Can I cancel a batch that is in progress?+

Yes. Both providers support batch cancellation while processing is in progress. For OpenAI, call the cancel endpoint with the batch ID — any requests that have already completed will still be available in the output file, and any in-progress or queued requests will be cancelled. You are only billed for requests that completed successfully before the cancellation took effect. For Anthropic, call the cancel endpoint on the batch and the same behavior applies: completed results remain accessible and you are not charged for cancelled or unprocessed requests. Cancellation is useful when you discover an error in your prompt templates after submission, when the upstream data that prompted the batch has changed, or when a higher-priority batch needs the compute capacity and you want to free up queue space.

How does CostHawk track batch vs. real-time spending?+

CostHawk automatically distinguishes between batch and real-time API calls based on the endpoint used and the pricing tier applied to each request. The usage dashboard displays a clear breakdown showing what percentage of your total spend goes through batch endpoints versus real-time endpoints, both in aggregate across your organization and broken down per individual project or team. This visibility is critical for ongoing optimization — if CostHawk shows that 60% of your requests are hitting real-time endpoints but have response consumption patterns consistent with asynchronous processing (for example, results that are read hours after they were generated or stored in a queue before processing), that is a strong signal those workloads could be migrated to batch endpoints for an immediate 50% cost reduction with no impact on end-user experience.

What models support batch processing?+

OpenAI supports batch processing for GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o1, o3-mini, and most other current production models via the /v1/chat/completions batch endpoint. Anthropic supports batch processing for Claude Opus 4, Claude Sonnet 4, and Claude Haiku 3.5 via the Message Batches API. Both providers also support batch processing for embedding models, which is particularly valuable for large-scale RAG indexing pipelines where millions of tokens need to be embedded. Check each provider's official documentation for the most current list, as new models are typically added to batch support within a few weeks of their general availability launch. CostHawk's model pricing database tracks which models support batch endpoints so you can quickly identify batch-eligible options.

Should I batch embedding requests too?+

Yes, if your provider supports it, and OpenAI does support batch processing for embedding models at the same 50% discount. If you are running a RAG pipeline that re-indexes documents nightly, switching the embedding step to batch processing halves that cost immediately. For example, embedding 10 million tokens of documents per day with text-embedding-3-small costs $0.20 per day at the standard rate of $0.02 per million tokens, or $0.10 per day with batch pricing. That is a small absolute number for modest corpora, but for larger enterprise corpora processing 100 million or more tokens per day, the savings become very meaningful — $3.00 per day versus $6.00, which compounds to $90 versus $180 monthly. Additionally, batch embedding requests do not count against real-time rate limits on OpenAI, so you get higher throughput as a bonus.

Related Terms

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary