GlossaryBilling & PricingUpdated 2026-03-16By Chase Dillingham

Multi-Modal Model

Q: How are image tokens calculated for different providers?

Each provider uses a different formula. OpenAI GPT-4o uses a tiling system: low-detail images are always 85 tokens; high-detail images are scaled to fit within 2048x2048 pixels, divided into 512x512 tiles, and each tile costs 170 tokens plus a base of 85 tokens. A 1024x1024 image produces 4 tiles = 765 tokens. Anthropic Claude 3.5 calculates based on pixel count: approximately (width x height) / 750. A 1024x1024 image costs ~1,398 tokens. Images larger than 1568x1568 are resized. Google Gemini uses a fixed 258 tokens per image regardless of resolution — the simplest and cheapest approach. These differences mean the same image produces very different costs across providers: a 1024x1024 photo costs 765 tokens on GPT-4o, 1,398 tokens on Claude, and 258 tokens on Gemini — a 5.4x difference between Claude and Gemini. For image-heavy workloads, this pricing disparity makes provider selection the single most impactful optimization decision.

Q: How do audio tokens get calculated and billed?

Audio token calculation varies by provider and approach. GPT-4o native audio mode encodes audio at approximately 100 tokens per second. A 60-second audio clip costs about 6,000 input tokens ($0.015 at $2.50/MTok). This preserves all acoustic information including tone, emotion, and speaker identity. Gemini 2.0 processes audio at approximately 32 tokens per second, making it about 3x more efficient than GPT-4o on a per-second basis and 75x cheaper due to its lower per-token rate. Whisper API (OpenAI's dedicated transcription model) charges a flat $0.006 per minute rather than per-token, and produces text output that can then be processed by any text model. For most audio transcription workloads, the Whisper + text model approach is 50–75% cheaper than native audio mode. Native audio mode is worth the premium only when acoustic features (tone analysis, speaker emotion, music analysis, sound classification) are essential to your use case.

An AI model capable of processing and generating content across multiple modalities — text, images, audio, and video. Each modality carries a different token cost, with image inputs costing substantially more than text per semantic unit. Multi-modal models like GPT-4o, Claude 3.5, and Gemini 2.0 unlock powerful capabilities but introduce complex pricing structures that require careful monitoring to avoid cost surprises.

Definition

What is Multi-Modal Model?

A multi-modal model is a foundation model that can accept and process inputs across more than one data modality — typically some combination of text, images, audio, and video — and generate outputs in one or more of those modalities. Unlike text-only models that process sequences of text tokens, multi-modal models use specialized encoders to convert non-text inputs into token-like representations that the model's transformer architecture can process alongside text. An image is divided into patches and encoded as hundreds to thousands of image tokens; audio is segmented into frames and encoded as audio tokens; video is decomposed into frame sequences. These modality-specific tokens are then processed through the same attention mechanism as text tokens and billed at the same or similar per-token rates. The critical cost implication is that non-text modalities are token-expensive: a single 1024x1024 image can consume 1,000+ tokens — equivalent to roughly 750 words of English text. A 30-second audio clip might consume 3,000+ tokens. A short video can consume tens of thousands of tokens. Multi-modal capabilities enable powerful applications (visual question answering, document understanding, audio transcription, video analysis), but the token economics of non-text inputs demand careful cost management. Current leading multi-modal models include GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash/Pro.

Impact

Why It Matters for AI Costs

Multi-modal models represent both the most powerful and the most expensive pattern in AI API consumption. Understanding their pricing is critical because the cost differences between modalities are dramatic and often surprising:

The modality cost multiplier: Processing an image costs 10–100x more than processing an equivalent amount of textual information. A product description in text might be 200 tokens ($0.0005 on GPT-4o). A photo of that same product might be 1,100 tokens ($0.00275 on GPT-4o) — 5.5x more expensive for potentially the same semantic content. At scale, this difference is enormous: 100,000 image analysis requests per day at 1,100 tokens each means 110 million input tokens daily, costing $275/day or $8,250/month on GPT-4o — compared to $15/day if the same information were available as text.

Modality	Typical Token Count	Cost per Unit (GPT-4o)	Cost per Unit (Claude 3.5 Sonnet)	Cost per Unit (Gemini 2.0 Flash)
Text (500 words)	~650 tokens	$0.0016	$0.0020	$0.0001
Image (512x512, low detail)	85 tokens	$0.0002	$0.0003	$0.00001
Image (1024x1024, high detail)	765 tokens	$0.0019	$0.0023	$0.00008
Image (2048x2048, high detail)	1,105 tokens	$0.0028	$0.0033	$0.00011
Audio (30 seconds)	~3,000 tokens	$0.0075	N/A	$0.0003
Video (10 seconds, 1fps)	~7,650 tokens	N/A	N/A	$0.0008

These costs compound rapidly in production. A document processing pipeline that analyzes 10-page PDFs as images (10 pages × 1,105 tokens each) consumes 11,050 tokens per document — costing $0.028 per document on GPT-4o. At 10,000 documents per day, that is $280/day or $8,400/month just for image input. If the same documents can be processed as extracted text (averaging 3,000 tokens per document), the cost drops to $0.0075 per document — $75/day or $2,250/month. Understanding when to use image input versus text extraction can save 70%+ on document processing workloads.

CostHawk breaks down token consumption by modality, showing you exactly how much of your spend comes from text versus image versus audio inputs, and identifying opportunities to shift from expensive modalities to cheaper alternatives.

What Are Multi-Modal Models?

Multi-modal models extend the transformer architecture to process data types beyond text. While the underlying self-attention mechanism is the same, each modality requires a specialized encoder to convert raw data into the token-like representations the transformer can process:

Vision encoders: Images are processed through a Vision Transformer (ViT) or similar architecture that divides the image into fixed-size patches (typically 14x14 or 16x16 pixels), projects each patch into the model's embedding space, and adds positional encoding so the model knows the spatial relationship between patches. A 224x224 image divided into 14x14 patches produces 256 image tokens; a 1024x1024 image divided into 16x16 patches produces 4,096 image tokens (though most APIs resize and tile images to manage token counts). These image tokens are then concatenated with text tokens and processed through the same transformer layers.

Audio encoders: Audio is typically processed through a Whisper-like encoder that converts raw waveforms into mel-spectrogram representations, then encodes these into audio tokens at a rate of approximately 50–100 tokens per second of audio. GPT-4o's native audio mode processes audio tokens directly, while other models may transcribe audio to text first (which is cheaper but loses tonal and prosodic information).

Video processing: Video is the most token-expensive modality. Models like Gemini process video by sampling frames at a configurable rate (e.g., 1 frame per second), encoding each frame as image tokens, and concatenating them with any audio track tokens. A 60-second video at 1fps produces approximately 60 × 258 = 15,480 image tokens plus ~6,000 audio tokens — over 21,000 tokens for a single minute of video. At Gemini 2.0 Flash rates, this costs approximately $0.002 per minute; at GPT-4o rates (if video were supported), it would cost approximately $0.05 per minute.

Cross-modal attention: The key architectural feature of multi-modal models is that attention operates across all modalities simultaneously. When analyzing an image with a text question, each text token can attend to all image tokens and vice versa. This cross-modal attention is what enables capabilities like visual question answering ("What color is the car in the image?"), document understanding ("Extract the table data from this screenshot"), and multimodal reasoning ("Based on this chart and the accompanying text, what is the trend?"). The compute cost of this cross-modal attention follows the same quadratic scaling as text-only attention — the total token count (text + image + audio) determines the attention cost.

Multi-Modal Pricing by Modality

Each modality has a different token conversion formula and effective cost per unit of information. Understanding these pricing structures is essential for estimating and optimizing multi-modal costs:

OpenAI GPT-4o image pricing: GPT-4o processes images at two detail levels. Low detail always costs 85 tokens regardless of image size — the image is resized to fit within a 512x512 box and encoded as a single tile. High detail first scales the image to fit within a 2048x2048 box, then divides it into 512x512 tiles, with each tile costing 170 tokens plus a base cost of 85 tokens. The formula is: tokens = 170 × number_of_tiles + 85. Examples:

Image Resolution	Detail	Tiles	Tokens	Cost (at $2.50/MTok)
Any size	Low	1	85	$0.000213
512x512	High	1	255	$0.000638
1024x1024	High	4	765	$0.001913
2048x1024	High	8	1,445	$0.003613
2048x2048	High	16	1,105	$0.002763

Anthropic Claude 3.5 image pricing: Claude calculates image tokens based on pixel count. The formula is approximately tokens = (width × height) / 750, with a minimum of 100 tokens. A 1024x1024 image costs approximately 1,398 tokens ($4.19/1K images at $3.00/MTok input). Claude supports images up to 1568x1568 pixels in a single message; larger images are automatically resized. Claude does not offer a "low detail" option — all images are processed at full resolution.

Google Gemini image pricing: Gemini encodes each image as exactly 258 tokens, regardless of resolution (images are resized internally). At Gemini 2.0 Flash's $0.10/MTok rate, each image costs just $0.0000258 — roughly 10x cheaper than GPT-4o and 15x cheaper than Claude for the same image. This extreme price advantage makes Gemini the cost leader for image-heavy workloads by a wide margin.

Audio pricing: GPT-4o's audio mode processes audio at approximately 100 tokens per second. Gemini processes audio at approximately 32 tokens per second. A 60-second audio clip costs approximately $0.015 on GPT-4o and $0.0002 on Gemini 2.0 Flash. For audio-heavy workloads, the 75x cost difference between providers is the most important factor in model selection.

Cross-provider cost comparison for a typical multimodal request (1 image at 1024x1024 + 200 text tokens input + 300 text tokens output):

Model	Image Tokens	Total Input Tokens	Total Cost
GPT-4o	765	965	$0.00541
Claude 3.5 Sonnet	1,398	1,598	$0.00929
Gemini 2.0 Flash	258	458	$0.00017

Gemini is 32x cheaper than GPT-4o and 55x cheaper than Claude for this specific multimodal request. For image-heavy workloads, provider selection alone can reduce costs by 95%+.

Image Token Calculations

Understanding exactly how images convert to tokens is essential for accurate cost estimation in multi-modal applications. The conversion formulas differ by provider and can produce surprisingly large token counts for high-resolution images.

OpenAI's tiling system (GPT-4o): For high-detail images, OpenAI uses a tiling algorithm:

Scale the image so the longest side fits within 2048 pixels (maintaining aspect ratio)
Scale the shortest side to fit within 768 pixels (maintaining aspect ratio)
Divide the resulting image into 512x512 tiles, counting partial tiles as full tiles
Token count = (number of tiles × 170) + 85

Practical examples:

Original Size	After Scaling	Tiles (H×W)	Total Tiles	Tokens
400x300	400x300	1×1	1	255
800x600	800x600	2×2	4	765
1920x1080	1365x768	2×3	6	1,105
3000x2000	1152x768	2×3	6	1,105
4000x4000	768x768	2×2	4	765

Key insight: images larger than 2048 pixels in any dimension are scaled down before tiling, so sending a 4000x4000 image does not cost more than sending a 768x768 image. However, sending a 512x512 image costs 255 tokens, while low-detail mode costs only 85 tokens — a 3x savings for use cases where low detail suffices.

Anthropic's pixel-based calculation (Claude): Claude uses a simpler formula based on total pixel count: tokens ≈ (width × height) / 750. This means:

Resolution	Pixels	Approximate Tokens
256x256	65,536	~87
512x512	262,144	~350
1024x1024	1,048,576	~1,398
1568x1568 (max)	2,458,624	~3,278

Google's fixed-token approach (Gemini): Gemini standardizes all images to 258 tokens regardless of input resolution. The image is internally resized and encoded at a fixed representation size. This makes cost estimation trivial for Gemini — every image costs the same.

Optimization strategies for image tokens:

Use low-detail mode when possible. For images where you only need rough content understanding ("Is this an image of a cat or a dog?"), low-detail mode at 85 tokens is 3–16x cheaper than high-detail mode. GPT-4o's low-detail mode is sufficient for classification, content moderation, and basic scene description.
Resize before sending. Sending a 4000x4000 DSLR photo when a 512x512 version contains all the information the model needs wastes bandwidth and may produce more tokens depending on the provider's scaling algorithm.
Crop to the region of interest. If you only need to analyze a specific area of an image, crop to that area before sending. A 200x200 crop costs a fraction of the full image.
Consider text extraction. For documents, receipts, and screenshots, OCR (optical character recognition) followed by text analysis is often cheaper than sending the image directly. A page of text as an image might be 1,000+ tokens; the same text extracted via OCR might be 200–400 text tokens.

Audio and Video Processing Costs

Audio and video are the most token-intensive modalities, and their costs can scale rapidly in production applications. Understanding the pricing structure and optimization opportunities is essential for applications that process spoken content or visual media.

Audio processing: There are two approaches to handling audio in AI applications, each with different cost profiles:

Approach 1: Native audio tokens (GPT-4o audio mode). GPT-4o can process audio natively, encoding audio input at approximately 100 tokens per second. A 60-second audio clip costs 6,000 input tokens ($0.015 at $2.50/MTok). This approach preserves tonal information, speaker identity, emotion, and non-verbal cues — valuable for sentiment analysis, speaker diarization, and audio quality assessment. However, it is expensive for bulk transcription.

Approach 2: Speech-to-text then text processing. Use a dedicated transcription model (OpenAI Whisper at $0.006/minute, or Gemini's audio at ~$0.0002/minute) to convert audio to text, then process the text with a text-only model. A 60-second audio clip transcribes to approximately 150 words (~200 text tokens), costing $0.0005 on GPT-4o for the text processing. Total cost: $0.006 (Whisper) + $0.0005 (text processing) = $0.0065 — compared to $0.015 for native audio processing. The tradeoff: you lose tonal and acoustic information but save 55%+ on processing costs.

Video processing: Video is processed as a combination of image frames and audio tokens:

Video Length	Frame Rate	Image Tokens	Audio Tokens (est.)	Total Tokens	Cost (Gemini 2.0 Flash)
10 seconds	1 fps	2,580	320	2,900	$0.0003
30 seconds	1 fps	7,740	960	8,700	$0.0009
60 seconds	1 fps	15,480	1,920	17,400	$0.0017
5 minutes	1 fps	77,400	9,600	87,000	$0.0087
30 minutes	1 fps	464,400	57,600	522,000	$0.0522

At 1 fps, a 30-minute video consumes over 500K tokens — approaching the context window limits of most models. Higher frame rates (2fps, 5fps) multiply image token counts proportionally. Currently, only Gemini supports native video input via API; OpenAI and Anthropic require frame extraction and submission as individual images.

Audio/video optimization strategies:

Minimize frame rate. For video analysis, 1 fps is sufficient for most content understanding tasks. Only increase frame rate when temporal precision matters (action detection, frame-by-frame analysis).
Use keyframe extraction. Instead of sampling at a fixed rate, extract frames only when the visual content changes significantly. A 5-minute interview with a static camera might only need 10 keyframes instead of 300 frames at 1fps — a 30x token reduction.
Process audio and video separately. If you need both visual and spoken content, process the audio track through a cheap transcription service and only send keyframes for visual analysis. This avoids double-counting the audio as both native audio tokens and image context.
Use Gemini for video workloads. Gemini's native video support and low token rates make it 10–50x cheaper than constructing video analysis from individual image frames on GPT-4o or Claude.

Optimizing Multi-Modal Costs

Multi-modal workloads offer some of the largest cost optimization opportunities in AI API usage because the gap between naive and optimized approaches can be 5–20x. Here are the most impactful strategies:

1. Resolution reduction. For many image analysis tasks, the model does not need high-resolution input. Image classification, content moderation, scene description, and basic OCR work well with 512x512 or even 256x256 images. Reducing from high-detail (765+ tokens) to low-detail (85 tokens) on GPT-4o saves 89% of image token costs. Test quality at lower resolutions before defaulting to maximum quality — you will often find no measurable quality difference for your specific task.

2. Image compression and format optimization. Before sending images to the API, compress them to reduce file size without losing visual information that matters for your task. JPEG quality 75 is typically indistinguishable from quality 95 for AI analysis, and WebP format often produces smaller files than JPEG at comparable quality. While token count is determined by pixel dimensions (not file size), smaller files transfer faster and some providers may optimize encoding of lower-quality images.

3. Text extraction before image analysis. For documents, screenshots, receipts, and any image that primarily contains text, running OCR first and sending the extracted text is dramatically cheaper than sending the image. A page of text as an image: 765–1,398 tokens. The same text extracted via Tesseract OCR: 200–400 tokens. For a 10-page document processed 1,000 times per day, this saves:

Image approach: 10 pages × 1,000 docs × 1,000 tokens = 10M tokens/day = $25/day on GPT-4o
Text approach: 10 pages × 1,000 docs × 300 tokens = 3M tokens/day = $7.50/day on GPT-4o
Savings: $17.50/day or $525/month

4. Provider routing by modality. As shown in the pricing comparison tables, Gemini is dramatically cheaper for image and video processing — 10–50x cheaper than GPT-4o or Claude. For image-heavy workloads, routing image analysis to Gemini while keeping text-heavy tasks on your preferred provider can reduce multi-modal costs by 80–95%. A hybrid architecture that uses Gemini 2.0 Flash for image understanding and Claude 3.5 Sonnet for text reasoning combines the best of both models at a fraction of the cost of using either model for everything.

5. Batch processing for non-real-time workloads. If your image or audio analysis does not require real-time results (e.g., nightly batch processing of product images, weekly video content analysis), use batch APIs that offer 50% discounts on per-token rates. OpenAI's batch API and Anthropic's batch processing both apply to multi-modal requests, halving your image token costs.

6. Caching multi-modal results. Image analysis results are highly cacheable because the same image produces the same analysis. If your application processes the same product images, logos, or document templates repeatedly, cache the analysis results and skip the API call entirely. A simple hash-based cache (hash the image bytes, check if analysis exists) can eliminate 30–70% of redundant image processing calls.

When to Use Multi-Modal vs Text-Only

The decision between multi-modal and text-only processing has significant cost implications. Here is a framework for making this decision for common use cases:

Use multi-modal (image input) when:

Visual information is primary. Tasks like image classification, object detection, visual quality assessment, and art analysis require the model to see the actual image — text descriptions are insufficient.
Layout and spatial relationships matter. Analyzing charts, diagrams, floor plans, or UI screenshots requires understanding spatial relationships that text descriptions cannot capture accurately.
Text extraction is unreliable. Handwritten text, stylized fonts, text in images with complex backgrounds, or multilingual documents where OCR accuracy is low benefit from the model's native ability to read text in images.
Speed of processing matters more than cost. Sending an image directly is faster than running OCR + text processing as a two-step pipeline, even if the latter is cheaper. For real-time applications where latency is critical, the single-step image approach may be worth the extra cost.

Use text-only (extract text first) when:

The image primarily contains machine-printed text. For standard documents, PDFs, and screenshots of text-heavy pages, OCR produces accurate text at a fraction of the cost of image tokens. Modern OCR (Tesseract, PaddleOCR, cloud OCR services) achieves 99%+ accuracy on clean, printed text.
You are processing at high volume. At 10,000+ documents per day, the cost difference between image and text processing can be $10,000+/month. The engineering investment in building an OCR pipeline pays for itself within days.
You need to search or index the content. Extracted text can be stored, searched, and indexed; image analysis results cannot be searched as effectively. If downstream processes need the raw text, extracting it first serves dual purposes.
The visual layout is irrelevant. If you only need the textual content of a document (not how it is formatted or laid out), text extraction is both cheaper and more reliable than image analysis for extracting facts, names, numbers, and other text-based data.

Hybrid approaches for maximum efficiency:

OCR + validation with image fallback. Extract text via OCR, process it with a text-only model, and use image analysis only when the OCR confidence is low or the text model's output seems incorrect. This handles 80–90% of documents at text-only cost while falling back to image analysis for the remaining 10–20% that are difficult.
Text-first, image for tables and charts. For mixed documents, extract and process body text via OCR (cheap) and use image analysis only for tables, charts, and diagrams (where layout matters). This targets image tokens at the content that genuinely needs visual understanding.
Thumbnail preview + full resolution on demand. For image classification or triage, process a low-resolution thumbnail first (85 tokens on GPT-4o). Only send the full-resolution image if the thumbnail analysis indicates the image requires detailed examination. This reduces average image token cost by 50–80% for workloads where most images are routine.

CostHawk's modality breakdown analytics show you exactly how much of your spend comes from image versus text tokens, making it easy to identify use cases where switching from image to text input would yield significant savings.

FAQ

Frequently Asked Questions

How much do image inputs cost compared to text inputs?+

Image inputs cost significantly more than text inputs per semantic unit of information. A single 1024x1024 image costs 765 tokens on GPT-4o (high detail), approximately 1,398 tokens on Claude 3.5 Sonnet, or 258 tokens on Gemini 2.0 Flash. The equivalent information described in text might be 200–400 tokens. At GPT-4o rates ($2.50/MTok input), a single high-detail image costs $0.0019, while 400 tokens of descriptive text costs $0.001 — the image is roughly 2x more expensive for the same information. At Claude rates ($3.00/MTok), the difference is even larger because Claude tokenizes images more heavily. The most dramatic difference is when comparing the information density: a 10-word image description (13 tokens, $0.000033 on GPT-4o) provides a rough summary of what might require 765 image tokens ($0.0019) — a 58x cost difference. For workloads where textual descriptions suffice, the savings from avoiding image input are substantial.

How are image tokens calculated for different providers?+

Each provider uses a different formula. OpenAI GPT-4o uses a tiling system: low-detail images are always 85 tokens; high-detail images are scaled to fit within 2048x2048 pixels, divided into 512x512 tiles, and each tile costs 170 tokens plus a base of 85 tokens. A 1024x1024 image produces 4 tiles = 765 tokens. Anthropic Claude 3.5 calculates based on pixel count: approximately (width x height) / 750. A 1024x1024 image costs ~1,398 tokens. Images larger than 1568x1568 are resized. Google Gemini uses a fixed 258 tokens per image regardless of resolution — the simplest and cheapest approach. These differences mean the same image produces very different costs across providers: a 1024x1024 photo costs 765 tokens on GPT-4o, 1,398 tokens on Claude, and 258 tokens on Gemini — a 5.4x difference between Claude and Gemini. For image-heavy workloads, this pricing disparity makes provider selection the single most impactful optimization decision.

Does image resolution affect token count and cost?+

Yes, for OpenAI and Anthropic but not for Google Gemini. On GPT-4o, higher resolution means more 512x512 tiles and more tokens: a 512x512 image costs 255 tokens (1 tile) while a 2048x2048 image costs 1,105 tokens (16 tiles, after internal scaling). On Claude, tokens scale linearly with pixel count: 512x512 costs ~350 tokens while 1024x1024 costs ~1,398 tokens — a 4x increase for a 4x pixel increase. On Gemini, all images cost exactly 258 tokens regardless of resolution because the model resizes all images to a fixed internal representation. This means resolution optimization has a large impact on GPT-4o and Claude costs but no impact on Gemini costs. The practical optimization: before sending images to GPT-4o or Claude, resize them to the minimum resolution that maintains quality for your task. For classification tasks, 512x512 is often sufficient. For detailed document analysis, 1024x1024 provides good results. Only use maximum resolution for tasks that genuinely require fine visual detail.

Is it cheaper to use OCR and send text instead of images?+

In most cases, yes — often dramatically cheaper. A page of printed text as a 1024x1024 image costs 765 tokens on GPT-4o or 1,398 tokens on Claude. The same page processed through OCR (Tesseract, cloud OCR, or PaddleOCR) produces approximately 400–600 tokens of extracted text. That is a 30–70% token reduction. At GPT-4o rates across 10,000 documents per day, the savings can exceed $15,000/month. However, OCR is not always the better choice: (1) handwritten text has low OCR accuracy, making image input more reliable; (2) tables, charts, and complex layouts lose their spatial structure when converted to text; (3) the two-step pipeline (OCR + LLM) adds latency and complexity compared to single-step image analysis. The sweet spot is a hybrid approach: use OCR for standard printed documents and image input for handwriting, complex layouts, and images where visual context matters. CostHawk's modality analytics help you identify which endpoints would benefit most from OCR preprocessing.

How do audio tokens get calculated and billed?+

Audio token calculation varies by provider and approach. GPT-4o native audio mode encodes audio at approximately 100 tokens per second. A 60-second audio clip costs about 6,000 input tokens ($0.015 at $2.50/MTok). This preserves all acoustic information including tone, emotion, and speaker identity. Gemini 2.0 processes audio at approximately 32 tokens per second, making it about 3x more efficient than GPT-4o on a per-second basis and 75x cheaper due to its lower per-token rate. Whisper API (OpenAI's dedicated transcription model) charges a flat $0.006 per minute rather than per-token, and produces text output that can then be processed by any text model. For most audio transcription workloads, the Whisper + text model approach is 50–75% cheaper than native audio mode. Native audio mode is worth the premium only when acoustic features (tone analysis, speaker emotion, music analysis, sound classification) are essential to your use case.

What is the cheapest provider for image-heavy workloads?+

Google Gemini 2.0 Flash is the cheapest provider for image-heavy workloads by a very large margin. At 258 tokens per image and $0.10/MTok input rate, each image costs approximately $0.0000258 — compared to $0.0019 on GPT-4o (74x more expensive) and $0.0042 on Claude 3.5 Sonnet (163x more expensive). For a workload processing 100,000 images per day, the monthly costs would be: Gemini 2.0 Flash at $77/month, GPT-4o at $5,700/month, and Claude 3.5 Sonnet at $12,600/month. The quality tradeoff is real — GPT-4o and Claude generally produce more nuanced image analysis than Gemini Flash — but for tasks like image classification, content moderation, basic OCR, and scene description, Gemini Flash's quality is competitive. A cost-optimized architecture might use Gemini Flash for initial image triage and classification, then route only complex cases to GPT-4o or Claude for detailed analysis. CostHawk's per-model analytics help you measure quality differences to determine if Gemini's cost advantage justifies any quality tradeoff for your specific workload.

Can I mix text and image inputs in a single request?+

Yes, all multi-modal models support interleaved text and image inputs in a single request. You can send a text prompt with one or more images, and the model processes all modalities through the same attention mechanism. The billing is straightforward: text tokens are counted and billed at the text input rate, and image tokens are counted (using the provider's image-to-token formula) and billed at the same input rate. For example, a GPT-4o request with a 200-token text prompt and a 1024x1024 high-detail image costs: (200 text tokens + 765 image tokens) × $2.50/MTok = $0.0024. You can include multiple images in a single request — each is tokenized separately and the tokens are summed. A request with 5 images and a text prompt contains the text tokens plus 5x the image tokens. Be aware that multi-image requests can quickly consume a large portion of the context window: 10 high-detail images on GPT-4o is 7,650 tokens just for the images, before any text is counted.

How does CostHawk track multi-modal costs differently from text-only costs?+

CostHawk provides modality-specific analytics that break down token consumption by type: text input tokens, image input tokens, audio input tokens, and text output tokens. This breakdown reveals the true cost composition of multi-modal workloads, which is often surprising — many teams discover that image tokens account for 60–80% of their input token costs even though they process far more text than images by content volume. CostHawk's per-endpoint modality breakdown shows which API endpoints consume the most image or audio tokens, enabling targeted optimization. The anomaly detection system tracks modality ratios and alerts you when image token consumption spikes unexpectedly — for example, if a frontend change starts sending higher-resolution images than before. CostHawk also provides modality-specific optimization recommendations: routing image-heavy endpoints to Gemini for cost savings, switching to low-detail mode for classification endpoints, or adding OCR preprocessing for document analysis pipelines. The goal is to make modality costs as visible and actionable as text token costs.

Related Terms

Token

The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.

Token Pricing

The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Large Language Model (LLM)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.

Context Window

The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.

Model Routing

Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.

Get started free Back to Glossary