Multi-Modal Model
An AI model capable of processing and generating content across multiple modalities — text, images, audio, and video. Each modality carries a different token cost, with image inputs costing substantially more than text per semantic unit. Multi-modal models like GPT-4o, Claude 3.5, and Gemini 2.0 unlock powerful capabilities but introduce complex pricing structures that require careful monitoring to avoid cost surprises.
Definition
What is Multi-Modal Model?
Impact
Why It Matters for AI Costs
Multi-modal models represent both the most powerful and the most expensive pattern in AI API consumption. Understanding their pricing is critical because the cost differences between modalities are dramatic and often surprising:
The modality cost multiplier: Processing an image costs 10–100x more than processing an equivalent amount of textual information. A product description in text might be 200 tokens ($0.0005 on GPT-4o). A photo of that same product might be 1,100 tokens ($0.00275 on GPT-4o) — 5.5x more expensive for potentially the same semantic content. At scale, this difference is enormous: 100,000 image analysis requests per day at 1,100 tokens each means 110 million input tokens daily, costing $275/day or $8,250/month on GPT-4o — compared to $15/day if the same information were available as text.
| Modality | Typical Token Count | Cost per Unit (GPT-4o) | Cost per Unit (Claude 3.5 Sonnet) | Cost per Unit (Gemini 2.0 Flash) |
|---|---|---|---|---|
| Text (500 words) | ~650 tokens | $0.0016 | $0.0020 | $0.0001 |
| Image (512x512, low detail) | 85 tokens | $0.0002 | $0.0003 | $0.00001 |
| Image (1024x1024, high detail) | 765 tokens | $0.0019 | $0.0023 | $0.00008 |
| Image (2048x2048, high detail) | 1,105 tokens | $0.0028 | $0.0033 | $0.00011 |
| Audio (30 seconds) | ~3,000 tokens | $0.0075 | N/A | $0.0003 |
| Video (10 seconds, 1fps) | ~7,650 tokens | N/A | N/A | $0.0008 |
These costs compound rapidly in production. A document processing pipeline that analyzes 10-page PDFs as images (10 pages × 1,105 tokens each) consumes 11,050 tokens per document — costing $0.028 per document on GPT-4o. At 10,000 documents per day, that is $280/day or $8,400/month just for image input. If the same documents can be processed as extracted text (averaging 3,000 tokens per document), the cost drops to $0.0075 per document — $75/day or $2,250/month. Understanding when to use image input versus text extraction can save 70%+ on document processing workloads.
CostHawk breaks down token consumption by modality, showing you exactly how much of your spend comes from text versus image versus audio inputs, and identifying opportunities to shift from expensive modalities to cheaper alternatives.
What Are Multi-Modal Models?
Multi-modal models extend the transformer architecture to process data types beyond text. While the underlying self-attention mechanism is the same, each modality requires a specialized encoder to convert raw data into the token-like representations the transformer can process:
Vision encoders: Images are processed through a Vision Transformer (ViT) or similar architecture that divides the image into fixed-size patches (typically 14x14 or 16x16 pixels), projects each patch into the model's embedding space, and adds positional encoding so the model knows the spatial relationship between patches. A 224x224 image divided into 14x14 patches produces 256 image tokens; a 1024x1024 image divided into 16x16 patches produces 4,096 image tokens (though most APIs resize and tile images to manage token counts). These image tokens are then concatenated with text tokens and processed through the same transformer layers.
Audio encoders: Audio is typically processed through a Whisper-like encoder that converts raw waveforms into mel-spectrogram representations, then encodes these into audio tokens at a rate of approximately 50–100 tokens per second of audio. GPT-4o's native audio mode processes audio tokens directly, while other models may transcribe audio to text first (which is cheaper but loses tonal and prosodic information).
Video processing: Video is the most token-expensive modality. Models like Gemini process video by sampling frames at a configurable rate (e.g., 1 frame per second), encoding each frame as image tokens, and concatenating them with any audio track tokens. A 60-second video at 1fps produces approximately 60 × 258 = 15,480 image tokens plus ~6,000 audio tokens — over 21,000 tokens for a single minute of video. At Gemini 2.0 Flash rates, this costs approximately $0.002 per minute; at GPT-4o rates (if video were supported), it would cost approximately $0.05 per minute.
Cross-modal attention: The key architectural feature of multi-modal models is that attention operates across all modalities simultaneously. When analyzing an image with a text question, each text token can attend to all image tokens and vice versa. This cross-modal attention is what enables capabilities like visual question answering ("What color is the car in the image?"), document understanding ("Extract the table data from this screenshot"), and multimodal reasoning ("Based on this chart and the accompanying text, what is the trend?"). The compute cost of this cross-modal attention follows the same quadratic scaling as text-only attention — the total token count (text + image + audio) determines the attention cost.
Multi-Modal Pricing by Modality
Each modality has a different token conversion formula and effective cost per unit of information. Understanding these pricing structures is essential for estimating and optimizing multi-modal costs:
OpenAI GPT-4o image pricing: GPT-4o processes images at two detail levels. Low detail always costs 85 tokens regardless of image size — the image is resized to fit within a 512x512 box and encoded as a single tile. High detail first scales the image to fit within a 2048x2048 box, then divides it into 512x512 tiles, with each tile costing 170 tokens plus a base cost of 85 tokens. The formula is: tokens = 170 × number_of_tiles + 85. Examples:
| Image Resolution | Detail | Tiles | Tokens | Cost (at $2.50/MTok) |
|---|---|---|---|---|
| Any size | Low | 1 | 85 | $0.000213 |
| 512x512 | High | 1 | 255 | $0.000638 |
| 1024x1024 | High | 4 | 765 | $0.001913 |
| 2048x1024 | High | 8 | 1,445 | $0.003613 |
| 2048x2048 | High | 16 | 1,105 | $0.002763 |
Anthropic Claude 3.5 image pricing: Claude calculates image tokens based on pixel count. The formula is approximately tokens = (width × height) / 750, with a minimum of 100 tokens. A 1024x1024 image costs approximately 1,398 tokens ($4.19/1K images at $3.00/MTok input). Claude supports images up to 1568x1568 pixels in a single message; larger images are automatically resized. Claude does not offer a "low detail" option — all images are processed at full resolution.
Google Gemini image pricing: Gemini encodes each image as exactly 258 tokens, regardless of resolution (images are resized internally). At Gemini 2.0 Flash's $0.10/MTok rate, each image costs just $0.0000258 — roughly 10x cheaper than GPT-4o and 15x cheaper than Claude for the same image. This extreme price advantage makes Gemini the cost leader for image-heavy workloads by a wide margin.
Audio pricing: GPT-4o's audio mode processes audio at approximately 100 tokens per second. Gemini processes audio at approximately 32 tokens per second. A 60-second audio clip costs approximately $0.015 on GPT-4o and $0.0002 on Gemini 2.0 Flash. For audio-heavy workloads, the 75x cost difference between providers is the most important factor in model selection.
Cross-provider cost comparison for a typical multimodal request (1 image at 1024x1024 + 200 text tokens input + 300 text tokens output):
| Model | Image Tokens | Total Input Tokens | Total Cost |
|---|---|---|---|
| GPT-4o | 765 | 965 | $0.00541 |
| Claude 3.5 Sonnet | 1,398 | 1,598 | $0.00929 |
| Gemini 2.0 Flash | 258 | 458 | $0.00017 |
Gemini is 32x cheaper than GPT-4o and 55x cheaper than Claude for this specific multimodal request. For image-heavy workloads, provider selection alone can reduce costs by 95%+.
Image Token Calculations
Understanding exactly how images convert to tokens is essential for accurate cost estimation in multi-modal applications. The conversion formulas differ by provider and can produce surprisingly large token counts for high-resolution images.
OpenAI's tiling system (GPT-4o): For high-detail images, OpenAI uses a tiling algorithm:
- Scale the image so the longest side fits within 2048 pixels (maintaining aspect ratio)
- Scale the shortest side to fit within 768 pixels (maintaining aspect ratio)
- Divide the resulting image into 512x512 tiles, counting partial tiles as full tiles
- Token count = (number of tiles × 170) + 85
Practical examples:
| Original Size | After Scaling | Tiles (H×W) | Total Tiles | Tokens |
|---|---|---|---|---|
| 400x300 | 400x300 | 1×1 | 1 | 255 |
| 800x600 | 800x600 | 2×2 | 4 | 765 |
| 1920x1080 | 1365x768 | 2×3 | 6 | 1,105 |
| 3000x2000 | 1152x768 | 2×3 | 6 | 1,105 |
| 4000x4000 | 768x768 | 2×2 | 4 | 765 |
Key insight: images larger than 2048 pixels in any dimension are scaled down before tiling, so sending a 4000x4000 image does not cost more than sending a 768x768 image. However, sending a 512x512 image costs 255 tokens, while low-detail mode costs only 85 tokens — a 3x savings for use cases where low detail suffices.
Anthropic's pixel-based calculation (Claude): Claude uses a simpler formula based on total pixel count: tokens ≈ (width × height) / 750. This means:
| Resolution | Pixels | Approximate Tokens |
|---|---|---|
| 256x256 | 65,536 | ~87 |
| 512x512 | 262,144 | ~350 |
| 1024x1024 | 1,048,576 | ~1,398 |
| 1568x1568 (max) | 2,458,624 | ~3,278 |
Google's fixed-token approach (Gemini): Gemini standardizes all images to 258 tokens regardless of input resolution. The image is internally resized and encoded at a fixed representation size. This makes cost estimation trivial for Gemini — every image costs the same.
Optimization strategies for image tokens:
- Use low-detail mode when possible. For images where you only need rough content understanding ("Is this an image of a cat or a dog?"), low-detail mode at 85 tokens is 3–16x cheaper than high-detail mode. GPT-4o's low-detail mode is sufficient for classification, content moderation, and basic scene description.
- Resize before sending. Sending a 4000x4000 DSLR photo when a 512x512 version contains all the information the model needs wastes bandwidth and may produce more tokens depending on the provider's scaling algorithm.
- Crop to the region of interest. If you only need to analyze a specific area of an image, crop to that area before sending. A 200x200 crop costs a fraction of the full image.
- Consider text extraction. For documents, receipts, and screenshots, OCR (optical character recognition) followed by text analysis is often cheaper than sending the image directly. A page of text as an image might be 1,000+ tokens; the same text extracted via OCR might be 200–400 text tokens.
Audio and Video Processing Costs
Audio and video are the most token-intensive modalities, and their costs can scale rapidly in production applications. Understanding the pricing structure and optimization opportunities is essential for applications that process spoken content or visual media.
Audio processing: There are two approaches to handling audio in AI applications, each with different cost profiles:
Approach 1: Native audio tokens (GPT-4o audio mode). GPT-4o can process audio natively, encoding audio input at approximately 100 tokens per second. A 60-second audio clip costs 6,000 input tokens ($0.015 at $2.50/MTok). This approach preserves tonal information, speaker identity, emotion, and non-verbal cues — valuable for sentiment analysis, speaker diarization, and audio quality assessment. However, it is expensive for bulk transcription.
Approach 2: Speech-to-text then text processing. Use a dedicated transcription model (OpenAI Whisper at $0.006/minute, or Gemini's audio at ~$0.0002/minute) to convert audio to text, then process the text with a text-only model. A 60-second audio clip transcribes to approximately 150 words (~200 text tokens), costing $0.0005 on GPT-4o for the text processing. Total cost: $0.006 (Whisper) + $0.0005 (text processing) = $0.0065 — compared to $0.015 for native audio processing. The tradeoff: you lose tonal and acoustic information but save 55%+ on processing costs.
Video processing: Video is processed as a combination of image frames and audio tokens:
| Video Length | Frame Rate | Image Tokens | Audio Tokens (est.) | Total Tokens | Cost (Gemini 2.0 Flash) |
|---|---|---|---|---|---|
| 10 seconds | 1 fps | 2,580 | 320 | 2,900 | $0.0003 |
| 30 seconds | 1 fps | 7,740 | 960 | 8,700 | $0.0009 |
| 60 seconds | 1 fps | 15,480 | 1,920 | 17,400 | $0.0017 |
| 5 minutes | 1 fps | 77,400 | 9,600 | 87,000 | $0.0087 |
| 30 minutes | 1 fps | 464,400 | 57,600 | 522,000 | $0.0522 |
At 1 fps, a 30-minute video consumes over 500K tokens — approaching the context window limits of most models. Higher frame rates (2fps, 5fps) multiply image token counts proportionally. Currently, only Gemini supports native video input via API; OpenAI and Anthropic require frame extraction and submission as individual images.
Audio/video optimization strategies:
- Minimize frame rate. For video analysis, 1 fps is sufficient for most content understanding tasks. Only increase frame rate when temporal precision matters (action detection, frame-by-frame analysis).
- Use keyframe extraction. Instead of sampling at a fixed rate, extract frames only when the visual content changes significantly. A 5-minute interview with a static camera might only need 10 keyframes instead of 300 frames at 1fps — a 30x token reduction.
- Process audio and video separately. If you need both visual and spoken content, process the audio track through a cheap transcription service and only send keyframes for visual analysis. This avoids double-counting the audio as both native audio tokens and image context.
- Use Gemini for video workloads. Gemini's native video support and low token rates make it 10–50x cheaper than constructing video analysis from individual image frames on GPT-4o or Claude.
Optimizing Multi-Modal Costs
Multi-modal workloads offer some of the largest cost optimization opportunities in AI API usage because the gap between naive and optimized approaches can be 5–20x. Here are the most impactful strategies:
1. Resolution reduction. For many image analysis tasks, the model does not need high-resolution input. Image classification, content moderation, scene description, and basic OCR work well with 512x512 or even 256x256 images. Reducing from high-detail (765+ tokens) to low-detail (85 tokens) on GPT-4o saves 89% of image token costs. Test quality at lower resolutions before defaulting to maximum quality — you will often find no measurable quality difference for your specific task.
2. Image compression and format optimization. Before sending images to the API, compress them to reduce file size without losing visual information that matters for your task. JPEG quality 75 is typically indistinguishable from quality 95 for AI analysis, and WebP format often produces smaller files than JPEG at comparable quality. While token count is determined by pixel dimensions (not file size), smaller files transfer faster and some providers may optimize encoding of lower-quality images.
3. Text extraction before image analysis. For documents, screenshots, receipts, and any image that primarily contains text, running OCR first and sending the extracted text is dramatically cheaper than sending the image. A page of text as an image: 765–1,398 tokens. The same text extracted via Tesseract OCR: 200–400 tokens. For a 10-page document processed 1,000 times per day, this saves:
- Image approach: 10 pages × 1,000 docs × 1,000 tokens = 10M tokens/day = $25/day on GPT-4o
- Text approach: 10 pages × 1,000 docs × 300 tokens = 3M tokens/day = $7.50/day on GPT-4o
- Savings: $17.50/day or $525/month
4. Provider routing by modality. As shown in the pricing comparison tables, Gemini is dramatically cheaper for image and video processing — 10–50x cheaper than GPT-4o or Claude. For image-heavy workloads, routing image analysis to Gemini while keeping text-heavy tasks on your preferred provider can reduce multi-modal costs by 80–95%. A hybrid architecture that uses Gemini 2.0 Flash for image understanding and Claude 3.5 Sonnet for text reasoning combines the best of both models at a fraction of the cost of using either model for everything.
5. Batch processing for non-real-time workloads. If your image or audio analysis does not require real-time results (e.g., nightly batch processing of product images, weekly video content analysis), use batch APIs that offer 50% discounts on per-token rates. OpenAI's batch API and Anthropic's batch processing both apply to multi-modal requests, halving your image token costs.
6. Caching multi-modal results. Image analysis results are highly cacheable because the same image produces the same analysis. If your application processes the same product images, logos, or document templates repeatedly, cache the analysis results and skip the API call entirely. A simple hash-based cache (hash the image bytes, check if analysis exists) can eliminate 30–70% of redundant image processing calls.
When to Use Multi-Modal vs Text-Only
The decision between multi-modal and text-only processing has significant cost implications. Here is a framework for making this decision for common use cases:
Use multi-modal (image input) when:
- Visual information is primary. Tasks like image classification, object detection, visual quality assessment, and art analysis require the model to see the actual image — text descriptions are insufficient.
- Layout and spatial relationships matter. Analyzing charts, diagrams, floor plans, or UI screenshots requires understanding spatial relationships that text descriptions cannot capture accurately.
- Text extraction is unreliable. Handwritten text, stylized fonts, text in images with complex backgrounds, or multilingual documents where OCR accuracy is low benefit from the model's native ability to read text in images.
- Speed of processing matters more than cost. Sending an image directly is faster than running OCR + text processing as a two-step pipeline, even if the latter is cheaper. For real-time applications where latency is critical, the single-step image approach may be worth the extra cost.
Use text-only (extract text first) when:
- The image primarily contains machine-printed text. For standard documents, PDFs, and screenshots of text-heavy pages, OCR produces accurate text at a fraction of the cost of image tokens. Modern OCR (Tesseract, PaddleOCR, cloud OCR services) achieves 99%+ accuracy on clean, printed text.
- You are processing at high volume. At 10,000+ documents per day, the cost difference between image and text processing can be $10,000+/month. The engineering investment in building an OCR pipeline pays for itself within days.
- You need to search or index the content. Extracted text can be stored, searched, and indexed; image analysis results cannot be searched as effectively. If downstream processes need the raw text, extracting it first serves dual purposes.
- The visual layout is irrelevant. If you only need the textual content of a document (not how it is formatted or laid out), text extraction is both cheaper and more reliable than image analysis for extracting facts, names, numbers, and other text-based data.
Hybrid approaches for maximum efficiency:
- OCR + validation with image fallback. Extract text via OCR, process it with a text-only model, and use image analysis only when the OCR confidence is low or the text model's output seems incorrect. This handles 80–90% of documents at text-only cost while falling back to image analysis for the remaining 10–20% that are difficult.
- Text-first, image for tables and charts. For mixed documents, extract and process body text via OCR (cheap) and use image analysis only for tables, charts, and diagrams (where layout matters). This targets image tokens at the content that genuinely needs visual understanding.
- Thumbnail preview + full resolution on demand. For image classification or triage, process a low-resolution thumbnail first (85 tokens on GPT-4o). Only send the full-resolution image if the thumbnail analysis indicates the image requires detailed examination. This reduces average image token cost by 50–80% for workloads where most images are routine.
CostHawk's modality breakdown analytics show you exactly how much of your spend comes from image versus text tokens, making it easy to identify use cases where switching from image to text input would yield significant savings.
FAQ
Frequently Asked Questions
How much do image inputs cost compared to text inputs?+
How are image tokens calculated for different providers?+
Does image resolution affect token count and cost?+
Is it cheaper to use OCR and send text instead of images?+
How do audio tokens get calculated and billed?+
What is the cheapest provider for image-heavy workloads?+
Can I mix text and image inputs in a single request?+
How does CostHawk track multi-modal costs differently from text-only costs?+
Related Terms
Token
The fundamental billing unit for large language models. Every API call is metered in tokens, which are sub-word text fragments produced by BPE tokenization. One token averages roughly four characters in English, and providers bill input and output tokens at separate rates.
Read moreToken Pricing
The per-token cost model used by AI API providers, with separate rates for input tokens, output tokens, and cached tokens. Token pricing is the fundamental billing mechanism for LLM APIs, typically quoted per million tokens, and varies by model, provider, and usage tier.
Read moreCost Per Query
The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.
Read moreLarge Language Model (LLM)
A neural network with billions of parameters trained on massive text corpora to understand and generate human language. For API consumers, inference cost — the price of running the model on your input — dominates the total cost of ownership.
Read moreContext Window
The maximum number of tokens a model can process in a single request, encompassing both the input prompt and the generated output. Context window size varies dramatically across models — from 8K tokens in older models to 2 million in Gemini 1.5 Pro — and directly determines how much information you can include per request and how much you pay.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
