GlossaryBilling & PricingUpdated 2026-03-16

Cost Per Query

The total cost of a single end-user request to your AI-powered application, including all token consumption, tool calls, and retries.

Definition

What is Cost Per Query?

Cost per query (CPQ) is the fully loaded cost of serving one end-user request through your AI application. The base formula is (input_tokens × input_price) + (output_tokens × output_price), but the true CPQ often includes multiple LLM calls, embedding lookups, tool invocations, and retries. CPQ ranges from $0.0003 for a simple GPT-4o-mini classification to over $1.00 for a complex multi-step agent chain using reasoning models. CPQ is the most important unit economics metric for any AI-powered product because it directly determines your gross margin.

Impact

Why It Matters for AI Costs

CPQ is the metric that connects your AI infrastructure costs to your business model. If you charge users $20/month and they make 500 queries, your revenue per query is $0.04. If your CPQ is $0.05, you are losing money on every interaction. Understanding and optimizing CPQ is essential for building a profitable AI product. CostHawk calculates CPQ automatically by summing all API calls associated with a single user request, giving you the true cost that provider dashboards cannot show.

Calculating Cost Per Query

The simplest form of CPQ is a single LLM call:

CPQ = (input_tokens × input_rate) + (output_tokens × output_rate)

For a GPT-4o request with 1,000 input tokens and 500 output tokens:

CPQ = (1,000 × $2.50/1M) + (500 × $10.00/1M)
    = $0.0025 + $0.005
    = $0.0075

But most production applications involve multiple steps per user query. A RAG-based Q&A system might include: (1) an embedding call to vectorize the query, (2) a vector database lookup (infrastructure cost), (3) an LLM call with retrieved context, and optionally (4) a follow-up LLM call for answer refinement. The true CPQ is the sum of all these steps:

CPQ_rag = embedding_cost + retrieval_infra_cost + llm_call_cost
        = $0.000002 + $0.0001 + $0.0075
        = ~$0.0076

For agent-based applications, CPQ becomes even more complex because the number of LLM calls per query is variable — an agent might make 3 tool calls or 15, depending on the task. In these cases, CPQ must be measured empirically using median and P95 values rather than calculated from a fixed formula.

Cost Benchmarks by Use Case

The following table provides typical CPQ ranges for common AI application patterns, based on GPT-4o pricing as of March 2026. Actual costs vary by prompt design, context length, and output requirements.

Use CaseTypical Input TokensTypical Output TokensLLM Calls per QueryMedian CPQ
Text classification200-5005-201$0.0003-$0.002
Sentiment analysis300-80010-501$0.001-$0.003
Entity extraction500-2,00050-2001$0.002-$0.007
RAG Q&A2,000-8,000200-8001-2$0.007-$0.03
Document summarization5,000-30,000500-2,0001-3$0.02-$0.10
Code generation1,000-5,000500-3,0001-2$0.01-$0.05
Conversational chatbot1,000-10,000200-1,0001$0.005-$0.04
Multi-step agent chain3,000-20,0001,000-10,0003-15$0.05-$1.00+
Reasoning (o1)1,000-5,0005,000-30,0001$0.30-$2.00

The 3,000x range between the cheapest classification query ($0.0003) and the most expensive agent chain ($1.00+) shows why CPQ analysis must be done per use case, not as an organization-wide average. A single agent workflow can cost more than 1,000 classification queries.

Hidden Costs in Multi-Step Queries

The most common source of CPQ underestimation is multi-step queries where a single user interaction triggers multiple LLM calls. These hidden costs include:

  • Agent loops: An AI agent that uses tool calling may invoke 5-15 LLM calls per user query. Each call includes the full conversation history plus tool results, causing input tokens to grow quadratically. The 5th call in a chain might send 10,000 input tokens of accumulated context.
  • Tool call overhead: Each tool call adds tokens for the tool definition (in the system prompt), the tool invocation (in the output), and the tool result (in the next input). A single tool call can add 200-500 tokens of overhead beyond the tool's actual payload.
  • Retries and fallbacks: When a model returns malformed JSON or fails a validation check, the application retries — doubling the cost of that step. If you have a fallback from GPT-4o-mini to GPT-4o on failure, the fallback call costs 16x more than the original attempt.
  • Guardrail and moderation calls: Content moderation, PII detection, and output guardrails each add an LLM call. A system that runs input moderation, generation, and output validation makes 3 LLM calls per query minimum.
  • Conversation history growth: In chatbot applications, each turn sends the full conversation history as input. The 10th message in a conversation sends 10x more input tokens than the first message. CPQ for turn 10 is dramatically higher than CPQ for turn 1.

CostHawk's request tracing groups all LLM calls triggered by a single user interaction into one trace, revealing the true multi-step CPQ that provider dashboards cannot show.

Reducing Cost Per Query

CPQ optimization works across four dimensions: model selection, token reduction, caching, and architecture.

Model selection: Route simple tasks to cheaper models. A classification task on GPT-4o-mini costs $0.0003; the same task on GPT-4o costs $0.005 — a 16x difference with minimal quality impact. Use model routing to automatically select the cheapest model that meets your quality threshold for each query type.

Token reduction: Compress prompts, trim conversation history, set strict max_tokens limits, and use structured output. These strategies reduce both input and output costs. A well-optimized prompt can cut CPQ by 30-50% compared to a naive implementation.

Caching: Prompt caching (90% input discount from Anthropic, 50% from OpenAI) reduces the input portion of CPQ for repetitive system prompts. Semantic caching can eliminate the LLM call entirely for repeated queries, reducing CPQ to near zero for cache hits.

Architecture: Reduce the number of LLM calls per query. Replace agent loops with deterministic pipelines where possible. Pre-compute tool results that do not change between requests. Batch guardrail checks instead of running them individually. Each eliminated LLM call removes an entire cost step from your CPQ.

The order of impact is typically: caching (highest) > model routing > token reduction > architecture. Start by measuring your current CPQ distribution, then apply optimizations to the most expensive query types first.

CPQ and Product Pricing Strategy

CPQ is the foundation of pricing strategy for AI-powered products. Your subscription price, per-query price, or usage-based rate must cover CPQ plus infrastructure overhead plus margin. Here is how to model it:

Required revenue per query = CPQ × (1 + infrastructure_overhead) × (1 + target_margin)

Example: CPQ = $0.02, overhead = 20%, margin = 50%
Required = $0.02 × 1.2 × 1.5 = $0.036 per query

For subscription pricing, work backward from usage patterns:

Monthly queries per user = 500 (median)
Median CPQ = $0.02
Monthly AI cost per user = 500 × $0.02 = $10.00
Minimum subscription price = $10.00 × 1.2 × 1.5 = $18.00/month

But median CPQ is not enough — you must account for the distribution. If your P95 CPQ is $0.15 (from agent-heavy queries), your P95 cost per user is $75/month, which destroys your margin at a $20/month price point. You need usage caps, tiered pricing, or model routing to prevent heavy users from making you unprofitable.

CostHawk's CPQ distribution charts show you the median, P75, P90, and P95 CPQ values so you can price your product with confidence and set usage limits that protect your margin.

Monitoring CPQ with CostHawk

CostHawk provides purpose-built CPQ monitoring that goes beyond raw token counting:

  • Per-endpoint CPQ: Tag API routes with CostHawk labels and see the average, median, and P95 CPQ for each endpoint. Identify which features are expensive and which are efficient.
  • CPQ trends over time: Track how CPQ changes as you modify prompts, switch models, or add new features. A rising CPQ trend is an early warning of context bloat or scope creep in agent behavior.
  • CPQ by user segment: Break down CPQ by customer plan tier, geography, or usage pattern. Power users who trigger complex agent chains may have 10-50x higher CPQ than casual users.
  • CPQ anomaly alerts: Set alerts for individual queries that exceed a CPQ threshold. A single runaway agent query costing $5.00 is a signal to investigate your agent loop termination logic.
  • Multi-step trace grouping: CostHawk groups all LLM calls, embedding calls, and tool invocations triggered by a single user request into one trace. The trace CPQ is the true cost that matters for unit economics, not the per-call cost that provider dashboards show.

Teams using CostHawk's CPQ monitoring typically identify 20-40% cost reduction opportunities within the first week by finding queries where CPQ is far above the median due to missing max_tokens limits, excessive agent steps, or unnecessary model upgrades.

FAQ

Frequently Asked Questions

What is a good cost per query for an AI chatbot?+
For a consumer-facing chatbot, target a median CPQ of $0.005-$0.02 using GPT-4o or Claude Sonnet. This allows profitable operation at subscription prices of $10-$20/month assuming 300-500 queries per user per month. Enterprise chatbots with longer conversations and more complex queries typically run $0.02-$0.08 per query. If your chatbot CPQ exceeds $0.10, investigate whether you are sending excessive conversation history, using an over-powered model for simple turns, or missing max_tokens limits. CostHawk's per-conversation cost breakdown helps you identify which conversations are expensive and why.
How do agent chains affect cost per query?+
Agent chains are the most expensive query pattern in AI applications. Each tool call in an agent loop triggers a new LLM inference with the full conversation history plus tool results. A 5-step agent chain on GPT-4o might process 50,000 total tokens across all steps (input tokens grow with each step as context accumulates). At GPT-4o pricing, this can cost $0.15-$0.50 per query. Longer chains of 10-15 steps can exceed $1.00. The key risk is unbounded loops — an agent that keeps calling tools without converging can generate unlimited costs. Always set a maximum step count (typically 5-10) and a per-query cost ceiling. CostHawk flags agent queries that exceed configurable step and cost thresholds.
How do I calculate CPQ for a RAG application?+
A RAG query has three cost components: (1) Embedding the user query — typically 10-50 tokens at $0.02/MTok, costing less than $0.000001, effectively free. (2) Vector database lookup — infrastructure cost, typically $0.0001-$0.001 per query depending on your provider. (3) LLM generation with retrieved context — this dominates the cost. If you retrieve 3 chunks of 500 tokens each (1,500 tokens) plus a 500-token system prompt and 100-token query, your input is ~2,100 tokens. With a 400-token response on GPT-4o: (2,100 × $2.50/1M) + (400 × $10.00/1M) = $0.0053 + $0.004 = ~$0.0093. Typical RAG CPQ ranges from $0.007 to $0.03 depending on context length and response length.
What is the difference between CPQ and cost per token?+
Cost per token is the unit rate charged by the provider (e.g., $2.50 per million input tokens for GPT-4o). It is a fixed rate. Cost per query is the total cost of serving one user request, calculated by multiplying token counts by token rates and summing across all LLM calls in the request. CPQ is a variable metric that depends on prompt length, response length, number of LLM calls, model selection, and whether caching applies. Two queries using the same model with the same cost-per-token rate can have wildly different CPQs: a simple classification at $0.0003 versus a document summarization at $0.05. CPQ is the metric that matters for unit economics; cost per token is the building block used to calculate it.
How can caching reduce cost per query?+
Caching reduces CPQ in two ways. Prompt caching (provider-side) discounts the input portion: Anthropic gives 90% off cached input tokens, OpenAI gives 50% off. For a request where 80% of input tokens are cached (a common scenario with large system prompts), this reduces the input cost component by 40-72%. Semantic caching (application-side) can eliminate the LLM call entirely for queries similar to previously answered ones, reducing CPQ to near zero (just the embedding cost for similarity matching, typically $0.00001). For applications with repetitive query patterns — FAQ bots, customer support, documentation Q&A — semantic caching can reduce average CPQ by 20-40% across all traffic, because 20-40% of queries hit the cache. CostHawk tracks cache hit rates alongside CPQ so you can measure the actual savings.
Should I track median or average CPQ?+
Track both, but make decisions based on median and percentiles. Average CPQ is skewed by outliers — a single $5.00 agent query in a batch of 1,000 $0.01 queries makes the average $0.015 instead of $0.01. This overstates the typical cost by 50%. Median CPQ tells you what most queries actually cost. P95 CPQ tells you what your expensive queries cost. P99 CPQ reveals your worst-case outliers. For pricing decisions, use median CPQ to estimate typical unit economics and P95 CPQ to set usage caps and safeguards. CostHawk's analytics dashboard provides median, P75, P90, P95, and P99 CPQ breakdowns per endpoint, model, and time period.
How does model routing lower CPQ?+
Model routing reduces CPQ by directing each query to the cheapest model that can handle it adequately. GPT-4o-mini costs $0.15/$0.60 per million tokens (input/output) while GPT-4o costs $2.50/$10.00 — a 16x difference. If 60% of your queries are simple enough for GPT-4o-mini, routing them to the cheaper model reduces your blended average CPQ by approximately 55%. A tiered routing system might classify queries into three tiers: simple (GPT-4o-mini at ~$0.0003/query), moderate (GPT-4o at ~$0.008/query), and complex (o1 at ~$0.50/query). The blended CPQ depends on your traffic distribution across tiers. CostHawk's per-model cost analytics help you identify which queries are over-served by expensive models and could be routed down a tier without quality loss.
What is the CPQ impact of conversation history growth?+
In multi-turn conversations, CPQ increases with each turn because the full conversation history is sent as input tokens. If each turn adds ~300 tokens (user message + assistant response), by turn 10 you are sending ~3,000 extra input tokens of history per request. On GPT-4o, that is an additional $0.0075 in input cost per query — which may double or triple the CPQ compared to turn 1. For a 20-turn conversation with a 2,000-token system prompt, the final turn sends roughly 8,000 input tokens of history alone, costing $0.02 just for history. Mitigation strategies include summarizing older turns (replacing 10 turns with a 200-token summary), sliding window truncation (keeping only the last N turns), and using RAG to retrieve relevant past context instead of sending everything. CostHawk's per-turn cost tracking shows you exactly where conversation cost starts to compound.
How do retries and fallbacks affect CPQ?+
Every retry doubles the cost of that inference step, and fallbacks to more expensive models can multiply costs further. If your application retries on JSON parse failures with a 10% failure rate, your effective CPQ increases by 10% on average. If your fallback chain goes from GPT-4o-mini ($0.0003/query) to GPT-4o ($0.008/query) on failure, and 5% of queries fail to the fallback, your blended CPQ increases by approximately 12%. The worst case is a retry loop combined with a fallback: a query that fails twice on GPT-4o-mini, then falls back to GPT-4o and retries once more, costs 2 × $0.0003 + 2 × $0.008 = $0.0166 — 55x the intended cost. Always set a maximum retry count (typically 1-2) and log retry rates. CostHawk flags endpoints with high retry rates as cost optimization opportunities.
Can CPQ be negative? What about free-tier provider credits?+
CPQ itself is always positive — it represents actual compute consumed. However, your net cost per query can be effectively zero or negative when accounting for provider credits and free tiers. OpenAI offers $5 in free credits for new accounts. Anthropic offers limited free-tier access through Claude.ai. Google offers $300 in Cloud credits that apply to Vertex AI. These credits reduce your effective CPQ to zero during the credit period, but you should still track and optimize CPQ during this time because credits expire and your true unit economics will be revealed once they do. Teams that ignore CPQ during the free-tier period often face sticker shock when credits run out and their actual costs are 3-5x higher than expected.

Related Terms

AI Cost Glossary

Put this knowledge to work. Track your AI spend in one place.

CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.