LLM Proxy
A transparent intermediary that sits between your application and LLM providers, forwarding requests while adding tracking, caching, or policy enforcement without code changes. Proxies intercept standard SDK traffic, log usage metadata, and optionally transform requests before relaying them upstream.
Definition
What is LLM Proxy?
Impact
Why It Matters for AI Costs
Without a proxy layer, AI cost visibility is fragmented and reactive. Each team, service, or microservice makes direct calls to provider APIs using its own keys, and the only cost data available is the monthly invoice from the provider — an aggregate number with no breakdown by feature, team, or request type. By the time you see the bill, the money is already spent.
An LLM proxy changes the economics of observability. Because every request flows through a single chokepoint, you gain:
- Real-time cost tracking — every request is logged with its token count, model, latency, and computed cost before the response even reaches your application.
- Per-key attribution — issue different proxy keys to different teams or services, and costs are automatically attributed to the correct owner.
- Caching — identical or semantically similar requests can be served from cache, eliminating redundant provider charges. Teams running batch pipelines or repeated evaluations routinely see 30–60% cache hit rates.
- Policy enforcement — set per-key spending caps, rate limits, or model restrictions at the proxy level. A runaway script cannot exceed its budget because the proxy will reject requests once the cap is hit.
- Model routing — the proxy can transparently redirect requests to cheaper models when appropriate, or fail over to backup providers during outages.
Consider a mid-size engineering organization with 8 teams, 15 microservices, and 3 LLM providers. Without a proxy, that is potentially 45+ direct integration points, each with its own key, its own logging (or lack thereof), and its own cost blind spot. With a proxy, it is one integration point per service (swap the base URL), and all cost data flows into a single dashboard. CostHawk customers typically discover 15–25% of their AI spend was previously untracked or misattributed before deploying proxy-based monitoring.
What is an LLM Proxy?
An LLM proxy is a server or service that sits in the network path between your application and an LLM provider's API endpoint. When your application makes an API call — for example, a POST to https://api.openai.com/v1/chat/completions — the request is instead routed to the proxy, which performs some combination of logging, transformation, and policy checks before forwarding the request to the actual provider.
The key architectural property of a proxy is transparency. Your application code does not need to know it is talking to a proxy rather than the provider directly. The proxy accepts requests in the same format as the provider's API, returns responses in the same format, and the only configuration change required is updating the base URL or API key. This is fundamentally different from a wrapper library or SDK plugin, which requires code changes in every service that calls an LLM.
LLM proxies can be deployed in several configurations:
- Cloud-hosted proxy — a managed service like CostHawk that runs in the cloud. Your application sends requests to the proxy's URL, and the proxy forwards them to the provider. This is the simplest deployment model and requires zero infrastructure management.
- Sidecar proxy — a lightweight process running alongside your application (common in Kubernetes environments). Traffic is intercepted at the network level using iptables rules or service mesh configuration, making the proxy completely invisible to application code.
- Reverse proxy / ingress — an Nginx, Envoy, or Caddy server configured with custom middleware that intercepts outbound LLM traffic. This model is popular in organizations that already have a reverse proxy infrastructure.
- SDK middleware — while not a true network proxy, some solutions offer SDK-level middleware that wraps provider client libraries. This requires code changes but offers deeper integration with application-level metadata like user IDs and feature flags.
Regardless of deployment model, the proxy's core responsibilities are the same: intercept the request, extract metadata (model, token counts, timestamps), optionally apply transformations or policies, forward to the provider, capture the response metadata, and return the response to the caller. The entire round trip typically adds 5–20ms of latency — negligible compared to the 200–2,000ms latency of the LLM inference itself.
Modern LLM proxies handle all major provider protocols including OpenAI's chat completions API, Anthropic's messages API, Google's Gemini API, and streaming responses via Server-Sent Events (SSE). The best proxies are protocol-aware, meaning they can parse streaming responses in real time to count output tokens as they arrive rather than waiting for the stream to complete.
Proxy vs Gateway
The terms "proxy" and "gateway" are often used interchangeably in the LLM infrastructure space, but they refer to architecturally distinct components with different responsibilities. Understanding the difference helps you choose the right tool for your needs.
| Characteristic | LLM Proxy | LLM Gateway |
|---|---|---|
| Primary purpose | Transparent request forwarding with observability | Centralized API management and orchestration |
| Code changes required | Minimal — swap base URL or API key | Often requires SDK or client library changes |
| Protocol handling | Pass-through; preserves provider-native format | May normalize to a unified API format |
| Model routing | Simple rules (round-robin, cost-based) | Advanced orchestration (A/B testing, cascading fallbacks) |
| Caching | Exact-match or semantic caching | Full response caching with invalidation policies |
| Authentication | Key substitution (wrapped key → provider key) | Full auth stack (OAuth, JWT, API key management) |
| Rate limiting | Per-key or per-endpoint limits | Tiered rate limiting with quota management |
| Latency overhead | 5–20ms | 20–100ms (more processing per request) |
| Deployment complexity | Low — single service or managed cloud | Medium to high — often requires dedicated infrastructure |
| Best for | Cost tracking, usage attribution, simple policy enforcement | Multi-provider orchestration, complex routing, enterprise API management |
In practice, many organizations start with a proxy for cost visibility and graduate to a gateway as their LLM infrastructure matures. CostHawk's wrapped-key system functions as a proxy — it adds zero friction to your existing provider SDK integration while providing comprehensive cost tracking and attribution. If you later need advanced orchestration features like cascading model fallbacks or A/B testing across providers, you can layer a gateway on top of or in place of the proxy.
Some products blur the line between proxy and gateway. LiteLLM, for example, started as a proxy (unified interface to multiple providers) but has evolved to include gateway-like features such as spend tracking, virtual keys, and budget management. The important thing is not the label but the capabilities: do you need transparent pass-through with logging (proxy), or do you need active request orchestration (gateway)?
How LLM Proxies Work
Understanding the request lifecycle through an LLM proxy helps you reason about latency, failure modes, and the data available for cost tracking. Here is the complete flow for a typical proxied request:
Step 1: Client sends request. Your application makes a standard API call using the provider's SDK or a raw HTTP request. The only difference from a direct call is the base URL (pointing to the proxy) and the API key (a wrapped key issued by the proxy rather than the raw provider key). For example:
const client = new OpenAI({
apiKey: "ch_wrapped_sk_abc123...", // CostHawk wrapped key
baseURL: "https://proxy.costhawk.com/v1" // Proxy endpoint
})Step 2: Proxy receives and authenticates. The proxy receives the HTTP request, validates the wrapped API key against its database, and retrieves the associated metadata: the real provider key (encrypted at rest), the owning team/project, spending limits, and any routing rules. If the key is invalid, expired, or over budget, the proxy returns a 401 or 429 immediately without contacting the provider.
Step 3: Pre-request processing. Before forwarding, the proxy can perform several optional operations:
- Cache check — hash the request payload and check if an identical (or semantically similar) request was recently served. If so, return the cached response immediately, saving both latency and provider cost.
- Budget check — estimate the request cost (using the input token count and model pricing) and verify it will not exceed the key's remaining budget.
- Request transformation — modify the request payload, such as injecting a system prompt, adjusting
max_tokens, or changing the target model based on routing rules. - Rate limit check — verify the key has not exceeded its requests-per-minute or tokens-per-minute limit.
Step 4: Forward to provider. The proxy constructs a new HTTP request to the provider's actual endpoint, substituting the wrapped key with the real provider API key. The request body is forwarded as-is (unless modified in Step 3). For streaming requests, the proxy establishes an SSE connection to the provider.
Step 5: Response relay. The provider's response is streamed back through the proxy to the client. For streaming responses, the proxy forwards each SSE chunk in real time while simultaneously counting output tokens. For non-streaming responses, the proxy reads the complete response, extracts the usage object (containing prompt_tokens and completion_tokens), and forwards the response to the client.
Step 6: Post-request logging. After the response is fully delivered, the proxy asynchronously writes a usage record containing: timestamp, wrapped key ID, model, input tokens, output tokens, computed cost, latency, HTTP status code, and any custom tags. This logging is non-blocking — it does not add latency to the response delivery.
The total latency overhead of steps 2–3 and 5–6 is typically 5–20ms, dominated by the key lookup and cache check. The provider inference itself (Step 4) takes 200–2,000ms+ depending on the model and output length, so the proxy overhead is negligible in practice.
Cost Benefits of Proxy Architecture
Deploying an LLM proxy delivers cost benefits across four dimensions: visibility, caching, routing, and enforcement. Each dimension compounds on the others, and together they typically reduce AI API spend by 20–45% within the first month of deployment.
1. Centralized cost visibility. The most immediate benefit is knowing where your money goes. Without a proxy, cost data is scattered across provider dashboards, billing emails, and (maybe) application logs. A proxy centralizes every request into a single audit trail with consistent metadata. CostHawk customers routinely discover cost surprises during their first week: development environments consuming 35% of total spend, a single batch job responsible for 40% of monthly tokens, or a chatbot feature sending 8,000-token conversation histories on every turn. Visibility alone does not save money, but it tells you exactly where to look.
2. Semantic and exact-match caching. Many LLM workloads contain significant redundancy. Customer support chatbots answer the same 50 questions repeatedly. Code review tools analyze similar pull requests. Data extraction pipelines process documents with identical structures. An LLM proxy with caching intercepts these duplicate requests and returns cached responses without ever calling the provider. Exact-match caching (same request payload = cache hit) is straightforward and risk-free. Semantic caching (similar requests = cache hit) uses embedding similarity to match requests that are worded differently but have the same intent. Production deployments with caching enabled typically see 15–40% cache hit rates depending on workload characteristics, translating directly to 15–40% cost reduction on cached traffic.
3. Intelligent model routing. A proxy can route requests to different models based on rules you define. Simple classification tasks go to GPT-4o mini ($0.15/MTok input) instead of GPT-4o ($2.50/MTok input) — a 16x cost reduction with minimal quality impact. The proxy can implement this routing transparently: your application always requests "gpt-4o," but the proxy downgrades qualifying requests to the cheaper model. More sophisticated routing uses a small classifier model to estimate task complexity and select the cheapest model that meets your quality threshold. Teams that implement model routing typically reduce costs by 25–50% on workloads that include a mix of simple and complex tasks.
4. Budget enforcement and spend caps. Perhaps the most valuable cost benefit is prevention. A proxy can enforce hard spending limits at the per-key, per-team, or per-project level. When a key reaches its daily or monthly budget cap, the proxy returns a 429 Too Many Requests response instead of forwarding the request to the provider. This prevents runaway costs from bugs (infinite loops calling the API), abuse (compromised keys), or honest mistakes (a developer accidentally running a batch job against production). Without a proxy, your only spending protection is the provider's own usage limits — which are per-account, not per-key, and are typically set much higher than you would want. CostHawk wrapped keys support configurable budget caps that stop spending the moment a limit is reached, giving you granular financial control over every integration point.
The compound effect of these four dimensions is substantial. An organization spending $50,000/month on LLM APIs might save $3,000 from caching, $8,000 from model routing, and prevent $5,000 in waste from development over-usage and runaway scripts — a total savings of $16,000/month (32%) while maintaining the same application quality and user experience.
Setting Up an LLM Proxy
The fastest path to proxy-based cost tracking is CostHawk's wrapped-key system, which requires no infrastructure deployment — just a key swap. Here is a step-by-step guide for the three most common setup paths:
Option 1: CostHawk Wrapped Keys (Managed Proxy)
This is the recommended approach for teams that want cost visibility without managing proxy infrastructure.
- Create a CostHawk account and navigate to the Integrations page in the CostHawk dashboard.
- Add your provider API key. Paste your OpenAI, Anthropic, or Google API key into CostHawk. The key is encrypted with AES-256-CBC and stored securely — it never appears in logs or API responses.
- Generate a wrapped key. CostHawk issues a wrapped key (prefixed with
ch_) that maps to your provider key. You can generate multiple wrapped keys per provider key for per-team or per-project attribution. - Update your application. Replace your provider API key and base URL:
// Before const client = new OpenAI({ apiKey: "sk-proj-abc123..." }) // After const client = new OpenAI({ apiKey: "ch_wrapped_sk_abc123...", baseURL: "https://proxy.costhawk.com/v1" }) - Deploy and monitor. Requests now flow through CostHawk's proxy. Cost data appears in your dashboard within seconds, broken down by key, model, and project.
Total setup time: under 5 minutes per service. No infrastructure to deploy, no containers to manage, no DNS changes required.
Option 2: Self-Hosted Proxy with LiteLLM
For teams that need to keep traffic on their own infrastructure (compliance, data residency, or latency requirements), LiteLLM is the most popular open-source LLM proxy.
- Deploy LiteLLM as a Docker container or Kubernetes service:
docker run -d \ -p 4000:4000 \ -e OPENAI_API_KEY="sk-proj-abc123..." \ -e ANTHROPIC_API_KEY="sk-ant-abc123..." \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml - Configure models in
config.yamlwith pricing, rate limits, and routing rules. - Point your applications at the LiteLLM endpoint (
http://litellm:4000/v1) and issue virtual keys for each team. - Forward usage logs to CostHawk using LiteLLM's webhook callbacks or by streaming logs to CostHawk's ingestion API for unified dashboarding.
Option 3: Reverse Proxy with Nginx/Envoy
For organizations with existing reverse proxy infrastructure, you can add an LLM-aware middleware layer:
- Configure an upstream pointing to the provider endpoint.
- Add a Lua or WASM middleware that extracts the request body, counts tokens using a lightweight tokenizer, and logs usage to your analytics pipeline.
- Substitute the API key in the proxy pass directive so applications use internal keys that map to the real provider key at the proxy layer.
This approach offers maximum control but requires the most engineering investment. It is best suited for organizations with dedicated platform teams and existing proxy expertise.
Proxy Security Considerations
An LLM proxy handles sensitive data at every layer — API keys, user prompts, model responses, and cost metadata. Security must be built into the proxy architecture from day one, not bolted on after deployment.
API key protection. The proxy holds your real provider API keys. If the proxy is compromised, an attacker gains access to your provider accounts. Mitigations include: encrypting keys at rest with AES-256 or better (CostHawk uses AES-256-CBC with unique initialization vectors per key), never logging raw keys in request/response logs, rotating provider keys on a regular schedule (quarterly at minimum), and implementing key access controls so that only the proxy's forwarding module can decrypt and use the real key.
Data in transit. All traffic between your application and the proxy, and between the proxy and the provider, must use TLS 1.2 or higher. Verify that your proxy deployment terminates TLS correctly and does not downgrade connections. For managed proxies like CostHawk, TLS is handled automatically. For self-hosted deployments, configure your proxy server with strong cipher suites and enable HSTS headers.
Prompt and response data. The proxy can see the full content of every request and response. This creates a data handling obligation: if your prompts contain PII, PHI, financial data, or other sensitive information, the proxy must handle that data according to your organization's data classification policies. Key questions to answer: Does the proxy log full request/response bodies, or just metadata (token counts, model, latency)? Where are logs stored, and who has access? What is the data retention policy? Is the proxy provider SOC 2 Type II certified? CostHawk logs only metadata by default — token counts, model identifiers, timestamps, and computed costs. Full request/response logging is available as an opt-in feature for teams that need it, with configurable retention and access controls.
Availability and failure modes. If the proxy goes down, your LLM-powered features stop working. This is the most significant operational risk of a proxy architecture. Mitigations include: deploying the proxy with redundancy (multiple instances behind a load balancer), implementing circuit breakers that fail open (bypass the proxy and call the provider directly if the proxy is unresponsive for more than a configurable threshold), setting aggressive timeouts on the proxy's internal operations (key lookup, cache check) so that proxy-side failures add minimal latency, and monitoring proxy health with synthetic probes that send test requests every 30 seconds.
Access control and audit. Not everyone in your organization should be able to create wrapped keys, view usage logs, or modify routing rules. Implement role-based access control (RBAC) for proxy administration: developers can view their own key's usage, team leads can view team-level aggregates, and only platform administrators can create keys, modify budgets, or access raw logs. CostHawk integrates with Clerk for authentication and supports organization-level roles that map to these access patterns. Every administrative action — key creation, budget modification, routing rule change — should be recorded in an immutable audit log for compliance and forensic purposes.
Supply chain and dependency risk. If you use a managed proxy service, you are adding a dependency to your critical path. Evaluate the provider's SLA, uptime history, incident response process, and what happens to your data if they go out of business. For self-hosted proxies using open-source software like LiteLLM, audit the project's security track record, dependency tree, and update cadence. Pin your proxy software to a specific version and test updates in staging before rolling out to production.
FAQ
Frequently Asked Questions
Does an LLM proxy add latency to my API calls?+
Can I use an LLM proxy with streaming responses?+
[DONE] sentinel event correctly to finalize the usage record and close the connection cleanly. CostHawk's proxy handles all streaming variants including OpenAI's chunked responses, Anthropic's event-based streaming, and Google's server-sent events, with correct token counting across all formats.What happens if the LLM proxy goes down?+
How does an LLM proxy handle authentication?+
Can I run an LLM proxy on-premises for data compliance?+
How is an LLM proxy different from a VPN or HTTP proxy?+
Does an LLM proxy work with function calling and tool use?+
Can I use multiple LLM proxies in a chain?+
Related Terms
LLM Gateway
An AI-specific API gateway purpose-built for routing LLM requests across providers. Adds model routing, cost tracking, caching, and fallback capabilities that traditional API gateways lack.
Read moreAPI Gateway
A centralized entry point for API traffic that handles routing, authentication, rate limiting, and request transformation. For LLM APIs, gateways add cost tracking, policy enforcement, and provider abstraction.
Read moreWrapped Keys
Proxy API keys that route provider SDK traffic through a cost tracking layer. The original provider key never leaves the server, while the wrapped key provides per-key attribution, budget enforcement, and policy controls without requiring application code changes beyond a base URL swap.
Read moreAPI Key Management
Securing, rotating, scoping, and tracking API credentials across AI providers. Effective key management is the foundation of both cost attribution and security — every unmanaged key is a potential source of untracked spend and unauthorized access.
Read moreModel Routing
Dynamically directing AI requests to different models based on task complexity, cost constraints, and quality requirements to achieve optimal cost efficiency.
Read moreRate Limiting
Provider-enforced caps on API requests and tokens per minute that throttle throughput and return HTTP 429 errors when exceeded.
Read moreAI Cost Glossary
Put this knowledge to work. Track your AI spend in one place.
CostHawk gives engineering teams real-time visibility into every token, every model, and every dollar across your AI stack.
