LLM Cost Optimisation: Caching, Batching & Model Routing
If you're running AI agents in production, you've probably felt the sting of inference costs. A customer support agent that processes 10,000 queries monthly at $0.03 per request? That's $300 — not terrible. But scale that to 100,000 queries across multiple AI agent specialties, and suddenly you're looking at $3,000/month just on inference.
The good news: most teams are overspending by 60-80% because they're not applying basic cost optimisation patterns. Let's fix that.
Pattern 1: Prompt Caching
Prompt caching is the lowest-hanging fruit. The idea: if you're sending the same system prompt or context repeatedly, cache it so you only pay for it once per TTL window instead of on every request.
How it works:
Modern LLM providers (Anthropic's Claude, OpenAI's GPT-4) now support prompt caching at the API level. When you send a request, the provider checks if your prompt prefix matches a cached version. If it does, you pay dramatically reduced rates for the cached portion.
Real numbers:
- Anthropic Claude: Cached input tokens cost 90% less ($0.30 per 1M tokens vs $3.00)
- OpenAI: Up to 50% cost reduction on repeated prompts with their caching layer
- Gemini: Similar 50-75% savings on cached context
When to use it:
- System prompts that rarely change (every AI agent has one)
- RAG contexts pulled from the same knowledge base
- Few-shot examples in your prompts
- Long conversation histories in chatbots
Implementation pattern:
1. Structure prompts with static content at the start
2. Enable caching at the API level (provider-specific)
3. Set TTL to 5-60 minutes depending on update frequency
4. Monitor cache hit rates — aim for 70%+
We implement caching by default in our AI agents stack. For a customer support agent we shipped last quarter, caching reduced inference costs from $1,847 to $412/month — a 77% drop with zero functionality loss.
Pattern 2: Request Batching
Batching is simple: collect multiple requests, send them together, split the response. Most teams don't bother because it adds latency. But for non-realtime workflows (report generation, batch data processing, overnight enrichment jobs), it's a massive win.
The math:
LLM APIs typically charge per token. When you send 10 separate requests vs 1 batched request with the same total tokens, you often pay less for the batched version because:
- Reduced per-request overhead
- Better token packing efficiency
- Some providers offer batch-specific discounts (OpenAI's Batch API is 50% cheaper)
Best use cases:
- Email classification/routing (collect 100 emails, classify in one shot)
- Product description generation for e-commerce catalogs
- Bulk sentiment analysis on customer feedback
- Invoice data extraction across multiple PDFs
Anti-patterns:
Don't batch:
- Real-time chat responses (users won't wait)
- Critical alerts or fraud detection
- Anything with <100ms latency requirements
TechNova implementation:
For our HotelDesk CRM, we batch-process guest review sentiment analysis overnight. Instead of analyzing reviews as they arrive ($0.02 per review × 500 daily reviews = $10/day), we batch 500 reviews every 6 hours. Cost: $2.50/day using OpenAI's Batch API. That's a 75% reduction.
We also use batching for:
- LegalEase contract clause extraction
- PharmaCare prescription validation checks
- CargoTrack shipment document processing
Pattern 3: Model Routing
Not every task needs GPT-4. Model routing means sending requests to the cheapest model that can handle the job. Think of it as load balancing, but for cost instead of servers.
The cost spread:
Pricing as of Q1 2025 (per 1M tokens):
- GPT-4 Turbo: $10 input / $30 output
- GPT-3.5 Turbo: $0.50 input / $1.50 output
- Claude 3 Haiku: $0.25 input / $1.25 output
- Llama 3 (self-hosted): ~$0.10 all-in
That's a 100x difference between top and bottom.
Routing strategies:
Task-based routing
- Complex reasoning → GPT-4 / Claude Opus
- Simple classification → GPT-3.5 / Haiku
- Bulk text generation → Llama / Mixtral
Confidence-based routing
- Try cheap model first
- If confidence score < threshold, retry with expensive model
- Typically saves 60-70% while maintaining accuracy
Hybrid routing
- Use cheap model for first draft
- Expensive model reviews/refines only when needed
Real-world routing table:
- Customer query classification → Haiku ($0.25/1M)
- Legal document summarisation → GPT-4 ($10/1M)
- Invoice data extraction → GPT-3.5 ($0.50/1M)
- Email reply generation → Haiku ($0.25/1M)
- Complex contract analysis → GPT-4 ($10/1M)
Implementation tip:
Build a routing layer that:
- Accepts standardized input
- Classifies task complexity
- Routes to appropriate model
- Logs cost and performance metrics
- Auto-adjusts routing rules based on accuracy/cost data
We've open-sourced a basic version of our router on GitHub. It's saved clients 50-65% on average across mixed workloads.
Combining All Three
The real magic happens when you stack these patterns:
- Cached + routed: System prompts cached, cheap model handles 80% of requests
- Batched + routed: Overnight batch jobs use self-hosted Llama
- All three: RAG pipeline with cached context, batched inference, task-appropriate models
Case study:
A client running an AI-powered helpdesk (similar to our customer support agent specialty) was spending $4,200/month on inference:
- 200k queries/month
- All using GPT-4
- No caching
- Individual API calls
After implementing our stack:
- Prompt caching: -60% (system prompt + FAQ context)
- Model routing: -45% on remaining costs (80% of queries to GPT-3.5)
- Selective batching: -20% on overnight report generation
Final cost: $987/month. That's a 76.5% reduction.
Where to Start
Audit your current spend. Most providers offer cost breakdowns by model/endpoint. Identify your top 3 cost drivers.
Implement caching first. Easiest win, no architecture changes needed. Enable it in your API client, structure prompts correctly, done.
Add routing for new features. Don't refactor everything. Start fresh projects with a routing layer, migrate old code as you touch it.
Batch where latency allows. Look for overnight jobs, reports, bulk operations. These are gimmes.
Monitor and iterate. Track cost-per-task, accuracy, and latency. Adjust routing rules monthly based on real data.
The Bottom Line
LLM costs aren't fixed. With caching, batching, and smart routing, most teams can cut 60-80% without sacrificing quality. These aren't exotic techniques — they're production patterns we use daily across our 49 AI agent specialties.
If you're building custom AI solutions and want help implementing these patterns, we've done this hundreds of times. The ROI shows up in month one.
Time to stop overpaying for tokens. Your CFO will thank you.