LLM Cost Optimisation: Caching, Batching & Model Routing

If you're running AI agents in production, you've probably felt the sting of inference costs. A customer support agent that processes 10,000 queries monthly at $0.03 per request? That's $300 — not terrible. But scale that to 100,000 queries across multiple AI agent specialties, and suddenly you're looking at $3,000/month just on inference.

The good news: most teams are overspending by 60-80% because they're not applying basic cost optimisation patterns. Let's fix that.

LLM Cost Optimisation: Caching, Batching & Model Routing — illustration 1

Pattern 1: Prompt Caching

Prompt caching is the lowest-hanging fruit. The idea: if you're sending the same system prompt or context repeatedly, cache it so you only pay for it once per TTL window instead of on every request.

How it works:

Modern LLM providers (Anthropic's Claude, OpenAI's GPT-4) now support prompt caching at the API level. When you send a request, the provider checks if your prompt prefix matches a cached version. If it does, you pay dramatically reduced rates for the cached portion.

Real numbers:

Anthropic Claude: Cached input tokens cost 90% less ($0.30 per 1M tokens vs $3.00)
OpenAI: Up to 50% cost reduction on repeated prompts with their caching layer
Gemini: Similar 50-75% savings on cached context

When to use it:

System prompts that rarely change (every AI agent has one)
RAG contexts pulled from the same knowledge base
Few-shot examples in your prompts
Long conversation histories in chatbots

Implementation pattern:

1. Structure prompts with static content at the start
2. Enable caching at the API level (provider-specific)
3. Set TTL to 5-60 minutes depending on update frequency
4. Monitor cache hit rates — aim for 70%+

We implement caching by default in our AI agents stack. For a customer support agent we shipped last quarter, caching reduced inference costs from $1,847 to $412/month — a 77% drop with zero functionality loss.

Pattern 2: Request Batching

Batching is simple: collect multiple requests, send them together, split the response. Most teams don't bother because it adds latency. But for non-realtime workflows (report generation, batch data processing, overnight enrichment jobs), it's a massive win.

The math:

LLM APIs typically charge per token. When you send 10 separate requests vs 1 batched request with the same total tokens, you often pay less for the batched version because:

Reduced per-request overhead
Better token packing efficiency
Some providers offer batch-specific discounts (OpenAI's Batch API is 50% cheaper)

Best use cases:

Email classification/routing (collect 100 emails, classify in one shot)
Product description generation for e-commerce catalogs
Bulk sentiment analysis on customer feedback
Invoice data extraction across multiple PDFs

Anti-patterns:

Don't batch:

Real-time chat responses (users won't wait)
Critical alerts or fraud detection
Anything with <100ms latency requirements

TechNova implementation:

For our HotelDesk CRM, we batch-process guest review sentiment analysis overnight. Instead of analyzing reviews as they arrive ($0.02 per review × 500 daily reviews = $10/day), we batch 500 reviews every 6 hours. Cost: $2.50/day using OpenAI's Batch API. That's a 75% reduction.

We also use batching for:

LegalEase contract clause extraction
PharmaCare prescription validation checks
CargoTrack shipment document processing

Pattern 3: Model Routing

Not every task needs GPT-4. Model routing means sending requests to the cheapest model that can handle the job. Think of it as load balancing, but for cost instead of servers.

The cost spread:

Pricing as of Q1 2025 (per 1M tokens):

GPT-4 Turbo: $10 input / $30 output
GPT-3.5 Turbo: $0.50 input / $1.50 output
Claude 3 Haiku: $0.25 input / $1.25 output
Llama 3 (self-hosted): ~$0.10 all-in

That's a 100x difference between top and bottom.

Routing strategies:

Task-based routing
- Complex reasoning → GPT-4 / Claude Opus
- Simple classification → GPT-3.5 / Haiku
- Bulk text generation → Llama / Mixtral
Confidence-based routing
- Try cheap model first
- If confidence score < threshold, retry with expensive model
- Typically saves 60-70% while maintaining accuracy
Hybrid routing
- Use cheap model for first draft
- Expensive model reviews/refines only when needed

Real-world routing table:

Customer query classification → Haiku ($0.25/1M)
Legal document summarisation → GPT-4 ($10/1M)
Invoice data extraction → GPT-3.5 ($0.50/1M)
Email reply generation → Haiku ($0.25/1M)
Complex contract analysis → GPT-4 ($10/1M)

Implementation tip:

Build a routing layer that:

Accepts standardized input
Classifies task complexity
Routes to appropriate model
Logs cost and performance metrics
Auto-adjusts routing rules based on accuracy/cost data

We've open-sourced a basic version of our router on GitHub. It's saved clients 50-65% on average across mixed workloads.

Combining All Three

The real magic happens when you stack these patterns:

Cached + routed: System prompts cached, cheap model handles 80% of requests
Batched + routed: Overnight batch jobs use self-hosted Llama
All three: RAG pipeline with cached context, batched inference, task-appropriate models

Case study:

A client running an AI-powered helpdesk (similar to our customer support agent specialty) was spending $4,200/month on inference:

200k queries/month
All using GPT-4
No caching
Individual API calls

After implementing our stack:

Prompt caching: -60% (system prompt + FAQ context)
Model routing: -45% on remaining costs (80% of queries to GPT-3.5)
Selective batching: -20% on overnight report generation

Final cost: $987/month. That's a 76.5% reduction.

Where to Start

Audit your current spend. Most providers offer cost breakdowns by model/endpoint. Identify your top 3 cost drivers.
Implement caching first. Easiest win, no architecture changes needed. Enable it in your API client, structure prompts correctly, done.
Add routing for new features. Don't refactor everything. Start fresh projects with a routing layer, migrate old code as you touch it.
Batch where latency allows. Look for overnight jobs, reports, bulk operations. These are gimmes.
Monitor and iterate. Track cost-per-task, accuracy, and latency. Adjust routing rules monthly based on real data.

The Bottom Line

LLM costs aren't fixed. With caching, batching, and smart routing, most teams can cut 60-80% without sacrificing quality. These aren't exotic techniques — they're production patterns we use daily across our 49 AI agent specialties.

If you're building custom AI solutions and want help implementing these patterns, we've done this hundreds of times. The ROI shows up in month one.

Time to stop overpaying for tokens. Your CFO will thank you.

LLM Cost Optimisation: Caching, Batching & Model Routing

The good news: most teams are overspending by 60-80% because they're not applying basic cost optimisation patterns. Let's fix that.

LLM Cost Optimisation: Caching, Batching & Model Routing — illustration 1

Pattern 1: Prompt Caching

Prompt caching is the lowest-hanging fruit. The idea: if you're sending the same system prompt or context repeatedly, cache it so you only pay for it once per TTL window instead of on every request.

How it works:

Real numbers:

Anthropic Claude: Cached input tokens cost 90% less ($0.30 per 1M tokens vs $3.00)
OpenAI: Up to 50% cost reduction on repeated prompts with their caching layer
Gemini: Similar 50-75% savings on cached context

When to use it:

System prompts that rarely change (every AI agent has one)
RAG contexts pulled from the same knowledge base
Few-shot examples in your prompts
Long conversation histories in chatbots

Implementation pattern:

1. Structure prompts with static content at the start
2. Enable caching at the API level (provider-specific)
3. Set TTL to 5-60 minutes depending on update frequency
4. Monitor cache hit rates — aim for 70%+

Pattern 2: Request Batching

The math:

LLM APIs typically charge per token. When you send 10 separate requests vs 1 batched request with the same total tokens, you often pay less for the batched version because:

Reduced per-request overhead
Better token packing efficiency
Some providers offer batch-specific discounts (OpenAI's Batch API is 50% cheaper)

Best use cases:

Email classification/routing (collect 100 emails, classify in one shot)
Product description generation for e-commerce catalogs
Bulk sentiment analysis on customer feedback
Invoice data extraction across multiple PDFs

Anti-patterns:

Don't batch:

Real-time chat responses (users won't wait)
Critical alerts or fraud detection
Anything with <100ms latency requirements

TechNova implementation:

We also use batching for:

LegalEase contract clause extraction
PharmaCare prescription validation checks
CargoTrack shipment document processing

Pattern 3: Model Routing

Not every task needs GPT-4. Model routing means sending requests to the cheapest model that can handle the job. Think of it as load balancing, but for cost instead of servers.

The cost spread:

Pricing as of Q1 2025 (per 1M tokens):

GPT-4 Turbo: $10 input / $30 output
GPT-3.5 Turbo: $0.50 input / $1.50 output
Claude 3 Haiku: $0.25 input / $1.25 output
Llama 3 (self-hosted): ~$0.10 all-in

That's a 100x difference between top and bottom.

Routing strategies:

Task-based routing
- Complex reasoning → GPT-4 / Claude Opus
- Simple classification → GPT-3.5 / Haiku
- Bulk text generation → Llama / Mixtral
Confidence-based routing
- Try cheap model first
- If confidence score < threshold, retry with expensive model
- Typically saves 60-70% while maintaining accuracy
Hybrid routing
- Use cheap model for first draft
- Expensive model reviews/refines only when needed

Real-world routing table:

Customer query classification → Haiku ($0.25/1M)
Legal document summarisation → GPT-4 ($10/1M)
Invoice data extraction → GPT-3.5 ($0.50/1M)
Email reply generation → Haiku ($0.25/1M)
Complex contract analysis → GPT-4 ($10/1M)

Implementation tip:

Build a routing layer that:

Accepts standardized input
Classifies task complexity
Routes to appropriate model
Logs cost and performance metrics
Auto-adjusts routing rules based on accuracy/cost data

We've open-sourced a basic version of our router on GitHub. It's saved clients 50-65% on average across mixed workloads.

Combining All Three

The real magic happens when you stack these patterns:

Cached + routed: System prompts cached, cheap model handles 80% of requests
Batched + routed: Overnight batch jobs use self-hosted Llama
All three: RAG pipeline with cached context, batched inference, task-appropriate models

Case study:

A client running an AI-powered helpdesk (similar to our customer support agent specialty) was spending $4,200/month on inference:

200k queries/month
All using GPT-4
No caching
Individual API calls

After implementing our stack:

Prompt caching: -60% (system prompt + FAQ context)
Model routing: -45% on remaining costs (80% of queries to GPT-3.5)
Selective batching: -20% on overnight report generation

Final cost: $987/month. That's a 76.5% reduction.

Where to Start

Audit your current spend. Most providers offer cost breakdowns by model/endpoint. Identify your top 3 cost drivers.
Implement caching first. Easiest win, no architecture changes needed. Enable it in your API client, structure prompts correctly, done.
Add routing for new features. Don't refactor everything. Start fresh projects with a routing layer, migrate old code as you touch it.
Batch where latency allows. Look for overnight jobs, reports, bulk operations. These are gimmes.
Monitor and iterate. Track cost-per-task, accuracy, and latency. Adjust routing rules monthly based on real data.

The Bottom Line

If you're building custom AI solutions and want help implementing these patterns, we've done this hundreds of times. The ROI shows up in month one.

Time to stop overpaying for tokens. Your CFO will thank you.

LLM Cost Optimisation: Caching, Batching & Model Routing

LLM Cost Optimisation: Caching, Batching & Model Routing

Pattern 1: Prompt Caching

Pattern 2: Request Batching

Pattern 3: Model Routing

Combining All Three

Where to Start

The Bottom Line

TechNova Team

More AI

The Eval Harness Every AI Feature Needs: Promptfoo + Langfuse

Customer Support Agents That Actually Deflect: 50+ Deployments Later

Building Observable AI Agents: Langfuse + Braintrust in Practice

Ready to ship the software your business actually runs on?

LLM Cost Optimisation: Caching, Batching & Model Routing

LLM Cost Optimisation: Caching, Batching & Model Routing

Pattern 1: Prompt Caching

Pattern 2: Request Batching

Pattern 3: Model Routing

Combining All Three

Where to Start

The Bottom Line

TechNova Team

More AI

The Eval Harness Every AI Feature Needs: Promptfoo + Langfuse

Customer Support Agents That Actually Deflect: 50+ Deployments Later

Building Observable AI Agents: Langfuse + Braintrust in Practice