AI Cost Optimization - Cut Your LLM Bills by 50-90%
The real costs of running AI in production and the playbook to slash them. Pricing tables, routing strategies, caching, and self-hosting economics.
A disciplined team can cut AI production costs by 70-90% while maintaining output quality
LLM API Pricing (April 2026)
| Model | Input $/MTok | Output $/MTok | Cached Input | Batch |
|---|---|---|---|---|
| GPT-4.1 | $2.00 | $8.00 | $0.50 | $1.00/$4.00 |
| GPT-4.1 mini | $0.40 | $1.60 | $0.10 | $0.20/$0.80 |
| GPT-4.1 nano | $0.10 | $0.40 | $0.025 | $0.05/$0.20 |
| Claude Opus 4.5 | $5.00 | $25.00 | $0.50 | $2.50/$12.50 |
| Claude Sonnet 4.5 | $3.00 | $15.00 | $0.30 | $1.50/$7.50 |
| Claude Haiku 3 | $0.25 | $1.25 | $0.03 | $0.125/$0.625 |
| Gemini 2.5 Pro | $1.25 | $10.00 | - | - |
| Gemini 2.5 Flash | $0.30 | $3.00 | - | - |
| Gemini Flash-Lite | $0.10 | $0.40 | - | - |
| DeepSeek V3.2 | $0.27 | $1.10 | $0.07 | - |
Prompt Caching - The Biggest Single Lever
Three Tiers of Caching
| Tier | How It Works | Savings | Complexity |
|---|---|---|---|
| Exact Match | Same prompt → cached response | 30-40% | Low |
| Prefix/Provider | Cache system instructions + few-shot examples. Anthropic: 90% savings on hits. OpenAI: 50%. | 50-70% | Low |
| Semantic | Vectorize queries, match similar ones. "Refund policy?" serves cached "How do I get a refund?" | 70-90% | Medium |
31% of LLM queries exhibit semantic similarity - that's massive waste without caching. A customer support system handling 50K similar queries/day where 30% are identical = 15,000 unnecessary API calls eliminated daily.
Other Quick Wins
- Batch API: 50% discount on all tokens (OpenAI + Anthropic). Results within 24 hours. Use for classification, summarization, data extraction.
- Prompt compression: Strip verbose instructions, use JSON/YAML instead of prose. 20-40% token reduction with no quality loss.
Model Routing & Cascading
85% of production queries don't need a frontier model. Research using RouteLLM shows you can maintain 95% of frontier quality while routing 85% of queries to cheaper models.
Complexity-Based Router
Simple (classification, FAQ, extraction)
→ GPT-4.1 nano ($0.10/$0.40) or Haiku 3 ($0.25/$1.25)
Medium (summarization, standard generation)
→ GPT-4.1 mini ($0.40/$1.60) or Haiku 4.5 ($1/$5)
Complex (multi-step reasoning, creative, code)
→ GPT-4.1 ($2/$8) or Sonnet 4.5 ($3/$15)
Hardest (research, novel analysis)
→ Claude Opus ($5/$25) or GPT-5 ($1.25/$10)
Cost Impact - Real Numbers
100K daily requests, all through Claude Sonnet ($3/$15 per MTok), avg 1K input + 500 output tokens:
| Approach | Monthly Cost | Savings |
|---|---|---|
| All Sonnet (no routing) | $31,500 | - |
| Routed (70% Haiku / 20% Sonnet / 10% Opus) | $13,387 | 57% |
| Routed + semantic caching | ~$5,000 | 84% |
Self-Hosting Economics
| GPU | Provider | On-Demand $/hr | Spot $/hr | Reserved $/hr |
|---|---|---|---|---|
| 8× H100 (p5.48xlarge) | AWS | $98.32 | $30-$50 | ~$60-$70 |
| 8× A100 (p4d.24xlarge) | AWS | $32.77 | $10-$15 | ~$20-$22 |
| 1× A10G (g5.xlarge) | AWS | $1.006 | $0.30-$0.50 | ~$0.63 |
| 1× H100 | Alt providers | $2.01 | $0.99 | - |
Break-Even: 10M Tokens/Day Through GPT-4o Equivalent
- API (GPT-4o): ~$375,000/year
- Self-hosted Llama 70B on 2× H100 (on-demand): ~$215,000/year
- Self-hosted on spot/reserved: ~$52,000-$105,000/year
Quantization - Cheaper Hardware, Same Quality
| Precision | 7B Model | 70B Model | Cost Reduction |
|---|---|---|---|
| FP16 | 14 GB | 140 GB | Baseline |
| INT8 | 7 GB | 70 GB | 50% |
| INT4 | 3.5 GB | 35 GB | 75% |
AWQ (activation-aware) offers the best quality at 4-bit on GPUs. GGUF is best for CPU/edge. A Llama 70B that normally needs multiple A100s runs on a single RTX 4090 after 4-bit quantization - with ~1-2% quality loss.
RAG Cost Optimization
Vector DB Pricing - The Real Numbers
The gap between pricing page estimates and actual bills averages 2.5-4×:
| Database | 1M Vectors | 10M Vectors | 100M Vectors |
|---|---|---|---|
| Pinecone | $70-$150/mo | $800-$2,500 | $15,000-$28,000 |
| Weaviate Cloud | $180-$320 | $900-$1,600 | $6,000-$12,000 |
| Qdrant Cloud | $150-$280 | $750-$1,400 | $5,500-$10,000 |
| pgvector (RDS) | $120-$200 | $600-$1,200 | $4,000-$8,000 |
| Qdrant self-hosted | $90-$160 | $450-$850 | $2,800-$5,500 |
Fine-Tuning ROI
The distillation pattern is 2026's dominant approach: use a frontier model as "teacher" to generate synthetic training data, then fine-tune a cheap model as "student." Result: 10× cheaper inference with near-frontier quality on your domain.
| When It Saves Money | When It Wastes Money |
|---|---|
| High-volume, narrow-domain (>100K req/month) | Low volume (<10K req/month) |
| Need consistent output format prompting can't achieve | Rapidly changing requirements |
| Can replace $3-$15/MTok model with $0.10-$0.40 | General-purpose tasks where frontier models excel |
| ROI: $5K fine-tuning → saves $2K/month → 2.5 month payback | When RAG can provide the domain knowledge instead |
The Complete Cost Optimization Playbook
🏃 Quick Wins (Week 1 - 30-50% savings)
- Audit current spend - know your cost per query, per feature, per user
- Switch simple tasks to cheap models - GPT-4.1 nano ($0.10) or Gemini Flash-Lite ($0.10)
- Enable prompt caching - Anthropic (90% on hits) and OpenAI (50%) support it natively
- Use Batch API for non-real-time workloads - instant 50% discount
📈 Medium-Term (Weeks 2-4 - additional 20-40%)
- Implement model routing - send 70%+ of queries to cheap models
- Add semantic caching - eliminates 30%+ of redundant API calls
- Compress prompts - structured formats, strip verbose instructions
- Right-size vector DB - reduce replication, clean unused indexes
🎯 Strategic (Months 2-6 - additional 20-50%)
- Fine-tune/distill for high-volume narrow tasks
- Self-host if >50M tokens/day sustained
- Quantize self-hosted models (4-bit AWQ for GPU, GGUF for CPU)
- Switch to open-source embeddings for high-volume RAG
The Bottom Line
A disciplined team applying these techniques can realistically cut AI production costs by 70-90% while maintaining equivalent output quality. The biggest lever is model selection (cheap models for easy tasks), followed by caching, then batching. Self-hosting and fine-tuning deliver the largest absolute savings at scale. And run a quarterly model audit - prices drop ~10× per year for equivalent quality.