How much does prompt caching save?
The short answer
Prompt caching cuts the per-token cost of repeated input prompts by 75% to 90% depending on the provider. If your application sends the same long system prompt repeatedly (most agents, most RAG pipelines), caching is the single biggest cost lever you have.
Cached-input rates as of April 2026
| Provider | Standard input | Cached input | Discount |
|---|---|---|---|
| OpenAI GPT-5 family | $0.05–$5/M | $0.005–$0.50/M | 90% off |
| OpenAI GPT-4.1 | $2/M | $0.50/M | 75% off |
| OpenAI o3 / o4-mini | $1.10–$2/M | $0.275–$0.55/M | 50–75% off |
| Anthropic Claude Opus 4.8 | $5/M | $0.50/M | 90% off |
| Anthropic Claude Sonnet 4.6 | $3/M | $0.30/M | 90% off |
| Anthropic Claude Haiku 4.5 | $1/M | $0.10/M | 90% off |
| Google Gemini 3.1 Pro Preview | $2/M (≤200k) | $0.20/M | 90% off |
| Google Gemini 2.5 Pro | $1.25/M | $0.125/M | 90% off |
| DeepSeek V3 | $0.27/M | ~$0.027/M | ~90% off |
OpenAI Pro tiers (5.5 Pro, 5.4 Pro, 5.2 Pro, o3-pro) don't qualify for caching as of April 2026. Plan accordingly if you're considering Pro for high-volume workloads.
When caching actually pays back
Caching applies to identical input prefixes, the prompt's first N tokens must be byte-identical across calls. Practical scenarios:
- Agentic loops with stable system prompts: the system message and tool definitions don't change call-to-call. Cache hit rate: typically 95%+.
- RAG over a fixed document set: same retrieved chunks across many user questions. Cache hit rate: variable, often 60-80%.
- Multi-turn chat: conversation history is the cached prefix; each new message appends to it. Cache hit rate: 90%+ after the first turn.
Scenarios where caching doesn't help:
- Unique prompts per call (every input is different), no cache hits.
- Prompts where dynamic content (user data, timestamps) appears early in the prompt, breaks the prefix match.
Cost example
A typical agent: 4,000-token stable system prompt + 200-token user message + 100-token reply, called 1 million times per month on Claude Sonnet 4.6.
Without caching:
- Input: 4,200 × $3/M × 1M = $12,600
- Output: 100 × $15/M × 1M = $1,500
- Total: $14,100/month
With caching (4,000-token system prompt cached, 200-token user message uncached):
- Cached input: 4,000 × $0.30/M × 1M = $1,200
- Uncached input: 200 × $3/M × 1M = $600
- Output: 100 × $15/M × 1M = $1,500
- Total: $3,300/month
Savings: $10,800/month, 77% reduction.
Caveats
- Cache TTL varies: OpenAI's prompt caching is roughly 5 minutes; Anthropic supports both 5-minute and 1-hour caches with different write costs; Google's cache TTL is hour-scale with explicit storage pricing ($1-$4.50/M tokens/hour depending on context size). Long-lived caches mean storage fees apply.
- Cache writes cost more than cached reads for Anthropic ($3.75/M for 5-minute writes on Sonnet vs $0.30/M for hits). The math works out only when hit rate is high, typically 4-5 hits per write to break even.
- Reorder prompts to maximize cache reuse: put stable content first, dynamic content last. This is the single biggest knob you have on cache hit rate.
Get a real estimate for your workload
Paste your prompt into the counter to see the token count, then multiply input tokens by the cached-rate column above to estimate your post-caching cost. The actual savings depend on your cache hit rate, which you measure in production.
Try this on every model
- Claude Opus 4.8 $5.00/$25.00
- Claude Opus 4.8 (Fast Mode) $10.00/$50.00
- Claude Sonnet 4.6 $3.00/$15.00
- Claude Haiku 4.5 $1.00/$5.00
- GPT-5.5 $5.00/$30.00
- GPT-5.5 Pro $30.00/$180.00
- GPT-5.4 $2.50/$15.00
- GPT-5.4 Mini $0.75/$4.50
- GPT-5.4 Nano $0.20/$1.25
- GPT-5.4 Pro $30.00/$180.00
- GPT-5.3 $1.75/$14.00
- GPT-5.2 $1.75/$14.00
- GPT-5.2 Pro $21.00/$168.00
- GPT-5.1 $1.25/$10.00
- GPT-5 $1.25/$10.00
- GPT-5 Mini $0.25/$2.00
- GPT-5 Nano $0.05/$0.40
- GPT-5 Pro $15.00/$120.00
- GPT-4.1 $2.00/$8.00
- GPT-4.1 Mini $0.40/$1.60
- GPT-4.1 Nano $0.10/$0.40
- o3 $2.00/$8.00
- o3-mini $1.10/$4.40
- o3-pro $20.00/$80.00
- o4-mini $1.10/$4.40
- GPT-4o $2.50/$10.00
- GPT-4o mini $0.15/$0.60
- GPT-4 Turbo $10.00/$30.00
- Gemini 3.1 Pro $2.00/$12.00
- Gemini 3 Flash $0.50/$3.00
- Gemini 3.1 Flash-Lite $0.25/$1.50
- Gemini 2.5 Pro $1.25/$10.00
- Gemini 2.5 Flash $0.30/$2.50
- Gemini 2.5 Flash-Lite $0.10/$0.40
- Llama 3.3 70B $0.88/$0.88
- Llama 3.1 405B $3.50/$3.50
- Llama 3.1 70B $0.59/$0.79
- Llama 3.1 8B $0.18/$0.18
- Mistral Large $2.00/$6.00
- DeepSeek V3 $0.27/$1.10
- DeepSeek V3.1 $0.60/$1.70
- DeepSeek R1 $3.00/$7.00
- Qwen 2.5 72B $0.90/$0.90
- Qwen 2.5 Coder 32B $0.80/$0.80
- Qwen3 Coder 480B $2.00/$2.00
- GLM-5.1 $1.40/$4.40