How much does prompt caching save?

Q: How much does prompt caching save?

Prompt caching cuts repeated-input cost 75-90% across OpenAI, Anthropic, Google, and DeepSeek. Per-provider cached vs standard rates, and when savings pay back.

Updated 2026-05-31 · By Clinton Patrick · Methodology

The short answer

Prompt caching cuts the per-token cost of repeated input prompts by 75% to 90% depending on the provider. If your application sends the same long system prompt repeatedly (most agents, most RAG pipelines), caching is the single biggest cost lever you have.

Cached-input rates as of April 2026

Provider	Standard input	Cached input	Discount
OpenAI GPT-5 family	$0.05–$5/M	$0.005–$0.50/M	90% off
OpenAI GPT-4.1	$2/M	$0.50/M	75% off
OpenAI o3 / o4-mini	$1.10–$2/M	$0.275–$0.55/M	50–75% off
Anthropic Claude Opus 4.8	$5/M	$0.50/M	90% off
Anthropic Claude Sonnet 4.6	$3/M	$0.30/M	90% off
Anthropic Claude Haiku 4.5	$1/M	$0.10/M	90% off
Google Gemini 3.1 Pro Preview	$2/M (≤200k)	$0.20/M	90% off
Google Gemini 2.5 Pro	$1.25/M	$0.125/M	90% off
DeepSeek V3	$0.27/M	~$0.027/M	~90% off

OpenAI Pro tiers (5.5 Pro, 5.4 Pro, 5.2 Pro, o3-pro) don't qualify for caching as of April 2026. Plan accordingly if you're considering Pro for high-volume workloads.

When caching actually pays back

Caching applies to identical input prefixes, the prompt's first N tokens must be byte-identical across calls. Practical scenarios:

Agentic loops with stable system prompts: the system message and tool definitions don't change call-to-call. Cache hit rate: typically 95%+.
RAG over a fixed document set: same retrieved chunks across many user questions. Cache hit rate: variable, often 60-80%.
Multi-turn chat: conversation history is the cached prefix; each new message appends to it. Cache hit rate: 90%+ after the first turn.

Scenarios where caching doesn't help:

Unique prompts per call (every input is different), no cache hits.
Prompts where dynamic content (user data, timestamps) appears early in the prompt, breaks the prefix match.

Cost example

A typical agent: 4,000-token stable system prompt + 200-token user message + 100-token reply, called 1 million times per month on Claude Sonnet 4.6.

Without caching:

Input: 4,200 × $3/M × 1M = $12,600
Output: 100 × $15/M × 1M = $1,500
Total: $14,100/month

With caching (4,000-token system prompt cached, 200-token user message uncached):

Cached input: 4,000 × $0.30/M × 1M = $1,200
Uncached input: 200 × $3/M × 1M = $600
Output: 100 × $15/M × 1M = $1,500
Total: $3,300/month

Savings: $10,800/month, 77% reduction.

Caveats

Cache TTL varies: OpenAI's prompt caching is roughly 5 minutes; Anthropic supports both 5-minute and 1-hour caches with different write costs; Google's cache TTL is hour-scale with explicit storage pricing ($1-$4.50/M tokens/hour depending on context size). Long-lived caches mean storage fees apply.
Cache writes cost more than cached reads for Anthropic ($3.75/M for 5-minute writes on Sonnet vs $0.30/M for hits). The math works out only when hit rate is high, typically 4-5 hits per write to break even.
Reorder prompts to maximize cache reuse: put stable content first, dynamic content last. This is the single biggest knob you have on cache hit rate.

Get a real estimate for your workload

Paste your prompt into the counter to see the token count, then multiply input tokens by the cached-rate column above to estimate your post-caching cost. The actual savings depend on your cache hit rate, which you measure in production.

Try this on every model

Claude Opus 4.8 $5.00/$25.00
Claude Opus 4.8 (Fast Mode) $10.00/$50.00
Claude Sonnet 4.6 $3.00/$15.00
Claude Haiku 4.5 $1.00/$5.00
GPT-5.5 $5.00/$30.00
GPT-5.5 Pro $30.00/$180.00
GPT-5.4 $2.50/$15.00
GPT-5.4 Mini $0.75/$4.50
GPT-5.4 Nano $0.20/$1.25
GPT-5.4 Pro $30.00/$180.00
GPT-5.3 $1.75/$14.00
GPT-5.2 $1.75/$14.00
GPT-5.2 Pro $21.00/$168.00
GPT-5.1 $1.25/$10.00
GPT-5 $1.25/$10.00
GPT-5 Mini $0.25/$2.00
GPT-5 Nano $0.05/$0.40
GPT-5 Pro $15.00/$120.00
GPT-4.1 $2.00/$8.00
GPT-4.1 Mini $0.40/$1.60
GPT-4.1 Nano $0.10/$0.40
o3 $2.00/$8.00
o3-mini $1.10/$4.40
o3-pro $20.00/$80.00
o4-mini $1.10/$4.40
GPT-4o $2.50/$10.00
GPT-4o mini $0.15/$0.60
GPT-4 Turbo $10.00/$30.00
Gemini 3.1 Pro $2.00/$12.00
Gemini 3 Flash $0.50/$3.00
Gemini 3.1 Flash-Lite $0.25/$1.50
Gemini 2.5 Pro $1.25/$10.00
Gemini 2.5 Flash $0.30/$2.50
Gemini 2.5 Flash-Lite $0.10/$0.40
Llama 3.3 70B $0.88/$0.88
Llama 3.1 405B $3.50/$3.50
Llama 3.1 70B $0.59/$0.79
Llama 3.1 8B $0.18/$0.18
Mistral Large $2.00/$6.00
DeepSeek V3 $0.27/$1.10
DeepSeek V3.1 $0.60/$1.70
DeepSeek R1 $3.00/$7.00
Qwen 2.5 72B $0.90/$0.90
Qwen 2.5 Coder 32B $0.80/$0.80
Qwen3 Coder 480B $2.00/$2.00
GLM-5.1 $1.40/$4.40

Try the live counter →