Prompt Caching: How Anthropic and OpenAI Cut Costs by 90%
If your LLM application sends the same system prompt with every request, and it probably does, you're paying to recompute identical tensors thousands of times a day. Every single API call triggers the same matrix multiplications for the same tokens in the same order. The provider's GPUs dutifully crunch through your 2000-token system prompt again and again, and you dutifully pay for every input token, every time.
Prompt caching fixes this. It's one of those rare optimizations where you change almost nothing about your code and save a staggering amount of money. I'm genuinely surprised how many teams I talk to haven't turned it on yet.
How Prompt Caching Works
When you send a prompt to an LLM, the model computes key-value tensors for every token in the input. These KV tensors are the intermediate representations that attention layers use to understand context (the same tensors that PagedAttention manages so efficiently on the serving side). For a long system prompt, this computation is substantial -- and it's identical every time you send the same prompt prefix.
Prompt caching stores those pre-computed KV tensors on the provider's side. When your next request arrives and shares the same prefix, the provider looks up the cached tensors instead of recomputing them. The model picks up generation from where the cache ends -- your dynamic user query -- and only computes new KV tensors for the uncached portion.
Think of it like this:
WITHOUT CACHING:
Request 1: [system prompt: COMPUTE] [user query: COMPUTE] → response
Request 2: [system prompt: COMPUTE] [user query: COMPUTE] → response
Request 3: [system prompt: COMPUTE] [user query: COMPUTE] → response
WITH CACHING:
Request 1: [system prompt: COMPUTE + CACHE] [user query: COMPUTE] → response
Request 2: [system prompt: HIT CACHE] [user query: COMPUTE] → response
Request 3: [system prompt: HIT CACHE] [user query: COMPUTE] → response
The cache is keyed on exact token-level prefix matching. If even one token in your prefix changes, the cache misses and the full computation runs again. This is not fuzzy matching -- it's byte-exact.
The Numbers
Here's where it gets exciting.
Anthropic charges cached input tokens at 90% less than regular input tokens. A Claude Sonnet request that normally costs $3 per million input tokens drops to $0.30 per million for the cached portion. That's not a typo. Ninety percent off.
OpenAI offers a similar structure. Cached input tokens on GPT-4o cost 50% less, and on newer models the discount is even steeper.
Let's do the math on a realistic scenario. Say you have a 2000-token system prompt -- instructions, persona definition, few-shot examples, output formatting rules. A typical production setup. You're handling 10,000 requests per day.
Without caching, that's 20 million system prompt tokens per day billed at full price. With caching, those same tokens are billed at the cached rate after the first request. On Claude Sonnet, that's roughly $54 saved per day on just the system prompt tokens. Over a month, that's $1,600+ back in your pocket, and this scales linearly with traffic.
For high-volume applications, prompt caching can easily save $50-100 per day depending on the model and prompt length. The bigger your system prompt and the higher your traffic, the more dramatic the savings. Pair this with proper token cost tracking and you can see exactly how much caching is saving you per endpoint.
How to Maximize Cache Hits
Prompt caching is automatic on most providers -- you don't need to flip a switch. But whether you actually get cache hits depends entirely on how you structure your prompts.
Put static content first. Your system instructions, persona definitions, few-shot examples, and output format specifications should all come before any dynamic content. The cache matches on the prefix, so the longer the identical prefix, the more tokens get cached.
Keep the static prefix byte-identical. I cannot emphasize this enough. If you inject a timestamp, request ID, session counter, or any other variable into your system prompt, you destroy the cache. Every request gets a unique prefix, and every request recomputes from scratch.
This is the most common mistake I see:
# BAD: timestamp in system prompt kills caching
system_prompt = f"You are a helpful assistant. Current time: {datetime.now()}"
# GOOD: static system prompt, dynamic content comes after
system_prompt = "You are a helpful assistant."
user_message = f"The current time is {datetime.now()}. My question is..."Structure your prompt in layers. Think of it as: system instructions first, then few-shot examples, then any context documents, then the user query last. Each layer should be stable across requests whenever possible. Even if the context documents change, the system instructions and few-shot examples still get cached.
Batch similar requests. If you have different prompt templates for different use cases, try to consolidate them. Fewer unique prefixes means higher cache hit rates across your fleet.
Semantic Caching: A Different Beast
Prompt caching as described above is exact prefix matching handled by the LLM provider. But there's a separate concept worth knowing: semantic caching.
Semantic caching stores the full LLM response and serves it again when a sufficiently similar query arrives. Instead of caching KV tensors, you're caching the output itself. The similarity check uses embeddings -- you compute an embedding for the incoming query, compare it against cached query embeddings, and if the cosine similarity exceeds a threshold, you return the cached response without calling the LLM at all.
This is a fundamentally different tradeoff. You're trading response freshness and accuracy for zero latency and zero cost on cache hits. For some applications -- FAQ bots, documentation search, repetitive customer queries -- this works brilliantly. For others, it's a bad idea.
Libraries like GPTCache and LangChain's caching layer make this straightforward to implement. A Redis-backed semantic cache with an embedding index can intercept 30-70% of requests in high-repetition environments. That's 30-70% of your LLM bill eliminated entirely. This pattern is especially powerful for RAG applications where users frequently ask similar questions about the same document corpus.
The catch is obvious: if your users ask nuanced or unique questions, semantic similarity thresholds either miss too many genuine duplicates or return stale answers for queries that are similar but not identical. Tuning the threshold is an art, not a science.
The Bottom Line
Prompt caching is free money. Restructure your prompts once, save permanently. Move static content to the front, keep dynamic content at the end, stop injecting variables into your system prompt.
If you want to go further, layer semantic caching on top for high-repetition query patterns. The two approaches are complementary -- prompt caching reduces cost per request, semantic caching eliminates requests entirely.
Most teams are leaving thousands of dollars on the table every month because their prompts are structured with dynamic content scattered throughout. Fix that. It takes an afternoon. The savings start immediately.
Related Posts
Structured Output That Actually Works: JSON Mode vs Function Calling
Getting reliable JSON from LLMs has been a pain point since GPT-3. Here's the current state of the art and what actually works in production.
Tracking Token Costs Before They Blow Up Your Bill
Output tokens cost 4-8x more than input tokens. If you're not tracking usage by query type and user segment, you're flying blind.
Building an LLM Gateway with LiteLLM
One API to call OpenAI, Anthropic, and self-hosted models. LiteLLM handles routing, fallbacks, and cost tracking so you don't have to.