Claude Code cache hit rate increased to 95%: 6 practical tips to reduce 400,000 tokens of input to 50,000

"Why does my Claude Code request 400k input tokens every time? Why is my bill so high?"—this is the first reaction many Claude Code users have when checking their usage stats. In reality, the vast majority of those 400k tokens have likely been cached, and the actual cost might only be 1/10th of the surface figure. However, if the cache isn't hit, that bill can definitely be painful.

Core Value: After reading this article, you'll understand Claude Code's automatic caching mechanism, the 8 common reasons for cache misses, and 6 practical tips to slash your input tokens from 400k down to 50k.

claude-code-prompt-caching-token-optimization-reduce-input-cost-guide-en 图示

A Deep Dive into Claude Code's Automatic Prompt Caching

Does Claude Code automatically hit the cache?

Yes, it does. Claude Code automatically enables Anthropic's Prompt Caching for every API request without requiring any configuration. This is a built-in behavior, not an optional feature.

Every time you send a message in Claude Code, the content sent to the API is assembled in the following order:

Assembly Order Content Estimated Size Caching Behavior
Layer 1 Tool definitions (Read/Edit/Bash, etc.) ~5,000 tokens Nearly static, high hit rate
Layer 2 System prompt + CLAUDE.md ~3,000-10,000 tokens Static within a session, high hit rate
Layer 3 Conversation history (all previous messages) Constantly growing Prefix matching, builds up gradually
Layer 4 Current new message Variable Never hits the cache

Key Mechanism: Caching is based on prefix matching—as long as the first N tokens of a request are identical to previously cached content, those N tokens will hit the cache. In a continuous conversation, by the 20th turn, 95%+ of input tokens are typically served from the cache.

Cache Pricing: Why Cache Hits Matter

Operation Type Relative Base Input Price Sonnet 4 Actual Price/MTok Opus 4 Actual Price/MTok
Standard Input (No Cache) 1x $3.00 $15.00
5-min Cache Write 1.25x $3.75 $18.75
1-hour Cache Write 2x $6.00 $30.00
Cache Hit/Read 0.1x $0.30 $1.50
Output $15.00 $75.00

A concrete example: If your request has 400,000 input tokens:

Scenario A: No Caching
├── 400k tokens × $3/MTok (Sonnet) = $1.20 per request

Scenario B: 95% Cache Hit (Typical Claude Code session)
├── Cache Hit 380k tokens × $0.30/MTok = $0.114
├── Cache Write 10k tokens × $3.75/MTok  = $0.0375
├── New Input 10k tokens × $3/MTok       = $0.03
├── Total = $0.18 per request
└── Actual cost is only 15% of the non-cached version

🎯 Pro Tip: Using Claude API via APIYI (apiyi.com) also supports the Prompt Caching mechanism, reducing input costs by 90% on cache hits. If your project integrates Claude via API, it's recommended to design your prompt structure to maximize the cache hit rate.

Cache TTL: The Hidden Perk for Max Users

Subscription Plan Cache TTL Write Cost Note
API Pay-as-you-go 5 minutes 1.25x Cache expires after 5 mins of inactivity
Pro / Team 5 minutes 1.25x Same as above
Max 5x / 20x 1 hour 2x Higher write cost, but 12x longer window

While Max users pay a 2x write cost (higher than the standard 1.25x), the 1-hour TTL means your cache is still there after you grab a coffee. For developers who work intermittently, this difference is significant.

Every cache hit resets the TTL timer, so as long as you remain active, the cache effectively won't expire.

Cache Miss? 8 Common Causes and Solutions

{8 categories of reasons for cache invalidation}
{Understand the cause → proactively avoid → improve cache hit rate}

{Category 1: TTL expired}
{The most common reasons for failure}

{1. Idle timeout}
{API: >5 minutes of inactivity}
{Max: >1 hour of inactivity}

{influence}
{All cache invalidated ⚠️}

{Solution}
{Keep active / Accept reconstruction}

{Category 2: Content change}
{Upper layer changes → lower layer cascading failure}

{2. Switch model /model}
{Cache is isolated by model, all invalidated}

{3. Add or delete MCP tools}
{Tool layer change, cascading failure ⚠️}

{4. Switch Web Search}
{System-level changes, downstream failure}

{5. Modify CLAUDE.md}
{Restart required, system level failure}

{Tool → System → Messages}
{Changes at the upper layer cause all lower layers to fail}

{Category 3: Operation trigger}
{caused by user action}

{6. /clear new chat}
{Clear all history, rebuild cache ⚠}

{7. /compact compress}
{History is summarized and replaced, and the prefix is changed.}

{8. /rewind undo}
{Message history prefix changed}

{Optimization suggestions}
{Do not use /clear frequently}
{/compact use when necessary}

{⚠ = All cache invalidated | Chart: APIYI apiyi.com}

The root cause of a cache miss is always the same: the request prefix does not match the cached content. Specifically in Claude Code, the following 8 scenarios will cause cache invalidation:

Category 1: TTL Expiration

Reason Trigger Condition Impact Solution
1. Idle Timeout >5 mins (API users), >1 hour (Max users) Entire cache invalidated Stay active or accept rebuild costs

This is the most common reason for cache misses. If you step away for longer than 5 minutes (API users) or 1 hour (Max users) while coding, your next request will trigger a full cache rebuild.

Category 2: Cascade Invalidation due to Content Changes

Caching follows a strict hierarchical structure: Tool definitions → System prompt → Conversation history. Changes to upper layers invalidate everything below them.

Reason Trigger Condition Impact Severity
2. Switching Models Using /model command Entire cache (cache is model-isolated) ⚠️ High
3. Adding/Removing MCP Tools Installing/uninstalling MCP Server Tool layer + everything below ⚠️ High
4. Toggling Web Search Enabling/disabling web search System layer + everything below ⚠️ Medium
5. Modifying CLAUDE.md Editing config and restarting System layer + everything below ⚠️ Medium

Category 3: Operation-Triggered Invalidation

Reason Trigger Condition Impact Severity
6. New Conversation /clear or starting a new session Entire cache (history cleared) ⚠️ High
7. Using /compact Manually compressing history History layer cache invalidated ⚠️ Medium
8. Using /rewind Undoing previous messages History prefix changed ⚠️ Medium

An Overlooked Technical Limit: Minimum Cache Length

If your prompt is shorter than the following token counts, the cache will be silently skipped without any error:

Model Minimum Cacheable Length
Claude Opus 4.6 / Haiku 4.5 4,096 tokens
Claude Sonnet 4.6 2,048 tokens
Claude Sonnet 4.5 / 4 1,024 tokens

For Claude Code, since the Tool definitions + system prompt already exceed 5,000 tokens, this limit is rarely hit. However, if you are building your own application via API, keep this lower bound in mind.

💡 Recommendation: If you are building an application via APIYI (apiyi.com) to call the Claude API, ensure your system prompt length exceeds the model's minimum cache threshold; otherwise, caching will not take effect.

Why You're Seeing 400k Input Tokens: The Context Composition of Claude Code

Now that we've covered the caching mechanism, let's break down what actually makes up that staggering "400k input tokens" figure you're seeing.

claude-code-prompt-caching-token-optimization-reduce-input-cost-guide-en 图示

5 Main Sources of Token Consumption

Source Share Approx. in 400k Characteristics
Conversation History ~60% ~240k Full history resent every turn
Tool Invocation Results ~20% ~80k File reads, grep search results stay in context
Extended Chain of Thought ~10% ~40k Previous turns' thinking blocks become input
System Prompt + CLAUDE.md ~5% ~20k Included in every message
Tool Definitions ~5% ~20k Schema for all available tools

The Core Truth: The Longer the Conversation, the Larger the Input

Claude Code works by resending the complete conversation history with every request. This means:

  • Turn 1: ~20k tokens input (System prompt + tool definitions + your question)
  • Turn 5: ~100k tokens input (Accumulated 4 turns of history)
  • Turn 15: ~250k tokens input (Includes significant file read results)
  • Turn 30: ~400k+ tokens input (Approaching the automatic compression threshold)

But keep in mind: The vast majority of these inputs are cache hits. In that 400k token count at turn 30, perhaps only 10k–20k are actually new, non-cached content.

Special Considerations for Large Codebases

Claude Code does not automatically load your entire codebase into the context. It reads files on demand. However, in large codebases:

  • A single grep search might return massive results, all of which enter the context.
  • Exploratory reading of multiple files means each file's content stays in the conversation history.
  • In Agent mode, autonomous multi-step operations cause tool invocation results to accumulate.

If you're seeing 400k tokens every time, it's likely due to a combination of these factors:

  1. The codebase is large, and Claude Code has read many files for analysis.
  2. There are many conversation turns, leading to history accumulation.
  3. You might not be using /compact or /clear frequently enough.
  4. Your CLAUDE.md file might be quite long.

6 Practical Tips: Reducing Input Tokens from 400k to 50k

Tip 1: Be Precise, Avoid Global Scans

This is the most important and easiest optimization to implement.

❌ Vague instructions (triggers wide-range file scanning):
"Help me optimize the performance of this project"
"Check for bugs in the code"
"Refactor this module"

✅ Precise instructions (reads only necessary files):
"Optimize the response time of the processRequest function in src/api/handler.ts"
"Fix the null pointer exception on line 45 of src/auth/login.ts"
"Migrate the formatDate function in src/utils/format.ts from moment to dayjs"

Vague instructions force Claude Code to use Glob + Grep + Read on a large number of files to "understand" your request, and the content of every file stays in your conversation history permanently. Precise instructions ensure it only reads the 1-2 relevant files.

Token Savings: Reduces tool invocation result tokens by 60-80%.

Tip 2: Use /clear and /compact Promptly

# Clear the conversation when switching to an unrelated task
/clear

# Compress history when the conversation is long but the task isn't finished
/compact

# Compress with instructions to keep specific information
/compact Keep code examples and API interface definitions; everything else can be summarized
Command Effect Best For Note
/clear Clears entire conversation history Switching to a completely different task All cache is invalidated
/compact AI summarizes history, replaces original text Mid-stage of long conversations Partial cache invalidation, but significantly shrinks context

Actual Impact: A 400k token conversation can typically be compressed to 50k-80k tokens after using /compact.

Tip 3: Optimize Your CLAUDE.md File

CLAUDE.md is loaded with every message. A 10,000-token CLAUDE.md sent over 30 turns adds up (even if cache hits reduce the cost to 0.1x, it still occupies valuable context space).

Optimization Tips:
├── Keep CLAUDE.md under 500 lines (core rules only)
├── Move detailed workflows to Skills (loaded on demand)
├── Put reference documentation in knowledge-base/ (Read when needed)
└── Avoid large code examples in CLAUDE.md

🚀 Pro Tip: Streamlining CLAUDE.md doesn't just save tokens;
it helps Claude Code focus on the core rules.
If you're building similar AI coding assistants using APIYI (apiyi.com),
we recommend keeping your system prompts concise as well.

Tip 4: Use Subagents to Isolate Verbose Output

When performing actions that generate massive output, use a Subagent instead of executing directly:

❌ Executing directly in the main conversation (output enters main context):
"Run the test suite and analyze the failures"
→ Test output could be 50,000+ tokens, staying in history permanently

✅ Letting Claude Code use a Subagent (output isolated in a sub-process):
"Use a sub-task to run the test suite, and only summarize the failed test names and reasons for me"
→ Main context only increases by ~500 tokens for the summary

Token Savings: Can prevent 10,000-50,000 tokens from entering the main context per operation.

Tip 5: Choose the Right Model and Effort Level

Task Type Recommended Model Effort Level Note
Simple edits/formatting Sonnet low No deep thinking required
General development Sonnet medium Best cost-performance ratio
Complex architecture design Opus high Requires deep reasoning
Code review Sonnet medium Better cost-performance than Opus
# Lower thinking depth to reduce thinking tokens (which become input later)
# Set a lower effort for simple tasks
/effort low

# Or control the thinking token limit via environment variables
MAX_THINKING_TOKENS=8000

Extended chain-of-thought (thinking) becomes part of the input tokens in subsequent turns. Lowering the effort level significantly reduces cumulative tokens over time.

Tip 6: Monitor Token Distribution with /context

# View current token usage distribution
/context

The /context command displays the token breakdown of your current context, helping you pinpoint exactly what is consuming space. Common findings include:

  • A grep search returned 20,000 tokens, but only 5% were useful.
  • A large file read earlier is no longer needed but remains in context.
  • CLAUDE.md is taking up more space than expected.

Once identified, use /compact or /clear to address the issue.

💰 Cost Tip: For users on pay-as-you-go API plans, these optimizations directly lower your bill.
Through the usage statistics on the APIYI (apiyi.com) platform, you can clearly see the token distribution for every request, helping you identify cost hotspots.

Practical Case: Reducing Daily Costs from $60 to $8

Here is a real-world optimization process:

Before Optimization (Large Python project, heavy Claude Code user)

Daily usage:
├── Conversation turns: ~50 turns/day
├── Average input tokens: 350k-450k/turn
├── Cache hit rate: ~70% (due to frequent /clear and model switching)
├── Daily API cost (Opus 4): ~$60
└── Monthly: ~$1,320

After Optimization (Applying 6 techniques)

Daily usage:
├── Conversation turns: ~40 turns/day (more precise, fewer turns needed)
├── Average input tokens: 80k-120k/turn (precise instructions + periodic compacting)
├── Cache hit rate: ~92% (reduced unnecessary cache interruptions)
├── Daily API cost (mostly Sonnet 4, Opus used only for complex tasks): ~$8
└── Monthly: ~$176
Optimization Item Savings Share Description
Precise prompts vs. fuzzy scanning ~35% Largest contributor
Timely /compact and /clear ~25% Controls cumulative bloat
Sonnet replacing Opus (80% of tasks) ~20% Zero-perception model downgrade
Streamlining CLAUDE.md ~8% Reduces fixed overhead per turn
Subagent isolation for long outputs ~7% Prevents large results from polluting context
Lowering effort level ~5% Reduces thinking token accumulation

FAQ

Q1: Is the 400k token count shown in Claude Code what I’m actually billed for?

No. Claude Code automatically enables Prompt Caching. In an active session, 95%+ of input tokens are usually cache hits, billed at only 0.1x the base price. Out of 400k tokens, perhaps only 20k-40k are billed at full price. You can use /context to check your actual cache hit rate. API calls made via APIYI (apiyi.com) also support this caching mechanism.

Q2: Do I still need to worry about token consumption if I have a Max monthly subscription?

Yes, but for a different reason. The Max subscription isn't billed by tokens, but it does have a weekly usage limit. High token consumption will cause you to hit that limit faster. Streamlining your context not only extends your usage time but also helps Claude Code understand your needs more accurately (the more precise the context, the better the response).

Q3: Which is better, /compact or /clear?

It depends on the scenario. If you are about to start a completely different task, /clear is better to wipe the slate clean. If you are still working on the same task but the conversation has become very long, use /compact to keep the essential context while compressing the volume. /compact supports custom instructions, such as /compact keep all code modification history and API interface definitions.

Q4: Will upgrading to the latest version of Claude Code automatically optimize token usage?

Yes, it's recommended to always stay on the latest version. Anthropic continuously optimizes Claude Code's context management strategy, including automatic compression triggers (currently triggered at ~83.5% context occupancy) and lazy loading of MCP tool definitions (loading only tool names, with full schemas loaded only when needed). New versions generally bring better cache hit rates and smarter context management.

claude-code-prompt-caching-token-optimization-reduce-input-cost-guide-en 图示

Summary: Understanding Caching + Precise Usage = Controlled Costs

Prompt Caching in Claude Code is a powerful automated optimization mechanism—it saves you money without requiring any configuration. However, understanding how it works and what causes it to invalidate can help you boost your savings from "an automatic 70%" to "an active 95%."

Keep these 3 core principles in mind:

  1. Keep the cache active: Avoid unnecessary actions that disrupt the cache (like frequently switching models or using /clear indiscriminately).
  2. Control context bloat: Use precise prompts and regular /compact commands to prevent your conversation history from growing indefinitely.
  3. Choose the right tools and models: Sonnet is sufficient for 80% of tasks; save Opus for scenarios that truly require it.

For users on pay-as-you-go plans, we recommend managing your Claude API calls through APIYI (apiyi.com) to leverage the platform's usage monitoring features for continuous Token consumption optimization. For heavy interactive users, we suggest opting for the Claude Max monthly subscription, combined with the optimization tips in this article, to achieve the best value for your money.


📝 Author: APIYI Technical Team | APIYI apiyi.com – A unified access platform for 300+ AI Large Language Model APIs.

References

  1. Anthropic Prompt Caching Documentation: Detailed explanation of the official caching mechanism.

    • Link: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
    • Note: Covers cache TTL, pricing multipliers, and minimum length requirements.
  2. Claude Code Cost Management Guide: Official Token optimization suggestions.

    • Link: code.claude.com/docs/en/costs
    • Note: Cost control strategies recommended by Anthropic.
  3. Claude Code Best Practices: Context management and efficiency optimization.

    • Link: anthropic.com/engineering/claude-code-best-practices
    • Note: Includes practical advice on precise prompting, using /compact, and more.

Leave a Comment