TL;DR — The root cause of Claude Code expenses isn't model cost but repeated context transmission, defaulting to Opus, and uncapped extended thinking. Combining prompt caching (cached tokens cost 90% less), model tiering (Haiku for simple tasks, Sonnet for standard work, Opus for complex problems), context hygiene (lean CLAUDE.md + /compact + skills), thinking budget controls, and hooks preprocessing plus sub-agent delegation can reduce bills to 10-40% of original costs.
Why Claude Code Overspending Happens
Claude Code charges by token, transmitting CLAUDE.md, MCP tool definitions, conversation history, and file read results to Sonnet 4.6 or Opus 4.7 each interaction:
- Enterprise deployment averages $13 per developer per active day, $150-250 monthly
- Token distribution analysis shows "70%-90% of input tokens come from repeated system prompts, CLAUDE.md, and file history"
- Opus 4.7 costs $5/MTok input and $25/MTok output; Sonnet 4.6 is $3/$15; Haiku 4.5 is $1/$5 — same tasks cost 5× more with Opus versus Haiku
Cost reduction requires stacking all five strategies together.
Strategy 1: Maximize Prompt Caching for 90% Input Savings
Action: Add cache_control breakpoints at the end of system prompts, tool definitions, long documents, and conversation history.
Why it works: Cache hits cost only 0.1× base input rate. Opus 4.7 cache hits cost $0.50/MTok instead of $5/MTok.
| Model | Input | 5m cache write | 1h cache write | Cache hit |
|---|---|---|---|---|
| Opus 4.7 | $5 | $6.25 (1.25×) | $10 (2×) | $0.50 (0.1×) |
| Sonnet 4.6 | $3 | $3.75 | $6 | $0.30 |
| Haiku 4.5 | $1 | $1.25 | $2 | $0.10 |
Breakeven threshold: 5-minute cache reuses once; 1-hour cache needs 2 reuses.
Minimum token threshold: Opus 4.7/4.6/Haiku 4.5 require 4096 tokens; Sonnet 4.6 requires 2048 tokens.
SDK Implementation (Python):
client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_query}]
)
Common pitfalls:
- Including timestamps in cached content invalidates cache
- Storing user IDs in system prompt causes cache misses for each user
- Content below minimum threshold produces zero cached tokens
Monitoring: Log cache_creation_input_tokens, cache_read_input_tokens, and input_tokens to track hit ratio. Stable agent workflows typically reach 70%-85%.
Strategy 2: Model Tiering — Don't Use Opus for Commit Messages
Action: Default to Sonnet 4.6, use Haiku 4.5 for sub-agents, reserve Opus 4.7 for multi-step reasoning, architecture decisions, and complex debugging.
Price comparison (per million tokens):
Haiku 4.5 : input $1 / output $5 (cache hit $0.10)
Sonnet 4.6 : input $3 / output $15 (cache hit $0.30)
Opus 4.7 : input $5 / output $25 (cache hit $0.50)
Key considerations:
- Opus 4.7 uses a new tokenizer, potentially requiring "up to 35% more tokens for identical text"
- Opus 4.7 doesn't provide "5× performance" for all tasks; Haiku 4.5 handles refactoring, testing, and docstring generation with minimal quality loss
- Single agents use "approximately 4× single-turn chat tokens"; multi-agent systems use "around 15× single-turn chat"
Switching in Claude Code:
/model # Current session switch
/config # Set default model
# For sub-agents:
model: haiku # In subagent frontmatter
Decision framework: Code writing → Sonnet; log reading, formatting, commit messages, file search → Haiku; architecture changes, multi-step reasoning, complex debugging → Opus.
Strategy 3: Context Hygiene — Keep CLAUDE.md Under 200 Lines
Action: Move task-specific instructions from CLAUDE.md to skills for on-demand loading.
Hard rules:
- Limit CLAUDE.md to 200 lines. A 5000-token CLAUDE.md costs $0.025 per Sonnet 4.6 input per turn
-
Use
/clearwhen switching tasks to remove irrelevant context - Use
/compactwith focus:
/compact Focus on test output and code changes
- Disable unused MCP servers. Tool definitions load deferred, but server connections consume context
- Externalize domain knowledge as skills. "PR review checklist" and "database migration process" become skills loaded on demand
Advanced: Monitor context usage with /usage; trigger /compact above 60% usage before automatic compression at 95% degrades quality.
Strategy 4: Set Limits on Extended Thinking
Action: Adjust MAX_THINKING_TOKENS to 8000-10000 or use /effort low for simple tasks.
Why: Extended thinking runs by default, with thinking tokens charged at output rates (Opus 4.7 $25/MTok, Sonnet 4.6 $15/MTok). Complex tasks default to tens of thousands of thinking tokens, costing $1+ per request.
Configuration:
export MAX_THINKING_TOKENS=8000 # Environment variable
# Session-specific control
/effort low # Minimal thinking budget
/effort medium
/effort high # Default
Usage guidelines:
- Code writing, formatting, documentation queries →
/effort lowor disable - Multi-step refactoring, cross-file logic tracing →
/effort medium - System design, complex algorithms, performance tuning →
/effort high
Using low and medium effort for one week typically reduces bills by 20%-30%.
Strategy 5: Hooks Preprocessing + Sub-agents + Batch API
Action: Outsource operations that produce high token volume but require only small portions.
Hook Preprocessing: Reduce 10,000 Lines of Logs to 100
Add PreToolUse hooks in settings.json to filter test output before Claude processes it:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "~/.claude/hooks/filter-test-output.sh"
}
]
}
]
}
}
Script example:
#!/bin/bash
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command')
if [[ "$cmd" =~ ^(npm test|pytest|go test) ]]; then
filtered="$cmd 2>&1 | grep -A 5 -E '(FAIL|ERROR|error:)' | head -100"
echo "{\"hookSpecificOutput\":{\"hookEventName\":\"PreToolUse\",\"permissionDecision\":\"allow\",\"updatedInput\":{\"command\":\"$filtered\"}}}"
else
echo "{}"
fi
Reduces test log tokens from tens of thousands to hundreds without affecting main conversation context.
Sub-agents: Run Dirty Work in Parallel
Delegate testing, documentation retrieval, and log processing to Haiku sub-agents, keeping verbose output in sub-contexts while returning summaries to the main conversation.
Batch API: 50% Discount for Asynchronous Tasks
While Claude Code is interactive, SDK usage can route suitable tasks to Batch API:
- Batch test data and fixture generation
- Repository-wide refactoring and annotation
- Linting and fixes
- Document translation
Batch API applies "50% discount" on input and output, stackable with prompt caching:
Opus 4.7 + Batch + Cache hit
input: $5 × 50% × 10% = $0.25/MTok
output: $25 × 50% = $12.5/MTok
Ideal for overnight automated runs with morning result review.
Real Cost Curve: Stacking All 5 Strategies
Unoptimized Sonnet 4.6 (200-message project conversation):
Input tokens: 2,000,000 → $6.00
Output tokens: 600,000 → $9.00
Default thinking: 300,000 → $4.50
Single cost: ~$19.50
With all 5 strategies (conservative estimate):
70% cached input: 1,400,000 × $0.30/MTok = $0.42
Remaining input: 600,000 × $3/MTok = $1.80
Refined output: 400,000 × $15/MTok = $6.00
Limited thinking: 100,000 × $15/MTok = $1.50
Half tasks to Haiku: Additional ~$1.50 savings
Total: ~$8.22
Net savings: ~58%
Adding asynchronous Batch API tasks saves another 50%, bringing bills to 30%-40% of original costs.
Conclusion
Cost reduction means eliminating unnecessary work, not reducing Claude Code usage. Cache repeated content, downgrade simple tasks, and exclude irrelevant context. These three principles suffice.
OfoxAI provides OpenAI-compatible interfaces directly connecting to Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5 with transparent prompt caching at official token rates. Change ANTHROPIC_BASE_URL in Claude Code; all strategies apply identically.
Originally published on ofox.ai/blog.
Top comments (0)