Owen

Posted on May 13 • Originally published at ofox.ai

Claude Code Token Optimization 2026: 5 Strategies That Cut Your API Bill by 60-90%

#ai #claudecode #llm #api

TL;DR — The root cause of Claude Code expenses isn't model cost but repeated context transmission, defaulting to Opus, and uncapped extended thinking. Combining prompt caching (cached tokens cost 90% less), model tiering (Haiku for simple tasks, Sonnet for standard work, Opus for complex problems), context hygiene (lean CLAUDE.md + /compact + skills), thinking budget controls, and hooks preprocessing plus sub-agent delegation can reduce bills to 10-40% of original costs.

Why Claude Code Overspending Happens

Claude Code charges by token, transmitting CLAUDE.md, MCP tool definitions, conversation history, and file read results to Sonnet 4.6 or Opus 4.7 each interaction:

Enterprise deployment averages $13 per developer per active day, $150-250 monthly
Token distribution analysis shows "70%-90% of input tokens come from repeated system prompts, CLAUDE.md, and file history"
Opus 4.7 costs $5/MTok input and $25/MTok output; Sonnet 4.6 is $3/$15; Haiku 4.5 is $1/$5 — same tasks cost 5× more with Opus versus Haiku

Cost reduction requires stacking all five strategies together.

Strategy 1: Maximize Prompt Caching for 90% Input Savings

Action: Add cache_control breakpoints at the end of system prompts, tool definitions, long documents, and conversation history.

Why it works: Cache hits cost only 0.1× base input rate. Opus 4.7 cache hits cost $0.50/MTok instead of $5/MTok.

Model	Input	5m cache write	1h cache write	Cache hit
Opus 4.7	$5	$6.25 (1.25×)	$10 (2×)	$0.50 (0.1×)
Sonnet 4.6	$3	$3.75	$6	$0.30
Haiku 4.5	$1	$1.25	$2	$0.10

Breakeven threshold: 5-minute cache reuses once; 1-hour cache needs 2 reuses.

Minimum token threshold: Opus 4.7/4.6/Haiku 4.5 require 4096 tokens; Sonnet 4.6 requires 2048 tokens.

SDK Implementation (Python):

client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

Common pitfalls:

Including timestamps in cached content invalidates cache
Storing user IDs in system prompt causes cache misses for each user
Content below minimum threshold produces zero cached tokens

Monitoring: Log cache_creation_input_tokens, cache_read_input_tokens, and input_tokens to track hit ratio. Stable agent workflows typically reach 70%-85%.

Strategy 2: Model Tiering — Don't Use Opus for Commit Messages

Action: Default to Sonnet 4.6, use Haiku 4.5 for sub-agents, reserve Opus 4.7 for multi-step reasoning, architecture decisions, and complex debugging.

Price comparison (per million tokens):

Haiku 4.5   : input $1   / output $5    (cache hit $0.10)
Sonnet 4.6  : input $3   / output $15   (cache hit $0.30)
Opus 4.7    : input $5   / output $25   (cache hit $0.50)

Key considerations:

Opus 4.7 uses a new tokenizer, potentially requiring "up to 35% more tokens for identical text"
Opus 4.7 doesn't provide "5× performance" for all tasks; Haiku 4.5 handles refactoring, testing, and docstring generation with minimal quality loss
Single agents use "approximately 4× single-turn chat tokens"; multi-agent systems use "around 15× single-turn chat"

Switching in Claude Code:

/model                    # Current session switch
/config                   # Set default model
# For sub-agents:
model: haiku              # In subagent frontmatter

Decision framework: Code writing → Sonnet; log reading, formatting, commit messages, file search → Haiku; architecture changes, multi-step reasoning, complex debugging → Opus.

Strategy 3: Context Hygiene — Keep CLAUDE.md Under 200 Lines

Action: Move task-specific instructions from CLAUDE.md to skills for on-demand loading.

Hard rules:

Limit CLAUDE.md to 200 lines. A 5000-token CLAUDE.md costs $0.025 per Sonnet 4.6 input per turn
Use /clear when switching tasks to remove irrelevant context
Use /compact with focus:

/compact Focus on test output and code changes

Disable unused MCP servers. Tool definitions load deferred, but server connections consume context
Externalize domain knowledge as skills. "PR review checklist" and "database migration process" become skills loaded on demand

Advanced: Monitor context usage with /usage; trigger /compact above 60% usage before automatic compression at 95% degrades quality.

Strategy 4: Set Limits on Extended Thinking

Action: Adjust MAX_THINKING_TOKENS to 8000-10000 or use /effort low for simple tasks.

Why: Extended thinking runs by default, with thinking tokens charged at output rates (Opus 4.7 $25/MTok, Sonnet 4.6 $15/MTok). Complex tasks default to tens of thousands of thinking tokens, costing $1+ per request.

Configuration:

export MAX_THINKING_TOKENS=8000    # Environment variable

# Session-specific control
/effort low                         # Minimal thinking budget
/effort medium
/effort high                        # Default

Usage guidelines:

Code writing, formatting, documentation queries → /effort low or disable
Multi-step refactoring, cross-file logic tracing → /effort medium
System design, complex algorithms, performance tuning → /effort high

Using low and medium effort for one week typically reduces bills by 20%-30%.

Strategy 5: Hooks Preprocessing + Sub-agents + Batch API

Action: Outsource operations that produce high token volume but require only small portions.

Hook Preprocessing: Reduce 10,000 Lines of Logs to 100

Add PreToolUse hooks in settings.json to filter test output before Claude processes it:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/filter-test-output.sh"
          }
        ]
      }
    ]
  }
}

Script example:

#!/bin/bash
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command')

if [[ "$cmd" =~ ^(npm test|pytest|go test) ]]; then
  filtered="$cmd 2>&1 | grep -A 5 -E '(FAIL|ERROR|error:)' | head -100"
  echo "{\"hookSpecificOutput\":{\"hookEventName\":\"PreToolUse\",\"permissionDecision\":\"allow\",\"updatedInput\":{\"command\":\"$filtered\"}}}"
else
  echo "{}"
fi

Reduces test log tokens from tens of thousands to hundreds without affecting main conversation context.

Sub-agents: Run Dirty Work in Parallel

Delegate testing, documentation retrieval, and log processing to Haiku sub-agents, keeping verbose output in sub-contexts while returning summaries to the main conversation.

Batch API: 50% Discount for Asynchronous Tasks

While Claude Code is interactive, SDK usage can route suitable tasks to Batch API:

Batch test data and fixture generation
Repository-wide refactoring and annotation
Linting and fixes
Document translation

Batch API applies "50% discount" on input and output, stackable with prompt caching:

Opus 4.7 + Batch + Cache hit
input:  $5 × 50% × 10% = $0.25/MTok
output: $25 × 50%      = $12.5/MTok

Ideal for overnight automated runs with morning result review.

Real Cost Curve: Stacking All 5 Strategies

Unoptimized Sonnet 4.6 (200-message project conversation):

Input tokens:        2,000,000  → $6.00
Output tokens:         600,000  → $9.00
Default thinking:      300,000  → $4.50
Single cost:           ~$19.50

With all 5 strategies (conservative estimate):

70% cached input:    1,400,000 × $0.30/MTok = $0.42
Remaining input:       600,000 × $3/MTok    = $1.80
Refined output:        400,000 × $15/MTok  = $6.00
Limited thinking:      100,000 × $15/MTok  = $1.50
Half tasks to Haiku:   Additional ~$1.50 savings
Total:                ~$8.22

Net savings: ~58%

Adding asynchronous Batch API tasks saves another 50%, bringing bills to 30%-40% of original costs.

Conclusion

Cost reduction means eliminating unnecessary work, not reducing Claude Code usage. Cache repeated content, downgrade simple tasks, and exclude irrelevant context. These three principles suffice.

OfoxAI provides OpenAI-compatible interfaces directly connecting to Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5 with transparent prompt caching at official token rates. Change ANTHROPIC_BASE_URL in Claude Code; all strategies apply identically.

Originally published on ofox.ai/blog.

DEV Community