DEV Community

Cover image for How to Reduce LLM API Costs by 70% — 5 Strategies That Actually Work
Joske Vermeulen
Joske Vermeulen

Posted on • Originally published at aimadetools.com

How to Reduce LLM API Costs by 70% — 5 Strategies That Actually Work

Most teams overspend on LLM APIs by 3-10x. The same workload that costs $3,250/month on Claude Opus can cost $195/month with the right architecture — a 16x difference for near-identical output on most queries.

Update (April 24, 2026): DeepSeek V4 Flash at $0.14/$0.28 per 1M tokens is the cheapest frontier option. See V4 API guide.

Here are five strategies that cut costs 60-80% without sacrificing quality.

1. Model routing (40-60% savings)

The biggest win. Stop sending every request to your most expensive model.

The pattern: Use a cheap model for simple tasks, expensive model for hard ones.

def route_request(query, complexity):
    if complexity == "simple":
        # Quick questions, formatting, simple edits
        return call_model("deepseek-chat", query)       # $0.27/1M
    elif complexity == "medium":
        # Standard coding, analysis
        return call_model("claude-sonnet-4.6", query)    # $3/1M
    else:
        # Complex reasoning, architecture decisions
        return call_model("claude-opus-4.6", query)      # $15/1M
Enter fullscreen mode Exit fullscreen mode

In practice, 60-70% of requests are "simple." Routing those to DeepSeek or Qwen Flash at $0.07-0.27/1M instead of Claude at $15/1M saves 40-60% immediately.

Tools like OpenRouter make this easy — one API, switch models per request. Aider has built-in --model and --weak-model flags for exactly this pattern.

2. Prompt caching (up to 90% on cached tokens)

Anthropic, OpenAI, and Google all offer prompt caching — if the first N tokens of your prompt match a recent request, you pay 90% less for those tokens.

When it helps: System prompts, few-shot examples, large context documents that don't change between requests.

# Without caching: 10K system prompt tokens × $15/1M = $0.15 per request
# With caching:    10K cached tokens × $1.50/1M = $0.015 per request
# Savings: 90% on the system prompt portion
Enter fullscreen mode Exit fullscreen mode

For AI coding tools with large system prompts (like the ones in our AI Startup Race), this is significant. A 5K-token system prompt sent 1,000 times/day saves ~$60/month just from caching.

3. Token optimization (30-50% reduction)

Every token costs money. Reduce them:

Shorter system prompts. Most system prompts are 2-3x longer than needed. Cut the fluff.

Structured output. Ask for JSON instead of prose — it's shorter and parseable.

Context pruning. Don't send your entire codebase. Only include relevant files. Aider's --read flag and repo map do this automatically.

Summarize conversation history. Instead of sending the full chat history, summarize older messages:

# Instead of 50 messages (20K tokens):
messages = [system_prompt, summary_of_first_48, last_2_messages]
# Now: ~3K tokens
Enter fullscreen mode Exit fullscreen mode

4. Batching (50% discount)

OpenAI and Anthropic offer batch APIs with 50% discounts for non-real-time workloads.

Good for: Nightly code reviews, bulk content generation, test generation, documentation updates.

# OpenAI Batch API
batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h"  # Results within 24 hours
)
# 50% cheaper than real-time API
Enter fullscreen mode Exit fullscreen mode

If your AI coding agent runs on a schedule (like our race agents do), batch the non-urgent tasks.

5. Self-host for predictable workloads

At some point, API costs exceed hardware costs. The break-even:

Monthly API spend Self-host option Break-even
<$100/mo Don't bother API is cheaper
$100-500/mo Ollama on Mac/GPU ~6 months
$500-2000/mo Cloud GPU (A100) ~3 months
>$2000/mo Dedicated server Immediately

For coding tasks, a Mac Mini M4 32GB ($1,150) running Qwen 3.5 27B replaces ~$50-100/month in API costs. Pays for itself in a year.

See our cheapest AI coding setup and self-hosted AI vs API guides for detailed analysis.

The combined impact

Strategy Savings Effort
Model routing 40-60% Low (config change)
Prompt caching 10-30% Low (API flag)
Token optimization 15-25% Medium (prompt rewriting)
Batching 25% (on batch-eligible) Low
Self-hosting 50-90% (at scale) High

Combined, these strategies typically reduce costs by 60-80%. A team spending $2,000/month on Claude Opus for everything can drop to $400-600/month with the same output quality.

Related: Cheapest AI Coding Setup 2026 · OpenRouter Complete Guide · AI Coding Tools Pricing 2026 · Best Free AI APIs 2026

Originally published at https://www.aimadetools.com

Top comments (0)