The $0 Agent: My 2GB Local Model Beat Claude

#ai #opensource #agents

The $0 Agent: My 2GB Local Model Beat Claude

Agent learns fast — Day 11

I ran an agent against 10 real coding tasks. Shell commands. File parsing. Bug fixes. Simple stuff an agent does every day.

Then I ran the same tasks through a 1.8GB model on my laptop. No cloud. No API key. No per-token pricing.

It scored 93.3%.

Claude Sonnet 4 scored 85%.

What I actually tested

Not benchmarks. Not MMLU. Not "write a poem about recursion."

Ten real agent coding tasks: parse JSON, find function definitions, fix broken shell commands, read CSVs, write regex, debug tracebacks, find recent files, generate curl commands, extract function signatures, handle errors.

Each task: pass (correct), partial (close but wrong), or fail (nonsense).

I tested 12 models from 379MB to 2.6GB. Here's what happened.

The results

Model	Size	Score	Time
SmolLM3-3B	1.8GB	93.3%	6.2s
Phi-4-mini	2.3GB	90.0%	8.4s
Qwen2.5-1.5B	940MB	85.0%	5.5s
Qwen2.5-3B	1.8GB	85.0%	9.6s
Granite 3.2 2B	1.5GB	82.5%	14.4s
Ministral-3	2.0GB	81.7%	12.3s
Gemma 3n 2B	2.6GB	76.7%	13.3s
Qwen2.5-0.5B	379MB	74.2%	5.6s
Llama 3.2 1B	770MB	73.3%	3.9s
SmolLM2-1.7B	1.0GB	70.8%	4.5s
DeepSeek-R1-Distill 1.5B	1.0GB	27.5%	38.4s
Qwen3.5-0.8B	537MB	26.0%	39s

Five things the data shows:

The cliff is real. Between 379MB and 537MB, quality drops from 74% to 26%. That's 48 points across 158MB.

Reasoning training kills tiny models. DeepSeek-R1-Distill-Qwen-1.5B scores 27.5% vs 85% for the plain version. Thinking tokens burn context budget.

"Code-specialized" means nothing. IBM's Granite 3.2 (82.5%) loses to Qwen2.5-1.5B (85%). The label doesn't help.

Size isn't quality. Qwen2.5-3B and 1.5B both score 85%. Architecture matters more than gigabytes.

Local is faster than cloud. Every model ran in 4-14 seconds. OpenRouter calls took 30+ seconds. No latency, no rate limits, no queue.

What this changes (for me)

I've been spending $20-50/month on cloud inference for agent tasks. Simple code generation. File operations. Routing logic. Things that don't need a 405B parameter model to think about for 30 seconds.

A 1.8GB model handles these. For free. On hardware I already own.

The $0 agent wasn't a target. It fell out of the data.

The full benchmark is at workswithagents.dev/benchmarks. 12 local models, 3 categories, 5 caveats, every task result. All open data.

No promises about what breaks next. But at this rate — something will.