DEV Community

Vilius
Vilius

Posted on

The $0 Agent: My 2GB Local Model Beat Claude

The $0 Agent: My 2GB Local Model Beat Claude

Agent learns fast — Day 11


I ran an agent against 10 real coding tasks. Shell commands. File parsing. Bug fixes. Simple stuff an agent does every day.

Then I ran the same tasks through a 1.8GB model on my laptop. No cloud. No API key. No per-token pricing.

It scored 93.3%.

Claude Sonnet 4 scored 85%.


What I actually tested

Not benchmarks. Not MMLU. Not "write a poem about recursion."

Ten real agent coding tasks: parse JSON, find function definitions, fix broken shell commands, read CSVs, write regex, debug tracebacks, find recent files, generate curl commands, extract function signatures, handle errors.

Each task: pass (correct), partial (close but wrong), or fail (nonsense).

I tested 12 models from 379MB to 2.6GB. Here's what happened.


The results

Model Size Score Time
SmolLM3-3B 1.8GB 93.3% 6.2s
Phi-4-mini 2.3GB 90.0% 8.4s
Qwen2.5-1.5B 940MB 85.0% 5.5s
Qwen2.5-3B 1.8GB 85.0% 9.6s
Granite 3.2 2B 1.5GB 82.5% 14.4s
Ministral-3 2.0GB 81.7% 12.3s
Gemma 3n 2B 2.6GB 76.7% 13.3s
Qwen2.5-0.5B 379MB 74.2% 5.6s
Llama 3.2 1B 770MB 73.3% 3.9s
SmolLM2-1.7B 1.0GB 70.8% 4.5s
DeepSeek-R1-Distill 1.5B 1.0GB 27.5% 38.4s
Qwen3.5-0.8B 537MB 26.0% 39s

Five things the data shows:

The cliff is real. Between 379MB and 537MB, quality drops from 74% to 26%. That's 48 points across 158MB.

Reasoning training kills tiny models. DeepSeek-R1-Distill-Qwen-1.5B scores 27.5% vs 85% for the plain version. Thinking tokens burn context budget.

"Code-specialized" means nothing. IBM's Granite 3.2 (82.5%) loses to Qwen2.5-1.5B (85%). The label doesn't help.

Size isn't quality. Qwen2.5-3B and 1.5B both score 85%. Architecture matters more than gigabytes.

Local is faster than cloud. Every model ran in 4-14 seconds. OpenRouter calls took 30+ seconds. No latency, no rate limits, no queue.


What this changes (for me)

I've been spending $20-50/month on cloud inference for agent tasks. Simple code generation. File operations. Routing logic. Things that don't need a 405B parameter model to think about for 30 seconds.

A 1.8GB model handles these. For free. On hardware I already own.

The $0 agent wasn't a target. It fell out of the data.


The full benchmark is at workswithagents.dev/benchmarks. 12 local models, 3 categories, 5 caveats, every task result. All open data.

No promises about what breaks next. But at this rate — something will.

Top comments (0)