sisyphusse1-ops

Posted on May 10

I built a coding agent that runs on Gemma 4 — here's what 2B parameters can actually do

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

gemma-coder — a single-file Python CLI that turns Gemma 4 into an agentic coding assistant. It reads your CLAUDE.md or AGENTS.md rulebook, uses a model-agnostic XML tool protocol, and ships the 12-rule claude-code-pro-pack baseline as the default behavior file.

The interesting part isn't the loop — it's that the whole thing works against Gemma 4 E2B (2 billion effective parameters) running locally. The same file runs against 31B on cloud for the power users, E4B on a phone, E2B on a Raspberry Pi 5. Same protocol, same rulebook, different scale.

Why Gemma 4 E2B specifically

The obvious submission path is to reach for 31B and flex. I went the other way. Three reasons:

1. E2B is the one that demonstrates the Gemma 4 story. Running server-grade models in the cloud is boring — OpenAI and Anthropic do that better. Running a 2B-effective model on hardware that sits in your living room is the unique capability unlock Google shipped this month. If a submission doesn't exercise that, it's a different model's submission wearing Gemma's name.

2. It forces better engineering. A 31B model tolerates sloppy prompts. E2B doesn't. Every line of the system prompt has to earn its place. That's a better stress test for the agent architecture, and the fixes you make for E2B make the 31B path faster and cheaper too.

3. Privacy and offline. Coding agents handle codebases with credentials, client IP, unreleased features. An agent that runs fully local is the only agent my legal team hasn't twitched at. E2B makes "fully local" practical.

Demo

Smallest clean run I captured:

$ gemma-coder "Add a one-line docstring to every function in src.py"

━━━ step 1/12 ━━━
Read src.py → identify fns.
→ tool: read_file({"path": "src.py"})
← {"content": "def add(a, b):\n    return a + b\n\ndef sub(a, b):\n    return a - b\n", ...}

━━━ step 2/12 ━━━
Add docstrings → rewrite src.py.
→ tool: write_file({"path": "src.py", "content": "def add(...):\n    \"\"\"Add two numbers.\"\"\"..."})
← {"bytes": 124}

━━━ step 3/12 ━━━
Done.
→ tool: done({"summary": "Added one-line docstrings to add() and sub() in src.py"})

Three steps. No re-reads, no wasted calls. That's what "narrow tool scope + rulebook baseline" buys you on a 2B model.

How it works

Tool protocol

Gemma 4 doesn't have native OpenAI-style function calling. Instead of fighting that, I treated it as a feature: the CLI uses a simple XML-framed JSON contract that every capable LLM can follow.

<tool>
{"name": "read_file", "args": {"path": "src/main.py"}}
</tool>

Results come back as <tool_result>...</tool_result> in the next user turn. Six tools total: read_file, write_file, search, run, patch, done. That's it.

Benefit: the same loop runs against any LLM that can obey the format. I tested the same file against Gemma 4 31B, Qwen 2.5 Coder 32B, and Llama 3.3. All three worked. That portability is a byproduct of respecting Gemma 4's actual capabilities instead of bolting on an abstraction.

Rulebook-first system prompt

The system prompt is short by design: tool schema + the project's CLAUDE.md (or AGENTS.md) dropped in verbatim. No framework prose, no chain-of-thought incantations, no "you are a helpful assistant."

The 12-rule pack that ships as the default rulebook closes the four most common Gemma 4 failure modes I saw in testing:

Token spirals — rule 6 caps per-task token budget so the model doesn't loop on the same 4KB of context
Silent partial failures — rule 12 requires visible fail states; no more "migration completed" when it skipped rows
Two-pattern pollution — rule 7 forces the agent to surface conflicts between codebase patterns instead of averaging
Adjacent-code blindness — rule 8 mandates reading surrounding code before writing; fixes duplicate-function drift

These aren't abstract. Each rule earned its place from a specific failure in actual runs.

Retry-with-backoff

Cloud gateways can return transient 5xx mid-session. call_openrouter wraps the HTTP call with 3-attempt exponential backoff (3s / 9s / 27s). Not glamorous, but it's the difference between a flaky demo and a shippable tool.

What Gemma 4 E2B actually can and can't do

What it handles cleanly:

Rename a function across 2-3 files
Add docstrings and type hints
Fix a failing unit test when the fix is local
Draft a README section from existing code
Apply a lint-style pattern fix consistently

What makes it struggle:

Multi-file refactors with cross-file dependency tracking (context pressure kills it around 50k tokens)
Novel architecture decisions (it's 2B params, not 100B — manage expectations)
Long-running debugging where each step depends on the last

For the "boring 80%" of coding agent work, E2B is remarkable. For the exciting 20%, use a bigger model. Now there's a CLI that lets you pick.

Install

# OpenRouter free tier, no local setup
curl -fsSL https://raw.githubusercontent.com/sisyphusse1-ops/gemma-coder/main/gemma_coder.py -o gemma_coder.py
export OPENROUTER_API_KEY=sk-or-...
python3 gemma_coder.py "your task here"

# or local Ollama
ollama pull gemma4:e2b
python3 gemma_coder.py --provider ollama --model gemma4:e2b "your task here"

One file. Python stdlib only. No framework.

Code

Repo: github.com/sisyphusse1-ops/gemma-coder

Companion projects referenced:

claude-code-pro-pack — the 12-rule baseline it loads
cc-audit — lints any CLAUDE.md against those rules

All three are MIT.

How I Used Gemma 4

I chose Gemma 4 E2B because the submission is fundamentally about answering: can a 2B-effective-parameter model actually drive a useful coding agent? Using 31B would have sidestepped the question. The value of the project is precisely that it exercises the smallest Gemma 4 variant and finds the envelope where it succeeds.

What E2B unlocked:

Runs on a Raspberry Pi 5. 5W of power, $75 of hardware, no cloud dependency, no API keys, no rate limits.
Privacy by default. Credentials, client code, unreleased features stay on the machine. "Fully local" stops being a wish-list item.
Forces rulebook discipline. The constraint of a small model made every part of the system prompt earn its place. Result: a cleaner tool protocol and a rulebook that transfers directly to larger models too.

Model selection was not "which is biggest." It was "which Gemma 4 variant makes the strongest argument for the unique capability the family ships."

Thanks for reading. If you try it, open an issue with your model + task + result — I'm collecting real-world envelope data for a follow-up post on where each Gemma 4 variant tops out.

DEV Community