Soleiman Mansouri

Posted on May 4 • Edited on May 6

I Built a Debugging Memory for AI Coding Agents — Here's the System Behind It

#ai #opensource

Here's a question that changed how I debug with AI agents:

What if the agent checked "have I seen this before?" before investigating every bug?

I started logging my debugging sessions — every root cause, every false lead, every fix. After 100+ production bugs across voice pipelines, API integrations, and distributed systems, a clear pattern emerged: the same ~22 root causes explain nearly everything. Config chain gaps. Stale caches. Silent fallbacks. Observer multipliers. Retry/timeout mismatches.

The bugs repeat. The agents don't remember. That's the gap.

So I built Debug Bank — a pattern-first debugging memory system that teaches AI agents to remember.

The Problem: AI Agents Learn Nothing

Here's what happens today:

Bug appears
Agent investigates from scratch
Agent finds root cause, fixes it
Session ends
Same bug appears in a different file
Agent investigates from scratch again

Stack Overflow data shows AI-generated code has 2.66x more formatting problems and 1.5-2x more security bugs than human code. Much of that comes from agents never learning from past mistakes.

Google's ReasoningBank research (arxiv.org/abs/2504.09762) proved that distilling failures into reusable patterns yields +8.3% on WebArena and +4.6% on SWE-Bench. But that research was theoretical. I needed a production system.

The Insight: Patterns Repeat

After documenting my own debugging trajectories, I extracted 22 root cause patterns that account for ~95% of the bugs I encounter:

P03 (Observer/Hook Multiplier) — Event listeners registered multiple times, causing duplicate processing per trigger
P07 (Partial Rollback) — A multi-step deploy succeeds partway, leaving the system in an inconsistent state
P11 (Retry/Timeout Mismatch) — Retry interval exceeds the timeout window, so retries never actually execute
P15 (Async Fire-and-Forget) — A background task fails silently because no one awaits its result or checks its status

These aren't theoretical. Each has 2-3 real-world examples from production, a 30-second checklist, and a fix strategy.

The key insight: If an agent checks "have I seen this before?" at the start of every debugging session, 70% of investigations end at step 1.

How Debug Bank Works

It's three layers stacked on top of each other:

Layer 1: Pattern Bank (P01-P22)

When a bug is reported, the agent reads the pattern bank and asks: "Is this a known pattern?"

If yes, it pulls the checklist and verification strategy. If the known fix applies, we're done in 30 seconds. If not, we know why it doesn't match and adjust.

If no pattern matches, we proceed to the 7-step protocol.

Layer 2: The 7-Step Protocol

I call it the "Debug Trajectory" — it's the exact sequence I follow on every production incident:

Reproduce — Get the exact error with full output (logs, stack trace, HTTP status)
Hypothesize — State 2-3 ranked, falsifiable root causes
Isolate — Test hypotheses one at a time using binary search
Diagnose — Identify the single root cause by tracing the full call chain
Fix — Make a minimal change addressing the root cause, not a symptom
Record — Document the trajectory in a domain catalog
Capture feedback — When corrected, turn the correction into a persistent rule

Every step produces evidence for the next. You never skip.

Layer 3: The 3-Exchange Stop Rule

This is the single most impactful rule in the system.

If 3 rounds of iterative fixing show no progress, STOP. Don't continue the same approach. Re-plan from scratch, add logging, or switch strategy entirely.

That's it. That rule alone prevents the #1 failure mode of AI agents — circular debugging that wastes tokens and produces nothing. Most agents loop 5-10 times on hard problems. This forces a strategy pivot.

Pre-Deploy: Catching Bugs Before They Ship

The pattern bank is reactive — it kicks in after a bug appears. But the cheapest time to catch a bug is before it ships.

So I built a pre-deploy scanner. Here's how it works:

bash integrations/pre-deploy-check.sh

The scanner:

Reads your git diff (staged changes)
Greps for keywords linked to each pattern (e.g., "observer" for P03, "fallback" for P08)
Prints a ranked list of flagged patterns with their quick-check
Exits non-zero if matches are found (so it blocks your deploy pipeline)

Example output:

[debug-bank] Pre-Deploy Pattern Scan
Scanning git diff for known failure patterns...

  FLAGGED  P03 Observer/Hook Multiplier
           keyword: subscribe
           Check: Deduplicate by event/frame ID

  FLAGGED  P08 Config Resolution Chain Gap
           keyword: fallback
           Check: Trace the full fallback chain

2 pattern(s) flagged. Review before deploying.
Exit code: 1

You fix the flagged issues, run the scanner again, and when there are no matches, the deploy proceeds.

This runs before human review — a catch-all safety net that uses the same pattern bank for prevention, not just diagnosis.

How Feedback Becomes Rules

When you tell Claude "don't do that" or correct its approach, that correction becomes a permanent rule:

---
name: no-mocking-database
type: feedback
---
Integration tests must hit a real database, not mocks.

**Why:** Prior incident where mock/prod divergence masked a broken migration.
**How to apply:** Any test file touching database operations.

The Why field lets the agent judge edge cases instead of blindly following rules. After 30+ feedback rules, the agent rarely needs the same correction twice.

Zero Dependencies, Just Markdown

Here's the radical part: It's all markdown files.

No database. No API. No external dependencies. No installation.

Setup takes 30 seconds:

curl -O https://raw.githubusercontent.com/soleimanmansouri/debug-bank/main/CLAUDE.md

Drop it in your project. Your agent reads it. That's it.

Works with:

Claude Code (drop-in via CLAUDE.md)
Codex CLI, Gemini CLI, Cursor (via AGENTS.md)
Custom agents (copy patterns, follow the protocol)

The pre-deploy scanner is a bash script — no external dependencies beyond git and grep.

Real-World Example: P08 in Production

Let me show you how this works with a real pattern.

The bug: Transfer requests were routing to wrong department numbers.

Step 1: Pattern Check — I read P08 (Config Resolution Chain Gap) and recognized the exact symptom: "System falls through to stale data when a link in the fallback chain is missing."

Step 2: Verify the fix applies — Config was resolved via: API response → database table → YAML → hardcoded fallback. The department table was empty. System fell through to YAML with outdated numbers.

Step 3: Fix — Populate the database table. Add monitoring for empty entries.

Result: 30 seconds. No investigation needed.

Without the pattern bank, I would have:

Checked the API response (working)
Checked the database query (returns no results, but why?)
Checked the YAML file (contains old numbers, but why is it using those?)
Spent an hour on a config chain I didn't fully understand

Pattern bank shortened it to 30 seconds. Pattern bank applies to every subsequent bug of that type — in this project and in any future project.

Building the Pattern Bank

The 22 patterns came from documenting my own debugging work across diverse systems:

Voice pipelines (Pipecat, ElevenLabs)
API integrations (Twilio, Odoo, Supabase)
Configuration management (database → YAML fallback chains)
Distributed systems (cache invalidation, retry storms)

Each pattern includes:

A clear description of the failure mode
A 30-second checklist
2-3 real-world examples
A fix strategy
Prevention guidance

The patterns transfer across projects. P02 (Multiple Writers) appears in voice pipelines, web APIs, databases, and infrastructure. Once learned, it's learned forever.

Scenarios and Postmortems

The repo includes higher-tier debugging challenges:

Scenarios (S01-S03) — Multi-service L3-L4 problems where the symptom is in one place and the root cause is somewhere else entirely
Postmortems (PM01-PM03) — Anonymized production incidents with full timelines, blast radius analysis, and systemic mitigation

These are for learning. They teach you to think like a senior engineer tracking distributed failures.

Why This Works

Compound learning — Every bug fix teaches the system. After 50 bugs, most issues resolve at step 1 (pattern match).

Transfers across projects — You build the pattern bank once. It moves with you. P02 (Multiple Writers) is P02 everywhere.

User-driven self-improvement — Feedback rules capture corrections with context. The agent gets better at matching your expectations.

Evidence-based — Every pattern has a checklist. Every catalog entry links to a pattern ID. Nothing is "just trust me."

Stops circular debugging — The 3-exchange rule forces strategy pivots. No more looping endlessly on the wrong approach.

Update: Debug Bank v3 — From Knowledge to Runtime

Shipped May 2. The pattern bank just crossed from reference material into live debugging.

v3 adds four capabilities that collapse investigation time:

Debugger Strategies (Built Into Each Pattern)

Every pattern now includes targeted breakpoint placement, watch expressions, and isolation techniques. Instead of "set 8-15 random breakpoints and hope," you get 2-4 high-confidence breakpoints tied directly to the root cause.

Example: P02 (Multiple Writers) in a Python service:

Breakpoint 1: database.py:47 (write operation entry point)
  Watch: [current_value, lock_status, caller_id]

Breakpoint 2: cache_invalidator.py:23 (where writes notify observers)
  Watch: [event_id, registered_handlers, call_stack]

Isolation: Disable observer notifications, confirm writes are idempotent

Expected evidence: First breakpoint shows 2+ distinct callers 
  writing same key within 50ms

5-12 steps instead of 20+. The agent hits the exact line, not a guessing game.

Symptom Classifier — The New Step 0

Before pattern-matching, feed the error into a keyword-driven symptom classifier. It reads your error message and returns ranked pattern matches with confidence scores.

Example input:

ERROR: Transfer routing to wrong department (expected 42, got stale value 15)
Multiple concurrent requests from same contact

Example output:

1. P08 (Config Resolution Chain Gap)     [95% confidence]
2. P02 (Multiple Writers)                [72% confidence]
3. P15 (Async Fire-and-Forget)           [41% confidence]

Tested at 100% accuracy on 8 real production bugs. No more reading 22 patterns manually.

Debug Subagent Protocol

A 4-tool interface for pattern-guided debug subagents:

debug_start_session(pattern_id, breakpoint_config) — boots with starting breakpoints from the matched pattern
debug_control(action, params) — step, continue, reload, detach
debug_inspect(target) — read memory, locals, call stack
debug_breakpoint(location, condition) — add conditional breakpoints on the fly

Max 25 steps to avoid token burn. The subagent gets pattern-matched breakpoints at startup — no warm-up period.

Pattern Compositions (C01-C05)

5 multi-pattern cascades showing how patterns amplify each other:

C01: P02 + P08 — Multiple Writers + Config Chain Gap = self-healing/re-breaking cycle
C02: P03 + P15 — Observer Multiplier + Async Fire-and-Forget = silent notification failure
C03: P07 + P11 — Partial Rollback + Retry Timeout Mismatch = stuck deploy state

These teach you to trace across systems instead of debugging in isolation.

Research Backing

Debug2Fix (arxiv.org/abs/2602.18571): +12-22% fix rate with debugger access, but 34% of fixes wrong despite correct debugging. Pattern knowledge addresses this directly.
Google ReasoningBank (arxiv.org/abs/2504.09762): +8.3% from distilled failure reasoning. The symptom classifier is Debug Bank's version.

What's Next

v3 shipped the runtime bridge. Patterns are wired into debuggers. The subagent protocol means agents start debugging at the exact line.

What's left: runnable Docker scenarios where you can practice debugging L3-L4 problems autonomously. A broken voice pipeline, a misconfigured database fallback chain, a timing race condition — all in a sandbox you can experiment in. The goal is to turn Debug Bank from a knowledge base into a practice environment.

Getting Started

Clone or download — git clone https://github.com/soleimanmansouri/debug-bank
Copy CLAUDE.md — curl -O https://raw.githubusercontent.com/soleimanmansouri/debug-bank/main/CLAUDE.md
Read the patterns — Skim patterns/ to learn what's available
On your first bug — Pattern check first, then trajectory protocol
Set up pre-deploy — Add bash integrations/pre-deploy-check.sh to your deploy pipeline

The full setup guide lives in the repo. You can start using patterns in under 5 minutes.

Why I'm Open-Sourcing This

I built Debug Bank from months of production debugging across 3 major projects and 100+ real incidents. Every pattern has been verified on actual bugs. Every rule has been tested under pressure.

But patterns are only as useful as they are shared. A pattern bank gains value when developers from different domains contribute new patterns, challenge existing assumptions, and improve the checklists.

If you've debugged a production system and found a repeating failure pattern, Debug Bank wants it. Submit a PR with a real example, and the pattern becomes part of the shared knowledge base.

The Bet

My bet is simple: If you give your AI agent access to a well-organized pattern bank, the agent will solve 70% of future bugs at step 1 instead of re-investigating every time.

That's a 10-100x speedup depending on the complexity. That's fewer wasted tokens, faster fixes, and debugging that feels less like Sisyphus pushing a boulder uphill.

Debug Bank is MIT licensed. Zero dependencies. Just markdown and a bash script.

Start here: github.com/soleimanmansouri/debug-bank

Have a bug that doesn't fit the 22 patterns? Open an issue or PR. Let's grow this together.

Top comments (1)

PEACEBINFLOW • May 4

The 3-exchange stop rule is the kind of thing that sounds like a small procedural tweak but is actually the difference between a tool that helps and a tool that quietly wastes an afternoon. Most of the AI debugging horror stories I've heard follow the same shape: the agent finds something plausible, the human says "no, try again," and then they loop — each iteration slightly different but all orbiting the same wrong assumption. The stop rule draws a hard line through that loop before it has time to become a time sink.

What interests me is that the rule works because it doesn't trust the agent to recognize its own circling. If the agent could reliably detect "I'm not making progress," you wouldn't need an external rule. The fact that you do suggests something about the nature of the problem: the same reasoning capability that lets the agent debug also lets it construct convincing justifications for why the current approach is almost working. It's not that it's failing — it's that it's failing persuasively.

Makes me wonder whether other AI-assisted workflows need similar circuit-breakers that aren't built into the model's reasoning but imposed structurally from outside. Things that don't try to make the AI smarter, just harder for it to waste resources gracefully. The pre-deploy scanner seems like another version of the same instinct — catching patterns before they require debugging at all, using grep instead of reasoning. Maybe the most useful AI tools are the ones that don't ask the AI to do the thing it's bad at, and instead give it a structure where its strengths have guardrails.