We migrated because our AI orchestrator became something that needed tests, trust boundaries, and deterministic behavior.
ForgeFlow is a fully local, multi-agent TDD system — a planning agent (Claude) designs a spec document set, and a local 45GB LLM executes TDD cycles overnight on Apple Silicon. During our initial development iterations, n8n was the orchestrator. It felt like the right call. Visual workflow, no-code glue, easy HTTP nodes, built-in error routing. What's not to like?
Turns out, quite a lot.
What n8n Got Right
To be fair, n8n earned its place early on. When we were validating the core loop — call LLM → apply files → run pytest → branch on result — the visual canvas was genuinely useful. You could see the entire decision tree at a glance. Non-engineers could follow it. Debugging a single node failure was fast.
And for a prototype, that matters. We got to a working TDD loop in far fewer iterations than a pure Python approach would have required.
Where It Started Breaking
Friction mounted as the system's architectural requirements became more stringent.
Problem 1: The fetch() wall
n8n's Code node runs in a sandboxed Node.js environment. When we tried to read existing source files before passing them as context to the LLM — a critical step for our TDD approach — fetch() to our local exec server failed silently. In our setup, $helpers.httpRequest() didn't give us the control we needed either. Whether this was configuration-specific or sandbox-related, the practical result was the same: file reads became workflow plumbing instead of ordinary code. The workaround was to spin up a separate HTTP Request node just for file reads.
This wasn't a bug. It was the architecture saying: "Code nodes are for data transformation, not I/O." Which is correct! But we needed I/O.
Problem 2: The trust boundary was wrong
In our system, the rule is absolute: only the orchestrator may modify files or run git commands. The LLM proposes. The orchestrator decides and acts.
In n8n, enforcing this required careful node discipline. Nothing in the framework prevents you from firing off a git commit in any arbitrary Code node. The constraint was cultural, not structural. And cultural constraints erode.
In Python, the constraint is structural: apply_files(), git_commit(), gate_decision() are functions in a specific layer (L1/L2). Nothing in L3 (the LLM layer) can call them directly. The layer separation is enforced by code organization, not discipline.
Problem 3: Deterministic logic deserves deterministic code
The most critical parts of our system — syntax validation, failure signature hashing, gate decision logic — are 100% deterministic. Not a single LLM call. Pure logic.
Expressing that logic in n8n's Code nodes felt like writing a chess engine in PowerPoint. It worked, technically. But every if/elif chain lived in a text field, untestable and unversionable in any meaningful way.
We wanted to write this:
def compute_failure_signature(failure_type: str, target_file: str, stderr_text: str) -> str:
payload = f"{failure_type}:{target_file}:{stderr_text[:200]}"
return hashlib.sha256(payload.encode()).hexdigest()[:12]
And then test it directly:
def test_failure_signature_is_stable():
sig1 = compute_failure_signature("patch", "auth.py", "IndentationError line 42")
sig2 = compute_failure_signature("patch", "auth.py", "IndentationError line 42")
assert sig1 == sig2
def test_failure_signature_differs_by_file():
sig1 = compute_failure_signature("patch", "auth.py", "IndentationError line 42")
sig2 = compute_failure_signature("patch", "models.py", "IndentationError line 42")
assert sig1 != sig2
You can approximate that with a Code node, but not as a first-class, importable, directly testable artifact. The difference matters when this logic is what determines whether a 45GB model gets another retry or hits deadlock.
Problem 4: Cross-model verification became essential
By our 21st development iteration, we brought in an independent reasoning model (different vendor, no shared context) to write test_forgeflow.py — 110 tests for our orchestrator's behavior, written solely from our spec documents.
It found 7 real mismatches between our spec and our implementation.
To run this process, we needed: a well-defined function contract, stable function signatures, and a test suite that could run in CI. None of that is natural in n8n. All of it is natural in Python.
The Migration Decision
The turning point was a single question: "Can we write a unit test for this logic?"
If the answer was no — and with n8n, it often was — that was a design smell. Critical, deterministic logic that can't be tested independently is a liability.
We rewrote the orchestrator as forgeflow.py with a three-layer trust model:
L1: TRUST (immutable) — File I/O, Git, Docker, State, Cycle Control
L2: VERIFY (deterministic) — Syntax Gate, correction engine, gate_decision
L3: DOUBT (always suspect) — LLM backends, JSON extraction, prompt assembly
The design invariant: L1 and L2 never call an LLM. All LLM interaction is isolated in L3.
This made the migration feel less like a rewrite and more like making implicit structure explicit.
What We'd Tell Our Past Selves
Use n8n for: Prototyping agent loops, visualizing complex branching logic for stakeholders, integrating heterogeneous external APIs with minimal glue code.
Don't use n8n for: Any logic you'll want to unit-test. Any constraint that needs to be structurally enforced rather than culturally agreed upon. Any system where the orchestrator's behavior is itself the thing being verified.
The moment your AI orchestrator has enough opinions to deserve a test suite, it's time to write real code.
The Result
The important change wasn't "Python instead of n8n." It was that the orchestrator became verifiable.
- 106 behavioral tests pass against stable function contracts
- Three execution paths (COMMIT / RETRY / DEADLOCK) are validated independently via dry-run
- Trust boundaries are structural: the LLM layer has no reference to file mutation or git operations
- Cross-model verification is repeatable: any external model can re-run tests against the same spec
n8n helped us discover the loop. Python let us verify it.
ForgeFlow runs entirely on local hardware — Apple Silicon M5 Max 128G, 45GB model, no cloud inference during execution. If you're building AI agent systems and hitting the same orchestration walls, whether with n8n, LangGraph, or CrewAI — what was your inflection point? The trust boundary question is one most teams hit eventually.
About
I'm Joseph YEO, a solo builder from Seoul, Korea. ForgeFlow is my experiment in pushing local AI agents toward full autonomy — no cloud inference during execution, no hand-holding mid-cycle.
This post is about one architectural decision that turned out to matter more than I expected: moving the orchestrator from n8n to Python, and why testability was the forcing function.
Follow along:
- 𝕏: @josephyeo_dev
- GitHub: joseph-yeo
- Site: projectjoseph.dev
Built over ~22 sessions, May 2026. All models run locally via Ollama 0.23.0 on Apple Silicon. No cloud APIs were used during autonomous execution.
This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.
Top comments (0)