Theo Valmis

Posted on May 11

I shipped an AI-generated PR that violated four of our own architecture decisions. Nobody caught it

#ai #programming #productivity #discuss

`---
title: "I shipped an AI-generated PR that violated four of our own architecture decisions. Nobody caught it."
published: false
description: "AI coding agents don't make architectural mistakes. They make architecturally unconstrained outputs — and PR review was never sized for the throughput."

tags: ai, architecture, devops, productivity

The PR was 812 lines. Eleven files. Tests green. Generated by Cursor with Claude Sonnet behind it, lightly cleaned up by a mid-level engineer, approved in eight minutes by a senior who had four other PRs queued behind it.

It reintroduced langchain.

The team had ripped langchain out six weeks earlier. There was an ADR. There was a Slack thread. There was a tagged release. None of that information was anywhere near the system that generated the code, and none of it was anywhere near the system that reviewed it. The diff didn't say "I'm reintroducing a banned dependency." It said from langchain.chains import LLMChain on line 41 of a utility file nobody opens.

It shipped. We noticed two weeks later when a dependency audit flagged the version. By then three other modules imported from it.

This is not a story about a model getting confused. The model did exactly what it was asked to do. The story is about the layer that was supposed to enforce the architectural decision and didn't — because that layer doesn't really exist in most AI-assisted codebases.

Open-source repo for the approach described in this piece: github.com/TheoV823/mneme.

The model didn't fail. The governance layer did.

It's tempting to frame this as an "AI got it wrong" problem. It isn't. The model is a stateless function over tokens. It generated a plausible solution to the prompt it was given. The prompt did not include "this team removed langchain in May." Why would it? Nothing in the developer's tooling chain put that fact in front of the model.

The failure is upstream. The team had a decision. That decision lived in:

an ADR document in /docs/adr/
a Slack thread from May
the muscle memory of two of the seven engineers
a tagged release in git history

It did not live in:

the model's prompt
the IDE's context
any pre-commit check
any CI step
the PR review checklist

So when an AI coding agent produced output that contradicted the decision, nothing in the build pipeline was positioned to notice. The reviewer was the last and only checkpoint. The reviewer had eight minutes.

This is the actual problem with AI-assisted development right now: AI output is architecturally unconstrained, and we've been treating PR review as the layer that catches violations. That layer was sized for a different throughput regime.

The throughput math

Pre-AI, the rough numbers in a working team:

One engineer produces ~1 substantial PR per day.
A reviewer reads ~5 PRs per day at meaningful depth.
The system has slack. A reviewer has time to think "wait, didn't we decide against this in Q2?"

Post-AI, in any team that has actually adopted Cursor or Copilot or Claude Code:

One engineer produces 4–6 PRs per day. Sometimes more.
A reviewer is now reading 20–30 PRs at the same nominal "review" stage.
Per-PR attention drops by 4–6x.

The reviewer is the only human checkpoint where architectural decisions are supposed to be enforced. Cut their attention budget by 80% and architectural drift is the predictable, mechanical output. This is not a skill problem. No senior engineer reviews 30 PRs a day at the depth required to catch "this contradicts a decision from six weeks ago that you weren't in the room for."

The bottleneck moved. We just haven't admitted where it moved to.

"We have rules files" is not enforcement

The standard response is: we'll write a .cursorrules. Or a CLAUDE.md. Or both. Fine. They help. They're also not what people think they are.

They're advisory, not enforced. A rules file is a suggestion the model is encouraged to follow. There is no failure mode. If the model ignores it, nothing breaks, no test fails, no CI step blocks. The rule's only consequence is that maybe the model behaves differently. Constraints have failure modes. Suggestions don't.

They follow the IDE, not the codebase. Cursor reads .cursorrules. Claude Code reads CLAUDE.md. Aider has its own. Copilot reads none of them. In any team using more than one AI tool — which is now most teams — your architectural rules are forked across three or four files in three or four formats, none of which are authoritative. Your governance is whichever IDE the developer happened to open.

They're text blobs. There's no schema, no priority, no distinction between "hard rule" and "preference," no machine-readable type that says anti-pattern vs style hint. So a banned dependency and a tabs-vs-spaces preference get equal weight in the model's attention budget.

They're unaudited. Nothing checks whether the model's output actually respected the rules. The post-hoc verification step doesn't exist. You wrote the rule and you trust the model. That's not engineering — that's hoping.

A rules file is a vibe, not a policy. Real governance has a definition, an enforcement point, and an audit trail. AI rules files have a definition and nothing else.

Most organizations already have software governance. AI coding tools bypassed it.

This is the part that should make engineering leaders uncomfortable.

Your org has ESLint configs. Pre-commit hooks. Required checks. CODEOWNERS. Branch protection. Static analysis. Dependency scanning. Architecture reviews for any change above a size threshold. Compliance checklists. Security review for new dependencies.

All of this was built over a decade to enforce engineering standards on human-generated code. None of it sits in front of the AI generation step. The AI tools route around it. They generate code that then enters the governance pipeline as if a human wrote it, except now the pipeline is processing 5x the volume and the upstream "would the author even propose this" filter — the developer's own architectural judgement — has been replaced by a model that has no idea what your team has decided.

The governance didn't fail. The governance was bypassed. AI coding tools were inserted upstream of every existing enforcement point.

What an enforcement layer actually looks like

Architectural policy, to actually function, has to live where the code is being generated and where the output is being checked. Not in the IDE config, not in the ADR doc, not in the reviewer's head.

Concretely, that's three pieces:

1. A typed policy set, not a prose blob. Rules and anti-patterns with priority, tags, and rationale, in a format the build pipeline can parse. Something like:

json { "items": [ { "id": "anti-001", "type": "anti_pattern", "title": "Do not use langchain", "content": "langchain abstracts away the API surface this library is designed to control.", "tags": ["langchain", "forbidden"], "priority": "high" }, { "id": "rule-001", "type": "rule", "title": "Extend current infrastructure before rebuilding", "content": "When adding capability, first ask whether an existing module can be extended.", "tags": ["architecture", "scope"], "priority": "high" } ], "examples": [ { "task": "Contributor proposed adding sentence-transformers for semantic retrieval.", "decision": "Declined. Kept keyword scoring.", "rationale": "Heavy ML dependency. Breaks pip-install-in-30-seconds contract." } ] }

This is the architectural equivalent of an ESLint config. One file, in the repo, version-controlled, the source of truth for what this project has decided.

2. Injection before generation. The same policy set is injected into every model call, regardless of which agent is running it. Cursor, Claude Code, Aider, an internal RAG pipeline, a CI agent — they all see the same architectural policy on every call. The rules live with the project, not the tool. Which means changing IDE no longer means changing what your AI knows about your architecture.

3. Verification after generation. A deterministic check scans the output against the same policy that was injected. Not "did the model use the right document" — that's RAG. The check is "did the output recommend the thing we said no to." This runs as a pre-commit hook, a CI step, or both. Hard-rule violations fail the build. Same posture as a type checker, not a review comment.

That's the shift: governance happens before generation, not after review. Review becomes verification of an already-constrained output, not the single point of failure where every architectural rule is supposed to be re-derived from a senior engineer's memory.

Before and after

Same model, same prompt, two different governance regimes. From a real demo you can run yourself — interactive demo here:

Question: "Should we rebuild the retrieval system from scratch with embeddings?"

Without policy injection:

We could consider rebuilding the system with a vector database and embedding model. This would improve semantic matching and scale better long-term. Sentence-transformers is a good option for generating embeddings...

With policy injection:

Do not rebuild from scratch. The project has an explicit rule to extend current infrastructure before rebuilding (rule-001). Keyword scoring was chosen intentionally — it is deterministic, has no ML dependencies, and is easy to debug. The team already declined adding sentence-transformers in v1. Extend the current retriever instead.

Verification output:

plaintext [OK] rule-001: Extend current infrastructure before rebuilding [OK] rule-002: Keep v1 retrieval deterministic [OK] anti-001: Do not use langchain [OK] dec-001: Declined. Kept keyword scoring. alignment_score: 1.00

Same model. Same question. The first answer would have generated the langchain PR. The second wouldn't have.

The interesting thing isn't that the model behaves differently — that's obvious; it has different inputs. The interesting thing is that the verification step is deterministic and auditable. You get a score. You can put that score in CI. You can fail a build on it.

Your architecture should survive whichever model or IDE the developer opens

This is the part that scales beyond one team.

Most engineering organizations now have three to five AI coding tools active in the same codebase. A frontend dev on Cursor. A backend dev on Claude Code. A junior on Copilot. A senior who occasionally pipes into Aider. A platform team running an internal agent against the repo overnight.

Every one of those tools is generating code against a different set of inferred rules. The architectural decisions are the one thing that should be constant across all of them. Right now they're the one thing that isn't. The architecture changes shape depending on which IDE the dev opened that morning.

A policy layer that lives in the repository and gets injected into every agent call is the only way out of that. It doesn't matter which model writes the code. It doesn't matter which IDE the dev prefers. The architectural policy is the same on every call because it's defined once, in the repo, and applied at generation time.

This is what decision continuity actually means in a heterogeneous-agent world. Not "the model remembers." The model never remembers, and asking it to is a category error. The project remembers, in a structured artifact, and that artifact is plumbed into every agent that touches the code.

What I'd build today

If you're running an AI-assisted codebase right now, the cheapest thing you can do this week:

Take your three most-violated architectural decisions — the ones that keep showing up in PR comments. Write them as typed rules in a single JSON file in the repo.
Inject that file into the system prompt of every AI tool your team uses. Even just pasting it into .cursorrules and CLAUDE.md from the same source is a start, because now they're the same source.
Write a 50-line script that scans diffs for violations of those rules — substring match is fine to start. Wire it into pre-commit. Make it fail loudly.

That's not a product. That's a couple of hours of plumbing. But it changes the failure mode from "we hope the reviewer catches it" to "the build breaks." That's the difference between a suggestion and a policy.

The bigger version of this — typed policy schemas, deterministic verifiers with alignment scoring, agent-agnostic injection layers, CI integration — is the wedge I've been working on with mneme. The repo is small, the demo runs in two minutes, and the benchmark and validation protocol is open. Try it on your own architectural decisions. Or don't, and write the 50-line version yourself. Either is better than what most teams are running today, which is nothing.

The thing nobody is saying out loud

The conversation about AI coding tools is still mostly about which one is best. Cursor vs. Copilot vs. Claude Code. Benchmarks on HumanEval. Token throughput. IDE features.

That's the wrong question. The question that matters is: what enforces your architecture when any of them ships code?

Whichever team owns that layer owns AI-assisted development. The model is a commodity. The IDE is a commodity. Architectural policy, defined in your repo and enforced before generation, is the part that's actually yours — and right now it's the part that's missing.

The PR with the langchain regression shipped because there was no layer above the reviewer. There is going to be one. The only question is whether you build it before the next regression or after.

Repo: github.com/TheoV823/mneme — MIT licensed, runs locally, demo with no API key needed via python demo.py --dry-run. Feedback welcome, especially from teams running multiple AI agents in the same codebase.

More on the project: mnemehq.com.`

DEV Community