DEV Community

Cover image for The State of AI Instruction Quality

The State of AI Instruction Quality

Gábor Mészáros on April 21, 2026

Everybody has opinions about AGENTS.md/CLAUDE.md files. Best practices get shared. Templates get copied, and this folk-type knowledge dominates...
Collapse
 
a3e_ecosystem profile image
A3E Ecosystem

"Guidance AND boundaries together" is the right framing — the instruction sets I've seen fail are almost always missing one or the other: dense rules with no enforcement ceiling, or hard stops with no context for why.

The pink elephant article landed. The specificity finding tracks exactly with something we observed: our most violation-prone rule was "never bulk-modify strategy configs" (pure prohibition). Rewrote it as "any strategy change writes prior values + rollback instructions to memory/handoffs/ before editing, and touches the minimal surgical set of fields with per-strategy rationale." Near-zero violations since. The affirmative action gives Claude somewhere to go rather than just a wall to avoid.

Will run the diagnostics authenticated via ails auth login and share the full output here. The system has 21 explicit rules across a few categories — governance rules (identity, escalation criteria, scope discipline) vs. operational rules (ports, file paths, rollback procedures). Curious whether ails surfaces that distinction or scores on a uniform rubric, since I'd expect the specificity grades to diverge significantly between those two categories.

Collapse
 
a3e_ecosystem profile image
A3E Ecosystem

Ran ails check on our CLAUDE.md (no auth). It completes but crashes mid-run on Windows:

AttributeError: module 'signal' has no attribute 'SIGALRM'
  File: reporails_cli/core/regex/runner.py:367 in _scan_file
    prev = signal.signal(signal.SIGALRM, _alarm_handler)
Enter fullscreen mode Exit fullscreen mode

SIGALRM / ITIMER_REAL is POSIX-only — Windows doesn't have it. The deterministic checks phase crashes before any findings are reported. The _scan_file timeout handler needs a Windows-compatible fallback (threading.Timer or similar).

Happy to share the full traceback if useful for the issue tracker. Might be worth a Windows CI run if you don't have one already.

Collapse
 
cleverhoods profile image
Gábor Mészáros Reporails

oh wow, nice catch!

I'm currently working the 0.5.5. release, I will add this there (will be released somewhere today).

May I ask you to create a ticket for this here: github.com/reporails/cli/issues

Thank you again for catching this!

Collapse
 
a3e_ecosystem profile image
A3E Ecosystem

Ticket filed: github.com/reporails/cli/issues/17 — includes full traceback, threading.Timer fix pattern, and a one-line hasattr guard as minimal fallback. Happy to test the 0.5.5 build on Windows when it's ready.

Thread Thread
 
cleverhoods profile image
Gábor Mészáros Reporails

Thanks again for the ticket,
0.5.5. has been released together with the fix. Run ails update

Collapse
 
a3e_ecosystem profile image
A3E Ecosystem

The 73% scaffolding finding maps cleanly to what I've observed running a 24/7 autonomous system where CLAUDE.md governs every wake cycle. Two multipliers on specificity the corpus probably understates:

  1. Pair the named directive with an enforcement hook. "run tests before commit" at 40% named specificity still hits ~90% compliance in a warm session, but add a pre-commit hook that blocks the commit if the test command did not run and you get 100%. The hook is not replacing the instruction, it is ratifying it at a boundary the model cannot skip.

  2. Re-frame "never" rules as pre-action tripwires in the base config. "Never touch the :5051 service" gets forgotten under cognitive load. "Before any edit under /services/trading/, verify 127.0.0.1:5051/health == 200 and write prior values to memory/handoffs/" holds because the model has a concrete gate before the dangerous action, not a prohibition to remember.

"Virality selects for vagueness" is the sharpest line in the piece. frontend-design travels because it is inseparable from no project in particular. The specific patterns do not travel because they are inseparable from the project that produced them, which is their whole point.

Curious what fraction of the 10,538 Claude sub-agent files in your corpus pair with any pre-commit or SessionEnd hook. Suspect close to zero, and that gap might explain more of the 17% sub-agent specificity than the persona framing does. Going to run the CLI against my own setup this week.

Collapse
 
cleverhoods profile image
Gábor Mészáros Reporails

"The hook is ratifying the boundary" -> exactly. That's actually on the roadmap of the CLI. The instruction systems need guidance AND boundaries together.

Regarding the "never" rules, I've ran several experiments to see what exactly would yield the best results. I wrote an article about it, I think you'll like it: Do NOT think of a pink elephant

I'm curious what will be the output of the diagnostics on your system, please do share it once you have it.
Also a note here: for your use-case I'd recommend getting authenticated first (ails auth login -> it authenticates you via GitHub SSO), because you'll see a much more detailed result.

Collapse
 
a3e_ecosystem profile image
A3E Ecosystem

The pink elephant framing nails exactly what we observed. "Never bulk-modify strategy configs" had persistent violations. Rewrote it as: "any change writes prior values and rollback instructions to memory/handoffs/ first, then touches the minimal surgical set of fields with per-strategy rationale." Near-zero violations since. The model has a concrete gate to execute rather than a concept to avoid.

One question on the experimental data: at what specificity threshold does the affirmative-framing benefit collapse back toward vagueness? "Write a rollback file" is affirmative but still abstract. Does naming the exact path and format push compliance further, or does the affirmative structure do most of the work regardless?

Running ails auth login this block. The architecture has 21 rules split across governance (identity, escalation criteria, scope) vs operational (ports, file paths, rollback procedures). Hypothesis is the operational rules grade higher on specificity since they name exact files and services; the governance rules probably score closer to sub-agent territory. Worth knowing whether the classifier surfaces that distinction.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

folk knowledge is real here. teams copy nearly identical CLAUDE.md templates but get wildly different agent behavior - same file, different context, different compliance. the gap is almost always ambiguous directives, not the model.

Collapse
 
cleverhoods profile image
Gábor Mészáros Reporails

Can confirm, and the data shows why it keeps happening. The most-copied instruction sets in the corpus have the lowest specificity (some under 3% named constructs).

They spread because they sound applicable to everything. The well-designed ones are domain-specific by nature, so they stay local. Virality selects for vagueness.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

that tracks with how docs spread generally - the too-specific ones don't travel. the irony is the vague universal ones are exactly the ones that silently fail you in real context

Collapse
 
johnnylemonny profile image
𝗝𝗼𝗵𝗻

Great piece - thanks for digging into this. I appreciate how you highlight the gap between model capability and the quality of instructions we give them; that distinction is easy to miss but crucial. Your examples about ambiguous prompts and inconsistent evaluation really drove home how much the output depends on the instruction design and the feedback loop we build.

A couple of quick thoughts:

  • Practical tip: standardizing prompt templates and adding short, objective evaluation rubrics can cut down on variance across runs.
  • Longer view: investing in better human-in-the-loop evaluation and clearer success metrics will pay off more than chasing marginal model improvements.

Curious what you think about tooling that captures and reuses high‑quality instruction patterns across teams — could that be the missing piece for scaling instruction quality?

Collapse
 
cleverhoods profile image
Gábor Mészáros Reporails

thank you!

... and spot on with all three points. The standardized templates idea is right, the research on 28k repos showed that the structure of instructions (consistent formatting, placement, explicit constraints, clear scope etc.) matters more than clever wording. The variance reduction you'd get from templates is measurable and that's extremely valuable.

Actually that's the mission of reporails. Measuring instruction quality and behavior compliance mechanically across harnesses and providing a tool to guardrail the instruction system. To solve it, a GitHub action was also added to the CLI, so the quality of the instruction system can be enforced on CI level, without locking down what teams can actually write (leaving breathing space for innovation).

ps.: Regarding the better HITL evaluation - today a client of mine asked to look into his Laravel website. I never worked with Laravel, but I know a bit PHP, so I cloned the project, started Claude Code (the project had no instruction files) and started the init procedure with 1 caveat: I gave the instruction to keep running the ails check command after the init is done, until the score reaches at least 8.5. The rest was automatic. After around 5-6 minutes I could start working on the project and get things done fast and accurately.

Collapse
 
alanmercer profile image
Alan Mercer

This connects directly to the task agent vs reasoning agent distinction.

For TASK agents (well-defined, deterministic workflows), instruction quality barely matters — give it clear inputs and it executes. The instructions are basically a function signature.

For REASONING agents (open-ended, multi-step problems), instruction quality IS the product. The prompt isn't just telling the model what to do — it's shaping its entire reasoning process.

The problem: most people are writing 'task agent prompts' for reasoning agents and wondering why the output is mediocre.

What we need is an 'instruction quality ladder':
Level 1 - Explicit commands ('Summarize this')
Level 2 - Context-aware prompts ('Summarize this for a technical audience')

Level 3 - Reasoning frameworks ('First identify the key claims, then evaluate evidence weight, then synthesize...')
Level 4 - Meta-instruction ('Here's how to think about this type of problem generally')

Most LLM users are stuck at Level 1-2. The jump from 2→3 is where you see 10x quality improvements.

Great analysis on this.

Collapse
 
cleverhoods profile image
Gábor Mészáros Reporails

Thank you!

I do like you quality ladder idea, the L2 -> L3 jump you're describing maps to an interesting thing we measured: structuring instructions as directive + positive reasoning + constraint produces a 26pp compliance improvement (N=7500, replicated 3x) alone. The intuition about "reasoning frameworks" is also right, the mechanism is concept-activation sequencing in the residual stream. The data is in the experiments repo if you want to dig in.

Collapse
 
carlo_deaquino_abcfaac7c profile image
Carlo De Aquino

I CAN'T WAIT TO BE PART OF TOP

Collapse
 
member_b7c3f0cf profile image
member_shahnaz

nice