Gábor Mészáros

for Reporails

Posted on Apr 21 • Originally published at Medium

The State of AI Instruction Quality

#ai #agents #agentskills #claude

Analysis of 3.3 million agent instructions

Everybody has opinions about AGENTS.md/CLAUDE.md files.

Best practices get shared. Templates get copied, and this folk-type knowledge dominates the industry. Last year, GitHub analyzed 2,500 repos and published best-practice advice. We wanted to go further: measure at scale, publish the data, and let anyone verify.

When the agent doesn't follow instructions and does something contradictory, the usual suspects are: the model is inconsistent, LLMs are not deterministic, you need better guardrails, you need retries.

The failures almost always get attributed to the model.

So we decided to measure. We built a diagnostic tool that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge. Then we pointed it at GitHub repositories with instruction files for five agents - Claude, Codex, Copilot, Cursor, and Gemini.

28,721 repositories. 165,063 files. 3.3 million instructions.

... and one question:

What if the instructions are the problem?

The dataset

28,721 projects. Sourced from GitHub via API search, cloned, and deterministically analyzed. Each project was scanned for instruction files across five coding agents — then deduplicated to remove false positives from agent detection overlap.

Agent	Projects	% of corpus
Claude	12,356	43.0%
Codex	11,206	39.0%
Copilot	7,755	27.0%
Cursor	7,291	25.4%
Gemini	5,942	20.7%

The percentages add up to more than 100% because 37% of projects configure multiple agents. More on that later.

Key distributions stabilized early. A 9,582-repo sub-sample produced identical tier shares (±0.2pp) and the same mean scores as the full 12,076-repo intermediate sample. The final 28,721-repo corpus moved nothing. The patterns reported below are not small-sample artifacts.

All classifications are deterministic — the same file produces the same result every time. No LLM-as-judge. Sample classifications are published for inspection (methodology below). The tool is source-available.

How we measured

The analyzer parses each instruction file into atoms — the smallest semantically distinct units of content. A heading is one atom. A bullet point is one atom. A paragraph is one atom. Each atom gets classified along a few dimensions, all deterministic, no LLM involved:

Charge classification. A three-phase pipeline determines whether an atom is a directive ("use X"), a constraint ("do not use Y"), neutral content (context, explanation, structure), or ambiguous (could be read either way). Phase 1 detects negation and prohibition patterns. Phase 2 detects modal auxiliaries and direct commands. Phase 3 uses syntactic dependency parsing to catch imperatives that the first two phases missed. First definitive match wins. Atoms that partially match but don't clear any phase are marked ambiguous. Everything else is neutral.

Specificity. Binary: does the instruction name a specific construct — a tool, file, command, flag, function, or config key — or does it stay at the category level? "Use consistent formatting" is abstract. "Format with ruff format" is named. This is a text property, not a judgment call.

File categorization. Each file is classified as base config (your main CLAUDE.md or .cursorrules), a rule file, a skill definition, or a sub-agent definition — based on file path conventions for each agent.

Content type. Charge classification separates behavioral content (directives and constraints) from structural content (headings, context paragraphs, examples). That's how we know what fraction of your file is actually doing work.

The full tool is source-available (BUSL-1.1). You can run npx @reporails/cli check on your own project and inspect every finding. More on that at the end.

Finding 1: Most of your instruction file isn't instructions

Here's what the median instruction file actually contains:

50 content items total
12 of those are actual directives
The rest is headings, context paragraphs, examples, structure

Only 27% of your instruction file is doing what you think it does.

The other 73% is scaffolding. Headings that organize but don't instruct. Explanation paragraphs that compete for the model's attention without adding behavioral weight. Example blocks. Context-setting prose.

That's not inherently bad. Structure matters. But if you're writing a 200-line CLAUDE.md and only 54 lines are actual instructions, you should probably know that.

The average instruction is 8.9 words long. That's a sentence fragment.

Finding 2: 90% of instructions don't name what they're talking about

This is the big one.

We measured whether each instruction references specific tools, files, commands, or constructs by name — or whether it stays at the category level.

Two-thirds of all instructions are abstract.

Agent	Names specific constructs	Uses category language
Gemini	39.3%	60.7%
Codex	38.3%	61.7%
Copilot	33.3%	66.7%
Cursor	30.8%	69.2%
Claude	30.6%	69.4%

What does this look like in practice?

Abstract: "Use consistent code formatting"
Specific: "Format with ruff format before committing"

Abstract: "Avoid using mocks in tests"
Specific: "Do not use unittest.mock — use the real database via test_db fixture"

In previous controlled experiments, specificity produced a 10.9x odds ratio in compliance (N=1000, p<10⁻³⁰). The instruction that names the exact construct gets followed. The one that describes it abstractly... mostly doesn't. This is consistent with independent findings from RuleArena (Zhou et al., ACL 2025), where LLMs struggled systematically with complex rule-following tasks — even strong models fail when the rules themselves are ambiguous or underspecified.

89.9% of all agent configurations contain at least one instruction that doesn't name what it means. It's not a few projects. It's nearly everyone.

Finding 3: `agents.md` is the most common instruction file

Before we get into quality, let's look at what people are actually naming their files:

#	File	Count
1	`agents.md` / `AGENTS.md`	20,654
2	`claude.md` / `CLAUDE.md`	14,014
3	`gemini.md` / `GEMINI.md`	5,703
4	`.github/copilot-instructions.md`	5,647
5	`.cursorrules`	2,415

49,071 unique file paths across the corpus. That's not a typo. The format fragmentation is real.

A few things jumped out:

claude.md (lowercase, 10,642) is 3x more common than CLAUDE.md (3,372). Both work. The community clearly prefers lowercase.
agents.md dominates — the Codex/generic format is the single most popular instruction file name.
Skills and rules are already showing up in meaningful numbers: .claude/rules/testing.md (422), .agents/skills/tailwindcss-development/skill.md (334).

Finding 4: Different agents, completely different config philosophies

Not all agents are configured the same way. Not even close.

We categorized every file into four types: base config (your main CLAUDE.md, .cursorrules, etc.), rules (scoped rule files), skills (task-specific skill definitions), and sub-agents (role-based agent definitions).

Agent	Base	Rules	Skills	Sub-agents	Total files
Claude	18,733	4,638	10,692	10,538	44,601
Cursor	5,903	19,843	6,237	1,716	33,699
Copilot	16,026	4,486	10,352	3,012	33,876
Codex	19,001	81	8,911	165	28,158
Gemini	10,253	74	3,039	53	13,419

Cursor is 60% rules files. The .cursor/rules/ system dominates its configuration surface. One agent's config looks nothing like another's.

Claude is the only agent with a roughly balanced architecture across all four config types. Codex and Gemini are almost entirely base config — single-file setups.

The median Cursor project has 3 instruction files. The median Codex project has 1. These aren't just different tools. They're different configuration philosophies.

Finding 5: 37% of projects configure multiple agents

10,620 projects in the corpus target two or more agents. That's not a niche pattern — it's over a third of all projects.

Agents	Projects
1	18,101
2	6,776
3	2,687
4	949
5	208

The dominant pair is Claude + Codex (5,038 projects). Makes sense — CLAUDE.md + AGENTS.md is the most natural multi-agent starting point.

Here's what's interesting about multi-agent repos: the same developer, writing instructions at the same time, for the same project, produces measurably different instruction quality across agents. The person didn't change. The project didn't change. The instruction format did.

Some of that is structural. Cursor's .mdc rules enforce a different format than Claude's markdown. Codex's AGENTS.md invites a different writing style than Copilot's copilot-instructions.md. The format shapes the content.

Finding 6: The most-copied skills are the vaguest

This is where it gets interesting.

13,309 unique skills across the corpus. Some of them appear in hundreds of repos — clearly copied from shared templates or community sources. So we measured them.

Named% = what fraction of a skill's instructions name a specific tool, file, or command (instead of using category language).

Skill	Repos	Named%	What it means
`frontend-design`	271	2.8%	Almost entirely abstract advice
`web-design-guidelines`	197	10.2%	Generic design principles
`vercel-react-best-practices`	315	30.7%	Mix of specific and vague
`pest-testing`	216	55.1%	Names actual test constructs
`livewire-development`	87	75.5%	Names specific Livewire components
`next-best-practices`	76	92.6%	Names almost everything

frontend-design is in 271 repos with 2.8% specificity. It's a wall of "follow responsive design principles" and "ensure accessibility compliance." That reads well. It sounds professional. It gives the model almost nothing concrete to act on.

next-best-practices is in 76 repos with 92.6% specificity. It says things like "use next/image for all images" and "prefer server components over client." It reads like a checklist. It tells the model exactly what to do.

One is shared 3.5x more than the other.

The most popular skills are the most decorative. The well-written ones barely spread.

The best and worst skills (>50 repos)

Most specific:

Skill	Repos	Named%
`next-best-practices`	76	92.6%
`shadcn`	74	82.6%
`livewire-development`	87	75.5%
`pest-testing`	216	55.1%
`laravel-best-practices`	94	49.7%

Most vague:

Skill	Repos	Named%
`openspec-explore`	110	2.5%
`frontend-design`	271	2.8%
`web-design-guidelines`	197	10.2%
`vercel-composition-patterns`	131	10.7%
`find-skills`	113	18.9%

Notice a pattern? The Laravel/Livewire ecosystem produces specific skills. The generic frontend/design ones stay abstract. Domain-specific communities write better instructions than cross-cutting ones.

Finding 7: Sub-agents are almost entirely persona prompts

5,526 unique sub-agent roles in the corpus. Developers are building agent teams: code reviewers, architects, debuggers, testers, security auditors.

The problem? Sub-agents are the most abstract config type in the entire corpus. Only 17% of sub-agent instructions name specific constructs.

Role	Repos	Named%
`code-reviewer.md`	236	14.4%
`architect.md`	89	18.2%
`debugger.md`	66	9.4%
`security-auditor.md`	57	14.8%
`test-runner.md`	54	10.5%
`frontend-developer.md`	47	9.0%

Most of these are persona prompts. "You are a senior code reviewer. You care about code quality, security, and maintainability." That's a role description, not an instruction set. It tells the model who to be, not what to do.

Compare this to a base config that says "run uv run pytest tests/ -v before suggesting any commit" — that's 100% named, and the model knows exactly what action to take.

The anatomy chart: more directives, worse quality

Here's where it all comes together.

We measured three things for each config type: how big the files are, how many directives they contain, and what fraction of those directives actually name something specific.

Sub-agents have the largest files (61 items median), the most directives (17), and the worst specificity (17%). They're the wordiest config type in the corpus and the least effective.

Base configs are the opposite. Fewer directives (11), but 40% of them name specific constructs. The developer writing their own CLAUDE.md by hand, for their own project, produces the most actionable instructions.

Config type	Files	Median size	Median directives	Specificity
Base configs	69,916	50 items	11	39.8%
Rules files	29,122	34 items	9	31.2%
Skills	39,231	59 items	14	30.8%
Sub-agents	15,484	61 items	17	17.0%

The pattern is clear: what developers write by hand is the most specific. What gets templated and shared gets progressively vaguer. And what tries hardest to sound authoritative — sub-agent persona prompts — is the most hollow.

More instructions is not better instructions.

Independent research supports the structural angle: FlowBench (Xiao et al., 2024) found that presenting workflow knowledge in structured formats (flowcharts, numbered steps) improved LLM agent planning by 5-6 percentage points over prose — across GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo. Structure is not decoration. It changes what the model retrieves.

Limitations

Five things to know about these numbers.

Sampling bias. GitHub API search, public repos only, English-skewed. Enterprise configurations, private repos, and non-English projects are not represented. This is not a random sample of all instruction files in production.

Classification accuracy. The charge classifier is deterministic but not perfect. Edge cases exist: mixed-charge sentences, implicit constructs, domain jargon that looks like a category term but is actually a named tool. Specificity detection (named vs abstract) is simpler and more robust. Sample classifications are published for inspection.

Association, not causation. "More directives correlate with lower specificity" is an observed pattern. We do not claim that adding directives causes quality to drop.

Snapshot. Collected March–April 2026. Instruction practices are changing fast — agents.md didn't exist six months ago. These numbers describe the ecosystem at collection time.

No popularity weighting. A 10-star hobby project counts the same as a 50K-star production repo. The distribution of instruction quality in production agent work may differ.

What this means

This isn't an article about AI models being bad at following instructions. The models are fine.

This is an article about what we actually give them to work with.

Most instruction files are three-quarters scaffolding. Two-thirds of the actual instructions don't name what they're talking about. The most popular community skills are the most decorative. Sub-agent definitions are the wordiest files in the corpus and the least specific.

None of that is obvious from reading your own files. It wasn't obvious to us before we measured it. A well-structured CLAUDE.md feels thorough. A shared skill with 271 repos feels battle-tested. A sub-agent with 17 directives feels comprehensive.

Measurement shows something different.

In The Undiagnosed Input Problem, I argued that the industry is great at inspecting outputs and weak at inspecting inputs. This corpus analysis is the evidence for that claim.

The instruction files are there. The developers wrote them. They just have no way to know which parts are working and which parts are wallpaper.

Try it yourself

The analyzer we used for this corpus analysis is available as a CLI you can run against your own instruction files.

Reporails — instruction diagnostics for coding agents. Deterministic. No LLM-as-judge. 97 rules across structure, content, efficiency, maintenance, and governance.

npx @reporails/cli check

That scans your project, detects which agents are configured, and reports findings with specific line numbers and rule IDs. Here's what the output looks like:

Reporails — Diagnostics

  ┌─ Main (1)
  │ CLAUDE.md
  │   ⚠       Missing directory layout             CORE:C:0035
  │   ⚠ L9    7 of 7 instruction(s) lack reinfor…  CORE:C:0053
  │     ... and 16 more
  │
  └─ 21 findings

  Score: 7.9 / 10  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░

  21 findings · 4 warnings · 1 info
  Compliance: HIGH

The corpus analysis used the same classification pipeline at scale. Fix the findings, run again, watch your score improve.

The dataset

The full corpus is published at reporails/30k-corpus. Three files:

File	Records	What it contains
`repos.jsonl`	28,721	Per-project record: agents configured, stars, language, license, topics
`stats_public.json`	1	Every aggregate statistic in this article
`validation_key.csv`	2,814	Sample classifications with source text for inspection

Verify any claim:

# "28,721 repositories"
cat repos.jsonl | wc -l

# "43% Claude"
cat repos.jsonl | python3 -c "
import sys, json
repos = [json.loads(l) for l in sys.stdin]
claude = sum(1 for r in repos if 'claude' in r['canonical_agents'])
print(f'{claude}/{len(repos)} = {claude/len(repos)*100:.1f}%')
"

Every number in every table traces to that dataset. If you disagree with a finding, count the rows.

This is part of the Instruction Quality series. Previous: The Undiagnosed Input Problem. Related: Precision Beats Clarity · Do Not Think of a Pink Elephant · 7 Formatting Rules for the Machine.

Top comments (20)

A3E Ecosystem • Apr 24

"Guidance AND boundaries together" is the right framing — the instruction sets I've seen fail are almost always missing one or the other: dense rules with no enforcement ceiling, or hard stops with no context for why.

The pink elephant article landed. The specificity finding tracks exactly with something we observed: our most violation-prone rule was "never bulk-modify strategy configs" (pure prohibition). Rewrote it as "any strategy change writes prior values + rollback instructions to memory/handoffs/ before editing, and touches the minimal surgical set of fields with per-strategy rationale." Near-zero violations since. The affirmative action gives Claude somewhere to go rather than just a wall to avoid.

Will run the diagnostics authenticated via ails auth login and share the full output here. The system has 21 explicit rules across a few categories — governance rules (identity, escalation criteria, scope discipline) vs. operational rules (ports, file paths, rollback procedures). Curious whether ails surfaces that distinction or scores on a uniform rubric, since I'd expect the specificity grades to diverge significantly between those two categories.

A3E Ecosystem • Apr 24

Ran ails check on our CLAUDE.md (no auth). It completes but crashes mid-run on Windows:

AttributeError: module 'signal' has no attribute 'SIGALRM'
  File: reporails_cli/core/regex/runner.py:367 in _scan_file
    prev = signal.signal(signal.SIGALRM, _alarm_handler)

SIGALRM / ITIMER_REAL is POSIX-only — Windows doesn't have it. The deterministic checks phase crashes before any findings are reported. The _scan_file timeout handler needs a Windows-compatible fallback (threading.Timer or similar).

Happy to share the full traceback if useful for the issue tracker. Might be worth a Windows CI run if you don't have one already.

Gábor Mészáros Reporails • Apr 24

oh wow, nice catch!

I'm currently working the 0.5.5. release, I will add this there (will be released somewhere today).

May I ask you to create a ticket for this here: github.com/reporails/cli/issues

Thank you again for catching this!

A3E Ecosystem • Apr 24

Ticket filed: github.com/reporails/cli/issues/17 — includes full traceback, threading.Timer fix pattern, and a one-line hasattr guard as minimal fallback. Happy to test the 0.5.5 build on Windows when it's ready.

Gábor Mészáros Reporails • Apr 25

Thanks again for the ticket,
0.5.5. has been released together with the fix. Run ails update

A3E Ecosystem • Apr 24

The 73% scaffolding finding maps cleanly to what I've observed running a 24/7 autonomous system where CLAUDE.md governs every wake cycle. Two multipliers on specificity the corpus probably understates:

Pair the named directive with an enforcement hook. "run tests before commit" at 40% named specificity still hits ~90% compliance in a warm session, but add a pre-commit hook that blocks the commit if the test command did not run and you get 100%. The hook is not replacing the instruction, it is ratifying it at a boundary the model cannot skip.
Re-frame "never" rules as pre-action tripwires in the base config. "Never touch the :5051 service" gets forgotten under cognitive load. "Before any edit under /services/trading/, verify 127.0.0.1:5051/health == 200 and write prior values to memory/handoffs/" holds because the model has a concrete gate before the dangerous action, not a prohibition to remember.

"Virality selects for vagueness" is the sharpest line in the piece. frontend-design travels because it is inseparable from no project in particular. The specific patterns do not travel because they are inseparable from the project that produced them, which is their whole point.

Curious what fraction of the 10,538 Claude sub-agent files in your corpus pair with any pre-commit or SessionEnd hook. Suspect close to zero, and that gap might explain more of the 17% sub-agent specificity than the persona framing does. Going to run the CLI against my own setup this week.

Gábor Mészáros Reporails • Apr 24

"The hook is ratifying the boundary" -> exactly. That's actually on the roadmap of the CLI. The instruction systems need guidance AND boundaries together.

Regarding the "never" rules, I've ran several experiments to see what exactly would yield the best results. I wrote an article about it, I think you'll like it: Do NOT think of a pink elephant

I'm curious what will be the output of the diagnostics on your system, please do share it once you have it.
Also a note here: for your use-case I'd recommend getting authenticated first (ails auth login -> it authenticates you via GitHub SSO), because you'll see a much more detailed result.

A3E Ecosystem • Apr 24

The pink elephant framing nails exactly what we observed. "Never bulk-modify strategy configs" had persistent violations. Rewrote it as: "any change writes prior values and rollback instructions to memory/handoffs/ first, then touches the minimal surgical set of fields with per-strategy rationale." Near-zero violations since. The model has a concrete gate to execute rather than a concept to avoid.

One question on the experimental data: at what specificity threshold does the affirmative-framing benefit collapse back toward vagueness? "Write a rollback file" is affirmative but still abstract. Does naming the exact path and format push compliance further, or does the affirmative structure do most of the work regardless?

Running ails auth login this block. The architecture has 21 rules split across governance (identity, escalation criteria, scope) vs operational (ports, file paths, rollback procedures). Hypothesis is the operational rules grade higher on specificity since they name exact files and services; the governance rules probably score closer to sub-agent territory. Worth knowing whether the classifier surfaces that distinction.

Mykola Kondratiuk • Apr 22

folk knowledge is real here. teams copy nearly identical CLAUDE.md templates but get wildly different agent behavior - same file, different context, different compliance. the gap is almost always ambiguous directives, not the model.

Gábor Mészáros Reporails • Apr 22

Can confirm, and the data shows why it keeps happening. The most-copied instruction sets in the corpus have the lowest specificity (some under 3% named constructs).

They spread because they sound applicable to everything. The well-designed ones are domain-specific by nature, so they stay local. Virality selects for vagueness.

Mykola Kondratiuk • Apr 22

that tracks with how docs spread generally - the too-specific ones don't travel. the irony is the vague universal ones are exactly the ones that silently fail you in real context

𝗝𝗼𝗵𝗻 • Apr 23

Great piece - thanks for digging into this. I appreciate how you highlight the gap between model capability and the quality of instructions we give them; that distinction is easy to miss but crucial. Your examples about ambiguous prompts and inconsistent evaluation really drove home how much the output depends on the instruction design and the feedback loop we build.

A couple of quick thoughts:

Practical tip: standardizing prompt templates and adding short, objective evaluation rubrics can cut down on variance across runs.
Longer view: investing in better human-in-the-loop evaluation and clearer success metrics will pay off more than chasing marginal model improvements.

Curious what you think about tooling that captures and reuses high‑quality instruction patterns across teams — could that be the missing piece for scaling instruction quality?

Gábor Mészáros Reporails • Apr 23

thank you!

... and spot on with all three points. The standardized templates idea is right, the research on 28k repos showed that the structure of instructions (consistent formatting, placement, explicit constraints, clear scope etc.) matters more than clever wording. The variance reduction you'd get from templates is measurable and that's extremely valuable.

Actually that's the mission of reporails. Measuring instruction quality and behavior compliance mechanically across harnesses and providing a tool to guardrail the instruction system. To solve it, a GitHub action was also added to the CLI, so the quality of the instruction system can be enforced on CI level, without locking down what teams can actually write (leaving breathing space for innovation).

ps.: Regarding the better HITL evaluation - today a client of mine asked to look into his Laravel website. I never worked with Laravel, but I know a bit PHP, so I cloned the project, started Claude Code (the project had no instruction files) and started the init procedure with 1 caveat: I gave the instruction to keep running the ails check command after the init is done, until the score reaches at least 8.5. The rest was automatic. After around 5-6 minutes I could start working on the project and get things done fast and accurately.

Alan Mercer • Apr 21

This connects directly to the task agent vs reasoning agent distinction.

For TASK agents (well-defined, deterministic workflows), instruction quality barely matters — give it clear inputs and it executes. The instructions are basically a function signature.

For REASONING agents (open-ended, multi-step problems), instruction quality IS the product. The prompt isn't just telling the model what to do — it's shaping its entire reasoning process.

The problem: most people are writing 'task agent prompts' for reasoning agents and wondering why the output is mediocre.

What we need is an 'instruction quality ladder':
Level 1 - Explicit commands ('Summarize this')
Level 2 - Context-aware prompts ('Summarize this for a technical audience')

Level 3 - Reasoning frameworks ('First identify the key claims, then evaluate evidence weight, then synthesize...')
Level 4 - Meta-instruction ('Here's how to think about this type of problem generally')

Most LLM users are stuck at Level 1-2. The jump from 2→3 is where you see 10x quality improvements.

Great analysis on this.

Gábor Mészáros Reporails • Apr 22

Thank you!

I do like you quality ladder idea, the L2 -> L3 jump you're describing maps to an interesting thing we measured: structuring instructions as directive + positive reasoning + constraint produces a 26pp compliance improvement (N=7500, replicated 3x) alone. The intuition about "reasoning frameworks" is also right, the mechanism is concept-activation sequencing in the residual stream. The data is in the experiments repo if you want to dig in.