Everybody has opinions about AGENTS.md/CLAUDE.md files.
Best practices get shared. Templates get copied, and this folk-type knowledge dominates the industry. Last year, GitHub analyzed 2,500 repos and published best-practice advice. We wanted to go further: measure at scale, publish the data, and let anyone verify.
When the agent doesn't follow instructions and does something contradictory, the usual suspects are: the model is inconsistent, LLMs are not deterministic, you need better guardrails, you need retries.
The failures almost always get attributed to the model.
So we decided to measure. We built a diagnostic tool that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge. Then we pointed it at GitHub repositories with instruction files for five agents - Claude, Codex, Copilot, Cursor, and Gemini.
28,721 repositories. 165,063 files. 3.3 million instructions.
... and one question:
What if the instructions are the problem?
The dataset
28,721 projects. Sourced from GitHub via API search, cloned, and deterministically analyzed. Each project was scanned for instruction files across five coding agents — then deduplicated to remove false positives from agent detection overlap.
| Agent | Projects | % of corpus |
|---|---|---|
| Claude | 12,356 | 43.0% |
| Codex | 11,206 | 39.0% |
| Copilot | 7,755 | 27.0% |
| Cursor | 7,291 | 25.4% |
| Gemini | 5,942 | 20.7% |
The percentages add up to more than 100% because 37% of projects configure multiple agents. More on that later.
Key distributions stabilized early. A 9,582-repo sub-sample produced identical tier shares (±0.2pp) and the same mean scores as the full 12,076-repo intermediate sample. The final 28,721-repo corpus moved nothing. The patterns reported below are not small-sample artifacts.
All classifications are deterministic — the same file produces the same result every time. No LLM-as-judge. Sample classifications are published for inspection (methodology below). The tool is source-available.
How we measured
The analyzer parses each instruction file into atoms — the smallest semantically distinct units of content. A heading is one atom. A bullet point is one atom. A paragraph is one atom. Each atom gets classified along a few dimensions, all deterministic, no LLM involved:
Charge classification. A three-phase pipeline determines whether an atom is a directive ("use X"), a constraint ("do not use Y"), neutral content (context, explanation, structure), or ambiguous (could be read either way). Phase 1 detects negation and prohibition patterns. Phase 2 detects modal auxiliaries and direct commands. Phase 3 uses syntactic dependency parsing to catch imperatives that the first two phases missed. First definitive match wins. Atoms that partially match but don't clear any phase are marked ambiguous. Everything else is neutral.
Specificity. Binary: does the instruction name a specific construct — a tool, file, command, flag, function, or config key — or does it stay at the category level? "Use consistent formatting" is abstract. "Format with ruff format" is named. This is a text property, not a judgment call.
File categorization. Each file is classified as base config (your main CLAUDE.md or .cursorrules), a rule file, a skill definition, or a sub-agent definition — based on file path conventions for each agent.
Content type. Charge classification separates behavioral content (directives and constraints) from structural content (headings, context paragraphs, examples). That's how we know what fraction of your file is actually doing work.
The full tool is source-available (BUSL-1.1). You can run npx @reporails/cli check on your own project and inspect every finding. More on that at the end.
Finding 1: Most of your instruction file isn't instructions
Here's what the median instruction file actually contains:
- 50 content items total
- 12 of those are actual directives
- The rest is headings, context paragraphs, examples, structure
Only 27% of your instruction file is doing what you think it does.
The other 73% is scaffolding. Headings that organize but don't instruct. Explanation paragraphs that compete for the model's attention without adding behavioral weight. Example blocks. Context-setting prose.
That's not inherently bad. Structure matters. But if you're writing a 200-line CLAUDE.md and only 54 lines are actual instructions, you should probably know that.
The average instruction is 8.9 words long. That's a sentence fragment.
Finding 2: 90% of instructions don't name what they're talking about
This is the big one.
We measured whether each instruction references specific tools, files, commands, or constructs by name — or whether it stays at the category level.
Two-thirds of all instructions are abstract.
| Agent | Names specific constructs | Uses category language |
|---|---|---|
| Gemini | 39.3% | 60.7% |
| Codex | 38.3% | 61.7% |
| Copilot | 33.3% | 66.7% |
| Cursor | 30.8% | 69.2% |
| Claude | 30.6% | 69.4% |
What does this look like in practice?
Abstract: "Use consistent code formatting"
Specific: "Format with ruff format before committing"
Abstract: "Avoid using mocks in tests"
Specific: "Do not use unittest.mock — use the real database via test_db fixture"
In previous controlled experiments, specificity produced a 10.9x odds ratio in compliance (N=1000, p<10⁻³⁰). The instruction that names the exact construct gets followed. The one that describes it abstractly... mostly doesn't. This is consistent with independent findings from RuleArena (Zhou et al., ACL 2025), where LLMs struggled systematically with complex rule-following tasks — even strong models fail when the rules themselves are ambiguous or underspecified.
89.9% of all agent configurations contain at least one instruction that doesn't name what it means. It's not a few projects. It's nearly everyone.
Finding 3: agents.md is the most common instruction file
Before we get into quality, let's look at what people are actually naming their files:
| # | File | Count |
|---|---|---|
| 1 |
agents.md / AGENTS.md
|
20,654 |
| 2 |
claude.md / CLAUDE.md
|
14,014 |
| 3 |
gemini.md / GEMINI.md
|
5,703 |
| 4 | .github/copilot-instructions.md |
5,647 |
| 5 | .cursorrules |
2,415 |
49,071 unique file paths across the corpus. That's not a typo. The format fragmentation is real.
A few things jumped out:
-
claude.md(lowercase, 10,642) is 3x more common thanCLAUDE.md(3,372). Both work. The community clearly prefers lowercase. -
agents.mddominates — the Codex/generic format is the single most popular instruction file name. - Skills and rules are already showing up in meaningful numbers:
.claude/rules/testing.md(422),.agents/skills/tailwindcss-development/skill.md(334).
Finding 4: Different agents, completely different config philosophies
Not all agents are configured the same way. Not even close.
We categorized every file into four types: base config (your main CLAUDE.md, .cursorrules, etc.), rules (scoped rule files), skills (task-specific skill definitions), and sub-agents (role-based agent definitions).
| Agent | Base | Rules | Skills | Sub-agents | Total files |
|---|---|---|---|---|---|
| Claude | 18,733 | 4,638 | 10,692 | 10,538 | 44,601 |
| Cursor | 5,903 | 19,843 | 6,237 | 1,716 | 33,699 |
| Copilot | 16,026 | 4,486 | 10,352 | 3,012 | 33,876 |
| Codex | 19,001 | 81 | 8,911 | 165 | 28,158 |
| Gemini | 10,253 | 74 | 3,039 | 53 | 13,419 |
Cursor is 60% rules files. The .cursor/rules/ system dominates its configuration surface. One agent's config looks nothing like another's.
Claude is the only agent with a roughly balanced architecture across all four config types. Codex and Gemini are almost entirely base config — single-file setups.
The median Cursor project has 3 instruction files. The median Codex project has 1. These aren't just different tools. They're different configuration philosophies.
Finding 5: 37% of projects configure multiple agents
10,620 projects in the corpus target two or more agents. That's not a niche pattern — it's over a third of all projects.
| Agents | Projects |
|---|---|
| 1 | 18,101 |
| 2 | 6,776 |
| 3 | 2,687 |
| 4 | 949 |
| 5 | 208 |
The dominant pair is Claude + Codex (5,038 projects). Makes sense — CLAUDE.md + AGENTS.md is the most natural multi-agent starting point.
Here's what's interesting about multi-agent repos: the same developer, writing instructions at the same time, for the same project, produces measurably different instruction quality across agents. The person didn't change. The project didn't change. The instruction format did.
Some of that is structural. Cursor's .mdc rules enforce a different format than Claude's markdown. Codex's AGENTS.md invites a different writing style than Copilot's copilot-instructions.md. The format shapes the content.
Finding 6: The most-copied skills are the vaguest
This is where it gets interesting.
13,309 unique skills across the corpus. Some of them appear in hundreds of repos — clearly copied from shared templates or community sources. So we measured them.
Named% = what fraction of a skill's instructions name a specific tool, file, or command (instead of using category language).
| Skill | Repos | Named% | What it means |
|---|---|---|---|
frontend-design |
271 | 2.8% | Almost entirely abstract advice |
web-design-guidelines |
197 | 10.2% | Generic design principles |
vercel-react-best-practices |
315 | 30.7% | Mix of specific and vague |
pest-testing |
216 | 55.1% | Names actual test constructs |
livewire-development |
87 | 75.5% | Names specific Livewire components |
next-best-practices |
76 | 92.6% | Names almost everything |
frontend-design is in 271 repos with 2.8% specificity. It's a wall of "follow responsive design principles" and "ensure accessibility compliance." That reads well. It sounds professional. It gives the model almost nothing concrete to act on.
next-best-practices is in 76 repos with 92.6% specificity. It says things like "use next/image for all images" and "prefer server components over client." It reads like a checklist. It tells the model exactly what to do.
One is shared 3.5x more than the other.
The most popular skills are the most decorative. The well-written ones barely spread.
The best and worst skills (>50 repos)
Most specific:
| Skill | Repos | Named% |
|---|---|---|
next-best-practices |
76 | 92.6% |
shadcn |
74 | 82.6% |
livewire-development |
87 | 75.5% |
pest-testing |
216 | 55.1% |
laravel-best-practices |
94 | 49.7% |
Most vague:
| Skill | Repos | Named% |
|---|---|---|
openspec-explore |
110 | 2.5% |
frontend-design |
271 | 2.8% |
web-design-guidelines |
197 | 10.2% |
vercel-composition-patterns |
131 | 10.7% |
find-skills |
113 | 18.9% |
Notice a pattern? The Laravel/Livewire ecosystem produces specific skills. The generic frontend/design ones stay abstract. Domain-specific communities write better instructions than cross-cutting ones.
Finding 7: Sub-agents are almost entirely persona prompts
5,526 unique sub-agent roles in the corpus. Developers are building agent teams: code reviewers, architects, debuggers, testers, security auditors.
The problem? Sub-agents are the most abstract config type in the entire corpus. Only 17% of sub-agent instructions name specific constructs.
| Role | Repos | Named% |
|---|---|---|
code-reviewer.md |
236 | 14.4% |
architect.md |
89 | 18.2% |
debugger.md |
66 | 9.4% |
security-auditor.md |
57 | 14.8% |
test-runner.md |
54 | 10.5% |
frontend-developer.md |
47 | 9.0% |
Most of these are persona prompts. "You are a senior code reviewer. You care about code quality, security, and maintainability." That's a role description, not an instruction set. It tells the model who to be, not what to do.
Compare this to a base config that says "run uv run pytest tests/ -v before suggesting any commit" — that's 100% named, and the model knows exactly what action to take.
The anatomy chart: more directives, worse quality
Here's where it all comes together.
We measured three things for each config type: how big the files are, how many directives they contain, and what fraction of those directives actually name something specific.
Sub-agents have the largest files (61 items median), the most directives (17), and the worst specificity (17%). They're the wordiest config type in the corpus and the least effective.
Base configs are the opposite. Fewer directives (11), but 40% of them name specific constructs. The developer writing their own CLAUDE.md by hand, for their own project, produces the most actionable instructions.
| Config type | Files | Median size | Median directives | Specificity |
|---|---|---|---|---|
| Base configs | 69,916 | 50 items | 11 | 39.8% |
| Rules files | 29,122 | 34 items | 9 | 31.2% |
| Skills | 39,231 | 59 items | 14 | 30.8% |
| Sub-agents | 15,484 | 61 items | 17 | 17.0% |
The pattern is clear: what developers write by hand is the most specific. What gets templated and shared gets progressively vaguer. And what tries hardest to sound authoritative — sub-agent persona prompts — is the most hollow.
More instructions is not better instructions.
Independent research supports the structural angle: FlowBench (Xiao et al., 2024) found that presenting workflow knowledge in structured formats (flowcharts, numbered steps) improved LLM agent planning by 5-6 percentage points over prose — across GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo. Structure is not decoration. It changes what the model retrieves.
Limitations
Five things to know about these numbers.
Sampling bias. GitHub API search, public repos only, English-skewed. Enterprise configurations, private repos, and non-English projects are not represented. This is not a random sample of all instruction files in production.
Classification accuracy. The charge classifier is deterministic but not perfect. Edge cases exist: mixed-charge sentences, implicit constructs, domain jargon that looks like a category term but is actually a named tool. Specificity detection (named vs abstract) is simpler and more robust. Sample classifications are published for inspection.
Association, not causation. "More directives correlate with lower specificity" is an observed pattern. We do not claim that adding directives causes quality to drop.
Snapshot. Collected March–April 2026. Instruction practices are changing fast — agents.md didn't exist six months ago. These numbers describe the ecosystem at collection time.
No popularity weighting. A 10-star hobby project counts the same as a 50K-star production repo. The distribution of instruction quality in production agent work may differ.
What this means
This isn't an article about AI models being bad at following instructions. The models are fine.
This is an article about what we actually give them to work with.
Most instruction files are three-quarters scaffolding. Two-thirds of the actual instructions don't name what they're talking about. The most popular community skills are the most decorative. Sub-agent definitions are the wordiest files in the corpus and the least specific.
None of that is obvious from reading your own files. It wasn't obvious to us before we measured it. A well-structured CLAUDE.md feels thorough. A shared skill with 271 repos feels battle-tested. A sub-agent with 17 directives feels comprehensive.
Measurement shows something different.
In The Undiagnosed Input Problem, I argued that the industry is great at inspecting outputs and weak at inspecting inputs. This corpus analysis is the evidence for that claim.
The instruction files are there. The developers wrote them. They just have no way to know which parts are working and which parts are wallpaper.
Try it yourself
The analyzer we used for this corpus analysis is available as a CLI you can run against your own instruction files.
Reporails — instruction diagnostics for coding agents. Deterministic. No LLM-as-judge. 97 rules across structure, content, efficiency, maintenance, and governance.
npx @reporails/cli check
That scans your project, detects which agents are configured, and reports findings with specific line numbers and rule IDs. Here's what the output looks like:
Reporails — Diagnostics
┌─ Main (1)
│ CLAUDE.md
│ ⚠ Missing directory layout CORE:C:0035
│ ⚠ L9 7 of 7 instruction(s) lack reinfor… CORE:C:0053
│ ... and 16 more
│
└─ 21 findings
Score: 7.9 / 10 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░
21 findings · 4 warnings · 1 info
Compliance: HIGH
The corpus analysis used the same classification pipeline at scale. Fix the findings, run again, watch your score improve.
The dataset
The full corpus is published at reporails/30k-corpus. Three files:
| File | Records | What it contains |
|---|---|---|
repos.jsonl |
28,721 | Per-project record: agents configured, stars, language, license, topics |
stats_public.json |
1 | Every aggregate statistic in this article |
validation_key.csv |
2,814 | Sample classifications with source text for inspection |
Verify any claim:
# "28,721 repositories"
cat repos.jsonl | wc -l
# "43% Claude"
cat repos.jsonl | python3 -c "
import sys, json
repos = [json.loads(l) for l in sys.stdin]
claude = sum(1 for r in repos if 'claude' in r['canonical_agents'])
print(f'{claude}/{len(repos)} = {claude/len(repos)*100:.1f}%')
"
Every number in every table traces to that dataset. If you disagree with a finding, count the rows.
This is part of the Instruction Quality series. Previous: The Undiagnosed Input Problem. Related: Precision Beats Clarity · Do Not Think of a Pink Elephant · 7 Formatting Rules for the Machine.






Top comments (20)
"Guidance AND boundaries together" is the right framing — the instruction sets I've seen fail are almost always missing one or the other: dense rules with no enforcement ceiling, or hard stops with no context for why.
The pink elephant article landed. The specificity finding tracks exactly with something we observed: our most violation-prone rule was "never bulk-modify strategy configs" (pure prohibition). Rewrote it as "any strategy change writes prior values + rollback instructions to memory/handoffs/ before editing, and touches the minimal surgical set of fields with per-strategy rationale." Near-zero violations since. The affirmative action gives Claude somewhere to go rather than just a wall to avoid.
Will run the diagnostics authenticated via
ails auth loginand share the full output here. The system has 21 explicit rules across a few categories — governance rules (identity, escalation criteria, scope discipline) vs. operational rules (ports, file paths, rollback procedures). Curious whether ails surfaces that distinction or scores on a uniform rubric, since I'd expect the specificity grades to diverge significantly between those two categories.Ran
ails checkon our CLAUDE.md (no auth). It completes but crashes mid-run on Windows:SIGALRM/ITIMER_REALis POSIX-only — Windows doesn't have it. The deterministic checks phase crashes before any findings are reported. The_scan_filetimeout handler needs a Windows-compatible fallback (threading.Timer or similar).Happy to share the full traceback if useful for the issue tracker. Might be worth a Windows CI run if you don't have one already.
oh wow, nice catch!
I'm currently working the 0.5.5. release, I will add this there (will be released somewhere today).
May I ask you to create a ticket for this here: github.com/reporails/cli/issues
Thank you again for catching this!
Ticket filed: github.com/reporails/cli/issues/17 — includes full traceback, threading.Timer fix pattern, and a one-line hasattr guard as minimal fallback. Happy to test the 0.5.5 build on Windows when it's ready.
Thanks again for the ticket,
0.5.5. has been released together with the fix. Run
ails updateThe 73% scaffolding finding maps cleanly to what I've observed running a 24/7 autonomous system where CLAUDE.md governs every wake cycle. Two multipliers on specificity the corpus probably understates:
Pair the named directive with an enforcement hook. "run tests before commit" at 40% named specificity still hits ~90% compliance in a warm session, but add a pre-commit hook that blocks the commit if the test command did not run and you get 100%. The hook is not replacing the instruction, it is ratifying it at a boundary the model cannot skip.
Re-frame "never" rules as pre-action tripwires in the base config. "Never touch the :5051 service" gets forgotten under cognitive load. "Before any edit under /services/trading/, verify 127.0.0.1:5051/health == 200 and write prior values to memory/handoffs/" holds because the model has a concrete gate before the dangerous action, not a prohibition to remember.
"Virality selects for vagueness" is the sharpest line in the piece. frontend-design travels because it is inseparable from no project in particular. The specific patterns do not travel because they are inseparable from the project that produced them, which is their whole point.
Curious what fraction of the 10,538 Claude sub-agent files in your corpus pair with any pre-commit or SessionEnd hook. Suspect close to zero, and that gap might explain more of the 17% sub-agent specificity than the persona framing does. Going to run the CLI against my own setup this week.
"The hook is ratifying the boundary" -> exactly. That's actually on the roadmap of the CLI. The instruction systems need guidance AND boundaries together.
Regarding the "never" rules, I've ran several experiments to see what exactly would yield the best results. I wrote an article about it, I think you'll like it: Do NOT think of a pink elephant
I'm curious what will be the output of the diagnostics on your system, please do share it once you have it.
Also a note here: for your use-case I'd recommend getting authenticated first (
ails auth login-> it authenticates you via GitHub SSO), because you'll see a much more detailed result.The pink elephant framing nails exactly what we observed. "Never bulk-modify strategy configs" had persistent violations. Rewrote it as: "any change writes prior values and rollback instructions to memory/handoffs/ first, then touches the minimal surgical set of fields with per-strategy rationale." Near-zero violations since. The model has a concrete gate to execute rather than a concept to avoid.
One question on the experimental data: at what specificity threshold does the affirmative-framing benefit collapse back toward vagueness? "Write a rollback file" is affirmative but still abstract. Does naming the exact path and format push compliance further, or does the affirmative structure do most of the work regardless?
Running ails auth login this block. The architecture has 21 rules split across governance (identity, escalation criteria, scope) vs operational (ports, file paths, rollback procedures). Hypothesis is the operational rules grade higher on specificity since they name exact files and services; the governance rules probably score closer to sub-agent territory. Worth knowing whether the classifier surfaces that distinction.
folk knowledge is real here. teams copy nearly identical CLAUDE.md templates but get wildly different agent behavior - same file, different context, different compliance. the gap is almost always ambiguous directives, not the model.
Can confirm, and the data shows why it keeps happening. The most-copied instruction sets in the corpus have the lowest specificity (some under 3% named constructs).
They spread because they sound applicable to everything. The well-designed ones are domain-specific by nature, so they stay local. Virality selects for vagueness.
that tracks with how docs spread generally - the too-specific ones don't travel. the irony is the vague universal ones are exactly the ones that silently fail you in real context
Great piece - thanks for digging into this. I appreciate how you highlight the gap between model capability and the quality of instructions we give them; that distinction is easy to miss but crucial. Your examples about ambiguous prompts and inconsistent evaluation really drove home how much the output depends on the instruction design and the feedback loop we build.
A couple of quick thoughts:
Curious what you think about tooling that captures and reuses high‑quality instruction patterns across teams — could that be the missing piece for scaling instruction quality?
thank you!
... and spot on with all three points. The standardized templates idea is right, the research on 28k repos showed that the structure of instructions (consistent formatting, placement, explicit constraints, clear scope etc.) matters more than clever wording. The variance reduction you'd get from templates is measurable and that's extremely valuable.
Actually that's the mission of reporails. Measuring instruction quality and behavior compliance mechanically across harnesses and providing a tool to guardrail the instruction system. To solve it, a GitHub action was also added to the CLI, so the quality of the instruction system can be enforced on CI level, without locking down what teams can actually write (leaving breathing space for innovation).
ps.: Regarding the better HITL evaluation - today a client of mine asked to look into his Laravel website. I never worked with Laravel, but I know a bit PHP, so I cloned the project, started Claude Code (the project had no instruction files) and started the init procedure with 1 caveat: I gave the instruction to keep running the
ails checkcommand after the init is done, until the score reaches at least 8.5. The rest was automatic. After around 5-6 minutes I could start working on the project and get things done fast and accurately.This connects directly to the task agent vs reasoning agent distinction.
For TASK agents (well-defined, deterministic workflows), instruction quality barely matters — give it clear inputs and it executes. The instructions are basically a function signature.
For REASONING agents (open-ended, multi-step problems), instruction quality IS the product. The prompt isn't just telling the model what to do — it's shaping its entire reasoning process.
The problem: most people are writing 'task agent prompts' for reasoning agents and wondering why the output is mediocre.
What we need is an 'instruction quality ladder':
Level 1 - Explicit commands ('Summarize this')
Level 2 - Context-aware prompts ('Summarize this for a technical audience')
Level 3 - Reasoning frameworks ('First identify the key claims, then evaluate evidence weight, then synthesize...')
Level 4 - Meta-instruction ('Here's how to think about this type of problem generally')
Most LLM users are stuck at Level 1-2. The jump from 2→3 is where you see 10x quality improvements.
Great analysis on this.
Thank you!
I do like you quality ladder idea, the L2 -> L3 jump you're describing maps to an interesting thing we measured: structuring instructions as directive + positive reasoning + constraint produces a 26pp compliance improvement (N=7500, replicated 3x) alone. The intuition about "reasoning frameworks" is also right, the mechanism is concept-activation sequencing in the residual stream. The data is in the experiments repo if you want to dig in.
I CAN'T WAIT TO BE PART OF TOP
nice
Some comments may only be visible to logged-in visitors. Sign in to view all comments.