The latest discourse I hear usually sounds something like, "I tried [insert agent flavor of the week] and it gave me garbage. AI is overrated."
My response: "No. You asked your mechanic to build a house and forgot to provide blueprints." š¦
The agent isn't the problemāthe setup is. Here's the workflow that actually works. None of it is clever and all of it took me longer to learn than I'd care to admit.
1. Pick the model that fits the task. Specs beat vibes. šŖ
Haiku is a sprinter. It'll absolutely take a swing at your distributed system architectureāthe answer just won't be one you can ship. Your job is to match the model to the work.
If the problem is well-definedāclear specs, acceptance criteria, edge cases enumeratedāSonnet handles it fine. You'll spend more time in review, but you'll save real money. You'll also catch your own bad specs faster, which is its own gift.
If the feature is a tangled mess and you can't (or won't) break it down, that's also fine. Hand the whole thing to Opus instead. You don't have to scope every subproblem, but you DO have to define the whole solution. "Make it work" is not a valid requirementāit's a desperate wish the agent will not understand.
A cheap model with great specs beats an expensive model with vibes and feelings, every single time.
2. Plan in chat. Touch the codebase last. šŖ
I spend hoursāmany hoursātalking through a problem before a single character lands in the codebase. AI is my rubber duck/research assistant with attitudeāyes, I code that in because annoying accolades are distracting me from the goal: a solid game plan.
The language? Does not matter. I can read them all (I probably won't). Package manager? I care even lessādrop a Makefile in the root and the commands stay the same regardless. Timeline? Sometimes, but the answer is usually "yesterday." What does matter:
- Meaningful tech stack
- Desired outcome
- Acceptance criteria
- Test scenariosāpositive, negative, error, edge, weird, seen
- Explicit non-goals (the things you are NOT building, so they don't get sneakily built anyway)
Skip these and start prompting with "build me a thing"? You will indeed get a thing. It just won't be your thing.
3. One source of truth. Stop copying instructions. šŖ§
AGENTS.md, copilot-instructions, CLAUDE.md, GEMINI.mdāpick one. I use AGENTS.md as the source of truth, then drop one-line markdown links to it from the others. That gives you one file to manage instead of four.
If a rule is true everywhereāfor you as the operator or across an entire projectāit doesn't belong in a skill. Skills get called when triggered. Instructions get loaded always. Know which one you actually need and use accordingly. I wrote another post dedicated solely to this concept, if you want a deeper dive.
The model should maintain AGENTS.md as it worksāyou do not need a separate MEMORY.md to muddy the waters. When it keeps violating the same rule, don't add another to the pile. Edit instead. Your agent knows exactly where it tripped if you ask, and it already knows how to fix it.
4. Write for the agent. Not the audience. šŖ¶
Left to its defaults, the model will write your instructions like a detailed onboarding doc. Section headers. Friendly intros. "This document outlines..." Polished prose for a human reader who is never supposed to show up.
Instructions load into context every turn. Every word costs tokens and burns clarity. So optimize for the actual audience: your agent.
Tell it explicitly:
- Edit for AI consumption onlyāno human-friendly framing, no narrative flow.
- Preserve every meaningful detail. Compress the prose, never drop the intent.
- Strip duplicates. If two rules say the same thing differently, merge them.
- Strip ambiguity. "Try to" and "consider" are noiseāsay what's required.
- Strip anything inferable from a reasonable code edit. If grep would answer it, cut it.
A polished onboarding doc is a tax on every prompt you ever send. Pay it once at write time, not every turn.
š” ProTip: These instructions should be a skill, because the agent only ever uses them when updating
AGENTS.md.
5. Skills aren't magical. Explicitly call them. šŖ
Skills are designed to be auto-invokedāyes. In theory... or if the description matches the prompt close enough and the planets align on a Tuesday. If you NEED a skill used, then name it explicitly in the prompt. Otherwise you're gambling.
And please stop installing every skill from the marketplace just because the name sounded interesting. If you don't know the exact name of it already, delete it (with a backup). Use a skill builder to document the workflows you actually run. Leave the rest alone. You load trash in, you get trash out.
6. Install MCPs locally. Globals tax every prompt. šŖŗ
Having 20 MCPs globally enabled is convenient for you and a context-pollution nightmare for your agent. Every connected MCP eats tokens just by existing.
The question is simple: do I use this everywhere, all the time? If yes, then global is accurate. If notāand the honest answer is usually notāthen install it only in the five projects where it actually matters. Symlinks and absolute paths can handle the duplication. Just make sure the agent has access to the directory.
7. Don't review. Test. Then test again. š©»
I stopped reviewing AI-written code line by line. I was doing it badly, doing it slowly, and my eyes glazed over by the third file. The answer is to test itāextensively, often, and the moment it stops spinning. Not three days later when you open a PR.
Unit. Integration. E2E. Performance. A11y (accessibility). Sonar. Semgrep. Et cetera. Then automate and run with GitHub Actions. Make the model cover positive paths, negative paths, error paths, edge cases, and the acceptance criteria you defined back in the planning phase. (You did define them, right?) Add in anything you uncover during testing explicitly, so it doesn't happen again.
Edited: Thanks for @txdesk for calling out that automated tests are not enough. My testing always includes manual verification for whatever I'm building. You need a manual validation loop that's far from the AI in order to prove it works.
Then cross-check across models. Have Codex review Claude. Have Copilot review Codex. Each model has different blind spots and different obsessionsārunning them against each other in controlled doses IS the review. One LLM is a single point of failure. Three are a quorum.
8. Ban the shortcuts. Temporary is never temporary. šŖ¤
In my AGENTS.md files for personal projects: backwards compatibility is strictly forbidden. Quick fixes are forbidden. Temporary solutions are not a viable path at any point. If the model wants to slap on a band-aid, it has to defend that choice. It can't, because my rule says it can't.
Now keep in mind, this is a personal-project rule and is harsh for live production code. If you're running production daily with real users, then you should probably nix the "no backwards compatibility" rule. But for your own stuff? Stop letting the model leave you with technical debt it threw around your codebase like confetti.
9. Clear the context. Don't iterate on broken. šŖ¦
If you've told the model the same thing three times and it's still wrong, then assume your conversation is poisoned. Too much wrong-direction is already baked in. Open a new chat. Start fresh with what you've learned.
A clean context with a sharper prompt beats six more rounds of "NO! I already said..."
10. The lesson. It was never the agent. š§
The agent is fine. The tooling is fine. What's not fine is treating a multi-thousand-dollar reasoning system like a Magic 8-Ballāshaking it harder every time the answer comes back wrong, hoping round fifteen is the one. It won't be.
Pick the right model. Plan first. One source of truth. Test ruthlessly. Cross-check across models. Forbid the shortcuts. Clean up your skill folder and your MCPs. Clear the context when things go sideways and start over.
This setup? It works. Try it for yourself.
š”ļø Behind the Curtain š
I wrote this post. Claude helped with the structure pass and the snark calibration so I'm not an accidental asshole. The opinions, the rules, and the AGENTS.md philosophy are mineāhardened over a year of letting AI drive and ruthlessly analyzing all the crashes.
Top comments (74)
Multi-thousand should be multi-billion dollar system. If it was in the thousands I would just buy a system.
Jokes aside, a reasoning system with that much power and memory should have more common sense than it has. An LLM is not smart, it just has a lot of knowledge and connections.
A general agent is not much more than a prompt with more context in a loop. So it is a bit more accurate because it prompts an LLM multiple times and it can run tools to get context.
That doesn't make AI smart because it has no good judgement.
So for me AI is stupid. But that doesn't mean it isn't a good tool.
I agree to select the right model for the task, the problem is that you need to code to allow alternative right models. This is vendor lock-in for the people that can't code.
Sure you can use AI as a part of your planning but the plan is yours to own. I rather talk to a person because of the judgement problem with AI.
We all know why every AI provider has their own config file, more vendor lock-in.
While the idea of skills was to create less friction, they created more friction.
Why explicitly call skills, just add a more context files list to the prompt. Then you can create the context file structure you prefer. Not the one that skills forces you to use.
Staying on the explicit path, instead of adding an MCP, most of the times they can be replaced with CLI commands.
Isn't that treating AI as a magic 8-ball?
True multi file reviews are mentally draining, but who takes the responsibility when things go wrong? not AI.
While I agree the use of AI got better with agents, it is far from an intelligent tool. And that is not the users fault.
Thanks for the thoughtful post. I'll try to address all of your topics one at a time, too.
gitcommands. Many MCPs often come documented well enough that the tool itself is an extra instruction.Thanks for the feedback!
Why are their agents then called Claude Code and Codex? These names give you the impression they are trained for coding while they connect to all-round models. The bulk of the knowledge is not in the agents.
The different LLM's are called by the skill or a custom made subagent. The overseeing agent has some knowledge but the main job is to handle the tasks until the done message appears.
That is a
while(true)loop.That sentence doesn't make much sense. If the model is not important you could pick any model.
I looks like you didn't read that part well. I'm not mentioning CLI as a tool, I'm mentioning commands. So it is very specific.
Are you suggesting you look all day at agent output? That seems a waste of time.
What if there are multiple agents running in parallel? That would be mentally draining.
How do you know AI followed the guardrails without looking at the code?
You let AI test. You write the intent, but how are you sure AI generated the correct tests?
How are you sure different LLM's are going to detect tests with no value?
This feels a lot like a hype sentence. This could lead to maybe your thought process is the bottleneck, lets use AI to make it faster. And it could end with you are no longer needed.
Even with the speed, maybe AI can be the bottleneck. Have you thought about that?
The main thing I want to communicate is that people matter as much, even more in my opinion, as AI. The sentiment of the post from the title to the conclusion is looking down on people for not using the tool correct. But there is no such thing as correct in a new field. We are all learning as things evolve. What can be true today, can be wrong tomorrow.
Thanks for the insights! It's definitely not my intent to communicate that people do not matter. AI is a tool and people are the ones using it. While I agree that correctness evolves over time and agentic coding is a rapidly evolving field, that doesn't mean there's not a right and a wrong way to approach things today. These are just some of the things that I've found helpful in my day to day that I wanted to share.
Showing what works for you is good. But there are alternative ways to use the tool.
And because LLM's are trained differently there is no single definite answer.
I see the common approach more as a best practice.
I thought skills were great because of the discovery and contextual enhancement. But like you I discovered that to be sure the right information is added it is best to be explicit.
Basically skills don't deliver on their promise.
The part about clearing context instead of iterating on broken ā point 9 ā feels like one of those things that's obvious in retrospect but surprisingly hard to actually do in the moment.
There's a sunk cost instinct that kicks in after you've spent twenty minutes refining a prompt. You've invested in that conversation. Starting fresh feels like throwing away progress, even when the "progress" is just six increasingly frustrated rounds of the model confidently missing the point.
I've started treating it like a compiler error threshold. If I've corrected the same thing twice and it's still veering off, I don't argue ā I just kill the session and start over with whatever I learned about what didn't work. It's faster, but it also keeps me from slipping into a dynamic where I'm essentially debugging the model's reasoning in real time, which is a bottomless pit.
What I'm curious about is whether anyone's found a reliable signal for when a conversation is starting to go bad, before it's obviously poisoned. Sometimes it's not three wrong answers ā sometimes it's the first answer being subtly misaligned in a way you dismiss because it's close enough. Those are the ones that compound quietly.
I do everything with a single prompt. It might not be the best use of tokens, but I find it works for me.
šI use a planning agent to help refine what I want to do, and might iterate over that a few times.
Then I clear the context and give the agent one prompt.
āļøIf it is perfect, great! If it is almost perfect, then I'll make the final touches myself.
āIf it didn't get it right, then I'll explain what was wrong and ask it to help refine the original prompt.
I then undo the code changes, delete the context and fire the newly refined prompt again.
The reason I do this method is exactly because of what you describe, you've sunk effort into iterations and don't feel like starting again, but really you'll never win because somewhere at the start the AI misunderstood something and will never know how to get the code right.
I know I am just iterating in a different way, but I find it fixes the problems quicker, and frees up my time more to focus on something else (do something else whilst waiting for a big change, rather than sit watching the agent knowing I'm going to do another small iteration in a minute).
I can definitely see where this approach would come in handy, especially for complex tasks. Thanks for sharing!
A reliable signal for this sort of misguided direction would be a goldmine I have yet to discover. š I can't pinpoint any specific thing that tells me when something starts to go sideways, it's in the pattern when something like the wrong file is edited or something as small as the model takes to long to complete the job that it's supposed to be doing. I usually start by restating the goal with explicit non-goals for what the outcome should look like. Not by trying to fix the original prompt. Often times I just didn't explain it well enough the first time and that does a lot to fix it.
AI "is stupid" from conception because it has that marketing virus in it that says: Always give an answer (even if you hallucinate).
This is one reason I set up personal instructions giving it a specific goal to challenge bad ideas and research/ask if anything seems ambiguous or unclear. Some models are better than others with this, but it's usually enough to not counter the system instructions and still get real answers.
The positive AI brings is paid with the user's energy. I for example, get very tired after interacting with AI, because I need to be like on the battlefield, always on alert. And we all know what happens when you loose focus. You are "killed".
I'm the oppositeāI love the battlefield. At least, I do when it's operating fairly and consistently. Knowing when to strike with preemptive "killing" is key.
Treating an LLM like a Magic 8-Ball is exactly why people get frustrated. The planning-first approach is a lifesaver.
hi
Others have already said similar things, but I love this tip. My favorite trick is just to have every model review every other model's work : )
Great post overall. Thanks for sharing it!
Thank you! Glad you enjoyed it. I usually run the models in a circle until they agree on the solution. š
right??? I saw a guy who posted something about how AI refactored his entire codebase, rewrites features...etc and finally nothing worked and my question to him is: " what was your prompt? let me see your prompt, mate "
the prompt? " please refactor this "
that's it.
Classic. Make no mistakes.
My approach here is to use one of the more āsimpleā models like Haiku, and really be the human in the loop. Sure itās not āpls fixā, but youāre getting a good understanding of whatās going on, and you can spot a breaking change before it spits out 10k LOC.
But this isnāt something a new vibe coder would do, at least not yet.
This is true if you've able to take the time to walk the LLM through the solution. The way I see things though, speed to delivery will be expected to increase naturally as the cost of LLM use continues to rise. That's a whole other exponential problem, but even Sonnet has trouble delivering accurately without granular details.
And I'm positive that "refactoring" was exactly what was accomplished in the end, too. š¤£
Iāve always found that AI performance is a mirror of the system design. As this article suggests, if the setup is right, the AI becomes an extension of your professional personality rather than just a script runner.
Very true! Especially if you add in a couple of personality tweaks to the AI itself. Things become much more fun. š¤©
This resonates deeply ā especially point #2 (plan in chat, touch the codebase last). I've been building AI-powered data tools at my startup and the biggest productivity gains came from forcing myself to do thorough planning in conversation before writing a single line of code. The temptation to just "start building" is real, but the cleanup cost is brutal.
The cross-model review tip (#7) is gold. Running Claude's output past Codex (and vice versa) catches blind spots neither model would catch solo. Treating one LLM as a single point of failure is exactly the right mental model.
Thanks for writing this up ā sharing it with my team today.
Thank you! I'm glad it's useful. I define a global user instruction that says something like, "Do not blindly agree with the user. Your job is to push back, especially on bad ideas." That helps a lot with the planning phase. Also, Codex is one of the best code reviewers out there!
That "push back" instruction is a game changer ā turns the model from a yes-man into an actual thought partner. I've been using a similar rule and it genuinely saved me from shipping a bad data schema last week. Also 100% on Codex for review. Running Claude's output past it catches edge cases neither model would surface on its own.
Agreed! Using Copilot reviews on top of them both surfaces even more. š
The multi-model stack is exactly this ā each model has different blind spots, so Claude + Codex + Copilot ends up covering complementary surface areas. Claude tends to reason well about ambiguous business logic; Codex catches low-level correctness issues; Copilot adds codebase context. Running them in sequence rather than picking one has been genuinely better in practice. Thanks for the great discussion!
You're very welcome. I've found the same thing from each of the models. Each has their own downsides, too. Claude while great at implementation will frequently overbuild on things you do not need. GPT 5.5 is leaning this way, too. Both I end up reigning in more with "don't over engineer simple solutions" sort of instructions. Copilot does a much better job keeping aligned, but misses the big picture. So sometimes it helps to swap them out at an implementation level, tooāthough that requires a very well defined set of stories to make it work.
The 'swap at implementation level' pattern makes a lot of sense ā it's essentially treating each model like a specialist you route to based on task shape. What I've found is that Claude's overbuilding tendency correlates strongly with prompt ambiguity. Tight, bounded specs (story-level scope, like you mentioned) cuts it down significantly. At 13F Insight we've internalized something similar ā complex parsing tasks where thoroughness matters get routed one way, while simpler transformations go a different path. The real overhead isn't the swap itself, it's the handoff spec discipline you need to make it not feel chaotic.
The global instruction to push back is a smart architectural choice ā you're essentially codifying the adversarial reviewer role that most devs skip. I've found domain-specific constraints work even better alongside it: for financial data pipelines (my domain at 13F Insight), I'll add something like 'flag whenever you're making assumptions about SEC filing formats or fiscal quarter boundaries.' Codex really shines when correctness criteria are unambiguous. For the fuzzier, context-heavy work, your push-back instruction is exactly the right layer.
The cross-model review idea is interesting. Treating one LLM as a single point of failure feels like the right mental model.
"A cheap model with great specs beats an expensive model with vibes and feelings" is the whole post in one line. I run this exact pattern in production. Haiku classifies intent and picks the tier in under 2 seconds. Simple queries ("what's the gas price on Base?") stay on Haiku. Transaction decoding routes to Sonnet. Complex questions like "simulate what happens to my Compound V3 position if ETH drops 20% and compute the exact repayment to reach HF 1.5" go to Opus. The router itself costs almost nothing and the expensive model only fires when the question needs it.
Point 7 is where I'd push back slightly. Testing is necessary but not sufficient. I had 87 green unit tests for blockchain security tools. Then I ran 4 curl commands against live mainnet and found three features were calling APIs that don't exist. The tests passed because the AI wrote mocks based on the same wrong assumptions I had. Unit tests prove your logic works. Smoke tests against real external systems prove your assumptions are real. Both matter. The mocks alone will fool you.
I should have probably expanded more on the testing section, which also includes manual validations. If I'm building a web page then I know it works because I opened it, used it, and ran metrics outside the control of AI. Thanks for the feedback!
Exactly. Manual validation against real systems is the part that closes the loop. The AI can write the test, run the test, and report the test passed. But opening the browser, hitting the endpoint, and checking the response with your own eyes is the step that catches the lies the test suite was too polite to surface. The tests are necessary. The manual check against reality is what makes them honest.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.