Mykola Kondratiuk

Posted on May 13

I Graded My Agent Deployment Doc Against LangChain Interrupt - Here Are the 5 Gaps

#ai #agents #productivity #discuss

canonical_url:

I do a lot of deployment authorization docs. That's the PM version of what an SRE would call a launch checklist. It lists the agent, the scopes, the secrets it touches, the kill switch, the cost ceiling, the rollback path.

For the last two years, the audience for that doc has been exactly one team: ours. Security signs it. Compliance reviews it. Engineering builds against it. Nobody outside the company ever reads a line.

Today I pulled up the LangChain Interrupt Day 1 schedule and that audience doubled.

The thing that changed at 9:30 PT today

Harrison Chase keynoted Interrupt at 9:30 Pacific. The headline was tame on paper: a synthesis of what 1,000+ teams shipped in production over the past 12 months. The substance was less tame. Clay, Rippling, Workday, plus the long tail of teams running smaller agent fleets, surfaced concrete production patterns. The talk wasn't aspirational. It read like a postmortem of an entire industry's first serious year of agent deployment.

That synthesis is now in public. Same week, SAP Sapphire closed with 200+ agents under a single stated design rule (governance first), with Claude as the reasoning layer and NVIDIA OpenShell as the execution wrapper. Two completely different sources. Same structural artifact: a public reference for what production-ready agent deployment looks like.

My internal doc has been graded against one rubric. As of today, it gets a second one whether I asked for it or not.

I sat down and graded mine

I gave myself 45 minutes. Three columns:

Pattern named in the public production literature this week
Our position: adopted, diverged, or gap
One sentence on why

I expected to find maybe 1 gap, possibly 2. I found 5.

Here they are, with what I'm doing about each one.

Gap 1: per-action cost ceiling, not per-month budget

Our cost guardrail was a monthly budget per agent. Easy to set, easy to forget, fires the alert about a week after damage is done.

The production pattern that kept surfacing in the public synthesis is per-action ceiling with auto-pause. If a single tool invocation projects to cost more than $N (or more than $N over the rolling 60 seconds), the agent stops and pings a human.

Our fix is small: a wrapper around the LLM client that estimates token cost in advance, compares against a per-action ceiling defined in the agent's config, and routes to a pause queue when over. About 40 lines. The harder part was deciding the ceiling, which is now a PM call, not an SRE call.

Gap 2: scoped credentials per agent, not per agent family

We had one service account per agent family (e.g., "research agents", "support agents"). Audit logs showed activity by family, not by individual agent. Fine for accounting. Bad for blast-radius reasoning.

The production pattern is one credential per logical agent instance, with the scope narrowed to the specific tables, endpoints, or namespaces that agent legitimately touches. If a single agent goes off, you revoke one credential without taking down the family.

This is a one-day migration in our system because the agent identity already exists in our config. We just weren't projecting it down into the credential layer.

Gap 3: the production review document does not yet exist

This one's about the artifact, not the runtime. Our internal review covers our policy: does this satisfy our compliance posture? It does not cover the production floor reading: where do we sit relative to what 1,000+ teams already found works?

I'm adding a new section to every deployment doc going forward. Three subheads:

Adopted: patterns we took straight from the public practitioner record
Diverged: patterns we considered and chose against, with the reason
Gaps: patterns we don't have an answer for yet

The "Gaps" subhead is the high-leverage one. Gaps documented in your own voice are gaps you control the conversation about. Gaps surfaced by a stakeholder in a meeting are not.

Gap 4: agent-writes-its-own-retries is a flagged pattern

We let one agent class write its own retry policy when a tool call failed. It's an obvious productivity win until the agent invents a retry pattern that compounds against a rate-limited downstream service.

The published practitioner consensus this week was clear: retries belong in the agent harness, not in the agent's reasoning loop. The agent should not be the entity deciding when to try again.

Our fix is to replace the self-retry behavior with a queued reissue: the harness owns the policy, the agent owns the request. About a day's work, including writing the test cases. Most of the time was migrating the existing retry-policy state out of the prompt and into config.

Gap 5: blast radius is named in our spec, but not in our review template

Anthropic's Deputy CISO ran a webinar Monday framing agent governance around "specific actions, scopes, and blast radii." That phrase is going to be the lingua franca of agent risk for at least the next 12 months.

We use blast radius informally in our deployment specs. We do not have a column for it in our review template. So the conversation we want to have (what's the worst this agent can do before someone catches it?) sometimes doesn't happen because the document doesn't prompt it.

The fix is a column. Each agent's row gets: "Blast radius at maximum scope, if every guardrail fails." One sentence. The act of writing the sentence is the audit.

What I'm shipping by Friday

The per-action cost wrapper, with one ceiling per agent class, tunable in config.
One credential per logical agent. Audit log change merges with it.
A new section in the deployment doc template: Production-Floor Reading.
The retry policy migrated from the prompt into the harness.
A blast radius column on the review template, populated retroactively for every active agent.

None of these are large. The total work is something like 3-4 engineer-days across the team. The reason I wasn't doing them last week is not that they were hard. It's that I didn't have the second reader yet. The internal reader was satisfied. Without the production floor, none of these gaps surfaced as gaps.

That's the part worth saying out loud. The second reader is what made the gaps visible. The work was always small. The doc was the bottleneck, and the doc didn't know it was the bottleneck.

What I'd ask if you ship agents

If you ran the same 45-minute exercise on your deployment doc this afternoon, which gap would you find first - cost ceiling, credential scope, retries, blast radius, or the review template itself?

I'm collecting answers through Friday. The Day 2 Interrupt content will sharpen the production-floor reading, and I'd rather refine the template against five teams' gap rows than against my own.

Tags: projectmanagement, ai, agents, productivity, discuss

Top comments (1)

Mykola Kondratiuk • May 13

honestly the cost-ceiling fix in Gap 1 assumes you can estimate token cost in advance, which breaks the moment your agent loops on a multi-turn tool call with variable input length. our wrapper double-counts on those paths and i don't have a clean answer yet - curious how teams handle the variable-length case.