Paulo Victor Leite Lima Gomes

Posted on May 11

measuring ai-assisted velocity without lying to yourself

#ai #productivity #softwareengineering #metrics

Every engineering leader I talk to these days has the same question:

"Are our AI tools actually making us faster?"

And every single one of them has an answer they sort of believe but cannot quite prove.

The CTO points at the PR count chart going up. The team lead points at the growing backlog of half-finished AI-generated features. The IC points at the three refactors they had to do last week because an agent built the wrong abstraction.

Someone is right. Someone is wrong. And the data, as it exists today, helps nobody.

I have been thinking about why most "AI productivity" metrics are useless, and what actually works instead.

the trap everyone falls into first

The standard move is to measure output.

PRs per week. Commits per day. Lines of code changed. Cycle time from first commit to merge. Story points completed.

These numbers look great once people start using AI tools. Of course they do. An agent can open a PR in five minutes that would have taken a human an afternoon. Your PR count doubles. Your cycle time drops. Your chart looks like the team just discovered amphetamines.

The problem is not that the numbers are wrong. The problem is that they measure the cheap part.

Opening a PR is easy. Writing code that does not create hidden problems is the actual work.

When a junior engineer generates thirty PRs in a week, six of them get rolled back, twelve require significant fixes after review, three break something downstream, and two are actually clean, "PRs per week" makes the junior look like a hero. Any dashboard that rewards this is not measuring productivity. It is measuring production of future work.

The same pattern applies to AI-generated code. The agent produces output fast. The human reviews it fast. The PR merges fast. And then, two weeks later, someone discovers the agent never considered the edge case that the original design document spent three paragraphs describing.

Nobody measured that cost. Nobody knows how.

velocity is not the same as throughput

This is where I think most engineering orgs get stuck.

Throughput is easy to measure. Velocity is not.

Throughput says "we shipped X features." Velocity says "we shipped X features and the system is still maintainable, the team is not burned out, and the next change will not be harder than this one."

AI tools clearly increase throughput. The question is whether they increase or decrease velocity.

The difference shows up in the hidden work:

How much time is spent fixing bugs introduced by AI-generated code?
How many PRs need significant rework after the agent is done?
How often does the agent produce code that passes tests but violates architectural conventions?
How many features are shipped with worse test coverage because the reviewer trusted the agent's output too much?
How much undocumented complexity lands in the codebase because nobody reads every line of an agent diff?

If you measure PR count, none of these show up.

If you measure cycle time, none of these show up.

If you measure story point completion, none of these show up.

The dashboard is lying and nobody installed the truth.

what to measure instead

I have been experimenting with a different set of metrics in the teams I work with. They are not perfect. But they surface the signal that gets buried by throughput numbers.

rework ratio

Track the percentage of code that is modified within 30 days of being written. High rework suggests the initial output was low quality, whether by human or agent.

Compare rework ratios between AI-assisted and non-AI-assisted changes. If the AI code has a significantly higher rework ratio, the "productivity gain" is a mirage. You are just moving the work from writing to fixing.

If the rework ratio is similar or lower, the AI is probably adding genuine leverage.

review depth ratio

Track how many review comments per line of changed code the team produces. If this number drops significantly after AI adoption, it may mean reviews are shallower, not that the code is better. Agents are convincing writers. They look correct. The reviewer needs to push harder.

If the review depth stays steady or increases, the team is maintaining healthy skepticism.

incident attribution (with a grain of salt)

Track whether AI-generated changes are overrepresented in incident postmortems. Not as blame. As a hygiene signal.

If AI code causes 30% of incidents but represents 50% of changes, that is a signal worth investigating. Maybe the agents need better constraints. Maybe the review process needs more guardrails. Maybe the agents should not touch certain parts of the codebase.

feature completion vs. feature health

Track not just whether a feature shipped, but whether it needed significant repair in the first month.

A feature that ships in two hours, then requires three days of bug fixes and two rollbacks, is not a win. It is a time bomb with optimistic labeling.

If AI-assisted features require disproportionate post-ship maintenance, the velocity equation changes dramatically.

context switching cost

Track how much time engineers spend switching between agent output and review.

If the pattern is "generate, review, generate more, review more, merge, fix," the context switching cost is real. Some teams report spending more time reviewing agent output than they would have spent writing the code themselves. That is not a productivity gain. That is management theater with extra steps.

what organizations get wrong about the numbers

There is a deeper problem here.

Most orgs want a single number. "Are we faster? Give me the percentage."

But AI-assisted velocity is not a ratio. It is a system property. It depends on:

the type of work being done (greenfield vs. maintenance vs. incident response)
the maturity of the codebase (well-factored vs. spaghetti)
the quality of the agent setup (tool access, context availability, guardrails)
the review culture (thorough vs. fast)
the deployment pipeline (automated tests + canary deploys vs. manual approvals)
the team's domain knowledge (agents cannot know what the team has not documented)

A single number flattens all of this. It passes judgment on a complex system based on one noisy signal.

The teams that get this right do not ask "are we faster?"

They ask:

"Are our agents producing less rework than last quarter?"
"Are our reviews staying thorough despite faster output?"
"Is the post-ship maintenance burden going down or up?"
"Are we catching agent mistakes before they reach production, or after?"

Those are actionable questions. The answers tell you where to invest next.

the danger of optimizing the wrong thing

Here is the part that makes me nervous.

Once organizations start measuring "AI velocity," they will inevitably optimize whatever the dashboard shows.

If the dashboard shows PR count, teams will generate more PRs, quality be damned. If it shows cycle time, they will merge faster, review be damned. If it shows feature count, they will ship more features, maintenance be damned.

This is not hypothetical. It is what every productivity measurement does. Goodhart's Law is older than AI.

The only way around this is to measure things that are harder to game.

Mean time to repair after a release? Hard to fake.
Percentage of code that survives 90 days without modification? Hard to fake.
Number of incidents caused by changes in the last quarter? Hard to fake.
Percentage of features that reach adoption targets without major rework? Hard to fake.

These metrics are not perfect. But they resist the incentive to make the number go up by making the system worse.

what a healthy setup looks like

The teams I have seen actually benefit from AI tools share a few patterns:

They measure before and after, not just after. Baseline data makes the comparison honest.
They separate throughput from quality in their dashboards. Two separate views. No averaging them together into a single "velocity" score.
They review agent output as critically as junior engineer output. Maybe more critically, because agents are more confident and less likely to ask clarifying questions.
They track maintenance burden explicitly. If the AI produces code that needs more bug fixes, they adjust their expectations and their tooling.
They stop measuring things that lie. If PR count goes up but quality goes down, they drop PR count from their metrics and dig into what drives quality.
They accept that some kinds of work do not benefit from AI assistance. Trying to measure "AI velocity" for incident debugging is silly. Trying to measure it for well-scoped greenfield features is useful. Different metrics for different contexts.

the real goal

I do not actually care whether the AI productivity number goes up or down.

I care whether the engineering organization is making better decisions about where to invest.

If the data says agents save more time than they cost, it is worth investing in better context, better guardrails, better integration with the platform.

If the data says agents create as much work as they save, it is worth investing in better prompts, better review processes, better tool selection.

If the data says agents are a net negative in certain domains, it is worth documenting that so nobody wastes time forcing a square peg into a round hole.

The value is not in the number. The value is in the decision the number enables.

So stop asking "are we faster?" and start asking "do we know what is happening?"

The first question produces a dashboard that lies.

The second question produces an organization that learns.

DEV Community