Toni Antunovic

Posted on May 12 • Originally published at lucidshark.com

How to Review Code Your AI Agent Wrote While You Were Sleeping

#aiagents #codereview #devsecops #codequality

This article was originally published on LucidShark Blog.

You come in Monday morning, open your terminal, and run git log. There are 47 commits from the weekend. Your AI agent was busy.

This scenario is no longer hypothetical. Agentic coding systems running overnight tasks, fixing issues from a backlog, refactoring modules, and implementing feature branches from spec files have become part of how serious engineering teams operate in 2026. The question is not whether your agent will write code while you sleep. The question is what you do with it when you wake up.

The answer most teams give is: they do a light pass, check that tests pass, and merge. This is a mistake.

Simon Willison put it clearly when he distinguished between throwaway code and production code. Vibe coding works fine when you are building a one-off script or prototyping something you will throw away. The danger is when that same relaxed posture carries over into production systems. Overnight agents are almost always writing production code. The review bar should match.

Why Overnight Agent Code Is Different from Live Agent Code

When you are coding interactively with an AI agent, you see the changes in real time. You notice when the agent goes sideways. You correct it mid-flight. The review is continuous and contextual.

Overnight agent code has none of these properties. The agent made dozens of decisions in sequence, each building on the last, without any human feedback loop. By the time you see the result, the context that led to each individual choice is gone. What you have is a compressed artifact of a long, unobserved reasoning chain.

This creates specific failure modes that do not appear in interactive work:

Cascading assumptions. The agent made a reasonable guess at step 3, and every subsequent step built on that guess. If the guess was wrong, the damage is not local. It is distributed across the entire changeset.
Silent scope creep. Agents tasked with "fix the auth bug" often also refactor the surrounding module, update type signatures, and touch files that were not in the original scope. The refactor might be sensible. It might also break something unrelated.
Plausible but incorrect logic. LLM-generated code is optimized for looking correct. It tends to pass syntax checks, follow conventions, and produce code that reads cleanly. Logic errors are harder to spot because the surrounding code is well-formed.
Missing context for edge cases. The agent did not attend the meeting where you discussed the edge case in the payment flow. It does not know about the legacy customer segment that still uses the old API format. It will write code that is correct for the nominal case and wrong for the case that matters.

The Overnight Review Checklist

Before you look at any code, run this command:

git log --oneline --since="yesterday" --author="agent" | wc -l

If the number is above 20, block off two hours. Seriously. Reviewing 47 agent commits in 20 minutes is not a review, it is a rubber stamp.

Step 1: Get the Diff in a Reviewable Form

Do not review commit by commit. Get the full aggregate diff since the agent started working:

git diff main...agent/overnight-batch-2026-05-06 --stat
git diff main...agent/overnight-batch-2026-05-06 -- '*.ts' '*.py' '*.go'

The --stat output tells you the scope immediately. If you see files you did not expect the agent to touch, that is your first red flag. Investigate those files first, not last.

Step 2: Check for Security-Sensitive Changes

Before reading any logic, scan for patterns that warrant immediate scrutiny:

# Look for authentication and authorization changes
git diff main...agent/overnight-batch-2026-05-06 | grep -E "(auth|token|secret|key|password|permission|role|session)" -i -A 5 -B 5

# Look for SQL and query construction
git diff main...agent/overnight-batch-2026-05-06 | grep -E "(query|execute|prepare|cursor\.)" -i -A 3 -B 3

# Look for file system operations
git diff main...agent/overnight-batch-2026-05-06 | grep -E "(readFile|writeFile|unlink|fs\.|open\(|Path\.join)" -A 3 -B 3

You are not doing a full security audit here. You are triaging where to spend your review time. Any diff that touches auth, SQL construction, or file system operations should get deep review before anything else.

Step 3: Look for the Agent's Reasoning Artifacts

Well-configured agents leave reasoning traces. Check commit messages carefully:

git log main..agent/overnight-batch-2026-05-06 --format="%H %s%n%b"

Good agent commit messages include the reasoning: "Fixed null check in payment handler because downstream consumers expected non-null user object per types.ts line 34." Bad agent commit messages say "fix bug" or "update code." If your agent is writing poor commit messages, fix the prompt before fixing the code.

The reasoning trace matters because it tells you what assumptions the agent made. A commit message that says "assumes legacy users always have billing.v2 flag set" is now something you can verify. Without that trace, you have no way to know the assumption existed.

Step 4: Semantic Diff Review, Not Line-by-Line

Line-by-line diff review on agent code is a trap. You will spend time reading code that looks correct and miss the structural issue three files over. Do semantic review instead.

For each modified module, answer these questions:

What did this module do before? What does it do now?
What is the new surface area for bugs? (New branches, new error paths, new external calls)
What invariants did the old code maintain that the new code might violate?

Here is a concrete example. Suppose the agent refactored a retry handler:

// Agent's version: looks correct
async function withRetry(fn: () => Promise<void>, maxAttempts = 3) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      await fn();
      return;
    } catch (err) {
      if (attempt === maxAttempts - 1) throw err;
      await sleep(100 * Math.pow(2, attempt));
    }
  }
}

This looks fine. It implements exponential backoff and rethrows on the last attempt. But if the original code had a circuit breaker pattern, or tracked failure counts externally, this new implementation silently removes that behavior. The diff is clean. The semantic change is significant.

Step 5: Test Coverage Gap Analysis

Run your test suite, but also check whether new code paths have coverage:

# For TypeScript projects using Jest
npx jest --coverage --coverageReporters=text-summary 2>&1 | tail -20

# Check which new files lack coverage
git diff --name-only main...agent/overnight-batch-2026-05-06 | xargs -I{} sh -c 'echo "=== {} ===" && grep -c "it\|test\|describe" {} 2>/dev/null || echo "No tests found"'

Agents frequently write tests for the happy path and skip error handling tests. The coverage percentage can look fine because the happy path is covered. Specifically check for test cases that cover the error conditions you identified in step 2.

Step 6: Run Static Analysis Before Merging

Do not skip this step because the agent wrote the code. Static analysis tools are calibrated for exactly the kind of plausible-but-incorrect patterns that LLMs produce. Run your usual SAST tools with higher sensitivity on the agent diff:

# Run Semgrep on just the changed files
git diff --name-only main...agent/overnight-batch-2026-05-06 | xargs semgrep --config=auto

# Run ESLint on changed TypeScript files
git diff --name-only main...agent/overnight-batch-2026-05-06 -- '*.ts' '*.tsx' | xargs npx eslint --max-warnings 0

Zero-warning tolerance is appropriate for agent code. Warnings in LLM-generated code tend to cluster around the actual bugs, not around stylistic choices.

The Meta-Problem: Review at Scale

Here is the uncomfortable truth. If your agent committed 47 changes overnight, doing the above process thoroughly will take longer than the agent spent generating the changes. This is expected and correct. Code review is slower than code generation, and it should be.

The problem is that many teams have not adjusted their review process for the new volume baseline. They apply the same 15-minute review they used to give a five-commit PR to a 47-commit overnight batch, and they wonder why agent-introduced bugs are reaching production.

There are two structural responses to this problem.

Constrain Agent Scope

Configure your agent to work in smaller batches with tighter scope. An agent that makes 5 focused commits to a single module is much easier to review than one that touches 12 modules in 47 commits.

# Example AGENTS.md constraint
## Batch Size
- Maximum 10 commits per overnight run
- Each commit touches at most 3 files
- Do not touch files outside the specified module unless explicitly required
- Create a summary commit at the end describing all changes made

Automate the Triage Layer

Use automated tools to do the triage work before human review starts. A tool that can scan the overnight diff, flag security-sensitive changes, identify missing test coverage, and run static analysis gives your reviewers a prioritized reading list instead of a raw diff.

This is the pattern that separates teams that ship agent code safely from teams that are accumulating hidden debt. The automated gate is not a replacement for human review. It is a filter that makes human review tractable at the volume agents produce.

What Passes Review vs. What Gets Rejected

After doing overnight agent reviews for several months, you develop a feel for what fails. The patterns are consistent:

Reject if: The agent touched auth or session handling and there are no corresponding tests for the modified paths.

Reject if: The diff includes a refactor that was not in the original task scope. Scope creep in agents is usually the agent over-generalizing.

Reject if: Static analysis produces new warnings in agent-modified files. Not old warnings that were already there.

Approve conditionally if: The logic is correct but commit messages lack reasoning traces. Approve the code, fix the agent prompting for next time.

Approve if: The diff is focused, tests cover the new paths, static analysis is clean, and commit messages explain the reasoning. This is what good overnight agent output looks like. It happens more often than you might expect once you constrain the agent's scope properly.

Building the Review Habit

The teams that use overnight agents effectively treat the morning review as a first-class engineering activity, not as a formality before merging. They block calendar time. They use structured checklists. They track the ratio of approved-to-rejected agent commits as a signal of agent quality over time.

The right mental model: your overnight agent is a very fast junior engineer who works in isolation, never asks clarifying questions, and cannot escalate when something is ambiguous. The code quality is often impressive. The judgment calls are often wrong. Review accordingly.

LucidShark gives you automated, local-first code quality analysis that catches the issues your AI agent introduces before they reach production.

DEV Community