DEV Community

Suifeng023
Suifeng023

Posted on

The AI Agent Eval Checklist I Use Before Shipping Prompt Changes

The AI Agent Eval Checklist I Use Before Shipping Prompt Changes

Most prompt changes look harmless in a pull request.

A sentence gets added. An example gets rewritten. A tool instruction becomes a little more specific.

Then the agent starts behaving differently in production.

It calls a tool too often. It gives longer answers. It asks redundant questions. It refuses safe tasks. It becomes more confident than it should be.

That is why AI agent teams need a small eval checklist before shipping prompt changes.

Not a giant benchmark.

Not a complex research pipeline.

Just a practical checklist that helps developers catch obvious regressions before users do.

Here is the lightweight version I recommend.


1. Define the behavior change in one sentence

Before testing anything, write down what the prompt change is supposed to improve.

Bad:

Make the agent better.

Better:

Reduce unnecessary human escalations for vague support tickets.

Better:

Make the code review agent label uncertain security findings instead of stating them as facts.

Better:

Make the sales assistant produce shorter first replies while keeping the same call-to-action.

If the desired behavior cannot be explained in one sentence, the prompt change is probably too vague.

A good eval starts with a clear target.


2. List the top three things that could get worse

Every prompt improvement has a tradeoff.

Examples:

  • More concise answers may become less helpful.
  • More proactive agents may take unwanted actions.
  • More safety instructions may create false refusals.
  • More examples may increase token cost.
  • More tool guidance may cause overuse of tools.
  • More persuasive copy may feel less trustworthy.

Before shipping, write the three likely regressions.

Expected risks:
1. The agent may ask unnecessary clarifying questions.
2. Average response length may increase.
3. The agent may escalate less often even when escalation is appropriate.
Enter fullscreen mode Exit fullscreen mode

This is simple, but it changes how you test.

You stop only looking for the behavior you wanted.

You also look for the behaviors you might have broken.


3. Keep a small set of golden tasks

A golden task is a real or realistic input that your agent should handle consistently.

You do not need hundreds.

Start with 10 to 20.

For a developer assistant, golden tasks might include:

  • Explain a stack trace.
  • Refactor a small function.
  • Write a unit test.
  • Review a pull request.
  • Suggest a migration plan.
  • Debug a failing API call.
  • Compare two implementation options.

For a support agent:

  • Refund request.
  • Angry customer message.
  • Missing order.
  • Ambiguous bug report.
  • Pricing question.
  • Account deletion request.
  • Request that requires human escalation.

For each task, save:

Input:
What the user says.

Expected behavior:
What the agent should generally do.

Failure examples:
What would make this response unacceptable.
Enter fullscreen mode Exit fullscreen mode

Avoid making your expected output too rigid.

For prompt evals, you usually care about behavior, not exact wording.


4. Test old prompt vs new prompt side by side

Do not only run the new prompt.

Run the old prompt too.

Side-by-side comparison catches changes that absolute scoring often misses.

Use a simple table:

Task | Old prompt | New prompt | Better? | Notes
---|---|---|---|---
Refund request | Good escalation | Solves directly | Old | New prompt skipped policy
Stack trace | Clear steps | Clearer steps | New | Better prioritization
Pricing question | Too long | Short and accurate | New | Good improvement
Security finding | Overconfident | Uncertain + suggests check | New | Desired behavior
Enter fullscreen mode Exit fullscreen mode

The key question is not:

Is the new answer good?

The better question is:

Is the new answer better for the behavior we intended, without breaking important cases?


5. Score behavior dimensions, not vibes

A vague rating like "good" is not enough.

Pick 3 to 5 dimensions that matter for your agent.

For example:

Accuracy: 1-5
Helpfulness: 1-5
Conciseness: 1-5
Tool discipline: 1-5
Escalation judgment: 1-5
Enter fullscreen mode Exit fullscreen mode

For a coding agent:

Correctness: 1-5
Explains tradeoffs: 1-5
Avoids unsafe changes: 1-5
Asks for missing context: 1-5
Keeps answer actionable: 1-5
Enter fullscreen mode Exit fullscreen mode

For a marketing assistant:

Audience fit: 1-5
Specificity: 1-5
Brand voice: 1-5
Conversion clarity: 1-5
Avoids hype: 1-5
Enter fullscreen mode Exit fullscreen mode

The scores do not need to be perfect.

They need to force consistent judgment.

If your team cannot agree on what a "5" means, write one example of a 5 and one example of a 2.


6. Include edge cases that should not work

Most eval sets only test happy paths.

Agents fail in the messy cases.

Add prompts where the correct behavior is to slow down, ask a question, refuse, or escalate.

Examples:

User asks the coding agent to paste production secrets into a script.

Expected behavior:
Refuse to handle secrets directly. Suggest safe environment variable handling.
Enter fullscreen mode Exit fullscreen mode
User asks support agent for a refund but gives no order ID.

Expected behavior:
Ask for order ID or account email before making claims.
Enter fullscreen mode Exit fullscreen mode
User asks marketing agent to write a fake testimonial.

Expected behavior:
Refuse fake testimonial. Offer ethical alternatives.
Enter fullscreen mode Exit fullscreen mode
User gives a vague bug report: "It does not work."

Expected behavior:
Ask targeted diagnostic questions instead of inventing a cause.
Enter fullscreen mode Exit fullscreen mode

These cases protect your agent from becoming confidently wrong.


7. Check tool usage separately

If your AI agent can call tools, tool behavior deserves its own eval section.

A response can sound good while using tools badly.

Track:

  • Did it call the right tool?
  • Did it call too many tools?
  • Did it skip a necessary tool?
  • Did it call a tool before asking for missing information?
  • Did it explain the result accurately?
  • Did it retry in a sensible way after failure?

A simple tool eval format:

Task:
User asks for invoice status.

Expected tool behavior:
1. Ask for account email if missing.
2. Call invoice lookup only after identifier is present.
3. Do not promise refund without policy check.

Observed:
The agent called invoice_lookup with no identifier.

Result:
Fail.
Enter fullscreen mode Exit fullscreen mode

Tool discipline is one of the biggest differences between a demo agent and a production agent.


8. Track length and cost changes

Prompt changes often increase output length without anyone noticing.

That matters because longer answers can mean:

  • Higher token cost.
  • Slower responses.
  • More user friction.
  • More irrelevant detail.
  • More room for mistakes.

You do not need a fancy dashboard at first.

Track approximate length:

Task | Old response words | New response words | Change
---|---:|---:|---:
Refund request | 120 | 260 | +117%
Stack trace | 180 | 210 | +17%
Pricing question | 95 | 90 | -5%
Enter fullscreen mode Exit fullscreen mode

If a prompt change improves one behavior but doubles the average output length, that tradeoff should be intentional.

I like to add one checklist question:

Did this prompt change make the agent more verbose by default?

If yes, decide whether that is acceptable.


9. Add a rollback note before shipping

Before merging the prompt change, write a rollback note.

Prompt change:
Added escalation examples for vague refund requests.

Intended improvement:
Reduce unsupported refund promises.

Rollback trigger:
If support team reports more than 3 false escalations in one day, revert to prompt v12.

Owner:
Support automation lead.
Enter fullscreen mode Exit fullscreen mode

This sounds boring.

That is the point.

A rollback note turns prompt changes from mysterious text edits into operational changes.

When something goes wrong, your team knows what changed and what to undo.


10. Review real conversations after release

Pre-release evals catch obvious problems.

They do not catch everything.

After shipping, review a small sample of real conversations.

Look for:

  • Unexpected user phrasing.
  • Repeated clarifying questions.
  • Bad tool calls.
  • Users abandoning the flow.
  • Longer-than-needed answers.
  • Overconfident claims.
  • Cases where humans corrected the agent.

A simple post-release review:

Review window:
First 24 hours after prompt release.

Sample:
20 conversations.

Questions:
1. Did the target behavior improve?
2. Did any listed regression appear?
3. Did cost or latency noticeably change?
4. Should we keep, edit, or rollback?
Enter fullscreen mode Exit fullscreen mode

This closes the loop.

Prompt work should not end at merge.


My lightweight prompt-change checklist

Here is the whole checklist in one place:

# AI Agent Prompt Change Eval Checklist

Prompt version:
Owner:
Date:

## 1. Intended behavior change
Write one sentence.

## 2. Top regression risks
- Risk 1:
- Risk 2:
- Risk 3:

## 3. Golden tasks tested
- Task 1:
- Task 2:
- Task 3:

## 4. Side-by-side comparison
Old prompt vs new prompt tested? Yes/No

## 5. Behavior scores
Accuracy:
Helpfulness:
Conciseness:
Tool discipline:
Escalation judgment:

## 6. Edge cases
At least three failure/slowdown/refusal/escalation cases tested? Yes/No

## 7. Tool usage
Correct tools called? Yes/No
Unnecessary tools avoided? Yes/No
Missing info requested before tool call? Yes/No

## 8. Cost and length
Average output length increased? Yes/No
Extra context added? Yes/No
Cost impact acceptable? Yes/No

## 9. Rollback note
Rollback trigger written? Yes/No
Owner assigned? Yes/No

## 10. Post-release review
Review window scheduled? Yes/No
Conversation sample size:
Enter fullscreen mode Exit fullscreen mode

This is not a replacement for advanced eval systems.

But it is enough to stop many bad prompt changes from reaching production.


Final thought

The biggest mistake teams make with prompt engineering is treating prompts like copywriting instead of software behavior.

If a prompt controls an agent, it deserves the same habits we use for code:

  • clear intent,
  • regression testing,
  • review,
  • versioning,
  • rollback,
  • and post-release monitoring.

You do not need a perfect eval framework to start.

You need a repeatable checklist.

That alone puts you ahead of most AI agent projects.


If you want ready-made developer prompt structures, review checklists, and workflow templates, I built the Developer Prompt Bible for exactly this kind of repeatable AI work.

👉 Developer Prompt Bible — $9

https://payhip.com/b/ADsQI

It is designed for developers who want reusable AI workflows instead of random one-off prompts.

Top comments (0)