The pattern I keep seeing on teams chasing AI cost reductions: someone swaps a workload from GPT-4o to DeepSeek-V3, eyeballs a handful of outputs, calls it good, ships it. The cost graph drops the next day. Three weeks later a customer surfaces a regression — the cheaper model hallucinates a date format 6% more often, breaks the downstream invoice generator, and the rollback erases most of the savings plus a week of engineering time.
The fix isn't "don't swap models." Swapping is mostly the right move — DeepSeek-V3 at $0.07/$0.28 per million tokens vs GPT-4o at $2.50/$10 is too much money to leave on the table when the workload tolerates it.
The fix is: build the eval set before the swap, not after.
What a useful eval set looks like
You don't need fancy infrastructure for this. Five steps:
1. Pull 100-300 real prompts from production logs
Cover the long tail of inputs, not just the happy path. Include the weird ones:
- The customer ticket in Spanish when your system was designed for English
- The PR diff with binary files mixed in
- The malformed JSON the user pasted instead of describing the issue in words
- The prompt that hit a 30-second timeout last Tuesday
These are the inputs where models actually differ. A 50-prompt happy-path eval will tell you both models are 99% accurate, and you'll learn nothing.
2. Get the current model's outputs on those prompts
Save them with a timestamp. This is your baseline. Don't skip this — you'll need it for the comparison and you can't reconstruct it later if the model gets deprecated.
from openai import OpenAI
client = OpenAI(api_key="...", base_url="https://your-gateway/v1")
baseline_outputs = {}
for prompt_id, prompt in eval_set.items():
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
baseline_outputs[prompt_id] = resp.choices[0].message.content
with open("baseline_gpt4o.json", "w") as f:
json.dump(baseline_outputs, f)
3. Get the candidate model's outputs on the same prompts
Same code, different model name. If your application is wired through an OpenAI-compatible gateway, this is one config change:
candidate_outputs = {}
for prompt_id, prompt in eval_set.items():
resp = client.chat.completions.create(
model="deepseek-chat", # the only change
messages=[{"role": "user", "content": prompt}],
)
candidate_outputs[prompt_id] = resp.choices[0].message.content
The cost of running 300 prompts through DeepSeek-V3 is roughly $0.20. Don't optimize this step.
4. Compare programmatically where you can, human-review the rest
For structured outputs (JSON, tool calls, field extraction), programmatic comparison covers most ground:
- Schema validity: does the output parse?
- Field match: do the extracted fields match the baseline?
- Edit distance for short strings
For free-form outputs (summaries, explanations, agent responses), human review is the bottleneck. Three minutes per prompt × 300 prompts = 15 hours, which sounds bad but is a one-time cost for a decision that affects every production call going forward.
Use an LLM-as-judge to triage: have a stronger model (Claude 3.5, GPT-4o) rate each candidate output against the baseline as better / equivalent / worse / different-but-acceptable. Then human-review only the worse and different buckets. That cuts human time by ~70% in my experience.
5. Set a threshold before you ship
"Candidate model has to match baseline on at least 95% of evals to ship" is a reasonable default. The exact number depends on the workload:
- Safety-critical (legal, medical, financial): 99%+
- User-facing high-stakes (customer-facing summaries): 97%+
- Internal tooling (Slack summaries, dev tools): 92%+
- Background tasks (data cleanup, tagging): 85%+
Pick the threshold before you see the numbers. Picking after is how you talk yourself into shipping a model that's slightly worse on the dimension you care about.
The architectural prerequisite
This whole loop only works cheaply if swapping the model for the eval is a config change, not an integration project. Wire your application code through the OpenAI Python SDK with a configurable base_url and let a gateway handle the provider-specific bits.
client = OpenAI(
api_key="th-...",
base_url="https://your-gateway/v1",
)
# Same client, different model per call
client.chat.completions.create(model="gpt-4o", ...)
client.chat.completions.create(model="deepseek-chat", ...)
client.chat.completions.create(model="claude-3-5-sonnet", ...)
I use TokenHub for the gateway — 40+ models behind one API key, route per call. LiteLLM self-hosted gets you the same shape if you'd rather run it yourself.
Without that wiring, every eval is a custom integration project, which is why most teams don't run evals.
TL;DR
- Pull 100-300 real prompts from logs (include weird ones)
- Run baseline model, save outputs
- Run candidate model, save outputs
- Compare (programmatic for structured, LLM-judge + human for free-form)
- Threshold before shipping
The whole exercise takes a day. It saves you the rollback story.
Top comments (0)