TokenHub

Posted on May 12

Build the eval set before you swap the model.

#ai #openai #deepseek #testing

The pattern I keep seeing on teams chasing AI cost reductions: someone swaps a workload from GPT-4o to DeepSeek-V3, eyeballs a handful of outputs, calls it good, ships it. The cost graph drops the next day. Three weeks later a customer surfaces a regression — the cheaper model hallucinates a date format 6% more often, breaks the downstream invoice generator, and the rollback erases most of the savings plus a week of engineering time.

The fix isn't "don't swap models." Swapping is mostly the right move — DeepSeek-V3 at $0.07/$0.28 per million tokens vs GPT-4o at $2.50/$10 is too much money to leave on the table when the workload tolerates it.

The fix is: build the eval set before the swap, not after.

What a useful eval set looks like

You don't need fancy infrastructure for this. Five steps:

1. Pull 100-300 real prompts from production logs

Cover the long tail of inputs, not just the happy path. Include the weird ones:

The customer ticket in Spanish when your system was designed for English
The PR diff with binary files mixed in
The malformed JSON the user pasted instead of describing the issue in words
The prompt that hit a 30-second timeout last Tuesday

These are the inputs where models actually differ. A 50-prompt happy-path eval will tell you both models are 99% accurate, and you'll learn nothing.

2. Get the current model's outputs on those prompts

Save them with a timestamp. This is your baseline. Don't skip this — you'll need it for the comparison and you can't reconstruct it later if the model gets deprecated.

from openai import OpenAI

client = OpenAI(api_key="...", base_url="https://your-gateway/v1")

baseline_outputs = {}
for prompt_id, prompt in eval_set.items():
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    baseline_outputs[prompt_id] = resp.choices[0].message.content

with open("baseline_gpt4o.json", "w") as f:
    json.dump(baseline_outputs, f)

3. Get the candidate model's outputs on the same prompts

Same code, different model name. If your application is wired through an OpenAI-compatible gateway, this is one config change:

candidate_outputs = {}
for prompt_id, prompt in eval_set.items():
    resp = client.chat.completions.create(
        model="deepseek-chat",  # the only change
        messages=[{"role": "user", "content": prompt}],
    )
    candidate_outputs[prompt_id] = resp.choices[0].message.content

The cost of running 300 prompts through DeepSeek-V3 is roughly $0.20. Don't optimize this step.

4. Compare programmatically where you can, human-review the rest

For structured outputs (JSON, tool calls, field extraction), programmatic comparison covers most ground:

Schema validity: does the output parse?
Field match: do the extracted fields match the baseline?
Edit distance for short strings

For free-form outputs (summaries, explanations, agent responses), human review is the bottleneck. Three minutes per prompt × 300 prompts = 15 hours, which sounds bad but is a one-time cost for a decision that affects every production call going forward.

Use an LLM-as-judge to triage: have a stronger model (Claude 3.5, GPT-4o) rate each candidate output against the baseline as better / equivalent / worse / different-but-acceptable. Then human-review only the worse and different buckets. That cuts human time by ~70% in my experience.

5. Set a threshold before you ship

"Candidate model has to match baseline on at least 95% of evals to ship" is a reasonable default. The exact number depends on the workload:

Safety-critical (legal, medical, financial): 99%+
User-facing high-stakes (customer-facing summaries): 97%+
Internal tooling (Slack summaries, dev tools): 92%+
Background tasks (data cleanup, tagging): 85%+

Pick the threshold before you see the numbers. Picking after is how you talk yourself into shipping a model that's slightly worse on the dimension you care about.

The architectural prerequisite

This whole loop only works cheaply if swapping the model for the eval is a config change, not an integration project. Wire your application code through the OpenAI Python SDK with a configurable base_url and let a gateway handle the provider-specific bits.

client = OpenAI(
    api_key="th-...",
    base_url="https://your-gateway/v1",
)

# Same client, different model per call
client.chat.completions.create(model="gpt-4o", ...)
client.chat.completions.create(model="deepseek-chat", ...)
client.chat.completions.create(model="claude-3-5-sonnet", ...)

I use TokenHub for the gateway — 40+ models behind one API key, route per call. LiteLLM self-hosted gets you the same shape if you'd rather run it yourself.

Without that wiring, every eval is a custom integration project, which is why most teams don't run evals.

TL;DR

Pull 100-300 real prompts from logs (include weird ones)
Run baseline model, save outputs
Run candidate model, save outputs
Compare (programmatic for structured, LLM-judge + human for free-form)
Threshold before shipping

The whole exercise takes a day. It saves you the rollback story.

DEV Community