Venkata Manideep Patibandla

Posted on May 4

My LLM App Started Silently Getting Worse. I Almost Didn't Notice. Here's What I Built to Catch It.

There's a failure mode in production AI systems that nobody talks about enough.

It's not a crash. It's not an error log. It's not a spike in latency that triggers your existing alerts. It's quieter than all of those.

Your model just... gets slightly worse. The provider updates something silently. The output quality drifts. Your users notice before you do if they notice at all. By the time you catch it, weeks of degraded output have already gone out.

I hit this problem while building with RealDataAgentBench, my LLM evaluation benchmark. I had benchmark data telling me which model to use. I had cost estimates telling me what it would cost. What I didn't have was any way to know if those numbers were still true next month.

So I built the observability layer into CostGuard. Here's what it does and why it matters.

The problem with "benchmark once, deploy forever"
Most teams pick a model by running some evaluation — formal or informal — and then committing to it. The evaluation happens at a point in time. The deployment runs indefinitely.

The gap between those two things is where silent degradation lives.

LLM providers update their models constantly. Sometimes they announce it. Often they don't. A model that scored 0.823 on your evaluation harness in January might score 0.741 in April — same model name, different behavior, no changelog entry you'd ever find.

If you're not re-evaluating continuously, you have no way to know this happened. You're flying blind on a system that's making decisions your users depend on.

**
**

Every time CostGuard runs an evaluation, it logs the result to a local SQLite database:

evaluations table:

model name
RDAB score (correctness, code quality, efficiency, stat validity)
cost estimate
timestamp
dataset fingerprint (SHA-256 hash of your file)

Your actual data is never stored — only the fingerprint and the scores. The file is processed entirely in memory and discarded immediately after scoring.

Over time, CostGuard builds a historical average for each model on your data type. After a few runs, it knows what "normal" looks like for GPT-4.1 on your customer churn CSV, or for Claude Sonnet on your financial modeling dataset.

Then it watches for drift.

The drift detection logic

The threshold is simple: if a model's current RDAB score drops more than 10% below its historical average, CostGuard records a drift event.

python # Simplified drift detection


historical_avg = get_model_average(model_id, dataset_fingerprint)
current_score = evaluation_result.rdab_score

if historical_avg and current_score < historical_avg * 0.90:
    record_drift_event(
        model=model_id,
        expected=historical_avg,
        actual=current_score,
        drop_pct=(historical_avg - current_score) / historical_avg
    )

A 10% drop sounds small. In practice it's significant. On RDAB's 0–1 scale, dropping from 0.82 to 0.74 on the same task type means the model is materially less reliable on your specific workload. If that happens silently and you don't catch it, you're making decisions on degraded output without knowing it.

The Slack alert

If you've configured a webhook, CostGuard fires a Slack notification the moment drift is detected:


bashexport 

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...
No webhook configured?

It logs silently to the database and surfaces in the History & Alerts tab. The alert fires only when drift actually occurs not on every run, not on a schedule. You get signal, not noise.

The message tells you: which model drifted, what its historical average was, what it scored on the current run, and the percentage drop. Enough context to decide whether to investigate or switch models immediately.

Why SQLite specifically

I deliberately didn't build this on Postgres, Redis, or any external service.

The goal was zero infrastructure overhead. CostGuard is already running on Railway with a FastAPI backend and a Streamlit frontend.

Adding a managed database service for observability data would mean another dependency, another cost, another thing to break.

SQLite writes to a single local file:

_bash# Default path
/tmp/costguard_history.db_

Override if you need persistence across deploys

export COSTGUARD_DB_PATH=/var/data/costguard.db

Two tables. One for evaluation history, one for drift events. The whole thing works out of the box on Railway, on Render, and on your laptop. No setup beyond setting the environment variable if you want a custom path.

The audit trail problem

There's a second benefit to logging every evaluation that's less obvious than drift detection.

When a model recommendation turns out to be wrong — when someone asks "why did we pick GPT-4o-mini for this workload in February" you need to be able to answer that question.

Without logging, the answer is "I don't remember" or "the benchmark said so" with no supporting evidence. With logging, the answer is a row in the evaluations table: here's the dataset fingerprint, here's the score at the time, here's the cost estimate, here's the timestamp.

Every recommendation is auditable. You can replay it, dispute it, or use it to demonstrate that the decision was reasonable given what the data showed at the time. In any organization where AI decisions get scrutinized — which is most organizations now — that audit trail is not optional.

The Model Averages tab

The History & Alerts tab in the CostGuard dashboard shows three things:

Recent evaluations — a table of your last N runs, sortable by model, score, cost, and timestamp.
Drift events — flagged runs where a model dropped below its historical baseline, with the percentage drop and a timestamp.
Model averages — per-model average RDAB scores across all your runs, broken down by the type of data you've been evaluating.

The averages view is the one I find most useful in practice. It shows you which models are consistently strong on your data type over time — not just on a single run. A model that scores 0.85 once and 0.71 twice has a different story than a model that scores 0.78 three times. The average catches the difference.

What this means for how you build

The standard workflow for LLM model selection is: benchmark, pick, deploy, move on.

CostGuard's observability layer adds a fourth step: monitor.
Not continuously re-running full benchmark suites. Not manually checking model changelogs. Just logging every evaluation you run anyway, watching for the 10% drift signal, and getting a Slack message if something changes.

The cost of doing this is essentially zero — SQLite is free, the logging is automatic, the drift check runs at the end of every evaluation you were already running. The cost of not doing it is finding out your model degraded three weeks ago when a user complains or a metric you track externally finally moves enough to notice.

Try it

CostGuard is live at costguard.up.railway.app — no account, no API keys needed for Simulation Mode. Upload any CSV or Parquet file and get a model recommendation with exact cost estimates in under 15 seconds.

To enable the observability layer locally:

bashgit clone https://github.com/patibandlavenkatamanideep/CostGuard.git
cd CostGuard
cp .env.example .env

# Optional: add your API keys for Live Mode
# Optional: add Slack webhook for drift alerts
# SLACK_WEBHOOK_URL=https://hooks.slack.com/...

docker compose up
# Dashboard → http://localhost:8501
# API docs → http://localhost:8000/docs

The History & Alerts tab appears automatically after your first evaluation run.

DEV Community

My LLM App Started Silently Getting Worse. I Almost Didn't Notice. Here's What I Built to Catch It.

Override if you need persistence across deploys

Top comments (0)