Aayush kumarsingh

Posted on May 5

TraceMind v3 — I built an AI agent that diagnoses why your LLM quality dropped

#python #opensource #agents #llmops

Uses ReAct loops and semantic failure search

Previous posts: v2 — hallucination detection + A/B testing

The most common question I got after v2 was this:

"The hallucination score spiked. Now what?"

TraceMind told you that something broke. It didn't tell you why. And it definitely didn't help you fix it.

That gap is what v3 closes.

If TraceMind is useful to you, a ⭐ on GitHub helps others find it.
GitHub: https://github.com/Aayush-engineer/TraceMind

What's new

Three things shipped in v3:

EvalAgent — a ReAct agent that diagnoses quality regressions
Response Control Hooks — block or retry hallucinated responses automatically
Prompt Version Registry — track which prompt is deployed where

The EvalAgent

This is the main feature. When quality drops, instead of staring at a dashboard, you ask the agent:

"Why is quality dropping on the support dataset?"

The agent runs a loop:

THINK → What do I need to know?
ACT   → Use a tool to get it
OBSERVE → What did the tool show?
REPEAT until I have enough to answer

It has 6 tools: fetch recent traces, run targeted evals, search past failures (semantic search via ChromaDB), generate new test cases, analyze failure patterns, and send alerts.

A real session looks like this:

Step 1: search_similar_failures
→ Found 3 similar past failures (82% match). Last seen 4 days ago.

Step 2: fetch_recent_traces
→ 14 low-quality traces in last 24h. Lowest score: 3.2.

Step 3: analyze_failure_pattern
→ Pattern: multi-step refund questions with policy constraints
  Root cause: prompt doesn't specify what to do when policy is ambiguous
  Fix: add explicit fallback instruction for edge cases

Step 4: generate_test_cases
→ Generated 5 adversarial cases covering this failure mode

ANSWER: Quality dropped because the prompt has no fallback for ambiguous
policy questions. Generated 5 test cases to cover this. Recommended fix:
add "If policy is unclear, say: I'll check and follow up" to your prompt.

That's the complete investigation — 4 tool calls, 45 seconds, specific root cause, specific fix, new test cases already added to the dataset.

The architecture decision: text-based ReAct, not native tool calling

I had two options for the agent loop.

Option A — Anthropic/OpenAI native tool calling: cleaner, more reliable JSON, the model calls tools directly.

Option B — Text-based ReAct: model outputs TOOL: name\nINPUT: {...}, I parse it.

I went with Option B because I'm running on Groq's free tier (llama-3.1-8b-instant), and native tool calling on smaller open models is unreliable — the model frequently hallucinates tool names or produces malformed schemas. Text-based ReAct is more forgiving and easier to debug when something goes wrong.

The tradeoff: I have to parse the output myself, and occasionally the model produces text that doesn't match the TOOL: / ANSWER: pattern. I handle that with a fallback that appends the raw response to context and retries.

Memory: 4 types

The agent isn't stateless. Between runs it maintains:

Semantic memory — ChromaDB stores embeddings of every past failure. When a new failure arrives, the agent searches for similar past failures and their resolutions. If this exact problem was solved 3 weeks ago, the agent finds it.

Episodic memory — The last 5 agent runs for each project are stored in Postgres. New runs start with context from previous investigations.

Project context — Loaded at agent init. The agent knows what kind of system it's investigating.

In-context working memory — The scratchpad of tool results that accumulates during a single run.

Most agents only have the last one. The semantic + episodic layers are what make investigations get faster over time.

Response Control Hooks

This closes the loop on hallucination detection.

Before v3: TraceMind detected a high-risk response. You logged it. Nothing happened.

Now:

from tracemind import TraceMind, HallucinationPolicy

tm = TraceMind(api_key="...", project="my-app")

# Built-in policies — safe defaults out of the box
tm.response_control.set_policy("critical", HallucinationPolicy.BLOCK)
tm.response_control.set_policy("high",     HallucinationPolicy.BLOCK)
tm.response_control.set_policy("medium",   HallucinationPolicy.FLAG)

# Or custom callback for your specific logic
@tm.response_control.on("critical")
def handle_critical(event):
    alert_oncall(f"Critical hallucination in {event.span_name}")
    return "I'm not confident in this answer. Please contact support."

# Your existing code, unchanged
@tm.trace("support_handler")
def handle_ticket(ticket: str) -> str:
    return your_llm.complete(ticket)
# If response is critical-risk → HallucinationBlocked raised automatically

The design principle here came from a comment on my v2 post from @sunychoudhary: teams that get full flexibility usually implement no policy at all. So the defaults ship with something safe, and you override what you need.

Prompt Version Registry

Every deployed prompt is now versioned:

POST /api/prompts/{prompt_name}/versions
{
  "content": "You are a professional support agent. Be empathetic and precise.",
  "tags": ["production", "v2.3"]
}
# → { "version_id": "support:v3" }

When quality drops, you can correlate it with which prompt version was deployed at that timestamp. This answers "did the regression start when we changed the prompt?" without manually digging through git history.

What I got wrong in v2 (and fixed)

The inputs["project_id"] bug — The agent would call fetch_recent_traces but the LLM sometimes omitted project_id from the tool input JSON. The function did inputs["project_id"] — hard key access — so it crashed with a KeyError instead of falling back to the agent's own project ID.

The fix: pid = inputs.get("project_id") or project_id and pass project_id through the call chain. Obvious in hindsight. The pattern for all tool inputs is now .get() with fallbacks throughout.

The float parse crash — The worker that auto-scores spans sent max_tokens=5 to get a single number back. Sometimes the model returned "3\n\nThe response is...". The code did float(result.strip()) and crashed.

The fix: float(result.strip().split()[0].rstrip('.')) — take only the first token.

Both bugs were caught by the verify suite (verify_all.py) before I noticed them in logs.

Numbers

44/44 verification checks passing
76 unit tests
8 iterations average per agent run
~45 seconds for a complete investigation
<1ms SDK overhead (batched, non-blocking)
$0 — runs entirely on Groq free tier

Try it

git clone https://github.com/Aayush-engineer/tracemind
cd tracemind && cp .env.example .env
# Add GROQ_API_KEY (free at console.groq.com)
docker-compose up

Or hit the hosted demo: tracemind.onrender.com/docs (free tier, ~30s cold start)

pip install tracemind-sdk

from tracemind import TraceMind
tm = TraceMind(
    api_key  = "ef_live_...",
    project  = "my-app",
    base_url = "https://tracemind.onrender.com"
)

@tm.trace("llm_call")
def your_function(msg):
    return your_llm.complete(msg)  # unchanged

What I'd still do differently

The agent uses text-based ReAct which occasionally misfires on smaller models. Native tool calling with a model that supports it reliably (Llama 3.3 70B, Mixtral) would be more robust — but that's beyond Groq's free tier limits for my use case.

The semantic memory searches all past failures globally across projects. It should be scoped per project first. On a shared instance with many projects, cross-project signal is mostly noise.

Live

What's next

Ollama integration — run entirely local, no API key
Hosted cloud version — 1 project, 1000 spans/month free
LlamaIndex callback

If you're building with LLMs and something breaks in a way that doesn't show up in your error logs — that's exactly the problem TraceMind is for. Would genuinely value feedback on whether the agent investigations are useful in practice, or just interesting in theory.

Top comments (5)

Nate Voss • May 7

the cross project signal is mostly noise admission is the honest beat. semantic memory across projects sounds powerful in theory but in practice each project has its own embedded assumptions that the agent treats as transferable when they're not. the text-based ReAct over JSON tool calling is also the right call for smaller models, i'd be curious whether the 8iteration ceiling is binding in practice or if the agent usually self terminates earlier. on the failure mode side, the memory layer also creates an ordering issue: if past investigations are loaded into context first, the agent can read "we already saw this pattern" before the fresh trace evidence has a chance to suggest it's actually a different bug class. seen that show up?

Aayush kumarsingh • May 8

The cross-project noise point is exactly right and I've just fixed it.

The failure_collection was storing failures from all projects in the same ChromaDB collection with no project_id filter on retrieval. When the agent ran search_similar_failures, it was pulling in failures from completely unrelated project domains and treating them as signal. I've now added project_id to every failure record's metadata and scope the search to the current project first, with a global fallback only when the project has zero past failures indexed.

The memory ordering issue is the more subtle one and you're right that I had it wrong. Past episodes were loaded into the system prompt before the agent saw any current evidence — so the agent's first cognitive frame was "we've seen this before" before it examined the actual traces. I've moved episodic memory injection to after the first tool call completes. Fresh evidence first, historical pattern matching second. The anchoring risk is real and your framing of it as an "ordering issue" is the precise diagnosis.

On the ceiling question: in practice the agent self-terminates between 3 and 6 iterations in every run I've logged. The 8-iteration ceiling has never been the binding constraint — it's always been model confidence (ANSWER: fires) or tool result completeness. The ceiling is
there for pathological cases, not normal investigations.

On text-based ReAct — the tradeoff you've seen too, I assume: more parsing fragility on your end, but significantly more robust actual tool selection on the model's end when you're below GPT-4 capability. The parse failures are recoverable. Malformed JSON tool schemas from
native calling on 8B models are not — they cascade into worse failures.

Both architectural fixes are in the repo now.

Nate Voss • May 9

that's the right ordering. honestly the second order risk is whether the model still treats "previous similar issue" as too strong a prior once it's in context, even when fresh evidence loads first. LLMs anchor on stated patterns even when the trace suggests a different bug class. probably more context-engineering than ordering at that point. and the 3-6 self-terminate range is a useful signal. it means 8 is the safety net, not where the agent actually lives.

Andy Nian • May 6

"The agent runs a loop: THINK → ACT → OBSERVE → REPEAT until I have enough to answer."

The ReAct loop in your EvalAgent is intriguing, but isn't there a risk of it getting stuck in an infinite loop if it continually finds data that doesn't fully resolve the issue? How do you cap the number of iterations to prevent it from spiraling out of control? It seems like that could be a potential snag, especially when working with ambiguous or partially complete data. Having run into similar issues, I know that setting a sensible upper limit can save a lot of headache.

Aayush kumarsingh • May 6

Good question — max_iterations is the primary guard.

The loop has a hard ceiling of 8 iterations. After 8 tool calls with no ANSWER:, the agent returns "Analysis incomplete after 8 steps" and saves whatever it found so far. The investigation doesn't spiral — it terminates and reports partial findings.

The more interesting failure mode you're pointing at is getting stuck in a reasoning rut — the agent keeps calling the same tool with slightly different inputs because each result gives enough signal to continue but not enough to conclude.

I handle this with two mechanisms:

Context accumulation — every tool result is appended to the working context. The LLM can see its own prior calls, which prevents pure repetition (calling search_similar_failures twice
with identical inputs gives identical output — the model learns this after 1-2 tries).
Tool diversity pressure — the system prompt instructs the agent to use different tools to gather diverse signal rather than repeating the same one. In practice, 8 iterations is more than
enough for any investigation I've run — the average is 4-5 tool calls to reach a specific root cause.

What I'd do for production at scale: add a tool-call deduplication check (if tool+input_hash was called before, skip it) and a confidence threshold (if analyze_failure_pattern returns high confidence, exit early). Neither is implemented yet — worth adding.