Anil Prasad

Posted on May 12 • Originally published at open.substack.com

Building Production RAG: From 52% to 89% Accuracy with a 6-Stage Pipeline

#ai #machinelearning #python #tutorial

Two hard problems in production AI:

Accuracy: RAG systems giving wrong answers 48% of the time
Cost: LLM API bills hitting $47K/month

We solved both. Here's how.

Part 1: RAG Accuracy (52% → 89%)

Our RAG system was confidently wrong. Users asked "What were Q2 healthcare results?" and got Q1 data, footnotes, and chapter titles with zero content.

High similarity scores. Completely useless context.

The LLM wasn't the problem. Retrieval was broken.

The 6-Stage Pipeline

Stage 1: Query Processing

Problem: "Show me Q2 results" has no semantic information.

Solution: Query expansion + metadata extraction

def process_query(raw_query: str) -> ProcessedQuery:
    metadata = extract_metadata(raw_query)  # dates, entities
    expanded = expand_query(raw_query, metadata)
    embedding = embed_with_context(expanded, metadata)
    return ProcessedQuery(expanded, metadata, embedding)

Transformation:
Input: "Show me Q2 results"
Output: "quarterly financial results Q2 2024 revenue profit earnings second quarter"

Stage 2: Vector Database Search

import pinecone

index = pinecone.Index("knowledge-base")

results = index.query(
    vector=query_embedding,
    top_k=5,  # not 10, not 20
    filter={
        "date_range": {"$gte": "2024-04-01"},
        "department": "healthcare"
    }
)

Key: Cosine similarity threshold 0.85. Anything lower retrieves noise.

Stage 3: Hybrid Search (Semantic + Keyword)

def hybrid_search(query: str, top_k=50):
    # Semantic (70%) + BM25 keyword (30%)
    vector_results = vector_search(query, top_k)
    bm25_results = keyword_search(query, top_k)

    combined = []
    for chunk_id in set(vector_results) | set(bm25_results):
        score = (vector_results.get(chunk_id, 0) * 0.7 + 
                 bm25_results.get(chunk_id, 0) * 0.3)
        combined.append((chunk_id, score))

    return sorted(combined, key=lambda x: x[1], reverse=True)[:top_k]

Why: Patent queries like "US-2847291" need exact match, not semantic.

Stage 4: Re-ranking (23% Accuracy Boost)

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, chunks: List[str], top_k=5):
    pairs = [[query, chunk] for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]

Strategy: Fast bi-encoder for 50 candidates → slow cross-encoder for final 5.

Stage 5: Context Assembly

def create_chunks(doc: str, size=512, overlap=50):
    chunks = []
    tokens = tokenize(doc)

    for i in range(0, len(tokens), size - overlap):
        chunk = tokens[i:i + size]
        chunks.append(Chunk(
            text=detokenize(chunk),
            metadata={
                'source': doc.title,
                'date': doc.date,
                'section': extract_section(chunk)
            }
        ))
    return chunks

Why overlap: "Revenue increased 23% vs previous quarter" → needs surrounding context.

Stage 6: LLM Generation

def generate_answer(query: str, chunks: List[Chunk]):
    context = "\n\n".join([
        f"<document>\n<source>{c.metadata['source']}</source>\n"
        f"<content>{c.text}</content>\n</document>"
        for c in chunks
    ])

    prompt = f"""Use ONLY the provided context.

Context:
{context}

Query: {query}

Instructions:
1. Answer using ONLY provided context
2. Cite sources
3. Say "I don't know" if insufficient

Answer:"""

    return llm.complete(prompt)

RAG Results

Before:

52% accuracy
31% hallucination rate
3.8s latency

After:

89% accuracy (+71%)
4% hallucination rate (-87%)
1.2s latency (-67%)

Key Insight: GPT-4 with naive retrieval = 54% accuracy. Haiku with 6-stage pipeline = 87% accuracy.

Optimize retrieval, not the LLM.

Part 2: Cost Reduction ($47K → $2.8K)

Same product. Same UX. 94% cost reduction.

The secret: 3-layer caching + intelligent routing.

Layer 1: Prompt Caching (68% hit rate)

Problem: Every request pays for the same system prompt.

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": "You are a helpful AI assistant...",
            "cache_control": {"type": "ephemeral"}  # 10x cheaper!
        }
    ],
    messages=[{"role": "user", "content": query}]
)

Economics:

Normal: $3.00/1M tokens
Cached: $0.30/1M tokens (10x cheaper)

Example:
5K token system prompt × 100 requests:
Without caching: $1.50
With caching: $0.02
98.7% savings

Layer 2: Semantic Caching (15% hit rate)

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold=0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}
        self.threshold = threshold

    def get(self, query: str):
        query_emb = self.model.encode(query)

        for cached_emb, (cached_q, response, _) in self.cache.items():
            similarity = np.dot(query_emb, cached_emb)
            if similarity >= self.threshold:
                return response
        return None

    def set(self, query: str, response: str):
        embedding = self.model.encode(query)
        self.cache[tuple(embedding)] = (query, response, time.time())

Catches: "How do I reset password?" ≈ "Password reset help?"

Layer 3: Result Caching (10% hit rate)

import redis
import hashlib

class ResultCache:
    def __init__(self):
        self.redis = redis.Redis()

    def get(self, query: str, context: dict):
        key = hashlib.sha256(
            json.dumps({'query': query, 'context': context}).encode()
        ).hexdigest()
        return self.redis.get(key)

    def set(self, query: str, context: dict, response: str, ttl=3600):
        key = hashlib.sha256(
            json.dumps({'query': query, 'context': context}).encode()
        ).hexdigest()
        self.redis.setex(key, ttl, response)

TTL strategy:

Stable content: 24 hours
Dynamic content: 1 hour
Real-time: 5 minutes

Intelligent Model Routing

67% of queries work with Haiku ($0.25/1M). 60x cheaper than Opus ($15/1M).

from enum import Enum

class Model(Enum):
    HAIKU = "claude-haiku-4-20250514"    # $0.25/1M
    SONNET = "claude-sonnet-4-20250514"  # $3/1M
    OPUS = "claude-opus-4-20250514"      # $15/1M

def route(query: str) -> Model:
    tokens = len(query.split())

    # Simple → Haiku
    if tokens < 50:
        return Model.HAIKU

    # Analysis → Sonnet
    if any(w in query.lower() for w in ['analyze', 'compare']):
        return Model.SONNET

    # Complex → Opus
    if any(w in query.lower() for w in ['design', 'architect']):
        return Model.OPUS

    return Model.SONNET

Distribution:

67% Haiku
28% Sonnet
5% Opus

The Complete System

class OptimizedLLM:
    def __init__(self):
        self.semantic_cache = SemanticCache()
        self.result_cache = ResultCache()
        self.client = anthropic.Anthropic()

    def complete(self, query: str, context: dict):
        # Layer 3: Result cache
        cached = self.result_cache.get(query, context)
        if cached:
            return cached

        # Layer 2: Semantic cache
        semantic = self.semantic_cache.get(query)
        if semantic:
            return semantic

        # Layer 1: Prompt cache + routing
        model = route(query)

        response = self.client.messages.create(
            model=model.value,
            system=[{
                "type": "text",
                "text": context['system_prompt'],
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[{"role": "user", "content": query}]
        )

        # Cache results
        self.result_cache.set(query, context, response.content)
        self.semantic_cache.set(query, response.content)

        return response.content

Cost Results

Before:

$47K/month
P95 latency: 2.1s

After:

$2.8K/month (-94%)
P95 latency: 340ms (-84%)
73% combined cache hit rate

Implementation Checklist

RAG:

[ ] Implement query processing (expand + extract metadata)
[ ] Set up vector DB with metadata filtering
[ ] Add hybrid search (semantic + keyword)
[ ] Deploy cross-encoder re-ranking
[ ] Build chunking with 50-token overlap
[ ] Force grounded prompts (no hallucinations)

Cost:

[ ] Enable prompt caching (10x savings)
[ ] Add semantic similarity cache
[ ] Implement result cache with smart TTL
[ ] Route to appropriate model tier
[ ] Monitor cache hit rates weekly

Key Insights

Retrieval > LLM: Haiku + perfect context beats GPT-4 + bad context
Re-ranking = 23% boost: Single highest-ROI optimization
Caching = 73% hit rate: Most requests never touch the LLM
Model routing = 60x savings: Haiku for 67% of queries

What We're Open-Sourcing

Next month:

6-stage RAG pipeline (code + docs)
Cost optimization framework
Re-ranking models
Monitoring dashboards
Evaluation datasets

Follow @anilsprasad or Ambharii Labs for release.

Your Turn

For RAG: What's your accuracy? Drop it in comments.

For Costs: What's your monthly LLM bill? I'll tell you which optimization has highest ROI.

Common wins:

Prompt caching: 10x savings
Re-ranking: 23% accuracy boost
Model routing: 60x price difference

Let's make production AI work. 🚀

Tags: #ai #machinelearning #python #tutorial

DEV Community