DEV Community

Cover image for Building Production RAG: From 52% to 89% Accuracy with a 6-Stage Pipeline
Anil Prasad
Anil Prasad

Posted on • Originally published at open.substack.com

Building Production RAG: From 52% to 89% Accuracy with a 6-Stage Pipeline

Two hard problems in production AI:

  1. Accuracy: RAG systems giving wrong answers 48% of the time
  2. Cost: LLM API bills hitting $47K/month

We solved both. Here's how.


Part 1: RAG Accuracy (52% → 89%)

Our RAG system was confidently wrong. Users asked "What were Q2 healthcare results?" and got Q1 data, footnotes, and chapter titles with zero content.

High similarity scores. Completely useless context.

The LLM wasn't the problem. Retrieval was broken.

RAG Architecture

The 6-Stage Pipeline

Stage 1: Query Processing

Problem: "Show me Q2 results" has no semantic information.

Solution: Query expansion + metadata extraction

def process_query(raw_query: str) -> ProcessedQuery:
    metadata = extract_metadata(raw_query)  # dates, entities
    expanded = expand_query(raw_query, metadata)
    embedding = embed_with_context(expanded, metadata)
    return ProcessedQuery(expanded, metadata, embedding)
Enter fullscreen mode Exit fullscreen mode

Transformation:
Input: "Show me Q2 results"
Output: "quarterly financial results Q2 2024 revenue profit earnings second quarter"

Stage 2: Vector Database Search

import pinecone

index = pinecone.Index("knowledge-base")

results = index.query(
    vector=query_embedding,
    top_k=5,  # not 10, not 20
    filter={
        "date_range": {"$gte": "2024-04-01"},
        "department": "healthcare"
    }
)
Enter fullscreen mode Exit fullscreen mode

Key: Cosine similarity threshold 0.85. Anything lower retrieves noise.

Stage 3: Hybrid Search (Semantic + Keyword)

def hybrid_search(query: str, top_k=50):
    # Semantic (70%) + BM25 keyword (30%)
    vector_results = vector_search(query, top_k)
    bm25_results = keyword_search(query, top_k)

    combined = []
    for chunk_id in set(vector_results) | set(bm25_results):
        score = (vector_results.get(chunk_id, 0) * 0.7 + 
                 bm25_results.get(chunk_id, 0) * 0.3)
        combined.append((chunk_id, score))

    return sorted(combined, key=lambda x: x[1], reverse=True)[:top_k]
Enter fullscreen mode Exit fullscreen mode

Why: Patent queries like "US-2847291" need exact match, not semantic.

Stage 4: Re-ranking (23% Accuracy Boost)

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, chunks: List[str], top_k=5):
    pairs = [[query, chunk] for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]
Enter fullscreen mode Exit fullscreen mode

Strategy: Fast bi-encoder for 50 candidates → slow cross-encoder for final 5.

Stage 5: Context Assembly

def create_chunks(doc: str, size=512, overlap=50):
    chunks = []
    tokens = tokenize(doc)

    for i in range(0, len(tokens), size - overlap):
        chunk = tokens[i:i + size]
        chunks.append(Chunk(
            text=detokenize(chunk),
            metadata={
                'source': doc.title,
                'date': doc.date,
                'section': extract_section(chunk)
            }
        ))
    return chunks
Enter fullscreen mode Exit fullscreen mode

Why overlap: "Revenue increased 23% vs previous quarter" → needs surrounding context.

Stage 6: LLM Generation

def generate_answer(query: str, chunks: List[Chunk]):
    context = "\n\n".join([
        f"<document>\n<source>{c.metadata['source']}</source>\n"
        f"<content>{c.text}</content>\n</document>"
        for c in chunks
    ])

    prompt = f"""Use ONLY the provided context.

Context:
{context}

Query: {query}

Instructions:
1. Answer using ONLY provided context
2. Cite sources
3. Say "I don't know" if insufficient

Answer:"""

    return llm.complete(prompt)
Enter fullscreen mode Exit fullscreen mode

RAG Results

Before:

  • 52% accuracy
  • 31% hallucination rate
  • 3.8s latency

After:

  • 89% accuracy (+71%)
  • 4% hallucination rate (-87%)
  • 1.2s latency (-67%)

Key Insight: GPT-4 with naive retrieval = 54% accuracy. Haiku with 6-stage pipeline = 87% accuracy.

Optimize retrieval, not the LLM.


Part 2: Cost Reduction ($47K → $2.8K)

Same product. Same UX. 94% cost reduction.

The secret: 3-layer caching + intelligent routing.

Layer 1: Prompt Caching (68% hit rate)

Problem: Every request pays for the same system prompt.

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": "You are a helpful AI assistant...",
            "cache_control": {"type": "ephemeral"}  # 10x cheaper!
        }
    ],
    messages=[{"role": "user", "content": query}]
)
Enter fullscreen mode Exit fullscreen mode

Economics:

  • Normal: $3.00/1M tokens
  • Cached: $0.30/1M tokens (10x cheaper)

Example:
5K token system prompt × 100 requests:
Without caching: $1.50
With caching: $0.02
98.7% savings

Layer 2: Semantic Caching (15% hit rate)

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold=0.95):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}
        self.threshold = threshold

    def get(self, query: str):
        query_emb = self.model.encode(query)

        for cached_emb, (cached_q, response, _) in self.cache.items():
            similarity = np.dot(query_emb, cached_emb)
            if similarity >= self.threshold:
                return response
        return None

    def set(self, query: str, response: str):
        embedding = self.model.encode(query)
        self.cache[tuple(embedding)] = (query, response, time.time())
Enter fullscreen mode Exit fullscreen mode

Catches: "How do I reset password?" ≈ "Password reset help?"

Layer 3: Result Caching (10% hit rate)

import redis
import hashlib

class ResultCache:
    def __init__(self):
        self.redis = redis.Redis()

    def get(self, query: str, context: dict):
        key = hashlib.sha256(
            json.dumps({'query': query, 'context': context}).encode()
        ).hexdigest()
        return self.redis.get(key)

    def set(self, query: str, context: dict, response: str, ttl=3600):
        key = hashlib.sha256(
            json.dumps({'query': query, 'context': context}).encode()
        ).hexdigest()
        self.redis.setex(key, ttl, response)
Enter fullscreen mode Exit fullscreen mode

TTL strategy:

  • Stable content: 24 hours
  • Dynamic content: 1 hour
  • Real-time: 5 minutes

Intelligent Model Routing

67% of queries work with Haiku ($0.25/1M). 60x cheaper than Opus ($15/1M).

from enum import Enum

class Model(Enum):
    HAIKU = "claude-haiku-4-20250514"    # $0.25/1M
    SONNET = "claude-sonnet-4-20250514"  # $3/1M
    OPUS = "claude-opus-4-20250514"      # $15/1M

def route(query: str) -> Model:
    tokens = len(query.split())

    # Simple → Haiku
    if tokens < 50:
        return Model.HAIKU

    # Analysis → Sonnet
    if any(w in query.lower() for w in ['analyze', 'compare']):
        return Model.SONNET

    # Complex → Opus
    if any(w in query.lower() for w in ['design', 'architect']):
        return Model.OPUS

    return Model.SONNET
Enter fullscreen mode Exit fullscreen mode

Distribution:

  • 67% Haiku
  • 28% Sonnet
  • 5% Opus

The Complete System

class OptimizedLLM:
    def __init__(self):
        self.semantic_cache = SemanticCache()
        self.result_cache = ResultCache()
        self.client = anthropic.Anthropic()

    def complete(self, query: str, context: dict):
        # Layer 3: Result cache
        cached = self.result_cache.get(query, context)
        if cached:
            return cached

        # Layer 2: Semantic cache
        semantic = self.semantic_cache.get(query)
        if semantic:
            return semantic

        # Layer 1: Prompt cache + routing
        model = route(query)

        response = self.client.messages.create(
            model=model.value,
            system=[{
                "type": "text",
                "text": context['system_prompt'],
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[{"role": "user", "content": query}]
        )

        # Cache results
        self.result_cache.set(query, context, response.content)
        self.semantic_cache.set(query, response.content)

        return response.content
Enter fullscreen mode Exit fullscreen mode

Cost Results

Before:

  • $47K/month
  • P95 latency: 2.1s

After:

  • $2.8K/month (-94%)
  • P95 latency: 340ms (-84%)
  • 73% combined cache hit rate

Implementation Checklist

RAG:

  • [ ] Implement query processing (expand + extract metadata)
  • [ ] Set up vector DB with metadata filtering
  • [ ] Add hybrid search (semantic + keyword)
  • [ ] Deploy cross-encoder re-ranking
  • [ ] Build chunking with 50-token overlap
  • [ ] Force grounded prompts (no hallucinations)

Cost:

  • [ ] Enable prompt caching (10x savings)
  • [ ] Add semantic similarity cache
  • [ ] Implement result cache with smart TTL
  • [ ] Route to appropriate model tier
  • [ ] Monitor cache hit rates weekly

Key Insights

  1. Retrieval > LLM: Haiku + perfect context beats GPT-4 + bad context
  2. Re-ranking = 23% boost: Single highest-ROI optimization
  3. Caching = 73% hit rate: Most requests never touch the LLM
  4. Model routing = 60x savings: Haiku for 67% of queries

What We're Open-Sourcing

Next month:

  • 6-stage RAG pipeline (code + docs)
  • Cost optimization framework
  • Re-ranking models
  • Monitoring dashboards
  • Evaluation datasets

Follow @anilsprasad or Ambharii Labs for release.


Your Turn

For RAG: What's your accuracy? Drop it in comments.

For Costs: What's your monthly LLM bill? I'll tell you which optimization has highest ROI.

Common wins:

  • Prompt caching: 10x savings
  • Re-ranking: 23% accuracy boost
  • Model routing: 60x price difference

Let's make production AI work. 🚀


Tags: #ai #machinelearning #python #tutorial

Top comments (0)