DEV Community

Cover image for Understanding RAG by Building a ChatPDF App: Smarter Chunking & Context Optimization (Part 3)
Sharath Kurup
Sharath Kurup

Posted on โ€ข Edited on

Understanding RAG by Building a ChatPDF App: Smarter Chunking & Context Optimization (Part 3)

Rag Pipeline

โšก Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context


In Part 1, we made it work.
In Part 2, we made it fast.
In Part 3, things gotโ€ฆ interesting ๐Ÿ˜…


๐Ÿ“Œ Recap from Part 1 & 2

In the previous parts:

๐Ÿ‘‰ Part 1

  • Built a basic RAG pipeline using NumPy
  • Understood embeddings + similarity search

๐Ÿ‘‰ Part 2

  • Switched to FAISS for fast retrieval โšก
  • Added persistence + re-ranking

At this point, everything looked solid.


๐Ÿ˜… But Something Still Felt Off

I started testing with real questionsโ€ฆ

query = "What is FAISS indexing?"
Enter fullscreen mode Exit fullscreen mode

And sometimes the answer would:

  • Talk about embeddings instead
  • Miss key details
  • Or feelโ€ฆ slightly off

๐Ÿค” The weird part?

The answer was actually present in the document.

But we werenโ€™t retrieving the right chunk.


๐Ÿง  The Real Problem Was Not Search

FAISS was doing its job.

The issue was earlier in the pipeline:

We were feeding it bad chunks.


๐Ÿ” Letโ€™s Look at the Old Chunking Logic

def generate_chunks(text, page_num):
    chunks = []
    i = 0
    while i < len(text):
        end = min(i + CHUNK_SIZE, len(text))
        chunk = text[i:end]

        if end < len(text):
            last_space = chunk.rfind(" ")
            if last_space != -1:
                end = i + last_space
                chunk = text[i:end]

        chunks.append({"text": chunk, "page": page_num})
        i = end - OVERLAP_SIZE
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  What This Was Doing

  • Split text using fixed size
  • Avoid breaking words
  • Add overlap

Looks reasonableโ€ฆ right?


๐Ÿšจ Where It Breaks

Letโ€™s take a simple example:

Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."
Enter fullscreen mode Exit fullscreen mode

Now imagine this gets chunked like:

Chunk 1:
"FAISS is a library for efficient similarity"

Chunk 2:
"search. It is widely used in RAG systems"
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ฅ What just happened?

  • The sentence got split
  • Meaning got split
  • Embeddings lost context

Embeddings donโ€™t understand fragments
They understand complete ideas


๐Ÿ” Letโ€™s Visualize This

Letโ€™s visualize what was actually happening ๐Ÿ‘‡

Fixed vs Recursive Chunking


๐Ÿ‘‰ Notice how sentences are broken across chunks โ€”
this is exactly what degrades retrieval quality.


๐Ÿ’ก The Shift in Thinking

Instead of:

โ€œSplit text by sizeโ€

We need:

โ€œSplit text by meaningโ€


๐Ÿš€ Step 1: Recursive Chunking (Respect Structure)


โœ… New Approach

def generate_chunks_recursive(text, page_num, chunk_size, overlap_size):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk_slice = text[start:end]

        for separator in ["\n\n", "\n", ". ", " "]:
            last_break = chunk_slice.rfind(separator)
            if last_break != -1:
                if separator == ". ":
                    last_break += 1
                break
        else:
            last_break = chunk_size

        actual_end = start + last_break
        final_chunk = text[start:actual_end].strip()

        chunks.append({"text": final_chunk, "page": page_num})
        start = actual_end - overlap_size
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  What Changed Here?

Instead of blindly splitting, we now:

for separator in ["\n\n", "\n", ". ", " "]:
Enter fullscreen mode Exit fullscreen mode

We try:

  1. Paragraph
  2. Line
  3. Sentence
  4. Word

๐Ÿ‘‰ This is a priority-based splitting strategy


๐Ÿ’ก Why This Works Better

  • Paragraphs stay intact
  • Sentences stay intact
  • Meaning stays intact

โœ… Micro Summary

  • What changed: Structure-aware chunking
  • Why it matters: Better embeddings โ†’ better retrieval

๐Ÿ” Step 2: Overlap Still Matters

We still keep overlap:

[Chunk 1]  "RAG systems work by retrieving relevant context"
[Chunk 2]         "retrieving relevant context from documents"
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Why This Is Important

  • Prevents context gaps
  • Keeps continuity between chunks

๐Ÿ”ฅ Step 3: Storing Full Context (Big Upgrade)

def generate_advanced_chunks(page_content, page_num):
    search_chunks = generate_chunks_recursive(...)

    for chunk in search_chunks:
        chunk["text"] = f"[Page {page_num}] {chunk['text']}"
        chunk["full_context"] = page_content
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Why This Matters

Earlier:

๐Ÿ‘‰ We only stored chunk text

Now:

๐Ÿ‘‰ We also store the entire page


๐Ÿ’ก What This Enables

  • Better answer generation
  • Flexibility for later processing
  • Smarter context selection

๐Ÿšจ New Problem Introduced

Now that we store full pagesโ€ฆ

We started sending too much data to the LLM.


โŒ Problem

  • Large token usage
  • Slower responses
  • Irrelevant information

๐Ÿš€ Step 4: Context Compression

def compress_context(query, full_text):
    sentences = re.split(r'(?<=[.!?]) +', full_text)
    query_words = set(query.lower().split())

    scored_sentences = []
    for s in sentences:
        score = sum(1 for word in s.lower().split() if word in query_words)
        scored_sentences.append((score, s))

    top_sentences = sorted(scored_sentences, key=lambda x: x[0], reverse=True)[:MAX_SENTENCES]
    return " ".join([s for _, s in top_sentences])
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Whatโ€™s Happening Here?

  1. Break into sentences
  2. Score relevance using query
  3. Keep only top sentences

๐Ÿ” Visualize It

Before:
Full page โ†’ 1000+ tokens โŒ

After:
Relevant sentences โ†’ smaller context โœ…
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ก Why This Is Powerful

  • Faster responses
  • Better relevance
  • Lower token usage

โœ… Micro Summary

  • What changed: Context filtering
  • Why it matters: Less noise โ†’ better answers

๐Ÿ” Retrieval Still Uses FAISS + Re-ranking

distances, indices = index.search(query_vector.reshape(1,-1), k=10)
results = ranker.rerank(rerank_request)
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Flow

  1. FAISS โ†’ fast retrieval
  2. Re-ranker โ†’ improves relevance
  3. Top results โ†’ passed to LLM

๐Ÿ’ฌ Smarter Answer Generation

compressed_text = compress_context(query, res['full_context'])
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘‰ Instead of raw chunks, we now send:

Focused, relevant context


๐Ÿ” Final System

Final System


๐Ÿง  What Changed Overall

Feature Part 1 Part 2 Part 3
Search NumPy FAISS FAISS + Rerank
Chunking Basic Basic Recursive ๐Ÿง 
Context Raw Raw Compressed ๐Ÿ”ฅ
Accuracy Low Medium High

๐Ÿง  Final Thought

This is where things clicked for me.

I kept thinking better models would fix my systemโ€ฆ

But the real issue was:

Bad context in โ†’ bad answers out


Most people focus on:

  • Models โŒ
  • Vector DB โŒ

But the real gains came from:

๐Ÿ‘‰ Chunking
๐Ÿ‘‰ Context handling


๐Ÿ”œ Whatโ€™s Next?

Now things get even more interesting.

In Part 4:

๐Ÿ‘‰ Weโ€™ll move beyond basic retrieval and make the system smarter

  • Token-aware chunking
  • Better query understanding
  • More intelligent retrieval

๐Ÿ’ฌ Letโ€™s Connect

If you're building something similar or experimenting with local LLMs, Iโ€™d love to hear your thoughts ๐Ÿ‘‡


Top comments (0)