Sharath Kurup

Posted on Apr 20 • Edited on Apr 21

Understanding RAG by Building a ChatPDF App: Smarter Chunking & Context Optimization (Part 3)

#python #ai #rag #tutorial

⚡ Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context

In Part 1, we made it work.
In Part 2, we made it fast.
In Part 3, things got… interesting 😅

📌 Recap from Part 1 & 2

In the previous parts:

👉 Part 1

Built a basic RAG pipeline using NumPy
Understood embeddings + similarity search

👉 Part 2

Switched to FAISS for fast retrieval ⚡
Added persistence + re-ranking

At this point, everything looked solid.

😅 But Something Still Felt Off

I started testing with real questions…

query = "What is FAISS indexing?"

And sometimes the answer would:

Talk about embeddings instead
Miss key details
Or feel… slightly off

🤔 The weird part?

The answer was actually present in the document.

But we weren’t retrieving the right chunk.

🧠 The Real Problem Was Not Search

FAISS was doing its job.

The issue was earlier in the pipeline:

We were feeding it bad chunks.

🔍 Let’s Look at the Old Chunking Logic

def generate_chunks(text, page_num):
    chunks = []
    i = 0
    while i < len(text):
        end = min(i + CHUNK_SIZE, len(text))
        chunk = text[i:end]

        if end < len(text):
            last_space = chunk.rfind(" ")
            if last_space != -1:
                end = i + last_space
                chunk = text[i:end]

        chunks.append({"text": chunk, "page": page_num})
        i = end - OVERLAP_SIZE

🧠 What This Was Doing

Split text using fixed size
Avoid breaking words
Add overlap

Looks reasonable… right?

🚨 Where It Breaks

Let’s take a simple example:

Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."

Now imagine this gets chunked like:

Chunk 1:
"FAISS is a library for efficient similarity"

Chunk 2:
"search. It is widely used in RAG systems"

💥 What just happened?

The sentence got split
Meaning got split
Embeddings lost context

Embeddings don’t understand fragments
They understand complete ideas

🔍 Let’s Visualize This

Let’s visualize what was actually happening 👇

👉 Notice how sentences are broken across chunks —
this is exactly what degrades retrieval quality.

💡 The Shift in Thinking

Instead of:

“Split text by size”

We need:

“Split text by meaning”

🚀 Step 1: Recursive Chunking (Respect Structure)

✅ New Approach

def generate_chunks_recursive(text, page_num, chunk_size, overlap_size):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk_slice = text[start:end]

        for separator in ["\n\n", "\n", ". ", " "]:
            last_break = chunk_slice.rfind(separator)
            if last_break != -1:
                if separator == ". ":
                    last_break += 1
                break
        else:
            last_break = chunk_size

        actual_end = start + last_break
        final_chunk = text[start:actual_end].strip()

        chunks.append({"text": final_chunk, "page": page_num})
        start = actual_end - overlap_size

🧠 What Changed Here?

Instead of blindly splitting, we now:

for separator in ["\n\n", "\n", ". ", " "]:

We try:

Paragraph
Line
Sentence
Word

👉 This is a priority-based splitting strategy

💡 Why This Works Better

Paragraphs stay intact
Sentences stay intact
Meaning stays intact

✅ Micro Summary

What changed: Structure-aware chunking
Why it matters: Better embeddings → better retrieval

🔁 Step 2: Overlap Still Matters

We still keep overlap:

[Chunk 1]  "RAG systems work by retrieving relevant context"
[Chunk 2]         "retrieving relevant context from documents"

🧠 Why This Is Important

Prevents context gaps
Keeps continuity between chunks

🔥 Step 3: Storing Full Context (Big Upgrade)

def generate_advanced_chunks(page_content, page_num):
    search_chunks = generate_chunks_recursive(...)

    for chunk in search_chunks:
        chunk["text"] = f"[Page {page_num}] {chunk['text']}"
        chunk["full_context"] = page_content

🧠 Why This Matters

Earlier:

👉 We only stored chunk text

Now:

👉 We also store the entire page

💡 What This Enables

Better answer generation
Flexibility for later processing
Smarter context selection

🚨 New Problem Introduced

Now that we store full pages…

We started sending too much data to the LLM.

❌ Problem

Large token usage
Slower responses
Irrelevant information

🚀 Step 4: Context Compression

def compress_context(query, full_text):
    sentences = re.split(r'(?<=[.!?]) +', full_text)
    query_words = set(query.lower().split())

    scored_sentences = []
    for s in sentences:
        score = sum(1 for word in s.lower().split() if word in query_words)
        scored_sentences.append((score, s))

    top_sentences = sorted(scored_sentences, key=lambda x: x[0], reverse=True)[:MAX_SENTENCES]
    return " ".join([s for _, s in top_sentences])

🧠 What’s Happening Here?

Break into sentences
Score relevance using query
Keep only top sentences

🔍 Visualize It

Before:
Full page → 1000+ tokens ❌

After:
Relevant sentences → smaller context ✅

💡 Why This Is Powerful

Faster responses
Better relevance
Lower token usage

✅ Micro Summary

What changed: Context filtering
Why it matters: Less noise → better answers

🔍 Retrieval Still Uses FAISS + Re-ranking

distances, indices = index.search(query_vector.reshape(1,-1), k=10)
results = ranker.rerank(rerank_request)

🧠 Flow

FAISS → fast retrieval
Re-ranker → improves relevance
Top results → passed to LLM

💬 Smarter Answer Generation

compressed_text = compress_context(query, res['full_context'])

👉 Instead of raw chunks, we now send:

Focused, relevant context

🔁 Final System

🧠 What Changed Overall

Feature	Part 1	Part 2	Part 3
Search	NumPy	FAISS	FAISS + Rerank
Chunking	Basic	Basic	Recursive 🧠
Context	Raw	Raw	Compressed 🔥
Accuracy	Low	Medium	High

🧠 Final Thought

This is where things clicked for me.

I kept thinking better models would fix my system…

But the real issue was:

Bad context in → bad answers out

Most people focus on:

Models ❌
Vector DB ❌

But the real gains came from:

👉 Chunking
👉 Context handling

🔜 What’s Next?

Now things get even more interesting.

In Part 4:

👉 We’ll move beyond basic retrieval and make the system smarter

Token-aware chunking
Better query understanding
More intelligent retrieval

💬 Let’s Connect

If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇

DEV Community

Understanding RAG by Building a ChatPDF App: Smarter Chunking & Context Optimization (Part 3)

⚡ Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context

📌 Recap from Part 1 & 2

😅 But Something Still Felt Off

🤔 The weird part?

🧠 The Real Problem Was Not Search

🔍 Let’s Look at the Old Chunking Logic

🧠 What This Was Doing

🚨 Where It Breaks

💥 What just happened?

🔍 Let’s Visualize This

💡 The Shift in Thinking

🚀 Step 1: Recursive Chunking (Respect Structure)

✅ New Approach

🧠 What Changed Here?

💡 Why This Works Better

✅ Micro Summary

🔁 Step 2: Overlap Still Matters

🧠 Why This Is Important

🔥 Step 3: Storing Full Context (Big Upgrade)

🧠 Why This Matters

💡 What This Enables

🚨 New Problem Introduced

❌ Problem

🚀 Step 4: Context Compression

🧠 What’s Happening Here?

🔍 Visualize It

💡 Why This Is Powerful

✅ Micro Summary

🔍 Retrieval Still Uses FAISS + Re-ranking

🧠 Flow

💬 Smarter Answer Generation

🔁 Final System

🧠 What Changed Overall

🧠 Final Thought

🔜 What’s Next?

💬 Let’s Connect

Top comments (0)