โก Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context
In Part 1, we made it work.
In Part 2, we made it fast.
In Part 3, things gotโฆ interesting ๐
๐ Recap from Part 1 & 2
In the previous parts:
๐ Part 1
- Built a basic RAG pipeline using NumPy
- Understood embeddings + similarity search
๐ Part 2
- Switched to FAISS for fast retrieval โก
- Added persistence + re-ranking
At this point, everything looked solid.
๐ But Something Still Felt Off
I started testing with real questionsโฆ
query = "What is FAISS indexing?"
And sometimes the answer would:
- Talk about embeddings instead
- Miss key details
- Or feelโฆ slightly off
๐ค The weird part?
The answer was actually present in the document.
But we werenโt retrieving the right chunk.
๐ง The Real Problem Was Not Search
FAISS was doing its job.
The issue was earlier in the pipeline:
We were feeding it bad chunks.
๐ Letโs Look at the Old Chunking Logic
def generate_chunks(text, page_num):
chunks = []
i = 0
while i < len(text):
end = min(i + CHUNK_SIZE, len(text))
chunk = text[i:end]
if end < len(text):
last_space = chunk.rfind(" ")
if last_space != -1:
end = i + last_space
chunk = text[i:end]
chunks.append({"text": chunk, "page": page_num})
i = end - OVERLAP_SIZE
๐ง What This Was Doing
- Split text using fixed size
- Avoid breaking words
- Add overlap
Looks reasonableโฆ right?
๐จ Where It Breaks
Letโs take a simple example:
Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."
Now imagine this gets chunked like:
Chunk 1:
"FAISS is a library for efficient similarity"
Chunk 2:
"search. It is widely used in RAG systems"
๐ฅ What just happened?
- The sentence got split
- Meaning got split
- Embeddings lost context
Embeddings donโt understand fragments
They understand complete ideas
๐ Letโs Visualize This
Letโs visualize what was actually happening ๐
๐ Notice how sentences are broken across chunks โ
this is exactly what degrades retrieval quality.
๐ก The Shift in Thinking
Instead of:
โSplit text by sizeโ
We need:
โSplit text by meaningโ
๐ Step 1: Recursive Chunking (Respect Structure)
โ New Approach
def generate_chunks_recursive(text, page_num, chunk_size, overlap_size):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk_slice = text[start:end]
for separator in ["\n\n", "\n", ". ", " "]:
last_break = chunk_slice.rfind(separator)
if last_break != -1:
if separator == ". ":
last_break += 1
break
else:
last_break = chunk_size
actual_end = start + last_break
final_chunk = text[start:actual_end].strip()
chunks.append({"text": final_chunk, "page": page_num})
start = actual_end - overlap_size
๐ง What Changed Here?
Instead of blindly splitting, we now:
for separator in ["\n\n", "\n", ". ", " "]:
We try:
- Paragraph
- Line
- Sentence
- Word
๐ This is a priority-based splitting strategy
๐ก Why This Works Better
- Paragraphs stay intact
- Sentences stay intact
- Meaning stays intact
โ Micro Summary
- What changed: Structure-aware chunking
- Why it matters: Better embeddings โ better retrieval
๐ Step 2: Overlap Still Matters
We still keep overlap:
[Chunk 1] "RAG systems work by retrieving relevant context"
[Chunk 2] "retrieving relevant context from documents"
๐ง Why This Is Important
- Prevents context gaps
- Keeps continuity between chunks
๐ฅ Step 3: Storing Full Context (Big Upgrade)
def generate_advanced_chunks(page_content, page_num):
search_chunks = generate_chunks_recursive(...)
for chunk in search_chunks:
chunk["text"] = f"[Page {page_num}] {chunk['text']}"
chunk["full_context"] = page_content
๐ง Why This Matters
Earlier:
๐ We only stored chunk text
Now:
๐ We also store the entire page
๐ก What This Enables
- Better answer generation
- Flexibility for later processing
- Smarter context selection
๐จ New Problem Introduced
Now that we store full pagesโฆ
We started sending too much data to the LLM.
โ Problem
- Large token usage
- Slower responses
- Irrelevant information
๐ Step 4: Context Compression
def compress_context(query, full_text):
sentences = re.split(r'(?<=[.!?]) +', full_text)
query_words = set(query.lower().split())
scored_sentences = []
for s in sentences:
score = sum(1 for word in s.lower().split() if word in query_words)
scored_sentences.append((score, s))
top_sentences = sorted(scored_sentences, key=lambda x: x[0], reverse=True)[:MAX_SENTENCES]
return " ".join([s for _, s in top_sentences])
๐ง Whatโs Happening Here?
- Break into sentences
- Score relevance using query
- Keep only top sentences
๐ Visualize It
Before:
Full page โ 1000+ tokens โ
After:
Relevant sentences โ smaller context โ
๐ก Why This Is Powerful
- Faster responses
- Better relevance
- Lower token usage
โ Micro Summary
- What changed: Context filtering
- Why it matters: Less noise โ better answers
๐ Retrieval Still Uses FAISS + Re-ranking
distances, indices = index.search(query_vector.reshape(1,-1), k=10)
results = ranker.rerank(rerank_request)
๐ง Flow
- FAISS โ fast retrieval
- Re-ranker โ improves relevance
- Top results โ passed to LLM
๐ฌ Smarter Answer Generation
compressed_text = compress_context(query, res['full_context'])
๐ Instead of raw chunks, we now send:
Focused, relevant context
๐ Final System
๐ง What Changed Overall
| Feature | Part 1 | Part 2 | Part 3 |
|---|---|---|---|
| Search | NumPy | FAISS | FAISS + Rerank |
| Chunking | Basic | Basic | Recursive ๐ง |
| Context | Raw | Raw | Compressed ๐ฅ |
| Accuracy | Low | Medium | High |
๐ง Final Thought
This is where things clicked for me.
I kept thinking better models would fix my systemโฆ
But the real issue was:
Bad context in โ bad answers out
Most people focus on:
- Models โ
- Vector DB โ
But the real gains came from:
๐ Chunking
๐ Context handling
๐ Whatโs Next?
Now things get even more interesting.
In Part 4:
๐ Weโll move beyond basic retrieval and make the system smarter
- Token-aware chunking
- Better query understanding
- More intelligent retrieval
๐ฌ Letโs Connect
If you're building something similar or experimenting with local LLMs, Iโd love to hear your thoughts ๐



Top comments (0)