Day 4 - Chunking continued - RAG

#python #rag #ai #nlp

Semantic Chunking
Lets Consider two paragraphs A and B, focussing on strings in python. para A focus on typecasting and para B focus on accessing characters. These two paragraphs are not that related to each other but if i do overlapping, these two points will be closer to each other. We do not want to forcefully bring the two paragraphs together. To solve this problem, semantic chunking can be used.

It will continue to add sentence to a chunk until the relevancy is present. i.e It will take first sentence, since there is nothing to compare it will add it to a chunk. Next it will the take the second sentence and compare it with the previous sentence, if the relevancy factor is > 0.75 , second sentence will be added to chunk. Next sentence will be taken and compared with the previous sentence. If the relevancy factor is < 0.75, it won't be added to chunk otherwise it will be added. Semantic chunking can be achieved by means of nltk package.

Embedding Chunking
To find relationship between previous and current sentence, LLM will be used. i.e LLM calculates and produces a number that determines how much are the two sentences related with each other.

There is no one best method to choose the chunking methodology. It varies based upon the dataset. We can do trial and error to determine the methdology suitable for us.

DEV Community

Day 4 - Chunking continued - RAG

Top comments (0)