Ramya Perumal

Posted on May 11

RAG - Chunking

#ai #llm #rag #machinelearning

What is chunking

Chunking is the process of breaking data into smaller pieces called chunks. Chunking happens before the data is fed into an embedding model, which converts each chunk into a vector (point) and stores the converted vectors in a vector database.

Why chunking Matters in RAG

Data can contain different types of context while still relating to the same topic.

From the above example, we may have a paragraph related to the Redis database that contains multiple contexts. An embedding model like nomic-embed-text converts the entire paragraph into a single vector point and stores it in the database.

This is where chunking plays a major role. Proper chunking helps retrieve only the most relevant information and avoids unrelated content.

For example, if a chunk contains information about both Python and Java, a query about Python may also retrieve Java-related information because both topics exist in the same chunk. Effective chunking helps prevent unrelated data from being retrieved.

Even an entire document can be stored as a single chunk. However, the purpose of chunking is to split the data into smaller meaningful sections so that only relevant data is retrieved for the user query while avoiding irrelevant information.

Chunking Method(Discrete way - formula methodology)

Fixed Chunking

Fixed chunking is the most common chunking method. In this approach, a fixed character or token limit is assigned to every chunk.

There is no single best chunking strategy for all datasets. Choosing the right chunk size usually requires a trial-and-error approach.

Disadvantage
A chunk may break in the middle of a sentence, resulting in incomplete context. This can reduce retrieval quality and may lead to irrelevant results.

Solution
One way to overcome this issue is to allow the chunk to continue until the sentence ends by checking for punctuation such as "." or spaces.

Overlapping chunking

In some cases, related information may be stored far apart in vector space due to the embedding model’s understanding. As a result, the LLM may miss relevant information during retrieval.

To overcome this issue, overlapping chunking is used.

In overlapping chunking, each chunk includes a portion of the previous chunk’s ending content. This helps the embedding model place related chunks closer together in the vector database.

The purpose of overlapping is to improve retrieval by making semantically related chunks easier to find.

Disadvantage

There is a possibility that irrelevant information may also be retrieved because of the overlap.

Example

Suppose:

Paragraph 1 is related to Topic A
Paragraph 2 is related to Topic B

If overlapping is applied, a query about Topic B may also retrieve some information from Topic A because part of Paragraph 1 overlaps with Paragraph 2.

In such scenarios, storing these chunks closer together may not be necessary. This is where semantic chunking becomes useful.

Semantic Chunking

Another scenario is when two paragraphs discuss the same topic but are not strongly related to each other. Normally, these paragraphs may still be stored nearby in the vector database. In such cases, overlapping chunking may not be necessary.

Semantic chunking solves this problem by grouping content based on meaning rather than fixed size.

In this method, each sentence is compared with the previous chunk using a similarity threshold value.

If the similarity score is below the threshold value, the sentence becomes a separate chunk.
If the similarity score is above the threshold value, it is added to the current chunk.

Libraries such as NLTK can be used to implement semantic chunking. The threshold value is configurable based on the use case.

Embedded Chunking

In embedding-based chunking, embedding models are used instead of libraries like NLTK.

This method works by calculating cosine similarity between sentences and grouping semantically similar sentences into chunks.

Advantage
Better semantic understanding
More accurate chunk boundaries

Disadvantage
Higher computational cost
Additional embedding model usage cost

Choosing the Right Chunking Method

Choosing a chunking method always involves trade-offs. There is no single chunking strategy that works for all datasets.

The best chunking method depends on:

Dataset type
Cost
Time
Retrieval accuracy requirements
Embedding model behavior

Different applications may require different chunking strategies to achieve the best RAG performance.

DEV Community

RAG - Chunking

What is chunking

Why chunking Matters in RAG

Chunking Method(Discrete way - formula methodology)

Choosing the Right Chunking Method

Top comments (0)