Have you ever wondered why a single chunking method works well for one dataset but performs poorly on another? Chunking plays a major role in how effectively a RAG system retrieves and uses information, but different data formats, like tables, code, or long paragraphs, require different approaches. Research such as the RAPTOR method also shows how the structure of chunks can impact the quality of retrieval in multi-layered documents.
In this blog, we’ll explore chunking strategies tailored to specific data types. We’ll look at how to split tables, break down code, and create hierarchical structures for long text. You’ll also learn how topic-based models like LDA group related content, and how hybrid approaches combine multiple techniques to improve flexibility. If you want a clearer way to chunk complex data, keep reading the full blog.
Table chunking is a method used to break large tables into smaller, more manageable sections. Instead of processing hundreds or thousands of rows at once, you divide the table into row-based chunks that retain important structure such as headers, column order, and row indices.
This makes it easier to summarize data, run analysis on specific portions, or feed the chunks into a RAG system without overwhelming the model. Table chunking also allows you to control chunk size and overlap, ensuring each piece contains enough context for accurate retrieval and categorization while still keeping the overall dataset organized and efficient to work with.
To see how this works in practice, here’s a simple Python example that chunks a table into smaller row groups while preserving the original headers and index positions:
def chunk_table_numpy(data, headers, chunk_size=10, overlap=2):
chunks = []
num_rows = data.shape[0]
for start in range(0, num_rows, chunk_size - overlap):
end = min(start + chunk_size, num_rows)
rows_chunk = data[start:end]
chunk = {
"headers": headers,
"rows": rows_chunk,
"row_indices": (start, end - 1)
}
chunks.append(chunk)
if end == num_rows:
break
return chunks
chunks = chunk_table_numpy(data, headers, chunk_size=3, overlap=0)from collections import defaultdict
category_index = defaultdict(list)
for chunk_id, chunk in enumerate(chunks):
rows = chunk["rows"]
category_col_idx = headers.index('categories')
categories_in_chunk = set(rows[:, category_col_idx])
For category in categories_in_chunk:
category_index[category].append(chunk_id)
# Retrieve chunks by category:
for category, chunk_ids in category_index.items():
print(f"Category: {category}, Chunks: {chunk_ids}")INPUT:
| ID | Product Name | Price | Category | Description |
1 | AlphaPhone699 | 699 | Electronics | Excellent phone with great battery life |
2 | BravoLaptop1200 | 1200 | Computers | High performance and sleek design |
3 | CharlieWatch199 | 199 | Wearables | Stylish and feature-rich smartwatch |
16 | PapaRouter130 | 130 | Networking | Strong and stable connection |
17 | QuebecSmartLight | 60 | Smart Home | Easy to control and bright |
18 | RomeoDoorbell250 | 250 | Smart Home | Clear video and alerts |
OUTPUT:
[{'headers': ['id', 'name', 'price', 'categories', 'review'],
'rows': array([[1, 'AlphaPhone', 699, 'Electronics',
'Excellent phone with great battery life'],
[2, 'BravoLaptop', 1200, 'Computers',
'High performance and sleek design'],
[3, 'CharlieWatch', 199, 'Wearables',
'Stylish and feature-rich smartwatch']],
dtype=object),
'row_indices': (0, 9)},
{'headers': ['id', 'name', 'price', 'categories', 'review'],
'rows': array([ [16, 'PapaRouter', 130, 'Networking',
'Strong and stable connection'],
[17, 'QuebecSmartLight', 60, 'Smart Home',
'Easy to control and bright'],
[18, 'RomeoDoorbell', 250, 'Smart Home', 'Clear video and alerts']],
dtype=object),
'row_indices': (8, 17)}This example shows how chunking enables efficient filtering, categorization, and retrieval of specific subsets inside a large table, critical for RAG pipelines that need only relevant table slices.
Walk away with actionable insights on AI adoption.
Limited seats available!
Code chunking is the process of breaking a long codebase into smaller, meaningful blocks such as functions, classes, or logical sections. Instead of analyzing a large file line by line, chunking lets you work with cleaner, well-defined pieces that are easier for both humans and models to understand. This approach improves readability, helps isolate bugs faster, and makes documentation more structured. In RAG workflows, chunking source code ensures the model retrieves only the relevant part of the logic, rather than scanning the entire file. It’s a simple but powerful way to manage large and complex codebases more efficiently.
To see how this works in action, here’s a simple Python example that splits code into chunks based on functions or classes:
def chunk_python_code(code, chunk_type="function"):
import re
if chunk_type == "function":
pattern = r"def [\w_]+\([^)]*\):"
elif chunk_type == "class":
pattern = r"class [\w_]+(\(.+?\)?:)"
else:
pattern = r".+"
chunks = re.split(pattern, code)
return chunks
sample_code = """
def foo():
print('Hi')
class Bar:
def baz(self):
pass
"""
chunks = chunk_python_code(sample_code)
print(chunks)Hierarchical chunking is a technique that breaks content into multiple layers of structure, making it easier to analyze or retrieve information at different levels of detail. Instead of splitting text into only one type of segment, this method creates broader chunks first, such as paragraphs, and then divides each paragraph further into smaller units like sentences. This nested approach preserves context across levels and is especially useful for documents with natural structure, such as articles, reports, documentation, or long-form text. In RAG systems, hierarchical chunking allows models to retrieve information with more precision by targeting the right paragraph or sentence depending on the query.
To understand how this works, here’s a simple Python example that splits text into a paragraph → sentence hierarchy:
import re
def hierarchical_chunking(text):
# Split text into paragraphs (using double newlines as delimiters)
paragraphs = [p for p in text.split('\n\n') if p.strip()]
hierarchy = []
for para in paragraphs:
# Split paragraph into sentences (using period, exclamation, question mark as delimiters)
sentences = re.split(r'(?<=[.!?])\s+', para.strip())
sentences = [s for s in sentences if s.strip()]
hierarchy.append(sentences)
return hierarchy
# Example input text
sample_text = """
Artificial Intelligence is the simulation of human intelligence by machines. It includes learning, reasoning, and self-correction!
Natural Language Processing enables computers to understand and generate human language. Machine Learning is a subset of AI that uses statistical techniques? Deep Learning uses neural networks with many layers. """
chunks = hierarchical_chunking(sample_text)
# Print hierarchy clearly
for p_idx, para in enumerate(chunks):
print(f"Paragraph {p_idx+1}:")
for s_idx, sentence in enumerate(para):
print(f" Sentence {s_idx+1}: {sentence}")
print()
Topic-based chunking is a method that groups documents or text segments based on the underlying themes they discuss rather than their position or structure. Instead of splitting content by rows, paragraphs, or functions, this approach uses statistical modeling, most commonly Latent Dirichlet Allocation (LDA), to automatically identify hidden topics within large collections of text.
Each document is treated as a mixture of topics, and each topic is represented by a distribution of keywords. This makes topic-based chunking especially useful when dealing with unstructured text, large corpora, or datasets where semantic similarity matters more than formatting. In RAG systems, it helps retrieve more relevant and contextually aligned chunks for a given query.
To see how this works, here’s a simple Python example that groups sample documents into two topics using LDA:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation07
# Sample documents from two topics: Tech/AI and Animals
texts = [
"Artificial intelligence is transforming industries.",
"Deep learning models are a subset of machine learning.",
"Cats love napping in the sun.",
"Kittens often play with yarn balls.",
"AI applications include robotics and automation.",
"Dogs bark and chase cats.",
"Machine learning enables prediction from big data.",
"Birds sing beautifully at dawn.",
"Natural language processing is part of AI.",
"Many animals communicate in complex ways."
]
# Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
# Fit LDA model with 2 topics
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)
# Get topic assignments for each document
topic_assignments = lda.transform(X).argmax(axis=1)
# Group texts by topic
topic_chunks = {topic: [] for topic in range(2)}
for idx, topic in enumerate(topic_assignments):
topic_chunks[topic].append(texts[idx])
# Print topic chunks
for topic, docs in topic_chunks.items():
print(f"\nTopic {topic+1}:")
for doc in docs:
print(" -", doc)
# Optional: Interpret topics by showing top words per topic
def print_top_words(model, feature_names, n_top_words=7):
for topic_idx, topic in enumerate(model.components_):
print(f"\nTopic {topic_idx+1} top words:")
top_features = [feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
print(" ", ", ".join(top_features))
print_top_words(lda, vectorizer.get_feature_names_out())OUTPUT:
Topic 1:
- Artificial intelligence is transforming industries.
- Deep learning models are a subset of machine learning.
- AI applications include robotics and automation.
- Machine learning enables prediction from big data.
- Natural language processing is part of AI.
Topic 2:
- Cats love napping in the sun.
- Kittens often play with yarn balls.
- Dogs bark and chase cats.
- Birds sing beautifully at dawn.
- Many animals communicate in complex ways.
Topic 1 top words:
ai, machine, learning, intelligence, natural, applications, deep
Topic 2 top words:
cats, animals, dogs, kittens, birds, sun, ballsWalk away with actionable insights on AI adoption.
Limited seats available!
Hybrid chunking is a flexible approach that combines two or more chunking strategies to handle complex or multi-layered documents more effectively. Instead of relying on a single method, like structural, semantic, or sliding-window chunking, this technique blends them to preserve both the document’s organization and its deeper meaning. Hybrid chunking is especially useful when dealing with mixed content, such as technical manuals, recipes, research papers, or documents with structured sections and long narrative explanations. In RAG systems, this approach allows the model to retrieve the right level of detail by leveraging structure where it exists and semantics where it matters.
For example, imagine a recipe with sections like Ingredients, Instructions, and Tips. Structural chunking would create primary chunks for each section. If the Instructions section is very long, semantic chunking can break it further into cooking phases such as preparation, mixing, and baking. A sliding window can then add light overlap to maintain context. This ensures that when someone searches for “How long to bake?”, the system retrieves the correct baking step along with useful surrounding context.
Chunking isn’t just a technical step; it’s a strategic way to make different types of data easier to understand, retrieve, and use within RAG systems. In this article, we explored how chunking needs to adapt depending on the data you’re working with, whether it’s structured tables, code blocks, long-form text, topic-based collections, or documents that benefit from a hybrid approach.
Each method comes with its own strengths and trade-offs, and no single technique works perfectly for every scenario. The real value comes from choosing the right approach, or combining several, to match your specific goals. By applying these advanced chunking strategies thoughtfully, you can improve retrieval quality, strengthen contextual understanding, and build more reliable, efficient AI workflows that scale with your data.
Walk away with actionable insights on AI adoption.
Limited seats available!