
Have you ever wondered why a chunking strategy that works perfectly on one dataset completely falls apart on another? I ran into this exact problem while working with RAG systems that had to handle everything from structured tables to messy long-form text and source code. The issue wasn’t the retrieval itself; it was how the data was being chunked before retrieval even began.
Chunking plays a critical role in how a RAG system understands and retrieves information, but I’ve learned that no single method works universally. Tables, code, and narrative text behave very differently once they enter a retrieval pipeline. Research such as the RAPTOR method reinforced what I was already seeing in practice: chunk structure directly affects retrieval quality, especially in layered or complex documents.
In this blog, I’m breaking down chunking strategies based on the type of data they work best with. I’ll walk through table chunking, code chunking, hierarchical approaches for long text, topic-based grouping with LDA, and hybrid methods that combine multiple techniques. This isn’t about theory alone; it’s about choosing chunking methods that actually hold up when your data gets complex.
When I started testing different RAG pipelines, it became clear that chunking decisions couldn’t be treated as an afterthought. Each data type introduced its own failure modes, and fixing retrieval quality often meant rethinking how the data was split in the first place. The following chunking strategies are the ones I’ve found most effective when working with complex, mixed data.

Table chunking is the approach I rely on when large tables start becoming a bottleneck in retrieval. Instead of forcing a model to reason over hundreds or thousands of rows at once, the table is split into smaller, row-based chunks while preserving critical structure like headers, column order, and row indices.
This makes it easier to summarize data, run analysis on specific portions, or feed the chunks into a RAG system without overwhelming the model. Table chunking also allows you to control chunk size and overlap, ensuring each piece contains enough context for accurate retrieval and categorization while still keeping the overall dataset organized and efficient to work with.
To see how this works in practice, here’s a simple Python example that chunks a table into smaller row groups while preserving the original headers and index positions:
def chunk_table_numpy(data, headers, chunk_size=10, overlap=2):
chunks = []
num_rows = data.shape[0]
for start in range(0, num_rows, chunk_size - overlap):
end = min(start + chunk_size, num_rows)
rows_chunk = data[start:end]
chunk = {
"headers": headers,
"rows": rows_chunk,
"row_indices": (start, end - 1)
}
chunks.append(chunk)
if end == num_rows:
break
return chunks
chunks = chunk_table_numpy(data, headers, chunk_size=3, overlap=0)INPUT:
from collections import defaultdict
category_index = defaultdict(list)
for chunk_id, chunk in enumerate(chunks):
rows = chunk["rows"]
category_col_idx = headers.index('categories')
categories_in_chunk = set(rows[:, category_col_idx])
For category in categories_in_chunk:
category_index[category].append(chunk_id)Retrieve chunks by category:
for category, chunk_ids in category_index.items():
print(f"Category: {category}, Chunks: {chunk_ids}")
| ID | Product Name | Price | Category | Description |
1 | AlphaPhone699 | 699 | Electronics | Excellent phone with great battery life |
2 | BravoLaptop1200 | 1200 | Computers | High performance and sleek design |
3 | CharlieWatch199 | 199 | Wearables | Stylish and feature-rich smartwatch |
16 | PapaRouter130 | 130 | Networking | Strong and stable connection |
17 | QuebecSmartLight | 60 | Smart Home | Easy to control and bright |
18 | RomeoDoorbell250 | 250 | Smart Home | Clear video and alerts |
OUTPUT:
[{'headers': ['id', 'name', 'price', 'categories', 'review'],
'rows': array([[1, 'AlphaPhone', 699, 'Electronics',
'Excellent phone with great battery life'],
[2, 'BravoLaptop', 1200, 'Computers',
'High performance and sleek design'],
[3, 'CharlieWatch', 199, 'Wearables',
'Stylish and feature-rich smartwatch']],
dtype=object),
'row_indices': (0, 9)},
{'headers': ['id', 'name', 'price', 'categories', 'review'],
'rows': array([ [16, 'PapaRouter', 130, 'Networking',
'Strong and stable connection'],
[17, 'QuebecSmartLight', 60, 'Smart Home',
'Easy to control and bright'],
[18, 'RomeoDoorbell', 250, 'Smart Home', 'Clear video and alerts']],
dtype=object),
'row_indices': (8, 17)}This example shows how chunking enables efficient filtering, categorization, and retrieval of specific subsets inside a large table, critical for RAG pipelines that need only relevant table slices.
Code chunking became essential for me once I started using RAG systems to retrieve logic from real codebases instead of toy examples. Rather than treating a file as a single block of text, code chunking breaks it into meaningful units such as functions, classes, or logical sections that the model can reason about more accurately. Instead of analyzing a large file line by line, chunking lets you work with cleaner, well-defined pieces that are easier for both humans and models to understand. This approach improves readability, helps isolate bugs faster, and makes documentation more structured. In RAG workflows, chunking source code ensures the model retrieves only the relevant part of the logic, rather than scanning the entire file. It’s a simple but powerful way to manage large and complex codebases more efficiently.
To see how this works in action, here’s a simple Python example that splits code into chunks based on functions or classes:
def chunk_python_code(code, chunk_type="function"):
import re
if chunk_type == "function":
pattern = r"def [\w_]+\([^)]*\):"
elif chunk_type == "class":
pattern = r"class [\w_]+(\(.+?\)?:)"
else:
pattern = r".+"
chunks = re.split(pattern, code)
return chunks
sample_code = """
def foo():
print('Hi')
class Bar:
def baz(self):
pass
"""
chunks = chunk_python_code(sample_code)
print(chunks)Walk away with actionable insights on AI adoption.
Limited seats available!
Hierarchical chunking is the strategy I turn to when flat chunking starts losing context. Instead of splitting text in a single pass, this approach breaks content into multiple layers of structure, allowing retrieval at different levels of detail depending on what the query actually needs. Instead of splitting text into only one type of segment, this method creates broader chunks first, such as paragraphs, and then divides each paragraph further into smaller units like sentences. This nested approach preserves context across levels and is especially useful for documents with natural structure, such as articles, reports, documentation, or long-form text. In RAG systems, hierarchical chunking allows models to retrieve information with more precision by targeting the right paragraph or sentence depending on the query.
To understand how this works, here’s a simple Python example that splits text into a paragraph → sentence hierarchy:
import re
def hierarchical_chunking(text):
# Split text into paragraphs (using double newlines as delimiters)
paragraphs = [p for p in text.split('\n\n') if p.strip()]
hierarchy = []
for para in paragraphs:
# Split paragraph into sentences (using period, exclamation, question mark as delimiters)
sentences = re.split(r'(?<=[.!?])\s+', para.strip())
sentences = [s for s in sentences if s.strip()]
hierarchy.append(sentences)
return hierarchy
# Example input text
sample_text = """Artificial Intelligence is the simulation of human intelligence by machines. It includes learning, reasoning, and self-correction!
Natural Language Processing enables computers to understand and generate human language. Machine Learning is a subset of AI that uses statistical techniques. Deep Learning uses neural networks with many layers.
chunks = hierarchical_chunking(sample_text)
# Print hierarchy clearly
for p_idx, para in enumerate(chunks):
print(f"Paragraph {p_idx+1}:")
for s_idx, sentence in enumerate(para):
print(f" Sentence {s_idx+1}: {sentence}")
print()Topic-based chunking is particularly useful in situations where structure alone isn’t enough. I’ve used this approach when working with large, unstructured text collections where semantic similarity matters more than where the content appears in a document. Instead of splitting content by rows, paragraphs, or functions, this approach uses statistical modeling with tools like LlamaIndex, most commonly Latent Dirichlet Allocation (LDA), to automatically identify hidden topics within large collections of text.
Each document is treated as a mixture of topics, and each topic is represented by a distribution of keywords. This makes topic-based chunking especially useful when dealing with unstructured text, large corpora, or datasets where semantic similarity matters more than formatting. In RAG systems, it helps retrieve more relevant and contextually aligned chunks for a given query.
To see how this works, here’s a simple Python example that groups sample documents into two topics using LDA:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation07
# Sample documents from two topics: Tech/AI and Animals
texts = [
"Artificial intelligence is transforming industries.",
"Deep learning models are a subset of machine learning.",
"Cats love napping in the sun.",
"Kittens often play with yarn balls.",
"AI applications include robotics and automation.",
"Dogs bark and chase cats.",
"Machine learning enables prediction from big data.",
"Birds sing beautifully at dawn.",
"Natural language processing is part of AI.",
"Many animals communicate in complex ways."
]
# Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
# Fit LDA model with 2 topics
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)
# Get topic assignments for each document
topic_assignments = lda.transform(X).argmax(axis=1)
# Group texts by topic
topic_chunks = {topic: [] for topic in range(2)}
for idx, topic in enumerate(topic_assignments):
topic_chunks[topic].append(texts[idx])
# Print topic chunks
for topic, docs in topic_chunks.items():
print(f"\nTopic {topic+1}:")
for doc in docs:
print(" -", doc)
# Optional: Interpret topics by showing top words per topic
def print_top_words(model, feature_names, n_top_words=7):
for topic_idx, topic in enumerate(model.components_):
print(f"\nTopic {topic_idx+1} top words:")
top_features = [feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]
print(" ", ", ".join(top_features))
print_top_words(lda, vectorizer.get_feature_names_out())OUTPUT:
Topic 1:
- Artificial intelligence is transforming industries.
- Deep learning models are a subset of machine learning.
- AI applications include robotics and automation.
- Machine learning enables prediction from big data.
- Natural language processing is part of AI.
Topic 2:
- Cats love napping in the sun.
- Kittens often play with yarn balls.
- Dogs bark and chase cats.
- Birds sing beautifully at dawn.
- Many animals communicate in complex ways.
Topic 1 top words:
AI, machine learning, intelligence, natural, applications, deep
Topic 2 top words:
cats, animals, dogs, kittens, birds, sun, ballsHybrid chunking is what I end up using most often in real-world RAG systems. When documents mix structure, narrative text, and procedural steps, relying on a single chunking method usually isn’t enough. Combining strategies allows the system to stay flexible without sacrificing context. Instead of relying on a single method, like structural, semantic, or sliding-window chunking, this technique blends them to preserve both the document’s organization and its deeper meaning. Hybrid chunking is especially useful when dealing with mixed content, such as technical manuals, recipes, research papers, or documents with structured sections and long narrative explanations. In RAG systems, this approach allows the model to retrieve the right level of detail by leveraging structure where it exists and semantics where it matters.
For example, imagine a recipe with sections like Ingredients, Instructions, and Tips. Structural chunking would create primary chunks for each section. If the Instructions section is very long, semantic chunking can break it further into cooking phases such as preparation, mixing, and baking. A sliding window can then add light overlap to maintain context. This ensures that when someone searches for “How long to bake?”, the system retrieves the correct baking step along with useful surrounding context.
Whisper ASR is an open-source automatic speech recognition model developed by OpenAI. It converts speech into text by processing audio through a neural network trained on large-scale multilingual and noisy speech data.
Yes. Whisper is open-source and free to use locally. You can run it on your own hardware without paying for API usage, though compute costs depend on the model size and hardware used.
Walk away with actionable insights on AI adoption.
Limited seats available!
Whisper performs well across accents, background noise, and multilingual speech compared to many ASR models. Accuracy improves significantly with larger models, especially in noisy or conversational audio.
Choose the model based on your needs:
Larger models require more memory and processing power.
Whisper can be used for near real-time transcription, but it is not optimized for low-latency streaming out of the box. Real-time use cases often require batching, chunking, or optimized versions like Faster-Whisper.
Yes. When running Whisper locally, transcription works fully offline. This makes it suitable for privacy-sensitive applications and environments without reliable internet access.
Whisper supports speech recognition in over 99 languages, including English, Spanish, Hindi, Mandarin, French, German, Japanese, and many others. It can also translate speech into English automatically.
Key limitations include:
Understanding these early helps design realistic production systems.
Faster-Whisper is an optimized implementation that significantly improves inference speed and reduces memory usage. It is better suited for production pipelines and near real-time workloads.
Avoid Whisper when you need:
In such cases, streaming-focused ASR systems may be more suitable.
Chunking isn’t just a technical step; it’s one of the biggest levers for improving how a RAG system behaves in practice. What I’ve learned through experimentation is that chunking has to adapt to the data itself, whether that data comes in the form of tables, code, long-form text, or loosely structured documents.
No single strategy works everywhere, and that’s okay. The real improvement comes from understanding why a particular chunking method fits a specific data type and where its limits are. By applying these advanced chunking strategies intentionally, you can improve retrieval accuracy, preserve context more reliably, and build RAG systems that scale as your data grows more complex.
Walk away with actionable insights on AI adoption.
Limited seats available!