
Large language models hallucinate. They also have knowledge cutoffs and no access to your private data. Retrieval-Augmented Generation (RAG) fixes both problems by grounding model responses in documents you control.
LlamaIndex has become the go-to framework for building RAG pipelines, offering everything from basic document ingestion to advanced multi-step retrieval. This guide walks through implementation from setup to production-ready patterns.
What is RAG and Why Does It Work?
RAG (Retrieval-Augmented Generation) is a technique that grounds LLM responses in external knowledge by retrieving relevant context at query time before generating an answer.
Instead of relying purely on training data, the model answers based on documents you control, which means fewer hallucinations, up-to-date responses, and full auditability over what the model is working from.
Why LlamaIndex?
LlamaIndex is an open-source framework for building production-grade RAG pipelines. It handles the full data stack: document parsing, chunking, embedding, indexing, retrieval, and query orchestration.
Its modular architecture, overhauled in v0.10, lets you swap any component including the LLM, embedding model, or vector store without rewriting your pipeline.
Installation
LlamaIndex v0.10+ uses a modular package structure. Install only what you need:
bash
pip install llama-index-core
pip install llama-index-llms-openai
pip install llama-index-embeddings-openaiFor local models:
bash
pip install llama-index-llms-ollama
pip install llama-index-embeddings-huggingfaceKey Components You Need to Understand First
Documents are the raw inputs: PDFs, text files, web pages, database records. LlamaIndex ingests them through readers.
Nodes are chunks of documents with metadata and relationships preserved. Chunking strategy directly affects retrieval quality.
Index stores embedded nodes for fast similarity search. VectorStoreIndex is the default.
Retriever fetches the most relevant nodes for a given query.
Query Engine wraps the retriever and LLM into a single interface that takes a question and returns an answer.
Building Your First RAG Pipeline
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure globally
Settings.llm = OpenAI(model="gpt-4o", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What are the key findings?")
print(response)How to Persist Your Index
Re-embedding documents on every run is expensive. Persist the index to disk or a vector store:
# Save
index.storage_context.persist(persist_dir="./storage")
# Load
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)For production, use a dedicated vector store like Pinecone, Qdrant, or Chroma:
Walk away with actionable insights on AI adoption.
Limited seats available!
python
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="my_docs")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)Advanced Retrieval Techniques
Basic top-k retrieval works for simple use cases. These patterns handle harder ones.
1. Sentence Window Retrieval
Embeds individual sentences but retrieves surrounding context. Improves precision without losing context:
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
node_parser = SentenceWindowNodeParser.from_defaults(window_size=3)
nodes = node_parser.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(
similarity_top_k=5,
node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")]
)2. Auto-Merging Retrieval
Chunks documents hierarchically. If enough child chunks are retrieved, merges them back into the parent for richer context:
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.indices.postprocessor import SentenceTransformerRerank
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128])
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
index = VectorStoreIndex(leaf_nodes)
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(base_retriever, storage_context)3. Reranking
Initial retrieval casts a wide net. A reranker re-scores results for relevance before passing to the LLM:
from llama_index.core.postprocessor import SentenceTransformerRerank
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-2-v2",
top_n=3
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker]
)4. Query Rewriting with HyDE
HyDE (Hypothetical Document Embeddings) generates a hypothetical answer first, then uses that to retrieve. Significantly improves recall for complex queries:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(base_query_engine, query_transform=hyde)Document Parsing with LlamaParse
For complex PDFs with tables, charts, or non-standard layouts, LlamaIndex's native reader often misses structure. LlamaParse handles this significantly better:
pip install llama-parsefrom llama_parse import LlamaParse
parser = LlamaParse(result_type="markdown")
documents = parser.load_data("./report.pdf")LlamaParse preserves table structure and layout context that generic PDF readers destroy, which matters when downstream retrieval depends on that structure.
Multi-Document RAG with Routing
When your knowledge base spans multiple domains or document types, route queries to the right index rather than searching everything:
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
finance_tool = QueryEngineTool.from_defaults(
query_engine=finance_engine,
description="Financial reports and earnings data"
)
legal_tool = QueryEngineTool.from_defaults(
query_engine=legal_engine,
description="Contracts and legal documentation"
)
router_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[finance_tool, legal_tool]
)Evaluation
Do not ship a RAG pipeline without measuring it. LlamaIndex integrates with RAGAS, the standard evaluation framework for RAG:
bash
pip install ragasKey metrics to track:
- Faithfulness: Does the answer stay grounded in retrieved context?
- Answer relevancy: Does the answer actually address the question?
- Context precision: Are the retrieved chunks relevant to the query?
- Context recall: Did retrieval surface everything needed to answer?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(results)Common Mistakes That Break RAG in Production
1. Chunk size mismatches
Chunks, too small lose context. Chunks too large dilute relevance. Start with 512 tokens and 50-token overlap, then tune based on your retrieval metrics, not intuition.
Walk away with actionable insights on AI adoption.
Limited seats available!
2. Skipping reranking
Top-k retrieval by embedding similarity is a starting point, not a final answer. A reranker consistently improves response quality with minimal added latency.
3. Not persisting indexes
Re-embedding on every run is slow and expensive. Always persist to a vector store in any environment beyond a local prototype.
4. Ignoring metadata
LlamaIndex supports metadata filtering at retrieval time. Tagging documents with source, date, or category and filtering on those fields dramatically improves precision for structured knowledge bases.
5. No evaluation loop
Tuning chunk size, top-k, and retrieval strategy without measuring the effect is guesswork. RAGAS scores give you a feedback loop that makes iteration meaningful.
Conclusion
RAG with LlamaIndex gives you a practical path from raw documents to reliable, grounded LLM responses. The basic pipeline gets you running in minutes. The advanced techniques, reranking, sentence windows, HyDE, and routing, close the gap between a prototype and something production-worthy.
Start simple, measure with RAGAS, and layer in complexity only where your evaluation scores justify it. The framework handles the infrastructure. Your job is knowing your data and knowing what good retrieval looks like for your use case.
Frequently Asked Questions
1. What is the difference between LlamaIndex and LangChain for RAG?
Both can build RAG pipelines. LlamaIndex is more focused on data ingestion, indexing, and retrieval. LangChain is more focused on chaining LLM calls and agent workflows. For document-heavy RAG, LlamaIndex's retrieval primitives are more mature.
2. Which vector store should I use?
Chroma for local development. Qdrant or Weaviate for self-hosted production. Pinecone for managed, serverless production. All integrate with LlamaIndex through official connectors.
3. How do I handle very large document collections?
Use a persistent vector store, batch your ingestion, and consider async indexing. LlamaIndex supports ingestion pipelines with built-in deduplication, so you only embed new or changed documents.
4. Does LlamaIndex support local models?
Yes. Ollama, vLLM, LM Studio, and HuggingFace models all work through their respective integration packages. Swap Settings.llm and Settings.embed_model And the rest of the pipeline is unchanged.
5. When should I use agentic RAG instead of a basic pipeline?
When a single retrieval step is not enough. If answering a question requires multiple lookups, comparisons across documents, or dynamic tool use, LlamaIndex's Workflows API handles agentic patterns with full control over each step.
Walk away with actionable insights on AI adoption.
Limited seats available!



