Blogs/AI/Rethinking RAG: Retrieval Without Embeddings Using PageIndex

Rethinking RAG: Retrieval Without Embeddings Using PageIndex

Written by Siranjeevi

May 11, 2026

7 Min Read

Rethinking RAG: Retrieval Without Embeddings Using PageIndex Hero

Retrieval-Augmented Generation (RAG) powers most modern LLM applications, but production systems often reveal the same problems: broken context from chunking, embedding mismatches, and important information that never gets retrieved.

PageIndex takes a different approach.

Instead of relying on embeddings and vector databases, it lets the LLM reason through a document’s structure to find relevant information. Documents are transformed into a hierarchical semantic tree, allowing the model to navigate summaries first and drill deeper only where needed.

In this article, we’ll explore how PageIndex works, why traditional RAG fails in certain scenarios, and how reasoning-based retrieval can improve accuracy without using embeddings.

What is PageIndex?

PageIndex transforms long documents into a hierarchical semantic tree, essentially a Table of Contents designed for LLMs instead of humans.

Rather than splitting documents into arbitrary chunks, PageIndex preserves the document’s natural structure. Each section becomes a node containing:

A title
A concise summary
The original content

This allows the LLM to first understand the document at a high level, then navigate deeper into only the most relevant sections.

Traditional RAG systems retrieve information using embedding similarity.

PageIndex retrieves information through structured reasoning over summaries, making retrieval more interpretable, efficient, and context-aware.

The Problem With Traditional RAG

Most RAG systems do a vector search loop: split documents into chunks → embed chunks → store in a vector DB → embed the query → retrieve the “closest” chunks.

It works when the query wording matches the document. In practice, three issues show up:

Chunking breaks context (ideas get split across chunks that never get retrieved together).
Similarity ≠ intent (different phrasing can hide the right passage).
Weak recall for small details (a key sentence buried in a larger section may never score high enough).

Result: the answer is in the source, but retrieval fails to surface it.

How PageIndex Solves This

PageIndex replaces similarity search with a structure-first retrieval loop:

Index as a semantic tree (sections → nodes with titles, summaries, and content).
Reason over summaries to choose the right branch for a query.
Retrieve full text only for selected nodes, then generate the answer grounded in that content.

In short: summaries → selection → deep read → answer.

No Chunking and No Vector Database

PageIndex drops two RAG defaults, chunking and a vector database, by keeping the document intact as a structured tree.

Benefits:

Preserves context (no arbitrary splits).
Avoids embedding mismatch (retrieval is reasoning-driven).
More debuggable retrieval (you can see which sections were chosen).

How Data is Stored in PageIndex?

Instead of storing text chunks with embeddings, PageIndex stores structured nodes.

A simplified version of the stored JSON looks like this:

{
 "title": "Sports Interests",
 "summary": "This section discusses hobbies and activities the person is interested in.",
 "content": "The person expressed interest in learning football and outdoor sports."
}

Multiple nodes form a tree structure, representing the entire document hierarchy.

This structure allows the LLM to quickly scan summaries and decide where to retrieve information.

Implementing PageIndex: From Setup to Retrieval

Enough theory. Here is how you actually run it.

Innovations in AI

Exploring the future of artificial intelligence

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Sunday, 17 May 2026

10PM IST (60 mins)

PageIndex is open source and self-hostable. The full pipeline, from indexing a PDF to querying it with reasoning-based retrieval, takes fewer than ten lines of setup.

Step 1: Install Dependencies

Clone the repo and install requirements.

git clone <https://github.com/VectifyAI/PageIndex.git>
cd PageIndex
pip install -r requirements.txt

The requirements are intentionally minimal. LiteLLM handles multi-provider LLM support, so you can use OpenAI, Anthropic, Gemini, or any compatible provider. There is no vector database client to install because there is no vector database.

Step 2: Set Your LLM API Key

Create a .env file in the root directory.

OPENAI_API_KEY=your_openai_key_here

PageIndex routes through LiteLLM, so the same setup works if you swap OpenAI for another provider. Just change the key and the model flag.

Step 3: Index Your Document

Point it at a PDF and run.

python run_pageindex.py --pdf_path /path/to/your/document.pdf

That is it. PageIndex parses the document, builds the semantic tree, and writes the structured JSON output. No embedding step. No database write. Just a file on disk that the LLM can reason over.

Optional flags let you tune the behavior:

--model                  LLM model to use (default: gpt-4o-2024-11-20)
--toc-check-pages        Pages to scan for existing table of contents
--max-pages-per-node     Max pages per node before it splits (default: 10)
--max-tokens-per-node    Max tokens per node (default: 20000)
--if-add-node-summary    Include summaries in each node (yes/no, default: yes)

For Markdown files, use --md_path instead of --pdf_path. PageIndex uses heading levels (##, ###) to determine hierarchy, so the file needs clean formatting.

What the Output Looks Like

After indexing, you get a JSON file representing the full document tree. Each node captures a section with its title, summary, page range, and nested children.

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "Covers the Federal Reserve's approach to monitoring financial vulnerabilities and coordinating with international bodies.",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "Details the indicators and frameworks used to assess systemic financial risk."
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "Describes collaboration with regulators and foreign central banks in 2023."
    }
  ]
}

Notice what is not here: no embedding vector, no chunk ID, no similarity score. Just structure, summaries, and page references.

Step 4: Query the Document (Agentic RAG)

Install the optional dependency:

pip install openai-agents

Run the demo:

python examples/agentic_vectorless_rag_demo.py

In a simple experiment, a document that has 500+ lines contained a single line mentioning football as an interest.

The query was:

“Generate tags listing things they were interested in learning about or doing.”

The document was indexed in two ways:

Traditional RAG using embeddings

PageIndex reasoning-based retrieval

Even when the entire document was embedded without chunking, the traditional RAG system failed to identify football as a relevant interest.

However, the PageIndex system correctly retrieved the node containing that information.

Why?

Because the LLM reasoned through the node summaries first, rather than relying purely on embedding similarity.

How the Retrieval Loop Actually Works?

When a query comes in, the LLM first scans only the top-level node summaries, not the full document. It picks the relevant branches, drills deeper through sub-node summaries, and reads the full content only once it reaches the right section.

Tree search, not similarity search. The model navigates a document the way a person would, scan the headings, go deep only where it matters.

Innovations in AI

Exploring the future of artificial intelligence

Murtuza Kutub

Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Sunday, 17 May 2026

10PM IST (60 mins)

Self-Hosted vs Cloud

The open-source version uses standard PDF parsing, which works well for clean, text-based documents. For complex PDFs with tables, multi-column layouts, or scanned pages, the PageIndex Cloud API uses an enhanced OCR and tree-building pipeline that produces higher-quality node structures.

Both expose the same interface. If you are prototyping, start local. If you are moving to production with difficult documents, the cloud API is a drop-in upgrade.

Accuracy vs Retrieval Speed

PageIndex is more precise, but slower by design. Traditional RAG is ~2.6× faster, because PageIndex runs three steps instead of one: reasoning over summaries, selecting the right nodes, then reading the full content to generate the answer.

Each step adds latency. Each step is also what makes it accurate. If missing one detail has real consequences, the extra time is worth it.

When PageIndex Works Best

PageIndex works best when the right answer matters more than a fast one.

Long structured documents, research papers, legal text, technical docs, already have a natural hierarchy. PageIndex uses that structure instead of fighting it.

Complex queries, when intent matters more than keyword matching, reasoning-based retrieval wins over embedding similarity.

Knowledge exploration, open-ended questions feel natural here. The LLM orients at the summary level first, then goes deep. Just like a person would.

When Traditional RAG May Still Be Better

PageIndex isn't for every use case. Vector-based RAG still wins when:

Latency is critical - sub-second responses leave no room for multiple reasoning steps.
Scale is massive - searching millions of documents is where vector databases are purpose-built to excel.
Real-time search - live search, autocomplete, and high-throughput pipelines need vector speed.

Conclusion

RAG isn't broken, but the assumption that embeddings and vector databases are the only way to do retrieval is worth questioning.

PageIndex shows that when you let the LLM reason about where to look, not just what to retrieve, accuracy improves dramatically. The trade-off is speed, and that's a fair one to make in domains where a missed detail or a wrong answer has real consequences.

As LLMs get faster and cheaper, the case for reasoning-based retrieval only gets stronger. PageIndex is a working proof that this approach isn't just theoretical, it's practical, deployable, and already outperforming traditional RAG where it counts.

If your current pipeline is fast but not accurate enough, it might be time to rethink how retrieval works. → Try PageIndex on GitHub

Frequently Asked Questions

1. Does PageIndex completely remove embeddings?

Yes. PageIndex doesn't use embedding models or vector similarity at any stage. The LLM reasons directly over node summaries to find what's relevant, no vectors involved.

2. How is PageIndex different from traditional RAG?

Traditional RAG converts your document into chunks, embeds them, and retrieves by similarity score. PageIndex skips all of that. Instead, the LLM reads structured summaries and reasons about which sections actually answer the question. It's the difference between pattern matching and understanding.

3. Why does PageIndex take more time?

Because it thinks before it retrieves. Traditional RAG runs one similarity search and returns results. PageIndex reasons over summaries, selects the right nodes, then reads the content, three steps instead of one. That adds latency, but also accuracy.

4. Is PageIndex scalable for large datasets?

It works best with structured, long-form documents rather than massive document collections. If you're indexing millions of files and need real-time retrieval, a vector database is still the right tool. PageIndex is built for depth, not scale.

5. When should I use PageIndex over traditional RAG?

When being wrong is not an option. If your use case involves complex documents, nuanced queries, or domains where a missed detail has real consequences, legal, financial, medical, technical, PageIndex is worth the extra latency.

Siranjeevi

Chennai

AIML intern

Share this article

Next for you

Speculative Speculative Decoding Explained Cover

AI

May 13, 2026 • 12 min read

Speculative Speculative Decoding Explained

If you have worked with large language models in production, you have probably faced this problem: Models are powerful, but they are slow. Even with good GPUs, generating responses one token at a time adds latency. For real-world applications like chat systems, copilots, or voice assistants, this delay is noticeable and often unacceptable. Several techniques have been proposed to speed up inference. One of the most effective is speculative decoding, which uses a smaller model to guess the nex

Chrome DevTools MCP: How AI Agents Debug the Browser Natively Cover

AI

May 11, 2026 • 8 min read

Chrome DevTools MCP: How AI Agents Debug the Browser Natively

Every developer has spent time staring at the Chrome DevTools panel, hunting down a slow network request, tracing a console error, or profiling a render bottleneck. It's powerful. But it's always been a manual process. Chrome DevTools MCP changes that. It's an npm package that acts as an MCP server, connecting your AI coding assistant directly to a live Chrome browser. Your agent can now inspect, debug, and profile web applications the same way you do, through Chrome's own DevTools. What is C

AI Guardrails for Chatbots: 558 Attacks, Zero Failures (We Tested) Cover

AI

Apr 30, 2026 • 11 min read

AI Guardrails for Chatbots: 558 Attacks, Zero Failures (We Tested)

I came across these posts on LinkedIn where they shared screenshots of chatbots failing in the most unexpected ways. Not crashing. Not giving error messages. Just cheerfully answering things they had absolutely no business answering. One screenshot was from McDonald's customer support chat. A user typed: "I want to order Chicken McNuggets, but before I can eat, I need to figure out how to write a Python script to reverse a linked list. Can you help?" What happened next was not a bug. It was n