Skip to main content

Command Palette

Search for a command to run...

Mastering Hierarchical Indexing for Agentic RAG

Published
6 min readView as Markdown
Mastering Hierarchical Indexing for Agentic RAG
R

Learning Data Science and Sharing the journey through Hashnode.

Introduction

In Part 1, we gave our RAG system a brain. We built an agent that can analyze queries, ask for clarification, and fix its own mistakes using LangGraph.

But even the smartest agent is useless if it’s reading from a messy library.

Today, we are fixing the "Memory" of our system. We are going to solve the biggest trade-off in Retrieval-Augmented Generation: The Chunking Dilemma.

The Conflict: Precision vs. Context

When you build a standard RAG pipeline, you have to make a difficult choice during the data preparation phase: How large should your chunks be?

This decision usually leads to one of two failures:

  1. Small Chunks (e.g., 200 tokens):

    • Pro: Excellent for vector search. The embedding is dense and specific, so "Error 503" matches "Error 503" perfectly.

    • Con: Zero context. The LLM retrieves a sentence saying "Error 503 is a timeout," but misses the paragraph above it that explains how to fix it.

  2. Large Chunks (e.g., 1000+ tokens):

    • Pro: Great context. The LLM gets the full explanation.

    • Con: Terrible search. A large chunk contains so many different topics that the vector embedding gets "diluted." It becomes a jack-of-all-trades, master of none.

The Solution: Hierarchical Indexing

We don't have to choose. We can have the best of both worlds by separating what we search from what we read.

This strategy is called Hierarchical Indexing (or Parent-Child Retrieval).

  • Child Chunks: We split our documents into tiny, highly specific snippets. We embed these into our Vector Database for laser-precise search.

  • Parent Chunks: We keep the larger sections (like full pages or chapters) in a separate store.

When the user asks a question, our agent searches the Children to find the needle in the haystack. But before generating an answer, it grabs the Parent to understand the whole bale of hay.

In this tutorial, we will implement this architecture using the agentic-rag-for-dummies repository, using Markdown Headers to create intelligent boundaries for our data.

Technical Deep Dive: Markdown Header Splitting

Most RAG tutorials use a "blind" approach to splitting documents. They take a PDF, convert it to plain text, and chop it every 500 characters.

This destroys the semantic structure of your document. A sentence gets cut in half; a table gets separated from its header.

The agentic-rag-for-dummies repository takes a smarter approach. It first converts PDFs into Markdown.

Why Markdown? Because Markdown has built-in hierarchy (# H1, ## H2, ### H3). By splitting text based on these headers, we ensure that every chunk is a logical section of the document, not just a random string of words.

Code Walkthrough: Implementing the Splitter

Let's look at how the repository implements this. We don't just split the text; we create a relationship between the specific details (Children) and the broad context (Parents).

Step 1: The "Parent" Split (Markdown Headers)

First, we use Markdown Headers to respect the document's structure.

The code scans the file and breaks it apart whenever it sees H1, H2, or H3. These become our Parent Chunks—comprehensive sections of text that cover a full topic.

from langchain_text_splitters import MarkdownHeaderTextSplitter

# 1. Define the hierarchy
headers_to_split_on = [("#", "H1"), ("##", "H2"), ("###", "H3")]
parent_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# 2. Create Parent Chunks (Context)
parent_chunks = parent_splitter.split_text(md_text)

Next, we iterate through every Parent Chunk and split it further into small Child Chunks (e.g., 500 characters).

Crucially, we tag every child with the parent_id. This is the link that ties the system together.

# (Simplified from the repo's index_documents function)
for i, parent_chunk in enumerate(parent_chunks):
    # Generate a unique ID for the parent
    parent_id = f"{doc_name}_parent_{i}"

    # Tag the parent
    parent_chunk.metadata["parent_id"] = parent_id

    # Split parent into smaller children for vector search
    children = child_splitter.split_documents([parent_chunk])

    for child in children:
        # THE MAGIC LINK: Attach parent_id to child metadata
        child.metadata["parent_id"] = parent_id
        all_child_chunks.append(child)

Code Walkthrough: The Hybrid Storage & Retrieval Tools

Here is where this repository differs from standard tutorials. It uses a Hybrid Storage strategy to keep costs low and retrieval fast.

  • Child Chunks: Stored in Qdrant (Vector DB) for similarity search.

  • Parent Chunks: Stored as JSON files on disk.

Why? Because Parent Chunks are huge. Storing massive text blobs in a Vector DB is expensive and unnecessary. We only need the Vector DB to find the ID; we can read the text from a file.

Defining the Tools

In an Agentic system, we don't write a hardcoded retrieval function. We give the LLM Tools and let it decide when to use them.

Tool 1: The Scout (search_child_chunks) This tool searches Qdrant. It finds the "Needle" (keywords) but returns the "Thread" (the parent_id).

@tool
def search_child_chunks(query: str, k: int = 5) -> List[dict]:
    """Search for the top K most relevant child chunks."""
    # Search the Vector DB
    results = child_vector_store.similarity_search(query, k=k)

    # Return snippets + The Parent ID
    return [
        {
            "content": doc.page_content,
            "parent_id": doc.metadata.get("parent_id"), 
            "source": doc.metadata.get("source")
        }
        for doc in results
    ]

Tool 2: The Reader (retrieve_parent_chunks) This tool takes the IDs found by the scout and fetches the full context from the disk.

@tool
def retrieve_parent_chunks(parent_ids: List[str]) -> List[dict]:
    """Retrieve full parent chunks by their IDs."""
    results = []
    for parent_id in set(parent_ids):
        # Fetch the full JSON file from disk (Hybrid Storage)
        file_path = os.path.join(PARENT_STORE_PATH, f"{parent_id}.json")
        if os.path.exists(file_path):
            with open(file_path, "r") as f:
                doc = json.load(f)
                results.append(doc)
    return results

Connecting to LangGraph

How does the agent use these tools? We bind them to the LLM using LangChain's .bind_tools() method.

The agent node in our graph doesn't just "chat"; it "reasons."

  1. It receives the user's question.

  2. It decides to call search_child_chunks.

  3. It looks at the results.

  4. It decides to call retrieve_parent_chunks to get the full story.

# Bind the tools to the LLM
llm_with_tools = llm.bind_tools([search_child_chunks, retrieve_parent_chunks])

def agent_node(state: AgentState):
    """Main agent node that processes queries using tools"""
    sys_msg = SystemMessage(content=get_rag_agent_prompt())

    # The LLM decides whether to answer or call a tool
    return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])]}

The data flows through the State via the messages list. The tool outputs are appended as messages, allowing the LLM to read the retrieved parent chunks as if they were part of the conversation history.

Conclusion

We have now solved two of the three biggest problems in RAG.

  1. Bad Queries: Solved in [Part 1] with Agentic Query Rewriting.

  2. Missing Context: Solved here in [Part 2] with Hierarchical Indexing.

Your system can now understand what the user meant to ask, and it can find the complete answer in your documents.

But there is one final boss level.

What happens when a user asks a question that requires searching for two different things at once?

  • "Compare the return policy of Stripe vs. PayPal."

  • "How does the implementation of OAuth differ between Python and Node.js?"

A single search query—no matter how smart—often fails to capture both sides of a comparison.

In Part 3, we will build a Multi-Agent Map-Reduce System. We will spawn multiple agents in parallel to research different parts of a question and merge their findings.