Skip to content

learnwithparam/vectorless-rag

Repository files navigation

Vectorless RAG with Hierarchical Document Trees

learnwithparam.com

Retrieve without embeddings. Parse PDFs into a hierarchical document tree, then let a LangGraph agent reason through the structure to find answers with page-level citations. Zero vector database costs, zero chunking artifacts.

Start learning at learnwithparam.com. Regional pricing available with discounts of up to 60%.

What You'll Learn

  • Build a layout-aware PDF parser that produces a hierarchical TreeNode structure
  • Replace vector similarity search with LLM reasoning over a document tree
  • Design a LangGraph state machine that analyzes, routes, retrieves, and generates
  • Cache document trees to disk so repeated questions skip expensive parsing
  • Return grounded answers with page ranges and section titles as citations

Tech Stack

  • Python 3.11+ with uv for dependency management
  • LangGraph for agent state graphs and conditional routing
  • PyMuPDF and pymupdf4llm for layout-aware PDF parsing
  • OpenAI (or any OpenAI-compatible client) for reasoning calls
  • Pydantic for typed state validation
  • Docker for reproducible runs

Getting Started

Prerequisites

  • Python 3.11+
  • uv (installed automatically by make setup)
  • An OpenAI API key

Quick Start

# One command to set up and run
make dev

# Or step by step:
make setup          # Create .env and install dependencies
# Edit .env with your API key
make run            # Parse the PDF, build the tree, answer the sample questions

With Docker

make build          # Build the Docker image
make up             # Run the container
make logs           # View logs
make down           # Stop the container

How It Runs

  1. Downloads the Google Bigtable paper on first run
  2. Parses the PDF into a DocumentTree and caches it at results/document_tree.json
  3. Renders the LangGraph workflow as results/workflow.png
  4. Answers the questions in questions.py, printing reasoning, confidence, path, and sources

Edit questions.py to ask your own questions, or point main.py at a different PDF to index a new document.

Challenges

Work through these incrementally to build the full system:

  1. PDF to Tree - Convert a raw PDF into a TreeNode hierarchy using PyMuPDF4LLM
  2. Tree Caching - Serialize the tree to JSON and reload it on subsequent runs
  3. Section Summaries - Generate a short summary per node so the agent can judge relevance cheaply
  4. Analyze Node - LLM call that scores how relevant the current node is to the query
  5. Conditional Routing - Decide between descending, retrieving, or backtracking based on confidence
  6. Retrieve and Generate - Collect the full text of selected nodes and synthesize the final answer
  7. Workflow Visualization - Render the state graph to a PNG for debugging
  8. Multi-Document Trees - Extend the system to search across many documents in one tree

Makefile Targets

make help           Show all available commands
make setup          Initial setup (create .env, install deps)
make dev            Setup and run (one command!)
make run            Run the vectorless RAG pipeline
make build          Build Docker image
make up             Start container
make down           Stop container
make logs           View container logs
make clean          Remove venv, caches, and generated results

Learn more

About

Vectorless RAG with Hierarchical Document Trees

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors