PDF RAG Assistant

A document-grounded Retrieval-Augmented Generation (RAG) application for asking questions over PDFs using LangChain, ChromaDB, Hugging Face embeddings, and Google Gemini.

Overview

PDF RAG Assistant solves a common problem in research and learning workflows: long documents are difficult to search semantically, summarize, and question interactively. This project turns PDFs into searchable vector indexes, retrieves the most relevant chunks for a user query, and sends only that context to an LLM so answers stay grounded in the document.

The repository is useful for students, AI engineers, and portfolio reviewers who want to see a practical RAG pipeline end to end: document loading, chunking, embedding generation, vector storage, retrieval strategies, and a conversational interface.

Features

Feature	Description
Streamlit Chat UI	Provides an interactive browser interface for PDF question answering.
PDF Uploads	Lets users upload a PDF and build an in-memory vector index during the session.
Default Knowledge Base	Loads a persisted ChromaDB index generated from `document_loaders/deeplearning.pdf`.
PDF Loader	Uses LangChain's `PyPDFLoader` to extract page-level content from PDF files.
Text Chunking	Splits documents with `RecursiveCharacterTextSplitter` using overlapping chunks for better context continuity.
Hugging Face Embeddings	Uses `sentence-transformers/all-MiniLM-L6-v2` for compact semantic embeddings.
Chroma Vector Store	Stores and searches embeddings locally with ChromaDB.
MMR Retrieval	Uses Maximal Marginal Relevance to improve diversity and reduce repetitive retrieved chunks.
Gemini Answer Generation	Uses `gemini-2.5-flash-lite` through `langchain-google-genai` for grounded responses.
Source Snippets	Displays the top retrieved document snippets and page metadata in the UI.
CLI Chat Loop	Includes a terminal-based RAG workflow in `main.py`.
Retriever Experiments	Includes small demos for similarity search, MMR, multi-query retrieval, and arXiv search.
Loader Experiments	Includes sample PDF, text, and web page loaders for learning LangChain ingestion patterns.

Architecture

graph TD
    A[PDF Documents] --> B[PyPDFLoader]
    B --> C[RecursiveCharacterTextSplitter]
    C --> D[Hugging Face Embeddings]
    D --> E[ChromaDB Vector Store]
    E --> F[MMR Retriever]
    F --> G[Retrieved Context]
    H[User Question] --> F
    G --> I[Prompt Template]
    H --> I
    I --> J[Google Gemini]
    J --> K[Grounded Answer]
    F --> L[Source Snippets]

Data Flow

A PDF is loaded from either document_loaders/deeplearning.pdf or a user upload in the Streamlit sidebar.
The document is split into chunks of about 1,000 characters with overlap.
Each chunk is embedded with sentence-transformers/all-MiniLM-L6-v2.
Embeddings are stored in ChromaDB, either persisted in chroma-db/ or kept in memory for uploaded files.
A user question is sent to the retriever.
The retriever performs MMR search with k=12, fetch_k=50, and lambda_mult=0.5.
Retrieved chunks are inserted into a strict prompt that instructs the model to answer only from provided context.
Gemini generates the final answer, and the UI shows the answer plus source snippets.

Tech Stack

Layer	Technology
Language	Python
UI	Streamlit
LLM Framework	LangChain
LLM Provider	Google Gemini via `langchain-google-genai`
Embeddings	Hugging Face Sentence Transformers
Vector Database	ChromaDB
Document Loading	LangChain Community document loaders
Environment Management	`python-dotenv`
Optional Experiments	Web loaders, arXiv search, multi-query retrieval, FAISS/Qdrant/Pinecone/Weaviate dependencies

Project Structure

RAG_Project/
|-- app.py                         # Streamlit PDF chat application
|-- main.py                        # Terminal-based RAG chat loop
|-- create_database.py             # Builds the persisted ChromaDB index
|-- requirements.txt               # Python dependencies
|-- chroma-db/                     # Local persisted Chroma vector database
|-- document_loaders/
|   |-- deeplearning.pdf           # Default document used for the persisted index
|   |-- GRU.pdf                    # Sample PDF for loader experiments
|   |-- notes.txt                  # Sample text file
|   |-- pdf.py                     # PDF loading and chunking demo
|   |-- page.py                    # Web page loader demo
|   `-- test.py                    # Text loader and splitter demo
|-- retrievers/
|   |-- arixv.py                   # arXiv search demo
|   |-- mmr.py                     # Similarity vs MMR retrieval demo
|   `-- multiquery.py              # Multi-query retriever demo
|-- vector_store/
|   `-- DB.py                      # ChromaDB creation script variant
`-- images/
    |-- app_ui.png                 # Streamlit UI screenshot
    `-- image.png                  # Additional project image

Getting Started

Prerequisites

Python 3.10 or later
A Google AI API key for Gemini
Recommended: a virtual environment

Installation

Clone the repository and install the dependencies:

git clone <your-repository-url>
cd RAG_Project
python -m venv .venv

Activate the environment:

# Windows PowerShell
.\.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

Install packages:

pip install -r requirements.txt

Note: create_database.py and vector_store/DB.py import langchain_chroma, and retrievers/arixv.py imports arxiv. If your environment does not already include them, install them with:

pip install langchain-chroma arxiv

Environment Variables

Create a .env file in the project root:

GOOGLE_API_KEY=your_google_api_key_here

The project uses load_dotenv() and LangChain's Google integration will read this key when calling Gemini.

Usage

1. Build or Rebuild the Default Vector Database

The repository already contains a chroma-db/ directory, but you can regenerate it from the default PDF:

python create_database.py

This loads document_loaders/deeplearning.pdf, chunks it, embeds it, and persists the ChromaDB index under ./chroma-db.

2. Run the Streamlit App

streamlit run app.py

Then open the local URL printed by Streamlit. The app starts with the default indexed document and also supports uploading a new PDF from the sidebar.

3. Run the CLI RAG Assistant

python main.py

Ask questions in the terminal. Enter 0 to exit.

4. Try the Experiment Scripts

python document_loaders/pdf.py
python document_loaders/test.py
python document_loaders/page.py
python retrievers/mmr.py
python retrievers/multiquery.py
python retrievers/arixv.py

These scripts demonstrate individual LangChain concepts used by the main app.

Key Implementation Details

Prompt Grounding

The application uses a strict system prompt:

Use ONLY the provided context to answer the question.
If the answer is not present in the context, say:
"I could not find the answer in the provided documents."

This design reduces unsupported answers by forcing the LLM to rely on retrieved chunks.

Retrieval Configuration

The main retriever uses Maximal Marginal Relevance:

search_type="mmr"
search_kwargs={"k": 12, "fetch_k": 50, "lambda_mult": 0.5}

This retrieves a larger candidate set, then balances relevance and diversity before sending context to the LLM.

Default vs Uploaded Documents

Source	Storage Mode	Behavior
Default PDF	Persisted ChromaDB	Loaded from `./chroma-db` for repeat use.
Uploaded PDF	In-memory ChromaDB	Built at upload time and discarded after the session ends.

Dependencies

The project includes a broad AI/RAG-focused requirements.txt. The primary runtime dependencies for the main workflow are:

streamlit
python-dotenv
langchain
langchain-community
langchain-core
langchain-google-genai
langchain-huggingface
chromadb
sentence-transformers
pypdf

Additional packages support experimentation with other vector databases, document formats, web scraping, OCR, notebooks, and testing.

Roadmap

Add .env.example with documented configuration keys.
Add automated tests for document ingestion and retrieval behavior.
Add a configurable model selector for different Gemini or local models.
Add support for multiple uploaded PDFs in a single session.
Add document metadata filters for page range, filename, or source type.
Move shared RAG pipeline logic into reusable modules.

Contributing

Contributions are welcome. Good first improvements include dependency cleanup, tests, better error handling for missing API keys, and support for additional document types.

Before opening a pull request:

Create a focused branch.
Keep changes small and well described.
Run the relevant script or Streamlit app locally.
Avoid committing secrets, .env files, virtual environments, or generated cache files.

License

No license file is currently included. Add a license before distributing or accepting external contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF RAG Assistant

Overview

Features

Architecture

Data Flow

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Environment Variables

Usage

1. Build or Rebuild the Default Vector Database

2. Run the Streamlit App

3. Run the CLI RAG Assistant

4. Try the Experiment Scripts

Key Implementation Details

Prompt Grounding

Retrieval Configuration

Default vs Uploaded Documents

Dependencies

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
chroma-db		chroma-db
document_loaders		document_loaders
images		images
retrievers		retrievers
vector_store		vector_store
.gitignore		.gitignore
README.md		README.md
app.py		app.py
create_database.py		create_database.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF RAG Assistant

Overview

Features

Architecture

Data Flow

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Environment Variables

Usage

1. Build or Rebuild the Default Vector Database

2. Run the Streamlit App

3. Run the CLI RAG Assistant

4. Try the Experiment Scripts

Key Implementation Details

Prompt Grounding

Retrieval Configuration

Default vs Uploaded Documents

Dependencies

Roadmap

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages