Skip to content

AyushGarg76/PDF_Assistant_project

Repository files navigation

PDF RAG Assistant

A document-grounded Retrieval-Augmented Generation (RAG) application for asking questions over PDFs using LangChain, ChromaDB, Hugging Face embeddings, and Google Gemini.

PDF RAG Assistant UI

Overview

PDF RAG Assistant solves a common problem in research and learning workflows: long documents are difficult to search semantically, summarize, and question interactively. This project turns PDFs into searchable vector indexes, retrieves the most relevant chunks for a user query, and sends only that context to an LLM so answers stay grounded in the document.

The repository is useful for students, AI engineers, and portfolio reviewers who want to see a practical RAG pipeline end to end: document loading, chunking, embedding generation, vector storage, retrieval strategies, and a conversational interface.

Features

Feature Description
Streamlit Chat UI Provides an interactive browser interface for PDF question answering.
PDF Uploads Lets users upload a PDF and build an in-memory vector index during the session.
Default Knowledge Base Loads a persisted ChromaDB index generated from document_loaders/deeplearning.pdf.
PDF Loader Uses LangChain's PyPDFLoader to extract page-level content from PDF files.
Text Chunking Splits documents with RecursiveCharacterTextSplitter using overlapping chunks for better context continuity.
Hugging Face Embeddings Uses sentence-transformers/all-MiniLM-L6-v2 for compact semantic embeddings.
Chroma Vector Store Stores and searches embeddings locally with ChromaDB.
MMR Retrieval Uses Maximal Marginal Relevance to improve diversity and reduce repetitive retrieved chunks.
Gemini Answer Generation Uses gemini-2.5-flash-lite through langchain-google-genai for grounded responses.
Source Snippets Displays the top retrieved document snippets and page metadata in the UI.
CLI Chat Loop Includes a terminal-based RAG workflow in main.py.
Retriever Experiments Includes small demos for similarity search, MMR, multi-query retrieval, and arXiv search.
Loader Experiments Includes sample PDF, text, and web page loaders for learning LangChain ingestion patterns.

Architecture

graph TD
    A[PDF Documents] --> B[PyPDFLoader]
    B --> C[RecursiveCharacterTextSplitter]
    C --> D[Hugging Face Embeddings]
    D --> E[ChromaDB Vector Store]
    E --> F[MMR Retriever]
    F --> G[Retrieved Context]
    H[User Question] --> F
    G --> I[Prompt Template]
    H --> I
    I --> J[Google Gemini]
    J --> K[Grounded Answer]
    F --> L[Source Snippets]
Loading

Data Flow

  1. A PDF is loaded from either document_loaders/deeplearning.pdf or a user upload in the Streamlit sidebar.
  2. The document is split into chunks of about 1,000 characters with overlap.
  3. Each chunk is embedded with sentence-transformers/all-MiniLM-L6-v2.
  4. Embeddings are stored in ChromaDB, either persisted in chroma-db/ or kept in memory for uploaded files.
  5. A user question is sent to the retriever.
  6. The retriever performs MMR search with k=12, fetch_k=50, and lambda_mult=0.5.
  7. Retrieved chunks are inserted into a strict prompt that instructs the model to answer only from provided context.
  8. Gemini generates the final answer, and the UI shows the answer plus source snippets.

Tech Stack

Layer Technology
Language Python
UI Streamlit
LLM Framework LangChain
LLM Provider Google Gemini via langchain-google-genai
Embeddings Hugging Face Sentence Transformers
Vector Database ChromaDB
Document Loading LangChain Community document loaders
Environment Management python-dotenv
Optional Experiments Web loaders, arXiv search, multi-query retrieval, FAISS/Qdrant/Pinecone/Weaviate dependencies

Project Structure

RAG_Project/
|-- app.py                         # Streamlit PDF chat application
|-- main.py                        # Terminal-based RAG chat loop
|-- create_database.py             # Builds the persisted ChromaDB index
|-- requirements.txt               # Python dependencies
|-- chroma-db/                     # Local persisted Chroma vector database
|-- document_loaders/
|   |-- deeplearning.pdf           # Default document used for the persisted index
|   |-- GRU.pdf                    # Sample PDF for loader experiments
|   |-- notes.txt                  # Sample text file
|   |-- pdf.py                     # PDF loading and chunking demo
|   |-- page.py                    # Web page loader demo
|   `-- test.py                    # Text loader and splitter demo
|-- retrievers/
|   |-- arixv.py                   # arXiv search demo
|   |-- mmr.py                     # Similarity vs MMR retrieval demo
|   `-- multiquery.py              # Multi-query retriever demo
|-- vector_store/
|   `-- DB.py                      # ChromaDB creation script variant
`-- images/
    |-- app_ui.png                 # Streamlit UI screenshot
    `-- image.png                  # Additional project image

Getting Started

Prerequisites

  • Python 3.10 or later
  • A Google AI API key for Gemini
  • Recommended: a virtual environment

Installation

Clone the repository and install the dependencies:

git clone <your-repository-url>
cd RAG_Project
python -m venv .venv

Activate the environment:

# Windows PowerShell
.\.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

Install packages:

pip install -r requirements.txt

Note: create_database.py and vector_store/DB.py import langchain_chroma, and retrievers/arixv.py imports arxiv. If your environment does not already include them, install them with:

pip install langchain-chroma arxiv

Environment Variables

Create a .env file in the project root:

GOOGLE_API_KEY=your_google_api_key_here

The project uses load_dotenv() and LangChain's Google integration will read this key when calling Gemini.

Usage

1. Build or Rebuild the Default Vector Database

The repository already contains a chroma-db/ directory, but you can regenerate it from the default PDF:

python create_database.py

This loads document_loaders/deeplearning.pdf, chunks it, embeds it, and persists the ChromaDB index under ./chroma-db.

2. Run the Streamlit App

streamlit run app.py

Then open the local URL printed by Streamlit. The app starts with the default indexed document and also supports uploading a new PDF from the sidebar.

3. Run the CLI RAG Assistant

python main.py

Ask questions in the terminal. Enter 0 to exit.

4. Try the Experiment Scripts

python document_loaders/pdf.py
python document_loaders/test.py
python document_loaders/page.py
python retrievers/mmr.py
python retrievers/multiquery.py
python retrievers/arixv.py

These scripts demonstrate individual LangChain concepts used by the main app.

Key Implementation Details

Prompt Grounding

The application uses a strict system prompt:

Use ONLY the provided context to answer the question.
If the answer is not present in the context, say:
"I could not find the answer in the provided documents."

This design reduces unsupported answers by forcing the LLM to rely on retrieved chunks.

Retrieval Configuration

The main retriever uses Maximal Marginal Relevance:

search_type="mmr"
search_kwargs={"k": 12, "fetch_k": 50, "lambda_mult": 0.5}

This retrieves a larger candidate set, then balances relevance and diversity before sending context to the LLM.

Default vs Uploaded Documents

Source Storage Mode Behavior
Default PDF Persisted ChromaDB Loaded from ./chroma-db for repeat use.
Uploaded PDF In-memory ChromaDB Built at upload time and discarded after the session ends.

Dependencies

The project includes a broad AI/RAG-focused requirements.txt. The primary runtime dependencies for the main workflow are:

  • streamlit
  • python-dotenv
  • langchain
  • langchain-community
  • langchain-core
  • langchain-google-genai
  • langchain-huggingface
  • chromadb
  • sentence-transformers
  • pypdf

Additional packages support experimentation with other vector databases, document formats, web scraping, OCR, notebooks, and testing.

Roadmap

  • Add .env.example with documented configuration keys.
  • Add automated tests for document ingestion and retrieval behavior.
  • Add a configurable model selector for different Gemini or local models.
  • Add support for multiple uploaded PDFs in a single session.
  • Add document metadata filters for page range, filename, or source type.
  • Move shared RAG pipeline logic into reusable modules.

Contributing

Contributions are welcome. Good first improvements include dependency cleanup, tests, better error handling for missing API keys, and support for additional document types.

Before opening a pull request:

  1. Create a focused branch.
  2. Keep changes small and well described.
  3. Run the relevant script or Streamlit app locally.
  4. Avoid committing secrets, .env files, virtual environments, or generated cache files.

License

No license file is currently included. Add a license before distributing or accepting external contributions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages