A document-grounded Retrieval-Augmented Generation (RAG) application for asking questions over PDFs using LangChain, ChromaDB, Hugging Face embeddings, and Google Gemini.
PDF RAG Assistant solves a common problem in research and learning workflows: long documents are difficult to search semantically, summarize, and question interactively. This project turns PDFs into searchable vector indexes, retrieves the most relevant chunks for a user query, and sends only that context to an LLM so answers stay grounded in the document.
The repository is useful for students, AI engineers, and portfolio reviewers who want to see a practical RAG pipeline end to end: document loading, chunking, embedding generation, vector storage, retrieval strategies, and a conversational interface.
| Feature | Description |
|---|---|
| Streamlit Chat UI | Provides an interactive browser interface for PDF question answering. |
| PDF Uploads | Lets users upload a PDF and build an in-memory vector index during the session. |
| Default Knowledge Base | Loads a persisted ChromaDB index generated from document_loaders/deeplearning.pdf. |
| PDF Loader | Uses LangChain's PyPDFLoader to extract page-level content from PDF files. |
| Text Chunking | Splits documents with RecursiveCharacterTextSplitter using overlapping chunks for better context continuity. |
| Hugging Face Embeddings | Uses sentence-transformers/all-MiniLM-L6-v2 for compact semantic embeddings. |
| Chroma Vector Store | Stores and searches embeddings locally with ChromaDB. |
| MMR Retrieval | Uses Maximal Marginal Relevance to improve diversity and reduce repetitive retrieved chunks. |
| Gemini Answer Generation | Uses gemini-2.5-flash-lite through langchain-google-genai for grounded responses. |
| Source Snippets | Displays the top retrieved document snippets and page metadata in the UI. |
| CLI Chat Loop | Includes a terminal-based RAG workflow in main.py. |
| Retriever Experiments | Includes small demos for similarity search, MMR, multi-query retrieval, and arXiv search. |
| Loader Experiments | Includes sample PDF, text, and web page loaders for learning LangChain ingestion patterns. |
graph TD
A[PDF Documents] --> B[PyPDFLoader]
B --> C[RecursiveCharacterTextSplitter]
C --> D[Hugging Face Embeddings]
D --> E[ChromaDB Vector Store]
E --> F[MMR Retriever]
F --> G[Retrieved Context]
H[User Question] --> F
G --> I[Prompt Template]
H --> I
I --> J[Google Gemini]
J --> K[Grounded Answer]
F --> L[Source Snippets]
- A PDF is loaded from either
document_loaders/deeplearning.pdfor a user upload in the Streamlit sidebar. - The document is split into chunks of about 1,000 characters with overlap.
- Each chunk is embedded with
sentence-transformers/all-MiniLM-L6-v2. - Embeddings are stored in ChromaDB, either persisted in
chroma-db/or kept in memory for uploaded files. - A user question is sent to the retriever.
- The retriever performs MMR search with
k=12,fetch_k=50, andlambda_mult=0.5. - Retrieved chunks are inserted into a strict prompt that instructs the model to answer only from provided context.
- Gemini generates the final answer, and the UI shows the answer plus source snippets.
| Layer | Technology |
|---|---|
| Language | Python |
| UI | Streamlit |
| LLM Framework | LangChain |
| LLM Provider | Google Gemini via langchain-google-genai |
| Embeddings | Hugging Face Sentence Transformers |
| Vector Database | ChromaDB |
| Document Loading | LangChain Community document loaders |
| Environment Management | python-dotenv |
| Optional Experiments | Web loaders, arXiv search, multi-query retrieval, FAISS/Qdrant/Pinecone/Weaviate dependencies |
RAG_Project/
|-- app.py # Streamlit PDF chat application
|-- main.py # Terminal-based RAG chat loop
|-- create_database.py # Builds the persisted ChromaDB index
|-- requirements.txt # Python dependencies
|-- chroma-db/ # Local persisted Chroma vector database
|-- document_loaders/
| |-- deeplearning.pdf # Default document used for the persisted index
| |-- GRU.pdf # Sample PDF for loader experiments
| |-- notes.txt # Sample text file
| |-- pdf.py # PDF loading and chunking demo
| |-- page.py # Web page loader demo
| `-- test.py # Text loader and splitter demo
|-- retrievers/
| |-- arixv.py # arXiv search demo
| |-- mmr.py # Similarity vs MMR retrieval demo
| `-- multiquery.py # Multi-query retriever demo
|-- vector_store/
| `-- DB.py # ChromaDB creation script variant
`-- images/
|-- app_ui.png # Streamlit UI screenshot
`-- image.png # Additional project image
- Python 3.10 or later
- A Google AI API key for Gemini
- Recommended: a virtual environment
Clone the repository and install the dependencies:
git clone <your-repository-url>
cd RAG_Project
python -m venv .venvActivate the environment:
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activateInstall packages:
pip install -r requirements.txtNote:
create_database.pyandvector_store/DB.pyimportlangchain_chroma, andretrievers/arixv.pyimportsarxiv. If your environment does not already include them, install them with:
pip install langchain-chroma arxivCreate a .env file in the project root:
GOOGLE_API_KEY=your_google_api_key_hereThe project uses load_dotenv() and LangChain's Google integration will read this key when calling Gemini.
The repository already contains a chroma-db/ directory, but you can regenerate it from the default PDF:
python create_database.pyThis loads document_loaders/deeplearning.pdf, chunks it, embeds it, and persists the ChromaDB index under ./chroma-db.
streamlit run app.pyThen open the local URL printed by Streamlit. The app starts with the default indexed document and also supports uploading a new PDF from the sidebar.
python main.pyAsk questions in the terminal. Enter 0 to exit.
python document_loaders/pdf.py
python document_loaders/test.py
python document_loaders/page.py
python retrievers/mmr.py
python retrievers/multiquery.py
python retrievers/arixv.pyThese scripts demonstrate individual LangChain concepts used by the main app.
The application uses a strict system prompt:
Use ONLY the provided context to answer the question.
If the answer is not present in the context, say:
"I could not find the answer in the provided documents."
This design reduces unsupported answers by forcing the LLM to rely on retrieved chunks.
The main retriever uses Maximal Marginal Relevance:
search_type="mmr"
search_kwargs={"k": 12, "fetch_k": 50, "lambda_mult": 0.5}This retrieves a larger candidate set, then balances relevance and diversity before sending context to the LLM.
| Source | Storage Mode | Behavior |
|---|---|---|
| Default PDF | Persisted ChromaDB | Loaded from ./chroma-db for repeat use. |
| Uploaded PDF | In-memory ChromaDB | Built at upload time and discarded after the session ends. |
The project includes a broad AI/RAG-focused requirements.txt. The primary runtime dependencies for the main workflow are:
streamlitpython-dotenvlangchainlangchain-communitylangchain-corelangchain-google-genailangchain-huggingfacechromadbsentence-transformerspypdf
Additional packages support experimentation with other vector databases, document formats, web scraping, OCR, notebooks, and testing.
- Add
.env.examplewith documented configuration keys. - Add automated tests for document ingestion and retrieval behavior.
- Add a configurable model selector for different Gemini or local models.
- Add support for multiple uploaded PDFs in a single session.
- Add document metadata filters for page range, filename, or source type.
- Move shared RAG pipeline logic into reusable modules.
Contributions are welcome. Good first improvements include dependency cleanup, tests, better error handling for missing API keys, and support for additional document types.
Before opening a pull request:
- Create a focused branch.
- Keep changes small and well described.
- Run the relevant script or Streamlit app locally.
- Avoid committing secrets,
.envfiles, virtual environments, or generated cache files.
No license file is currently included. Add a license before distributing or accepting external contributions.
