A Python-based PDF Question-Answering system powered by Retrieval-Augmented Generation (RAG).
This project extracts text from PDFs, chunks it, generates embeddings (using Hugging Face or OpenAI models), stores them in Qdrant (an open-source vector database), and performs efficient similarity search to retrieve relevant context. The retrieved context is then used to augment prompts for a Large Language Model, delivering accurate and document-grounded answers.
Perfect for building custom document-specific QA applications.
- PDF text extraction and intelligent chunking
- Support for Hugging Face or OpenAI embeddings
- Vector storage and fast similarity search with Qdrant
- Full RAG pipeline for reliable question answering
- Interactive Streamlit web interface
- Python
- Qdrant (vector database)
- LangChain (RAG orchestration)
- Streamlit (UI)
- Hugging Face Transformers / OpenAI API
- PyPDF2 or similar for PDF processing
- Clone the repository:
git clone https://github.com/1rishu0/PDF-QuestionAnswer.git cd PDF-QuestionAnswer - Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (required for OpenAI usage):
Create a .env file in the root directory:
OPENAI_API_KEY=your_openai_api_key_here - Run Qdrant using Docker:
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant
Launch the Interactive Interface:
streamlit run streamlit_app.py- Upload your PDF document
- Wait for the document to be processed and indexed
- Start asking questions about the content
Run the CLI version:
python main.py --pdf_pdf path/to/your/document.pdfSteps:
- The PDF will be loaded and indexed
- Follow the console prompts to ask questions
- For local/free usage: The project supports Hugging Face embedding models (no API key needed)
- For better performance: Use OpenAI embeddings/LLM by setting OPENAI_API_KEY
- Persistent Qdrant storage: The Docker command above mounts a local folder for data persistence
.
├── streamlit_app.py # Streamlit web interface
├── main.py # CLI entry point
├── data_loader.py # PDF loading, text extraction & chunking
├── vector_db.py # Qdrant integration (indexing & retrieval)
├── custom_types.py # Custom type definitions
├── pyproject.toml # Dependencies (Poetry)
└── README.md # This file
Contributions are welcome! Feel free to:
- Open issues for bugs or feature requests
- Fork the repo and submit pull requests
- Improve documentation or add examples
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
You can directly copy and paste this content into your repository's `README.md` file. All sections after **Installation** are now fully formatted with proper Markdown headings, lists, code blocks, and bold text for better readability.
