An intelligent document processing and question answering system that extracts text from various document formats (PDF, Excel, PowerPoint, Word, PNG) and provides AI-powered answers to questions about the content.
- Multi-format Document Support: Extracts text from PDF, Excel (.xlsx/.xls), PowerPoint (.pptx), Word (.docx), and PNG files
- Intelligent Text Processing: Splits documents into manageable chunks for processing
- Advanced Embeddings: Uses sentence transformers for semantic text encoding
- Similarity Search: Implements FAISS for efficient similarity search
- Question Classification: Automatically classifies questions as math, code, document, or general
- Context-Aware Responses: Generates answers using Retrieval-Augmented Generation (RAG)
- RESTful API: Exposes functionality through a FastAPI endpoint
- Python 3.10
- FastAPI: Web framework for the REST API
- FAISS: Efficient similarity search and clustering
- Sentence Transformers: State-of-the-art sentence embeddings
- OpenAI GPT-4: Advanced language model for question answering
- PyMuPDF: PDF text extraction
- Pandas: Excel file processing
- python-docx: Word document processing
- python-pptx: PowerPoint presentation processing
-
Clone the repository:
git clone <repository-url> cd HackRx_6.0
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile with your OpenAI API key:OPENAI_API_KEY=your_openai_api_key_here
Start the server:
python main.pyThe API will be available at http://localhost:8000
POST /hackrx/run
Headers:
Authorization: Bearer <token>(Use the token specified in the code)
Body:
{
"documents": "URL_to_document",
"questions": ["Question 1", "Question 2", ...]
}Supported document formats:
- PDF (.pdf)
- Excel (.xlsx, .xls)
- PowerPoint (.pptx)
- Word Documents (.docx)
- PNG Images (processed with OCR)
- Document Ingestion: The system accepts document URLs and extracts text based on file type
- Text Processing: Documents are split into chunks for efficient processing
- Embedding Generation: Chunks are converted to vector embeddings using sentence transformers
- Index Creation: Embeddings are indexed using FAISS for fast similarity search
- Question Processing: Questions are classified and processed appropriately
- Information Retrieval: Relevant document chunks are retrieved based on question semantics
- Answer Generation: Context-aware answers are generated using OpenAI's GPT-4
├── main.py # Main application file with FastAPI implementation
├── requirements.txt # Python dependencies
├── .env # Environment variables (not included in repo)
├── pdfs/ # Sample PDF documents
├── chunks.pkl # Processed document chunks
├── faiss_index.bin # FAISS vector index
└── test.py # GPU availability test script
- API access is protected with a bearer token
- Environment variables are used for sensitive information like API keys
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing the GPT-4 API
- Facebook AI Research for FAISS
- SBERT.net for sentence transformers