HackRx Document Processing System

An intelligent document processing and question answering system that extracts text from various document formats (PDF, Excel, PowerPoint, Word, PNG) and provides AI-powered answers to questions about the content.

Features

Multi-format Document Support: Extracts text from PDF, Excel (.xlsx/.xls), PowerPoint (.pptx), Word (.docx), and PNG files
Intelligent Text Processing: Splits documents into manageable chunks for processing
Advanced Embeddings: Uses sentence transformers for semantic text encoding
Similarity Search: Implements FAISS for efficient similarity search
Question Classification: Automatically classifies questions as math, code, document, or general
Context-Aware Responses: Generates answers using Retrieval-Augmented Generation (RAG)
RESTful API: Exposes functionality through a FastAPI endpoint

Technologies Used

Python 3.10
FastAPI: Web framework for the REST API
FAISS: Efficient similarity search and clustering
Sentence Transformers: State-of-the-art sentence embeddings
OpenAI GPT-4: Advanced language model for question answering
PyMuPDF: PDF text extraction
Pandas: Excel file processing
python-docx: Word document processing
python-pptx: PowerPoint presentation processing

Installation

Clone the repository:

git clone <repository-url>
cd HackRx_6.0

Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables: Create a .env file with your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```

Usage

Start the server:

python main.py

The API will be available at http://localhost:8000

API Endpoint

POST /hackrx/run

Headers:

Authorization: Bearer <token> (Use the token specified in the code)

Body:

{
  "documents": "URL_to_document",
  "questions": ["Question 1", "Question 2", ...]
}

Supported document formats:

PDF (.pdf)
Excel (.xlsx, .xls)
PowerPoint (.pptx)
Word Documents (.docx)
PNG Images (processed with OCR)

How It Works

Document Ingestion: The system accepts document URLs and extracts text based on file type
Text Processing: Documents are split into chunks for efficient processing
Embedding Generation: Chunks are converted to vector embeddings using sentence transformers
Index Creation: Embeddings are indexed using FAISS for fast similarity search
Question Processing: Questions are classified and processed appropriately
Information Retrieval: Relevant document chunks are retrieved based on question semantics
Answer Generation: Context-aware answers are generated using OpenAI's GPT-4

Project Structure

├── main.py              # Main application file with FastAPI implementation
├── requirements.txt     # Python dependencies
├── .env                 # Environment variables (not included in repo)
├── pdfs/               # Sample PDF documents
├── chunks.pkl          # Processed document chunks
├── faiss_index.bin     # FAISS vector index
└── test.py             # GPU availability test script

Security

API access is protected with a bearer token
Environment variables are used for sensitive information like API keys

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI for providing the GPT-4 API
Facebook AI Research for FAISS
SBERT.net for sentence transformers

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
build_faiss_index.py		build_faiss_index.py
main.py		main.py
main2.py		main2.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HackRx Document Processing System

Features

Technologies Used

Installation

Usage

API Endpoint

How It Works

Project Structure

Security

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HackRx Document Processing System

Features

Technologies Used

Installation

Usage

API Endpoint

How It Works

Project Structure

Security

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages