Skip to content

kartiks15/Document-Query-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HackRx Document Processing System

An intelligent document processing and question answering system that extracts text from various document formats (PDF, Excel, PowerPoint, Word, PNG) and provides AI-powered answers to questions about the content.

Features

  • Multi-format Document Support: Extracts text from PDF, Excel (.xlsx/.xls), PowerPoint (.pptx), Word (.docx), and PNG files
  • Intelligent Text Processing: Splits documents into manageable chunks for processing
  • Advanced Embeddings: Uses sentence transformers for semantic text encoding
  • Similarity Search: Implements FAISS for efficient similarity search
  • Question Classification: Automatically classifies questions as math, code, document, or general
  • Context-Aware Responses: Generates answers using Retrieval-Augmented Generation (RAG)
  • RESTful API: Exposes functionality through a FastAPI endpoint

Technologies Used

  • Python 3.10
  • FastAPI: Web framework for the REST API
  • FAISS: Efficient similarity search and clustering
  • Sentence Transformers: State-of-the-art sentence embeddings
  • OpenAI GPT-4: Advanced language model for question answering
  • PyMuPDF: PDF text extraction
  • Pandas: Excel file processing
  • python-docx: Word document processing
  • python-pptx: PowerPoint presentation processing

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd HackRx_6.0
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables: Create a .env file with your OpenAI API key:

    OPENAI_API_KEY=your_openai_api_key_here
    

Usage

Start the server:

python main.py

The API will be available at http://localhost:8000

API Endpoint

POST /hackrx/run

Headers:

  • Authorization: Bearer <token> (Use the token specified in the code)

Body:

{
  "documents": "URL_to_document",
  "questions": ["Question 1", "Question 2", ...]
}

Supported document formats:

  • PDF (.pdf)
  • Excel (.xlsx, .xls)
  • PowerPoint (.pptx)
  • Word Documents (.docx)
  • PNG Images (processed with OCR)

How It Works

  1. Document Ingestion: The system accepts document URLs and extracts text based on file type
  2. Text Processing: Documents are split into chunks for efficient processing
  3. Embedding Generation: Chunks are converted to vector embeddings using sentence transformers
  4. Index Creation: Embeddings are indexed using FAISS for fast similarity search
  5. Question Processing: Questions are classified and processed appropriately
  6. Information Retrieval: Relevant document chunks are retrieved based on question semantics
  7. Answer Generation: Context-aware answers are generated using OpenAI's GPT-4

Project Structure

├── main.py              # Main application file with FastAPI implementation
├── requirements.txt     # Python dependencies
├── .env                 # Environment variables (not included in repo)
├── pdfs/               # Sample PDF documents
├── chunks.pkl          # Processed document chunks
├── faiss_index.bin     # FAISS vector index
└── test.py             # GPU availability test script

Security

  • API access is protected with a bearer token
  • Environment variables are used for sensitive information like API keys

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI for providing the GPT-4 API
  • Facebook AI Research for FAISS
  • SBERT.net for sentence transformers

About

An intelligent document processing and question answering system that extracts text from various document formats (PDF, Excel, PowerPoint, Word, PNG) and provides AI-powered answers to questions about the content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages