A high-performance, modular search engine core built from scratch in Java. This project implements fundamental Information Retrieval (IR) concepts, including an Inverted Index, custom Tokenization, and a scalable architecture designed for high-speed document indexing.
- Inverted Index Architecture: Efficient mapping of terms to document locations for O(1) term lookup.
- TF-IDF Ranking: Professional relevance ranking algorithm that sorts search results based on term frequency and document rarity.
- Search Query Processor: Interactive console interface for real-time querying and ranked result retrieval.
- Custom Tokenizer: Advanced text processing with lowercase normalization, punctuation removal, and stopword filtering.
The system processes raw text documents through a structured pipeline to build a searchable index.
graph TD
A[Data Folder] -->|Read .txt| B(DocumentReader)
B -->|List of Documents| C(IndexBuilder)
C -->|Raw Content| D[Tokenizer]
D -->|Filtered Tokens| C
C -->|Add Token + DocID| E[InvertedIndex]
E -->|Store| F[(HashMap: Word -> Postings)]
Designed with modularity in mind to allow for easy integration of future ranking algorithms and query processors.
classDiagram
class DocumentReader {
+readDocuments(path String) List~Document~
}
class Tokenizer {
+tokenize(text String) List~String~
}
class InvertedIndex {
+addToken(token String, docId int)
+getPostings(token String) List
}
class IndexBuilder {
+buildIndex(docs List)
}
IndexBuilder --> Tokenizer
IndexBuilder --> InvertedIndex
InvertedIndex "1" *-- "many" PostingEntry
Standard databases scan records line-by-line (Linear Search). An Inverted Index pre-processes the data into a "dictionary" of words, where each word points to its locations. This is the same fundamental technology that powers global search engines like Elasticsearch and Google.
Preprocessing text is critical for search quality. This engine:
- Normalizes case (converts all to lowercase).
- Cleans noise (removes non-alphanumeric characters).
- Filters stops (removes high-frequency words like 'the', 'is' which degrade search relevance).
- Java 17+
- Gradle
- Clone the repository.
- Place your documents in the
data/documents/folder. - Run the application:
./gradlew run
- Boolean Query Support: Enabling complex searches using
AND,OR, andNOToperators. - Disk Persistence: Serializing the Inverted Index to disk for persistence across sessions.
This project is part of a deep dive into Information Retrieval. Feel free to open issues or submit PRs if you see room for performance optimizations!