Skip to content

A full-text search engine implemented in Java using a custom tokenizer, inverted index, and TF-IDF ranking for fast and relevant text retrieval.

Notifications You must be signed in to change notification settings

RanaVinit/full-text-search-engine-java

Repository files navigation

Java Full-Text Search Engine

A high-performance, modular search engine core built from scratch in Java. This project implements fundamental Information Retrieval (IR) concepts, including an Inverted Index, custom Tokenization, and a scalable architecture designed for high-speed document indexing.


Key Features

  • Inverted Index Architecture: Efficient mapping of terms to document locations for O(1) term lookup.
  • TF-IDF Ranking: Professional relevance ranking algorithm that sorts search results based on term frequency and document rarity.
  • Search Query Processor: Interactive console interface for real-time querying and ranked result retrieval.
  • Custom Tokenizer: Advanced text processing with lowercase normalization, punctuation removal, and stopword filtering.

System Architecture

Data Processing Pipeline

The system processes raw text documents through a structured pipeline to build a searchable index.

graph TD
    A[Data Folder] -->|Read .txt| B(DocumentReader)
    B -->|List of Documents| C(IndexBuilder)
    C -->|Raw Content| D[Tokenizer]
    D -->|Filtered Tokens| C
    C -->|Add Token + DocID| E[InvertedIndex]
    E -->|Store| F[(HashMap: Word -> Postings)]
Loading

Class Relationships

Designed with modularity in mind to allow for easy integration of future ranking algorithms and query processors.

classDiagram
    class DocumentReader {
        +readDocuments(path String) List~Document~
    }
    class Tokenizer {
        +tokenize(text String) List~String~
    }
    class InvertedIndex {
        +addToken(token String, docId int)
        +getPostings(token String) List
    }
    class IndexBuilder {
        +buildIndex(docs List)
    }
    
    IndexBuilder --> Tokenizer
    IndexBuilder --> InvertedIndex
    InvertedIndex "1" *-- "many" PostingEntry
Loading

Core Concepts

Why an Inverted Index?

Standard databases scan records line-by-line (Linear Search). An Inverted Index pre-processes the data into a "dictionary" of words, where each word points to its locations. This is the same fundamental technology that powers global search engines like Elasticsearch and Google.

Tokenization Strategy

Preprocessing text is critical for search quality. This engine:

  1. Normalizes case (converts all to lowercase).
  2. Cleans noise (removes non-alphanumeric characters).
  3. Filters stops (removes high-frequency words like 'the', 'is' which degrade search relevance).

Getting Started

Prerequisites

  • Java 17+
  • Gradle

Build and Run

  1. Clone the repository.
  2. Place your documents in the data/documents/ folder.
  3. Run the application:
    ./gradlew run

Future Roadmap

  • Boolean Query Support: Enabling complex searches using AND, OR, and NOT operators.
  • Disk Persistence: Serializing the Inverted Index to disk for persistence across sessions.

Contributing

This project is part of a deep dive into Information Retrieval. Feel free to open issues or submit PRs if you see room for performance optimizations!

About

A full-text search engine implemented in Java using a custom tokenizer, inverted index, and TF-IDF ranking for fast and relevant text retrieval.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages