Java Full-Text Search Engine

A high-performance, modular search engine core built from scratch in Java. This project implements fundamental Information Retrieval (IR) concepts, including an Inverted Index, custom Tokenization, and a scalable architecture designed for high-speed document indexing.

Key Features

Inverted Index Architecture: Efficient mapping of terms to document locations for O(1) term lookup.
TF-IDF Ranking: Professional relevance ranking algorithm that sorts search results based on term frequency and document rarity.
Search Query Processor: Interactive console interface for real-time querying and ranked result retrieval.
Custom Tokenizer: Advanced text processing with lowercase normalization, punctuation removal, and stopword filtering.

System Architecture

Data Processing Pipeline

The system processes raw text documents through a structured pipeline to build a searchable index.

graph TD
    A[Data Folder] -->|Read .txt| B(DocumentReader)
    B -->|List of Documents| C(IndexBuilder)
    C -->|Raw Content| D[Tokenizer]
    D -->|Filtered Tokens| C
    C -->|Add Token + DocID| E[InvertedIndex]
    E -->|Store| F[(HashMap: Word -> Postings)]

Class Relationships

Designed with modularity in mind to allow for easy integration of future ranking algorithms and query processors.

classDiagram
    class DocumentReader {
        +readDocuments(path String) List~Document~
    }
    class Tokenizer {
        +tokenize(text String) List~String~
    }
    class InvertedIndex {
        +addToken(token String, docId int)
        +getPostings(token String) List
    }
    class IndexBuilder {
        +buildIndex(docs List)
    }
    
    IndexBuilder --> Tokenizer
    IndexBuilder --> InvertedIndex
    InvertedIndex "1" *-- "many" PostingEntry

Core Concepts

Why an Inverted Index?

Standard databases scan records line-by-line (Linear Search). An Inverted Index pre-processes the data into a "dictionary" of words, where each word points to its locations. This is the same fundamental technology that powers global search engines like Elasticsearch and Google.

Tokenization Strategy

Preprocessing text is critical for search quality. This engine:

Normalizes case (converts all to lowercase).
Cleans noise (removes non-alphanumeric characters).
Filters stops (removes high-frequency words like 'the', 'is' which degrade search relevance).

Getting Started

Prerequisites

Java 17+
Gradle

Build and Run

Clone the repository.
Place your documents in the data/documents/ folder.
Run the application:
```
./gradlew run
```

Future Roadmap

Boolean Query Support: Enabling complex searches using AND, OR, and NOT operators.
Disk Persistence: Serializing the Inverted Index to disk for persistence across sessions.

Contributing

This project is part of a deep dive into Information Retrieval. Feel free to open issues or submit PRs if you see room for performance optimizations!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
data/documents		data/documents
gradle/wrapper		gradle/wrapper
src/main/java/com/searchengine		src/main/java/com/searchengine
.gitignore		.gitignore
README.md		README.md
build.gradle.kts		build.gradle.kts
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Full-Text Search Engine

Key Features

System Architecture

Data Processing Pipeline

Class Relationships

Core Concepts

Why an Inverted Index?

Tokenization Strategy

Getting Started

Prerequisites

Build and Run

Future Roadmap

Contributing

About

Uh oh!

Releases

Packages

Languages

RanaVinit/full-text-search-engine-java

Folders and files

Latest commit

History

Repository files navigation

Java Full-Text Search Engine

Key Features

System Architecture

Data Processing Pipeline

Class Relationships

Core Concepts

Why an Inverted Index?

Tokenization Strategy

Getting Started

Prerequisites

Build and Run

Future Roadmap

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages