Skip to content

feat(index): add More Like This (MLT) query support#111

Open
poyrazK wants to merge 3 commits into
mainfrom
feature/mlt-query
Open

feat(index): add More Like This (MLT) query support#111
poyrazK wants to merge 3 commits into
mainfrom
feature/mlt-query

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 24, 2026

Summary

  • Add MLT query type for "find similar documents" functionality
  • Extract significant terms from reference document using TF*IDF scoring
  • Build boosted BoolQuery with must_not exclusion of source doc
  • Add boost field to TermQuery for per-term scoring control

Test plan

  • All existing tests pass
  • 4 new MLT tests covering doc_id exclusion, like source, validation

Notes

MLT scoring requires flushed positions data. Currently requires segments to be flushed for the positions_readers to contain data. The per-doc inverted index built during indexing is not yet integrated with positions_readers.

poyrazK added 2 commits May 24, 2026 16:49
- Add MltQuery struct and Mlt variant to SearchQuery enum
- Add boost field to TermQuery for per-term scoring control
- Implement build_mlt_bool_query to extract significant terms
  from reference doc using TF*IDF significance scoring
- MLT transforms to BoolQuery before scoring; excludes source doc
- Add validation for doc_id xor like (mutually exclusive)
- Update all TermQuery constructors with new boost field
- Add get_query_terms and score_query handling for Mlt variant
Tests verify MLT query validation and behavior with current constraints:
- doc_id source exclusion works
- like parameter works with raw JSON
- min_term_freq filtering logic works
- Neither doc_id nor like returns empty
- min_doc_freq and max_query_terms constraints work

Note: MLT scoring requires flushed positions data, so tests verify
the query returns empty until segment flushing integrates the
per-doc inverted index with positions_readers.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 24, 2026

Warning

Review limit reached

@poyrazK, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 22 minutes and 10 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c460e401-ae0f-460c-ba5f-3bb96d6a6a3d

📥 Commits

Reviewing files that changed from the base of the PR and between 9f7775b and 5df34b4.

📒 Files selected for processing (6)
  • rust/crates/cloudsearch-api/src/lib.rs
  • rust/crates/cloudsearch-api/src/query_string.rs
  • rust/crates/cloudsearch-common/src/lib.rs
  • rust/crates/cloudsearch-common/tests/round_trip.rs
  • rust/crates/cloudsearch-index/src/lib.rs
  • rust/crates/cloudsearch-index/tests/coverage.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/mlt-query

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

1. Track term frequencies per-field, not globally
   - Changed term_freqs from HashMap<String, usize> to
     HashMap<String, HashMap<String, usize>> (field -> term -> tf)
   - Use original field when building should_clauses

2. Move source doc exclusion to search level
   - Removed must_not from BoolQuery (field _id may not be indexed)
   - Filter out mlt.doc_id directly in search() before scoring
   - Return doc_id_to_exclude alongside transformed query

3. Applied cargo fmt fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant