feat(index): add More Like This (MLT) query support#111
Conversation
- Add MltQuery struct and Mlt variant to SearchQuery enum - Add boost field to TermQuery for per-term scoring control - Implement build_mlt_bool_query to extract significant terms from reference doc using TF*IDF significance scoring - MLT transforms to BoolQuery before scoring; excludes source doc - Add validation for doc_id xor like (mutually exclusive) - Update all TermQuery constructors with new boost field - Add get_query_terms and score_query handling for Mlt variant
Tests verify MLT query validation and behavior with current constraints: - doc_id source exclusion works - like parameter works with raw JSON - min_term_freq filtering logic works - Neither doc_id nor like returns empty - min_doc_freq and max_query_terms constraints work Note: MLT scoring requires flushed positions data, so tests verify the query returns empty until segment flushing integrates the per-doc inverted index with positions_readers.
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 22 minutes and 10 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
1. Track term frequencies per-field, not globally
- Changed term_freqs from HashMap<String, usize> to
HashMap<String, HashMap<String, usize>> (field -> term -> tf)
- Use original field when building should_clauses
2. Move source doc exclusion to search level
- Removed must_not from BoolQuery (field _id may not be indexed)
- Filter out mlt.doc_id directly in search() before scoring
- Return doc_id_to_exclude alongside transformed query
3. Applied cargo fmt fixes
Summary
must_notexclusion of source docboostfield toTermQueryfor per-term scoring controlTest plan
Notes
MLT scoring requires flushed positions data. Currently requires segments to be flushed for the positions_readers to contain data. The per-doc inverted index built during indexing is not yet integrated with positions_readers.