Skip to content

feat(datafusion): tantivy-datafusion connector for SQL over tantivy splits#1

Draft
alexanderbianchi wants to merge 6 commits intobianchi/df-4-servefrom
bianchi/quickwit-tantivy-datafusion
Draft

feat(datafusion): tantivy-datafusion connector for SQL over tantivy splits#1
alexanderbianchi wants to merge 6 commits intobianchi/df-4-servefrom
bianchi/quickwit-tantivy-datafusion

Conversation

@alexanderbianchi
Copy link
Copy Markdown
Owner

Summary

  • Adds TantivyDataSource implementing QuickwitDataSource for SQL queries over tantivy-indexed log data
  • Full-text search via full_text() UDF pushed down to tantivy's inverted index
  • Fast field filter pushdown to tantivy range queries
  • Tantivy native aggregation pushdown (AggDataSource)
  • Distributed execution with TantivyCodec for plan serialization across workers
  • Async document retrieval via Searcher::doc_async() (per-block, no full-file preload)
  • REST POST /api/v1/_sql endpoint with pretty-printed table output
  • 8 unit tests + 5 full cluster sandbox integration tests

Architecture

SQL query → DataFusionService::execute_sql
  → QuickwitSchemaProvider resolves index
  → TantivyDataSource::create_default_table_provider (opens 1 split for schema)
  → TantivyTableProvider::scan (lists all splits, creates QuickwitSplitOpener per split)
  → SingleTableProvider::from_splits (partition-per-segment parallelism)
  → DistributedExec (splits work across searcher nodes)
  → AggDataSource (native tantivy aggregation when applicable)

Depends on

Test plan

  • cargo test -p quickwit-datafusion -- tantivy::tests (8 unit tests)
  • cargo test -p quickwit-integration-tests -- tantivy_datafusion_tests (5 integration tests)
  • Manual: distributed SQL queries with 4 searcher nodes
  • Manual: full-text search + document retrieval + aggregations via REST endpoint
  • Load test with hundreds of splits

🤖 Generated with Claude Code

alexanderbianchi and others added 2 commits April 11, 2026 00:57
Documents all changes made to tantivy-datafusion during the quickwit
integration, upstream tantivy API requests, and architectural improvements
needed in tantivy-datafusion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexanderbianchi
Copy link
Copy Markdown
Owner Author

Working on some stuff locally - as of writing this we partition by segment which is information local to the split but splits should not be opened until the worker node. Splits are like fragments - and this architecture right now is using the Tantivy library as the "reader" node (basically:bolt). So what i'm heading for now is effectively the "logs-event-store-api" or "quickwit" level being distributed over splits and tantivy in the table provider does the "bolt" / "logs-event-store-reader" work. Long term - sure we could do a two step planning and push datafusion closer to the data further decomposing tantivy but I'm not sure we need that at first.

@alexanderbianchi
Copy link
Copy Markdown
Owner Author

full text search broken on latest commit.

@alexanderbianchi
Copy link
Copy Markdown
Owner Author

Next up: clean up some of the IndexOpener stuff - start looking into quickwit API behaviors like pagination or deterministic ordering and see if we can put it in the plan. Inter-split stuff still done in tantivy.


**What changed**: Removed sync document store reads from the `spawn_blocking`
batch generation path entirely. Document retrieval now happens asynchronously
after each batch exits `spawn_blocking`:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any need for spawn_blocking.

Prefer using one of the thread pool in quickwit_common.

spawn_blocking is really just for blocking IO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants