feat(datafusion): tantivy-datafusion connector for SQL over tantivy splits#1
feat(datafusion): tantivy-datafusion connector for SQL over tantivy splits#1alexanderbianchi wants to merge 6 commits intobianchi/df-4-servefrom
Conversation
Documents all changes made to tantivy-datafusion during the quickwit integration, upstream tantivy API requests, and architectural improvements needed in tantivy-datafusion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Working on some stuff locally - as of writing this we partition by segment which is information local to the split but splits should not be opened until the worker node. Splits are like fragments - and this architecture right now is using the Tantivy library as the "reader" node (basically:bolt). So what i'm heading for now is effectively the "logs-event-store-api" or "quickwit" level being distributed over splits and tantivy in the table provider does the "bolt" / "logs-event-store-reader" work. Long term - sure we could do a two step planning and push datafusion closer to the data further decomposing tantivy but I'm not sure we need that at first. |
|
full text search broken on latest commit. |
|
Next up: clean up some of the IndexOpener stuff - start looking into quickwit API behaviors like pagination or deterministic ordering and see if we can put it in the plan. Inter-split stuff still done in tantivy. |
|
|
||
| **What changed**: Removed sync document store reads from the `spawn_blocking` | ||
| batch generation path entirely. Document retrieval now happens asynchronously | ||
| after each batch exits `spawn_blocking`: |
There was a problem hiding this comment.
I don't think there is any need for spawn_blocking.
Prefer using one of the thread pool in quickwit_common.
spawn_blocking is really just for blocking IO.
Summary
TantivyDataSourceimplementingQuickwitDataSourcefor SQL queries over tantivy-indexed log datafull_text()UDF pushed down to tantivy's inverted indexAggDataSource)TantivyCodecfor plan serialization across workersSearcher::doc_async()(per-block, no full-file preload)POST /api/v1/_sqlendpoint with pretty-printed table outputArchitecture
Depends on
Test plan
cargo test -p quickwit-datafusion -- tantivy::tests(8 unit tests)cargo test -p quickwit-integration-tests -- tantivy_datafusion_tests(5 integration tests)🤖 Generated with Claude Code