Skip to content

feat: add KatanaDbExtractor for direct database sync#18

Draft
kariy wants to merge 6 commits intomainfrom
feat/katana-db-extractor
Draft

feat: add KatanaDbExtractor for direct database sync#18
kariy wants to merge 6 commits intomainfrom
feat/katana-db-extractor

Conversation

@kariy
Copy link
Copy Markdown
Member

@kariy kariy commented Mar 4, 2026

This adds a new KatanaDbExtractor that implements the Extractor trait by reading block data directly from Katana's MDBX database via DbProviderFactory, instead of going through JSON-RPC. When torii runs alongside Katana on the same machine, this eliminates all network overhead and enables significantly faster block syncing with zero latency.

The extractor opens the database in read-only mode (Db::open_ro) so it can safely run concurrently with a live Katana instance. Per extraction batch, it creates a fresh read-only provider and queries headers, transactions, receipts, declared classes, and deployed contracts for each block in the range. The cursor format (block:N) and persistence mechanism are consistent with the existing BlockRangeExtractor.

A max_events_per_batch config (default 100,000) caps the number of events per batch to bound memory usage. When the limit is reached mid-range, the batch is returned early with fewer blocks than batch_size, and the cursor points to the last fully processed block so the next call resumes correctly.

Benchmark

Benchmarked against a real Katana Sepolia database (492 GB MDBX, 884,993 blocks) using --release builds with a batch size of 1,000 blocks and a max_events_per_batch of 100,000. Rates are calculated as total_count / wall_time where wall time is measured from the first extract() call to the last using std::time::Instant, covering the full extraction loop including provider queries and batch construction — but excluding sink processing, since this benchmark only exercises the extractor.

Machine: AMD EPYC 9124 (16-core, 32 threads) · 124 GB RAM · 3.5 TB ext4 SSD

Metric Value
Database Katana Sepolia (492 GB MDBX)
Blocks processed 884,993
Total events 94,874,059
Total transactions 13,548,558
Total declared classes 45,704
Total deployed contracts 863,389
Batches 1,357
Wall time 885.5 s (~14.8 min)
Events/sec 107,144
Transactions/sec 15,301
Blocks/sec 999
Avg batch time 652.5 ms

Rust Version Compatibility

There is a Rust toolchain compatibility issue worth noting. Katana's transitive dependency on starknet-types-core v0.1.x pulls in the size-of v0.1.5 crate, which uses platform-specific ABIs (aapcs, sysv64, stdcall, fastcall) that became hard errors on aarch64 targets starting with Rust 1.85 (unsupported_calling_conventions lint). At the same time, katana's alloy dependencies (resolved to ^1.2) require Rust 1.88+. This creates a narrow compatibility window where only Rust 1.88.0 successfully compiles the combined dependency tree — older versions fail the alloy MSRV check, and newer versions (1.90+) reject the size-of ABIs. The lockfile has been pinned with compatible alloy versions (1.2.1) accordingly. This will resolve itself once katana's upstream dependencies move to starknet-types-core v0.2.x which drops the size-of dependency.

🤖 Generated with Claude Code

kariy and others added 4 commits March 4, 2026 14:36
Add a new extractor that reads block data directly from Katana's MDBX
database via DbProviderFactory, bypassing JSON-RPC entirely. This
enables significantly faster block syncing with zero network latency
when running alongside Katana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch katana-db, katana-provider, and katana-primitives from local
path dependencies to git dependencies pinned at rev 7e6fef8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests use the spawn_and_move fixture database from katana to verify:
- First batch extraction with correct block count and cursor
- Full chain extraction covering all blocks
- Event context integrity (block/tx references match)
- Transaction field validity
- Batch boundary behavior when beyond chain head
- Full Extractor trait loop with cursor commit
- Cursor resume correctly skips already-processed blocks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs the extractor through the entire database with timing metrics,
reporting blocks/sec, transactions/sec, and events/sec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kariy kariy changed the title feat: add KatanaDbExtractor for direct database sync feat: add KatanaDbExtractor for direct database sync Mar 6, 2026
kariy and others added 2 commits March 6, 2026 15:12
When blocks contain dense event data (e.g., 1M+ events per 1000 blocks),
the extractor now yields a partial batch early once the event count
threshold is reached (default: 100,000). The cursor points to the last
fully processed block so the next call resumes correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The benchmark loop broke only on is_finished() which is never true in
follow-chain-head mode (to_block=None). Now it breaks on any empty batch.
Also adds probe examples for debugging and from_block CLI arg.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant