perf(bulker) snowflake: sort dedup CTAS by timestamp by absorbb · Pull Request #1347 · jitsucom/jitsu

absorbb · 2026-06-04T17:12:27Z

Follow-up to #1346. Asked whether the dedup TEMPORARY table is already presorted by timestamp — it isn't (QUALIFY emits rows in window-function-output order, roughly PK-grouped) — and whether presorting would help. It does, but only for INSERT's write path.

Summary

Add an optional `ORDER BY {timestamp_col}` clause to the dedup CTAS template (gated on a new `QueryPayload.DedupOrderBy` field; set in `copyOrMergeSplit` only when `targetTable.TimestampColumn != ""`).

Why

Target tables created by bulker use `CLUSTER BY (TO_DATE(timestamp))` (`sfAlterClusteringKeyTemplate`). When the INSERT stage writes new rows into T, Snowflake creates new micro-partitions whose `TO_DATE(ts)` min/max are determined by the input row order:

Without sort (status quo): dedup output is roughly PK-grouped, so within a single new micro-partition rows span the full date range of the batch. Auto-clustering then has to repartition that data to restore the cluster key — billable warehouse work.
With sort: dedup rows hit storage in timestamp order, so new T micro-partitions have tight `TO_DATE(ts)` ranges along the cluster key from the start. Auto-clustering has almost nothing to do.

UPDATE is unaffected — rewriting a micro-partition preserves T's existing clustering layout regardless of source row order. The join itself is unaffected — hash-join doesn't care about input order on either side.

Cost: O(N log N) sort on the already-deduped row set during the CTAS. Sub-second for typical batch sizes; trivial compared to even one auto-clustering pass on T.

Test plan

Build green (`go build ./bulkerlib/...` in `bulker/`) — done locally.
Run a Snowflake sync on an INSERT-heavy table and observe the auto-clustering credits used by the target table over the next few hours — should drop vs. the pre-PR baseline.
Run a sync where the target has no timestamp column — `DedupOrderBy` stays empty, dedup template renders unchanged.
Compare `dedup` stage time before/after — small bump expected (sort cost), should be sub-second.

🤖 Generated with Claude Code

The dedup CTAS used to emit rows in window-function-output order (roughly PK-grouped), which meant INSERT later wrote new T micro-partitions whose TO_DATE(ts) ranges spanned the whole batch. T is clustered by TO_DATE(timestamp) (sfAlterClusteringKeyTemplate), so those wide new micro-partitions force auto-clustering to re-sort them later — billable warehouse work. Add ORDER BY {ts} to the dedup template (gated on a new DedupOrderBy QueryPayload field) so the dedup rows hit storage in timestamp order. INSERT then writes new micro-partitions whose TO_DATE(ts) min/max are already tight along the cluster key; auto-clustering has almost nothing to do. UPDATE is unaffected (rewriting micro-partitions preserves T's existing clustering); the join cost is unaffected (PK-keyed, not ts-keyed). ORDER BY adds an O(N log N) sort on the already-deduped row set — sub-second for typical batch sizes. Only enabled when targetTable.TimestampColumn != ""; templates that don't set DedupOrderBy keep the old behaviour. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jitsu-code-review

Reviewed the Snowflake split-merge changes in snowflake.go plus the QueryPayload update. I found one correctness risk worth addressing: the new dedup CTAS ORDER BY can reference a timestamp column that is present on the destination table metadata but absent from the current source batch schema, which would fail the dedup stage at runtime.

jitsu-code-review · 2026-06-04T17:17:51Z

+	// strictly an INSERT-side optimisation; no benefit (and no harm)
+	// when there is no timestamp column.
+	var dedupOrderBy string
+	if targetTable.TimestampColumn != "" {


Possible runtime regression: this uses targetTable.TimestampColumn unconditionally when the target has one, but the dedup CTAS reads from sourceTable columns for the current batch. If a batch doesn’t carry that timestamp field (while destination metadata still has it), the generated ORDER BY references a missing column and dedup fails with an invalid identifier. Should we gate this with sourceTable.Columns.Get(targetTable.TimestampColumn) before setting DedupOrderBy?

jitsu-code-review Bot approved these changes Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bulker) snowflake: sort dedup CTAS by timestamp#1347

perf(bulker) snowflake: sort dedup CTAS by timestamp#1347
absorbb wants to merge 1 commit into
newjitsufrom
snowflake-merge-split

absorbb commented Jun 4, 2026

Uh oh!

jitsu-code-review Bot left a comment

Uh oh!

jitsu-code-review Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

absorbb commented Jun 4, 2026

Summary

Why

Test plan

Uh oh!

jitsu-code-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jitsu-code-review Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant