Skip to content

perf(bulker) snowflake: sort dedup CTAS by timestamp#1347

Open
absorbb wants to merge 1 commit into
newjitsufrom
snowflake-merge-split
Open

perf(bulker) snowflake: sort dedup CTAS by timestamp#1347
absorbb wants to merge 1 commit into
newjitsufrom
snowflake-merge-split

Conversation

@absorbb
Copy link
Copy Markdown
Contributor

@absorbb absorbb commented Jun 4, 2026

Follow-up to #1346. Asked whether the dedup TEMPORARY table is already presorted by timestamp — it isn't (QUALIFY emits rows in window-function-output order, roughly PK-grouped) — and whether presorting would help. It does, but only for INSERT's write path.

Summary

Add an optional `ORDER BY {timestamp_col}` clause to the dedup CTAS template (gated on a new `QueryPayload.DedupOrderBy` field; set in `copyOrMergeSplit` only when `targetTable.TimestampColumn != ""`).

Why

Target tables created by bulker use `CLUSTER BY (TO_DATE(timestamp))` (`sfAlterClusteringKeyTemplate`). When the INSERT stage writes new rows into T, Snowflake creates new micro-partitions whose `TO_DATE(ts)` min/max are determined by the input row order:

  • Without sort (status quo): dedup output is roughly PK-grouped, so within a single new micro-partition rows span the full date range of the batch. Auto-clustering then has to repartition that data to restore the cluster key — billable warehouse work.
  • With sort: dedup rows hit storage in timestamp order, so new T micro-partitions have tight `TO_DATE(ts)` ranges along the cluster key from the start. Auto-clustering has almost nothing to do.

UPDATE is unaffected — rewriting a micro-partition preserves T's existing clustering layout regardless of source row order. The join itself is unaffected — hash-join doesn't care about input order on either side.

Cost: O(N log N) sort on the already-deduped row set during the CTAS. Sub-second for typical batch sizes; trivial compared to even one auto-clustering pass on T.

Test plan

  • Build green (`go build ./bulkerlib/...` in `bulker/`) — done locally.
  • Run a Snowflake sync on an INSERT-heavy table and observe the auto-clustering credits used by the target table over the next few hours — should drop vs. the pre-PR baseline.
  • Run a sync where the target has no timestamp column — `DedupOrderBy` stays empty, dedup template renders unchanged.
  • Compare `dedup` stage time before/after — small bump expected (sort cost), should be sub-second.

🤖 Generated with Claude Code

The dedup CTAS used to emit rows in window-function-output order
(roughly PK-grouped), which meant INSERT later wrote new T
micro-partitions whose TO_DATE(ts) ranges spanned the whole batch.
T is clustered by TO_DATE(timestamp) (sfAlterClusteringKeyTemplate),
so those wide new micro-partitions force auto-clustering to re-sort
them later — billable warehouse work.

Add ORDER BY {ts} to the dedup template (gated on a new DedupOrderBy
QueryPayload field) so the dedup rows hit storage in timestamp order.
INSERT then writes new micro-partitions whose TO_DATE(ts) min/max are
already tight along the cluster key; auto-clustering has almost
nothing to do.

UPDATE is unaffected (rewriting micro-partitions preserves T's existing
clustering); the join cost is unaffected (PK-keyed, not ts-keyed).
ORDER BY adds an O(N log N) sort on the already-deduped row set —
sub-second for typical batch sizes.

Only enabled when targetTable.TimestampColumn != ""; templates that
don't set DedupOrderBy keep the old behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@jitsu-code-review jitsu-code-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the Snowflake split-merge changes in snowflake.go plus the QueryPayload update. I found one correctness risk worth addressing: the new dedup CTAS ORDER BY can reference a timestamp column that is present on the destination table metadata but absent from the current source batch schema, which would fail the dedup stage at runtime.

// strictly an INSERT-side optimisation; no benefit (and no harm)
// when there is no timestamp column.
var dedupOrderBy string
if targetTable.TimestampColumn != "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible runtime regression: this uses targetTable.TimestampColumn unconditionally when the target has one, but the dedup CTAS reads from sourceTable columns for the current batch. If a batch doesn’t carry that timestamp field (while destination metadata still has it), the generated ORDER BY references a missing column and dedup fails with an invalid identifier. Should we gate this with sourceTable.Columns.Get(targetTable.TimestampColumn) before setting DedupOrderBy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant