Skip to content

sql/colexec: add multi-level spill join with robust file lifecycle#23915

Open
aunjgr wants to merge 11 commits intomatrixorigin:mainfrom
aunjgr:multi_spill
Open

sql/colexec: add multi-level spill join with robust file lifecycle#23915
aunjgr wants to merge 11 commits intomatrixorigin:mainfrom
aunjgr:multi_spill

Conversation

@aunjgr
Copy link
Contributor

@aunjgr aunjgr commented Mar 20, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #3433 #23353

What this PR does / why we need it:

Implements recursive (multi-level) spill for hash join when a single
spill pass is insufficient, along with comprehensive fixes to the spill
file lifecycle to prevent orphaned files under all cancellation paths.

Multi-level spill (hashjoin/spill.go)

  • rebuildHashmapForBucket: after reading a build bucket, if memory
    still exceeds the threshold and depth < spillMaxPass, re-spills the
    bucket to the next depth instead of OOM-ing.
  • reSpillBucket: scatters both the build and probe sides of a bucket
    into sub-buckets and enqueues them for the next pass.
  • Sub-bucket naming follows join_<uuid>_<i0>_<i1>_..._<iN>_build/probe
    so each level's ancestry is encoded in the filename.
  • Seed-based XXHash (computeXXHash) uses a per-depth seed to avoid
    degenerate distributions when the same keys re-spill at deeper levels.
  • shouldReSpill checks live memory against the threshold to decide
    whether another spill pass is needed.

Spill file lifecycle hardening

context.Background() for all cleanup deletions

cleanupSpillFiles in both hashbuild and hashjoin previously used
proc.Ctx, which is already cancelled by the time cleanup runs on
abnormal client exit. Changed to context.Background().

hashbuild only cleans build files when JoinMap was never sent

Reset() in hashbuild/types.go now calls cleanupSpillFiles only when
!mapSucceed. When mapSucceed=true hashjoin owns the files.

spillQueue pre-populated before probe loop (Gap 1)

build() in hashjoin/join.go now pre-populates spillQueue with build
file names before starting the probe loop. Previously the queue was
populated after the loop, so a mid-loop cancellation left build files
untracked.

defer + ownsBuildFile in rebuildHashmapForBucket (Gap 2)

The build file for a bucket is removed from spillQueue (popped) before
processing begins. A deferred cleanup with an ownsBuildFile flag and
context.Background() ensures the named build file is always deleted,
even on early return or cancellation.

Build file cleanup in reSpillBucket defer (Gap 3)

Moved inline RemoveFile(proc.Ctx, ...) into the existing deferred
cleanup with context.Background().

JoinMap.spillCleanup for cancel-before-receive (Gap 4)

Added spillCleanup func() field to JoinMap (message/joinMapMsg.go).
hashbuild sets it via SetSpillCleanup() with a clone of the bucket
list. FreeMemory() — called via MessageBoard.Reset()Destroy()
when a pipeline tears down — invokes the cleanup, deleting build files
even when hashjoin cancels before calling ReceiveJoinMap.

Early stop for empty buckets (hashjoin/spill.go)

  • Skip a bucket entirely when the build side is empty and the join type
    is not left outer / left single / left anti (which require probe rows
    to pass through regardless).
  • Skip a bucket entirely when the probe side is empty and the join type
    is not right outer / right single / right anti.

IO optimizations

  • CreateAndRemoveFile (unlinking the directory entry immediately on
    open) is used for all probe bucket files and re-spill build files so
    the OS reclaims them automatically when the fd is closed, regardless
    of whether explicit cleanup runs.
  • Spill expression executors (spillExprExecs) are initialized once per
    build phase and reused across all batches, avoiding repeated
    re-evaluation overhead.
  • acquireSpillBuffers reuses pre-allocated batch buffers for spill.

Naming scheme

  • hashbuild root build files: join_<uuid>_<i>_build
  • hashjoin root probe files: join_<uuid>_<i>_probe
  • Sub-bucket files at depth N: join_<uuid>_<i0>_..._<iN>_build/probe
  • makeSpillBucketWriters(uid, suffix) generates the full set of bucket
    writers for a given parent base name and build/probe suffix.

Metrics and logging

  • Fixed SpillSize / SpillRows metrics to account for re-spill passes.
  • Added logutil.Infof lines in hashjoin/spill.go for bucket rebuild and
    re-spill events, reporting bucket name and 1-based depth.

Dead code removal

  • Deleted unused ClearHashmap() method from hashbuild/hashmap.go.
  • Removed unused vecs [][]*vector.Vector and delVecs fields from
    HashmapBuilder; replaced with curVecs []*vector.Vector.

Tests

  • Added hashjoin/spill_integration_test.go with end-to-end spill and
    multi-level re-spill scenarios.
  • Updated hashjoin/spill_test.go and hashbuild/spill_test.go to
    match the new APIs and naming scheme.

When hash join spills build data to disk during memory pressure, the
rebuilt hashmap may itself need to spill. This change adds multi-level
spill support by replacing the simple bucket index with a spillQueue
that supports FIFO processing with prepend for re-spilled sub-buckets.

- Add spillQueue (slice with front pop/prepend) replacing spilledBuildBuckets
- Add spillMaxPass constant (3) to limit re-spill recursion depth
- Refactor getSpilledInputBatch to use spillQueue and support re-spill
- Add spill_integration_test.go for rebuild and re-spill flow tests
- Minor cleanup: remove unused logutil import, fix probe file cleanup
@matrix-meow matrix-meow added size/XXL Denotes a PR that changes 2000+ lines and removed size/XL Denotes a PR that changes [1000, 1999] lines labels Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants