Skip to content

Conversation

@dojiong
Copy link
Contributor

@dojiong dojiong commented Dec 9, 2025

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

A stack overflow occurs when processing data files containing a large number of equality deletes (e.g., > 6000 rows).
This happens because parse_equality_deletes_record_batch_stream previously constructed the final predicate by linearly calling .and() in a loop:

result_predicate = result_predicate.and(row_predicate.not());

This resulted in a deeply nested, left-skewed tree structure with a depth equal to the number of rows (N). When rewrite_not() (which uses a recursive visitor
pattern) was subsequently called on this structure, or when the structure was dropped, the call stack limit was exceeded.

Changes

  1. Balanced Tree Construction: Refactored the predicate combination logic. Instead of linear accumulation, row predicates are collected and combined using a
    pairwise combination approach to build a balanced tree. This reduces the tree depth from O(N) to O(log N).
  2. Early Rewrite: rewrite_not() is now called immediately on each individual row predicate before they are combined. This ensures we are combining simplified
    predicates and avoids traversing a massive unoptimized tree later.
  3. Regression Test: Added test_large_equality_delete_batch_stack_overflow, which processes 20,000 equality delete rows to verify the fix.

Are these changes tested?

  • New regression test test_large_equality_delete_batch_stack_overflow passed.
  • All existing tests in arrow::caching_delete_file_loader passed.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dojiong for this pr, LGTM! It would be nice to add a comment to explain why we need to do this optimization.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dojiong for this fix.

@liurenjie1024 liurenjie1024 merged commit 16906c1 into apache:main Dec 11, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants