feat: add `cleanup_non_empty_nulls` kernel by rluvaton · Pull Request #9970 · apache/arrow-rs

rluvaton · 2026-05-13T19:45:30Z

Which issue does this PR close?

N/A

Rationale for this change

when working with lists or variable size arrays you cant operate on the underlying values/bytes of variable length array as is as nulls might point to non empty values

Cases when this is useful:

lambda function on lists - since we need to remove the values that
are not null
explode sql function - list values behind nulls cannot be kept
have kernels that use the list values without need to check if the
value should be processed or not - for example implementing
array_distinct which is keeping in each list the unique items

the cases where having nulls for non empty values can happen for example when using the nullif kernel

What changes are included in this PR?

added to arrow-select cleanup_non_empty_nulls module which include 2 functions

cleanup_non_empty_nulls which is the logic for removing non empty nulls values
has_non_empty_nulls which can be called before calling the cleanup_non_empty_nulls function to check if the expensive work is even needed
Added benchmarks for cleanup

Originally I wanted to add the function on ListArray and StringArray and so on, but because the use of take and interleave we cannot do that

Are these changes tested?

Yes

Are there any user-facing changes?

yes, new kernel

Needed also for:

Support MapArray in lambda clear null values datafusion#22846

This is useful in: 1. lambda function on lists - since we need to remove the values that are not null 2. `explode` sql function 3. have kernels that use the list values without need to check if the value should be processed or not - for example implementing `array_distinct` which is keeping in each list the unique items

rluvaton · 2026-05-14T17:15:37Z

Cc @gstvg @comphead (for lambda support)

gstvg · 2026-05-14T23:43:04Z

+        assert_eq!(cleaned.nulls(), Some(&input_nulls));
+    }
+
+    // ===== Underlying child array is sliced =====


Should the children arrays below be sliced? The first test below looks very similiar as list_cleanup_nulls_with_null_pointing_to_non_empty_list_and_have_empty_list in line 464

you are right, fixing

gstvg · 2026-05-14T23:44:24Z

+                unsafe { Buffer::from_trusted_len_iter(iter) }
+            };
+
+            let cleanup_array = crate::take::take(


non-blocking: for lists/maps with big chunks of valid or empty sublists, is possible that using MutableArrayData directly be faster, since we can copy the given chunk in a single memcpy for some data types, and perform less dynamic dispatches compared to take_list?

Now for string/binary I'm not sure since is static

yes it is possible
but we can optimize the code further to have dedicated impl for each type but for now it is ok I think.

the next optimization possible is using filter on values and only updating list offsets.

I also think that is ok, that's for sure, I just wanted to comment instead of forgetting about it later. Using filter is a great idea, nice

I will try create an issue after this is merged or update the implementation in the next pr

gstvg · 2026-05-14T23:48:13Z

+            return Ok(Arc::new(self.clone()));
+        };
+
+        // Find an empty value so we can use the `take` kernel


I'm not sure if we can take that as granted, but take should already clean up nulls for list/maps and bytes

why does it matter? having empty value will allow us to use the optimized take version rather than the fallback

Because I believe we may not need the the empty value and can simplify this to take(self, &UInt32Array::from_iter(0..self.len() as u32), None), including in the interleave fallback path, since take doesn't copy underlying values of nulls. Within the take kernel the version used would be the same as today

Kernels try to reuse data as much as possible and try to avoid allocating when possible, so even if the current implementation of take will remove nulls when using take(self, &UInt32Array::from_iter(0..self.len() as u32), None) it should not be dependent upon

See take kernel comment here:

arrow-rs/arrow-select/src/take.rs

Lines 56 to 58 in 6fae4ea

/// Note that this kernel, similar to other kernels in this crate,

/// will avoid allocating where not necessary. Consequently

/// the returned array may share buffers with the inputs

comphead

Thanks @rluvaton @gstvg before expanding arrow-rs lets try to scope the concern.

Ideally to include in the PR description the explanation why this problem is happening, what is the reason of ArrayRef require cleaning, because under usual circumstances it should not. When we discussed the change in apache/datafusion#22158 (comment) I feel the scope was to explore if BooleanBuffer can calculate has_false without adding another BooleanBuilder wrapper on top of it

rluvaton · 2026-05-17T09:50:05Z

Thanks @rluvaton @gstvg before expanding arrow-rs lets try to scope the concern.

Ideally to include in the PR description the explanation why this problem is happening, what is the reason of ArrayRef require cleaning, because under usual circumstances it should not. When we discussed the change in apache/datafusion#22158 (comment) I feel the scope was to explore if BooleanBuffer can calculate has_false without adding another BooleanBuilder wrapper on top of it

This was before the discussion to move the has_false, but also this supported map and byte based array and not only lists

this can happen under normal circumstances, for example using nullif kernel, this is not a problem it is a valid case, updated the description.

alamb · 2026-05-21T18:27:16Z

Recently a similar usecase came up in DataFusion for "garbage collection" -- see

Fix: compact view buffers in ScalarValue::compact for all container t… datafusion#21934

What would you think about adding gc for all array types (and part of its contract would be to clear out unused / null slots)?

alamb · 2026-05-21T18:27:34Z

That could potentially keep the API surface of arrow-rs from growing too much

rluvaton · 2026-05-26T11:07:44Z

I want gc functionality as well but they are not the same.
cleanup_non_empty_nulls avoid copying the data if there are no non empty lists, gc from how I see it will call gc on the childs as wells which is not the use case for this function, the use case is to prepare for other kernels to work

there are also open questions about gc. if there are still references to the underlying buffers will it copy the data? because then gc can increase memory usage rather than decrease it. and if it will do nothing, than it won't cleanup the non empty lists behind nulls.

rluvaton · 2026-06-09T10:18:23Z

@comphead / @alamb could you please review so I can fix:

Support MapArray in lambda clear null values datafusion#22846

and the relevant pr that need it for map:

Add transform_values UDF datafusion#22689 (comment)

gstvg

LGTM, there's only #9970 (review) which I believe is worth working on, all my others comments are resolved

Jefffrey · 2026-06-22T16:15:30Z

if we're worried about api surface, this could fit into whats described here 🤔

Minify Kernel #7186

alamb · 2026-06-22T19:21:06Z

Yes, I think minify would be much better than a bunch of type specific kernels

We would just have to be careful about defining exactly what minify is allowed to do and what is a breaking change or not

rluvaton added 5 commits May 13, 2026 21:55

rename and add tests

b2b67c1

add bench and test

0d18b97

update

d0f5c3a

rename

8b64caf

github-actions Bot added the arrow Changes to the arrow crate label May 13, 2026

format and lint

a433e16

rluvaton marked this pull request as ready for review May 13, 2026 19:49

add license

2684203

gstvg reviewed May 15, 2026

View reviewed changes

comphead reviewed May 15, 2026

View reviewed changes

rluvaton added 2 commits May 17, 2026 23:33

Merge branch 'main' into add-cleanup-non-empty-nulls-kernel

79afb38

Merge branch 'main' into add-cleanup-non-empty-nulls-kernel

6b233f1

Merge branch 'main' into add-cleanup-non-empty-nulls-kernel

99d5a78

rluvaton mentioned this pull request Jun 9, 2026

feat(lambda): support map non empty nulls cleanup for HigherOrderFunctionExpr apache/datafusion#22847

Draft

gstvg approved these changes Jun 22, 2026

View reviewed changes

alamb mentioned this pull request Jun 22, 2026

Minify Kernel #7186

Open

rluvaton added 4 commits June 23, 2026 19:58

remove tests

fb64ef1

fixing the sliced tests

382ef26

Merge branch 'main' into add-cleanup-non-empty-nulls-kernel

b939778

Merge branch 'main' into add-cleanup-non-empty-nulls-kernel

f3e9c95

	/// Note that this kernel, similar to other kernels in this crate,
	/// will avoid allocating where not necessary. Consequently
	/// the returned array may share buffers with the inputs

Uh oh!

Conversation

rluvaton commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gstvg May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

rluvaton commented May 17, 2026

Uh oh!

alamb commented May 21, 2026

Uh oh!

alamb commented May 21, 2026

Uh oh!

rluvaton commented May 26, 2026

Uh oh!

rluvaton commented Jun 9, 2026

Uh oh!

gstvg left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented Jun 22, 2026

Uh oh!

alamb commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rluvaton commented May 13, 2026 •

edited

Loading

gstvg May 14, 2026 •

edited

Loading