feat: add cleanup_non_empty_nulls kernel#9970
Conversation
This is useful in: 1. lambda function on lists - since we need to remove the values that are not null 2. `explode` sql function 3. have kernels that use the list values without need to check if the value should be processed or not - for example implementing `array_distinct` which is keeping in each list the unique items
| assert_eq!(cleaned.nulls(), Some(&input_nulls)); | ||
| } | ||
|
|
||
| // ===== Underlying child array is sliced ===== |
There was a problem hiding this comment.
Should the children arrays below be sliced? The first test below looks very similiar as list_cleanup_nulls_with_null_pointing_to_non_empty_list_and_have_empty_list in line 464
| unsafe { Buffer::from_trusted_len_iter(iter) } | ||
| }; | ||
|
|
||
| let cleanup_array = crate::take::take( |
There was a problem hiding this comment.
non-blocking: for lists/maps with big chunks of valid or empty sublists, is possible that using MutableArrayData directly be faster, since we can copy the given chunk in a single memcpy for some data types, and perform less dynamic dispatches compared to take_list?
Now for string/binary I'm not sure since is static
There was a problem hiding this comment.
yes it is possible
but we can optimize the code further to have dedicated impl for each type but for now it is ok I think.
the next optimization possible is using filter on values and only updating list offsets.
There was a problem hiding this comment.
I also think that is ok, that's for sure, I just wanted to comment instead of forgetting about it later. Using filter is a great idea, nice
There was a problem hiding this comment.
I will try create an issue after this is merged or update the implementation in the next pr
| return Ok(Arc::new(self.clone())); | ||
| }; | ||
|
|
||
| // Find an empty value so we can use the `take` kernel |
There was a problem hiding this comment.
why does it matter? having empty value will allow us to use the optimized take version rather than the fallback
There was a problem hiding this comment.
Because I believe we may not need the the empty value and can simplify this to take(self, &UInt32Array::from_iter(0..self.len() as u32), None), including in the interleave fallback path, since take doesn't copy underlying values of nulls. Within the take kernel the version used would be the same as today
There was a problem hiding this comment.
Kernels try to reuse data as much as possible and try to avoid allocating when possible, so even if the current implementation of take will remove nulls when using take(self, &UInt32Array::from_iter(0..self.len() as u32), None) it should not be dependent upon
See take kernel comment here:
arrow-rs/arrow-select/src/take.rs
Lines 56 to 58 in 6fae4ea
comphead
left a comment
There was a problem hiding this comment.
Thanks @rluvaton @gstvg before expanding arrow-rs lets try to scope the concern.
Ideally to include in the PR description the explanation why this problem is happening, what is the reason of ArrayRef require cleaning, because under usual circumstances it should not. When we discussed the change in apache/datafusion#22158 (comment) I feel the scope was to explore if BooleanBuffer can calculate has_false without adding another BooleanBuilder wrapper on top of it
This was before the discussion to move the this can happen under normal circumstances, for example using |
|
Recently a similar usecase came up in DataFusion for "garbage collection" -- see What would you think about adding gc for all array types (and part of its contract would be to clear out unused / null slots)? |
|
That could potentially keep the API surface of arrow-rs from growing too much |
|
I want gc functionality as well but they are not the same. there are also open questions about gc. if there are still references to the underlying buffers will it copy the data? because then gc can increase memory usage rather than decrease it. and if it will do nothing, than it won't cleanup the non empty lists behind nulls. |
gstvg
left a comment
There was a problem hiding this comment.
LGTM, there's only #9970 (review) which I believe is worth working on, all my others comments are resolved
|
if we're worried about api surface, this could fit into whats described here 🤔 |
|
Yes, I think minify would be much better than a bunch of type specific kernels We would just have to be careful about defining exactly what minify is allowed to do and what is a breaking change or not |
Which issue does this PR close?
N/A
Rationale for this change
when working with lists or variable size arrays you cant operate on the underlying values/bytes of variable length array as is as nulls might point to non empty values
Cases when this is useful:
are not null
explodesql function - list values behind nulls cannot be keptvalue should be processed or not - for example implementing
array_distinctwhich is keeping in each list the unique itemsthe cases where having nulls for non empty values can happen for example when using the
nullifkernelWhat changes are included in this PR?
added to
arrow-selectcleanup_non_empty_nullsmodule which include 2 functionscleanup_non_empty_nullswhich is the logic for removing non empty nulls valueshas_non_empty_nullswhich can be called before calling thecleanup_non_empty_nullsfunction to check if the expensive work is even neededOriginally I wanted to add the function on
ListArrayandStringArrayand so on, but because the use of take and interleave we cannot do thatAre these changes tested?
Yes
Are there any user-facing changes?
yes, new kernel
Needed also for: