Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups#741
Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups#741sleeepyjack merged 28 commits intoNVIDIA:devfrom
Conversation
c8eb1a2 to
a56e3a9
Compare
PointKernel
left a comment
There was a problem hiding this comment.
First pass
The logic is straightforward, so there’s not much to review from a technical standpoint.
ci/build.sh
Outdated
| --arch) CUDA_ARCHS="${args[1]}"; args=("${args[@]:2}");; | ||
| --std) CXX_STANDARD="${args[1]}"; args=("${args[@]:2}");; | ||
| -v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");; | ||
| --) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;; |
There was a problem hiding this comment.
why do we need this change?
There was a problem hiding this comment.
This allows for passing arbitrary cmake build options through our ci/build.sh script. For instance, this PR introduces the -DCUCO_DOWNLOAD_ROARING_DATA option, Which I can now pass as ci/build.sh — -DCUCO_DOWNLOAD_ROARING_DATA=OFF.
There was a problem hiding this comment.
#704 allows for specifying the cmake binary while this change here allows for passing build arguments directly to cmake as a pass-through.
There was a problem hiding this comment.
Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build
There was a problem hiding this comment.
I've outsourced the changes to the build script in #749
| private: | ||
| metadata_type metadata_; | ||
| cuda::std::byte const* data_; | ||
| cuda::std::uint8_t const* run_container_bitmap_; |
There was a problem hiding this comment.
what does run_container_bitmap do
There was a problem hiding this comment.
There are three types of containers in the bitmap. (1) array containers, (2) bitset containers, and (3) "run containers" aka array that encode consecutive sequences of elements in RLE format. To distinguish the first two types we can look at the cardinality (if it's >4096 then it is stored as a bitset, otherwise it is an array container). We cannot apply this trick for run containers, thus we have to store a bitset with one bit per container, indicating if it's a run container or not.
There was a problem hiding this comment.
Can we document this somewhere for posterity?
There was a problem hiding this comment.
This is taken directly from the RoaringBitmapFormatSpec which I reference a couple of times in the docs. I you read throught the cookie header definition you'll find the following explanation:
Let size be the number of containers. Then we store (size + 7) / 8 bytes, following the initial 32 bits, as a bitset to indicate whether each of the containers is a run container (bit set to 1) or not (bit set to 0). The first (least significant) bit of the first byte corresponds to the first stored container and so forth. In this scenario, the cookie header uses 32 bits followed by (size + 7) / 8 bytes.
| template <class U /* = T */, | ||
| class /* = cuda::std::enable_if_t<cuda::std::is_same_v<U, cuda::std::uint32_t>> */> |
There was a problem hiding this comment.
is this some cleanup leftover?
There was a problem hiding this comment.
It's a hint mirroring what is there in the definition of the function to help understand why these tparams are there. In this case it means this function is only enabled for the 32-bit roaring bitmap format.
| if (this->empty()) { | ||
| thrust::fill( | ||
| nosync_exec_policy, contained, contained + cuda::std::distance(first, last), false); | ||
| } else { | ||
| thrust::transform(nosync_exec_policy, | ||
| first, | ||
| last, | ||
| contained, | ||
| cuda::proclaim_return_type<bool>( | ||
| [*this] __device__(auto key) { return this->contains(key); })); | ||
| } |
There was a problem hiding this comment.
As discussed offline, we should use cub algorithms here as thrust doesn't provide control for potential temporary data allocation/deallocation
There was a problem hiding this comment.
My last commit should have switched to cub but somehow the diff is not showing it. I'll verify this.
| card = 1u + misaligned_load<cuda::std::uint16_t>( | ||
| storage_ref_.key_cards() + (index * 2 + 1) * sizeof(cuda::std::uint16_t)); | ||
| } | ||
| if (card <= 4096) { |
There was a problem hiding this comment.
let's get rid of magic number and give it a name
| cuda::std::uint32_t cookie; | ||
| cuda::std::memcpy(&cookie, buf, sizeof(cuda::std::uint32_t)); | ||
| buf += sizeof(cuda::std::uint32_t); | ||
| if ((cookie & 0xFFFF) != serial_cookie && cookie != serial_cookie_no_runcontainer) { |
There was a problem hiding this comment.
Yep, cookie_mask
|
|
||
| thrust::device_vector<T> items(num_items); | ||
|
|
||
| key_generator gen{}; |
There was a problem hiding this comment.
Just caffeinating the engine ;)
| key_generator gen{}; | |
| key_generator gen{0xc0ffee}; |
There was a problem hiding this comment.
Ultimately, I would like to use the same default seed for all cuco benchmarks and add an option to adjust it via nvbench. We currently use std::time(nullptr) as the default seed which is another problem in itself. So yeah, this will be a separate PR to fix this for the entire library.
ci/build.sh
Outdated
| --arch) CUDA_ARCHS="${args[1]}"; args=("${args[@]:2}");; | ||
| --std) CXX_STANDARD="${args[1]}"; args=("${args[@]:2}");; | ||
| -v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");; | ||
| --) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;; |
There was a problem hiding this comment.
Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build
| // Create query keys for the portable_bitmap64.bin file: | ||
| // https://github.com/RoaringBitmap/RoaringFormatSpec/blob/5177ad9/testdata64/README.md#portable_bitmap64bin | ||
| std::vector<cuda::std::uint64_t> keys; | ||
| for (cuda::std::uint64_t k = 0x00000ull; k < 0x09000ull; ++k) { |
There was a problem hiding this comment.
std::iota at a few places here
There was a problem hiding this comment.
I would lean towards keeping the original logic as it is a 1:1 copy of how the bitmap was serialize (see link in line 70). Just makes it clearer for the reader where these numbers come from.
| return static_cast<cuda::std::uint8_t>(container[lower / 8]) & | ||
| (cuda::std::uint8_t(1) << (lower % 8)); |
There was a problem hiding this comment.
May a comment about the logic and any ref.
There was a problem hiding this comment.
Added some comments and also added a helper function check_bit() to reduce code duplication.
| namespace cuco::detail { | ||
|
|
||
| // primary template | ||
| template <class T> |
There was a problem hiding this comment.
Looks like we are re-implementing binary search quite a few times in this file. Perhaps we could add a couple of functions in anonymous namespace and reuse. Future PR to do that is also ok.
There was a problem hiding this comment.
Yeah, that's definitely a candidate for a refactoring. However, there a some details that are slightly different between the implementations which I'd have to abstract from somehow. I would say this is something we can do in a future PR.
| private: | ||
| metadata_type metadata_; | ||
| cuda::std::byte const* data_; | ||
| cuda::std::uint8_t const* run_container_bitmap_; |
There was a problem hiding this comment.
Can we document this somewhere for posterity?
| else { | ||
| cuda::std::memcpy(&num_containers, buf, sizeof(cuda::std::uint32_t)); | ||
| buf += sizeof(cuda::std::uint32_t); | ||
| } |
There was a problem hiding this comment.
Don't have a better suggestion but quite a few magic numbers in here 🪄
There was a problem hiding this comment.
I cleaned up and simplified some of the logic and added some variable names for these constants. Let me know if I missed something.
|
|
||
| #include <fstream> | ||
| #include <string> | ||
| #include <vector> |
There was a problem hiding this comment.
Same comments from host_bulk_example.cu about Use of std::iota and std::filesystem::file_size
| cub::DeviceTransform::Transform( | ||
| thrust::constant_iterator<bool>(false), | ||
| contained, | ||
| cuda::std::distance(first, last), | ||
| cuda::proclaim_return_type<bool>([] __device__(auto /* dummy */) { return false; }), |
There was a problem hiding this comment.
Can anyone think of a better way of doing a fill operation using cub that works with arbitrary input iterators? We want to avoid thrust::fill as it might introduce some unexpected host syncs.
PointKernel
left a comment
There was a problem hiding this comment.
Looks good to me. One last question needs some discussions. Do we want to get this merged in cudf or cuco?
|
|
||
| private: | ||
| allocator_type allocator_; | ||
| typename ref_type::metadata_type metadata_; |
There was a problem hiding this comment.
Are there cases where users need to access or manage the metadata when working with a roaring bitmap? If so, yes.
@PointKernel The container as is does pass the cuco vibecheck I'd say. The current implementation is limited to lookups but could be extended to support mutable operations as well (although it would require some significant effort). That said, I'm not at all opposed to moving it into cudf instead - just saying that it could fit into cuco from my standpoint. If you have strong oppinions against including it in cuco I'm open to discuss them. If we move it into cudf I would need someone to lead the effort who is more versed with libcudf details than me. CC @mhaseeb123 what do you think? |
Agreed with this statement. My vote would be to keep this container in cuco especially considering we may want to expand the set of features in the future and then would need to move it back to cuco. That said, if y'all think it's better to move this to libcudf, I am happy to lead the effort. |
PointKernel
left a comment
There was a problem hiding this comment.
I’m concerned that this is mainly addressing Spark’s requirements rather than providing a general-purpose bitmap solution on GPUs. Once we decide to support insertion, the existing lookups may also need to change. That said, it’s fine for now—let’s merge the PR as is and revisit the full integration later once we have a clearer picture of how cudf/Spark will use this data structure.
Agreed. I don't see any requirement from Spark or libcudf, in the near term at least, to add insertions, deletions etc.
Yup and we may need to move it back to cuco.
Agreed with this. We can always move it to libcudf if needed. |
PointKernel
left a comment
There was a problem hiding this comment.
Looks good. @sleeepyjack Thanks!
Closes #725