Skip to content

Conversation

@tisonkun
Copy link
Member

@tisonkun tisonkun commented Feb 4, 2026

This closes #37

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun marked this pull request as draft February 4, 2026 09:53
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun marked this pull request as ready for review February 4, 2026 13:05
@tisonkun
Copy link
Member Author

tisonkun commented Feb 4, 2026

cc @PsiACE @Xuanwo @ZENOTME

@tisonkun tisonkun requested review from leerho and notfilippo February 4, 2026 13:08
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
/// Computes and checks the 16-bit seed hash from the given long seed.
///
/// The seed hash may not be zero in order to maintain compatibility with older serialized
/// versions that did not have this concept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we should check the return value to prevent 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose so.

cc @leerho I can't see similar requiremeny based on barely the Rust code. Could you provide more context why 0 is not allowed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this comment is copied from computeSeedHash's Java version.

Copy link
Member

@leerho leerho Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is exactly what the comment says. It is to remain compatible with older sketch versions (in other languages) that did not have the concept of the seedHash. Once you have serialized a sketch, it no longer retains any information about what language generated the serialized image. That is the whole idea and quite powerful! Once you have properly created this sketch in Rust, you will be able to import sketch images created years ago from Java, C++, or whatever.

The fact that "older versions of Rust" don't have this problem is irrelevant. :)

And yes, the method that generates the seed must check for 0, as it does in Java.

And, hmmm, it looks like C++ doesn't check for zero either. Which is a bug.
The likely reason this has not been noticed before is because we always use the DEFAULT_UPDATE_SEED, which has a non-zero seed_hash.

Copy link
Contributor

@ZENOTME ZENOTME Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like we need to return an error here for the result is 0 as Java:

  public static short computeSeedHash(final long seed) {
    final long[] seedArr = {seed};
    final short seedHash = (short)(hash(seedArr, 0L)[0] & 0xFFFFL);
    if (seedHash == 0) {
      throw new SketchesArgumentException(
          "The given seed: " + seed + " produced a seedHash of zero. "
              + "You must choose a different seed.");
    }
    return seedHash;
  }

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Member Author

tisonkun commented Feb 5, 2026

I'm going to do the following tasks after this patch is merged:

  1. CpcWrapper to read fields without fully deserializing the sketch. This is implemented in the Java impl as well.
  2. Investigate whether we need introspective_insertion_sort. Rust's slice sort should properly leverage existing ordered items already.

For this patch, one open question is whether to include the decoding table as static values, or build it at the first access (using OnceLock or so).

I tend to keep the static decoding tables. They should not increase the binary size too much.

@tisonkun
Copy link
Member Author

tisonkun commented Feb 6, 2026

I'm going to merge this patch now. Review after commit is welcome.

To reduce binary size, we'd follow #32 to exclude CpcSketch's code when users doesn't need it.

@tisonkun tisonkun merged commit 309b134 into main Feb 6, 2026
9 checks passed
@tisonkun tisonkun deleted the compression branch February 6, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement CpcSketch

3 participants