Skip to content

VarOpt heapify: add validation for totalRWeight in full preamble (align with C++) #721

@Fengzdadi

Description

@Fengzdadi

Description

In Java VarOptItemsSketch heapify, full‑preamble deserialization currently only checks that rCount > 0, but does not validate the totalRWeight field. C++ (var_opt_sketch_impl.hpp) rejects NaN or non‑positive total_wt_r, and Go was updated to follow C++ behavior. This creates a cross‑language inconsistency and allows corrupted or malicious bytes to silently produce NaN/negative tau and invalid estimates.

Current Java code

File: src/main/java/org/apache/datasketches/sampling/VarOptItemsSketch.java
Method: heapify (around where numPreLongs == Family.VAROPT.getMaxPreLongs())

Current logic:

double totalRWeight = 0.0;
if (numPreLongs == Family.VAROPT.getMaxPreLongs()) {
  if (rCount > 0) {
    totalRWeight = extractTotalRWeight(srcSeg);
  } else {
    throw new SketchesArgumentException(
        "Possible Corruption: " + Family.VAROPT.getMaxPreLongs()
        + " preLongs but no items in R region");
  }
}

Behavior in other languages

C++ (sampling/include/var_opt_sketch_impl.hpp):

  • Full preamble → if r == 0 or total_wt_r is NaN or <= 0, throw
    "Possible corruption: deserializing in full mode but r = 0 or invalid R weight.
     Found r = ..., R region weight = ..."
    

Go (sampling/varopt_items_sketch.go):

  • Full preamble → reject NaN or <= 0 totalRWeight (aligned with C++)

Proposed change

Add validation after extractTotalRWeight in Java:

double totalRWeight = 0.0;
if (numPreLongs == Family.VAROPT.getMaxPreLongs()) {
  if (rCount > 0) {
    totalRWeight = extractTotalRWeight(srcSeg);
    if (Double.isNaN(totalRWeight) || totalRWeight <= 0.0) {
      throw new SketchesArgumentException(
          "Possible Corruption: deserializing in full mode but r = 0 or invalid R weight. "
          + "Found r = " + rCount + ", R region weight = " + totalRWeight);
    }
  } else {
    throw new SketchesArgumentException(
        "Possible Corruption: " + Family.VAROPT.getMaxPreLongs()
        + " preLongs but no items in R region");
  }
}

Why this matters

  • Prevents NaN/negative tau from silently propagating
  • Aligns Java behavior with C++
  • Only rejects invalid/corrupted bytes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions