Skip to content

fix: Preserve integer values in round() for large Int64 and UInt64 inputs#22697

Open
pchintar wants to merge 1 commit into
apache:mainfrom
pchintar:round-large-integer-handling
Open

fix: Preserve integer values in round() for large Int64 and UInt64 inputs#22697
pchintar wants to merge 1 commit into
apache:mainfrom
pchintar:round-large-integer-handling

Conversation

@pchintar

@pchintar pchintar commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

round() should not change integer values when the scale is non-negative, since no fractional digits need to be rounded.

Currently, core round() coerces large Int64 values through Float64, causing precision loss:

SELECT round(arrow_cast(9007199254740993, 'Int64'));

Before/Current Buggy Output:

9007199254740992.0

Expected:

9007199254740993

The Spark-compatible round() also fails for UInt64 values above i64::MAX even when the scale is non-negative:

SELECT round(arrow_cast(18446744073709551615, 'UInt64'));

Before/Current Buggy Output:

round: UInt64 value 18446744073709551615 exceeds i64::MAX and cannot be rounded

What changes are included in this PR?

  • Preserved integer inputs in core round() for non-negative scales instead of routing them through Float64.

  • Preserved UInt64 values in Spark-compatible round() when the scale is non-negative, avoiding unnecessary UInt64 -> i64 conversion.

  • Added SQLLogicTest coverage for:

    • Int64 values above 2^53 in core round().
    • UInt64::MAX in Spark-compatible round().
    • Both one-argument and two-argument forms.

Are these changes tested?

Yes.

cargo fmt --all
git diff --check
cargo test -p datafusion-sqllogictest --test sqllogictests -- spark/math/round.slt
cargo test -p datafusion-functions round
cargo test -p datafusion-spark round

I also verified the core regression queries manually:

SELECT arrow_typeof(round(arrow_cast(9007199254740993, 'Int64'))),
       round(arrow_cast(9007199254740993, 'Int64'));

Result:

Int64 9007199254740993
SELECT arrow_typeof(round(arrow_cast(9007199254740993, 'Int64'), 2)),
       round(arrow_cast(9007199254740993, 'Int64'), 2);

Result:

Int64 9007199254740993

Are there any user-facing changes?

No.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation spark labels Jun 1, 2026
@pchintar pchintar force-pushed the round-large-integer-handling branch from e565975 to 7832c0c Compare June 1, 2026 13:58
Comment thread datafusion/functions/src/math/round.rs Outdated
Comment thread datafusion/functions/src/math/round.rs Outdated
Comment thread datafusion/functions/src/math/round.rs
Comment thread datafusion/sqllogictest/test_files/scalar.slt
Comment thread datafusion/functions/src/math/round.rs Outdated
@kumarUjjawal

Copy link
Copy Markdown
Contributor

cc @neilconway

@pchintar pchintar force-pushed the round-large-integer-handling branch from 7832c0c to 97a7294 Compare June 2, 2026 09:13
@pchintar

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal Thnx for the review!

I've updated the implementation so that integer inputs now stay as integer types instead of being converted through Float64. This preserves large Int64 values that cannot be represented exactly as Float64.

I also added tests for the cases you mentioned:

  • negative decimal_places values (for example, round(125, -1))
  • column-valued decimal_places
  • large Int64 values above the exact Float64 range

Additionally, I updated the affected sqllogictest expectations since the plan output changed after removing the unnecessary cast to Float64.

I've also rerun the relevant tests and manually verified all these cases.

@pchintar

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal could you pls re-run the CI checks? thnx again

@pchintar

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal This failure seems to be unrelated to my code.

It failed before running DataFusion tests because GitHub Actions could not pull the Docker image:

docker pull amd64/rust
Get "https://registry-1.docker.io/v2/": context deadline exceeded

So this is a CI infrastructure/network failure, not a test failure perhaps?

@kumarUjjawal

Copy link
Copy Markdown
Contributor

@kumarUjjawal This failure seems to be unrelated to my code.

It failed before running DataFusion tests because GitHub Actions could not pull the Docker image:

docker pull amd64/rust
Get "https://registry-1.docker.io/v2/": context deadline exceeded

So this is a CI infrastructure/network failure, not a test failure perhaps?

Yeah ignore this, it's Github issue

@pchintar

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Yeah ignore this, it's Github issue

So wdyt of my PR? is it good enough to be approved? or is it too early to tell?

@kumarUjjawal

Copy link
Copy Markdown
Contributor

Yeah ignore this, it's Github issue

So wdyt of my PR? is it good enough to be approved? or is it too early to tell?

Will review tomorrow!

@pchintar

pchintar commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Will review tomorrow!

Sure, thank you!

@@ -468,6 +487,11 @@ fn round_columnar(
let decimal_places_is_array = matches!(decimal_places, ColumnarValue::Array(_));

let arr: ArrayRef = match (value_array.data_type(), return_type) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore a fast path, when decimal_places is a non-negative scalar literal and input_type == return_type, return value_array directly, only fall into round_integer_array for negative/array decimal_places.

@pchintar pchintar Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kumarUjjawal I've restored the fast path for non-negative scalar decimal_places values while still using round_integer_array for negative and column-valued decimal_places.

Before:

(input_type, return_type)
    if input_type == return_type && is_integer_data_type(return_type) =>
{
    round_integer_array(value_array.as_ref(), decimal_places, return_type)?
}

Now:

(input_type, return_type)
    if input_type == return_type && is_integer_data_type(return_type) =>
{
    match decimal_places {
        ColumnarValue::Scalar(ScalarValue::Int32(Some(dp))) if *dp >= 0 => {
            value_array
        }
        _ => round_integer_array(
            value_array.as_ref(),
            decimal_places,
            return_type,
        )?,
    }
}

@pchintar pchintar force-pushed the round-large-integer-handling branch from 97a7294 to 744cc36 Compare June 3, 2026 10:16
@pchintar

pchintar commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal could you kindly pls re-run the checks now? thnx

@pchintar

pchintar commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@kumarUjjawal all the checks seem to have passed

@kumarUjjawal kumarUjjawal requested a review from Jefffrey June 3, 2026 11:38
@pchintar

pchintar commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey could you pls check this PR and lmk if possible? thnx

@kumarUjjawal

Copy link
Copy Markdown
Contributor

@pchintar let's add some tests for:

  • overflow-on-negative-scale case
  • unsigned 64-bit values
  • Rounding to a place larger than the number itself

@pchintar pchintar force-pushed the round-large-integer-handling branch from 744cc36 to d73d6c3 Compare June 8, 2026 08:47
@pchintar

pchintar commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@pchintar let's add some tests for:

  • overflow-on-negative-scale case
  • unsigned 64-bit values
  • Rounding to a place larger than the number itself

Hi @kumarUjjawal , I've added tests for all three cases in scalar.slt :

  • overflow with Int64::MAX and negative scale
  • UInt64::MAX with one-arg and two-arg round
  • rounding 125 to a larger place (-5)

I also verified the queries manually.

Comment thread datafusion/functions/src/math/round.rs Outdated
Comment thread datafusion/spark/src/function/math/round.rs
Comment thread datafusion/functions/src/math/round.rs Outdated
@@ -630,6 +663,180 @@ fn round_columnar(
}
}

fn round_integer_value(value: i128, decimal_places: i32) -> Result<i128, ArrowError> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is converting all ints to i128 strictly needed for correctness, or its for ergonomics?

Comment thread datafusion/functions/src/math/round.rs Outdated

let scalar = match return_type {
Int8 => ScalarValue::Int8(Some(i8::try_from(rounded).map_err(|_| {
ArrowError::ComputeError("Overflow while rounding Int8".to_string())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these dont need to be ArrowErrors since we're return a datafusion error in this function

Comment thread datafusion/functions/src/math/round.rs Outdated
@pchintar pchintar force-pushed the round-large-integer-handling branch from d73d6c3 to 29a59b9 Compare June 9, 2026 07:01
@pchintar

pchintar commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey I replaced the custom integer-type helper with DataType::is_integer() and restored the original UInt64 handling in the Spark implementation, since Spark doesn't expose a UInt64 type. I also reran the existing tests and manually verified the Int64/UInt64 rounding cases to make sure the behavior is unchanged.

@pchintar

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey just a kind reminder, could you pls approve this PR if everything looks fine, if not pls lmk, thanks again.

@Jefffrey

Copy link
Copy Markdown
Contributor

Some of my comments haven't been addressed, could you please check them?

@pchintar pchintar force-pushed the round-large-integer-handling branch 2 times, most recently from 6d96f9d to 5966609 Compare June 21, 2026 18:29
@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 21, 2026
@pchintar pchintar force-pushed the round-large-integer-handling branch from 5966609 to c7d003c Compare June 21, 2026 19:45
@github-actions github-actions Bot removed the auto detected api change Auto detected API change label Jun 21, 2026
@pchintar

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey I'm v.sorry for this late reply as I was on vacation last week actually.

So you had provided me 5 suggestions earlier, I've addressed the 3 cleanup suggestions out of those 5:

  • replaced the custom integer-type helper with DataType::is_integer()
  • restored the original UInt64 path in the Spark implementation, since Spark doesn't expose a UInt64 type
  • simplified round_integer_scalar() to match on (value, return_type) directly, removing the separate input/output matches while keeping the same checked conversions and overflow behavior

For the remaining two points though:

  • I kept the shared i128 intermediate intentionally. It allows all integer types to use the same negative-scale rounding logic and then perform a checked conversion back to the original type, so overflow is detected consistently without duplicating the implementation for every integer width.

  • I also kept ArrowError in the rounding helpers. The array path goes through calculate_binary_math, whose callback is defined in terms of ArrowError, so keeping the same error type there avoids adding a separate conversion layer. The scalar path follows the same pattern for consistency.

I hope this reply helps clarify, thnx again!

@pchintar

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey could you kindly provide any update on this?

@Jefffrey

Copy link
Copy Markdown
Contributor

I feel the i128 conversion could be avoided with some generic trickery 🤔

For reference, I updated the benchmarks to have an i64 bench with scale 2 and with scale -2, and ran on this branch:

round size=1024/round_f64_array
                        time:   [813.88 ns 819.69 ns 823.61 ns]
                        change: [−2.9895% −2.2717% −1.7347%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) high mild
round size=1024/round_i64_array
                        time:   [104.98 ns 105.65 ns 106.55 ns]
                        change: [−2.5935% −1.3028% −0.0545%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
round size=1024/round_i64_array_neg_scale
                        time:   [4.4552 µs 4.4583 µs 4.4623 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

round size=4096/round_f64_array
                        time:   [2.7175 µs 2.7288 µs 2.7357 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low severe
round size=4096/round_i64_array
                        time:   [105.02 ns 105.41 ns 105.90 ns]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
round size=4096/round_i64_array_neg_scale
                        time:   [17.140 µs 17.144 µs 17.149 µs]

round size=8192/round_f64_array
                        time:   [5.6102 µs 5.6347 µs 5.6583 µs]
round size=8192/round_i64_array
                        time:   [104.78 ns 105.08 ns 105.37 ns]
round size=8192/round_i64_array_neg_scale
                        time:   [34.582 µs 34.604 µs 34.627 µs]
  • round_i64_array = scale 2, expected pass through fast since no work needed
  • round_i64_array_neg_scale = scale -2, needs to go through i128 path

And to test, I modified the code to avoid the i128 conversion and just operate directly on i64:

round size=1024/round_f64_array
                        time:   [810.04 ns 814.87 ns 817.61 ns]
                        change: [−1.4862% −0.5891% +0.2634%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low severe
round size=1024/round_i64_array
                        time:   [105.65 ns 105.78 ns 105.92 ns]
                        change: [−0.7381% +0.1219% +0.7794%] (p = 0.79 > 0.05)
                        No change in performance detected.
round size=1024/round_i64_array_neg_scale
                        time:   [2.3741 µs 2.3780 µs 2.3824 µs]
                        change: [−46.758% −46.661% −46.557%] (p = 0.00 < 0.05)
                        Performance has improved.

round size=4096/round_f64_array
                        time:   [2.6846 µs 2.6904 µs 2.6962 µs]
                        change: [−1.7566% −1.4088% −0.9277%] (p = 0.00 < 0.05)
                        Change within noise threshold.
round size=4096/round_i64_array
                        time:   [106.03 ns 106.31 ns 106.67 ns]
                        change: [+0.3251% +0.8590% +1.4019%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
round size=4096/round_i64_array_neg_scale
                        time:   [8.9223 µs 8.9313 µs 8.9400 µs]
                        change: [−47.965% −47.905% −47.852%] (p = 0.00 < 0.05)
                        Performance has improved.

round size=8192/round_f64_array
                        time:   [5.5491 µs 5.5718 µs 5.5949 µs]
                        change: [−1.6684% −1.1164% −0.5111%] (p = 0.00 < 0.05)
                        Change within noise threshold.
round size=8192/round_i64_array
                        time:   [109.97 ns 110.49 ns 111.02 ns]
                        change: [+4.5883% +5.1438% +5.7191%] (p = 0.00 < 0.05)
                        Performance has regressed.
round size=8192/round_i64_array_neg_scale
                        time:   [17.872 µs 17.946 µs 18.034 µs]
                        change: [−48.360% −48.138% −47.920%] (p = 0.00 < 0.05)
                        Performance has improved.

So the i128 conversion does seem to have a noticeable performance impact.

We can always try to optimize in a followup if that is preference, just want to highlight this

@pchintar pchintar force-pushed the round-large-integer-handling branch from c7d003c to 8d15cdf Compare June 27, 2026 12:49
@pchintar

pchintar commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Hi @Jefffrey thnx for the benchmark comparison. So, I updated my integer rounding implementation to remove the i128 conversion in the integer path. Instead of converting each integer value to i128 before rounding and then converting it back, the implementation now performs the rounding directly on the native integer type using PrimInt together with checked_pow. I also split the implementation into separate signed and unsigned paths so that the existing overflow handling and rounding behavior for each integer type are preserved, while avoiding the extra conversion overhead in the hot path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

round() mishandles large Int64 and UInt64 values

3 participants