perf: test vectorized varint algo by anthony-swirldslabs · Pull Request #811 · hashgraph/pbj

anthony-swirldslabs · 2026-05-01T23:18:10Z

Description:
Introducing a vectorized LEB128 algo for reading varint values that uses a fully unrolled loop and employs a "negative limit" trick to avoid explicit limit checks. It's 4x times faster for 1-byte varints than our current implementation. It's consistently and equally fast for 2, 3, 4, and 5-byte varints as well: 2.4x faster for 2 byte and 2x-9x faster for longer encodings.

~~A varint.md is added to describe the algorithm, so that we don't have to repeat the lengthy doc in every implementation. The core PBJ implementations will be replaced in a separate PR in the future.~~

Also, a unit test is added to verify the correctness of the algorithm.

UPDATE ON 5/4/2026 morning: A bug in the benchmark implementation has been discovered where the position wasn't properly updated after finishing reading a varint. The first table below is updated with the new results, which look rather disappointing now. The second table is removed as the old results are no longer relevant.

UPDATE ON 5/4/2026 afternoon: After more tweaking, as well as restoring the fair conditions between the new algo and the existing pbj implementation in terms of using the long type and actually checking the limit against the length of the buffer, here's the updated results on Mac aarch64:

Benchmark results:

Benchmark                               (range)   Mode  Cnt     Score    Error   Units
VarIntByteArrayReadBench.pbj                  1  thrpt   15  1352.686 ±  5.331  ops/us
VarIntByteArrayReadBench.pbj                  2  thrpt   15   355.533 ±  2.242  ops/us
VarIntByteArrayReadBench.pbj                  3  thrpt   15   413.225 ±  1.534  ops/us
VarIntByteArrayReadBench.pbj                  4  thrpt   15   293.223 ±  1.596  ops/us
VarIntByteArrayReadBench.pbj                  5  thrpt   15   320.100 ±  6.319  ops/us
VarIntByteArrayReadBench.vector_zigZag        1  thrpt   15  1529.252 ±  9.868  ops/us
VarIntByteArrayReadBench.vector_zigZag        2  thrpt   15   972.393 ± 12.976  ops/us
VarIntByteArrayReadBench.vector_zigZag        3  thrpt   15   596.073 ±  2.460  ops/us
VarIntByteArrayReadBench.vector_zigZag        4  thrpt   15   581.045 ±  3.294  ops/us
VarIntByteArrayReadBench.vector_zigZag        5  thrpt   15   442.047 ±  1.256  ops/us

The numbers may not look as impressive as the old broken implementations showed, but we still get a performance boost of some 13% for 1 byte varint. For 2-byte varint the performance boost is actually 2.7x, which looks pretty good. Longer varints are improved by some 40% or thereabout, which again isn't bad at all.

~~I'll share results on an AMD once I get them.~~
Results on an AMD Linux (we don't have Intel Linux readily available):
https://github.com/swirldslabs/performance-analysis-automation/actions/runs/25448397833/job/74658708248

Benchmark                               (range)   Mode  Cnt     Score    Error   Units
VarIntByteArrayReadBench.pbj                  1  thrpt   15  1102.281 ±  6.512  ops/us
VarIntByteArrayReadBench.pbj                  2  thrpt   15   224.119 ±  3.151  ops/us
VarIntByteArrayReadBench.pbj                  3  thrpt   15   205.939 ±  4.638  ops/us
VarIntByteArrayReadBench.pbj                  4  thrpt   15   162.483 ±  2.497  ops/us
VarIntByteArrayReadBench.pbj                  5  thrpt   15   171.661 ±  1.825  ops/us
VarIntByteArrayReadBench.vector_zigZag        1  thrpt   15  1219.005 ± 17.063  ops/us
VarIntByteArrayReadBench.vector_zigZag        2  thrpt   15   637.920 ±  8.852  ops/us
VarIntByteArrayReadBench.vector_zigZag        3  thrpt   15   354.285 ±  9.974  ops/us
VarIntByteArrayReadBench.vector_zigZag        4  thrpt   15   302.984 ±  1.620  ops/us
VarIntByteArrayReadBench.vector_zigZag        5  thrpt   15   239.761 ±  1.405  ops/us

While the absolute numbers are slightly different, relative numbers show approximately the same gains as above when testing on a Mac.

Related issue(s):

Fixes #810

Notes for reviewer:
All tests should pass.

Checklist

Documented (Code comments, README, etc.)
Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

github-actions · 2026-05-01T23:19:17Z

JUnit Test Report

81 files ±0 81 suites ±0 3m 16s ⏱️ ±0s
1 519 tests ±0 1 515 ✅ ±0 4 💤 ±0 0 ❌ ±0
10 407 runs ±0 10 379 ✅ ±0 28 💤 ±0 0 ❌ ±0

Results for commit ff9a81c. ± Comparison against base commit b629795.

♻️ This comment has been updated with latest results.

github-actions · 2026-05-01T23:24:56Z

Integration Test Report

420 files +1 420 suites +1 17m 7s ⏱️ - 10m 47s
114 984 tests +2 114 984 ✅ +2 0 💤 ±0 0 ❌ ±0
115 226 runs +2 115 226 ✅ +2 0 💤 ±0 0 ❌ ±0

Results for commit ff9a81c. ± Comparison against base commit b629795.

This pull request removes 3 and adds 5 tests. Note that renamed tests count towards both.

com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [1] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000000009b935f0@2a8f10c8
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [2] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000000009b93838@61a537ae
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [3] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000000009b93a80@376e7549

com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [1] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x000000007cc5eaf0@2bbec358
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [2] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x000000007cc5ed38@58f5f9ca
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [3] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x000000007cc5ef80@23e54450
com.hedera.pbj.integration.test.VectorVarIntTest ‑ [1] true
com.hedera.pbj.integration.test.VectorVarIntTest ‑ [2] false

♻️ This comment has been updated with latest results.

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

jasperpotts · 2026-05-07T17:00:23Z

It would be very interesting to know what google numbers are as well. Not sure if PBJ is as fast or faster than google.

anthony-swirldslabs · 2026-05-07T19:05:52Z

@jasperpotts :

It would be very interesting to know what google numbers are as well. Not sure if PBJ is as fast or faster than google.

It was slower in prior runs, that's why I commented it out. I uncommented it locally and changed an internal var from int to long to match the actual implementation ( https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L1323 ), and here's results on my Mac:

Benchmark                               (range)   Mode  Cnt     Score    Error   Units
VarIntByteArrayReadBench.google               1  thrpt   15  1550.680 ± 15.346  ops/us
VarIntByteArrayReadBench.google               2  thrpt   15   403.662 ±  3.259  ops/us
VarIntByteArrayReadBench.google               3  thrpt   15   488.655 ±  4.323  ops/us
VarIntByteArrayReadBench.google               4  thrpt   15   325.786 ±  2.124  ops/us
VarIntByteArrayReadBench.google               5  thrpt   15   392.850 ±  1.542  ops/us
VarIntByteArrayReadBench.pbj                  1  thrpt   15  1354.360 ±  6.827  ops/us
VarIntByteArrayReadBench.pbj                  2  thrpt   15   356.162 ±  0.904  ops/us
VarIntByteArrayReadBench.pbj                  3  thrpt   15   405.432 ±  6.666  ops/us
VarIntByteArrayReadBench.pbj                  4  thrpt   15   286.206 ±  1.054  ops/us
VarIntByteArrayReadBench.pbj                  5  thrpt   15   314.905 ±  1.486  ops/us
VarIntByteArrayReadBench.vector_zigZag        1  thrpt   15  1496.003 ±  3.532  ops/us
VarIntByteArrayReadBench.vector_zigZag        2  thrpt   15   954.625 ±  2.824  ops/us
VarIntByteArrayReadBench.vector_zigZag        3  thrpt   15   580.596 ±  3.386  ops/us
VarIntByteArrayReadBench.vector_zigZag        4  thrpt   15   564.084 ±  2.740  ops/us
VarIntByteArrayReadBench.vector_zigZag        5  thrpt   15   429.190 ±  1.940  ops/us

I will re-enable it in a future fix and test on Alex's real hardware. However, as you can see from the results, the new algo is faster overall.

perf: test vectorized varint algo

8594095

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

anthony-swirldslabs self-assigned this May 1, 2026

anthony-swirldslabs requested review from a team as code owners May 1, 2026 23:18

anthony-swirldslabs requested a review from rbarker-dev May 1, 2026 23:18

github-advanced-security AI found potential problems May 1, 2026

View reviewed changes

increment the position

5ef5c72

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

github-advanced-security AI found potential problems May 4, 2026

View reviewed changes

update the vector algo

ff9a81c

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>

jasperpotts approved these changes May 7, 2026

View reviewed changes

anthony-swirldslabs merged commit 3eab3fe into main May 8, 2026
22 checks passed

anthony-swirldslabs deleted the 810-vectorVarInt branch May 8, 2026 17:23

Conversation

anthony-swirldslabs commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JUnit Test Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration Test Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jasperpotts commented May 7, 2026

Uh oh!

anthony-swirldslabs commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anthony-swirldslabs commented May 1, 2026 •

edited

Loading

github-actions Bot commented May 1, 2026 •

edited

Loading

github-actions Bot commented May 1, 2026 •

edited

Loading