Skip to content

perf: test vectorized varint algo#811

Merged
anthony-swirldslabs merged 3 commits intomainfrom
810-vectorVarInt
May 8, 2026
Merged

perf: test vectorized varint algo#811
anthony-swirldslabs merged 3 commits intomainfrom
810-vectorVarInt

Conversation

@anthony-swirldslabs
Copy link
Copy Markdown
Contributor

@anthony-swirldslabs anthony-swirldslabs commented May 1, 2026

Description:
Introducing a vectorized LEB128 algo for reading varint values that uses a fully unrolled loop and employs a "negative limit" trick to avoid explicit limit checks. It's 4x times faster for 1-byte varints than our current implementation. It's consistently and equally fast for 2, 3, 4, and 5-byte varints as well: 2.4x faster for 2 byte and 2x-9x faster for longer encodings.

A varint.md is added to describe the algorithm, so that we don't have to repeat the lengthy doc in every implementation. The core PBJ implementations will be replaced in a separate PR in the future.

Also, a unit test is added to verify the correctness of the algorithm.

UPDATE ON 5/4/2026 morning: A bug in the benchmark implementation has been discovered where the position wasn't properly updated after finishing reading a varint. The first table below is updated with the new results, which look rather disappointing now. The second table is removed as the old results are no longer relevant.

UPDATE ON 5/4/2026 afternoon: After more tweaking, as well as restoring the fair conditions between the new algo and the existing pbj implementation in terms of using the long type and actually checking the limit against the length of the buffer, here's the updated results on Mac aarch64:

Benchmark results:

Benchmark                               (range)   Mode  Cnt     Score    Error   Units
VarIntByteArrayReadBench.pbj                  1  thrpt   15  1352.686 ±  5.331  ops/us
VarIntByteArrayReadBench.pbj                  2  thrpt   15   355.533 ±  2.242  ops/us
VarIntByteArrayReadBench.pbj                  3  thrpt   15   413.225 ±  1.534  ops/us
VarIntByteArrayReadBench.pbj                  4  thrpt   15   293.223 ±  1.596  ops/us
VarIntByteArrayReadBench.pbj                  5  thrpt   15   320.100 ±  6.319  ops/us
VarIntByteArrayReadBench.vector_zigZag        1  thrpt   15  1529.252 ±  9.868  ops/us
VarIntByteArrayReadBench.vector_zigZag        2  thrpt   15   972.393 ± 12.976  ops/us
VarIntByteArrayReadBench.vector_zigZag        3  thrpt   15   596.073 ±  2.460  ops/us
VarIntByteArrayReadBench.vector_zigZag        4  thrpt   15   581.045 ±  3.294  ops/us
VarIntByteArrayReadBench.vector_zigZag        5  thrpt   15   442.047 ±  1.256  ops/us

The numbers may not look as impressive as the old broken implementations showed, but we still get a performance boost of some 13% for 1 byte varint. For 2-byte varint the performance boost is actually 2.7x, which looks pretty good. Longer varints are improved by some 40% or thereabout, which again isn't bad at all.

I'll share results on an AMD once I get them.
Results on an AMD Linux (we don't have Intel Linux readily available):
https://github.com/swirldslabs/performance-analysis-automation/actions/runs/25448397833/job/74658708248

Benchmark                               (range)   Mode  Cnt     Score    Error   Units
VarIntByteArrayReadBench.pbj                  1  thrpt   15  1102.281 ±  6.512  ops/us
VarIntByteArrayReadBench.pbj                  2  thrpt   15   224.119 ±  3.151  ops/us
VarIntByteArrayReadBench.pbj                  3  thrpt   15   205.939 ±  4.638  ops/us
VarIntByteArrayReadBench.pbj                  4  thrpt   15   162.483 ±  2.497  ops/us
VarIntByteArrayReadBench.pbj                  5  thrpt   15   171.661 ±  1.825  ops/us
VarIntByteArrayReadBench.vector_zigZag        1  thrpt   15  1219.005 ± 17.063  ops/us
VarIntByteArrayReadBench.vector_zigZag        2  thrpt   15   637.920 ±  8.852  ops/us
VarIntByteArrayReadBench.vector_zigZag        3  thrpt   15   354.285 ±  9.974  ops/us
VarIntByteArrayReadBench.vector_zigZag        4  thrpt   15   302.984 ±  1.620  ops/us
VarIntByteArrayReadBench.vector_zigZag        5  thrpt   15   239.761 ±  1.405  ops/us

While the absolute numbers are slightly different, relative numbers show approximately the same gains as above when testing on a Mac.

Related issue(s):

Fixes #810

Notes for reviewer:
All tests should pass.

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@anthony-swirldslabs anthony-swirldslabs self-assigned this May 1, 2026
@anthony-swirldslabs anthony-swirldslabs requested review from a team as code owners May 1, 2026 23:18
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

JUnit Test Report

    81 files  ±0      81 suites  ±0   3m 16s ⏱️ ±0s
 1 519 tests ±0   1 515 ✅ ±0   4 💤 ±0  0 ❌ ±0 
10 407 runs  ±0  10 379 ✅ ±0  28 💤 ±0  0 ❌ ±0 

Results for commit ff9a81c. ± Comparison against base commit b629795.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Integration Test Report

    420 files  +1      420 suites  +1   17m 7s ⏱️ - 10m 47s
114 984 tests +2  114 984 ✅ +2  0 💤 ±0  0 ❌ ±0 
115 226 runs  +2  115 226 ✅ +2  0 💤 ±0  0 ❌ ±0 

Results for commit ff9a81c. ± Comparison against base commit b629795.

This pull request removes 3 and adds 5 tests. Note that renamed tests count towards both.
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [1] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000000009b935f0@2a8f10c8
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [2] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000000009b93838@61a537ae
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [3] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000000009b93a80@376e7549
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [1] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x000000007cc5eaf0@2bbec358
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [2] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x000000007cc5ed38@58f5f9ca
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [3] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x000000007cc5ef80@23e54450
com.hedera.pbj.integration.test.VectorVarIntTest ‑ [1] true
com.hedera.pbj.integration.test.VectorVarIntTest ‑ [2] false

♻️ This comment has been updated with latest results.

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@jasperpotts
Copy link
Copy Markdown
Member

It would be very interesting to know what google numbers are as well. Not sure if PBJ is as fast or faster than google.

@anthony-swirldslabs
Copy link
Copy Markdown
Contributor Author

@jasperpotts :

It would be very interesting to know what google numbers are as well. Not sure if PBJ is as fast or faster than google.

It was slower in prior runs, that's why I commented it out. I uncommented it locally and changed an internal var from int to long to match the actual implementation ( https://github.com/protocolbuffers/protobuf/blob/main/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L1323 ), and here's results on my Mac:

Benchmark                               (range)   Mode  Cnt     Score    Error   Units
VarIntByteArrayReadBench.google               1  thrpt   15  1550.680 ± 15.346  ops/us
VarIntByteArrayReadBench.google               2  thrpt   15   403.662 ±  3.259  ops/us
VarIntByteArrayReadBench.google               3  thrpt   15   488.655 ±  4.323  ops/us
VarIntByteArrayReadBench.google               4  thrpt   15   325.786 ±  2.124  ops/us
VarIntByteArrayReadBench.google               5  thrpt   15   392.850 ±  1.542  ops/us
VarIntByteArrayReadBench.pbj                  1  thrpt   15  1354.360 ±  6.827  ops/us
VarIntByteArrayReadBench.pbj                  2  thrpt   15   356.162 ±  0.904  ops/us
VarIntByteArrayReadBench.pbj                  3  thrpt   15   405.432 ±  6.666  ops/us
VarIntByteArrayReadBench.pbj                  4  thrpt   15   286.206 ±  1.054  ops/us
VarIntByteArrayReadBench.pbj                  5  thrpt   15   314.905 ±  1.486  ops/us
VarIntByteArrayReadBench.vector_zigZag        1  thrpt   15  1496.003 ±  3.532  ops/us
VarIntByteArrayReadBench.vector_zigZag        2  thrpt   15   954.625 ±  2.824  ops/us
VarIntByteArrayReadBench.vector_zigZag        3  thrpt   15   580.596 ±  3.386  ops/us
VarIntByteArrayReadBench.vector_zigZag        4  thrpt   15   564.084 ±  2.740  ops/us
VarIntByteArrayReadBench.vector_zigZag        5  thrpt   15   429.190 ±  1.940  ops/us

I will re-enable it in a future fix and test on Alex's real hardware. However, as you can see from the results, the new algo is faster overall.

@anthony-swirldslabs anthony-swirldslabs merged commit 3eab3fe into main May 8, 2026
22 checks passed
@anthony-swirldslabs anthony-swirldslabs deleted the 810-vectorVarInt branch May 8, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vectorized varint algo

3 participants