perf: test vectorized varint algo#811
Conversation
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Integration Test Report 420 files +1 420 suites +1 17m 7s ⏱️ - 10m 47s Results for commit ff9a81c. ± Comparison against base commit b629795. This pull request removes 3 and adds 5 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
|
It would be very interesting to know what google numbers are as well. Not sure if PBJ is as fast or faster than google. |
It was slower in prior runs, that's why I commented it out. I uncommented it locally and changed an internal var from I will re-enable it in a future fix and test on Alex's real hardware. However, as you can see from the results, the new algo is faster overall. |
Description:
Introducing a vectorized LEB128 algo for reading varint values that uses a fully unrolled loop
and employs a "negative limit" trick to avoid explicit limit checks. It's 4x times faster for 1-byte varints than our current implementation. It's consistently and equally fast for 2, 3, 4, and 5-byte varints as well: 2.4x faster for 2 byte and 2x-9x faster for longer encodings.Avarint.mdis added to describe the algorithm, so that we don't have to repeat the lengthy doc in every implementation. The core PBJ implementations will be replaced in a separate PR in the future.Also, a unit test is added to verify the correctness of the algorithm.
UPDATE ON 5/4/2026 morning: A bug in the benchmark implementation has been discovered where the position wasn't properly updated after finishing reading a varint. The first table below is updated with the new results, which look rather disappointing now. The second table is removed as the old results are no longer relevant.
UPDATE ON 5/4/2026 afternoon: After more tweaking, as well as restoring the fair conditions between the new algo and the existing pbj implementation in terms of using the
longtype and actually checking the limit against the length of the buffer, here's the updated results on Mac aarch64:Benchmark results:
The numbers may not look as impressive as the old broken implementations showed, but we still get a performance boost of some 13% for 1 byte varint. For 2-byte varint the performance boost is actually 2.7x, which looks pretty good. Longer varints are improved by some 40% or thereabout, which again isn't bad at all.
I'll share results on an AMD once I get them.Results on an AMD Linux (we don't have Intel Linux readily available):
https://github.com/swirldslabs/performance-analysis-automation/actions/runs/25448397833/job/74658708248
While the absolute numbers are slightly different, relative numbers show approximately the same gains as above when testing on a Mac.
Related issue(s):
Fixes #810
Notes for reviewer:
All tests should pass.
Checklist