This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Replace modulus operations in std::seed_seq::generate with conditional checks.
ClosedPublic

Authored by laramiel on May 10 2022, 11:09 AM.

Details

Summary

Abseil benchmarks suggest that the conditional checks result in faster code (4-5x)
as they are compiled into conditional move instructions (cmov on x86).

Diff Detail

Event Timeline

laramiel created this revision.May 10 2022, 11:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 11:09 AM
laramiel updated this revision to Diff 428441.May 10 2022, 11:20 AM

Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.

laramiel published this revision for review.May 10 2022, 11:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 11:38 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
philnik requested changes to this revision.May 10 2022, 11:43 AM
philnik added a subscriber: philnik.

Could you provide some benchmarks in libcxx/benchmarks and their results?

This revision now requires changes to proceed.May 10 2022, 11:43 AM
laramiel updated this revision to Diff 428459.May 10 2022, 12:07 PM

Add benchmark file.

laramiel updated this revision to Diff 428463.May 10 2022, 12:13 PM

Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.

laramiel updated this revision to Diff 428464.May 10 2022, 12:21 PM

Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.

This looks pretty good, but I'd like to see some benchmark results before approving. BTW you can ignore clang-format currently. (I assume it's the reason for your last two diffs). If you don't have commit access please provide "Your Name" <your.email@address.com> for attribution.

libcxx/benchmarks/random.bench.cpp
1 ↗(On Diff #428464)

This shouldn't be in a .cpp file, only in .h files.

10–14 ↗(On Diff #428459)

Could you order these alphabetically?

libcxx/include/__random/seed_seq.h
121–135

This would be just perfect for a lambda. Why do we have to support C++03?

laramiel updated this revision to Diff 428472.May 10 2022, 12:58 PM
laramiel marked 2 inline comments as done.

Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.

Resolve some comments

libcxx/benchmarks/random.bench.cpp
1 ↗(On Diff #428464)

I can do that, but it's in a lot of the *.bench.cpp files. Though not all.

philnik added inline comments.May 10 2022, 1:01 PM
libcxx/benchmarks/random.bench.cpp
1 ↗(On Diff #428464)

Sorry for the miscommunication. I only meant the // -*- C++ -*-.

laramiel added a comment.EditedMay 10 2022, 4:01 PM

Ok, I managed to run the benchmarks with the before and after on both my x86 workstation and my mac M1 laptop. For the M1, the new code is about the same, but the new code is dramatically faster on x86.

Mac LAPTOP

CPU Caches:
  L1 Data 64 KiB (x10)
  L1 Instruction 128 KiB (x10)
  L2 Unified 4096 KiB (x5)
Load Average: 3.74, 4.25, 3.59

OLD


Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_SeedSeq_Generate/1/1          16.9 ns         16.9 ns     39431288
BM_SeedSeq_Generate/8/1          50.1 ns         50.0 ns     13943668
BM_SeedSeq_Generate/16/1         88.4 ns         87.9 ns      7974663
BM_SeedSeq_Generate/1/8          56.1 ns         56.0 ns     12601941
BM_SeedSeq_Generate/8/8          60.7 ns         60.7 ns     11631385
BM_SeedSeq_Generate/16/8         97.0 ns         93.6 ns      7510568
BM_SeedSeq_Generate/1/64          564 ns          561 ns      1259672
BM_SeedSeq_Generate/8/64          549 ns          549 ns      1241113
BM_SeedSeq_Generate/16/64         541 ns          540 ns      1294139
BM_SeedSeq_Generate/1/256        2299 ns         2296 ns       304877
BM_SeedSeq_Generate/8/256        2298 ns         2297 ns       305722
BM_SeedSeq_Generate/16/256       2300 ns         2299 ns       305262

NEW


Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_SeedSeq_Generate/1/1          17.0 ns         17.0 ns     38130515
BM_SeedSeq_Generate/8/1          49.5 ns         49.5 ns     14140414
BM_SeedSeq_Generate/16/1         86.6 ns         86.1 ns      8184933
BM_SeedSeq_Generate/1/8          47.3 ns         47.2 ns     14856161
BM_SeedSeq_Generate/8/8          50.7 ns         50.7 ns     13919268
BM_SeedSeq_Generate/16/8         79.8 ns         79.6 ns      8695436
BM_SeedSeq_Generate/1/64          520 ns          520 ns      1361550
BM_SeedSeq_Generate/8/64          519 ns          519 ns      1355355
BM_SeedSeq_Generate/16/64         525 ns          525 ns      1319261
BM_SeedSeq_Generate/1/256        2198 ns         2194 ns       317959
BM_SeedSeq_Generate/8/256        2196 ns         2194 ns       320623
BM_SeedSeq_Generate/16/256       2215 ns         2201 ns       317943

X86

Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Run on (12 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 15360 KiB (x1)
Load Average: 0.34, 0.38, 0.36

OLD


Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_SeedSeq_Generate/1/1          51.9 ns         51.9 ns     12867935
BM_SeedSeq_Generate/8/1           179 ns          179 ns      3915093
BM_SeedSeq_Generate/16/1          325 ns          325 ns      2158696
BM_SeedSeq_Generate/1/8           361 ns          361 ns      1960768
BM_SeedSeq_Generate/8/8           342 ns          342 ns      1982211
BM_SeedSeq_Generate/16/8          485 ns          484 ns      1449114
BM_SeedSeq_Generate/1/64         3010 ns         3008 ns       234187
BM_SeedSeq_Generate/8/64         2966 ns         2964 ns       237480
BM_SeedSeq_Generate/16/64        2910 ns         2909 ns       240558
BM_SeedSeq_Generate/1/256       12132 ns        12127 ns        58263
BM_SeedSeq_Generate/8/256       12051 ns        12046 ns        58368
BM_SeedSeq_Generate/16/256      12169 ns        12163 ns        58210

NEW


Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_SeedSeq_Generate/1/1          26.1 ns         26.1 ns     25494346
BM_SeedSeq_Generate/8/1          45.5 ns         45.5 ns     15384561
BM_SeedSeq_Generate/16/1         78.0 ns         77.9 ns      9019345
BM_SeedSeq_Generate/1/8          65.0 ns         65.0 ns     10775171
BM_SeedSeq_Generate/8/8          69.4 ns         69.4 ns     10152010
BM_SeedSeq_Generate/16/8          100 ns        100.0 ns      7041379
BM_SeedSeq_Generate/1/64          488 ns          488 ns      1426010
BM_SeedSeq_Generate/8/64          488 ns          488 ns      1434429
BM_SeedSeq_Generate/16/64         489 ns          488 ns      1322369
BM_SeedSeq_Generate/1/256        1938 ns         1938 ns       362517
BM_SeedSeq_Generate/8/256        1934 ns         1934 ns       361980
BM_SeedSeq_Generate/16/256       1936 ns         1936 ns       361482
philnik accepted this revision as: philnik.May 10 2022, 4:11 PM

Wow, these numbers look great! LGTM with the license header added back.

laramiel updated this revision to Diff 428529.May 10 2022, 4:12 PM

Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.

laramiel marked an inline comment as done.May 10 2022, 4:18 PM
laramiel added inline comments.May 10 2022, 4:21 PM
libcxx/benchmarks/random.bench.cpp
20 ↗(On Diff #428529)

Does the benchmark need to support C++03? because std::array

philnik added inline comments.May 10 2022, 4:24 PM
libcxx/benchmarks/random.bench.cpp
20 ↗(On Diff #428529)

No. We normally don't have any divergence of performance critical code between C++ versions, so we only run the benchmarks for C++20(?).

What next? The AIX 32-bit build fails with an unrelated error, all the rest pass.

FAILED: libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o
/opt/IBM/openxlC/17.1.0/bin/ibm-clang++_r -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LIBCPP_BUILDING_LIBRARY -D_LIBCPP_DISABLE_NEW_DELETE_DEFINITIONS -D_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER -DSTDC_CONSTANT_MACROS -DSTDC_FORMAT_MACROS -DSTDC_LIMIT_MACROS -Ilibcxx/include/c++build -I/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxx/src -I/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxxabi/include -DLIBC_NO_CPP_MATH_OVERLOADS__ -mcmodel=large -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O2 -g -fPIC -DLIBCXX_BUILDING_LIBCXXABI -faligned-allocation -nostdinc++ -fvisibility-inlines-hidden -Wall -Wextra -W -Wwrite-strings -Wno-unused-parameter -Wno-long-long -Werror=return-type -Wextra-semi -Wundef -Wformat-nonliteral -Wno-user-defined-literals -Wno-covered-switch-default -Wno-suggest-override -Werror -I/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1 -std=c++2a -MD -MT libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o -MF libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o.d -o libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o -c /scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxx/src/atomic.cpp
In file included from /scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxx/src/atomic.cpp:12:
/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:1005:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]

return __c11_atomic_fetch_add(&__a->__a_value, __delta, static_cast<__memory_order_underlying_t>(__order));
       ^

/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:948:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]

return __c11_atomic_load(const_cast<__ptr_type>(&__a->__a_value), static_cast<__memory_order_underlying_t>(__order));
       ^

/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:1000:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]

return __c11_atomic_fetch_add(&__a->__a_value, __delta, static_cast<__memory_order_underlying_t>(__order));
       ^

/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:1022:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]

return __c11_atomic_fetch_sub(&__a->__a_value, __delta, static_cast<__memory_order_underlying_t>(__order));
       ^

4 errors generated.

@laramiel The AIX build is fixed now, so you can rebase to fix the CI. You have to wait for a second libc++ approver to review and then this can be landed.

laramiel updated this revision to Diff 429300.May 13 2022, 11:14 AM

Updating D125329: Rebase

Abseil benchmarks suggest that the conditional checks result in faster code (4-5x)
as they can be compiled into conditional move instructions.

Nice improvement! Did state that the cmov is the big win?

In general quite happy with the changes, but I really like a bit more comment explaining the cleverness.

libcxx/benchmarks/random.bench.cpp
20 ↗(On Diff #428529)

Correct the benchmarks are indeed only tested with C++20.

libcxx/include/__random/seed_seq.h
88

I saw libcxx/test/std/numerics/rand/rand.util/rand.util.seedseq/generate.pass.cpp tests the generated output of this function.

112

This is a kind of clever code which isn't really obvious. Can you add a bit more comment what it exactly does. After you did that the __k % __n comment becomes clear.

153

I think here we can add some comment too. The modulo in this loop works when __k == __m, which is true when __s + 1 <= __m.

laramiel added inline comments.May 16 2022, 9:50 AM
libcxx/include/__random/seed_seq.h
112

The comments that I added worked to make it clear to me while reading [rand.util.seedseq].
What specific comment would make it more clear to you?

153

What comment do you think helps? I think that the modulo pattern is pretty clear, particularly since it mirrors the first loop.

laramiel updated this revision to Diff 429755.May 16 2022, 10:01 AM

Adjust comments & rename km1modn to k1modn.

laramiel edited the summary of this revision. (Show Details)May 16 2022, 10:02 AM
laramiel updated this revision to Diff 429756.May 16 2022, 10:05 AM

Sort benchmark consts

laramiel marked 2 inline comments as done.May 16 2022, 10:08 AM
Mordante requested changes to this revision.May 16 2022, 11:01 AM

Something went wrong with your patch, the changes to libcxx/include/__random/seed_seq.h have disappeared. Can you upload a fixed version?

This revision now requires changes to proceed.May 16 2022, 11:01 AM
laramiel updated this revision to Diff 429789.May 16 2022, 11:06 AM

Rebase. Does this fix it.

Mordante requested changes to this revision.May 16 2022, 11:14 AM

Rebase. Does this fix it.

Unfortunately no. Do you have multiple commits in your local branch? Arc only uses one commit.
I guess it is related to arc making the new review D125702.

This revision now requires changes to proceed.May 16 2022, 11:14 AM
laramiel updated this revision to Diff 429792.EditedMay 16 2022, 11:15 AM

Try arc diff origin/main, which seems to work after the rebase.

It's all back and the buildbot has completed a green build.

Mordante accepted this revision.May 17 2022, 10:46 AM

LGTM after adding the comment. Do you need somebody to land to patch for you?
If so please provide your name and e-mail address.

libcxx/include/__random/seed_seq.h
112

I think it would be nice to explain a bit more what happens and more importantly why the code is written in the current way. That would help future readers to understand the code without looking in the git history. I propose something along the lines of:

As an optimization the code in the loop doesn't calculate modulo n every iteration.
Instead it's initialized here allowing the loop to use an if statement instead of a modulo.

If you have a link to the claim of abseil it would be great to add it too.

153

I see the precondition __s + 1 <= __m is always true. Might be nice to add a comment to that effect, but not required.

This revision is now accepted and ready to land.May 17 2022, 10:46 AM

The abseil code is internal, and we opted to upstream the main optimization rather than push another std:: compatible type, so I don't have a public reference there.

laramiel updated this revision to Diff 431005.May 20 2022, 10:26 AM

Updating D125329: Update comment on loop.

laramiel marked an inline comment as done.May 20 2022, 10:27 AM
laramiel added inline comments.
libcxx/include/__random/seed_seq.h
112

Ok, I have a very similar comment to that.

Please take another look.

laramiel marked an inline comment as done.May 20 2022, 10:30 AM

I would prefer help landing a patch; lar@google.com (Laramie Leavitt)

laramiel updated this revision to Diff 431007.May 20 2022, 10:34 AM

Updating D125329: Rebase

philnik retitled this revision from Replace modulus operations in std::seed_seq::generate with conditional checks. to [libc++] Replace modulus operations in std::seed_seq::generate with conditional checks..May 24 2022, 1:27 AM