Abseil benchmarks suggest that the conditional checks result in faster code (4-5x)
as they are compiled into conditional move instructions (cmov on x86).
Details
- Reviewers
- ldionne - STL_MSFT - philnik - Mordante 
- Group Reviewers
- Restricted Project 
- Commits
- rGb07880454ba3: [libc++] Replace modulus operations in std::seed_seq::generate with conditional…
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.
Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.
Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.
This looks pretty good, but I'd like to see some benchmark results before approving. BTW you can ignore clang-format currently. (I assume it's the reason for your last two diffs). If you don't have commit access please provide "Your Name" <your.email@address.com> for attribution.
| libcxx/benchmarks/random.bench.cpp | ||
|---|---|---|
| 2 | This shouldn't be in a .cpp file, only in .h files. | |
| 11–15 | Could you order these alphabetically? | |
| libcxx/include/__random/seed_seq.h | ||
| 121–135 | This would be just perfect for a lambda. Why do we have to support C++03? | |
| libcxx/benchmarks/random.bench.cpp | ||
|---|---|---|
| 2 | Sorry for the miscommunication. I only meant the // -*- C++ -*-. | |
Ok, I managed to run the benchmarks with the before and after on both my x86 workstation and my mac M1 laptop. For the M1, the new code is about the same, but the new code is dramatically faster on x86.
Mac LAPTOP
CPU Caches: L1 Data 64 KiB (x10) L1 Instruction 128 KiB (x10) L2 Unified 4096 KiB (x5) Load Average: 3.74, 4.25, 3.59
OLD
Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_SeedSeq_Generate/1/1 16.9 ns 16.9 ns 39431288 BM_SeedSeq_Generate/8/1 50.1 ns 50.0 ns 13943668 BM_SeedSeq_Generate/16/1 88.4 ns 87.9 ns 7974663 BM_SeedSeq_Generate/1/8 56.1 ns 56.0 ns 12601941 BM_SeedSeq_Generate/8/8 60.7 ns 60.7 ns 11631385 BM_SeedSeq_Generate/16/8 97.0 ns 93.6 ns 7510568 BM_SeedSeq_Generate/1/64 564 ns 561 ns 1259672 BM_SeedSeq_Generate/8/64 549 ns 549 ns 1241113 BM_SeedSeq_Generate/16/64 541 ns 540 ns 1294139 BM_SeedSeq_Generate/1/256 2299 ns 2296 ns 304877 BM_SeedSeq_Generate/8/256 2298 ns 2297 ns 305722 BM_SeedSeq_Generate/16/256 2300 ns 2299 ns 305262
NEW
Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_SeedSeq_Generate/1/1 17.0 ns 17.0 ns 38130515 BM_SeedSeq_Generate/8/1 49.5 ns 49.5 ns 14140414 BM_SeedSeq_Generate/16/1 86.6 ns 86.1 ns 8184933 BM_SeedSeq_Generate/1/8 47.3 ns 47.2 ns 14856161 BM_SeedSeq_Generate/8/8 50.7 ns 50.7 ns 13919268 BM_SeedSeq_Generate/16/8 79.8 ns 79.6 ns 8695436 BM_SeedSeq_Generate/1/64 520 ns 520 ns 1361550 BM_SeedSeq_Generate/8/64 519 ns 519 ns 1355355 BM_SeedSeq_Generate/16/64 525 ns 525 ns 1319261 BM_SeedSeq_Generate/1/256 2198 ns 2194 ns 317959 BM_SeedSeq_Generate/8/256 2196 ns 2194 ns 320623 BM_SeedSeq_Generate/16/256 2215 ns 2201 ns 317943
X86
Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz Run on (12 X 4000 MHz CPU s) CPU Caches: L1 Data 32 KiB (x6) L1 Instruction 32 KiB (x6) L2 Unified 256 KiB (x6) L3 Unified 15360 KiB (x1) Load Average: 0.34, 0.38, 0.36
OLD
Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_SeedSeq_Generate/1/1 51.9 ns 51.9 ns 12867935 BM_SeedSeq_Generate/8/1 179 ns 179 ns 3915093 BM_SeedSeq_Generate/16/1 325 ns 325 ns 2158696 BM_SeedSeq_Generate/1/8 361 ns 361 ns 1960768 BM_SeedSeq_Generate/8/8 342 ns 342 ns 1982211 BM_SeedSeq_Generate/16/8 485 ns 484 ns 1449114 BM_SeedSeq_Generate/1/64 3010 ns 3008 ns 234187 BM_SeedSeq_Generate/8/64 2966 ns 2964 ns 237480 BM_SeedSeq_Generate/16/64 2910 ns 2909 ns 240558 BM_SeedSeq_Generate/1/256 12132 ns 12127 ns 58263 BM_SeedSeq_Generate/8/256 12051 ns 12046 ns 58368 BM_SeedSeq_Generate/16/256 12169 ns 12163 ns 58210
NEW
Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_SeedSeq_Generate/1/1 26.1 ns 26.1 ns 25494346 BM_SeedSeq_Generate/8/1 45.5 ns 45.5 ns 15384561 BM_SeedSeq_Generate/16/1 78.0 ns 77.9 ns 9019345 BM_SeedSeq_Generate/1/8 65.0 ns 65.0 ns 10775171 BM_SeedSeq_Generate/8/8 69.4 ns 69.4 ns 10152010 BM_SeedSeq_Generate/16/8 100 ns 100.0 ns 7041379 BM_SeedSeq_Generate/1/64 488 ns 488 ns 1426010 BM_SeedSeq_Generate/8/64 488 ns 488 ns 1434429 BM_SeedSeq_Generate/16/64 489 ns 488 ns 1322369 BM_SeedSeq_Generate/1/256 1938 ns 1938 ns 362517 BM_SeedSeq_Generate/8/256 1934 ns 1934 ns 361980 BM_SeedSeq_Generate/16/256 1936 ns 1936 ns 361482
Updating D125329: Replace modulus operations in std::seed_seq::generate with conditional checks.
| libcxx/benchmarks/random.bench.cpp | ||
|---|---|---|
| 21 | Does the benchmark need to support C++03? because std::array | |
| libcxx/benchmarks/random.bench.cpp | ||
|---|---|---|
| 21 | No. We normally don't have any divergence of performance critical code between C++ versions, so we only run the benchmarks for C++20(?). | |
What next? The AIX 32-bit build fails with an unrelated error, all the rest pass.
FAILED: libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o
/opt/IBM/openxlC/17.1.0/bin/ibm-clang++_r  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LIBCPP_BUILDING_LIBRARY -D_LIBCPP_DISABLE_NEW_DELETE_DEFINITIONS -D_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER -DSTDC_CONSTANT_MACROS -DSTDC_FORMAT_MACROS -DSTDC_LIMIT_MACROS -Ilibcxx/include/c++build -I/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxx/src -I/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxxabi/include -DLIBC_NO_CPP_MATH_OVERLOADS__ -mcmodel=large -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O2 -g  -fPIC   -DLIBCXX_BUILDING_LIBCXXABI -faligned-allocation -nostdinc++ -fvisibility-inlines-hidden -Wall -Wextra -W -Wwrite-strings -Wno-unused-parameter -Wno-long-long -Werror=return-type -Wextra-semi -Wundef -Wformat-nonliteral -Wno-user-defined-literals -Wno-covered-switch-default -Wno-suggest-override -Werror -I/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1 -std=c++2a -MD -MT libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o -MF libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o.d -o libcxx/src/CMakeFiles/cxx_shared.dir/atomic.cpp.o -c /scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxx/src/atomic.cpp
In file included from /scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/libcxx/src/atomic.cpp:12:
/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:1005:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4  bytes) [-Werror,-Watomic-alignment]
return __c11_atomic_fetch_add(&__a->__a_value, __delta, static_cast<__memory_order_underlying_t>(__order));
       ^/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:948:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]
return __c11_atomic_load(const_cast<__ptr_type>(&__a->__a_value), static_cast<__memory_order_underlying_t>(__order));
       ^/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:1000:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]
return __c11_atomic_fetch_add(&__a->__a_value, __delta, static_cast<__memory_order_underlying_t>(__order));
       ^/scratch/powerllvm/cpap8008/llvm-project/libcxx-ci/build/aix/include/c++/v1/atomic:1022:12: error: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Werror,-Watomic-alignment]
return __c11_atomic_fetch_sub(&__a->__a_value, __delta, static_cast<__memory_order_underlying_t>(__order));
       ^4 errors generated.
@laramiel The AIX build is fixed now, so you can rebase to fix the CI. You have to wait for a second libc++ approver to review and then this can be landed.
Abseil benchmarks suggest that the conditional checks result in faster code (4-5x)
as they can be compiled into conditional move instructions.
Nice improvement! Did state that the cmov is the big win?
In general quite happy with the changes, but I really like a bit more comment explaining the cleverness.
| libcxx/benchmarks/random.bench.cpp | ||
|---|---|---|
| 21 | Correct the benchmarks are indeed only tested with C++20. | |
| libcxx/include/__random/seed_seq.h | ||
| 88 | I saw libcxx/test/std/numerics/rand/rand.util/rand.util.seedseq/generate.pass.cpp tests the generated output of this function. | |
| 112 | This is a kind of clever code which isn't really obvious. Can you add a bit more comment what it exactly does. After you did that the __k % __n comment becomes clear. | |
| 153 | I think here we can add some comment too. The modulo in this loop works when __k == __m, which is true when __s + 1 <= __m. | |
| libcxx/include/__random/seed_seq.h | ||
|---|---|---|
| 112 | The comments that I added worked to make it clear to me while reading [rand.util.seedseq]. | |
| 153 | What comment do you think helps? I think that the modulo pattern is pretty clear, particularly since it mirrors the first loop. | |
Something went wrong with your patch, the changes to libcxx/include/__random/seed_seq.h have disappeared. Can you upload a fixed version?
Unfortunately no. Do you have multiple commits in your local branch? Arc only uses one commit.
I guess it is related to arc making the new review D125702.
LGTM after adding the comment. Do you need somebody to land to patch for you?
If so please provide your name and e-mail address.
| libcxx/include/__random/seed_seq.h | ||
|---|---|---|
| 112 | I think it would be nice to explain a bit more what happens and more importantly why the code is written in the current way. That would help future readers to understand the code without looking in the git history. I propose something along the lines of: As an optimization the code in the loop doesn't calculate modulo n every iteration. Instead it's initialized here allowing the loop to use an if statement instead of a modulo. If you have a link to the claim of abseil it would be great to add it too. | |
| 153 | I see the precondition __s + 1 <= __m is always true. Might be nice to add a comment to that effect, but not required. | |
The abseil code is internal, and we opted to upstream the main optimization rather than push another std:: compatible type, so I don't have a public reference there.
| libcxx/include/__random/seed_seq.h | ||
|---|---|---|
| 112 | Ok, I have a very similar comment to that. Please take another look. | |
This shouldn't be in a .cpp file, only in .h files.