This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize std::rotate
Needs ReviewPublic

Authored by philnik on Apr 20 2022, 1:37 PM.

Details

Reviewers
Mordante
var-const
Group Reviewers
Restricted Project
Summary

This removes the random access std::rotate "optimization".

Fixes https://github.com/llvm/llvm-project/issues/54949
Fixes https://github.com/llvm/llvm-project/issues/39644

The bug report proposes to use a different rotate algorithm, but that should be done in a different patch.

Diff Detail

Event Timeline

philnik created this revision.Apr 20 2022, 1:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 1:37 PM
philnik requested review of this revision.Apr 20 2022, 1:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 1:37 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript

Here are the numbers:

-------------------------------------------------------------------------------
Benchmark                                                  old Time    new Time
-------------------------------------------------------------------------------
BM_Rotate_uint32_Random_1                                   10.6 ns     10.7 ns
BM_Rotate_uint32_Random_4                                   24.3 ns     23.3 ns
BM_Rotate_uint32_Random_16                                  43.7 ns     32.9 ns
BM_Rotate_uint32_Random_64                                   116 ns     64.4 ns
BM_Rotate_uint32_Random_256                                  423 ns      181 ns
BM_Rotate_uint32_Random_1024                                1574 ns      612 ns
BM_Rotate_uint32_Random_16384                              26318 ns     8900 ns
BM_Rotate_uint32_Random_262144                            450038 ns   142255 ns
BM_Rotate_uint64_Random_1                                   10.6 ns     10.6 ns
BM_Rotate_uint64_Random_4                                   24.3 ns     23.2 ns
BM_Rotate_uint64_Random_16                                  44.3 ns     34.4 ns
BM_Rotate_uint64_Random_64                                   117 ns     67.7 ns
BM_Rotate_uint64_Random_256                                  424 ns      193 ns
BM_Rotate_uint64_Random_1024                                1589 ns      653 ns
BM_Rotate_uint64_Random_16384                              26920 ns     9357 ns
BM_Rotate_uint64_Random_262144                            460469 ns   149440 ns
BM_Rotate_pair<uint32, uint32>_Random_1                     10.6 ns     10.7 ns
BM_Rotate_pair<uint32, uint32>_Random_4                     23.2 ns     23.2 ns
BM_Rotate_pair<uint32, uint32>_Random_16                    36.9 ns     35.2 ns
BM_Rotate_pair<uint32, uint32>_Random_64                    80.7 ns     80.3 ns
BM_Rotate_pair<uint32, uint32>_Random_256                    268 ns      268 ns
BM_Rotate_pair<uint32, uint32>_Random_1024                  1010 ns     1009 ns
BM_Rotate_pair<uint32, uint32>_Random_16384                15728 ns    15734 ns
BM_Rotate_pair<uint32, uint32>_Random_262144              251816 ns   251948 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_1            10.6 ns     10.6 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_4            24.1 ns     23.7 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_16           40.1 ns     40.7 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_64            115 ns      113 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_256           412 ns      405 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_1024         1548 ns     1532 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_16384       24555 ns    23936 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_262144     395888 ns   386210 ns
BM_Rotate_string_Random_1                                   10.6 ns     10.6 ns
BM_Rotate_string_Random_4                                   48.7 ns     47.6 ns
BM_Rotate_string_Random_16                                   124 ns      124 ns
BM_Rotate_string_Random_64                                   288 ns      287 ns
BM_Rotate_string_Random_256                                  756 ns      758 ns
BM_Rotate_string_Random_1024                                2278 ns     2266 ns
BM_Rotate_string_Random_16384                              33052 ns    32758 ns
BM_Rotate_string_Random_262144                            533664 ns   533456 ns
BM_Rotate_float_Random_1                                    10.6 ns     10.6 ns
BM_Rotate_float_Random_4                                    25.2 ns     23.9 ns
BM_Rotate_float_Random_16                                   45.3 ns     33.4 ns
BM_Rotate_float_Random_64                                    115 ns     61.2 ns
BM_Rotate_float_Random_256                                   424 ns      167 ns
BM_Rotate_float_Random_1024                                 1578 ns      561 ns
BM_Rotate_float_Random_16384                               24589 ns     8015 ns
BM_Rotate_float_Random_262144                             409517 ns   125895 ns
philnik edited the summary of this revision. (Show Details)Apr 20 2022, 1:39 PM

@philnik Note that the issue also contains a comment (https://github.com/llvm/llvm-project/issues/54949#issuecomment-1101619447) indicating that the current optimization can be faster for types that aren't cheap to move. Can we try to detect that and still use the current optimization in that case? Also, can you please try a benchmark with a non-trivially-copyable type and see what the numbers look like?

philnik edited the summary of this revision. (Show Details)Apr 23 2022, 11:03 AM

It looks like it is indeed a bit faster for structs that are expensive to move. I'll enable the optimization for non-trivially_move_constructible types and ones that are larger than 32 bytes.

------------------------------------------------------------------
Benchmark                                    old Time     new Time
------------------------------------------------------------------
BM_Rotate_ExpensiveToMove_Random_1            10.4 ns      10.5 ns
BM_Rotate_ExpensiveToMove_Random_4             300 ns       301 ns
BM_Rotate_ExpensiveToMove_Random_16           1299 ns      1363 ns
BM_Rotate_ExpensiveToMove_Random_64           5401 ns      6023 ns
BM_Rotate_ExpensiveToMove_Random_256         21964 ns     24717 ns
BM_Rotate_ExpensiveToMove_Random_1024        89849 ns    101480 ns
BM_Rotate_ExpensiveToMove_Random_16384     1550789 ns   1732624 ns
BM_Rotate_ExpensiveToMove_Random_262144   30205994 ns  35608461 ns

@var-const The non-trivially-copyable part should be covered by string, or did you have something more specific in mind?

philnik updated this revision to Diff 424741.Apr 23 2022, 11:59 AM

I noticed that it wasn't actually enabled for non-trivial types, so I only restricted it to types larger than 32 bytes, although I'm not sure we want to keep this in, since it's a large part of the code while while only being enabled for a very small amount of types and only making a relatively small difference performance wise. If we can find a good heuristic for enabling it for non-tivial types I would be happier to keep it in. (Although I don't think we can have a good heuristic for this kind of stuff)

I'd be fine with this since it addresses the underwhelming performance for simple types like int, which is super important. However, I would prefer if we instead went for a better algorithm directly, like the swap/grail rotate algorithm mentioned in https://github.com/llvm/llvm-project/issues/54949#issue-1206295098.

libcxx/include/__algorithm/rotate.h
118–119

I don't think constexpr adds a lot of value here since the compiler will definitely fold this anyway, and it's kind of weird to have _LIBCPP_CONSTEXPR_AFTER_CXX14 in that location.

118–119

I would also do something like

const bool __is_expensive_to_move = sizeof(value_type) > 32;
if (is_trivially_foo<...> && __is_expensive_to_move) { ... }

That way, the code is somewhat self-documenting.

I'd be fine with this since it addresses the underwhelming performance for simple types like int, which is super important. However, I would prefer if we instead went for a better algorithm directly, like the swap/grail rotate algorithm mentioned in https://github.com/llvm/llvm-project/issues/54949#issue-1206295098.

I planned to do a follow-up for that anyways. Would it be OK if I just nuke the current implementation and add the ranges API in the same patch?

I'd be fine with this since it addresses the underwhelming performance for simple types like int, which is super important. However, I would prefer if we instead went for a better algorithm directly, like the swap/grail rotate algorithm mentioned in https://github.com/llvm/llvm-project/issues/54949#issue-1206295098.

I planned to do a follow-up for that anyways. Would it be OK if I just nuke the current implementation and add the ranges API in the same patch?

Yes, if we can have perf benchmarks for before and after. If that's easy to do, you could also first nuke the current implementation and replace it by one that is ranges::-friendly, and then just add the tiny ranges::rotate overlay on top (+ tests) in a separate patch. Both ways are acceptable, though.

philnik updated this revision to Diff 447827.Jul 26 2022, 2:10 PM
  • Upload current status for Konstantin