This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/
-
benchmarks/
6/7
algorithms.bench.cpp
-
include/__algorithm/
-
__algorithm/
20/33
sort.h

Differential D118029

Introduce branchless sorting functions for sort3, sort4 and sort5.
ClosedPublic

Authored by marcogelmi on Jan 24 2022, 3:50 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
minjaehwang
ldionne
philnik

Group Reviewers

Restricted Project

Commits

rG194d1965d2c8: Introduce branchless sorting functions for sort3, sort4 and sort5.

Summary

We are introducing branchless variants for sort3, sort4 and sort5.
These sorting functions have been generated using Reinforcement
Learning and aim to replace sort3, sort4 and __sort5 variants
for integral types.

The libc++ benchmarks were run on isolated machines for Skylake, ARM and
AMD architectures and achieve statistically significant improvement in
sorting random integers on test cases from sort1 to sort262144 for
uint32 and uint64.

A full performance overview for Intel Skylake, AMD and Arm can be
found here: https://bit.ly/3AtesYf

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	7,160 ms	libcxx CI C++11 > llvm-libc++-shared-cfg-in.libcxx/gdb::gdb_pretty_printer_test.sh.cpp
	6,800 ms	libcxx CI C++14 > llvm-libc++-shared-cfg-in.libcxx/gdb::gdb_pretty_printer_test.sh.cpp
	8,640 ms	libcxx CI C++17 > llvm-libc++-shared-cfg-in.libcxx/gdb::gdb_pretty_printer_test.sh.cpp
	7,400 ms	libcxx CI C++20 > llvm-libc++-shared-cfg-in.libcxx/gdb::gdb_pretty_printer_test.sh.cpp
	7,550 ms	libcxx CI C++2b > llvm-libc++-shared-cfg-in.libcxx/gdb::gdb_pretty_printer_test.sh.cpp

Event Timeline

marcogelmi created this revision.Jan 24 2022, 3:50 AM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptJan 24 2022, 3:50 AM

marcogelmi requested review of this revision.Jan 24 2022, 3:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2022, 3:50 AM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

Harbormaster completed remote builds in B145197: Diff 402459.Jan 24 2022, 4:06 AM

A few points right now:
(1) Please use using x = . It's a clang extension in C++03, so we can use it.
(2) __magic_swap is a really bad name. Please change that to something more meaningful.
(3) Don't qualify any type traits with _VSTD::.
(4) You can't use enable_if_t. Use __enable_if_t instead.
I'll have a closer look later.

This revision now requires changes to proceed.Jan 24 2022, 4:12 AM

fix initial style issues

Harbormaster completed remote builds in B145215: Diff 402485.Jan 24 2022, 7:52 AM

andreamichi added a subscriber: andreamichi.Jan 26 2022, 4:16 AM

@philnik just wondering whether you had a chance to look at the new changes we have made to the patch?

I'm guessing all the ternaries are making the compiler generate cmovs?
I noticed that the benchmarks for size 4 are generally pretty bad. Do you know if anything can be done there?
I haven't yet looked at the logic itself very closely.

libcxx/include/__algorithm/sort.h
152	Ditto below
160	Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.

I noticed that the benchmarks for size 4 are generally pretty bad. Do you know if anything can be done there?

I noticed this also, but assume that we don't really care about the performance of std::sort on arrays of size 4 (or less). ;) If we're trading some perf on unrealistically small arrays, for better perf on large arrays, that sounds like a good tradeoff to me.
But I, also, have not particularly reviewed this code. Just glanced briefly.

libcxx/include/__algorithm/sort.h
152	And vice versa, `_VSTD::__sort3_internal`. All function calls (that take arguments of user-defined types, anyway) need qualification specifically so that you don't get ADL. Type/variable/concept/... names don't need qualification because they never do ADL. Also, almost certainly `_VSTD::__sort3_internal<_Compare>`, right? We should have a test for "no unnecessary copies of predicates" that catches this, and if it's not failing on this PR then it needs a bit of investigation why not.
160	`is_scalar` might be relevant.

use is_arithmetic instead of is_integral

I'm guessing all the ternaries are making the compiler generate cmovs?

yes that's correct

I noticed that the benchmarks for size 4 are generally pretty bad. Do you know if anything can be done there?

Random 4 is actually very good. It's bad for deterministic orderings because the benchmark just keeps on calling sort4 with the same ordering over and over again, so the branch predictor has a very easy job. I suspect this type of data pattern is not very realistic.

marcogelmi marked an inline comment as done.Feb 1 2022, 6:15 AM

marcogelmi added inline comments.

libcxx/include/__algorithm/sort.h
152	sorry what line of code is this referring to? I thought I used _VSTD::__sort3_internal<_Compare> everywhere I'm calling it.
160	this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
160	switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9

Random 4 is actually very good. It's bad for deterministic orderings because the benchmark just keeps on calling sort4 with the same ordering over and over again, so the branch predictor has a very easy job. I suspect this type of data pattern is not very realistic.

Ok, that makes sense.
I checked the sorting logic now and that part LGTM. Now the only question is which types should be allowed.

libcxx/include/__algorithm/sort.h
160	Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.

Harbormaster completed remote builds in B146881: Diff 404904.Feb 1 2022, 10:15 AM

marcogelmi added inline comments.Feb 2 2022, 9:33 AM

libcxx/include/__algorithm/sort.h
160	so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below). I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1 I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated... Results on ARM: name old cpu/op new cpu/op delta BM_Sort_optional_Random_1 4.02ns ± 0% 4.01ns ± 1% -0.36% (p=0.000 n=99+98) BM_Sort_optional_Random_3 5.33ns ± 0% 5.46ns ± 0% +2.42% (p=0.000 n=94+93) BM_Sort_optional_Random_4 5.39ns ± 0% 6.17ns ± 0% +14.45% (p=0.000 n=99+99) BM_Sort_optional_Random_5 6.47ns ± 0% 7.35ns ± 0% +13.54% (p=0.000 n=99+98) BM_Sort_optional_Random_16 10.2ns ± 1% 10.8ns ± 2% +5.74% (p=0.000 n=95+100)

I don't see a way to force clang to generate the cmovs, so I'm OK with the patch % nit. I'd like @ldionne to have a look too, since I don't know if the want the new functions to be part of the ABI or not. I'd say we don't, but that's not my decision to make.

libcxx/benchmarks/algorithms.bench.cpp
30–32

fix whitespace

marcogelmi marked an inline comment as done.Feb 3 2022, 2:36 AM

Harbormaster completed remote builds in B147330: Diff 405556.Feb 3 2022, 4:58 AM

hi @ldionne I was wondering if you had a chance to look at this?

Thanks for the patch. So, to summarize, these ternary conditionals generate cmov instead of normal branches when the types involved are arithmetic, right? And that's basically where the improvements come from.

This looks sensible to me, the benchmarks show a fairly consistent improvement for the sizes that we care most about.

libcxx/include/__algorithm/sort.h
128	Also, does the `inline` make a difference in your benchmarks? If not, I'd drop it, because they are semantically equivalent (templates have inline semantics by default). This comment applies below in several places too.
201	Instead, would it be possible to put the `enable_if_t` on `__sort5` directly, and then only add one overload of `__sort5` for when `is_arithmetic` is true? This would avoid introducing a new layer of complexity like `__sort5_internal`. This applies elsewhere too.

This revision now requires changes to proceed.Feb 7 2022, 12:58 PM

thanks for the review!

libcxx/include/__algorithm/sort.h
128	Removing the inline seems to negatively affect the performance for some combinations of tests and architectures. Would it be ok to keep the inline for extra safety?
201	unfortunately the original version of __sortN also returns the number of swaps, while the new one doesn't, so we need both versions

ldionne added inline comments.Feb 10 2022, 7:26 AM

libcxx/include/__algorithm/sort.h
128	Can you share those numbers? IMO that's a LLVM bug, since we like to say that `inline` doesn't have any impact on the codegen. But yeah, at the end of the day, if it makes a positive difference we should keep it.
201	Right -- that actually highlights the problem I have with the name chosen (`__sort5_internal`). There's nothing "internal" about that version of `__sort5`. Instead, we could rename it to `__sort5_maybe_branchless`? It's not great, but at least it explains what it does.

rename sortN_internal to sortN_maybe_branchless

rename sortN_internal to sortN_maybe_branchless
use _LIBCPP_HIDE_FROM_ABI

marcogelmi marked an inline comment as done.Feb 10 2022, 9:51 AM

marcogelmi added inline comments.

libcxx/include/__algorithm/sort.h
128	sure, here they are: https://bit.ly/34Q6z3q

• Quuxplusone added inline comments.Feb 10 2022, 10:53 AM

libcxx/include/__algorithm/sort.h
134	Either call `_VSTD::__cond_swap<_Compare>(...)` below, or (less good IMHO) change it up here to take `_Compare&` instead of `_Compare`. Otherwise we end up making copies of the predicate — and I would have expected that to fail `libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp`! Could you also find out why that test isn't failing, and add a regression test to it? (I imagine it's just not taking the offending codepath because of the shape of the input data values, so adding a few extra test cases of the correct shape will solve the problem.) Ditto `__partially_sorted_swap`.
187–191	Nit: It's more libc++ style to use `enable_if_t<>* = nullptr` than return type SFINAE; I'd say the advantages are readability and that return type SFINAE greatly increases the length of the mangled name while an extra `void*` type parameter doesn't so much. Drive-by `s/typename/class/` and whitespace. Likewise below.

Harbormaster completed remote builds in B148782: Diff 407588.Feb 10 2022, 1:36 PM

add regression test for comparator copies of ints

Herald added a subscriber: mgrang. · View Herald TranscriptFeb 11 2022, 4:59 AM

Harbormaster completed remote builds in B148960: Diff 407847.Feb 11 2022, 5:14 AM

fix template issues

additional fixes for missing temmplates

Harbormaster completed remote builds in B148988: Diff 407883.Feb 11 2022, 8:18 AM

fix c++03 test

Harbormaster completed remote builds in B149008: Diff 407919.Feb 11 2022, 9:55 AM

marcogelmi marked an inline comment as done.Feb 11 2022, 11:19 AM

marcogelmi added inline comments.

libcxx/include/__algorithm/sort.h

187–191

I'm not entirely sure why, but this seems to be failing for the c++03 build (it works fine for the others):, e.g. https://reviews.llvm.org/harbormaster/unit/view/2520185/

candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,

revert small, unintended test change

Harbormaster completed remote builds in B149040: Diff 407968.Feb 11 2022, 1:19 PM

• Quuxplusone added inline comments.Feb 11 2022, 1:22 PM

libcxx/benchmarks/algorithms.bench.cpp
30–32	It might be (long past) time to change this to simply using Types = std::tuple< uint32_t, uint64_t, std::pair<uint32_t, uint64_t>, std::tuple<uint32_t, uint64_t, uint32_t>, std::string, float >; template<class V> using Value = std::tuple_element_t<(int)V(), Types>;
libcxx/include/__algorithm/sort.h
152	sorry what line of code is this referring to? My best guess is that I misread line 153 __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, as a function call. :) No action required.
160	FWIW, I think it would be worth pulling out the condition into a named trait, like template<class _Tp> using __use_branchless_sort = bool_constant< sizeof(_Tp) <= sizeof(void) && is_scalar<_Tp>::value >; All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.) Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to contiguous iterators? template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type> using __use_branchless_sort = bool_constant< __is_cpp17_contiguous_iterator<_Iter> && sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value >; Also note I keep trying to expand `arithmetic` to `scalar`. :)
187–191	Yuck, I see that on Godbolt (C++03 is unhappy with giving a non-type template parameter a default value). So I guess the return type SFINAE is fine.
libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp
177 ↗	(On Diff #407968)	IIUC, what you've done here is just add a call to `std::sort` with `value_type=int`, which triggers your new thing, specifically because your new thing is constrained to `arithmetic` value_types instead of (as I keep trying to drive-by it into) `scalar` value_types. (Pointer types are scalar but not arithmetic.) Is this the only change that was needed in order to trigger your comparator-copying codepath? If so, I'd slightly rather see you just expand your new thing to include `scalar` types! I don't really object to the idea of templatizing the whole thing on `T` and testing with different `T`s; but it feels like a lot of churn if the only culprit was `arithmetic/scalar`, and also I'm skeptical that that was the only culprit here. I had thought that the problem was more about the shape of the input data — I mean, notice that `is_sorted(first, last)` is already true going into this call. In particular, if we now know that there are different codepaths taken for arrays of length 3, 4, 5, then shouldn't we be testing here (void)std::sort(first, first+3, Less(&copies)); assert(copies == 0); (void)std::sort(first, first+4, Less(&copies)); assert(copies == 0); (void)std::sort(first, first+5, Less(&copies)); assert(copies == 0); (void)std::sort(first, last, Less(&copies)); assert(copies == 0); ?

address style comments, use named trait

Harbormaster completed remote builds in B149407: Diff 408449.Feb 14 2022, 9:39 AM

Thanks for the suggestions.

Regarding is_scalar vs is_arithmetic: expanding to is_scalar would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator. This is also not trivial to do with the current test setup.

Can we therefore first submit this CL and then look into the feasibility of incorporating those additional types into our setup?

libcxx/benchmarks/algorithms.bench.cpp
30–32	note that I'm using clang-format and it's changing the format slightly. Also, I had to change to a plain enum
libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp
177 ↗	(On Diff #407968)	yes this is triggered by `int` specifically because `void*` is not included by `is_arithmetic` the reason why this is triggered even if the shape of the data is of length 10 is because there is a base case (`insertion_sort_3`) which calls `__sort3` once and is invoked when 5 < size <= 30. Anyway, I've added the test cases you recommended.

marcogelmi marked 2 inline comments as done.Feb 15 2022, 2:06 AM

marcogelmi added inline comments.

libcxx/include/__algorithm/sort.h
127	bool_constant seems to need >= c++14, any suggestion on how to make this work for c++03?

philnik added inline comments.Feb 15 2022, 2:09 AM

libcxx/include/__algorithm/sort.h
127	use `_LIBCPP_BOOL_CONSTANT`

use _LIBCPP_BOOL_CONSTANT instead of bool_constant

Harbormaster completed remote builds in B149634: Diff 408777.Feb 15 2022, 2:28 AM

rebase to head

Harbormaster completed remote builds in B150222: Diff 409628.Feb 17 2022, 7:32 AM

hi @Quuxplusone, @ldionne just pinging this thread: what do you think of the current state of the diff?

FWIW, I have no strong desire to be obstructionist, and I definitely don't think we should abandon this direction, but I have one bad and one good(?) reason I think we should hold this PR in suspended animation for a while longer:

Bad: I'm still uncomfortable with the invasiveness of the changes to robust_against_copying_comparators.pass.cpp.
Good(?): Very soon now we're going to have to implement ranges::sort, which innovates in a bunch of different directions — we're going to have to deal with sentinels as well as iterators, and we're going to have to replace __c(*x, *y) with invoke(__c, invoke(__p, *x), invoke(__p, *y), and so on. I feel like it would be a good idea to get that stuff out of the way first, and then revisit algorithmic changes like this PR, rather than vice versa. Unfortunately I have no particular estimate of when ranges::sort will get done; @philnik might have a long-range plan to share, but being as we're still down in the foothills of ranges::min_element etc., personally I'd predict it'll be a couple months at least.

Anyway, @ldionne will be the ultimate arbiter (but IIRC he's on vacation this week).

I'll also keep recommending that you replace is_arithmetic with is_scalar, until you actually do it! or explain why not. :)

libcxx/benchmarks/algorithms.bench.cpp
30–32	For my curiosity, why did you have to change to a plain enum? (All else being equal, we should prefer `enum class`, but if it really technically doesn't work, then I don't imagine it's a big deal.)

In D118029#3335871, @Quuxplusone wrote:

FWIW, I have no strong desire to be obstructionist, and I definitely don't think we should abandon this direction, but I have one bad and one good(?) reason I think we should hold this PR in suspended animation for a while longer:

Bad: I'm still uncomfortable with the invasiveness of the changes to robust_against_copying_comparators.pass.cpp.

Good(?): Very soon now we're going to have to implement ranges::sort, which innovates in a bunch of different directions — we're going to have to deal with sentinels as well as iterators, and we're going to have to replace __c(*x, *y) with invoke(__c, invoke(__p, *x), invoke(__p, *y), and so on. I feel like it would be a good idea to get that stuff out of the way first, and then revisit algorithmic changes like this PR, rather than vice versa. Unfortunately I have no particular estimate of when ranges::sort will get done; @philnik might have a long-range plan to share, but being as we're still down in the foothills of ranges::min_element etc., personally I'd predict it'll be a couple months at least.

Anyway, @ldionne will be the ultimate arbiter (but IIRC he's on vacation this week).

I'll also keep recommending that you replace is_arithmetic with is_scalar, until you actually do it! or explain why not. :)

I think ranges::sort is one of the last algorithms on the list. Unlike the ones I'm working on currently, the most naive implementation isn't the fastest. So it will be much more work, and even the ones currently under review take a long time, since we don't have many copy/paste-able tests currently. Assuming we will get faster I'd guess the earliest I would start to implement sort is in 4-6 months. I wouldn't be surprised if sort alone will be under review for a few months.

In D118029#3335871, @Quuxplusone wrote:

FWIW, I have no strong desire to be obstructionist, and I definitely don't think we should abandon this direction, but I have one bad and one good(?) reason I think we should hold this PR in suspended animation for a while longer:

Bad: I'm still uncomfortable with the invasiveness of the changes to robust_against_copying_comparators.pass.cpp.

Good(?): Very soon now we're going to have to implement ranges::sort, which innovates in a bunch of different directions — we're going to have to deal with sentinels as well as iterators, and we're going to have to replace __c(*x, *y) with invoke(__c, invoke(__p, *x), invoke(__p, *y), and so on. I feel like it would be a good idea to get that stuff out of the way first, and then revisit algorithmic changes like this PR, rather than vice versa. Unfortunately I have no particular estimate of when ranges::sort will get done; @philnik might have a long-range plan to share, but being as we're still down in the foothills of ranges::min_element etc., personally I'd predict it'll be a couple months at least.

Anyway, @ldionne will be the ultimate arbiter (but IIRC he's on vacation this week).

I'll also keep recommending that you replace is_arithmetic with is_scalar, until you actually do it! or explain why not. :)

Regarding is_arithmetic vs is_scalar: expanding to is_scalar would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator. This is also not trivial to do with the current test setup. Can we therefore first submit this CL and then look into the feasibility of incorporating those additional types into our setup?

libcxx/benchmarks/algorithms.bench.cpp
30–32	this is just because we need to cast it to `int` to index into the `Types` tuple and that's more tricky with an `enum class`.

• Quuxplusone added inline comments.Feb 24 2022, 4:02 PM

libcxx/benchmarks/algorithms.bench.cpp
30–32	Ahhh, I see now. There's a pair of parens missing from my original code: using Value = std::tuple_element_t<(int)V(), Types>; should have been either using Value = std::tuple_element_t<(int)V()(), Types>; // or, better, using Value = std::tuple_element_t<(int)V::value, Types>; Use either of those instead, please. (Preferably the latter.)
libcxx/include/__algorithm/sort.h
160	Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator. Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like int is[1000]; Widget ps[1000]; std::sort(is, is+1000, std::less<>); // A std::sort(ps, ps+1000, std::less<>); // B but poorly on complicated predicates like Widget ws[10]; std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; }); // C std::sort(ps, ps+1000, [](auto pi, auto pj) { return pi < pj; }); // D That is, you expect that this optimization, applied across the board, would measurably speed up* A and B, but measurably slow down C and D. And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much less common than D. So, your patch intentionally applies only to A and C, and excludes B and D. Is that all correct? I do still think it'd be interesting to find out whether this patch slows down C and D.

daniel.mankowitz added a subscriber: daniel.mankowitz.Mar 2 2022, 11:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 11:10 AM

revert to using enum class in tests

marcogelmi marked an inline comment as done.Mar 4 2022, 7:33 AM

marcogelmi added inline comments.

libcxx/include/__algorithm/sort.h
160	You are correct. Our patch applies to A and B and we are pretty certain that our changes will slow down C and D, because our branchless version calls the comparator more times in the average case. This is why we have limited ourselves to A and B. Note that our patch not only applies to ints but also has similar performance improvements for floats. @ldionne it would be great to get your thoughts on the patch as to whether we can submit as is? Thanks in advance.

Harbormaster completed remote builds in B152589: Diff 413004.Mar 4 2022, 10:02 AM

hi @ldionne just wondering if you had time to review the latest changes? do you think it can be submitted as is? thanks again!

Thanks for the ping! I still have some comments, I'm still trying to make sure I understand all the implications of this change since std::sort is such an important (and non-trivial as we can see) algorithm!

libcxx/benchmarks/algorithms.bench.cpp
30–31
libcxx/include/__algorithm/sort.h
118	Commenting here for lack of better place: With this patch, I think we need to add tests for `std::sort` for a random access but non-contiguous range, otherwise we may end up testing. And for sorting on non arithmetic types, too. It looks like our `std::sort` tests might be lacking quite a bit, now that I look at them.
127	I would just go for `integral_constant<bool, EXPR>`. I actually had to look up `_LIBCPP_BOOL_CONSTANT`, and IMO it does not pull its weight. I'll upload a patch to remove it and we can discuss its fate there.
154–155	We usually use the following idiom for SFINAE (this is supported in C++03 as a Clang extension): template <class _Compare, class _RandomAccessIterator, __enable_if_t< __use_branchless_sort<_RandomAccessIterator>::value >* = 0> inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(...) { ... } It has the benefit that the return type is not hidden in the `__enable_if_t`.
160	Let me just try to state this in the most naive way possible to make sure I'm on the same page as you folks: Basically, you assume that sorts on arithmetic types over contiguous ranges will not be using complicated comparators, and that you'll save more by avoiding branches and performing more calls to `compare` than reducing the number of calls to `compare` but having more branches. However, if we have a contiguous range of arithmetic types but the `compare` is extremely expensive, this optimization will actually be a pessimization over the naive approach. Is that correct? If that is correct, then basically in an ideal world, what you'd want would be to enable this optimization only when you know that the cost of the comparator is negligible compared to the cost of branching, since that's what you're reducing with your approach. Of course, that is difficult if not impossible to tell. One approach would be to enable this optimization only when the comparator is something trivial like `std::less` and friends, but IDK if that would prevent this optimization in too many cases.

Enable branchless sort for simple comparators only. Increase test coverage.

Harbormaster completed remote builds in B156956: Diff 419140.Mar 30 2022, 8:08 AM

fix test errors for c++03

Harbormaster completed remote builds in B156971: Diff 419160.Mar 30 2022, 10:49 AM

Thanks for your in-depth feedback!

libcxx/include/__algorithm/sort.h
118	I've added tests for non-contiguous iterators (using a `std::deque<int>`) and non-arithmetic types (using a `std::pair<int, int>`). Let me know if there are other tests I'm missing. I've also added more checks using `std::is_permutation` because the current tests only check for `std::is_sorted`
154–155	this was mentioned by another reviewer but if I try using this idiom I get failures for c++03: candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,
160	Yes your understanding is correct. After discussing your feedback amongst ourselves, we think that you had a great suggestion to enable our patch for simple comparators. Note that I've had to enable it for `__less<_Tp>&` because that seems to be the default when nothing is passed in. Also, the comparator seems to be transformed into a reference using `__comp_ref_type` so that's why it's templated on the reference type.

philnik mentioned this in D122780: Modify std::sort: add BlockQuickSort partitioning algorithm for arithmetic types.Mar 31 2022, 11:12 AM

Thanks a lot! I think this is now a no-brainer since we've made this more conservative (it only triggers on a few comparators that we know are going to benefit from it).

Please ship this once you've addressed my comments and the CI is passing. If you need help committing this, please provide Author Name <email@domain> and we can commit it on your behalf. Thanks again for the patch and for your patience+collaboration going through the review!

libcxx/test/std/algorithms/alg.sorting/alg.sort/sort/sort.pass.cpp
52–56 ↗	(On Diff #419160)	Instead of specializing the function template, can you instead add an overload? inline void set_value(std::pair<int, int>& dest, int value) { dest.first = value; dest.second = value; }
74 ↗	(On Diff #419160)	Please don't use reserved identifiers in the tests. This should be `class Constainer` instead of `class _Container`.
98 ↗	(On Diff #419160)	Same comment, overload instead of specialization please :-)
113 ↗	(On Diff #419160)	Should be `Iter`, not `_Iter`. We only use `_Foo` and `__foo` in the source code of the library itself because it ensures that no users can possibly define a macro with a name that conflicts with our identifiers. Inside our test code, this isn't needed (or desirable since it obscures everything).
137 ↗	(On Diff #419160)	I like that we're adding this coverage!

This revision is now accepted and ready to land.Apr 6 2022, 10:45 AM

address review comments on tests style

Harbormaster completed remote builds in B158448: Diff 421157.Apr 7 2022, 7:18 AM

Thanks everyone for your comments and improvements to this patch!

I don't think I have committer rights so if you could commit this on my behalf that would be great. The attribution should be: Marco Gelmi <marcogelmi@deepmind.com>

Thanks again!

Closed by commit rG194d1965d2c8: Introduce branchless sorting functions for sort3, sort4 and sort5. (authored by marcogelmi, committed by philnik). · Explain WhyApr 8 2022, 12:05 AM

This revision was automatically updated to reflect the committed changes.

philnik added a commit: rG194d1965d2c8: Introduce branchless sorting functions for sort3, sort4 and sort5..

itypedef added a subscriber: itypedef.May 11 2022, 7:13 AM

broshear added a subscriber: broshear.Oct 12 2022, 8:09 AM

m4ph3w added a subscriber: m4ph3w.Jun 15 2023, 5:55 AM

Sdopn added a subscriber: Sdopn.Jul 14 2023, 2:48 AM

Revision Contents

Path

Size

libcxx/

benchmarks/

algorithms.bench.cpp

14 lines

include/

__algorithm/

sort.h

114 lines

Diff 407588

libcxx/benchmarks/algorithms.bench.cpp

#include <algorithm> #include <algorithm>

#include <cstdint> #include <cstdint>

#include <map> #include <map>

#include <random> #include <random>

#include <string> #include <string>

#include <utility> #include <utility>

#include <vector> #include <vector>

#include "CartesianBenchmarks.h" #include "CartesianBenchmarks.h"

#include "GenerateInput.h" #include "GenerateInput.h"

#include "benchmark/benchmark.h" #include "benchmark/benchmark.h"

#include "test_macros.h" #include "test_macros.h"

namespace { namespace {

enum class ValueType { Uint32, Uint64, Pair, Tuple, String }; enum class ValueType { Uint32, Uint64, Pair, Tuple, String, Float };

struct AllValueTypes : EnumValuesAsTuple<AllValueTypes, ValueType, 5> { struct AllValueTypes : EnumValuesAsTuple<AllValueTypes, ValueType, 6> {

static constexpr const char* Names[] = { static constexpr const char* Names[] = {"uint32", "uint64", "pair<uint32, uint32>", "tuple<uint32, uint64, uint32>",

"uint32", "uint64", "pair<uint32, uint32>", "string", "float"};

"tuple<uint32, uint64, uint32>", "string"};

}; };

template <class V> template <class V>

using Value = std::conditional_t< using Value = std::conditional_t<

V() == ValueType::Uint32, uint32_t, V() == ValueType::Uint32, uint32_t,

std::conditional_t< std::conditional_t<

V() == ValueType::Uint64, uint64_t, V() == ValueType::Uint64, uint64_t,

std::conditional_t< std::conditional_t<

V() == ValueType::Pair, std::pair<uint32_t, uint32_t>, V() == ValueType::Pair, std::pair<uint32_t, uint32_t>,

std::conditional_t<V() == ValueType::Tuple, std::conditional_t<V() == ValueType::Tuple, std::tuple<uint32_t, uint64_t, uint32_t>,

std::tuple<uint32_t, uint64_t, uint32_t>, std::conditional_t< V() == ValueType::String, std::string, float> > > > >;

ldionneUnsubmitted

Not Done

template <class V>

- using Value = std::tuple_element_t<(int)V::value, Types>;

+ using Value = std::tuple_element_t<static_cast<int>(V::value), Types>;

enum class Order {

ldionne:

std::string> > > >;

philnikUnsubmitted

Done

V() == ValueType::Pair, std::pair<uint32_t, uint32_t>,

- std::conditional_t< V() == ValueType::Tuple, std::tuple<uint32_t, uint64_t, uint32_t>,

- std::conditional_t< V() == ValueType::String, std::string, float> > > > >;

+ std::conditional_t<V() == ValueType::Tuple, std::tuple<uint32_t, uint64_t, uint32_t>,

+ std::conditional_t< V() == ValueType::String, std::string, float> > > > >;

enum class Order {

philnik:

QuuxplusoneUnsubmitted

Done

It might be (long past) time to change this to simply

using Types = std::tuple<
    uint32_t, uint64_t, std::pair<uint32_t, uint64_t>, std::tuple<uint32_t, uint64_t, uint32_t>,
    std::string, float
>;

template<class V>
using Value = std::tuple_element_t<(int)V(), Types>;

Quuxplusone: It might be (long past) time to change this to simply ``` using Types = std::tuple<…

marcogelmiAuthorUnsubmitted

Done

note that I'm using clang-format and it's changing the format slightly. Also, I had to change to a plain enum

marcogelmi: note that I'm using clang-format and it's changing the format slightly. Also, I had to change…

QuuxplusoneUnsubmitted

Done

For my curiosity, why did you have to change to a plain enum? (All else being equal, we should prefer enum class, but if it really technically doesn't work, then I don't imagine it's a big deal.)

Quuxplusone: For my curiosity, why did you have to change to a plain enum? (All else being equal, we should…

marcogelmiAuthorUnsubmitted

Done

this is just because we need to cast it to int to index into the Types tuple and that's more tricky with an enum class.

marcogelmi: this is just because we need to cast it to `int` to index into the `Types` tuple and that's…

QuuxplusoneUnsubmitted

Done

Ahhh, I see now. There's a pair of parens missing from my original code:

using Value = std::tuple_element_t<(int)V(), Types>;

should have been either

using Value = std::tuple_element_t<(int)V()(), Types>;  // or, better,
using Value = std::tuple_element_t<(int)V::value, Types>;

Use either of those instead, please. (Preferably the latter.)

Quuxplusone: Ahhh, I see now. There's a pair of parens missing from my original code: ``` using Value = std…

enum class Order { enum class Order {

Random, Random,

Ascending, Ascending,

Descending, Descending,

SingleElement, SingleElement,

PipeOrgan, PipeOrgan,

Heap, Heap,

QuickSortAdversary, QuickSortAdversary,

▲ Show 20 Lines • Show All 345 Lines • Show Last 20 Lines

libcxx/include/__algorithm/sort.h

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines if (__c(*__x5, *__x4))

++__r; ++__r;

if (__c(*__x3, *__x2)) if (__c(*__x3, *__x2))

{ {

swap(*__x2, *__x3); swap(*__x2, *__x3);

++__r; ++__r;

if (__c(*__x2, *__x1)) if (__c(*__x2, *__x1))

{ {

swap(*__x1, *__x2); swap(*__x1, *__x2);

++__r; ++__r;

ldionneUnsubmitted

Not Done

Commenting here for lack of better place: With this patch, I think we need to add tests for std::sort for a random access but non-contiguous range, otherwise we may end up testing. And for sorting on non arithmetic types, too. It looks like our std::sort tests might be lacking quite a bit, now that I look at them.

ldionne: Commenting here for lack of better place: With this patch, I think we need to add tests for…

marcogelmiAuthorUnsubmitted

Done

I've added tests for non-contiguous iterators (using a std::deque<int>) and non-arithmetic types (using a std::pair<int, int>). Let me know if there are other tests I'm missing. I've also added more checks using std::is_permutation because the current tests only check for std::is_sorted

marcogelmi: I've added tests for non-contiguous iterators (using a `std::deque<int>`) and non-arithmetic…

} }

return __r; return __r;

} }

// Ensures that __c(*__x, *__y) is true by swapping *__x and *__y if necessary.

template <typename _Compare, typename _RandomAccessIterator>

marcogelmiAuthorUnsubmitted

Done

bool_constant seems to need >= c++14, any suggestion on how to make this work for c++03?

marcogelmi: bool_constant seems to need >= c++14, any suggestion on how to make this work for c++03?

philnikUnsubmitted

Not Done

use _LIBCPP_BOOL_CONSTANT

philnik: use `_LIBCPP_BOOL_CONSTANT`

ldionneUnsubmitted

Done

I would just go for integral_constant<bool, EXPR>. I actually had to look up _LIBCPP_BOOL_CONSTANT, and IMO it does not pull its weight. I'll upload a patch to remove it and we can discuss its fate there.

ldionne: I would just go for `integral_constant<bool, EXPR>`. I actually had to look up…

inline _LIBCPP_HIDE_FROM_ABI void __cond_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _Compare __c) {

ldionneUnsubmitted

Not Done

template <typename _Compare, typename _RandomAccessIterator>

- inline _LIBCPP_HIDDEN void __cond_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _Compare __c) {

+ inline _LIBCPP_HIDE_FROM_ABI void __cond_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _Compare __c) {

using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;

Also, does the inline make a difference in your benchmarks? If not, I'd drop it, because they are semantically equivalent (templates have inline semantics by default).

This comment applies below in several places too.

ldionne: Also, does the `inline` make a difference in your benchmarks? If not, I'd drop it, because they…

marcogelmiAuthorUnsubmitted

Done

Removing the inline seems to negatively affect the performance for some combinations of tests and architectures. Would it be ok to keep the inline for extra safety?

marcogelmi: Removing the inline seems to negatively affect the performance for some combinations of tests…

ldionneUnsubmitted

Not Done

Can you share those numbers? IMO that's a LLVM bug, since we like to say that inline doesn't have any impact on the codegen. But yeah, at the end of the day, if it makes a positive difference we should keep it.

ldionne: Can you share those numbers? IMO that's a LLVM bug, since we like to say that `inline` doesn't…

marcogelmiAuthorUnsubmitted

Done

sure, here they are: https://bit.ly/34Q6z3q

marcogelmi: sure, here they are: https://bit.ly/34Q6z3q

using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;

bool __r = __c(*__x, *__y);

value_type __tmp = __r ? *__x : *__y;

*__y = __r ? *__y : *__x;

*__x = __tmp;

}

QuuxplusoneUnsubmitted

Done

Either call _VSTD::__cond_swap<_Compare>(...) below, or (less good IMHO) change it up here to take _Compare& instead of _Compare. Otherwise we end up making copies of the predicate — and I would have expected that to fail libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp! Could you also find out why that test isn't failing, and add a regression test to it? (I imagine it's just not taking the offending codepath because of the shape of the input data values, so adding a few extra test cases of the correct shape will solve the problem.)

Ditto __partially_sorted_swap.

Quuxplusone: Either call `_VSTD::__cond_swap<_Compare>(...)` below, or (less good IMHO) change it up here to…

// Ensures that *__x, *__y and *__z are ordered according to the comparator __c,

// under the assumption that *__y and *__z are already ordered.

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI void __partially_sorted_swap(_RandomAccessIterator __x, _RandomAccessIterator __y,

_RandomAccessIterator __z, _Compare __c) {

using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;

bool __r = __c(*__z, *__x);

value_type __tmp = __r ? *__z : *__x;

*__z = __r ? *__x : *__z;

__r = __c(__tmp, *__y);

*__x = __r ? *__x : *__y;

*__y = __r ? *__y : __tmp;

}

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI

__enable_if_t<is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

philnikUnsubmitted

Done

inline _LIBCPP_HIDDEN

- __enable_if_t<_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

+ __enable_if_t<is_integral<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

__sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {

Ditto below

philnik: Ditto below

QuuxplusoneUnsubmitted

Not Done

And vice versa, _VSTD::__sort3_internal. All function calls (that take arguments of user-defined types, anyway) need qualification specifically so that you don't get ADL. Type/variable/concept/... names don't need qualification because they never do ADL.
Also, almost certainly _VSTD::__sort3_internal<_Compare>, right? We should have a test for "no unnecessary copies of predicates" that catches this, and if it's not failing on this PR then it needs a bit of investigation why not.

Quuxplusone: And vice versa, `_VSTD::__sort3_internal`. All function calls (that take arguments of user…

marcogelmiAuthorUnsubmitted

Done

sorry what line of code is this referring to? I thought I used _VSTD::__sort3_internal<_Compare> everywhere I'm calling it.

marcogelmi: sorry what line of code is this referring to? I thought I used _VSTD…

QuuxplusoneUnsubmitted

Done

sorry what line of code is this referring to?

My best guess is that I misread line 153

__sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

as a function call. :) No action required.

Quuxplusone: > sorry what line of code is this referring to? My best guess is that I misread line 153 ```…

__sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

_Compare __c) {

_VSTD::__cond_swap(__x2, __x3, __c);

ldionneUnsubmitted

Not Done

We usually use the following idiom for SFINAE (this is supported in C++03 as a Clang extension):

template <class _Compare, class _RandomAccessIterator, __enable_if_t<
    __use_branchless_sort<_RandomAccessIterator>::value
>* = 0>
inline _LIBCPP_HIDE_FROM_ABI
void __sort3_maybe_branchless(...) { ... }

It has the benefit that the return type is not hidden in the __enable_if_t.

ldionne: We usually use the following idiom for SFINAE (this is supported in C++03 as a Clang extension)…

marcogelmiAuthorUnsubmitted

Done

this was mentioned by another reviewer but if I try using this idiom I get failures for c++03:

candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,

marcogelmi: this was mentioned by another reviewer but if I try using this idiom I get failures for c++03…

_VSTD::__partially_sorted_swap(__x1, __x2, __x3, __c);

}

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI

philnikUnsubmitted

Not Done

Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.

philnik: Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at…

QuuxplusoneUnsubmitted

Not Done

is_scalar might be relevant.

Quuxplusone: `is_scalar` might be relevant.

marcogelmiAuthorUnsubmitted

Done

switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9

marcogelmi: switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9

marcogelmiAuthorUnsubmitted

Done

this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.

marcogelmi: this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra…

philnikUnsubmitted

Not Done

Could we generalize it to something like sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>? I'm assuming here that sizeof(size_t) is less than or equal to the word size. This would allow this optimization to be performed on something like std::optional<int>. Could you check if that still yields a performance improvement? std::optional doesn't have a trivial comparison operator, so it might be interesting.

philnik: Could we generalize it to something like `sizeof(typename…

marcogelmiAuthorUnsubmitted

Done

so I tried testing it on optional<int> and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below).

I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1

I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...

Results on ARM:

name                            old cpu/op  new cpu/op  delta
BM_Sort_optional_Random_1       4.02ns ± 0%  4.01ns ± 1%   -0.36%   (p=0.000 n=99+98)
BM_Sort_optional_Random_3       5.33ns ± 0%  5.46ns ± 0%   +2.42%   (p=0.000 n=94+93)
BM_Sort_optional_Random_4       5.39ns ± 0%  6.17ns ± 0%  +14.45%   (p=0.000 n=99+99)
BM_Sort_optional_Random_5       6.47ns ± 0%  7.35ns ± 0%  +13.54%   (p=0.000 n=99+98)
BM_Sort_optional_Random_16      10.2ns ± 1%  10.8ns ± 2%   +5.74%  (p=0.000 n=95+100)

marcogelmi: so I tried testing it on `optional<int>` and the results are underwhelming (pretty much…

QuuxplusoneUnsubmitted

Done

FWIW, I think it would be worth pulling out the condition into a named trait, like

template<class _Tp>
using __use_branchless_sort = bool_constant<
  sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
>;

All this focus on value_type has recently made me paranoid about vector<bool> and proxy iterators in general. I see your __partially_sorted_swap doing like 13 calls to operator*; are we sure that's still going to be efficient when the _RandomAccessIterator in question is an arbitrarily complicated Ranges thing? (We don't support ranges::sort yet, but it would be unfortunate if we poured all this effort into std::sort and then had to re-evaluate the entire thing for ranges::sort.)

Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to contiguous iterators?

template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
using __use_branchless_sort = bool_constant<
  __is_cpp17_contiguous_iterator<_Iter> &&
  sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
>;

Also note I keep trying to expand arithmetic to scalar. :)

Quuxplusone: FWIW, I think it would be worth pulling out the condition into a named trait, like ```…

QuuxplusoneUnsubmitted

Not Done

Regarding is_arithmetic vs is_scalar: expanding to is_scalar would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.

Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like

int is[1000];
Widget *ps[1000];
std::sort(is, is+1000, std::less<>);  // A
std::sort(ps, ps+1000, std::less<>);  // B

but poorly on complicated predicates like

Widget ws[10];
std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; });  // C
std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; });  // D

That is, you expect that this optimization, applied across the board, would measurably speed up A and B, but measurably slow down C and D. And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much less common than D. So, your patch intentionally applies only to A and C, and excludes B and D. Is that all correct?
I do still think it'd be interesting to find out whether this patch slows down C and D.

Quuxplusone: > Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer…

marcogelmiAuthorUnsubmitted

Done

You are correct. Our patch applies to A and B and we are pretty certain that our changes will slow down C and D, because our branchless version calls the comparator more times in the average case. This is why we have limited ourselves to A and B. Note that our patch not only applies to ints but also has similar performance improvements for floats. @ldionne it would be great to get your thoughts on the patch as to whether we can submit as is? Thanks in advance.

marcogelmi: You are correct. Our patch applies to A and B and we are pretty certain that our changes will…

ldionneUnsubmitted

Not Done

Let me just try to state this in the most naive way possible to make sure I'm on the same page as you folks:

Basically, you assume that sorts on arithmetic types over contiguous ranges will not be using complicated comparators, and that you'll save more by avoiding branches and performing more calls to compare than reducing the number of calls to compare but having more branches. However, if we have a contiguous range of arithmetic types but the compare is extremely expensive, this optimization will actually be a pessimization over the naive approach. Is that correct?

If that is correct, then basically in an ideal world, what you'd want would be to enable this optimization only when you know that the cost of the comparator is negligible compared to the cost of branching, since that's what you're reducing with your approach. Of course, that is difficult if not impossible to tell. One approach would be to enable this optimization only when the comparator is something trivial like std::less and friends, but IDK if that would prevent this optimization in too many cases.

ldionne: Let me just try to state this in the most naive way possible to make sure I'm on the same page…

marcogelmiAuthorUnsubmitted

Done

Yes your understanding is correct. After discussing your feedback amongst ourselves, we think that you had a great suggestion to enable our patch for simple comparators.

Note that I've had to enable it for __less<_Tp>& because that seems to be the default when nothing is passed in. Also, the comparator seems to be transformed into a reference using __comp_ref_type so that's why it's templated on the reference type.

marcogelmi: Yes your understanding is correct. After discussing your feedback amongst ourselves, we think…

__enable_if_t<!is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

__sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

_Compare __c) {

_VSTD::__sort3<_Compare>(__x1, __x2, __x3, __c);

}

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI

__enable_if_t<is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

__sort4_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

_RandomAccessIterator __x4, _Compare __c) {

_VSTD::__cond_swap(__x1, __x3, __c);

_VSTD::__cond_swap(__x2, __x4, __c);

_VSTD::__cond_swap(__x1, __x2, __c);

_VSTD::__cond_swap(__x3, __x4, __c);

_VSTD::__cond_swap(__x2, __x3, __c);

}

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI

__enable_if_t<!is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

__sort4_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

_RandomAccessIterator __x4, _Compare __c) {

_VSTD::__sort4(__x1, __x2, __x3, __x4, __c);

}

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI

__enable_if_t<is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

__sort5_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

_RandomAccessIterator __x4, _RandomAccessIterator __x5, _Compare __c) {

QuuxplusoneUnsubmitted

Not Done

_VSTD::__sort4(__x1, __x2, __x3, __x4, __c);

}

- template <typename _Compare, typename _RandomAccessIterator>

+ template <class _Compare, class _RandomAccessIterator,

+ __enable_if_t<is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value>* = nullptr>

inline _LIBCPP_HIDE_FROM_ABI

- __enable_if_t<is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

- __sort5_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

- _RandomAccessIterator __x4, _RandomAccessIterator __x5, _Compare __c) {

+ void __sort5_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

+ _RandomAccessIterator __x4, _RandomAccessIterator __x5, _Compare __c) {

_VSTD::__cond_swap(__x1, __x2, __c);

Nit: It's more libc++ style to use enable_if_t<>* = nullptr than return type SFINAE; I'd say the advantages are readability and that return type SFINAE greatly increases the length of the mangled name while an extra void* type parameter doesn't so much.
Drive-by s/typename/class/ and whitespace.
Likewise below.

Quuxplusone: Nit: It's more libc++ style to use `enable_if_t<>* = nullptr` than return type SFINAE; I'd say…

marcogelmiAuthorUnsubmitted

Done

I'm not entirely sure why, but this seems to be failing for the c++03 build (it works fine for the others):, e.g. https://reviews.llvm.org/harbormaster/unit/view/2520185/

candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,

marcogelmi: I'm not entirely sure why, but this seems to be failing for the c++03 build (it works fine for…

QuuxplusoneUnsubmitted

Done

Yuck, I see that on Godbolt (C++03 is unhappy with giving a non-type template parameter a default value). So I guess the return type SFINAE is fine.

Quuxplusone: Yuck, I see that on Godbolt (C++03 is unhappy with giving a non-type template parameter a…

_VSTD::__cond_swap(__x1, __x2, __c);

_VSTD::__cond_swap(__x4, __x5, __c);

_VSTD::__partially_sorted_swap(__x3, __x4, __x5, __c);

_VSTD::__cond_swap(__x2, __x5, __c);

_VSTD::__partially_sorted_swap(__x1, __x3, __x4, __c);

_VSTD::__partially_sorted_swap(__x2, __x3, __x4, __c);

}

template <typename _Compare, typename _RandomAccessIterator>

inline _LIBCPP_HIDE_FROM_ABI

ldionneUnsubmitted

Not Done

Instead, would it be possible to put the enable_if_t on __sort5 directly, and then only add one overload of __sort5 for when is_arithmetic is true? This would avoid introducing a new layer of complexity like __sort5_internal. This applies elsewhere too.

ldionne: Instead, would it be possible to put the `enable_if_t` on `__sort5` directly, and then only add…

marcogelmiAuthorUnsubmitted

Done

unfortunately the original version of __sortN also returns the number of swaps, while the new one doesn't, so we need both versions

marcogelmi: unfortunately the original version of __sortN also returns the number of swaps, while the new…

ldionneUnsubmitted

Done

Right -- that actually highlights the problem I have with the name chosen (__sort5_internal). There's nothing "internal" about that version of __sort5. Instead, we could rename it to __sort5_maybe_branchless? It's not great, but at least it explains what it does.

ldionne: Right -- that actually highlights the problem I have with the name chosen (`__sort5_internal`).

__enable_if_t<!is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>

__sort5_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

_RandomAccessIterator __x4, _RandomAccessIterator __x5, _Compare __c) {

_VSTD::__sort5(__x1, __x2, __x3, __x4, __x5, __c);

}

// Assumes size > 0 // Assumes size > 0

template <class _Compare, class _BidirectionalIterator> template <class _Compare, class _BidirectionalIterator>

_LIBCPP_CONSTEXPR_AFTER_CXX11 void _LIBCPP_CONSTEXPR_AFTER_CXX11 void

__selection_sort(_BidirectionalIterator __first, _BidirectionalIterator __last, _Compare __comp) __selection_sort(_BidirectionalIterator __first, _BidirectionalIterator __last, _Compare __comp)

{ {

_BidirectionalIterator __lm1 = __last; _BidirectionalIterator __lm1 = __last;

for (--__lm1; __first != __lm1; ++__first) for (--__lm1; __first != __lm1; ++__first)

{ {

Show All 24 Lines

template <class _Compare, class _RandomAccessIterator> template <class _Compare, class _RandomAccessIterator>

void void

__insertion_sort_3(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp) __insertion_sort_3(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp)

{ {

typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type; typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;

typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type; typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;

_RandomAccessIterator __j = __first+difference_type(2); _RandomAccessIterator __j = __first+difference_type(2);

_VSTD::__sort3<_Compare>(__first, __first+difference_type(1), __j, __comp); _VSTD::__sort3_maybe_branchless<_Compare>(__first, __first + difference_type(1), __j, __comp);

for (_RandomAccessIterator __i = __j+difference_type(1); __i != __last; ++__i) for (_RandomAccessIterator __i = __j+difference_type(1); __i != __last; ++__i)

{ {

if (__comp(*__i, *__j)) if (__comp(*__i, *__j))

{ {

value_type __t(_VSTD::move(*__i)); value_type __t(_VSTD::move(*__i));

_RandomAccessIterator __k = __j; _RandomAccessIterator __k = __j;

__j = __i; __j = __i;

do do

Show All 17 Lines __insertion_sort_incomplete(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp)

case 0: case 0:

case 1: case 1:

return true; return true;

case 2: case 2:

if (__comp(*--__last, *__first)) if (__comp(*--__last, *__first))

swap(*__first, *__last); swap(*__first, *__last);

return true; return true;

case 3: case 3:

_VSTD::__sort3<_Compare>(__first, __first+difference_type(1), --__last, __comp); _VSTD::__sort3_maybe_branchless<_Compare>(__first, __first + difference_type(1), --__last, __comp);

return true; return true;

case 4: case 4:

_VSTD::__sort4<_Compare>(__first, __first+difference_type(1), __first+difference_type(2), --__last, __comp); _VSTD::__sort4_maybe_branchless<_Compare>(__first, __first + difference_type(1), __first + difference_type(2),

--__last, __comp);

return true; return true;

case 5: case 5:

_VSTD::__sort5<_Compare>(__first, __first+difference_type(1), __first+difference_type(2), __first+difference_type(3), --__last, __comp); _VSTD::__sort5_maybe_branchless<_Compare>(__first, __first + difference_type(1), __first + difference_type(2),

__first + difference_type(3), --__last, __comp);

return true; return true;

} }

typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type; typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;

_RandomAccessIterator __j = __first+difference_type(2); _RandomAccessIterator __j = __first+difference_type(2);

_VSTD::__sort3<_Compare>(__first, __first+difference_type(1), __j, __comp); _VSTD::__sort3_maybe_branchless<_Compare>(__first, __first + difference_type(1), __j, __comp);

const unsigned __limit = 8; const unsigned __limit = 8;

unsigned __count = 0; unsigned __count = 0;

for (_RandomAccessIterator __i = __j+difference_type(1); __i != __last; ++__i) for (_RandomAccessIterator __i = __j+difference_type(1); __i != __last; ++__i)

{ {

if (__comp(*__i, *__j)) if (__comp(*__i, *__j))

{ {

value_type __t(_VSTD::move(*__i)); value_type __t(_VSTD::move(*__i));

_RandomAccessIterator __k = __j; _RandomAccessIterator __k = __j;

▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines __restart:

case 0: case 0:

case 1: case 1:

return; return;

case 2: case 2:

if (__comp(*--__last, *__first)) if (__comp(*--__last, *__first))

swap(*__first, *__last); swap(*__first, *__last);

return; return;

case 3: case 3:

_VSTD::__sort3<_Compare>(__first, __first+difference_type(1), --__last, __comp); _VSTD::__sort3_maybe_branchless<_Compare>(__first, __first + difference_type(1), --__last, __comp);

return; return;

case 4: case 4:

--__last, __comp);

return; return;

case 5: case 5:

__first + difference_type(3), --__last, __comp);

return; return;

} }

if (__len <= __limit) if (__len <= __limit)

{ {

_VSTD::__insertion_sort_3<_Compare>(__first, __last, __comp); _VSTD::__insertion_sort_3<_Compare>(__first, __last, __comp);

return; return;

} }

// __len > 5 // __len > 5

if (__depth == 0) if (__depth == 0)

▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Introduce branchless sorting functions for sort3, sort4 and sort5.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 407588

libcxx/benchmarks/algorithms.bench.cpp

libcxx/include/__algorithm/sort.h

Introduce branchless sorting functions for sort3, sort4 and sort5.
ClosedPublic