Page MenuHomePhabricator

Introduce branchless sorting functions for sort3, sort4 and sort5.
ClosedPublic

Authored by marcogelmi on Jan 24 2022, 3:50 AM.

Details

Summary

We are introducing branchless variants for sort3, sort4 and sort5.
These sorting functions have been generated using Reinforcement
Learning and aim to replace sort3, sort4 and __sort5 variants
for integral types.

The libc++ benchmarks were run on isolated machines for Skylake, ARM and
AMD architectures and achieve statistically significant improvement in
sorting random integers on test cases from sort1 to sort262144 for
uint32 and uint64.

A full performance overview for Intel Skylake, AMD and Arm can be
found here: https://bit.ly/3AtesYf

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • rename sortN_internal to sortN_maybe_branchless
  • use _LIBCPP_HIDE_FROM_ABI
marcogelmi marked an inline comment as done.Feb 10 2022, 9:51 AM
marcogelmi added inline comments.
libcxx/include/__algorithm/sort.h
128

sure, here they are: https://bit.ly/34Q6z3q

libcxx/include/__algorithm/sort.h
134

Either call _VSTD::__cond_swap<_Compare>(...) below, or (less good IMHO) change it up here to take _Compare& instead of _Compare. Otherwise we end up making copies of the predicate — and I would have expected that to fail libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp! Could you also find out why that test isn't failing, and add a regression test to it? (I imagine it's just not taking the offending codepath because of the shape of the input data values, so adding a few extra test cases of the correct shape will solve the problem.)

Ditto __partially_sorted_swap.

187–191

Nit: It's more libc++ style to use enable_if_t<>* = nullptr than return type SFINAE; I'd say the advantages are readability and that return type SFINAE greatly increases the length of the mangled name while an extra void* type parameter doesn't so much.
Drive-by s/typename/class/ and whitespace.
Likewise below.

add regression test for comparator copies of ints

fix template issues

additional fixes for missing temmplates

fix c++03 test

marcogelmi marked an inline comment as done.Feb 11 2022, 11:19 AM
marcogelmi added inline comments.
libcxx/include/__algorithm/sort.h
187–191

I'm not entirely sure why, but this seems to be failing for the c++03 build (it works fine for the others):, e.g. https://reviews.llvm.org/harbormaster/unit/view/2520185/

candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,

revert small, unintended test change

libcxx/benchmarks/algorithms.bench.cpp
27–28

It might be (long past) time to change this to simply

using Types = std::tuple<
    uint32_t, uint64_t, std::pair<uint32_t, uint64_t>, std::tuple<uint32_t, uint64_t, uint32_t>,
    std::string, float
>;

template<class V>
using Value = std::tuple_element_t<(int)V(), Types>;
libcxx/include/__algorithm/sort.h
152

sorry what line of code is this referring to?

My best guess is that I misread line 153

__sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,

as a function call. :) No action required.

160

FWIW, I think it would be worth pulling out the condition into a named trait, like

template<class _Tp>
using __use_branchless_sort = bool_constant<
  sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
>;

All this focus on value_type has recently made me paranoid about vector<bool> and proxy iterators in general. I see your __partially_sorted_swap doing like 13 calls to operator*; are we sure that's still going to be efficient when the _RandomAccessIterator in question is an arbitrarily complicated Ranges thing? (We don't support ranges::sort yet, but it would be unfortunate if we poured all this effort into std::sort and then had to re-evaluate the entire thing for ranges::sort.)

Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to contiguous iterators?

template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
using __use_branchless_sort = bool_constant<
  __is_cpp17_contiguous_iterator<_Iter> &&
  sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
>;

Also note I keep trying to expand arithmetic to scalar. :)

187–191

Yuck, I see that on Godbolt (C++03 is unhappy with giving a non-type template parameter a default value). So I guess the return type SFINAE is fine.

libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp
192

IIUC, what you've done here is just add a call to std::sort with value_type=int, which triggers your new thing, specifically because your new thing is constrained to arithmetic value_types instead of (as I keep trying to drive-by it into) scalar value_types. (Pointer types are scalar but not arithmetic.)

Is this the only change that was needed in order to trigger your comparator-copying codepath?
If so, I'd slightly rather see you just expand your new thing to include scalar types!

I don't really object to the idea of templatizing the whole thing on T and testing with different Ts; but it feels like a lot of churn if the only culprit was arithmetic/scalar, and also I'm skeptical that that was the only culprit here.

I had thought that the problem was more about the shape of the input data — I mean, notice that is_sorted(first, last) is already true going into this call.

In particular, if we now know that there are different codepaths taken for arrays of length 3, 4, 5, then shouldn't we be testing here

(void)std::sort(first, first+3, Less(&copies)); assert(copies == 0);
(void)std::sort(first, first+4, Less(&copies)); assert(copies == 0);
(void)std::sort(first, first+5, Less(&copies)); assert(copies == 0);
(void)std::sort(first, last, Less(&copies)); assert(copies == 0);

?

address style comments, use named trait

marcogelmi marked 5 inline comments as done.Feb 14 2022, 9:40 AM

Thanks for the suggestions.

Regarding is_scalar vs is_arithmetic: expanding to is_scalar would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator. This is also not trivial to do with the current test setup.

Can we therefore first submit this CL and then look into the feasibility of incorporating those additional types into our setup?

libcxx/benchmarks/algorithms.bench.cpp
27–28

note that I'm using clang-format and it's changing the format slightly. Also, I had to change to a plain enum

libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp
192

yes this is triggered by int specifically because void* is not included by is_arithmetic

the reason why this is triggered even if the shape of the data is of length 10 is because there is a base case (insertion_sort_3) which calls __sort3 once and is invoked when 5 < size <= 30.

Anyway, I've added the test cases you recommended.

marcogelmi marked 2 inline comments as done.Feb 15 2022, 2:06 AM
marcogelmi added inline comments.
libcxx/include/__algorithm/sort.h
127

bool_constant seems to need >= c++14, any suggestion on how to make this work for c++03?

philnik added inline comments.Feb 15 2022, 2:09 AM
libcxx/include/__algorithm/sort.h
127

use _LIBCPP_BOOL_CONSTANT

use _LIBCPP_BOOL_CONSTANT instead of bool_constant

rebase to head

hi @Quuxplusone, @ldionne just pinging this thread: what do you think of the current state of the diff?

FWIW, I have no strong desire to be obstructionist, and I definitely don't think we should abandon this direction, but I have one bad and one good(?) reason I think we should hold this PR in suspended animation for a while longer:

  • Bad: I'm still uncomfortable with the invasiveness of the changes to robust_against_copying_comparators.pass.cpp.
  • Good(?): Very soon now we're going to have to implement ranges::sort, which innovates in a bunch of different directions — we're going to have to deal with sentinels as well as iterators, and we're going to have to replace __c(*x, *y) with invoke(__c, invoke(__p, *x), invoke(__p, *y), and so on. I feel like it would be a good idea to get that stuff out of the way first, and then revisit algorithmic changes like this PR, rather than vice versa. Unfortunately I have no particular estimate of when ranges::sort will get done; @philnik might have a long-range plan to share, but being as we're still down in the foothills of ranges::min_element etc., personally I'd predict it'll be a couple months at least.

Anyway, @ldionne will be the ultimate arbiter (but IIRC he's on vacation this week).

I'll also keep recommending that you replace is_arithmetic with is_scalar, until you actually do it! or explain why not. :)

libcxx/benchmarks/algorithms.bench.cpp
27–28

For my curiosity, why did you have to change to a plain enum? (All else being equal, we should prefer enum class, but if it really technically doesn't work, then I don't imagine it's a big deal.)

FWIW, I have no strong desire to be obstructionist, and I definitely don't think we should abandon this direction, but I have one bad and one good(?) reason I think we should hold this PR in suspended animation for a while longer:

  • Bad: I'm still uncomfortable with the invasiveness of the changes to robust_against_copying_comparators.pass.cpp.
  • Good(?): Very soon now we're going to have to implement ranges::sort, which innovates in a bunch of different directions — we're going to have to deal with sentinels as well as iterators, and we're going to have to replace __c(*x, *y) with invoke(__c, invoke(__p, *x), invoke(__p, *y), and so on. I feel like it would be a good idea to get that stuff out of the way first, and then revisit algorithmic changes like this PR, rather than vice versa. Unfortunately I have no particular estimate of when ranges::sort will get done; @philnik might have a long-range plan to share, but being as we're still down in the foothills of ranges::min_element etc., personally I'd predict it'll be a couple months at least.

Anyway, @ldionne will be the ultimate arbiter (but IIRC he's on vacation this week).

I'll also keep recommending that you replace is_arithmetic with is_scalar, until you actually do it! or explain why not. :)

I think ranges::sort is one of the last algorithms on the list. Unlike the ones I'm working on currently, the most naive implementation isn't the fastest. So it will be much more work, and even the ones currently under review take a long time, since we don't have many copy/paste-able tests currently. Assuming we will get faster I'd guess the earliest I would start to implement sort is in 4-6 months. I wouldn't be surprised if sort alone will be under review for a few months.

marcogelmi marked an inline comment as done.Feb 22 2022, 2:53 AM

FWIW, I have no strong desire to be obstructionist, and I definitely don't think we should abandon this direction, but I have one bad and one good(?) reason I think we should hold this PR in suspended animation for a while longer:

  • Bad: I'm still uncomfortable with the invasiveness of the changes to robust_against_copying_comparators.pass.cpp.
  • Good(?): Very soon now we're going to have to implement ranges::sort, which innovates in a bunch of different directions — we're going to have to deal with sentinels as well as iterators, and we're going to have to replace __c(*x, *y) with invoke(__c, invoke(__p, *x), invoke(__p, *y), and so on. I feel like it would be a good idea to get that stuff out of the way first, and then revisit algorithmic changes like this PR, rather than vice versa. Unfortunately I have no particular estimate of when ranges::sort will get done; @philnik might have a long-range plan to share, but being as we're still down in the foothills of ranges::min_element etc., personally I'd predict it'll be a couple months at least.

Anyway, @ldionne will be the ultimate arbiter (but IIRC he's on vacation this week).

I'll also keep recommending that you replace is_arithmetic with is_scalar, until you actually do it! or explain why not. :)

Regarding is_arithmetic vs is_scalar: expanding to is_scalar would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator. This is also not trivial to do with the current test setup. Can we therefore first submit this CL and then look into the feasibility of incorporating those additional types into our setup?

libcxx/benchmarks/algorithms.bench.cpp
27–28

this is just because we need to cast it to int to index into the Types tuple and that's more tricky with an enum class.

libcxx/benchmarks/algorithms.bench.cpp
27–28

Ahhh, I see now. There's a pair of parens missing from my original code:

using Value = std::tuple_element_t<(int)V(), Types>;

should have been either

using Value = std::tuple_element_t<(int)V()(), Types>;  // or, better,
using Value = std::tuple_element_t<(int)V::value, Types>;

Use either of those instead, please. (Preferably the latter.)

libcxx/include/__algorithm/sort.h
160

Regarding is_arithmetic vs is_scalar: expanding to is_scalar would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.

Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like

int is[1000];
Widget *ps[1000];
std::sort(is, is+1000, std::less<>);  // A
std::sort(ps, ps+1000, std::less<>);  // B

but poorly on complicated predicates like

Widget ws[10];
std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; });  // C
std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; });  // D

That is, you expect that this optimization, applied across the board, would measurably speed up A and B, but measurably slow down C and D. And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much less common than D. So, your patch intentionally applies only to A and C, and excludes B and D. Is that all correct?
I do still think it'd be interesting to find out whether this patch slows down C and D.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 11:10 AM
marcogelmi updated this revision to Diff 413004.Mar 4 2022, 7:25 AM
marcogelmi marked an inline comment as done.

revert to using enum class in tests

marcogelmi marked an inline comment as done.Mar 4 2022, 7:33 AM
marcogelmi added inline comments.
libcxx/include/__algorithm/sort.h
160

You are correct. Our patch applies to A and B and we are pretty certain that our changes will slow down C and D, because our branchless version calls the comparator more times in the average case. This is why we have limited ourselves to A and B. Note that our patch not only applies to ints but also has similar performance improvements for floats. @ldionne it would be great to get your thoughts on the patch as to whether we can submit as is? Thanks in advance.

hi @ldionne just wondering if you had time to review the latest changes? do you think it can be submitted as is? thanks again!

Thanks for the ping! I still have some comments, I'm still trying to make sure I understand all the implications of this change since std::sort is such an important (and non-trivial as we can see) algorithm!

libcxx/benchmarks/algorithms.bench.cpp
27
libcxx/include/__algorithm/sort.h
118

Commenting here for lack of better place: With this patch, I think we need to add tests for std::sort for a random access but non-contiguous range, otherwise we may end up testing. And for sorting on non arithmetic types, too. It looks like our std::sort tests might be lacking quite a bit, now that I look at them.

127

I would just go for integral_constant<bool, EXPR>. I actually had to look up _LIBCPP_BOOL_CONSTANT, and IMO it does not pull its weight. I'll upload a patch to remove it and we can discuss its fate there.

154–155

We usually use the following idiom for SFINAE (this is supported in C++03 as a Clang extension):

template <class _Compare, class _RandomAccessIterator, __enable_if_t<
    __use_branchless_sort<_RandomAccessIterator>::value
>* = 0>
inline _LIBCPP_HIDE_FROM_ABI
void __sort3_maybe_branchless(...) { ... }

It has the benefit that the return type is not hidden in the __enable_if_t.

160

Let me just try to state this in the most naive way possible to make sure I'm on the same page as you folks:

Basically, you assume that sorts on arithmetic types over contiguous ranges will not be using complicated comparators, and that you'll save more by avoiding branches and performing more calls to compare than reducing the number of calls to compare but having more branches. However, if we have a contiguous range of arithmetic types but the compare is extremely expensive, this optimization will actually be a pessimization over the naive approach. Is that correct?

If that is correct, then basically in an ideal world, what you'd want would be to enable this optimization only when you know that the cost of the comparator is negligible compared to the cost of branching, since that's what you're reducing with your approach. Of course, that is difficult if not impossible to tell. One approach would be to enable this optimization only when the comparator is something trivial like std::less and friends, but IDK if that would prevent this optimization in too many cases.

Enable branchless sort for simple comparators only. Increase test coverage.

fix test errors for c++03

marcogelmi marked an inline comment as done.Mar 31 2022, 5:01 AM

Thanks for your in-depth feedback!

libcxx/include/__algorithm/sort.h
118

I've added tests for non-contiguous iterators (using a std::deque<int>) and non-arithmetic types (using a std::pair<int, int>). Let me know if there are other tests I'm missing. I've also added more checks using std::is_permutation because the current tests only check for std::is_sorted

154–155

this was mentioned by another reviewer but if I try using this idiom I get failures for c++03:

candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,
160

Yes your understanding is correct. After discussing your feedback amongst ourselves, we think that you had a great suggestion to enable our patch for simple comparators.

Note that I've had to enable it for __less<_Tp>& because that seems to be the default when nothing is passed in. Also, the comparator seems to be transformed into a reference using __comp_ref_type so that's why it's templated on the reference type.

ldionne accepted this revision.Apr 6 2022, 10:45 AM

Thanks a lot! I think this is now a no-brainer since we've made this more conservative (it only triggers on a few comparators that we know are going to benefit from it).

Please ship this once you've addressed my comments and the CI is passing. If you need help committing this, please provide Author Name <email@domain> and we can commit it on your behalf. Thanks again for the patch and for your patience+collaboration going through the review!

libcxx/test/std/algorithms/alg.sorting/alg.sort/sort/sort.pass.cpp
52–56

Instead of specializing the function template, can you instead add an overload?

inline void set_value(std::pair<int, int>& dest, int value)
{
    dest.first = value;
    dest.second = value;
}
74

Please don't use reserved identifiers in the tests. This should be class Constainer instead of class _Container.

98

Same comment, overload instead of specialization please :-)

113

Should be Iter, not _Iter.

We only use _Foo and __foo in the source code of the library itself because it ensures that no users can possibly define a macro with a name that conflicts with our identifiers. Inside our test code, this isn't needed (or desirable since it obscures everything).

137

I like that we're adding this coverage!

This revision is now accepted and ready to land.Apr 6 2022, 10:45 AM
marcogelmi updated this revision to Diff 421157.Apr 7 2022, 4:22 AM

address review comments on tests style

marcogelmi marked 4 inline comments as done.Apr 7 2022, 7:22 AM

Thanks everyone for your comments and improvements to this patch!

I don't think I have committer rights so if you could commit this on my behalf that would be great. The attribution should be: Marco Gelmi <marcogelmi@deepmind.com>

Thanks again!