This is an archive of the discontinued LLVM Phabricator instance.

Bug 39129: Speeding up partition_point/lower_bound/upper_bound/ by using unsigned division by 2 when possible.
ClosedPublic

Authored by dyaroshev on Sep 29 2018, 2:35 PM.

Details

Summary

The rational and measurements can be found in the bug description: https://bugs.llvm.org/show_bug.cgi?id=39129

Diff Detail

Event Timeline

dyaroshev created this revision.Sep 29 2018, 2:35 PM

Unfortunately didn't manage to run a clang-format on my changes - it kept ignoring the library settings - so probably the formatting needs to be reviewed too.

Offtopic: compiling and running benchmarks is non-trivial - I had to lurk through Makefiles to realize how to compile them and then there was a link error with file system stuff. Updating the docs would be great.

include/algorithm
3216

If you know a better metaprogramming trick to do this, would be great.

4106

We are not allowed (by default) to use C++11 for this file? A simple lambda would do this much better.

4119

Code duplication between partition_point, lower_bound and upper_bound doesn't seem to have any practical value.

4167

Do you know of a good way to remove code duplication? Move common parts into a base class?

In general, this is the kind of optimization that I would rather see the compiler do (everywhere, automatically) than libc++ (here and there, manually).

benchmarks/algorithms.bench.cpp
90

Seems like we need more than 32-bit ints here.

include/algorithm
4162

Why constexpr after C++11?

test/std/algorithms/alg.modifying.operations/alg.partitions/partition_point.pass.cpp
92 ↗(On Diff #167628)

You can't test the functionality of __difference_type_to_size_t_cast (a libc++-specific type) in the test/std hierarchy.

That's what test/libc++ tests are for.

102 ↗(On Diff #167628)

This needs to be wrapped in a conditional checking to see if we have __int128_t

Thanks for the quick review! I will address you comments tomorrow.

  • In general, this is the kind of optimization that I would rather see the compiler do (everywhere, automatically) than libc++ (here and there, manually).

I don't think it can, can it? The optimization comes from knowing the last - first will never be < 0. How would the compiler know that?

include/algorithm
4162

Mostly to be constexpr in c++17. Do I say 17 instead? Can be contstexpr in 11 no problem

dyaroshev updated this revision to Diff 167651.Sep 30 2018, 9:57 AM
dyaroshev retitled this revision from Bug 39129: Speeding up partition_point/lower_bound/upper_bound by using unsigned difference_type when possible. to Bug 39129: Speeding up partition_point/lower_bound/upper_bound/ by using unsigned division by 2 when possible..
  • Addressed review comments.
  • Changed (I measured - the result performs identically) cast to __half_unsigned function
  • Added similar optimization and benchmarks to equal_range.

(updated measurements can be found in the bug description).

Minor comments typos.

  • In general, this is the kind of optimization that I would rather see the compiler do (everywhere, automatically) than libc++ (here and there, manually).

I don't think it can, can it? The optimization comes from knowing the last - first will never be < 0. How would the compiler know that?

We can tell the compiler that std::distance always returns >= 0.
The optimizer already knows a lot about various standard library calls.

We can tell the compiler that std::distance always returns >= 0.
The optimizer already knows a lot about various standard library calls.

Oh, that could be interesting. Or this could be an example of the contracts based optimization.

You don't think that it's right to merge this patch? The performance wins are pretty nice. There are plenty of other things for optimizer guys to work on, this is a quite specific case.

You don't think that it's right to merge this patch? The performance wins are pretty nice. There are plenty of other things for optimizer guys to work on, this is a quite specific case.

I don't think there's a hurry to merge this.
The performance wins are nice - on a micro scale.
Do you have an example in a real app where this makes a difference?

Note: I'm not disparaging the work you've done.
Nor do I think this isn't worth pursuing.
I'm just looking at other approaches.

I don't think there's a hurry to merge this.
The performance wins are nice - on a micro scale.
Do you have an example in a real app where this makes a difference?

Note: I'm not disparaging the work you've done.
Nor do I think this isn't worth pursuing.
I'm just looking at other approaches.

At my current job I don't run libc++ in production, so I can't give you bigger measurements.

This is the examples I know:
Chromium actively using flat containers which rely on standard binary search: https://cs.chromium.org/chromium/src/base/containers/flat_tree.h?q=flat_tr&sq=package:chromium&g=0&l=904
FB does it too: https://github.com/facebook/folly/blob/master/folly/sorted_vector_types.h#L448
There are some implementations of hash maps that rely on std::lower_bound to find the correct bucket (I just talked to people who do that, I don't have a github link)
There was a few talks about flat_hash_map - they use the lower_bound too: https://github.com/skarupke/flat_hash_map/blob/master/flat_hash_map.hpp#L1218

I, too some extend, don't know what your concern is. It's 30 lines of code that makes code much faster at least on some benchmarks and smaller (I looked at the assembly).
These are highly used and important functions and if not to push for every bit of performance in the standard library, where to?

On the Note:
No worries, I know that you mean well. You have to maintain this staff and libc++ is your responsibility not mine.
I am completely OK with you making the decisions and appreciate you taking the time to review this stuff.

mclow.lists added inline comments.Oct 1 2018, 9:09 AM
include/algorithm
3213

Does make_unsigned do what you want here?

dyaroshev added inline comments.Oct 1 2018, 9:26 AM
include/algorithm
3213

I don't think so. Seems like size_t if faster than unsigned: http://quick-bench.com/hATsToKHx8UP3legquGgvT-4qpU
(this is not only this benchmark, I've seen other cases when it happens.

These are highly used and important functions and if not to push for every bit of performance in the standard library, where to?

Oh, I want the speedup. I'm just exploring other ways of getting it.

These are highly used and important functions and if not to push for every bit of performance in the standard library, where to?

Oh, I want the speedup. I'm just exploring other ways of getting it.

FWTW just adding an assume is *NOT* enough, sadly.
https://godbolt.org/z/v8fjnG

FWTW just adding an assume is *NOT* enough, sadly.
https://godbolt.org/z/v8fjnG

Good thinking though. I checked in libstdc++ (they don't have this performance issue).
They just do right shift instead of casting to unsigned: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L972
I'm not too sure how >> works, but if their way works for you, it looks simpler. (I don't know the concept requirements for difference type, but libstdc++ guys seem to require >> operation - which probably means it's OK).
http://quick-bench.com/9tsZ0ZTUn0bUTy1gPBvFnpYD4oA

My bad, size_t cast is faster: http://quick-bench.com/klty7gya6LMs5p8YVNSQ3bnUXcA

size_t still produces less code: https://godbolt.org/z/OCuCtN

mclow.lists added inline comments.Oct 1 2018, 11:01 AM
include/algorithm
3213

I think we may be talking about different things.

I was talking about the template metafunction std::make_unsigned, which gives you an unsigned type with the same size as the parameter.

From your comment, I get the feeling you are talking about the literal type unsigned.

mclow.lists added inline comments.Oct 1 2018, 11:07 AM
include/algorithm
3213

If I change line 9 in your quickbench example to:

using len_type = std::make_unsigned_t<typename std::iterator_traits<I>::difference_type>;

then the two chunks of code run at the same speed.

dyaroshev added inline comments.Oct 1 2018, 11:34 AM
include/algorithm
3213

Sorry, wasn't clear enough.

Not all unsigned types are created equal: for int - unsigned type would unsigned(32 bit) not size_t - 64 bit. If you use size_t - it's faster, so if we would get int difference type - I'd still like to use size_t.

Realistically, the only difference_type we care about is std::ptrdiff_t, which is 64 bit, so either way the important case works. But, if this is really the only case we care about, we can just specialise for it.

Hi!

Just wanted to know what is the current status of this PR.

EricWF added a subscriber: EricWF.
EricWF added inline comments.
include/algorithm
3199

__value_ with the trailing underscore should only be used for members. And pass by value.

You'll have to forgive the existing code around you that's doing it wrong.

3205

This is always true. It's a quirk of _LIBCPP_STD_VER that it's never less than 11. What you want here is #if !defined(_LIBCPP_CXX03_LANG)

3210

If you really want to make this work in C++03, You should be able to use the super-secret member numeric_limits<Integral>::__max. But this only works because you know you're dealing with a builtin type in this specialization.

3223

I think this comment can be reduced to something like:

// Perform division by two quickly for positive integers (llvm.org/PR39129)

We want to keep our header sizes small

3227

s/Bug 39129/llvm.org/PR39129

3233

s/__value_/__value. And I think we can pass by value here.

3233

Nit on the name:

I would rather it reflected (A) that the number must be positive, and (B) it's fast. The fact that it does it using unsigned numbers is an implementation detail.

3245

Needs _VSTD:: qualifier to defeat ADL.

4105

I would prefer __less_than_t and __less_than had more specific names which reflect how the types/functions bind to comparators/iterators.

That being said, I'm also OK if we scrap this optimization in C++03 and then implement it using C++11 lambdas.

4107

This seems like a job for __compressed_pair.

4110

There is no reason the functor is guaranteed to be compiled out completely, is there?

4115

We could move here.

4235

Needs _VSTD:: qualifier to defeat ADL.

EricWF requested changes to this revision.Oct 12 2018, 9:22 PM

Changing the status to "Changes Requested"

This revision now requires changes to proceed.Oct 12 2018, 9:22 PM

Simplifying the patch after review.

Forgotten _VSTD::

Seems like you are not comfortable with some additional changes that I did. I removed all of them and just left what was 100% necessary - casting to size_t for positive ptrdiff_t.

EricWF added inline comments.Oct 13 2018, 1:46 PM
include/algorithm
754

I think we can use _LIBCPP_CONSTEXPR here and get better behavior in C++03.

760

While this version is certainly simplier, I would still like to have this work for all of the builtin signed integral types.

test/libcxx/algorithms/half_positive.pass.cpp
27

Please make sure this test compiles in C++03, C++11, ect.

You'll want to use TEST_STD_VER from test_macros.h to guard the constexpr tests.

dyaroshev added inline comments.Oct 13 2018, 2:22 PM
include/algorithm
754

Sure, will do.

Why _LIBCPP_CONSTEXPR helps in C++03?

760

Will do.
I tried experimenting with this a bit more, but it's quite hard to tell what divisions are actually faster.

The problem with my previous benchmark is that I cast from 64bit pointer to 32bit to do division.
While the question that I really want to answer is - is it a better to cast 32 bit thingy to 32 unsigned or to 64 bit unsigned to do the division.

Measuring it in isolation seems shady: http://quick-bench.com/Dn1QDyTLbYjH68qDydX3rM6Azko

I tried to write an implementation of binary search that would reliably use 32bit arithmetics, but it doesn't really work out, since the pointer dereference is still 64 bit.

It's also important (I've seen a similar issue) that the unsigned division doesn't occupy an extra register, but definetely not when you measure this in isolation (see line seven in godbolt: https://gcc.godbolt.org/z/Fx5M6l)

My conclusion is:
Are you OK if I just do make_unsigned? Seems to be better than signed all the time but not 100% that it's the best cast.

test/libcxx/algorithms/half_positive.pass.cpp
27

Will do. Can you please point me to how do I compile in C++03 mode?

dyaroshev updated this revision to Diff 169601.Oct 14 2018, 7:22 AM

__half_positve now just uses a cast to make_unsigned and other review issues.

dyaroshev marked 3 inline comments as done.Oct 14 2018, 7:24 AM
dyaroshev added inline comments.
include/algorithm
760

Just used make_unsigned.

test/libcxx/algorithms/half_positive.pass.cpp
27

found in the docs.

Instead of feature tests, used internal macroses.

@EricWF can we maybe merge it now?

Hi again! What is the situation with this PR?

EricWF requested changes to this revision.Oct 25 2018, 11:10 AM
EricWF added inline comments.
include/algorithm
754

Woops. I meant to write "C++11".

My mistake.

test/libcxx/algorithms/half_positive.pass.cpp
32

Other libraries run our test suite, so we can't use internal macros.

Could you please write one set of tests that run at runtime, and another set that uses constexpr and static assert.
And then guard the constexpr tests under #if TEST_STD_VER >= 11.

This revision now requires changes to proceed.Oct 25 2018, 11:10 AM
dyaroshev added inline comments.Oct 25 2018, 11:13 AM
test/libcxx/algorithms/half_positive.pass.cpp
32

I can but how they can run these tests? There are public tests right? And these are private tests.
Other libraries cannot run this test because they do not have __half_positive not because they do not have a macro? Or am I missing smth?

Other than the comment about the test, this LGTM.

test/libcxx/algorithms/half_positive.pass.cpp
32

Good point. These tests are specific to libc++. My mistake.

We still don't use _LIBCPP_CONSTEXPR to declare constexpr variables in tests. It's just not how that's supposed to be used.
Plus, for the constexpr tests you want to use static_assert and not assert.

Testing both runtime and compile time behavior will require a little extra duplication, but it's probably worth it.

(If you're curious here's how I would have written it: https://gist.github.com/EricWF/df7e02ca931075b62caac587b7c0cf27)

dyaroshev updated this revision to Diff 171401.Oct 27 2018, 9:11 AM

Updating tests according to review comments.

dyaroshev added inline comments.Oct 27 2018, 9:12 AM
benchmarks/ordered_set.bench.cpp
86 ↗(On Diff #171401)

Had a compilation error here:

/space/llvm/llvm/projects/libcxx/benchmarks/ordered_set.bench.cpp:37:6: note: candidate function not viable: no known conversion from

'std::vector<size_t>' (aka 'vector<unsigned long>') to 'std::vector<uint64_t> &' (aka 'vector<unsigned long long> &') for 1st argument

I'm much happier with this patch now. It's a lot simpler, and still accomplishes what you want.

EricWF accepted this revision.Oct 27 2018, 10:31 AM

LGTM. Thanks so much for this change.

I'm still quite surprised about how much of a performance difference this make.

This revision is now accepted and ready to land.Oct 27 2018, 10:31 AM

LGTM. Thanks so much for this change.

I'm still quite surprised about how much of a performance difference this make.

My pleasure, thanks for the review.

Yeah, this in my limited experience performance involves a lot of REALY??? moments. Which is kinda fun though)

I'm sorry, how do I merge this patch?

I'm sorry, how do I merge this patch?

Do you have commit access? If not I can go ahead and merge it for you with your permission.

I'm sorry, how do I merge this patch?

Do you have commit access? If not I can go ahead and merge it for you with your permission.

I don't! Please merge it.

EricWF closed this revision.Oct 29 2018, 12:27 PM

Committed as r345525.

Thanks again.

Hi again.

Do you remeber if the difference was this big (5 times): http://quick-bench.com/kguTLHndCQWFuzlxJQT4tJQa_p8 ?
This seems insane

I don't believe that quickbench already has the latest version of the library, because this is the assembly generated
3.69% mov %rdx,%rsi
0.12% mov %rcx,%rdx

shr    $0x3f,%rdx

2.89% add %rcx,%rdx
5.42% sar %rdx
5.06% add $0xffffffffffffffff,%rcx

The last 0xffff ... was exactly what this patch removed.
Can someone please take a look? It doesn't feel right.

FOUND IT! ! difference_type len = _VSTD::distance(first, __last); - should be size_t on the left
I don't know what happened - I ran benchmarks locally before merging!
I'm sorry - no idea why it didn't reproduce on my machine. Maybe I messed something up
I can do the patch tommorow. It will take a while to merge it (judging by this merge experience).
Do you want to maybe roll this back?
http://quick-bench.com/fcV9gaAWPeqB_3-AePOw9H2wxXU

Again, I don't know how this happened, I ran the benchmarks. I'm really sorry.