The rational and measurements can be found in the bug description: https://bugs.llvm.org/show_bug.cgi?id=39129
Details
Diff Detail
Event Timeline
Unfortunately didn't manage to run a clang-format on my changes - it kept ignoring the library settings - so probably the formatting needs to be reviewed too.
Offtopic: compiling and running benchmarks is non-trivial - I had to lurk through Makefiles to realize how to compile them and then there was a link error with file system stuff. Updating the docs would be great.
include/algorithm | ||
---|---|---|
3216 | If you know a better metaprogramming trick to do this, would be great. | |
4106 | We are not allowed (by default) to use C++11 for this file? A simple lambda would do this much better. | |
4119 | Code duplication between partition_point, lower_bound and upper_bound doesn't seem to have any practical value. | |
4167 | Do you know of a good way to remove code duplication? Move common parts into a base class? |
In general, this is the kind of optimization that I would rather see the compiler do (everywhere, automatically) than libc++ (here and there, manually).
benchmarks/algorithms.bench.cpp | ||
---|---|---|
90 | Seems like we need more than 32-bit ints here. | |
include/algorithm | ||
4162 | Why constexpr after C++11? | |
test/std/algorithms/alg.modifying.operations/alg.partitions/partition_point.pass.cpp | ||
92 ↗ | (On Diff #167628) | You can't test the functionality of __difference_type_to_size_t_cast (a libc++-specific type) in the test/std hierarchy. That's what test/libc++ tests are for. |
102 ↗ | (On Diff #167628) | This needs to be wrapped in a conditional checking to see if we have __int128_t |
Thanks for the quick review! I will address you comments tomorrow.
- In general, this is the kind of optimization that I would rather see the compiler do (everywhere, automatically) than libc++ (here and there, manually).
I don't think it can, can it? The optimization comes from knowing the last - first will never be < 0. How would the compiler know that?
include/algorithm | ||
---|---|---|
4162 | Mostly to be constexpr in c++17. Do I say 17 instead? Can be contstexpr in 11 no problem |
- Addressed review comments.
- Changed (I measured - the result performs identically) cast to __half_unsigned function
- Added similar optimization and benchmarks to equal_range.
(updated measurements can be found in the bug description).
We can tell the compiler that std::distance always returns >= 0.
The optimizer already knows a lot about various standard library calls.
Oh, that could be interesting. Or this could be an example of the contracts based optimization.
You don't think that it's right to merge this patch? The performance wins are pretty nice. There are plenty of other things for optimizer guys to work on, this is a quite specific case.
I don't think there's a hurry to merge this.
The performance wins are nice - on a micro scale.
Do you have an example in a real app where this makes a difference?
Note: I'm not disparaging the work you've done.
Nor do I think this isn't worth pursuing.
I'm just looking at other approaches.
At my current job I don't run libc++ in production, so I can't give you bigger measurements.
This is the examples I know:
Chromium actively using flat containers which rely on standard binary search: https://cs.chromium.org/chromium/src/base/containers/flat_tree.h?q=flat_tr&sq=package:chromium&g=0&l=904
FB does it too: https://github.com/facebook/folly/blob/master/folly/sorted_vector_types.h#L448
There are some implementations of hash maps that rely on std::lower_bound to find the correct bucket (I just talked to people who do that, I don't have a github link)
There was a few talks about flat_hash_map - they use the lower_bound too: https://github.com/skarupke/flat_hash_map/blob/master/flat_hash_map.hpp#L1218
I, too some extend, don't know what your concern is. It's 30 lines of code that makes code much faster at least on some benchmarks and smaller (I looked at the assembly).
These are highly used and important functions and if not to push for every bit of performance in the standard library, where to?
On the Note:
No worries, I know that you mean well. You have to maintain this staff and libc++ is your responsibility not mine.
I am completely OK with you making the decisions and appreciate you taking the time to review this stuff.
include/algorithm | ||
---|---|---|
3213 | Does make_unsigned do what you want here? |
include/algorithm | ||
---|---|---|
3213 | I don't think so. Seems like size_t if faster than unsigned: http://quick-bench.com/hATsToKHx8UP3legquGgvT-4qpU |
These are highly used and important functions and if not to push for every bit of performance in the standard library, where to?
Oh, I want the speedup. I'm just exploring other ways of getting it.
Good thinking though. I checked in libstdc++ (they don't have this performance issue).
They just do right shift instead of casting to unsigned: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L972
I'm not too sure how >> works, but if their way works for you, it looks simpler. (I don't know the concept requirements for difference type, but libstdc++ guys seem to require >> operation - which probably means it's OK).
http://quick-bench.com/9tsZ0ZTUn0bUTy1gPBvFnpYD4oA
My bad, size_t cast is faster: http://quick-bench.com/klty7gya6LMs5p8YVNSQ3bnUXcA
size_t still produces less code: https://godbolt.org/z/OCuCtN
include/algorithm | ||
---|---|---|
3213 | I think we may be talking about different things. I was talking about the template metafunction std::make_unsigned, which gives you an unsigned type with the same size as the parameter. From your comment, I get the feeling you are talking about the literal type unsigned. |
include/algorithm | ||
---|---|---|
3213 | If I change line 9 in your quickbench example to: using len_type = std::make_unsigned_t<typename std::iterator_traits<I>::difference_type>; then the two chunks of code run at the same speed. |
include/algorithm | ||
---|---|---|
3213 | Sorry, wasn't clear enough. Not all unsigned types are created equal: for int - unsigned type would unsigned(32 bit) not size_t - 64 bit. If you use size_t - it's faster, so if we would get int difference type - I'd still like to use size_t. Realistically, the only difference_type we care about is std::ptrdiff_t, which is 64 bit, so either way the important case works. But, if this is really the only case we care about, we can just specialise for it. |
include/algorithm | ||
---|---|---|
3199 | __value_ with the trailing underscore should only be used for members. And pass by value. You'll have to forgive the existing code around you that's doing it wrong. | |
3205 | This is always true. It's a quirk of _LIBCPP_STD_VER that it's never less than 11. What you want here is #if !defined(_LIBCPP_CXX03_LANG) | |
3210 | If you really want to make this work in C++03, You should be able to use the super-secret member numeric_limits<Integral>::__max. But this only works because you know you're dealing with a builtin type in this specialization. | |
3223 | I think this comment can be reduced to something like: // Perform division by two quickly for positive integers (llvm.org/PR39129) We want to keep our header sizes small | |
3227 | s/Bug 39129/llvm.org/PR39129 | |
3233 | s/__value_/__value. And I think we can pass by value here. | |
3233 | Nit on the name: I would rather it reflected (A) that the number must be positive, and (B) it's fast. The fact that it does it using unsigned numbers is an implementation detail. | |
3245 | Needs _VSTD:: qualifier to defeat ADL. | |
4105 | I would prefer __less_than_t and __less_than had more specific names which reflect how the types/functions bind to comparators/iterators. That being said, I'm also OK if we scrap this optimization in C++03 and then implement it using C++11 lambdas. | |
4107 | This seems like a job for __compressed_pair. | |
4110 | There is no reason the functor is guaranteed to be compiled out completely, is there? | |
4115 | We could move here. | |
4235 | Needs _VSTD:: qualifier to defeat ADL. |
Seems like you are not comfortable with some additional changes that I did. I removed all of them and just left what was 100% necessary - casting to size_t for positive ptrdiff_t.
include/algorithm | ||
---|---|---|
754 | I think we can use _LIBCPP_CONSTEXPR here and get better behavior in C++03. | |
760 | While this version is certainly simplier, I would still like to have this work for all of the builtin signed integral types. | |
test/libcxx/algorithms/half_positive.pass.cpp | ||
28 | Please make sure this test compiles in C++03, C++11, ect. You'll want to use TEST_STD_VER from test_macros.h to guard the constexpr tests. |
include/algorithm | ||
---|---|---|
754 | Sure, will do. Why _LIBCPP_CONSTEXPR helps in C++03? | |
760 | Will do. The problem with my previous benchmark is that I cast from 64bit pointer to 32bit to do division. Measuring it in isolation seems shady: http://quick-bench.com/Dn1QDyTLbYjH68qDydX3rM6Azko I tried to write an implementation of binary search that would reliably use 32bit arithmetics, but it doesn't really work out, since the pointer dereference is still 64 bit. It's also important (I've seen a similar issue) that the unsigned division doesn't occupy an extra register, but definetely not when you measure this in isolation (see line seven in godbolt: https://gcc.godbolt.org/z/Fx5M6l) My conclusion is: | |
test/libcxx/algorithms/half_positive.pass.cpp | ||
28 | Will do. Can you please point me to how do I compile in C++03 mode? |
include/algorithm | ||
---|---|---|
754 | Woops. I meant to write "C++11". My mistake. | |
test/libcxx/algorithms/half_positive.pass.cpp | ||
31 | Other libraries run our test suite, so we can't use internal macros. Could you please write one set of tests that run at runtime, and another set that uses constexpr and static assert. |
test/libcxx/algorithms/half_positive.pass.cpp | ||
---|---|---|
31 | I can but how they can run these tests? There are public tests right? And these are private tests. |
Other than the comment about the test, this LGTM.
test/libcxx/algorithms/half_positive.pass.cpp | ||
---|---|---|
31 | Good point. These tests are specific to libc++. My mistake. We still don't use _LIBCPP_CONSTEXPR to declare constexpr variables in tests. It's just not how that's supposed to be used. Testing both runtime and compile time behavior will require a little extra duplication, but it's probably worth it. (If you're curious here's how I would have written it: https://gist.github.com/EricWF/df7e02ca931075b62caac587b7c0cf27) |
benchmarks/ordered_set.bench.cpp | ||
---|---|---|
86 ↗ | (On Diff #171401) | Had a compilation error here: /space/llvm/llvm/projects/libcxx/benchmarks/ordered_set.bench.cpp:37:6: note: candidate function not viable: no known conversion from 'std::vector<size_t>' (aka 'vector<unsigned long>') to 'std::vector<uint64_t> &' (aka 'vector<unsigned long long> &') for 1st argument |
I'm much happier with this patch now. It's a lot simpler, and still accomplishes what you want.
LGTM. Thanks so much for this change.
I'm still quite surprised about how much of a performance difference this make.
My pleasure, thanks for the review.
Yeah, this in my limited experience performance involves a lot of REALY??? moments. Which is kinda fun though)
Do you have commit access? If not I can go ahead and merge it for you with your permission.
Hi again.
Do you remeber if the difference was this big (5 times): http://quick-bench.com/kguTLHndCQWFuzlxJQT4tJQa_p8 ?
This seems insane
I don't believe that quickbench already has the latest version of the library, because this is the assembly generated
3.69% mov %rdx,%rsi
0.12% mov %rcx,%rdx
shr $0x3f,%rdx
2.89% add %rcx,%rdx
5.42% sar %rdx
5.06% add $0xffffffffffffffff,%rcx
The last 0xffff ... was exactly what this patch removed.
Can someone please take a look? It doesn't feel right.
FOUND IT! ! difference_type len = _VSTD::distance(first, __last); - should be size_t on the left
I don't know what happened - I ran benchmarks locally before merging!
I'm sorry - no idea why it didn't reproduce on my machine. Maybe I messed something up
I can do the patch tommorow. It will take a while to merge it (judging by this merge experience).
Do you want to maybe roll this back?
http://quick-bench.com/fcV9gaAWPeqB_3-AePOw9H2wxXU
Again, I don't know how this happened, I ran the benchmarks. I'm really sorry.
Seems like we need more than 32-bit ints here.