Page MenuHomePhabricator

Fixing lower bound regression in certain situations.
ClosedPublic

Authored by dyaroshev on Nov 1 2018, 12:52 PM.

Details

Summary

Patch https://reviews.llvm.org/D52697 introduced a regression on certain platforms. (for example on quick-bench: http://quick-bench.com/kguTLHndCQWFuzlxJQT4tJQa_p8)

This patch should fix that regression: http://quick-bench.com/fcV9gaAWPeqB_3-AePOw9H2wxXU

I do not know yet why it behaves like it is.
I do think that these measurements are correct since I've tried multiple variations on quick-bench and didn't stumble into the opposite.
On my machine this regression does not reproduce.

As part of the review, please check that you do not run into this issue on your machine.

Related bug: https://bugs.llvm.org/show_bug.cgi?id=39129

UPD: have a strong suspicion that this is the issue with quick-bench using clang6: https://gcc.godbolt.org/z/WqhWpj
The same problem does not reproduce with gcc on quick-bench

UPD2: OK, I did the investigation and I think this was a false alarm - we can recommit the previous patch.

This issue does not arise as a result of miscompliation, though clang7.0 has a significantly better codegen.

Here are reruns of the benchmarks on my machine for different clang versions (3.9, 5.0. 6.0, 7.0):
https://gist.github.com/DenisYaroshevskiy/eac461b1f8c2f80ce52790a056e460a8

The thing that I didn't test when I raised this issue is I didn't try to reimplement what went in the standard library from scratch and see how it performs compared to the std::lower_bound available on quick-bench. Here is what it looks like: http://quick-bench.com/prKysK3i2idH483CrKDJ45CsGuo
(don't mind that in this measurement the Sean Paren's version is slightly faster - it's an artefact of benchmarking - most likely code alignment issue. You can remove std::lower_bound benchmark - and see that the new version of lb will be faster: http://quick-bench.com/hor6m5_f0Jy2ZCQPcwc94Mj-Ga0)

Actually, what I now think is that the standard library on the quick-bench website is not updated from master. It seemed to me that the difference between std::lower_bound and the Sean Parent's one was not that big - but since it has been more than a week - the version with reversed patch would have been picked up already, so I was wrong about that.
I emailed Frederic Tingaud (the quick-bench) to confirm this- waiting for answer.

Note: we can tweak code to compile better on clang6.0 and the previous ones - but it will make it less readable. I don't think it's worth it - new patch is still a win and I'd expect users to move to the new compiler/libcxx somewhat simultaneously. Though this is up to you.

Diff Detail

Event Timeline

dyaroshev created this revision.Nov 1 2018, 12:52 PM

I did not participate in the original review, but my preference would be for https://reviews.llvm.org/D52697 to be reverted until the cause of the regression is understood and fixed. Unless other maintainers disagree.

I'm a bit uneasy with checking-in a fix to an issue that is not well understood -- this is the standard library, after all, and we run on all kinds of platforms in all kinds of environments. "It works on my machine" usually does not mean much.

dyaroshev added a comment.EditedNov 1 2018, 2:06 PM

I did not participate in the original review, but my preference would be for https://reviews.llvm.org/D52697 to be reverted until the cause of the regression is understood and fixed. Unless other maintainers disagree.

I'm a bit uneasy with checking-in a fix to an issue that is not well understood -- this is the standard library, after all, and we run on all kinds of platforms in all kinds of environments. "It works on my machine" usually does not mean much.

I'm very much behind you.

Is there some standard way that llvm project measures performance? Could we use that?

dyaroshev edited the summary of this revision. (Show Details)Nov 1 2018, 2:26 PM

I did not participate in the original review, but my preference would be for https://reviews.llvm.org/D52697 to be reverted until the cause of the regression is understood and fixed. Unless other maintainers disagree.

I'm a bit uneasy with checking-in a fix to an issue that is not well understood -- this is the standard library, after all, and we run on all kinds of platforms in all kinds of environments. "It works on my machine" usually does not mean much.

I'm very much behind you.

Is there some standard way that llvm project measures performance? Could we use that?

We have the benchmarks in libcxx, but I don't know whether we have any kind of performance testing infrastructure that would e.g. run those benchmarks under a wide array of configurations and report regressions / improvements. I don't think we do, otherwise I might have heard of it by now.

For simplicity, I did a wholesale revert of https://reviews.llvm.org/D52697 (as r345893) even though that patch included some benchmarks that are probably useful. Feel free to re-submit bits that do not impact the library on their own, but I think the algorithm should only be modified once the regression you found is understood and worked around.

For simplicity, I did a wholesale revert of https://reviews.llvm.org/D52697 (as r345893) even though that patch included some benchmarks that are probably useful. Feel free to re-submit bits that do not impact the library on their own, but I think the algorithm should only be modified once the regression you found is understood and worked around.

Thank you. Now I'll get my sleep back)

dyaroshev updated this revision to Diff 173597.Nov 11 2018, 3:21 PM
dyaroshev edited the summary of this revision. (Show Details)Nov 11 2018, 3:48 PM
dyaroshev updated this revision to Diff 173598.Nov 11 2018, 3:50 PM

Recreated the original patch - this was a false alarm as far as I understood it.
I listed everything I've done to tackle and understand the issue in the description of the PR.

dyaroshev edited the summary of this revision. (Show Details)Nov 11 2018, 4:03 PM
ldionne requested changes to this revision.Nov 11 2018, 7:08 PM

I think the previous patch also contained tests for some of the functions you added. Did they disappear?

This revision now requires changes to proceed.Nov 11 2018, 7:08 PM
dyaroshev updated this revision to Diff 173782.Nov 12 2018, 4:19 PM

@ldionne:

Yes, thank you, missed re-adding the test file.

BTW: Frederic responded - he didn't update the quick-bench for a while and my change wasn't picked up there.

Thank you for quick response.

ldionne requested changes to this revision.Nov 13 2018, 8:29 AM
ldionne added inline comments.
half_positive.pass.cpp
1 ↗(On Diff #173782)

This should not be at the top-level of the repository. It should be in test/libcxx/somewhere

12 ↗(On Diff #173782)

Please remove this line -- it makes it look as though this is a standard-provided algorithm.

This revision now requires changes to proceed.Nov 13 2018, 8:29 AM
dyaroshev updated this revision to Diff 173957.Nov 13 2018, 4:26 PM

@ldionne:

Yes, you are right - another artefact of non-apliable patch, thanks.
FYI: updated the description in the test a bit too.

Is there a known reason why it doesn't list 'test' in libcxx/algorithms/half_positive.pass.cpp??

Git diff says:

  • /dev/null

+++ test/libcxx/algorithms/half_positive.pass.cpp

dyaroshev updated this revision to Diff 173960.Nov 13 2018, 4:33 PM

Ok - not adding context in git helped.
Which is unpleasant to read though.

ldionne added a reviewer: jfb.Nov 14 2018, 5:12 AM
ldionne added a subscriber: jfb.
ldionne added inline comments.
include/algorithm
3231

I'll admit to being extremely surprised this makes any difference whatsoever. In my experience, Clang always emits the same instruction no matter what: shr. Adding @jfb to take a look. If he's OK with this, I'm OK with this.

dyaroshev added inline comments.Nov 14 2018, 5:48 AM
include/algorithm
3231

That's a great idea, I was as surprised as you are.

I don't think you can just shr - you don't know if the distance is positive: (l - f) kinda doesn't have to be (though std::distance can assert that as a post condition).

Here are a few ways to find the middle: https://godbolt.org/z/csAbsx
signed, unsigned and right shift (this is what libstdc++ does: https://github.com/gcc-mirror/gcc/blob/58fddb25d239007c7a9bff0d0969322d9c1709a8/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L972)

They all produce a different codegen.
The unsigned one was the fastest (slightly faster than >>) and I wasn't 100% sure that calling >> was OK, while the way this patch does it should be.

Hi. Just wanted to know what's the status of this change.

Hi. Just wanted to know what's the status of this change.

I just want @jfb to take a quick look and if he's ok with the change, I'll commit it.

Hi. Just wanted to know what's the status of this change.

I just want @jfb to take a quick look and if he's ok with the change, I'll commit it.

Also note that some people in the US are going to be out this week due to Thanksgiving, so it may wait until next week.

Hi. Just wanted to know what's the status of this change.

I just want @jfb to take a quick look and if he's ok with the change, I'll commit it.

Also note that some people in the US are going to be out this week due to Thanksgiving, so it may wait until next week.

Oh, yeah you right. Thanks.

Hi again. Just reminding about this review.

jfb added a comment.Dec 4 2018, 10:51 AM

I'm surprised the codegen is so much worst, even with __builtin_assume(l >= f); it's bad. We'll look at improving LLVM's codegen for this, in the meantime I guess this patch is fine.

ldionne accepted this revision.Dec 4 2018, 11:06 AM

I'll apply the patch.

This revision is now accepted and ready to land.Dec 4 2018, 11:06 AM

@dyaroshev Can you please rebase on top of master? There's been significant changes in algorithms.bench.cpp and the patch does not apply cleanly anymore.

@dyaroshev Can you please rebase on top of master? There's been significant changes in algorithms.bench.cpp and the patch does not apply cleanly anymore.

It seems like I need to redo my benchmarks from scratch to match what is now used. It will take me a couple of hours, so probably on the weekend.
Do you mind if I create a separate file: "algorithms.search.bench" and rename this one in "algorithms.sort.bench"?
It seems like the input/benchmark generation in algorithms.bench is not a good fit for binary search.

Also, please take a look: https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp#L45
I'm not sure if it matters, but seems like a weird idea to measure on 0...n - not very representative data.
I would also suggest rewriting it like:

V.resize(N);
std::iota(V.begin(), V.end());

The last time I measured - much faster (and less code).

ldionne added a subscriber: sbenza.Dec 4 2018, 2:16 PM

@dyaroshev Can you please rebase on top of master? There's been significant changes in algorithms.bench.cpp and the patch does not apply cleanly anymore.

It seems like I need to redo my benchmarks from scratch to match what is now used. It will take me a couple of hours, so probably on the weekend.

Ugh, sorry about that.

Do you mind if I create a separate file: "algorithms.search.bench" and rename this one in "algorithms.sort.bench"?

No, I think that's a fine idea.

It seems like the input/benchmark generation in algorithms.bench is not a good fit for binary search.

Also, please take a look: https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp#L45
I'm not sure if it matters, but seems like a weird idea to measure on 0...n - not very representative data.
I would also suggest rewriting it like:

V.resize(N);
std::iota(V.begin(), V.end());

The last time I measured - much faster (and less code).

@sbenza wrote those benchmarks, there might be a reason why he used 0...n.

sbenza added a comment.Dec 4 2018, 2:34 PM

It seems like the input/benchmark generation in algorithms.bench is not a good fit for binary search.

Also, please take a look: https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp#L45
I'm not sure if it matters, but seems like a weird idea to measure on 0...n - not very representative data.
I would also suggest rewriting it like:

V.resize(N);
std::iota(V.begin(), V.end());

The last time I measured - much faster (and less code).

@sbenza wrote those benchmarks, there might be a reason why he used 0...n.

re resize+iota, I don't see why not. That code is not part of the measurements.

As for why 0...n, for integers it doesn't really what the values are. only that there are N distinct values. The initial ordering of those values is changed depending on the benchmark.
Note that for strings we don't use consecutive values as the value itself matters for operator<. That is, the first mismatched char is the one that matters and "sequential" strings would not be good for that.

For lower_bound you might want a different sequence, though. You might also want to check for values missing in the list. The way we do it for std::set is to have 0,2,...,N. We then search for 2*i for hits and 2*i+1 for misses.

re resize+iota, I don't see why not. That code is not part of the measurements.

As for why 0...n, for integers it doesn't really what the values are. only that there are N distinct values. The initial ordering of those values is changed depending on the benchmark.
Note that for strings we don't use consecutive values as the value itself matters for operator<. That is, the first mismatched char is the one that matters and "sequential" strings would not be good for that.

For lower_bound you might want a different sequence, though. You might also want to check for values missing in the list. The way we do it for std::set is to have 0,2,...,N. We then search for 2*i for hits and 2*i+1 for misses.

I see your point about n values but I'm not entirely sold.
There is a famous bug in quick sort implementation, for example, - not partitioning the duplicates.
BTW just ran a small benchmark, check out how much duplicates affect the performance of std::sort http://quick-bench.com/mq8UiBPBT5vmmS7FG-RmY11VriQ

I also suspect that median of 5 type of algorithm and other similar "find value to partition" techniques could depend on the distribution - but maybe not.

Anyways, this is minor - just was confused a bit.

Why do you think for lower bound it is beneficial to look for "not found elements"? It never checks for equality (it should just call a partition_point but it gets ugly().
I can run these benchmarks but I'd suggest not to commit them if they don't show anything.

Do you have any other suggestions? Popular sizes, 32/64 bit integers? I usually just pick 1000 and smth big but I don't know, maybe there is a research.

PS:
For strings I would suggest maybe to also use URL;s of some kind.
Random string comparisons results in a first character, right? But for URLs you often hit a long common prefix.

dyaroshev updated this revision to Diff 177433.Dec 9 2018, 8:24 AM

Recreated the benchmarks in spirit of the sorting benchmarks.

Seems like a bit more results than we actually need but maybe it's OK.

Haven't tried to do not found elements - fairly confident that it doesn't matter since we never check for equality.

Measurement results (for lower bound)
1048576 on the previous version seems to trigger Google benchmark bug. It happens. (On equal_range doesn't reproduce).

-----------------------------------------------------------------------------------------------
                                                                 Before             After
-----------------------------------------------------------------------------------------------
PartitionPointBench_LowerBoundAlg_TestInt32/256                  144 ns            62 ns    
PartitionPointBench_LowerBoundAlg_TestInt32/1024                 193 ns            77 ns    
PartitionPointBench_LowerBoundAlg_TestInt32/1048576              1352 ns          186 ns    
                                                                                            
PartitionPointBench_LowerBoundAlg_TestInt64/256                   143 ns           63 ns    
PartitionPointBench_LowerBoundAlg_TestInt64/1024                  205 ns           77 ns    
PartitionPointBench_LowerBoundAlg_TestInt64/1048576              1440 ns          197 ns    
                                                                                            
PartitionPointBench_LowerBoundAlg_TestUint32/256                  149 ns           63 ns    
PartitionPointBench_LowerBoundAlg_TestUint32/1024                 203 ns           93 ns    
PartitionPointBench_LowerBoundAlg_TestUint32/1048576             1418 ns          186 ns    
                                                                                            
PartitionPointBench_LowerBoundAlg_TestMediumString/256            395 ns          407 ns    
PartitionPointBench_LowerBoundAlg_TestMediumString/1024           498 ns          519 ns    
PartitionPointBench_LowerBoundAlg_TestMediumString/1048576       1370 ns         1286 ns    
-----------------------------------------------------------------------------------------------

re resize+iota, I don't see why not. That code is not part of the measurements.

As for why 0...n, for integers it doesn't really what the values are. only that there are N distinct values. The initial ordering of those values is changed depending on the benchmark.
Note that for strings we don't use consecutive values as the value itself matters for operator<. That is, the first mismatched char is the one that matters and "sequential" strings would not be good for that.

For lower_bound you might want a different sequence, though. You might also want to check for values missing in the list. The way we do it for std::set is to have 0,2,...,N. We then search for 2*i for hits and 2*i+1 for misses.

I see your point about n values but I'm not entirely sold.
There is a famous bug in quick sort implementation, for example, - not partitioning the duplicates.
BTW just ran a small benchmark, check out how much duplicates affect the performance of std::sort http://quick-bench.com/mq8UiBPBT5vmmS7FG-RmY11VriQ

There are various input sequences. One of them is all duplicates.
We could have one that is N% duplicates if you think it is useful.

I also suspect that median of 5 type of algorithm and other similar "find value to partition" techniques could depend on the distribution - but maybe not.

Anyways, this is minor - just was confused a bit.

Why do you think for lower bound it is beneficial to look for "not found elements"?

lower_bound, just like std::set/map will stop the search when they find the element, but will generally continue all the way if it is not there.
The access pattern is a little different when it is all misses.

It never checks for equality (it should just call a partition_point but it gets ugly().
I can run these benchmarks but I'd suggest not to commit them if they don't show anything.

Do you have any other suggestions? Popular sizes, 32/64 bit integers? I usually just pick 1000 and smth big but I don't know, maybe there is a research.

No idea about recommended sizes. I would assume lower_bound is mostly bound by L1 cache and branch prediction.
It should have a behavior similar to std::set in that the higher "nodes" will be hotter than the leaves.
sort has interesting behavior because they change algorithms depending on size. Not sure if lower_bound does this. Maybe they switch to linear search for small sizes?

PS:
For strings I would suggest maybe to also use URL;s of some kind.
Random string comparisons results in a first character, right? But for URLs you often hit a long common prefix.

For strings I just kept the logic that was there before my change.
I agree that strings with common prefixes are a known (common) bad case for lexicographical sorting. URLs, file names, YYYYMMDD dates, etc tend to have largish common prefixes.
imo, the point of the string benchmarks is to compare a cheap-to-compare-and-swap type vs an expensive-to-compare-and-swap type. Extra compares/swaps would have a larger effect in the string benchmark.

re resize+iota, I don't see why not. That code is not part of the measurements.

Why do you think for lower bound it is beneficial to look for "not found elements"?

lower_bound, just like std::set/map will stop the search when they find the element, but will generally continue all the way if it is not there.
The access pattern is a little different when it is all misses.

There is no check for found element in lower_bound. It always goes to N == 0 regardless (which is a good thing since we do just one predicate invocation per iteration of the loop).

It never checks for equality (it should just call a partition_point but it gets ugly().
I can run these benchmarks but I'd suggest not to commit them if they don't show anything.

Do you have any other suggestions? Popular sizes, 32/64 bit integers? I usually just pick 1000 and smth big but I don't know, maybe there is a research.

No idea about recommended sizes. I would assume lower_bound is mostly bound by L1 cache and branch prediction.
It should have a behavior similar to std::set in that the higher "nodes" will be hotter than the leaves.
sort has interesting behavior because they change algorithms depending on size. Not sure if lower_bound does this. Maybe they switch to linear search for small sizes?

PS:
For strings I would suggest maybe to also use URL;s of some kind.
Random string comparisons results in a first character, right? But for URLs you often hit a long common prefix.

For strings I just kept the logic that was there before my change.
I agree that strings with common prefixes are a known (common) bad case for lexicographical sorting. URLs, file names, YYYYMMDD dates, etc tend to have largish common prefixes.
imo, the point of the string benchmarks is to compare a cheap-to-compare-and-swap type vs an expensive-to-compare-and-swap type. Extra compares/swaps would have a larger effect in the string benchmark.

Ok. I don't know about writing benchmarks enough to have an opinion to be honest.

@ldionne - can we please review/merge this? Christmas is coming and I won't have a computer where I can make changes for quite some time.

@ldionne this weekends are an ideal time for me to fix any review comments you might have.

EricWF accepted this revision.Dec 14 2018, 10:09 PM

LGTM.

Do you have commit access?

LGTM.

Do you have commit access?

No, I don't. Can you merge this for me?

ldionne accepted this revision.Dec 17 2018, 8:07 AM

Sorry for the silence. LGTM too. I applied it as r349358.

ldionne closed this revision.Dec 17 2018, 8:08 AM

Hooray! Thanks!