This is an archive of the discontinued LLVM Phabricator instance.

Bug 42208: speeding up std::merge
Needs ReviewPublic

Authored by dyaroshev on Jun 9 2019, 12:59 PM.

Details

Summary

Benchmarking results:

Before:

----------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations
----------------------------------------------------------------------------------------
MergeBench_MergeAlg_TestInt32/512                  1848 ns         1846 ns       378311
MergeBench_MergeAlg_TestInt32/2048                27005 ns        26989 ns        25576
MergeBench_MergeAlg_TestInt32/2097152          33327173 ns     33314619 ns           21
MergeBench_MergeAlg_TestInt64/512                  1702 ns         1701 ns       407486
MergeBench_MergeAlg_TestInt64/2048                26351 ns        26340 ns        26949
MergeBench_MergeAlg_TestInt64/2097152          34492582 ns     34484900 ns           20
MergeBench_MergeAlg_TestUint32/512                 1755 ns         1755 ns       379830
MergeBench_MergeAlg_TestUint32/2048               25971 ns        25963 ns        26655
MergeBench_MergeAlg_TestUint32/2097152         37003864 ns     35490619 ns           21
MergeBench_MergeAlg_TestMediumString/512         234988 ns       234489 ns         2641
MergeBench_MergeAlg_TestMediumString/2048       1120958 ns      1062598 ns          615
MergeBench_MergeAlg_TestMediumString/2097152 2595634493 ns   2590478000 ns            1

After:

----------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations
----------------------------------------------------------------------------------------
MergeBench_MergeAlg_TestInt32/512                  1396 ns         1395 ns       485652  (25% speedup)
MergeBench_MergeAlg_TestInt32/2048                15691 ns        15682 ns        43672  (42% speedup)
MergeBench_MergeAlg_TestInt32/2097152          30340879 ns     30329130 ns           23  (9% speedup)
MergeBench_MergeAlg_TestInt64/512                  1567 ns         1566 ns       440998  (9% speedup)
MergeBench_MergeAlg_TestInt64/2048                25090 ns        25076 ns        27286  (5% speedup)
MergeBench_MergeAlg_TestInt64/2097152          32398209 ns     32394000 ns           22  (7% speedup)
MergeBench_MergeAlg_TestUint32/512                 1366 ns         1366 ns       507957  (23% speedup)
MergeBench_MergeAlg_TestUint32/2048               15713 ns        15706 ns        43127  (40% speedup)
MergeBench_MergeAlg_TestUint32/2097152         30373730 ns     30366261 ns           23  (18% speedup)
MergeBench_MergeAlg_TestMediumString/512         213092 ns       212974 ns         3253  (10% speedup)
MergeBench_MergeAlg_TestMediumString/2048        879484 ns       879021 ns          752  (22% speedup)
MergeBench_MergeAlg_TestMediumString/2097152 2156054708 ns   2155483000 ns            1  (17% speedup)

There are two issues with current implementation of std::merge:
https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L4353

  1. The algorithm does two checks for boundary on every iteration, even though we only move one of the iterators
  2. If one of the checks for left boundary is unrolled we get better loop structures for both 1 and 2 ranges being bigger.

The speed up for the 1 range dominating on some measurements I did gets up to 1.7 times, while the 2 - about 1.4

If you want to play with algorithms/parameters - you can do that on quick-bench.
Watch out for code alignment issues!! - unfortunately including all three benchmarks
in the binary will result in incorrect result.
Link: http://quick-bench.com/kWbYdPDFnrovXWnuF6xw5wK27B8

Binary size increase (godbolt: https://godbolt.org/z/R-mmti):
Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)
Considering other places in libcxx that specialize algorithms for specific sizes seems acceptable.
(

for example sort: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L3703
rotate: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L2388
(if I understand the rotate algorithm correctly)

)

Potential followups:
std::stable_sort, std::inplace_merge are relying on merge - but reimplement it currently from scratch.
This could be a useful improvement.
std::set_union/std::set_difference have some similar problems to std::merge and could be improved in a similar maner.

Diff Detail

Event Timeline

dyaroshev created this revision.Jun 9 2019, 12:59 PM
dyaroshev updated this revision to Diff 203753.Jun 9 2019, 1:02 PM
dyaroshev added a reviewer: mclow.lists.

Updated to add a missing '\n'
Not sure who to put as a reviewer.

mclow.lists requested changes to this revision.Jun 9 2019, 7:57 PM

Thanks, but no.
Because of the gotos, we can't make this constexpr, and we need to do that.

This revision now requires changes to proceed.Jun 9 2019, 7:57 PM

Thank you, @mclow.lists
Please do not close this revision - I will try to make the optimiser produce the same code with ifs/whiles/switches.

How about is_constant_evaluated if I fail?

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

I do not know how to get those. What do I look at?

lebedev.ri added a comment.EditedJun 10 2019, 1:02 PM

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

I do not know how to get those. What do I look at?

Well, if "hey tool, give me all the optimizations that could be done here" was that easy :/

  • Compile the code to .s (-c -S -o test.s), and analyze the produced assembly, interpret it "with your mind" and think if some particular assembly instruction sequences can be optimized into other, better/shorter/etc, assembly instruction sequences. This one is cpu architecture, cpu model/version specific, obviously.
  • Compile the code to LLVM IR (-c -emit-llvm -S -o test.ll), and do the same at IR level.

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

I do not know how to get those. What do I look at?

Well, if "hey tool, give me all the optimizations that could be done here" was that easy :/

  • Compile the code to .s (-c -S -o test.s), and analyze the produced assembly, interpret it "with your mind" and think if some particular assembly instruction sequences can be optimized into other, better/shorter/etc, assembly instruction sequences. This one is cpu architecture, cpu model/version specific, obviously.
  • Compile the code to LLVM IR (-c -emit-llvm -S -o test.ll), and do the same at IR level.

Oh - I see what you mean - i thought I could do it optimisation by optimisation or smth like that.
I did look at the assembly.
The only thing that seems like an optimiser bug is that it doesn't collapse multiple calls to memmoves into one for std::merge and for my version it does.
Other then that - in this case the code is 1 to 1 what is written.

There are two wins here:

  1. I only check one boundary per iteration instead of two.
  2. I restructure the loop so that it looks like a nice unrolled loop for the case when there are two elements from the first range and that it looks like a 'do while' loop for the second range.

It is purely a code layout trick that proves to yield very nice results. But I don't think that it constitutes a bug that optimiser doesn't do it.

I also played a bit with the switch statement - so far seems like the optimiser refuses transform my switch into jumps and insists on keeping it in.
That might be a bug.

There is a godbolt link in the pr description.

If you take libcxx code for std::merge and compile it with GCC, is it faster? If not much, or even worse than Clang, I see no reason why not to improve it in libcxx's codebase.

lebedev.ri requested changes to this revision.Jun 10 2019, 2:39 PM

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).

Uhm.
I have a question:
you did notice that you are looking at libstdc++ implementation there?
https://godbolt.org/z/WGSQ6r

dyaroshev added a comment.EditedJun 11 2019, 4:46 PM

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).

Uhm.
I have a question:
you did notice that you are looking at libstdc++ implementation there?
https://godbolt.org/z/WGSQ6r

My bad - keep forgetting about that by default libstdc++ is used.
Performance measurements were done using libc++ benchmarks and the results are correct.
Quickbench link is also good

Seems like libstdc++ does smth weird for coping strings, which leads to doubling of the size.

Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)

Seems like since the copy produces less code, my collapsing of two memmoves into one brings less size decrease.

dyaroshev edited the summary of this revision. (Show Details)Jun 11 2019, 4:48 PM
EricWF added a subscriber: EricWF.Jun 13 2019, 9:58 PM

Binary size isn't

benchmarks/algorithms.merge.bench.cpp
2

Missing license header.

88

I know this doesn't affect the timing (or shouldn't), but it's useful to split out the prologue of a benchmark so it's easier to inspect the assembly the loop generates.

Also, if you could provide some snippits or analysis about what changed in the assembly that's giving the performance win.

lebedev.ri resigned from this revision.Jun 14 2019, 1:09 AM

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).

Uhm.
I have a question:
you did notice that you are looking at libstdc++ implementation there?
https://godbolt.org/z/WGSQ6r

My bad - keep forgetting about that by default libstdc++ is used.

K.
I've messed around with that code, this *seems* like the right direction,
but this really screams optimizer failure. Regardless of whether or not
it should be workarounded in the library's code, it should be reported as such first

Performance measurements were done using libc++ benchmarks and the results are correct.
Quickbench link is also good

Seems like libstdc++ does smth weird for coping strings, which leads to doubling of the size.

Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)

Seems like since the copy produces less code, my collapsing of two memmoves into one brings less size decrease.

After multiple attempts I could not make the "if/switch" based version to produce a similar result.
Since there doesn't seem to be "is_constant_evaluated" in clang, I cannot write a constexpr enabled version of this function.
I can report this as an optimiser bug, if that would be useful.

@EricWF
I did 2 things:

  1. I only check the boundary for just adjusted iterator. This gives me up to 30% speedup for some cases.
  2. I restructured the loop that we do less jumps: std::merge jumps back for each element, I don't. + The loop for the second array dominating is a lot like a do while loop, this might be helpful.

I do not know enough about the CPU to know how correct this explanation is but I do know that I have reproduction of this speed up on multiple different machines.

Reported switch not being removed as a bug: https://bugs.llvm.org/show_bug.cgi?id=42313 - this should be doable.

Reported this as an optimiser bug - maybe they'll have a suggestion: https://bugs.llvm.org/show_bug.cgi?id=42334

dyaroshev added a subscriber: david.EditedJun 19 2019, 5:00 PM

@DavidBolvansky solved the puzzle!
http://quick-bench.com/W5kTrvhSkufQQleyYsY9KTOmDrc

Almost as good as a goto solution! Will double/tripple check with everything I have over the weekend and update the patch!

Unfortunately switch based solution didn't work out - the degradation of 10% is not evenly distributed and, in fact, the results are even significantly worse than current implementation for the case when right hand side.

I tested for this benchmark: https://github.com/DenisYaroshevskiy/merge_biased_blog_post/blob/c8fb78cf17db9ac181d4c6f80cfda32731c5a043/merge_benchmark.cc#L160

Benchmark works like this: I have 2000 64 bit integers. I go every 50 elements and split them in 2 arrays. (starting with 1 one having 2000 integers, finishing with the second one having 2000 integers).

Here are screenshots of my results on the plot:

  1. https://pasteboard.co/IkCkqzy.png
  2. https://pasteboard.co/IkCnttt.png

(pointing out 2 different data points).

When the second array is dominating, we get a 1.9 loss in performance.

Reproducing same effect on quick-bench: http://quick-bench.com/KHWe2sC-XlYky4oX-g7YaTV7cog (same benchmark that I posted previously, I just adjusted array sizes) - 1.5 time difference.

So, unfortunately, I don't think that proposed switch based solution is good enough.
Would be really great to generate 'goto' version of the loop somehow.

Update:

From the comments on the optimiser bugs created I take that this type of code transformation is not feasible in a foreseeable future.
The reason is that this code structure has a non-canonical loop structure and optimiser really likes it's single entry loops.

With this I suggest to use if (is_constant_evaluated) because this is a good win. Since it doesn't have to be constexpr until C++20, I think it should be OK.
I was putting this solution together - since clang already supports this (at least the function is in libc++) but then I had to work on other staff.

What do you think about this approach?

If you OK with it, I can most likely come back to working on this in 2 weeks time - or if somebody else wants to finish this- I'm also happy with that.

dyaroshev updated this revision to Diff 212117.Jul 28 2019, 2:15 PM

Sorry for taking a long time - was busy.

Updated std::merge/std::copy to be constexpr friendly.
Used std::is_constant_evaluated (available in the trunk) to do speed up std::merge for runtime.
Added tests for both constexpr/non-constexpr merge.

EricWF added inline comments.Jul 28 2019, 2:37 PM
include/algorithm
1717

__builtin_memmove is constexpr, so I think using that is a better approach that branching on is_constant_evaluated.

4381

Everytime I've seen a duff's device optimization, it's a win is some cases and a loss in others. That makes me skeptical that it's the libraries job to perform the loop unrolling.
Do you know why LLVM is failing to generate comparable code here?

dyaroshev marked 2 inline comments as done.Jul 28 2019, 2:58 PM
dyaroshev added inline comments.
include/algorithm
1717

Will do, thanks.

4381
  1. Though you have a valid concern here - I have benchmarked this code front and center - I have not seen a pessimization. Do you have something you want me to try?

This is at most a 40% increase in binary size (still significantly less then with libstdc++) where I tried - so I would not expect sudden instruction cache spills or things like that.

I would point out that sometimes it's a 1.7 times win.

  1. I do know - see https://bugs.llvm.org/show_bug.cgi?id=42313

This optimization requires jumping through the loop header - which optimizer cannot comfortably work at the moment.

Eli Friedman 2019-06-19 14:27:29 PDT

Is this a fundamental llvm problem or it is solvable?

The reason we generally avoid jump-threading across a loop header is that irreducible CFGs (like the "goto" version of your >function) generally don't optimize very well; we have a bunch of optimizations that only recognize proper loops. But certain >loops really benefit from being transformed into irreducible CFGs; I think we've had similar reports before about state >machines before. So the challenge is figuring out a good heuristic for when the transform is actually profitable.

dyaroshev updated this revision to Diff 212127.Jul 28 2019, 3:17 PM

Eric's comment.

I don't know - I had to modify some changes from: https://github.com/llvm-mirror/libcxx/commit/c005c7e34c3d22d1dba2cfe62c79f9e8be2d60de

Why are those 'constexpr if nodebug' and why does it make sense?

Could not find the 'phabricator review' or anything.
Alternatively - I can disable the mmemove optimization in debug completely - what do you think about that?

jdoerfert resigned from this revision.Jul 29 2019, 11:28 AM

Reminding about this PR.

dyaroshev added a comment.EditedAug 31 2019, 4:53 PM

Those numbers didn't make sense to me. Tested my sort with std::merge - the suggested version gives a 10% win over the standard one.

The rest comes from I don't know what, but seems like std::stable_sort has serious issues.

Talked to @mclow.lists about this PR - he said that he's interested.

I have figured out what happened with the previous stable sort measurements - there was an issue with incorrect data.
With correct data the wins are not as good, but merge proves to be performing well consistently.

What do you think we need to merge this?