This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
benchmarks/
2
algorithms.merge.bench.cpp
-
include/
2/4
algorithm

Differential D63063

Bug 42208: speeding up std::merge
Needs ReviewPublic

Authored by dyaroshev on Jun 9 2019, 12:59 PM.

Download Raw Diff

Details

Reviewers

mclow.lists
lebedev.ri
jdoerfert

Summary

Benchmarking results:

Before:

----------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations
----------------------------------------------------------------------------------------
MergeBench_MergeAlg_TestInt32/512                  1848 ns         1846 ns       378311
MergeBench_MergeAlg_TestInt32/2048                27005 ns        26989 ns        25576
MergeBench_MergeAlg_TestInt32/2097152          33327173 ns     33314619 ns           21
MergeBench_MergeAlg_TestInt64/512                  1702 ns         1701 ns       407486
MergeBench_MergeAlg_TestInt64/2048                26351 ns        26340 ns        26949
MergeBench_MergeAlg_TestInt64/2097152          34492582 ns     34484900 ns           20
MergeBench_MergeAlg_TestUint32/512                 1755 ns         1755 ns       379830
MergeBench_MergeAlg_TestUint32/2048               25971 ns        25963 ns        26655
MergeBench_MergeAlg_TestUint32/2097152         37003864 ns     35490619 ns           21
MergeBench_MergeAlg_TestMediumString/512         234988 ns       234489 ns         2641
MergeBench_MergeAlg_TestMediumString/2048       1120958 ns      1062598 ns          615
MergeBench_MergeAlg_TestMediumString/2097152 2595634493 ns   2590478000 ns            1

After:

----------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations
----------------------------------------------------------------------------------------
MergeBench_MergeAlg_TestInt32/512                  1396 ns         1395 ns       485652  (25% speedup)
MergeBench_MergeAlg_TestInt32/2048                15691 ns        15682 ns        43672  (42% speedup)
MergeBench_MergeAlg_TestInt32/2097152          30340879 ns     30329130 ns           23  (9% speedup)
MergeBench_MergeAlg_TestInt64/512                  1567 ns         1566 ns       440998  (9% speedup)
MergeBench_MergeAlg_TestInt64/2048                25090 ns        25076 ns        27286  (5% speedup)
MergeBench_MergeAlg_TestInt64/2097152          32398209 ns     32394000 ns           22  (7% speedup)
MergeBench_MergeAlg_TestUint32/512                 1366 ns         1366 ns       507957  (23% speedup)
MergeBench_MergeAlg_TestUint32/2048               15713 ns        15706 ns        43127  (40% speedup)
MergeBench_MergeAlg_TestUint32/2097152         30373730 ns     30366261 ns           23  (18% speedup)
MergeBench_MergeAlg_TestMediumString/512         213092 ns       212974 ns         3253  (10% speedup)
MergeBench_MergeAlg_TestMediumString/2048        879484 ns       879021 ns          752  (22% speedup)
MergeBench_MergeAlg_TestMediumString/2097152 2156054708 ns   2155483000 ns            1  (17% speedup)

There are two issues with current implementation of std::merge:
https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L4353

The algorithm does two checks for boundary on every iteration, even though we only move one of the iterators
If one of the checks for left boundary is unrolled we get better loop structures for both 1 and 2 ranges being bigger.

The speed up for the 1 range dominating on some measurements I did gets up to 1.7 times, while the 2 - about 1.4

If you want to play with algorithms/parameters - you can do that on quick-bench.
Watch out for code alignment issues!! - unfortunately including all three benchmarks
in the binary will result in incorrect result.
Link: http://quick-bench.com/kWbYdPDFnrovXWnuF6xw5wK27B8

Binary size increase (godbolt: https://godbolt.org/z/R-mmti):
Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)
Considering other places in libcxx that specialize algorithms for specific sizes seems acceptable.
(

for example sort: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L3703
rotate: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L2388
(if I understand the rotate algorithm correctly)

)

Potential followups:
std::stable_sort, std::inplace_merge are relying on merge - but reimplement it currently from scratch.
This could be a useful improvement.
std::set_union/std::set_difference have some similar problems to std::merge and could be improved in a similar maner.

Diff Detail

Event Timeline

dyaroshev created this revision.Jun 9 2019, 12:59 PM

Herald added subscribers: libcxx-commits, ldionne, dmgreen, mgrang. · View Herald TranscriptJun 9 2019, 12:59 PM

Updated to add a missing '\n'
Not sure who to put as a reviewer.

Thanks, but no.
Because of the gotos, we can't make this constexpr, and we need to do that.

This revision now requires changes to proceed.Jun 9 2019, 7:57 PM

Thank you, @mclow.lists
Please do not close this revision - I will try to make the optimiser produce the same code with ifs/whiles/switches.

How about is_constant_evaluated if I fail?

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

In D63063#1535924, @lebedev.ri wrote:

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

I do not know how to get those. What do I look at?

In D63063#1536710, @dyaroshev wrote:

In D63063#1535924, @lebedev.ri wrote:

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

I do not know how to get those. What do I look at?

Well, if "hey tool, give me all the optimizations that could be done here" was that easy :/

Compile the code to .s (-c -S -o test.s), and analyze the produced assembly, interpret it "with your mind" and think if some particular assembly instruction sequences can be optimized into other, better/shorter/etc, assembly instruction sequences. This one is cpu architecture, cpu model/version specific, obviously.
Compile the code to LLVM IR (-c -emit-llvm -S -o test.ll), and do the same at IR level.

In D63063#1536732, @lebedev.ri wrote:

In D63063#1536710, @dyaroshev wrote:

In D63063#1535924, @lebedev.ri wrote:

Have you analyzed, how much is this a problem of the actual implementation (in libc++), and how much of the llvm optimization passes?
I.e. are there some obvious failures of the llvm opts?

I do not know how to get those. What do I look at?

Well, if "hey tool, give me all the optimizations that could be done here" was that easy :/

Compile the code to .s (-c -S -o test.s), and analyze the produced assembly, interpret it "with your mind" and think if some particular assembly instruction sequences can be optimized into other, better/shorter/etc, assembly instruction sequences. This one is cpu architecture, cpu model/version specific, obviously.

Compile the code to LLVM IR (-c -emit-llvm -S -o test.ll), and do the same at IR level.

Oh - I see what you mean - i thought I could do it optimisation by optimisation or smth like that.
I did look at the assembly.
The only thing that seems like an optimiser bug is that it doesn't collapse multiple calls to memmoves into one for std::merge and for my version it does.
Other then that - in this case the code is 1 to 1 what is written.

There are two wins here:

I only check one boundary per iteration instead of two.
I restructure the loop so that it looks like a nice unrolled loop for the case when there are two elements from the first range and that it looks like a 'do while' loop for the second range.

It is purely a code layout trick that proves to yield very nice results. But I don't think that it constitutes a bug that optimiser doesn't do it.

I also played a bit with the switch statement - so far seems like the optimiser refuses transform my switch into jumps and insists on keeping it in.
That might be a bug.

There is a godbolt link in the pr description.

If you take libcxx code for std::merge and compile it with GCC, is it faster? If not much, or even worse than Clang, I see no reason why not to improve it in libcxx's codebase.

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).

Uhm.
I have a question:
you did notice that you are looking at libstdc++ implementation there?
https://godbolt.org/z/WGSQ6r

In D63063#1536895, @lebedev.ri wrote:

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).

Uhm.
I have a question:
you did notice that you are looking at libstdc++ implementation there?
https://godbolt.org/z/WGSQ6r

My bad - keep forgetting about that by default libstdc++ is used.
Performance measurements were done using libc++ benchmarks and the results are correct.
Quickbench link is also good

Seems like libstdc++ does smth weird for coping strings, which leads to doubling of the size.

Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)

Seems like since the copy produces less code, my collapsing of two memmoves into one brings less size decrease.

dyaroshev edited the summary of this revision. (Show Details)Jun 11 2019, 4:48 PM

Binary size isn't

benchmarks/algorithms.merge.bench.cpp
1	Missing license header.
87	I know this doesn't affect the timing (or shouldn't), but it's useful to split out the prologue of a benchmark so it's easier to inspect the assembly the loop generates. Also, if you could provide some snippits or analysis about what changed in the assembly that's giving the performance win.

In D63063#1539131, @dyaroshev wrote:

In D63063#1536895, @lebedev.ri wrote:

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).

Uhm.
I have a question:
you did notice that you are looking at libstdc++ implementation there?
https://godbolt.org/z/WGSQ6r

My bad - keep forgetting about that by default libstdc++ is used.

K.
I've messed around with that code, this *seems* like the right direction,
but this really screams optimizer failure. Regardless of whether or not
it should be workarounded in the library's code, it should be reported as such first

Performance measurements were done using libc++ benchmarks and the results are correct.
Quickbench link is also good

Seems like libstdc++ does smth weird for coping strings, which leads to doubling of the size.

Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)

Seems like since the copy produces less code, my collapsing of two memmoves into one brings less size decrease.

After multiple attempts I could not make the "if/switch" based version to produce a similar result.
Since there doesn't seem to be "is_constant_evaluated" in clang, I cannot write a constexpr enabled version of this function.
I can report this as an optimiser bug, if that would be useful.

@EricWF
I did 2 things:

I only check the boundary for just adjusted iterator. This gives me up to 30% speedup for some cases.
I restructured the loop that we do less jumps: std::merge jumps back for each element, I don't. + The loop for the second array dominating is a lot like a do while loop, this might be helpful.

I do not know enough about the CPU to know how correct this explanation is but I do know that I have reproduction of this speed up on multiple different machines.

Reported switch not being removed as a bug: https://bugs.llvm.org/show_bug.cgi?id=42313 - this should be doable.

Reported this as an optimiser bug - maybe they'll have a suggestion: https://bugs.llvm.org/show_bug.cgi?id=42334

@DavidBolvansky solved the puzzle!
http://quick-bench.com/W5kTrvhSkufQQleyYsY9KTOmDrc

Almost as good as a goto solution! Will double/tripple check with everything I have over the weekend and update the patch!

Unfortunately switch based solution didn't work out - the degradation of 10% is not evenly distributed and, in fact, the results are even significantly worse than current implementation for the case when right hand side.

I tested for this benchmark: https://github.com/DenisYaroshevskiy/merge_biased_blog_post/blob/c8fb78cf17db9ac181d4c6f80cfda32731c5a043/merge_benchmark.cc#L160

Benchmark works like this: I have 2000 64 bit integers. I go every 50 elements and split them in 2 arrays. (starting with 1 one having 2000 integers, finishing with the second one having 2000 integers).

Here are screenshots of my results on the plot:

(pointing out 2 different data points).

When the second array is dominating, we get a 1.9 loss in performance.

Reproducing same effect on quick-bench: http://quick-bench.com/KHWe2sC-XlYky4oX-g7YaTV7cog (same benchmark that I posted previously, I just adjusted array sizes) - 1.5 time difference.

So, unfortunately, I don't think that proposed switch based solution is good enough.
Would be really great to generate 'goto' version of the loop somehow.

Update:

From the comments on the optimiser bugs created I take that this type of code transformation is not feasible in a foreseeable future.
The reason is that this code structure has a non-canonical loop structure and optimiser really likes it's single entry loops.

With this I suggest to use if (is_constant_evaluated) because this is a good win. Since it doesn't have to be constexpr until C++20, I think it should be OK.
I was putting this solution together - since clang already supports this (at least the function is in libc++) but then I had to work on other staff.

What do you think about this approach?

If you OK with it, I can most likely come back to working on this in 2 weeks time - or if somebody else wants to finish this- I'm also happy with that.

Sorry for taking a long time - was busy.

Updated std::merge/std::copy to be constexpr friendly.
Used std::is_constant_evaluated (available in the trunk) to do speed up std::merge for runtime.
Added tests for both constexpr/non-constexpr merge.

Herald added a reviewer: jdoerfert. · View Herald TranscriptJul 28 2019, 2:15 PM

Herald added a subscriber: christof. · View Herald Transcript

EricWF added inline comments.Jul 28 2019, 2:37 PM

include/algorithm
1717	`__builtin_memmove` is constexpr, so I think using that is a better approach that branching on `is_constant_evaluated`.
4391	Everytime I've seen a duff's device optimization, it's a win is some cases and a loss in others. That makes me skeptical that it's the libraries job to perform the loop unrolling. Do you know why LLVM is failing to generate comparable code here?

dyaroshev marked 2 inline comments as done.Jul 28 2019, 2:58 PM

dyaroshev added inline comments.

include/algorithm
1717	Will do, thanks.
4391	Though you have a valid concern here - I have benchmarked this code front and center - I have not seen a pessimization. Do you have something you want me to try? This is at most a 40% increase in binary size (still significantly less then with libstdc++) where I tried - so I would not expect sudden instruction cache spills or things like that. I would point out that sometimes it's a 1.7 times win. I do know - see https://bugs.llvm.org/show_bug.cgi?id=42313 This optimization requires jumping through the loop header - which optimizer cannot comfortably work at the moment. Eli Friedman 2019-06-19 14:27:29 PDT Is this a fundamental llvm problem or it is solvable? The reason we generally avoid jump-threading across a loop header is that irreducible CFGs (like the "goto" version of your >function) generally don't optimize very well; we have a bunch of optimizations that only recognize proper loops. But certain >loops really benefit from being transformed into irreducible CFGs; I think we've had similar reports before about state >machines before. So the challenge is figuring out a good heuristic for when the transform is actually profitable.

Eric's comment.

I don't know - I had to modify some changes from: https://github.com/llvm-mirror/libcxx/commit/c005c7e34c3d22d1dba2cfe62c79f9e8be2d60de

Why are those 'constexpr if nodebug' and why does it make sense?

Could not find the 'phabricator review' or anything.
Alternatively - I can disable the mmemove optimization in debug completely - what do you think about that?

jdoerfert resigned from this revision.Jul 29 2019, 11:28 AM

Reminding about this PR.

Ping.

Ok, last attempt to push this.

I built a very dumb stable_sort, using this merge:
https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/1c1ae696d1cef7772f78041b326b1e3d1bcf6346/src/algo/stable_sort.h#L38

It's 1.5 - 3 times faster than stable sort, that comes with my libc++

Here are the benchmarking results:
https://denisyaroshevskiy.github.io/algorithm_dumpster/#sorts

Here is the code for the benchmark:
Template: https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/1c1ae696d1cef7772f78041b326b1e3d1bcf6346/src/bench_generic/sort.h#L44
Runner: https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/1c1ae696d1cef7772f78041b326b1e3d1bcf6346/src/bench_runnable/sort.cc#L26

Those numbers didn't make sense to me. Tested my sort with std::merge - the suggested version gives a 10% win over the standard one.

The rest comes from I don't know what, but seems like std::stable_sort has serious issues.

Talked to @mclow.lists about this PR - he said that he's interested.

I have figured out what happened with the previous stable sort measurements - there was an issue with incorrect data.
With correct data the wins are not as good, but merge proves to be performing well consistently.

What do you think we need to merge this?

Revision Contents

Path

Size

benchmarks/

algorithms.merge.bench.cpp

127 lines

include/

algorithm

36 lines

Diff 203753

benchmarks/algorithms.merge.bench.cpp

This file was added.

				#include <array>
				EricWFUnsubmitted Not Done Reply Inline Actions Missing license header. EricWF: Missing license header.
				#include <string>

				#include "benchmark/benchmark.h"

				#include "CartesianBenchmarks.hpp"
				#include "GenerateInput.hpp"

				namespace {

				constexpr size_t kSplitPointCount = 5;

				template <class I>
				std::array<I, kSplitPointCount> pickSplitPoints(I f, I l) {
				std::ptrdiff_t N = std::distance(f, l);

				return {
				f + N / 4,
				f + N / 3,
				f + N / 2,
				f + 2 * N / 3,
				f + 3 * N / 4
				};
				}

				template <class IntT>
				struct TestIntBase {
				static std::vector<IntT> generateInput(size_t size) {
				std::vector<IntT> Res(size);
				std::generate(Res.begin(), Res.end(),
				[] { return getRandomInteger<IntT>(); });
				return Res;
				}
				};

				struct TestInt32 : TestIntBase<std::int32_t> {
				static constexpr const char* Name = "TestInt32";
				};

				struct TestInt64 : TestIntBase<std::int64_t> {
				static constexpr const char* Name = "TestInt64";
				};

				struct TestUint32 : TestIntBase<std::uint32_t> {
				static constexpr const char* Name = "TestUint32";
				};

				struct TestMediumString {
				static constexpr const char* Name = "TestMediumString";
				static constexpr size_t StringSize = 32;

				static std::vector<std::string> generateInput(size_t size) {
				std::vector<std::string> Res(size);
				std::generate(Res.begin(), Res.end(),
				[] { return getRandomString(StringSize); });
				return Res;
				}
				};

				using AllTestTypes =
				std::tuple<TestInt32, TestInt64, TestUint32, TestMediumString>;

				struct MergeAlg {
				template <class I1, class I2, class O>
				O operator()(I1 first1, I1 last1, I2 first2, I2 last2, O out) {
				return std::merge(first1, last1, first2, last2, out);
				}

				static constexpr const char* Name = "MergeAlg";
				};

				using AllAlgs = std::tuple<MergeAlg>;

				} // namespace

				template <class Alg, class TestType>
				struct MergeBench {
				size_t Quantity;

				std::string name() const {
				return std::string("MergeBench_") + Alg::Name + "_" + TestType::Name + '/' +
				std::to_string(Quantity);
				}

				void run(benchmark::State& state) const {
				auto Data = TestType::generateInput(Quantity);
				using VecT = decltype(Data);
				EricWFUnsubmitted Not Done Reply Inline Actions I know this doesn't affect the timing (or shouldn't), but it's useful to split out the prologue of a benchmark so it's easier to inspect the assembly the loop generates. Also, if you could provide some snippits or analysis about what changed in the assembly that's giving the performance win. EricWF: I know this doesn't affect the timing (or shouldn't), but it's useful to split out the prologue…

				const auto SplitPoints = pickSplitPoints(Data.begin(), Data.end());

				std::array<VecT, kSplitPointCount> LhsArray{};
				std::array<VecT, kSplitPointCount> RhsArray{};

				{
				VecT* LhsIt = LhsArray.begin();
				VecT* RhsIt = RhsArray.begin();
				for (auto Split : SplitPoints) {
				*LhsIt = VecT{Data.begin(), Split};
				*RhsIt = VecT{Split, Data.end()};
				std::sort(LhsIt->begin(), LhsIt->end());
				std::sort(RhsIt->begin(), RhsIt->end());
				++LhsIt;
				++RhsIt;
				}
				}

				VecT output(Data.size());

				for (auto _ : state) {
				for (size_t i = 0; i < LhsArray.size(); ++i) {
				Alg{}(LhsArray[i].begin(), LhsArray[i].end(), RhsArray[i].begin(),
				RhsArray[i].end(), output.begin());
				benchmark::DoNotOptimize(output);
				}
				}
				}
				};

				int main(int argc, char** argv) {
				benchmark::Initialize(&argc, argv);
				if (benchmark::ReportUnrecognizedArguments(argc, argv))
				return 1;

				const std::vector<size_t> Quantities = {1 << 9, 1 << 11, 1 << 21};
				makeCartesianProductBenchmark<MergeBench, AllAlgs, AllTestTypes>(Quantities);
				benchmark::RunSpecifiedBenchmarks();
				}

include/algorithm

Show First 20 Lines • Show All 1,708 Lines • ▼ Show 20 Lines	<
is_same<typename remove_const<_Tp>::type, _Up>::value &&		is_same<typename remove_const<_Tp>::type, _Up>::value &&
is_trivially_copy_assignable<_Up>::value,		is_trivially_copy_assignable<_Up>::value,
_Up*		_Up*
>::type		>::type
__copy(_Tp* __first, _Tp* __last, _Up* __result)		__copy(_Tp* __first, _Tp* __last, _Up* __result)
{		{
const size_t __n = static_cast<size_t>(__last - __first);		const size_t __n = static_cast<size_t>(__last - __first);
if (__n > 0)		if (__n > 0)
_VSTD::memmove(__result, __first, __n * sizeof(_Up));		_VSTD::memmove(__result, __first, __n * sizeof(_Up));
		EricWFUnsubmitted Not Done Reply Inline Actions `__builtin_memmove` is constexpr, so I think using that is a better approach that branching on `is_constant_evaluated`. EricWF: `__builtin_memmove` is constexpr, so I think using that is a better approach that branching on…
		dyaroshevAuthorUnsubmitted Done Reply Inline Actions Will do, thanks. dyaroshev: Will do, thanks.
return __result + __n;		return __result + __n;
}		}

template <class _InputIterator, class _OutputIterator>		template <class _InputIterator, class _OutputIterator>
inline _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_INLINE_VISIBILITY
_OutputIterator		_OutputIterator
copy(_InputIterator __first, _InputIterator __last, _OutputIterator __result)		copy(_InputIterator __first, _InputIterator __last, _OutputIterator __result)
{		{
▲ Show 20 Lines • Show All 2,619 Lines • ▼ Show 20 Lines

// merge		// merge

template <class _Compare, class _InputIterator1, class _InputIterator2, class _OutputIterator>		template <class _Compare, class _InputIterator1, class _InputIterator2, class _OutputIterator>
_OutputIterator		_OutputIterator
__merge(_InputIterator1 __first1, _InputIterator1 __last1,		__merge(_InputIterator1 __first1, _InputIterator1 __last1,
_InputIterator2 __first2, _InputIterator2 __last2, _OutputIterator __result, _Compare __comp)		_InputIterator2 __first2, _InputIterator2 __last2, _OutputIterator __result, _Compare __comp)
{		{
for (; __first1 != __last1; ++__result)		if (__first1 == __last1) goto __copySecond;
{		if (__first2 == __last2) goto __copyFirst;
if (__first2 == __last2)
return _VSTD::copy(__first1, __last1, __result);		while (true) {
if (__comp(__first2, __first1))		if (__comp(__first2, __first1)) goto __takeSecond;
{		__result = __first1;
		++__first1, (void)++__result;
		if (__first1 == __last1) goto __copySecond;
		goto __unrolledCheck;
		__takeSecond:
__result = __first2;		__result = __first2;
++__first2;		++__first2, (void)++__result;
}		if (__first2 == __last2) goto __copyFirst;
else		__unrolledCheck:
{		if (__comp(__first2, __first1)) goto __takeSecond;
__result = __first1;		__result = __first1;
++__first1;		++__first1, (void)++__result;
}		if (__first1 == __last1) goto __copySecond;
}		}

		__copySecond:
return _VSTD::copy(__first2, __last2, __result);		return _VSTD::copy(__first2, __last2, __result);
		__copyFirst:
		return _VSTD::copy(__first1, __last1, __result);
}		}

template <class _InputIterator1, class _InputIterator2, class _OutputIterator, class _Compare>		template <class _InputIterator1, class _InputIterator2, class _OutputIterator, class _Compare>
inline _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_INLINE_VISIBILITY
_OutputIterator		_OutputIterator
merge(_InputIterator1 __first1, _InputIterator1 __last1,		merge(_InputIterator1 __first1, _InputIterator1 __last1,
_InputIterator2 __first2, _InputIterator2 __last2, _OutputIterator __result, _Compare __comp)		_InputIterator2 __first2, _InputIterator2 __last2, _OutputIterator __result, _Compare __comp)
{		{
typedef typename __comp_ref_type<_Compare>::type _Comp_ref;		typedef typename __comp_ref_type<_Compare>::type _Comp_ref;
return _VSTD::__merge<_Comp_ref>(__first1, __last1, __first2, __last2, __result, __comp);		return _VSTD::__merge<_Comp_ref>(__first1, __last1, __first2, __last2, __result, __comp);
}		}

template <class _InputIterator1, class _InputIterator2, class _OutputIterator>		template <class _InputIterator1, class _InputIterator2, class _OutputIterator>
inline _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_INLINE_VISIBILITY
_OutputIterator		_OutputIterator
		EricWFUnsubmitted Not Done Reply Inline Actions Everytime I've seen a duff's device optimization, it's a win is some cases and a loss in others. That makes me skeptical that it's the libraries job to perform the loop unrolling. Do you know why LLVM is failing to generate comparable code here? EricWF: Everytime I've seen a duff's device optimization, it's a win is some cases and a loss in others.
		dyaroshevAuthorUnsubmitted Done Reply Inline Actions Though you have a valid concern here - I have benchmarked this code front and center - I have not seen a pessimization. Do you have something you want me to try? This is at most a 40% increase in binary size (still significantly less then with libstdc++) where I tried - so I would not expect sudden instruction cache spills or things like that. I would point out that sometimes it's a 1.7 times win. I do know - see https://bugs.llvm.org/show_bug.cgi?id=42313 This optimization requires jumping through the loop header - which optimizer cannot comfortably work at the moment. Eli Friedman 2019-06-19 14:27:29 PDT Is this a fundamental llvm problem or it is solvable? The reason we generally avoid jump-threading across a loop header is that irreducible CFGs (like the "goto" version of your >function) generally don't optimize very well; we have a bunch of optimizations that only recognize proper loops. But certain >loops really benefit from being transformed into irreducible CFGs; I think we've had similar reports before about state >machines before. So the challenge is figuring out a good heuristic for when the transform is actually profitable. dyaroshev: 1) Though you have a valid concern here - I have benchmarked this code front and center - I…
merge(_InputIterator1 __first1, _InputIterator1 __last1,		merge(_InputIterator1 __first1, _InputIterator1 __last1,
_InputIterator2 __first2, _InputIterator2 __last2, _OutputIterator __result)		_InputIterator2 __first2, _InputIterator2 __last2, _OutputIterator __result)
{		{
typedef typename iterator_traits<_InputIterator1>::value_type __v1;		typedef typename iterator_traits<_InputIterator1>::value_type __v1;
typedef typename iterator_traits<_InputIterator2>::value_type __v2;		typedef typename iterator_traits<_InputIterator2>::value_type __v2;
return merge(__first1, __last1, __first2, __last2, __result, __less<__v1, __v2>());		return merge(__first1, __last1, __first2, __last2, __result, __less<__v1, __v2>());
}		}

▲ Show 20 Lines • Show All 1,294 Lines • Show Last 20 Lines