Benchmarking results:
Before:
---------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- MergeBench_MergeAlg_TestInt32/512 1848 ns 1846 ns 378311 MergeBench_MergeAlg_TestInt32/2048 27005 ns 26989 ns 25576 MergeBench_MergeAlg_TestInt32/2097152 33327173 ns 33314619 ns 21 MergeBench_MergeAlg_TestInt64/512 1702 ns 1701 ns 407486 MergeBench_MergeAlg_TestInt64/2048 26351 ns 26340 ns 26949 MergeBench_MergeAlg_TestInt64/2097152 34492582 ns 34484900 ns 20 MergeBench_MergeAlg_TestUint32/512 1755 ns 1755 ns 379830 MergeBench_MergeAlg_TestUint32/2048 25971 ns 25963 ns 26655 MergeBench_MergeAlg_TestUint32/2097152 37003864 ns 35490619 ns 21 MergeBench_MergeAlg_TestMediumString/512 234988 ns 234489 ns 2641 MergeBench_MergeAlg_TestMediumString/2048 1120958 ns 1062598 ns 615 MergeBench_MergeAlg_TestMediumString/2097152 2595634493 ns 2590478000 ns 1
After:
---------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- MergeBench_MergeAlg_TestInt32/512 1396 ns 1395 ns 485652 (25% speedup) MergeBench_MergeAlg_TestInt32/2048 15691 ns 15682 ns 43672 (42% speedup) MergeBench_MergeAlg_TestInt32/2097152 30340879 ns 30329130 ns 23 (9% speedup) MergeBench_MergeAlg_TestInt64/512 1567 ns 1566 ns 440998 (9% speedup) MergeBench_MergeAlg_TestInt64/2048 25090 ns 25076 ns 27286 (5% speedup) MergeBench_MergeAlg_TestInt64/2097152 32398209 ns 32394000 ns 22 (7% speedup) MergeBench_MergeAlg_TestUint32/512 1366 ns 1366 ns 507957 (23% speedup) MergeBench_MergeAlg_TestUint32/2048 15713 ns 15706 ns 43127 (40% speedup) MergeBench_MergeAlg_TestUint32/2097152 30373730 ns 30366261 ns 23 (18% speedup) MergeBench_MergeAlg_TestMediumString/512 213092 ns 212974 ns 3253 (10% speedup) MergeBench_MergeAlg_TestMediumString/2048 879484 ns 879021 ns 752 (22% speedup) MergeBench_MergeAlg_TestMediumString/2097152 2156054708 ns 2155483000 ns 1 (17% speedup)
There are two issues with current implementation of std::merge:
https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L4353
- The algorithm does two checks for boundary on every iteration, even though we only move one of the iterators
- If one of the checks for left boundary is unrolled we get better loop structures for both 1 and 2 ranges being bigger.
The speed up for the 1 range dominating on some measurements I did gets up to 1.7 times, while the 2 - about 1.4
If you want to play with algorithms/parameters - you can do that on quick-bench.
Watch out for code alignment issues!! - unfortunately including all three benchmarks
in the binary will result in incorrect result.
Link: http://quick-bench.com/kWbYdPDFnrovXWnuF6xw5wK27B8
Binary size increase (godbolt: https://godbolt.org/z/R-mmti):
Binary size, std::string: 125 vs 184 (40%) - (much less than std::merge on libstdc++, which is 384), int : 54 vs 64 (18%)
Considering other places in libcxx that specialize algorithms for specific sizes seems acceptable.
(
for example sort: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L3703 rotate: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L2388 (if I understand the rotate algorithm correctly)
)
Potential followups:
std::stable_sort, std::inplace_merge are relying on merge - but reimplement it currently from scratch.
This could be a useful improvement.
std::set_union/std::set_difference have some similar problems to std::merge and could be improved in a similar maner.
Missing license header.