This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Replaces std::sort by Bitset sorting algorithm.
AbandonedPublic

Authored by nilayvaish on Dec 14 2020, 9:27 AM.

Details

Reviewers
jdoerfert
minjaehwang
ldionne
Group Reviewers
Restricted Project
Summary

Bitset Sort

Bitset Sort is a variant of quick sort, specifically BlockQuickSort. Bitset Sort uses a carefully written partition function to let the compiler generates SIMD instructions without actually writing SIMD intrinsics in the loop.

Bitset Sort is interface compatible with std::sort and meant to replace std::sort in libc++.

Bitset Sort is 3.4x faster (or spends 71% less time) than libc++ std::sort when sorting uint64s and 1.58x faster (or spends 37% less time) when sorting std::string.

Bitset Sort uses multiple techniques to improve runtime performance of sort. This includes sorting networks, a variant of merge sort called Bitonic Order Merge Sort that is faster for small N, and pattern recognitions.

Please see Github page for more detailed documentations and full results.

Diff Detail

Event Timeline

minjaehwang created this revision.Dec 14 2020, 9:27 AM
minjaehwang requested review of this revision.Dec 14 2020, 9:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2020, 9:27 AM
minjaehwang edited the summary of this revision. (Show Details)Dec 14 2020, 9:29 AM
minjaehwang edited the summary of this revision. (Show Details)
minjaehwang edited the summary of this revision. (Show Details)
lebedev.ri retitled this revision from Replaces std::sort by Bitset sorting algorithm. to [TableGen] Replaces std::sort by Bitset sorting algorithm..Dec 14 2020, 9:34 AM
lebedev.ri edited reviewers, added: Paul-C-Anagnostopoulos; removed: EricWF.

Did you upload the right diff? The code does not show sort.

Made the incorrect patch previously.

Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript

Did you upload the right diff? The code does not show sort.

I am very sorry. Just updated the review with the sort.

lebedev.ri retitled this revision from [TableGen] Replaces std::sort by Bitset sorting algorithm. to [libc++] Replaces std::sort by Bitset sorting algorithm..Dec 14 2020, 10:02 AM
lebedev.ri removed a reviewer: Paul-C-Anagnostopoulos.
lebedev.ri added a subscriber: lebedev.ri.

What are the perf changes on CPU's without support for those instructions?

IMHO bitset sort is not a standard term, so it needs some clarification. From a glance this uses a bitset partition and needs temporary storage (heap memory allocation?). Does the original algorithm have heap memory allocation?

IMHO bitset sort is not a standard term, so it needs some clarification. From a glance this uses a bitset partition and needs temporary storage (heap memory allocation?). Does the original algorithm have heap memory allocation?

There's a list of standard algorithms that are allowed to use temporary heap storage.
That list does not include std::sort (it consists of stable_sort, stable_partition and inplace_merge)

abidh added a subscriber: abidh.Dec 14 2020, 11:06 AM

This is an interesting contribution. I'm trying to wrap my head around it.

From a glance this uses a bitset partition and needs temporary storage (heap memory allocation?). Does the original algorithm have heap memory allocation?

Unless I missed a subtlety, it doesn't require heap allocation. The "bitsets" used are actually 64 (or 32) bit integers. The current algorithm also doesn't have heap allocation, so keeping that property is something we want (and something this patch does AFAICT).

@minjaehwang Is there documentation for this algorithm? I believe it would help us (at least myself) understand the algorithm and the potential tradeoffs involved. Also, are you aware of Timsort? IIRC there had been an analysis conducted that concluded that we should be using Timsort (as it was faster than our sort and respected the complexity requirements of the Standard).

libcxx/test/libcxx/algorithms/sort.pass.cpp
21 ↗(On Diff #311640)

Shouldn't this test work with all implementations? Why not?

This is an interesting contribution. I'm trying to wrap my head around it.

Thanks!

From a glance this uses a bitset partition and needs temporary storage (heap memory allocation?). Does the original algorithm have heap memory allocation?

Unless I missed a subtlety, it doesn't require heap allocation. The "bitsets" used are actually 64 (or 32) bit integers. The current algorithm also doesn't have heap allocation, so keeping that property is something we want (and something this patch does AFAICT).

You are right. It only requires two 64 (or 32) bit integers to represent bitsets on left partition and right partition.

@minjaehwang Is there documentation for this algorithm? I believe it would help us (at least myself) understand the algorithm and the potential tradeoffs involved. Also, are you aware of Timsort? IIRC there had been an analysis conducted that concluded that we should be using Timsort (as it was faster than our sort and respected the complexity requirements of the Standard).

I have the documentation which was intended to present in my employer. I will work with the company to publish the document.

I am aware of Timsort. Timsort is probably good for replacing std::stable_sort. It is a variant of merge sort and could be generally slower compared to unstable sorts.

What are the perf changes on CPU's without support for those instructions?

This is definitely something that I would like to do. Could you suggest which platforms or architectures should I try?

This implementation benefits from bsr or lzcnt instructions. I am guessing bitset will speed up even without specialized instructions because lzcnt and bsr are not terribly slow to implement with general instructions.

IMHO bitset sort is not a standard term, so it needs some clarification. From a glance this uses a bitset partition and needs temporary storage (heap memory allocation?). Does the original algorithm have heap memory allocation?

You are right. I coined bitset sort terminology and have not published anything publicly yet. I will work on that.

Bitset partition works with stack variables without heap memory allocation.

minjaehwang added inline comments.Dec 14 2020, 12:25 PM
libcxx/test/libcxx/algorithms/sort.pass.cpp
21 ↗(On Diff #311640)

You are right. This is a general sort test. Will fix it.

Hi, @ldionne asked me whether it was possible to review the algorithm, and so did I over the last three days. I eventually decided to create an account here and share again what I already sent him by mail, with additional information.

First, it looks like the main new trick is an implementation of the idea described in *BlockQuicksort: How Branch Mispredictions don't affect Quicksort* by Edelkamp and Weiß: basically, the partitioning algorithm performs comparisons over blocks of 32~64 elements, stores their results as booleans (here in a bitset), then performs the swaps to put the misplaced elements into place. It nicely avoids issues linked to branch misprediction, and thus is kind of a silver bullet for quicksort as long as the comparisons and projections passed to the algorithm are themselves branchless. As we can see for the strings benchmark, this partitioning scheme can be slower than a more traditional one when there are branches in the comparisons, which is likely due to the overhead from the additional logic involved.

On the good side, I tweaked bitset_sort so that I could inject it in my cpp-sort library, and ran the full test suite: the algorithm doesn't seem to have obvious bugs, and ubsan and asan didn't fid any issue either. Also the benchmarks show good results (here for a std::vector<double>: https://i.imgur.com/Nu6l8lI.png

On the not-so-good side, bitset_sort seems to be O(n²), just like the current implementation of std::sort in C++. I reran the quicksort killer implementation of Orson Peters from this issue, and got the same results. The quicksort killer is based on *A Killer Adversary for Quicksort* by M. D. McIlroy. Here is a graph showing the O(n log n) vs. O(n²) behaviour of the different algorithms: https://i.imgur.com/IWa2teO.png

So far my overall conclusion is that the idea of reimplementing the BlockQuicksort logic over bitsets is nice, but pdqsort otherwise seems better in every regard: it is truly O(n log n) - and even O(n log k) for k unique elements when k is small -, is efficient at breaking common quicksort-adverse patterns such as the pipe organ pattern in my first benchmark, and it actually switches over to a simpler non-branchless partitioning algorithm when it can't prove at compile time that the comparison will be branchless, avoiding the regression on types such as std::string.

If std::sort has to be changed, then I'd definitely reconsider using pdqsort instead. It should be possible to reimplement it over bitsets à la bitset_sort if you care about the reduced impact on stack memory.

minjaehwang added a comment.EditedDec 21 2020, 1:27 PM

Thank @Morwenn very much for looking into this code.

@Morwenn, you are right that the idea is largely based on Block Quicksort except that Bitset sort uses a 32/64 bit integer instead of an array. A bitset will be kept within CPU registers and won't require store instructions like Block Quicksort does.

As you know, Pdqsort is a variation of Block Quicksort with a pattern recognition and a different pivot selection. It recognizes ascending ranges, descending ranges, and O(n ^ 2) cases. Pdqsort shows O(n) performance on these known patterns and guarantees O(n lg n) in the worse case.

The bitset trick can be applied to Pdqsort's partition function just like this code does for the existing std::sort's partition function. I have implemented Bitset trick on top of pdqsort as well but chose the current implementation over pdqsort-based Bitset sort. The current implementation introduces the minimal change to std::sort. It replaces the partition function but keeps most of the outer layer of std::sort.

The reason for keeping the outer layer of std::sort is that it will keep most of performance characteristics unchanged and adds a little overhead when there is a branch in the comparison function.

@Morwenn mentioned the O(n ^ 2) case for quicksort which also happens for the existing std::sort. As you might know, a simple change can avoid O(n ^ 2). Introsort avoids O(n ^ 2) by calling heap sort when a quicksort depth goes above O(lg n). There has been efforts to introduce introsort into libc++ (Kumar's introsort). For unknown reason to me, it has not been submitted. Pdqsort avoids O(n ^ 2) by calling heap sort when partitions are highly unbalanced.

I understand that Pdqsort is faster in many known patterns over std::sort. If we are not afraid of changing the entire sort implementation, I can certainly bring Bitset-on-pdqsort as another code review. @Morwenn and @ldionne, could you give your thoughts on the future direction?

@Morwenn Bitset sort is faster for std::strings than pdqsort which turns off Block sort technique for non-arithmetic types. See the following comparison. Block sort technique can be faster even with comparison functions with branches.

BM_PdqSort_string_Random_1                                                   4.65 ns         4.64 ns    150470656
BM_PdqSort_string_Random_4                                                   19.0 ns         19.0 ns     36175872
BM_PdqSort_string_Random_16                                                  46.4 ns         46.4 ns     14942208
BM_PdqSort_string_Random_64                                                  64.7 ns         64.7 ns     10485760
BM_PdqSort_string_Random_256                                                 84.1 ns         84.1 ns      8126464
BM_PdqSort_string_Random_1024                                                 103 ns          103 ns      6553600
BM_PdqSort_string_Random_16384                                                160 ns          160 ns      4456448
BM_PdqSort_string_Random_262144                                               242 ns          242 ns      2359296
BM_BitsetSort_string_Random_1                                                3.85 ns         3.85 ns    181665792
BM_BitsetSort_string_Random_4                                                18.1 ns         18.1 ns     38535168
BM_BitsetSort_string_Random_16                                               45.5 ns         45.5 ns     14942208
BM_BitsetSort_string_Random_64                                               62.6 ns         62.6 ns     10747904
BM_BitsetSort_string_Random_256                                              74.4 ns         74.4 ns      9175040
BM_BitsetSort_string_Random_1024                                             85.9 ns         85.9 ns      7864320
BM_BitsetSort_string_Random_16384                                             121 ns          121 ns      5767168
BM_BitsetSort_string_Random_262144                                            179 ns          179 ns      3407872
BM_Sort_string_Random_1                                                      4.05 ns         4.05 ns    173277184
BM_Sort_string_Random_4                                                      17.9 ns         17.9 ns     38010880
BM_Sort_string_Random_16                                                     50.4 ns         50.3 ns     13369344
BM_Sort_string_Random_64                                                     73.3 ns         73.3 ns      9437184
BM_Sort_string_Random_256                                                    94.6 ns         94.4 ns      7077888
BM_Sort_string_Random_1024                                                    115 ns          115 ns      6029312
BM_Sort_string_Random_16384                                                   177 ns          177 ns      3932160
BM_Sort_string_Random_262144                                                  279 ns          279 ns      2097152

The bitset trick can be applied to Pdqsort's partition function just like this code does for the existing std::sort's partition function. I have implemented Bitset trick on top of pdqsort as well but chose the current implementation over pdqsort-based Bitset sort. The current implementation introduces the minimal change to std::sort. It replaces the partition function but keeps most of the outer layer of std::sort.

The reason for keeping the outer layer of std::sort is that it will keep most of performance characteristics unchanged and adds a little overhead when there is a branch in the comparison function.

That's surely a fair concern.

@Morwenn mentioned the O(n ^ 2) case for quicksort which also happens for the existing std::sort. As you might know, a simple change can avoid O(n ^ 2). Introsort avoids O(n ^ 2) by calling heap sort when a quicksort depth goes above O(lg n). There has been efforts to introduce introsort into libc++ (Kumar's introsort). For unknown reason to me, it has not been submitted. Pdqsort avoids O(n ^ 2) by calling heap sort when partitions are highly unbalanced.

Yeah, I'm still rather surprised that the libc++ implementation of std::sort isn't some flavour of introsort yet. It's not excessively hard to implement, and it would make the algorithm standard-compliant. I know that a few years ago they wanted proper sorting benchmarks before giving the "go" to a new sort implementation, but it seems like something that could have been fixed regardless.

I understand that Pdqsort is faster in many known patterns over std::sort. If we are not afraid of changing the entire sort implementation, I can certainly bring Bitset-on-pdqsort as another code review. @Morwenn and @ldionne, could you give your thoughts on the future direction?

@Morwenn Bitset sort is faster for std::strings than pdqsort which turns off Block sort technique for non-arithmetic types. See the following comparison. Block sort technique can be faster even with comparison functions with branches.

Oh, I thought that deactivating the branchless partitioning algorithm was done because it introduced regressions for non-arithmetic types. I guess I blindly believed that it had been compared against std::string of all things to make this decision, but never ran benchmarks to back that claim. That's new info to me.

As for the future direction I do believe that pdqsort has some interesting ideas that would be nice to have, but I'm not part of the libc++ project in the first place, so it's not really my call to make. I'm glad to see that things are moving in this area nonetheless, and bitset_sort would be a step in the right direction compared to the status quo anyway :)

I worked on performance improvements and wrote a documentation what bitset sort is and why it is faster. Thanks to LLVM's ability to generate SIMD instructions, bitset sort lets LLVM to generate SIMD instructions for the critical path (bitset partition).

The documentation and the newer implementation is here - https://github.com/minjaehwang/bitsetsort. This implementation is faster than pdqsort for all randomized inputs. For example, in the case of uint64 256k set, bitset sort only takes 15ns per element while pdqsort takes 24ns per element.

Bitset sort shows a little regression on patterns such as ascending or descending. This is something I can work on but I do believe that speed-ups in randomized inputs triumphs the regression in these patterns because these are so small in difference.

minjaehwang edited the summary of this revision. (Show Details)Jun 16 2021, 11:37 PM

Also, are you aware of Timsort? IIRC there had been an analysis conducted that concluded that we should be using Timsort (as it was faster than our sort and respected the complexity requirements of the Standard).

Swift went for the standard library from introsort to timsort: https://github.com/apple/swift/pull/19717

Improved randomized sort performance for all types.
Added introsort feature.

minjaehwang edited the summary of this revision. (Show Details)Jun 17 2021, 12:49 AM
ldionne commandeered this revision.Oct 6 2021, 2:19 PM
ldionne added a reviewer: minjaehwang.

I stumbled upon this again while looking at the review queue and realized I had never followed up. Please rest assured that the lack of response here is only a measure of how backed up we are with reviews and not how interested we are with this change. I also want to give special thanks to @Morwenn for responding to my request to review this and for the detailed explanation and interesting discussion that followed. For a non-sort-expert like me, that is essential in shedding light on the whole thing.

So my take away from the discussion is that Bitsetsort and pdqsort are comparable in efficiency, however Bitsetsort still suffers from O(n^2) worst case. Is that correct? If so, is it possible to modify Bitsetsort to avoid this O(n^2) worst case? That's a pretty bad offender in our conformance right now.

I'm going to commandeer this so I can rebase it onto main and apply a few changes necessary for libc++. I'm sorry this slept for so long. In particular, this patch as-is breaks the ABI because we stop exporting the internal __sort_whatever symbols from the dylib. We can't do that -- we're stuck with these symbols forever unfortunately. However, nothing prevents us from introducing new ones (even though for the time being I would try to avoid baking parts of the sort implementation in the dylib at all).

ldionne updated this revision to Diff 377704.Oct 6 2021, 3:08 PM
  • Rebase onto main
  • Keep the old std::sort implementation around for ABI compatibility
  • Do not explicitly instantiate bitset sort in the library -- that ties our hands too much

Still left to do:

  • Some tests are failing, they need to be investigated
  • Do basic code size testing since we're not instantiating the sorts in the dylib anymore -- does it cause the size of user programs to blow up badly?
  • Make sure I applied the diff correctly - I had to do it manually but the diff was generated weirdly, so I'm not sure the diff I applied is correct.
  • We need to decide on the visibility of implementation detail functions. We probably want to sprinkle _LIBCPP_HIDE_FROM_ABI all around.

Ideally @minjaehwang if you could commandeer this again and make those fixes, that would be awesome. From here on, it should be fairly straightforward to make changes until we can ship this, since I should have handled most of the annoying libc++ specific things.

  • Rebase onto main
  • Keep the old std::sort implementation around for ABI compatibility
  • Do not explicitly instantiate bitset sort in the library -- that ties our hands too much

Still left to do:

  • Some tests are failing, they need to be investigated
  • Do basic code size testing since we're not instantiating the sorts in the dylib anymore -- does it cause the size of user programs to blow up badly?
  • Make sure I applied the diff correctly - I had to do it manually but the diff was generated weirdly, so I'm not sure the diff I applied is correct.
  • We need to decide on the visibility of implementation detail functions. We probably want to sprinkle _LIBCPP_HIDE_FROM_ABI all around.

Ideally @minjaehwang if you could commandeer this again and make those fixes, that would be awesome. From here on, it should be fairly straightforward to make changes until we can ship this, since I should have handled most of the annoying libc++ specific things.

ldionne@, I'll work on this on MinJae's behalf. Would post a new diff sometime soon.

  • Rebase onto main
  • Keep the old std::sort implementation around for ABI compatibility
  • Do not explicitly instantiate bitset sort in the library -- that ties our hands too much

Still left to do:

  • Some tests are failing, they need to be investigated
  • Do basic code size testing since we're not instantiating the sorts in the dylib anymore -- does it cause the size of user programs to blow up badly?
  • Make sure I applied the diff correctly - I had to do it manually but the diff was generated weirdly, so I'm not sure the diff I applied is correct.
  • We need to decide on the visibility of implementation detail functions. We probably want to sprinkle _LIBCPP_HIDE_FROM_ABI all around.

Ideally @minjaehwang if you could commandeer this again and make those fixes, that would be awesome. From here on, it should be fairly straightforward to make changes until we can ship this, since I should have handled most of the annoying libc++ specific things.

ldionne@, I'll work on this on MinJae's behalf. Would post a new diff sometime soon.

Awesome, thanks! I saw your post on libcxx-dev just now. To reproduce CI failures locally, you can use libcxx/utils/ci/run-buildbot-container to spawn a Docker container matching our CI nodes, and then libcxx/utils/ci/run-buildbot <config you want> to run the actual build/test. There's documentation in run-buildbot-container and also the Dockerfile in the same directory.

nilayvaish commandeered this revision.Oct 12 2021, 5:24 PM
nilayvaish added a reviewer: ldionne.

Questions for ldionne@

  • How do we test code size?
  • Do we need to hide member functions in classes from the ABI?

It seems that we can use _LIBCPP_HIDE_FROM_ABI with __bitsetsort_loop. Causes the compiler to report errors like the following:

589 | bitsetsort_loop(_RandomAccessIterator first, _RandomAccessIterator last, _Compare comp,

| ^~~~~~~~~~~~~~~~~

/home/libcxx-builder/.buildkite-agent/builds/75f0779fb064-1/llvm-project/libcxx-ci/build/generic-gcc/include/c++/v1/__algorithm/sort.h:642:34: note: called from here

642 |       __bitsetsort_loop<_Compare>(__first, __ret.first, __comp, __buff, __limit);
    |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

/home/libcxx-builder/.buildkite-agent/builds/75f0779fb064-1/llvm-project/libcxx-ci/build/generic-gcc/include/c++/v1/__algorithm/sort.h:589:1: error: inlining failed in call to 'always_inline' 'void std::1::bitsetsort::bitsetsort_loop(_RandomAccessIterator, _RandomAccessIterator, _Compare, typename std::1::iterator_traits<_RandomAccessIterator1>::value_type*, typename std::1::iterator_traits<_RandomAccessIterator1>::difference_type) [with _Compare = std::1::__less<double, double>; _RandomAccessIterator = double*]': recursive inlining

589 | __bitsetsort_loop(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp,
    | ^~~~~~~~~~~~~~~~~

/home/libcxx-builder/.buildkite-agent/builds/75f0779fb064-1/llvm-project/libcxx-ci/build/generic-gcc/include/c++/v1/__algorithm/sort.h:645:34: note: called from here

645 |       __bitsetsort_loop<_Compare>(__ret.first + 1, __last, __comp, __buff, __limit);
    |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nilayvaish added inline comments.Oct 13 2021, 11:34 AM
libcxx/src/legacy-sort.cpp
339–393

ldionne@, I am wondering if these symbols need to be in this file. Can we continue with the setup from before i.e. these symbols have extern declarations in sort.h and are defined in algorithm.cpp? Further is there a need for retaining the existing sorting algorithm? It seems to me we need to retain __insertion_sort_incomplete only. What do you think?

ldionne requested changes to this revision.Oct 27 2021, 2:56 PM

Sorry for the delay, see my answers below.

Questions for ldionne@

  • How do we test code size?

I think we have benchmarks for std::sort -- one thing we could do is look at the size of that executable before and after the change.

Can you also provide the runtime output of these benchmarks before and after the change?

  • Do we need to hide member functions in classes from the ABI?

Yes, but I think you have figured that out already.

libcxx/include/__algorithm/sort.h
34

Why is this called __sorting_network? Sounds like a weird name.

libcxx/src/legacy-sort.cpp
339–393

I think we do need to retain all those functions since they were previously exported from the shared library. Removing these functions would be an ABI break.

I've put it in legacy-sort.cpp because we don't ever want to deal with these functions again anymore -- they are only there for ABI compatibility purpose. I thought it was better to separate them into their own little file than keeping them around in the headers that we actually use. If you have a strong reason to keep them around in sort.h, let me know and we can discuss.

340

This should be guarded by _LIBCPP_HAS_NO_WIDE_CHARACTERS. I think this must have happened when you (or I) rebased the patch.

363–364

Same.

This revision now requires changes to proceed.Oct 27 2021, 2:56 PM

Hi ldionne@, I have question for you. Do the tests we have for libc++ ensure that the new implementation works well across all versions of the C++ standard? I want to make sure that I am not going to unintentionally break someone's standard-compliant code out there.

Fixed some compilation issues and reduced the number of changes in ordering of equivalent elements

nilayvaish marked 2 inline comments as done.Nov 8 2021, 4:02 PM

ldionne@, I am going to break this change into few parts and provide the benchmark results for each of the part separately. To begin with, I have posted: https://reviews.llvm.org/D113413 that makes the introsort change to the std::sort implementation.

libcxx/include/__algorithm/sort.h
34

I think the name has been prevalent in the theory community for a long time. More info here: https://en.wikipedia.org/wiki/Sorting_network. The code in functions __sort[3...8] mimics what the sorting networks for those lengths would look like. The name is strange from namespace perspective. Willing to name it something else that you prefer.

nilayvaish added inline comments.Nov 8 2021, 4:04 PM
libcxx/src/legacy-sort.cpp
339–393

I do not have a preference for where these symbols are kept. Is there a way to avoid having sort implementation in legacy-sort.cpp file?

Hi ldionne@, I have question for you. Do the tests we have for libc++ ensure that the new implementation works well across all versions of the C++ standard? I want to make sure that I am not going to unintentionally break someone's standard-compliant code out there.

They should, however you're welcome to look at them - our tests are sometimes not up to our quality standards, so you should make sure they will catch all issues you can think about.

We do run those tests in C++03, C++11, C++14, C++17, C++20 and C++23 modes, if that was the sole intent of your question.

ldionne@, I am going to break this change into few parts and provide the benchmark results for each of the part separately. To begin with, I have posted: https://reviews.llvm.org/D113413 that makes the introsort change to the std::sort implementation.

That's fine by me if you prefer doing it that way, however at the moment I am failing to understand where https://reviews.llvm.org/D113413 fits into the picture. Can you explain what's the plan? Basically introduce introsort in D113413 and then replace it by this bitset sort? Sorry if that's a stupid question.

libcxx/include/__algorithm/sort.h
34

Oh, that's all good. If there's prior art with this name, ignore my comment -- I was not aware of it.

Hi ldionne@, I have question for you. Do the tests we have for libc++ ensure that the new implementation works well across all versions of the C++ standard? I want to make sure that I am not going to unintentionally break someone's standard-compliant code out there.

They should, however you're welcome to look at them - our tests are sometimes not up to our quality standards, so you should make sure they will catch all issues you can think about.

We do run those tests in C++03, C++11, C++14, C++17, C++20 and C++23 modes, if that was the sole intent of your question.

I think the current tests do not cover some of the ways in which folks use sort. I did some internal testing and that revealed few compilation issues with one of the previous versions of the diff. I'll try to reproduce those issues and add those to tests.

ldionne@, I am going to break this change into few parts and provide the benchmark results for each of the part separately. To begin with, I have posted: https://reviews.llvm.org/D113413 that makes the introsort change to the std::sort implementation.

That's fine by me if you prefer doing it that way, however at the moment I am failing to understand where https://reviews.llvm.org/D113413 fits into the picture. Can you explain what's the plan? Basically introduce introsort in D113413 and then replace it by this bitset sort? Sorry if that's a stupid question.

D93233 has three different changes in it which I would like to break into parts. So the plan is to do the following:

  • D113413: introduce introsort (improves worst case from O(n^2) to O(n log n).
  • D93233 or another commit: introduce bitset sort (improves the core of the quicksort implementation to avoid branch mispredictions)
  • D93233 or another commit: improve small sized sort (improves small sized sorting so that it is possibly vectorized by the compiler and the number of comparisons in use is optimal).

Gentle ping here -- we have now fixed the O(N^2) behavior with introsort, but do we still want to make these improvements? (I suspect we do)

nilayvaish abandoned this revision.Nov 11 2022, 4:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 11 2022, 4:37 PM
libcxx/src/CMakeLists.txt