There are two commits in this request. The first one adds a benchmark that tests std::sort on an adversarial inputs. The second commit adds intro sort to the std::sort. Inputs where partitions are unbalanced even after 2 log(n) pivots have been selected, the algorithm switches to heap sort to avoid the possibility of spending O(n^2) time on sorting the input. Benchmark results show that the intro sort implementation does significantly better. Benchmarking results before this change. Time represents the sorting time required per element. ---------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------------------- BM_Sort_uint32_QuickSortAdversary_1 3.75 ns 3.74 ns 187432960 BM_Sort_uint32_QuickSortAdversary_4 3.05 ns 3.05 ns 231211008 BM_Sort_uint32_QuickSortAdversary_16 2.45 ns 2.45 ns 288096256 BM_Sort_uint32_QuickSortAdversary_64 32.8 ns 32.8 ns 21495808 BM_Sort_uint32_QuickSortAdversary_256 132 ns 132 ns 5505024 BM_Sort_uint32_QuickSortAdversary_1024 498 ns 497 ns 1572864 BM_Sort_uint32_QuickSortAdversary_16384 3846 ns 3845 ns 262144 BM_Sort_uint32_QuickSortAdversary_262144 61431 ns 61400 ns 262144 BM_Sort_uint64_QuickSortAdversary_1 3.93 ns 3.92 ns 181141504 BM_Sort_uint64_QuickSortAdversary_4 3.10 ns 3.09 ns 222560256 BM_Sort_uint64_QuickSortAdversary_16 2.50 ns 2.50 ns 283639808 BM_Sort_uint64_QuickSortAdversary_64 33.2 ns 33.2 ns 21757952 BM_Sort_uint64_QuickSortAdversary_256 132 ns 132 ns 5505024 BM_Sort_uint64_QuickSortAdversary_1024 478 ns 477 ns 1572864 BM_Sort_uint64_QuickSortAdversary_16384 3932 ns 3930 ns 262144 BM_Sort_uint64_QuickSortAdversary_262144 61646 ns 61615 ns 262144 Benchmarking results after this change ---------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------------------- BM_Sort_uint32_QuickSortAdversary_1 6.31 ns 6.30 ns 107741184 BM_Sort_uint32_QuickSortAdversary_4 4.51 ns 4.50 ns 158859264 BM_Sort_uint32_QuickSortAdversary_16 3.00 ns 3.00 ns 223608832 BM_Sort_uint32_QuickSortAdversary_64 44.8 ns 44.8 ns 15990784 BM_Sort_uint32_QuickSortAdversary_256 69.0 ns 68.9 ns 9961472 BM_Sort_uint32_QuickSortAdversary_1024 118 ns 118 ns 6029312 BM_Sort_uint32_QuickSortAdversary_16384 175 ns 175 ns 4194304 BM_Sort_uint32_QuickSortAdversary_262144 210 ns 210 ns 3407872 BM_Sort_uint64_QuickSortAdversary_1 6.75 ns 6.73 ns 103809024 BM_Sort_uint64_QuickSortAdversary_4 4.53 ns 4.53 ns 160432128 BM_Sort_uint64_QuickSortAdversary_16 2.98 ns 2.97 ns 234356736 BM_Sort_uint64_QuickSortAdversary_64 44.3 ns 44.3 ns 15990784 BM_Sort_uint64_QuickSortAdversary_256 69.2 ns 69.2 ns 10223616 BM_Sort_uint64_QuickSortAdversary_1024 119 ns 119 ns 6029312 BM_Sort_uint64_QuickSortAdversary_16384 173 ns 173 ns 4194304 BM_Sort_uint64_QuickSortAdversary_262144 212 ns 212 ns 3407872
Details
- Reviewers
jdoerfert ldionne - Group Reviewers
Restricted Project - Commits
- rG7f287390d78d: [libc++] Add introsort to avoid O(n^2) behavior
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libcxx/include/__algorithm/sort.h | ||
---|---|---|
264 | It seems to me like you reformatted a lot of this code, making the change itself a lot more difficult to understand. Please do the re-formatting either before or after your functional change, in a separate "no functionality change" patch. |
Remove formatting diffs from the diff. It seems Phabricator is unable to show the diffs between different commits.
Move the depth check after the initial switch so that small sized sorts are not affected by the depth check.
libcxx/include/__algorithm/sort.h | ||
---|---|---|
264 | I have dropped the formatting changes. Those were done by arc and I just accepted them. |
Do you need someone to commit this for you? If so, please provide Author Name <email@domain> for attribution. Thanks!
libcxx/include/__algorithm/sort.h | ||
---|---|---|
264–265 |
Yes, if you can commit this for me that would be great. You can use 'Nilay Vaish <nilayvaish@google.com>' for attribution. Thanks for all the help!
We are seeing build breakages on Fuchsia builders after this change. It complains definition of implicit copy constructor is deprecated because it has a user-provided copy assignment operator. Example of the failure can be found at https://ci.chromium.org/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.arm64-release/b8830264610876104385/overview . It is confirmed through bisection. Code that failed to build can be found at https://fuchsia.googlesource.com/third_party/flatbuffers/+/f22618113aea6d04ed10bcaf28cc3621eea146d2/include/flatbuffers/flatbuffers.h#1154
Would you mind taking a look? We can suppress the error on our end but we still haven't figure out why this change could trigger this error(warning).
libcxx/include/__algorithm/sort.h | ||
---|---|---|
468 | This line needs to use _VSTD::__introsort (ADL-proofing — hmm, how did this pass robust_against_adl.pass.cpp? I should investigate). I will investigate adding a new test along the lines of algorithms/robust_*.cpp to verify that we never accidentally copy comparators or predicates. |
libcxx/include/__algorithm/sort.h | ||
---|---|---|
468 | Noted. Will send a diff later today. |
We see that this patch is causing the size of the Chrome Android APK to increase by ~120 KB. This is confirmed via bisection. A breakdown of the symbols contributing to the size increase can be found at https://chrome-supersize.firebaseapp.com/viewer.html?load_url=https%3A%2F%2Fstorage.googleapis.com%2Fchromium-binary-size-trybot-results%2Fandroid-binary-size%2F2022%2F05%2F10%2F1112911%2Fsupersize_diff.sizediff&group_by=template .
At this point in time, I'm not exactly sure why this patch would increase the size by such a large amount. Does anyone have any ideas?
@From the symbol diffs on e.g. ng_grid_layout_algorithm.cc, it looks like we have lots of negative deltas for sort and larger positive deltas for introsort, which suggests that the difference is the cumulative effect of inlining many instances of __introsort with its slightly longer (len > 5) case in the new code.
Perhaps this code is being built with excessive inlining and needs -Os or similar? Perhaps Clang itself needs some inliner heuristic tuning? Perhaps this is an unavoidable size hit unless you want to take a speed hit?
For documentation purposes:
I chatted offline with pkasting@ and we found out that adding __attribute__((noinline)) to __introsort actually made the binary size even worse - it added an additional ~40KB.
re: binary size - std::sort-related symbols make up more than 1% of Chrome's binary size (for arm32 Android at least). Any work to reduce the size overhead of std::sort() would be very welcome. From what I can tell, the main contributing factor is the templated nature of the function. It gets stamped out a *lot* of times, and in a way that identical-code-folding is often not applicable.
glad to see this land. we had a very similar patch almost 5 years ago but got lost in the review process: https://reviews.llvm.org/D36423
We enable identical code folding, so I don't see how that would help.
If we could identify at compile time when sort is being called with aggregates / PODs and route all such calls through qsort(), I think that might help.
I'm working on upgrading Meta's Android apps from libc++ 13 to libc++ 15, and seeing large size regressions caused by this change (similar to the Chrome case above). Was there any progress on figuring out how to offset the binary size increase?
(I recognize it's pretty annoying on my part to be raising the issue 9 months after the code was committed, and I apologize for that. We're working on being able to catch these issues earlier in the future.)
It seems to me like you reformatted a lot of this code, making the change itself a lot more difficult to understand. Please do the re-formatting either before or after your functional change, in a separate "no functionality change" patch.