This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Implement `lexicographical_compare_three_way`

Authored by avogelsgesang on Aug 8 2022, 6:17 AM.



The implementation makes use of the freedom added by LWG 3410. We have
two variants of this algorithm:

  • a fast path for random access iterators: This fast path computes the maximum number of loop iterations up-front and does not compare the iterators against their limits on every loop iteration.
  • A basic implementation for all other iterators: This implementation compares the iterators against their limits in every loop iteration. However, it still takes advantage of the freedom added by LWG 3410 to avoid unnecessary additional iterator comparisons, as originally specified by P1614R2. shows the benefit of the fast path:
The hot loop generated of lexicographical_compare_three_way3 is
more tight than for lexicographical_compare_three_way1. The added
benchmark illustrates how this leads to a 30% performance improvement on
integer vectors.

Implements part of P1614R2 "The Mothership has Landed"

Fixes LWG 3410 and LWG 3350

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

fix the easy comments by @Mordante. Leaving the more involved ones for another day


I just realized that there is already a pattern how to deal with overly long names, see uniform_random_bit_generator_adaptor. Did the same here now...


any particular reason? Would it be fine if I used a function-local using std::array?


do we already have such a test utility somewhere? I couldn't find anything useful in test_iterators.h

253 ↗(On Diff #454246)

because for strong_ordering you can simply use a plain int. Or should I still add it for completeness?

mumbleskates added inline comments.Aug 21 2022, 12:52 PM

The point is that, while we are still relying on the exact same semantic under the hood, we are making an effort to spell out what we expect to happen, making it explicit that we thought of this case and fully intend this outcome. We can also ensure that future readers might also understand this happens when they might not have thought of it, whether in understanding what our code is doing or in making their own implementation some time in the future.

I like @Mordante's most recent version with the using declarations.

253 ↗(On Diff #454246)

honestly you could probably just write using StrongComp = int;.

in the tests i've written so far i have used integral types for strong, floating points for partial, and only used structs for weak orderings.

to that end, it would be useful if PartialComp had an avenue to actually return partial_ordering::unordered. you could keep its member typed as an int and use INT_MIN as a sentinel for the unordered value, which could even allow us to test heterogenous orderable/unorderable values constexpr in gcc (which currently(?) does not allow comparing infinities and NaNs against different values in constant evaluation).

For additional completeness here we would add a UserComp struct whose operator<=> returns a UserOrdering typed value that implements the appropriate operators against literal zero; such types are useful for SFINAE testing and types that utilize synth-three-way.

huixie90 accepted this revision as: huixie90.Aug 21 2022, 1:03 PM

LGTM with nits. I will leave the final approval to others who also commented

35 ↗(On Diff #454330)

nit: you can use random_access_iterator<int*> from the test_iterators.h so that both tests are using some iterator wrappers (to be more fair). It also saves you to explain that int* is a random access iterator


in other tests, we usually do

template <class It1, class Int2>
void test();

template <class It2>
void testAllIterator1(){
  test<InputIteartor, It2>();
  test<ForwardIteartor, It2>();
  // ...
  test<ContiguousIteartor, It2>();

void testAllIt1It2(){
  // ...

This saves you manually writing down the all the cartesian products.

Some people also argue that we only need to test the weakest iterator and the strongest iterator to save the combinatorial code bloat. It also has a point but I am fine with testing all of the combinations.

avogelsgesang marked 4 inline comments as done.

address a couple more comments


Can you test whether that's true here too, by using integral instead of is_integral_v?

Given that we implemented the integral as

concept integral = is_integral_v<_Tp>;

using a concept here doesn't improve the error message. It only adds a more complicated backtrace to the error message which does not really provide much value.


[...] maybe we want to add the exposition only concepts for Cpp17Iterators from the standard somewhere and static_assert that here?

You mean

static_assert(__cpp17_iterator<_InputIterator1>, "calling lexicographical_compare_three_way with a non-standard-compliant iterators is undefined behavior");
static_assert(__cpp17_iterator<_InputIterator2>, "calling lexicographical_compare_three_way with a non-standard-compliant iterators is undefined behavior");



I personally don't have an opinion here. Please come to an agreement here (or agree to disagree, and nevertheless agree with merging this code).

I am happy to update this patch in whichever way you agree on, but I want to avoid changing it forth and back (an earlier version of this review was actually written in terms of the difference type and changed based on feedback in this review)


not sure how much that would actually give us. Instead of

test_lexicographical_compare<Iter1, Iter2>(array{0, 1}, array{0, 2}, std::strong_ordering::less);

I could write

test_lexicographical_compare<Iter1, Iter2>(array{0, 1}, array{0, 2});

but I would still have to enumerate all the different input arrays, iterator types etc. Given this would only save few lines, I prefer to keep the test cases as they currently are.

Added a benchmark

Based on the below results, we can see that the fast path for random access iterators gives us around 30% peformance compared to the basic implementation

Run on (40 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 14080 KiB (x2)
Load Average: 8.64, 6.10, 3.51
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Benchmark                                                                         Time             CPU   Iterations
BM_lexicographical_compare_three_way<int*>/1                                   1.38 ns         1.38 ns    507820129
BM_lexicographical_compare_three_way<int*>/2                                   2.06 ns         2.06 ns    339714887
BM_lexicographical_compare_three_way<int*>/4                                   3.44 ns         3.44 ns    203747061
BM_lexicographical_compare_three_way<int*>/8                                   6.20 ns         6.20 ns    111056521
BM_lexicographical_compare_three_way<int*>/16                                  12.0 ns         12.0 ns     58183199
BM_lexicographical_compare_three_way<int*>/32                                  22.9 ns         22.9 ns     30549045
BM_lexicographical_compare_three_way<int*>/64                                  44.8 ns         44.8 ns     15615457
BM_lexicographical_compare_three_way<int*>/128                                 99.0 ns         99.0 ns      7056236
BM_lexicographical_compare_three_way<int*>/256                                  187 ns          187 ns      3743238
BM_lexicographical_compare_three_way<int*>/512                                  363 ns          363 ns      1927025
BM_lexicographical_compare_three_way<int*>/1024                                 715 ns          715 ns       969796
BM_lexicographical_compare_three_way<int*>/2048                                1418 ns         1418 ns       493273
BM_lexicographical_compare_three_way<int*>/4096                                2838 ns         2838 ns       246123
BM_lexicographical_compare_three_way<int*>/8192                                5647 ns         5647 ns       123272
BM_lexicographical_compare_three_way<int*>/16384                              11318 ns        11318 ns        61772
BM_lexicographical_compare_three_way<int*>/32768                              22565 ns        22563 ns        31000
BM_lexicographical_compare_three_way<int*>/65536                              45231 ns        45229 ns        15513
BM_lexicographical_compare_three_way<int*>/131072                             91047 ns        91042 ns         7678
BM_lexicographical_compare_three_way<int*>/262144                            181619 ns       181601 ns         3867
BM_lexicographical_compare_three_way<int*>/524288                            361487 ns       361452 ns         1936
BM_lexicographical_compare_three_way<int*>/1048576                           736236 ns       736225 ns          953
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1             1.03 ns         1.03 ns    675808965
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/2             2.41 ns         2.41 ns    290934998
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4             4.48 ns         4.48 ns    156648871
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/8             8.60 ns         8.60 ns     81462544
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16            16.9 ns         16.9 ns     41431608
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/32            33.3 ns         33.3 ns     20989390
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/64            66.3 ns         66.3 ns     10542387
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/128            142 ns          142 ns      4949055
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/256            274 ns          274 ns      2555910
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/512            537 ns          537 ns      1296173
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1024          1066 ns         1066 ns       656525
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/2048          2121 ns         2121 ns       329789
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4096          4237 ns         4237 ns       165380
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/8192          8452 ns         8452 ns        82559
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16384        16897 ns        16897 ns        41435
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/32768        33800 ns        33798 ns        20716
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/65536        67609 ns        67608 ns        10340
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/131072      135777 ns       135770 ns         5156
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/262144      270691 ns       270686 ns         2588
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/524288      542588 ns       542550 ns         1290
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1048576    1088975 ns      1088880 ns          642

Can you please redo the benchmark in a different way by doing some local (not to be committed) modifications? Rather than comparing the results for int* versus cpp17_input_iterator<int*>, remove the input iterator tests completely, run the int* benchmark against the current implementation, then remove the optimization and rerun the benchmark. That way, it would be a proper "apples-to-apples" comparison: int* vs. int* with or without the optimization. (I don't expect results to change much, but better be sure)


Is this header still used?


Optional: consider refactoring both branches into helper functions.


Nit: I think this would read a little better with a verb, e.g. Using a non-integral difference_type is undefined behavior.

Also, this seems a little overkill -- I'm pretty sure there are many places in algorithms where we subtract random access iterators and expect to get an integral type, without checking. I don't object to the static_asserts, but the comment (starting from We rely on the fact...) seems a little unnecessary to me (the part about undefined behavior is already captured in the static assertion).


Nit: some empty lines will help separate this algorithm into logical blocks and make it more readable. I'd suggest adding blank lines before and after the for loop and before the else branch.


Nit: increment __i in the iteration expression? Otherwise, it seems more like a while loop.

var-const requested changes to this revision.Aug 25 2022, 1:48 PM
This revision now requires changes to proceed.Aug 25 2022, 1:48 PM
avogelsgesang marked 6 inline comments as done.

address more comments, but now gcc11 is crashing.
updating so that CI can hopefully give me some insight into whether it's fixed in gcc12.

avogelsgesang added inline comments.Aug 26 2022, 2:19 PM
253 ↗(On Diff #454246)

complete the set with std::strong_ordering


PartialComp had an avenue to actually return partial_ordering::unordered


UserComp struct whose operator<=> returns a UserOrdering

Added. But now gcc-11 crashes

var-const added inline comments.Aug 26 2022, 6:05 PM

As noted in another comment, can you please rerun the benchmark so that it compares int* with optimizations vs. int* without optimizations?

avogelsgesang added inline comments.Aug 28 2022, 2:00 PM

yes, rerunning the benchmark is definitely still on my todo list. It takes longer than expected, because it seems that one of the refactorings during this review destroyed the optimization. I can still reproduce the numbers on the old commit, but on the current review the fast path is no longer faster than the default path. I will need some time to figure out what exactly lead to the regression here...

Mordante added inline comments.Aug 29 2022, 9:32 AM

It helps since the rules of the type of the conditional expression are not simple. I had to verify with the standard it does the right thing. So instead of spending a few seconds to validate these 3 lines I had to spend several minutes.

I don't see how common_type helps with the type of the conditional not being simple. That's exactly what common_type uses to get the type in this case AFAICT ( Plus is has a heap of other conditionals that are hard to get through.

It's at least clear that it's intended to use common_type and not one of the other rules of the conditional expression. IMO auto should never be used with the conditional expression.

Code should be optimized for understanding by humans, auto quite often saves the writer from typing a few characters. (The compiler doesn't care either way it does its thing.)

I agree that code should be written to make it easier to read. IMO littering the code with types I don't really care about makes it harder to read. i.e. I don't really care that a - b returns a value of type __iter_diff_t<_InputIterator1>, but now I have to check that you actually named the correct type. Thinking about it, integral auto __len1 = __last1 - __first1 would be great. Not sure how much compile time overhead that would incur though. WDYT?

I care about these types when I try to understand the code and validate whether the author wrote the code correctly.
It takes me as reader a lot longer to validate the code. Auto has its uses but it shouldn't be used everywhere just to make it easy for the writer. I still like the explicit type better either directly or by using a typedef.
For the operator- I don't dislike this too strongly; but as said above for the conditional expression I do.

The verbose code helps to communicate what the author of the code intended to happen. [Relying] on some (not always well understood) language rules means it's less clear for the reader to understand what the writer intended. Both may have a different understanding of these rules.

But you are still relying on these rules through common_type AFAICT.

Yes but as said above it makes it clear that common_type is intended to be used. (I agree common_type has it's complexity too.)


To make it clear which array is used. Sometimes we use similar names as standard names. In general we don't use using globally, sometimes in a function. There usually the nested namespace being tested like std::chrono or the literals namespaces.


I think we don't. almost_satisfies_types.h seems to be the better place for such an iterator.

253 ↗(On Diff #454246)

GCC-12 too? We don't officially support GCC-11 anymore.

@avogelsgesang Do you need anything from us to make progress on this patch? I'd like to make sure you're unblocked.

avogelsgesang planned changes to this revision.Nov 4 2022, 4:39 AM

@avogelsgesang Do you need anything from us to make progress on this patch? I'd like to make sure you're unblocked.

Thanks for checking in! No, I currently don't need any support. I was on PTO for the last couple of weeks, I am currently ramping back up, and hope to find some time to work on this next week.

The main work item currently is: I am a bit confused by my own benchmark, and I have a suspicion that clang had a regression in the meantime so my benchmark numbers look different now. I want to figure out why my benchmark changed, but in the end I will likely just remove the optimization and the benchmark, and open a request on clang/LLVM to improve their optimizations

avogelsgesang marked 3 inline comments as done.EditedNov 19 2022, 5:30 PM

@var-const As requested, I updated the benchmark to benchmark int* on both the fast and the slow path.

Previously, you proposed

Can you please redo the benchmark in a different way by doing some local (not to be committed) modifications? Rather than comparing the results for int* versus cpp17_input_iterator<int*>, remove the input iterator tests completely, run the int* benchmark against the current implementation, then remove the optimization and rerun the benchmark. That way, it would be a proper "apples-to-apples" comparison: int* vs. int* with or without the optimization. (I don't expect results to change much, but better be sure)

Instead, I now factored out the slow and fast path into two separate functions (as proposed in another comment) and call both of them directly from the benchmark. That way, we can also rerun the benchmark in the future, and e.g. remove the optimization from libc++ again, as soon as clang's optimizations are smart enough to figure this out by themselves. Do you agree with the updated benchmark?

The results on my local machine (compiled using clang commit aae08b1d372a45b8bef95e86e5fe9110045eb78d):

Running ./libcxx/benchmarks/lexicographical_compare_three_way.libcxx.out
Run on (40 X 800.102 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 14080 KiB (x2)
Load Average: 0.00, 0.48, 0.67
Benchmark                                                                           Time             CPU   Iterations
BM_lexicographical_compare_three_way_slow_path/1                                 1.67 ns         1.67 ns    415947081
BM_lexicographical_compare_three_way_slow_path/4                                 4.68 ns         4.68 ns    149519461
BM_lexicographical_compare_three_way_slow_path/16                                16.7 ns         16.7 ns     41824440
BM_lexicographical_compare_three_way_slow_path/64                                64.9 ns         64.9 ns     10711142
BM_lexicographical_compare_three_way_slow_path/256                                266 ns          266 ns      2630928
BM_lexicographical_compare_three_way_slow_path/1024                              1036 ns         1036 ns       676167
BM_lexicographical_compare_three_way_slow_path/4096                              4122 ns         4121 ns       169817
BM_lexicographical_compare_three_way_slow_path/16384                            16453 ns        16453 ns        42559
BM_lexicographical_compare_three_way_slow_path/65536                            65827 ns        65826 ns        10619
BM_lexicographical_compare_three_way_slow_path/262144                          263147 ns       263140 ns         2658
BM_lexicographical_compare_three_way_slow_path/1048576                        1060909 ns      1060813 ns          659
BM_lexicographical_compare_three_way_fast_path/1                                 2.34 ns         2.34 ns    298968436
BM_lexicographical_compare_three_way_fast_path/4                                 4.76 ns         4.76 ns    148425440
BM_lexicographical_compare_three_way_fast_path/16                                10.1 ns         10.1 ns     68735401
BM_lexicographical_compare_three_way_fast_path/64                                28.8 ns         28.8 ns     24262243
BM_lexicographical_compare_three_way_fast_path/256                                123 ns          123 ns      5685144
BM_lexicographical_compare_three_way_fast_path/1024                               444 ns          444 ns      1575722
BM_lexicographical_compare_three_way_fast_path/4096                              1749 ns         1749 ns       400100
BM_lexicographical_compare_three_way_fast_path/16384                             6898 ns         6897 ns       101189
BM_lexicographical_compare_three_way_fast_path/65536                            28537 ns        28536 ns        24520
BM_lexicographical_compare_three_way_fast_path/262144                          125793 ns       125791 ns         5503
BM_lexicographical_compare_three_way_fast_path/1048576                         506020 ns       505976 ns         1379
BM_lexicographical_compare_three_way<int*>/1                                     1.67 ns         1.67 ns    418741799
BM_lexicographical_compare_three_way<int*>/4                                     3.68 ns         3.68 ns    190328759
BM_lexicographical_compare_three_way<int*>/16                                    8.70 ns         8.70 ns     80464881
BM_lexicographical_compare_three_way<int*>/64                                    28.9 ns         28.9 ns     24261824
BM_lexicographical_compare_three_way<int*>/256                                    125 ns          125 ns      5593129
BM_lexicographical_compare_three_way<int*>/1024                                   448 ns          448 ns      1562365
BM_lexicographical_compare_three_way<int*>/4096                                  1751 ns         1750 ns       399745
BM_lexicographical_compare_three_way<int*>/16384                                 6895 ns         6895 ns       101193
BM_lexicographical_compare_three_way<int*>/65536                                28741 ns        28740 ns        24319
BM_lexicographical_compare_three_way<int*>/262144                              125823 ns       125822 ns         5523
BM_lexicographical_compare_three_way<int*>/1048576                             508282 ns       508215 ns         1385
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1             2.68 ns         2.68 ns    261609292
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/4             5.37 ns         5.37 ns    132805153
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/16            12.4 ns         12.4 ns     56251434
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/64            44.5 ns         44.5 ns     15732328
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/256            182 ns          182 ns      3848123
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1024           696 ns          696 ns      1003714
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/4096          2753 ns         2753 ns       254392
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/16384        10966 ns        10966 ns        63844
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/65536        43846 ns        43844 ns        15953
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/262144      175599 ns       175590 ns         3989
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1048576     705291 ns       705245 ns          991
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1               1.67 ns         1.67 ns    418471062
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4               4.68 ns         4.68 ns    149512136
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16              16.7 ns         16.7 ns     41727011
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/64              65.0 ns         65.0 ns     10776363
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/256              266 ns          266 ns      2635392
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1024            1035 ns         1035 ns       674817
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4096            4121 ns         4121 ns       169881
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16384          16461 ns        16461 ns        42545
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/65536          65776 ns        65774 ns        10630
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/262144        263308 ns       263295 ns         2659
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1048576      1060793 ns      1060722 ns          660

A couple of observations:

  • The results for {fast,slow}_path/1 vary from run to run. The slow_path is not faster than the fast_path consistently across runs. I think this is primarily variance
  • fast_path/1048576 is twice as fast as slow_path/1048576
  • The results for three_way<int*>/1048576 and three_way<int*>/1048576 are inline with the results for {fast,slow}_path/1048576. The dispatching based on whether we are passing in a random access iterator or not works as expected
  • three_way<random_access_iterator<int*>>/1048576 is less efficient than three_way<int*>/1048576. As shown below, the assembly code is exactly identical between the two, though. So this seems to be due to some micro-architectural shenanigans (maybe code alignment?)
  • Compared to the previously posted results, three_way<int*> is now faster. I think compared to last time, I was just more lucky this time, and this improvement is thanks to the same micro-architectural shenanigans as the previous point

I recorded the benchmark with perf record -e cycles:pp ./libcxx/benchmarks/lexicographical_compare_three_way.libcxx.out.
Below the hot loops of the individual benchmarks as reported by perf report:

Hot loop for BM_lexicographical_compare_three_way_fast_path and BM_lexicographical_compare_three_way<int*>. Both have the exact same assembly and the reported cycles:pp perf numbers are pretty identical.

15.10 │210:┌─→cmp    %rdi,%r10 
      │    │↓ je     25a
17.76 │    │  mov    0x4(%rsi,%rdi,4),%r8d
16.45 │    │  mov    0x4(%r15,%rdi,4),%r9d
14.85 │    │  inc    %rdi
 0.30 │    ├──cmp    %r9d,%r8d
15.98 │    └──je     210

Hot loop for BM_lexicographical_compare_three_way_slow_path and BM_lexicographical_compare_three_way<cpp17_input_iterator<int*, int*>>. Both have the exact same assembly and the reported perf numbers are pretty identical.

 0.03 │210:┌─→mov    (%r15,%r10,1),%edi
20.82 │    │  cmp    %edi,(%rsi,%r10,1) 
 2.75 │    │↓ jne    240 
 7.61 │    │  cmp    %r10,%rax
      │    │  sete   %r8b
 0.01 │    │  cmp    %r10,%r9
20.71 │    │  sete   %dil
 7.54 │    │↑ je     1d0
      │    │  lea    0x4(%r10),%r11
 0.01 │    │  cmp    %r10,%rax
21.13 │    │  mov    %r11,%r10
 7.63 │    └──jne    210

Hot loop for BM_lexicographical_compare_three_way<random_access_iterator<int*> >. Interestingly, the generated assembly is identitcal with

 2.10 │210:┌─→cmp    %rdi,%r10 
      │    │↓ je     25a
35.84 │    │  mov    0x4(%rsi,%rdi,4),%r8d
 3.58 │    │  mov    0x4(%r15,%rdi,4),%r9d
 4.02 │    │  inc    %rdi
 0.29 │    ├──cmp    %r9d,%r8d
33.22 │    └──je     210

As you can see, the assembly is identical with BM_lexicographical_compare_three_way_fast_path. However, the codes executes 40%, and the perf profile has different hot instructions.
My guess is that this is due to some micro-architectural shenanigans.

I am not sure how to further pinpoint this difference between int* and random_access_iterator<int*>.
Either way, I think we can say for sure: The assembly for the fast path is more efficient than the assembly for the slow path.
There seems to be some performance variability, though, due to some aspects which I don't understand fully.

Are those benchmark results as well as the updated benchmark code satisfactory to you, @var-const?


Done. In particular, because this allows me to address your other benchmarking request more easily: I can now directly call the slow and fast paths from the benchmark

avogelsgesang marked an inline comment as done.

Updated benchmark code to directly compare the slow and the fast path

avogelsgesang marked 6 inline comments as done.

Address mordante's comments about test cases:

  • remove using std::array and qualify each usage explicitly
  • add test case for non-integer difference_type
253 ↗(On Diff #454246)

Didn't try gcc-12, yet. But I am currently struggling to even get it working for clang. shows my current progress. The problem is that I can't get it to satisfy the std::three_way_comparable. three_way_comparable requires same_as<common_comparison_category_t<user_ordering, partial_ordering>, partial_ordering>, but looking at the implementation of common_comparison_category_t, it seems to be hardcoded to {partial,weak,strong}_ordering and I don't see a way how to extend it.

As such, I would say: Any type of user-defined ordering is currently not implementable.

I think I addressed all actionable review comments now.

There is one unresolved, controversial comments remaining, though: auto vs explicit types:

Given that there are good arguments in both directions and I don't see one side as objectively better, I would propose to merge the commit as currently in review, in the interest of making progress.
Can you live with that, @mumbleskates, @philnik, @Mordante?


To me, the goal of the

static_assert(is_integral_v<typename iterator_traits<_InputIterator1>::difference_type>,
              "Using a non-integral difference_type is undefined behavior");
static_assert(is_integral_v<typename iterator_traits<_InputIterator2>::difference_type>,
              "Using a non-integral difference_type is undefined behavior");

was to remind the reader: the difference_type is an integer, keep that in mind while reading the following code. Replacing this by a __cpp17_input_iterator assert would no longer fulfil this purpose

fix CI... maybe...

Another try at fixing the CI; this time the benchmarks

avogelsgesang marked 6 inline comments as done and an inline comment as not done.
  • use explicit types instead of auto
  • make benchmark runnable on non-libcxx
  • sort includes correctly
  • work around NASTY_MACRO
ldionne added inline comments.Nov 25 2022, 6:57 AM
13–14 ↗(On Diff #477852)

I wouldn't bother with that. We test internal stuff as well in other benchmarks.

If we want to make them usable with other stdlibs, I think we'd need to spend some time making that work.

Mordante accepted this revision as: Mordante.Nov 25 2022, 11:01 AM

LGTM, I leave the final approval to @var-const


Is forward intended? __comp is passed by value.

ldionne accepted this revision as: ldionne.Nov 29 2022, 2:16 PM

Removing my "request for changes". Please update the benchmarks as suggested and ship it once @var-const gives you the LGTM.


73–79 ↗(On Diff #477852)

You seem to be benchmarking not only the call to std::lexicographical_compare_three_way, but all of:

auto b1 = IteratorT{};
auto e1 = IteratorT{ + v1.size()};
auto b2 = IteratorT{};
auto e2 = IteratorT{ + v2.size()};
benchmark::DoNotOptimize(std::lexicographical_compare_three_way(b1, e1, b2, e2, std::compare_three_way()));

I suspect this is where the difference for random_access_iterator<int*> vs int* comes from. I think it would make sense to instead create the various iterators outside of the timer. And then perhaps regenerate the benchmarks and hopefully the ones for random_access_iterator<int*> and int* would be within noise.

Also, we did confirm that the assembly was the same for both:

avogelsgesang marked 4 inline comments as done.
  • forward -> move
  • move iterator initialization out of benchmarking loop
  • remove #if _LIB_CPP_VERSION check from benchmark
73–79 ↗(On Diff #477852)

Good catch! I moved the initialization of the iterators out of the benchmark loop. In particular for the large vector sizes, I don't think that this influenced the benchmark results, but still good to make the benchmarks as focused as possible.

I suspect this is where the difference for random_access_iterator<int*> vs int* comes from.

The change did not help, unfortunately (see my local results below). This time, the performance of BM_lexicographical_compare_three_way_fast_path/1048576 also dropped to 700ms compared to last time. The assembly code for the different ways to trigger the fast path (direct call to the fast path; int*, random_access_iterator) is still identical.

Benchmark                                                                             Time             CPU   Iterations
BM_lexicographical_compare_three_way_slow_path/1                                   2.01 ns         2.01 ns    347897440
BM_lexicographical_compare_three_way_slow_path/4                                   5.00 ns         5.00 ns    137864992
BM_lexicographical_compare_three_way_slow_path/16                                  16.3 ns         16.3 ns     42347282
BM_lexicographical_compare_three_way_slow_path/64                                  64.8 ns         64.8 ns     10882499
BM_lexicographical_compare_three_way_slow_path/256                                  269 ns          269 ns      2602744
BM_lexicographical_compare_three_way_slow_path/1024                                1039 ns         1039 ns       672699
BM_lexicographical_compare_three_way_slow_path/4096                                4131 ns         4131 ns       169469
BM_lexicographical_compare_three_way_slow_path/16384                              16563 ns        16562 ns        42373
BM_lexicographical_compare_three_way_slow_path/65536                              65853 ns        65850 ns        10631
BM_lexicographical_compare_three_way_slow_path/262144                            263375 ns       263356 ns         2656
BM_lexicographical_compare_three_way_slow_path/1048576                          1060461 ns      1060399 ns          658
BM_lexicographical_compare_three_way_fast_path/1                                   1.35 ns         1.35 ns    523207011
BM_lexicographical_compare_three_way_fast_path/4                                   3.38 ns         3.38 ns    207207415
BM_lexicographical_compare_three_way_fast_path/16                                  11.4 ns         11.4 ns     60817686
BM_lexicographical_compare_three_way_fast_path/64                                  43.6 ns         43.6 ns     16014340
BM_lexicographical_compare_three_way_fast_path/256                                  181 ns          181 ns      3877235
BM_lexicographical_compare_three_way_fast_path/1024                                 700 ns          700 ns      1008340
BM_lexicographical_compare_three_way_fast_path/4096                                2758 ns         2758 ns       253557
BM_lexicographical_compare_three_way_fast_path/16384                              11054 ns        11054 ns        62963
BM_lexicographical_compare_three_way_fast_path/65536                              44180 ns        44176 ns        15921
BM_lexicographical_compare_three_way_fast_path/262144                            175889 ns       175876 ns         3986
BM_lexicographical_compare_three_way_fast_path/1048576                           707635 ns       707576 ns          990
BM_lexicographical_compare_three_way<IntPtr>/1                                     1.01 ns         1.01 ns    690305420
BM_lexicographical_compare_three_way<IntPtr>/4                                     3.70 ns         3.70 ns    189157419
BM_lexicographical_compare_three_way<IntPtr>/16                                    7.70 ns         7.70 ns     90905923
BM_lexicographical_compare_three_way<IntPtr>/64                                    29.2 ns         29.2 ns     23966259
BM_lexicographical_compare_three_way<IntPtr>/256                                    124 ns          124 ns      5635669
BM_lexicographical_compare_three_way<IntPtr>/1024                                   467 ns          467 ns      1497982
BM_lexicographical_compare_three_way<IntPtr>/4096                                  1848 ns         1848 ns       378813
BM_lexicographical_compare_three_way<IntPtr>/16384                                 7710 ns         7710 ns        89978
BM_lexicographical_compare_three_way<IntPtr>/65536                                29735 ns        29734 ns        23509
BM_lexicographical_compare_three_way<IntPtr>/262144                              122694 ns       122680 ns         5585
BM_lexicographical_compare_three_way<IntPtr>/1048576                             500608 ns       500572 ns         1000
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1             1.34 ns         1.34 ns    515760180
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4             3.42 ns         3.42 ns    205376850
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16            11.8 ns         11.8 ns     59125669
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/64            44.0 ns         44.0 ns     15869035
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/256            181 ns          180 ns      3880819
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1024           695 ns          695 ns      1003722
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4096          2754 ns         2754 ns       254042
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16384        11040 ns        11040 ns        63761
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/65536        43912 ns        43911 ns        15908
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/262144      177597 ns       177587 ns         3965
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1048576     706671 ns       706623 ns          993
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1               1.12 ns         1.12 ns    627628655
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4               4.13 ns         4.13 ns    169518923
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16              16.3 ns         16.3 ns     43030558
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/64              64.4 ns         64.4 ns     10875232
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/256              268 ns          268 ns      2611255
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1024            1040 ns         1040 ns       674434
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4096            4133 ns         4133 ns       169644
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16384          16523 ns        16522 ns        42524
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/65536          65819 ns        65814 ns        10645
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/262144        263771 ns       263765 ns         2656
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1048576      1059997 ns      1059932 ns          660

good catch! Changed to move

JohelEGP added inline comments.Nov 29 2022, 4:14 PM

I think ref(__comp) is the preferred way.

var-const requested changes to this revision.Nov 30 2022, 1:43 AM

Thanks a lot for the updated benchmarks! I really appreciate the effort you put into this.

All in all, the results look pretty good, but I really hope we can get to the bottom of the difference between int* and random_access_iterator<int*>.

The results for {fast,slow}_path/1 vary from run to run. The slow_path is not faster than the fast_path consistently across runs. I think this is primarily variance

I suspect that this might be more than just variance. I haven't tried plotting this, but from looking at the numbers, it seems that the fast path starts _slower_ for the smallest input of 1, becomes roughly equal in time for a slightly larger input of 4, and becomes faster with larger inputs, with the difference becoming larger and larger as the input grows. So if we were to plot this (with the x axis being input size and y axis being time), it seems like we would get two curves where the fast path curve starts out slightly above the slow path curve, but then they intersect almost immediately and from that point the slow curve goes up at a sharper angle, so to say, than the fast curve. I could also imagine how the implementation for the fast path could end up doing slightly more work for tiny inputs. So in short, I think we might be seeing a real issue where the optimization is actually slightly less efficient for very small inputs.

I doubt, however, that this is an issue worth fixing. Adding a runtime check for small inputs would introduce branching that is likely to do way more harm than good, and optimizing for large inputs is a lot more important than for small ones. In short, this seems like a good (and probably unavoidable) tradeoff.

three_way<random_access_iterator<int*>>/1048576 is less efficient than three_way<int*>/1048576. As shown below, the assembly code is exactly identical between the two, though. So this seems to be due to some micro-architectural shenanigans (maybe code alignment?)
I am not sure how to further pinpoint this difference between int* and random_access_iterator<int*>.
Either way, I think we can say for sure: The assembly for the fast path is more efficient than the assembly for the slow path.
There seems to be some performance variability, though, due to some aspects which I don't understand fully.

Here, too, I suspect we're seeing a real thing and not just variance. I say that because the difference tends to stay pretty consistent, with random_access_iterator being ~40% (30-50%) slower. This seems to point to the issue being real.

I don't immediately see how the same assembly could produce consistently different timings, so I'd start with the hypothesis that the difference happens somewhere else. On the other hand, as the input grows, any factors such as inlining or not inlining the function or copying the iterators should be completely drowned out by the body of the algorithm. But perhaps looking at a larger piece of generated assembly would help pinpoint the issue. I'll try to find the time this week to see if I can reproduce the numbers on my side.

I would expect the results for int* and random_access_iterator<int*> to be within the margin of error in an optimized build. If they differ, it might imply we accidentally do something inefficient with iterators, and it would be great to get to the bottom of this because it's such a common use case. I certainly don't want to block this patch forever on this, though. Basically, I think we should dig more into the issue, but if we run out of ideas, I'd be okay to ship it as is and deal with it later.

73 ↗(On Diff #478741)

Nit: passing this is unnecessary because it's the default value, right?

73 ↗(On Diff #478741)

Nit: move?


FWIW, I don't have a strong preference here, but for me, one of the most important and informative aspects of auto is "guaranteed no conversion". This is as relevant for the author as it is for the reader. While it's true that auto can be misused by a writer to avoid thinking about which types are returned (IMO that's a bigger time "saver" than the literal typing which can often be autocompleted), I think it has legitimate uses where it makes the intention clearer.


Would it make sense to pass __comp by reference? It could, in theory, be e.g. a lambda that captures a lot of state. In fact, we generally commit to avoid copying user-provided functors. However, see also the other comment about __comp_ref_type (which would make this a reference but also have an additional effect in the debug mode).


As a libc++-specific extension, we mark the return value of lexicographical_compare with _LIBCPP_NODISCARD_EXT. We should do the same here (and add a check to test/libcxx/diagnostics/nodiscard_extensions.verify.cpp).


@JohenEGP I presume you referring to the __comp_ref_type, right? It's a great suggestion.

@avogelsgesang For context, there is an existing pattern in algorithms where the internal implementation of the function wraps the comparator in a typedef that is defined differently in debug and non-debug modes (see include/__algorithm/comp_ref_type.h):

template <class _Comp>
using __comp_ref_type = __debug_less<_Comp>;
template <class _Comp>
using __comp_ref_type = _Comp&;

So in non-debug modes, this resolves to simply a reference. In debug mode, however, it creates a temporary of a helper type __debug_less that additionally checks that the comparator in fact does induce a strict weak order.

I think we want to continue using this pattern going forward (e.g. lexicographical_compare uses it). The only issue is that the existing __debug_less only supports comparators returning a boolean. You would probably need to create a separate __three_way_comp_ref_type typedef and a separate __debug_three_way_comp helper struct (names are subject to change).

Please let me know if you'd like any additional context on this (this can be kinda confusing). Many existing algorithms can be used as examples of this pattern; a caveat is that more recent code uses a (very) slightly different approach:

  • older code tends to call have _Compare as a template parameter of the internal function, declare the parameter as _Compare __comp, and have the caller specify the template argument explicitly, like std::__foo<_comp_ref_type<_Comp>>(__first, __last, __comp). See e.g. lexicographical_compare;
  • in newer code, the pattern was to declare the parameter as a forwarding reference _Compare&& __comp and have the caller do a static_cast, like std::__foo(__first, __last, static_cast<__comp_ref_type(__comp)> (see e.g. inplace_merge).

If you prefer to, I'm fine with doing this in a follow up since this patch has been open for a while now.


Move the iterators? (in the other function as well)

You might additionally static_assert that the given iterators are copyable, to prevent users from accidentally passing move-only iterators (that our implementation would happen to accept due to the optimization but which isn't guaranteed by the Standard). A few of the existing algorithms do that, see e.g. upper_bound.


We have a few tests that check for a certain behavior across a wide range of algorithms; those have to be updated to include lexicographical_compare_three_way now:

  • test/libcxx/diagnostics/nodiscard_extensions.verify.cpp (potentially);
  • test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp;
  • test/libcxx/algorithms/robust_against_cpp20_hostile_iterators.compile.pass.cpp;
  • test/std/algorithms/robust_re_difference_type.compile.pass.cpp;
  • test/std/algorithms/robust_against_adl.compile.pass.cpp.

Please let me know if you need any help with those.


Nit: s/auto/decltype(auto)/. That way we always check the _exact_ returned type, without potentially stripping away references. While it's very unlikely that this algorithm could plausibly return a reference, I think the main advantage of using decltype(auto) is just not having to think about it and have it guaranteed to always catch the exact type.


Question: what is the purpose of comparing just the last digit instead of just the two given numbers? It makes the implementation slightly more complicated (and the variable names longer), but all the inputs are single-digit anyway.


Is this branch ever taken?


Nit: I'd put expected last so that all the algorithm inputs are next to each other.


Optional: create a local constant with a shorter name for the comparator to cut down on the boilerplate a little?


Nit: s/both/the/.


Nit: please also check that passing RandomAccessIteratorBadDifferenceType iterators for the second range fails as well (i.e., calling std::lexicographical_compare_three_way(c, d, a, b, std::compare_three_way())).

425 ↗(On Diff #478741)

Ultranit: there's one extraneous blank line.

263 ↗(On Diff #478741)

Is this branch ever taken?

This revision now requires changes to proceed.Nov 30 2022, 1:43 AM
philnik added inline comments.Nov 30 2022, 4:13 AM

IMO we should make the debug stuff a separate effort. AFAIK we don't test it anywhere and because of that I'm pretty sure it regressed in some places. Instead of adding another untested branch, I'd suggest creating a patch that adds tests to all algorithms and, depending on scope, update the algorithms in follow-up patches or as part of the test-patch.

var-const added inline comments.Nov 30 2022, 11:34 AM

I would agree to that if someone volunteers to do that follow-up in the near future -- in fact, it would be great. Would you or @avogelsgesang be willing to take this on?

avogelsgesang marked 9 inline comments as done.

address varconst's commments

avogelsgesang added inline comments.Nov 30 2022, 7:05 PM
73 ↗(On Diff #478741)

Nit: passing this is unnecessary because it's the default value, right?

correct. removed

Nit: move?

move what? the input iterators? I don't think that is possible without a "use-after-move". I need them for each iteration of the for loop.


passing as _Cmp& now


added the __three_way_comp_ref_type as requested


Is adding the additional calls to std::move standard compliant? defines lexicographical_compare_three_way as not moving the iterators


I added lexicographical_compare_three_way to all of them, except for robust_re_difference_type.compile.pass.cpp.
Afaict this test case relies on undefined behavior. The difference_type must be an integer type (, but this test case violates that requirement.
As requested by @philnik I added a static_assert to lexicographical_compare_three_way which guards against non-integer different_types. Hence, adding lexicographical_compare_three_way triggers this static_assert.
I see two ways forward: Not adding lexicographical_compare_three_way to robust_re_difference_type.compile.pass.cpp or removing the static_assert. Which one do you prefer?


the idea is to test a "non-default" comparator. If I just used normal integer comparisons here, the test cases wouldn't catch it if lexicographical_compare_three_way just ignored the comparator and used the standard comparator instead


see "Check for a partial_ordering::unordered result" inside test_comparison_categories

263 ↗(On Diff #478741)

See "Check for a partial_ordering::unordered result" in lexicographical_compare_three_way.pass.cpp

huixie90 added inline comments.Dec 3 2022, 9:55 AM

I would Not use common_type_t as it might not exist a common_type as common_type does not work for types that need implicit conversions.
The relevant spec here:
In short, integral types are only "explicitly" convertible to each other, not implicitly.
There are related discussion here:

I think we want to find which difference_type is wider (or we need another type to cover both) and then we need to static_cast to that type

avogelsgesang added inline comments.Dec 5 2022, 4:47 AM

I cannot quite follow. Which part of common_type_t does not work for this use case?

I think we want to find which difference_type is wider (or we need another type to cover both)

My understanding is that common_type_t does exactly that. Note that guarantees that difference_type is a signed integral. As such, there can be no mismatches on signedness, and common_type_t should always give the wider `difference_type

then we need to static_cast to that type

Why do we need static_casts instead of relying on implicit conversions? states that casting to a wider integer type is an implicit conversion.

@avogelsgesang I tried out the benchmarks, and it looks like you were probably accidentally running the benchmarks on the debug version of the build (unfortunately, it's a very easy mistake to make since it's the default). Using the debug build, I get timings very similar to what you saw earlier:

Benchmark                                                                             Time             CPU   Iterations
BM_lexicographical_compare_three_way<IntPtr>/1                                     25.7 ns         25.7 ns     27179505
BM_lexicographical_compare_three_way<IntPtr>/4                                     64.1 ns         64.0 ns     10934595
BM_lexicographical_compare_three_way<IntPtr>/16                                     316 ns          249 ns      2924673
BM_lexicographical_compare_three_way<IntPtr>/64                                     880 ns          864 ns       820595
BM_lexicographical_compare_three_way<IntPtr>/256                                   3328 ns         3324 ns       211011
BM_lexicographical_compare_three_way<IntPtr>/1024                                 13151 ns        13145 ns        53256
BM_lexicographical_compare_three_way<IntPtr>/4096                                 52541 ns        52531 ns        13321
BM_lexicographical_compare_three_way<IntPtr>/16384                               210420 ns       210398 ns         3342
BM_lexicographical_compare_three_way<IntPtr>/65536                               847864 ns       847372 ns          825
BM_lexicographical_compare_three_way<IntPtr>/262144                             3394918 ns      3393657 ns          207
BM_lexicographical_compare_three_way<IntPtr>/1048576                           13569716 ns     13563941 ns           51
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1             37.7 ns         37.7 ns     18632581
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4             85.6 ns         85.6 ns      8158698
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16             286 ns          285 ns      2474206
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/64            1036 ns         1035 ns       676002
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/256           4018 ns         4017 ns       167247
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1024         16138 ns        15993 ns        43945
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4096         63604 ns        63602 ns        10994
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16384       255664 ns       255574 ns         2749
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/65536      1025497 ns      1025498 ns          683
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/262144     4113879 ns      4111688 ns          170
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1048576   16455950 ns     16454558 ns           43
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1               29.5 ns         29.5 ns     23748940
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4               99.7 ns         99.7 ns      6798031
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16               300 ns          300 ns      2339940
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/64              1116 ns         1115 ns       632037
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/256             4347 ns         4344 ns       161197
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1024           17271 ns        17270 ns        40467
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4096           68959 ns        68958 ns        10135
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16384         276437 ns       276348 ns         2533
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/65536        1111640 ns      1111475 ns          629
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/262144       4451482 ns      4450650 ns          157
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1048576     17807151 ns     17807154 ns           39

With a similar observation where random_access_iterator<int*> is consistently slower than int* (which makes sense in an unoptimized build).

Rerunning the build with -DCMAKE_BUILD_TYPE=Release, however, makes the difference go away -- now the timings are within the margin of error (not to mention many times faster):

BM_lexicographical_compare_three_way<IntPtr>/1                                    0.447 ns        0.444 ns   1000000000
BM_lexicographical_compare_three_way<IntPtr>/4                                     1.99 ns         1.96 ns    361161702
BM_lexicographical_compare_three_way<IntPtr>/16                                    5.86 ns         5.81 ns    121777252
BM_lexicographical_compare_three_way<IntPtr>/64                                    21.6 ns         21.4 ns     32736439
BM_lexicographical_compare_three_way<IntPtr>/256                                   97.6 ns         96.2 ns      7379296
BM_lexicographical_compare_three_way<IntPtr>/1024                                   346 ns          343 ns      2028815
BM_lexicographical_compare_three_way<IntPtr>/4096                                  1352 ns         1339 ns       515878
BM_lexicographical_compare_three_way<IntPtr>/16384                                 5356 ns         5312 ns       132878
BM_lexicographical_compare_three_way<IntPtr>/65536                                21379 ns        21179 ns        32955
BM_lexicographical_compare_three_way<IntPtr>/262144                               85883 ns        84892 ns         8327
BM_lexicographical_compare_three_way<IntPtr>/1048576                             359011 ns       355722 ns         1956
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1            0.448 ns        0.444 ns   1000000000
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4             1.96 ns         1.94 ns    359261768
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16            5.89 ns         5.83 ns    121641817
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/64            21.5 ns         21.3 ns     32731847
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/256           97.2 ns         96.2 ns      7337449
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1024           346 ns          344 ns      2040876
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4096          1347 ns         1336 ns       516457
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16384         5378 ns         5329 ns       131757
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/65536        21296 ns        21168 ns        32939
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/262144       85516 ns        84770 ns         8164
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1048576     362302 ns       358839 ns         1979
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1              0.576 ns        0.568 ns   1000000000
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4               2.27 ns         2.26 ns    309139488
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16              7.85 ns         7.80 ns     87860227
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/64              37.9 ns         37.7 ns     18627871
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/256              131 ns          130 ns      5375353
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1024             503 ns          501 ns      1386276
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4096            1995 ns         1986 ns       352332
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16384           7947 ns         7919 ns        88986
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/65536          31716 ns        31642 ns        22141
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/262144        127179 ns       126551 ns         5454
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1048576       517207 ns       514653 ns         1360

I think that explains it -- we were assuming we're seeing optimized results which wasn't actually the case. It also means the code is doing the right thing, so there's no actual issue, which is great!

var-const added a comment.EditedDec 8 2022, 10:02 PM

Edit: never mind my original comment, I got incorrect results. Update below:

@avogelsgesang Thanks for pointing out you were using RelWithDebInfo. Running in that mode, I get similar results to Release, with no divergence between int* and random_access_iterator<int*>:

Benchmark                                                                             Time             CPU   Iterations
BM_lexicographical_compare_three_way<IntPtr>/1                                    0.590 ns        0.568 ns   1000000000
BM_lexicographical_compare_three_way<IntPtr>/4                                     2.24 ns         2.24 ns    312197559
BM_lexicographical_compare_three_way<IntPtr>/16                                    6.05 ns         6.05 ns    116194144
BM_lexicographical_compare_three_way<IntPtr>/64                                    21.4 ns         21.4 ns     32794105
BM_lexicographical_compare_three_way<IntPtr>/256                                   94.8 ns         94.8 ns      7386226
BM_lexicographical_compare_three_way<IntPtr>/1024                                   340 ns          340 ns      2051468
BM_lexicographical_compare_three_way<IntPtr>/4096                                  1319 ns         1319 ns       530854
BM_lexicographical_compare_three_way<IntPtr>/16384                                 5332 ns         5284 ns       132817
BM_lexicographical_compare_three_way<IntPtr>/65536                                20946 ns        20945 ns        33471
BM_lexicographical_compare_three_way<IntPtr>/262144                               83706 ns        83701 ns         8370
BM_lexicographical_compare_three_way<IntPtr>/1048576                             348663 ns       348648 ns         1981
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1            0.557 ns        0.557 ns   1000000000
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4             2.20 ns         2.20 ns    313624287
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16            6.04 ns         6.04 ns    117445723
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/64            27.8 ns         27.8 ns     25139524
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/256           95.5 ns         94.7 ns      7396059
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1024           339 ns          339 ns      2072201
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/4096          1319 ns         1319 ns       530504
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/16384         5245 ns         5245 ns       133212
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/65536        20917 ns        20917 ns        33457
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/262144       83482 ns        83481 ns         8363
BM_lexicographical_compare_three_way<random_access_iterator<IntPtr>>/1048576     346843 ns       346843 ns         2018
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1              0.627 ns        0.627 ns   1000000000
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4               2.92 ns         2.92 ns    239622629
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16              12.4 ns         12.4 ns     55910543
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/64              50.1 ns         50.1 ns     13777900
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/256              206 ns          206 ns      3390586
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1024             807 ns          807 ns       861390
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/4096            3286 ns         3286 ns       214230
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/16384          12841 ns        12841 ns        54509
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/65536          52191 ns        51713 ns        13388
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/262144        204980 ns       204978 ns         3411
BM_lexicographical_compare_three_way<cpp17_input_iterator<IntPtr>>/1048576       824384 ns       824375 ns          847

So my original guess about Debug was probably incorrect (though it did reproduce the relative difference in numbers pretty well). I'm not sure what happened, but since I'm seeing the expected results locally, I think everything's good with the patch, and it's good to proceed.

var-const added inline comments.Dec 8 2022, 10:46 PM
73 ↗(On Diff #478741)

Right, please disregard (I might have been looking at a previous iteration of this code where iterators were initialized inside the loop).

28 ↗(On Diff #479137)

Hmm, I think the slow/fast_path functions still need the comparator passed explicitly.


I think we are allowed to do that (as long as the difference is not observable), and it is done in a few places. I can't think of a valid way a user could observe the difference -- since we are allowed to copy or move the iterators in the implementation, they can't rely on move constructor not being called or a copy constructor only being called once.

9 ↗(On Diff #479137)

You might need to run something like ninja -C build libcxx-generate-files to take the new header into account.

27 ↗(On Diff #479137)

Functions in this file should be marked _LIBCPP_HIDE_FROM_ABI.

30 ↗(On Diff #479137)

Hmm, looks like the return type can be just auto? (if I'm reading this correctly, it can never return a reference)

37 ↗(On Diff #479137)

Note: _LIBCPP_INLINE_VISIBILITY was used in old code, in new code we use _LIBCPP_HIDE_FROM_ABI for this (they do the same thing, though).

39 ↗(On Diff #479137)

Question: do we need this requires clause? We only use the class in our internal code, so it seems like we shouldn't need this. If it's to validate the comparator given by the user, then it shouldn't only be done in the debug mode.

46 ↗(On Diff #479137)

Hmm, shouldn't this be == __expected? IIUC, __expected is already "reversed".

53 ↗(On Diff #479137)

Since this class's definition is only available in C++20 and above, this can be just constexpr (throughout the file).


Hmm, it's a great question. I'm not sure why this wasn't just static_asserted before. I'll check -- in the meantime, I think it's best to leave the static_assert in place and add a comment to the test file to explain that lexicographical_compare_three_way is omitted intentionally.

109 ↗(On Diff #479137)

This looks like it was commented out accidentally?

112 ↗(On Diff #479137)

Can this be uncommented?

avogelsgesang marked 18 inline comments as done.

Adress a couple of varconst's comments. Not completely done, yet. Want to get feedback from the CI

9 ↗(On Diff #479137)

That seems like a no-op to me. At least, git does not mark any files as changed after running this

ldionne added inline comments.Dec 14 2022, 11:20 AM

@avogelsgesang @philnik @var-const

Is there a reason why we're testing for this *here* specifically? Isn't this a general property that we require from iterators?

Also, why don't we check that the difference type is signed? That's what says.

FWIW, I believe it would be preferable to add this assertion in a more principled way across the library, and also (especially) to do it in a separate patch. We probably need to add a temporary escape hatch for this one cause I would assume that it may break a lot of users.

philnik added inline comments.Dec 14 2022, 11:28 AM

@avogelsgesang @philnik @var-const

Is there a reason why we're testing for this *here* specifically? Isn't this a general property that we require from iterators?

We're adding it here because it's a new algorithm, so we can't break people.

Also, why don't we check that the difference type is signed? That's what says.

I don't remember why we only check that it's integral; I assume it was just an oversight.

FWIW, I believe it would be preferable to add this assertion in a more principled way across the library, and also (especially) to do it in a separate patch. We probably need to add a temporary escape hatch for this one cause I would assume that it may break a lot of users.

I agree that it would make sense to add it to all the algorithms, but I don't think it makes a lot of sense to add an escape hatch to a new algorithm.

ldionne added inline comments.Dec 14 2022, 4:09 PM

Good point about the new algorithm. Let's add it here unconditionally. And then we can pursue lifting this into something like static_assert(__iterator_requirements_whatever<_It>) that we can perhaps sprinkle in a few places (with a escape hatch for existing algorithms).

Basically I'd like to avoid this being just a one-off. I think this assert is a good idea and would like to expand its use.

Let's address the signed integral issue before we ship, though.

avogelsgesang marked 6 inline comments as done.
  • static_assert for signed integral
  • add missing includes
ldionne accepted this revision.Dec 20 2022, 2:59 PM

LGTM w/ green CI and current review comments addressed. Some of us will be out for the holidays and I don't want this patch to be blocked from making progress artificially since it seems like everything has been addressed or at least discussed.

Is this blocked on anything? It would be awesome to finish it up and merge it in time for LLVM 17. I think this had pretty much all the approvals it needed to go ahead and only a few review comments had to be addressed.

265 ↗(On Diff #483771)
avogelsgesang marked 3 inline comments as done.Feb 12 2023, 11:02 AM

Is this blocked on anything?

I am unfortunately still running into linking issues with libc++ (latest update here) which means I am unable to run any tests locally. So far, my strategy was to give at a bit of time since maybe someone else with more experience than me might be running into the same problem and fix it for me. I just rebased on latest main, but the problems still persist. I guess, I will have to dig deeper on what is broken there.

Rebase on main; update status pages: 16.0 -> 17.0

Add test for __debug_three_way_comp

avogelsgesang added inline comments.Feb 12 2023, 1:02 PM
39 ↗(On Diff #479137)

The idea is: For comparators which can be called as cmp(Type1, Type2) but not as cmp(Type2, Type1), we want to skip this check, and fallback to the other __do_compare_assert further down.

I copied this from __algorithm/comp_ref_type.h, which had

    std::declval<_LHS &>(), std::declval<_RHS &>()))

to check whether the parameters can be swapped

fix clang-tidy by replacing declval with std::declval

Landing despite red CI, because:

  • clang-format is red because it is complaining about incorrectly formatting in a pre-existing file
  • "Apple back-deployment macosx-10.15" is failing due to some unrelated infrastructure error
  • clang-cl failed due to std/thread/thread.mutex/thread.mutex.requirements/thread.timedmutex.requirements/thread.timedmutex.recursive/try_lock_for.pass.cpp which is unrelated to this commit

Build error from Apple back-deployment test:

+ tar -xz --strip-components=1 -C /Users/libcxx-buildkite-agent/libcxx.buildkite-agent/builds/y10-8-macminivault-com/llvm-project/libcxx-ci/build/apple-system-backdeployment-10.15/macos-roots
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   934  100   934    0     0     31      0  0:00:30  0:00:30 --:--:--   246
tar: Error opening archive: Unrecognized archive format
This revision was not accepted when it landed; it landed in state Needs Review.Feb 12 2023, 2:51 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
h-vetinari added inline comments.
264–265 ↗(On Diff #496806)

This line is now duplicated.

148–153 ↗(On Diff #496806)

Is it intentional that these tests are commented out? If so, the comment doesn't really elucidate why, or what would be necessary to enable them.

avogelsgesang added inline comments.Feb 13 2023, 12:48 AM
148–153 ↗(On Diff #496806)

Yes, commenting them out was intentional, see discussion in

I hoped that the comment

lexicographical_compare_three_way static_asserts that the difference type is an integer, as required by

would explain why lexicographical_compare_three_way would reject the difference_type used in this test here, but now I realize that the comment is missing the *signed* integer requirement...

On 2nd thought: maybe it would have been sufficient to use PickyIterator<void**, unsigned long>(a); instead of PickyIterator<void**, long>(a); ...

avogelsgesang marked an inline comment as done.Feb 13 2023, 4:34 PM
avogelsgesang added inline comments.
264–265 ↗(On Diff #496806)

Thanks! Fixed :)