This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize unordered_{multiset,multimap} equality comparison
AbandonedPublic

Authored by ldionne on May 13 2019, 4:49 PM.

Details

Reviewers
mclow.lists
EricWF
zoecarver
Group Reviewers
Restricted Project
Summary

This patch simplifies the comparison of unordered multiset and multimap by first
trying to find a common prefix between the two containers, in case they
happen to be ordered the same.

This also adds some tests for corner cases of unordered_{multimap,multiset}
comparison that wasn't tested previously.

Diff Detail

Event Timeline

zoecarver created this revision.May 13 2019, 4:49 PM

I am quite surprised by the idea that a naked is_permutation can outperform the existing implementation.
However, the test that you linked to is definitely faster on my machine.

I think that result is something that needs to go into our benchmark suite.

include/unordered_map
1654

This seems odd to me.

The old code cached both __x.end() and __y.end().
The new code caches only __x.end().

Why?
Either it's worth caching end() or it is not.

include/unordered_set
960–963

Same comment as in unordered_map

zoecarver marked an inline comment as done.May 13 2019, 6:09 PM

I can add a benchmark for this. I have a few other tests (using map, etc.). I will refer to this, but I might have a few questions :)

include/unordered_map
1654

The old code compares the found value to a stored variable whereas here I compare it to __y.end(). I don't think it makes a difference. I think it's just a thing I did when re-writing these. I am happy to use either.

zoecarver updated this revision to Diff 199370.May 13 2019, 9:41 PM
  • add benchmark

Let me know if there is anything I should add to the benchmark (or if I should write it differently). Here is the output from my machine:

----------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
----------------------------------------------------------------
initialize_data           0.000 ns        0.000 ns            0
test_small_old           166805 ns       166343 ns         4356
test_small_new            64089 ns        63915 ns        10408
test_small_range_old     287885 ns       287437 ns         2648
test_small_range_new      50265 ns        50077 ns        10000
test_big_old           95000392 ns     94921500 ns            6
test_big_new           75199528 ns     75144333 ns            9

Unfortunately nowhere near what I was getting with the other test, but I suspect this is more accurate, and there is a bit of improvement (especially for sets with limited ranges).

What I would like here instead of just a benchmark is a test that ensures that the new implementation is faster than the old one (with some fuzz of course). Is that something reasonable to ask for? @mclow.lists Do you agree this would be better? I'm concerned that we won't be looking at benchmark results and might not notice if the new algorithm is slower under some circumstances.

@ldionne I can add a test that asserts one function takes less time than another, but it might fail if tests are run in parallel or if the computer's resources are being used more heavily while one of the tests is being run. I think it is unlikely that such circumstances arise, but it is possible.

While it is harder for me to show why is_permutation is faster, it follows logic that it would be at least as fast. Essentially, the old function looped through every element of the set. For each element, it would find all elements containing the same value as that element. Then, it would check is_permutation for that set of elements (who were required to both be the equal values and have the same size!). Assuming is_permutation works as specified in the standard, it does the same thing. However, it cuts out unnecessary calls to is_permutation, does not have to loop through every element, and has other optimizations. In other words, the standard specifies that is_permutation must have a maximum of O(n^2) complexity, the current (old) function is more complex than that, therefore, even if is_permutation is updated, it should still be faster.

mclow.lists requested changes to this revision.May 15 2019, 6:59 PM

I have found cases where the existing code is significantly faster than this new code.
Zoe is investigating.

This revision now requires changes to proceed.May 15 2019, 6:59 PM
zoecarver updated this revision to Diff 199745.EditedMay 15 2019, 10:00 PM

I played around with this a bit and (thanks to @mclow.lists) discovered that there are issues with using is_permutation. I tried to take the best of both of these implementations by adding is_permutation's loop which chops off any identical parts and by removing the unneeded call to is_permutation while keeping the equal_range check. Here is the result:

-------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
-------------------------------------------------------------------------------------
initialize_data                                0.000 ns        0.000 ns            0
test_old/OLD_compare_small_set                 62502 ns        62483 ns        11055
test_new/NEW_compare_small_set                 26598 ns        26591 ns        27834
test_old/OLD_compare_similar_set               97684 ns        97492 ns         8014
test_new/NEW_compare_similar_set               22088 ns        22080 ns        33696
test_old/OLD_compare_big_set                79731796 ns     79664100 ns           10
test_new/NEW_compare_big_set                74296876 ns     74245222 ns            9
test_old/OLD_compare_big_set_different      71916394 ns     71878300 ns           10
test_new/NEW_compare_big_set_different     147169839 ns    147085200 ns            5
test_old/OLD_compare_oposite_order_set     306415935 ns    305943000 ns            2
test_new/NEW_compare_oposite_order_set     329123963 ns    329029500 ns            2
test_old/OLD_compare_random_set             12786696 ns     12763356 ns           45
test_new/NEW_compare_random_set              3610982 ns      3605559 ns          211
test_old/OLD_compare_random_set_different       2888 ns         2883 ns       250692
test_new/NEW_compare_random_set_different       2882 ns         2868 ns       234480
test_old/OLD_compare_short_string             241644 ns       241291 ns         2902
test_new/NEW_compare_short_string              61157 ns        60933 ns         8343
test_old/OLD_compare_long_string             1369126 ns      1366185 ns          509
test_new/NEW_compare_long_string              142631 ns       142461 ns         6213
test_old/OLD_compare_different_strings           311 ns          310 ns      2278497
test_new/NEW_compare_different_strings           319 ns          318 ns      2245367

Let's break this down. You will notice that in every one of these bencharks, NEW_* is faster or equally as fast as OLD_* with two exceptions: compare_different_strings and compare_big_set_different. The first one is similar enough that I won't worry about it, but the second one is what worries me. For some reason, the new implementation is twice as slow as the old one. However, when I look at the number of times values are compared the new implementation compares 4999507 values while the old one compares 4999505 values. Two comparisons should not make that big a difference. Additionally, when I run this outside of the benchmark enviorment the comparisons take almost the same amount of time. Any thoughts on why this might be?

Note: careful when running these benchmarks, it takes almost 5GB of memory.

zoecarver retitled this revision from Use is_permutation when comparing containers to Change how containers are compared .May 15 2019, 10:01 PM
zoecarver edited the summary of this revision. (Show Details)
zoecarver set the repository for this revision to rCXX libc++.
ldionne requested changes to this revision.May 16 2019, 11:04 AM
ldionne added inline comments.
benchmarks/unordered_set_comp.bench.cpp
83

This should be extracted into an implementation-detail function and used in both the headers and this benchmark, instead of copy-pasting.

include/unordered_set
1534

That's basically equivalent to:

bool res = std::equal(__first1, __last1, __first2);
if (res)
  return true;

// rest of the code after __not_done label

Or did I miss something? If they're equivalent, I suggest we remove the raw loop.

This revision now requires changes to proceed.May 16 2019, 11:04 AM
zoecarver marked 2 inline comments as done.May 16 2019, 11:25 AM
zoecarver added inline comments.
benchmarks/unordered_set_comp.bench.cpp
83

Actually, I could probably just use operator== here.

include/unordered_set
1534

No, that would not be the same. This function does not check if the sets are the same; it removes any parts of the sets that are the same. Importantly, if only three elements are the same, it will "chop off" those elements and continue from the third element below. Does that make sense?

zoecarver marked an inline comment as done.May 16 2019, 11:36 AM
zoecarver added inline comments.
include/unordered_set
1532–1547

Just realizing now that this should be __first1. That might make it faster :)

I made a mistake and did not update the comparison function correctly, so most of the optimizations were not applied. No wonder it was giving such a weird benchmark score. Now that I have updated the comparison function, it is faster every time (with the exception of the last string test which goes back and forth):

Load Average: 6.94, 4.27, 3.07
-------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
-------------------------------------------------------------------------------------
initialize_data                                0.000 ns        0.000 ns            0
test_old/OLD_compare_small_set                 64835 ns        64759 ns         9022
test_new/NEW_compare_small_set                 25215 ns        25209 ns        27516
test_old/OLD_compare_similar_set               85983 ns        85929 ns         7419
test_new/NEW_compare_similar_set               23424 ns        23408 ns        31177
test_old/OLD_compare_big_set                88360371 ns     88124500 ns            6
test_new/NEW_compare_big_set                67877017 ns     67856889 ns            9
test_old/OLD_compare_big_set_different      76333837 ns     76093818 ns           11
test_new/NEW_compare_big_set_different      75138186 ns     74695111 ns            9
test_old/OLD_compare_oposite_order_set     324600832 ns    324167000 ns            2
test_new/NEW_compare_oposite_order_set     276399596 ns    275194333 ns            3
test_old/OLD_compare_random_set             12322272 ns     12318583 ns           72
test_new/NEW_compare_random_set              3413009 ns      3411407 ns          216
test_old/OLD_compare_random_set_different       3004 ns         3002 ns       233761
test_new/NEW_compare_random_set_different       2958 ns         2951 ns       225822
test_old/OLD_compare_short_string             237232 ns       237151 ns         3006
test_new/NEW_compare_short_string              62076 ns        61986 ns        11166
test_old/OLD_compare_long_string             1310500 ns      1310243 ns          519
test_new/NEW_compare_long_string              108740 ns       108724 ns         6146
test_old/OLD_compare_different_strings           345 ns          345 ns      1963061
test_new/NEW_compare_different_strings           364 ns          362 ns      1948509

Sorry about that mistake, now it should work properly.

Any more thoughts on this? Friendly ping if you have time to look over it again.

EricWF added inline comments.Jun 9 2019, 12:12 AM
benchmarks/unordered_set_comp.bench.cpp
218

I would be very surprised if the compiler wasn't optimizing your benchmark to:

bool __cached_val = new_comp(u1, u2);
while (st.KeepRunning()) {
  bool val = __cached_val;
  DoNotOptimize(val);
}
include/unordered_map
2291

You can't use auto.

2294

Use explicit braces to make the nesting here more visible.

2296

You can write this without goto.

2303

There's no need to quality pair here.

zoecarver marked an inline comment as done.Jun 10 2019, 8:35 PM
zoecarver added inline comments.
benchmarks/unordered_set_comp.bench.cpp
218

Hopefully passing the function directly to DoNotOptimize will fix this. Any other suggestions?

zoecarver updated this revision to Diff 203971.Jun 10 2019, 8:41 PM
  • updated based on Eric's suggestions
zoecarver marked an inline comment as done.EditedJun 10 2019, 8:46 PM

Also, here are some more benchmarks because I technically changed the comparison function:

-------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
-------------------------------------------------------------------------------------
initialize_data                                0.000 ns        0.000 ns            0
test_old/OLD_compare_small_set                 72746 ns        72391 ns        10212
test_new/NEW_compare_small_set                 27201 ns        27132 ns        24270
test_old/OLD_compare_similar_set              100853 ns       100354 ns         7101
test_new/NEW_compare_similar_set               22151 ns        22101 ns        28880
test_old/OLD_compare_big_set                85456872 ns     85059167 ns            6
test_new/NEW_compare_big_set                78148398 ns     78032222 ns            9
test_old/OLD_compare_big_set_different      80389034 ns     80301000 ns            9
test_new/NEW_compare_big_set_different      76252136 ns     76169200 ns           10
test_old/OLD_compare_oposite_order_set     338906270 ns    338359500 ns            2
test_new/NEW_compare_oposite_order_set     308761215 ns    308473500 ns            2
test_old/OLD_compare_random_set             11021038 ns     11011500 ns           56
test_new/NEW_compare_random_set              3717716 ns      3710818 ns          192
test_old/OLD_compare_random_set_different       2720 ns         2718 ns       261663
test_new/NEW_compare_random_set_different       2674 ns         2671 ns       260192
test_old/OLD_compare_short_string            2381223 ns      2377644 ns          295
test_new/NEW_compare_short_string             680146 ns       679617 ns          991
test_old/OLD_compare_long_string            19782950 ns     19765727 ns           33
test_new/NEW_compare_long_string             2370050 ns      2367704 ns          301
test_old/OLD_compare_different_strings          2068 ns         2064 ns       343178
test_new/NEW_compare_different_strings          2046 ns         2043 ns       346945

and (to be fair to the times this function was slower):

-------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
-------------------------------------------------------------------------------------
initialize_data                                0.000 ns        0.000 ns            0
test_old/OLD_compare_small_set                 72221 ns        71934 ns         8892
test_new/NEW_compare_small_set                 30014 ns        29660 ns        25674
test_old/OLD_compare_similar_set              107336 ns       106814 ns         6611
test_new/NEW_compare_similar_set               23795 ns        23757 ns        29510
test_old/OLD_compare_big_set                87814006 ns     87564750 ns            8
test_new/NEW_compare_big_set                77258477 ns     77124000 ns            9
test_old/OLD_compare_big_set_different      86371487 ns     86087222 ns            9
test_new/NEW_compare_big_set_different      79709931 ns     79506000 ns            8
test_old/OLD_compare_oposite_order_set     351100203 ns    349060500 ns            2
test_new/NEW_compare_oposite_order_set     312774516 ns    312366000 ns            2
test_old/OLD_compare_random_set             11675568 ns     11655017 ns           58
test_new/NEW_compare_random_set              3614458 ns      3611199 ns          196
test_old/OLD_compare_random_set_different       2807 ns         2802 ns       229585
test_new/NEW_compare_random_set_different       3031 ns         3022 ns       241292
test_old/OLD_compare_short_string            2557198 ns      2545343 ns          274
test_new/NEW_compare_short_string             739160 ns       736574 ns          976
test_old/OLD_compare_long_string            17619450 ns     17561594 ns           32
test_new/NEW_compare_long_string             2347907 ns      2343063 ns          284
test_old/OLD_compare_different_strings           134 ns          133 ns      5265256
test_new/NEW_compare_different_strings           147 ns          146 ns      4736707
include/unordered_map
2296

Yep. It requires one more comparison, but I think it is worth it. We should probably also update is_permutation so that it can eventually be a constexpr.

Would it be better to make a patch with just the benchmarks?

ldionne added a reviewer: Restricted Project.Nov 2 2020, 2:14 PM
ldionne commandeered this revision.Sep 12 2023, 7:56 AM
ldionne edited reviewers, added: zoecarver; removed: ldionne.

[Github PR transition cleanup]

This patch doesn't seem correct to me: unordered containers are not ordered, so I don't think we can do a linear walk of __first1 and __first2, compare elements side by side and draw any conclusion from that. Abandoning.

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2023, 7:56 AM
ldionne abandoned this revision.Sep 12 2023, 7:56 AM

[Github PR transition cleanup]

This patch doesn't seem correct to me: unordered containers are not ordered, so I don't think we can do a linear walk of __first1 and __first2, compare elements side by side and draw any conclusion from that. Abandoning.

Looks to me like this patch is correct. It uses the linear walk as a "fast path" to ignore the prefix of the two containers that are both in the same order and equal, then reverts to the pre-existing slow path for the out-of-order (or unequal) suffix.

ldionne reclaimed this revision.Sep 12 2023, 3:13 PM

I think Duncan is right here, actually. There is a potentially useful optimization that we can do by looking at the matching prefix of the container and skipping repeated lookups based on that. Re-opening.

ldionne updated this revision to Diff 556612.Sep 12 2023, 3:14 PM
ldionne retitled this revision from Change how containers are compared to [libc++] Optimize unordered_{set,map} equality comparison.
ldionne edited the summary of this revision. (Show Details)

Rebase and update the approach. Some of the previous patch was not correct (e.g. we can't get rid of is_permutation for unordered_map).

I'm unable to get decisive benchmark results yet though.

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2023, 3:14 PM
ldionne retitled this revision from [libc++] Optimize unordered_{set,map} equality comparison to [libc++] Optimize unordered_{multiset,multimap} equality comparison.Sep 12 2023, 3:17 PM
ldionne edited the summary of this revision. (Show Details)
ldionne added inline comments.
libcxx/include/unordered_map
1989 ↗(On Diff #556612)

@dexonsmith Do you agree that we should be able to apply the same mismatch optimization here and in std::unordered_set?

Now I know why I'm unable to get decisive benchmark results: I added benchmarks for unordered_set but I changed unordered_multiset. What a shame. I'll update tomorrow.

ldionne updated this revision to Diff 556969.Sep 18 2023, 12:44 PM

Apply the optimization to non-multi containers too. Long story short, I am not convinced that these changes are a real improvement.

Indeed, map lookup starts by computing the hash of the key. If computing the hash is cheaper than fully comparing the key for equality, it can happen that using std::mismatch actually pessimizes the algorithm, since it always checks for full key equality.

Consequently, I am going to drop this patch and only cherry-pick some of the changes in this patch (e.g. testing improvements).

Before patch:

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_Compare_same_container/unordered_set_string/1024           155440 ns       155422 ns         4495
BM_Compare_same_container/unordered_set_int/1024                2501 ns         2501 ns       269041
BM_Compare_different_containers/unordered_set_string/1024       67.7 ns         67.7 ns     10428771
BM_Compare_different_containers/unordered_set_int/1024          1.33 ns         1.33 ns    528206212
BM_Compare_shared_prefix/unordered_set_string/1024              68.1 ns         68.1 ns     10355336
BM_Compare_shared_prefix/unordered_set_int/1024                 1.33 ns         1.33 ns    527314912

After patch:

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_Compare_same_container/unordered_set_string/1024           153609 ns       153589 ns         4515
BM_Compare_same_container/unordered_set_int/1024                2390 ns         2390 ns       279170
BM_Compare_different_containers/unordered_set_string/1024       70.7 ns         70.7 ns      9918807
BM_Compare_different_containers/unordered_set_int/1024          3.11 ns         3.11 ns    173664520
BM_Compare_shared_prefix/unordered_set_string/1024              69.7 ns         69.7 ns     10042753
BM_Compare_shared_prefix/unordered_set_int/1024                 1.44 ns         1.44 ns    486236047
ldionne abandoned this revision.Sep 18 2023, 12:45 PM