This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Fix performance inconsistency between map/set copy-assignment and copy-constructor
Needs ReviewPublic

Authored by ldionne on Mar 11 2022, 12:06 PM.

Details

Reviewers
None
Group Reviewers
Restricted Project
Summary

Before this patch, std::map's copy constructor and copy assignment operators
would have significantly different performance characteristics. This was
caused by tree::operator= using assign_multi even in the case of a
container with unique keys like std::map. Furthermore, upon investigation,
it appears that __assign_unique was actually slower in some cases than
insert(first, last), because we were not using a hint to insert in the
loop. This patch fixes that issue which provides a small additional speedup.

In the future, we could instead reintroduce a copy assignment operator in
tree and copy the whole tree without performing any comparison. This
isn't done in this patch because such a patch should ensure the copy
assignment and copy constructors for
tree are consistent, which goes
beyond the targeted fix I'm trying to do here. See https://llvm.org/PR62571.

This patch also adds a benchmark for std::map copy assignment, which
highlights the performance inconsistency we had.


Benchmark Time CPU Iterations

(before) BM_ConstructorCopy_MapSize=1000000 60.2 ns 60.2 ns 11000000
(before) BM_CopyAssignment_MapSize=1000000 68.7 ns 68.7 ns 11000000
(before) BM_CopyAssignmentPrepopulated_MapSize=1000000 42.1 ns 42.1 ns 17000000

(after) BM_ConstructorCopy_MapSize=1000000 59.9 ns 59.9 ns 12000000
(after) BM_CopyAssignment_MapSize=1000000 59.9 ns 59.9 ns 12000000
(after) BM_CopyAssignmentPrepopulated_MapSize=1000000 40.7 ns 40.7 ns 18000000

rdar://89335215

Diff Detail

Event Timeline

ldionne created this revision.Mar 11 2022, 12:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2022, 12:06 PM
ldionne requested review of this revision.Mar 11 2022, 12:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2022, 12:06 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne updated this revision to Diff 414745.Mar 11 2022, 1:58 PM

Add regression test.

var-const added inline comments.
libcxx/benchmarks/map.bench.cpp
175

Question: why are we using a different naming convention here (with capitalized variable Names)?

1032

Question: is a similar benchmark for set needed? And perhaps multimap, unless it's covered by this test?

libcxx/include/set
594

Question: it's _VSTD and not std for consistency within the file, right?

libcxx/test/libcxx/containers/associative/perf.assignment_vs_construction.pass.cpp
3

A couple of questions:

  • can/should we use Google Benchmark here?
  • is this file compiled with optimizations enabled?
21

Nit: include <vector>.

27

If I understand correctly, all the generated strings will be short enough for SSO to kick in. Is this desirable (I presume it effectively turns copies into moves and minimizes the time difference between the old and new implementations, thus testing for the "worst" case scenario)? If yes, it might be worth adding a comment to that effect.

Also, now that I think of it, what's the reason to use strings and not integers as the key type?

33

Alternatively, this code could use an unordered_map to store generated keys and not add them to keys if they were already seen before. Though I don't think efficiency matters for this function.

43

Very optional / question: using <chrono> always makes me want to create a short alias like using namespace chr = std::chrono. What's your preference on this?

50

Note: this could overflow if iterations were unbounded, though if we're always using a small integer, it doesn't matter.

57–59

Optional: test_map and test_set could probably be merged into a single function by moving the difference out into a helper. The best I could come up with is to have a functor that takes care of the difference in value_type (to avoid specializing for all 4 map/set variations):

template <typename Container, typename Func>
void test(std::vector<std::string> const& keys, Func const& value_type_func) {
...
tmp.insert(value_type_func(key));

...

test<std::map<std::string, int>(keys, [](const auto& s) { return std::make_pair(s, 1); });

Though it might be not worth it.

62

I'm a bit concerned about the compiler seeing that foo is never used and potentially optimizing it away. Is it known that it's not the case / cannot be the case?

Note: it looks like the double underscore in front of the tree name causes incorrect formatting in the patch description. Perhaps putting __tree inside backpacks would help (if you think it's worth fixing).

libcxx/include/__tree
1098

From the patch description: "In the future, we could instead reintroduce a copy assignment operator in
tree and copy the whole tree without performing any comparison". Would it be worthwhile adding a TODO to that effect here?

Nice that this issue has been found!

libcxx/include/__tree
1098

+1 for the TODO.

libcxx/include/set
594

These changes are quite unexpected when reading the patch description. Can you update the description?
(Same for the multi(map|set).)

libcxx/test/libcxx/containers/associative/perf.assignment_vs_construction.pass.cpp
10

When this can't be done in Google Benchmark please add the rationale here.

71

When this is the reason to not do it in Google Benchmark I think it's worthwhile to consider how we can test for timing regressions in the benchmarks. (This would even be good to look into regardless of this patch.)

EricWF added a subscriber: EricWF.Mar 16 2022, 10:45 AM

I don't think a custom benchmarking test is appropriate. Please rewrite your benchmarks using Google benchmark.

libcxx/benchmarks/map.bench.cpp
175

Probably because I started it because I was feeling funky. Not sure it matters much.

177

I don't think this is the correct usage of KeepRunningBatch.

179

What about the case where the map is already populated? And with different sizes?

libcxx/include/__tree
1098

On another note: a copy assignment operator gets implicitly declared by the compiler. For readability lets make that implicit declaration explicit.

1620

Could we make this faster for non-multi containers by using a different assign f ml

libcxx/include/set
598

What's the performance like if you call __assign_unique(__s.begin(), __s.end()) instead?

libcxx/test/libcxx/containers/associative/perf.assignment_vs_construction.pass.cpp
3

Yes, this should be google benchmark.

PS. This is removing an optimization where:

(A) the node allocations are reused during assignment
(B) For reused nodes, the types assignment operator is called.

So this could be a significant regression when either

(A) Memory allocation is slow.
(B) The element types assignment operator is faster than its copy constructor (imagine a case where it too is reusing resources).

So I think we need more benchmarks that attempt to measure the performance in these cases.

EricWF added inline comments.Mar 16 2022, 10:56 AM
libcxx/include/__tree
1620

Ignore this comment.

ldionne updated this revision to Diff 519947.May 5 2023, 12:40 PM
ldionne marked 23 inline comments as done.
ldionne retitled this revision from [libc++] Fix performance inconsistency between map copy-assignment and copy-constructor to [libc++] Fix performance inconsistency between map/set copy-assignment and copy-constructor.
ldionne edited the summary of this revision. (Show Details)

Rebase, address comments and improve performance more.

ldionne added inline comments.May 5 2023, 12:46 PM
libcxx/benchmarks/map.bench.cpp
175

Yeah it's weird but I am staying consistent with the rest of the file.

177

This is what I see in GoogleBenchmark:

// Returns true iff the benchmark should run n more iterations.
// REQUIRES: 'n' > 0.
// NOTE: A benchmark must not return from the test until KeepRunningBatch()
// has returned false.
// NOTE: KeepRunningBatch() may overshoot by up to 'n' iterations.
//
// Intended usage:
//   while (state.KeepRunningBatch(1000)) {
//     // process 1000 elements
//   }
bool KeepRunningBatch(IterationCount n);

You're right that it's a bit weird since we're only "processing" one element here (aka we're doing the assignment of a single map). However, if I look at how all these benchmarks were written, I think the intent is that we pretend that copying each element in the map is "one iteration". I guess that's done so that the output of Google Benchmark can show the time taken on average for each element given a map with N elements in it. That's what we do in all the benchmarks in this file.

Anyway, I switched all the constructor-related benchmarks to use the pattern used in ConstructorMove, which gave me the most sensical and the most consistent results.

179

Added a benchmark.

libcxx/include/__tree
1098
1098

Since there's a move constructor, it won't get declared. But I explicitly marked it as deleted.

libcxx/include/set
594

This is taken from the __tree::operator= code we were using before, so there are no changes really.

Question: it's _VSTD and not std for consistency within the file, right?

Hmm, I'll use std anyway.

598

It's better in case there's already elements in the tree, so I changed it to that instead. Good catch!

libcxx/test/libcxx/containers/associative/perf.assignment_vs_construction.pass.cpp
3

I basically removed this test. The goal was to have a regression test that would catch when copy assignment and copy construction are not within a reasonable margin of each other, which we don't achieve with our current benchmarks. But I think it's fine to instead have just the benchmarks under libcxx/benchmarks.

71

Yes, I agree, however let's not block this patch on figuring out how to do performance regression testing.

ldionne updated this revision to Diff 520371.May 8 2023, 7:33 AM

Rebase onto D150119 to fix C++03 issue

ldionne updated this revision to Diff 520666.May 9 2023, 5:39 AM

Rebase onto main

Mordante added inline comments.May 9 2023, 10:36 AM
libcxx/benchmarks/map.bench.cpp
124

Why remove the validation?

131

Why regenerating the data?

libcxx/include/__tree
1634

I really wished somebody would have named this function __insert_unique_hint.

libcxx/include/map
1870

Actually I think this is needed.

  • we allocated nodes with allocator foo
  • we change the allocator to bar in __tree_.__copy_assign_alloc(__m.__tree_);
  • the __tree_.__assign_multi will remove the exiting nodes using allocator bar instead of foo

Am I missing something?

When I'm correct please fix the other 3 assignments too. And then we probably need tests too.