This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize vector push_back to avoid continuous load and store of end pointer
ClosedPublic

Authored by ldionne on May 26 2020, 1:57 PM.

Details

Summary

Credits: this change is based on analysis and a proof of concept by
gerbens@google.com.

Before, the compiler loses track of end as 'this' and other references
possibly escape beyond the compiler's scope. This can be see in the
generated assembly:

16.28 │200c80:   mov     %r15d,(%rax)
60.87 │200c83:   add     $0x4,%rax
      │200c87:   mov     %rax,-0x38(%rbp)
 0.03 │200c8b: → jmpq    200d4e
 ...
 ...
 1.69 │200d4e:   cmp     %r15d,%r12d
      │200d51: → je      200c40
16.34 │200d57:   inc     %r15d
 0.05 │200d5a:   mov     -0x38(%rbp),%rax
 3.27 │200d5e:   mov     -0x30(%rbp),%r13
 1.47 │200d62:   cmp     %r13,%rax
      │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer
back at the end of push back. This generates some slight source 'noise',
but creates nice and compact fast path code, i.e.:

32.64 │200760:   mov    %r14d,(%r12)
 9.97 │200764:   add    $0x4,%r12
 6.97 │200768:   mov    %r12,-0x38(%rbp)
32.17 │20076c:   add    $0x1,%r14d
 2.36 │200770:   cmp    %r14d,%ebx
      │200773: → je     200730
 8.98 │200775:   mov    -0x30(%rbp),%r13
 6.75 │200779:   cmp    %r13,%r12
      │20077c: → jne    200760

Now there is a single store for the push_back value (as before), and a
single store for the end without a reload (dependency).

For fully local vectors, (i.e., not referenced elsewhere), the capacity
load and store inside the loop could also be removed, but this requires
more substantial refactoring inside vector.

Diff Detail

Event Timeline

mvels created this revision.May 26 2020, 1:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2020, 1:57 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
mvels edited the summary of this revision. (Show Details)
miscco added a subscriber: miscco.May 26 2020, 11:38 PM

That are great numbers!

If I understand it correctly this patch has two parts:

  1. Pass this->__end_ to the helper functions
  2. Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

Finally, your __construct_one_at helper function diverges from the previous code.

  • The ASAN annotation is new and not done in the slow path.
  • __construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor
mvels added a comment.May 27 2020, 9:47 AM
  1. Pass this->__end_ to the helper functions
  2. Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

The story is somewhat complicated. The compiler will optimize local types and keep them register allocated if possible. The 'if possible' here is not absolute, but more an 'if the compiler deems it possible".
As the main loop here is mostly trivial, and the vector has 3 words for state, we can easily 'see' it is possible to keep the vector state in 3 registers (if we include begin_).
For a compiler this is harder, as the following factors come in:

  • the logic involves some non inlined code that is beyond the compilers view and may affect this / state
  • state is modified at an inlining depth the compiler no longer tracks in full (tracking state is hard)
  • 'this' or other state 'escapes', i.e., some code path could escape into globals, functions, etc, and the compiler can't proof that the state of the vector is not externally observable
  • compiler heuristics required to determine that the slow path is unlikely and/or register allocation is easily preserved
  • etc....

For an example of possible variants of making life 'as easy as possible' for the compiler, see https://pastebin.com/5YjHbSaC

Running these benchmarks:

BM_Pushback<Vector<1>>/4k         1.67ns ± 8%
BM_Pushback<Vector<2>>/4k         0.60ns ± 7%
BM_Pushback<Vector<3>>/4k         0.35ns ± 7%

You'll see the top 2 are basically the numbers I posted earlier. The 3rd option is rewriting the logic such that the slow path is executed without 'this' state, but purely as inputs / outputs:

void push_back(int value) {
    pointer end = end_;
    if (end == end_cap_) {
      size_type sz = end_ - begin_;
      size_type n = sz ? sz * 2 : 2;
      begin_ = __push_back_slow_path(begin_, sz, n, value);
      end_ = end = begin_ + sz;
      end_cap_ = begin_ + n;
    } else {
      *end = value;
      end_ = end + 1;
    }
}

Which now has the fast path completely register allocated:

20.20 │ ca:   cmp   %r13,%r14
           │     ↑ je    90
20.16 │       mov   %r15d,(%r14)
19.26 │       add   $0x4,%r14
21.36 │       add   $0x1,%r15d
           │       cmp   %r15d,%r12d
18.63 │     ↑ jne   ca

I cheated here in 2 ways: I elided allocator state. The default std::allocator is stateless, however, for a generic implementation we do need to pass allocator references along the call paths. This is part where refactoring this is harder, as this needs to be factored out in 'default allocator' and 'stateful allocator' code where the latter basically will have option 2 performance (only caching end_ state in register). Additionally, there is -fno-exception which makes optimizing this easier (there are no thousands of early exits) especially when it comes to setting / swapping state on grow events.

Finally, your __construct_one_at helper function diverges from the previous code.

  • The ASAN annotation is new and not done in the slow path.
  • __construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

The Transaction class purpose is to track the 'end' pointer (in the split buffer in these cases) for the added element, and to do some accounting (size grew) used in ASAN compilation which defines how many items are readable.
The 'construct one at end' case is easy, as there is only one failure point -> in place constructing the last element, so there is no need for any complicated state. Thus we can simply construct in place, and do the 'grow count for asan' which happens in the tx dtor.

mvels updated this revision to Diff 266660.May 27 2020, 2:08 PM

Fixed AsanTransaction for single 'construct_at' use case

mvels updated this revision to Diff 266669.May 27 2020, 2:15 PM

White space / comment clean up

EricWF requested changes to this revision.May 29 2020, 7:45 AM

@mvels Do we have macro-benchmark results for this change?

libcxx/include/vector
949

If this is the only place _AsanTransaction is used, does it need to be this complicated?

1701–1702

Could we rename __end here, because it only sometimes stores the end iterator. Maybe __last or __back?

This revision now requires changes to proceed.May 29 2020, 7:45 AM
mvels updated this revision to Diff 267320.May 29 2020, 11:56 AM
mvels marked 2 inline comments as done.

renamed end --> pos

libcxx/include/vector
949

An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827

I like this better as emptying out the ASAN to empty thunks could remove much more #ifndef crud, and it evaporates entirely (guaranteed) when compiling with exceptions. (RAII has more cost then except if not part of a 'final' execution path)

Is there still interest in pursuing this? If not, could you abandon this?

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 5:26 PM
Herald added a subscriber: sunshaoce. · View Herald Transcript
ldionne commandeered this revision.Sep 13 2023, 3:20 PM
ldionne added a reviewer: mvels.
ldionne added a subscriber: ldionne.

[Github PR transition cleanup]

Commandeering to finish.

ldionne updated this revision to Diff 556726.Sep 13 2023, 3:21 PM
ldionne retitled this revision from Optimize vector push_back to avoid continuous load and store of end pointer. to [libc++] Optimize vector push_back to avoid continuous load and store of end pointer.
ldionne edited the summary of this revision. (Show Details)

Rebase and change a few things in the patch. We're still getting a pretty awesome speedup:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_Pushback/vector_int/1024       2.07 ns         2.07 ns    337961984    BEFORE
BM_Pushback/vector_int/1024      0.549 ns        0.549 ns   1000000512    AFTER
ldionne updated this revision to Diff 556774.Sep 14 2023, 4:57 AM

Fix formatting.

A little more digging into this assembly can be found here: https://godbolt.org/z/TrWY7YMWW (or with LLVM IR: https://godbolt.org/z/69xzf7r74)

I think this change is good and safe. Still pondering the "how to test this" question.

ldionne accepted this revision as: Restricted Project.Sep 25 2023, 3:07 PM

A little more digging into this assembly can be found here: https://godbolt.org/z/TrWY7YMWW (or with LLVM IR: https://godbolt.org/z/69xzf7r74)

I think this change is good and safe. Still pondering the "how to test this" question.

The correctness tests are already handled by our test suite. I agree we don't have a good way of testing performance changes right now, and that's a problem. I think I'd rather not block this patch on that issue since a lot of patches are in the same boat and we both agree this is a good change.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 2 2023, 6:13 AM
This revision was automatically updated to reflect the committed changes.