Page MenuHomePhabricator

Optimize vector push_back to avoid continuous load and store of end pointer.
Needs ReviewPublic

Authored by mvels on May 26 2020, 1:57 PM.

Details

Reviewers
EricWF
Group Reviewers
Restricted Project
Summary

credits: this change is based on analysis and a proof of concept by gerbens@google.com.

Before, the compiler loses track of end as 'this' and other references possibly escape beyond the compiler's scope. This can be see in the generated assembly:

16.28 │200c80:   mov     %r15d,(%rax)
60.87 │200c83:   add     $0x4,%rax
      │200c87:   mov     %rax,-0x38(%rbp)
 0.03 │200c8b: → jmpq    200d4e
 ...
 ...
 1.69 │200d4e:   cmp     %r15d,%r12d
      │200d51: → je      200c40
16.34 │200d57:   inc     %r15d
 0.05 │200d5a:   mov     -0x38(%rbp),%rax
 3.27 │200d5e:   mov     -0x30(%rbp),%r13
 1.47 │200d62:   cmp     %r13,%rax
      │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise' in the slow paths, but creates nice and compact fast path code, i.e.:

32.64 │200760:   mov    %r14d,(%r12)
 9.97 │200764:   add    $0x4,%r12
 6.97 │200768:   mov    %r12,-0x38(%rbp)
32.17 │20076c:   add    $0x1,%r14d
 2.36 │200770:   cmp    %r14d,%ebx
      │200773: → je     200730
 8.98 │200775:   mov    -0x30(%rbp),%r13
 6.75 │200779:   cmp    %r13,%r12
      │20077c: → jne    200760

Now there is a single store for the pushback value (as before), and a single store for the end without a reload (dependency).
For full local vectors, (i.e., not referenced elsewhere), the capacity load and store inside the loop could also be removed, but this requires more substantial refactoring inside vector.

Benchmark based on:

template <typename Vector>
void BM_Pushback(benchmark::State& state) {
  int count = state.range(0);
  Vector vec;
  vec.reserve(count);
  while (state.KeepRunningBatch(count)) {
    vec.clear();
    for (int i = 0; i < count; ++i) {
      vec.push_back(i);
    }
    testing::DoNotOptimize(vec.data());
  }
}

BENCHMARK(BM_Pushback<std::vector<int>>)->Arg(4000);

Results:

CPU: Intel Haswell with HyperThreading (32 cores) dL1:32KB dL2:256KB dL3:45MB
name                                         old time/op             new time/op   delta
-----------------------------------------------------------------------------------------
BM_Pushback<std::vector<int>>/4k             1.66ns ± 7%             0.60ns ± 7%  -63.76%

Diff Detail

Unit TestsFailed

TimeTest
210 msLLVM.CodeGen/AMDGPU::Unknown Unit Message ("")
Script: -- : 'RUN: at line 1'; c:\ws\prod\llvm-project\build\bin\opt.exe -mtriple=amdgcn-- -S -amdgpu-unify-divergent-exit-nodes -verify -structurizecfg -verify -si-annotate-control-flow C:\ws\prod\llvm-project\llvm\test\CodeGen\AMDGPU\multi-divergent-exit-region.ll | c:\ws\prod\llvm-project\build\bin\filecheck.exe -check-prefix=IR C:\ws\prod\llvm-project\llvm\test\CodeGen\AMDGPU\multi-divergent-exit-region.ll
790 msLLVM.CodeGen/AMDGPU/GlobalISel::Unknown Unit Message ("")
Script: -- : 'RUN: at line 2'; c:\ws\prod\llvm-project\build\bin\llc.exe -global-isel -amdgpu-codegenprepare-disable-idiv-expansion=1 -mtriple=amdgcn-amd-amdhsa -denormal-fp-math-f32=preserve-sign < C:\ws\prod\llvm-project\llvm\test\CodeGen\AMDGPU\GlobalISel\udiv.i64.ll | c:\ws\prod\llvm-project\build\bin\filecheck.exe -check-prefixes=CHECK,GISEL C:\ws\prod\llvm-project\llvm\test\CodeGen\AMDGPU\GlobalISel\udiv.i64.ll
660 msLLVM.CodeGen/AMDGPU/GlobalISel::Unknown Unit Message ("")
Script: -- : 'RUN: at line 2'; c:\ws\prod\llvm-project\build\bin\llc.exe -global-isel -amdgpu-codegenprepare-disable-idiv-expansion=1 -mtriple=amdgcn-amd-amdhsa -denormal-fp-math-f32=preserve-sign < C:\ws\prod\llvm-project\llvm\test\CodeGen\AMDGPU\GlobalISel\urem.i64.ll | c:\ws\prod\llvm-project\build\bin\filecheck.exe -check-prefixes=CHECK,GISEL C:\ws\prod\llvm-project\llvm\test\CodeGen\AMDGPU\GlobalISel\urem.i64.ll

Event Timeline

mvels created this revision.May 26 2020, 1:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2020, 1:57 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
mvels edited the summary of this revision. (Show Details)
miscco added a subscriber: miscco.May 26 2020, 11:38 PM

That are great numbers!

If I understand it correctly this patch has two parts:

  1. Pass this->__end_ to the helper functions
  2. Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

Finally, your __construct_one_at helper function diverges from the previous code.

  • The ASAN annotation is new and not done in the slow path.
  • __construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor
mvels added a comment.May 27 2020, 9:47 AM
  1. Pass this->__end_ to the helper functions
  2. Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

The story is somewhat complicated. The compiler will optimize local types and keep them register allocated if possible. The 'if possible' here is not absolute, but more an 'if the compiler deems it possible".
As the main loop here is mostly trivial, and the vector has 3 words for state, we can easily 'see' it is possible to keep the vector state in 3 registers (if we include begin_).
For a compiler this is harder, as the following factors come in:

  • the logic involves some non inlined code that is beyond the compilers view and may affect this / state
  • state is modified at an inlining depth the compiler no longer tracks in full (tracking state is hard)
  • 'this' or other state 'escapes', i.e., some code path could escape into globals, functions, etc, and the compiler can't proof that the state of the vector is not externally observable
  • compiler heuristics required to determine that the slow path is unlikely and/or register allocation is easily preserved
  • etc....

For an example of possible variants of making life 'as easy as possible' for the compiler, see https://pastebin.com/5YjHbSaC

Running these benchmarks:

BM_Pushback<Vector<1>>/4k         1.67ns ± 8%
BM_Pushback<Vector<2>>/4k         0.60ns ± 7%
BM_Pushback<Vector<3>>/4k         0.35ns ± 7%

You'll see the top 2 are basically the numbers I posted earlier. The 3rd option is rewriting the logic such that the slow path is executed without 'this' state, but purely as inputs / outputs:

void push_back(int value) {
    pointer end = end_;
    if (end == end_cap_) {
      size_type sz = end_ - begin_;
      size_type n = sz ? sz * 2 : 2;
      begin_ = __push_back_slow_path(begin_, sz, n, value);
      end_ = end = begin_ + sz;
      end_cap_ = begin_ + n;
    } else {
      *end = value;
      end_ = end + 1;
    }
}

Which now has the fast path completely register allocated:

20.20 │ ca:   cmp   %r13,%r14
           │     ↑ je    90
20.16 │       mov   %r15d,(%r14)
19.26 │       add   $0x4,%r14
21.36 │       add   $0x1,%r15d
           │       cmp   %r15d,%r12d
18.63 │     ↑ jne   ca

I cheated here in 2 ways: I elided allocator state. The default std::allocator is stateless, however, for a generic implementation we do need to pass allocator references along the call paths. This is part where refactoring this is harder, as this needs to be factored out in 'default allocator' and 'stateful allocator' code where the latter basically will have option 2 performance (only caching end_ state in register). Additionally, there is -fno-exception which makes optimizing this easier (there are no thousands of early exits) especially when it comes to setting / swapping state on grow events.

Finally, your __construct_one_at helper function diverges from the previous code.

  • The ASAN annotation is new and not done in the slow path.
  • __construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

The Transaction class purpose is to track the 'end' pointer (in the split buffer in these cases) for the added element, and to do some accounting (size grew) used in ASAN compilation which defines how many items are readable.
The 'construct one at end' case is easy, as there is only one failure point -> in place constructing the last element, so there is no need for any complicated state. Thus we can simply construct in place, and do the 'grow count for asan' which happens in the tx dtor.

mvels updated this revision to Diff 266660.May 27 2020, 2:08 PM

Fixed AsanTransaction for single 'construct_at' use case

mvels updated this revision to Diff 266669.May 27 2020, 2:15 PM

White space / comment clean up

EricWF requested changes to this revision.May 29 2020, 7:45 AM

@mvels Do we have macro-benchmark results for this change?

libcxx/include/vector
949

If this is the only place _AsanTransaction is used, does it need to be this complicated?

1724–1725

Could we rename __end here, because it only sometimes stores the end iterator. Maybe __last or __back?

This revision now requires changes to proceed.May 29 2020, 7:45 AM
mvels updated this revision to Diff 267320.May 29 2020, 11:56 AM
mvels marked 2 inline comments as done.

renamed end --> pos

libcxx/include/vector
949

An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827

I like this better as emptying out the ASAN to empty thunks could remove much more #ifndef crud, and it evaporates entirely (guaranteed) when compiling with exceptions. (RAII has more cost then except if not part of a 'final' execution path)