Page MenuHomePhabricator

Optimize vector push_back for hot loops / push_back fill invocations
Needs ReviewPublic

Authored by mvels on Jun 29 2020, 2:34 PM.

Details

Reviewers
EricWF
Group Reviewers
Restricted Project
Summary

credits: this change is based on analysis and a proof of concept by gerbens@google.com.

With the old code, the compiler does not generate optimized code for inlined / compact code, i.e.:

vector<int> myvec;
for (...) {
  myvec.push_back(<Some Inlined Expression>);
}

Before, the compiler loses track of the internal end and end capacity pointers as 'this' and other references possibly escape beyond the compiler's scope. This can be seen in the generated assembly:

16.28 │200c80:   mov     %r15d,(%rax)
60.87 │200c83:   add     $0x4,%rax
      │200c87:   mov     %rax,-0x38(%rbp)
 0.03 │200c8b: → jmpq    200d4e
 ...
 ...
 1.69 │200d4e:   cmp     %r15d,%r12d
      │200d51: → je      200c40
16.34 │200d57:   inc     %r15d
 0.05 │200d5a:   mov     -0x38(%rbp),%rax
 3.27 │200d5e:   mov     -0x30(%rbp),%r13
 1.47 │200d62:   cmp     %r13,%rax
      │200d65: → jne     200c80

The compiler stores and reloads the end and end_cap pointers.

We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise' in the slow paths, but creates nice and compact fast path code where the compiler now cashes the new end values, i.e.:

32.64 │200760:   mov    %r14d,(%r12)
 9.97 │200764:   add    $0x4,%r12
 6.97 │200768:   mov    %r12,-0x38(%rbp)
32.17 │20076c:   add    $0x1,%r14d
 2.36 │200770:   cmp    %r14d,%ebx
      │200773: → je     200730
 8.98 │200775:   mov    -0x30(%rbp),%r13
 6.75 │200779:   cmp    %r13,%r12
      │20077c: → jne    200760

Now there is a single store for the pushback value (as before), and a single store for the end without a reload (dependency).

For full local vectors, (i.e., not referenced elsewhere nor 'escaped`), the capacity load and store inside the loop may also be fully elided depending on the compiler and usage of the vector as per the below benchmark with a local vector, where a vector on some pod is then as fast as filling a plain array, i.e., one store per push_back:

20.93 │ a0:  →mov    %r14d,0x0(%r13)
19.29 │       add    $0x4,%r13
18.34 │       add    $0x1,%r14d
18.86 │       cmp    %r14d,%r15d
      │     ↓ je     190
21.78 │ b5:   cmp    %rsi,%r13
      │       jne    a0

Benchmark based on:

void BM_FillVectorPushback(benchmark::State& state) {
  int count = state.range(0);
  std::vector<int> vec;
  vec.reserve(count);
  while (state.KeepRunningBatch(count)) {
    vec.clear();
    for (int i = 0; i < count; ++i) {
      vec.push_back(i);
    }
    testing::DoNotOptimize(vec.data());
  }
}

BENCHMARK(BM_FillVectorPushback)->Arg(4000);

Results:

name                                       old time/op             new time/op    delta
---------------------------------------------------------------------------------------
BM_FillVectorPushback/1000                 1.74ns ± 6%             0.39ns ± 1%  -77.75%

Diff Detail

Event Timeline

mvels created this revision.Jun 29 2020, 2:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2020, 2:34 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript

General comment: would it not be better to make the compiler smarter to solve this (and reap the benefits for vector and other code)?