This is an archive of the discontinued LLVM Phabricator instance.

Optimize vector push_back to avoid continuous load and store of end pointer. (alt 2)
AbandonedPublic

Authored by mvels on May 29 2020, 11:49 AM.

Details

Reviewers
EricWF
Group Reviewers
Restricted Project
Summary

credits: this change is based on analysis and a proof of concept by gerbens@google.com.

Before, the compiler loses track of end as 'this' and other references possibly escape beyond the compiler's scope. This can be see in the generated assembly:

16.28 │200c80:   mov     %r15d,(%rax)
60.87 │200c83:   add     $0x4,%rax
      │200c87:   mov     %rax,-0x38(%rbp)
 0.03 │200c8b: → jmpq    200d4e
 ...
 ...
 1.69 │200d4e:   cmp     %r15d,%r12d
      │200d51: → je      200c40
16.34 │200d57:   inc     %r15d
 0.05 │200d5a:   mov     -0x38(%rbp),%rax
 3.27 │200d5e:   mov     -0x30(%rbp),%r13
 1.47 │200d62:   cmp     %r13,%rax
      │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise' in the slow paths, but creates nice and compact fast path code, i.e.:

32.64 │200760:   mov    %r14d,(%r12)
 9.97 │200764:   add    $0x4,%r12
 6.97 │200768:   mov    %r12,-0x38(%rbp)
32.17 │20076c:   add    $0x1,%r14d
 2.36 │200770:   cmp    %r14d,%ebx
      │200773: → je     200730
 8.98 │200775:   mov    -0x30(%rbp),%r13
 6.75 │200779:   cmp    %r13,%r12
      │20077c: → jne    200760

Now there is a single store for the pushback value (as before), and a single store for the end without a reload (dependency).
For full local vectors, (i.e., not referenced elsewhere), the capacity load and store inside the loop could also be removed, but this requires more substantial refactoring inside vector.

Benchmark based on:

template <typename Vector>
void BM_Pushback(benchmark::State& state) {
  int count = state.range(0);
  Vector vec;
  vec.reserve(count);
  while (state.KeepRunningBatch(count)) {
    vec.clear();
    for (int i = 0; i < count; ++i) {
      vec.push_back(i);
    }
    testing::DoNotOptimize(vec.data());
  }
}

BENCHMARK(BM_Pushback<std::vector<int>>)->Arg(4000);

Results:

CPU: Intel Haswell with HyperThreading (32 cores) dL1:32KB dL2:256KB dL3:45MB
name                                         old time/op             new time/op   delta
-----------------------------------------------------------------------------------------
BM_Pushback<std::vector<int>>/4k             1.61ns ± 0%             0.61ns ± 3%  -62.36%

Diff Detail

Event Timeline

mvels created this revision.May 29 2020, 11:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2020, 11:49 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
mvels retitled this revision from credits: this change is based on analysis and a proof of concept by gerbens@google.com. to Optimize vector push_back to avoid continuous load and store of end pointer. (alt 2).May 29 2020, 11:58 AM
mvels edited the summary of this revision. (Show Details)
mvels added a reviewer: EricWF.
mvels updated this revision to Diff 267322.May 29 2020, 12:00 PM

Fixed NOEXCEPT comments

mvels updated this revision to Diff 269664.Jun 9 2020, 2:09 PM

Revamped

mvels updated this revision to Diff 269665.Jun 9 2020, 2:11 PM

Removed __config changes

mvels updated this revision to Diff 269666.Jun 9 2020, 2:12 PM

Fixed __config comment

Harbormaster failed remote builds in B59696: Diff 269665!
Harbormaster failed remote builds in B59697: Diff 269666!
mvels abandoned this revision.Jun 18 2020, 12:05 PM