credits: this change is based on analysis and a proof of concept by gerbens@google.com.
With the old code, the compiler does not generate optimized code for inlined / compact code, i.e.:
vector<int> myvec; for (...) { myvec.push_back(<Some Inlined Expression>); }
Before, the compiler loses track of the internal end and end capacity pointers as 'this' and other references possibly escape beyond the compiler's scope. This can be seen in the generated assembly:
16.28 │200c80: mov %r15d,(%rax) 60.87 │200c83: add $0x4,%rax │200c87: mov %rax,-0x38(%rbp) 0.03 │200c8b: → jmpq 200d4e ... ... 1.69 │200d4e: cmp %r15d,%r12d │200d51: → je 200c40 16.34 │200d57: inc %r15d 0.05 │200d5a: mov -0x38(%rbp),%rax 3.27 │200d5e: mov -0x30(%rbp),%r13 1.47 │200d62: cmp %r13,%rax │200d65: → jne 200c80
The compiler stores and reloads the end and end_cap pointers.
We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise' in the slow paths, but creates nice and compact fast path code where the compiler now cashes the new end values, i.e.:
32.64 │200760: mov %r14d,(%r12) 9.97 │200764: add $0x4,%r12 6.97 │200768: mov %r12,-0x38(%rbp) 32.17 │20076c: add $0x1,%r14d 2.36 │200770: cmp %r14d,%ebx │200773: → je 200730 8.98 │200775: mov -0x30(%rbp),%r13 6.75 │200779: cmp %r13,%r12 │20077c: → jne 200760
Now there is a single store for the pushback value (as before), and a single store for the end without a reload (dependency).
For full local vectors, (i.e., not referenced elsewhere nor 'escaped`), the capacity load and store inside the loop may also be fully elided depending on the compiler and usage of the vector as per the below benchmark with a local vector, where a vector on some pod is then as fast as filling a plain array, i.e., one store per push_back:
20.93 │ a0: →mov %r14d,0x0(%r13) 19.29 │ add $0x4,%r13 18.34 │ add $0x1,%r14d 18.86 │ cmp %r14d,%r15d │ ↓ je 190 21.78 │ b5: cmp %rsi,%r13 │ jne a0
Benchmark based on:
void BM_FillVectorPushback(benchmark::State& state) { int count = state.range(0); std::vector<int> vec; vec.reserve(count); while (state.KeepRunningBatch(count)) { vec.clear(); for (int i = 0; i < count; ++i) { vec.push_back(i); } testing::DoNotOptimize(vec.data()); } } BENCHMARK(BM_FillVectorPushback)->Arg(4000);
Results:
name old time/op new time/op delta --------------------------------------------------------------------------------------- BM_FillVectorPushback/1000 1.74ns ± 6% 0.39ns ± 1% -77.75%