Download Raw Diff

Details

Reviewers

EricWF
mvels

Group Reviewers

Restricted Project

Commits

rG6fe4e033f07d: [libc++] Optimize vector push_back to avoid continuous load and store of end…

Summary

Credits: this change is based on analysis and a proof of concept by
gerbens@google.com.

Before, the compiler loses track of end as 'this' and other references
possibly escape beyond the compiler's scope. This can be see in the
generated assembly:

16.28 │200c80:   mov     %r15d,(%rax)
60.87 │200c83:   add     $0x4,%rax
      │200c87:   mov     %rax,-0x38(%rbp)
 0.03 │200c8b: → jmpq    200d4e
 ...
 ...
 1.69 │200d4e:   cmp     %r15d,%r12d
      │200d51: → je      200c40
16.34 │200d57:   inc     %r15d
 0.05 │200d5a:   mov     -0x38(%rbp),%rax
 3.27 │200d5e:   mov     -0x30(%rbp),%r13
 1.47 │200d62:   cmp     %r13,%rax
      │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer
back at the end of push back. This generates some slight source 'noise',
but creates nice and compact fast path code, i.e.:

32.64 │200760:   mov    %r14d,(%r12)
 9.97 │200764:   add    $0x4,%r12
 6.97 │200768:   mov    %r12,-0x38(%rbp)
32.17 │20076c:   add    $0x1,%r14d
 2.36 │200770:   cmp    %r14d,%ebx
      │200773: → je     200730
 8.98 │200775:   mov    -0x30(%rbp),%r13
 6.75 │200779:   cmp    %r13,%r12
      │20077c: → jne    200760

Now there is a single store for the push_back value (as before), and a
single store for the end without a reload (dependency).

For fully local vectors, (i.e., not referenced elsewhere), the capacity
load and store inside the loop could also be removed, but this requires
more substantial refactoring inside vector.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mvels created this revision.May 26 2020, 1:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2020, 1:57 PM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

mvels added a reviewer: EricWF.May 26 2020, 1:58 PM

mvels edited the summary of this revision. (Show Details)

Harbormaster failed remote builds in B57938: Diff 266328!May 26 2020, 3:17 PM

That are great numbers!

If I understand it correctly this patch has two parts:

Pass this->__end_ to the helper functions
Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

Finally, your __construct_one_at helper function diverges from the previous code.

The ASAN annotation is new and not done in the slow path.
__construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

Pass this->__end_ to the helper functions

Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

The story is somewhat complicated. The compiler will optimize local types and keep them register allocated if possible. The 'if possible' here is not absolute, but more an 'if the compiler deems it possible".
As the main loop here is mostly trivial, and the vector has 3 words for state, we can easily 'see' it is possible to keep the vector state in 3 registers (if we include begin_).
For a compiler this is harder, as the following factors come in:

the logic involves some non inlined code that is beyond the compilers view and may affect this / state
state is modified at an inlining depth the compiler no longer tracks in full (tracking state is hard)
'this' or other state 'escapes', i.e., some code path could escape into globals, functions, etc, and the compiler can't proof that the state of the vector is not externally observable
compiler heuristics required to determine that the slow path is unlikely and/or register allocation is easily preserved
etc....

For an example of possible variants of making life 'as easy as possible' for the compiler, see https://pastebin.com/5YjHbSaC

Running these benchmarks:

BM_Pushback<Vector<1>>/4k         1.67ns ± 8%
BM_Pushback<Vector<2>>/4k         0.60ns ± 7%
BM_Pushback<Vector<3>>/4k         0.35ns ± 7%

You'll see the top 2 are basically the numbers I posted earlier. The 3rd option is rewriting the logic such that the slow path is executed without 'this' state, but purely as inputs / outputs:

void push_back(int value) {
    pointer end = end_;
    if (end == end_cap_) {
      size_type sz = end_ - begin_;
      size_type n = sz ? sz * 2 : 2;
      begin_ = __push_back_slow_path(begin_, sz, n, value);
      end_ = end = begin_ + sz;
      end_cap_ = begin_ + n;
    } else {
      *end = value;
      end_ = end + 1;
    }
}

Which now has the fast path completely register allocated:

20.20 │ ca:   cmp   %r13,%r14
           │     ↑ je    90
20.16 │       mov   %r15d,(%r14)
19.26 │       add   $0x4,%r14
21.36 │       add   $0x1,%r15d
           │       cmp   %r15d,%r12d
18.63 │     ↑ jne   ca

I cheated here in 2 ways: I elided allocator state. The default std::allocator is stateless, however, for a generic implementation we do need to pass allocator references along the call paths. This is part where refactoring this is harder, as this needs to be factored out in 'default allocator' and 'stateful allocator' code where the latter basically will have option 2 performance (only caching end_ state in register). Additionally, there is -fno-exception which makes optimizing this easier (there are no thousands of early exits) especially when it comes to setting / swapping state on grow events.

Finally, your __construct_one_at helper function diverges from the previous code.

The ASAN annotation is new and not done in the slow path.

__construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

The Transaction class purpose is to track the 'end' pointer (in the split buffer in these cases) for the added element, and to do some accounting (size grew) used in ASAN compilation which defines how many items are readable.
The 'construct one at end' case is easy, as there is only one failure point -> in place constructing the last element, so there is no need for any complicated state. Thus we can simply construct in place, and do the 'grow count for asan' which happens in the tx dtor.

Fixed AsanTransaction for single 'construct_at' use case

White space / comment clean up

Harbormaster failed remote builds in B58116: Diff 266669!May 27 2020, 2:43 PM

Harbormaster failed remote builds in B58112: Diff 266660!May 28 2020, 2:08 AM

@mvels Do we have macro-benchmark results for this change?

libcxx/include/vector
942	If this is the only place `_AsanTransaction` is used, does it need to be this complicated?
1681–1682	Could we rename `__end` here, because it only sometimes stores the end iterator. Maybe `__last` or `__back`?

This revision now requires changes to proceed.May 29 2020, 7:45 AM

renamed end --> pos

libcxx/include/vector
942	An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827 I like this better as emptying out the ASAN to empty thunks could remove much more #ifndef crud, and it evaporates entirely (guaranteed) when compiling with exceptions. (RAII has more cost then except if not part of a 'final' execution path)

Harbormaster failed remote builds in B58471: Diff 267320!May 29 2020, 12:33 PM

Is there still interest in pursuing this? If not, could you abandon this?

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 5:26 PM

Herald added a subscriber: sunshaoce. · View Herald Transcript

[Github PR transition cleanup]

Commandeering to finish.

Rebase and change a few things in the patch. We're still getting a pretty awesome speedup:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_Pushback/vector_int/1024       2.07 ns         2.07 ns    337961984    BEFORE
BM_Pushback/vector_int/1024      0.549 ns        0.549 ns   1000000512    AFTER

Harbormaster completed remote builds in B257191: Diff 556726.Sep 13 2023, 4:55 PM

Fix formatting.

Harbormaster completed remote builds in B257219: Diff 556774.Sep 14 2023, 9:30 AM

A little more digging into this assembly can be found here: https://godbolt.org/z/TrWY7YMWW (or with LLVM IR: https://godbolt.org/z/69xzf7r74)

I think this change is good and safe. Still pondering the "how to test this" question.

In D80588#4648327, @EricWF wrote:

A little more digging into this assembly can be found here: https://godbolt.org/z/TrWY7YMWW (or with LLVM IR: https://godbolt.org/z/69xzf7r74)

I think this change is good and safe. Still pondering the "how to test this" question.

The correctness tests are already handled by our test suite. I agree we don't have a good way of testing performance changes right now, and that's a problem. I think I'd rather not block this patch on that issue since a lot of patches are in the same boat and we both agree this is a good change.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 2 2023, 6:13 AM

Closed by commit rG6fe4e033f07d: [libc++] Optimize vector push_back to avoid continuous load and store of end… (authored by mvels, committed by ldionne). · Explain Why

This revision was automatically updated to reflect the committed changes.

ldionne added a commit: rG6fe4e033f07d: [libc++] Optimize vector push_back to avoid continuous load and store of end….

Diff 557529

libcxx/benchmarks/ContainerBenchmarks.h

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	void BM_ConstructFromRange(benchmark::State& st, Container, GenInputs gen) {
auto in = gen(st.range(0));		auto in = gen(st.range(0));
benchmark::DoNotOptimize(&in);		benchmark::DoNotOptimize(&in);
while (st.KeepRunning()) {		while (st.KeepRunning()) {
Container c(std::from_range, in);		Container c(std::from_range, in);
DoNotOptimizeData(c);		DoNotOptimizeData(c);
}		}
}		}

		template <class Container>
		void BM_Pushback(benchmark::State& state, Container c) {
		int count = state.range(0);
		c.reserve(count);
		while (state.KeepRunningBatch(count)) {
		c.clear();
		for (int i = 0; i != count; ++i) {
		c.push_back(i);
		}
		benchmark::DoNotOptimize(c.data());
		}
		}

template <class Container, class GenInputs>		template <class Container, class GenInputs>
void BM_InsertValue(benchmark::State& st, Container c, GenInputs gen) {		void BM_InsertValue(benchmark::State& st, Container c, GenInputs gen) {
auto in = gen(st.range(0));		auto in = gen(st.range(0));
const auto end = in.end();		const auto end = in.end();
while (st.KeepRunning()) {		while (st.KeepRunning()) {
c.clear();		c.clear();
for (auto it = in.begin(); it != end; ++it) {		for (auto it = in.begin(); it != end; ++it) {
benchmark::DoNotOptimize(&(c.insert(it).first));		benchmark::DoNotOptimize(&(c.insert(it).first));
▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

libcxx/benchmarks/vector_operations.bench.cpp

Show All 33 Lines	BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_char, std::vector<char>{}, getRandomIntegerInputs<char>)
->Arg(TestNumInputs);		->Arg(TestNumInputs);

BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_size_t, std::vector<size_t>{}, getRandomIntegerInputs<size_t>)		BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_size_t, std::vector<size_t>{}, getRandomIntegerInputs<size_t>)
->Arg(TestNumInputs);		->Arg(TestNumInputs);

BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_string, std::vector<std::string>{}, getRandomStringInputs)		BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_string, std::vector<std::string>{}, getRandomStringInputs)
->Arg(TestNumInputs);		->Arg(TestNumInputs);

		BENCHMARK_CAPTURE(BM_Pushback, vector_int, std::vector<int>{})->Arg(TestNumInputs);

BENCHMARK_MAIN();		BENCHMARK_MAIN();

libcxx/include/vector

Show First 20 Lines • Show All 827 Lines • ▼ Show 20 Lines	private:
{		{
size_type __old_size = size();		size_type __old_size = size();
__base_destruct_at_end(__new_last);		__base_destruct_at_end(__new_last);
__annotate_shrink(__old_size);		__annotate_shrink(__old_size);
}		}

template <class _Up>		template <class _Up>
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI		_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
inline void __push_back_slow_path(_Up&& __x);		inline pointer __push_back_slow_path(_Up&& __x);

template <class... _Args>		template <class... _Args>
_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI		_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
inline void __emplace_back_slow_path(_Args&&... __args);		inline pointer __emplace_back_slow_path(_Args&&... __args);

// The following functions are no-ops outside of AddressSanitizer mode.		// The following functions are no-ops outside of AddressSanitizer mode.
// We call annotations for every allocator, unless explicitly disabled.		// We call annotations for every allocator, unless explicitly disabled.
//		//
// To disable annotations for a particular allocator, change value of		// To disable annotations for a particular allocator, change value of
// __asan_annotate_container_with_allocator to false.		// __asan_annotate_container_with_allocator to false.
// For more details, see the "Using libc++" documentation page or		// For more details, see the "Using libc++" documentation page or
// the documentation for __sanitizer_annotate_contiguous_container.		// the documentation for __sanitizer_annotate_contiguous_container.
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	#endif
const pointer& __end_cap() const _NOEXCEPT		const pointer& __end_cap() const _NOEXCEPT
{return this->__end_cap_.first();}		{return this->__end_cap_.first();}

_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI		_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
void __clear() _NOEXCEPT {__base_destruct_at_end(this->__begin_);}		void __clear() _NOEXCEPT {__base_destruct_at_end(this->__begin_);}

_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI		_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
void __base_destruct_at_end(pointer __new_last) _NOEXCEPT {		void __base_destruct_at_end(pointer __new_last) _NOEXCEPT {
pointer __soon_to_be_end = this->__end_;		pointer __soon_to_be_end = this->__end_;
		EricWFUnsubmitted Not Done Reply Inline Actions If this is the only place `_AsanTransaction` is used, does it need to be this complicated? EricWF: If this is the only place `_AsanTransaction` is used, does it need to be this complicated?
		mvelsUnsubmitted Done Reply Inline Actions An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827 I like this better as emptying out the ASAN to empty thunks could remove much more #ifndef crud, and it evaporates entirely (guaranteed) when compiling with exceptions. (RAII has more cost then except if not part of a 'final' execution path) mvels: An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827 I like…
while (__new_last != __soon_to_be_end)		while (__new_last != __soon_to_be_end)
__alloc_traits::destroy(__alloc(), std::__to_address(--__soon_to_be_end));		__alloc_traits::destroy(__alloc(), std::__to_address(--__soon_to_be_end));
this->__end_ = __new_last;		this->__end_ = __new_last;
}		}

_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI		_LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
void __copy_assign_alloc(const vector& __c)		void __copy_assign_alloc(const vector& __c)
{__copy_assign_alloc(__c, integral_constant<bool,		{__copy_assign_alloc(__c, integral_constant<bool,
▲ Show 20 Lines • Show All 653 Lines • ▼ Show 20 Lines	#ifndef _LIBCPP_HAS_NO_EXCEPTIONS
}		}
#endif // _LIBCPP_HAS_NO_EXCEPTIONS		#endif // _LIBCPP_HAS_NO_EXCEPTIONS
}		}
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
template <class _Up>		template <class _Up>
_LIBCPP_CONSTEXPR_SINCE_CXX20		_LIBCPP_CONSTEXPR_SINCE_CXX20
void		typename vector<_Tp, _Allocator>::pointer
vector<_Tp, _Allocator>::__push_back_slow_path(_Up&& __x)		vector<_Tp, _Allocator>::__push_back_slow_path(_Up&& __x)
{		{
allocator_type& __a = this->__alloc();		allocator_type& __a = this->__alloc();
__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);		__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);
// __v.push_back(std::forward<_Up>(__x));		// __v.push_back(std::forward<_Up>(__x));
__alloc_traits::construct(__a, std::__to_address(__v.__end_), std::forward<_Up>(__x));		__alloc_traits::construct(__a, std::__to_address(__v.__end_), std::forward<_Up>(__x));
__v.__end_++;		__v.__end_++;
__swap_out_circular_buffer(__v);		__swap_out_circular_buffer(__v);
		return this->__end_;
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
_LIBCPP_CONSTEXPR_SINCE_CXX20		_LIBCPP_CONSTEXPR_SINCE_CXX20
inline _LIBCPP_HIDE_FROM_ABI		inline _LIBCPP_HIDE_FROM_ABI
void		void
vector<_Tp, _Allocator>::push_back(const_reference __x)		vector<_Tp, _Allocator>::push_back(const_reference __x)
{		{
if (this->__end_ != this->__end_cap())		pointer __end = this->__end_;
{		if (__end < this->__end_cap()) {
__construct_one_at_end(__x);		__construct_one_at_end(__x);
		++__end;
		} else {
		__end = __push_back_slow_path(__x);
}		}
else		this->__end_ = __end;
__push_back_slow_path(__x);
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
_LIBCPP_CONSTEXPR_SINCE_CXX20		_LIBCPP_CONSTEXPR_SINCE_CXX20
inline _LIBCPP_HIDE_FROM_ABI		inline _LIBCPP_HIDE_FROM_ABI
void		void
vector<_Tp, _Allocator>::push_back(value_type&& __x)		vector<_Tp, _Allocator>::push_back(value_type&& __x)
{		{
if (this->__end_ < this->__end_cap())		pointer __end = this->__end_;
{		if (__end < this->__end_cap()) {
__construct_one_at_end(std::move(__x));		__construct_one_at_end(std::move(__x));
		++__end;
		} else {
		__end = __push_back_slow_path(std::move(__x));
}		}
else		this->__end_ = __end;
__push_back_slow_path(std::move(__x));
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
template <class... _Args>		template <class... _Args>
_LIBCPP_CONSTEXPR_SINCE_CXX20		_LIBCPP_CONSTEXPR_SINCE_CXX20
void		typename vector<_Tp, _Allocator>::pointer
vector<_Tp, _Allocator>::__emplace_back_slow_path(_Args&&... __args)		vector<_Tp, _Allocator>::__emplace_back_slow_path(_Args&&... __args)
{		{
allocator_type& __a = this->__alloc();		allocator_type& __a = this->__alloc();
__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);		__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);
// __v.emplace_back(std::forward<_Args>(__args)...);		// __v.emplace_back(std::forward<_Args>(__args)...);
__alloc_traits::construct(__a, std::__to_address(__v.__end_), std::forward<_Args>(__args)...);		__alloc_traits::construct(__a, std::__to_address(__v.__end_), std::forward<_Args>(__args)...);
__v.__end_++;		__v.__end_++;
__swap_out_circular_buffer(__v);		__swap_out_circular_buffer(__v);
		return this->__end_;
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
template <class... _Args>		template <class... _Args>
_LIBCPP_CONSTEXPR_SINCE_CXX20		_LIBCPP_CONSTEXPR_SINCE_CXX20
inline		inline
#if _LIBCPP_STD_VER >= 17		#if _LIBCPP_STD_VER >= 17
typename vector<_Tp, _Allocator>::reference		typename vector<_Tp, _Allocator>::reference
#else		#else
void		void
#endif		#endif
vector<_Tp, _Allocator>::emplace_back(_Args&&... __args)		vector<_Tp, _Allocator>::emplace_back(_Args&&... __args)
{		{
if (this->__end_ < this->__end_cap())		pointer __end = this->__end_;
		EricWFUnsubmitted Done Reply Inline Actions Could we rename `__end` here, because it only sometimes stores the end iterator. Maybe `__last` or `__back`? EricWF: Could we rename `__end` here, because it only sometimes stores the end iterator. Maybe `__last`…
{		if (__end < this->__end_cap()) {
__construct_one_at_end(std::forward<_Args>(__args)...);		__construct_one_at_end(std::forward<_Args>(__args)...);
		++__end;
		} else {
		__end = __emplace_back_slow_path(std::forward<_Args>(__args)...);
}		}
else		this->__end_ = __end;
__emplace_back_slow_path(std::forward<_Args>(__args)...);
#if _LIBCPP_STD_VER >= 17		#if _LIBCPP_STD_VER >= 17
return this->back();		return *(__end - 1);
#endif		#endif
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
_LIBCPP_CONSTEXPR_SINCE_CXX20		_LIBCPP_CONSTEXPR_SINCE_CXX20
inline		inline
void		void
vector<_Tp, _Allocator>::pop_back()		vector<_Tp, _Allocator>::pop_back()
▲ Show 20 Lines • Show All 1,744 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize vector push_back to avoid continuous load and store of end pointer
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557529

libcxx/benchmarks/ContainerBenchmarks.h

libcxx/benchmarks/vector_operations.bench.cpp

libcxx/include/vector

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize vector push_back to avoid continuous load and store of end pointerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557529

libcxx/benchmarks/ContainerBenchmarks.h

libcxx/benchmarks/vector_operations.bench.cpp

libcxx/include/vector

[libc++] Optimize vector push_back to avoid continuous load and store of end pointer
ClosedPublic