Download Raw Diff

Details

Reviewers

EricWF
mvels

Group Reviewers

Restricted Project

Commits

rG6fe4e033f07d: [libc++] Optimize vector push_back to avoid continuous load and store of end…

Summary

Credits: this change is based on analysis and a proof of concept by
gerbens@google.com.

Before, the compiler loses track of end as 'this' and other references
possibly escape beyond the compiler's scope. This can be see in the
generated assembly:

16.28 │200c80:   mov     %r15d,(%rax)
60.87 │200c83:   add     $0x4,%rax
      │200c87:   mov     %rax,-0x38(%rbp)
 0.03 │200c8b: → jmpq    200d4e
 ...
 ...
 1.69 │200d4e:   cmp     %r15d,%r12d
      │200d51: → je      200c40
16.34 │200d57:   inc     %r15d
 0.05 │200d5a:   mov     -0x38(%rbp),%rax
 3.27 │200d5e:   mov     -0x30(%rbp),%r13
 1.47 │200d62:   cmp     %r13,%rax
      │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer
back at the end of push back. This generates some slight source 'noise',
but creates nice and compact fast path code, i.e.:

32.64 │200760:   mov    %r14d,(%r12)
 9.97 │200764:   add    $0x4,%r12
 6.97 │200768:   mov    %r12,-0x38(%rbp)
32.17 │20076c:   add    $0x1,%r14d
 2.36 │200770:   cmp    %r14d,%ebx
      │200773: → je     200730
 8.98 │200775:   mov    -0x30(%rbp),%r13
 6.75 │200779:   cmp    %r13,%r12
      │20077c: → jne    200760

Now there is a single store for the push_back value (as before), and a
single store for the end without a reload (dependency).

For fully local vectors, (i.e., not referenced elsewhere), the capacity
load and store inside the loop could also be removed, but this requires
more substantial refactoring inside vector.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mvels created this revision.May 26 2020, 1:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2020, 1:57 PM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

mvels added a reviewer: EricWF.May 26 2020, 1:58 PM

mvels edited the summary of this revision. (Show Details)

Harbormaster failed remote builds in B57938: Diff 266328!May 26 2020, 3:17 PM

That are great numbers!

If I understand it correctly this patch has two parts:

Pass this->__end_ to the helper functions
Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

Finally, your __construct_one_at helper function diverges from the previous code.

The ASAN annotation is new and not done in the slow path.
__construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

Pass this->__end_ to the helper functions

Return the new end pointer from those helpers

I am skeptical why step 2 is needed at all. You never remove setting of this->__end_. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?

If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.

I would greatly prefer it if you would directly return in both the fast and the slow path.

The story is somewhat complicated. The compiler will optimize local types and keep them register allocated if possible. The 'if possible' here is not absolute, but more an 'if the compiler deems it possible".
As the main loop here is mostly trivial, and the vector has 3 words for state, we can easily 'see' it is possible to keep the vector state in 3 registers (if we include begin_).
For a compiler this is harder, as the following factors come in:

the logic involves some non inlined code that is beyond the compilers view and may affect this / state
state is modified at an inlining depth the compiler no longer tracks in full (tracking state is hard)
'this' or other state 'escapes', i.e., some code path could escape into globals, functions, etc, and the compiler can't proof that the state of the vector is not externally observable
compiler heuristics required to determine that the slow path is unlikely and/or register allocation is easily preserved
etc....

For an example of possible variants of making life 'as easy as possible' for the compiler, see https://pastebin.com/5YjHbSaC

Running these benchmarks:

BM_Pushback<Vector<1>>/4k         1.67ns ± 8%
BM_Pushback<Vector<2>>/4k         0.60ns ± 7%
BM_Pushback<Vector<3>>/4k         0.35ns ± 7%

You'll see the top 2 are basically the numbers I posted earlier. The 3rd option is rewriting the logic such that the slow path is executed without 'this' state, but purely as inputs / outputs:

void push_back(int value) {
    pointer end = end_;
    if (end == end_cap_) {
      size_type sz = end_ - begin_;
      size_type n = sz ? sz * 2 : 2;
      begin_ = __push_back_slow_path(begin_, sz, n, value);
      end_ = end = begin_ + sz;
      end_cap_ = begin_ + n;
    } else {
      *end = value;
      end_ = end + 1;
    }
}

Which now has the fast path completely register allocated:

20.20 │ ca:   cmp   %r13,%r14
           │     ↑ je    90
20.16 │       mov   %r15d,(%r14)
19.26 │       add   $0x4,%r14
21.36 │       add   $0x1,%r15d
           │       cmp   %r15d,%r12d
18.63 │     ↑ jne   ca

I cheated here in 2 ways: I elided allocator state. The default std::allocator is stateless, however, for a generic implementation we do need to pass allocator references along the call paths. This is part where refactoring this is harder, as this needs to be factored out in 'default allocator' and 'stateful allocator' code where the latter basically will have option 2 performance (only caching end_ state in register). Additionally, there is -fno-exception which makes optimizing this easier (there are no thousands of early exits) especially when it comes to setting / swapping state on grow events.

Finally, your __construct_one_at helper function diverges from the previous code.

The ASAN annotation is new and not done in the slow path.

__construct_at_end has the additional _ConstructTransaction __tx(*this, 1); I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

The Transaction class purpose is to track the 'end' pointer (in the split buffer in these cases) for the added element, and to do some accounting (size grew) used in ASAN compilation which defines how many items are readable.
The 'construct one at end' case is easy, as there is only one failure point -> in place constructing the last element, so there is no need for any complicated state. Thus we can simply construct in place, and do the 'grow count for asan' which happens in the tx dtor.

Fixed AsanTransaction for single 'construct_at' use case

White space / comment clean up

Harbormaster failed remote builds in B58116: Diff 266669!May 27 2020, 2:43 PM

Harbormaster failed remote builds in B58112: Diff 266660!May 28 2020, 2:08 AM

@mvels Do we have macro-benchmark results for this change?

libcxx/include/vector
949	If this is the only place `_AsanTransaction` is used, does it need to be this complicated?
1701–1702	Could we rename `__end` here, because it only sometimes stores the end iterator. Maybe `__last` or `__back`?

This revision now requires changes to proceed.May 29 2020, 7:45 AM

renamed end --> pos

libcxx/include/vector
949	An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827 I like this better as emptying out the ASAN to empty thunks could remove much more #ifndef crud, and it evaporates entirely (guaranteed) when compiling with exceptions. (RAII has more cost then except if not part of a 'final' execution path)

Harbormaster failed remote builds in B58471: Diff 267320!May 29 2020, 12:33 PM

Is there still interest in pursuing this? If not, could you abandon this?

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 5:26 PM

Herald added a subscriber: sunshaoce. · View Herald Transcript

[Github PR transition cleanup]

Commandeering to finish.

Rebase and change a few things in the patch. We're still getting a pretty awesome speedup:

----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_Pushback/vector_int/1024       2.07 ns         2.07 ns    337961984    BEFORE
BM_Pushback/vector_int/1024      0.549 ns        0.549 ns   1000000512    AFTER

Harbormaster completed remote builds in B257191: Diff 556726.Sep 13 2023, 4:55 PM

Fix formatting.

Harbormaster completed remote builds in B257219: Diff 556774.Sep 14 2023, 9:30 AM

A little more digging into this assembly can be found here: https://godbolt.org/z/TrWY7YMWW (or with LLVM IR: https://godbolt.org/z/69xzf7r74)

I think this change is good and safe. Still pondering the "how to test this" question.

In D80588#4648327, @EricWF wrote:

A little more digging into this assembly can be found here: https://godbolt.org/z/TrWY7YMWW (or with LLVM IR: https://godbolt.org/z/69xzf7r74)

I think this change is good and safe. Still pondering the "how to test this" question.

The correctness tests are already handled by our test suite. I agree we don't have a good way of testing performance changes right now, and that's a problem. I think I'd rather not block this patch on that issue since a lot of patches are in the same boat and we both agree this is a good change.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 2 2023, 6:13 AM

Closed by commit rG6fe4e033f07d: [libc++] Optimize vector push_back to avoid continuous load and store of end… (authored by mvels, committed by ldionne). · Explain Why

This revision was automatically updated to reflect the committed changes.

ldionne added a commit: rG6fe4e033f07d: [libc++] Optimize vector push_back to avoid continuous load and store of end….

Diff 266328

libcxx/include/vector

Show First 20 Lines • Show All 833 Lines • ▼ Show 20 Lines	void __destruct_at_end(pointer __new_last) _NOEXCEPT
size_type __old_size = size();		size_type __old_size = size();
__base::__destruct_at_end(__new_last);		__base::__destruct_at_end(__new_last);
__annotate_shrink(__old_size);		__annotate_shrink(__old_size);
}		}

#ifndef _LIBCPP_CXX03_LANG		#ifndef _LIBCPP_CXX03_LANG
template <class _Up>		template <class _Up>
_LIBCPP_INLINE_VISIBILITY		_LIBCPP_INLINE_VISIBILITY
inline void __push_back_slow_path(_Up&& __x);		inline pointer __push_back_slow_path(_Up&& __x);

template <class... _Args>		template <class... _Args>
_LIBCPP_INLINE_VISIBILITY		_LIBCPP_INLINE_VISIBILITY
inline void __emplace_back_slow_path(_Args&&... __args);		inline pointer __emplace_back_slow_path(_Args&&... __args);
#else		#else
template <class _Up>		template <class _Up>
_LIBCPP_INLINE_VISIBILITY		_LIBCPP_INLINE_VISIBILITY
inline void __push_back_slow_path(_Up& __x);		inline pointer __push_back_slow_path(_Up& __x);
#endif		#endif

// The following functions are no-ops outside of AddressSanitizer mode.		// The following functions are no-ops outside of AddressSanitizer mode.
// We call annotatations only for the default Allocator because other allocators		// We call annotatations only for the default Allocator because other allocators
// may not meet the AddressSanitizer alignment constraints.		// may not meet the AddressSanitizer alignment constraints.
// See the documentation for __sanitizer_annotate_contiguous_container for more details.		// See the documentation for __sanitizer_annotate_contiguous_container for more details.
#ifndef _LIBCPP_HAS_NO_ASAN		#ifndef _LIBCPP_HAS_NO_ASAN
void __annotate_contiguous_container(const void __beg, const void __end,		void __annotate_contiguous_container(const void __beg, const void __end,
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	#endif

private:		private:
_ConstructTransaction(_ConstructTransaction const&) = delete;		_ConstructTransaction(_ConstructTransaction const&) = delete;
_ConstructTransaction& operator=(_ConstructTransaction const&) = delete;		_ConstructTransaction& operator=(_ConstructTransaction const&) = delete;
};		};

template <class ..._Args>		template <class ..._Args>
_LIBCPP_INLINE_VISIBILITY		_LIBCPP_INLINE_VISIBILITY
		void __construct_one_at(pointer __pos, _Args&& ...__args) {
		__alloc_traits::construct(this->__alloc(), _VSTD::__to_address(__pos),
		_VSTD::forward<_Args>(__args)...);
		#ifndef _LIBCPP_HAS_NO_ASAN
		__annotate_increase(__n);
		#endif
		}

		template <class ..._Args>
		_LIBCPP_INLINE_VISIBILITY
void __construct_one_at_end(_Args&& ...__args) {		void __construct_one_at_end(_Args&& ...__args) {
_ConstructTransaction __tx(*this, 1);		_ConstructTransaction __tx(*this, 1);
__alloc_traits::construct(this->__alloc(), _VSTD::__to_address(__tx.__pos_),		__alloc_traits::construct(this->__alloc(), _VSTD::__to_address(__tx.__pos_),
_VSTD::forward<_Args>(__args)...);		_VSTD::forward<_Args>(__args)...);
++__tx.__pos_;		++__tx.__pos_;
}		}
};		};

#ifndef _LIBCPP_HAS_NO_DEDUCTION_GUIDES		#ifndef _LIBCPP_HAS_NO_DEDUCTION_GUIDES
template<class _InputIterator,		template<class _InputIterator,
class _Alloc = typename std::allocator<typename iterator_traits<_InputIterator>::value_type>,		class _Alloc = typename std::allocator<typename iterator_traits<_InputIterator>::value_type>,
class = typename enable_if<__is_allocator<_Alloc>::value, void>::type		class = typename enable_if<__is_allocator<_Alloc>::value, void>::type
>		>
vector(_InputIterator, _InputIterator)		vector(_InputIterator, _InputIterator)
-> vector<typename iterator_traits<_InputIterator>::value_type, _Alloc>;		-> vector<typename iterator_traits<_InputIterator>::value_type, _Alloc>;

		EricWFUnsubmitted Not Done Reply Inline Actions If this is the only place `_AsanTransaction` is used, does it need to be this complicated? EricWF: If this is the only place `_AsanTransaction` is used, does it need to be this complicated?
		mvelsUnsubmitted Done Reply Inline Actions An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827 I like this better as emptying out the ASAN to empty thunks could remove much more #ifndef crud, and it evaporates entirely (guaranteed) when compiling with exceptions. (RAII has more cost then except if not part of a 'final' execution path) mvels: An alternative (imho more elegant) way to do this is in https://reviews.llvm.org/D80827 I like…
template<class _InputIterator,		template<class _InputIterator,
class _Alloc,		class _Alloc,
class = typename enable_if<__is_allocator<_Alloc>::value, void>::type		class = typename enable_if<__is_allocator<_Alloc>::value, void>::type
>		>
vector(_InputIterator, _InputIterator, _Alloc)		vector(_InputIterator, _InputIterator, _Alloc)
-> vector<typename iterator_traits<_InputIterator>::value_type, _Alloc>;		-> vector<typename iterator_traits<_InputIterator>::value_type, _Alloc>;
#endif		#endif

▲ Show 20 Lines • Show All 661 Lines • ▼ Show 20 Lines	#ifndef _LIBCPP_NO_EXCEPTIONS
{		{
}		}
#endif // _LIBCPP_NO_EXCEPTIONS		#endif // _LIBCPP_NO_EXCEPTIONS
}		}
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
template <class _Up>		template <class _Up>
void		typename vector<_Tp, _Allocator>::pointer
#ifndef _LIBCPP_CXX03_LANG		#ifndef _LIBCPP_CXX03_LANG
vector<_Tp, _Allocator>::__push_back_slow_path(_Up&& __x)		vector<_Tp, _Allocator>::__push_back_slow_path(_Up&& __x)
#else		#else
vector<_Tp, _Allocator>::__push_back_slow_path(_Up& __x)		vector<_Tp, _Allocator>::__push_back_slow_path(_Up& __x)
#endif		#endif
{		{
allocator_type& __a = this->__alloc();		allocator_type& __a = this->__alloc();
__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);		__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);
// __v.push_back(_VSTD::forward<_Up>(__x));		// __v.push_back(_VSTD::forward<_Up>(__x));
__alloc_traits::construct(__a, _VSTD::__to_address(__v.__end_), _VSTD::forward<_Up>(__x));		__alloc_traits::construct(__a, _VSTD::__to_address(__v.__end_), _VSTD::forward<_Up>(__x));
__v.__end_++;		__v.__end_++;
__swap_out_circular_buffer(__v);		__swap_out_circular_buffer(__v);
		return this->__end_;
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_INLINE_VISIBILITY
void		void
vector<_Tp, _Allocator>::push_back(const_reference __x)		vector<_Tp, _Allocator>::push_back(const_reference __x)
{		{
if (this->__end_ != this->__end_cap())		pointer __end = this->__end_;
		if (__end != this->__end_cap())
{		{
__construct_one_at_end(__x);		__construct_one_at(__end, __x);
}		}
else		else {
__push_back_slow_path(__x);		__end = __push_back_slow_path(__x) - 1;
		}
		this->__end_ = __end + 1;
}		}

#ifndef _LIBCPP_CXX03_LANG		#ifndef _LIBCPP_CXX03_LANG

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_INLINE_VISIBILITY
void		void
vector<_Tp, _Allocator>::push_back(value_type&& __x)		vector<_Tp, _Allocator>::push_back(value_type&& __x)
{		{
if (this->__end_ < this->__end_cap())		pointer __end = this->__end_;
		if (__end != this->__end_cap())
{		{
__construct_one_at_end(_VSTD::move(__x));		__construct_one_at(__end, _VSTD::move(__x));
}		}
else		else {
__push_back_slow_path(_VSTD::move(__x));		__end = __push_back_slow_path(std::move(__x)) - 1;
		}
		this->__end_ = __end + 1;
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
template <class... _Args>		template <class... _Args>
void		typename vector<_Tp, _Allocator>::pointer
vector<_Tp, _Allocator>::__emplace_back_slow_path(_Args&&... __args)		vector<_Tp, _Allocator>::__emplace_back_slow_path(_Args&&... __args)
{		{
allocator_type& __a = this->__alloc();		allocator_type& __a = this->__alloc();
__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);		__split_buffer<value_type, allocator_type&> __v(__recommend(size() + 1), size(), __a);
// __v.emplace_back(_VSTD::forward<_Args>(__args)...);		// __v.emplace_back(_VSTD::forward<_Args>(__args)...);
__alloc_traits::construct(__a, _VSTD::__to_address(__v.__end_), _VSTD::forward<_Args>(__args)...);		__alloc_traits::construct(__a, _VSTD::__to_address(__v.__end_), _VSTD::forward<_Args>(__args)...);
__v.__end_++;		__v.__end_++;
__swap_out_circular_buffer(__v);		__swap_out_circular_buffer(__v);
		return this->__end_;
}		}

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
template <class... _Args>		template <class... _Args>
inline		inline
#if _LIBCPP_STD_VER > 14		#if _LIBCPP_STD_VER > 14
typename vector<_Tp, _Allocator>::reference		typename vector<_Tp, _Allocator>::reference
#else		#else
void		void
#endif		#endif
vector<_Tp, _Allocator>::emplace_back(_Args&&... __args)		vector<_Tp, _Allocator>::emplace_back(_Args&&... __args)
{		{
if (this->__end_ < this->__end_cap())		pointer __end = this->__end_;
		if (__end < this->__end_cap())
		EricWFUnsubmitted Done Reply Inline Actions Could we rename `__end` here, because it only sometimes stores the end iterator. Maybe `__last` or `__back`? EricWF: Could we rename `__end` here, because it only sometimes stores the end iterator. Maybe `__last`…
{		{
__construct_one_at_end(_VSTD::forward<_Args>(__args)...);		__construct_one_at(__end, _VSTD::forward<_Args>(__args)...);
}		}
else		else {
__emplace_back_slow_path(_VSTD::forward<_Args>(__args)...);		__end = __emplace_back_slow_path(_VSTD::forward<_Args>(__args)...) - 1;
		}
		this->__end_ = __end + 1;
#if _LIBCPP_STD_VER > 14		#if _LIBCPP_STD_VER > 14
return this->back();		return *__end;
#endif		#endif
}		}

#endif // !_LIBCPP_CXX03_LANG		#endif // !_LIBCPP_CXX03_LANG

template <class _Tp, class _Allocator>		template <class _Tp, class _Allocator>
inline		inline
void		void
▲ Show 20 Lines • Show All 1,716 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize vector push_back to avoid continuous load and store of end pointer
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 266328

libcxx/include/vector

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize vector push_back to avoid continuous load and store of end pointerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 266328

libcxx/include/vector

[libc++] Optimize vector push_back to avoid continuous load and store of end pointer
ClosedPublic