This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/
-
benchmarks/
-
algorithms.bench.cpp
-
include/__algorithm/
-
__algorithm/
2
rotate.h

Differential D124122

[libc++] Optimize std::rotate
Needs ReviewPublic

Authored by philnik on Apr 20 2022, 1:37 PM.

Download Raw Diff

Details

Reviewers

Mordante
var-const

Group Reviewers

Restricted Project

Summary

This removes the random access std::rotate "optimization".

Fixes https://github.com/llvm/llvm-project/issues/54949
Fixes https://github.com/llvm/llvm-project/issues/39644

The bug report proposes to use a different rotate algorithm, but that should be done in a different patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	1,200,190 ms	libcxx CI AIX (32-bit) > ibm-libc++-shared-cfg-in.std/algorithms/alg_modifying_operations/alg_transform::ranges.transform.pass.cpp
	21,920 ms	libcxx CI AIX (32-bit) > ibm-libc++-shared-cfg-in.std/localization/locale_categories/category_collate/locale_collate_byname::compare.pass.cpp
	35,210 ms	libcxx CI AIX (32-bit) > ibm-libc++-shared-cfg-in.std/localization/locale_categories/category_monetary/locale_money_get/locale_money_get_members::get_long_double_ru_RU.pass.cpp
	28,510 ms	libcxx CI AIX (32-bit) > ibm-libc++-shared-cfg-in.std/localization/locale_categories/category_monetary/locale_money_put/locale_money_put_members::put_long_double_ru_RU.pass.cpp
	27,100 ms	libcxx CI AIX (32-bit) > ibm-libc++-shared-cfg-in.std/localization/locale_categories/category_monetary/locale_moneypunct_byname::curr_symbol.pass.cpp
		View Full Test Results (49 Failed)

Event Timeline

philnik created this revision.Apr 20 2022, 1:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 1:37 PM

philnik requested review of this revision.Apr 20 2022, 1:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 1:37 PM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

Here are the numbers:

-------------------------------------------------------------------------------
Benchmark                                                  old Time    new Time
-------------------------------------------------------------------------------
BM_Rotate_uint32_Random_1                                   10.6 ns     10.7 ns
BM_Rotate_uint32_Random_4                                   24.3 ns     23.3 ns
BM_Rotate_uint32_Random_16                                  43.7 ns     32.9 ns
BM_Rotate_uint32_Random_64                                   116 ns     64.4 ns
BM_Rotate_uint32_Random_256                                  423 ns      181 ns
BM_Rotate_uint32_Random_1024                                1574 ns      612 ns
BM_Rotate_uint32_Random_16384                              26318 ns     8900 ns
BM_Rotate_uint32_Random_262144                            450038 ns   142255 ns
BM_Rotate_uint64_Random_1                                   10.6 ns     10.6 ns
BM_Rotate_uint64_Random_4                                   24.3 ns     23.2 ns
BM_Rotate_uint64_Random_16                                  44.3 ns     34.4 ns
BM_Rotate_uint64_Random_64                                   117 ns     67.7 ns
BM_Rotate_uint64_Random_256                                  424 ns      193 ns
BM_Rotate_uint64_Random_1024                                1589 ns      653 ns
BM_Rotate_uint64_Random_16384                              26920 ns     9357 ns
BM_Rotate_uint64_Random_262144                            460469 ns   149440 ns
BM_Rotate_pair<uint32, uint32>_Random_1                     10.6 ns     10.7 ns
BM_Rotate_pair<uint32, uint32>_Random_4                     23.2 ns     23.2 ns
BM_Rotate_pair<uint32, uint32>_Random_16                    36.9 ns     35.2 ns
BM_Rotate_pair<uint32, uint32>_Random_64                    80.7 ns     80.3 ns
BM_Rotate_pair<uint32, uint32>_Random_256                    268 ns      268 ns
BM_Rotate_pair<uint32, uint32>_Random_1024                  1010 ns     1009 ns
BM_Rotate_pair<uint32, uint32>_Random_16384                15728 ns    15734 ns
BM_Rotate_pair<uint32, uint32>_Random_262144              251816 ns   251948 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_1            10.6 ns     10.6 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_4            24.1 ns     23.7 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_16           40.1 ns     40.7 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_64            115 ns      113 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_256           412 ns      405 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_1024         1548 ns     1532 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_16384       24555 ns    23936 ns
BM_Rotate_tuple<uint32, uint64, uint32>_Random_262144     395888 ns   386210 ns
BM_Rotate_string_Random_1                                   10.6 ns     10.6 ns
BM_Rotate_string_Random_4                                   48.7 ns     47.6 ns
BM_Rotate_string_Random_16                                   124 ns      124 ns
BM_Rotate_string_Random_64                                   288 ns      287 ns
BM_Rotate_string_Random_256                                  756 ns      758 ns
BM_Rotate_string_Random_1024                                2278 ns     2266 ns
BM_Rotate_string_Random_16384                              33052 ns    32758 ns
BM_Rotate_string_Random_262144                            533664 ns   533456 ns
BM_Rotate_float_Random_1                                    10.6 ns     10.6 ns
BM_Rotate_float_Random_4                                    25.2 ns     23.9 ns
BM_Rotate_float_Random_16                                   45.3 ns     33.4 ns
BM_Rotate_float_Random_64                                    115 ns     61.2 ns
BM_Rotate_float_Random_256                                   424 ns      167 ns
BM_Rotate_float_Random_1024                                 1578 ns      561 ns
BM_Rotate_float_Random_16384                               24589 ns     8015 ns
BM_Rotate_float_Random_262144                             409517 ns   125895 ns

philnik edited the summary of this revision. (Show Details)Apr 20 2022, 1:39 PM

Harbormaster completed remote builds in B160517: Diff 424013.Apr 20 2022, 3:24 PM

@philnik Note that the issue also contains a comment (https://github.com/llvm/llvm-project/issues/54949#issuecomment-1101619447) indicating that the current optimization can be faster for types that aren't cheap to move. Can we try to detect that and still use the current optimization in that case? Also, can you please try a benchmark with a non-trivially-copyable type and see what the numbers look like?

philnik edited the summary of this revision. (Show Details)Apr 23 2022, 11:03 AM

It looks like it is indeed a bit faster for structs that are expensive to move. I'll enable the optimization for non-trivially_move_constructible types and ones that are larger than 32 bytes.

------------------------------------------------------------------
Benchmark                                    old Time     new Time
------------------------------------------------------------------
BM_Rotate_ExpensiveToMove_Random_1            10.4 ns      10.5 ns
BM_Rotate_ExpensiveToMove_Random_4             300 ns       301 ns
BM_Rotate_ExpensiveToMove_Random_16           1299 ns      1363 ns
BM_Rotate_ExpensiveToMove_Random_64           5401 ns      6023 ns
BM_Rotate_ExpensiveToMove_Random_256         21964 ns     24717 ns
BM_Rotate_ExpensiveToMove_Random_1024        89849 ns    101480 ns
BM_Rotate_ExpensiveToMove_Random_16384     1550789 ns   1732624 ns
BM_Rotate_ExpensiveToMove_Random_262144   30205994 ns  35608461 ns

@var-const The non-trivially-copyable part should be covered by string, or did you have something more specific in mind?

I noticed that it wasn't actually enabled for non-trivial types, so I only restricted it to types larger than 32 bytes, although I'm not sure we want to keep this in, since it's a large part of the code while while only being enabled for a very small amount of types and only making a relatively small difference performance wise. If we can find a good heuristic for enabling it for non-tivial types I would be happier to keep it in. (Although I don't think we can have a good heuristic for this kind of stuff)

Harbormaster completed remote builds in B161043: Diff 424741.Apr 23 2022, 2:25 PM

I'd be fine with this since it addresses the underwhelming performance for simple types like int, which is super important. However, I would prefer if we instead went for a better algorithm directly, like the swap/grail rotate algorithm mentioned in https://github.com/llvm/llvm-project/issues/54949#issue-1206295098.

libcxx/include/__algorithm/rotate.h
118–119	I don't think `constexpr` adds a lot of value here since the compiler will definitely fold this anyway, and it's kind of weird to have `_LIBCPP_CONSTEXPR_AFTER_CXX14` in that location.
118–119	I would also do something like const bool __is_expensive_to_move = sizeof(value_type) > 32; if (is_trivially_foo<...> && __is_expensive_to_move) { ... } That way, the code is somewhat self-documenting.

In D124122#3475473, @ldionne wrote:

I'd be fine with this since it addresses the underwhelming performance for simple types like int, which is super important. However, I would prefer if we instead went for a better algorithm directly, like the swap/grail rotate algorithm mentioned in https://github.com/llvm/llvm-project/issues/54949#issue-1206295098.

I planned to do a follow-up for that anyways. Would it be OK if I just nuke the current implementation and add the ranges API in the same patch?

In D124122#3475497, @philnik wrote:

In D124122#3475473, @ldionne wrote:

I'd be fine with this since it addresses the underwhelming performance for simple types like int, which is super important. However, I would prefer if we instead went for a better algorithm directly, like the swap/grail rotate algorithm mentioned in https://github.com/llvm/llvm-project/issues/54949#issue-1206295098.

I planned to do a follow-up for that anyways. Would it be OK if I just nuke the current implementation and add the ranges API in the same patch?

Yes, if we can have perf benchmarks for before and after. If that's easy to do, you could also first nuke the current implementation and replace it by one that is ranges::-friendly, and then just add the tiny ranges::rotate overlay on top (+ tests) in a separate patch. Both ways are acceptable, though.

Upload current status for Konstantin

Harbormaster completed remote builds in B177706: Diff 447827.Jul 26 2022, 2:47 PM

Revision Contents

Path

Size

libcxx/

benchmarks/

algorithms.bench.cpp

17 lines

include/

__algorithm/

rotate.h

67 lines

Diff 424013

libcxx/benchmarks/algorithms.bench.cpp

Show First 20 Lines • Show All 361 Lines • ▼ Show 20 Lines	void run(benchmark::State& state) const {
});		});
}		}

std::string name() const {		std::string name() const {
return "BM_MinMaxElement" + ValueType::name() + Order::name() + "_" + std::to_string(Quantity);		return "BM_MinMaxElement" + ValueType::name() + Order::name() + "_" + std::to_string(Quantity);
}		}
};		};

		template <class ValueType, class Order>
		struct Rotate {
		size_t Quantity;
		mutable std::mt19937_64 rng;

		void run(benchmark::State& state) const {
		runOpOnCopies<ValueType>(state, Quantity, Order(), BatchSize::CountBatch, [&](auto& Copy) {
		benchmark::DoNotOptimize(std::rotate(Copy.begin(), Copy.begin() + (rng() % Copy.size()), Copy.end()));
		});
		}

		std::string name() const {
		return "BM_Rotate" + ValueType::name() + Order::name() + "_" + std::to_string(Quantity);
		}
		};

} // namespace		} // namespace

int main(int argc, char** argv) {		int main(int argc, char** argv) {
benchmark::Initialize(&argc, argv);		benchmark::Initialize(&argc, argv);
if (benchmark::ReportUnrecognizedArguments(argc, argv))		if (benchmark::ReportUnrecognizedArguments(argc, argv))
return 1;		return 1;

const std::vector<size_t> Quantities = {1 << 0, 1 << 2, 1 << 4, 1 << 6,		const std::vector<size_t> Quantities = {1 << 0, 1 << 2, 1 << 4, 1 << 6,
Show All 9 Lines	makeCartesianProductBenchmark<StableSort, AllValueTypes, AllOrders>(
Quantities);		Quantities);
makeCartesianProductBenchmark<MakeHeap, AllValueTypes, AllOrders>(Quantities);		makeCartesianProductBenchmark<MakeHeap, AllValueTypes, AllOrders>(Quantities);
makeCartesianProductBenchmark<SortHeap, AllValueTypes>(Quantities);		makeCartesianProductBenchmark<SortHeap, AllValueTypes>(Quantities);
makeCartesianProductBenchmark<MakeThenSortHeap, AllValueTypes, AllOrders>(		makeCartesianProductBenchmark<MakeThenSortHeap, AllValueTypes, AllOrders>(
Quantities);		Quantities);
makeCartesianProductBenchmark<PushHeap, AllValueTypes, AllOrders>(Quantities);		makeCartesianProductBenchmark<PushHeap, AllValueTypes, AllOrders>(Quantities);
makeCartesianProductBenchmark<PopHeap, AllValueTypes>(Quantities);		makeCartesianProductBenchmark<PopHeap, AllValueTypes>(Quantities);
makeCartesianProductBenchmark<MinMaxElement, AllValueTypes, AllOrders>(Quantities);		makeCartesianProductBenchmark<MinMaxElement, AllValueTypes, AllOrders>(Quantities);
		makeCartesianProductBenchmark<Rotate, AllValueTypes, AllOrders>(Quantities);
benchmark::RunSpecifiedBenchmarks();		benchmark::RunSpecifiedBenchmarks();
}		}

libcxx/include/__algorithm/rotate.h

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines if (__first != __middle)

} }

else if (__first == __middle) else if (__first == __middle)

__middle = __i; __middle = __i;

} }

return __r; return __r;

} }

template<typename _Integral>

inline _LIBCPP_INLINE_VISIBILITY

_LIBCPP_CONSTEXPR_AFTER_CXX14 _Integral

__algo_gcd(_Integral __x, _Integral __y)

{

_Integral __t = __x % __y;

__x = __y;

__y = __t;

} while (__y);

return __x;

}

template<typename _RandomAccessIterator>

_LIBCPP_CONSTEXPR_AFTER_CXX14 _RandomAccessIterator

__rotate_gcd(_RandomAccessIterator __first, _RandomAccessIterator __middle, _RandomAccessIterator __last)

{

typedef typename iterator_traits<_RandomAccessIterator>::difference_type difference_type;

typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;

const difference_type __m1 = __middle - __first;

const difference_type __m2 = __last - __middle;

if (__m1 == __m2)

{

_VSTD::swap_ranges(__first, __middle, __middle);

return __middle;

}

const difference_type __g = _VSTD::__algo_gcd(__m1, __m2);

for (_RandomAccessIterator __p = __first + __g; __p != __first;)

{

value_type __t(_VSTD::move(*--__p));

_RandomAccessIterator __p1 = __p;

_RandomAccessIterator __p2 = __p1 + __m1;

{

*__p1 = _VSTD::move(*__p2);

__p1 = __p2;

const difference_type __d = __last - __p2;

if (__m1 < __d)

__p2 += __m1;

else

__p2 = __first + (__m1 - __d);

} while (__p2 != __p);

*__p1 = _VSTD::move(__t);

}

return __first + __m2;

}

template <class _ForwardIterator> template <class _ForwardIterator>

inline _LIBCPP_INLINE_VISIBILITY inline _LIBCPP_INLINE_VISIBILITY

_LIBCPP_CONSTEXPR_AFTER_CXX11 _ForwardIterator _LIBCPP_CONSTEXPR_AFTER_CXX11 _ForwardIterator

__rotate(_ForwardIterator __first, _ForwardIterator __middle, _ForwardIterator __last, __rotate(_ForwardIterator __first, _ForwardIterator __middle, _ForwardIterator __last,

_VSTD::forward_iterator_tag) _VSTD::forward_iterator_tag)

{ {

typedef typename iterator_traits<_ForwardIterator>::value_type value_type; typedef typename iterator_traits<_ForwardIterator>::value_type value_type;

if (is_trivially_move_assignable<value_type>::value) if (is_trivially_move_assignable<value_type>::value)

Show All 15 Lines __rotate(_BidirectionalIterator __first, _BidirectionalIterator __middle, _BidirectionalIterator __last,

{ {

if (_VSTD::next(__first) == __middle) if (_VSTD::next(__first) == __middle)

return _VSTD::__rotate_left(__first, __last); return _VSTD::__rotate_left(__first, __last);

if (_VSTD::next(__middle) == __last) if (_VSTD::next(__middle) == __last)

return _VSTD::__rotate_right(__first, __last); return _VSTD::__rotate_right(__first, __last);

} }

return _VSTD::__rotate_forward(__first, __middle, __last); return _VSTD::__rotate_forward(__first, __middle, __last);

} }

template <class _RandomAccessIterator>

inline _LIBCPP_INLINE_VISIBILITY

_LIBCPP_CONSTEXPR_AFTER_CXX11 _RandomAccessIterator

__rotate(_RandomAccessIterator __first, _RandomAccessIterator __middle, _RandomAccessIterator __last,

random_access_iterator_tag)

{

typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;

if (is_trivially_move_assignable<value_type>::value)

{

if (_VSTD::next(__first) == __middle)

return _VSTD::__rotate_left(__first, __last);

if (_VSTD::next(__middle) == __last)

return _VSTD::__rotate_right(__first, __last);

return _VSTD::__rotate_gcd(__first, __middle, __last);

}

return _VSTD::__rotate_forward(__first, __middle, __last);

}

template <class _ForwardIterator> template <class _ForwardIterator>

ldionneUnsubmitted

Not Done

typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;

- if _LIBCPP_CONSTEXPR_AFTER_CXX14 (is_trivially_move_assignable<value_type>::value && sizeof(value_type) > 32)

+ if (is_trivially_move_assignable<value_type>::value && sizeof(value_type) > 32)

{

if (_VSTD::next(__first) == __middle)

I don't think constexpr adds a lot of value here since the compiler will definitely fold this anyway, and it's kind of weird to have _LIBCPP_CONSTEXPR_AFTER_CXX14 in that location.

ldionne: I don't think `constexpr` adds a lot of value here since the compiler will definitely fold this…

ldionneUnsubmitted

Not Done

I would also do something like

const bool __is_expensive_to_move = sizeof(value_type) > 32;
if (is_trivially_foo<...> && __is_expensive_to_move) { ... }

That way, the code is somewhat self-documenting.

ldionne: I would also do something like ``` const bool __is_expensive_to_move = sizeof(value_type) > 32…

inline _LIBCPP_INLINE_VISIBILITY inline _LIBCPP_INLINE_VISIBILITY

_LIBCPP_CONSTEXPR_AFTER_CXX17 _ForwardIterator _LIBCPP_CONSTEXPR_AFTER_CXX17 _ForwardIterator

rotate(_ForwardIterator __first, _ForwardIterator __middle, _ForwardIterator __last) rotate(_ForwardIterator __first, _ForwardIterator __middle, _ForwardIterator __last)

{ {

if (__first == __middle) if (__first == __middle)

return __last; return __last;

if (__middle == __last) if (__middle == __last)

return __first; return __first;

return _VSTD::__rotate(__first, __middle, __last, return _VSTD::__rotate(__first, __middle, __last,

typename iterator_traits<_ForwardIterator>::iterator_category()); typename iterator_traits<_ForwardIterator>::iterator_category());

} }

_LIBCPP_END_NAMESPACE_STD _LIBCPP_END_NAMESPACE_STD

#endif // _LIBCPP___ALGORITHM_ROTATE_H #endif // _LIBCPP___ALGORITHM_ROTATE_H

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize std::rotateNeeds ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 424013

libcxx/benchmarks/algorithms.bench.cpp

libcxx/include/__algorithm/rotate.h

[libc++] Optimize std::rotate
Needs ReviewPublic