This is an archive of the discontinued LLVM Phabricator instance.

pstl/include/pstl/internal/parallel_backend_omp.h
556–558	As long as you're changing these anyway, might as well simplify.
569	New lines 577, 570, 573: seems like you should undo these whitespace diffs (manually if necessary). I don't suppose it would make sense to use a heap instead of linearly scanning the vector in `__update_largest_item` every time one is inserted? I mean, that's the whole point of non-parallel `partial_sort`... but I guess the idea here is that rearranging the elements once would be less cache-friendly than linearly scanning them multiple times?
600–601	if you wanted to keep this simple
619	That this ever worked indicates that we could use some more "evil" test cases with overloaded/deleted `operator,` and so on. The iterators in `libcxx/test/support/test_iterators.h` are already "evil" in this way; we'd just need some new tests that use those iterators with parallel algorithms. My ADL senses are tingling with those ADL calls to `begin` and `end`. (Are there tests for parallel `partial_sort`? If it's this buggy, why aren't those tests failing now?)
630	Gratuitous whitespace diff here lost the indentation of the trailing return type.
760	I think this was meant to call `std::iter_swap(__current_item, __swap_item)`, actually. Calling `std::swap` qualified seems like a bug, but calling `std::iter_swap` qualified would be normal practice.
859	Technically, all of these ADL calls are unsafe and should be qualified, like `__omp_backend::__parallel_stable_sort_body(...)`. You're changing so much in the PR currently that I feel like you should go ahead and ADL-proof it too; but I wouldn't make ADL the main point of the PR. ;)

nadiasvertex added inline comments.Apr 3 2021, 4:49 PM

pstl/include/pstl/internal/parallel_backend_omp.h
569	Yes, exactly. I didn't do timed tests, but when I look at managing a heap, operationally it's hard to see how it could be faster than the linear scan, when K is small. If K starts to get large, one wonders why a partial sort is even being done. If I maintain a heap there's a lot of copying and moving going on. This approach lets me overwrite existing items in place and enjoy the benefits of the cache for the search. Also, during the merge operation used to reduce the chunks I have to update the largest item, so I would be doing a lot of heap maintenance. But if you think it would be better, I can look into doing that instead.
600–601	In general I thought it better to use std:distance, but if the preference is to use explicit + and - operators, I can do that.
859	Actually, this is all "new" code. I have another parent revision where most of this code is introduced. However, the parallel partial sort is buggy, so I am working on fixing it here. I will make the ADL fixes you suggest. The partial sort now succeeds for most of the collections I try, but it fails to produce correct results in some cases, I think due to some asymmetry in the reduction. I thought I had fixed it (which is why I posted this) until I did more testing.

This work will continue on https://reviews.llvm.org/D99836

Revision Contents

Path

Size

pstl/

include/

pstl/

internal/

parallel_backend_omp.h

116 lines

Diff 335110

pstl/include/pstl/internal/parallel_backend_omp.h

// -*- C++ // -*- C++

Lint: Lint

clang-format not found in user's PATH; not linting file.

Lint: Lint: clang-format not found in user's PATH; not linting file.

// -*-===----------------------------------------------------------------------===// // -*-===----------------------------------------------------------------------===//

// //

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information. // See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 534 Lines • ▼ Show 20 Lines __parallel_transform_scan(_ExecutionPolicy&& __exec, _Index __n, _Up __u, _Tp __init, _Cp __combine, _Rp __brick_reduce,

return __serial_backend::__parallel_transform_scan(std::forward<_ExecutionPolicy>(__exec), __n, __u, __init, return __serial_backend::__parallel_transform_scan(std::forward<_ExecutionPolicy>(__exec), __n, __u, __init,

__combine, __brick_reduce, __scan); __combine, __brick_reduce, __scan);

} }

//------------------------------------------------------------------------ //------------------------------------------------------------------------

// parallel_stable_sort // parallel_stable_sort

//------------------------------------------------------------------------ //------------------------------------------------------------------------

template <typename _RandomAccessIterator> template <typename _RandomAccessIterator> struct _MinKItems {

struct _MinKItems

{

using _MinKVector = std::vector<_RandomAccessIterator>; using _MinKVector = std::vector<_RandomAccessIterator>;

_MinKVector __smallest_k_items; _MinKVector __smallest_k_items;

typename _MinKVector::iterator __largest_item; std::size_t __largest_item;

bool bool __empty() { return __smallest_k_items.empty(); }

__empty()

{

return __smallest_k_items.empty();

}

auto auto __size() { return std::size(__smallest_k_items); }

QuuxplusoneUnsubmitted

Not Done

std::size_t __largest_item;

- bool __empty() { return __smallest_k_items.empty(); }

+ bool __empty() const { return __smallest_k_items.empty(); }

- auto __size() { return std::size(__smallest_k_items); }

+ std::size_t __size() const { return __smallest_k_items.size(); }

void __resize(std::size_t new_size) { __smallest_k_items.resize(new_size); }

As long as you're changing these anyway, might as well simplify.

Quuxplusone: As long as you're changing these anyway, might as well simplify.

__size()

{

return std::size(__smallest_k_items);

}

void void __resize(std::size_t new_size) { __smallest_k_items.resize(new_size); }

__resize(std::size_t new_size)

{ auto __get_largest_item() { return __smallest_k_items[__largest_item]; }

__smallest_k_items.resize(new_size);

auto __set_largest_item(_RandomAccessIterator it) {

__smallest_k_items[__largest_item] = it;

} }

}; };

template <typename _RandomAccessIterator, typename _Compare> template <typename _RandomAccessIterator, typename _Compare> struct _MinKOp {

QuuxplusoneUnsubmitted

Not Done

}

};

- template <typename _RandomAccessIterator, typename _Compare> struct _MinKOp {

+ template <typename _RandomAccessIterator, typename _Compare>

+ struct _MinKOp {

_MinKItems<_RandomAccessIterator> &__items;

New lines 577, 570, 573: seems like you should undo these whitespace diffs (manually if necessary).

I don't suppose it would make sense to use a heap instead of linearly scanning the vector in __update_largest_item every time one is inserted? I mean, that's the whole point of non-parallel partial_sort... but I guess the idea here is that rearranging the elements once would be less cache-friendly than linearly scanning them multiple times?

Quuxplusone: New lines 577, 570, 573: seems like you should undo these whitespace diffs (manually if…

nadiasvertexAuthorUnsubmitted

Done

Yes, exactly. I didn't do timed tests, but when I look at managing a heap, operationally it's hard to see how it could be faster than the linear scan, when K is small. If K starts to get large, one wonders why a partial sort is even being done. If I maintain a heap there's a lot of copying and moving going on. This approach lets me overwrite existing items in place and enjoy the benefits of the cache for the search.

Also, during the merge operation used to reduce the chunks I have to update the largest item, so I would be doing a lot of heap maintenance. But if you think it would be better, I can look into doing that instead.

nadiasvertex: Yes, exactly. I didn't do timed tests, but when I look at managing a heap, operationally it's…

struct _MinKOp

{

_MinKItems<_RandomAccessIterator>& __items; _MinKItems<_RandomAccessIterator> &__items;

_Compare __comp; _Compare __comp;

_MinKOp(_MinKItems<_RandomAccessIterator>& __items_, _Compare __comp_) : __items(__items_), __comp(__comp_) {} _MinKOp(_MinKItems<_RandomAccessIterator> &__items_, _Compare __comp_)

: __items(__items_), __comp(__comp_) {}

void void __keep_smallest_k_items(_RandomAccessIterator __item) {

__keep_smallest_k_items(_RandomAccessIterator __item)

{

// If the new item is larger than the largest item in the list, discard it. // If the new item is larger than the largest item in the list, discard it.

if (__comp(**__items.__largest_item, *__item)) if (__comp(*__items.__get_largest_item(), *__item)) {

{

return; return;

} }

// If thew new item is equal to the largest item in the list, discard it. // If thew new item is equal to the largest item in the list, discard it.

if (!__comp(*__item, **__items.__largest_item)) if (!__comp(*__item, *__items.__get_largest_item())) {

{

return; return;

} }

// The new item is smaller than the largest item. Replace the largest item // The new item is smaller than the largest item. Replace the largest item

// with the new item. // with the new item.

*__items.__largest_item = __item; __items.__set_largest_item(__item);

// Find the new largest item. // Find the new largest item.

__update_largest_item(); __update_largest_item();

}; };

void void __update_largest_item() {

__update_largest_item() auto pos = std::max_element(

{ std::begin(__items.__smallest_k_items),

__items.__largest_item = std::end(__items.__smallest_k_items),

std::max_element(std::begin(__items.__smallest_k_items), std::end(__items.__smallest_k_items),

[this](const auto& l, const auto& r) { return __comp(*l, *r); }); [this](const auto &l, const auto &r) { return __comp(*l, *r); });

__items.__largest_item =

std::distance(std::begin(__items.__smallest_k_items), pos);

QuuxplusoneUnsubmitted

Not Done

[this](const auto &l, const auto &r) { return __comp(*l, *r); });

- __items.__largest_item =

- std::distance(std::begin(__items.__smallest_k_items), pos);

+ __items.__largest_item = pos - __items.__smallest_k_items.begin();

}

void __merge(_MinKItems<_RandomAccessIterator> &__other) {

if you wanted to keep this simple

Quuxplusone: if you wanted to keep this simple

nadiasvertexAuthorUnsubmitted

Done

In general I thought it better to use std:distance, but if the preference is to use explicit + and - operators, I can do that.

nadiasvertex: In general I thought it better to use std:distance, but if the preference is to use explicit +…

} }

void void __merge(_MinKItems<_RandomAccessIterator> &__other) {

__merge(_MinKItems<_RandomAccessIterator>& __other) for (auto __it = std::begin(__other.__smallest_k_items);

{ __it != std::end(__other.__smallest_k_items); ++__it) {

for (auto __it = std::begin(__other.__smallest_k_items); __it != std::end(__other.__smallest_k_items); ++__it)

{

__keep_smallest_k_items(*__it); __keep_smallest_k_items(*__it);

} }

__update_largest_item();

} }

void void __initialize(_RandomAccessIterator __first, _RandomAccessIterator __last,

__initialize(_RandomAccessIterator __first, _RandomAccessIterator __last, std::size_t __k) std::size_t __k) {

{

__items.__resize(__k); __items.__resize(__k);

auto __item_it = __first; auto __item_it = __first;

for (auto __tracking_it = begin(__items.__smallest_k_items); for (auto __tracking_it = begin(__items.__smallest_k_items);

__item_it != __last && __tracking_it != end(__items.__smallest_k_items); ++__item_it, ++__tracking_it) __item_it != __last &&

{ __tracking_it != end(__items.__smallest_k_items);

++__item_it, ++__tracking_it) {

QuuxplusoneUnsubmitted

Not Done

__tracking_it != end(__items.__smallest_k_items);

- ++__item_it, ++__tracking_it) {

+ ++__item_it, void(), ++__tracking_it) {

*__tracking_it = __item_it;

That this ever worked indicates that we could use some more "evil" test cases with overloaded/deleted operator, and so on. The iterators in libcxx/test/support/test_iterators.h are already "evil" in this way; we'd just need some new tests that use those iterators with parallel algorithms.

My ADL senses are tingling with those ADL calls to begin and end.

(Are there tests for parallel partial_sort? If it's this buggy, why aren't those tests failing now?)

Quuxplusone: That this ever worked indicates that we could use some more "evil" test cases with…

*__tracking_it = __item_it; *__tracking_it = __item_it;

} }

__update_largest_item(); __update_largest_item();

for (; __item_it != __last; ++__item_it) for (; __item_it != __last; ++__item_it) {

{

__keep_smallest_k_items(__item_it); __keep_smallest_k_items(__item_it);

} }

static auto static auto __reduce(_MinKItems<_RandomAccessIterator> &__v1,

__reduce(_MinKItems<_RandomAccessIterator>& __v1, _MinKItems<_RandomAccessIterator>& __v2, _Compare __comp) _MinKItems<_RandomAccessIterator> &__v2, _Compare __comp)

-> _MinKItems<_RandomAccessIterator> -> _MinKItems<_RandomAccessIterator> {

QuuxplusoneUnsubmitted

Not Done

Gratuitous whitespace diff here lost the indentation of the trailing return type.

Quuxplusone: Gratuitous whitespace diff here lost the indentation of the trailing return type.

{ if (__v1.__empty()) {

if (__v1.__empty())

{

return __v2; return __v2;

} }

if (__v2.__empty()) if (__v2.__empty()) {

{

return __v1; return __v1;

} }

if (__v1.__size() >= __v2.__size()) if (__v1.__size() >= __v2.__size()) {

{

_MinKOp<_RandomAccessIterator, _Compare> __op(__v1, __comp); _MinKOp<_RandomAccessIterator, _Compare> __op(__v1, __comp);

__op.__merge(__v2); __op.__merge(__v2);

return __v1; return __v1;

} }

_MinKOp<_RandomAccessIterator, _Compare> __op(__v2, __comp); _MinKOp<_RandomAccessIterator, _Compare> __op(__v2, __comp);

__op.__merge(__v1); __op.__merge(__v1);

return __v2; return __v2;

▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines auto __reduce_chunk = [&](std::uint32_t __chunk) {

auto __end = __begin + __this_chunk_size; auto __end = __begin + __this_chunk_size;

return __find_min_k(__begin, __end, __nsort, __comp); return __find_min_k(__begin, __end, __nsort, __comp);

}; };

auto __reduce_value = [&](auto& __v1, auto& __v2) { return _Op::__reduce(__v1, __v2, __comp); }; auto __reduce_value = [&](auto& __v1, auto& __v2) { return _Op::__reduce(__v1, __v2, __comp); };

auto __result = __parallel_reduce_chunks<_Value>(0, __n_chunks, __reduce_chunk, __reduce_value); auto __result = __parallel_reduce_chunks<_Value>(0, __n_chunks, __reduce_chunk, __reduce_value);

return *__result.__largest_item; return __result.__get_largest_item();

} }

template <typename _RandomAccessIterator, typename _Compare> template <typename _RandomAccessIterator, typename _Compare>

void void

__parallel_partition(_RandomAccessIterator __xs, _RandomAccessIterator __xe, _RandomAccessIterator __pivot, __parallel_partition(_RandomAccessIterator __xs, _RandomAccessIterator __xe, _RandomAccessIterator __pivot,

_Compare __comp, std::size_t __nsort) _Compare __comp, std::size_t __nsort)

{ {

auto __size = static_cast<std::size_t>(std::distance(__xs, __xe)); auto __size = static_cast<std::size_t>(std::distance(__xs, __xe));

Show All 38 Lines for (std::size_t __index = 0U; __index < __nsort; ++__index)

{ {

// Try to capture this slot by using compare and exchange. If we // Try to capture this slot by using compare and exchange. If we

// are able to capture the slot then perform a swap and exit this // are able to capture the slot then perform a swap and exit this

// loop. // loop.

if (__status[__swap_index].load() == false && __status[__swap_index].exchange(true) == false) if (__status[__swap_index].load() == false && __status[__swap_index].exchange(true) == false)

{ {

auto __current_item = std::next(__xs, __index); auto __current_item = std::next(__xs, __index);

auto __swap_item = std::next(__xs, __swap_index); auto __swap_item = std::next(__xs, __swap_index);

std::swap(__current_item, __swap_item); std::swap(*__current_item, *__swap_item);

QuuxplusoneUnsubmitted

Not Done

I think this was meant to call std::iter_swap(__current_item, __swap_item), actually. Calling std::swap qualified seems like a bug, but calling std::iter_swap qualified would be normal practice.

Quuxplusone: I think this was meant to call `std::iter_swap(__current_item, __swap_item)`, actually. Calling…

break; break;

} }

delete[] __status; delete[] __status;

} }

▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines

template <typename _RandomAccessIterator, typename _Compare, typename _LeafSort> template <typename _RandomAccessIterator, typename _Compare, typename _LeafSort>

void void

__parallel_stable_partial_sort(_RandomAccessIterator __xs, _RandomAccessIterator __xe, _Compare __comp, __parallel_stable_partial_sort(_RandomAccessIterator __xs, _RandomAccessIterator __xe, _Compare __comp,

_LeafSort __leaf_sort, std::size_t __nsort) _LeafSort __leaf_sort, std::size_t __nsort)

{ {

auto __pivot = __parallel_find_pivot(__xs, __xe, __comp, __nsort); auto __pivot = __parallel_find_pivot(__xs, __xe, __comp, __nsort);

__parallel_partition(__xs, __xe, __pivot, __comp, __nsort); __parallel_partition(__xs, __xe, __pivot, __comp, __nsort);

auto __part_end = std::next(__xs, __nsort);

if (__nsort <= __default_chunk_size) if (__nsort <= __default_chunk_size)

{ {

__leaf_sort(__xs, __pivot, __comp); __leaf_sort(__xs, __part_end, __comp);

} }

else else

{ {

__parallel_stable_sort_body(__xs, __pivot, __comp); __parallel_stable_sort_body(__xs, __part_end, __comp);

QuuxplusoneUnsubmitted

Not Done

Technically, all of these ADL calls are unsafe and should be qualified, like __omp_backend::__parallel_stable_sort_body(...). You're changing so much in the PR currently that I feel like you should go ahead and ADL-proof it too; but I wouldn't make ADL the main point of the PR. ;)

Quuxplusone: Technically, all of these ADL calls are unsafe and should be qualified, like `__omp_backend…

nadiasvertexAuthorUnsubmitted

Done

Actually, this is all "new" code. I have another parent revision where most of this code is introduced. However, the parallel partial sort is buggy, so I am working on fixing it here. I will make the ADL fixes you suggest.

The partial sort now succeeds for most of the collections I try, but it fails to produce correct results in some cases, I think due to some asymmetry in the reduction. I thought I had fixed it (which is why I posted this) until I did more testing.

nadiasvertex: Actually, this is all "new" code. I have another parent revision where most of this code is…

} }

template <class _ExecutionPolicy, typename _RandomAccessIterator, typename _Compare, typename _LeafSort> template <class _ExecutionPolicy, typename _RandomAccessIterator, typename _Compare, typename _LeafSort>

void void

__parallel_stable_sort(_ExecutionPolicy&&, _RandomAccessIterator __xs, _RandomAccessIterator __xe, _Compare __comp, __parallel_stable_sort(_ExecutionPolicy&& __exec, _RandomAccessIterator __xs, _RandomAccessIterator __xe,

_LeafSort __leaf_sort, std::size_t __nsort = 0) _Compare __comp, _LeafSort __leaf_sort, std::size_t __nsort = 0)

{ {

if (__xs >= __xe) if (__xs >= __xe)

{ {

return; return;

} }

if (__nsort < __default_chunk_size) if (__nsort <= __default_chunk_size)

{ {

__leaf_sort(__xs, __xe, __comp); __serial_backend::__parallel_stable_sort(std::forward<_ExecutionPolicy>(__exec), __xs, __xe, __comp,

__leaf_sort, __nsort);

return; return;

} }

std::size_t __count = static_cast<std::size_t>(std::distance(__xs, __xe)); std::size_t __count = static_cast<std::size_t>(std::distance(__xs, __xe));

if (omp_in_parallel()) if (omp_in_parallel())

{ {

if (__count <= __nsort) if (__count <= __nsort)

▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[pstl] Fix a number of bugs with parallel partial sortAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 335110

pstl/include/pstl/internal/parallel_backend_omp.h

[pstl] Fix a number of bugs with parallel partial sort
AbandonedPublic