This is an archive of the discontinued LLVM Phabricator instance.

[libc++][PSTL] Add a GCD backend

Authored by philnik on May 30 2023, 8:23 AM.


Group Reviewers
Restricted Project
rG2b2e7f6e5727: [libc++][PSTL] Add a GCD backend

Diff Detail

Event Timeline

philnik created this revision.May 30 2023, 8:23 AM
Herald added a project: Restricted Project. · View Herald Transcript
ldionne added inline comments.May 30 2023, 8:30 AM

HIDE_FROM_ABI here and below


If this isn't part of our API right now, I wouldn't provide it. I'd add it once we need it.

philnik updated this revision to Diff 527088.May 31 2023, 9:33 AM
philnik marked 2 inline comments as done.

Address comments

ldionne added inline comments.Jun 1 2023, 10:08 AM

If it does.


This seems a bit convoluted to me. Either set __partitions.__x directly or use local variables and then return __chunk_partitions{_vars...}?


This variable name is super confusing with the above!

Also I would probably name it something like __n, which makes it clear that this is the total number of elements in the range.


I would rename this to avoid __parallel in the name since it makes it look like a CPU backend customization point, when it's not.


Otherwise this code doesn't work as-is.

We also discussed using unique_ptr<optional<_Value>[]> which would destroy objects correctly in case an exception is thrown while constructing values below. This potentially has a non-negligible memory overhead. No conclusion yet, but I think I would like to either make this code locally correct or have a nice way of expressing the fact that we don't care about running the destructors of _Value in case of an exception because we call terminate anyway.

philnik updated this revision to Diff 527581.Jun 1 2023, 1:06 PM
philnik marked 5 inline comments as done.

Address comments

h-vetinari added a subscriber: h-vetinari.

Previous effort for this for cross-reference:

ldionne added inline comments.Jun 2 2023, 8:32 AM

If the only difference between __parallel_merge and __parallel_merge_body is that we're passing the size, then __parallel_merge could be implemented as a simple call to __parallel_merge_body(__xe - __xs, __ye - __ys).


I think we should introduce a helper class for this. It seems like something we might have a other uses for. Something like:

template <class _Tp, class _Alloc = std::allocator<_Tp>>
struct __uninitialized_buffer {
  explicit __uninitialized_buffer(size_t __n);
  _Tp* __get();
  ~__uninitialized_buffer(); // destroys the elements and frees the memory

This totally deserves a comment explaining why we're terminating here, when normally we always terminate from terminate_on_exception. Actually, we could also use terminate_on_exception here instead of exception_guard.


We need to figure out how we're going to test the backend. Tests should be added in this patch.

philnik updated this revision to Diff 529678.Jun 8 2023, 10:51 AM
philnik marked 3 inline comments as done.

Address comments

ldionne published this revision for review.Jun 15 2023, 2:02 PM
ldionne added inline comments.

I would like us to test specifically the various functions we use in the backend, because otherwise we don't have any testing for those. So for example, let's create a libc++ specific test that checks for the implementation of __partition_chunks.


This comment seems not to be relevant anymore, we don't "create an executor".


There is some non-trivial logic in here that we could get wrong. For example, if we were not to use upper_bound below to figure out the end of the __xm range to merge, we could get correctness issues -- this is clearly not only about performance.

We could either test this by adding GCD-specific tests for the backend, or by adding std/ tests for e.g. std::merge(std::par, ...) that are specially crafted to trigger these issues. I don't care very strongly which one we go for, but we should have something because as it stands, all of our tests would pass if we had incorrect logic in here (our tests always use something smaller than the chunk size).

@philnik mentioned some sort of "fuzzing" test during live review, and I think I agree this would make sense.


I don't think we need to pass the __size_foo variables around, we can compute them from the iterators instead since we have random access iterators.


Maybe those are more descriptive names? Especially if we drop __size_x and __size_y, x and y will lose their meaning.


Is there any reason why we use lower_bound here?

@nadiasvertex Would you happen to remember why you did it that way in the original GCD backend implementation?


This implementation is quite interesting. At a high level, we basically figure out how to chunk up the ranges as we're going, and we spawn tasks two-by-two every time. My understanding is that this kind of "tree of computation" pattern is not what libdispatch excels at -- instead I think it is better to give it a "finalized" number of tasks you want to execute all at once.

If that is correct, then going for a different algorithm where we instead figure out the chunking upfront *and then* dispatch it all at once could be a win. One way to do this could be to accumulate the work items (aka the arguments you pass to __leaf_merge above) in a std::vector<_WorkItem>, and then dispatch_apply_f on that. We're allowed to allocate in these algorithms so that should be an acceptable approach. Since the number of work items is roughly n / __default_chunk_size and that's linear, the number of work items we might have to spawn could become quite large, perhaps making it important to use dispatch_apply_f only once. Unfortunately this also means that it's difficult to determine the number of work items up front so we would likely need to allocate with our std::vector almost all the time (i.e. tricks like llvm::SmallVector likely don't apply here).

I think the first step to resolve this comment is to figure out whether the premise that libdispatch doesn't handle those trees well is true or not. If not, then the current approach is probably the right one.


We need a test for this case -- i.e. we want to make sure that we throw an exception (and the right one) if we fail to allocate memory from the implementation of the GCD backend.


Can we add a test that fails here? This is a double-move issue that was present in the code before this patch. I think it would make sense to split this one off.


Let's capture by name since this was *so* not obvious.

51–53 ↗(On Diff #531780)

This should be in libc++experimental.a for now instead.

Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2023, 2:02 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
nadiasvertex added inline comments.Jun 16 2023, 6:13 AM

It's been so long, I don't remember the exact details, sorry. I think it had something to do with calculating the correct offset. We do the search in the smallest range, and the constraints change because the place we want to compute is different in first1:last1 than it is in first2:last2.

philnik updated this revision to Diff 532960.Jun 20 2023, 9:16 AM
philnik marked 4 inline comments as done.

Address some comments

philnik updated this revision to Diff 534786.Jun 26 2023, 4:48 PM
philnik marked 3 inline comments as done.

Address some more comments

ldionne added inline comments.Jun 27 2023, 8:32 AM

Otherwise this is a bit confusing.


Seems more fitting given the names of the iterators?


IMO this captures better the fact that we are only catching exceptions in __libdispatch::__calculate_merge_ranges:

pmr::vector<...> __ranges;
try {
  __ranges = __calculate(...);
} catch ( ) {
  throw __pstl_bad_alloc();


This makes it more obvious that we let __dispatch_apply terminate if an exception is thrown.

Edit: This might not work if the allocator doesn't propagate.

ldionne added inline comments.Jun 27 2023, 12:04 PM
pmr::vector<...> __ranges = [&]{
  try {
    return __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp)
  } catch (....) {
    throw ...;

Still debating whether that's overkill, but it works around the allocator propagation issue.

I think we should leave the code as-is after consideration.


I'm not sure this is correct as-is. Let's say we have a predicate to find_if that happens to use __pstl_merge in its implementation, and let's say the "setup" code for __pstl_merge fails to allocate and throws __pstl_bad_alloc. It will be caught by find_if's __terminate_on_exception wrapper and be rethrown. Right? If so, then we'd be incorrectly propagating the exception.

Instead, I think we might want to instead use __terminate_on_exception in the backend implementation *only* around user code (not setup code), so we'd remove it from the various functions that take __cpu_backend_tag. It would also not need to handle __pstl_bad_alloc anymore cause it would never wrap "setup" code.

This should be done in a separate patch, and we'll likely need some tests as well.


We might want to move this system header below the library's own includes, I think that's what we usually do.


Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do:

const auto target_number_of_chunks = thread::hardware_concurrency() * 50;
const auto chunk_size = element_count / target_number_of_chunks; // roughly

Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing.

At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on __default_chunk_size (which itself is 100% thin air).

You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree.

Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use std::for_each(std::execution::par) with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code.

In summary, I would suggest this scheme:

We spawn at least min(element_count, thread::hardware_concurrency()) tasks always.
When element_count > thread::hardware_concurrency(), we increase logarithmically as a function of element_count with an asymptote located at roughly 100 * thread::hardware_concurrency().

philnik updated this revision to Diff 535165.Jun 27 2023, 3:43 PM
philnik marked 5 inline comments as done.

Address some comments

ldionne added inline comments.Jun 29 2023, 10:46 AM

For small numbers of elements (< cores), we could create one task for each element. That's the only way to nicely handle the case where processing each element is really heavy and we really want to parallelize. Also it should be rare that users use std::execution::par to sort a vector of 3 ints. If they do, we have no way to tell and I'd argue it's not unreasonable for us to spawn 3 tasks.
For medium numbers of elements (say < 500): cores + ((n-cores) / cores). This gives us a smooth transition from the previous size and then it basically grows linearly with n.
For larger numbers: Assuming 8 cores, we have: log(1.01, sqrt(n)) = 800 (aka 100.499 * log(sqrt(n)) according to Wolfram Alpha) requires n to be roughly 8.2 million elements. That means that we'd create 100 * 8 tasks at 8.2 million elements, and then it grows really slowly. That doesn't seem unreasonable.

This is not very scientific, I'm afraid, but this seems to be somewhat reasonable while playing around with a couple of values.

size_t cores = thread::hardware_concurrency();

auto small = [](ptrdiff_t n) { return n; };
auto medium = [](ptrdiff_t n) { return cores + ((n-cores) / cores); };

// explain where this comes from, in particular that this is an approximation of `log(1.01, sqrt(n))` which seemed to be reasonable for `n` larger than 500 and tops at 800 tasks for n ~ 8 million
auto large = [](ptrdiff_t n) { return 100.499 * std::log(std::sqrt(n)); };

if (n < cores)    return small(n);
else if (n < 500) return medium(n);
else              return std::min(medium(n), large(n)); // provide a "smooth" transition

The above assumes that we have 8 cores, but everything kind of needs to be adjusted for different numbers of cores. I suggest we start with this just to unblock the patch, but leave a big fat comment explaining this needs to be revisited. We can take a look with the dispatch folks, or if anyone observing this review has an idea, please feel free to chime in.

philnik updated this revision to Diff 536011.Jun 29 2023, 2:49 PM
philnik marked 6 inline comments as done.

Address more comments

ldionne added inline comments.Jul 5 2023, 9:53 AM

We should ensure that this emplace_back() doesn't need to reallocate, otherwise it gets messy with exceptions (and it's a bit inefficient). We should add an assertion here _LIBCPP_ASSERT_UNCATEGORIZED(__ranges.capacity() >= __ranges.size() + 1);.


Here let's reserve() in advance. The upper bound on the number of ranges we'll ever allocate should be the number of leaves in the "call tree of __calculate_merge_ranges". In the worst case we only divide the biggest range by two and we don't touch the other range at every step, so the number of levels in that call tree is bounded by the number of times you can divide each range by two. You want to sum the worst case for each range here. The number of leaves will be bounded by 2^levels. Since levels grows logarithmically, this shouldn't be as bad as it seems, basically this would end up being linear. So I think this gives us:

2^ (log_2(size1/default_chunk_size)) == size1/default_chunk_size

If we add both of the ranges, it means we'd be bounded by:

size1/default_chunk_size + size2/default_chunk_size == (size1+size2)/default_chunk_size

But actually, we might be able to ditch this altogether and instead base our chunking on __partition_chunks by chunking the largest range (or a combination of both) iteratively. I'm not seeing the solution clearly right now but I think we might be able to come up with something better than what we have.

Potential alternative

We could instead walk the largest range by increments of chunk_size and compute the matching location to merge from in the small range using lower_bound or upper_bound, iteratively. We would pre-reserve large-range / chunk_size in the vector. That seems simpler and while there are adversarial inputs with that method, the general case is better behaved.

42–46 ↗(On Diff #536011)
15 ↗(On Diff #536011)

What is that number? Can you document it with a comment and the rationale for using it? (e.g. "we want this to contain at least N + M elements so we can merge large-ish ranges")


Let's do this as a separate patch.

philnik updated this revision to Diff 537548.Jul 5 2023, 5:01 PM
philnik marked 8 inline comments as done.

Address more comments


Not applicable anymore.


See D154238.


See D154546.

ldionne added inline comments.Jul 6 2023, 10:35 AM

This isn't used anymore.


Future optimization: We should allocate only one temporary value per worker thread, not one per chunk. Then we should pad the storage to make sure they all fall on different cache lines to avoid false sharing.

Can you add a TODO comment mentioning that?


Can you add a TODO comment to use the number of cores that libdispatch will actually use instead of the total number of cores on the system?

ldionne added inline comments.Jul 10 2023, 9:03 AM

Until we re-introduce __uninitialized_buffer, we can do this instead:

// TODO: Use __uninitialized_buffer
auto __destroy = [=](_Value* __ptr){
  std::destroy_n(__ptr, __partitions.__chunk_count_);
  std::allocator<_Value>().deallocate(__ptr, __partitions.__chunk_count_);
unique_ptr<_Value, decltype(__destroy)> __values(std::allocator<_Value>().allocate(__partitions.__chunk_count_), __destroy);

// use __dispatch_apply(...)

auto __tmp = std::transform_reduce(__values, __values + __partitions.__chunk_count_, __init, __transform, __reduction);
return __tmp;
philnik updated this revision to Diff 538701.Jul 10 2023, 9:39 AM
philnik marked 5 inline comments as done.

Address comments

ldionne added inline comments.Jul 10 2023, 2:46 PM

Not used anymore.


As discussed, this seems to be a bit simplified:

auto __compute_chunk = [&](size_t __chunk_size) -> __merge_range_t {
  auto [__mid1, __mid2] = [&] {
    if (__iterate_first_range) {
      auto __m1 = __first1 + __chunk_size;
      auto __m2 = std::lower_bound(__first2, __last2, __m1[-1], __comp);
      return std::make_pair(__m1, __m2);
    } else {
      auto __m2 = __first2 + __chunk_size;
      auto __m1 = std::lower_bound(__first1, __last1, __m2[-1], __comp);
      return std::make_pair(__m1, __m2);

  __merge_range_t __ret{__mid1, __mid2, __result};
  __result += (__mid1 - __first1) + (__mid2 - __first2);
  __first1 = __mid1;
  __first2 = __mid2;
  return __ret;

As discussed live, you could emplace_back the first iterators of the entire range manually into the vector, and then you'd be able to remove the special casing for __index == 0. You'd always look at __ranges[__index] and __ranges[__index + 1] here.


This doesn't work if you have only 1 element in a chunk (or if that's the case for the __first_chunk_size). We should have a test that exercises that. I guess the expected behavior in that case is that we'd just use std::construct_at(__values + __index, __transform(*__first));


This should only be a reduce. This should be catchable if the transform returns a different type. Test coverage should be added for that.


Can we instead run the sort in parallel by chunking and then run a serial merge? That's not too hard to implement and it's much better than doing everything serial. We can leave a TODO for parallelizing the merge since we don't have a clear way to do it.

1 ↗(On Diff #538792)

Those tests should be moved to the fuzz patch.

philnik updated this revision to Diff 538889.Jul 10 2023, 6:09 PM
philnik marked 8 inline comments as done.

Address comments


I'd rather do that in a follow-up. It's not like we have to make this implementation perfect from the start.

ldionne requested changes to this revision.Jul 11 2023, 8:31 AM
ldionne added inline comments.

We need set(LIBCXX_PSTL_CPU_BACKEND libdispatch) inside libcxx/cmake/caches/Apple.cmake


Missing include for move_iterator.


If you use __first_iters to store the result, this becomes:

__result += (__mid1 - __first1) + (__mid2 - __first2);
__first1 = __mid1;
__first2 = __mid2;
return __merge_range_t{__first1, __first2, __result};

And above becomes

__ranges.emplace_back(__first1, __first2, __result);

instead of

__ranges.emplace_back(__first1, __first2, _RandomAccessIterator3{});

It makes it clearer that we do N-1 iterations, and also that we require __partitions.__chunk_count_ > 1 (which is the case, see above).


IMO that's easier to read, and it's equivalent.


I don't think this was done.

This revision now requires changes to proceed.Jul 11 2023, 8:31 AM
philnik updated this revision to Diff 539136.Jul 11 2023, 8:51 AM
philnik marked 6 inline comments as done.

Address comments

philnik updated this revision to Diff 539172.Jul 11 2023, 10:02 AM

Try to fix CI

philnik updated this revision to Diff 539284.Jul 11 2023, 1:46 PM

Try to fix CI

philnik updated this revision to Diff 539560.Jul 12 2023, 7:52 AM

Try to fix CI

ldionne accepted this revision.Jul 12 2023, 8:32 AM

LGTM once CI passes. Thanks a lot for all the work on this! It's not the final implementation, but this is a really good start and we can then improve on it iteratively.

This revision is now accepted and ready to land.Jul 12 2023, 8:32 AM
This revision was landed with ongoing or failed builds.Jul 12 2023, 1:27 PM
This revision was automatically updated to reflect the committed changes.