Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[libc++][PSTL] Add a GCD backend

Authored by philnik on May 30 2023, 8:23 AM.


Group Reviewers
Restricted Project
rG2b2e7f6e5727: [libc++][PSTL] Add a GCD backend

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
nadiasvertex added inline comments.Jun 16 2023, 6:13 AM

It's been so long, I don't remember the exact details, sorry. I think it had something to do with calculating the correct offset. We do the search in the smallest range, and the constraints change because the place we want to compute is different in first1:last1 than it is in first2:last2.

philnik updated this revision to Diff 532960.Jun 20 2023, 9:16 AM
philnik marked 4 inline comments as done.

Address some comments

philnik updated this revision to Diff 534786.Jun 26 2023, 4:48 PM
philnik marked 3 inline comments as done.

Address some more comments

ldionne added inline comments.Jun 27 2023, 8:32 AM

Otherwise this is a bit confusing.


Seems more fitting given the names of the iterators?


IMO this captures better the fact that we are only catching exceptions in __libdispatch::__calculate_merge_ranges:

pmr::vector<...> __ranges;
try {
  __ranges = __calculate(...);
} catch ( ) {
  throw __pstl_bad_alloc();


This makes it more obvious that we let __dispatch_apply terminate if an exception is thrown.

Edit: This might not work if the allocator doesn't propagate.

ldionne added inline comments.Jun 27 2023, 12:04 PM
pmr::vector<...> __ranges = [&]{
  try {
    return __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp)
  } catch (....) {
    throw ...;

Still debating whether that's overkill, but it works around the allocator propagation issue.

I think we should leave the code as-is after consideration.


I'm not sure this is correct as-is. Let's say we have a predicate to find_if that happens to use __pstl_merge in its implementation, and let's say the "setup" code for __pstl_merge fails to allocate and throws __pstl_bad_alloc. It will be caught by find_if's __terminate_on_exception wrapper and be rethrown. Right? If so, then we'd be incorrectly propagating the exception.

Instead, I think we might want to instead use __terminate_on_exception in the backend implementation *only* around user code (not setup code), so we'd remove it from the various functions that take __cpu_backend_tag. It would also not need to handle __pstl_bad_alloc anymore cause it would never wrap "setup" code.

This should be done in a separate patch, and we'll likely need some tests as well.


We might want to move this system header below the library's own includes, I think that's what we usually do.


Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do:

const auto target_number_of_chunks = thread::hardware_concurrency() * 50;
const auto chunk_size = element_count / target_number_of_chunks; // roughly

Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing.

At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on __default_chunk_size (which itself is 100% thin air).

You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree.

Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use std::for_each(std::execution::par) with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code.

In summary, I would suggest this scheme:

We spawn at least min(element_count, thread::hardware_concurrency()) tasks always.
When element_count > thread::hardware_concurrency(), we increase logarithmically as a function of element_count with an asymptote located at roughly 100 * thread::hardware_concurrency().

philnik updated this revision to Diff 535165.Jun 27 2023, 3:43 PM
philnik marked 5 inline comments as done.

Address some comments

ldionne added inline comments.Jun 29 2023, 10:46 AM

For small numbers of elements (< cores), we could create one task for each element. That's the only way to nicely handle the case where processing each element is really heavy and we really want to parallelize. Also it should be rare that users use std::execution::par to sort a vector of 3 ints. If they do, we have no way to tell and I'd argue it's not unreasonable for us to spawn 3 tasks.
For medium numbers of elements (say < 500): cores + ((n-cores) / cores). This gives us a smooth transition from the previous size and then it basically grows linearly with n.
For larger numbers: Assuming 8 cores, we have: log(1.01, sqrt(n)) = 800 (aka 100.499 * log(sqrt(n)) according to Wolfram Alpha) requires n to be roughly 8.2 million elements. That means that we'd create 100 * 8 tasks at 8.2 million elements, and then it grows really slowly. That doesn't seem unreasonable.

This is not very scientific, I'm afraid, but this seems to be somewhat reasonable while playing around with a couple of values.

size_t cores = thread::hardware_concurrency();

auto small = [](ptrdiff_t n) { return n; };
auto medium = [](ptrdiff_t n) { return cores + ((n-cores) / cores); };

// explain where this comes from, in particular that this is an approximation of `log(1.01, sqrt(n))` which seemed to be reasonable for `n` larger than 500 and tops at 800 tasks for n ~ 8 million
auto large = [](ptrdiff_t n) { return 100.499 * std::log(std::sqrt(n)); };

if (n < cores)    return small(n);
else if (n < 500) return medium(n);
else              return std::min(medium(n), large(n)); // provide a "smooth" transition

The above assumes that we have 8 cores, but everything kind of needs to be adjusted for different numbers of cores. I suggest we start with this just to unblock the patch, but leave a big fat comment explaining this needs to be revisited. We can take a look with the dispatch folks, or if anyone observing this review has an idea, please feel free to chime in.

philnik updated this revision to Diff 536011.Jun 29 2023, 2:49 PM
philnik marked 6 inline comments as done.

Address more comments

ldionne added inline comments.Jul 5 2023, 9:53 AM

We should ensure that this emplace_back() doesn't need to reallocate, otherwise it gets messy with exceptions (and it's a bit inefficient). We should add an assertion here _LIBCPP_ASSERT_UNCATEGORIZED(__ranges.capacity() >= __ranges.size() + 1);.


Here let's reserve() in advance. The upper bound on the number of ranges we'll ever allocate should be the number of leaves in the "call tree of __calculate_merge_ranges". In the worst case we only divide the biggest range by two and we don't touch the other range at every step, so the number of levels in that call tree is bounded by the number of times you can divide each range by two. You want to sum the worst case for each range here. The number of leaves will be bounded by 2^levels. Since levels grows logarithmically, this shouldn't be as bad as it seems, basically this would end up being linear. So I think this gives us:

2^ (log_2(size1/default_chunk_size)) == size1/default_chunk_size

If we add both of the ranges, it means we'd be bounded by:

size1/default_chunk_size + size2/default_chunk_size == (size1+size2)/default_chunk_size

But actually, we might be able to ditch this altogether and instead base our chunking on __partition_chunks by chunking the largest range (or a combination of both) iteratively. I'm not seeing the solution clearly right now but I think we might be able to come up with something better than what we have.

Potential alternative

We could instead walk the largest range by increments of chunk_size and compute the matching location to merge from in the small range using lower_bound or upper_bound, iteratively. We would pre-reserve large-range / chunk_size in the vector. That seems simpler and while there are adversarial inputs with that method, the general case is better behaved.

42–46 ↗(On Diff #536011)
15 ↗(On Diff #536011)

What is that number? Can you document it with a comment and the rationale for using it? (e.g. "we want this to contain at least N + M elements so we can merge large-ish ranges")


Let's do this as a separate patch.

philnik updated this revision to Diff 537548.Jul 5 2023, 5:01 PM
philnik marked 8 inline comments as done.

Address more comments


Not applicable anymore.


See D154238.


See D154546.

ldionne added inline comments.Jul 6 2023, 10:35 AM

This isn't used anymore.


Future optimization: We should allocate only one temporary value per worker thread, not one per chunk. Then we should pad the storage to make sure they all fall on different cache lines to avoid false sharing.

Can you add a TODO comment mentioning that?


Can you add a TODO comment to use the number of cores that libdispatch will actually use instead of the total number of cores on the system?

ldionne added inline comments.Jul 10 2023, 9:03 AM

Until we re-introduce __uninitialized_buffer, we can do this instead:

// TODO: Use __uninitialized_buffer
auto __destroy = [=](_Value* __ptr){
  std::destroy_n(__ptr, __partitions.__chunk_count_);
  std::allocator<_Value>().deallocate(__ptr, __partitions.__chunk_count_);
unique_ptr<_Value, decltype(__destroy)> __values(std::allocator<_Value>().allocate(__partitions.__chunk_count_), __destroy);

// use __dispatch_apply(...)

auto __tmp = std::transform_reduce(__values, __values + __partitions.__chunk_count_, __init, __transform, __reduction);
return __tmp;
philnik updated this revision to Diff 538701.Jul 10 2023, 9:39 AM
philnik marked 5 inline comments as done.

Address comments

ldionne added inline comments.Jul 10 2023, 2:46 PM

Not used anymore.


As discussed, this seems to be a bit simplified:

auto __compute_chunk = [&](size_t __chunk_size) -> __merge_range_t {
  auto [__mid1, __mid2] = [&] {
    if (__iterate_first_range) {
      auto __m1 = __first1 + __chunk_size;
      auto __m2 = std::lower_bound(__first2, __last2, __m1[-1], __comp);
      return std::make_pair(__m1, __m2);
    } else {
      auto __m2 = __first2 + __chunk_size;
      auto __m1 = std::lower_bound(__first1, __last1, __m2[-1], __comp);
      return std::make_pair(__m1, __m2);

  __merge_range_t __ret{__mid1, __mid2, __result};
  __result += (__mid1 - __first1) + (__mid2 - __first2);
  __first1 = __mid1;
  __first2 = __mid2;
  return __ret;

As discussed live, you could emplace_back the first iterators of the entire range manually into the vector, and then you'd be able to remove the special casing for __index == 0. You'd always look at __ranges[__index] and __ranges[__index + 1] here.


This doesn't work if you have only 1 element in a chunk (or if that's the case for the __first_chunk_size). We should have a test that exercises that. I guess the expected behavior in that case is that we'd just use std::construct_at(__values + __index, __transform(*__first));


This should only be a reduce. This should be catchable if the transform returns a different type. Test coverage should be added for that.


Can we instead run the sort in parallel by chunking and then run a serial merge? That's not too hard to implement and it's much better than doing everything serial. We can leave a TODO for parallelizing the merge since we don't have a clear way to do it.

1 ↗(On Diff #538792)

Those tests should be moved to the fuzz patch.

philnik updated this revision to Diff 538889.Jul 10 2023, 6:09 PM
philnik marked 8 inline comments as done.

Address comments


I'd rather do that in a follow-up. It's not like we have to make this implementation perfect from the start.

ldionne requested changes to this revision.Jul 11 2023, 8:31 AM
ldionne added inline comments.

We need set(LIBCXX_PSTL_CPU_BACKEND libdispatch) inside libcxx/cmake/caches/Apple.cmake


Missing include for move_iterator.


If you use __first_iters to store the result, this becomes:

__result += (__mid1 - __first1) + (__mid2 - __first2);
__first1 = __mid1;
__first2 = __mid2;
return __merge_range_t{__first1, __first2, __result};

And above becomes

__ranges.emplace_back(__first1, __first2, __result);

instead of

__ranges.emplace_back(__first1, __first2, _RandomAccessIterator3{});

It makes it clearer that we do N-1 iterations, and also that we require __partitions.__chunk_count_ > 1 (which is the case, see above).


IMO that's easier to read, and it's equivalent.


I don't think this was done.

This revision now requires changes to proceed.Jul 11 2023, 8:31 AM
philnik updated this revision to Diff 539136.Jul 11 2023, 8:51 AM
philnik marked 6 inline comments as done.

Address comments

philnik updated this revision to Diff 539172.Jul 11 2023, 10:02 AM

Try to fix CI

philnik updated this revision to Diff 539284.Jul 11 2023, 1:46 PM

Try to fix CI

philnik updated this revision to Diff 539560.Jul 12 2023, 7:52 AM

Try to fix CI

ldionne accepted this revision.Jul 12 2023, 8:32 AM

LGTM once CI passes. Thanks a lot for all the work on this! It's not the final implementation, but this is a really good start and we can then improve on it iteratively.

This revision is now accepted and ready to land.Jul 12 2023, 8:32 AM
This revision was landed with ongoing or failed builds.Jul 12 2023, 1:27 PM
This revision was automatically updated to reflect the committed changes.