This is an archive of the discontinued LLVM Phabricator instance.

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
44	If it does.
55–57	This seems a bit convoluted to me. Either set `__partitions.__x` directly or use local variables and then return `__chunk_partitions{_vars...}`?
60	This variable name is super confusing with the above! Also I would probably name it something like `__n`, which makes it clear that this is the total number of elements in the range.
108	I would rename this to avoid `__parallel` in the name since it makes it look like a CPU backend customization point, when it's not.
213	Otherwise this code doesn't work as-is. We also discussed using `unique_ptr<optional<_Value>[]>` which would destroy objects correctly in case an exception is thrown while constructing values below. This potentially has a non-negligible memory overhead. No conclusion yet, but I think I would like to either make this code locally correct or have a nice way of expressing the fact that we don't care about running the destructors of `_Value` in case of an exception because we call terminate anyway.

Address comments

Harbormaster completed remote builds in B235957: Diff 527581.Jun 1 2023, 5:39 PM

Previous effort for this for cross-reference: https://reviews.llvm.org/D120186

ldionne added inline comments.Jun 2 2023, 8:32 AM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
163	If the only difference between `__parallel_merge` and `__parallel_merge_body` is that we're passing the size, then `__parallel_merge` could be implemented as a simple call to `__parallel_merge_body(__xe - __xs, __ye - __ys)`.
212–213	I think we should introduce a helper class for this. It seems like something we might have a other uses for. Something like: template <class _Tp, class _Alloc = std::allocator<_Tp>> struct __uninitialized_buffer { explicit __uninitialized_buffer(size_t __n); _Tp* __get(); ~__uninitialized_buffer(); // destroys the elements and frees the memory };
219	This totally deserves a comment explaining why we're terminating here, when normally we always terminate from `terminate_on_exception`. Actually, we could also use `terminate_on_exception` here instead of `exception_guard`.
libcxx/src/pstl/libdispatch.cpp
2	We need to figure out how we're going to test the backend. Tests should be added in this patch.

philnik added a parent revision: D152208: [libc++] Introduce __make_uninitialized_buffer and use it instead of get_temporary_buffer.Jun 8 2023, 10:51 AM

Address comments

Harbormaster completed remote builds in B237556: Diff 529678.Jun 8 2023, 12:53 PM

Rebased

Harbormaster completed remote builds in B239152: Diff 531780.Jun 15 2023, 9:29 AM

ldionne published this revision for review.Jun 15 2023, 2:02 PM

ldionne added inline comments.

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
34–35
55	I would like us to test specifically the various functions we use in the backend, because otherwise we don't have any testing for those. So for example, let's create a libc++ specific test that checks for the implementation of `__partition_chunks`.
90–91	This comment seems not to be relevant anymore, we don't "create an executor".
121	There is some non-trivial logic in here that we could get wrong. For example, if we were not to use `upper_bound` below to figure out the end of the `__xm` range to merge, we could get correctness issues -- this is clearly not only about performance. We could either test this by adding GCD-specific tests for the backend, or by adding `std/` tests for e.g. `std::merge(std::par, ...)` that are specially crafted to trigger these issues. I don't care very strongly which one we go for, but we should have something because as it stands, all of our tests would pass if we had incorrect logic in here (our tests always use something smaller than the chunk size). @philnik mentioned some sort of "fuzzing" test during live review, and I think I agree this would make sense.
122–123	I don't think we need to pass the `__size_foo` variables around, we can compute them from the iterators instead since we have random access iterators.
136–137	Maybe those are more descriptive names? Especially if we drop `__size_x` and `__size_y`, `x` and `y` will lose their meaning.
143–144	Is there any reason why we use `lower_bound` here? @nadiasvertex Would you happen to remember why you did it that way in the original GCD backend implementation?
149	This implementation is quite interesting. At a high level, we basically figure out how to chunk up the ranges as we're going, and we spawn tasks two-by-two every time. My understanding is that this kind of "tree of computation" pattern is not what libdispatch excels at -- instead I think it is better to give it a "finalized" number of tasks you want to execute all at once. If that is correct, then going for a different algorithm where we instead figure out the chunking upfront and then dispatch it all at once could be a win. One way to do this could be to accumulate the work items (aka the arguments you pass to `__leaf_merge` above) in a `std::vector<_WorkItem>`, and then `dispatch_apply_f` on that. We're allowed to allocate in these algorithms so that should be an acceptable approach. Since the number of work items is roughly `n / __default_chunk_size` and that's linear, the number of work items we might have to spawn could become quite large, perhaps making it important to use `dispatch_apply_f` only once. Unfortunately this also means that it's difficult to determine the number of work items up front so we would likely need to allocate with our `std::vector` almost all the time (i.e. tricks like `llvm::SmallVector` likely don't apply here). I think the first step to resolve this comment is to figure out whether the premise that libdispatch doesn't handle those trees well is true or not. If not, then the current approach is probably the right one.
192–193	We need a test for this case -- i.e. we want to make sure that we throw an exception (and the right one) if we fail to allocate memory from the implementation of the GCD backend.
libcxx/include/__algorithm/pstl_backends/cpu_backends/transform_reduce.h
167	Can we add a test that fails here? This is a double-move issue that was present in the code before this patch. I think it would make sense to split this one off.
168	Let's capture by name since this was so not obvious.
libcxx/src/new.cpp
51–53 ↗	(On Diff #531780)	This should be in `libc++experimental.a` for now instead.

Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2023, 2:02 PM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

nadiasvertex added inline comments.Jun 16 2023, 6:13 AM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
143–144	It's been so long, I don't remember the exact details, sorry. I think it had something to do with calculating the correct offset. We do the search in the smallest range, and the constraints change because the place we want to compute is different in first1:last1 than it is in first2:last2.

philnik removed a parent revision: D152208: [libc++] Introduce __make_uninitialized_buffer and use it instead of get_temporary_buffer.Jun 16 2023, 8:13 AM

Address some comments

Harbormaster completed remote builds in B240043: Diff 532960.Jun 20 2023, 11:05 AM

Address some more comments

Herald added a subscriber: krytarowski. · View Herald TranscriptJun 26 2023, 4:48 PM

Harbormaster completed remote builds in B241334: Diff 534786.Jun 26 2023, 5:14 PM

ldionne added inline comments.Jun 27 2023, 8:32 AM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
38	Otherwise this is a bit confusing.
102–103	Seems more fitting given the names of the iterators?
159–173	IMO this captures better the fact that we are only catching exceptions in `__libdispatch::__calculate_merge_ranges`: pmr::vector<...> __ranges; try { __ranges = __calculate(...); } catch ( ) { throw __pstl_bad_alloc(); } __libdispatch::__dispatch_apply(...); This makes it more obvious that we let `__dispatch_apply` terminate if an exception is thrown. Edit: This might not work if the allocator doesn't propagate.

ldionne added inline comments.Jun 27 2023, 12:04 PM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
159–173	pmr::vector<...> __ranges = [&]{ try { return __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp) } catch (....) { throw ...; } }(); Still debating whether that's overkill, but it works around the allocator propagation issue. I think we should leave the code as-is after consideration.
libcxx/include/__utility/terminate_on_exception.h
30–31	I'm not sure this is correct as-is. Let's say we have a predicate to `find_if` that happens to use `__pstl_merge` in its implementation, and let's say the "setup" code for `__pstl_merge` fails to allocate and throws `__pstl_bad_alloc`. It will be caught by `find_if`'s `__terminate_on_exception` wrapper and be rethrown. Right? If so, then we'd be incorrectly propagating the exception. Instead, I think we might want to instead use `__terminate_on_exception` in the backend implementation only around user code (not setup code), so we'd remove it from the various functions that take `__cpu_backend_tag`. It would also not need to handle `__pstl_bad_alloc` anymore cause it would never wrap "setup" code. This should be done in a separate patch, and we'll likely need some tests as well.
libcxx/src/CMakeLists.txt
322–323
libcxx/src/pstl/libdispatch.cpp
12	We might want to move this system header below the library's own includes, I think that's what we usually do.
29–30	Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do: const auto target_number_of_chunks = thread::hardware_concurrency() * 50; const auto chunk_size = element_count / target_number_of_chunks; // roughly Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing. At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on `__default_chunk_size` (which itself is 100% thin air). You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree. Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use `std::for_each(std::execution::par)` with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code. In summary, I would suggest this scheme: We spawn at least `min(element_count, thread::hardware_concurrency())` tasks always. When `element_count > thread::hardware_concurrency()`, we increase logarithmically as a function of `element_count` with an asymptote located at roughly `100 * thread::hardware_concurrency()`.

Address some comments

Harbormaster completed remote builds in B241630: Diff 535165.Jun 27 2023, 3:50 PM

ldionne added inline comments.Jun 29 2023, 10:46 AM

libcxx/src/pstl/libdispatch.cpp
29–30	For small numbers of elements (`< cores`), we could create one task for each element. That's the only way to nicely handle the case where processing each element is really heavy and we really want to parallelize. Also it should be rare that users use `std::execution::par` to sort a vector of 3 ints. If they do, we have no way to tell and I'd argue it's not unreasonable for us to spawn 3 tasks. For medium numbers of elements (say `< 500`): `cores + ((n-cores) / cores)`. This gives us a smooth transition from the previous size and then it basically grows linearly with `n`. For larger numbers: Assuming 8 cores, we have: `log(1.01, sqrt(n)) = 800` (aka `100.499 * log(sqrt(n))` according to Wolfram Alpha) requires `n` to be roughly 8.2 million elements. That means that we'd create `100 * 8` tasks at 8.2 million elements, and then it grows really slowly. That doesn't seem unreasonable. This is not very scientific, I'm afraid, but this seems to be somewhat reasonable while playing around with a couple of values. size_t cores = thread::hardware_concurrency(); auto small = [](ptrdiff_t n) { return n; }; auto medium = [](ptrdiff_t n) { return cores + ((n-cores) / cores); }; // explain where this comes from, in particular that this is an approximation of `log(1.01, sqrt(n))` which seemed to be reasonable for `n` larger than 500 and tops at 800 tasks for n ~ 8 million auto large = [](ptrdiff_t n) { return 100.499 * std::log(std::sqrt(n)); }; if (n < cores) return small(n); else if (n < 500) return medium(n); else return std::min(medium(n), large(n)); // provide a "smooth" transition The above assumes that we have 8 cores, but everything kind of needs to be adjusted for different numbers of cores. I suggest we start with this just to unblock the patch, but leave a big fat comment explaining this needs to be revisited. We can take a look with the dispatch folks, or if anyone observing this review has an idea, please feel free to chime in.

Address more comments

Herald added a subscriber: mgrang. · View Herald TranscriptJun 29 2023, 2:49 PM

Harbormaster completed remote builds in B242251: Diff 536011.Jun 29 2023, 4:41 PM

ldionne added inline comments.Jul 5 2023, 9:53 AM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
113–117	We should ensure that this `emplace_back()` doesn't need to reallocate, otherwise it gets messy with exceptions (and it's a bit inefficient). We should add an assertion here `_LIBCPP_ASSERT_UNCATEGORIZED(__ranges.capacity() >= __ranges.size() + 1);`.
146–147	Here let's `reserve()` in advance. The upper bound on the number of ranges we'll ever allocate should be the number of leaves in the "call tree of `__calculate_merge_ranges`". In the worst case we only divide the biggest range by two and we don't touch the other range at every step, so the number of levels in that call tree is bounded by the number of times you can divide each range by two. You want to sum the worst case for each range here. The number of leaves will be bounded by 2^levels. Since `levels` grows logarithmically, this shouldn't be as bad as it seems, basically this would end up being linear. So I think this gives us: 2^ (log_2(size1/default_chunk_size)) == size1/default_chunk_size If we add both of the ranges, it means we'd be bounded by: size1/default_chunk_size + size2/default_chunk_size == (size1+size2)/default_chunk_size But actually, we might be able to ditch this altogether and instead base our chunking on `__partition_chunks` by chunking the largest range (or a combination of both) iteratively. I'm not seeing the solution clearly right now but I think we might be able to come up with something better than what we have. Potential alternative We could instead walk the largest range by increments of `chunk_size` and compute the matching location to merge from in the small range using `lower_bound` or `upper_bound`, iteratively. We would pre-reserve `large-range / chunk_size` in the vector. That seems simpler and while there are adversarial inputs with that method, the general case is better behaved.
libcxx/test/libcxx/algorithms/pstl.libdispatch.chunk_partitions.pass.cpp
17
20
libcxx/test/std/algorithms/alg.modifying.operations/alg.copy/fuzz.pstl.copy.sh.cpp
42–46 ↗	(On Diff #536011)
libcxx/test/std/algorithms/alg.sorting/alg.merge/fuzz.pstl.merge.sh.cpp
15 ↗	(On Diff #536011)	What is that number? Can you document it with a comment and the rationale for using it? (e.g. "we want this to contain at least N + M elements so we can merge large-ish ranges")
libcxx/test/std/algorithms/alg.sorting/alg.merge/pstl.merge.pass.cpp
106–111	Let's do this as a separate patch.

Address more comments

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
113–117	Not applicable anymore.
libcxx/include/__utility/terminate_on_exception.h
30–31	See D154238.
libcxx/test/std/algorithms/alg.sorting/alg.merge/pstl.merge.pass.cpp
106–111	See D154546.

Harbormaster completed remote builds in B243346: Diff 537548.Jul 5 2023, 6:31 PM

ldionne added inline comments.Jul 6 2023, 10:35 AM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
75–83	This isn't used anymore.
189–190	Future optimization: We should allocate only one temporary value per worker thread, not one per chunk. Then we should pad the storage to make sure they all fall on different cache lines to avoid false sharing. Can you add a TODO comment mentioning that?
libcxx/src/pstl/libdispatch.cpp
33	Can you add a TODO comment to use the number of cores that libdispatch will actually use instead of the total number of cores on the system?
37–38

ldionne added inline comments.Jul 10 2023, 9:03 AM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h

189

Until we re-introduce __uninitialized_buffer, we can do this instead:

// TODO: Use __uninitialized_buffer
auto __destroy = [=](_Value* __ptr){
  std::destroy_n(__ptr, __partitions.__chunk_count_);
  std::allocator<_Value>().deallocate(__ptr, __partitions.__chunk_count_);
};
unique_ptr<_Value, decltype(__destroy)> __values(std::allocator<_Value>().allocate(__partitions.__chunk_count_), __destroy);

// use __dispatch_apply(...)

auto __tmp = std::transform_reduce(__values, __values + __partitions.__chunk_count_, __init, __transform, __reduction);
return __tmp;

Address comments

Harbormaster completed remote builds in B244188: Diff 538701.Jul 10 2023, 11:53 AM

Fix diff

ldionne added inline comments.Jul 10 2023, 2:46 PM

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
48–49	Not used anymore.
121–133	As discussed, this seems to be a bit simplified: auto __compute_chunk = [&](size_t __chunk_size) -> __merge_range_t { auto [__mid1, __mid2] = [&] { if (__iterate_first_range) { auto __m1 = __first1 + __chunk_size; auto __m2 = std::lower_bound(__first2, __last2, __m1[-1], __comp); return std::make_pair(__m1, __m2); } else { auto __m2 = __first2 + __chunk_size; auto __m1 = std::lower_bound(__first1, __last1, __m2[-1], __comp); return std::make_pair(__m1, __m2); } }(); __merge_range_t __ret{__mid1, __mid2, __result}; __result += (__mid1 - __first1) + (__mid2 - __first2); __first1 = __mid1; __first2 = __mid2; return __ret; };
151	As discussed live, you could emplace_back the first iterators of the entire range manually into the vector, and then you'd be able to remove the special casing for `__index == 0`. You'd always look at `__ranges[__index]` and `__ranges[__index + 1]` here.
199–201	This doesn't work if you have only 1 element in a chunk (or if that's the case for the `__first_chunk_size`). We should have a test that exercises that. I guess the expected behavior in that case is that we'd just use `std::construct_at(__values + __index, __transform(*__first));`
205	This should only be a `reduce`. This should be catchable if the transform returns a different type. Test coverage should be added for that.
213	Can we instead run the `sort` in parallel by chunking and then run a serial merge? That's not too hard to implement and it's much better than doing everything serial. We can leave a TODO for parallelizing the `merge` since we don't have a clear way to do it.
libcxx/test/std/algorithms/alg.modifying.operations/alg.copy/fuzz.pstl.copy.sh.cpp
1 ↗	(On Diff #538792)	Those tests should be moved to the `fuzz` patch.

Address comments

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
213	I'd rather do that in a follow-up. It's not like we have to make this implementation perfect from the start.

philnik added a parent revision: D154913: [libc++][PSTL] Fix double-move in std::transform_reduce.Jul 10 2023, 6:09 PM

Harbormaster completed remote builds in B244326: Diff 538889.Jul 11 2023, 12:24 AM

ldionne requested changes to this revision.Jul 11 2023, 8:31 AM

ldionne added inline comments.

libcxx/CMakeLists.txt
800	We need `set(LIBCXX_PSTL_CPU_BACKEND libdispatch)` inside `libcxx/cmake/caches/Apple.cmake`
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
13	Missing include for `move_iterator`.
131–135	If you use `__first_iters` to store the result, this becomes: __result += (__mid1 - __first1) + (__mid2 - __first2); __first1 = __mid1; __first2 = __mid2; return __merge_range_t{__first1, __first2, __result}; And above becomes __ranges.emplace_back(__first1, __first2, __result); instead of __ranges.emplace_back(__first1, __first2, _RandomAccessIterator3{});
142–143	It makes it clearer that we do `N-1` iterations, and also that we require `__partitions.__chunk_count_ > 1` (which is the case, see above).
148	IMO that's easier to read, and it's equivalent.
libcxx/src/CMakeLists.txt
322–323	I don't think this was done.

This revision now requires changes to proceed.Jul 11 2023, 8:31 AM

Address comments

Herald added a subscriber: arichardson. · View Herald TranscriptJul 11 2023, 8:51 AM

Harbormaster completed remote builds in B244501: Diff 539136.Jul 11 2023, 9:35 AM

Try to fix CI

Harbormaster completed remote builds in B244524: Diff 539172.Jul 11 2023, 1:39 PM

Try to fix CI

Harbormaster completed remote builds in B244601: Diff 539284.Jul 11 2023, 5:24 PM

Try to fix CI

LGTM once CI passes. Thanks a lot for all the work on this! It's not the final implementation, but this is a really good start and we can then improve on it iteratively.

This revision is now accepted and ready to land.Jul 12 2023, 8:32 AM

Harbormaster completed remote builds in B244793: Diff 539560.Jul 12 2023, 12:28 PM

philnik removed a parent revision: D154913: [libc++][PSTL] Fix double-move in std::transform_reduce.Jul 12 2023, 1:22 PM

This revision was landed with ongoing or failed builds.Jul 12 2023, 1:27 PM

Closed by commit rG2b2e7f6e5727: [libc++][PSTL] Add a GCD backend (authored by philnik). · Explain Why

This revision was automatically updated to reflect the committed changes.

philnik added a commit: rG2b2e7f6e5727: [libc++][PSTL] Add a GCD backend.

Revision Contents

Path

Size

libcxx/

CMakeLists.txt

4 lines

cmake/

caches/

Apple.cmake

1 line

include/

CMakeLists.txt

1 line

__algorithm/

pstl_backend.h

3 lines

pstl_backends/

cpu_backends/

2 lines

226 lines

2 lines

1 line

__numeric/

reduce.h

3 lines

__utility/

terminate_on_exception.h

1 line

module.modulemap.in

3 lines

src/

CMakeLists.txt

4 lines

pstl/

libdispatch.cpp

71 lines

test/

libcxx/

algorithms/

pstl.libdispatch.chunk_partitions.pass.cpp

28 lines

std/

algorithms/

alg.sorting/

alg.merge/

pstl.merge.pass.cpp

7 lines

numeric.ops/

transform.reduce/

pstl.transform_reduce.binary.pass.cpp

35 lines

utils/

data/

ignore_format.txt

1 line

libcxx/

test/

features.py

1 line

Diff 539708

libcxx/CMakeLists.txt

Show First 20 Lines • Show All 791 Lines • ▼ Show 20 Lines	elseif (LIBCXX_HARDENING_MODE STREQUAL "unchecked")
config_define(0 _LIBCPP_ENABLE_HARDENED_MODE_DEFAULT)		config_define(0 _LIBCPP_ENABLE_HARDENED_MODE_DEFAULT)
config_define(0 _LIBCPP_ENABLE_DEBUG_MODE_DEFAULT)		config_define(0 _LIBCPP_ENABLE_DEBUG_MODE_DEFAULT)
endif()		endif()

if (LIBCXX_PSTL_CPU_BACKEND STREQUAL "serial")		if (LIBCXX_PSTL_CPU_BACKEND STREQUAL "serial")
config_define(1 _LIBCPP_PSTL_CPU_BACKEND_SERIAL)		config_define(1 _LIBCPP_PSTL_CPU_BACKEND_SERIAL)
elseif(LIBCXX_PSTL_CPU_BACKEND STREQUAL "std_thread")		elseif(LIBCXX_PSTL_CPU_BACKEND STREQUAL "std_thread")
config_define(1 _LIBCPP_PSTL_CPU_BACKEND_THREAD)		config_define(1 _LIBCPP_PSTL_CPU_BACKEND_THREAD)
		elseif(LIBCXX_PSTL_CPU_BACKEND STREQUAL "libdispatch")
		ldionneUnsubmitted Done Reply Inline Actions We need `set(LIBCXX_PSTL_CPU_BACKEND libdispatch)` inside `libcxx/cmake/caches/Apple.cmake` ldionne: We need `set(LIBCXX_PSTL_CPU_BACKEND libdispatch)` inside `libcxx/cmake/caches/Apple.cmake`
		config_define(1 _LIBCPP_PSTL_CPU_BACKEND_LIBDISPATCH)
else()		else()
message(FATAL_ERROR "LIBCXX_PSTL_CPU_BACKEND is set to ${LIBCXX_PSTL_CPU_BACKEND}, which is not a valid backend.		message(FATAL_ERROR "LIBCXX_PSTL_CPU_BACKEND is set to ${LIBCXX_PSTL_CPU_BACKEND}, which is not a valid backend.
Valid backends are: serial, std_thread")		Valid backends are: serial, std_thread and libdispatch")
endif()		endif()

if (LIBCXX_ABI_DEFINES)		if (LIBCXX_ABI_DEFINES)
set(abi_defines)		set(abi_defines)
foreach (abi_define ${LIBCXX_ABI_DEFINES})		foreach (abi_define ${LIBCXX_ABI_DEFINES})
if (NOT abi_define MATCHES "^_LIBCPP_ABI_")		if (NOT abi_define MATCHES "^_LIBCPP_ABI_")
message(SEND_ERROR "Invalid ABI macro ${abi_define} in LIBCXX_ABI_DEFINES")		message(SEND_ERROR "Invalid ABI macro ${abi_define} in LIBCXX_ABI_DEFINES")
endif()		endif()
▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

libcxx/cmake/caches/Apple.cmake

	set(CMAKE_BUILD_TYPE MinSizeRel CACHE STRING "")			set(CMAKE_BUILD_TYPE MinSizeRel CACHE STRING "")
	set(CMAKE_POSITION_INDEPENDENT_CODE OFF CACHE BOOL "")			set(CMAKE_POSITION_INDEPENDENT_CODE OFF CACHE BOOL "")

	set(LIBCXX_USE_COMPILER_RT ON CACHE BOOL "")			set(LIBCXX_USE_COMPILER_RT ON CACHE BOOL "")
	set(LIBCXX_ENABLE_ASSERTIONS OFF CACHE BOOL "")			set(LIBCXX_ENABLE_ASSERTIONS OFF CACHE BOOL "")
	set(LIBCXX_ABI_VERSION "1" CACHE STRING "")			set(LIBCXX_ABI_VERSION "1" CACHE STRING "")
	set(LIBCXX_ENABLE_STATIC ON CACHE BOOL "")			set(LIBCXX_ENABLE_STATIC ON CACHE BOOL "")
	set(LIBCXX_ENABLE_SHARED ON CACHE BOOL "")			set(LIBCXX_ENABLE_SHARED ON CACHE BOOL "")
	set(LIBCXX_CXX_ABI libcxxabi CACHE STRING "")			set(LIBCXX_CXX_ABI libcxxabi CACHE STRING "")
	set(LIBCXX_ENABLE_VENDOR_AVAILABILITY_ANNOTATIONS ON CACHE BOOL "")			set(LIBCXX_ENABLE_VENDOR_AVAILABILITY_ANNOTATIONS ON CACHE BOOL "")
				set(LIBCXX_PSTL_CPU_BACKEND libdispatch)

	set(LIBCXX_HERMETIC_STATIC_LIBRARY ON CACHE BOOL "")			set(LIBCXX_HERMETIC_STATIC_LIBRARY ON CACHE BOOL "")
	set(LIBCXXABI_HERMETIC_STATIC_LIBRARY ON CACHE BOOL "")			set(LIBCXXABI_HERMETIC_STATIC_LIBRARY ON CACHE BOOL "")

	set(LIBCXXABI_ENABLE_ASSERTIONS OFF CACHE BOOL "")			set(LIBCXXABI_ENABLE_ASSERTIONS OFF CACHE BOOL "")
	set(LIBCXXABI_ENABLE_FORGIVING_DYNAMIC_CAST ON CACHE BOOL "")			set(LIBCXXABI_ENABLE_FORGIVING_DYNAMIC_CAST ON CACHE BOOL "")

	set(LIBCXX_TEST_CONFIG "apple-libc++-shared.cfg.in" CACHE STRING "")			set(LIBCXX_TEST_CONFIG "apple-libc++-shared.cfg.in" CACHE STRING "")
	set(LIBCXXABI_TEST_CONFIG "apple-libc++abi-shared.cfg.in" CACHE STRING "")			set(LIBCXXABI_TEST_CONFIG "apple-libc++abi-shared.cfg.in" CACHE STRING "")

libcxx/include/CMakeLists.txt

Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	set(files
__algorithm/pstl_any_all_none_of.h		__algorithm/pstl_any_all_none_of.h
__algorithm/pstl_backend.h		__algorithm/pstl_backend.h
__algorithm/pstl_backends/cpu_backend.h		__algorithm/pstl_backends/cpu_backend.h
__algorithm/pstl_backends/cpu_backends/any_of.h		__algorithm/pstl_backends/cpu_backends/any_of.h
__algorithm/pstl_backends/cpu_backends/backend.h		__algorithm/pstl_backends/cpu_backends/backend.h
__algorithm/pstl_backends/cpu_backends/fill.h		__algorithm/pstl_backends/cpu_backends/fill.h
__algorithm/pstl_backends/cpu_backends/find_if.h		__algorithm/pstl_backends/cpu_backends/find_if.h
__algorithm/pstl_backends/cpu_backends/for_each.h		__algorithm/pstl_backends/cpu_backends/for_each.h
		__algorithm/pstl_backends/cpu_backends/libdispatch.h
__algorithm/pstl_backends/cpu_backends/merge.h		__algorithm/pstl_backends/cpu_backends/merge.h
__algorithm/pstl_backends/cpu_backends/serial.h		__algorithm/pstl_backends/cpu_backends/serial.h
__algorithm/pstl_backends/cpu_backends/stable_sort.h		__algorithm/pstl_backends/cpu_backends/stable_sort.h
__algorithm/pstl_backends/cpu_backends/thread.h		__algorithm/pstl_backends/cpu_backends/thread.h
__algorithm/pstl_backends/cpu_backends/transform.h		__algorithm/pstl_backends/cpu_backends/transform.h
__algorithm/pstl_backends/cpu_backends/transform_reduce.h		__algorithm/pstl_backends/cpu_backends/transform_reduce.h
__algorithm/pstl_copy.h		__algorithm/pstl_copy.h
__algorithm/pstl_count.h		__algorithm/pstl_count.h
▲ Show 20 Lines • Show All 978 Lines • Show Last 20 Lines

libcxx/include/__algorithm/pstl_backend.h

	Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines

	# if _LIBCPP_STD_VER >= 20			# if _LIBCPP_STD_VER >= 20
	template <>			template <>
	struct __select_backend<std::execution::unsequenced_policy> {			struct __select_backend<std::execution::unsequenced_policy> {
	using type = __cpu_backend_tag;			using type = __cpu_backend_tag;
	};			};
	# endif			# endif

	# if defined(_LIBCPP_PSTL_CPU_BACKEND_SERIAL) \|\| defined(_LIBCPP_PSTL_CPU_BACKEND_THREAD)			# if defined(_LIBCPP_PSTL_CPU_BACKEND_SERIAL) \|\| defined(_LIBCPP_PSTL_CPU_BACKEND_THREAD) \|\| \
				defined(_LIBCPP_PSTL_CPU_BACKEND_LIBDISPATCH)
	template <>			template <>
	struct __select_backend<std::execution::parallel_policy> {			struct __select_backend<std::execution::parallel_policy> {
	using type = __cpu_backend_tag;			using type = __cpu_backend_tag;
	};			};

	template <>			template <>
	struct __select_backend<std::execution::parallel_unsequenced_policy> {			struct __select_backend<std::execution::parallel_unsequenced_policy> {
	using type = __cpu_backend_tag;			using type = __cpu_backend_tag;
	Show All 14 Lines

libcxx/include/__algorithm/pstl_backends/cpu_backends/backend.h

	Show All 10 Lines

	#include <__config>			#include <__config>
	#include <cstddef>			#include <cstddef>

	#if defined(_LIBCPP_PSTL_CPU_BACKEND_SERIAL)			#if defined(_LIBCPP_PSTL_CPU_BACKEND_SERIAL)
	# include <__algorithm/pstl_backends/cpu_backends/serial.h>			# include <__algorithm/pstl_backends/cpu_backends/serial.h>
	#elif defined(_LIBCPP_PSTL_CPU_BACKEND_THREAD)			#elif defined(_LIBCPP_PSTL_CPU_BACKEND_THREAD)
	# include <__algorithm/pstl_backends/cpu_backends/thread.h>			# include <__algorithm/pstl_backends/cpu_backends/thread.h>
				#elif defined(_LIBCPP_PSTL_CPU_BACKEND_LIBDISPATCH)
				# include <__algorithm/pstl_backends/cpu_backends/libdispatch.h>
	#else			#else
	# error "Invalid CPU backend choice"			# error "Invalid CPU backend choice"
	#endif			#endif

	#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)			#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
	# pragma GCC system_header			# pragma GCC system_header
	#endif			#endif

	Show All 13 Lines

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h

This file was added.

//===----------------------------------------------------------------------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#ifndef _LIBCPP___ALGORITHM_PSTL_BACKENDS_CPU_BACKENDS_LIBDISPATCH_H

#define _LIBCPP___ALGORITHM_PSTL_BACKENDS_CPU_BACKENDS_LIBDISPATCH_H

#include <__algorithm/lower_bound.h>

#include <__algorithm/upper_bound.h>

ldionneUnsubmitted

Done

Missing include for move_iterator.

ldionne: Missing include for `move_iterator`.

#include <__atomic/atomic.h>

#include <__config>

#include <__exception/terminate.h>

#include <__iterator/iterator_traits.h>

#include <__iterator/move_iterator.h>

#include <__memory/construct_at.h>

#include <__memory/unique_ptr.h>

#include <__memory_resource/memory_resource.h>

#include <__numeric/reduce.h>

#include <__utility/exception_guard.h>

#include <__utility/move.h>

#include <__utility/terminate_on_exception.h>

#include <cstddef>

#include <new>

#include <vector>

#if !defined(_LIBCPP_HAS_NO_INCOMPLETE_PSTL) && _LIBCPP_STD_VER >= 17

_LIBCPP_BEGIN_NAMESPACE_STD

namespace __par_backend {

inline namespace __libdispatch {

ldionneUnsubmitted

Done

inline namespace __libdispatch {

+ // ::dispatch_apply is marked as __attribute__((nothrow)) because it doesn't let exceptions propagate, and

+ // neither do we.

_LIBCPP_FUNC_VIS void

__dispatch_apply(size_t __chunk_count, void* __context, void (*__func)(void* __context, size_t __chunk)) noexcept;

template <class _Func>

ldionne:

// ::dispatch_apply is marked as __attribute__((nothrow)) because it doesn't let exceptions propagate, and neither do

// we.

ldionneUnsubmitted

Done

[[_Clang::__callback__(__func, __context, __)]] _LIBCPP_EXPORTED_FROM_ABI void

- __dispatch_apply(size_t __chunk_count, void* __context, void (*__func)(void* __context, size_t __chunk)) noexcept;

+ __dispatch_apply(size_t __chunk_count, void* __context, void (*__func)(void* __ctx, size_t __chunk)) noexcept;

template <class _Func>

Otherwise this is a bit confusing.

ldionne: Otherwise this is a bit confusing.

// TODO: Do we want to add [[_Clang::__callback__(__func, __context, __)]]?

_LIBCPP_EXPORTED_FROM_ABI void

__dispatch_apply(size_t __chunk_count, void* __context, void (*__func)(void* __context, size_t __chunk)) noexcept;

template <class _Func>

_LIBCPP_HIDE_FROM_ABI void __dispatch_apply(size_t __chunk_count, _Func __func) noexcept {

ldionneUnsubmitted

Done

struct __chunk_partitions {

- size_t __chunk_count_;

+ size_t __chunk_count_; // includes the first chunk

size_t __chunk_size_;

If it does.

ldionne: If it does.

__libdispatch::__dispatch_apply(__chunk_count, &__func, [](void* __context, size_t __chunk) {

(*static_cast<_Func*>(__context))(__chunk);

});

}

ldionneUnsubmitted

Done

HIDE_FROM_ABI here and below

ldionne: `HIDE_FROM_ABI` here and below

ldionneUnsubmitted

Done

Not used anymore.

ldionne: Not used anymore.

struct __chunk_partitions {

ptrdiff_t __chunk_count_; // includes the first chunk

ptrdiff_t __chunk_size_;

ptrdiff_t __first_chunk_size_;

};

ldionneUnsubmitted

Done

I would like us to test specifically the various functions we use in the backend, because otherwise we don't have any testing for those. So for example, let's create a libc++ specific test that checks for the implementation of __partition_chunks.

ldionne: I would like us to test specifically the various functions we use in the backend, because…

[[__gnu__::__const__]] _LIBCPP_EXPORTED_FROM_ABI pmr::memory_resource* __get_memory_resource();

[[__gnu__::__const__]] _LIBCPP_EXPORTED_FROM_ABI __chunk_partitions __partition_chunks(ptrdiff_t __size);

ldionneUnsubmitted

Done

This seems a bit convoluted to me. Either set __partitions.__x directly or use local variables and then return __chunk_partitions{_vars...}?

ldionne: This seems a bit convoluted to me. Either set `__partitions.__x` directly or use local…

template <class _RandomAccessIterator, class _Functor>

_LIBCPP_HIDE_FROM_ABI void

ldionneUnsubmitted

Done

This variable name is super confusing with the above!

Also I would probably name it something like __n, which makes it clear that this is the total number of elements in the range.

ldionne: This variable name is super confusing with the above! Also I would probably name it something…

__parallel_for(_RandomAccessIterator __first, _RandomAccessIterator __last, _Functor __func) {

auto __partitions = __libdispatch::__partition_chunks(__last - __first);

// Perform the chunked execution.

__libdispatch::__dispatch_apply(__partitions.__chunk_count_, [&](size_t __chunk) {

auto __this_chunk_size = __chunk == 0 ? __partitions.__first_chunk_size_ : __partitions.__chunk_size_;

auto __index =

__chunk == 0

? 0

: (__chunk * __partitions.__chunk_size_) + (__partitions.__first_chunk_size_ - __partitions.__chunk_size_);

__func(__first + __index, __first + __index + __this_chunk_size);

});

}

template <class _RandomAccessIterator1, class _RandomAccessIterator2, class _RandomAccessIteratorOut>

struct __merge_range {

__merge_range(_RandomAccessIterator1 __mid1, _RandomAccessIterator2 __mid2, _RandomAccessIteratorOut __result)

: __mid1_(__mid1), __mid2_(__mid2), __result_(__result) {}

_RandomAccessIterator1 __mid1_;

_RandomAccessIterator2 __mid2_;

_RandomAccessIteratorOut __result_;

};

ldionneUnsubmitted

Done

This isn't used anymore.

ldionne: This isn't used anymore.

template <typename _RandomAccessIterator1,

typename _RandomAccessIterator2,

typename _RandomAccessIterator3,

typename _Compare,

typename _LeafMerge>

_LIBCPP_HIDE_FROM_ABI void __parallel_merge(

_RandomAccessIterator1 __first1,

ldionneUnsubmitted

Done

This comment seems not to be relevant anymore, we don't "create an executor".

ldionne: This comment seems not to be relevant anymore, we don't "create an executor".

_RandomAccessIterator1 __last1,

_RandomAccessIterator2 __first2,

_RandomAccessIterator2 __last2,

_RandomAccessIterator3 __result,

_Compare __comp,

_LeafMerge __leaf_merge) {

__chunk_partitions __partitions =

__libdispatch::__partition_chunks(std::max<ptrdiff_t>(__last1 - __first1, __last2 - __first2));

if (__partitions.__chunk_count_ == 0)

return;

ldionneUnsubmitted

Done

pmr::vector<__merge_range<_RandomAccessIterator1, _RandomAccessIterator2, _RandomAccessIteratorOut>>& __ranges) {

- std::size_t __size_x = __last1 - __first1;

- std::size_t __size_y = __last2 - __first2;

+ std::size_t __size1 = __last1 - __first1;

+ std::size_t __size2 = __last2 - __first2;

if (__size_x + __size_y <= __default_chunk_size) {

Seems more fitting given the names of the iterators?

ldionne: Seems more fitting given the names of the iterators?

if (__partitions.__chunk_count_ == 1) {

__leaf_merge(__first1, __last1, __first2, __last2, __result, __comp);

return;

}

ldionneUnsubmitted

Done

I would rename this to avoid __parallel in the name since it makes it look like a CPU backend customization point, when it's not.

ldionne: I would rename this to avoid `__parallel` in the name since it makes it look like a CPU backend…

using __merge_range_t = __merge_range<_RandomAccessIterator1, _RandomAccessIterator2, _RandomAccessIterator3>;

vector<__merge_range_t> __ranges;

__ranges.reserve(__partitions.__chunk_count_ + 1);

ldionneUnsubmitted

Done

If this isn't part of our API right now, I wouldn't provide it. I'd add it once we need it.

ldionne: If this isn't part of our API right now, I wouldn't provide it. I'd add it once we need it.

// TODO: Improve the case where the smaller range is merged into just a few (or even one) chunks of the larger case

std::__terminate_on_exception([&] {

__ranges.emplace_back(__first1, __first2, __result);

ldionneUnsubmitted

Done

We should ensure that this emplace_back() doesn't need to reallocate, otherwise it gets messy with exceptions (and it's a bit inefficient). We should add an assertion here _LIBCPP_ASSERT_UNCATEGORIZED(__ranges.capacity() >= __ranges.size() + 1);.

ldionne: We should ensure that this `emplace_back()` doesn't need to reallocate, otherwise it gets messy…

philnikAuthorUnsubmitted

Done

Not applicable anymore.

philnik: Not applicable anymore.

bool __iterate_first_range = __last1 - __first1 > __last2 - __first2;

auto __compute_chunk = [&](size_t __chunk_size) -> __merge_range_t {

auto [__mid1, __mid2] = [&] {

ldionneUnsubmitted

Done

There is some non-trivial logic in here that we could get wrong. For example, if we were not to use upper_bound below to figure out the end of the __xm range to merge, we could get correctness issues -- this is clearly not only about performance.

We could either test this by adding GCD-specific tests for the backend, or by adding std/ tests for e.g. std::merge(std::par, ...) that are specially crafted to trigger these issues. I don't care very strongly which one we go for, but we should have something because as it stands, all of our tests would pass if we had incorrect logic in here (our tests always use something smaller than the chunk size).

@philnik mentioned some sort of "fuzzing" test during live review, and I think I agree this would make sense.

ldionne: There is some non-trivial logic in here that we could get wrong. For example, if we were not to…

if (__iterate_first_range) {

auto __m1 = __first1 + __chunk_size;

ldionneUnsubmitted

Done

I don't think we need to pass the __size_foo variables around, we can compute them from the iterators instead since we have random access iterators.

ldionne: I don't think we need to pass the `__size_foo` variables around, we can compute them from the…

auto __m2 = std::lower_bound(__first2, __last2, __m1[-1], __comp);

return std::make_pair(__m1, __m2);

} else {

auto __m2 = __first2 + __chunk_size;

auto __m1 = std::lower_bound(__first1, __last1, __m2[-1], __comp);

return std::make_pair(__m1, __m2);

}

}();

__result += (__mid1 - __first1) + (__mid2 - __first2);

ldionneUnsubmitted

Done

As discussed, this seems to be a bit simplified:

auto __compute_chunk = [&](size_t __chunk_size) -> __merge_range_t {
  auto [__mid1, __mid2] = [&] {
    if (__iterate_first_range) {
      auto __m1 = __first1 + __chunk_size;
      auto __m2 = std::lower_bound(__first2, __last2, __m1[-1], __comp);
      return std::make_pair(__m1, __m2);
    } else {
      auto __m2 = __first2 + __chunk_size;
      auto __m1 = std::lower_bound(__first1, __last1, __m2[-1], __comp);
      return std::make_pair(__m1, __m2);
    }
  }();

  __merge_range_t __ret{__mid1, __mid2, __result};
  __result += (__mid1 - __first1) + (__mid2 - __first2);
  __first1 = __mid1;
  __first2 = __mid2;
  return __ret;
};

ldionne: As discussed, this seems to be a bit simplified: ``` auto __compute_chunk = [&](size_t…

__first1 = __mid1;

__first2 = __mid2;

ldionneUnsubmitted

Done

If you use __first_iters to store the result, this becomes:

__result += (__mid1 - __first1) + (__mid2 - __first2);
__first1 = __mid1;
__first2 = __mid2;
return __merge_range_t{__first1, __first2, __result};

And above becomes

__ranges.emplace_back(__first1, __first2, __result);

instead of

__ranges.emplace_back(__first1, __first2, _RandomAccessIterator3{});

ldionne: If you use `__first_iters` to store the result, this becomes: ``` __result += (__mid1…

return {std::move(__mid1), std::move(__mid2), __result};

};

ldionneUnsubmitted

Done

return;

}

- _RandomAccessIterator1 __xm;

- _RandomAccessIterator2 __ym;

+ _RandomAccessIterator1 __middle1;

+ _RandomAccessIterator2 __middle2;

if (__size_x < __size_y) {

Maybe those are more descriptive names? Especially if we drop __size_x and __size_y, x and y will lose their meaning.

ldionne: Maybe those are more descriptive names? Especially if we drop `__size_x` and `__size_y`, `x`…

// handle first chunk

__ranges.emplace_back(__compute_chunk(__partitions.__first_chunk_size_));

// handle 2 -> N - 1 chunks

for (ptrdiff_t __i = 0; __i != __partitions.__chunk_count_ - 2; ++__i)

ldionneUnsubmitted

Done

// handle 2 -> N - 1 chunks

- for (ptrdiff_t __i = 1; __i != __partitions.__chunk_count_ - 1; ++__i)

+ for (ptrdiff_t __i = 0; __i != __partitions.__chunk_count_ - 2; ++__i)

__ranges.emplace_back(__compute_chunk(__partitions.__chunk_size_));

// handle last chunk

It makes it clearer that we do N-1 iterations, and also that we require __partitions.__chunk_count_ > 1 (which is the case, see above).

ldionne: It makes it clearer that we do `N-1` iterations, and also that we require `__partitions.

__ranges.emplace_back(__compute_chunk(__partitions.__chunk_size_));

ldionneUnsubmitted

Done

Is there any reason why we use lower_bound here?

@nadiasvertex Would you happen to remember why you did it that way in the original GCD backend implementation?

ldionne: Is there any reason why we use `lower_bound` here? @nadiasvertex Would you happen to remember…

nadiasvertexUnsubmitted

Done

It's been so long, I don't remember the exact details, sorry. I think it had something to do with calculating the correct offset. We do the search in the smallest range, and the constraints change because the place we want to compute is different in first1:last1 than it is in first2:last2.

nadiasvertex: It's been so long, I don't remember the exact details, sorry. I think it had something to do…

// handle last chunk

__ranges.emplace_back(__last1, __last2, __result);

ldionneUnsubmitted

Done

Here let's reserve() in advance. The upper bound on the number of ranges we'll ever allocate should be the number of leaves in the "call tree of __calculate_merge_ranges". In the worst case we only divide the biggest range by two and we don't touch the other range at every step, so the number of levels in that call tree is bounded by the number of times you can divide each range by two. You want to sum the worst case for each range here. The number of leaves will be bounded by 2^levels. Since levels grows logarithmically, this shouldn't be as bad as it seems, basically this would end up being linear. So I think this gives us:

2^ (log_2(size1/default_chunk_size)) == size1/default_chunk_size

If we add both of the ranges, it means we'd be bounded by:

size1/default_chunk_size + size2/default_chunk_size == (size1+size2)/default_chunk_size

But actually, we might be able to ditch this altogether and instead base our chunking on __partition_chunks by chunking the largest range (or a combination of both) iteratively. I'm not seeing the solution clearly right now but I think we might be able to come up with something better than what we have.

Potential alternative

We could instead walk the largest range by increments of chunk_size and compute the matching location to merge from in the small range using lower_bound or upper_bound, iteratively. We would pre-reserve large-range / chunk_size in the vector. That seems simpler and while there are adversarial inputs with that method, the general case is better behaved.

ldionne: Here let's `reserve()` in advance. The upper bound on the number of ranges we'll ever allocate…

ldionneUnsubmitted

Done

__ranges.emplace_back(__last1, __last2, __result);

- __libdispatch::__dispatch_apply(__ranges.size() - 1, [&](size_t __index) {

+ __libdispatch::__dispatch_apply(__partitions.__chunk_count_, [&](size_t __index) {

auto __first_iters = __ranges[__index];

IMO that's easier to read, and it's equivalent.

ldionne: IMO that's easier to read, and it's equivalent.

__libdispatch::__dispatch_apply(__partitions.__chunk_count_, [&](size_t __index) {

ldionneUnsubmitted

Done

This implementation is quite interesting. At a high level, we basically figure out how to chunk up the ranges as we're going, and we spawn tasks two-by-two every time. My understanding is that this kind of "tree of computation" pattern is not what libdispatch excels at -- instead I think it is better to give it a "finalized" number of tasks you want to execute all at once.

If that is correct, then going for a different algorithm where we instead figure out the chunking upfront *and then* dispatch it all at once could be a win. One way to do this could be to accumulate the work items (aka the arguments you pass to __leaf_merge above) in a std::vector<_WorkItem>, and then dispatch_apply_f on that. We're allowed to allocate in these algorithms so that should be an acceptable approach. Since the number of work items is roughly n / __default_chunk_size and that's linear, the number of work items we might have to spawn could become quite large, perhaps making it important to use dispatch_apply_f only once. Unfortunately this also means that it's difficult to determine the number of work items up front so we would likely need to allocate with our std::vector almost all the time (i.e. tricks like llvm::SmallVector likely don't apply here).

I think the first step to resolve this comment is to figure out whether the premise that libdispatch doesn't handle those trees well is true or not. If not, then the current approach is probably the right one.

ldionne: This implementation is quite interesting. At a high level, we basically figure out how to chunk…

auto __first_iters = __ranges[__index];

auto __last_iters = __ranges[__index + 1];

ldionneUnsubmitted

Done

As discussed live, you could emplace_back the first iterators of the entire range manually into the vector, and then you'd be able to remove the special casing for __index == 0. You'd always look at __ranges[__index] and __ranges[__index + 1] here.

ldionne: As discussed live, you could emplace_back the first iterators of the entire range manually into…

__leaf_merge(

__first_iters.__mid1_,

__last_iters.__mid1_,

__first_iters.__mid2_,

__last_iters.__mid2_,

__first_iters.__result_,

__comp);

});

}

template <class _RandomAccessIterator, class _Transform, class _Value, class _Combiner, class _Reduction>

ldionneUnsubmitted

Done

If the only difference between __parallel_merge and __parallel_merge_body is that we're passing the size, then __parallel_merge could be implemented as a simple call to __parallel_merge_body(__xe - __xs, __ye - __ys).

ldionne: If the only difference between `__parallel_merge` and `__parallel_merge_body` is that we're…

_LIBCPP_HIDE_FROM_ABI _Value __parallel_transform_reduce(

_RandomAccessIterator __first,

_RandomAccessIterator __last,

_Transform __transform,

_Value __init,

_Combiner __combiner,

_Reduction __reduction) {

auto __partitions = __libdispatch::__partition_chunks(__last - __first);

auto __destroy = [__count = __partitions.__chunk_count_](_Value* __ptr) {

ldionneUnsubmitted

Done

IMO this captures better the fact that we are only catching exceptions in __libdispatch::__calculate_merge_ranges:

pmr::vector<...> __ranges;
try {
  __ranges = __calculate(...);
} catch ( ) {
  throw __pstl_bad_alloc();
}

__libdispatch::__dispatch_apply(...);

This makes it more obvious that we let __dispatch_apply terminate if an exception is thrown.

Edit: This might not work if the allocator doesn't propagate.

ldionne: IMO this captures better the fact that we are only catching exceptions in `__libdispatch…

ldionneUnsubmitted

Done

pmr::vector<...> __ranges = [&]{
  try {
    return __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp)
  } catch (....) {
    throw ...;
  }
}();

Still debating whether that's overkill, but it works around the allocator propagation issue.

I think we should leave the code as-is after consideration.

ldionne: ``` pmr::vector<...> __ranges = [&]{ try { return __libdispatch::__calculate_merge_ranges…

std::destroy_n(__ptr, __count);

std::allocator<_Value>().deallocate(__ptr, __count);

};

// TODO: use __uninitialized_buffer

// TODO: allocate one element per worker instead of one element per chunk

unique_ptr<_Value[], decltype(__destroy)> __values(

std::allocator<_Value>().allocate(__partitions.__chunk_count_), __destroy);

// __dispatch_apply is noexcept

__libdispatch::__dispatch_apply(__partitions.__chunk_count_, [&](size_t __chunk) {

auto __this_chunk_size = __chunk == 0 ? __partitions.__first_chunk_size_ : __partitions.__chunk_size_;

auto __index =

__chunk == 0

? 0

: (__chunk * __partitions.__chunk_size_) + (__partitions.__first_chunk_size_ - __partitions.__chunk_size_);

ldionneUnsubmitted

Done

Until we re-introduce __uninitialized_buffer, we can do this instead:

// TODO: Use __uninitialized_buffer
auto __destroy = [=](_Value* __ptr){
  std::destroy_n(__ptr, __partitions.__chunk_count_);
  std::allocator<_Value>().deallocate(__ptr, __partitions.__chunk_count_);
};
unique_ptr<_Value, decltype(__destroy)> __values(std::allocator<_Value>().allocate(__partitions.__chunk_count_), __destroy);

// use __dispatch_apply(...)

auto __tmp = std::transform_reduce(__values, __values + __partitions.__chunk_count_, __init, __transform, __reduction);
return __tmp;

ldionne: Until we re-introduce `__uninitialized_buffer`, we can do this instead: ``` // TODO: Use…

if (__this_chunk_size != 1) {

ldionneUnsubmitted

Done

Future optimization: We should allocate only one temporary value per worker thread, not one per chunk. Then we should pad the storage to make sure they all fall on different cache lines to avoid false sharing.

Can you add a TODO comment mentioning that?

ldionne: Future optimization: We should allocate only one temporary value per worker thread, not one per…

std::__construct_at(

__values.get() + __chunk,

__reduction(__first + __index + 2,

ldionneUnsubmitted

Not Done

We need a test for this case -- i.e. we want to make sure that we throw an exception (and the right one) if we fail to allocate memory from the implementation of the GCD backend.

ldionne: We need a test for this case -- i.e. we want to make sure that we throw an exception (and the…

__first + __index + __this_chunk_size,

__combiner(__transform(__first + __index), __transform(__first + __index + 1))));

} else {

std::__construct_at(__values.get() + __chunk, __transform(__first + __index));

}

});

return std::__terminate_on_exception([&] {

ldionneUnsubmitted

Done

This doesn't work if you have only 1 element in a chunk (or if that's the case for the __first_chunk_size). We should have a test that exercises that. I guess the expected behavior in that case is that we'd just use std::construct_at(__values + __index, __transform(*__first));

ldionne: This doesn't work if you have only 1 element in a chunk (or if that's the case for the…

return std::reduce(

std::make_move_iterator(__values.get()),

std::make_move_iterator(__values.get() + __partitions.__chunk_count_),

std::move(__init),

ldionneUnsubmitted

Done

This should only be a reduce. This should be catchable if the transform returns a different type. Test coverage should be added for that.

ldionne: This should only be a `reduce`. This should be catchable if the transform returns a different…

__combiner);

});

}

// TODO: parallelize this

template <class _RandomAccessIterator, class _Comp, class _LeafSort>

_LIBCPP_HIDE_FROM_ABI void __parallel_stable_sort(

_RandomAccessIterator __first, _RandomAccessIterator __last, _Comp __comp, _LeafSort __leaf_sort) {

ldionneUnsubmitted

Done

auto __partitions = __libdispatch::__partition_chunks(__first, __last);

- unique_ptr<_Value> __values(::operator new(__partitions.__chunk_count_, align_val_t{alignof(_Value)}, nothrow));

+ unique_ptr<_Value[]> __values = std::make_unique_for_overwrite<_Value[]>(__partitions.__chunk_count_);

if (__values == nullptr)

Otherwise this code doesn't work as-is.

We also discussed using unique_ptr<optional<_Value>[]> which would destroy objects correctly in case an exception is thrown while constructing values below. This potentially has a non-negligible memory overhead. No conclusion yet, but I think I would like to either make this code locally correct or have a nice way of expressing the fact that we don't care about running the destructors of _Value in case of an exception because we call terminate anyway.

ldionne: Otherwise this code doesn't work as-is. We also discussed using `unique_ptr<optional<_Value>…

ldionneUnsubmitted

Done

I think we should introduce a helper class for this. It seems like something we might have a other uses for. Something like:

template <class _Tp, class _Alloc = std::allocator<_Tp>>
struct __uninitialized_buffer {
  explicit __uninitialized_buffer(size_t __n);
  _Tp* __get();
  ~__uninitialized_buffer(); // destroys the elements and frees the memory
};

ldionne: I think we should introduce a helper class for this. It seems like something we might have a…

ldionneUnsubmitted

Done

Can we instead run the sort in parallel by chunking and then run a serial merge? That's not too hard to implement and it's much better than doing everything serial. We can leave a TODO for parallelizing the merge since we don't have a clear way to do it.

ldionne: Can we instead run the `sort` in parallel by chunking and then run a serial merge? That's not…

philnikAuthorUnsubmitted

Done

I'd rather do that in a follow-up. It's not like we have to make this implementation perfect from the start.

philnik: I'd rather do that in a follow-up. It's not like we have to make this implementation perfect…

__leaf_sort(__first, __last, __comp);

}

_LIBCPP_HIDE_FROM_ABI inline void __cancel_execution() {}

} // namespace __libdispatch

ldionneUnsubmitted

Done

This totally deserves a comment explaining why we're terminating here, when normally we always terminate from terminate_on_exception. Actually, we could also use terminate_on_exception here instead of exception_guard.

ldionne: This totally deserves a comment explaining why we're terminating here, when normally we always…

} // namespace __par_backend

_LIBCPP_END_NAMESPACE_STD

#endif // !defined(_LIBCPP_HAS_NO_INCOMPLETE_PSTL) && _LIBCPP_STD_VER >= 17

#endif // _LIBCPP___ALGORITHM_PSTL_BACKENDS_CPU_BACKENDS_LIBDISPATCH_H

libcxx/include/__algorithm/pstl_backends/cpu_backends/transform_reduce.h

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines

if constexpr (__is_parallel_execution_policy_v<_ExecutionPolicy> &&

__has_random_access_iterator_category<_ForwardIterator>::value) {

return std::__terminate_on_exception([&] {

return __par_backend::__parallel_transform_reduce(

std::move(__first),

std::move(__last),

[__transform](_ForwardIterator __iter) { return __transform(*__iter); },

std::move(__init),

__reduce,

[=](_ForwardIterator __brick_first, _ForwardIterator __brick_last, _Tp __brick_init) {

[__transform, __reduce](auto __brick_first, auto __brick_last, _Tp __brick_init) {

ldionneUnsubmitted

Done

Can we add a test that fails here? This is a double-move issue that was present in the code before this patch. I think it would make sense to split this one off.

ldionne: Can we add a test that fails here? This is a double-move issue that was present in the code…

return std::__pstl_transform_reduce<__remove_parallel_policy_t<_ExecutionPolicy>>(

ldionneUnsubmitted

Done

__reduce,

- [=](auto __brick_first, auto __brick_last, _Tp __brick_init) {

+ [__reduce, __transform](auto __brick_first, auto __brick_last, _Tp __brick_init) {

return std::__pstl_transform_reduce<__remove_parallel_policy_t<_ExecutionPolicy>>(

Let's capture by name since this was *so* not obvious.

ldionne: Let's capture by name since this was *so* not obvious.

__cpu_backend_tag{},

std::move(__brick_first),

std::move(__brick_last),

std::move(__brick_init),

std::move(__reduce),

std::move(__transform));

});

Show All 18 Lines

libcxx/include/__config_site.in

	Show All 27 Lines
	#cmakedefine _LIBCPP_HAS_NO_RANDOM_DEVICE			#cmakedefine _LIBCPP_HAS_NO_RANDOM_DEVICE
	#cmakedefine _LIBCPP_HAS_NO_LOCALIZATION			#cmakedefine _LIBCPP_HAS_NO_LOCALIZATION
	#cmakedefine _LIBCPP_HAS_NO_WIDE_CHARACTERS			#cmakedefine _LIBCPP_HAS_NO_WIDE_CHARACTERS
	#cmakedefine01 _LIBCPP_ENABLE_ASSERTIONS_DEFAULT			#cmakedefine01 _LIBCPP_ENABLE_ASSERTIONS_DEFAULT

	// PSTL backends			// PSTL backends
	#cmakedefine _LIBCPP_PSTL_CPU_BACKEND_SERIAL			#cmakedefine _LIBCPP_PSTL_CPU_BACKEND_SERIAL
	#cmakedefine _LIBCPP_PSTL_CPU_BACKEND_THREAD			#cmakedefine _LIBCPP_PSTL_CPU_BACKEND_THREAD
				#cmakedefine _LIBCPP_PSTL_CPU_BACKEND_LIBDISPATCH

	// Hardening.			// Hardening.
	#cmakedefine01 _LIBCPP_ENABLE_HARDENED_MODE_DEFAULT			#cmakedefine01 _LIBCPP_ENABLE_HARDENED_MODE_DEFAULT
	#cmakedefine01 _LIBCPP_ENABLE_DEBUG_MODE_DEFAULT			#cmakedefine01 _LIBCPP_ENABLE_DEBUG_MODE_DEFAULT

	// __USE_MINGW_ANSI_STDIO gets redefined on MinGW			// __USE_MINGW_ANSI_STDIO gets redefined on MinGW
	#ifdef __clang__			#ifdef __clang__
	# pragma clang diagnostic push			# pragma clang diagnostic push
	Show All 11 Lines

libcxx/include/__numeric/reduce.h

	// -- C++ --			// -- C++ --
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef _LIBCPP___NUMERIC_REDUCE_H			#ifndef _LIBCPP___NUMERIC_REDUCE_H
	#define _LIBCPP___NUMERIC_REDUCE_H			#define _LIBCPP___NUMERIC_REDUCE_H

	#include <__config>			#include <__config>
	#include <__functional/operations.h>			#include <__functional/operations.h>
	#include <__iterator/iterator_traits.h>			#include <__iterator/iterator_traits.h>
				#include <__utility/move.h>

	#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)			#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
	# pragma GCC system_header			# pragma GCC system_header
	#endif			#endif

	_LIBCPP_BEGIN_NAMESPACE_STD			_LIBCPP_BEGIN_NAMESPACE_STD

	#if _LIBCPP_STD_VER >= 17			#if _LIBCPP_STD_VER >= 17
	template <class _InputIterator, class _Tp, class _BinaryOp>			template <class _InputIterator, class _Tp, class _BinaryOp>
	_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp reduce(_InputIterator __first, _InputIterator __last,			_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp reduce(_InputIterator __first, _InputIterator __last,
	_Tp __init, _BinaryOp __b) {			_Tp __init, _BinaryOp __b) {
	for (; __first != __last; ++__first)			for (; __first != __last; ++__first)
	__init = __b(__init, *__first);			__init = __b(std::move(__init), *__first);
	return __init;			return __init;
	}			}

	template <class _InputIterator, class _Tp>			template <class _InputIterator, class _Tp>
	_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp reduce(_InputIterator __first, _InputIterator __last,			_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp reduce(_InputIterator __first, _InputIterator __last,
	_Tp __init) {			_Tp __init) {
	return _VSTD::reduce(__first, __last, __init, _VSTD::plus<>());			return _VSTD::reduce(__first, __last, __init, _VSTD::plus<>());
	}			}
	Show All 11 Lines

libcxx/include/__utility/terminate_on_exception.h

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef _LIBCPP___UTILITY_TERMINATE_ON_EXCEPTION_H			#ifndef _LIBCPP___UTILITY_TERMINATE_ON_EXCEPTION_H
	#define _LIBCPP___UTILITY_TERMINATE_ON_EXCEPTION_H			#define _LIBCPP___UTILITY_TERMINATE_ON_EXCEPTION_H

	#include <__config>			#include <__config>
	#include <__exception/terminate.h>			#include <__exception/terminate.h>
				#include <new>

	#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)			#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
	# pragma GCC system_header			# pragma GCC system_header
	#endif			#endif

	#if _LIBCPP_STD_VER >= 17			#if _LIBCPP_STD_VER >= 17

	_LIBCPP_BEGIN_NAMESPACE_STD			_LIBCPP_BEGIN_NAMESPACE_STD

	# ifndef _LIBCPP_HAS_NO_EXCEPTIONS			# ifndef _LIBCPP_HAS_NO_EXCEPTIONS

	template <class _Func>			template <class _Func>
	_LIBCPP_HIDE_FROM_ABI auto __terminate_on_exception(_Func __func) {			_LIBCPP_HIDE_FROM_ABI auto __terminate_on_exception(_Func __func) {
	try {			try {
	return __func();			return __func();
	} catch (...) {			} catch (...) {
	std::terminate();			std::terminate();
				ldionneUnsubmitted Done Reply Inline Actions I'm not sure this is correct as-is. Let's say we have a predicate to `find_if` that happens to use `__pstl_merge` in its implementation, and let's say the "setup" code for `__pstl_merge` fails to allocate and throws `__pstl_bad_alloc`. It will be caught by `find_if`'s `__terminate_on_exception` wrapper and be rethrown. Right? If so, then we'd be incorrectly propagating the exception. Instead, I think we might want to instead use `__terminate_on_exception` in the backend implementation only around user code (not setup code), so we'd remove it from the various functions that take `__cpu_backend_tag`. It would also not need to handle `__pstl_bad_alloc` anymore cause it would never wrap "setup" code. This should be done in a separate patch, and we'll likely need some tests as well. ldionne: I'm not sure this is correct as-is. Let's say we have a predicate to `find_if` that happens to…
				philnikAuthorUnsubmitted Done Reply Inline Actions See D154238. philnik: See D154238.
	}			}
	}			}

	# else // _LIBCPP_HAS_NO_EXCEPTIONS			# else // _LIBCPP_HAS_NO_EXCEPTIONS

	template <class _Func>			template <class _Func>
	_LIBCPP_HIDE_FROM_ABI auto __terminate_on_exception(_Func __func) {			_LIBCPP_HIDE_FROM_ABI auto __terminate_on_exception(_Func __func) {
	return __func();			return __func();
	Show All 9 Lines

libcxx/include/module.modulemap.in

Show First 20 Lines • Show All 338 Lines • ▼ Show 20 Lines	module __algorithm {
}		}
module pstl_backends_cpu_backends_fill { private header "__algorithm/pstl_backends/cpu_backends/fill.h" }		module pstl_backends_cpu_backends_fill { private header "__algorithm/pstl_backends/cpu_backends/fill.h" }
module pstl_backends_cpu_backends_find_if {		module pstl_backends_cpu_backends_find_if {
private header "__algorithm/pstl_backends/cpu_backends/find_if.h"		private header "__algorithm/pstl_backends/cpu_backends/find_if.h"
}		}
module pstl_backends_cpu_backends_for_each {		module pstl_backends_cpu_backends_for_each {
private header "__algorithm/pstl_backends/cpu_backends/for_each.h"		private header "__algorithm/pstl_backends/cpu_backends/for_each.h"
}		}
		module pstl_backends_cpu_backends_libdispatch {
		private header "__algorithm/pstl_backends/cpu_backends/libdispatch.h"
		}
module pstl_backends_cpu_backends_merge {		module pstl_backends_cpu_backends_merge {
private header "__algorithm/pstl_backends/cpu_backends/merge.h"		private header "__algorithm/pstl_backends/cpu_backends/merge.h"
}		}
module pstl_backends_cpu_backends_serial {		module pstl_backends_cpu_backends_serial {
private textual header "__algorithm/pstl_backends/cpu_backends/serial.h"		private textual header "__algorithm/pstl_backends/cpu_backends/serial.h"
}		}
module pstl_backends_cpu_backends_stable_sort {		module pstl_backends_cpu_backends_stable_sort {
private header "__algorithm/pstl_backends/cpu_backends/stable_sort.h"		private header "__algorithm/pstl_backends/cpu_backends/stable_sort.h"
▲ Show 20 Lines • Show All 1,718 Lines • Show Last 20 Lines

libcxx/src/CMakeLists.txt

Show First 20 Lines • Show All 312 Lines • ▼ Show 20 Lines

# Add a meta-target for both libraries.

add_custom_target(cxx DEPENDS ${LIBCXX_BUILD_TARGETS})

set(LIBCXX_EXPERIMENTAL_SOURCES

experimental/memory_resource.cpp

)

if (LIBCXX_PSTL_CPU_BACKEND STREQUAL "libdispatch")

list(APPEND LIBCXX_EXPERIMENTAL_SOURCES pstl/libdispatch.cpp)

endif()

ldionneUnsubmitted

Done

if (LIBCXX_PSTL_CPU_BACKEND STREQUAL "libdispatch")

- set(LIBCXX_EXPERIMENTAL_SOURCES ${LIBCXX_EXPERIMENTAL_SOURCES}

- pstl/libdispatch.cpp)

+ list(APPEND LIBCXX_EXPERIMENTAL_SOURCES pstl/libdispatch.cpp)

endif()

ldionne:

ldionneUnsubmitted

Done

I don't think this was done.

ldionne: I don't think this was done.

add_library(cxx_experimental STATIC ${LIBCXX_EXPERIMENTAL_SOURCES})

target_link_libraries(cxx_experimental PUBLIC cxx-headers)

if (LIBCXX_ENABLE_SHARED)

target_link_libraries(cxx_experimental PRIVATE cxx_shared)

else()

target_link_libraries(cxx_experimental PRIVATE cxx_static)

endif()

▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

libcxx/src/pstl/libdispatch.cpp

This file was added.

//===----------------------------------------------------------------------===//

ldionneUnsubmitted

Not Done

We need to figure out how we're going to test the backend. Tests should be added in this patch.

ldionne: We need to figure out how we're going to test the backend. Tests should be added in this patch.

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include <__algorithm/min.h>

#include <__algorithm/pstl_backends/cpu_backends/libdispatch.h>

#include <__config>

#include <dispatch/dispatch.h>

ldionneUnsubmitted

Done

We might want to move this system header below the library's own includes, I think that's what we usually do.

ldionne: We might want to move this system header below the library's own includes, I think that's what…

#include <memory_resource>

#include <thread>

_LIBCPP_BEGIN_NAMESPACE_STD

namespace __par_backend::inline __libdispatch {

pmr::memory_resource* __get_memory_resource() {

static std::pmr::synchronized_pool_resource pool{pmr::new_delete_resource()};

return &pool;

}

void __dispatch_apply(size_t chunk_count, void* context, void (*func)(void* context, size_t chunk)) noexcept {

::dispatch_apply_f(chunk_count, DISPATCH_APPLY_AUTO, context, func);

}

__chunk_partitions __partition_chunks(ptrdiff_t element_count) {

__chunk_partitions partitions;

ldionneUnsubmitted

Done

Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do:

const auto target_number_of_chunks = thread::hardware_concurrency() * 50;
const auto chunk_size = element_count / target_number_of_chunks; // roughly

Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing.

At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on __default_chunk_size (which itself is 100% thin air).

You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree.

Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use std::for_each(std::execution::par) with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code.

In summary, I would suggest this scheme:

We spawn at least min(element_count, thread::hardware_concurrency()) tasks always.
When element_count > thread::hardware_concurrency(), we increase logarithmically as a function of element_count with an asymptote located at roughly 100 * thread::hardware_concurrency().

ldionne: Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always…

ldionneUnsubmitted

Done

For small numbers of elements (< cores), we could create one task for each element. That's the only way to nicely handle the case where processing each element is really heavy and we really want to parallelize. Also it should be rare that users use std::execution::par to sort a vector of 3 ints. If they do, we have no way to tell and I'd argue it's not unreasonable for us to spawn 3 tasks.
For medium numbers of elements (say < 500): cores + ((n-cores) / cores). This gives us a smooth transition from the previous size and then it basically grows linearly with n.
For larger numbers: Assuming 8 cores, we have: log(1.01, sqrt(n)) = 800 (aka 100.499 * log(sqrt(n)) according to Wolfram Alpha) requires n to be roughly 8.2 million elements. That means that we'd create 100 * 8 tasks at 8.2 million elements, and then it grows really slowly. That doesn't seem unreasonable.

This is not very scientific, I'm afraid, but this seems to be somewhat reasonable while playing around with a couple of values.

size_t cores = thread::hardware_concurrency();

auto small = [](ptrdiff_t n) { return n; };
auto medium = [](ptrdiff_t n) { return cores + ((n-cores) / cores); };

// explain where this comes from, in particular that this is an approximation of `log(1.01, sqrt(n))` which seemed to be reasonable for `n` larger than 500 and tops at 800 tasks for n ~ 8 million
auto large = [](ptrdiff_t n) { return 100.499 * std::log(std::sqrt(n)); };

if (n < cores)    return small(n);
else if (n < 500) return medium(n);
else              return std::min(medium(n), large(n)); // provide a "smooth" transition

The above assumes that we have 8 cores, but everything kind of needs to be adjusted for different numbers of cores. I suggest we start with this just to unblock the patch, but leave a big fat comment explaining this needs to be revisited. We can take a look with the dispatch folks, or if anyone observing this review has an idea, please feel free to chime in.

ldionne: For small numbers of elements (`< cores`), we could create one task for each element. That's…

partitions.__chunk_count_ = [&] {

ptrdiff_t cores = std::max(1u, thread::hardware_concurrency());

ldionneUnsubmitted

Done

Can you add a TODO comment to use the number of cores that libdispatch will actually use instead of the total number of cores on the system?

ldionne: Can you add a TODO comment to use the number of cores that libdispatch will actually use…

auto medium = [&](ptrdiff_t n) { return cores + ((n - cores) / cores); };

// This is an approximation of `log(1.01, sqrt(n))` which seemes to be reasonable for `n` larger than 500 and tops

// at 800 tasks for n ~ 8 million

auto large = [](ptrdiff_t n) { return static_cast<ptrdiff_t>(100.499 * std::log(std::sqrt(n))); };

ldionneUnsubmitted

Done

auto medium = [&](ptrdiff_t n) { return cores + ((n - cores) / cores); };

- // This is an approximation of `log(1.01, sqrt(n))` which seemes to be reasonable for `n` larger than 500 and tops

+ // This is an approximation of `log(1.01, sqrt(n))` which seems to be reasonable for `n` larger than 500 and tops

// at 800 tasks for n ~ 8 million

auto large = [](ptrdiff_t n) { return static_cast<ptrdiff_t>(100.499 * std::log(std::sqrt(n))); };

ldionne:

if (element_count < cores)

return element_count;

else if (element_count < 500)

return medium(element_count);

else

return std::min(medium(element_count), large(element_count)); // provide a "smooth" transition

}();

partitions.__chunk_size_ = element_count / partitions.__chunk_count_;

partitions.__first_chunk_size_ = partitions.__chunk_size_;

const ptrdiff_t leftover_item_count = element_count - (partitions.__chunk_count_ * partitions.__chunk_size_);

if (leftover_item_count == 0)

return partitions;

if (leftover_item_count == partitions.__chunk_size_) {

partitions.__chunk_count_ += 1;

return partitions;

}

const ptrdiff_t n_extra_items_per_chunk = leftover_item_count / partitions.__chunk_count_;

const ptrdiff_t n_final_leftover_items = leftover_item_count - (n_extra_items_per_chunk * partitions.__chunk_count_);

partitions.__chunk_size_ += n_extra_items_per_chunk;

partitions.__first_chunk_size_ = partitions.__chunk_size_ + n_final_leftover_items;

return partitions;

}

// NOLINTNEXTLINE(llvm-namespace-comment) // This is https://llvm.org/PR56804

} // namespace __par_backend::inline __libdispatch

_LIBCPP_END_NAMESPACE_STD

libcxx/test/libcxx/algorithms/pstl.libdispatch.chunk_partitions.pass.cpp

This file was added.

//===----------------------------------------------------------------------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// <algorithm>

// REQUIRES: libcpp-pstl-cpu-backend-libdispatch

// ADDITIONAL_COMPILE_FLAGS: -Wno-private-header

// __chunk_partitions __partition_chunks(ptrdiff_t);

#include <__algorithm/pstl_backends/cpu_backends/libdispatch.h>

ldionneUnsubmitted

Done

#include <__algorithm/pstl_backends/cpu_backends/libdispatch.h>

#include <cassert>

+ #include <cstddef>

int main(int, char**) {

ldionne:

#include <cassert>

#include <cstddef>

ldionneUnsubmitted

Done

int main(int, char**) {

- for (ptrdiff_t i = 0; i != 2ll << 20; ++i) {

+ for (std::ptrdiff_t i = 0; i != 2ll << 20; ++i) {

auto chunks = std::__par_backend::__libdispatch::__partition_chunks(i);

ldionne:

int main(int, char**) {

for (std::ptrdiff_t i = 0; i != 2ll << 20; ++i) {

auto chunks = std::__par_backend::__libdispatch::__partition_chunks(i);

assert(chunks.__chunk_count_ <= i);

assert((chunks.__chunk_count_ - 1) * chunks.__chunk_size_ + chunks.__first_chunk_size_ == i);

}

return 0;

}

libcxx/test/std/algorithms/alg.sorting/alg.merge/pstl.merge.pass.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	void operator()(Policy&& policy) {
int a[] = {1, 3, 5, 7, 9};		int a[] = {1, 3, 5, 7, 9};
int b[] = {2, 4, 6, 8, 10};		int b[] = {2, 4, 6, 8, 10};
std::array<int, std::size(a) + std::size(b)> out;		std::array<int, std::size(a) + std::size(b)> out;
std::merge(		std::merge(
policy, Iter1(std::begin(a)), Iter1(std::end(a)), Iter2(std::begin(b)), Iter2(std::end(b)), std::begin(out));		policy, Iter1(std::begin(a)), Iter1(std::end(a)), Iter2(std::begin(b)), Iter2(std::end(b)), std::begin(out));
assert((out == std::array{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}));		assert((out == std::array{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}));
}		}

		{ // check that it works with both ranges being empty
		std::array<int, 0> a;
		std::array<int, 0> b;
		std::array<int, std::size(a) + std::size(b)> out;
		std::merge(
		policy, Iter1(std::begin(a)), Iter1(std::end(a)), Iter2(std::begin(b)), Iter2(std::end(b)), std::begin(out));
		}
{ // check that it works with the first range being empty		{ // check that it works with the first range being empty
std::array<int, 0> a;		std::array<int, 0> a;
int b[] = {2, 4, 6, 8, 10};		int b[] = {2, 4, 6, 8, 10};
std::array<int, std::size(a) + std::size(b)> out;		std::array<int, std::size(a) + std::size(b)> out;
std::merge(		std::merge(
policy, Iter1(std::begin(a)), Iter1(std::end(a)), Iter2(std::begin(b)), Iter2(std::end(b)), std::begin(out));		policy, Iter1(std::begin(a)), Iter1(std::end(a)), Iter2(std::begin(b)), Iter2(std::end(b)), std::begin(out));
assert((out == std::array{2, 4, 6, 8, 10}));		assert((out == std::array{2, 4, 6, 8, 10}));
}		}
Show All 31 Lines	void operator()(Policy&& policy) {
int i = 1;		int i = 1;
for (auto& e : b) {		for (auto& e : b) {
e = i;		e = i;
i += 2;		i += 2;
}		}
}		}

std::vector<int> out(std::size(a) + std::size(b));		std::vector<int> out(std::size(a) + std::size(b));
std::merge(policy,		std::merge(policy,
Iter1(a.data()),		Iter1(a.data()),
Iter1(a.data() + a.size()),		Iter1(a.data() + a.size()),
Iter2(b.data()),		Iter2(b.data()),
Iter2(b.data() + b.size()),		Iter2(b.data() + b.size()),
std::begin(out));		std::begin(out));
		ldionneUnsubmitted Done Reply Inline Actions Let's do this as a separate patch. ldionne: Let's do this as a separate patch.
		philnikAuthorUnsubmitted Done Reply Inline Actions See D154546. philnik: See D154546.
std::vector<int> expected(200);		std::vector<int> expected(200);
std::iota(expected.begin(), expected.end(), 0);		std::iota(expected.begin(), expected.end(), 0);
assert(std::equal(out.begin(), out.end(), expected.begin()));		assert(std::equal(out.begin(), out.end(), expected.begin()));
}		}

{ // check that the predicate is used		{ // check that the predicate is used
int a[] = {10, 9, 8, 7};		int a[] = {10, 9, 8, 7};
int b[] = {8, 4, 3};		int b[] = {8, 4, 3};
Show All 24 Lines

libcxx/test/std/algorithms/numeric.ops/transform.reduce/pstl.transform_reduce.binary.pass.cpp

	Show All 32 Lines
	#include <vector>			#include <vector>

	#include "MoveOnly.h"			#include "MoveOnly.h"
	#include "test_execution_policies.h"			#include "test_execution_policies.h"
	#include "test_iterators.h"			#include "test_iterators.h"
	#include "test_macros.h"			#include "test_macros.h"
	#include "type_algorithms.h"			#include "type_algorithms.h"

				template <class T>
				struct constructible_from {
				T v_;

				explicit constructible_from(T v) : v_(v) {}

				friend constructible_from operator+(constructible_from lhs, constructible_from rhs) {
				return constructible_from{lhs.get() + rhs.get()};
				}

				T get() const { return v_; }
				};

	template <class Iter1, class Iter2, class ValueT>			template <class Iter1, class Iter2, class ValueT>
	struct Test {			struct Test {
	template <class Policy>			template <class Policy>
	void operator()(Policy&& policy) {			void operator()(Policy&& policy) {
	for (const auto& pair : {std::pair{0, 34}, {1, 33}, {2, 30}, {100, 313434}, {350, 14046934}}) {			for (const auto& pair : {std::pair{0, 34}, {1, 40}, {2, 48}, {100, 10534}, {350, 124284}}) {
	auto [size, expected] = pair;			auto [size, expected] = pair;
	std::vector<int> a(size);			std::vector<int> a(size);
	std::vector<int> b(size);			std::vector<int> b(size);
	for (int i = 0; i != size; ++i) {			for (int i = 0; i != size; ++i) {
	a[i] = i + 1;			a[i] = i + 1;
	b[i] = i - 4;			b[i] = i + 4;
	}			}

	decltype(auto) ret = std::transform_reduce(			decltype(auto) ret = std::transform_reduce(
	policy,			policy,
	Iter1(std::data(a)),			Iter1(std::data(a)),
	Iter1(std::data(a) + std::size(a)),			Iter1(std::data(a) + std::size(a)),
	Iter2(std::data(b)),			Iter2(std::data(b)),
	ValueT(34),			ValueT(34),
	[](ValueT i, ValueT j) { return i + j + 3; },			std::plus{},
	[](ValueT i, ValueT j) { return i * j; });			[](ValueT i, ValueT j) { return i + j + 1; });
	static_assert(std::is_same_v<decltype(ret), ValueT>);			static_assert(std::is_same_v<decltype(ret), ValueT>);
	assert(ret == expected);			assert(ret == expected);
	}			}

	for (const auto& pair : {std::pair{0, 34}, {1, 30}, {2, 24}, {100, 313134}, {350, 14045884}}) {			for (const auto& pair : {std::pair{0, 34}, {1, 30}, {2, 24}, {100, 313134}, {350, 14045884}}) {
	auto [size, expected] = pair;			auto [size, expected] = pair;
	std::vector<int> a(size);			std::vector<int> a(size);
	std::vector<int> b(size);			std::vector<int> b(size);
	for (int i = 0; i != size; ++i) {			for (int i = 0; i != size; ++i) {
	a[i] = i + 1;			a[i] = i + 1;
	b[i] = i - 4;			b[i] = i - 4;
	}			}

	decltype(auto) ret = std::transform_reduce(			decltype(auto) ret = std::transform_reduce(
	policy, Iter1(std::data(a)), Iter1(std::data(a) + std::size(a)), Iter2(std::data(b)), 34);			policy, Iter1(std::data(a)), Iter1(std::data(a) + std::size(a)), Iter2(std::data(b)), 34);
	static_assert(std::is_same_v<decltype(ret), int>);			static_assert(std::is_same_v<decltype(ret), int>);
	assert(ret == expected);			assert(ret == expected);
	}			}
				{
				int a[] = {1, 2, 3, 4, 5, 6, 7, 8};
				int b[] = {8, 7, 6, 5, 4, 3, 2, 1};

				auto ret = std::transform_reduce(
				policy,
				Iter1(std::begin(a)),
				Iter1(std::end(a)),
				Iter2(std::begin(b)),
				constructible_from<int>{0},
				std::plus{},
				[](int i, int j) { return constructible_from<int>{i + j}; });
				assert(ret.get() == 72);
				}
	}			}
	};			};

	int main(int, char**) {			int main(int, char**) {
	types::for_each(			types::for_each(
	types::forward_iterator_list<int*>{}, types::apply_type_identity{[](auto v) {			types::forward_iterator_list<int*>{}, types::apply_type_identity{[](auto v) {
	using Iter2 = typename decltype(v)::type;			using Iter2 = typename decltype(v)::type;
	types::for_each(			types::for_each(
	Show All 10 Lines

libcxx/utils/data/ignore_format.txt

	Show First 20 Lines • Show All 547 Lines • ▼ Show 20 Lines
	libcxx/src/iostream.cpp			libcxx/src/iostream.cpp
	libcxx/src/legacy_debug_handler.cpp			libcxx/src/legacy_debug_handler.cpp
	libcxx/src/locale.cpp			libcxx/src/locale.cpp
	libcxx/src/memory.cpp			libcxx/src/memory.cpp
	libcxx/src/mutex.cpp			libcxx/src/mutex.cpp
	libcxx/src/mutex_destructor.cpp			libcxx/src/mutex_destructor.cpp
	libcxx/src/new.cpp			libcxx/src/new.cpp
	libcxx/src/optional.cpp			libcxx/src/optional.cpp
				libcxx/src/pstl/libdispatch.cpp
	libcxx/src/random.cpp			libcxx/src/random.cpp
	libcxx/src/random_shuffle.cpp			libcxx/src/random_shuffle.cpp
	libcxx/src/regex.cpp			libcxx/src/regex.cpp
	libcxx/src/stdexcept.cpp			libcxx/src/stdexcept.cpp
	libcxx/src/std_stream.h			libcxx/src/std_stream.h
	libcxx/src/string.cpp			libcxx/src/string.cpp
	libcxx/src/strstream.cpp			libcxx/src/strstream.cpp
	libcxx/src/support/ibm/mbsnrtowcs.cpp			libcxx/src/support/ibm/mbsnrtowcs.cpp
	Show All 23 Lines

libcxx/utils/libcxx/test/features.py

Show First 20 Lines • Show All 301 Lines • ▼ Show 20 Lines	macros = {
"_LIBCPP_HAS_THREAD_API_PTHREAD": "libcpp-has-thread-api-pthread",		"_LIBCPP_HAS_THREAD_API_PTHREAD": "libcpp-has-thread-api-pthread",
"_LIBCPP_NO_VCRUNTIME": "libcpp-no-vcruntime",		"_LIBCPP_NO_VCRUNTIME": "libcpp-no-vcruntime",
"_LIBCPP_ABI_VERSION": "libcpp-abi-version",		"_LIBCPP_ABI_VERSION": "libcpp-abi-version",
"_LIBCPP_HAS_NO_FILESYSTEM": "no-filesystem",		"_LIBCPP_HAS_NO_FILESYSTEM": "no-filesystem",
"_LIBCPP_HAS_NO_RANDOM_DEVICE": "no-random-device",		"_LIBCPP_HAS_NO_RANDOM_DEVICE": "no-random-device",
"_LIBCPP_HAS_NO_LOCALIZATION": "no-localization",		"_LIBCPP_HAS_NO_LOCALIZATION": "no-localization",
"_LIBCPP_HAS_NO_WIDE_CHARACTERS": "no-wide-characters",		"_LIBCPP_HAS_NO_WIDE_CHARACTERS": "no-wide-characters",
"_LIBCPP_HAS_NO_UNICODE": "libcpp-has-no-unicode",		"_LIBCPP_HAS_NO_UNICODE": "libcpp-has-no-unicode",
		"_LIBCPP_PSTL_CPU_BACKEND_LIBDISPATCH": "libcpp-pstl-cpu-backend-libdispatch",
}		}
for macro, feature in macros.items():		for macro, feature in macros.items():
DEFAULT_FEATURES.append(		DEFAULT_FEATURES.append(
Feature(		Feature(
name=lambda cfg, m=macro, f=feature: f + ("={}".format(compilerMacros(cfg)[m]) if compilerMacros(cfg)[m] else ""),		name=lambda cfg, m=macro, f=feature: f + ("={}".format(compilerMacros(cfg)[m]) if compilerMacros(cfg)[m] else ""),
when=lambda cfg, m=macro: m in compilerMacros(cfg),		when=lambda cfg, m=macro: m in compilerMacros(cfg),
)		)
)		)
▲ Show 20 Lines • Show All 256 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc++][PSTL] Add a GCD backendClosedPublic

Details

Diff Detail

Event Timeline

Potential alternative

Revision Contents

Diff 539708

libcxx/CMakeLists.txt

libcxx/cmake/caches/Apple.cmake

libcxx/include/CMakeLists.txt

libcxx/include/__algorithm/pstl_backend.h

libcxx/include/__algorithm/pstl_backends/cpu_backends/backend.h

libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h

Potential alternative

libcxx/include/__algorithm/pstl_backends/cpu_backends/transform_reduce.h

libcxx/include/__config_site.in

libcxx/include/__numeric/reduce.h

libcxx/include/__utility/terminate_on_exception.h

libcxx/include/module.modulemap.in

libcxx/src/CMakeLists.txt

libcxx/src/pstl/libdispatch.cpp

libcxx/test/libcxx/algorithms/pstl.libdispatch.chunk_partitions.pass.cpp

libcxx/test/std/algorithms/alg.sorting/alg.merge/pstl.merge.pass.cpp

libcxx/test/std/algorithms/numeric.ops/transform.reduce/pstl.transform_reduce.binary.pass.cpp

libcxx/utils/data/ignore_format.txt

libcxx/utils/libcxx/test/features.py

[libc++][PSTL] Add a GCD backend
ClosedPublic