Details
- Reviewers
ldionne nadiasvertex - Group Reviewers
Restricted Project - Commits
- rG2b2e7f6e5727: [libc++][PSTL] Add a GCD backend
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
143–144 | It's been so long, I don't remember the exact details, sorry. I think it had something to do with calculating the correct offset. We do the search in the smallest range, and the constraints change because the place we want to compute is different in first1:last1 than it is in first2:last2. |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
38 | Otherwise this is a bit confusing. | |
102–103 | Seems more fitting given the names of the iterators? | |
159–173 | IMO this captures better the fact that we are only catching exceptions in __libdispatch::__calculate_merge_ranges: pmr::vector<...> __ranges; try { __ranges = __calculate(...); } catch ( ) { throw __pstl_bad_alloc(); } __libdispatch::__dispatch_apply(...); This makes it more obvious that we let __dispatch_apply terminate if an exception is thrown. Edit: This might not work if the allocator doesn't propagate. |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
159–173 | pmr::vector<...> __ranges = [&]{ try { return __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp) } catch (....) { throw ...; } }(); Still debating whether that's overkill, but it works around the allocator propagation issue. I think we should leave the code as-is after consideration. | |
libcxx/include/__utility/terminate_on_exception.h | ||
30–31 | I'm not sure this is correct as-is. Let's say we have a predicate to find_if that happens to use __pstl_merge in its implementation, and let's say the "setup" code for __pstl_merge fails to allocate and throws __pstl_bad_alloc. It will be caught by find_if's __terminate_on_exception wrapper and be rethrown. Right? If so, then we'd be incorrectly propagating the exception. Instead, I think we might want to instead use __terminate_on_exception in the backend implementation *only* around user code (not setup code), so we'd remove it from the various functions that take __cpu_backend_tag. It would also not need to handle __pstl_bad_alloc anymore cause it would never wrap "setup" code. This should be done in a separate patch, and we'll likely need some tests as well. | |
libcxx/src/CMakeLists.txt | ||
322–323 | ||
libcxx/src/pstl/libdispatch.cpp | ||
12 | We might want to move this system header below the library's own includes, I think that's what we usually do. | |
29–30 | Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do: const auto target_number_of_chunks = thread::hardware_concurrency() * 50; const auto chunk_size = element_count / target_number_of_chunks; // roughly Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing. At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on __default_chunk_size (which itself is 100% thin air). You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree. Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use std::for_each(std::execution::par) with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code. In summary, I would suggest this scheme: We spawn at least min(element_count, thread::hardware_concurrency()) tasks always. |
libcxx/src/pstl/libdispatch.cpp | ||
---|---|---|
29–30 | For small numbers of elements (< cores), we could create one task for each element. That's the only way to nicely handle the case where processing each element is really heavy and we really want to parallelize. Also it should be rare that users use std::execution::par to sort a vector of 3 ints. If they do, we have no way to tell and I'd argue it's not unreasonable for us to spawn 3 tasks. This is not very scientific, I'm afraid, but this seems to be somewhat reasonable while playing around with a couple of values. size_t cores = thread::hardware_concurrency(); auto small = [](ptrdiff_t n) { return n; }; auto medium = [](ptrdiff_t n) { return cores + ((n-cores) / cores); }; // explain where this comes from, in particular that this is an approximation of `log(1.01, sqrt(n))` which seemed to be reasonable for `n` larger than 500 and tops at 800 tasks for n ~ 8 million auto large = [](ptrdiff_t n) { return 100.499 * std::log(std::sqrt(n)); }; if (n < cores) return small(n); else if (n < 500) return medium(n); else return std::min(medium(n), large(n)); // provide a "smooth" transition The above assumes that we have 8 cores, but everything kind of needs to be adjusted for different numbers of cores. I suggest we start with this just to unblock the patch, but leave a big fat comment explaining this needs to be revisited. We can take a look with the dispatch folks, or if anyone observing this review has an idea, please feel free to chime in. |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
113–117 | We should ensure that this emplace_back() doesn't need to reallocate, otherwise it gets messy with exceptions (and it's a bit inefficient). We should add an assertion here _LIBCPP_ASSERT_UNCATEGORIZED(__ranges.capacity() >= __ranges.size() + 1);. | |
146–147 | Here let's reserve() in advance. The upper bound on the number of ranges we'll ever allocate should be the number of leaves in the "call tree of __calculate_merge_ranges". In the worst case we only divide the biggest range by two and we don't touch the other range at every step, so the number of levels in that call tree is bounded by the number of times you can divide each range by two. You want to sum the worst case for each range here. The number of leaves will be bounded by 2^levels. Since levels grows logarithmically, this shouldn't be as bad as it seems, basically this would end up being linear. So I think this gives us: 2^ (log_2(size1/default_chunk_size)) == size1/default_chunk_size If we add both of the ranges, it means we'd be bounded by: size1/default_chunk_size + size2/default_chunk_size == (size1+size2)/default_chunk_size But actually, we might be able to ditch this altogether and instead base our chunking on __partition_chunks by chunking the largest range (or a combination of both) iteratively. I'm not seeing the solution clearly right now but I think we might be able to come up with something better than what we have. Potential alternativeWe could instead walk the largest range by increments of chunk_size and compute the matching location to merge from in the small range using lower_bound or upper_bound, iteratively. We would pre-reserve large-range / chunk_size in the vector. That seems simpler and while there are adversarial inputs with that method, the general case is better behaved. | |
libcxx/test/libcxx/algorithms/pstl.libdispatch.chunk_partitions.pass.cpp | ||
17 | ||
20 | ||
libcxx/test/std/algorithms/alg.modifying.operations/alg.copy/fuzz.pstl.copy.sh.cpp | ||
42–46 ↗ | (On Diff #536011) | |
libcxx/test/std/algorithms/alg.sorting/alg.merge/fuzz.pstl.merge.sh.cpp | ||
15 ↗ | (On Diff #536011) | What is that number? Can you document it with a comment and the rationale for using it? (e.g. "we want this to contain at least N + M elements so we can merge large-ish ranges") |
libcxx/test/std/algorithms/alg.sorting/alg.merge/pstl.merge.pass.cpp | ||
106–111 | Let's do this as a separate patch. |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
75–83 | This isn't used anymore. | |
189–190 | Future optimization: We should allocate only one temporary value per worker thread, not one per chunk. Then we should pad the storage to make sure they all fall on different cache lines to avoid false sharing. Can you add a TODO comment mentioning that? | |
libcxx/src/pstl/libdispatch.cpp | ||
33 | Can you add a TODO comment to use the number of cores that libdispatch will actually use instead of the total number of cores on the system? | |
37–38 |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
189 | Until we re-introduce __uninitialized_buffer, we can do this instead: // TODO: Use __uninitialized_buffer auto __destroy = [=](_Value* __ptr){ std::destroy_n(__ptr, __partitions.__chunk_count_); std::allocator<_Value>().deallocate(__ptr, __partitions.__chunk_count_); }; unique_ptr<_Value, decltype(__destroy)> __values(std::allocator<_Value>().allocate(__partitions.__chunk_count_), __destroy); // use __dispatch_apply(...) auto __tmp = std::transform_reduce(__values, __values + __partitions.__chunk_count_, __init, __transform, __reduction); return __tmp; |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
48–49 | Not used anymore. | |
121–133 | As discussed, this seems to be a bit simplified: auto __compute_chunk = [&](size_t __chunk_size) -> __merge_range_t { auto [__mid1, __mid2] = [&] { if (__iterate_first_range) { auto __m1 = __first1 + __chunk_size; auto __m2 = std::lower_bound(__first2, __last2, __m1[-1], __comp); return std::make_pair(__m1, __m2); } else { auto __m2 = __first2 + __chunk_size; auto __m1 = std::lower_bound(__first1, __last1, __m2[-1], __comp); return std::make_pair(__m1, __m2); } }(); __merge_range_t __ret{__mid1, __mid2, __result}; __result += (__mid1 - __first1) + (__mid2 - __first2); __first1 = __mid1; __first2 = __mid2; return __ret; }; | |
151 | As discussed live, you could emplace_back the first iterators of the entire range manually into the vector, and then you'd be able to remove the special casing for __index == 0. You'd always look at __ranges[__index] and __ranges[__index + 1] here. | |
199–201 | This doesn't work if you have only 1 element in a chunk (or if that's the case for the __first_chunk_size). We should have a test that exercises that. I guess the expected behavior in that case is that we'd just use std::construct_at(__values + __index, __transform(*__first)); | |
205 | This should only be a reduce. This should be catchable if the transform returns a different type. Test coverage should be added for that. | |
213 | Can we instead run the sort in parallel by chunking and then run a serial merge? That's not too hard to implement and it's much better than doing everything serial. We can leave a TODO for parallelizing the merge since we don't have a clear way to do it. | |
libcxx/test/std/algorithms/alg.modifying.operations/alg.copy/fuzz.pstl.copy.sh.cpp | ||
1 ↗ | (On Diff #538792) | Those tests should be moved to the fuzz patch. |
Address comments
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
---|---|---|
213 | I'd rather do that in a follow-up. It's not like we have to make this implementation perfect from the start. |
libcxx/CMakeLists.txt | ||
---|---|---|
800 | We need set(LIBCXX_PSTL_CPU_BACKEND libdispatch) inside libcxx/cmake/caches/Apple.cmake | |
libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h | ||
13 | Missing include for move_iterator. | |
131–135 | If you use __first_iters to store the result, this becomes: __result += (__mid1 - __first1) + (__mid2 - __first2); __first1 = __mid1; __first2 = __mid2; return __merge_range_t{__first1, __first2, __result}; And above becomes __ranges.emplace_back(__first1, __first2, __result); instead of __ranges.emplace_back(__first1, __first2, _RandomAccessIterator3{}); | |
142–143 | It makes it clearer that we do N-1 iterations, and also that we require __partitions.__chunk_count_ > 1 (which is the case, see above). | |
148 | IMO that's easier to read, and it's equivalent. | |
libcxx/src/CMakeLists.txt | ||
322–323 | I don't think this was done. |
LGTM once CI passes. Thanks a lot for all the work on this! It's not the final implementation, but this is a really good start and we can then improve on it iteratively.