This allows backends to customize arbitrary parallel algorithms, which was requested pretty often.
Details
- Reviewers
ldionne - Group Reviewers
Restricted Project - Commits
- rG8e2d09c33938: [libc++][PSTL] Add more specialized backend customization points
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libcxx/include/__algorithm/pstl_for_each.h | ||
---|---|---|
48 | I guess we could do an ADL call here and if that resolves, we use that, otherwise we use this implementation. There's still difficulties with the fact that we have both a par and an unseq backend, though. |
I generally feel that we are better served with a overload set based on internal execution policy types. One problem I see is that we very, very fast are gonna get asked how to for example choose either SIMD or OpenMP based parallelism on a per invocation basis. The overload set based approach makes it easy to support this. Note we ARE allowed to ship implementation defined execution policies: "The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined." This is not undefined behavior, we just need to say what it does. That means for example we could ship std::omp_par std::omp_par_simd, std::gcd etc.. Or rather the LLVM-OpenMP project could ship std::omp_par together with a customization implementation of the algorithms. And AMD in their ROCm toolchain could add std::par_hip or something like that.
Then the only configuration decision is essentially what to map the mandated execution policies too.
Here is some code from our std::linalg prototype which does something like this:
c++ template<class .........> void matrix_vector_product( ExecutionPolicy&& exec, mdspan<...> A, mdspan<...> x, mdspan<...> y) { constexpr bool use_custom = is_custom_mat_vec_product_avail< decltype(execpolicy_mapper(exec)), decltype(A), decltype(x), decltype(y)>::value; if constexpr(use_custom) { matrix_vector_product(execpolicy_mapper(exec), A, x, y); } else { matrix_vector_product(std::experimental::linalg::impl::inline_exec_t(), A, x, y); } }
Basically there are no real implementations which take any of the standard std::execution policies. We implement overloads with internal execution policies. The execpolicy_mapper(exec) will by default return exec, except for the official std::execution. policies which map to an internal one (right now for us its all mapping to
std::experimental::linalg::impl::inline_exec_t ). That actually has the advantage that you know what the internal impl does.
The is_custom_mat_vec_product_avail will check whether an overload is visible for the provided args. In linalg that allows vendor plugins to for example only provide implementations for the scalar types they got (like the fortran BLAS). That latter point probably doesn't matter to PSTL.
So a vendor shipping LLVM would be able to modify the mapper to let say std::execution::par_unseq map to hip::gpu_exec or whatever. They then can still only provide the overloads their customers asked for, while the other stuff will fallback to the default impl.
A non-llvm-shipper like Kokkos can also provide their own overloads, but they would only be called if you call std::linalg::matrix_vector_product(Kokkos::exec_policy, ...);
The last little piece not up there is that one would probably want to not just call inline_exec_t() in the else branch (if more than one internal implementation exists) but maybe check what public thing the handed exec policy is convertible too and then call the corresponding internal thing of that. I.e. you get Kokkos::exec_policy but no overload Kokkos::sort exists, however Kokkos::exec_policy is convertible to std::execution::par_unseq, so you could call impl::par_unseq overload or so.
Oh if you like to I am happy to draft a different revision sketching out the above approach similar to the one here.
I don't understand what distinction you are making here. You can use OpenMP to generate SIMD instructions or you can use it to multi-thread your code, but there is no SIMD or OpenMP.
The overload set based approach makes it easy to support this. Note we ARE allowed to ship implementation defined execution policies: "The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined." This is not undefined behavior, we just need to say what it does. That means for example we could ship std::omp_par std::omp_par_simd, std::gcd etc.. Or rather the LLVM-OpenMP project could ship std::omp_par together with a customization implementation of the algorithms. And AMD in their ROCm toolchain could add std::par_hip or something like that.
That's a lot more complicated than it seems to be at first. There are a few problems I see:
- is_execution_policy_v<std::omp_par> has to return true for implementation-defined execution policies, so we wouldn't be able to use it inside the implementation of the algorithms
- We would have to push/pop macros everywhere to avoid users #defineing our implementation non-conforming
- using OpenMP, std::threads, GCD and whatever other mechanism together seems to defeat the purpose of the interface. At least as I understand it, the idea is to have an interface which brings you >90% of the way compared to writing it by hand.
- I don't really see the purpose of having the choice of selecting the backend. If there is a significant performance improvement, the implementation should be tuned instead of having the user try out X different backends to find the best one.
- This kind of customization is there in the PSTL, but it never materialized, so I don't think it's actually asked for that much
BTW the "implementation-defined" only forces us to document this. If it weren't mentioned, it would still be allowed for an implementation to support other execution policies, it just wouldn't have to be documented.
I don't understand what distinction you are making here. You can use OpenMP to generate SIMD instructions or you can use it to multi-thread your code, but there is no SIMD or OpenMP.
The current draft thingy has macros to at configure time decide whether to use PSTL_UNSEQ_BACKEND_SIMD or PSTL_UNSEQ_BACKEND_SERIAL my example was just riffing off that.
But even if we just talk OpenMP there is a difference between #pragma omp parallel for and #pragma omp parallel for simd. Now we could just have that map to std::par and std::par_unseq if OpenMP is detected as enabled,
but I certainly don't want folks to need to install to versions of libcxx - one with OpenMP enabled and one with it disabled. We don't install to versions of clang for this right now.
Furthermore, in our use cases one would enable for example in ROCM HIP and OpenMP at the same time. So now users want to decide on a case by case basis whether a for_each runs with OpenMP "parallel for" "parallel for simd" or "hip_gpu". And both the "omp parallel for simd" and the "hip_gpu" presumably have the semantic meaning of "par_unseq".
That's a lot more complicated than it seems to be at first. There are a few problems I see:
- is_execution_policy_v<std::omp_par> has to return true for implementation-defined execution policies, so we wouldn't be able to use it inside the implementation of the algorithms
That is what the mapper to fully internal exec policies is for, and there would never be actual implementations using std::execution::par etc.. Those ones would always hit the "dispatch" overload, not an implementation overload.
- We would have to push/pop macros everywhere to avoid users #defineing our implementation non-conforming
How so? At most we would need to have a check function in each dispatch function which tests whether the execution policy is an allowed one in "no-implementation-defined-behavior" mode, and so that one check function needs an ifdef.
- using OpenMP, std::threads, GCD and whatever other mechanism together seems to defeat the purpose of the interface. At least as I understand it, the idea is to have an interface which brings you >90% of the way compared to writing it by hand.
- I don't really see the purpose of having the choice of selecting the backend. If there is a significant performance improvement, the implementation should be tuned instead of having the user try out X different backends to find the best one.
But for different uses of the same algorithm (i.e. different callable, different number of iterations) a different backend will be optimal. There is no way to automatically choose that optimally on the backend side.
- This kind of customization is there in the PSTL, but it never materialized, so I don't think it's actually asked for that much
Nobody really worked on this, partly because we didn't have the right people or time to do this. We (as in the HPC community and explicitly DOE) had to get ready for Exascale and there wasn't a clear path for us how to do that with pstl. Now however we are shipping implementations of pstl (in our own namespaces). Kokkos, Intel's OneAPI and NVIDIA are all doing their own version. One common thing across all of them is that we pass in stateful execution policies, because users need to provide some information in an important subset of cases. On our side probably 75% of use cases get away with defaults and then there are some which don't. More importantly we got more than one par_unseq equivalent users choose from.
BTW the "implementation-defined" only forces us to document this. If it weren't mentioned, it would still be allowed for an implementation to support other execution policies, it just wouldn't have to be documented.
Agreed, however this wording means we are allowed to have other execution policies for which is_execution_policy is true. I don't think users are allowed to provide a specialization of is_execution_policy.
I like this. I think this answers all the constraints we had determined yesterday.
libcxx/include/__algorithm/pstl_backend.h | ||
---|---|---|
30–42 | ||
67–91 | # if defined(_PSTL_PAR_BACKEND_STD_THREAD) || defined(_PSTL_PAR_BACKEND_GCD) || defined(_PSTL_PAR_BACKEND_TBB) || defined(_PSTL_PAR_BACKEND_SERIAL) # include <__algorithm/pstl_backends/cpu_backend.h> template <> struct __select_backend<std::parallel_policy> { using type = __cpu_backend; }; template <> struct __select_backend<std::parallel_unsequenced_policy> { using type = __cpu_backend; }; #elif defined(_PSTL_PAR_BACKEND_SOME_FUNKY_GPU) # include <__algorithm/pstl_backends/funky_gpu_backend.h> template <> struct __select_backend<std::parallel_policy> { using type = __funky_gpu_backend; }; template <> struct __select_backend<std::parallel_unsequenced_policy> { using type = __funky_gpu_backend; }; # else // ...New vendors can add parallel backends here... # error "Invalid choice of a PSTL parallel backend" # endif | |
libcxx/include/__algorithm/pstl_backends/cpu_backend.h | ||
18–22 ↗ | (On Diff #520740) | #ifdef _LIBCPP_HAS_NO_THREADS # include <__algorithm/pstl_backends/cpu_backends/serial.h> #elif defined(_PSTL_PAR_BACKEND_STD_THREAD) # include <__algorithm/pstl_backends/cpu_backends/thread.h> #elif defined(_PSTL_PAR_BACKEND_GCD) # include <__algorithm/pstl_backends/cpu_backends/gcd.h> #elif defined(_PSTL_PAR_BACKEND_TBB) # include <__algorithm/pstl_backends/cpu_backends/tbb.h> #elif defined(_PSTL_PAR_BACKEND_SERIAL) # include <__algorithm/pstl_backends/cpu_backends/serial.h> #else # error "Invalid backend choice for a CPU backend" #endif |
libcxx/include/__algorithm/pstl_for_each.h | ||
50 |
libcxx/include/__algorithm/pstl_for_each.h | ||
---|---|---|
49 | We could do this in C++17: auto __for_each_n_test = [](auto&& ...args) -> void_t<decltype(std::__pstl_for_each_n<_RawPolicy>(args...))> {}; if constexpr (__is_valid(__for_each_n_test, _Backend{}, __first, __size, __func)) { // ... } else { // ... } Where: template <typename _Func, typename ..._Args, typename = decltype( std::declval<_Func&&>()(std::declval<_Args&&>()...) )> constexpr bool __is_valid_impl(int) { return true; } template <typename _Func, typename ..._Args> constexpr bool __is_valid_impl(...) { return false; } template <typename _Func, typename ..._Args> constexpr bool __is_valid(_Func&&, _Args&& ...) { return __is_valid_impl<_Func&&, _Args&&...>(int{}); } You might run into issues with __is_valid(__for_each_n_test, _Backend{}, __first, __size, __func) not being a constant expression because you are passing references to function arguments. Not sure if it'll be a problem cause you never actually read them. But if it is, then you can switch to returning -> auto std::true_c{} and -> auto std::false_c{} from your __is_valid function, and then you call it like: if constexpr (decltype(__is_valid(as-before)){}) { // ... } OK, not quite as nice, but it works around the constexpr issue. If you don't like this, you can also try to pass the argument types directly instead of the arguments themselves, like __is_valid<decltype(__for_each_n_test), args...>(). I'm not sure I quite like this, but it's an option on the table. |
I think this is an excellent start. Then we can clean up some stuff, improve comments and move all the existing algorithms to this approach.
It turns out that once we get to the CPU backend, we basically do what the original PSTL did -- we're really just adding an additional layer of customizability on top for backends where the par/non-par split might not make sense. LGTM w/ green CI and comments addressed.
libcxx/include/__algorithm/pstl_backend.h | ||
---|---|---|
46–47 | Let's move this to pstl_for_each.h, it seems to belong there more than here. We can also add // declaration needed for the frontend dispatch below. | |
libcxx/include/__algorithm/pstl_backends/cpu_backends/for_each.h | ||
35 ↗ | (On Diff #520995) | _HIDE_FROM_ABI |
libcxx/include/__algorithm/pstl_for_each.h | ||
62 | Nit but I don't think you can move(__first) here since you are then using __first + __size. | |
libcxx/include/__type_traits/is_execution_policy.h | ||
47 ↗ | (On Diff #520995) | // TODO: Remove default argument once algorithms are using the new backend dispatching |
I did not look it over super careful, but I agree that this looks like it can do the things I was most interested in having available. So I am good with merging this.