This is an archive of the discontinued LLVM Phabricator instance.

[libc++][PSTL] Add more specialized backend customization points
ClosedPublic

Authored by philnik on May 2 2023, 1:31 PM.

Details

Summary

This allows backends to customize arbitrary parallel algorithms, which was requested pretty often.

Diff Detail

Event Timeline

ldionne created this revision.May 2 2023, 1:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2023, 1:31 PM
ldionne requested review of this revision.May 2 2023, 1:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2023, 1:31 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne added inline comments.May 2 2023, 1:38 PM
libcxx/include/__algorithm/pstl_for_each.h
48

I guess we could do an ADL call here and if that resolves, we use that, otherwise we use this implementation. There's still difficulties with the fact that we have both a par and an unseq backend, though.

I generally feel that we are better served with a overload set based on internal execution policy types. One problem I see is that we very, very fast are gonna get asked how to for example choose either SIMD or OpenMP based parallelism on a per invocation basis. The overload set based approach makes it easy to support this. Note we ARE allowed to ship implementation defined execution policies: "The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined." This is not undefined behavior, we just need to say what it does. That means for example we could ship std::omp_par std::omp_par_simd, std::gcd etc.. Or rather the LLVM-OpenMP project could ship std::omp_par together with a customization implementation of the algorithms. And AMD in their ROCm toolchain could add std::par_hip or something like that.

Then the only configuration decision is essentially what to map the mandated execution policies too.

Here is some code from our std::linalg prototype which does something like this:

c++
template<class .........>
void matrix_vector_product(
  ExecutionPolicy&& exec, mdspan<...> A, mdspan<...> x, mdspan<...> y)
{
  constexpr bool use_custom = is_custom_mat_vec_product_avail<
    decltype(execpolicy_mapper(exec)), decltype(A), decltype(x), decltype(y)>::value;

  if constexpr(use_custom) {
    matrix_vector_product(execpolicy_mapper(exec), A, x, y);
  } else {
    matrix_vector_product(std::experimental::linalg::impl::inline_exec_t(), A, x, y);
  }
}

Basically there are no real implementations which take any of the standard std::execution policies. We implement overloads with internal execution policies. The execpolicy_mapper(exec) will by default return exec, except for the official std::execution. policies which map to an internal one (right now for us its all mapping to
std::experimental::linalg::impl::inline_exec_t ). That actually has the advantage that you know what the internal impl does.

The is_custom_mat_vec_product_avail will check whether an overload is visible for the provided args. In linalg that allows vendor plugins to for example only provide implementations for the scalar types they got (like the fortran BLAS). That latter point probably doesn't matter to PSTL.

So a vendor shipping LLVM would be able to modify the mapper to let say std::execution::par_unseq map to hip::gpu_exec or whatever. They then can still only provide the overloads their customers asked for, while the other stuff will fallback to the default impl.

A non-llvm-shipper like Kokkos can also provide their own overloads, but they would only be called if you call std::linalg::matrix_vector_product(Kokkos::exec_policy, ...);

The last little piece not up there is that one would probably want to not just call inline_exec_t() in the else branch (if more than one internal implementation exists) but maybe check what public thing the handed exec policy is convertible too and then call the corresponding internal thing of that. I.e. you get Kokkos::exec_policy but no overload Kokkos::sort exists, however Kokkos::exec_policy is convertible to std::execution::par_unseq, so you could call impl::par_unseq overload or so.

Oh if you like to I am happy to draft a different revision sketching out the above approach similar to the one here.

I generally feel that we are better served with a overload set based on internal execution policy types. One problem I see is that we very, very fast are gonna get asked how to for example choose either SIMD or OpenMP based parallelism on a per invocation basis.

I don't understand what distinction you are making here. You can use OpenMP to generate SIMD instructions or you can use it to multi-thread your code, but there is no SIMD or OpenMP.

The overload set based approach makes it easy to support this. Note we ARE allowed to ship implementation defined execution policies: "The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined." This is not undefined behavior, we just need to say what it does. That means for example we could ship std::omp_par std::omp_par_simd, std::gcd etc.. Or rather the LLVM-OpenMP project could ship std::omp_par together with a customization implementation of the algorithms. And AMD in their ROCm toolchain could add std::par_hip or something like that.

That's a lot more complicated than it seems to be at first. There are a few problems I see:

  • is_execution_policy_v<std::omp_par> has to return true for implementation-defined execution policies, so we wouldn't be able to use it inside the implementation of the algorithms
  • We would have to push/pop macros everywhere to avoid users #defineing our implementation non-conforming
  • using OpenMP, std::threads, GCD and whatever other mechanism together seems to defeat the purpose of the interface. At least as I understand it, the idea is to have an interface which brings you >90% of the way compared to writing it by hand.
  • I don't really see the purpose of having the choice of selecting the backend. If there is a significant performance improvement, the implementation should be tuned instead of having the user try out X different backends to find the best one.
  • This kind of customization is there in the PSTL, but it never materialized, so I don't think it's actually asked for that much

BTW the "implementation-defined" only forces us to document this. If it weren't mentioned, it would still be allowed for an implementation to support other execution policies, it just wouldn't have to be documented.

I don't understand what distinction you are making here. You can use OpenMP to generate SIMD instructions or you can use it to multi-thread your code, but there is no SIMD or OpenMP.

The current draft thingy has macros to at configure time decide whether to use PSTL_UNSEQ_BACKEND_SIMD or PSTL_UNSEQ_BACKEND_SERIAL my example was just riffing off that.
But even if we just talk OpenMP there is a difference between #pragma omp parallel for and #pragma omp parallel for simd. Now we could just have that map to std::par and std::par_unseq if OpenMP is detected as enabled,
but I certainly don't want folks to need to install to versions of libcxx - one with OpenMP enabled and one with it disabled. We don't install to versions of clang for this right now.
Furthermore, in our use cases one would enable for example in ROCM HIP and OpenMP at the same time. So now users want to decide on a case by case basis whether a for_each runs with OpenMP "parallel for" "parallel for simd" or "hip_gpu". And both the "omp parallel for simd" and the "hip_gpu" presumably have the semantic meaning of "par_unseq".

That's a lot more complicated than it seems to be at first. There are a few problems I see:

  • is_execution_policy_v<std::omp_par> has to return true for implementation-defined execution policies, so we wouldn't be able to use it inside the implementation of the algorithms

That is what the mapper to fully internal exec policies is for, and there would never be actual implementations using std::execution::par etc.. Those ones would always hit the "dispatch" overload, not an implementation overload.

  • We would have to push/pop macros everywhere to avoid users #defineing our implementation non-conforming

How so? At most we would need to have a check function in each dispatch function which tests whether the execution policy is an allowed one in "no-implementation-defined-behavior" mode, and so that one check function needs an ifdef.

  • using OpenMP, std::threads, GCD and whatever other mechanism together seems to defeat the purpose of the interface. At least as I understand it, the idea is to have an interface which brings you >90% of the way compared to writing it by hand.
  • I don't really see the purpose of having the choice of selecting the backend. If there is a significant performance improvement, the implementation should be tuned instead of having the user try out X different backends to find the best one.

But for different uses of the same algorithm (i.e. different callable, different number of iterations) a different backend will be optimal. There is no way to automatically choose that optimally on the backend side.

  • This kind of customization is there in the PSTL, but it never materialized, so I don't think it's actually asked for that much

Nobody really worked on this, partly because we didn't have the right people or time to do this. We (as in the HPC community and explicitly DOE) had to get ready for Exascale and there wasn't a clear path for us how to do that with pstl. Now however we are shipping implementations of pstl (in our own namespaces). Kokkos, Intel's OneAPI and NVIDIA are all doing their own version. One common thing across all of them is that we pass in stateful execution policies, because users need to provide some information in an important subset of cases. On our side probably 75% of use cases get away with defaults and then there are some which don't. More importantly we got more than one par_unseq equivalent users choose from.

BTW the "implementation-defined" only forces us to document this. If it weren't mentioned, it would still be allowed for an implementation to support other execution policies, it just wouldn't have to be documented.

Agreed, however this wording means we are allowed to have other execution policies for which is_execution_policy is true. I don't think users are allowed to provide a specialization of is_execution_policy.

ldionne updated this revision to Diff 520470.May 8 2023, 1:10 PM

Update after today's discussion

ldionne updated this revision to Diff 520482.May 8 2023, 2:13 PM

Update per discussion with @philnik

ldionne added inline comments.May 8 2023, 2:14 PM
libcxx/include/__algorithm/pstl_for_each.h
38

As @philnik pointed out, we actually can't use ADL here because our code needs to be "robust-against-adl" (tm).

philnik commandeered this revision.May 9 2023, 10:18 AM
philnik added a reviewer: ldionne.
philnik updated this revision to Diff 520740.May 9 2023, 10:18 AM

Updated approach

I like this. I think this answers all the constraints we had determined yesterday.

libcxx/include/__algorithm/pstl_backend.h
30–42
67–91
#  if defined(_PSTL_PAR_BACKEND_STD_THREAD) || defined(_PSTL_PAR_BACKEND_GCD) || defined(_PSTL_PAR_BACKEND_TBB) || defined(_PSTL_PAR_BACKEND_SERIAL)
#    include <__algorithm/pstl_backends/cpu_backend.h>
template <> struct __select_backend<std::parallel_policy> { using type = __cpu_backend; };
template <> struct __select_backend<std::parallel_unsequenced_policy> { using type = __cpu_backend; };

#elif defined(_PSTL_PAR_BACKEND_SOME_FUNKY_GPU)
#    include <__algorithm/pstl_backends/funky_gpu_backend.h>
template <> struct __select_backend<std::parallel_policy> { using type = __funky_gpu_backend; };
template <> struct __select_backend<std::parallel_unsequenced_policy> { using type = __funky_gpu_backend; };

#  else

// ...New vendors can add parallel backends here...

#    error "Invalid choice of a PSTL parallel backend"
#  endif
libcxx/include/__algorithm/pstl_backends/cpu_backend.h
18–22 ↗(On Diff #520740)
#ifdef _LIBCPP_HAS_NO_THREADS
#  include <__algorithm/pstl_backends/cpu_backends/serial.h>
#elif defined(_PSTL_PAR_BACKEND_STD_THREAD)
#  include <__algorithm/pstl_backends/cpu_backends/thread.h>
#elif defined(_PSTL_PAR_BACKEND_GCD)
#  include <__algorithm/pstl_backends/cpu_backends/gcd.h>
#elif defined(_PSTL_PAR_BACKEND_TBB)
#  include <__algorithm/pstl_backends/cpu_backends/tbb.h>
#elif defined(_PSTL_PAR_BACKEND_SERIAL)
#  include <__algorithm/pstl_backends/cpu_backends/serial.h>
#else
#  error "Invalid backend choice for a CPU backend"
#endif
libcxx/include/__algorithm/pstl_for_each.h
50
ldionne added inline comments.May 9 2023, 10:53 AM
libcxx/include/__algorithm/pstl_for_each.h
49

We could do this in C++17:

auto __for_each_n_test = [](auto&& ...args) -> void_t<decltype(std::__pstl_for_each_n<_RawPolicy>(args...))> {};
if constexpr (__is_valid(__for_each_n_test, _Backend{}, __first, __size, __func)) {
  // ...
} else {
  // ...
}

Where:

template <typename _Func, typename ..._Args, typename = decltype(
  std::declval<_Func&&>()(std::declval<_Args&&>()...)
)>
constexpr bool __is_valid_impl(int) { return true; }

template <typename _Func, typename ..._Args>
constexpr bool __is_valid_impl(...) { return false; }

template <typename _Func, typename ..._Args>
constexpr bool __is_valid(_Func&&, _Args&& ...) {
  return __is_valid_impl<_Func&&, _Args&&...>(int{});
}

You might run into issues with __is_valid(__for_each_n_test, _Backend{}, __first, __size, __func) not being a constant expression because you are passing references to function arguments. Not sure if it'll be a problem cause you never actually read them. But if it is, then you can switch to returning -> auto std::true_c{} and -> auto std::false_c{} from your __is_valid function, and then you call it like:

if constexpr (decltype(__is_valid(as-before)){}) {
  // ...
}

OK, not quite as nice, but it works around the constexpr issue.

If you don't like this, you can also try to pass the argument types directly instead of the arguments themselves, like __is_valid<decltype(__for_each_n_test), args...>(). I'm not sure I quite like this, but it's an option on the table.

philnik updated this revision to Diff 520782.May 9 2023, 11:59 AM
philnik marked 5 inline comments as done.

Refactor to making this a proper patch (and not a draft anymore)

philnik retitled this revision from [libc++][DISCUSSION] Exploring PSTL backend customization points to [libc++][PSTL] Add more specialized backend customization points.May 9 2023, 1:03 PM
philnik edited the summary of this revision. (Show Details)
philnik updated this revision to Diff 520801.May 9 2023, 1:13 PM

Try to fix CI

philnik updated this revision to Diff 520803.May 9 2023, 1:18 PM

Update wording

philnik updated this revision to Diff 520840.May 9 2023, 3:18 PM

Design updated after trying to implement a few more algorithms

philnik updated this revision to Diff 520842.May 9 2023, 3:27 PM

Fix the implementation

philnik updated this revision to Diff 520988.May 10 2023, 7:23 AM

Generate files

philnik updated this revision to Diff 520995.May 10 2023, 7:58 AM

Try to fix CI

ldionne accepted this revision.May 10 2023, 8:45 AM

I think this is an excellent start. Then we can clean up some stuff, improve comments and move all the existing algorithms to this approach.

It turns out that once we get to the CPU backend, we basically do what the original PSTL did -- we're really just adding an additional layer of customizability on top for backends where the par/non-par split might not make sense. LGTM w/ green CI and comments addressed.

libcxx/include/__algorithm/pstl_backend.h
46–47

Let's move this to pstl_for_each.h, it seems to belong there more than here. We can also add // declaration needed for the frontend dispatch below.

libcxx/include/__algorithm/pstl_backends/cpu_backends/for_each.h
35 ↗(On Diff #520995)

_HIDE_FROM_ABI

libcxx/include/__algorithm/pstl_for_each.h
62

Nit but I don't think you can move(__first) here since you are then using __first + __size.

libcxx/include/__type_traits/is_execution_policy.h
47 ↗(On Diff #520995)

// TODO: Remove default argument once algorithms are using the new backend dispatching

This revision is now accepted and ready to land.May 10 2023, 8:45 AM
philnik updated this revision to Diff 521013.May 10 2023, 8:57 AM
philnik marked 4 inline comments as done.

Try to fix CI

philnik updated this revision to Diff 521036.May 10 2023, 10:47 AM

Try to fix CI

I did not look it over super careful, but I agree that this looks like it can do the things I was most interested in having available. So I am good with merging this.

philnik updated this revision to Diff 521128.May 10 2023, 3:45 PM

Try to fix CI

philnik updated this revision to Diff 521320.May 11 2023, 8:17 AM

Try to fix CI

philnik updated this revision to Diff 521338.May 11 2023, 9:01 AM

Fix formatting

This revision was landed with ongoing or failed builds.May 11 2023, 1:54 PM
This revision was automatically updated to reflect the committed changes.