This is an archive of the discontinued LLVM Phabricator instance.

[pstl] Add a serial backend for the PSTL
ClosedPublic

Authored by ldionne on Mar 25 2019, 12:13 PM.

Details

Summary

The serial backend performs all tasks serially and does not require
threads. It does not have any dependencies beyond normal C++, but
it is not very efficient either.

Event Timeline

ldionne created this revision.Mar 25 2019, 12:13 PM

This is a WIP patch for adding a serial backend. This is a work in progress because I strongly suspect I'm not implementing the backend correctly, and I'm looking for guidance on how to do it correctly. One of the difficulties in implementing this was the lack of documentation for the parameters of each function constituting the backend. For example, I have no idea what __real_body is in

parallel_reduce(_ExecutionPolicy&&, _Index __first, _Index __last, const _Value& __identity, const _RealBody& __real_body, const _Reduction&)

so I tried guessing from the existing code. I feel it would be useful to document the backend API properly and I'm willing to do it, but I need help from @MikeDvorskiy and others to achieve that.

MikeDvorskiy added a comment.EditedMar 26 2019, 2:27 AM

Hi Louis,

Actually, there is a draft of Parallel STL back-end API documentation. Probably, not so detailed, but there is. I sent it Thomas several month ago. I've just sent you as well.
M.b. make sense to put one into the repo?
And of course , don't hesitate writing directly me about Parallel STL first to prevent wasting time for guessing)..

In essence. I think we don't need to write a special serial back-end if we want a serial execution by a parallel policy. According to PSTL design we can just additional compile time dispatching om "pattern of bricks" design level. It will do a re-direct to the serial patterns. See "is_parallelization_preferred".
Furthermore, we can do just one change within "is_parallelization_preferred" or even, within the parallel policy traits. But I tend to "is_parallelization_preferred" and avoid the policy traits modification.

For example,

template <typename _ExecutionPolicy, typename... _IteratorTypes>
auto is_parallelization_preferred(_ExecutionPolicy&& __exec)
    -> decltype(lazy_and(__exec.__allow_parallel(), typename is_random_access_iterator<_IteratorTypes...>::type()))
{
#if __PSTL_PAR_BACKEND_SERIAL
    return std::false_type();
#else
    return internal::lazy_and(__exec.__allow_parallel(), typename is_random_access_iterator<_IteratorTypes...>::type());
#endif
}

Hi Louis,

Actually, there is a draft of Parallel STL back-end API documentation. Probably, not so detailed, but there is. I sent it Thomas several month ago. I've just sent you as well.
M.b. make sense to put one into the repo?
And of course , don't hesitate writing directly me about Parallel STL first to prevent wasting time for guessing)..

In essence. I think we don't need to write a special serial back-end if we want a serial execution by a parallel policy. According to PSTL design we can just additional compile time dispatching om "pattern of bricks" design level. It will do a re-direct to the serial patterns. See "is_parallelization_preferred".
Furthermore, we can do just one change within "is_parallelization_preferred" or even, within the parallel policy traits. But I tend to "is_parallelization_preferred" and avoid the policy traits modification.

For example,
[...]

Per today's discussion, there IS interest for having a serial backend from multiple vendors. I think we all agreed that this was worth implementing, now I have to figure out how to implement it properly.

I think we all agreed that this was worth implementing

Yes, we agreed. I'm ready to explain any details If you need.

ldionne marked 3 inline comments as done.Apr 9 2019, 10:32 AM

I think we all agreed that this was worth implementing

Yes, we agreed. I'm ready to explain any details If you need.

See questions in this review.

pstl/include/pstl/internal/parallel_backend_serial.h
86

So, I strongly suspect this implementation is incorrect. Can you draft what a correct serial implementation would be?

94

Is __nsort the number of elements that should be sorted in the resulting range? IOW, this is a partial sort where you want to get the n smallest elements of the whole range in sorted order.

It's not clear to me how to achieve this without a __leaf_sort that itself accepts a n to only partially sort the first n elements. I mean, I could probably find a way to do it iteratively by sorting a bit more every time, but what I really want is to just call std::partial_sort. And actually, while we're at it, calling std::partial_sort isn't enough, since this sort needs to be stable. But we don't have a std::stable_partial_sort.

106

Note that the obvious implementation doesn't work:

std::merge(__first1, __last1, __first2, __last2, __out, __comp);

because it requires the elements to be copyable, but pstl apparently expects the merge to only move elements around without copying them.

MikeDvorskiy added inline comments.Apr 10 2019, 4:38 AM
pstl/include/pstl/internal/parallel_backend_serial.h
94

Actually, leaf_sort is not only std::sort or std::partial_sort.
leaf_sort (as lambda) has already captured information about "sort" or "partial sort" - "__n" variable.
See, the code of the passed lambda into parallel_stable_sort:

[__n](_RandomAccessIterator __begin, _RandomAccessIterator __end, _Compare __comp) {
                if (__n < __end - __begin)
                    std::partial_sort(__begin, __begin + __n, __end, __comp);
                else
                    std::sort(__begin, __end, __comp);
            },

So, in case of serial case here you don't use __nsort parameter, due to you the special lambda which takes into account "sort" or "partial sort" logic.

In case of parallel implementation n parameter may be useful - for merging of two sorted sub-ranges , just first n elements....

106

Why "doesnt work"?

In special case
...

if (__n <= __merge_cut_off)
{
    // Fall back on serial merge
    __leaf_merge(__xs, __xe, __ys, __ye, __zs, __comp);
}

...
where __leaf_merge is std::merge

ldionne marked 3 inline comments as done.Apr 10 2019, 1:55 PM
ldionne added a subscriber: wash.

I'd love to get concrete feedback on whether the implementation of the algorithms is wrong, and if so, why. This would be really helpful in understanding the backend API better and would allow me to make progress on implementing this backend, which is a pre-requisite for many other things.

Also note that I don't expect this backend implementation to be final after this commit. I just want us to get something "correct" so as to nail down the semantics of the backend functions and make progress on other tasks. I strongly suspect that we'll have to make changes to the backend API to get a more straightforward serial implementation, but now's not the time to tackle this.

pstl/include/pstl/internal/parallel_backend_serial.h
86

This is the only thing I strongly suspect is incorrectly implemented. Can you please check this?

94

Ah, I missed the bit in the lambda. So this implementation is correct, then. Thanks!

106

Like I said, std::merge requires elements to be copyable, but the PSTL tests call parallel_merge with elements that are move-only. It also looks like you guys went through some hoops to make this work, see __serial_move_merge in parallel_backend_utils.h.

However, since it looks like calling __leaf_merge is a valid implementation, let's revisit this copyability issue at another point. I want to land this patch ASAP because other stuff depends on it.

In case you're curious, the error message for copyability looks like this:

<toolchain>/usr/include/c++/v1/algorithm:4392:23: error: object of type 'LocalWrapper<float>' cannot be assigned because its copy assignment operator is implicitly deleted
            *__result = *__first2;
                      ^
<toolchain>/usr/include/c++/v1/algorithm:4416:19: note: in instantiation of function template specialization 'std::__1::__merge<std::__1::less<LocalWrapper<float> > &, std::__1::__wrap_iter<LocalWrapper<float> *>, std::__1::__wrap_iter<LocalWrapper<float> *>, LocalWrapper<float> *>' requested here
    return _VSTD::__merge<_Comp_ref>(__first1, __last1, __first2, __last2, __result, __comp);
                  ^
<pstl-root>/include/pstl/internal/parallel_backend_serial.h:117:10: note: in instantiation of function template specialization 'std::__1::merge<std::__1::__wrap_iter<LocalWrapper<float> *>, std::__1::__wrap_iter<LocalWrapper<float> *>, LocalWrapper<float> *, std::__1::less<LocalWrapper<float> > >' requested here
    std::merge(__first1, __last1, __first2, __last2, __out, __comp);
         ^
<pstl-root>/include/pstl/internal/algorithm_impl.h:2667:24: note: in instantiation of function template specialization '__pstl::__serial::__parallel_merge<const __pstl::execution::v1::parallel_policy &, std::__1::__wrap_iter<LocalWrapper<float> *>, std::__1::__wrap_iter<LocalWrapper<float> *>, LocalWrapper<float> *, std::__1::less<LocalWrapper<float> >, (lambda at <pstl-root>/include/pstl/internal/algorithm_impl.h:2669:13)>' requested here
        __par_backend::__parallel_merge(
                       ^
<pstl-root>/include/pstl/internal/glue_algorithm_impl.h:899:17: note: in instantiation of function template specialization '__pstl::__internal::__pattern_inplace_merge<const __pstl::execution::v1::parallel_policy &, std::__1::__wrap_iter<LocalWrapper<float> *>, std::__1::less<LocalWrapper<float> >, std::__1::integral_constant<bool, false> >' requested here
    __internal::__pattern_inplace_merge(
                ^
<pstl-root>/test/std/algorithms/alg.merge/inplace_merge.pass.cpp:63:14: note: in instantiation of function template specialization 'std::inplace_merge<const __pstl::execution::v1::parallel_policy &, std::__1::__wrap_iter<LocalWrapper<float> *>, std::__1::less<LocalWrapper<float> > >' requested here
        std::inplace_merge(exec, first2, mid2, last2, comp);
             ^
<pstl-root>/test/support/utils.h:757:9: note: (skipping 2 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
        op(std::forward<Rest>(rest)...);
        ^
...
ldionne updated this revision to Diff 194585.Apr 10 2019, 1:55 PM
ldionne removed a subscriber: wash.

Rebase on top of master

Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2019, 1:55 PM
MikeDvorskiy added inline comments.Apr 11 2019, 5:25 AM
pstl/include/pstl/internal/parallel_backend_serial.h
106
  1. Yes, std::merge requires "copyable".

Actually, the all "move" operations in "__serial_move_merge" will reduce to "copy" operations If "move" semantic is not specified.

But, you are right. If we pass non-const iterator into std::merge and a value type has the non-trivial move semantic we have got wrong effect...
It may be quickly fixed in "__parallel_merge". I'll do it.

  1. But the compiler diagnostic shown here tells about "std::inplace_merge" algo. This algo requires just move semantic and the test checks it.

The problem is that pattern_inplace_merge re-uses par_backend::parallel_merge due to there is no a special par_backend::parallelinplace_merge API. I think it makes sense to add par_backend::parallelinplace_merge API (and move the relevant code from pattern_inplace_merge into __par_backend). I'll do it. After that in your serial back-end you should just call std::inplace_merge serial.

Thanks for raising the issues.

MikeDvorskiy added inline comments.Apr 11 2019, 9:01 AM
pstl/include/pstl/internal/parallel_backend_serial.h
86

I see your question about strict_scan as well.
I need additional time to answer regarding right serial code here...
I will try to answer today or tomorrow..

MikeDvorskiy added inline comments.Apr 12 2019, 9:14 AM
pstl/include/pstl/internal/parallel_backend_serial.h
86

Yes, this serial version is right for __parallel_strict_scan.

Is there anything to fix in this patch? If not, and if that implementation is the correct one given today's backend API, I'd like to check this in so we can do followup cleanup. It'll also help me as I try to see if a more general backend API is possible because I'll have both the TBB and the serial (trivial) examples to work off of.

MikeDvorskiy accepted this revision.Apr 18 2019, 10:34 AM

If the problem with passing test inplace_merge.pass.cpp" is actual, I'll investigate and fix it by proposed new patch over the top your changes.

This revision is now accepted and ready to land.Apr 18 2019, 10:34 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2019, 11:19 AM