Page MenuHomePhabricator

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant
ClosedPublic

Authored by jdoerfert on Jan 30 2021, 5:01 PM.

Details

Summary

The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.

Diff Detail

Event Timeline

jdoerfert created this revision.Jan 30 2021, 5:01 PM
jdoerfert requested review of this revision.Jan 30 2021, 5:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2021, 5:01 PM
Herald added a subscriber: sstefan1. · View Herald Transcript
tianshilei1992 accepted this revision.Jan 30 2021, 5:11 PM

LGTM. This will be the first example to merge different implementations into one file.

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu
17

One nit: you might have shuffle.h before target_impl.h if using clang-format.

This revision is now accepted and ready to land.Jan 30 2021, 5:11 PM

I can see many warnings emitted, like the following one:

In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17:
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d
own_sync' is not defined [-Wundefined-inline]
inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var,                                                                                                   ^

Comments inline. Not totally sure this is better, code seems longer and more complicated than it was before. Aim is merging the two GPU implementations?

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
27 ↗(On Diff #320332)

Extern C? Also, why forward declare instead of include header?

40 ↗(On Diff #320332)

uint64_t for mask. Sign bit is just one of the lanes.

88 ↗(On Diff #320332)

Seems bad, both because it's a macro instead of variant, and because I thought we'd already got rid of that macro

117 ↗(On Diff #320332)

Not sure about int constants near places where the 32/64 bit distinction is important

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
66 ↗(On Diff #320332)

Wonder if this compiles without a forward declare of the intrinsic

I can see many warnings emitted, like the following one:

In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17:
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d
own_sync' is not defined [-Wundefined-inline]
inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var,                                                                                                   ^

I will fix this in clang though before.

jdoerfert added inline comments.Jan 30 2021, 7:21 PM
openmp/libomptarget/deviceRTLs/common/include/shuffle.h
27 ↗(On Diff #320332)

because our headers are a mess. These are C++ functions, the interface is a mess too

117 ↗(On Diff #320332)

That's why it's int64_t above, -1 sign extended stays -1.

Addressed comments. uint64, and static to avoid warning, once we have a default
impl we can actually remove the first decl variant and the static if we want.

jdoerfert planned changes to this revision.Jan 31 2021, 12:20 AM

I'll update this tomorrow, certain parts are not great.

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
88 ↗(On Diff #320332)

we will, ptx selection. I'll actually update this patch.

Replace the CUDA_VERSION macro, provide a mock default impl for shuffle

This revision is now accepted and ready to land.Jan 31 2021, 10:14 AM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131

There is no shuffle.cpp

tianshilei1992 accepted this revision.Jan 31 2021, 10:54 AM

^ shuffle.cpp doesn't exist. Except that, I tested locally and it works properly. LGTM.

Add shuffle.cpp

Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.

If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
105 ↗(On Diff #320368)

I like this a lot. Way better that CUDA_VERSION macros.

Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.

I'm hoping for a structure where we put a feature into a file, rather than an architecture per file. If we do both, we get 4 files per feature. This is not intrinsically bad but makes updates less convenient. At the end of the day, it doesn't matter much what we do since we perform LTO on this. I was hoping the approach with one header per feature and one cpp file for externally used functions would spur reuse. That is, helpers are defined only once, opportunities to share code are easier to spot during updates, etc.

If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.

So this is what's happening, at least once we have a "fallback" target model for which we can define what a "shuffle" means.

Header per target-specific-function as opposed to a list of things like __kmpc_impl_syncwarp seems reasonable. Not the way I'd have sliced it but clearly equivalent.

Minor request, let it be target/shuffle.h, so that we build up a list of functions (stuff in that directory) that new architectures need to implement in one place, distinct from code that will hopefully work out of the box for a new target.

This starts us down a path towards:

devicertl
  - src
     - parallel.cpp
     - target
        - shuffle.h

as opposed to the current nvptx/amdgcn/common split. That has the attraction of being a more conventional build system (it's weird treating one source file as openmp for one target and hip for another, the semantics don't really match) and, given we're building with clang, even using a single cmake file to build for all targets.

ronlieb added a subscriber: ronlieb.Feb 1 2021, 3:24 PM
ronlieb added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131

is there an amdgcn CMakeLists.txt equivalent change ? should there be ?

We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.

jdoerfert added inline comments.Feb 1 2021, 3:26 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131

probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds the runtime for AMDGPU, that would expose such a mistake right away ;)

We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.

yes, it's case by case though. We should check what ptx version, or other criterion, is a good selector and replace them. Here it was rather easy in the end.

ronlieb added inline comments.Feb 1 2021, 5:01 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131

i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang support. so in the spirit of making more progress on this, could you do another review of Singh's patch https://reviews.llvm.org/D94961

jdoerfert updated this revision to Diff 320922.Feb 2 2021, 3:09 PM

Address comments

This revision was landed with ongoing or failed builds.Mar 11 2021, 9:31 PM
This revision was automatically updated to reflect the committed changes.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131

Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path)