The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
LGTM. This will be the first example to merge different implementations into one file.
openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu | ||
---|---|---|
17 | One nit: you might have shuffle.h before target_impl.h if using clang-format. |
I can see many warnings emitted, like the following one:
In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17: /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d own_sync' is not defined [-Wundefined-inline] inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var, ^
Comments inline. Not totally sure this is better, code seems longer and more complicated than it was before. Aim is merging the two GPU implementations?
openmp/libomptarget/deviceRTLs/common/include/shuffle.h | ||
---|---|---|
27 ↗ | (On Diff #320332) | Extern C? Also, why forward declare instead of include header? |
40 ↗ | (On Diff #320332) | uint64_t for mask. Sign bit is just one of the lanes. |
88 ↗ | (On Diff #320332) | Seems bad, both because it's a macro instead of variant, and because I thought we'd already got rid of that macro |
117 ↗ | (On Diff #320332) | Not sure about int constants near places where the 32/64 bit distinction is important |
openmp/libomptarget/deviceRTLs/common/include/shuffle.h | ||
---|---|---|
66 ↗ | (On Diff #320332) | Wonder if this compiles without a forward declare of the intrinsic |
Addressed comments. uint64, and static to avoid warning, once we have a default
impl we can actually remove the first decl variant and the static if we want.
I'll update this tomorrow, certain parts are not great.
openmp/libomptarget/deviceRTLs/common/include/shuffle.h | ||
---|---|---|
88 ↗ | (On Diff #320332) | we will, ptx selection. I'll actually update this patch. |
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt | ||
---|---|---|
131 | There is no shuffle.cpp |
^ shuffle.cpp doesn't exist. Except that, I tested locally and it works properly. LGTM.
Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.
If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.
openmp/libomptarget/deviceRTLs/common/include/shuffle.h | ||
---|---|---|
105 ↗ | (On Diff #320368) | I like this a lot. Way better that CUDA_VERSION macros. |
I'm hoping for a structure where we put a feature into a file, rather than an architecture per file. If we do both, we get 4 files per feature. This is not intrinsically bad but makes updates less convenient. At the end of the day, it doesn't matter much what we do since we perform LTO on this. I was hoping the approach with one header per feature and one cpp file for externally used functions would spur reuse. That is, helpers are defined only once, opportunities to share code are easier to spot during updates, etc.
If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.
So this is what's happening, at least once we have a "fallback" target model for which we can define what a "shuffle" means.
Header per target-specific-function as opposed to a list of things like __kmpc_impl_syncwarp seems reasonable. Not the way I'd have sliced it but clearly equivalent.
Minor request, let it be target/shuffle.h, so that we build up a list of functions (stuff in that directory) that new architectures need to implement in one place, distinct from code that will hopefully work out of the box for a new target.
This starts us down a path towards:
devicertl - src - parallel.cpp - target - shuffle.h
as opposed to the current nvptx/amdgcn/common split. That has the attraction of being a more conventional build system (it's weird treating one source file as openmp for one target and hip for another, the semantics don't really match) and, given we're building with clang, even using a single cmake file to build for all targets.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt | ||
---|---|---|
131 | is there an amdgcn CMakeLists.txt equivalent change ? should there be ? |
We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt | ||
---|---|---|
131 | probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds the runtime for AMDGPU, that would expose such a mistake right away ;) |
yes, it's case by case though. We should check what ptx version, or other criterion, is a good selector and replace them. Here it was rather easy in the end.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt | ||
---|---|---|
131 | i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang support. so in the spirit of making more progress on this, could you do another review of Singh's patch https://reviews.llvm.org/D94961 |
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt | ||
---|---|---|
131 | Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path) |
One nit: you might have shuffle.h before target_impl.h if using clang-format.