This patch implements codegen for the reduction clause on
any teams construct for elementary data types. An efficient
implementation requires hierarchical reduction within a
warp, a threadblock, and across threadblocks. It is
complicated by the fact that variables declared in the stack
of a CUDA thread cannot be shared with other threads.
The patch creates a struct to hold reduction variables and
a number of helper functions. The OpenMP runtime on the GPU
implements reduction algorithms that uses these helper
functions to perform reductions across teams. Variables are
shared between CUDA threads using shuffle intrinsics.
An implementation of reductions on the NVPTX device is
substantially different to that of CPUs. However, this patch
is written so that there are minimal changes to the rest of
OpenMP codegen.
The implemented design allows the compiler and runtime to be
decoupled, i.e., the runtime does not need to know of the
reduction operation(s), the type of the reduction variable(s),
or the number of reductions. The design also allows reuse of
host codegen, with appropriate specialization for the NVPTX
device.
While the patch does introduce a number of abstractions, the
expected use case calls for inlining of the GPU OpenMP runtime.
After inlining and optimizations in LLVM, these abstractions
are unwound and performance of OpenMP reductions is comparable
to CUDA-canonical code.
Patch by Tian Jin in collaboration with Arpith Jacob
This is required to cast to the NVPTX runtime in a static function as follows;
CGOpenMPRuntimeNVPTX &RT = cast<CGOpenMPRuntimeNVPTX>(CGM.getOpenMPRuntime());