This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Teams reduction on the NVPTX device.
AbandonedPublic

Authored by arpith-jacob on Feb 3 2017, 11:57 AM.

Details

Summary

This patch implements codegen for the reduction clause on
any teams construct for elementary data types. An efficient
implementation requires hierarchical reduction within a
warp, a threadblock, and across threadblocks. It is
complicated by the fact that variables declared in the stack
of a CUDA thread cannot be shared with other threads.

The patch creates a struct to hold reduction variables and
a number of helper functions. The OpenMP runtime on the GPU
implements reduction algorithms that uses these helper
functions to perform reductions across teams. Variables are
shared between CUDA threads using shuffle intrinsics.

An implementation of reductions on the NVPTX device is
substantially different to that of CPUs. However, this patch
is written so that there are minimal changes to the rest of
OpenMP codegen.

The implemented design allows the compiler and runtime to be
decoupled, i.e., the runtime does not need to know of the
reduction operation(s), the type of the reduction variable(s),
or the number of reductions. The design also allows reuse of
host codegen, with appropriate specialization for the NVPTX
device.

While the patch does introduce a number of abstractions, the
expected use case calls for inlining of the GPU OpenMP runtime.
After inlining and optimizations in LLVM, these abstractions
are unwound and performance of OpenMP reductions is comparable
to CUDA-canonical code.

Patch by Tian Jin in collaboration with Arpith Jacob

Diff Detail

Event Timeline

arpith-jacob created this revision.Feb 3 2017, 11:57 AM
arpith-jacob added inline comments.Feb 3 2017, 12:00 PM
lib/CodeGen/CGOpenMPRuntime.h
524

This is required to cast to the NVPTX runtime in a static function as follows;

CGOpenMPRuntimeNVPTX &RT = cast<CGOpenMPRuntimeNVPTX>(CGM.getOpenMPRuntime());

975

NVPTX requires this parameter because the reduction implementation for 'parallel', 'simd', and 'teams' are different. This is required for correctness and performance.

ABataev edited edge metadata.Feb 7 2017, 11:00 AM

The patch is too big and quite hard to review? Could you split it into several smaller parts?

lib/CodeGen/CGOpenMPRuntime.cpp
4280–4282

It is not quite so, at least we have a reference in a list of reductions in reduction clause, which may be considered a debug position

lib/CodeGen/CGOpenMPRuntime.h
524

Are you going to make calls to isa(), dyn_cast() functions? If not, just use static_cast<>() instead.

lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp
972–973

I'd prefer you to use CGF.EmitScalarConversion() rather than Builder casts.

975–978

Do you really need a SmallVector<> or you can use just an array here?

1140–1141

Enclose in braces

1153–1156

Braces

lib/CodeGen/CGStmtOpenMP.cpp
2249

OMPD_parallel or OMPD_parallel_for?

2424

OMPD_parallel_for?

arpith-jacob marked 2 inline comments as done.Feb 9 2017, 6:42 AM

The patch is too big and quite hard to review? Could you split it into several smaller parts?

Alexey, thank you for your time. I have addressed your comments and split the patch into a smaller one, which I will post shortly.

The new patch is to implement parallel reductions on the GPU. This is the smallest patch I can get such that the codegen is correct and fully functional (runs correctly on the GPU).

Thanks,
Arpith

lib/CodeGen/CGOpenMPRuntime.cpp
4280–4282

Ok, removed.

lib/CodeGen/CGOpenMPRuntime.h
524

Yes, I will use a static_cast and remove the OpenMPRuntimeKind functionality.

lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp
972–973

Ok. I have used CGF.EmitScalarConversion() when I can, for example for WarpSize. In other cases I want simple bitcasts and truncs so I have used the Builder.

975–978

Used an array here and other places.

lib/CodeGen/CGStmtOpenMP.cpp
2249

OMPD_parallel is fine. I just need to know that it is a 'parallel' type reduction.

2424

This is the reduction codegen for the 'sections' directive. So a reduction type of 'OMPD_parallel' works well.

arpith-jacob abandoned this revision.Feb 12 2017, 4:27 PM