This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] [CUDA plugin] Add support for teams reduction via scratchpad
AbandonedPublic

Authored by grokos on Apr 5 2018, 8:21 AM.

Details

Reviewers
Hahnfeld
ABataev
Summary

This patch adds support for teams reduction into the CUDA plugin. The number of variables to be reduced as well as their size are passed from the compiler to the plugin via a struct of kernel computation properties (which also includes the execution mode). Before a kernel is launched, the plugin allocates space for the scratchpad to be used for the reduction. A pointer to the allocated scratchpad is passed as the last parameter to the kernel at launch.

Diff Detail

Repository
rOMP OpenMP

Event Timeline

grokos created this revision.Apr 5 2018, 8:21 AM

Some comments inline, mostly minor things.

libomptarget/plugins/cuda/src/rtl.cpp
64–67

Shouldn't you be explicitly assigning values to this enum? Currently, it's not obvious which values they will hold.

(And I think the names should not all be upper case (except SPMD), but only the first character...)

496

You should be using SPMD and >= None here.

638–641

I think this shouldn't be in this patch?

709–712

Maybe error out completely?

ABataev added inline comments.Apr 5 2018, 9:07 AM
libomptarget/plugins/cuda/src/rtl.cpp
75–76

Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call?
Like this:

__omp_offloading....
<master>
%Scratchpad = call i8 *__kmpc_allocate_scratchpad(<Size_of_the_reductions>);
....
__kmpc_deallocate_scratchpad(i8 *%Scratchpad);
<end_master>
grokos updated this revision to Diff 141184.Apr 5 2018, 11:05 AM
grokos marked 3 inline comments as done.
grokos added inline comments.
libomptarget/plugins/cuda/src/rtl.cpp
75–76

We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (cuadMalloc on the host vs malloc on the device), so I assume they perform equally fast.

Any thoughts on that?

638–641

Removed.

709–712

Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return OFFLOAD_FAIL.

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using cudaThreadSetLimit

libomptarget/plugins/cuda/src/rtl.cpp
75–76

I'd prefer this solution rather than the original one.

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using cudaThreadSetLimit

That's what I'm saying. You can increase the limit, but how large will you set it? How will you know how many bytes are needed for the scratchpad if the compiler doesn't provide this information?

One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.

But you can change the limit using cudaThreadSetLimit

That's what I'm saying. You can increase the limit, but how large will you set it? How will you know how many bytes are needed for the scratchpad if the compiler doesn't provide this information?

We already using the global memory allocation, so I don't see any reason why we can't use it for scratchpad. We just need to set some initial amount which is big enough and, probably, add the option that will allow increasing this size.

I think reductions are already implemented differently, can we close this?

grokos abandoned this revision.Jul 9 2019, 3:43 PM

Right, this patch is now obsolete.