This patch adds support for teams reduction into the CUDA plugin. The number of variables to be reduced as well as their size are passed from the compiler to the plugin via a struct of kernel computation properties (which also includes the execution mode). Before a kernel is launched, the plugin allocates space for the scratchpad to be used for the reduction. A pointer to the allocated scratchpad is passed as the last parameter to the kernel at launch.
Diff Detail
- Repository
- rOMP OpenMP
Event Timeline
Some comments inline, mostly minor things.
libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
62–81 | Shouldn't you be explicitly assigning values to this enum? Currently, it's not obvious which values they will hold. (And I think the names should not all be upper case (except SPMD), but only the first character...) | |
496–498 | You should be using SPMD and >= None here. | |
638–641 | I think this shouldn't be in this patch? | |
705–708 | Maybe error out completely? |
libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
89–90 | Why do you need all that data before starting the outlined function? Can we allocate the memory during execution of the outlined function by some runtime function call? __omp_offloading.... <master> %Scratchpad = call i8 *__kmpc_allocate_scratchpad(<Size_of_the_reductions>); .... __kmpc_deallocate_scratchpad(i8 *%Scratchpad); <end_master> |
libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
89–90 | We can go down that route if you prefer. I haven't been able to find official documentation about which type of memory allocation is faster (cuadMalloc on the host vs malloc on the device), so I assume they perform equally fast. Any thoughts on that? | |
638–641 | Removed. | |
705–708 | Correct, if allocating the scratchpad fails the kernel cannot be executed, so we'll return OFFLOAD_FAIL. |
One caveat regarding Alexey's proposal: According to the CUDA programming guide, malloc on the device allocates space from a fixed-size heap. The default size of this heap is 8MB. If we run into a scenario where more than 8MB will be required for the reduction scratchpad, allocating the scratchpad from the device will fail. The heap size can be user-defined from the host, but for that to happen the host must know how large the scratchpad needs to be, which defeats the purpose of moving scratchpad allocation from the plugin to the nvptx runtime.
But you can change the limit using cudaThreadSetLimit
libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
89–90 | I'd prefer this solution rather than the original one. |
That's what I'm saying. You can increase the limit, but how large will you set it? How will you know how many bytes are needed for the scratchpad if the compiler doesn't provide this information?
We already using the global memory allocation, so I don't see any reason why we can't use it for scratchpad. We just need to set some initial amount which is big enough and, probably, add the option that will allow increasing this size.
Shouldn't you be explicitly assigning values to this enum? Currently, it's not obvious which values they will hold.
(And I think the names should not all be upper case (except SPMD), but only the first character...)