Page MenuHomePhabricator

[OpenMP] Add support for dynamic shared memory in new RTL
ClosedPublic

Authored by jhuber6 on Sep 17 2021, 2:40 PM.

Details

Summary

This patch adds support for using dynamic shared memory in the new
device runtime. The new function __kmpc_get_dynamic_shared will return a
pointer to the buffer of dynamic shared memory. Currently the amount of memory
allocated is set by an environment variable.

In the future this amount will be added to the amount used for the smart stack
which will be configured in a similar way.

Diff Detail

Event Timeline

jhuber6 created this revision.Sep 17 2021, 2:40 PM
jhuber6 requested review of this revision.Sep 17 2021, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2021, 2:40 PM
tianshilei1992 accepted this revision.Sep 17 2021, 4:34 PM

LGTM with some nits.

openmp/libomptarget/DeviceRTL/include/Configuration.h
35

getDynamicMemorySize? I just feel since other words like "dynamic" or "size" we all use the full version, we could also use "memory" here?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
547

LIBOMPTARGET_SHARED_MEMORY_SIZE?

This revision is now accepted and ready to land.Sep 17 2021, 4:34 PM
jhuber6 updated this revision to Diff 373373.Sep 17 2021, 4:49 PM

Addressing nits and adding some documentation.

This revision was landed with ongoing or failed builds.Sep 17 2021, 6:26 PM
This revision was automatically updated to reflect the committed changes.
JonChesterfield added inline comments.Oct 4 2021, 6:29 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

This struct is also defined in deviceRTLs/common/device_environment.h, which doesn't seem to have the extra field present. Any objection to moving that header up a couple of levels then including it from the various plugins and device runtimes?

uint64_t seems overly ambitious for shared size. The amdgpu hardware is limited to ~ 64k, can cuda go over 32 bits for this?

jhuber6 added inline comments.Oct 4 2021, 6:33 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

It would probably be a good idea to keep the device environment stuff consistent. Currently there might be some difference between the new and old runtime so we should make sure that they all share the same size at least.

And you're probably right, the argument in cuLaunchKernel that it's passed to only uses an unsigned int which is most likely 32 bits. Not likely someone is going to allocate a few billion bytes of shared memory.

JonChesterfield added inline comments.Oct 4 2021, 6:43 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

Cool. I've wanted this for a while (rocm and trunk haven't always matched either) so will put up a patch

JonChesterfield added inline comments.Oct 4 2021, 8:20 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

The dynamic_shared_size field is currently unused (and not defined in the devicertl), but the cuda plugin raises an error when the size of the global in the devicertl is different to the size declared in this file. I don't understand how that is working at present.

JonChesterfield added inline comments.Oct 4 2021, 8:22 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

Specifically, this patch missed the deviceRTL definition:
libomptarget/deviceRTLs/common/device_environment.h
struct omptarget_device_environmentTy {

int32_t debug_level;
uint32_t num_devices;
uint32_t device_num;

};

tianshilei1992 added inline comments.Oct 4 2021, 8:47 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

This patch only adds the support in the new device runtime, as described in the title.

JonChesterfield added inline comments.Oct 4 2021, 8:59 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
102

Right, but both old and new device runtime use the same host plugin, and the host plugin raises an error if this struct has the wrong size, and the old device runtime has one without the extra uint64_t. Just put a comment by the size check further down this file

930

size mismatch here with old runtime, as far as I can tell