This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Simplify GPU memory globalization
ClosedPublic

Authored by jhuber6 on Mar 1 2021, 5:54 AM.

Details

Summary

Memory globalization is required to maintain OpenMP standard semantics for data sharing between
worker and master threads. The GPU cannot share data between its threads so must allocate global or
shared memory to store the data in. Currently this is implemented fully in the frontend using the
__kmpc_data_sharing_push_stack and __kmpc_data_sharing_pop_stack functions to emulate standard
CPU stack sharing. The front-end scans the target region for variables that escape the region and
must be shared between the threads. Each variable then has a field created for it in a global record
type.

This patch replaces this functionality with a single allocation command, effectively mimicking an
alloca instruction for the variables that must be shared between the threads. This will be much
slower than the current solution, but makes it much easier to optimize as we can analyze each
variable independently and determine if it is not captured. In the future, we can replace these
calls with an alloca and small allocations can be pushed to shared memory.

This patch is based on D90670.

Depends on D102532

Diff Detail

Event Timeline

jhuber6 created this revision.Mar 1 2021, 5:54 AM
jhuber6 requested review of this revision.Mar 1 2021, 5:54 AM

Fixing tests is WIP

jhuber6 edited the summary of this revision. (Show Details)Mar 1 2021, 5:55 AM
jdoerfert added inline comments.Mar 2 2021, 8:13 AM
openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu
34

Add a TODO:

  1. Use a small shared buffer
  2. emit a user note that results in a INFO message once we have the communication capability.
jhuber6 edited the summary of this revision. (Show Details)Mar 2 2021, 4:41 PM
jhuber6 updated this revision to Diff 330352.Mar 12 2021, 1:21 PM

Changed the RTL to have an argument that indicates if there is only one active caller for a team. This makes it easier to optimize.

jhuber6 updated this revision to Diff 331374.Mar 17 2021, 2:19 PM

Fixing tests and changing function interface back.

jhuber6 updated this revision to Diff 331565.Mar 18 2021, 7:58 AM

Fixing test and formatting

josemonsalve2 added inline comments.Mar 19 2021, 10:10 AM
clang/test/OpenMP/nvptx_parallel_codegen.cpp
4 ↗(On Diff #331565)

Is this flag -fopenmp-cuda-parallel-target-regions useful after this change? I know it was used to determine something in globalization, and I believe this was removed. But is it used for anything else somewhere else or could it be removed?

jhuber6 updated this revision to Diff 331961.Mar 19 2021, 11:44 AM

Remove command line argument and more unused runtime functions from clang.

jhuber6 updated this revision to Diff 343469.May 6 2021, 12:05 PM

Rebasing. The previous change to the test files required massive changes that made the file size too big to upload, so context is limited.

Is it better to split the patch? For example, the new alloc/free shared can be in a separate patch, and the globalization can be in another one.

jhuber6 updated this revision to Diff 344538.May 11 2021, 12:37 PM

Fixing tests again.

jhuber6 updated this revision to Diff 345557.May 14 2021, 2:05 PM
jhuber6 edited the summary of this revision. (Show Details)

Splitting patches into D97680 and D102532.

Generally LGTM. Some nits.

clang/include/clang/Driver/Options.td
2331 ↗(On Diff #345557)

Can we make these removal of options in another patch if it doesn't affect remaining functionality?

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1731

unrelated

3885

unrelated change

3907

Not sure that's the reason you need the explicit constructor. If yes, can we get around it somehow?

clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
442

Constructor is not needed here I suppose?

jhuber6 updated this revision to Diff 347097.May 21 2021, 12:11 PM

Fixing nits and splitting command line removal to a new patch.

This revision is now accepted and ready to land.May 22 2021, 6:12 PM
This revision was automatically updated to reflect the committed changes.