Page MenuHomePhabricator

[OpenMP] Simplify GPU memory globalization
Needs ReviewPublic

Authored by jhuber6 on Mar 1 2021, 5:54 AM.

Details

Reviewers
jdoerfert
Summary

Memory globalization is required to maintain OpenMP standard semantics for data sharing between
worker and master threads. The GPU cannot share data between its threads so must allocate global or
shared memory to store the data in. Currently this is implemented fully in the frontend using the
__kmpc_data_sharing_push_stack and __kmpc_data_sharing_pop_stack functions to emulate standard
CPU stack sharing. The front-end scans the target region for variables that escape the region and
must be shared between the threads. Each variable then has a field created for it in a global record
type.

This patch replaces this functionality with a single allocation command, effectively mimicking an
alloca instruction for the variables that must be shared between the threads. This will be much
slower than the current solution, but makes it much easier to optimize as we can analyze each
variable independently and determine if it is not captured. In the future, we can replace these
calls with an alloca and small allocations can be pushed to shared memory.

This patch is based on D90670.

Diff Detail

Event Timeline

jhuber6 created this revision.Mar 1 2021, 5:54 AM
jhuber6 requested review of this revision.Mar 1 2021, 5:54 AM

Fixing tests is WIP

jhuber6 edited the summary of this revision. (Show Details)Mar 1 2021, 5:55 AM
jdoerfert added inline comments.Mar 2 2021, 8:13 AM
openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu
34

Add a TODO:

  1. Use a small shared buffer
  2. emit a user note that results in a INFO message once we have the communication capability.
jhuber6 edited the summary of this revision. (Show Details)Mar 2 2021, 4:41 PM
jhuber6 updated this revision to Diff 330352.Mar 12 2021, 1:21 PM

Changed the RTL to have an argument that indicates if there is only one active caller for a team. This makes it easier to optimize.

jhuber6 updated this revision to Diff 331374.Mar 17 2021, 2:19 PM

Fixing tests and changing function interface back.

jhuber6 updated this revision to Diff 331565.Thu, Mar 18, 7:58 AM

Fixing test and formatting

josemonsalve2 added inline comments.Fri, Mar 19, 10:10 AM
clang/test/OpenMP/nvptx_parallel_codegen.cpp
4

Is this flag -fopenmp-cuda-parallel-target-regions useful after this change? I know it was used to determine something in globalization, and I believe this was removed. But is it used for anything else somewhere else or could it be removed?

jhuber6 updated this revision to Diff 331961.Fri, Mar 19, 11:44 AM

Remove command line argument and more unused runtime functions from clang.