This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Enable the lowering of implicitly shared variables in OpenMP GPU-offloaded target regions to the GPU shared memory
AbandonedPublic

Authored by gtbercea on Oct 16 2017, 2:29 PM.

Details

Summary

This patch is part of the development effort to add support in the current OpenMP GPU offloading implementation for implicitly sharing variables between a target region executed by the team master thread and the worker threads within that team.

This patch is the second of three required for successfully performing the implicit sharing of master thread variables with the worker threads within a team:
-Patch D38976 extends the CLANG code generation with code that handles shared variables.
-Patch (coming soon) extends the functionality of libomptarget to maintain a list of references to shared variables.

This patch adds a shared memory stack to the prolog of the kernel function representing the device offloaded OpenMP target region. The new passes along with the changes to existing ones, ensure that any OpenMP variable which needs to be shared across several threads will be allocated in this new stack, in the shared memory of the device. This patch covers the case of sharing variables from the master thread to the worker threads:

#pragma omp target
{
   // master thread only
   int v;
   #pragma omp parallel
   {
      // worker threads
      // use v
   }
}

Diff Detail

Repository
rL LLVM

Event Timeline

gtbercea created this revision.Oct 16 2017, 2:29 PM
tra added a subscriber: tra.Oct 16 2017, 5:06 PM

Please add tests for the cases where such local->shaed conversion should and should not happen.
I would appreciate if you could add details on what exactly your passes are supposed to move to shared memory.

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1749

Nit: the name should end with S as the L in SPL was for 'local' address space. which then gets converted to generic AS. In your case it will be in shared space, hence S would be more appropriate.

lib/Target/NVPTX/NVPTXAssignValidGlobalNames.cpp
68

The name cleanup changes in this file should probably be committed by themselves as they have nothing to do with the rest of the patch.

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp
9

Please add details about what the pass is supposed to do.

gtbercea updated this revision to Diff 119327.Oct 17 2017, 8:27 AM

Eliminate variable and function name clean-up. That has been moved into a separate patch: D39005

gtbercea marked an inline comment as done.Oct 17 2017, 8:28 AM
gtbercea updated this revision to Diff 124243.Nov 24 2017, 3:22 PM

Add regression tests and allow for shared memory lowering to be disabled at function level.

gtbercea marked 2 inline comments as done.Nov 24 2017, 3:24 PM
Hahnfeld edited subscribers, added: llvm-commits; removed: cfe-commits.Nov 26 2017, 6:45 AM
yaxunl added a subscriber: yaxunl.Dec 6 2017, 5:14 PM

Here is a question, do we require that the alloca size to be compile time constant?

hfinkel added inline comments.Dec 11 2017, 7:53 PM
lib/Target/NVPTX/NVPTXAsmPrinter.cpp
1734

Line too long.

lib/Target/NVPTX/NVPTXFrameLowering.cpp
71

In other places in this patch you refer explicitly to OpenMP, so it probably makes sense to say "the OpenMP runtime" here as well (but just saying "the runtime" seems potentially confusing).

85

Line too long.

lib/Target/NVPTX/NVPTXLowerSharedFrameIndicesPass.cpp
13

Can you be more specific? I believe that we fixed PEI to handle virtual registers, so if that's the only motivation, can we use the regular PEI now?

lib/Target/NVPTX/NVPTXRegisterInfo.cpp
134

Line too long.

lib/Target/NVPTX/NVPTXUtilities.cpp
321

Can't you use PointerMayBeCaptured (include/llvm/Analysis/CaptureTracking.h) instead of this function? If so, please do.

gtbercea updated this revision to Diff 127492.Dec 19 2017, 4:22 AM

Use LLVM function for checking if pointer is stored.

gtbercea marked 5 inline comments as done.Dec 19 2017, 4:23 AM
gtbercea marked an inline comment as done.Dec 19 2017, 9:31 AM
tra added a comment.Jan 4 2018, 10:30 AM

Dotting the 'i's on the questions that were not replied to directly.

In D38978#899205, @tra wrote:

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

This may have been addressed by the no-shared-depot.ll test. It would be nice to add few comments in the tests explaining what they do.

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

AFAICT this functionality only applies to functions with has-nvptx-shared-depot attribute. Works for me.

lib/Target/NVPTX/NVPTXFunctionDataSharing.cpp
99

Nit: return false would match the intent better.

lib/Target/NVPTX/NVPTXRegisterInfo.td
75

Line too long.

test/CodeGen/NVPTX/insert-shared-depot.ll
4–5 ↗(On Diff #127492)

You could put common checks under the same label (e.g. CHECK) and run tests with -check-prefixes=PTX32,CHECK

29 ↗(On Diff #127492)

'LABEL' is not a check-prefix and @linsert_shared_depot is not this function's name, so I'm puzzled what this line is supposed to do. Did you intend <prefix>-LABEL: @kernel ?

This appears in all the test cases in the patch.

gtbercea updated this revision to Diff 128725.Jan 5 2018, 2:54 AM
gtbercea marked 3 inline comments as done.

Address comments.

In D38978#967485, @tra wrote:

Dotting the 'i's on the questions that were not replied to directly.

In D38978#899205, @tra wrote:

Considering that device-side code tends to be heavily inlined, it may be prudent to add an option to control the total size of shared memory we allow to be used for this purpose.

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

In case your passes are not executed (or didn't move anything to shared memory), is there any impact on the generated PTX. I.e. can ptxas successfully optimize unused shared memory away?

This may have been addressed by the no-shared-depot.ll test. It would be nice to add few comments in the tests explaining what they do.

Done.

If the code intentionally wants to allocate something in local memory, would the allocation ever be moved to shared memory by your pass? If so, how would I prevent that?

AFAICT this functionality only applies to functions with has-nvptx-shared-depot attribute. Works for me.

That's right.

test/CodeGen/NVPTX/insert-shared-depot.ll
29 ↗(On Diff #127492)

This is modeled after the lower-alloca.ll test which has a similar label. The label is always equal to the name of the test file. In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot"

tra added a comment.Jan 5 2018, 10:09 AM

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

Good luck with that. IMO if your kernel requires all shared memory available per multiprocessor, you are almost guaranteed suboptimal performance because you will not have enough threads running -- neither for peak compute, nor to hide global memory access latency. My bet that you will eventually end up limiting shared memory use to a fairly small fraction of it.

Given that impact is limited to explicitly annotated functions only, this lack of tune-ability is OK with me for now. I'd add a TODO item somewhere to describe that tuning specific limits is WIP.

test/CodeGen/NVPTX/insert-shared-depot.ll
29 ↗(On Diff #127492)

This is modeled after the lower-alloca.ll test which has a similar label.

lower-alloca.ll indeed has the same problem.

The label is always equal to the name of the test file.

I don't think FileCheck has such a feature. Nor do I see anything matching this description in the FileCheck documentation. Nor does it work. See below.

In this particular case there is a typo, it should be "insert_shared_depot" not "linsert_shared_depot"

The line does not check *anything* right now. In this test FileCheck only pays attention to lines that have CHECK or PTX64/PTX32. This line contains neither and is ignored. You can do an experiment -- replace the line with ; LABEL: this should never match and run the test.

I've tried that on lower-alloca.ll and the test, as expected, passes regardless of the nonsense I put after the LABEL:.

In D38978#968565, @tra wrote:

I'm still curious to hear what do you plan to do when your depot use grows beyond certain limit. At the very least there's the physical limit on shared memory size. Shared memory use also affects how many threads can be launched which has large impact on performance. IMO having some sort of user-controllable threshold would be very desirable.

When shared memory isn't enough to hold the shared depot, global memory will be used instead. That is a scheme which will be covered by a future patch.

Good luck with that. IMO if your kernel requires all shared memory available per multiprocessor, you are almost guaranteed suboptimal performance because you will not have enough threads running -- neither for peak compute, nor to hide global memory access latency. My bet that you will eventually end up limiting shared memory use to a fairly small fraction of it.

I completely agree, this scheme will be efficient only when modest amounts of shared memory are required, for larger memory footprints, a global memory scheme will be used instead.

Given that impact is limited to explicitly annotated functions only, this lack of tune-ability is OK with me for now. I'd add a TODO item somewhere to describe that tuning specific limits is WIP.

I'll choose a sensible default for the cut-off point/condition and make it tune-able by the user once we have the global memory scheme in place.

gtbercea updated this revision to Diff 129093.EditedJan 9 2018, 8:29 AM

Remove LABEL from tests and add TODO comment for shared memory limit.

Not my area of expertise

gtbercea abandoned this revision.Jun 12 2019, 10:13 AM

Alternative solution was implemented.