Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[OpenMP] Simplify offloading parallel call codegen

Authored by ggeorgakoudis on Feb 3 2021, 1:51 PM.



This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single kmpc_parallel_51 runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.

Diff Detail

Event Timeline

ggeorgakoudis created this revision.Feb 3 2021, 1:51 PM
ggeorgakoudis requested review of this revision.Feb 3 2021, 1:51 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald Transcript
ggeorgakoudis edited the summary of this revision. (Show Details)Feb 3 2021, 1:58 PM

Fix type for IfCond, formatting

Add tests, update OpenMPOpt, rebase to main

ggeorgakoudis edited the summary of this revision. (Show Details)Apr 13 2021, 7:27 AM

Add aux-triple to one test, check unit test builder on windows

Fix llvm test

Hi @Meinersbur (got word you are a windows user), @jdoerfert, could I ask your help in detecting why the clang tests on windows are failing? There are two failures I'm spotting, one is that calls to llvm.nvvm intrinsics seem transposed ( and another that attribute regexes are not recognized ( at nvptx_target_codegen.cpp:723:17). Maybe there is something else I'm missing and I'd appreciate the extra eyeballing on the problem.

I have only minor remarks but I'd like you to check if my hunch is correct and the proposed modifications will fix fix PR49777 *and* fix PR49779.
Also, the number of arguments need to be increased, let's go big and automatic here.

Other than that I think this looks good.


Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That check is later folded, no need to make things more complicated here.


This should allow us to remove the SeqGen in the Clang CodeGen *and* fix PR49777 *and* fix PR49779, a win-win-win situation.


FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and interface are the important parts.


Not a return but a __builtin_trap(), please.
We also need this for more than 16 unfortunately, I've seen 20 in miniqmc.
We might want to create a script to print the cases, and then generate 128 or something like that in a file we include. The script can be in the utils folder too.

Meinersbur added a comment.EditedApr 15 2021, 9:20 PM

The transposition problem arises from:

static llvm::Value *getThreadLimit(CodeGenFunction &CGF,
                                   bool IsInSPMDExecutionMode = false) {
  CGBuilderTy &Bld = CGF.Builder;
  auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
  return IsInSPMDExecutionMode
             ? RT.getGPUNumThreads(CGF)
             : Bld.CreateNUWSub(RT.getGPUNumThreads(CGF),
                                RT.getGPUWarpSize(CGF), "thread_limit");

The order in which getGPUNumThreads(), getGPUNumThreads(), getGPUWarpSize() is called is undefined, only has to have happened at a sequence point. The idea is that it would depend on the order in which the function arguments are put on the stack.

Turns out, clang/gcc evaluate the left argument first, msvc starts with the right one.

ggeorgakoudis updated this revision to Diff 338102.EditedApr 16 2021, 7:12 AM

Fix for getThreadLimit

Meinersbur requested changes to this revision.Apr 16 2021, 11:35 AM
Meinersbur added inline comments.

getGPUNumThreads and getGPUWarpSize still have undefined call order.

This revision now requires changes to proceed.Apr 16 2021, 11:35 AM

Update for comments, fix for windows fix

ggeorgakoudis marked 4 inline comments as done.Apr 16 2021, 2:45 PM
ggeorgakoudis added inline comments.



Please check

jdoerfert accepted this revision.Apr 16 2021, 3:41 PM

With the nit to add the two reproducers, LGTM. (please make sure to run FAROS or some benchmarks we have before commiting).


Check? Can we add the two reproducers as tests, please. One should be a clang test, the other maybe a runtime test, though clang test might suffice.

31 ↗(On Diff #338246)

Great. The output is not pretty but that was not the objective ;)

Meinersbur added a comment.EditedApr 16 2021, 3:47 PM

I have not looked at the other mentioned problem yet:

another that attribute regexes are not recognized ( at nvptx_target_codegen.cpp:723:17)

Which might still be there.

I would like to wait for Harbormaster to complete the pre-merge check.

Meinersbur added inline comments.Apr 16 2021, 7:41 PM

There seem to be more unordered codegen calls, such as this one.


There seem to be more unordered codegen calls, such as this one.

ggeorgakoudis marked 2 inline comments as done.

Update for comments, fixes

ggeorgakoudis marked 4 inline comments as done.Apr 19 2021, 12:55 AM
ggeorgakoudis added inline comments.

Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the value from line 1140 , instead of re-emitted. I've kept emitting them as it was previously done. What is the preferred way to handle those?


Ack, will do

Meinersbur added inline comments.Apr 19 2021, 8:58 AM

This is another undefined codegen call order which causes the current pre-merge checks to fail.

ggeorgakoudis marked 2 inline comments as done.


Add tests, reduce microtask cases to avoid stack problems

ggeorgakoudis marked an inline comment as done.Apr 21 2021, 9:07 AM
Meinersbur accepted this revision.Apr 21 2021, 10:54 AM

This test seem to pass on Windows now. Please still fix the clang-format remarks, such as going over 80 characters on a line.


getGPUThreadID/getMasterThreadID could cache the value if used multiple times., but it would also require to put them into the entry block to be available anywhere in the function.

Otherwise, use a best-effort to minimize overhead even if the optimizer cannot unify them or in debug builds.

This revision is now accepted and ready to land.Apr 21 2021, 10:54 AM

Fix clang-format

This revision was landed with ongoing or failed builds.Apr 21 2021, 6:46 PM
This revision was automatically updated to reflect the committed changes.

Please update the test with a NFC commit.

1–5 ↗(On Diff #339441)
29–36 ↗(On Diff #339441)

Since the output goes to Filecheck anyways, I think we should avoid asserts, but let Filecheck test for expected results.
The output for failing tests has more information with this approach.

Please update the test with a NFC commit.

Thanks, @protze.joachim. The changes look good. I'll get that NFC commit in soon-ish, unless you would like to take over.

1 ↗(On Diff #339441)

This is not very pretty. Why do we need runtime dispatch to a function pointer?

jdoerfert added inline comments.May 12 2021, 8:06 AM
1 ↗(On Diff #339441)

because we have variadic functions right now. A patch to remove this is already underway: