Page MenuHomePhabricator

[OpenMP] Simplify offloading parallel call codegen
ClosedPublic

Authored by ggeorgakoudis on Feb 3 2021, 1:51 PM.

Details

Summary

This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single kmpc_parallel_51 runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using update_cc_test_checks.py. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.

Diff Detail

Unit TestsFailed

TimeTest
60 msx64 debian > LLVM.Transforms/OpenMP::gpu_state_machine_function_ptr_replacement.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -S -passes=openmpopt -pass-remarks=openmp-opt -openmp-print-gpu-kernels < /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
170 msx64 windows > Clang.OpenMP::nvptx_data_sharing.cpp
Script: -- : 'RUN: at line 5'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16c2-1\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc -verify -fopenmp -x c++ -triple powerpc64le-unknown-unknown -fopenmp-targets=nvptx64-nvidia-cuda -emit-llvm-bc C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\OpenMP\nvptx_data_sharing.cpp -o C:\ws\w16c2-1\llvm-project\premerge-checks\build\tools\clang\test\OpenMP\Output\nvptx_data_sharing.cpp.tmp-ppc-host.bc
580 msx64 windows > Clang.OpenMP::nvptx_lambda_capturing.cpp
Script: -- : 'RUN: at line 5'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16c2-1\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc -fopenmp -x c++ -std=c++11 -triple powerpc64le-unknown-unknown -fopenmp-targets=nvptx64-nvidia-cuda -emit-llvm C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\OpenMP\nvptx_lambda_capturing.cpp -o - | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\OpenMP\nvptx_lambda_capturing.cpp --check-prefix CHECK1
200 msx64 windows > Clang.OpenMP::nvptx_parallel_codegen.cpp
Script: -- : 'RUN: at line 3'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16c2-1\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc -verify -fopenmp -x c++ -triple powerpc64le-unknown-unknown -fopenmp-targets=nvptx64-nvidia-cuda -emit-llvm-bc C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\OpenMP\nvptx_parallel_codegen.cpp -o C:\ws\w16c2-1\llvm-project\premerge-checks\build\tools\clang\test\OpenMP\Output\nvptx_parallel_codegen.cpp.tmp-ppc-host.bc
190 msx64 windows > Clang.OpenMP::nvptx_parallel_for_codegen.cpp
Script: -- : 'RUN: at line 3'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16c2-1\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc -verify -fopenmp -x c++ -triple powerpc64le-unknown-unknown -fopenmp-targets=nvptx64-nvidia-cuda -emit-llvm-bc C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\OpenMP\nvptx_parallel_for_codegen.cpp -o C:\ws\w16c2-1\llvm-project\premerge-checks\build\tools\clang\test\OpenMP\Output\nvptx_parallel_for_codegen.cpp.tmp-ppc-host.bc
View Full Test Results (10 Failed)

Event Timeline

ggeorgakoudis created this revision.Feb 3 2021, 1:51 PM
ggeorgakoudis requested review of this revision.Feb 3 2021, 1:51 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald Transcript
ggeorgakoudis edited the summary of this revision. (Show Details)Feb 3 2021, 1:58 PM

Fix type for IfCond, formatting

Add tests, update OpenMPOpt, rebase to main

ggeorgakoudis edited the summary of this revision. (Show Details)Tue, Apr 13, 7:27 AM

Add aux-triple to one test, check unit test builder on windows

Fix llvm test

Hi @Meinersbur (got word you are a windows user), @jdoerfert, could I ask your help in detecting why the clang tests on windows are failing? There are two failures I'm spotting, one is that calls to llvm.nvvm intrinsics seem transposed (https://reviews.llvm.org/harbormaster/unit/view/552591/) and another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17). Maybe there is something else I'm missing and I'd appreciate the extra eyeballing on the problem.

I have only minor remarks but I'd like you to check if my hunch is correct and the proposed modifications will fix fix PR49777 *and* fix PR49779.
Also, the number of arguments need to be increased, let's go big and automatic here.

Other than that I think this looks good.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
2192

Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That check is later folded, no need to make things more complicated here.

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
294

This should allow us to remove the SeqGen in the Clang CodeGen *and* fix PR49777 *and* fix PR49779, a win-win-win situation.

368

FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and interface are the important parts.

openmp/libomptarget/deviceRTLs/common/src/support.cu
370

Not a return but a __builtin_trap(), please.
We also need this for more than 16 unfortunately, I've seen 20 in miniqmc.
We might want to create a script to print the cases, and then generate 128 or something like that in a file we include. The script can be in the utils folder too.

Meinersbur added a comment.EditedThu, Apr 15, 9:20 PM

The transposition problem arises from:

static llvm::Value *getThreadLimit(CodeGenFunction &CGF,
                                   bool IsInSPMDExecutionMode = false) {
  CGBuilderTy &Bld = CGF.Builder;
  auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
  return IsInSPMDExecutionMode
             ? RT.getGPUNumThreads(CGF)
             : Bld.CreateNUWSub(RT.getGPUNumThreads(CGF),
                                RT.getGPUWarpSize(CGF), "thread_limit");
}

The order in which getGPUNumThreads(), getGPUNumThreads(), getGPUWarpSize() is called is undefined, only has to have happened at a sequence point. The idea is that it would depend on the order in which the function arguments are put on the stack.

Turns out, clang/gcc evaluate the left argument first, msvc starts with the right one.

ggeorgakoudis updated this revision to Diff 338102.EditedFri, Apr 16, 7:12 AM

Fix for getThreadLimit

Meinersbur requested changes to this revision.Fri, Apr 16, 11:35 AM
Meinersbur added inline comments.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
572–573

getGPUNumThreads and getGPUWarpSize still have undefined call order.

This revision now requires changes to proceed.Fri, Apr 16, 11:35 AM

Update for comments, fix for windows fix

ggeorgakoudis marked 4 inline comments as done.Fri, Apr 16, 2:45 PM
ggeorgakoudis added inline comments.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
2192

Done

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
294

Please check

jdoerfert accepted this revision.Fri, Apr 16, 3:41 PM

With the nit to add the two reproducers, LGTM. (please make sure to run FAROS or some benchmarks we have before commiting).

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
294

Check? Can we add the two reproducers as tests, please. One should be a clang test, the other maybe a runtime test, though clang test might suffice.

openmp/libomptarget/utils/generate_microtask_cases.py
31 ↗(On Diff #338246)

Great. The output is not pretty but that was not the objective ;)

Meinersbur added a comment.EditedFri, Apr 16, 3:47 PM

I have not looked at the other mentioned problem yet:

another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17)

Which might still be there.

I would like to wait for Harbormaster to complete the pre-merge check.

Meinersbur added inline comments.Fri, Apr 16, 7:41 PM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1135

There seem to be more unordered codegen calls, such as this one.

1144

There seem to be more unordered codegen calls, such as this one.

ggeorgakoudis marked 2 inline comments as done.

Update for comments, fixes

ggeorgakoudis marked 4 inline comments as done.Mon, Apr 19, 12:55 AM
ggeorgakoudis added inline comments.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1144

Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the value from line 1140 , instead of re-emitted. I've kept emitting them as it was previously done. What is the preferred way to handle those?

openmp/libomptarget/deviceRTLs/common/src/parallel.cu
294

Ack, will do

Meinersbur added inline comments.Mon, Apr 19, 8:58 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
585–586

This is another undefined codegen call order which causes the current pre-merge checks to fail.

ggeorgakoudis marked 2 inline comments as done.

Fix

Add tests, reduce microtask cases to avoid stack problems

ggeorgakoudis marked an inline comment as done.Wed, Apr 21, 9:07 AM
Meinersbur accepted this revision.Wed, Apr 21, 10:54 AM

This test seem to pass on Windows now. Please still fix the clang-format remarks, such as going over 80 characters on a line.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1144

getGPUThreadID/getMasterThreadID could cache the value if used multiple times., but it would also require to put them into the entry block to be available anywhere in the function.

Otherwise, use a best-effort to minimize overhead even if the optimizer cannot unify them or in debug builds.

This revision is now accepted and ready to land.Wed, Apr 21, 10:54 AM

Fix clang-format

This revision was landed with ongoing or failed builds.Wed, Apr 21, 6:46 PM
This revision was automatically updated to reflect the committed changes.

Please update the test with a NFC commit.

openmp/libomptarget/test/offloading/bug49779.cpp
1–5 ↗(On Diff #339441)
29–36 ↗(On Diff #339441)

Since the output goes to Filecheck anyways, I think we should avoid asserts, but let Filecheck test for expected results.
The output for failing tests has more information with this approach.

Please update the test with a NFC commit.

Thanks, @protze.joachim. The changes look good. I'll get that NFC commit in soon-ish, unless you would like to take over.

openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen
1 ↗(On Diff #339441)

This is not very pretty. Why do we need runtime dispatch to a function pointer?

jdoerfert added inline comments.Wed, May 12, 8:06 AM
openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen
1 ↗(On Diff #339441)

because we have variadic functions right now. A patch to remove this is already underway:
https://reviews.llvm.org/D102107