This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single kmpc_parallel_51 runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using update_cc_test_checks.py. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi @Meinersbur (got word you are a windows user), @jdoerfert, could I ask your help in detecting why the clang tests on windows are failing? There are two failures I'm spotting, one is that calls to llvm.nvvm intrinsics seem transposed (https://reviews.llvm.org/harbormaster/unit/view/552591/) and another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17). Maybe there is something else I'm missing and I'd appreciate the extra eyeballing on the problem.
I have only minor remarks but I'd like you to check if my hunch is correct and the proposed modifications will fix fix PR49777 *and* fix PR49779.
Also, the number of arguments need to be increased, let's go big and automatic here.
Other than that I think this looks good.
| clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
|---|---|---|
| 2155–2156 | Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That check is later folded, no need to make things more complicated here. | |
| openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
| 294 | This should allow us to remove the SeqGen in the Clang CodeGen *and* fix PR49777 *and* fix PR49779, a win-win-win situation. | |
| 368 | FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and interface are the important parts. | |
| openmp/libomptarget/deviceRTLs/common/src/support.cu | ||
| 370 | Not a return but a __builtin_trap(), please. | |
The transposition problem arises from:
static llvm::Value *getThreadLimit(CodeGenFunction &CGF,
                                   bool IsInSPMDExecutionMode = false) {
  CGBuilderTy &Bld = CGF.Builder;
  auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime());
  return IsInSPMDExecutionMode
             ? RT.getGPUNumThreads(CGF)
             : Bld.CreateNUWSub(RT.getGPUNumThreads(CGF),
                                RT.getGPUWarpSize(CGF), "thread_limit");
}The order in which getGPUNumThreads(), getGPUNumThreads(), getGPUWarpSize() is called is undefined, only has to have happened at a sequence point. The idea is that it would depend on the order in which the function arguments are put on the stack.
Turns out, clang/gcc evaluate the left argument first, msvc starts with the right one.
| clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
|---|---|---|
| 569–570 | getGPUNumThreads and getGPUWarpSize still have undefined call order. | |
With the nit to add the two reproducers, LGTM. (please make sure to run FAROS or some benchmarks we have before commiting).
| openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
|---|---|---|
| 294 | Check? Can we add the two reproducers as tests, please. One should be a clang test, the other maybe a runtime test, though clang test might suffice. | |
| openmp/libomptarget/utils/generate_microtask_cases.py | ||
| 32 | Great. The output is not pretty but that was not the objective ;) | |
I have not looked at the other mentioned problem yet:
another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17)
Which might still be there.
I would like to wait for Harbormaster to complete the pre-merge check.
| clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
|---|---|---|
| 1150–1152 | Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the value from line 1140 , instead of re-emitted. I've kept emitting them as it was previously done. What is the preferred way to handle those? | |
| openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
| 294 | Ack, will do | |
| clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
|---|---|---|
| 590–592 | This is another undefined codegen call order which causes the current pre-merge checks to fail. | |
This test seem to pass on Windows now. Please still fix the clang-format remarks, such as going over 80 characters on a line.
| clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
|---|---|---|
| 1150–1152 | getGPUThreadID/getMasterThreadID could cache the value if used multiple times., but it would also require to put them into the entry block to be available anywhere in the function. Otherwise, use a best-effort to minimize overhead even if the optimizer cannot unify them or in debug builds. | |
Please update the test with a NFC commit.
| openmp/libomptarget/test/offloading/bug49779.cpp | ||
|---|---|---|
| 1–5 | See D101326 | |
| 29–36 | Since the output goes to Filecheck anyways, I think we should avoid asserts, but let Filecheck test for expected results.  | |
Thanks, @protze.joachim. The changes look good. I'll get that NFC commit in soon-ish, unless you would like to take over.
| openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen | ||
|---|---|---|
| 1 | This is not very pretty. Why do we need runtime dispatch to a function pointer? | |
| openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen | ||
|---|---|---|
| 1 | because we have variadic functions right now. A patch to remove this is already underway: | |
getGPUNumThreads and getGPUWarpSize still have undefined call order.