This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single kmpc_parallel_51 runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using update_cc_test_checks.py. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi @Meinersbur (got word you are a windows user), @jdoerfert, could I ask your help in detecting why the clang tests on windows are failing? There are two failures I'm spotting, one is that calls to llvm.nvvm intrinsics seem transposed (https://reviews.llvm.org/harbormaster/unit/view/552591/) and another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17). Maybe there is something else I'm missing and I'd appreciate the extra eyeballing on the problem.
I have only minor remarks but I'd like you to check if my hunch is correct and the proposed modifications will fix fix PR49777 *and* fix PR49779.
Also, the number of arguments need to be increased, let's go big and automatic here.
Other than that I think this looks good.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
2155–2156 | Can we remove SeqGen while we are here please. We need to check in the runtime anyway. That check is later folded, no need to make things more complicated here. | |
openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
294 | This should allow us to remove the SeqGen in the Clang CodeGen *and* fix PR49777 *and* fix PR49779, a win-win-win situation. | |
368 | FWIW, The implementation here is a stopgap until we move to the new runtime. The codegen and interface are the important parts. | |
openmp/libomptarget/deviceRTLs/common/src/support.cu | ||
370 | Not a return but a __builtin_trap(), please. |
The transposition problem arises from:
static llvm::Value *getThreadLimit(CodeGenFunction &CGF, bool IsInSPMDExecutionMode = false) { CGBuilderTy &Bld = CGF.Builder; auto &RT = static_cast<CGOpenMPRuntimeGPU &>(CGF.CGM.getOpenMPRuntime()); return IsInSPMDExecutionMode ? RT.getGPUNumThreads(CGF) : Bld.CreateNUWSub(RT.getGPUNumThreads(CGF), RT.getGPUWarpSize(CGF), "thread_limit"); }
The order in which getGPUNumThreads(), getGPUNumThreads(), getGPUWarpSize() is called is undefined, only has to have happened at a sequence point. The idea is that it would depend on the order in which the function arguments are put on the stack.
Turns out, clang/gcc evaluate the left argument first, msvc starts with the right one.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
569–570 | getGPUNumThreads and getGPUWarpSize still have undefined call order. |
With the nit to add the two reproducers, LGTM. (please make sure to run FAROS or some benchmarks we have before commiting).
openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
---|---|---|
294 | Check? Can we add the two reproducers as tests, please. One should be a clang test, the other maybe a runtime test, though clang test might suffice. | |
openmp/libomptarget/utils/generate_microtask_cases.py | ||
32 | Great. The output is not pretty but that was not the objective ;) |
I have not looked at the other mentioned problem yet:
another that attribute regexes are not recognized (https://reviews.llvm.org/harbormaster/unit/view/552593/ at nvptx_target_codegen.cpp:723:17)
Which might still be there.
I would like to wait for Harbormaster to complete the pre-merge check.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
1150–1152 | Some previous emitted values can be re-used, e.g., GPUThreadID in line 1150 can re-use the value from line 1140 , instead of re-emitted. I've kept emitting them as it was previously done. What is the preferred way to handle those? | |
openmp/libomptarget/deviceRTLs/common/src/parallel.cu | ||
294 | Ack, will do |
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
590–592 | This is another undefined codegen call order which causes the current pre-merge checks to fail. |
This test seem to pass on Windows now. Please still fix the clang-format remarks, such as going over 80 characters on a line.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp | ||
---|---|---|
1150–1152 | getGPUThreadID/getMasterThreadID could cache the value if used multiple times., but it would also require to put them into the entry block to be available anywhere in the function. Otherwise, use a best-effort to minimize overhead even if the optimizer cannot unify them or in debug builds. |
Please update the test with a NFC commit.
openmp/libomptarget/test/offloading/bug49779.cpp | ||
---|---|---|
1–5 | See D101326 | |
29–36 | Since the output goes to Filecheck anyways, I think we should avoid asserts, but let Filecheck test for expected results. |
Thanks, @protze.joachim. The changes look good. I'll get that NFC commit in soon-ish, unless you would like to take over.
openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen | ||
---|---|---|
1 | This is not very pretty. Why do we need runtime dispatch to a function pointer? |
openmp/libomptarget/deviceRTLs/common/generated_microtask_cases.gen | ||
---|---|---|
1 | because we have variadic functions right now. A patch to remove this is already underway: |
getGPUNumThreads and getGPUWarpSize still have undefined call order.