- This patch adds support for thread_limit clause on target directive according to OpenMP 51 [2.14.5]
- The idea is to create an outer task for target region, when there is a thread_limit clause, and manipulate the thread_limit of task instead. This way, thread_limit will be applied to all the relevant constructs enclosed by the target region.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| clang/lib/CodeGen/CGOpenMPRuntime.cpp | ||
|---|---|---|
| 9428 | What if D is combined target directive, i.e. D.getDirectiveKind() is something like OMPD_target_teams, etc.? | |
| clang/lib/CodeGen/CGStmtOpenMP.cpp | ||
| 5148–5153 | Avoid double call of S.getSingleClause<OMPThreadLimitClause>(), store in local variable call result. | |
| clang/test/OpenMP/target_codegen.cpp | ||
| 849–850 | It requires extra resource consumption, can you try to avoid creating outer task, if possible? | |
| openmp/runtime/src/kmp_ftn_entry.h | ||
| 818 | No need for else here | |
| clang/lib/CodeGen/CGOpenMPRuntime.cpp | ||
|---|---|---|
| 9428 | I will fix that, thanks for noticing. | |
| clang/lib/CodeGen/CGStmtOpenMP.cpp | ||
| 5148–5153 | sure. | |
| clang/test/OpenMP/target_codegen.cpp | ||
| 849–850 | I tried different ideas for making thread_limit work on target. I tried to reuse the existing implementation by replacing the directive to target teams(1) thread_limit(x) at parsing , sema and IR stages. I couldn't successfully implement any of them. So, I tried adding num_threads for all the parallel directives within target, and there were corner cases like parallel directives in a function which is called in target region, which were becoming tedious to handle. This method seem to encompass the idea of thread limit on target pretty well and also works... So I proceeded with this idea. | |
| openmp/runtime/src/kmp_ftn_entry.h | ||
| 818 | oops, I will fix that | |
- Added support for thread_limit clause on relevant combined directives which begin with target as per @ABataev 's comments.
- Added additional LIT test cases to check codegen of the thread_limit on the newly supported directives.
- Updated the runtime LIT as per @jdoerfert 's comments.
| clang/lib/CodeGen/CGOpenMPRuntime.cpp | ||
|---|---|---|
| 9654 | I think you don't need needsTaskBasedThreadLimit call here, the emitTargetCall function itself can be called only for target-based directives | |
| clang/lib/CodeGen/CGStmtOpenMP.cpp | ||
| 5148 | Same regarding needsTaskBasedThreadLimit(S.getDirectiveKind()) , the function EmitOMPTargetTaskBasedDirective is called only for target-based directives | |
| clang/lib/CodeGen/CGOpenMPRuntime.cpp | ||
|---|---|---|
| 9654 | emitTargetCall() is called for all the target based directives, even target - team based directives, which already have a thread limit implementation in place. So, I need needsTaskBasedThreadLimit to select applicable directives only. | |
| clang/lib/CodeGen/CGStmtOpenMP.cpp | ||
| 5148 | Similarly here as well. | |
- Updated SemaOpenMP.cpp to support thread_limit clause on the newly allowed directives.
- This update is to fix the newly added LIT tests' failures (which were occurring only on debug build)
Used the python script update_cc_test_checks.py to generate the checks for the newly added tests.
| clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp | ||
|---|---|---|
| 5 | Add PCH serialization/desrialization checks in your tests | |
| clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp | ||
|---|---|---|
| 31 | I did not remove any check lines in this function. Same for all the other test cases. | |
| clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp | ||
|---|---|---|
| 31 | Better to restore it to be able to use the script in future without many changes | |
| clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp | ||
|---|---|---|
| 31 | But a few check lines are failing on windows, while passing on debian. | |
| clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp | ||
|---|---|---|
| 31 | It must be investigated, it is bad idea just to remove these checks. Most probably related to the order of the expressions emission. | |
This new test is failing on Windows, due to __kmpc_set_thread_limit not being exported - see e.g. https://github.com/mstorsjo/llvm-mingw/actions/runs/5994183421/job/16264501555. Can someone add it to openmp/runtime/src/dllexports?
CC @vadikp-intel @natgla about this. What's the procedure for allocating new ordinal numbers for new exported functions here?
Windows importing is now done by name, and new exports do not need to have an ordinal specified for them i.e. you can add a line with just the API name to dllexports.
| openmp/runtime/test/target/target_thread_limit.cpp | ||
|---|---|---|
| 29 | This test fails when running (on Windows) on GitHub Actions runners - see https://github.com/mstorsjo/llvm-mingw/actions/runs/6019088705/job/16342540379. I believe that this bit of the test has got a hidden assumption that it is running in an environment with 4 or more cores. By setting #pragma omp target thread_limit(tl) (with tl=4) and running a line in parallel with #pragma omp parallel, it expects that we'll get 4 printouts - while in practice, we'll get anywhere between 1 and 4 printouts depending on the number of cores. Is there something that can be done to make this test work in such an environment too? | |
| openmp/runtime/test/target/target_thread_limit.cpp | ||
|---|---|---|
| 29 | Can someone involved in this patch take on fixing it so that it works on machines with fewer than 4 cores? I'm not sure what's the most appropriate path forward here, as it breaks clearly in such configs (even if it might not be hit by one of the official llvm buildbots, but it shows up as breakage in my nightly builds every day now) - reverting seems a bit harsh. I guess I could just rip out this part of the test? | |
| openmp/runtime/test/target/target_thread_limit.cpp | ||
|---|---|---|
| 29 | @mstorsjo , I noticed that you have committed this https://github.com/llvm/llvm-project/commit/c2019c416c8d7ec50aec6ac6b82c9aa4e99b0f6f Does this solve your problem ? | |
| openmp/runtime/test/target/target_thread_limit.cpp | ||
|---|---|---|
| 29 | Yes, that commit fixed the issue. | |
I cannot write inline comment, so I'm leaving message here.
openmp/runtime/test/target/target_thread_limit.cpp:
// checking consecutive target regions with different thread_limits
#pragma omp target thread_limit(3){ printf("\nsecond target: thread_limit = %d", omp_get_thread_limit());// OMP51: second target: thread_limit = 3
Our VE architecture supports only OpenMP runtime. It doesn't support libomptarget. If I run check-openmp on our machine, this omp_get_thread_limit() returns default thread limit 2147483647 (=0x7fffffff). I guess it is OK because this pragma specifies only target and our VE doensn't support target. Other pragmas are containing not only target but also other keyword like parallel, so I guess others are running well. I'm not clear about omptarget, so my assumptions here may be wrong, though.
My question is what is the best way to correct the behavior of this test? I appreciate any comments or suggestions. Thank you!
I figure out the reason of strange behavior I mentioned last night. However, I'm still not sure what is the good way to solve this problem. Can someone check the behavior with -O2 option please? Thanks!
| openmp/runtime/test/target/target_thread_limit.cpp | ||
|---|---|---|
| 74 | Last night, I cannot use inline comment, but now I can. So, I'm writing down my comments here. The problem I mentioned last night was caused by optimization. This works fine with optimization level by default, but this doesn't work with -O2 even on x86_64. I'm using -O2 for tests, so I come across this problem. I'll write command lines to reproduce my problem below. As you see, there is a warning message at -O2. So, this behavior may be expected, though. In addition, I'm using f8efa65 to produce following results. $ cd build $ ninja ... $ ./bin/clang++ -fopenmp -I runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -I ../llvm-project/openmp/runtime/test -L runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -fno-omit-frame-pointer -I ../llvm-project/openmp/runtime/test/ompt -std=c++17 ../llvm-project/openmp/runtime/test/target/target_thread_limit.cpp -o ok -lm -latomic -fopenmp-version=51 $ LD_LIBRARY_PATH=runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src ./ok | grep "second target: thread_limit" second target: thread_limit = 3 $ ./bin/clang++ -fopenmp -I runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -I ../llvm-project/openmp/runtime/test -L runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -fno-omit-frame-pointer -I ../llvm-project/openmp/runtime/test/ompt -std=c++17 ../llvm-project/openmp/runtime/test/target/target_thread_limit.cpp -o bad -lm -latomic -fopenmp-version=51 -O2 warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 1 warning generated. $ LD_LIBRARY_PATH=runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src ./bad | grep "second target: thread_limit" second target: thread_limit = 2147483647 | |
@kaz7, it seems that the thread_limit is being set properly, but the omp_get_thread_limit() is giving a wrong output when you enable anything more than -O1. I will fix it as soon as I can. Meanwhile, if you absolutely want the test case to work right now, remove the printf causing the issue or do not run that test case with a higher optimization level than -O1.
Thank you for investigating that. I'm fine with -O0 result. I just came across this issue and informed that. Thank you for your efforts!
What if D is combined target directive, i.e. D.getDirectiveKind() is something like OMPD_target_teams, etc.?