This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Codegen support for thread_limit on target directive
ClosedPublic

Authored by sandeepkosuri on Jun 3 2023, 12:46 AM.

Details

Summary
  • This patch adds support for thread_limit clause on target directive according to OpenMP 51 [2.14.5]
  • The idea is to create an outer task for target region, when there is a thread_limit clause, and manipulate the thread_limit of task instead. This way, thread_limit will be applied to all the relevant constructs enclosed by the target region.

Diff Detail

Event Timeline

sandeepkosuri created this revision.Jun 3 2023, 12:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2023, 12:46 AM
sandeepkosuri requested review of this revision.Jun 3 2023, 12:46 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald Transcript

Updated target_codegen.cpp test case to incorporate my changes

jdenny added a subscriber: jdenny.Jun 12 2023, 11:46 AM
ABataev added inline comments.Jun 28 2023, 5:31 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
9640

What if D is combined target directive, i.e. D.getDirectiveKind() is something like OMPD_target_teams, etc.?

clang/lib/CodeGen/CGStmtOpenMP.cpp
5143–5148

Avoid double call of S.getSingleClause<OMPThreadLimitClause>(), store in local variable call result.

clang/test/OpenMP/target_codegen.cpp
849–850

It requires extra resource consumption, can you try to avoid creating outer task, if possible?

openmp/runtime/src/kmp_ftn_entry.h
813

No need for else here

sandeepkosuri added inline comments.Jun 30 2023, 10:13 PM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
9640

I will fix that, thanks for noticing.

clang/lib/CodeGen/CGStmtOpenMP.cpp
5143–5148

sure.

clang/test/OpenMP/target_codegen.cpp
849–850

I tried different ideas for making thread_limit work on target.

I tried to reuse the existing implementation by replacing the directive to target teams(1) thread_limit(x) at parsing , sema and IR stages. I couldn't successfully implement any of them. So, I tried adding num_threads for all the parallel directives within target, and there were corner cases like parallel directives in a function which is called in target region, which were becoming tedious to handle.

This method seem to encompass the idea of thread limit on target pretty well and also works... So I proceeded with this idea.

openmp/runtime/src/kmp_ftn_entry.h
813

oops, I will fix that

jdoerfert added inline comments.Jul 1 2023, 11:20 AM
openmp/runtime/test/target/target_thread_limit.cpp
29

This doesn't check much. You need to verify a 5th or print the team size. Same for the rest

36

Verify that the user can use the omp_set_ functions and see consistent behavior.

  • Added support for thread_limit clause on relevant combined directives which begin with target as per @ABataev 's comments.
  • Added additional LIT test cases to check codegen of the thread_limit on the newly supported directives.
  • Updated the runtime LIT as per @jdoerfert 's comments.

Explicitly mentioned -fopenmp-version=51 in LIT test cases

ABataev added inline comments.Jul 26 2023, 7:11 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
9866

I think you don't need needsTaskBasedThreadLimit call here, the emitTargetCall function itself can be called only for target-based directives

clang/lib/CodeGen/CGStmtOpenMP.cpp
5143

Same regarding needsTaskBasedThreadLimit(S.getDirectiveKind()) , the function EmitOMPTargetTaskBasedDirective is called only for target-based directives

sandeepkosuri added inline comments.Aug 4 2023, 12:02 AM
clang/lib/CodeGen/CGOpenMPRuntime.cpp
9866

emitTargetCall() is called for all the target based directives, even target - team based directives, which already have a thread limit implementation in place. So, I need needsTaskBasedThreadLimit to select applicable directives only.

clang/lib/CodeGen/CGStmtOpenMP.cpp
5143

Similarly here as well.

Is this patch to support thread_limit on target directive on the host?

Is this patch to support thread_limit on target directive on the host?

Yes @tianshilei1992 , It is for host only

This revision is now accepted and ready to land.Aug 7 2023, 5:00 AM
  • Updated SemaOpenMP.cpp to support thread_limit clause on the newly allowed directives.
  • This update is to fix the newly added LIT tests' failures (which were occurring only on debug build)

Please, use the script to generate the checks for the newly added tests

Used the python script update_cc_test_checks.py to generate the checks for the newly added tests.

ABataev added inline comments.Aug 9 2023, 5:24 AM
clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp
5

Add PCH serialization/desrialization checks in your tests

Added PCH options to the RUN lines in LIT tests

ABataev added inline comments.Aug 23 2023, 10:19 AM
clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp
31

Why removed these checks?

clang/test/OpenMP/target_parallel_for_tl_codegen.cpp
31

Same, why it was removed?

sandeepkosuri marked an inline comment as done.Aug 23 2023, 10:36 PM
sandeepkosuri added inline comments.
clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp
31

I did not remove any check lines in this function.
But I removed checks in omp_task_entry function that were not related to my changes, to avoid failures. I only wanted to check whether __kmpc_set_thread_limit() is called.

Same for all the other test cases.

ABataev added inline comments.Aug 24 2023, 5:15 AM
clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp
31

Better to restore it to be able to use the script in future without many changes

sandeepkosuri marked an inline comment as done.Aug 24 2023, 6:35 AM
sandeepkosuri added inline comments.
clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp
31

But a few check lines are failing on windows, while passing on debian.

ABataev added inline comments.Aug 24 2023, 6:42 AM
clang/test/OpenMP/target_parallel_for_simd_tl_codegen.cpp
31

It must be investigated, it is bad idea just to remove these checks. Most probably related to the order of the expressions emission.

Edited the LIT test cases to use more script generated check lines.

Used CHECK-DAG s to avoid LIT test failures on Windows system

Made LIT test cases more robust to check lines ordering problem

made new LIT test cases target specific to linux

This new test is failing on Windows, due to __kmpc_set_thread_limit not being exported - see e.g. https://github.com/mstorsjo/llvm-mingw/actions/runs/5994183421/job/16264501555. Can someone add it to openmp/runtime/src/dllexports?

This new test is failing on Windows, due to __kmpc_set_thread_limit not being exported - see e.g. https://github.com/mstorsjo/llvm-mingw/actions/runs/5994183421/job/16264501555. Can someone add it to openmp/runtime/src/dllexports?

CC @vadikp-intel @natgla about this. What's the procedure for allocating new ordinal numbers for new exported functions here?

Windows importing is now done by name, and new exports do not need to have an ordinal specified for them i.e. you can add a line with just the API name to dllexports.

Windows importing is now done by name, and new exports do not need to have an ordinal specified for them i.e. you can add a line with just the API name to dllexports.

Oh, right, thanks. Will do.

mstorsjo added inline comments.Aug 30 2023, 5:16 AM
openmp/runtime/test/target/target_thread_limit.cpp
29

This test fails when running (on Windows) on GitHub Actions runners - see https://github.com/mstorsjo/llvm-mingw/actions/runs/6019088705/job/16342540379.

I believe that this bit of the test has got a hidden assumption that it is running in an environment with 4 or more cores. By setting #pragma omp target thread_limit(tl) (with tl=4) and running a line in parallel with #pragma omp parallel, it expects that we'll get 4 printouts - while in practice, we'll get anywhere between 1 and 4 printouts depending on the number of cores.

Is there something that can be done to make this test work in such an environment too?

mstorsjo added inline comments.Aug 31 2023, 2:08 PM
openmp/runtime/test/target/target_thread_limit.cpp
29

Can someone involved in this patch take on fixing it so that it works on machines with fewer than 4 cores? I'm not sure what's the most appropriate path forward here, as it breaks clearly in such configs (even if it might not be hit by one of the official llvm buildbots, but it shows up as breakage in my nightly builds every day now) - reverting seems a bit harsh. I guess I could just rip out this part of the test?

sandeepkosuri added inline comments.Sep 7 2023, 7:50 AM
openmp/runtime/test/target/target_thread_limit.cpp
29

@mstorsjo , I noticed that you have committed this https://github.com/llvm/llvm-project/commit/c2019c416c8d7ec50aec6ac6b82c9aa4e99b0f6f

Does this solve your problem ?

mstorsjo added inline comments.Sep 7 2023, 1:55 PM
openmp/runtime/test/target/target_thread_limit.cpp
29

Yes, that commit fixed the issue.

kaz7 added a subscriber: kaz7.Sep 10 2023, 3:08 AM

I cannot write inline comment, so I'm leaving message here.

openmp/runtime/test/target/target_thread_limit.cpp:

// checking consecutive target regions with different thread_limits
#pragma omp target thread_limit(3)

{
  printf("\nsecond target: thread_limit = %d", omp_get_thread_limit());

// OMP51: second target: thread_limit = 3

Our VE architecture supports only OpenMP runtime. It doesn't support libomptarget. If I run check-openmp on our machine, this omp_get_thread_limit() returns default thread limit 2147483647 (=0x7fffffff). I guess it is OK because this pragma specifies only target and our VE doensn't support target. Other pragmas are containing not only target but also other keyword like parallel, so I guess others are running well. I'm not clear about omptarget, so my assumptions here may be wrong, though.

My question is what is the best way to correct the behavior of this test? I appreciate any comments or suggestions. Thank you!

I run check-openmp on our machine, this omp_get_thread_limit() returns default thread limit 2147483647 (=0x7fffffff).

That is something wrong because this patch is about host instead of offloading.

kaz7 added a comment.Sep 10 2023, 9:13 PM

I run check-openmp on our machine, this omp_get_thread_limit() returns default thread limit 2147483647 (=0x7fffffff).

That is something wrong because this patch is about host instead of offloading.

Thank you for your comment. I try to inspect this.

kaz7 added a comment.Sep 11 2023, 6:53 AM

I figure out the reason of strange behavior I mentioned last night. However, I'm still not sure what is the good way to solve this problem. Can someone check the behavior with -O2 option please? Thanks!

openmp/runtime/test/target/target_thread_limit.cpp
74

Last night, I cannot use inline comment, but now I can. So, I'm writing down my comments here.

The problem I mentioned last night was caused by optimization. This works fine with optimization level by default, but this doesn't work with -O2 even on x86_64. I'm using -O2 for tests, so I come across this problem. I'll write command lines to reproduce my problem below. As you see, there is a warning message at -O2. So, this behavior may be expected, though. In addition, I'm using f8efa65 to produce following results.

$ cd build
$ ninja
...
$ ./bin/clang++  -fopenmp -I runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -I ../llvm-project/openmp/runtime/test -L runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -fno-omit-frame-pointer -I ../llvm-project/openmp/runtime/test/ompt -std=c++17 ../llvm-project/openmp/runtime/test/target/target_thread_limit.cpp -o ok -lm -latomic -fopenmp-version=51
$ LD_LIBRARY_PATH=runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src ./ok | grep "second target: thread_limit"
second target: thread_limit = 3
$ ./bin/clang++ -fopenmp -I runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -I ../llvm-project/openmp/runtime/test -L runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src -fno-omit-frame-pointer -I ../llvm-project/openmp/runtime/test/ompt -std=c++17 ../llvm-project/openmp/runtime/test/target/target_thread_limit.cpp -o bad -lm -latomic -fopenmp-version=51 -O2
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
1 warning generated.
$ LD_LIBRARY_PATH=runtimes/runtimes-x86_64-unknown-linux-gnu-bins/openmp/runtime/src ./bad | grep "second target: thread_limit"
second target: thread_limit = 2147483647

@kaz7, it seems that the thread_limit is being set properly, but the omp_get_thread_limit() is giving a wrong output when you enable anything more than -O1. I will fix it as soon as I can. Meanwhile, if you absolutely want the test case to work right now, remove the printf causing the issue or do not run that test case with a higher optimization level than -O1.

kaz7 added a comment.Sep 20 2023, 1:52 AM

@kaz7, it seems that the thread_limit is being set properly, but the omp_get_thread_limit() is giving a wrong output when you enable anything more than -O1. I will fix it as soon as I can. Meanwhile, if you absolutely want the test case to work right now, remove the printf causing the issue or do not run that test case with a higher optimization level than -O1.

Thank you for investigating that. I'm fine with -O0 result. I just came across this issue and informed that. Thank you for your efforts!