This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/runtime/
-
runtime/
-
src/
7/7
kmp.h
-
kmp_global.cpp
3/5
kmp_runtime.cpp
-
kmp_settings.cpp
1/1
kmp_taskdeps.h
10/11
kmp_tasking.cpp
-
kmp_wait_release.h
-
z_Linux_util.cpp
-
test/tasking/unshackled_task/
-
tasking/
-
unshackled_task/
-
depend.cpp
-
gtid.cpp
-
taskgroup.cpp

Differential D77609

[OpenMP] Added the support for hidden helper task in RTL
ClosedPublic

Authored by tianshilei1992 on Apr 6 2020, 4:54 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
AndreyChurbanov
adurang
ABataev

Commits

rG9d64275ae08f: [OpenMP] Added the support for hidden helper task in RTL
rGed939f853da1: [OpenMP] Added the support for hidden helper task in RTL

Summary

The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use pthread_create to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls __kmpc_fork_call. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via pthread_create on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.

Two environment variables, LIBOMP_NUM_HIDDEN_HELPER_THREADS and LIBOMP_USE_HIDDEN_HELPER_TASK, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.

Here are some open issues to be discussed:

The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Harbormaster completed remote builds in B70056: Diff 288879.Aug 30 2020, 5:57 PM

Fixed the issue in must_wait

tianshilei1992 marked an inline comment as done.Aug 31 2020, 12:06 PM

Harbormaster completed remote builds in B70129: Diff 289000.Aug 31 2020, 12:39 PM

adurang added inline comments.Sep 1 2020, 4:57 AM

openmp/runtime/src/kmp_tasking.cpp
3727	For the code below: #pragma omp parallel num_threads(2) { #pragma omp target nowait blah() #pragma omp taskwait } With your current code (because you're using a shared counter for the whole team), both thread 1 and 2 are waiting for each others target regions ( so for example, even if target-th1 was finished thread1 would be blocked until target-th2 was completed). Each taskwait should only be waiting for their own child target tasks. Hope this helps.

tianshilei1992 added inline comments.Sep 1 2020, 9:54 AM

openmp/runtime/src/kmp_tasking.cpp
3727	Thanks for the explanation. But this lines of code are in the function `__kmp_task_team_wait` that is not called by `__kmpc_omp_taskwait`. If I understand correctly, `__kmp_task_team_wait` is called by the master thread of a team to wait for all tasks created in the team to finish so that it can proceed. So we need to wait for all unshackled tasks encountered/created in the task team.

ye-luo added a subscriber: ye-luo.Sep 1 2020, 2:00 PM

ye-luo added inline comments.

openmp/runtime/src/kmp_tasking.cpp
3727	@adurang your example demonstrates exactly the code pattern I use. taskwait should only wait for the child tasks.

adurang added inline comments.Sep 1 2020, 3:00 PM

openmp/runtime/src/kmp_tasking.cpp
3727	Ah sorry, I didn't notice the patch changing functions. I should really look at the whole file! But if taskwait is working correctly the flag.wait call in __kmp_task_team_wait should also make sure that no outstanding unshackled tasks are left and then I don't think the extra check shouldn't be needed. Do you have any tests with taskgroup/taskwait? In any case, could you move it inside the same if statement as the other checks? Also, you need to set tt_unfinished_unshackled_tasks to FALSE in case the same task_team structure is reused (Note that is done the same for various fields just above).

Added another condition to see whether we need to wait in the task team

Harbormaster completed remote builds in B70334: Diff 289337.Sep 1 2020, 6:35 PM

tianshilei1992 added reviewers: AndreyChurbanov, adurang, ABataev.Sep 4 2020, 2:27 PM

Couple of tests needed to check if the implementation works - one with unshackled task encountered before parallel, and another with unshackled task encountered after / between parallels.

openmp/runtime/src/kmp_runtime.cpp
4335	This condition is false if new_gtid started with (__kmp_unshackled_threads_num + 1), that is for regular thread. Thus all threads will mistakenly get the same gtid.

rogfer01 added a subscriber: rogfer01.Sep 9 2020, 7:46 AM

tianshilei1992 mentioned this in rGebb1092a2875: [Clang][OpenMP] Added support for nowait target in CodeGen via regular task.Sep 25 2020, 7:10 PM

Fixed some problems and added the first test case. More cases are on the way.

Harbormaster completed remote builds in B73812: Diff 295870.Oct 2 2020, 10:53 AM

Added another test case to test dependence process

Harbormaster completed remote builds in B74490: Diff 297057.Oct 8 2020, 2:59 PM

Fixed a potental race condition

Harbormaster completed remote builds in B74625: Diff 297290.Oct 9 2020, 12:17 PM

Still trying to fix a race problem

Harbormaster completed remote builds in B74740: Diff 297485.Oct 11 2020, 1:28 PM

Added the missing part for the team destroy

Harbormaster completed remote builds in B75218: Diff 298452.Oct 15 2020, 2:05 PM

Refactored code in z_Linux_util.cpp

Harbormaster completed remote builds in B75238: Diff 298494.Oct 15 2020, 4:36 PM

Updated some tests

Harbormaster completed remote builds in B75312: Diff 298637.Oct 16 2020, 8:41 AM

tianshilei1992 marked 5 inline comments as done.Oct 16 2020, 1:38 PM

Enabled unshackled thread by default

In D77609#2336223, @tianshilei1992 wrote:

Enabled unshackled thread by default

What is the current supported way of turning off unshackled thread team with zero side effect?

Harbormaster completed remote builds in B75391: Diff 298777.Oct 16 2020, 4:47 PM

In D77609#2336270, @ye-luo wrote:

In D77609#2336223, @tianshilei1992 wrote:

Enabled unshackled thread by default

What is the current supported way of turning off unshackled thread team with zero side effect?

Currently it can be disabled by setting LIBOMP_USE_UNSHACKLED_TASK to OFF at the CMake stage. It is also feasible to make it during runtime but that would bring in extra overhead.

Added a new test case for taskgroup

Fixed test case description

Disabled unshackled task on macOS as well

Harbormaster completed remote builds in B75571: Diff 299118.Oct 19 2020, 12:10 PM

Harbormaster completed remote builds in B75573: Diff 299120.Oct 19 2020, 12:16 PM

Harbormaster completed remote builds in B75575: Diff 299123.Oct 19 2020, 12:29 PM

tianshilei1992 retitled this revision from [OpenMP][WIP] Added the support for unshackled task in RTL to [OpenMP] Added the support for unshackled task in RTL.Oct 19 2020, 12:58 PM

I left some comments. Generally, I would prefer we minimize the use of the macro to elide declarations. I'd also prefer to use the macro as part of the conditions to avoid duplication.
Instead of

#ifdef MACRO
foo(X)
#else 
foo(Y)
#endif

we do

v = MACRO ? X : Y;
foo(v);

which is really helpful if foo is complex code and just as fast.

@adurang @AndreyChurbanov have your concerns been addressed?

openmp/runtime/src/kmp.h
2244–2245	Do we really need the `USE_UNSHACKLED_TASK` flag? Even if we want it, we don't need it here in the struct. Let's waste one bit on windows until we catch up and remove complexity for everyone.
2299	Grammar: "The task team of its parent task team" and "therefore we it when this task is created"
2309	Similarly, I don't think the byte savings here are worth it.
openmp/runtime/src/kmp_runtime.cpp
3644	Is the code in the `#else` case the same as in the `} else {` case? If so, make the conditional `if (USE_UNSHACKLED_TASK && ...)` and avoid the duplication of bugs.
4334	Nit: can we move the initialization out of the loop, hard to read. A comment might help as well. Looking at the code more generally, is this the same code as below with different bounds? If so, avoid the duplication all together please, same way as suggested above.

@adurang @AndreyChurbanov have your concerns been addressed?

I didn't see the problem with release_deps being solved (maybe I missed). And I think we should really have a mechanism to set the number of threads instead of a hardcoded '8' and not have the threads created until is necessary.

Also, given efforts in OpenMP to remove master and similar terms maybe we should think about renaming "unshackled" to something else like "helper" or "auxilary"? I know is a bit of a pain to do that so I won't press for this but thought that I should mention it.

tianshilei1992 added inline comments.Oct 28 2020, 7:41 AM

openmp/runtime/src/kmp_taskdeps.h
126	@adurang The problem of release deeps was fixed here.

Enhanced one test case and fixed some comments

tianshilei1992 marked 2 inline comments as done.Oct 28 2020, 8:04 PM

Harbormaster completed remote builds in B76858: Diff 301508.Oct 28 2020, 8:54 PM

In D77609#2345867, @jdoerfert wrote:
I left some comments. Generally, I would prefer we minimize the use of the macro to elide declarations. I'd also prefer to use the macro as part of the conditions to avoid duplication.
Instead of
#ifdef MACRO
foo(X)
#else 
foo(Y)
#endif
we do
v = MACRO ? X : Y;
foo(v);
which is really helpful if foo is complex code and just as fast.

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

Changed some code to make it more readable and less complex.

The failed case is because the gtid is not offset. What is a right way to detect whether a CMake variable or macro is defined?

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

Harbormaster completed remote builds in B76939: Diff 301669.Oct 29 2020, 4:10 PM

Added support for setting number of unshackled threads via environment variable

In D77609#2362867, @jdoerfert wrote:

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

In D77609#2386307, @tianshilei1992 wrote:

In D77609#2362867, @jdoerfert wrote:

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.

In D77609#2386308, @jdoerfert wrote:

In D77609#2386307, @tianshilei1992 wrote:

In D77609#2362867, @jdoerfert wrote:

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.

Better to have a way to elide unshackled thread team creation at runtime before putting LIBOMP_USE_UNSHACKLED_TASK by default.

In D77609#2386352, @ye-luo wrote:

In D77609#2386308, @jdoerfert wrote:

In D77609#2386307, @tianshilei1992 wrote:

In D77609#2362867, @jdoerfert wrote:

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.

Better to have a way to elide unshackled thread team creation at runtime before putting LIBOMP_USE_UNSHACKLED_TASK by default.

It's already included in this patch.

Harbormaster completed remote builds in B78326: Diff 304236.Nov 10 2020, 10:38 AM

Added the missing variable initialization

Harbormaster completed remote builds in B78343: Diff 304265.Nov 10 2020, 5:21 PM

Removed the marcro USE_UNSHACKLED_TASK

tianshilei1992 edited the summary of this revision. (Show Details)Nov 11 2020, 8:58 AM

tianshilei1992 marked 3 inline comments as done.Nov 11 2020, 12:51 PM

Harbormaster completed remote builds in B78470: Diff 304533.Nov 11 2020, 2:58 PM

Ping…

As far as I can tell the issues have been addressed. This has been sitting here a while, let's get it in so we get more exposure. LGTM

If you go over your comments once more, add punctuation to make all of them sentences. If you want to change "unshackled" to "hidden_helper" or similar, that might be good.

This revision is now accepted and ready to land.Dec 18 2020, 7:21 PM

Updated the patch to use more inclusive words

tianshilei1992 retitled this revision from [OpenMP] Added the support for unshackled task in RTL to [OpenMP] Added the support for hidden helper task in RTL.Dec 19 2020, 6:01 PM

tianshilei1992 edited the summary of this revision. (Show Details)

Fixed one remained part

Still something left...

Harbormaster completed remote builds in B83056: Diff 312952.Dec 20 2020, 12:03 AM

Harbormaster completed remote builds in B83057: Diff 312953.Dec 20 2020, 12:07 AM

Harbormaster completed remote builds in B83058: Diff 312954.Dec 20 2020, 12:10 AM

Rebase

Harbormaster completed remote builds in B85390: Diff 317020.Jan 15 2021, 4:34 PM

Fixed a bug in __kmp_release_deps

Harbormaster completed remote builds in B85468: Diff 317135.Jan 15 2021, 7:13 PM

Refined test cases and rebased

Harbormaster completed remote builds in B85501: Diff 317186.Jan 16 2021, 10:23 AM

This revision was landed with ongoing or failed builds.Jan 16 2021, 11:13 AM

Closed by commit rGed939f853da1: [OpenMP] Added the support for hidden helper task in RTL (authored by tianshilei1992). · Explain Why

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rGed939f853da1: [OpenMP] Added the support for hidden helper task in RTL.

This broke building OpenMP for windows; all the new helper functions, like __kmp_hidden_helper_threads_initz_wait, that are added in z_Linux_util.cpp would need to be added similarly to z_Windows_NT_util.cpp. What do you propose doing - revert the patch for now until that's in place?

tianshilei1992 added a reverting change: rG9bf843bdc88f: Revert "[OpenMP] Added the support for hidden helper task in RTL".Jan 18 2021, 3:58 AM

reopen as the change was reverted

This revision is now accepted and ready to land.Jan 18 2021, 3:58 AM

In D77609#2504139, @mstorsjo wrote:

This broke building OpenMP for windows; all the new helper functions, like __kmp_hidden_helper_threads_initz_wait, that are added in z_Linux_util.cpp would need to be added similarly to z_Windows_NT_util.cpp. What do you propose doing - revert the patch for now until that's in place?

Thanks for the report. We had a macro controlling whether the feature is enabled before. On Windows the macro is not defined so that corresponding parts in common files will not be built on Windows. Later we decided to remove the macro and turn the feature ON by default but I forgot to add the logic in Windows files, and I didn’t have Windows machines then.......

I’ve reverted the change and will fix the issue.

@ronlieb tells me an out of tree offloading test (aomp/test/smoke/devices) started crashing (hangs/segv/fp exception) with this patch applied. That doesn't make sense to me since this doesn't appear to change the target offloading logic, but it might be a smoking gun for a lifetime management error somewhere in the above. Does anyone know if the host openmp runtime is expected to be clean under things like valgrind or thread sanitizer?

Added missing functions on Windows but forced __kmp_enable_hidden_helper to
FALSE on all non-Linux platforms

This revision is now accepted and ready to land.Jan 20 2021, 5:51 PM

@mstorsjo Would you mind giving it a shot on Windows?

tianshilei1992 requested review of this revision.Jan 20 2021, 5:53 PM

Harbormaster completed remote builds in B86019: Diff 318075.Jan 20 2021, 10:18 PM

In D77609#2511477, @tianshilei1992 wrote:

@mstorsjo Would you mind giving it a shot on Windows?

Looks like it builds correctly now, thanks!

Haven't tested it practically (except for a very trivial smoke test) to see if it breaks anything at runtime, but it doesn't at least regress the build any longer.

The information I've got on the possible race is:
When this patch is applied (by git's automerge, I think) to the rocm stack, a test located at:
https://github.com/ROCm-Developer-Tools/aomp/blob/master/test/smoke/devices/devices.c
fails in unpredictable fashion.

I've reproduced the test here as it's fairly short, but it uses some functions on the device that the trunk implementation returns zero for. Adjusted so it builds on trunk. Run as

export LD_LIBRARY_PATH=$HOME/llvm-install/lib/ ; $HOME/llvm-install/bin/clang  -O2  -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50   devices.c -o devices -L/usr/local/cuda/targets/x86_64-linux/lib -lcudart && valgrind --fair-sched=yes ./devices

// devices.c
#include <stdio.h>
#include <omp.h>

int main() {
  int num_devs = omp_get_num_devices();
  for (int device_num = 0; device_num < num_devs ; device_num++) {
#pragma omp target device(device_num) nowait
#pragma omp teams num_teams(2) thread_limit(4)
#pragma omp parallel num_threads(2)
    {
      // need to pass the total device number to all devices, per module load
      int num_threads = omp_get_num_threads();
      int num_teams   = omp_get_num_teams();
      int num_devices = omp_get_num_devices(); // not legal in 4.5

      // need to pass the device id to the device starting the kernel
      int thread_id   = omp_get_thread_num();
      int team_id     = omp_get_team_num();
      int device_id   = 0; // omp_get_device_num();  // no API in omp 4.5

      // assume we have homogeneous devices
      int total_threads = num_devices * num_teams * num_threads;
      int gthread_id    = (device_id * num_teams * num_threads) + (team_id * num_threads) + thread_id;

      // print out id
      printf("Hello OpenMP 5 from \n");
      printf(" Device num  %d of %d devices\n", device_id, num_devices);
      printf(" Team num    %d of %d teams  \n", team_id,   num_teams);
      printf(" Thread num  %d of %d threads\n", thread_id, num_threads);
      printf(" Global thread %d of %d total threads\n", gthread_id, total_threads);
    };
  };
#pragma omp taskwait
  printf("The host device num is %d\n", omp_get_device_num());
  printf("The initial device num is %d\n", omp_get_initial_device());
  printf("The number of devices are %d\n", num_devs);
}

Trunk before this patch makes a use of uninitialized memory but the test succeeds (prints a lot of stuff).

==27099== Conditional jump or move depends on uninitialised value(s)
==27099==    at 0x4C36DC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==27099==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==27099==    by 0x4B5B688: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1562)
==27099==    by 0x4B5B8BB: __kmp_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1679)
==27099==    by 0x4B5BB7E: __kmpc_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1739)
==27099==    by 0x401309: main

With this patch applied, most of the print output is lost, and the uninitialized data error changes

The host device num is 1
The initial device num is 1
==20091== Thread 9:
==20091== Conditional jump or move depends on uninitialised value(s)
==20091==    at 0x4C3ADC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==20091==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==20091==    by 0x4B5C399: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1633)
==20091==    by 0x4B60012: int __kmp_execute_tasks_template<kmp_flag_64<false, true> >(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3012)
==20091==    by 0x4B6AE91: int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3111)
==20091==    by 0x4B79901: kmp_flag_64<false, true>::execute_tasks(kmp_info*, int, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_wait_release.h:915)
==20091==    by 0x4B7497C: bool __kmp_wait_template<kmp_flag_64<false, true>, true, false, true>(kmp_info*, kmp_flag_64<false, true>*, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:345)
==20091==    by 0x4B797D9: kmp_flag_64<false, true>::wait(kmp_info*, int, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:922)
==20091==    by 0x4B70559: __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:672)
==20091==    by 0x4B7401D: __kmp_fork_barrier(int, int) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:1982)
==20091==    by 0x4B3B701: __kmp_launch_thread (llvm-project/openmp/runtime/src/kmp_runtime.cpp:5776)
==20091==    by 0x4BB976D: __kmp_launch_worker(void*) (llvm-project/openmp/runtime/src/z_Linux_util.cpp:591)
==20091== 
The number of devices are 1
CUDA error: Error returned from cuDeviceGet

This is more obvious on the amd implementation because it segfaults on a null pointer dereference.

In D77609#2515690, @JonChesterfield wrote:

I've reproduced the test here as it's fairly short, but it uses some functions on the device that the trunk implementation returns zero for. Adjusted so it builds on trunk. Run as

export LD_LIBRARY_PATH=$HOME/llvm-install/lib/ ; $HOME/llvm-install/bin/clang  -O2  -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50   devices.c -o devices -L/usr/local/cuda/targets/x86_64-linux/lib -lcudart && valgrind --fair-sched=yes ./devices

// devices.c
#include <stdio.h>
#include <omp.h>

int main() {
  int num_devs = omp_get_num_devices();
  for (int device_num = 0; device_num < num_devs ; device_num++) {
#pragma omp target device(device_num) nowait
#pragma omp teams num_teams(2) thread_limit(4)
#pragma omp parallel num_threads(2)
    {
      // need to pass the total device number to all devices, per module load
      int num_threads = omp_get_num_threads();
      int num_teams   = omp_get_num_teams();
      int num_devices = omp_get_num_devices(); // not legal in 4.5

      // need to pass the device id to the device starting the kernel
      int thread_id   = omp_get_thread_num();
      int team_id     = omp_get_team_num();
      int device_id   = 0; // omp_get_device_num();  // no API in omp 4.5

      // assume we have homogeneous devices
      int total_threads = num_devices * num_teams * num_threads;
      int gthread_id    = (device_id * num_teams * num_threads) + (team_id * num_threads) + thread_id;

      // print out id
      printf("Hello OpenMP 5 from \n");
      printf(" Device num  %d of %d devices\n", device_id, num_devices);
      printf(" Team num    %d of %d teams  \n", team_id,   num_teams);
      printf(" Thread num  %d of %d threads\n", thread_id, num_threads);
      printf(" Global thread %d of %d total threads\n", gthread_id, total_threads);
    };
  };
#pragma omp taskwait
  printf("The host device num is %d\n", omp_get_device_num());
  printf("The initial device num is %d\n", omp_get_initial_device());
  printf("The number of devices are %d\n", num_devs);
}

Trunk before this patch makes a use of uninitialized memory but the test succeeds (prints a lot of stuff).

==27099== Conditional jump or move depends on uninitialised value(s)
==27099==    at 0x4C36DC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==27099==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==27099==    by 0x4B5B688: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1562)
==27099==    by 0x4B5B8BB: __kmp_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1679)
==27099==    by 0x4B5BB7E: __kmpc_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1739)
==27099==    by 0x401309: main

With this patch applied, most of the print output is lost, and the uninitialized data error changes

The host device num is 1
The initial device num is 1
==20091== Thread 9:
==20091== Conditional jump or move depends on uninitialised value(s)
==20091==    at 0x4C3ADC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==20091==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==20091==    by 0x4B5C399: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1633)
==20091==    by 0x4B60012: int __kmp_execute_tasks_template<kmp_flag_64<false, true> >(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3012)
==20091==    by 0x4B6AE91: int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3111)
==20091==    by 0x4B79901: kmp_flag_64<false, true>::execute_tasks(kmp_info*, int, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_wait_release.h:915)
==20091==    by 0x4B7497C: bool __kmp_wait_template<kmp_flag_64<false, true>, true, false, true>(kmp_info*, kmp_flag_64<false, true>*, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:345)
==20091==    by 0x4B797D9: kmp_flag_64<false, true>::wait(kmp_info*, int, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:922)
==20091==    by 0x4B70559: __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:672)
==20091==    by 0x4B7401D: __kmp_fork_barrier(int, int) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:1982)
==20091==    by 0x4B3B701: __kmp_launch_thread (llvm-project/openmp/runtime/src/kmp_runtime.cpp:5776)
==20091==    by 0x4BB976D: __kmp_launch_worker(void*) (llvm-project/openmp/runtime/src/z_Linux_util.cpp:591)
==20091== 
The number of devices are 1
CUDA error: Error returned from cuDeviceGet

This is more obvious on the amd implementation because it segfaults on a null pointer dereference.

If you take a look at the code around interface.cpp:470, it is:

EXTERN int __tgt_target_teams_nowait_mapper(
    ident_t *loc, int64_t device_id, void *host_ptr, int32_t arg_num,
    void **args_base, void **args, int64_t *arg_sizes, int64_t *arg_types,
    map_var_info_t *arg_names, void **arg_mappers, int32_t team_num,
    int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,
    void *noAliasDepList) {
  TIMESCOPE();
  if (depNum + noAliasDepNum > 0)
    __kmpc_omp_taskwait(loc, __kmpc_global_thread_num(loc));

  return __tgt_target_teams_mapper(loc, device_id, host_ptr, arg_num, args_base,
                                   args, arg_sizes, arg_types, arg_names,
                                   arg_mappers, team_num, thread_limit);
}

Line 470 is if (depNum + noAliasDepNum > 0). The reason it raises an error is, depNum and noAliasDepNum are not passed to the function call at all due to the known issue we have in clang. Actually, depNum, depList, noAliasDepNum, and noAliasDepList are all not passed on the callsite. So your issue encountered probably has nothing to do with this part.

I did try on my local systems with NVIDIA GPUs. I didn't encounter any crash/hang with 1000 runs. The only potential problem is printf in the target region doesn't work at all, which I believe has nothing to do with this patch.

rebase

Harbormaster completed remote builds in B86478: Diff 318867.Jan 24 2021, 2:50 PM

Fixed some issues in miniQMC

The only potential problem is printf in the target region doesn't work at all, which I believe has nothing to do with this patch.

Do you see the print statements within the region without this patch applied? On sm_50, I see all the print output on trunk and the ones inside target missing with this patch.

I suspect there is a race condition in the library that this patch has exposed.

Did you run the test under valgrind? The fair scheduler setting does a reasonable job of perturbing thread order, though I suppose one should use an actual race detector instead.

rebased and remove unnecessary struct data member

In D77609#2518640, @JonChesterfield wrote:

The only potential problem is printf in the target region doesn't work at all, which I believe has nothing to do with this patch.

Do you see the print statements within the region without this patch applied? On sm_50, I see all the print output on trunk and the ones inside target missing with this patch.

I suspect there is a race condition in the library that this patch has exposed.

Did you run the test under valgrind? The fair scheduler setting does a reasonable job of perturbing thread order, though I suppose one should use an actual race detector instead.

I tested the latest version of this patch, and it can print out all information. Can you give it a shot on your side with AMD GPUs?

Harbormaster completed remote builds in B86484: Diff 318875.Jan 24 2021, 4:40 PM

Harbormaster completed remote builds in B86486: Diff 318877.Jan 24 2021, 5:01 PM

The test still doesn't work ideally on amdgpu, but it no longer crashes, and some of the print statements within the target region are seen.

Known issues resolved, AMDGPU is not yet a supported target and hard to test right now. LG

This revision is now accepted and ready to land.Jan 25 2021, 7:12 PM

This revision was landed with ongoing or failed builds.Jan 25 2021, 7:16 PM

Closed by commit rG9d64275ae08f: [OpenMP] Added the support for hidden helper task in RTL (authored by tianshilei1992). · Explain Why

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rG9d64275ae08f: [OpenMP] Added the support for hidden helper task in RTL.

tianshilei1992 mentioned this in D95798: [OpenMP] Fixed an issue that taskwait doesn't work on detachable task.Feb 1 2021, 10:03 AM

tianshilei1992 mentioned this in rG3c31b78455da: [OpenMP] Fixed an issue that taskwait doesn't work on detachable task.Feb 3 2021, 10:13 AM

I'm getting a segfault, when running code with target nowait compiled for x86 offloading. The segfault is in __kmp_push_task for a task marked as hidden_task.

I tried to find the thread with __kmp_gtid = 2 (assuming that's still the task identified as gtid=2) :

(gdb) t 11
[Switching to thread 11 (Thread 0x2aab18000800 (LWP 16111))]
(gdb) p __kmp_gtid
$34 = 2
(gdb) bt
#0  0x00002aaabddea9cc in .omp_outlined._debug__ (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:109
#1  0x00002aaabddeaa95 in .omp_outlined. (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:105
#2  0x00002aaaab584803 in __kmp_invoke_microtask () at llvm-project/openmp/runtime/src/z_Linux_asm.S:1166
#3  0x00002aaaab51741c in __kmp_fork_call (loc=0x2aaabdfeada0, gtid=<optimized out>, call_context=fork_context_intel, argc=3, microtask=<optimized out>, invoker=0x2aaaab51c020 <__kmp_invoke_task_func>, 
    ap=0x2aab17fff1d0) at llvm-project/openmp/runtime/src/kmp_runtime.cpp:1906
#4  0x00002aaaab509048 in __kmpc_fork_call (loc=0x2aaabdfeada0, argc=<optimized out>, microtask=0x2aaabddeaa60 <.omp_outlined.>) at llvm-project/openmp/runtime/src/kmp_csupport.cpp:307
#5  0x00002aaabddea8aa in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101_debug__ (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:105
#6  0x00002aaabddeaac5 in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101 (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:101
#7  0x00002aaaadccce2c in ffi_call_unix64 () from /lib64/libffi.so.6
#8  0x00002aaaadccc755 in ffi_call () from /lib64/libffi.so.6
#9  0x00002aaaadac4a56 in __tgt_rtl_run_target_team_region () from /home/x/sw/UTIL/clang//12.0-release/lib/../lib/libomptarget.rtl.x86_64.so
#10 0x00002aaaab7c0be0 in DeviceTy::runTeamRegion(void*, void**, long*, int, int, int, unsigned long, __tgt_async_info*) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#11 0x00002aaaab7d02f2 in target(ident_t*, long, void*, int, void**, void**, long*, long*, void**, void**, int, int, int) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#12 0x00002aaaab7c5d96 in __tgt_target_teams_mapper () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12

Is it intended, that the threads executing the host offloading use the same gtid as the hidden threads?

openmp/runtime/src/kmp_tasking.cpp
363	I'm getting the segfault here. When I look at task_team, it is 0x0. taskdata->td_flags.hidden_helper = 1 gtid = 2 __kmp_threads[gtid]->th.th_task_team = 0x0

In D77609#2579070, @protze.joachim wrote:

I'm getting a segfault, when running code with target nowait compiled for x86 offloading. The segfault is in __kmp_push_task for a task marked as hidden_task.

I tried to find the thread with __kmp_gtid = 2 (assuming that's still the task identified as gtid=2) :

(gdb) t 11
[Switching to thread 11 (Thread 0x2aab18000800 (LWP 16111))]
(gdb) p __kmp_gtid
$34 = 2
(gdb) bt
#0  0x00002aaabddea9cc in .omp_outlined._debug__ (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:109
#1  0x00002aaabddeaa95 in .omp_outlined. (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:105
#2  0x00002aaaab584803 in __kmp_invoke_microtask () at llvm-project/openmp/runtime/src/z_Linux_asm.S:1166
#3  0x00002aaaab51741c in __kmp_fork_call (loc=0x2aaabdfeada0, gtid=<optimized out>, call_context=fork_context_intel, argc=3, microtask=<optimized out>, invoker=0x2aaaab51c020 <__kmp_invoke_task_func>, 
    ap=0x2aab17fff1d0) at llvm-project/openmp/runtime/src/kmp_runtime.cpp:1906
#4  0x00002aaaab509048 in __kmpc_fork_call (loc=0x2aaabdfeada0, argc=<optimized out>, microtask=0x2aaabddeaa60 <.omp_outlined.>) at llvm-project/openmp/runtime/src/kmp_csupport.cpp:307
#5  0x00002aaabddea8aa in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101_debug__ (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:105
#6  0x00002aaabddeaac5 in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101 (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:101
#7  0x00002aaaadccce2c in ffi_call_unix64 () from /lib64/libffi.so.6
#8  0x00002aaaadccc755 in ffi_call () from /lib64/libffi.so.6
#9  0x00002aaaadac4a56 in __tgt_rtl_run_target_team_region () from /home/x/sw/UTIL/clang//12.0-release/lib/../lib/libomptarget.rtl.x86_64.so
#10 0x00002aaaab7c0be0 in DeviceTy::runTeamRegion(void*, void**, long*, int, int, int, unsigned long, __tgt_async_info*) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#11 0x00002aaaab7d02f2 in target(ident_t*, long, void*, int, void**, void**, long*, long*, void**, void**, int, int, int) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#12 0x00002aaaab7c5d96 in __tgt_target_teams_mapper () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12

We also got report in openmp-dev mail list of this issue. I'll investigate it.

Is it intended, that the threads executing the host offloading use the same gtid as the hidden threads?

It is because the task needs to be executed by a hidden helper thread.

@protze.joachim
Can you try https://reviews.llvm.org/D97329 to see if it works?

Post commit issue:
Our downstream testing of our release branch revealed an assertion in kmp_runtime.cpp while compiling our rocFFT application
The rocFFT application does not use openmp offload, rather it uses HIP, and host openmp threads.
When we reverted this patch locally it allowed the application to compile and run succesfullly.

root@ixt-sjc2-13:/root/Staging/MathLibs/rocFFT/build/release/clients/staging# cd /root/Staging/MathLibs/rocFFT/build/release/clients/staging; ./rocfft-test --gtest_filter=rocfft_UnitTest.simple_multithread_1D
rocFFT version: 1.0.9.a07759d-dirty
Note: Google Test filter = rocfft_UnitTest.simple_multithread_1D
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from rocfft_UnitTest
[ RUN ] rocfft_UnitTest.simple_multithread_1D
OMP: Error #13: Assertion failure at kmp_runtime.cpp(3691).

Patch we reverted to get us back to our happy-place.

commit 9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08
Author: Shilei Tian <tianshilei1992@gmail.com>
Date: Mon Jan 25 22:14:52 2021 -0500

[OpenMP] Added the support for hidden helper task in RTL

Without a reproducer, I cannot tell what was going wrong. And your code is out of date. What is the assertion at line 3691 in kmp_runtime.cpp?

[AMD Official Use Only - Internal Distribution Only]

I was requested to add the information about the failure observed in our product testing, so I added comment to your patch so you would be aware of it. Perhaps down the road someone will encounter it with a simpler upstream testcase.

I don't have a small reproducer, however the release engineer (David cc'ed) says it might take him a few hours or more. He is willing to try, so maybe you will be able to provide one.... might be today or tomorrow.

Ron

Hi Ron,

even without a reproducer, it would certainly help, if you can map your
line 3691 to a line of code we can find in the upstream repository.
Neither main nor the release branch have an assertion on that line:

https://github.com/llvm/llvm-project/blob/main/openmp/runtime/src/kmp_runtime.cpp#L3691

https://github.com/llvm/llvm-project/blob/release/12.x/openmp/runtime/src/kmp_runtime.cpp#L3691

Best
Joachim

Am 15.03.21 um 20:53 schrieb Lieberman, Ron:

[AMD Official Use Only - Internal Distribution Only]

I was requested to add the information about the failure observed in our product testing, so I added comment to your patch so you would be aware of it. Perhaps down the road someone will encounter it with a simpler upstream testcase.

I don't have a small reproducer, however the release engineer (David cc'ed) says it might take him a few hours or more. He is willing to try, so maybe you will be able to provide one.... might be today or tomorrow.

Ron

latest trunk has the assert in question at line 3651
3638 } else {

3639      /* find an available thread slot */
3640      // Don't reassign the zero slot since we need that to only be used by
3641      // initial thread. Slots for hidden helper threads should also be skipped.
3642      if (initial_thread && __kmp_threads[0] == NULL) {
3643        gtid = 0;
3644      } else {
3645        for (gtid = __kmp_hidden_helper_threads_num + 1;
3646             TCR_PTR(__kmp_threads[gtid]) != NULL; gtid++)
3647          ;
3648      }
3649      KA_TRACE(
3650          1, ("__kmp_register_root: found slot in threads array: T#%d\n", gtid));
3651      KMP_ASSERT(gtid < __kmp_threads_capacity);
3652    }

our jan 27th internal merge has this at line 3691 of kmp_runtime.cpp

3678 } else {

3679      /* find an available thread slot */
3680      // Don't reassign the zero slot since we need that to only be used b

3681      // initial thread. Slots for hidden helper threads should also be sk

ipped.

3682      if (initial_thread && __kmp_threads[0] == NULL) {
3683        gtid = 0;
3684      } else {
3685        for (gtid = __kmp_hidden_helper_threads_num + 1;
3686             TCR_PTR(__kmp_threads[gtid]) != NULL; gtid++)
3687          ;
3688      }
3689      KA_TRACE(
3690          1, ("__kmp_register_root: found slot in threads array: T#%d\n",

gtid));

3691      KMP_ASSERT(gtid < __kmp_threads_capacity);
3692    }

I'm also hitting asserts with with this change (as i've told @jdoerfert in irc previously)

I'm reliably hitting "Assertion failure at kmp_runtime.cpp(4314): new_gtid < __kmp_threads_capacity."
unless i specify LIBOMP_NUM_HIDDEN_HELPER_THREADS=0.
It's both easy to repro, and not, i don't have a standalone repro.

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

I would suggest reverting this.

Seems like the two assertions mentioned above are caused by a same problem that __kmp_threads is somehow touched and all elements are not NULL. I'd appreciate if someone could provide a reproducer.

That's three independent reports of stuff breaking after this patch. There are a bunch of locks and condition variables involved, and it looks suspicious to me that the introduced variables are volatile but not atomic.

I don't think we have robust enough in tree testing to say this patch is sound, and multiple independent reports suggest it is not. I think we have to pull it until we can work out what's gone wrong, or rewrite it to be simple enough to reliably audit the concurrency.

How about set LIBOMP_USE_HIDDEN_HELPER_TASK=OFF by default? So we can keep this commit but make user codes happy as we investigate more?

I'm starting to have doubts about the thread safety of this library in general so would lean towards removing the commit entirely such that the remainder is easier to reason about. That way we can be fairly sure we've removed whatever bug this introduced so have ~ one fewer race to try to pin down.

[AMD Official Use Only - Internal Distribution Only]

The failure we saw in our down fork testing , is pure host openmp code, no device code of any kind.

The pragmas all looked like this

#pragma omp parallel for reduction(max : linf) reduction(+ : l2) num_threads(partitions.size())

Going forward, I would like to ask if the pre commit testing does not already include SPEC CPU speed runs with openmp enabled , and SPEC OMP2012 , that it be added. Plus any other OpenMP applications that folks think of that might help stress the task helper patch.

Ron

Again, it doesn't help if we don't have a way to reproduce it. We can disable it, we can revert it, sure, but it will NEVER be enabled back because we don't have a reproducer to tell what is wrong, and nobody will use it if it is disabled. We can't guarantee that rewriting the whole thing in a "simpler" way can work if we don't have a way to test it.

One of the drawbacks of limited trunk testing of openmp is that we're reliant on out of trunk people noticing something looks odd. I don't want to set a precedent of downstream forks reverting patches that fail local testing, as that'll remove a bunch of the ad hoc testing we do have.

Closely related, we really do need CI. I'm told that's a work in progress for amdgpu. Even without a live GPU box, it should be possible to exercise some runtime testing of the host code, which would have sufficed to raise awareness of this patch.

Also, the reproducer doesn't need to be a small piece of code. It can be steps to reproduce it as long as I can access the source code.

In D77609#2630550, @tianshilei1992 wrote:

Also, the reproducer doesn't need to be a small piece of code. It can be steps to reproduce it as long as I can access the source code.

I will state repro steps once this is reverted.

FYI another aspect of reverting this one; this is part of the 12.x release branch too (which is drawing very close to the actual release), so if it needs to be reverted, maybe it needs to be reverted there too.

In D77609#2629372, @lebedev.ri wrote:

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

Linking two OpenMP runtime libraries into one application is guaranteed to break things. You have basically no way to guarantee that calls to API functions go to the right runtime.

i tried using
export LIBOMP_USE_HIDDEN_HELPER_TASK=0 and rebuilding/rerunning spec cpu2017 fpspeed base.

and still see the performance issues in 619.lbm and all the other fpspeed benchmarks.
The GeoMean dropped aprox 30%

@tianshilei1992 please review my comments, they might explain why the assertion triggers.

openmp/runtime/src/kmp_runtime.cpp
3632	This check is not aware of reserved hidden threads. `__kmp_expand_threads` will only be called, if __kmp_all_nth exeeds the capacity limit. Even if the hidden threads are included in `__kmp_all_nth`, this check does not consider the hole in the thread array.
3663	This load used to be TCR_PTR

In D77609#2631767, @protze.joachim wrote:

In D77609#2629372, @lebedev.ri wrote:

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

Linking two OpenMP runtime libraries into one application is guaranteed to break things. You have basically no way to guarantee that calls to API functions go to the right runtime.

Presumably you have read my comment in it's entirety, and did see that both libgomp and libomp used are from llvm?
What's the point of llvm's libgomp then?

In D77609#2631803, @lebedev.ri wrote:

In D77609#2631767, @protze.joachim wrote:

In D77609#2629372, @lebedev.ri wrote:

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

Linking two OpenMP runtime libraries into one application is guaranteed to break things. You have basically no way to guarantee that calls to API functions go to the right runtime.

Presumably you have read my comment in it's entirety, and did see that both libgomp and libomp used are from llvm?
What's the point of llvm's libgomp then?

My point was on this specific part:

whether or not the libgomp used is provided at runtime by gcc or llvm

You explicitly listed the case that libgomp from gcc is loaded at execution time. The most tedious issues I had with a third-party library, which is statically linked against libgomp. You won't spot the gcc libgomp with ldd.

As long as you make sure that only the LLVM OpenMP runtime is loaded during execution, it should work, yes.

In D77609#2631784, @ronlieb wrote:

i tried using
export LIBOMP_USE_HIDDEN_HELPER_TASK=0 and rebuilding/rerunning spec cpu2017 fpspeed base.

and still see the performance issues in 619.lbm and all the other fpspeed benchmarks.
The GeoMean dropped aprox 30%

Wasn't spec cpu meant to measure single-core performance? I can see how -fopenmp or -fopenmp-simd might help to turn on vectorization. But none of these flags should turn on OpenMP directives present in the code and make the code multi-threaded (in spec cpu 2006 there were actually #if !defined(SPEC_CPU) arround all OpenMP directives and includes). Are the resulting binaries really linked against libomp? Without any OpenMP symbols in the application, the linker should just drop libomp.

Originally, yes SPEC CPU was intended to be single core/cpu or rate runs.
with the advent of spec cpu 2017 and the explosion of multicore, spec cpu decided to add openmp to speed benchmarks so that compilers could utilize more cores.
The benchmark have openmp pragmas/directives

who knows in 10 years, maybe spec cpu will want to treat gpus as an extension of the cpu (openmp offload)

I find a stable way to reproduce the assertion. Let's say the default __kmp_threads_capacity is N. If hidden helper thread is enabled, __kmp_threads_capacity will be offset to N+8 by default. If the number of threads we need exceeds N+8, e.g. via num_threads clause, we need to expand __kmp_threads. In __kmp_expand_threads, the expansion starts from __kmp_threads_capacity, and repeatedly doubling it until the new capacity meets the requirement. Let's assume the new requirement is Y. If Y happens to meet the constraint (N+8)*2^X=Y where X is the number of iterations, then the new capacity is not enough because we have 8 slots for hidden helper threads.

#include <vector>

int main(int argc, char *argv[]) {
  constexpr const size_t N = 1344;
  std::vector<int> data(N);

#pragma omp parallel for
  for (unsigned i = 0; i < N; ++i) {
    data[i] = i;
  }

#pragma omp parallel for num_threads(N)
  for (unsigned i = 0; i < N; ++i) {
    data[i] += i;
  }

  return 0;
}

Here is an example. My CPU is 20C40T, then __kmp_threads_capacity is 160. After offset, __kmp_threads_capacity becomes 168. 1344 = (160+8)*2^3, and then the assertions hit.

I'll fix it right away.

Try to fix the crash in D98838

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously. After spending quite some effort into trying to identify locations, where this implicit assumption is now broken, I think, much more effort is needed to identify all places which rely on this assumption and are now broken.

JonChesterfield mentioned this in D98838: [OpenMP] Fixed a crash in hidden helper thread.Mar 18 2021, 1:45 PM

In D77609#2634254, @protze.joachim wrote:

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously. After spending quite some effort into trying to identify locations, where this implicit assumption is now broken, I think, much more effort is needed to identify all places which rely on this assumption and are now broken.

If we set the cmake flag to false and 0, we don't break those assumptions, right? Let's do that.

protze.joachim mentioned this in D101882: [OpenMP] Fix hidden helper + affinity assignment.May 5 2021, 5:49 AM

Munesanz added a subscriber: Munesanz.Feb 14 2023, 5:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2023, 5:28 AM

Munesanz removed a subscriber: Munesanz.Feb 14 2023, 5:28 AM

Revision Contents

Path

Size

openmp/

runtime/

src/

61 lines

3 lines

153 lines

34 lines

15 lines

130 lines

20 lines

163 lines

test/

tasking/

unshackled_task/

depend.cpp

195 lines

gtid.cpp

168 lines

taskgroup.cpp

112 lines

Diff 304533

openmp/runtime/src/kmp.h

Show First 20 Lines • Show All 2,235 Lines • ▼ Show 20 Lines	unsigned merged_if0 : 1; /* no __kmpc_task_{begin/complete}_if0 calls in if0
code path */		code path */
unsigned destructors_thunk : 1; /* set if the compiler creates a thunk to		unsigned destructors_thunk : 1; /* set if the compiler creates a thunk to
invoke destructors from the runtime */		invoke destructors from the runtime */
unsigned proxy : 1; /* task is a proxy task (it will be executed outside the		unsigned proxy : 1; /* task is a proxy task (it will be executed outside the
context of the RTL) */		context of the RTL) */
unsigned priority_specified : 1; /* set if the compiler provides priority		unsigned priority_specified : 1; /* set if the compiler provides priority
setting for the task */		setting for the task */
unsigned detachable : 1; /* 1 == can detach */		unsigned detachable : 1; /* 1 == can detach */
unsigned reserved : 9; /* reserved for compiler use */		unsigned unshackled : 1; /* 1 == unshackled task */
		unsigned reserved : 8; /* reserved for compiler use */
		AndreyChurbanovUnsubmitted Done Reply Inline Actions This breaks the convention that the struct size is 32 bits. The correct addition of extra flag under condition should look like: #if USE_UNSHACKLED_TASK unsigned unshackled : 1; /* 1 == unshackled task / unsigned reserved : 8; / reserved for compiler use / #else unsigned reserved : 9; / reserved for compiler use / #endif AndreyChurbanov:* This breaks the convention that the struct size is 32 bits. The correct addition of extra flag…
		jdoerfertUnsubmitted Done Reply Inline Actions Do we really need the `USE_UNSHACKLED_TASK` flag? Even if we want it, we don't need it here in the struct. Let's waste one bit on windows until we catch up and remove complexity for everyone. jdoerfert: Do we really need the `USE_UNSHACKLED_TASK` flag? Even if we want it, we don't need it here in…

/* Library flags / / Total library flags must be 16 bits */		/* Library flags / / Total library flags must be 16 bits */
unsigned tasktype : 1; /* task is either explicit(1) or implicit (0) */		unsigned tasktype : 1; /* task is either explicit(1) or implicit (0) */
unsigned task_serial : 1; // task is executed immediately (1) or deferred (0)		unsigned task_serial : 1; // task is executed immediately (1) or deferred (0)
unsigned tasking_ser : 1; // all tasks in team are either executed immediately		unsigned tasking_ser : 1; // all tasks in team are either executed immediately
// (1) or may be deferred (0)		// (1) or may be deferred (0)
unsigned team_serial : 1; // entire team is serial (1) [1 thread] or parallel		unsigned team_serial : 1; // entire team is serial (1) [1 thread] or parallel
// (0) [>= 2 threads]		// (0) [>= 2 threads]
Show All 31 Lines	std::atomic<kmp_int32>
td_incomplete_child_tasks; /* Child tasks not yet complete */		td_incomplete_child_tasks; /* Child tasks not yet complete */
kmp_taskgroup_t		kmp_taskgroup_t
*td_taskgroup; // Each task keeps pointer to its current taskgroup		*td_taskgroup; // Each task keeps pointer to its current taskgroup
kmp_dephash_t		kmp_dephash_t
*td_dephash; // Dependencies for children tasks are tracked from here		*td_dephash; // Dependencies for children tasks are tracked from here
kmp_depnode_t		kmp_depnode_t
*td_depnode; // Pointer to graph node if this task has dependencies		*td_depnode; // Pointer to graph node if this task has dependencies
kmp_task_team_t *td_task_team;		kmp_task_team_t *td_task_team;
		// The parent task team. Usually we could access it via
		// parent_task->td_task_team, but it is possible to be nullptr because of late
		// initialization. Sometimes we must use it. Since the td_task_team of the
		// encountering thread is never nullptr, we set it when this task is created.
		kmp_task_team_t *td_parent_task_team;
		// The global thread id of the encountering thread. We need it because when a
		// regular task depends on an unshackled task, and the unshackled task is
		jdoerfertUnsubmitted Done Reply Inline Actions Grammar: "The task team of its parent task team" and "therefore we it when this task is created" jdoerfert: Grammar: "The task team of its parent task team" and "therefore we it when this task is created"
		// finished on an unshackled thread, it will call __kmp_release_deps to
		// release all dependences. If now the task is a regular task, we need to pass
		// the encountering gtid such that the task will be picked up and executed by
		// its encountering team instead of unshackled team.
		kmp_int32 encountering_gtid;
kmp_int32 td_size_alloc; // The size of task structure, including shareds etc.		kmp_int32 td_size_alloc; // The size of task structure, including shareds etc.
#if defined(KMP_GOMP_COMPAT)		#if defined(KMP_GOMP_COMPAT)
// 4 or 8 byte integers for the loop bounds in GOMP_taskloop		// 4 or 8 byte integers for the loop bounds in GOMP_taskloop
kmp_int32 td_size_loop_bounds;		kmp_int32 td_size_loop_bounds;
#endif		#endif
		jdoerfertUnsubmitted Done Reply Inline Actions Similarly, I don't think the byte savings here are worth it. jdoerfert: Similarly, I don't think the byte savings here are worth it.
kmp_taskdata_t *td_last_tied; // keep tied task for task scheduling constraint		kmp_taskdata_t *td_last_tied; // keep tied task for task scheduling constraint
#if defined(KMP_GOMP_COMPAT)		#if defined(KMP_GOMP_COMPAT)
// GOMP sends in a copy function for copy constructors		// GOMP sends in a copy function for copy constructors
void (td_copy_func)(void , void *);		void (td_copy_func)(void , void *);
#endif		#endif
kmp_event_t td_allow_completion_event;		kmp_event_t td_allow_completion_event;
#if OMPT_SUPPORT		#if OMPT_SUPPORT
ompt_task_info_t ompt_task_info;		ompt_task_info_t ompt_task_info;
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	typedef struct kmp_base_task_team {
/* Data survives task team deallocation */		/* Data survives task team deallocation */
kmp_int32 tt_found_tasks; /* Have we found tasks and queued them while		kmp_int32 tt_found_tasks; /* Have we found tasks and queued them while
executing this team? */		executing this team? */
/* TRUE means tt_threads_data is set up and initialized */		/* TRUE means tt_threads_data is set up and initialized */
kmp_int32 tt_nproc; /* #threads in team */		kmp_int32 tt_nproc; /* #threads in team */
kmp_int32 tt_max_threads; // # entries allocated for threads_data array		kmp_int32 tt_max_threads; // # entries allocated for threads_data array
kmp_int32 tt_found_proxy_tasks; // found proxy tasks since last barrier		kmp_int32 tt_found_proxy_tasks; // found proxy tasks since last barrier
kmp_int32 tt_untied_task_encountered;		kmp_int32 tt_untied_task_encountered;
		// There is unshackled thread encountered in this task team so that we must
		// wait when waiting on task team
		kmp_int32 tt_unshackled_task_encountered;

KMP_ALIGN_CACHE		KMP_ALIGN_CACHE
std::atomic<kmp_int32> tt_unfinished_threads; /* #threads still active */		std::atomic<kmp_int32> tt_unfinished_threads; /* #threads still active */

KMP_ALIGN_CACHE		KMP_ALIGN_CACHE
		std::atomic<kmp_int32> tt_unfinished_unshackled_tasks;

		KMP_ALIGN_CACHE
volatile kmp_uint32		volatile kmp_uint32
tt_active; /* is the team still actively executing tasks */		tt_active; /* is the team still actively executing tasks */
} kmp_base_task_team_t;		} kmp_base_task_team_t;

union KMP_ALIGN_CACHE kmp_task_team {		union KMP_ALIGN_CACHE kmp_task_team {
kmp_base_task_team_t tt;		kmp_base_task_team_t tt;
double tt_align; /* use worst case alignment */		double tt_align; /* use worst case alignment */
char tt_pad[KMP_PAD(kmp_base_task_team_t, CACHE_LINE)];		char tt_pad[KMP_PAD(kmp_base_task_team_t, CACHE_LINE)];
▲ Show 20 Lines • Show All 447 Lines • ▼ Show 20 Lines
extern volatile int __kmp_init_gtid;		extern volatile int __kmp_init_gtid;
extern volatile int __kmp_init_common;		extern volatile int __kmp_init_common;
extern volatile int __kmp_init_middle;		extern volatile int __kmp_init_middle;
extern volatile int __kmp_init_parallel;		extern volatile int __kmp_init_parallel;
#if KMP_USE_MONITOR		#if KMP_USE_MONITOR
extern volatile int __kmp_init_monitor;		extern volatile int __kmp_init_monitor;
#endif		#endif
extern volatile int __kmp_init_user_locks;		extern volatile int __kmp_init_user_locks;
		extern volatile int __kmp_init_unshackled_threads;
extern int __kmp_init_counter;		extern int __kmp_init_counter;
extern int __kmp_root_counter;		extern int __kmp_root_counter;
extern int __kmp_version;		extern int __kmp_version;

/* list of address of allocated caches for commons */		/* list of address of allocated caches for commons */
extern kmp_cached_addr_t *__kmp_threadpriv_cache_list;		extern kmp_cached_addr_t *__kmp_threadpriv_cache_list;

/* Barrier algorithm types and options */		/* Barrier algorithm types and options */
▲ Show 20 Lines • Show All 214 Lines • ▼ Show 20 Lines
// AT: Which way is correct?		// AT: Which way is correct?
// AT: 1. nproc = __kmp_threads[ ( gtid ) ] -> th.th_team -> t.t_nproc;		// AT: 1. nproc = __kmp_threads[ ( gtid ) ] -> th.th_team -> t.t_nproc;
// AT: 2. nproc = __kmp_threads[ ( gtid ) ] -> th.th_team_nproc;		// AT: 2. nproc = __kmp_threads[ ( gtid ) ] -> th.th_team_nproc;
#define __kmp_get_team_num_threads(gtid) \		#define __kmp_get_team_num_threads(gtid) \
(__kmp_threads[(gtid)]->th.th_team->t.t_nproc)		(__kmp_threads[(gtid)]->th.th_team->t.t_nproc)

static inline bool KMP_UBER_GTID(int gtid) {		static inline bool KMP_UBER_GTID(int gtid) {
KMP_DEBUG_ASSERT(gtid >= KMP_GTID_MIN);		KMP_DEBUG_ASSERT(gtid >= KMP_GTID_MIN);
KMP_DEBUG_ASSERT(gtid < __kmp_threads_capacity);		KMP_DEBUG_ASSERT(gtid < __kmp_threads_capacity);
		AndreyChurbanovUnsubmitted Done Reply Inline Actions This ASSERT now checks that __kmp_threads_capacity in not 0. I doubt this was the intention. Use brackets around "?:" to check the gtid. AndreyChurbanov: This ASSERT now checks that __kmp_threads_capacity in not 0. I doubt this was the intention.
return (gtid >= 0 && __kmp_root[gtid] && __kmp_threads[gtid] &&		return (gtid >= 0 && __kmp_root[gtid] && __kmp_threads[gtid] &&
__kmp_threads[gtid] == __kmp_root[gtid]->r.r_uber_thread);		__kmp_threads[gtid] == __kmp_root[gtid]->r.r_uber_thread);
}		}

static inline int __kmp_tid_from_gtid(int gtid) {		static inline int __kmp_tid_from_gtid(int gtid) {
KMP_DEBUG_ASSERT(gtid >= 0);		KMP_DEBUG_ASSERT(gtid >= 0);
return __kmp_threads[gtid]->th.th_info.ds.ds_tid;		return __kmp_threads[gtid]->th.th_info.ds.ds_tid;
}		}
▲ Show 20 Lines • Show All 839 Lines • ▼ Show 20 Lines
static inline void __kmp_resume_if_hard_paused() {		static inline void __kmp_resume_if_hard_paused() {
if (__kmp_pause_status == kmp_hard_paused) {		if (__kmp_pause_status == kmp_hard_paused) {
__kmp_pause_status = kmp_not_paused;		__kmp_pause_status = kmp_not_paused;
}		}
}		}

extern void __kmp_omp_display_env(int verbose);		extern void __kmp_omp_display_env(int verbose);

		// 1: it is initializing unshackled team
		extern volatile int __kmp_init_unshackled;
		// 1: the unshackled team is done
		extern volatile int __kmp_unshackled_team_done;
		// 1: enable unshackled task
		extern int __kmp_enable_unshackled;
		// Master thread of unshackled team
		extern kmp_info_t *__kmp_unshackled_master_thread;
		// Descriptors for the unshackled threads
		extern kmp_info_t **__kmp_unshackled_threads;
		// Number of unshackled threads
		extern int __kmp_unshackled_threads_num;
		// Number of unshackled tasks that have not been executed yet
		extern std::atomic<kmp_int32> __kmp_unexecuted_unshackled_tasks;

		extern void __kmp_unshackled_initialize();
		extern void __kmp_unshackled_threads_initz_routine();
		extern void __kmp_do_initialize_unshackled_threads();
		AndreyChurbanovUnsubmitted Done Reply Inline Actions I see the problem here: threads array is expandable. E.g. suppose the application starts with threads capacity=32, then threads 0-31 would be regular OMP threads, and threads 32-63 are unshackled threads. Then after some execution the application can request the parallel region with 300 threads. The threads array will be expanded, but what will happen with unshackled threads? AndreyChurbanov: I see the problem here: threads array is expandable. E.g. suppose the application starts with…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Right. That's a good point. I could take the first few elements for the unshackled team I guess. tianshilei1992: Right. That's a good point. I could take the first few elements for the unshackled team I guess.
		extern void __kmp_unshackled_threads_initz_wait();
		extern void __kmp_unshackled_initz_release();
		extern void __kmp_unshackled_threads_deinitz_wait();
		extern void __kmp_unshackled_threads_deinitz_release();
		extern void __kmp_unshackled_master_thread_wait();
		extern void __kmp_unshackled_worker_thread_wait();
		extern void __kmp_unshackled_worker_thread_signal();
		extern void __kmp_unshackled_master_thread_release();

		// Check whether a given thread is an unshackled thread
		#define KMP_UNSHACKLED_THREAD(gtid) \
		((gtid) >= 1 && (gtid) <= __kmp_unshackled_threads_num)

		#define KMP_UNSHACKLED_WORKER_THREAD(gtid) \
		((gtid) > 1 && (gtid) <= __kmp_unshackled_threads_num)

		// Map a gtid to an unshackled thread. The first unshackled thread, a.k.a master
		// thread, is skipped.
		#define KMP_GTID_TO_SHADOW_GTID(gtid) \
		((gtid) % (__kmp_unshackled_threads_num - 1) + 2)

#ifdef __cplusplus		#ifdef __cplusplus
}		}
#endif		#endif

#endif /* KMP_H */		#endif /* KMP_H */

openmp/runtime/src/kmp_global.cpp

	Show All 40 Lines
	/* ----------------------------------------------------- */			/* ----------------------------------------------------- */
	/* INITIALIZATION VARIABLES */			/* INITIALIZATION VARIABLES */
	/* they are syncronized to write during init, but read anytime */			/* they are syncronized to write during init, but read anytime */
	volatile int __kmp_init_serial = FALSE;			volatile int __kmp_init_serial = FALSE;
	volatile int __kmp_init_gtid = FALSE;			volatile int __kmp_init_gtid = FALSE;
	volatile int __kmp_init_common = FALSE;			volatile int __kmp_init_common = FALSE;
	volatile int __kmp_init_middle = FALSE;			volatile int __kmp_init_middle = FALSE;
	volatile int __kmp_init_parallel = FALSE;			volatile int __kmp_init_parallel = FALSE;
				volatile int __kmp_init_unshackled = FALSE;
				volatile int __kmp_init_unshackled_threads = FALSE;
				volatile int __kmp_unshackled_team_done = FALSE;
	#if KMP_USE_MONITOR			#if KMP_USE_MONITOR
	volatile int __kmp_init_monitor =			volatile int __kmp_init_monitor =
	0; /* 1 - launched, 2 - actually started (Windows* OS only) */			0; /* 1 - launched, 2 - actually started (Windows* OS only) */
	#endif			#endif
	volatile int __kmp_init_user_locks = FALSE;			volatile int __kmp_init_user_locks = FALSE;

	/* list of address of allocated caches for commons */			/* list of address of allocated caches for commons */
	kmp_cached_addr_t *__kmp_threadpriv_cache_list = NULL;			kmp_cached_addr_t *__kmp_threadpriv_cache_list = NULL;
	▲ Show 20 Lines • Show All 478 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_runtime.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,623 Lines • ▼ Show 20 Lines (2) we cannot detect initial thread reliably (the first thread which does

serial initialization may be not a real initial thread). serial initialization may be not a real initial thread).

*/ */

capacity = __kmp_threads_capacity; capacity = __kmp_threads_capacity;

if (!initial_thread && TCR_PTR(__kmp_threads[0]) == NULL) { if (!initial_thread && TCR_PTR(__kmp_threads[0]) == NULL) {

--capacity; --capacity;

} }

/* see if there are too many threads */ /* see if there are too many threads */

if (__kmp_all_nth >= capacity && !__kmp_expand_threads(1)) { if (__kmp_all_nth >= capacity && !__kmp_expand_threads(1)) {

protze.joachimUnsubmitted

Not Done

This check is not aware of reserved hidden threads. __kmp_expand_threads will only be called, if __kmp_all_nth exeeds the capacity limit. Even if the hidden threads are included in __kmp_all_nth, this check does not consider the hole in the thread array.

protze.joachim: This check is not aware of reserved hidden threads. `__kmp_expand_threads` will only be called…

if (__kmp_tp_cached) { if (__kmp_tp_cached) {

__kmp_fatal(KMP_MSG(CantRegisterNewThread), __kmp_fatal(KMP_MSG(CantRegisterNewThread),

KMP_HNT(Set_ALL_THREADPRIVATE, __kmp_tp_capacity), KMP_HNT(Set_ALL_THREADPRIVATE, __kmp_tp_capacity),

KMP_HNT(PossibleSystemLimitOnThreads), __kmp_msg_null); KMP_HNT(PossibleSystemLimitOnThreads), __kmp_msg_null);

} else { } else {

__kmp_fatal(KMP_MSG(CantRegisterNewThread), KMP_HNT(SystemLimitOnThreads), __kmp_fatal(KMP_MSG(CantRegisterNewThread), KMP_HNT(SystemLimitOnThreads),

__kmp_msg_null); __kmp_msg_null);

} }

/* find an available thread slot */ // When unshackled task is enabled, __kmp_threads is organized as follows:

/* Don't reassign the zero slot since we need that to only be used by initial // 0: initial thread, also a regular OpenMP thread.

jdoerfertUnsubmitted

Done

Is the code in the #else case the same as in the } else { case? If so, make the conditional if (USE_UNSHACKLED_TASK && ...) and avoid the duplication of bugs.

jdoerfert: Is the code in the `#else` case the same as in the `} else {` case? If so, make the conditional…

thread */ // [1, __kmp_unshackled_threads_num]: slots for unshackled threads.

for (gtid = (initial_thread ? 0 : 1); TCR_PTR(__kmp_threads[gtid]) != NULL; // [__kmp_unshackled_threads_num + 1, __kmp_threads_capacity): slots for

// regular OpenMP threads.

if (TCR_4(__kmp_init_unshackled_threads)) {

// Find an available thread slot for unshackled thread. Slots for unshackled

// threads start from 1 to __kmp_unshackled_threads_num.

for (gtid = 1; TCR_PTR(__kmp_threads[gtid]) != NULL &&

gtid <= __kmp_unshackled_threads_num;

gtid++) gtid++)

; ;

KA_TRACE(1, KMP_ASSERT(gtid <= __kmp_unshackled_threads_num);

("__kmp_register_root: found slot in threads array: T#%d\n", gtid)); KA_TRACE(1, ("__kmp_register_root: found slot in threads array for "

"unshackled thread: T#%d\n",

gtid));

} else {

/* find an available thread slot */

// Don't reassign the zero slot since we need that to only be used by

// initial thread. Slots for unshackled threads should also be skipped.

if (initial_thread && __kmp_threads[0] == NULL) {

protze.joachimUnsubmitted

Not Done

// initial thread. Slots for hidden helper threads should also be skipped.

- if (initial_thread && __kmp_threads[0] == NULL) {

+ if (initial_thread && TCR_PTR(__kmp_threads[0]) == NULL) {

gtid = 0;

This load used to be TCR_PTR

protze.joachim: This load used to be TCR_PTR

gtid = 0;

} else {

for (gtid = __kmp_unshackled_threads_num + 1;

TCR_PTR(__kmp_threads[gtid]) != NULL; gtid++)

;

}

KA_TRACE(

1, ("__kmp_register_root: found slot in threads array: T#%d\n", gtid));

KMP_ASSERT(gtid < __kmp_threads_capacity); KMP_ASSERT(gtid < __kmp_threads_capacity);

}

/* update global accounting */ /* update global accounting */

__kmp_all_nth++; __kmp_all_nth++;

TCW_4(__kmp_nth, __kmp_nth + 1); TCW_4(__kmp_nth, __kmp_nth + 1);

// if __kmp_adjust_gtid_mode is set, then we use method #1 (sp search) for low // if __kmp_adjust_gtid_mode is set, then we use method #1 (sp search) for low

// numbers of procs, and method #2 (keyed API call) for higher numbers. // numbers of procs, and method #2 (keyed API call) for higher numbers.

if (__kmp_adjust_gtid_mode) { if (__kmp_adjust_gtid_mode) {

▲ Show 20 Lines • Show All 640 Lines • ▼ Show 20 Lines #if KMP_OS_WINDOWS

KF_TRACE(10, ("after monitor thread has started\n")); KF_TRACE(10, ("after monitor thread has started\n"));

#endif #endif

} }

__kmp_release_bootstrap_lock(&__kmp_monitor_lock); __kmp_release_bootstrap_lock(&__kmp_monitor_lock);

} }

#endif #endif

KMP_MB(); KMP_MB();

for (new_gtid = 1; TCR_PTR(__kmp_threads[new_gtid]) != NULL; ++new_gtid) {

{

int new_start_gtid = TCR_4(__kmp_init_unshackled_threads)

? 1

: __kmp_unshackled_threads_num + 1;

jdoerfertUnsubmitted

Done

Nit: can we move the initialization out of the loop, hard to read. A comment might help as well.

Looking at the code more generally, is this the same code as below with different bounds? If so, avoid the duplication all together please, same way as suggested above.

jdoerfert: Nit: can we move the initialization out of the loop, hard to read. A comment might help as well.

AndreyChurbanovUnsubmitted

Done

This condition is false if new_gtid started with (__kmp_unshackled_threads_num + 1), that is for regular thread. Thus all threads will mistakenly get the same gtid.

AndreyChurbanov: This condition is false if new_gtid started with (__kmp_unshackled_threads_num + 1), that is…

for (new_gtid = new_start_gtid; TCR_PTR(__kmp_threads[new_gtid]) != NULL;

++new_gtid) {

KMP_DEBUG_ASSERT(new_gtid < __kmp_threads_capacity); KMP_DEBUG_ASSERT(new_gtid < __kmp_threads_capacity);

} }

if (TCR_4(__kmp_init_unshackled_threads)) {

KMP_DEBUG_ASSERT(new_gtid <= __kmp_unshackled_threads_num);

}

/* allocate space for it. */ /* allocate space for it. */

new_thr = (kmp_info_t *)__kmp_allocate(sizeof(kmp_info_t)); new_thr = (kmp_info_t *)__kmp_allocate(sizeof(kmp_info_t));

TCW_SYNC_PTR(__kmp_threads[new_gtid], new_thr); TCW_SYNC_PTR(__kmp_threads[new_gtid], new_thr);

#if USE_ITT_BUILD && USE_ITT_NOTIFY && KMP_DEBUG #if USE_ITT_BUILD && USE_ITT_NOTIFY && KMP_DEBUG

// suppress race conditions detection on synchronization flags in debug mode // suppress race conditions detection on synchronization flags in debug mode

// this helps to analyze library internals eliminating false positives // this helps to analyze library internals eliminating false positives

▲ Show 20 Lines • Show All 1,935 Lines • ▼ Show 20 Lines if (__kmp_global.g.g_abort) {

/* TODO abort? */ /* TODO abort? */

return; return;

} }

if (TCR_4(__kmp_global.g.g_done) || !__kmp_init_serial) { if (TCR_4(__kmp_global.g.g_done) || !__kmp_init_serial) {

KA_TRACE(10, ("__kmp_internal_end_thread: already finished\n")); KA_TRACE(10, ("__kmp_internal_end_thread: already finished\n"));

return; return;

} }

// If unshackled team has been initialized, we need to deinit it

if (TCR_4(__kmp_init_unshackled)) {

TCW_SYNC_4(__kmp_unshackled_team_done, TRUE);

// First release the master thread to let it continue its work

__kmp_unshackled_master_thread_release();

// Wait until the unshackled team has been destroyed

__kmp_unshackled_threads_deinitz_wait();

}

KMP_MB(); /* Flush all pending memory write invalidates. */ KMP_MB(); /* Flush all pending memory write invalidates. */

/* find out who we are and what we should do */ /* find out who we are and what we should do */

{ {

int gtid = (gtid_req >= 0) ? gtid_req : __kmp_gtid_get_specific(); int gtid = (gtid_req >= 0) ? gtid_req : __kmp_gtid_get_specific();

KA_TRACE(10, KA_TRACE(10,

("__kmp_internal_end_thread: enter T#%d (%d)\n", gtid, gtid_req)); ("__kmp_internal_end_thread: enter T#%d (%d)\n", gtid, gtid_req));

if (gtid == KMP_GTID_SHUTDOWN) { if (gtid == KMP_GTID_SHUTDOWN) {

▲ Show 20 Lines • Show All 809 Lines • ▼ Show 20 Lines #endif

TCW_SYNC_4(__kmp_init_parallel, TRUE); TCW_SYNC_4(__kmp_init_parallel, TRUE);

KMP_MB(); KMP_MB();

KA_TRACE(10, ("__kmp_parallel_initialize: exit\n")); KA_TRACE(10, ("__kmp_parallel_initialize: exit\n"));

__kmp_release_bootstrap_lock(&__kmp_initz_lock); __kmp_release_bootstrap_lock(&__kmp_initz_lock);

} }

void __kmp_unshackled_initialize() {

if (TCR_4(__kmp_init_unshackled))

return;

// __kmp_parallel_initialize is required before we initialize unshackled

if (!TCR_4(__kmp_init_parallel))

__kmp_parallel_initialize();

// Double check. Note that this double check should not be placed before

// __kmp_parallel_initialize as it will cause dead lock.

__kmp_acquire_bootstrap_lock(&__kmp_initz_lock);

if (TCR_4(__kmp_init_unshackled)) {

__kmp_release_bootstrap_lock(&__kmp_initz_lock);

return;

}

// Set the count of unshackled tasks to be executed to zero

KMP_ATOMIC_ST_REL(&__kmp_unexecuted_unshackled_tasks, 0);

// Set the global variable indicating that we're initializing unshackled

// team/threads

TCW_SYNC_4(__kmp_init_unshackled_threads, TRUE);

// Platform independent initialization

__kmp_do_initialize_unshackled_threads();

// Wait here for the finish of initialization of unshackled teams

__kmp_unshackled_threads_initz_wait();

// We have finished unshackled initialization

TCW_SYNC_4(__kmp_init_unshackled, TRUE);

__kmp_release_bootstrap_lock(&__kmp_initz_lock);

}

/* ------------------------------------------------------------------------ */ /* ------------------------------------------------------------------------ */

void __kmp_run_before_invoked_task(int gtid, int tid, kmp_info_t *this_thr, void __kmp_run_before_invoked_task(int gtid, int tid, kmp_info_t *this_thr,

kmp_team_t *team) { kmp_team_t *team) {

kmp_disp_t *dispatch; kmp_disp_t *dispatch;

KMP_MB(); KMP_MB();

▲ Show 20 Lines • Show All 1,328 Lines • ▼ Show 20 Lines if (__kmp_pause_status != kmp_not_paused) {

return 0; return 0;

} }

} else { } else {

// error message about invalid level // error message about invalid level

return 1; return 1;

} }

void __kmp_omp_display_env(int verbose) { void __kmp_omp_display_env(int verbose) {

__kmp_acquire_bootstrap_lock(&__kmp_initz_lock); __kmp_acquire_bootstrap_lock(&__kmp_initz_lock);

if (__kmp_init_serial == 0) if (__kmp_init_serial == 0)

__kmp_do_serial_initialize(); __kmp_do_serial_initialize();

__kmp_display_env_impl(!verbose, verbose); __kmp_display_env_impl(!verbose, verbose);

__kmp_release_bootstrap_lock(&__kmp_initz_lock); __kmp_release_bootstrap_lock(&__kmp_initz_lock);

} }

// Globals and functions for unshackled task

kmp_info_t **__kmp_unshackled_threads;

kmp_info_t *__kmp_unshackled_master_thread;

int __kmp_unshackled_threads_num = 8;

std::atomic<kmp_int32> __kmp_unexecuted_unshackled_tasks;

int __kmp_enable_unshackled = TRUE;

namespace {

std::atomic<kmp_int32> __kmp_hit_unshackled_threads_num;

void __kmp_unshackled_wrapper_fn(int *gtid, int *, ...) {

// This is an explicit synchronization on all unshackled threads in case that

// when a regular thread pushes an unshackled task to one unshackled thread,

// the thread has not been awaken once since they're released by the master

// thread after creating the team.

KMP_ATOMIC_INC(&__kmp_hit_unshackled_threads_num);

while (KMP_ATOMIC_LD_ACQ(&__kmp_hit_unshackled_threads_num) !=

__kmp_unshackled_threads_num)

;

// If master thread, then wait for signal

if (__kmpc_master(nullptr, *gtid)) {

// First, unset the initial state and release the initial thread

TCW_4(__kmp_init_unshackled_threads, FALSE);

__kmp_unshackled_initz_release();

__kmp_unshackled_master_thread_wait();

// Now wake up all worker threads

for (int i = 1; i < __kmp_hit_unshackled_threads_num; ++i) {

__kmp_unshackled_worker_thread_signal();

}

} // namespace

void __kmp_unshackled_threads_initz_routine() {

// Create a new root for unshackled team/threads

const int gtid = __kmp_register_root(TRUE);

__kmp_unshackled_master_thread = __kmp_threads[gtid];

__kmp_unshackled_threads = &__kmp_threads[gtid];

__kmp_unshackled_master_thread->th.th_set_nproc =

__kmp_unshackled_threads_num;

KMP_ATOMIC_ST_REL(&__kmp_hit_unshackled_threads_num, 0);

__kmpc_fork_call(nullptr, 0, __kmp_unshackled_wrapper_fn);

// Set the initialization flag to FALSE

TCW_SYNC_4(__kmp_init_unshackled, FALSE);

__kmp_unshackled_threads_deinitz_release();

}

openmp/runtime/src/kmp_settings.cpp

Show First 20 Lines • Show All 497 Lines • ▼ Show 20 Lines	int __kmp_initial_threads_capacity(int req_nproc) {

/* MIN( MAX( 32, 4 * $OMP_NUM_THREADS, 4 * omp_get_num_procs() ),		/* MIN( MAX( 32, 4 * $OMP_NUM_THREADS, 4 * omp_get_num_procs() ),
* __kmp_max_nth) */		* __kmp_max_nth) */
if (nth < (4 * req_nproc))		if (nth < (4 * req_nproc))
nth = (4 * req_nproc);		nth = (4 * req_nproc);
if (nth < (4 * __kmp_xproc))		if (nth < (4 * __kmp_xproc))
nth = (4 * __kmp_xproc);		nth = (4 * __kmp_xproc);

		// If unshackled task is enabled, we initialize the thread capacity with extra
		// __kmp_unshackled_threads_num.
		nth += __kmp_unshackled_threads_num;

if (nth > __kmp_max_nth)		if (nth > __kmp_max_nth)
nth = __kmp_max_nth;		nth = __kmp_max_nth;

return nth;		return nth;
}		}

int __kmp_default_tp_capacity(int req_nproc, int max_nth,		int __kmp_default_tp_capacity(int req_nproc, int max_nth,
int all_threads_specified) {		int all_threads_specified) {
▲ Show 20 Lines • Show All 644 Lines • ▼ Show 20 Lines	if (__kmp_nested_nth.nth) {
if (__kmp_dflt_team_nth_ub < __kmp_dflt_team_nth) {		if (__kmp_dflt_team_nth_ub < __kmp_dflt_team_nth) {
__kmp_dflt_team_nth_ub = __kmp_dflt_team_nth;		__kmp_dflt_team_nth_ub = __kmp_dflt_team_nth;
}		}
}		}
}		}
K_DIAG(1, ("__kmp_dflt_team_nth == %d\n", __kmp_dflt_team_nth));		K_DIAG(1, ("__kmp_dflt_team_nth == %d\n", __kmp_dflt_team_nth));
} // __kmp_stg_parse_num_threads		} // __kmp_stg_parse_num_threads

		static void __kmp_stg_parse_num_unshackled_threads(char const *name,
		char const *value,
		void *data) {
		__kmp_stg_parse_int(name, value, 0, 16, &__kmp_unshackled_threads_num);
		// If the number of unshackled threads is zero, we disable unshackled task
		if (__kmp_unshackled_threads_num == 0) {
		__kmp_enable_unshackled = FALSE;
		}
		} // __kmp_stg_parse_num_unshackled_threads

		static void __kmp_stg_print_num_unshackled_threads(kmp_str_buf_t *buffer,
		char const *name,
		void *data) {
		__kmp_stg_print_int(buffer, name, __kmp_unshackled_threads_num);
		} // __kmp_stg_print_num_unshackled_threads

		static void __kmp_stg_parse_use_unshackled(char const name, char const value,
		void *data) {
		__kmp_stg_parse_bool(name, value, &__kmp_enable_unshackled);
		} // __kmp_stg_parse_use_unshackled

		static void __kmp_stg_print_use_unshackled(kmp_str_buf_t *buffer,
		char const name, void data) {
		__kmp_stg_print_bool(buffer, name, __kmp_enable_unshackled);
		} // __kmp_stg_print_use_unshackled

static void __kmp_stg_print_num_threads(kmp_str_buf_t buffer, char const name,		static void __kmp_stg_print_num_threads(kmp_str_buf_t buffer, char const name,
void *data) {		void *data) {
if (__kmp_env_format) {		if (__kmp_env_format) {
KMP_STR_BUF_PRINT_NAME;		KMP_STR_BUF_PRINT_NAME;
} else {		} else {
__kmp_str_buf_print(buffer, " %s", name);		__kmp_str_buf_print(buffer, " %s", name);
}		}
if (__kmp_nested_nth.used) {		if (__kmp_nested_nth.used) {
▲ Show 20 Lines • Show All 3,752 Lines • ▼ Show 20 Lines	#endif
__kmp_stg_print_task_throttling, NULL, 0, 0},		__kmp_stg_print_task_throttling, NULL, 0, 0},

{"OMP_DISPLAY_ENV", __kmp_stg_parse_omp_display_env,		{"OMP_DISPLAY_ENV", __kmp_stg_parse_omp_display_env,
__kmp_stg_print_omp_display_env, NULL, 0, 0},		__kmp_stg_print_omp_display_env, NULL, 0, 0},
{"OMP_CANCELLATION", __kmp_stg_parse_omp_cancellation,		{"OMP_CANCELLATION", __kmp_stg_parse_omp_cancellation,
__kmp_stg_print_omp_cancellation, NULL, 0, 0},		__kmp_stg_print_omp_cancellation, NULL, 0, 0},
{"OMP_ALLOCATOR", __kmp_stg_parse_allocator, __kmp_stg_print_allocator,		{"OMP_ALLOCATOR", __kmp_stg_parse_allocator, __kmp_stg_print_allocator,
NULL, 0, 0},		NULL, 0, 0},
		{"LIBOMP_USE_UNSHACKLED_TASK", __kmp_stg_parse_use_unshackled,
		__kmp_stg_print_use_unshackled, NULL, 0, 0},
		{"LIBOMP_NUM_UNSHACKLED_THREADS", __kmp_stg_parse_num_unshackled_threads,
		__kmp_stg_print_num_unshackled_threads, NULL, 0, 0},

#if OMPT_SUPPORT		#if OMPT_SUPPORT
{"OMP_TOOL", __kmp_stg_parse_omp_tool, __kmp_stg_print_omp_tool, NULL, 0,		{"OMP_TOOL", __kmp_stg_parse_omp_tool, __kmp_stg_print_omp_tool, NULL, 0,
0},		0},
{"OMP_TOOL_LIBRARIES", __kmp_stg_parse_omp_tool_libraries,		{"OMP_TOOL_LIBRARIES", __kmp_stg_parse_omp_tool_libraries,
__kmp_stg_print_omp_tool_libraries, NULL, 0, 0},		__kmp_stg_print_omp_tool_libraries, NULL, 0, 0},
#endif		#endif

▲ Show 20 Lines • Show All 822 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_taskdeps.h

Show First 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	KA_TRACE(20, ("__kmp_release_deps: T#%d notifying successors of task %p.\n",
gtid, task));		gtid, task));

KMP_ACQUIRE_DEPNODE(gtid, node);		KMP_ACQUIRE_DEPNODE(gtid, node);
node->dn.task =		node->dn.task =
NULL; // mark this task as finished, so no new dependencies are generated		NULL; // mark this task as finished, so no new dependencies are generated
KMP_RELEASE_DEPNODE(gtid, node);		KMP_RELEASE_DEPNODE(gtid, node);

kmp_depnode_list_t *next;		kmp_depnode_list_t *next;
		kmp_taskdata_t *next_taskdata;
for (kmp_depnode_list_t *p = node->dn.successors; p; p = next) {		for (kmp_depnode_list_t *p = node->dn.successors; p; p = next) {
kmp_depnode_t *successor = p->node;		kmp_depnode_t *successor = p->node;
kmp_int32 npredecessors = KMP_ATOMIC_DEC(&successor->dn.npredecessors) - 1;		kmp_int32 npredecessors = KMP_ATOMIC_DEC(&successor->dn.npredecessors) - 1;

// successor task can be NULL for wait_depends or because deps are still		// successor task can be NULL for wait_depends or because deps are still
// being processed		// being processed
if (npredecessors == 0) {		if (npredecessors == 0) {
KMP_MB();		KMP_MB();
if (successor->dn.task) {		if (successor->dn.task) {
KA_TRACE(20, ("__kmp_release_deps: T#%d successor %p of %p scheduled "		KA_TRACE(20, ("__kmp_release_deps: T#%d successor %p of %p scheduled "
"for execution.\n",		"for execution.\n",
gtid, successor->dn.task, task));		gtid, successor->dn.task, task));
		// If a regular task depending on an unshackled task, when the
		// unshackled task is done, the regular task should be executed by its
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions @adurang The problem of release deeps was fixed here. tianshilei1992: @adurang The problem of release deeps was fixed here.
		// encountering team.
		if (KMP_UNSHACKLED_THREAD(gtid)) {
		// Unshackled thread can only execute unshackled tasks
		KMP_ASSERT(task->td_flags.unshackled);
		next_taskdata = KMP_TASK_TO_TASKDATA(successor->dn.task);
		if (!next_taskdata->td_flags.unshackled) {
		__kmp_omp_task(task->encountering_gtid, successor->dn.task, false);
		}
		} else {
__kmp_omp_task(gtid, successor->dn.task, false);		__kmp_omp_task(gtid, successor->dn.task, false);
}		}
}		}
		}

next = p->next;		next = p->next;
__kmp_node_deref(thread, p->node);		__kmp_node_deref(thread, p->node);
#if USE_FAST_MEMORY		#if USE_FAST_MEMORY
__kmp_fast_free(thread, p);		__kmp_fast_free(thread, p);
#else		#else
__kmp_thread_free(thread, p);		__kmp_thread_free(thread, p);
#endif		#endif
Show All 11 Lines

openmp/runtime/src/kmp_tasking.cpp

Show First 20 Lines • Show All 319 Lines • ▼ Show 20 Lines	static void __kmp_realloc_task_deque(kmp_info_t *thread,
thread_data->td.td_deque = new_deque;		thread_data->td.td_deque = new_deque;
thread_data->td.td_deque_size = new_size;		thread_data->td.td_deque_size = new_size;
}		}

// __kmp_push_task: Add a task to the thread's deque		// __kmp_push_task: Add a task to the thread's deque
static kmp_int32 __kmp_push_task(kmp_int32 gtid, kmp_task_t *task) {		static kmp_int32 __kmp_push_task(kmp_int32 gtid, kmp_task_t *task) {
kmp_info_t *thread = __kmp_threads[gtid];		kmp_info_t *thread = __kmp_threads[gtid];
kmp_taskdata_t *taskdata = KMP_TASK_TO_TASKDATA(task);		kmp_taskdata_t *taskdata = KMP_TASK_TO_TASKDATA(task);

		if (taskdata->td_flags.unshackled) {
		gtid = KMP_GTID_TO_SHADOW_GTID(gtid);
		thread = __kmp_threads[gtid];
		}

kmp_task_team_t *task_team = thread->th.th_task_team;		kmp_task_team_t *task_team = thread->th.th_task_team;
kmp_int32 tid = __kmp_tid_from_gtid(gtid);		kmp_int32 tid = __kmp_tid_from_gtid(gtid);
kmp_thread_data_t *thread_data;		kmp_thread_data_t *thread_data;

KA_TRACE(20,		KA_TRACE(20,
("__kmp_push_task: T#%d trying to push task %p.\n", gtid, taskdata));		("__kmp_push_task: T#%d trying to push task %p.\n", gtid, taskdata));

if (taskdata->td_flags.tiedness == TASK_UNTIED) {		if (taskdata->td_flags.tiedness == TASK_UNTIED) {
Show All 13 Lines	KA_TRACE(20, ("__kmp_push_task: T#%d team serialized; returning "
"TASK_NOT_PUSHED for task %p\n",		"TASK_NOT_PUSHED for task %p\n",
gtid, taskdata));		gtid, taskdata));
return TASK_NOT_PUSHED;		return TASK_NOT_PUSHED;
}		}

// Now that serialized tasks have returned, we can assume that we are not in		// Now that serialized tasks have returned, we can assume that we are not in
// immediate exec mode		// immediate exec mode
KMP_DEBUG_ASSERT(__kmp_tasking_mode != tskm_immediate_exec);		KMP_DEBUG_ASSERT(__kmp_tasking_mode != tskm_immediate_exec);
if (!KMP_TASKING_ENABLED(task_team)) {		if (!KMP_TASKING_ENABLED(task_team)) {
		protze.joachimUnsubmitted Not Done Reply Inline Actions I'm getting the segfault here. When I look at task_team, it is 0x0. taskdata->td_flags.hidden_helper = 1 gtid = 2 __kmp_threads[gtid]->th.th_task_team = 0x0 protze.joachim: I'm getting the segfault here. When I look at task_team, it is 0x0. taskdata->td_flags.
__kmp_enable_tasking(task_team, thread);		__kmp_enable_tasking(task_team, thread);
}		}
KMP_DEBUG_ASSERT(TCR_4(task_team->tt.tt_found_tasks) == TRUE);		KMP_DEBUG_ASSERT(TCR_4(task_team->tt.tt_found_tasks) == TRUE);
KMP_DEBUG_ASSERT(TCR_PTR(task_team->tt.tt_threads_data) != NULL);		KMP_DEBUG_ASSERT(TCR_PTR(task_team->tt.tt_threads_data) != NULL);

// Find tasking deque specific to encountering thread		// Find tasking deque specific to encountering thread
thread_data = &task_team->tt.tt_threads_data[tid];		thread_data = &task_team->tt.tt_threads_data[tid];

// No lock needed since only owner can allocate		// No lock needed even if the task is unshackled because we have initialized
		// the dequeue for unshackled thread data
if (thread_data->td.td_deque == NULL) {		if (thread_data->td.td_deque == NULL) {
__kmp_alloc_task_deque(thread, thread_data);		__kmp_alloc_task_deque(thread, thread_data);
}		}

int locked = 0;		int locked = 0;
// Check if deque is full		// Check if deque is full
if (TCR_4(thread_data->td.td_deque_ntasks) >=		if (TCR_4(thread_data->td.td_deque_ntasks) >=
TASK_DEQUE_SIZE(thread_data->td)) {		TASK_DEQUE_SIZE(thread_data->td)) {
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	static kmp_int32 __kmp_push_task(kmp_int32 gtid, kmp_task_t *task) {
KMP_FSYNC_RELEASING(taskdata); // releasing child		KMP_FSYNC_RELEASING(taskdata); // releasing child
KA_TRACE(20, ("__kmp_push_task: T#%d returning TASK_SUCCESSFULLY_PUSHED: "		KA_TRACE(20, ("__kmp_push_task: T#%d returning TASK_SUCCESSFULLY_PUSHED: "
"task=%p ntasks=%d head=%u tail=%u\n",		"task=%p ntasks=%d head=%u tail=%u\n",
gtid, taskdata, thread_data->td.td_deque_ntasks,		gtid, taskdata, thread_data->td.td_deque_ntasks,
thread_data->td.td_deque_head, thread_data->td.td_deque_tail));		thread_data->td.td_deque_head, thread_data->td.td_deque_tail));

__kmp_release_bootstrap_lock(&thread_data->td.td_deque_lock);		__kmp_release_bootstrap_lock(&thread_data->td.td_deque_lock);

		// Signal one worker thread to execute the task
		if (taskdata->td_flags.unshackled) {
		// Increment the number of unshackled tasks to be executed
		KMP_ATOMIC_INC(&__kmp_unexecuted_unshackled_tasks);
		// Wake unshackled threads up if they're sleeping
		__kmp_unshackled_worker_thread_signal();
		}

return TASK_SUCCESSFULLY_PUSHED;		return TASK_SUCCESSFULLY_PUSHED;
}		}

// __kmp_pop_current_task_from_thread: set up current task from called thread		// __kmp_pop_current_task_from_thread: set up current task from called thread
// when team ends		// when team ends
//		//
// this_thr: thread structure to set current_task in.		// this_thr: thread structure to set current_task in.
void __kmp_pop_current_task_from_thread(kmp_info_t *this_thr) {		void __kmp_pop_current_task_from_thread(kmp_info_t *this_thr) {
▲ Show 20 Lines • Show All 276 Lines • ▼ Show 20 Lines	static void __kmp_free_task(kmp_int32 gtid, kmp_taskdata_t *taskdata,
taskdata->td_flags.freed = 1;		taskdata->td_flags.freed = 1;
ANNOTATE_HAPPENS_BEFORE(taskdata);		ANNOTATE_HAPPENS_BEFORE(taskdata);
// deallocate the taskdata and shared variable blocks associated with this task		// deallocate the taskdata and shared variable blocks associated with this task
#if USE_FAST_MEMORY		#if USE_FAST_MEMORY
__kmp_fast_free(thread, taskdata);		__kmp_fast_free(thread, taskdata);
#else /* ! USE_FAST_MEMORY */		#else /* ! USE_FAST_MEMORY */
__kmp_thread_free(thread, taskdata);		__kmp_thread_free(thread, taskdata);
#endif		#endif

KA_TRACE(20, ("__kmp_free_task: T#%d freed task %p\n", gtid, taskdata));		KA_TRACE(20, ("__kmp_free_task: T#%d freed task %p\n", gtid, taskdata));
}		}

// __kmp_free_task_and_ancestors: free the current task and ancestors without		// __kmp_free_task_and_ancestors: free the current task and ancestors without
// children		// children
//		//
// gtid: Global thread ID of calling thread		// gtid: Global thread ID of calling thread
// taskdata: task to free		// taskdata: task to free
▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	if (!detach) {
taskdata->td_flags.complete = 1; // mark the task as completed		taskdata->td_flags.complete = 1; // mark the task as completed

#if OMPT_SUPPORT		#if OMPT_SUPPORT
// This is not a detached task, we are done here		// This is not a detached task, we are done here
if (ompt)		if (ompt)
__ompt_task_finish(task, resumed_task, ompt_task_complete);		__ompt_task_finish(task, resumed_task, ompt_task_complete);
#endif		#endif

		if (taskdata->td_flags.unshackled) {
		KMP_DEBUG_ASSERT(taskdata->td_parent_task_team);
		KMP_ATOMIC_DEC(
		&taskdata->td_parent_task_team->tt.tt_unfinished_unshackled_tasks);
		}

// Only need to keep track of count if team parallel and tasking not		// Only need to keep track of count if team parallel and tasking not
// serialized, or task is detachable and event has already been fulfilled		// serialized, or task is detachable and event has already been fulfilled
if (!(taskdata->td_flags.team_serial \|\| taskdata->td_flags.tasking_ser) \|\|		if (!(taskdata->td_flags.team_serial \|\| taskdata->td_flags.tasking_ser) \|\|
taskdata->td_flags.detachable == TASK_DETACHABLE) {		taskdata->td_flags.detachable == TASK_DETACHABLE) {
// Predecrement simulated by "- 1" calculation		// Predecrement simulated by "- 1" calculation
children =		children =
KMP_ATOMIC_DEC(&taskdata->td_parent->td_incomplete_child_tasks) - 1;		KMP_ATOMIC_DEC(&taskdata->td_parent->td_incomplete_child_tasks) - 1;
KMP_DEBUG_ASSERT(children >= 0);		KMP_DEBUG_ASSERT(children >= 0);
if (taskdata->td_taskgroup)		if (taskdata->td_taskgroup)
KMP_ATOMIC_DEC(&taskdata->td_taskgroup->count);		KMP_ATOMIC_DEC(&taskdata->td_taskgroup->count);
__kmp_release_deps(gtid, taskdata);		__kmp_release_deps(gtid, taskdata);
		adurangUnsubmitted Done Reply Inline Actions Calling release_deps from the unshackle thread will result in the tasks being released from the graph to be queued in the unshackle thread and not the original team which is wrong. You'd need to use here the original gtid but I'm not sure of the consequences of 'impersonating' a thread in that code ptah. adurang: Calling release_deps from the unshackle thread will result in the tasks being released from the…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I see the problem here. Will figure out a way to fix it. tianshilei1992: I see the problem here. Will figure out a way to fix it.
} else if (task_team && task_team->tt.tt_found_proxy_tasks) {		} else if (task_team && task_team->tt.tt_found_proxy_tasks) {
// if we found proxy tasks there could exist a dependency chain		// if we found proxy tasks there could exist a dependency chain
// with the proxy task as origin		// with the proxy task as origin
__kmp_release_deps(gtid, taskdata);		__kmp_release_deps(gtid, taskdata);
}		}
// td_flags.executing must be marked as 0 after __kmp_release_deps has been		// td_flags.executing must be marked as 0 after __kmp_release_deps has been
// called. Othertwise, if a task is executed immediately from the		// called. Othertwise, if a task is executed immediately from the
// release_deps code, the flag will be reset to 1 again by this same		// release_deps code, the flag will be reset to 1 again by this same
▲ Show 20 Lines • Show All 225 Lines • ▼ Show 20 Lines
// returns: a pointer to the allocated kmp_task_t structure (task).		// returns: a pointer to the allocated kmp_task_t structure (task).
kmp_task_t __kmp_task_alloc(ident_t loc_ref, kmp_int32 gtid,		kmp_task_t __kmp_task_alloc(ident_t loc_ref, kmp_int32 gtid,
kmp_tasking_flags_t *flags,		kmp_tasking_flags_t *flags,
size_t sizeof_kmp_task_t, size_t sizeof_shareds,		size_t sizeof_kmp_task_t, size_t sizeof_shareds,
kmp_routine_entry_t task_entry) {		kmp_routine_entry_t task_entry) {
kmp_task_t *task;		kmp_task_t *task;
kmp_taskdata_t *taskdata;		kmp_taskdata_t *taskdata;
kmp_info_t *thread = __kmp_threads[gtid];		kmp_info_t *thread = __kmp_threads[gtid];
		kmp_info_t *encountering_thread = thread;
		kmp_int32 encountering_gtid = gtid;
kmp_team_t *team = thread->th.th_team;		kmp_team_t *team = thread->th.th_team;
kmp_taskdata_t *parent_task = thread->th.th_current_task;		kmp_taskdata_t *parent_task = thread->th.th_current_task;
size_t shareds_offset;		size_t shareds_offset;

if (!TCR_4(__kmp_init_middle))		if (!TCR_4(__kmp_init_middle))
__kmp_middle_initialize();		__kmp_middle_initialize();

		if (flags->unshackled) {
		if (__kmp_enable_unshackled) {
		if (!TCR_4(__kmp_init_unshackled))
		__kmp_unshackled_initialize();

		// For an unshackled task encountered by a regular thread, we will push
		// the task to the (gtid%__kmp_unshackled_threads_num)-th unshackled
		// thread
		if (!KMP_UNSHACKLED_THREAD(gtid)) {
		thread = __kmp_threads[KMP_GTID_TO_SHADOW_GTID(gtid)];
		team = thread->th.th_team;
		// We don't change the parent-child relation for unshackled task as we
		// need that to do per-task-region synchronization
		}
		} else {
		// If the unshackled task is not enabled, reset the flag to FALSE
		flags->unshackled = FALSE;
		}
		}

KA_TRACE(10, ("__kmp_task_alloc(enter): T#%d loc=%p, flags=(0x%x) "		KA_TRACE(10, ("__kmp_task_alloc(enter): T#%d loc=%p, flags=(0x%x) "
"sizeof_task=%ld sizeof_shared=%ld entry=%p\n",		"sizeof_task=%ld sizeof_shared=%ld entry=%p\n",
gtid, loc_ref, ((kmp_int32 )flags), sizeof_kmp_task_t,		gtid, loc_ref, ((kmp_int32 )flags), sizeof_kmp_task_t,
sizeof_shareds, task_entry));		sizeof_shareds, task_entry));

if (parent_task->td_flags.final) {		if (parent_task->td_flags.final) {
if (flags->merged_if0) {		if (flags->merged_if0) {
}		}
flags->final = 1;		flags->final = 1;
}		}

if (flags->tiedness == TASK_UNTIED && !team->t.t_serialized) {		if (flags->tiedness == TASK_UNTIED && !team->t.t_serialized) {
// Untied task encountered causes the TSC algorithm to check entire deque of		// Untied task encountered causes the TSC algorithm to check entire deque of
// the victim thread. If no untied task encountered, then checking the head		// the victim thread. If no untied task encountered, then checking the head
// of the deque should be enough.		// of the deque should be enough.
KMP_CHECK_UPDATE(thread->th.th_task_team->tt.tt_untied_task_encountered, 1);		KMP_CHECK_UPDATE(thread->th.th_task_team->tt.tt_untied_task_encountered, 1);
}		}

// Detachable tasks are not proxy tasks yet but could be in the future. Doing		// Detachable tasks are not proxy tasks yet but could be in the future. Doing
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	kmp_task_t __kmp_task_alloc(ident_t loc_ref, kmp_int32 gtid,
shareds_offset = __kmp_round_up_to_val(shareds_offset, sizeof(void *));		shareds_offset = __kmp_round_up_to_val(shareds_offset, sizeof(void *));

// Allocate a kmp_taskdata_t block and a kmp_task_t block.		// Allocate a kmp_taskdata_t block and a kmp_task_t block.
KA_TRACE(30, ("__kmp_task_alloc: T#%d First malloc size: %ld\n", gtid,		KA_TRACE(30, ("__kmp_task_alloc: T#%d First malloc size: %ld\n", gtid,
shareds_offset));		shareds_offset));
KA_TRACE(30, ("__kmp_task_alloc: T#%d Second malloc size: %ld\n", gtid,		KA_TRACE(30, ("__kmp_task_alloc: T#%d Second malloc size: %ld\n", gtid,
sizeof_shareds));		sizeof_shareds));

// Avoid double allocation here by combining shareds with taskdata		// Avoid double allocation here by combining shareds with taskdata
#if USE_FAST_MEMORY		#if USE_FAST_MEMORY
taskdata = (kmp_taskdata_t *)__kmp_fast_allocate(thread, shareds_offset +		taskdata = (kmp_taskdata_t *)__kmp_fast_allocate(
sizeof_shareds);		encountering_thread, shareds_offset + sizeof_shareds);
#else /* ! USE_FAST_MEMORY */		#else /* ! USE_FAST_MEMORY */
taskdata = (kmp_taskdata_t *)__kmp_thread_malloc(thread, shareds_offset +		taskdata = (kmp_taskdata_t *)__kmp_thread_malloc(
sizeof_shareds);		encountering_thread, shareds_offset + sizeof_shareds);
#endif /* USE_FAST_MEMORY */		#endif /* USE_FAST_MEMORY */
ANNOTATE_HAPPENS_AFTER(taskdata);		ANNOTATE_HAPPENS_AFTER(taskdata);

task = KMP_TASKDATA_TO_TASK(taskdata);		task = KMP_TASKDATA_TO_TASK(taskdata);

// Make sure task & taskdata are aligned appropriately		// Make sure task & taskdata are aligned appropriately
#if KMP_ARCH_X86 \|\| KMP_ARCH_PPC64 \|\| !KMP_HAVE_QUAD		#if KMP_ARCH_X86 \|\| KMP_ARCH_PPC64 \|\| !KMP_HAVE_QUAD
KMP_DEBUG_ASSERT((((kmp_uintptr_t)taskdata) & (sizeof(double) - 1)) == 0);		KMP_DEBUG_ASSERT((((kmp_uintptr_t)taskdata) & (sizeof(double) - 1)) == 0);
Show All 11 Lines	#endif
} else {		} else {
task->shareds = NULL;		task->shareds = NULL;
}		}
task->routine = task_entry;		task->routine = task_entry;
task->part_id = 0; // AC: Always start with 0 part id		task->part_id = 0; // AC: Always start with 0 part id

taskdata->td_task_id = KMP_GEN_TASK_ID();		taskdata->td_task_id = KMP_GEN_TASK_ID();
taskdata->td_team = team;		taskdata->td_team = team;
taskdata->td_alloc_thread = thread;		taskdata->td_alloc_thread = encountering_thread;
taskdata->td_parent = parent_task;		taskdata->td_parent = parent_task;
taskdata->td_level = parent_task->td_level + 1; // increment nesting level		taskdata->td_level = parent_task->td_level + 1; // increment nesting level
KMP_ATOMIC_ST_RLX(&taskdata->td_untied_count, 0);		KMP_ATOMIC_ST_RLX(&taskdata->td_untied_count, 0);
taskdata->td_ident = loc_ref;		taskdata->td_ident = loc_ref;
taskdata->td_taskwait_ident = NULL;		taskdata->td_taskwait_ident = NULL;
taskdata->td_taskwait_counter = 0;		taskdata->td_taskwait_counter = 0;
taskdata->td_taskwait_thread = 0;		taskdata->td_taskwait_thread = 0;
KMP_DEBUG_ASSERT(taskdata->td_parent != NULL);		KMP_DEBUG_ASSERT(taskdata->td_parent != NULL);
// avoid copying icvs for proxy tasks		// avoid copying icvs for proxy tasks
if (flags->proxy == TASK_FULL)		if (flags->proxy == TASK_FULL)
copy_icvs(&taskdata->td_icvs, &taskdata->td_parent->td_icvs);		copy_icvs(&taskdata->td_icvs, &taskdata->td_parent->td_icvs);

taskdata->td_flags.tiedness = flags->tiedness;		taskdata->td_flags.tiedness = flags->tiedness;
taskdata->td_flags.final = flags->final;		taskdata->td_flags.final = flags->final;
taskdata->td_flags.merged_if0 = flags->merged_if0;		taskdata->td_flags.merged_if0 = flags->merged_if0;
taskdata->td_flags.destructors_thunk = flags->destructors_thunk;		taskdata->td_flags.destructors_thunk = flags->destructors_thunk;
taskdata->td_flags.proxy = flags->proxy;		taskdata->td_flags.proxy = flags->proxy;
taskdata->td_flags.detachable = flags->detachable;		taskdata->td_flags.detachable = flags->detachable;
		taskdata->td_flags.unshackled = flags->unshackled;
		taskdata->td_parent_task_team = encountering_thread->th.th_task_team;
		taskdata->encountering_gtid = encountering_gtid;
taskdata->td_task_team = thread->th.th_task_team;		taskdata->td_task_team = thread->th.th_task_team;
taskdata->td_size_alloc = shareds_offset + sizeof_shareds;		taskdata->td_size_alloc = shareds_offset + sizeof_shareds;
taskdata->td_flags.tasktype = TASK_EXPLICIT;		taskdata->td_flags.tasktype = TASK_EXPLICIT;

// GEH - TODO: fix this to copy parent task's value of tasking_ser flag		// GEH - TODO: fix this to copy parent task's value of tasking_ser flag
taskdata->td_flags.tasking_ser = (__kmp_tasking_mode == tskm_immediate_exec);		taskdata->td_flags.tasking_ser = (__kmp_tasking_mode == tskm_immediate_exec);

// GEH - TODO: fix this to copy parent task's value of team_serial flag		// GEH - TODO: fix this to copy parent task's value of team_serial flag
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	if (parent_task->td_taskgroup)
KMP_ATOMIC_INC(&parent_task->td_taskgroup->count);		KMP_ATOMIC_INC(&parent_task->td_taskgroup->count);
// Only need to keep track of allocated child tasks for explicit tasks since		// Only need to keep track of allocated child tasks for explicit tasks since
// implicit not deallocated		// implicit not deallocated
if (taskdata->td_parent->td_flags.tasktype == TASK_EXPLICIT) {		if (taskdata->td_parent->td_flags.tasktype == TASK_EXPLICIT) {
KMP_ATOMIC_INC(&taskdata->td_parent->td_allocated_child_tasks);		KMP_ATOMIC_INC(&taskdata->td_parent->td_allocated_child_tasks);
}		}
}		}

		{
		kmp_task_team_t *parent_team = taskdata->td_parent_task_team;
		if (flags->unshackled && parent_team) {
		KMP_ATOMIC_INC(&parent_team->tt.tt_unfinished_unshackled_tasks);
		if (!parent_team->tt.tt_unshackled_task_encountered) {
		TCW_4(parent_team->tt.tt_unshackled_task_encountered, TRUE);
		}
		}
		}

KA_TRACE(20, ("__kmp_task_alloc(exit): T#%d created task %p parent=%p\n",		KA_TRACE(20, ("__kmp_task_alloc(exit): T#%d created task %p parent=%p\n",
gtid, taskdata, taskdata->td_parent));		gtid, taskdata, taskdata->td_parent));
ANNOTATE_HAPPENS_BEFORE(task);		ANNOTATE_HAPPENS_BEFORE(task);

return task;		return task;
}		}

kmp_task_t __kmpc_omp_task_alloc(ident_t loc_ref, kmp_int32 gtid,		kmp_task_t __kmpc_omp_task_alloc(ident_t loc_ref, kmp_int32 gtid,
Show All 21 Lines
}		}

kmp_task_t __kmpc_omp_target_task_alloc(ident_t loc_ref, kmp_int32 gtid,		kmp_task_t __kmpc_omp_target_task_alloc(ident_t loc_ref, kmp_int32 gtid,
kmp_int32 flags,		kmp_int32 flags,
size_t sizeof_kmp_task_t,		size_t sizeof_kmp_task_t,
size_t sizeof_shareds,		size_t sizeof_shareds,
kmp_routine_entry_t task_entry,		kmp_routine_entry_t task_entry,
kmp_int64 device_id) {		kmp_int64 device_id) {
		if (__kmp_enable_unshackled) {
		kmp_tasking_flags_t input_flags = (kmp_tasking_flags_t )&flags;
		input_flags->unshackled = TRUE;
		// Unshackled thread is always final for now because it is created by the
		// compiler and used only for async offloading
		input_flags->final = TRUE;
		}

return __kmpc_omp_task_alloc(loc_ref, gtid, flags, sizeof_kmp_task_t,		return __kmpc_omp_task_alloc(loc_ref, gtid, flags, sizeof_kmp_task_t,
sizeof_shareds, task_entry);		sizeof_shareds, task_entry);
}		}

/*!		/*!
@ingroup TASKING		@ingroup TASKING
@param loc_ref location of the original task directive		@param loc_ref location of the original task directive
@param gtid Global Thread ID of encountering thread		@param gtid Global Thread ID of encountering thread
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	if (UNLIKELY(ompt_enabled.enabled)) {
thread->th.ompt_thread_info.wait_id = 0;		thread->th.ompt_thread_info.wait_id = 0;
thread->th.ompt_thread_info.state = (thread->th.th_team_serialized)		thread->th.ompt_thread_info.state = (thread->th.th_team_serialized)
? ompt_state_work_serial		? ompt_state_work_serial
: ompt_state_work_parallel;		: ompt_state_work_parallel;
taskdata->ompt_task_info.frame.exit_frame.ptr = OMPT_GET_FRAME_ADDRESS(0);		taskdata->ompt_task_info.frame.exit_frame.ptr = OMPT_GET_FRAME_ADDRESS(0);
}		}
#endif		#endif

		// Decreament the counter of unshackled tasks to be executed
		if (taskdata->td_flags.unshackled) {
		// Unshackled tasks can only be executed by unshackled threads
		KMP_ASSERT(KMP_UNSHACKLED_THREAD(gtid));
		KMP_ATOMIC_DEC(&__kmp_unexecuted_unshackled_tasks);
		}

// Proxy tasks are not handled by the runtime		// Proxy tasks are not handled by the runtime
if (taskdata->td_flags.proxy != TASK_PROXY) {		if (taskdata->td_flags.proxy != TASK_PROXY) {
ANNOTATE_HAPPENS_AFTER(task);		ANNOTATE_HAPPENS_AFTER(task);
__kmp_task_start(gtid, task, current_task); // OMPT only if not discarded		__kmp_task_start(gtid, task, current_task); // OMPT only if not discarded
}		}

// TODO: cancel tasks if the parallel region has also been cancelled		// TODO: cancel tasks if the parallel region has also been cancelled
// TODO: check if this sequence can be hoisted above __kmp_task_start		// TODO: check if this sequence can be hoisted above __kmp_task_start
▲ Show 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	if (itt_sync_obj != NULL)
__kmp_itt_taskwait_starting(gtid, itt_sync_obj);		__kmp_itt_taskwait_starting(gtid, itt_sync_obj);
#endif /* USE_ITT_BUILD */		#endif /* USE_ITT_BUILD */

bool must_wait =		bool must_wait =
!taskdata->td_flags.team_serial && !taskdata->td_flags.final;		!taskdata->td_flags.team_serial && !taskdata->td_flags.final;

must_wait = must_wait \|\| (thread->th.th_task_team != NULL &&		must_wait = must_wait \|\| (thread->th.th_task_team != NULL &&
thread->th.th_task_team->tt.tt_found_proxy_tasks);		thread->th.th_task_team->tt.tt_found_proxy_tasks);
		// If unshackled thread is encountered, we must enable wait here.
		must_wait = must_wait \|\|
		(__kmp_enable_unshackled && thread->th.th_task_team != NULL &&
		thread->th.th_task_team->tt.tt_unshackled_task_encountered);

		adurangUnsubmitted Done Reply Inline Actions This statement here is going to create performance regressions at least in the serial path. It might probably affect the barrier performance of non-tasking applications which makes me very wary. You can use your unshackle_task counter to protected as is done a few lines before. adurang: This statement here is going to create performance regressions at least in the serial path.
if (must_wait) {		if (must_wait) {
kmp_flag_32 flag(RCAST(std::atomic<kmp_uint32> *,		kmp_flag_32 flag(RCAST(std::atomic<kmp_uint32> *,
&(taskdata->td_incomplete_child_tasks)),		&(taskdata->td_incomplete_child_tasks)),
0U);		0U);
while (KMP_ATOMIC_LD_ACQ(&taskdata->td_incomplete_child_tasks) != 0) {		while (KMP_ATOMIC_LD_ACQ(&taskdata->td_incomplete_child_tasks) != 0) {
flag.execute_tasks(thread, gtid, FALSE,		flag.execute_tasks(thread, gtid, FALSE,
&thread_finished USE_ITT_BUILD_ARG(itt_sync_obj),		&thread_finished USE_ITT_BUILD_ARG(itt_sync_obj),
__kmp_task_stealing_constraint);		__kmp_task_stealing_constraint);
▲ Show 20 Lines • Show All 949 Lines • ▼ Show 20 Lines	if (task_team == NULL \|\| current_task == NULL)
return FALSE;		return FALSE;

KA_TRACE(15, ("__kmp_execute_tasks_template(enter): T#%d final_spin=%d "		KA_TRACE(15, ("__kmp_execute_tasks_template(enter): T#%d final_spin=%d "
"*thread_finished=%d\n",		"*thread_finished=%d\n",
gtid, final_spin, *thread_finished));		gtid, final_spin, *thread_finished));

thread->th.th_reap_state = KMP_NOT_SAFE_TO_REAP;		thread->th.th_reap_state = KMP_NOT_SAFE_TO_REAP;
threads_data = (kmp_thread_data_t *)TCR_PTR(task_team->tt.tt_threads_data);		threads_data = (kmp_thread_data_t *)TCR_PTR(task_team->tt.tt_threads_data);

		// This can happen when unshackled task is enabled
		if (__kmp_enable_unshackled && threads_data == nullptr)
		return FALSE;

KMP_DEBUG_ASSERT(threads_data != NULL);		KMP_DEBUG_ASSERT(threads_data != NULL);

nthreads = task_team->tt.tt_nproc;		nthreads = task_team->tt.tt_nproc;
unfinished_threads = &(task_team->tt.tt_unfinished_threads);		unfinished_threads = &(task_team->tt.tt_unfinished_threads);
KMP_DEBUG_ASSERT(nthreads > 1 \|\| task_team->tt.tt_found_proxy_tasks);		KMP_DEBUG_ASSERT(nthreads > 1 \|\| task_team->tt.tt_found_proxy_tasks);
KMP_DEBUG_ASSERT(*unfinished_threads >= 0);		KMP_DEBUG_ASSERT(*unfinished_threads >= 0);

while (1) { // Outer loop keeps trying to find tasks in case of single thread		while (1) { // Outer loop keeps trying to find tasks in case of single thread
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	while (1) { // Inner loop to find a task and execute it
new_victim = 1;		new_victim = 1;
}		}
} else { // No tasks found; unset last_stolen		} else { // No tasks found; unset last_stolen
KMP_CHECK_UPDATE(threads_data[tid].td.td_deque_last_stolen, -1);		KMP_CHECK_UPDATE(threads_data[tid].td.td_deque_last_stolen, -1);
victim_tid = -2; // no successful victim found		victim_tid = -2; // no successful victim found
}		}
}		}

if (task == NULL) // break out of tasking loop		if (task == NULL)
break;		break; // break out of tasking loop

// Found a task; execute it		// Found a task; execute it
#if USE_ITT_BUILD && USE_ITT_NOTIFY		#if USE_ITT_BUILD && USE_ITT_NOTIFY
if (__itt_sync_create_ptr \|\| KMP_ITT_DEBUG) {		if (__itt_sync_create_ptr \|\| KMP_ITT_DEBUG) {
if (itt_sync_obj == NULL) { // we are at fork barrier where we could not		if (itt_sync_obj == NULL) { // we are at fork barrier where we could not
// get the object reliably		// get the object reliably
itt_sync_obj = __kmp_itt_barrier_object(gtid, bs_forkjoin_barrier);		itt_sync_obj = __kmp_itt_barrier_object(gtid, bs_forkjoin_barrier);
}		}
▲ Show 20 Lines • Show All 438 Lines • ▼ Show 20 Lines	#endif /* USE_ITT_BUILD && USE_ITT_NOTIFY && KMP_DEBUG */
// task_team->tt.tt_next = NULL;		// task_team->tt.tt_next = NULL;
}		}

TCW_4(task_team->tt.tt_found_tasks, FALSE);		TCW_4(task_team->tt.tt_found_tasks, FALSE);
TCW_4(task_team->tt.tt_found_proxy_tasks, FALSE);		TCW_4(task_team->tt.tt_found_proxy_tasks, FALSE);
task_team->tt.tt_nproc = nthreads = team->t.t_nproc;		task_team->tt.tt_nproc = nthreads = team->t.t_nproc;

KMP_ATOMIC_ST_REL(&task_team->tt.tt_unfinished_threads, nthreads);		KMP_ATOMIC_ST_REL(&task_team->tt.tt_unfinished_threads, nthreads);
		KMP_ATOMIC_ST_REL(&task_team->tt.tt_unfinished_unshackled_tasks, 0);
		TCW_4(task_team->tt.tt_unshackled_task_encountered, FALSE);
TCW_4(task_team->tt.tt_active, TRUE);		TCW_4(task_team->tt.tt_active, TRUE);

KA_TRACE(20, ("__kmp_allocate_task_team: T#%d exiting; task_team = %p "		KA_TRACE(20, ("__kmp_allocate_task_team: T#%d exiting; task_team = %p "
"unfinished_threads init'd to %d\n",		"unfinished_threads init'd to %d\n",
(thread ? __kmp_gtid_from_thread(thread) : -1), task_team,		(thread ? __kmp_gtid_from_thread(thread) : -1), task_team,
KMP_ATOMIC_LD_RLX(&task_team->tt.tt_unfinished_threads)));		KMP_ATOMIC_LD_RLX(&task_team->tt.tt_unfinished_threads)));
return task_team;		return task_team;
}		}
▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	if (team->t.t_task_team[other_team] == NULL) { // setup other team as well
// realloc threads_data if necessary		// realloc threads_data if necessary
KA_TRACE(20, ("__kmp_task_team_setup: Master T#%d reset next task_team "		KA_TRACE(20, ("__kmp_task_team_setup: Master T#%d reset next task_team "
"%p for team %d at parity=%d\n",		"%p for team %d at parity=%d\n",
__kmp_gtid_from_thread(this_thr),		__kmp_gtid_from_thread(this_thr),
team->t.t_task_team[other_team],		team->t.t_task_team[other_team],
((team != NULL) ? team->t.t_id : -1), other_team));		((team != NULL) ? team->t.t_id : -1), other_team));
}		}
}		}

		// For regular thread, task enabling should be called when the task is going
		// to be pushed to a dequeue. However, for the unshackled thread, we need it
		// ahead of time so that some operations can be performed without race
		// condition.
		if (this_thr == __kmp_unshackled_master_thread) {
		for (int i = 0; i < 2; ++i) {
		kmp_task_team_t *task_team = team->t.t_task_team[i];
		if (KMP_TASKING_ENABLED(task_team)) {
		continue;
		}
		__kmp_enable_tasking(task_team, this_thr);
		for (int j = 0; j < task_team->tt.tt_nproc; ++j) {
		kmp_thread_data_t *thread_data = &task_team->tt.tt_threads_data[j];
		if (thread_data->td.td_deque == NULL) {
		__kmp_alloc_task_deque(__kmp_unshackled_threads[j], thread_data);
		}
		}
		}
		}
}		}

// __kmp_task_team_sync: Propagation of task team data from team to threads		// __kmp_task_team_sync: Propagation of task team data from team to threads
// which happens just after the release phase of a team barrier. This may be		// which happens just after the release phase of a team barrier. This may be
// called by any thread, but only for teams with # threads > 1.		// called by any thread, but only for teams with # threads > 1.
void __kmp_task_team_sync(kmp_info_t this_thr, kmp_team_t team) {		void __kmp_task_team_sync(kmp_info_t this_thr, kmp_team_t team) {
KMP_DEBUG_ASSERT(__kmp_tasking_mode != tskm_immediate_exec);		KMP_DEBUG_ASSERT(__kmp_tasking_mode != tskm_immediate_exec);

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	KMP_DEBUG_ASSERT(task_team->tt.tt_nproc > 1 \|\|
task_team->tt.tt_found_proxy_tasks == TRUE);		task_team->tt.tt_found_proxy_tasks == TRUE);
TCW_SYNC_4(task_team->tt.tt_found_proxy_tasks, FALSE);		TCW_SYNC_4(task_team->tt.tt_found_proxy_tasks, FALSE);
KMP_CHECK_UPDATE(task_team->tt.tt_untied_task_encountered, 0);		KMP_CHECK_UPDATE(task_team->tt.tt_untied_task_encountered, 0);
TCW_SYNC_4(task_team->tt.tt_active, FALSE);		TCW_SYNC_4(task_team->tt.tt_active, FALSE);
KMP_MB();		KMP_MB();

TCW_PTR(this_thr->th.th_task_team, NULL);		TCW_PTR(this_thr->th.th_task_team, NULL);
}		}

		if (__kmp_enable_unshackled && task_team &&
		task_team->tt.tt_unshackled_task_encountered)
		while (KMP_ATOMIC_LD_ACQ(&task_team->tt.tt_unfinished_unshackled_tasks))
		adurangUnsubmitted Done Reply Inline Actions It seems to me that this is forcing any taskwait (and possibly taskgroup) to wait for any outstanding "async target" that exist on the team irrespectively of being a part of that synchronization domain or not. adurang: It seems to me that this is forcing any taskwait (and possibly taskgroup) to wait for any…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Not any async target. Only those created in the team. tianshilei1992: Not any async target. Only those created in the team.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Sorry I still didn't understand this part. Could you please expatiate it? tianshilei1992: Sorry I still didn't understand this part. Could you please expatiate it?
		adurangUnsubmitted Done Reply Inline Actions For the code below: #pragma omp parallel num_threads(2) { #pragma omp target nowait blah() #pragma omp taskwait } With your current code (because you're using a shared counter for the whole team), both thread 1 and 2 are waiting for each others target regions ( so for example, even if target-th1 was finished thread1 would be blocked until target-th2 was completed). Each taskwait should only be waiting for their own child target tasks. Hope this helps. adurang: For the code below: ``` #pragma omp parallel num_threads(2) { #pragma omp target nowait…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Thanks for the explanation. But this lines of code are in the function `__kmp_task_team_wait` that is not called by `__kmpc_omp_taskwait`. If I understand correctly, `__kmp_task_team_wait` is called by the master thread of a team to wait for all tasks created in the team to finish so that it can proceed. So we need to wait for all unshackled tasks encountered/created in the task team. tianshilei1992: Thanks for the explanation. But this lines of code are in the function `__kmp_task_team_wait`…
		adurangUnsubmitted Done Reply Inline Actions Ah sorry, I didn't notice the patch changing functions. I should really look at the whole file! But if taskwait is working correctly the flag.wait call in __kmp_task_team_wait should also make sure that no outstanding unshackled tasks are left and then I don't think the extra check shouldn't be needed. Do you have any tests with taskgroup/taskwait? In any case, could you move it inside the same if statement as the other checks? Also, you need to set tt_unfinished_unshackled_tasks to FALSE in case the same task_team structure is reused (Note that is done the same for various fields just above). adurang: Ah sorry, I didn't notice the patch changing functions. I should really look at the whole file!
		ye-luoUnsubmitted Done Reply Inline Actions @adurang your example demonstrates exactly the code pattern I use. taskwait should only wait for the child tasks. ye-luo: @adurang your example demonstrates exactly the code pattern I use. taskwait should only wait…
		;
}		}

// __kmp_tasking_barrier:		// __kmp_tasking_barrier:
// This routine may only called when __kmp_tasking_mode == tskm_extra_barrier.		// This routine may only called when __kmp_tasking_mode == tskm_extra_barrier.
// Internal function to execute all tasks prior to a regular barrier or a join		// Internal function to execute all tasks prior to a regular barrier or a join
// barrier. It is a full barrier itself, which unfortunately turns regular		// barrier. It is a full barrier itself, which unfortunately turns regular
// barriers into double barriers and join barriers into 1 1/2 barriers.		// barriers into double barriers and join barriers into 1 1/2 barriers.
void __kmp_tasking_barrier(kmp_team_t team, kmp_info_t thread, int gtid) {		void __kmp_tasking_barrier(kmp_team_t team, kmp_info_t thread, int gtid) {
▲ Show 20 Lines • Show All 967 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_wait_release.h

	Show First 20 Lines • Show All 375 Lines • ▼ Show 20 Lines
	#endif			#endif
	// Check if the barrier surrounding this wait loop has been cancelled			// Check if the barrier surrounding this wait loop has been cancelled
	if (cancellable) {			if (cancellable) {
	kmp_team_t *team = this_thr->th.th_team;			kmp_team_t *team = this_thr->th.th_team;
	if (team && team->t.t_cancel_request == cancel_parallel)			if (team && team->t.t_cancel_request == cancel_parallel)
	break;			break;
	}			}

				// For unshackled thread, if task_team is nullptr, it means the master
				// thread has not released the barrier. We cannot wait here because once the
				// master thread releases all children barriers, all unshackled threads are
				// still sleeping. This leads to a problem that following configuration,
				// such as task team sync, will not be performed such that this thread does
				// not have task team. Usually it is not bad. However, a corner case is,
				// when the first task encountered is an untied task, the check in
				// __kmp_task_alloc will crash because it uses the task team pointer without
				// checking whether it is nullptr. It is probably under some kind of
				// assumption.
				if (task_team && KMP_UNSHACKLED_WORKER_THREAD(th_gtid) &&
				!TCR_4(__kmp_unshackled_team_done)) {
				// If there is still unshackled tasks to be executed, the unshackled
				// thread will not enter a waiting status.
				if (KMP_ATOMIC_LD_ACQ(&__kmp_unexecuted_unshackled_tasks) == 0) {
				__kmp_unshackled_worker_thread_wait();
				}
				continue;
				}

	// Don't suspend if KMP_BLOCKTIME is set to "infinite"			// Don't suspend if KMP_BLOCKTIME is set to "infinite"
	if (__kmp_dflt_blocktime == KMP_MAX_BLOCKTIME &&			if (__kmp_dflt_blocktime == KMP_MAX_BLOCKTIME &&
	__kmp_pause_status != kmp_soft_paused)			__kmp_pause_status != kmp_soft_paused)
	continue;			continue;

	// Don't suspend if there is a likelihood of new tasks being spawned.			// Don't suspend if there is a likelihood of new tasks being spawned.
	if ((task_team != NULL) && TCR_4(task_team->tt.tt_found_tasks))			if ((task_team != NULL) && TCR_4(task_team->tt.tt_found_tasks))
	continue;			continue;
	▲ Show 20 Lines • Show All 544 Lines • Show Last 20 Lines

openmp/runtime/src/z_Linux_util.cpp

Show All 19 Lines
#include "kmp_str.h"		#include "kmp_str.h"
#include "kmp_wait_release.h"		#include "kmp_wait_release.h"
#include "kmp_wrapper_getpid.h"		#include "kmp_wrapper_getpid.h"

#if !KMP_OS_DRAGONFLY && !KMP_OS_FREEBSD && !KMP_OS_NETBSD && !KMP_OS_OPENBSD		#if !KMP_OS_DRAGONFLY && !KMP_OS_FREEBSD && !KMP_OS_NETBSD && !KMP_OS_OPENBSD
#include <alloca.h>		#include <alloca.h>
#endif		#endif
#include <math.h> // HUGE_VAL.		#include <math.h> // HUGE_VAL.
		#include <semaphore.h>
#include <sys/resource.h>		#include <sys/resource.h>
#include <sys/syscall.h>		#include <sys/syscall.h>
#include <sys/time.h>		#include <sys/time.h>
#include <sys/times.h>		#include <sys/times.h>
#include <unistd.h>		#include <unistd.h>

#if KMP_OS_LINUX		#if KMP_OS_LINUX
#include <sys/sysinfo.h>		#include <sys/sysinfo.h>
▲ Show 20 Lines • Show All 2,398 Lines • ▼ Show 20 Lines
// we really only need the case with 1 argument, because CLANG always build		// we really only need the case with 1 argument, because CLANG always build
// a struct of pointers to shared variables referenced in the outlined function		// a struct of pointers to shared variables referenced in the outlined function
int __kmp_invoke_microtask(microtask_t pkfn, int gtid, int tid, int argc,		int __kmp_invoke_microtask(microtask_t pkfn, int gtid, int tid, int argc,
void *p_argv[]		void *p_argv[]
#if OMPT_SUPPORT		#if OMPT_SUPPORT
,		,
void **exit_frame_ptr		void **exit_frame_ptr
#endif		#endif
) {		) {
#if OMPT_SUPPORT		#if OMPT_SUPPORT
*exit_frame_ptr = OMPT_GET_FRAME_ADDRESS(0);		*exit_frame_ptr = OMPT_GET_FRAME_ADDRESS(0);
#endif		#endif

switch (argc) {		switch (argc) {
default:		default:
fprintf(stderr, "Too many args to microtask: %d!\n", argc);		fprintf(stderr, "Too many args to microtask: %d!\n", argc);
fflush(stderr);		fflush(stderr);
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	case 15:
break;		break;
}		}

return 1;		return 1;
}		}

#endif		#endif

		// Functions for unshackled task
		namespace {
		// Condition variable for initializing unshackled team
		pthread_cond_t unshackled_threads_initz_cond_var;
		pthread_mutex_t unshackled_threads_initz_lock;
		volatile int unshackled_initz_signaled = FALSE;

		// Condition variable for deinitializing unshackled team
		pthread_cond_t unshackled_threads_deinitz_cond_var;
		pthread_mutex_t unshackled_threads_deinitz_lock;
		volatile int unshackled_deinitz_signaled = FALSE;

		// Condition variable for the wrapper function of master thread
		pthread_cond_t unshackled_master_thread_cond_var;
		pthread_mutex_t unshackled_master_thread_lock;
		volatile int unshackled_master_thread_signaled = FALSE;

		// Semaphore for worker threads. We don't use condition variable here in case
		// that when multiple signals are sent at the same time, only one thread might
		// be waken.
		sem_t unshackled_task_sem;
		} // namespace

		void __kmp_unshackled_worker_thread_wait() {
		int status = sem_wait(&unshackled_task_sem);
		KMP_CHECK_SYSFAIL("sem_wait", status);
		}

		void __kmp_do_initialize_unshackled_threads() {
		// Initialize condition variable
		int status = pthread_cond_init(&unshackled_threads_initz_cond_var, nullptr);
		KMP_CHECK_SYSFAIL("pthread_cond_init", status);

		status = pthread_cond_init(&unshackled_threads_deinitz_cond_var, nullptr);
		KMP_CHECK_SYSFAIL("pthread_cond_init", status);

		status = pthread_cond_init(&unshackled_master_thread_cond_var, nullptr);
		KMP_CHECK_SYSFAIL("pthread_cond_init", status);

		status = pthread_mutex_init(&unshackled_threads_initz_lock, nullptr);
		KMP_CHECK_SYSFAIL("pthread_mutex_init", status);

		status = pthread_mutex_init(&unshackled_threads_deinitz_lock, nullptr);
		KMP_CHECK_SYSFAIL("pthread_mutex_init", status);

		status = pthread_mutex_init(&unshackled_master_thread_lock, nullptr);
		KMP_CHECK_SYSFAIL("pthread_mutex_init", status);

		// Initialize the semaphore
		status = sem_init(&unshackled_task_sem, 0, 0);
		KMP_CHECK_SYSFAIL("sem_init", status);

		// Create a new thread to finish initialization
		pthread_t handle;
		status = pthread_create(
		&handle, nullptr,
		[](void ) -> void {
		__kmp_unshackled_threads_initz_routine();
		return nullptr;
		},
		nullptr);
		KMP_CHECK_SYSFAIL("pthread_create", status);
		}

		void __kmp_unshackled_threads_initz_wait() {
		// Initial thread waits here for the completion of the initialization. The
		// condition variable will be notified by master thread of unshackled teams
		int status = pthread_mutex_lock(&unshackled_threads_initz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_lock", status);

		if (!TCR_4(unshackled_initz_signaled)) {
		status = pthread_cond_wait(&unshackled_threads_initz_cond_var,
		&unshackled_threads_initz_lock);
		KMP_CHECK_SYSFAIL("pthread_cond_wait", status);
		}

		status = pthread_mutex_unlock(&unshackled_threads_initz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_unlock", status);
		}

		void __kmp_unshackled_initz_release() {
		// After all initialization, reset __kmp_init_unshackled_threads to false
		int status = pthread_mutex_lock(&unshackled_threads_initz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_lock", status);

		status = pthread_cond_signal(&unshackled_threads_initz_cond_var);
		KMP_CHECK_SYSFAIL("pthread_cond_wait", status);

		TCW_SYNC_4(unshackled_initz_signaled, TRUE);

		status = pthread_mutex_unlock(&unshackled_threads_initz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_unlock", status);
		}

		void __kmp_unshackled_master_thread_wait() {
		// The master thread of unshackled team will be blocked here. The
		// condition variable can only be signal in the destructor of RTL
		int status = pthread_mutex_lock(&unshackled_master_thread_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_lock", status);

		if (!TCR_4(unshackled_master_thread_signaled)) {
		status = pthread_cond_wait(&unshackled_master_thread_cond_var,
		&unshackled_master_thread_lock);
		KMP_CHECK_SYSFAIL("pthread_cond_wait", status);
		}

		status = pthread_mutex_unlock(&unshackled_master_thread_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_unlock", status);
		}

		void __kmp_unshackled_master_thread_release() {
		// The initial thread of OpenMP RTL should call this function to wake up the
		// master thread of unshackled team
		int status = pthread_mutex_lock(&unshackled_master_thread_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_lock", status);

		status = pthread_cond_signal(&unshackled_master_thread_cond_var);
		KMP_CHECK_SYSFAIL("pthread_cond_signal", status);

		// The unshackled team is done here
		TCW_SYNC_4(unshackled_master_thread_signaled, TRUE);

		status = pthread_mutex_unlock(&unshackled_master_thread_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_unlock", status);
		}

		void __kmp_unshackled_worker_thread_signal() {
		int status = sem_post(&unshackled_task_sem);
		KMP_CHECK_SYSFAIL("sem_post", status);
		}

		void __kmp_unshackled_threads_deinitz_wait() {
		// Initial thread waits here for the completion of the deinitialization. The
		// condition variable will be notified by master thread of unshackled teams
		int status = pthread_mutex_lock(&unshackled_threads_deinitz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_lock", status);

		if (!TCR_4(unshackled_deinitz_signaled)) {
		status = pthread_cond_wait(&unshackled_threads_deinitz_cond_var,
		&unshackled_threads_deinitz_lock);
		KMP_CHECK_SYSFAIL("pthread_cond_wait", status);
		}

		status = pthread_mutex_unlock(&unshackled_threads_deinitz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_unlock", status);
		}

		void __kmp_unshackled_threads_deinitz_release() {
		int status = pthread_mutex_lock(&unshackled_threads_deinitz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_lock", status);

		status = pthread_cond_signal(&unshackled_threads_deinitz_cond_var);
		KMP_CHECK_SYSFAIL("pthread_cond_wait", status);

		TCW_SYNC_4(unshackled_deinitz_signaled, TRUE);

		status = pthread_mutex_unlock(&unshackled_threads_deinitz_lock);
		KMP_CHECK_SYSFAIL("pthread_mutex_unlock", status);
		}

// end of file //		// end of file //

openmp/runtime/test/tasking/unshackled_task/depend.cpp

This file was added.

				// RUN: %libomp-cxx-compile-and-run

				/*
				* This test aims to check whether unshackled task can work with regular task in
				* terms of dependences. It is equivalent to the following code:
				*
				* #pragma omp parallel
				* for (int i = 0; i < N; ++i) {
				* int data = -1;
				* #pragma omp task shared(data) depend(out: data)
				* {
				* data = 1;
				* }
				* #pragma omp unshackled task shared(data) depend(inout: data)
				* {
				* data += 2;
				* }
				* #pragma omp unshackled task shared(data) depend(inout: data)
				* {
				* data += 4;
				* }
				* #pragma omp task shared(data) depend(inout: data)
				* {
				* data += 8;
				* }
				* #pragma omp taskwait
				* assert(data == 15);
				* }
				*/

				#include <cassert>
				#include <iostream>

				extern "C" {
				struct ident_t;

				using kmp_int32 = int32_t;
				using kmp_int64 = int64_t;
				using kmp_routine_entry_t = kmp_int32 ()(kmp_int32, void );
				using kmp_intptr_t = intptr_t;

				typedef struct kmp_depend_info {
				kmp_intptr_t base_addr;
				size_t len;
				struct {
				bool in : 1;
				bool out : 1;
				bool mtx : 1;
				} flags;
				} kmp_depend_info_t;

				typedef union kmp_cmplrdata {
				kmp_int32 priority;
				kmp_routine_entry_t destructors;
				} kmp_cmplrdata_t;

				typedef struct kmp_task {
				void *shareds;
				kmp_routine_entry_t routine;
				kmp_int32 part_id;
				kmp_cmplrdata_t data1;
				kmp_cmplrdata_t data2;
				} kmp_task_t;

				struct kmp_task_t_with_privates {
				kmp_task_t task;
				};

				struct anon {
				int32_t *data;
				};

				int32_t __kmpc_global_thread_num(void *);
				kmp_task_t __kmpc_omp_task_alloc(ident_t , kmp_int32, kmp_int32, size_t,
				size_t, kmp_routine_entry_t);
				kmp_task_t __kmpc_omp_target_task_alloc(ident_t , kmp_int32, kmp_int32,
				size_t, size_t, kmp_routine_entry_t,
				kmp_int64);
				kmp_int32 __kmpc_omp_taskwait(ident_t *, kmp_int32);
				kmp_int32 __kmpc_omp_task_with_deps(ident_t *loc_ref, kmp_int32 gtid,
				kmp_task_t *new_task, kmp_int32 ndeps,
				kmp_depend_info_t *dep_list,
				kmp_int32 ndeps_noalias,
				kmp_depend_info_t *noalias_dep_list);
				}

				kmp_int32 omp_task_entry_1(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p = 1;
				return 0;
				}

				kmp_int32 omp_task_entry_2(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p += 2;
				return 0;
				}

				kmp_int32 omp_task_entry_3(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p += 4;
				return 0;
				}

				kmp_int32 omp_task_entry_4(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p += 8;
				return 0;
				}

				int main(int argc, char *argv[]) {
				constexpr const int N = 1024;
				#pragma omp parallel for
				for (int i = 0; i < N; ++i) {
				int32_t gtid = __kmpc_global_thread_num(nullptr);
				int32_t data = 0;

				// Task 1
				auto task1 = __kmpc_omp_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry_1));

				auto shareds = reinterpret_cast<anon *>(task1->shareds);
				shareds->data = &data;

				kmp_depend_info_t depinfo1;
				depinfo1.base_addr = reinterpret_cast<intptr_t>(&data);
				depinfo1.flags.out = 1;
				depinfo1.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task1, 1, &depinfo1, 0, nullptr);

				// Task 2
				auto task2 = __kmpc_omp_target_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry_2), -1);

				shareds = reinterpret_cast<anon *>(task2->shareds);
				shareds->data = &data;

				kmp_depend_info_t depinfo2;
				depinfo2.base_addr = reinterpret_cast<intptr_t>(&data);
				depinfo2.flags.in = 1;
				depinfo2.flags.out = 1;
				depinfo2.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task2, 1, &depinfo2, 0, nullptr);

				// Task 3
				auto task3 = __kmpc_omp_target_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry_3), -1);

				shareds = reinterpret_cast<anon *>(task3->shareds);
				shareds->data = &data;

				kmp_depend_info_t depinfo3;
				depinfo3.base_addr = reinterpret_cast<intptr_t>(&data);
				depinfo3.flags.in = 1;
				depinfo3.flags.out = 1;
				depinfo3.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task3, 1, &depinfo3, 0, nullptr);

				// Task 4
				auto task4 = __kmpc_omp_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry_4));

				shareds = reinterpret_cast<anon *>(task4->shareds);
				shareds->data = &data;

				kmp_depend_info_t depinfo4;
				depinfo4.base_addr = reinterpret_cast<intptr_t>(&data);
				depinfo4.flags.in = 1;
				depinfo4.flags.out = 1;
				depinfo4.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task4, 1, &depinfo4, 0, nullptr);

				// Wait for all tasks
				__kmpc_omp_taskwait(nullptr, gtid);

				assert(data == 15);
				}

				std::cout << "PASS\n";
				return 0;
				}

				// CHECK: PASS

openmp/runtime/test/tasking/unshackled_task/gtid.cpp

This file was added.

				// RUN: %libomp-cxx-compile-and-run

				/*
				* This test aims to check whether unshackled thread has right gtid. We also
				* test if there is mixed dependences between regular tasks and unshackled
				* tasks, the tasks are executed by right set of threads. It is equivalent to
				* the following code:
				*
				* #pragma omp parallel for
				* for (int i = 0; i < N; ++i) {
				* int data1 = -1, data2 = -1, data3 = -1;
				* int depvar;
				* #pragma omp task shared(data1) depend(inout: depvar)
				* {
				* data1 = omp_get_global_thread_id();
				* }
				* #pragma omp task unshackled shared(data2) depend(inout: depvar)
				* {
				* data2 = omp_get_global_thread_id();
				* }
				* #pragma omp task shared(data3) depend(inout: depvar)
				* {
				* data3 = omp_get_global_thread_id();
				* }
				* #pragma omp taskwait
				* assert(data1 == 0 \|\| data1 > __kmp_num_unshackled_threads);
				* assert(data2 > 0 && data2 <= __kmp_num_unshackled_threads);
				* assert(data3 == 0 \|\| data3 > __kmp_num_unshackled_threads);
				* }
				*/

				#include <cassert>
				#include <iostream>

				extern "C" {
				struct ident_t;

				using kmp_int32 = int32_t;
				using kmp_int64 = int64_t;
				using kmp_routine_entry_t = kmp_int32 ()(kmp_int32, void );
				using kmp_intptr_t = intptr_t;

				typedef struct kmp_depend_info {
				kmp_intptr_t base_addr;
				size_t len;
				struct {
				bool in : 1;
				bool out : 1;
				bool mtx : 1;
				} flags;
				} kmp_depend_info_t;

				typedef union kmp_cmplrdata {
				kmp_int32 priority;
				kmp_routine_entry_t destructors;
				} kmp_cmplrdata_t;

				typedef struct kmp_task {
				void *shareds;
				kmp_routine_entry_t routine;
				kmp_int32 part_id;
				kmp_cmplrdata_t data1;
				kmp_cmplrdata_t data2;
				} kmp_task_t;

				struct kmp_task_t_with_privates {
				kmp_task_t task;
				};

				struct anon {
				int32_t *data;
				};

				int32_t __kmpc_global_thread_num(void *) __attribute__((weak));
				kmp_task_t __kmpc_omp_task_alloc(ident_t , kmp_int32, kmp_int32, size_t,
				size_t, kmp_routine_entry_t);
				kmp_task_t __kmpc_omp_target_task_alloc(ident_t , kmp_int32, kmp_int32,
				size_t, size_t, kmp_routine_entry_t,
				kmp_int64);
				kmp_int32 __kmpc_omp_task_with_deps(ident_t *loc_ref, kmp_int32 gtid,
				kmp_task_t *new_task, kmp_int32 ndeps,
				kmp_depend_info_t *dep_list,
				kmp_int32 ndeps_noalias,
				kmp_depend_info_t *noalias_dep_list);
				kmp_int32 __kmpc_omp_taskwait(ident_t *, kmp_int32);
				}

				kmp_int32 omp_task_entry(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p = __kmpc_global_thread_num(nullptr);
				return 0;
				}

				template <bool unshackled_task> void assert_gtid(int v) {
				if (unshackled_task) {
				assert(v > 0 && v <= 8);
				} else {
				assert(v == 0 \|\| v > 8);
				}
				}

				int main(int argc, char *argv[]) {
				constexpr const int N = 1024;
				#pragma omp parallel for
				for (int i = 0; i < N; ++i) {
				int32_t data1 = -1, data2 = -1, data3 = -1;
				int depvar;
				int32_t gtid = __kmpc_global_thread_num(nullptr);

				// Task 1, regular task
				auto task1 = __kmpc_omp_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry));
				auto shareds = reinterpret_cast<anon *>(task1->shareds);
				shareds->data = &data1;

				kmp_depend_info_t depinfo1;
				depinfo1.base_addr = reinterpret_cast<intptr_t>(&depvar);
				depinfo1.flags.in = 1;
				depinfo1.flags.out = 1;
				depinfo1.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task1, 1, &depinfo1, 0, nullptr);

				// Task 2, unshackled task
				auto task2 = __kmpc_omp_target_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry), -1);
				shareds = reinterpret_cast<anon *>(task2->shareds);
				shareds->data = &data2;

				kmp_depend_info_t depinfo2;
				depinfo2.base_addr = reinterpret_cast<intptr_t>(&depvar);
				depinfo2.flags.in = 1;
				depinfo2.flags.out = 1;
				depinfo2.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task2, 1, &depinfo2, 0, nullptr);

				// Task 3, regular task
				auto task3 = __kmpc_omp_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry));
				shareds = reinterpret_cast<anon *>(task3->shareds);
				shareds->data = &data3;

				kmp_depend_info_t depinfo3;
				depinfo3.base_addr = reinterpret_cast<intptr_t>(&depvar);
				depinfo3.flags.in = 1;
				depinfo3.flags.out = 1;
				depinfo3.len = 4;

				__kmpc_omp_task_with_deps(nullptr, gtid, task3, 1, &depinfo3, 0, nullptr);

				__kmpc_omp_taskwait(nullptr, gtid);

				// FIXME: 8 here is not accurate
				assert_gtid<false>(data1);
				assert_gtid<true>(data2);
				assert_gtid<false>(data3);
				}

				std::cout << "PASS\n";
				return 0;
				}

				// CHECK: PASS

openmp/runtime/test/tasking/unshackled_task/taskgroup.cpp

This file was added.

				// RUN: %libomp-cxx-compile-and-run

				/*
				* This test aims to check whether unshackled task can work with regular task in
				* terms of dependences. It is equivalent to the following code:
				*
				* #pragma omp parallel
				* for (int i = 0; i < N; ++i) {
				* int data1 = 0, data2 = 0;
				* #pragma omp taskgroup
				* {
				* #pragma omp unshackled task shared(data1)
				* {
				* data1 = 1;
				* }
				* #pragma omp unshackled task shared(data2)
				* {
				* data2 = 2;
				* }
				* }
				* assert(data1 == 1);
				* assert(data2 == 2);
				* }
				*/

				#include <cassert>
				#include <iostream>

				extern "C" {
				struct ident_t;

				using kmp_int32 = int32_t;
				using kmp_int64 = int64_t;
				using kmp_routine_entry_t = kmp_int32 ()(kmp_int32, void );

				typedef union kmp_cmplrdata {
				kmp_int32 priority;
				kmp_routine_entry_t destructors;
				} kmp_cmplrdata_t;

				typedef struct kmp_task {
				void *shareds;
				kmp_routine_entry_t routine;
				kmp_int32 part_id;
				kmp_cmplrdata_t data1;
				kmp_cmplrdata_t data2;
				} kmp_task_t;

				struct kmp_task_t_with_privates {
				kmp_task_t task;
				};

				struct anon {
				int32_t *data;
				};

				int32_t __kmpc_global_thread_num(void *);
				kmp_task_t __kmpc_omp_target_task_alloc(ident_t , kmp_int32, kmp_int32,
				size_t, size_t, kmp_routine_entry_t,
				kmp_int64);
				kmp_int32 __kmpc_omp_task(ident_t , kmp_int32, kmp_task_t );
				void __kmpc_taskgroup(ident_t *, kmp_int32);
				void __kmpc_end_taskgroup(ident_t *, kmp_int32);
				}

				kmp_int32 omp_task_entry_1(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p = 1;
				return 0;
				}

				kmp_int32 omp_task_entry_2(kmp_int32 gtid, kmp_task_t_with_privates *task) {
				auto shareds = reinterpret_cast<anon *>(task->task.shareds);
				auto p = shareds->data;
				*p = 2;
				return 0;
				}

				int main(int argc, char *argv[]) {
				constexpr const int N = 1024;
				#pragma omp parallel for
				for (int i = 0; i < N; ++i) {
				int32_t gtid = __kmpc_global_thread_num(nullptr);
				int32_t data1 = 0, data2 = 0;
				__kmpc_taskgroup(nullptr, gtid);

				auto task1 = __kmpc_omp_target_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry_1), -1);
				auto shareds = reinterpret_cast<anon *>(task1->shareds);
				shareds->data = &data1;
				__kmpc_omp_task(nullptr, gtid, task1);

				auto task2 = __kmpc_omp_target_task_alloc(
				nullptr, gtid, 1, sizeof(kmp_task_t_with_privates), sizeof(anon),
				reinterpret_cast<kmp_routine_entry_t>(omp_task_entry_2), -1);
				shareds = reinterpret_cast<anon *>(task2->shareds);
				shareds->data = &data2;
				__kmpc_omp_task(nullptr, gtid, task2);

				__kmpc_end_taskgroup(nullptr, gtid);

				assert(data1 == 1);
				assert(data2 == 2);
				}

				std::cout << "PASS\n";
				return 0;
				}

				// CHECK: PASS

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Added the support for hidden helper task in RTLClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 304533

openmp/runtime/src/kmp.h

openmp/runtime/src/kmp_global.cpp

openmp/runtime/src/kmp_runtime.cpp

openmp/runtime/src/kmp_settings.cpp

openmp/runtime/src/kmp_taskdeps.h

openmp/runtime/src/kmp_tasking.cpp

openmp/runtime/src/kmp_wait_release.h

openmp/runtime/src/z_Linux_util.cpp

openmp/runtime/test/tasking/unshackled_task/depend.cpp

openmp/runtime/test/tasking/unshackled_task/gtid.cpp

openmp/runtime/test/tasking/unshackled_task/taskgroup.cpp

[OpenMP] Added the support for hidden helper task in RTL
ClosedPublic