This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Added the support for hidden helper task in RTL
ClosedPublic

Authored by tianshilei1992 on Apr 6 2020, 4:54 PM.

Details

Summary

The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use pthread_create to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls __kmpc_fork_call. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via pthread_create on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.

Two environment variables, LIBOMP_NUM_HIDDEN_HELPER_THREADS and LIBOMP_USE_HIDDEN_HELPER_TASK, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.

Here are some open issues to be discussed:

  1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Fixed the issue in must_wait

tianshilei1992 marked an inline comment as done.Aug 31 2020, 12:06 PM
adurang added inline comments.Sep 1 2020, 4:57 AM
openmp/runtime/src/kmp_tasking.cpp
3726

For the code below:

#pragma omp parallel num_threads(2)
{
#pragma omp target nowait
   blah()
#pragma omp taskwait
}

With your current code (because you're using a shared counter for the whole team), both thread 1 and 2 are waiting for each others target regions ( so for example, even if target-th1 was finished thread1 would be blocked until target-th2 was completed). Each taskwait should only be waiting for their own child target tasks.

Hope this helps.

tianshilei1992 added inline comments.Sep 1 2020, 9:54 AM
openmp/runtime/src/kmp_tasking.cpp
3726

Thanks for the explanation. But this lines of code are in the function __kmp_task_team_wait that is not called by __kmpc_omp_taskwait. If I understand correctly, __kmp_task_team_wait is called by the master thread of a team to wait for all tasks created in the team to finish so that it can proceed. So we need to wait for all unshackled tasks encountered/created in the task team.

ye-luo added a subscriber: ye-luo.Sep 1 2020, 2:00 PM
ye-luo added inline comments.
openmp/runtime/src/kmp_tasking.cpp
3726

@adurang your example demonstrates exactly the code pattern I use. taskwait should only wait for the child tasks.

adurang added inline comments.Sep 1 2020, 3:00 PM
openmp/runtime/src/kmp_tasking.cpp
3726

Ah sorry, I didn't notice the patch changing functions. I should really look at the whole file!

But if taskwait is working correctly the flag.wait call in __kmp_task_team_wait should also make sure that no outstanding unshackled tasks are left and then I don't think the extra check shouldn't be needed. Do you have any tests with taskgroup/taskwait?

In any case, could you move it inside the same if statement as the other checks? Also, you need to set tt_unfinished_unshackled_tasks to FALSE in case the same task_team structure is reused (Note that is done the same for various fields just above).

Added another condition to see whether we need to wait in the task team

Couple of tests needed to check if the implementation works - one with unshackled task encountered before parallel, and another with unshackled task encountered after / between parallels.

openmp/runtime/src/kmp_runtime.cpp
4333

This condition is false if new_gtid started with (__kmp_unshackled_threads_num + 1), that is for regular thread. Thus all threads will mistakenly get the same gtid.

Fixed some problems and added the first test case. More cases are on the way.

Added another test case to test dependence process

Fixed a potental race condition

Still trying to fix a race problem

Added the missing part for the team destroy

Refactored code in z_Linux_util.cpp

Updated some tests

tianshilei1992 marked 5 inline comments as done.Oct 16 2020, 1:38 PM

Enabled unshackled thread by default

ye-luo added a comment.EditedOct 16 2020, 4:31 PM

Enabled unshackled thread by default

What is the current supported way of turning off unshackled thread team with zero side effect?

Enabled unshackled thread by default

What is the current supported way of turning off unshackled thread team with zero side effect?

Currently it can be disabled by setting LIBOMP_USE_UNSHACKLED_TASK to OFF at the CMake stage. It is also feasible to make it during runtime but that would bring in extra overhead.

Added a new test case for taskgroup

Fixed test case description

Disabled unshackled task on macOS as well

tianshilei1992 retitled this revision from [OpenMP][WIP] Added the support for unshackled task in RTL to [OpenMP] Added the support for unshackled task in RTL.Oct 19 2020, 12:58 PM

I left some comments. Generally, I would prefer we minimize the use of the macro to elide declarations. I'd also prefer to use the macro as part of the conditions to avoid duplication.
Instead of

#ifdef MACRO
foo(X)
#else 
foo(Y)
#endif

we do

v = MACRO ? X : Y;
foo(v);

which is really helpful if foo is complex code and just as fast.

@adurang @AndreyChurbanov have your concerns been addressed?

openmp/runtime/src/kmp.h
2337–2338

Do we really need the USE_UNSHACKLED_TASK flag? Even if we want it, we don't need it here in the struct. Let's waste one bit on windows until we catch up and remove complexity for everyone.

2392

Grammar:
"The task team of its parent task team"
and
"therefore we it when this task is created"

2402

Similarly, I don't think the byte savings here are worth it.

openmp/runtime/src/kmp_runtime.cpp
3648

Is the code in the #else case the same as in the } else { case? If so, make the conditional if (USE_UNSHACKLED_TASK && ...) and avoid the duplication of bugs.

4332

Nit: can we move the initialization out of the loop, hard to read. A comment might help as well.

Looking at the code more generally, is this the same code as below with different bounds? If so, avoid the duplication all together please, same way as suggested above.

@adurang @AndreyChurbanov have your concerns been addressed?

I didn't see the problem with release_deps being solved (maybe I missed). And I think we should really have a mechanism to set the number of threads instead of a hardcoded '8' and not have the threads created until is necessary.

Also, given efforts in OpenMP to remove master and similar terms maybe we should think about renaming "unshackled" to something else like "helper" or "auxilary"? I know is a bit of a pain to do that so I won't press for this but thought that I should mention it.

tianshilei1992 added inline comments.Oct 28 2020, 7:41 AM
openmp/runtime/src/kmp_taskdeps.h
136

@adurang The problem of release deeps was fixed here.

Enhanced one test case and fixed some comments

tianshilei1992 marked 2 inline comments as done.Oct 28 2020, 8:04 PM

I left some comments. Generally, I would prefer we minimize the use of the macro to elide declarations. I'd also prefer to use the macro as part of the conditions to avoid duplication.
Instead of

#ifdef MACRO
foo(X)
#else 
foo(Y)
#endif

we do

v = MACRO ? X : Y;
foo(v);

which is really helpful if foo is complex code and just as fast.

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

Changed some code to make it more readable and less complex.

The failed case is because the gtid is not offset. What is a right way to detect whether a CMake variable or macro is defined?

jdoerfert added a comment.EditedOct 29 2020, 1:08 PM

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

Added support for setting number of unshackled threads via environment variable

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.

ye-luo added a comment.EditedNov 10 2020, 10:05 AM

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.

Better to have a way to elide unshackled thread team creation at runtime before putting LIBOMP_USE_UNSHACKLED_TASK by default.

Some variables are only defined when the MACRO is enabled. I have changed some code to make it more readable and less complex.

As I said before, I don't see the point in omitting declarations. It just increases our testing surface for no real benefit. If you don't use this but have two more functions and a few declarations, all of which you don't use, you really don't pay a price in the big scheme of things.

What is a right way to detect whether a CMake variable or macro is defined?

In C/C++ (#ifdef) or in CMake (idk)?

The point is, our test cases are not run by CMake, so it cannot detect whether we define any variable.

Then make USE_UNSHACKLED_TASK default and remove all the uses that elide declarations and definitions.

Better to have a way to elide unshackled thread team creation at runtime before putting LIBOMP_USE_UNSHACKLED_TASK by default.

It's already included in this patch.

Added the missing variable initialization

Removed the marcro USE_UNSHACKLED_TASK

tianshilei1992 edited the summary of this revision. (Show Details)Nov 11 2020, 8:58 AM
tianshilei1992 marked 3 inline comments as done.Nov 11 2020, 12:51 PM
jdoerfert accepted this revision.Dec 18 2020, 7:21 PM

As far as I can tell the issues have been addressed. This has been sitting here a while, let's get it in so we get more exposure. LGTM

If you go over your comments once more, add punctuation to make all of them sentences. If you want to change "unshackled" to "hidden_helper" or similar, that might be good.

This revision is now accepted and ready to land.Dec 18 2020, 7:21 PM

Updated the patch to use more inclusive words

tianshilei1992 retitled this revision from [OpenMP] Added the support for unshackled task in RTL to [OpenMP] Added the support for hidden helper task in RTL.Dec 19 2020, 6:01 PM
tianshilei1992 edited the summary of this revision. (Show Details)

Fixed one remained part

Still something left...

Fixed a bug in __kmp_release_deps

Refined test cases and rebased

This revision was landed with ongoing or failed builds.Jan 16 2021, 11:13 AM
This revision was automatically updated to reflect the committed changes.

This broke building OpenMP for windows; all the new helper functions, like __kmp_hidden_helper_threads_initz_wait, that are added in z_Linux_util.cpp would need to be added similarly to z_Windows_NT_util.cpp. What do you propose doing - revert the patch for now until that's in place?

tianshilei1992 reopened this revision.Jan 18 2021, 3:58 AM

reopen as the change was reverted

This revision is now accepted and ready to land.Jan 18 2021, 3:58 AM
tianshilei1992 planned changes to this revision.Jan 18 2021, 4:02 AM

This broke building OpenMP for windows; all the new helper functions, like __kmp_hidden_helper_threads_initz_wait, that are added in z_Linux_util.cpp would need to be added similarly to z_Windows_NT_util.cpp. What do you propose doing - revert the patch for now until that's in place?

Thanks for the report. We had a macro controlling whether the feature is enabled before. On Windows the macro is not defined so that corresponding parts in common files will not be built on Windows. Later we decided to remove the macro and turn the feature ON by default but I forgot to add the logic in Windows files, and I didn’t have Windows machines then.......

I’ve reverted the change and will fix the issue.

@ronlieb tells me an out of tree offloading test (aomp/test/smoke/devices) started crashing (hangs/segv/fp exception) with this patch applied. That doesn't make sense to me since this doesn't appear to change the target offloading logic, but it might be a smoking gun for a lifetime management error somewhere in the above. Does anyone know if the host openmp runtime is expected to be clean under things like valgrind or thread sanitizer?

Added missing functions on Windows but forced __kmp_enable_hidden_helper to
FALSE on all non-Linux platforms

This revision is now accepted and ready to land.Jan 20 2021, 5:51 PM

@mstorsjo Would you mind giving it a shot on Windows?

tianshilei1992 requested review of this revision.Jan 20 2021, 5:53 PM

@mstorsjo Would you mind giving it a shot on Windows?

Looks like it builds correctly now, thanks!

Haven't tested it practically (except for a very trivial smoke test) to see if it breaks anything at runtime, but it doesn't at least regress the build any longer.

JonChesterfield added a comment.EditedJan 22 2021, 8:33 AM

The information I've got on the possible race is:
When this patch is applied (by git's automerge, I think) to the rocm stack, a test located at:
https://github.com/ROCm-Developer-Tools/aomp/blob/master/test/smoke/devices/devices.c
fails in unpredictable fashion.

I've reproduced the test here as it's fairly short, but it uses some functions on the device that the trunk implementation returns zero for. Adjusted so it builds on trunk. Run as

export LD_LIBRARY_PATH=$HOME/llvm-install/lib/ ; $HOME/llvm-install/bin/clang  -O2  -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50   devices.c -o devices -L/usr/local/cuda/targets/x86_64-linux/lib -lcudart && valgrind --fair-sched=yes ./devices
// devices.c
#include <stdio.h>
#include <omp.h>

int main() {
  int num_devs = omp_get_num_devices();
  for (int device_num = 0; device_num < num_devs ; device_num++) {
#pragma omp target device(device_num) nowait
#pragma omp teams num_teams(2) thread_limit(4)
#pragma omp parallel num_threads(2)
    {
      // need to pass the total device number to all devices, per module load
      int num_threads = omp_get_num_threads();
      int num_teams   = omp_get_num_teams();
      int num_devices = omp_get_num_devices(); // not legal in 4.5

      // need to pass the device id to the device starting the kernel
      int thread_id   = omp_get_thread_num();
      int team_id     = omp_get_team_num();
      int device_id   = 0; // omp_get_device_num();  // no API in omp 4.5

      // assume we have homogeneous devices
      int total_threads = num_devices * num_teams * num_threads;
      int gthread_id    = (device_id * num_teams * num_threads) + (team_id * num_threads) + thread_id;

      // print out id
      printf("Hello OpenMP 5 from \n");
      printf(" Device num  %d of %d devices\n", device_id, num_devices);
      printf(" Team num    %d of %d teams  \n", team_id,   num_teams);
      printf(" Thread num  %d of %d threads\n", thread_id, num_threads);
      printf(" Global thread %d of %d total threads\n", gthread_id, total_threads);
    };
  };
#pragma omp taskwait
  printf("The host device num is %d\n", omp_get_device_num());
  printf("The initial device num is %d\n", omp_get_initial_device());
  printf("The number of devices are %d\n", num_devs);
}

Trunk before this patch makes a use of uninitialized memory but the test succeeds (prints a lot of stuff).

==27099== Conditional jump or move depends on uninitialised value(s)
==27099==    at 0x4C36DC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==27099==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==27099==    by 0x4B5B688: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1562)
==27099==    by 0x4B5B8BB: __kmp_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1679)
==27099==    by 0x4B5BB7E: __kmpc_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1739)
==27099==    by 0x401309: main

With this patch applied, most of the print output is lost, and the uninitialized data error changes

The host device num is 1
The initial device num is 1
==20091== Thread 9:
==20091== Conditional jump or move depends on uninitialised value(s)
==20091==    at 0x4C3ADC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==20091==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==20091==    by 0x4B5C399: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1633)
==20091==    by 0x4B60012: int __kmp_execute_tasks_template<kmp_flag_64<false, true> >(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3012)
==20091==    by 0x4B6AE91: int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3111)
==20091==    by 0x4B79901: kmp_flag_64<false, true>::execute_tasks(kmp_info*, int, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_wait_release.h:915)
==20091==    by 0x4B7497C: bool __kmp_wait_template<kmp_flag_64<false, true>, true, false, true>(kmp_info*, kmp_flag_64<false, true>*, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:345)
==20091==    by 0x4B797D9: kmp_flag_64<false, true>::wait(kmp_info*, int, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:922)
==20091==    by 0x4B70559: __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:672)
==20091==    by 0x4B7401D: __kmp_fork_barrier(int, int) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:1982)
==20091==    by 0x4B3B701: __kmp_launch_thread (llvm-project/openmp/runtime/src/kmp_runtime.cpp:5776)
==20091==    by 0x4BB976D: __kmp_launch_worker(void*) (llvm-project/openmp/runtime/src/z_Linux_util.cpp:591)
==20091== 
The number of devices are 1
CUDA error: Error returned from cuDeviceGet

This is more obvious on the amd implementation because it segfaults on a null pointer dereference.

The information I've got on the possible race is:
When this patch is applied (by git's automerge, I think) to the rocm stack, a test located at:
https://github.com/ROCm-Developer-Tools/aomp/blob/master/test/smoke/devices/devices.c
fails in unpredictable fashion.

I've reproduced the test here as it's fairly short, but it uses some functions on the device that the trunk implementation returns zero for. Adjusted so it builds on trunk. Run as

export LD_LIBRARY_PATH=$HOME/llvm-install/lib/ ; $HOME/llvm-install/bin/clang  -O2  -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50   devices.c -o devices -L/usr/local/cuda/targets/x86_64-linux/lib -lcudart && valgrind --fair-sched=yes ./devices
// devices.c
#include <stdio.h>
#include <omp.h>

int main() {
  int num_devs = omp_get_num_devices();
  for (int device_num = 0; device_num < num_devs ; device_num++) {
#pragma omp target device(device_num) nowait
#pragma omp teams num_teams(2) thread_limit(4)
#pragma omp parallel num_threads(2)
    {
      // need to pass the total device number to all devices, per module load
      int num_threads = omp_get_num_threads();
      int num_teams   = omp_get_num_teams();
      int num_devices = omp_get_num_devices(); // not legal in 4.5

      // need to pass the device id to the device starting the kernel
      int thread_id   = omp_get_thread_num();
      int team_id     = omp_get_team_num();
      int device_id   = 0; // omp_get_device_num();  // no API in omp 4.5

      // assume we have homogeneous devices
      int total_threads = num_devices * num_teams * num_threads;
      int gthread_id    = (device_id * num_teams * num_threads) + (team_id * num_threads) + thread_id;

      // print out id
      printf("Hello OpenMP 5 from \n");
      printf(" Device num  %d of %d devices\n", device_id, num_devices);
      printf(" Team num    %d of %d teams  \n", team_id,   num_teams);
      printf(" Thread num  %d of %d threads\n", thread_id, num_threads);
      printf(" Global thread %d of %d total threads\n", gthread_id, total_threads);
    };
  };
#pragma omp taskwait
  printf("The host device num is %d\n", omp_get_device_num());
  printf("The initial device num is %d\n", omp_get_initial_device());
  printf("The number of devices are %d\n", num_devs);
}

Trunk before this patch makes a use of uninitialized memory but the test succeeds (prints a lot of stuff).

==27099== Conditional jump or move depends on uninitialised value(s)
==27099==    at 0x4C36DC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==27099==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==27099==    by 0x4B5B688: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1562)
==27099==    by 0x4B5B8BB: __kmp_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1679)
==27099==    by 0x4B5BB7E: __kmpc_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1739)
==27099==    by 0x401309: main

With this patch applied, most of the print output is lost, and the uninitialized data error changes

The host device num is 1
The initial device num is 1
==20091== Thread 9:
==20091== Conditional jump or move depends on uninitialised value(s)
==20091==    at 0x4C3ADC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
==20091==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
==20091==    by 0x4B5C399: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1633)
==20091==    by 0x4B60012: int __kmp_execute_tasks_template<kmp_flag_64<false, true> >(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3012)
==20091==    by 0x4B6AE91: int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3111)
==20091==    by 0x4B79901: kmp_flag_64<false, true>::execute_tasks(kmp_info*, int, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_wait_release.h:915)
==20091==    by 0x4B7497C: bool __kmp_wait_template<kmp_flag_64<false, true>, true, false, true>(kmp_info*, kmp_flag_64<false, true>*, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:345)
==20091==    by 0x4B797D9: kmp_flag_64<false, true>::wait(kmp_info*, int, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:922)
==20091==    by 0x4B70559: __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:672)
==20091==    by 0x4B7401D: __kmp_fork_barrier(int, int) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:1982)
==20091==    by 0x4B3B701: __kmp_launch_thread (llvm-project/openmp/runtime/src/kmp_runtime.cpp:5776)
==20091==    by 0x4BB976D: __kmp_launch_worker(void*) (llvm-project/openmp/runtime/src/z_Linux_util.cpp:591)
==20091== 
The number of devices are 1
CUDA error: Error returned from cuDeviceGet

This is more obvious on the amd implementation because it segfaults on a null pointer dereference.

If you take a look at the code around interface.cpp:470, it is:

EXTERN int __tgt_target_teams_nowait_mapper(
    ident_t *loc, int64_t device_id, void *host_ptr, int32_t arg_num,
    void **args_base, void **args, int64_t *arg_sizes, int64_t *arg_types,
    map_var_info_t *arg_names, void **arg_mappers, int32_t team_num,
    int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,
    void *noAliasDepList) {
  TIMESCOPE();
  if (depNum + noAliasDepNum > 0)
    __kmpc_omp_taskwait(loc, __kmpc_global_thread_num(loc));

  return __tgt_target_teams_mapper(loc, device_id, host_ptr, arg_num, args_base,
                                   args, arg_sizes, arg_types, arg_names,
                                   arg_mappers, team_num, thread_limit);
}

Line 470 is if (depNum + noAliasDepNum > 0). The reason it raises an error is, depNum and noAliasDepNum are not passed to the function call at all due to the known issue we have in clang. Actually, depNum, depList, noAliasDepNum, and noAliasDepList are all not passed on the callsite. So your issue encountered probably has nothing to do with this part.

I did try on my local systems with NVIDIA GPUs. I didn't encounter any crash/hang with 1000 runs. The only potential problem is printf in the target region doesn't work at all, which I believe has nothing to do with this patch.

Fixed some issues in miniQMC

The only potential problem is printf in the target region doesn't work at all, which I believe has nothing to do with this patch.

Do you see the print statements within the region without this patch applied? On sm_50, I see all the print output on trunk and the ones inside target missing with this patch.

I suspect there is a race condition in the library that this patch has exposed.

Did you run the test under valgrind? The fair scheduler setting does a reasonable job of perturbing thread order, though I suppose one should use an actual race detector instead.

rebased and remove unnecessary struct data member

The only potential problem is printf in the target region doesn't work at all, which I believe has nothing to do with this patch.

Do you see the print statements within the region without this patch applied? On sm_50, I see all the print output on trunk and the ones inside target missing with this patch.

I suspect there is a race condition in the library that this patch has exposed.

Did you run the test under valgrind? The fair scheduler setting does a reasonable job of perturbing thread order, though I suppose one should use an actual race detector instead.

I tested the latest version of this patch, and it can print out all information. Can you give it a shot on your side with AMD GPUs?

The test still doesn't work ideally on amdgpu, but it no longer crashes, and some of the print statements within the target region are seen.

jdoerfert accepted this revision.Jan 25 2021, 7:12 PM

Known issues resolved, AMDGPU is not yet a supported target and hard to test right now. LG

This revision is now accepted and ready to land.Jan 25 2021, 7:12 PM
This revision was landed with ongoing or failed builds.Jan 25 2021, 7:16 PM
This revision was automatically updated to reflect the committed changes.

I'm getting a segfault, when running code with target nowait compiled for x86 offloading. The segfault is in __kmp_push_task for a task marked as hidden_task.

I tried to find the thread with __kmp_gtid = 2 (assuming that's still the task identified as gtid=2) :

(gdb) t 11
[Switching to thread 11 (Thread 0x2aab18000800 (LWP 16111))]
(gdb) p __kmp_gtid
$34 = 2
(gdb) bt
#0  0x00002aaabddea9cc in .omp_outlined._debug__ (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:109
#1  0x00002aaabddeaa95 in .omp_outlined. (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:105
#2  0x00002aaaab584803 in __kmp_invoke_microtask () at llvm-project/openmp/runtime/src/z_Linux_asm.S:1166
#3  0x00002aaaab51741c in __kmp_fork_call (loc=0x2aaabdfeada0, gtid=<optimized out>, call_context=fork_context_intel, argc=3, microtask=<optimized out>, invoker=0x2aaaab51c020 <__kmp_invoke_task_func>, 
    ap=0x2aab17fff1d0) at llvm-project/openmp/runtime/src/kmp_runtime.cpp:1906
#4  0x00002aaaab509048 in __kmpc_fork_call (loc=0x2aaabdfeada0, argc=<optimized out>, microtask=0x2aaabddeaa60 <.omp_outlined.>) at llvm-project/openmp/runtime/src/kmp_csupport.cpp:307
#5  0x00002aaabddea8aa in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101_debug__ (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:105
#6  0x00002aaabddeaac5 in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101 (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:101
#7  0x00002aaaadccce2c in ffi_call_unix64 () from /lib64/libffi.so.6
#8  0x00002aaaadccc755 in ffi_call () from /lib64/libffi.so.6
#9  0x00002aaaadac4a56 in __tgt_rtl_run_target_team_region () from /home/x/sw/UTIL/clang//12.0-release/lib/../lib/libomptarget.rtl.x86_64.so
#10 0x00002aaaab7c0be0 in DeviceTy::runTeamRegion(void*, void**, long*, int, int, int, unsigned long, __tgt_async_info*) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#11 0x00002aaaab7d02f2 in target(ident_t*, long, void*, int, void**, void**, long*, long*, void**, void**, int, int, int) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#12 0x00002aaaab7c5d96 in __tgt_target_teams_mapper () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12

Is it intended, that the threads executing the host offloading use the same gtid as the hidden threads?

openmp/runtime/src/kmp_tasking.cpp
363

I'm getting the segfault here. When I look at task_team, it is 0x0.

taskdata->td_flags.hidden_helper = 1
gtid = 2
__kmp_threads[gtid]->th.th_task_team = 0x0

I'm getting a segfault, when running code with target nowait compiled for x86 offloading. The segfault is in __kmp_push_task for a task marked as hidden_task.

I tried to find the thread with __kmp_gtid = 2 (assuming that's still the task identified as gtid=2) :

(gdb) t 11
[Switching to thread 11 (Thread 0x2aab18000800 (LWP 16111))]
(gdb) p __kmp_gtid
$34 = 2
(gdb) bt
#0  0x00002aaabddea9cc in .omp_outlined._debug__ (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:109
#1  0x00002aaabddeaa95 in .omp_outlined. (.global_tid.=0x2aab17ffef00, .bound_tid.=0x2aab17ffeef8, BlockC=@0x2aab17fff238: 0x2aab20000d30, BlockA=@0x2aab17fff230: 0x2aab3c010da0, 
    BlockB=@0x2aab17fff228: 0x2aab40010da0) at targetnowait.cpp:105
#2  0x00002aaaab584803 in __kmp_invoke_microtask () at llvm-project/openmp/runtime/src/z_Linux_asm.S:1166
#3  0x00002aaaab51741c in __kmp_fork_call (loc=0x2aaabdfeada0, gtid=<optimized out>, call_context=fork_context_intel, argc=3, microtask=<optimized out>, invoker=0x2aaaab51c020 <__kmp_invoke_task_func>, 
    ap=0x2aab17fff1d0) at llvm-project/openmp/runtime/src/kmp_runtime.cpp:1906
#4  0x00002aaaab509048 in __kmpc_fork_call (loc=0x2aaabdfeada0, argc=<optimized out>, microtask=0x2aaabddeaa60 <.omp_outlined.>) at llvm-project/openmp/runtime/src/kmp_csupport.cpp:307
#5  0x00002aaabddea8aa in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101_debug__ (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:105
#6  0x00002aaabddeaac5 in __omp_offloading_3b_1502eaf5__Z24BlockMatMul_TargetNowaitR11BlockMatrixS0_S0__l101 (BlockC=0x2aab20000d30, BlockA=0x2aab3c010da0, BlockB=0x2aab40010da0) at targetnowait.cpp:101
#7  0x00002aaaadccce2c in ffi_call_unix64 () from /lib64/libffi.so.6
#8  0x00002aaaadccc755 in ffi_call () from /lib64/libffi.so.6
#9  0x00002aaaadac4a56 in __tgt_rtl_run_target_team_region () from /home/x/sw/UTIL/clang//12.0-release/lib/../lib/libomptarget.rtl.x86_64.so
#10 0x00002aaaab7c0be0 in DeviceTy::runTeamRegion(void*, void**, long*, int, int, int, unsigned long, __tgt_async_info*) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#11 0x00002aaaab7d02f2 in target(ident_t*, long, void*, int, void**, void**, long*, long*, void**, void**, int, int, int) () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12
#12 0x00002aaaab7c5d96 in __tgt_target_teams_mapper () from /home/x/sw/UTIL/clang//12.0-release/lib/libomptarget.so.12

We also got report in openmp-dev mail list of this issue. I'll investigate it.

Is it intended, that the threads executing the host offloading use the same gtid as the hidden threads?

It is because the task needs to be executed by a hidden helper thread.

Post commit issue:
Our downstream testing of our release branch revealed an assertion in kmp_runtime.cpp while compiling our rocFFT application
The rocFFT application does not use openmp offload, rather it uses HIP, and host openmp threads.
When we reverted this patch locally it allowed the application to compile and run succesfullly.

root@ixt-sjc2-13:/root/Staging/MathLibs/rocFFT/build/release/clients/staging# cd /root/Staging/MathLibs/rocFFT/build/release/clients/staging; ./rocfft-test --gtest_filter=rocfft_UnitTest.simple_multithread_1D
rocFFT version: 1.0.9.a07759d-dirty
Note: Google Test filter = rocfft_UnitTest.simple_multithread_1D
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from rocfft_UnitTest
[ RUN ] rocfft_UnitTest.simple_multithread_1D
OMP: Error #13: Assertion failure at kmp_runtime.cpp(3691).

Patch we reverted to get us back to our happy-place.

commit 9d64275ae08fbdeeca0ce9c2f3951a2de6f38a08
Author: Shilei Tian <tianshilei1992@gmail.com>
Date: Mon Jan 25 22:14:52 2021 -0500

[OpenMP] Added the support for hidden helper task in RTL

Without a reproducer, I cannot tell what was going wrong. And your code is out of date. What is the assertion at line 3691 in kmp_runtime.cpp?

[AMD Official Use Only - Internal Distribution Only]

I was requested to add the information about the failure observed in our product testing, so I added comment to your patch so you would be aware of it. Perhaps down the road someone will encounter it with a simpler upstream testcase.

I don't have a small reproducer, however the release engineer (David cc'ed) says it might take him a few hours or more. He is willing to try, so maybe you will be able to provide one.... might be today or tomorrow.

Ron

Hi Ron,

even without a reproducer, it would certainly help, if you can map your
line 3691 to a line of code we can find in the upstream repository.
Neither main nor the release branch have an assertion on that line:

https://github.com/llvm/llvm-project/blob/main/openmp/runtime/src/kmp_runtime.cpp#L3691

https://github.com/llvm/llvm-project/blob/release/12.x/openmp/runtime/src/kmp_runtime.cpp#L3691

Best
Joachim

Am 15.03.21 um 20:53 schrieb Lieberman, Ron:

[AMD Official Use Only - Internal Distribution Only]

I was requested to add the information about the failure observed in our product testing, so I added comment to your patch so you would be aware of it. Perhaps down the road someone will encounter it with a simpler upstream testcase.

I don't have a small reproducer, however the release engineer (David cc'ed) says it might take him a few hours or more. He is willing to try, so maybe you will be able to provide one.... might be today or tomorrow.

Ron

latest trunk has the assert in question at line 3651
3638 } else {

3639      /* find an available thread slot */
3640      // Don't reassign the zero slot since we need that to only be used by
3641      // initial thread. Slots for hidden helper threads should also be skipped.
3642      if (initial_thread && __kmp_threads[0] == NULL) {
3643        gtid = 0;
3644      } else {
3645        for (gtid = __kmp_hidden_helper_threads_num + 1;
3646             TCR_PTR(__kmp_threads[gtid]) != NULL; gtid++)
3647          ;
3648      }
3649      KA_TRACE(
3650          1, ("__kmp_register_root: found slot in threads array: T#%d\n", gtid));
3651      KMP_ASSERT(gtid < __kmp_threads_capacity);
3652    }

our jan 27th internal merge has this at line 3691 of kmp_runtime.cpp

3678 } else {

3679      /* find an available thread slot */
3680      // Don't reassign the zero slot since we need that to only be used b

y

3681      // initial thread. Slots for hidden helper threads should also be sk

ipped.

3682      if (initial_thread && __kmp_threads[0] == NULL) {
3683        gtid = 0;
3684      } else {
3685        for (gtid = __kmp_hidden_helper_threads_num + 1;
3686             TCR_PTR(__kmp_threads[gtid]) != NULL; gtid++)
3687          ;
3688      }
3689      KA_TRACE(
3690          1, ("__kmp_register_root: found slot in threads array: T#%d\n",

gtid));

3691      KMP_ASSERT(gtid < __kmp_threads_capacity);
3692    }

I'm also hitting asserts with with this change (as i've told @jdoerfert in irc previously)

I'm reliably hitting "Assertion failure at kmp_runtime.cpp(4314): new_gtid < __kmp_threads_capacity."
unless i specify LIBOMP_NUM_HIDDEN_HELPER_THREADS=0.
It's both easy to repro, and not, i don't have a standalone repro.

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

I would suggest reverting this.

Seems like the two assertions mentioned above are caused by a same problem that __kmp_threads is somehow touched and all elements are not NULL. I'd appreciate if someone could provide a reproducer.

That's three independent reports of stuff breaking after this patch. There are a bunch of locks and condition variables involved, and it looks suspicious to me that the introduced variables are volatile but not atomic.

I don't think we have robust enough in tree testing to say this patch is sound, and multiple independent reports suggest it is not. I think we have to pull it until we can work out what's gone wrong, or rewrite it to be simple enough to reliably audit the concurrency.

How about set LIBOMP_USE_HIDDEN_HELPER_TASK=OFF by default? So we can keep this commit but make user codes happy as we investigate more?

I'm starting to have doubts about the thread safety of this library in general so would lean towards removing the commit entirely such that the remainder is easier to reason about. That way we can be fairly sure we've removed whatever bug this introduced so have ~ one fewer race to try to pin down.

[AMD Official Use Only - Internal Distribution Only]

The failure we saw in our down fork testing , is pure host openmp code, no device code of any kind.

The pragmas all looked like this

#pragma omp parallel for reduction(max : linf) reduction(+ : l2) num_threads(partitions.size())

Going forward, I would like to ask if the pre commit testing does not already include SPEC CPU speed runs with openmp enabled , and SPEC OMP2012 , that it be added. Plus any other OpenMP applications that folks think of that might help stress the task helper patch.

Ron

Again, it doesn't help if we don't have a way to reproduce it. We can disable it, we can revert it, sure, but it will NEVER be enabled back because we don't have a reproducer to tell what is wrong, and nobody will use it if it is disabled. We can't guarantee that rewriting the whole thing in a "simpler" way can work if we don't have a way to test it.

One of the drawbacks of limited trunk testing of openmp is that we're reliant on out of trunk people noticing something looks odd. I don't want to set a precedent of downstream forks reverting patches that fail local testing, as that'll remove a bunch of the ad hoc testing we do have.

Closely related, we really do need CI. I'm told that's a work in progress for amdgpu. Even without a live GPU box, it should be possible to exercise some runtime testing of the host code, which would have sufficed to raise awareness of this patch.

Also, the reproducer doesn't need to be a small piece of code. It can be steps to reproduce it as long as I can access the source code.

Also, the reproducer doesn't need to be a small piece of code. It can be steps to reproduce it as long as I can access the source code.

I will state repro steps once this is reverted.

FYI another aspect of reverting this one; this is part of the 12.x release branch too (which is drawing very close to the actual release), so if it needs to be reverted, maybe it needs to be reverted there too.

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

Linking two OpenMP runtime libraries into one application is guaranteed to break things. You have basically no way to guarantee that calls to API functions go to the right runtime.

i tried using
export LIBOMP_USE_HIDDEN_HELPER_TASK=0 and rebuilding/rerunning spec cpu2017 fpspeed base.

and still see the performance issues in 619.lbm and all the other fpspeed benchmarks.
The GeoMean dropped aprox 30%

@tianshilei1992 please review my comments, they might explain why the assertion triggers.

openmp/runtime/src/kmp_runtime.cpp
3636

This check is not aware of reserved hidden threads. __kmp_expand_threads will only be called, if __kmp_all_nth exeeds the capacity limit. Even if the hidden threads are included in __kmp_all_nth, this check does not consider the hole in the thread array.

3667

This load used to be TCR_PTR

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

Linking two OpenMP runtime libraries into one application is guaranteed to break things. You have basically no way to guarantee that calls to API functions go to the right runtime.

Presumably you have read my comment in it's entirety, and did see that both libgomp and libomp used are from llvm?
What's the point of llvm's libgomp then?

I believe this involves compiling a program that uses omp with clang, linking it to llvm's libomp,
and linking it to some library that is compiled with gcc and linked to libgomp.
The issue appears to happen regardless of whether or not the libgomp used is provided at runtime by gcc or llvm.

Linking two OpenMP runtime libraries into one application is guaranteed to break things. You have basically no way to guarantee that calls to API functions go to the right runtime.

Presumably you have read my comment in it's entirety, and did see that both libgomp and libomp used are from llvm?
What's the point of llvm's libgomp then?

My point was on this specific part:

whether or not the libgomp used is provided at runtime by gcc or llvm

You explicitly listed the case that libgomp from gcc is loaded at execution time. The most tedious issues I had with a third-party library, which is statically linked against libgomp. You won't spot the gcc libgomp with ldd.

As long as you make sure that only the LLVM OpenMP runtime is loaded during execution, it should work, yes.

i tried using
export LIBOMP_USE_HIDDEN_HELPER_TASK=0 and rebuilding/rerunning spec cpu2017 fpspeed base.

and still see the performance issues in 619.lbm and all the other fpspeed benchmarks.
The GeoMean dropped aprox 30%

Wasn't spec cpu meant to measure single-core performance? I can see how -fopenmp or -fopenmp-simd might help to turn on vectorization. But none of these flags should turn on OpenMP directives present in the code and make the code multi-threaded (in spec cpu 2006 there were actually #if !defined(SPEC_CPU) arround all OpenMP directives and includes). Are the resulting binaries really linked against libomp? Without any OpenMP symbols in the application, the linker should just drop libomp.

Originally, yes SPEC CPU was intended to be single core/cpu or rate runs.
with the advent of spec cpu 2017 and the explosion of multicore, spec cpu decided to add openmp to speed benchmarks so that compilers could utilize more cores.
The benchmark have openmp pragmas/directives

who knows in 10 years, maybe spec cpu will want to treat gpus as an extension of the cpu (openmp offload)

I find a stable way to reproduce the assertion. Let's say the default __kmp_threads_capacity is N. If hidden helper thread is enabled, __kmp_threads_capacity will be offset to N+8 by default. If the number of threads we need exceeds N+8, e.g. via num_threads clause, we need to expand __kmp_threads. In __kmp_expand_threads, the expansion starts from __kmp_threads_capacity, and repeatedly doubling it until the new capacity meets the requirement. Let's assume the new requirement is Y. If Y happens to meet the constraint (N+8)*2^X=Y where X is the number of iterations, then the new capacity is not enough because we have 8 slots for hidden helper threads.

#include <vector>

int main(int argc, char *argv[]) {
  constexpr const size_t N = 1344;
  std::vector<int> data(N);

#pragma omp parallel for
  for (unsigned i = 0; i < N; ++i) {
    data[i] = i;
  }

#pragma omp parallel for num_threads(N)
  for (unsigned i = 0; i < N; ++i) {
    data[i] += i;
  }

  return 0;
}

Here is an example. My CPU is 20C40T, then __kmp_threads_capacity is 160. After offset, __kmp_threads_capacity becomes 168. 1344 = (160+8)*2^3, and then the assertions hit.

I'll fix it right away.

Try to fix the crash in D98838

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously. After spending quite some effort into trying to identify locations, where this implicit assumption is now broken, I think, much more effort is needed to identify all places which rely on this assumption and are now broken.

I think, the fundamental issue of this patch is, that it broke the implicit assumption, that entries in __kmp_threads are handed out contiguously. After spending quite some effort into trying to identify locations, where this implicit assumption is now broken, I think, much more effort is needed to identify all places which rely on this assumption and are now broken.

If we set the cmake flag to false and 0, we don't break those assumptions, right? Let's do that.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2023, 5:28 AM
Munesanz removed a subscriber: Munesanz.Feb 14 2023, 5:28 AM