This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Fix performance regression reported in bug #51235
ClosedPublic

Authored by tianshilei1992 on Jul 29 2021, 5:27 PM.

Details

Summary

This patch fixes the "performance regression" reported in https://bugs.llvm.org/show_bug.cgi?id=51235. In fact it has nothing to do with performance. The root cause is, the stolen task is not allowed to execute by another thread because by default it is tied task. Since hidden helper task will always be executed by hidden helper threads, it should be untied.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Jul 29 2021, 5:27 PM
tianshilei1992 requested review of this revision.Jul 29 2021, 5:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2021, 5:27 PM

@tianshilei1992 when flaky tests fail/block in the build bots, this is often triggered by unusual thread interleaving due to over-subscription.
To reproduce on my own system, it helped sometimes to oversubscribe the system by running multiple instances to the flaky test at the same time:

for i in $(seq 20); do ./gtid.cpp.out & done

or to repeat until failing/hanging:

while for i in $(seq 20); do ./gtid.cpp.out & done; wait; do true; done

Can you reproduce the hang on your system and verify that this change fixes the issue?

@tianshilei1992 when flaky tests fail/block in the build bots, this is often triggered by unusual thread interleaving due to over-subscription.
To reproduce on my own system, it helped sometimes to oversubscribe the system by running multiple instances to the flaky test at the same time:

for i in $(seq 20); do ./gtid.cpp.out & done

or to repeat until failing/hanging:

while for i in $(seq 20); do ./gtid.cpp.out & done; wait; do true; done

Can you reproduce the hang on your system and verify that this change fixes the issue?

Yup. In my machine, it can be almost 100% reproduced via numactl to set the number of threads to 6, which is same as the test machine.

$ numactl -C 0,1,2,3,4,5 ./a.out

After I applied the patch, I didn't see it again. Can you help try with numactl on your side to see if it can be easily reproduced and then this patch can fix it? Thanks.

openmp/runtime/src/kmp_tasking.cpp
2790

A side question here, what does "TSC" mean here?

Running the following command, I get below assertion from the test:

for i in $(seq 4); do numactl -C 0,1,24,25 env LD_PRELOAD=openmp/runtime/src/libomp.so openmp/runtime/test/tasking/hidden_helper_task/Output/gtid.cpp.tmp & done; wait

gtid.cpp.tmp: llvm-project/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp:58: void assert_gtid(int) [hidden_helper_task = false]: Assertion `v == 0 || v > __kmp_hidden_helper_threads_num' failed.
openmp/runtime/src/kmp_tasking.cpp
2790

TSC = task stealing constraint (see OpenMP spec)

tianshilei1992 marked an inline comment as done.Jul 31 2021, 11:33 AM

Running the following command, I get below assertion from the test:

for i in $(seq 4); do numactl -C 0,1,24,25 env LD_PRELOAD=openmp/runtime/src/libomp.so openmp/runtime/test/tasking/hidden_helper_task/Output/gtid.cpp.tmp & done; wait

gtid.cpp.tmp: llvm-project/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp:58: void assert_gtid(int) [hidden_helper_task = false]: Assertion `v == 0 || v > __kmp_hidden_helper_threads_num' failed.

Thanks. I'll see what I should do.

openmp/runtime/src/kmp_tasking.cpp
2790

I guess it is "Task Scheduling Constraints", isn't it?

openmp/runtime/src/kmp_tasking.cpp
2790

I meant task scheduling constraint

tianshilei1992 marked an inline comment as done.EditedJul 31 2021, 12:00 PM

Running the following command, I get below assertion from the test:

for i in $(seq 4); do numactl -C 0,1,24,25 env LD_PRELOAD=openmp/runtime/src/libomp.so openmp/runtime/test/tasking/hidden_helper_task/Output/gtid.cpp.tmp & done; wait

gtid.cpp.tmp: llvm-project/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp:58: void assert_gtid(int) [hidden_helper_task = false]: Assertion `v == 0 || v > __kmp_hidden_helper_threads_num' failed.

I cannot reproduce your failure. Can you double check if the right libomp.so was being used?

➜ clang++ -fopenmp -L $HOME/Documents/deploy/openmp/release/lib $HOME/Documents/vscode/llvm-project/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp -o gtid
➜ for i in $(seq 20); do numactl -C 0,1,24,25 ./gtid & done; wait
[2] 1304258
[3] 1304259
[4] 1304260
[5] 1304261
[6] 1304262
[7] 1304263
[8] 1304264
[9] 1304265
[10] 1304266
[11] 1304267
[12] 1304268
[13] 1304269
[14] 1304270
[15] 1304271
[16] 1304272
[17] 1304273
[18] 1304274
[19] 1304275
[20] 1304276
[21] 1304277
PASS
[8]    1304264 done       numactl -C 0,1,24,25 ./gtid
PASS
[9]    1304265 done       numactl -C 0,1,24,25 ./gtid
PASS
PASS
PASS
[2]    1304258 done       numactl -C 0,1,24,25 ./gtid
[7]    1304263 done       numactl -C 0,1,24,25 ./gtid
[10]    1304266 done       numactl -C 0,1,24,25 ./gtid
PASS
[13]    1304269 done       numactl -C 0,1,24,25 ./gtid
PASS
[5]    1304261 done       numactl -C 0,1,24,25 ./gtid
PASS
PASS
[20]  - 1304276 done       numactl -C 0,1,24,25 ./gtid
[11]    1304267 done       numactl -C 0,1,24,25 ./gtid
PASS
PASS
[17]    1304273 done       numactl -C 0,1,24,25 ./gtid
[18]    1304274 done       numactl -C 0,1,24,25 ./gtid
PASS
PASS
[19]  - 1304275 done       numactl -C 0,1,24,25 ./gtid
PASS
[16]  - 1304272 done       numactl -C 0,1,24,25 ./gtid
PASS
[21]  + 1304277 done       numactl -C 0,1,24,25 ./gtid
[12]    1304268 done       numactl -C 0,1,24,25 ./gtid
PASS
PASS
[15]  + 1304271 done       numactl -C 0,1,24,25 ./gtid
[4]    1304260 done       numactl -C 0,1,24,25 ./gtid
PASS
PASS
[6]  - 1304262 done       numactl -C 0,1,24,25 ./gtid
[14]  + 1304270 done       numactl -C 0,1,24,25 ./gtid
PASS
[3]  + 1304259 done       numactl -C 0,1,24,25 ./gtid
➜ ldd gtid
        linux-vdso.so.1 (0x00007fffa74ad000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f20e2952000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00007f20e280d000)
        libomp.so => /home/shiltian/Documents/deploy/openmp/release/lib/libomp.so (0x00007f20e272a000)
        libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f20e270f000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f20e26ed000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f20e2523000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f20e2b44000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00007f20e2516000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f20e250f000)

I execute in BUILD/runtime/runtime-bin and use LD_PRELOAD to make sure, that I use the right libomp.
The system I use for testing has two sockets with 24 cores each. So, using core 24,25 I make sure to get the extra latency from communication between the sockets.

I added a print statement before the assertion and found that v=-1.
I added an atomic counter to count the executed tasks for each iteration of the for loop. If the assertion triggers, the task count is 2 instead of 3.

tianshilei1992 marked an inline comment as done.Jul 31 2021, 6:58 PM

I execute in BUILD/runtime/runtime-bin and use LD_PRELOAD to make sure, that I use the right libomp.
The system I use for testing has two sockets with 24 cores each. So, using core 24,25 I make sure to get the extra latency from communication between the sockets.

I added a print statement before the assertion and found that v=-1.
I added an atomic counter to count the executed tasks for each iteration of the for loop. If the assertion triggers, the task count is 2 instead of 3.

Okay, thanks. Sadly, I still cannot reproduce it, even with the same configuration as yours:
for i in $(seq 1000); do numactl -C 0,1,29,30 ./gtid & done; wait
It ran 1000 times and nothing happens…

Looking at your ldd output, I reallize, that we use different C++ runtime libraries. I always build against libc++. Could this make a difference?

I rerun the tests with different clang compiler versions (after building libomp standalone) and can reproduce the assertion with clang 11 and clang 12.
For depend.cpp, the assertion triggers with data = 7.

Looking at your ldd output, I reallize, that we use different C++ runtime libraries. I always build against libc++. Could this make a difference?

I rerun the tests with different clang compiler versions (after building libomp standalone) and can reproduce the assertion with clang 11 and clang 12.
For depend.cpp, the assertion triggers with data = 7.

Thanks for the info. I tried with libc++, but still had no luck.

➜ ldd gtid
        linux-vdso.so.1 (0x00007ffd799ba000)
        libc++.so.1 => /home/shiltian/Documents/deploy/llvm/release/lib/libc++.so.1 (0x00007f05e1261000)
        libc++abi.so.1 => /home/shiltian/Documents/deploy/llvm/release/lib/libc++abi.so.1 (0x00007f05e1226000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00007f05e10e1000)
        libomp.so => /home/shiltian/Documents/deploy/openmp/release/lib/libomp.so (0x00007f05e0ffe000)
        libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f05e0fe3000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f05e0fbf000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f05e0df5000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00007f05e0dea000)
        libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00007f05e0de0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f05e134f000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f05e0dd9000)

Can you try with GDB to see if the extra 8 threads are created?

fix it in another way

@protze.joachim Could you please give it a shot?

protze.joachim added a comment.EditedAug 3 2021, 3:01 PM

I still get the same assertions.

It looks like the last task is not guaranteed to finish before the taskwait. The taskwait function spins on td_incomplete_child_tasks to become 0. Is it possible, that this variable is not properly maintained for hidden helper tasks?

(PS: I applied the patches from D106977, D107121, and D107316 for my tests)

This is how I tried to refine the tests to see, what is going wrong:

diff --git a/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp b/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp
index 4bc27c1..a6e8cbb 100644
--- a/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp
+++ b/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp
@@ -44,6 +44,10 @@ template <int I>
 kmp_int32 omp_task_entry(kmp_int32 gtid, kmp_task_t_with_privates *task) {
   auto shareds = reinterpret_cast<anon *>(task->task.shareds);
   auto p = shareds->data;
+  auto v = *p;
+  if(v!=I-1)
+    printf("Fail: %i expect %i\n", v, I-1);
+  assert(v == I-1);
   *p += I;
   return 0;
 }
@@ -118,6 +122,8 @@ int main(int argc, char *argv[]) {
     // Wait for all tasks
     __kmpc_omp_taskwait(nullptr, gtid);
 
+    if(data!=15)
+      printf("Fail: %i\n",data);
     assert(data == 15);
   }
 
diff --git a/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp b/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp
index 8cec95b..8cab6c9 100644
--- a/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp
+++ b/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp
@@ -50,11 +50,15 @@ kmp_int32 omp_task_entry(kmp_int32 gtid, kmp_task_t_with_privates *task) {
   return 0;
 }
 
-template <bool hidden_helper_task> void assert_gtid(int v) {
+template <bool hidden_helper_task> void assert_gtid(int v, int I) {
   if (__kmp_hidden_helper_threads_num) {
     if (hidden_helper_task) {
+      if(!(v > 0 && v <= __kmp_hidden_helper_threads_num))
+        printf("Fail(%i, %i): v = %i\n", hidden_helper_task, I, v);
       assert(v > 0 && v <= __kmp_hidden_helper_threads_num);
     } else {
+      if(!(v == 0 || v > __kmp_hidden_helper_threads_num))
+        printf("Fail(%i, %i): v = %i\n", hidden_helper_task, I, v);
       assert(v == 0 || v > __kmp_hidden_helper_threads_num);
     }
   } else {
@@ -68,7 +72,7 @@ int main(int argc, char *argv[]) {
   constexpr const int N = 1024;
 #pragma omp parallel for
   for (int i = 0; i < N; ++i) {
-    int32_t data1 = -1, data2 = -1, data3 = -1;
+    int32_t data1 = -4, data2 = -4, data3 = -4;
     int depvar;
     int32_t gtid = __kmpc_global_thread_num(nullptr);
 
@@ -117,9 +121,9 @@ int main(int argc, char *argv[]) {
     __kmpc_omp_taskwait(nullptr, gtid);
 
     // FIXME: 8 here is not accurate
-    assert_gtid<false>(data1);
-    assert_gtid<true>(data2);
-    assert_gtid<false>(data3);
+    assert_gtid<false>(data1, 1);
+    assert_gtid<true>(data2, 2);
+    assert_gtid<false>(data3, 3);
   }
 
   std::cout << "PASS\n";

I didn't see an issue regarding the child task counter.
Increment:

if (flags->proxy == TASK_PROXY || flags->detachable == TASK_DETACHABLE ||
      flags->hidden_helper ||
      !(taskdata->td_flags.team_serial || taskdata->td_flags.tasking_ser)) {
    KMP_ATOMIC_INC(&parent_task->td_incomplete_child_tasks);
...

Decrement:

if (!(taskdata->td_flags.team_serial || taskdata->td_flags.tasking_ser) ||
        taskdata->td_flags.detachable == TASK_DETACHABLE ||
        taskdata->td_flags.hidden_helper) {
      __kmp_release_deps(gtid, taskdata);
      // Predecrement simulated by "- 1" calculation
#if KMP_DEBUG
      children = -1 +
#endif
          KMP_ATOMIC_DEC(&taskdata->td_parent->td_incomplete_child_tasks);
...

Another increment of the counter is in __kmp_task_dup_alloc, which is for task loop, that is unrelated here. No other increment/decrement.
I did check the counter for the case gtid, and after the three tasks are created, the counter is 3, as expected.

https://bugs.llvm.org/show_bug.cgi?id=49816 also reports that one task is missing. So I'm thinking the issue is not introduced by this patch. Can you help double check if w/o the patch you can still observe the failure? If it is not introduced by this patch, I feel we could first get this patch in and then see if we can reproduce the issue in another way.

I think, I found the issue with the two tests:
On our system, the default env sets OMP_NUM_THREADS=1. When I set OMP_NUM_THREADS=5 before running the tests, the tests succeed. I guess you should be able to reproduce the assertion by exporting OMP_NUM_THREADS=1 before running the tests.
Since the tests do not specify explicit number of threads (using the num_threads clause or setting the env var), the team will run serially.

The taskwait synchronization fails in that case, because the last task has td_flags.team_serial set. I think, td_incomplete_child_tasks needs to be maintained for serialized teams, as soon as a hidden/detached task is created in a task. @AndreyChurbanov how should we handle the synchronization for such tasks?

For better test coverage, we should add two run lines to the tests, one setting OMP_NUM_THREADS=1, one setting OMP_NUM_THREADS=5.

protze.joachim accepted this revision.Aug 4 2021, 12:48 AM

I can confirm, that this patch fixes the deadlock issue in the runtime.

Please rephrase the subject of the patch before committing. It's not about performance, but really fixes a possible deadlock in the runtime.

This revision is now accepted and ready to land.Aug 4 2021, 12:48 AM
tianshilei1992 edited the summary of this revision. (Show Details)Aug 4 2021, 9:34 AM
This revision was landed with ongoing or failed builds.Aug 4 2021, 9:34 AM
This revision was automatically updated to reflect the committed changes.
tianshilei1992 added a comment.EditedAug 4 2021, 9:52 AM

I think, I found the issue with the two tests:
On our system, the default env sets OMP_NUM_THREADS=1. When I set OMP_NUM_THREADS=5 before running the tests, the tests succeed. I guess you should be able to reproduce the assertion by exporting OMP_NUM_THREADS=1 before running the tests.
Since the tests do not specify explicit number of threads (using the num_threads clause or setting the env var), the team will run serially.

The taskwait synchronization fails in that case, because the last task has td_flags.team_serial set. I think, td_incomplete_child_tasks needs to be maintained for serialized teams, as soon as a hidden/detached task is created in a task. @AndreyChurbanov how should we handle the synchronization for such tasks?

For better test coverage, we should add two run lines to the tests, one setting OMP_NUM_THREADS=1, one setting OMP_NUM_THREADS=5.

Right. In D106519 I tried to reorder the code to make release dependence happens before decrementing the counter such that we can make sure the last task is pushed to the queue. At that moment, if the initial thread happens to finish the function flag.execute_tasks(...), taskdata->td_incomplete_child_tasks becomes 0 so it exits the loop. If the initial thread is still spinning, trying to execute tasks, it will pass.

Detached task doesn't have this problem because the thread executing the bottom half can execute the task itself. But for hidden helper task, if a regular task depends on a hidden helper task, and it happens to be the last one, it has to be pushed to its "encountering team". But whether it will be executed really depends.

protze.joachim added a comment.EditedAug 4 2021, 11:42 AM

The issue you linked is somewhat different.

Using detached tasks, the gtid/depend test is similar to:

#include <omp.h>
#include <thread>
#include <unistd.h>
#include <assert.h>

std::atomic<int> a{0};
std::thread t;

void async(omp_event_handle_t e){
  sleep(1);
  assert(a==2);
  omp_fulfill_event(e);
}

int main(int argc, char *argv[]) {
#pragma omp parallel master
  {
#pragma omp task depend(out: a)
    { a = 1; }
    omp_event_handle_t event;
#pragma omp task depend(out: a) detach(event)
    { a = 2; t = std::thread(async, event);}
#pragma omp task depend(out: a)
    { a = 4; }
#pragma omp taskwait
    assert(a==4);
  }
  t.join();
  return 0;
}

Then compile/execute:

$ clang++ -fopenmp test.cpp
$ for i in $(seq 10); do numactl -C 0 env OMP_NUM_THREADS=1 ./a.out & done

For me ~6/10 processes print Assertion 'a==4' failed.

tianshilei1992 added a comment.EditedAug 4 2021, 11:44 AM

The issue you linked is somewhat different.

Using detached tasks, the gtid/depend test is similar to:

#include <omp.h>
#include <thread>
#include <unistd.h>
#include <assert.h>

std::atomic<int> a{0};
std::thread t;

void async(omp_event_handle_t e){
  sleep(1);
  assert(a==2);
  omp_fulfill_event(e);
}

int main(int argc, char *argv[]) {
#pragma omp parallel master
  {
#pragma omp task depend(out: a)
    { a = 1; }
    omp_event_handle_t event;
#pragma omp task depend(out: a) detach(event)
    { a = 2; t = std::thread(async, event);}
#pragma omp task depend(out: a)
    { a = 4; }
#pragma omp taskwait
    assert(a==4);
  }
  t.join();
  return 0;
}

Then compile/execute:

$ clang++ -fopenmp test.cpp
$ for i in $(seq 10); do numactl -C 0 env OMP_NUM_THREADS=1 ./a.out & done

For me ~6/10 processes print Assertion 'a==4' failed.

Yeah, thanks for that. I updated my previous comment. It surprised me that even detached task will also fail...After I reorder the code, the counter can only be decremented if the chain of dependent tasks are finished.