This patch fixes the "performance regression" reported in https://bugs.llvm.org/show_bug.cgi?id=51235. In fact it has nothing to do with performance. The root cause is, the stolen task is not allowed to execute by another thread because by default it is tied task. Since hidden helper task will always be executed by hidden helper threads, it should be untied.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
@tianshilei1992 when flaky tests fail/block in the build bots, this is often triggered by unusual thread interleaving due to over-subscription.
To reproduce on my own system, it helped sometimes to oversubscribe the system by running multiple instances to the flaky test at the same time:
for i in $(seq 20); do ./gtid.cpp.out & done
or to repeat until failing/hanging:
while for i in $(seq 20); do ./gtid.cpp.out & done; wait; do true; done
Can you reproduce the hang on your system and verify that this change fixes the issue?
Yup. In my machine, it can be almost 100% reproduced via numactl to set the number of threads to 6, which is same as the test machine.
$ numactl -C 0,1,2,3,4,5 ./a.out
After I applied the patch, I didn't see it again. Can you help try with numactl on your side to see if it can be easily reproduced and then this patch can fix it? Thanks.
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
2793 | A side question here, what does "TSC" mean here? |
Running the following command, I get below assertion from the test:
for i in $(seq 4); do numactl -C 0,1,24,25 env LD_PRELOAD=openmp/runtime/src/libomp.so openmp/runtime/test/tasking/hidden_helper_task/Output/gtid.cpp.tmp & done; wait gtid.cpp.tmp: llvm-project/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp:58: void assert_gtid(int) [hidden_helper_task = false]: Assertion `v == 0 || v > __kmp_hidden_helper_threads_num' failed.
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
2793 | TSC = task stealing constraint (see OpenMP spec) |
Thanks. I'll see what I should do.
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
2793 | I guess it is "Task Scheduling Constraints", isn't it? |
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
2793 | I meant task scheduling constraint |
I cannot reproduce your failure. Can you double check if the right libomp.so was being used?
➜ clang++ -fopenmp -L $HOME/Documents/deploy/openmp/release/lib $HOME/Documents/vscode/llvm-project/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp -o gtid ➜ for i in $(seq 20); do numactl -C 0,1,24,25 ./gtid & done; wait [2] 1304258 [3] 1304259 [4] 1304260 [5] 1304261 [6] 1304262 [7] 1304263 [8] 1304264 [9] 1304265 [10] 1304266 [11] 1304267 [12] 1304268 [13] 1304269 [14] 1304270 [15] 1304271 [16] 1304272 [17] 1304273 [18] 1304274 [19] 1304275 [20] 1304276 [21] 1304277 PASS [8] 1304264 done numactl -C 0,1,24,25 ./gtid PASS [9] 1304265 done numactl -C 0,1,24,25 ./gtid PASS PASS PASS [2] 1304258 done numactl -C 0,1,24,25 ./gtid [7] 1304263 done numactl -C 0,1,24,25 ./gtid [10] 1304266 done numactl -C 0,1,24,25 ./gtid PASS [13] 1304269 done numactl -C 0,1,24,25 ./gtid PASS [5] 1304261 done numactl -C 0,1,24,25 ./gtid PASS PASS [20] - 1304276 done numactl -C 0,1,24,25 ./gtid [11] 1304267 done numactl -C 0,1,24,25 ./gtid PASS PASS [17] 1304273 done numactl -C 0,1,24,25 ./gtid [18] 1304274 done numactl -C 0,1,24,25 ./gtid PASS PASS [19] - 1304275 done numactl -C 0,1,24,25 ./gtid PASS [16] - 1304272 done numactl -C 0,1,24,25 ./gtid PASS [21] + 1304277 done numactl -C 0,1,24,25 ./gtid [12] 1304268 done numactl -C 0,1,24,25 ./gtid PASS PASS [15] + 1304271 done numactl -C 0,1,24,25 ./gtid [4] 1304260 done numactl -C 0,1,24,25 ./gtid PASS PASS [6] - 1304262 done numactl -C 0,1,24,25 ./gtid [14] + 1304270 done numactl -C 0,1,24,25 ./gtid PASS [3] + 1304259 done numactl -C 0,1,24,25 ./gtid ➜ ldd gtid linux-vdso.so.1 (0x00007fffa74ad000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f20e2952000) libm.so.6 => /usr/lib64/libm.so.6 (0x00007f20e280d000) libomp.so => /home/shiltian/Documents/deploy/openmp/release/lib/libomp.so (0x00007f20e272a000) libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f20e270f000) libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f20e26ed000) libc.so.6 => /usr/lib64/libc.so.6 (0x00007f20e2523000) /lib64/ld-linux-x86-64.so.2 (0x00007f20e2b44000) librt.so.1 => /usr/lib64/librt.so.1 (0x00007f20e2516000) libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f20e250f000)
I execute in BUILD/runtime/runtime-bin and use LD_PRELOAD to make sure, that I use the right libomp.
The system I use for testing has two sockets with 24 cores each. So, using core 24,25 I make sure to get the extra latency from communication between the sockets.
I added a print statement before the assertion and found that v=-1.
I added an atomic counter to count the executed tasks for each iteration of the for loop. If the assertion triggers, the task count is 2 instead of 3.
Okay, thanks. Sadly, I still cannot reproduce it, even with the same configuration as yours:
for i in $(seq 1000); do numactl -C 0,1,29,30 ./gtid & done; wait
It ran 1000 times and nothing happens…
Looking at your ldd output, I reallize, that we use different C++ runtime libraries. I always build against libc++. Could this make a difference?
I rerun the tests with different clang compiler versions (after building libomp standalone) and can reproduce the assertion with clang 11 and clang 12.
For depend.cpp, the assertion triggers with data = 7.
Thanks for the info. I tried with libc++, but still had no luck.
➜ ldd gtid linux-vdso.so.1 (0x00007ffd799ba000) libc++.so.1 => /home/shiltian/Documents/deploy/llvm/release/lib/libc++.so.1 (0x00007f05e1261000) libc++abi.so.1 => /home/shiltian/Documents/deploy/llvm/release/lib/libc++abi.so.1 (0x00007f05e1226000) libm.so.6 => /usr/lib64/libm.so.6 (0x00007f05e10e1000) libomp.so => /home/shiltian/Documents/deploy/openmp/release/lib/libomp.so (0x00007f05e0ffe000) libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f05e0fe3000) libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f05e0fbf000) libc.so.6 => /usr/lib64/libc.so.6 (0x00007f05e0df5000) librt.so.1 => /usr/lib64/librt.so.1 (0x00007f05e0dea000) libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00007f05e0de0000) /lib64/ld-linux-x86-64.so.2 (0x00007f05e134f000) libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f05e0dd9000)
Can you try with GDB to see if the extra 8 threads are created?
I still get the same assertions.
It looks like the last task is not guaranteed to finish before the taskwait. The taskwait function spins on td_incomplete_child_tasks to become 0. Is it possible, that this variable is not properly maintained for hidden helper tasks?
(PS: I applied the patches from D106977, D107121, and D107316 for my tests)
This is how I tried to refine the tests to see, what is going wrong:
diff --git a/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp b/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp index 4bc27c1..a6e8cbb 100644 --- a/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp +++ b/openmp/runtime/test/tasking/hidden_helper_task/depend.cpp @@ -44,6 +44,10 @@ template <int I> kmp_int32 omp_task_entry(kmp_int32 gtid, kmp_task_t_with_privates *task) { auto shareds = reinterpret_cast<anon *>(task->task.shareds); auto p = shareds->data; + auto v = *p; + if(v!=I-1) + printf("Fail: %i expect %i\n", v, I-1); + assert(v == I-1); *p += I; return 0; } @@ -118,6 +122,8 @@ int main(int argc, char *argv[]) { // Wait for all tasks __kmpc_omp_taskwait(nullptr, gtid); + if(data!=15) + printf("Fail: %i\n",data); assert(data == 15); } diff --git a/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp b/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp index 8cec95b..8cab6c9 100644 --- a/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp +++ b/openmp/runtime/test/tasking/hidden_helper_task/gtid.cpp @@ -50,11 +50,15 @@ kmp_int32 omp_task_entry(kmp_int32 gtid, kmp_task_t_with_privates *task) { return 0; } -template <bool hidden_helper_task> void assert_gtid(int v) { +template <bool hidden_helper_task> void assert_gtid(int v, int I) { if (__kmp_hidden_helper_threads_num) { if (hidden_helper_task) { + if(!(v > 0 && v <= __kmp_hidden_helper_threads_num)) + printf("Fail(%i, %i): v = %i\n", hidden_helper_task, I, v); assert(v > 0 && v <= __kmp_hidden_helper_threads_num); } else { + if(!(v == 0 || v > __kmp_hidden_helper_threads_num)) + printf("Fail(%i, %i): v = %i\n", hidden_helper_task, I, v); assert(v == 0 || v > __kmp_hidden_helper_threads_num); } } else { @@ -68,7 +72,7 @@ int main(int argc, char *argv[]) { constexpr const int N = 1024; #pragma omp parallel for for (int i = 0; i < N; ++i) { - int32_t data1 = -1, data2 = -1, data3 = -1; + int32_t data1 = -4, data2 = -4, data3 = -4; int depvar; int32_t gtid = __kmpc_global_thread_num(nullptr); @@ -117,9 +121,9 @@ int main(int argc, char *argv[]) { __kmpc_omp_taskwait(nullptr, gtid); // FIXME: 8 here is not accurate - assert_gtid<false>(data1); - assert_gtid<true>(data2); - assert_gtid<false>(data3); + assert_gtid<false>(data1, 1); + assert_gtid<true>(data2, 2); + assert_gtid<false>(data3, 3); } std::cout << "PASS\n";
I didn't see an issue regarding the child task counter.
Increment:
if (flags->proxy == TASK_PROXY || flags->detachable == TASK_DETACHABLE || flags->hidden_helper || !(taskdata->td_flags.team_serial || taskdata->td_flags.tasking_ser)) { KMP_ATOMIC_INC(&parent_task->td_incomplete_child_tasks); ...
Decrement:
if (!(taskdata->td_flags.team_serial || taskdata->td_flags.tasking_ser) || taskdata->td_flags.detachable == TASK_DETACHABLE || taskdata->td_flags.hidden_helper) { __kmp_release_deps(gtid, taskdata); // Predecrement simulated by "- 1" calculation #if KMP_DEBUG children = -1 + #endif KMP_ATOMIC_DEC(&taskdata->td_parent->td_incomplete_child_tasks); ...
Another increment of the counter is in __kmp_task_dup_alloc, which is for task loop, that is unrelated here. No other increment/decrement.
I did check the counter for the case gtid, and after the three tasks are created, the counter is 3, as expected.
https://bugs.llvm.org/show_bug.cgi?id=49816 also reports that one task is missing. So I'm thinking the issue is not introduced by this patch. Can you help double check if w/o the patch you can still observe the failure? If it is not introduced by this patch, I feel we could first get this patch in and then see if we can reproduce the issue in another way.
I think, I found the issue with the two tests:
On our system, the default env sets OMP_NUM_THREADS=1. When I set OMP_NUM_THREADS=5 before running the tests, the tests succeed. I guess you should be able to reproduce the assertion by exporting OMP_NUM_THREADS=1 before running the tests.
Since the tests do not specify explicit number of threads (using the num_threads clause or setting the env var), the team will run serially.
The taskwait synchronization fails in that case, because the last task has td_flags.team_serial set. I think, td_incomplete_child_tasks needs to be maintained for serialized teams, as soon as a hidden/detached task is created in a task. @AndreyChurbanov how should we handle the synchronization for such tasks?
For better test coverage, we should add two run lines to the tests, one setting OMP_NUM_THREADS=1, one setting OMP_NUM_THREADS=5.
I can confirm, that this patch fixes the deadlock issue in the runtime.
Please rephrase the subject of the patch before committing. It's not about performance, but really fixes a possible deadlock in the runtime.
Right. In D106519 I tried to reorder the code to make release dependence happens before decrementing the counter such that we can make sure the last task is pushed to the queue. At that moment, if the initial thread happens to finish the function flag.execute_tasks(...), taskdata->td_incomplete_child_tasks becomes 0 so it exits the loop. If the initial thread is still spinning, trying to execute tasks, it will pass.
Detached task doesn't have this problem because the thread executing the bottom half can execute the task itself. But for hidden helper task, if a regular task depends on a hidden helper task, and it happens to be the last one, it has to be pushed to its "encountering team". But whether it will be executed really depends.
The issue you linked is somewhat different.
Using detached tasks, the gtid/depend test is similar to:
#include <omp.h> #include <thread> #include <unistd.h> #include <assert.h> std::atomic<int> a{0}; std::thread t; void async(omp_event_handle_t e){ sleep(1); assert(a==2); omp_fulfill_event(e); } int main(int argc, char *argv[]) { #pragma omp parallel master { #pragma omp task depend(out: a) { a = 1; } omp_event_handle_t event; #pragma omp task depend(out: a) detach(event) { a = 2; t = std::thread(async, event);} #pragma omp task depend(out: a) { a = 4; } #pragma omp taskwait assert(a==4); } t.join(); return 0; }
Then compile/execute:
$ clang++ -fopenmp test.cpp $ for i in $(seq 10); do numactl -C 0 env OMP_NUM_THREADS=1 ./a.out & done
For me ~6/10 processes print Assertion 'a==4' failed.
Yeah, thanks for that. I updated my previous comment. It surprised me that even detached task will also fail...After I reorder the code, the counter can only be decremented if the chain of dependent tasks are finished.
A side question here, what does "TSC" mean here?