PR#49334 reports a crash when offloading to x86_64 with target nowait,
which is caused by referencing a nullptr. The root cause of the issue is, when
pushing a hidden helper task in __kmp_push_task, it also maps the gtid to its
shadow gtid, which is wrong.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I didn't include the reproducer because it cannot pass because of computation error. The same code can pass on NVPTX target.
This patch fixes the segfault in __kmp_push_task, when executing the code with OMP_NUM_THREADS>1.
I accidentally ran the test case in this patch with OMP_NUM_THREADS=1 (which happens to be the default on our cluster) and could not even get a stacktrace after the crash.
This is another regression from the 11 release.
Including openmp/runtime/test/ompt/callback.h, I could identify that the segfault seems to occur inside of the OpenMP for loop. The last printed OMPT event on the crashing thread prints ompt_event_loop_begin.
If you build the OpenMP runtime with debugging symbols and include callback.h to dump the OMPT events (I just add -include .../llvm-project/openmp/runtime/test/ompt/callback.h to the compile line), you can get the OMPT thread-id with:
(gdb) print __kmp_threads[__kmp_gtid].th.ompt_thread_info.thread_data.value
This is the same ID as at the beginning of all callback.h output by that thread.
Another observation:
clang/11 would execute all code in the test on the initial thread, avoiding overloading the system. The current implementation will execute an instance of the target region on each of the hidden threads, effectively oversubscribing the system when running on a dual-core desktop. Is this the intended behavior?
I'll take a look and fix it in another patch.
Gotcha. Will take a look. Thanks!
Another observation:
clang/11 would execute all code in the test on the initial thread, avoiding overloading the system. The current implementation will execute an instance of the target region on each of the hidden threads, effectively oversubscribing the system when running on a dual-core desktop. Is this the intended behavior?
Yes. Actually most of the time those hidden helper threads are sleeping (waiting for the job finish or waiting for a job), it should be fine. After all, there are so many threads in the system already. :-) However, I have to admit that it is probably not the case for x86_64 target offloading. Then I suppose reducing the number of hidden helper thread to just one can help. (There is currently a bug when setting the number to 1 but I'll fix it accordingly)
The test tends to fail for x86_64 offloading. When I reduce the BS to 64, the test also fails for nvptx offloading.
Hangs on amdgpu offloading with these parameters, suggestion in D102017 to use BS = 16; / N = 256; seems to replace that with segv. Valgrind says:
==392314== Conditional jump or move depends on uninitialised value(s) ==392314== at 0x4BCE15E: __kmp_push_task(int, kmp_task*) (in /home/amd/llvm-build/llvm/runtimes/runtimes-bins/openmp/runtime/src/libomp.so)
a little while afterwards it dereferences 0x50 and faults. Not sure I have the time to set up a debug build and go race chasing right now.