This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Fixed a crash when offloading to x86_64 with target nowait
ClosedPublic

Authored by tianshilei1992 on Feb 23 2021, 12:45 PM.

Details

Summary

PR#49334 reports a crash when offloading to x86_64 with target nowait,
which is caused by referencing a nullptr. The root cause of the issue is, when
pushing a hidden helper task in __kmp_push_task, it also maps the gtid to its
shadow gtid, which is wrong.

Diff Detail

Event Timeline

tianshilei1992 requested review of this revision.Feb 23 2021, 12:45 PM
Herald added a project: Restricted Project. · View Herald Transcript

I didn't include the reproducer because it cannot pass because of computation error. The same code can pass on NVPTX target.

jdoerfert accepted this revision.Feb 23 2021, 1:00 PM

LG. backport if possible and add the test (if we don't have one)

This revision is now accepted and ready to land.Feb 23 2021, 1:00 PM

Added the test although it is expected to fail on x86_64

update test case

This patch fixes the segfault in __kmp_push_task, when executing the code with OMP_NUM_THREADS>1.
I accidentally ran the test case in this patch with OMP_NUM_THREADS=1 (which happens to be the default on our cluster) and could not even get a stacktrace after the crash.

This is another regression from the 11 release.

Including openmp/runtime/test/ompt/callback.h, I could identify that the segfault seems to occur inside of the OpenMP for loop. The last printed OMPT event on the crashing thread prints ompt_event_loop_begin.

If you build the OpenMP runtime with debugging symbols and include callback.h to dump the OMPT events (I just add -include .../llvm-project/openmp/runtime/test/ompt/callback.h to the compile line), you can get the OMPT thread-id with:

(gdb) print __kmp_threads[__kmp_gtid].th.ompt_thread_info.thread_data.value

This is the same ID as at the beginning of all callback.h output by that thread.

Another observation:
clang/11 would execute all code in the test on the initial thread, avoiding overloading the system. The current implementation will execute an instance of the target region on each of the hidden threads, effectively oversubscribing the system when running on a dual-core desktop. Is this the intended behavior?

This patch fixes the segfault in __kmp_push_task, when executing the code with OMP_NUM_THREADS>1.
I accidentally ran the test case in this patch with OMP_NUM_THREADS=1 (which happens to be the default on our cluster) and could not even get a stacktrace after the crash.

I'll take a look and fix it in another patch.

Including openmp/runtime/test/ompt/callback.h, I could identify that the segfault seems to occur inside of the OpenMP for loop. The last printed OMPT event on the crashing thread prints ompt_event_loop_begin.

If you build the OpenMP runtime with debugging symbols and include callback.h to dump the OMPT events (I just add -include .../llvm-project/openmp/runtime/test/ompt/callback.h to the compile line), you can get the OMPT thread-id with:

(gdb) print __kmp_threads[__kmp_gtid].th.ompt_thread_info.thread_data.value

This is the same ID as at the beginning of all callback.h output by that thread.

Gotcha. Will take a look. Thanks!

Another observation:
clang/11 would execute all code in the test on the initial thread, avoiding overloading the system. The current implementation will execute an instance of the target region on each of the hidden threads, effectively oversubscribing the system when running on a dual-core desktop. Is this the intended behavior?

Yes. Actually most of the time those hidden helper threads are sleeping (waiting for the job finish or waiting for a job), it should be fine. After all, there are so many threads in the system already. :-) However, I have to admit that it is probably not the case for x86_64 target offloading. Then I suppose reducing the number of hidden helper thread to just one can help. (There is currently a bug when setting the number to 1 but I'll fix it accordingly)

The test tends to fail for x86_64 offloading. When I reduce the BS to 64, the test also fails for nvptx offloading.

Hangs on amdgpu offloading with these parameters, suggestion in D102017 to use BS = 16; / N = 256; seems to replace that with segv. Valgrind says:

==392314== Conditional jump or move depends on uninitialised value(s)
==392314==    at 0x4BCE15E: __kmp_push_task(int, kmp_task*) (in /home/amd/llvm-build/llvm/runtimes/runtimes-bins/openmp/runtime/src/libomp.so)

a little while afterwards it dereferences 0x50 and faults. Not sure I have the time to set up a debug build and go race chasing right now.