This is an archive of the discontinued LLVM Phabricator instance.

tsan: fix deadlock during race reporting
ClosedPublic

Authored by dvyukov on Dec 20 2021, 7:14 AM.

Details

Summary

SlotPairLocker calls SlotLock under ctx->multi_slot_mtx.
SlotLock can invoke global reset DoReset if we are out of slots/epochs.
But DoReset locks ctx->multi_slot_mtx as well, which leads to deadlock.

Resolve the deadlock by removing SlotPairLocker/multi_slot_mtx
and only lock one slot for which we will do RestoreStack.
We need to lock that slot because RestoreStack accesses the slot journal.
But it's unclear why we need to lock the current slot.
Initially I did it just to be on the safer side (but at that time
we dit not lock the second slot, so it was easy just to lock the current slot).

Diff Detail

Event Timeline

dvyukov requested review of this revision.Dec 20 2021, 7:14 AM
dvyukov created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2021, 7:14 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
melver accepted this revision.Dec 20 2021, 8:28 AM
This revision is now accepted and ready to land.Dec 20 2021, 8:28 AM
This revision was landed with ongoing or failed builds.Dec 20 2021, 9:52 AM
This revision was automatically updated to reflect the committed changes.
yln added a subscriber: yln.EditedDec 21 2021, 11:00 AM

This commit causes compiler-rt/test/tsan/libdispatch/dispatch_once_deadlock.c to hang for a debug (but not release) build of the TSan runtime (COMPILER_RT_DEBUG=ON). Not resolved by D116091.
@vitalybuka @dvyukov

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * #0: 0x00007ff810f48ab2 libsystem_kernel.dylib
    #1: 0x00007ff810f82b81 libsystem_pthread.dylib`sched_yield + 11 at .../libpthread/libpthread-486.100.10/src/pthread.c:2254
    #2: 0x0000000100639729 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::FutexWait(p=<unavailable>, cmp=<unavailable>) + 9 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mac.cpp:516
    #3: 0x000000010063bb9a libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Semaphore::Wait(this=0x0000000101aa6e08) + 58 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mutex.cpp:35
    #4: 0x00000001006389d7 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Mutex::Lock(this=0x0000000101aa6e00) + 183 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mutex.h:194
    #5: 0x000000010064120d libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::ThreadRegistry::Lock(this=<unavailable>) + 13 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_thread_registry.h:98
    #6: 0x00000001006411ef libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::GenericScopedLock<__sanitizer::ThreadRegistry>::GenericScopedLock(this=<unavailable>, mu=<unavailable>) + 15 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mutex.h:367
    #7: 0x000000010063feb9 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::GenericScopedLock<__sanitizer::ThreadRegistry>::GenericScopedLock(this=<unavailable>, mu=<unavailable>) + 9 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mutex.h:366
    #8: 0x000000010067f586 libclang_rt.tsan_osx_dynamic.dylib`__tsan::SlotLock(thr=0x0000000103abd100) + 54 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:343
    #9: 0x00000001006967e9 libclang_rt.tsan_osx_dynamic.dylib`__tsan::Acquire(__tsan::ThreadState*, unsigned long, unsigned long) [inlined] __tsan::SlotLocker::SlotLocker(thr=0x0000000103abd100, recursive=false) + 57 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:616
    #10: 0x00000001006967e1 libclang_rt.tsan_osx_dynamic.dylib`__tsan::Acquire(__tsan::ThreadState*, unsigned long, unsigned long) [inlined] __tsan::SlotLocker::SlotLocker(thr=0x0000000103abd100, recursive=false) at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:614
    #11: 0x00000001006967e1 libclang_rt.tsan_osx_dynamic.dylib`__tsan::Acquire(thr=0x0000000103abd100, pc=<unavailable>, addr=<unavailable>) + 49 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp:447
    #12: 0x00000001006a25a8 libclang_rt.tsan_osx_dynamic.dylib`::wrap_dispatch_once(predicate=0x00000001000040c8, block=0x0000000100003ce0) + 216 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_libdispatch.cpp:306
    #13: 0x0000000100003c9f dispatch_once_deadlock.c.tmp`f [inlined] _dispatch_once + 40 at /Applications/XcodeSunriverE.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.Internal.sdk/usr/include/dispatch/once.h:84
    #14: 0x0000000100003c77 dispatch_once_deadlock.c.tmp`f + 39 at .../llvm-project/compiler-rt/test/tsan/libdispatch/dispatch_once_deadlock.c:15
    #15: 0x0000000100003d76 dispatch_once_deadlock.c.tmp`__tsan_on_report + 54 at .../llvm-project/compiler-rt/test/tsan/libdispatch/dispatch_once_deadlock.c:23
    #16: 0x0000000100698286 libclang_rt.tsan_osx_dynamic.dylib`__tsan::OutputReport(thr=0x0000000103abd100, srep=<unavailable>) + 390 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl_report.cpp:692
    #17: 0x0000000100695bd7 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ReportMutexMisuse(thr=0x0000000103abd100, pc=<unavailable>, typ=ReportTypeMutexBadUnlock, addr=140702053824072, creation_stack_id=<unavailable>) + 183 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp:65
    #18: 0x0000000100695dd4 libclang_rt.tsan_osx_dynamic.dylib`__tsan::MutexUnlock(thr=0x0000000103abd100, pc=<unavailable>, addr=<unavailable>, flagz=<unavailable>) + 468 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp:256
    #19: 0x00000001006658fa libclang_rt.tsan_osx_dynamic.dylib`::wrap_pthread_mutex_unlock(m=0x00007ff7bfeff648) + 90 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_common_interceptors.inc:4339
    #20: 0x0000000100003de9 dispatch_once_deadlock.c.tmp`main + 105 at .../llvm-project/compiler-rt/test/tsan/libdispatch/dispatch_once_deadlock.c:32
    #21: 0x000000010001150e dyld`start(kernArgs=<unavailable>) + 462 at .../dyld/dyld-951/dyld/dyldMain.cpp:879
#15: 0x0000000100003d76 dispatch_once_deadlock.c.tmp`__tsan_on_report + 54 at .../llvm-project/compiler-rt/test/tsan/libdispatch/dispatch_once_deadlock.c:23

void __tsan_on_report() {

fprintf(stderr, "Report.\n");
f();

}

Calling instrumented code from runtime callbacks is not supported.

The test just says:
// Check that calling dispatch_once from a report callback works.
but it does not explain why this is needed.

yln added a comment.EditedDec 21 2021, 4:57 PM

Calling instrumented code from runtime callbacks is not supported.

Acknowledged. I update the test: https://reviews.llvm.org/rG63ddf0baf37edbfdf482853870238f9f88bfeb0b

The test just says:
// Check that calling dispatch_once from a report callback works.
but it does not explain why this is needed.

For example, if we want to write to a file or os_log() from __tsan_on_report() and we get unlucky and encounter an uninitialized dispatch_once() in a system library.

When I comment out this line at compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:343 we don't deadlock:

void SlotLock(ThreadState* thr) NO_THREAD_SAFETY_ANALYSIS {
  DCHECK(!thr->slot_locked);
#if SANITIZER_DEBUG
  // Check these mutexes are not locked.
  // We can call DoReset from SlotAttachAndLock, which will lock
  // these mutexes, but it happens only every once in a while.
  { ThreadRegistryLock lock(&ctx->thread_registry); }  <<<<< line 343
  { Lock lock(&ctx->slot_mtx); }
#endif

Does that help? Any ideas?

For example, if we want to write to a file or os_log() from __tsan_on_report() and we get unlucky and encounter an uninitialized dispatch_once() in a system library.

System calls will work in this context, or calling sanitizer routines (e.g. Printf). Anything that sanitizers don't intercept.

Allowing arbitrary code in the callbacks would require moving them out of the scope of all runtime mutexes. But it's non-trivial to do.

Adding ignoring of interceptors/synchronization/memory accesses may hide a particular problem, but it may re-appear later as we need to handle some interceptors even when interceptors are ignored.

When I comment out this line at compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:343 we don't deadlock:

void SlotLock(ThreadState* thr) NO_THREAD_SAFETY_ANALYSIS {
  DCHECK(!thr->slot_locked);
#if SANITIZER_DEBUG
  // Check these mutexes are not locked.
  // We can call DoReset from SlotAttachAndLock, which will lock
  // these mutexes, but it happens only every once in a while.
  { ThreadRegistryLock lock(&ctx->thread_registry); }  <<<<< line 343
  { Lock lock(&ctx->slot_mtx); }
#endif

Does that help? Any ideas?

The comment there explains what happens. If we remove this, it may make the toy test work, but the code that does this in real life will still deadlock from time to time.

yln added a comment.Dec 22 2021, 12:14 PM

System calls will work in this context, or calling sanitizer routines (e.g. Printf). Anything that sanitizers don't intercept.

Thank you for your explanation, Dmitry! I appreciate it.

I updated my mental model to: TSan runtime callbacks should not execute any code that might call back into the runtime; TSan runtime does not support reentrancy from callbacks.

I will look into changing this test.