Page MenuHomePhabricator

tsan: new runtime (v3)
ClosedPublic

Authored by dvyukov on Oct 27 2021, 2:02 AM.

Details

Summary

This change switches tsan to the new runtime which features:

  • 2x smaller shadow memory (2x of app memory)
  • faster fully vectorized race detection
  • small fixed-size vector clocks (512b)
  • fast vectorized vector clock operations
  • unlimited number of alive threads/goroutimes

Depends on D112602.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
yln added a comment.Nov 19 2021, 9:47 AM

fix Mac fork deadlock

Will test, please rebase (patch does not apply cleanly).

fix Mac fork deadlock

Will test, please rebase (patch does not apply cleanly).

It was freshly rebased onto main:

$ git log --oneline
17871bb55d11 tsan: new runtime (v3)
d1b89d9c5743 tsan: add another fork test
6774cc33f75f (origin/main, main) [NFC][llvm] Inclusive language: remove instance of master in IntrinsicsNVVM.td

compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp
115–116

I rebased these series. Ping.

115–116

Builds successfully, but at least these tests deadlock on my Intel Mac:

Thanks for testing.

I've added a test that reproduces this deadlock on Linux:
https://reviews.llvm.org/D114250

And folded a fix into this change:
https://reviews.llvm.org/D112603?vs=387678&id=388509#toc
(the ForkBefore/After part of the diff)

I would appreciate another test. Thanks.

It was freshly rebased onto main:

$ git log --oneline
17871bb55d11 tsan: new runtime (v3)
d1b89d9c5743 tsan: add another fork test
6774cc33f75f (origin/main, main) [NFC][llvm] Inclusive language: remove instance of master in IntrinsicsNVVM.td

Just in case, I also pushed the tree here:
https://github.com/dvyukov/llvm-project/commits/tsanv3

yln added a comment.Nov 19 2021, 11:35 AM

The following 2 tests spinlock:

ThreadSanitizer-x86_64 :: Darwin/debug_external.cpp
ThreadSanitizer-x86_64 :: Darwin/external-swift-debugging.cpp

And a few unit tests fail:

ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicRwMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicSpinMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ReportRace
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.SimpleWrite
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ThreadSync

I hope that I can contribute some cycles to help investigate the issues with the unit tests after the break.

The following 2 tests spinlock:

ThreadSanitizer-x86_64 :: Darwin/debug_external.cpp
ThreadSanitizer-x86_64 :: Darwin/external-swift-debugging.cpp

Mailed https://reviews.llvm.org/D114348 to fix this.

And a few unit tests fail:

ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicRwMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicSpinMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ReportRace
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.SimpleWrite
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ThreadSync

I hope that I can contribute some cycles to help investigate the issues with the unit tests after the break.

This one is not immediately obvious to me.
The InitializeShadowMemory/dlsym part of the stack was changed in already landed https://github.com/llvm/llvm-project/commit/20ee72d4ccb17
And even before that commit we called dlsym from __tsan::Initialize.
Is it passing on the current HEAD?
I also noticed that TsanRtlTest is built with stale runtime sometimes and I need to run 'ninja check-tsan' 3 times before it gets built with the current runtime. Maybe that's what happened?

There are some bugs related to big-endian arches. ppc64 and s390x bots showed some failures, both are big-endian.
There are a number of sloppy bit manipulations in FastState/Shadow types. Trying to figure out exact bugs and testing story.

I've fixed a number of endian-ness bugs in tsan_shadow.h and added a unit test for shadow methods.
Also fixed a bug in MemoryRangeSet, which was uncovered by test/tsan/ignored-interceptors-mmap.cpp on powerpc (due to different page size) and extended test/tsan/ignored-interceptors-mmap.cpp to detect such bugs on x86_64 as well (D114366).

I am going to do another presubmit since all known bugs were fixed.

This revision was landed with ongoing or failed builds.Nov 22 2021, 6:55 AM
This revision was automatically updated to reflect the committed changes.

Hi, it seems that this is causing failures on ppc64le-rhel, https://lab.llvm.org/buildbot/#/builders/57/builds/12365.

dvyukov reopened this revision.Nov 22 2021, 10:22 AM
This revision is now accepted and ready to land.Nov 22 2021, 10:22 AM
dvyukov updated this revision to Diff 388964.Nov 22 2021, 10:22 AM

rebase and reopen

Fixed 2 easy bugs pointed by buildbots:

https://lab.llvm.org/buildbot#builders/57/builds/12365

/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/runtimes/../llvm/utils/unittest/googletest/include/gtest/gtest.h:1527:11: error: comparison of integers of different signs: 'const unsigned long' and 'const int' [-Werror,-Wsign-compare]
  if (lhs == rhs) {
      ~~~ ^  ~~~
/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/runtimes/../llvm/utils/unittest/googletest/include/gtest/gtest.h:1554:12: note: in instantiation of function template specialization 'testing::internal::CmpHelperEQ<unsigned long, int>' requested here
    return CmpHelperEQ(lhs_expression, rhs_expression, lhs, rhs);
           ^
/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/compiler-rt/lib/tsan/tests/unit/tsan_flags_test.cpp:125:3: note: in instantiation of function template specialization 'testing::internal::EqHelper::Compare<unsigned long, int, nullptr>' requested here
  EXPECT_EQ(f->history_size, 6);

and:
https://lab.llvm.org/buildbot/#/builders/197/builds/588

XPASS: ThreadSanitizer-aarch64:: bench_threads.cpp

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

222: SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) deadlock_detector_stress_test.cpp:96 in PthreadRWLock::rdlock() 
           230: ==6031==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x000109ec7aef bp 0x7e8000103e10 sp 0x7e8000103e10 T7632574) 
           231: ==6031==The signal is caused by a READ memory access. 
           232: ==6031==Hint: address points to the zero page. 
           233:  #0 __tsan::TraceSwitchPart(__tsan::ThreadState*) tsan_rtl.cpp:854 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x68aef) 
           234:  #1 __tsan::TraceRestartFuncExit(__tsan::ThreadState*) tsan_rtl_access.cpp:101 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x6b090) 
           235:  #2 PthreadRWLock::rdlock() deadlock_detector_stress_test.cpp:96 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x100005006) 
           236:  #3 LockTest::RL(unsigned long) deadlock_detector_stress_test.cpp:131 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x100004e70) 
           237:  #4 LockTest::Test12_Thread() deadlock_detector_stress_test.cpp:376 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x1000054ff) 
           238:  #5 LockTest::Thread(void*) deadlock_detector_stress_test.cpp:578 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x1000049d5) 
           239:  #6 __tsan_thread_start_func tsan_interceptors_posix.cpp:993 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x2e9ea) 

/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/fork_deadlock.cpp:37:11: error: CHECK: expected string not found in input
            1: .ThreadSanitizer:DEADLYSIGNAL 
            2: ==6224==ERROR: ThreadSanitizer: SEGV on unknown address 0x000000012b88 (pc 0x00010b2da9c5 bp 0x7e8000103eb0 sp 0x7e8000103e90 T7633409) 
            3: ==6224==The signal is caused by a READ memory access. 
            4: . #0 __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator32<__sanitizer::AP32>, __sanitizer::LargeMmapAllocatorPtrArrayStatic>::Allocate(__sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__sanitizer::AP32> >*, unsigned long, unsigned long) sanitizer_allocator_combined.h:69 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x29c5) 
check:37'0                 5:  #1 __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__sanitizer::AP32> >*, unsigned long) sanitizer_allocator.cpp:148 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x239b) 
check:37'0                 6:  #2 __tsan::VectorClock::ReleaseAcquire(__tsan::VectorClock**) tsan_vector_clock.cpp:107 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x80ff3) 
            7:  #3 __tsan_atomic32_fetch_add tsan_interface_atomic.cpp:590 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x51838) 
            
            

/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/stress.cpp:77:11: error: CHECK: expected string not found in input
// CHECK: DONE
          ^
<stdin>:1:1: note: scanning from here
            1: ThreadSanitizerThreadSanitizer:DEADLYSIGNAL 
            2: :DEADLYSIGNAL 
            3: ==8027==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x0001087d4aef bp 0x7e8000727ec0 sp 0x7e8000727ec0 T7644989) 
            4: ==8027==The signal is caused by a READ memory access. 
            5: ==8027==Hint: address points to the zero page. 
            6: ThreadSanitizer:DEADLYSIGNAL 


/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/flush_memory.cpp:35:11: error: CHECK: expected string not found in input
// CHECK: WARNING: ThreadSanitizer: data race
          ^
<stdin>:1:1: note: scanning from here
ThreadSanitizer:DEADLYSIGNAL
<stdin>:2:7: note: possible intended match here
==6211==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x000106a98aef bp 0x7e8000103ef0 sp 0x7e8000103ef0 T7633393)

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Oh, I see, Mac stores ThreadState* in the shadow and we wipe shadow during global reset. Then next access to ThreadState happily mmaps a new ThreadState.
But this also means nobody ever enabled flush_memory_ms/memory_limit_mb on Mac as it would have the same effect with the current runtime...

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Oh, I see, Mac stores ThreadState* in the shadow and we wipe shadow during global reset. Then next access to ThreadState happily mmaps a new ThreadState.
But this also means nobody ever enabled flush_memory_ms/memory_limit_mb on Mac as it would have the same effect with the current runtime...

Asked @yln to merge D110236

This revision was landed with ongoing or failed builds.Nov 23 2021, 2:45 AM
This revision was automatically updated to reflect the committed changes.

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Oh, I see, Mac stores ThreadState* in the shadow and we wipe shadow during global reset. Then next access to ThreadState happily mmaps a new ThreadState.
But this also means nobody ever enabled flush_memory_ms/memory_limit_mb on Mac as it would have the same effect with the current runtime...

Asked @yln to merge D110236

I've re-landed this as https://github.com/llvm/llvm-project/commit/ebd47b0fb78fa11758da6ffcd3e6b415cbb8fa28
with MemToMeta instead of MemToShadow for Mac thread state location. It may help with passing testing. Subject to the same issues as MemToShadow, D110236 is a better solution.

flush_memory.cpp failed on one of the bots:

https://green.lab.llvm.org/green//job/clang-stage1-RA/25681/consoleFull#-3188605978254eaf0-7326-4999-85b0-388101f2d404

+ env TSAN_OPTIONS=atexit_sleep_ms=0:abort_on_error=0:ignore_noninstrumented_modules=0:ignore_interceptors_accesses=0:flush_memory_ms=1:flush_symbolizer_ms=1:memory_limit_mb=1 /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/deflake.bash 10 /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/tools/clang/runtime/compiler-rt-bins/test/tsan/X86_64Config/Output/flush_memory.cpp.tmp
+ FileCheck /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/flush_memory.cpp
FileCheck error: '<stdin>' is empty.
FileCheck command line:  FileCheck /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/flush_memory.cpp

But it does not give any glues. It seems that the test crashed with no output at all.
Unclear what OS/arch this is. Is it reproducible on that OS/arch?

Another failure:
http://green.lab.llvm.org/green/job/lldb-cmake/38666/

======================================================================FAIL: test_dsym (TestTsanBasic.TsanBasicTestCase)----------------------------------------------------------------------Traceback (most recent call last): File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1715, in test_method return attrvalue(self) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 149, in wrapper return func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 25, in test self.tsan_tests() File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 131, in tsan_tests substrs=['stopped', 'stop reason = signal SIGABRT']) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2434, in expect self.fail(log_msg)AssertionError: Ran command:"thread list"

It seems that test is written in an unreliable way. Strange it wasn't flaky before.

Another failure:
http://green.lab.llvm.org/green/job/lldb-cmake/38666/

======================================================================FAIL: test_dsym (TestTsanBasic.TsanBasicTestCase)----------------------------------------------------------------------Traceback (most recent call last): File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1715, in test_method return attrvalue(self) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 149, in wrapper return func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 25, in test self.tsan_tests() File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 131, in tsan_tests substrs=['stopped', 'stop reason = signal SIGABRT']) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2434, in expect self.fail(log_msg)AssertionError: Ran command:"thread list"

It seems that test is written in an unreliable way. Strange it wasn't flaky before.

Sent https://reviews.llvm.org/D114444 to fix this.

rtrieu added a subscriber: rtrieu.Nov 23 2021, 6:41 PM

This was causing some issues in tests.
I've reverted this in 1150f02c77b81adca4d0c67afdef23321e93db89
and sent @dvyukov details to reproduce the problem I saw.

dvyukov reopened this revision.Nov 24 2021, 5:44 AM
This revision is now accepted and ready to land.Nov 24 2021, 5:44 AM
This revision was landed with ongoing or failed builds.Nov 25 2021, 9:32 AM
This revision was automatically updated to reflect the committed changes.

This breaks TestTsanBasic.py: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38898/testReport/lldb-api/functionalities_tsan_basic/TestTsanBasic_py/

I suspect the new runtime will require lldb/source/Plugins/InstrumentationRuntime/TSan to be updated.

This breaks TestTsanBasic.py: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38898/testReport/lldb-api/functionalities_tsan_basic/TestTsanBasic_py/

I suspect the new runtime will require lldb/source/Plugins/InstrumentationRuntime/TSan to be updated.

Hi Jonas,

I think the issue is not in the test, but in the tsan runtime.

Somehow the test is in __abort function but the debugger is getting EXC_BAD_INSTRUCTION rather than the expected SIGABRT:

runCmd: continue
output: Process 71919 resuming
Process 71919 stopped
* thread #3, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169
libsystem_c.dylib`:
->  0x7fff7370c8c0 <+169>: ud2    

libsystem_c.dylib`abort_report_np:
    0x7fff7370c8c2 <+0>:   pushq  %rbp
    0x7fff7370c8c3 <+1>:   movq   %rsp, %rbp
    0x7fff7370c8c6 <+4>:   pushq  %r14


runCmd: thread list
output: Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Ran command:
"thread list"

Got output:
Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Expecting sub string: "stopped" (was found)
Expecting sub string: "stop reason = signal SIGABRT" (was not found)
We should be stopped due a SIGABRT

I tried to reproduce this on Linux, but failed. Running a racy program with TSAN_OPTIONS="abort_on_error=1" properly aborts for me and gdb sees SIGABRT.

Mac has some different code paths, I stared at the code for a while, but don't see what happens.
abort() interceptor runs after the tsan thread descriptor was destroyed, which is a tricky context, but somehow it worked before.

(1) is it possible to extend the test to dump full stacks of the threads? In particular I am interested in the thread #3 in __abort.

(2) does the following help?

--- a/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
+++ b/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
@@ -1857,7 +1857,6 @@ static void FlushStreams() {
 }

 TSAN_INTERCEPTOR(void, abort, int fake) {
-  SCOPED_TSAN_INTERCEPTOR(abort, fake);
   FlushStreams();
   REAL(abort)(fake);
 }

or the following:

--- a/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
+++ b/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
@@ -1856,12 +1856,6 @@ static void FlushStreams() {
   REAL(fflush)(stderr);
 }
 
-TSAN_INTERCEPTOR(void, abort, int fake) {
-  SCOPED_TSAN_INTERCEPTOR(abort, fake);
-  FlushStreams();
-  REAL(abort)(fake);
-}
-
 TSAN_INTERCEPTOR(int, rmdir, char *path) {
   SCOPED_TSAN_INTERCEPTOR(rmdir, path);
   Release(thr, pc, Dir2addr(path));
@@ -2872,7 +2866,6 @@ void InitializeInterceptors() {
   TSAN_INTERCEPT(unlink);
   TSAN_INTERCEPT(tmpfile);
   TSAN_MAYBE_INTERCEPT_TMPFILE64;
-  TSAN_INTERCEPT(abort);
   TSAN_INTERCEPT(rmdir);
   TSAN_INTERCEPT(closedir);

?

(3) or does applying https://reviews.llvm.org/D110236 help?

@yln maybe you can help to debug this on Mac?

This breaks TestTsanBasic.py: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38898/testReport/lldb-api/functionalities_tsan_basic/TestTsanBasic_py/

I suspect the new runtime will require lldb/source/Plugins/InstrumentationRuntime/TSan to be updated.

Hi Jonas,

I think the issue is not in the test, but in the tsan runtime.

Somehow the test is in __abort function but the debugger is getting EXC_BAD_INSTRUCTION rather than the expected SIGABRT:

runCmd: continue
output: Process 71919 resuming
Process 71919 stopped
* thread #3, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169
libsystem_c.dylib`:
->  0x7fff7370c8c0 <+169>: ud2    

libsystem_c.dylib`abort_report_np:
    0x7fff7370c8c2 <+0>:   pushq  %rbp
    0x7fff7370c8c3 <+1>:   movq   %rsp, %rbp
    0x7fff7370c8c6 <+4>:   pushq  %r14


runCmd: thread list
output: Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Ran command:
"thread list"

Got output:
Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Expecting sub string: "stopped" (was found)
Expecting sub string: "stop reason = signal SIGABRT" (was not found)
We should be stopped due a SIGABRT

Hey Dmitry, thanks for looking into this. I tried suggestions (1) and (2) but neither of them got the test to pass. I pinged Julian about (3) and he kindly landed his patch, but that didn't fix the test either.

Here's the backtrace:

* thread #3, stop reason = EXC_BREAKPOINT (code=1, subcode=0x1c06023c0)
  * frame #0: 0x00000001c06023c0 libsystem_c.dylib`__abort + 160
    frame #1: 0x00000001c0602320 libsystem_c.dylib`abort + 176
    frame #2: 0x000000010056c844 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Abort() at sanitizer_posix_libcdep.cpp:143:3 [opt]
    frame #3: 0x000000010056bf8c libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Die() at sanitizer_termination.cpp:58:5 [opt]
    frame #4: 0x00000001005d2230 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ReportIgnoresEnabled(tctx=<unavailable>, set=<unavailable>) at tsan_rtl_thread.cpp:69:3 [opt]
    frame #5: 0x00000001005d1948 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(__tsan::ThreadState*) [inlined] __tsan::ThreadCheckIgnore(thr=0x0000000100550000) at tsan_rtl_thread.cpp:76:5 [opt]
    frame #6: 0x00000001005d193c libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(thr=0x0000000100550000) at tsan_rtl_thread.cpp:214:3 [opt]
    frame #7: 0x0000000100582544 libclang_rt.tsan_osx_dynamic.dylib`__tsan::DestroyThreadState() at tsan_interceptors_posix.cpp:944:3 [opt]
    frame #8: 0x00000001005d96e0 libclang_rt.tsan_osx_dynamic.dylib`__tsan::my_pthread_introspection_hook(event=3, thread=0x000000016ff9f000, addr=0x000000016ff18000, size=557056) at tsan_platform_mac.cpp:235:7 [opt]
    frame #9: 0x00000001c06c6fe0 libsystem_pthread.dylib`_pthread_introspection_hook_callout_thread_terminate + 132
    frame #10: 0x00000001c06c2c68 libsystem_pthread.dylib`_pthread_terminate + 60
    frame #11: 0x00000001c06c2c2c libsystem_pthread.dylib`_pthread_terminate_invoke + 56
    frame #12: 0x00000001c06c5950 libsystem_pthread.dylib`_pthread_exit + 112
    frame #13: 0x00000001c06c524c libsystem_pthread.dylib`_pthread_start + 160

Given that our bot has been red for a while, I'm going to temporarily skip this test.

Hey Dmitry, thanks for looking into this. I tried suggestions (1) and (2) but neither of them got the test to pass. I pinged Julian about (3) and he kindly landed his patch, but that didn't fix the test either.

Here's the backtrace:

* thread #3, stop reason = EXC_BREAKPOINT (code=1, subcode=0x1c06023c0)
  * frame #0: 0x00000001c06023c0 libsystem_c.dylib`__abort + 160
    frame #1: 0x00000001c0602320 libsystem_c.dylib`abort + 176
    frame #2: 0x000000010056c844 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Abort() at sanitizer_posix_libcdep.cpp:143:3 [opt]
    frame #3: 0x000000010056bf8c libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Die() at sanitizer_termination.cpp:58:5 [opt]
    frame #4: 0x00000001005d2230 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ReportIgnoresEnabled(tctx=<unavailable>, set=<unavailable>) at tsan_rtl_thread.cpp:69:3 [opt]
    frame #5: 0x00000001005d1948 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(__tsan::ThreadState*) [inlined] __tsan::ThreadCheckIgnore(thr=0x0000000100550000) at tsan_rtl_thread.cpp:76:5 [opt]
    frame #6: 0x00000001005d193c libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(thr=0x0000000100550000) at tsan_rtl_thread.cpp:214:3 [opt]
    frame #7: 0x0000000100582544 libclang_rt.tsan_osx_dynamic.dylib`__tsan::DestroyThreadState() at tsan_interceptors_posix.cpp:944:3 [opt]
    frame #8: 0x00000001005d96e0 libclang_rt.tsan_osx_dynamic.dylib`__tsan::my_pthread_introspection_hook(event=3, thread=0x000000016ff9f000, addr=0x000000016ff18000, size=557056) at tsan_platform_mac.cpp:235:7 [opt]
    frame #9: 0x00000001c06c6fe0 libsystem_pthread.dylib`_pthread_introspection_hook_callout_thread_terminate + 132
    frame #10: 0x00000001c06c2c68 libsystem_pthread.dylib`_pthread_terminate + 60
    frame #11: 0x00000001c06c2c2c libsystem_pthread.dylib`_pthread_terminate_invoke + 56
    frame #12: 0x00000001c06c5950 libsystem_pthread.dylib`_pthread_exit + 112
    frame #13: 0x00000001c06c524c libsystem_pthread.dylib`_pthread_start + 160

Given that our bot has been red for a while, I'm going to temporarily skip this test.

Humm... this sheds some light, but I am still puzzled...

First, it calls an abort() from libsystem (even if not the one the test wants), so why it's EXC_BREAKPOINT rather than SIGABRT as the test expects...

Then, I don't understand thread termination sequence on Mac. Should my_pthread_introspection_hook be used for pthread_create-created threads? Why all other tsan tests don't fail in the same way on Mac? Does presence of the debugger somehow changes behavior of libsystem_pthread.dylib? What sets thr->ignore_sync/ignore_reads_and_writes that ReportIgnoresEnabled does not like?...
@yln @kubamracek

yln added a comment.Dec 1 2021, 11:40 AM

Should my_pthread_introspection_hook be used for pthread_create-created threads?

Yes, this is intentional, but I am unaware of the underlying reasons and whether it still holds with the new runtime.

Why all other tsan tests don't fail in the same way on Mac? Does presence of the debugger somehow changes behavior of libsystem_pthread.dylib? What sets thr->ignore_sync/ignore_reads_and_writes that ReportIgnoresEnabled does not like?...

I will try to investigate this and report back as soon as I can.

This revision is now accepted and ready to land.Dec 1 2021, 11:39 PM
This revision was automatically updated to reflect the committed changes.

Dmitry, this is still breaking the LLDB bot. Last time I XFAILed the LLDB test so you could continue investigate without having to revert the whole patch. It got reverted anyway for the chrome test, but without removing the XFAIL again, breaking our bots again. Now the patch got landed again, but still without a fix for the macOS/LLDB issue (or XFAILing the test).

I've reverted your change in 396113c19fd1dc3bb9771d04ffb3a0319e018558. Please investigate the issue and keep an eye on the LLDB job when landing this again.

Dmitry, this is still breaking the LLDB bot. Last time I XFAILed the LLDB test so you could continue investigate without having to revert the whole patch. It got reverted anyway for the chrome test, but without removing the XFAIL again, breaking our bots again. Now the patch got landed again, but still without a fix for the macOS/LLDB issue (or XFAILing the test).

I've reverted your change in 396113c19fd1dc3bb9771d04ffb3a0319e018558. Please investigate the issue and keep an eye on the LLDB job when landing this again.

I've got access to a Mac machine and spent several hours trying to reproduce this, but failed.
ninja check-lldb all fails with "this is a non-interactive debug session, cannot get permission to debug processes" (I can only run the system lldb on this machine).
So I built main.c manually and run it under lldb and it halts with the expected "stop reason = signal SIGABRT":

Process 51022 stopped
* thread #3, stop reason = Data race detected
    frame #0: 0x00000001001903f0 libclang_rt.tsan_osx_dynamic.dylib`::__tsan_on_report(rep=0x00000001062029f0) at tsan_rtl_report.cpp:47 [opt]
   44  	#endif
   45  	
   46  	SANITIZER_WEAK_DEFAULT_IMPL
-> 47  	void __tsan_on_report(const ReportDesc *rep) {
   48  	  (void)rep;
   49  	}
   50  	
Target 0: (race) stopped.
(lldb) c
Process 51022 resuming
ThreadSanitizer: reported 1 warnings
Process 51022 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff203d792e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff203d792e <+10>: jae    0x7fff203d7938            ; <+20>
    0x7fff203d7930 <+12>: movq   %rax, %rdi
    0x7fff203d7933 <+15>: jmp    0x7fff203d1ad9            ; cerror_nocancel
    0x7fff203d7938 <+20>: retq   
Target 0: (race) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff203d792e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff204065bd libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x0000000100158649 libclang_rt.tsan_osx_dynamic.dylib`::wrap_pthread_kill(tid=0x00000001000e7e00, sig=6) at tsan_interceptors_posix.cpp:2155:13 [opt]
    frame #3: 0x00007fff2035b406 libsystem_c.dylib`abort + 125
    frame #4: 0x000000010013c6d6 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Abort() at sanitizer_posix_libcdep.cpp:143:3 [opt]
    frame #5: 0x000000010013bfd5 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Die() at sanitizer_termination.cpp:58:5 [opt]
    frame #6: 0x0000000100165dfa libclang_rt.tsan_osx_dynamic.dylib`__tsan::finalize(arg=<unavailable>) at tsan_interceptors_posix.cpp:2747:5 [opt]
    frame #7: 0x00007fff20335d25 libsystem_c.dylib`__cxa_finalize_ranges + 316
    frame #8: 0x00007fff20336010 libsystem_c.dylib`exit + 53
    frame #9: 0x00007fff20421f44 libdyld.dylib`start + 8
    frame #10: 0x00007fff20421f3d libdyld.dylib`start + 1

I also checked if any of existing tsan tests hit this __tsan::ReportIgnoresEnabled, but none of them are (except for 3 tests that test this function).

I will try to disable ReportIgnoresEnabled functionality on Mac, maybe it will help to unbreak the test.

dvyukov reopened this revision.Dec 13 2021, 3:42 AM
This revision is now accepted and ready to land.Dec 13 2021, 3:42 AM
dvyukov updated this revision to Diff 393839.Dec 13 2021, 3:44 AM

rebased + deflaked tests (D115613, D115612) + disabled ReportIgnoresEnabled on Mac

dvyukov updated this revision to Diff 393840.Dec 13 2021, 3:45 AM

Default TSAN_USE_OLD_RUNTIME to ON.
The plan is to submit with TSAN_USE_OLD_RUNTIME=ON
and then flip it separately so that people revert just the flip.

This revision was landed with ongoing or failed builds.Dec 13 2021, 3:48 AM
This revision was automatically updated to reflect the committed changes.
yln added a comment.Dec 14 2021, 4:44 PM

Landing this (even with TSAN_USE_OLD_RUNTIME=ON) did break the tests on the iOS simulator. Fixed here: D115767

Landing this (even with TSAN_USE_OLD_RUNTIME=ON) did break the tests on the iOS simulator. Fixed here: D115767

Thanks for fixing!
I am not sure what's changed wrt symbolization since last submit of this change..