This is an archive of the discontinued LLVM Phabricator instance.

tsan: new runtime (v3)
ClosedPublic

Authored by dvyukov on Oct 27 2021, 2:02 AM.

Details

Summary

This change switches tsan to the new runtime which features:

  • 2x smaller shadow memory (2x of app memory)
  • faster fully vectorized race detection
  • small fixed-size vector clocks (512b)
  • fast vectorized vector clock operations
  • unlimited number of alive threads/goroutimes

Depends on D112602.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
melver added inline comments.Oct 28 2021, 6:33 AM
compiler-rt/lib/tsan/rtl/tsan_flags.inc
61–62

This comment is out-of-date I think. There's no more "SANITIZER_GO ? ...".

64

does "unspecified units" mean "implementation-defined units"?

I.e. only someone who knows the implementation can make the right choice here. I guess it's hard to quantify what 1 unit of this size actually translates to, but "unspecified" is probably not right.

I'd just drop "(in unspecified units)" because it's confusing.

compiler-rt/lib/tsan/rtl/tsan_mman.cpp
255
256
compiler-rt/lib/tsan/rtl/tsan_platform.h
88–89

Will the new runtime work with the larger shadow?
If so, does it make sense to maybe move adjusting shadow to a subsequent patch?

Because I can imagine that shadow-adjustment might itself have a high chance of breaking things.

compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
78–261

static

or somehow group the internal helpers and place them in a "namespace {" (I think most of them are already grouped).

279

static

289

static

294

Why the new nested scope? I don't see a scoped lock, it seems to start with for-loop right away.

301

static

384

static

404

the 'continue' looks redundant:

if (slot->epoch() != kEpochLast) {
  ctx->slot_queue.PushBack(slot);
  break;
}
compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp
245
#else /* !TSAN_VECTORIZE */

instead of

#endif

#if TSAN_VECTORIZE

Could also switch them around so that the condition isn't negated, but up to you what's easier to read.

455

move cond from line after onto this line -- unless you intend to put more things into the "if (!traced)" branch only.

517–518

Will these ever be needed? If not, can it be removed from args?

compiler-rt/lib/tsan/rtl/tsan_shadow.h
78
143

remove?

compiler-rt/lib/tsan/rtl/tsan_sync.h
101

Is there a way to DCHECK() this precondition?

melver added inline comments.Oct 28 2021, 6:35 AM
compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
423
if (slot->epoch() != kEpochLast)
  return slot;
melver added inline comments.Oct 28 2021, 9:03 AM
compiler-rt/lib/tsan/rtl/tsan_rtl.h
588–589

This should go into the previous patch moving this function.

dvyukov updated this revision to Diff 383334.Oct 29 2021, 6:48 AM

fix shadow remapping that did wrong thing wrt huge pages

dvyukov updated this revision to Diff 386245.Nov 10 2021, 10:49 AM
dvyukov marked 18 inline comments as done.

Adderessed most comments, except for moving shadow layout changes to a separate commit
(will look into this tomorrow).

dvyukov added inline comments.Nov 10 2021, 10:51 AM
compiler-rt/lib/tsan/rtl/tsan_flags.inc
61–62

Right. I reduced size of TracePart struct for Go instead.

compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
584

The namespace does not increase indentation level, so removing the namespace does not introduce a large diff on it's own.

compiler-rt/lib/tsan/rtl/tsan_rtl_access.cpp
245

Could also switch them around so that the condition isn't negated, but up to you what's easier to read.

The idea was to place non-vectorized code first as it should be easier to read/understand and it serves as a baseline implementation. And then the vectorized version is a more complex optimization of this baseline implementation.

compiler-rt/lib/tsan/rtl/tsan_sync.h
101

Good question. I've added ctx->resetting flag which is set in DoReset and added CHECK(!ctx->resetting) in SyncVar::Reset.

I've now fixed all comments and all issues I was aware of.
PTAL

compiler-rt/lib/tsan/rtl/tsan_platform.h
88–89

Done.
I moved reduction of shadow into https://reviews.llvm.org/D113751
Start of the shadow still changes in some cases b/c it depends on the app:shadow ration which changed from 4 to 2, so in some cases shadow moved from, say, 0x20000... to 0x10000....

What do you think if I submit just this change and then revert it back 4 hours later? Most likely there will be some feedback from bots and we won't get it until try to submit.

What do you think if I submit just this change and then revert it back 4 hours later? Most likely there will be some feedback from bots and we won't get it until try to submit.

Yeah, that's fine with me. It sucks that LLVM doesn't have something like linux-next, but at this point I think we've exhausted all the options of testing (Chrome, internal testing, etc.).

melver accepted this revision.Nov 12 2021, 4:43 AM
This revision is now accepted and ready to land.Nov 12 2021, 4:43 AM

What do you think if I submit just this change and then revert it back 4 hours later? Most likely there will be some feedback from bots and we won't get it until try to submit.

Yeah, that's fine with me. It sucks that LLVM doesn't have something like linux-next, but at this point I think we've exhausted all the options of testing (Chrome, internal testing, etc.).

I've run check-tsan/check-sanitizers in debug/release configs + Go std lib tests + some of our internal tests with just this commit. Everything passed.
🤞

This revision was automatically updated to reflect the committed changes.
shafik added a subscriber: shafik.Nov 12 2021, 8:36 AM

It looks like this PR broke green dragon see: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38135/consoleFull#-148886289949ba4694-19c4-4d7e-bec5-911270d8a58c

Please let me know if you are looking at fixing this soon.

thakis added a subscriber: thakis.Nov 12 2021, 8:37 AM

This doesn't build on Mac: http://45.33.8.238/mac/38755/step_4.txt

Please take a look and revert for now if it takes a while to fix.

I've committed revert in https://github.com/llvm/llvm-project/commit/79fbba9b7948d217ad34a81de205a9eaf53b05f9
(arc diff does not allow me to send it w/o re-formatting code)

vitalybuka reopened this revision.Nov 12 2021, 10:09 AM
vitalybuka added 1 blocking reviewer(s): vitalybuka.

I assume now it's reopen.
I had no time to look at that. Don't consider this strictly blocking, I just want my dashboard nagging me about that.

@dvyukov can you upload the most recent snapshot here, I guess something was changed after revert?

dvyukov updated this revision to Diff 386891.Nov 12 2021, 10:50 AM

rebase and reopen the revision

@dvyukov can you upload the most recent snapshot here, I guess something was changed after revert?

I've reopened it (did I?) but I don't yet have fixes for any errors reported by bots.

dvyukov updated this revision to Diff 386894.Nov 12 2021, 11:05 AM

fix a bunch of compilation issues uncovered by bots

@dvyukov can you upload the most recent snapshot here, I guess something was changed after revert?

I've reopened it (did I?) but I don't yet have fixes for any errors reported by bots.

The latest push contains a bunch of fixes for issues reported by bots:
https://reviews.llvm.org/D112603?vs=386891&id=386894#toc

There are also some mysterious failures on ppc64/s390x bots where we failed to detect a race in random tests (test is different on each bot run). I don't have any ideas with respect to these failures yet.

vitalybuka removed 1 blocking reviewer(s): vitalybuka.Nov 14 2021, 8:36 PM

I spent some time looking through, but it's hard to to review such huge patch. Incremental approach would be nice.
There are some nits, but nothing important.

compiler-rt/lib/tsan/rtl/tsan_flags.cpp
113

separate patch?

compiler-rt/lib/tsan/rtl/tsan_mman.cpp
273

no DPrintf here?

compiler-rt/lib/tsan/rtl/tsan_platform_linux.cpp
139–140

separate patch?

compiler-rt/lib/tsan/rtl/tsan_report.h
94–96 ↗(On Diff #386894)

is this possible to do in a separate patch? even after the main one?

compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
458

all this block a separate patch?

compiler-rt/lib/tsan/rtl/tsan_rtl.h
566

maybe it should be a SlotLocker::IncrementEpoch ?
and other state that requires Locker

compiler-rt/lib/tsan/rtl/tsan_trace.h
47

Could all changes of this file below be a separate patch?

compiler-rt/test/tsan/memcmp_race.cpp
37

why it's size 3 now? I see only 1 byte overlap.

compiler-rt/test/tsan/memcpy_race.cpp
35–36

same

This revision is now accepted and ready to land.Nov 14 2021, 8:36 PM
yln added a subscriber: yln.Nov 15 2021, 5:09 PM
yln added inline comments.
compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp
115–116

error: use of undeclared identifier 'TraceMemBeg'

dvyukov updated this revision to Diff 387557.Nov 16 2021, 3:43 AM
dvyukov marked 5 inline comments as done.

move parts to separate commits, fix mac build error

dvyukov added inline comments.Nov 16 2021, 3:46 AM
compiler-rt/lib/tsan/rtl/tsan_flags.cpp
113

New runtime changes meaning of this flag. The new runtime does not work with the old flag, the old runtime does not work with the new flag. This change is part of the new runtime.

compiler-rt/lib/tsan/rtl/tsan_platform_linux.cpp
139–140

I don't see how this can be untied from the new runtime. ctx->global_epoch and all changes related to the trace are integral part of the new runtime.

compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp
115–116

Thanks for testing. Fixed. Does it work otherwise? Or you have not get past this build error?

compiler-rt/lib/tsan/rtl/tsan_rtl.h
566

SlotLocker is a RAII utility helper, it looks strange to group logic in this class. It's like making functions that require holding a particular mutex members of that mutex type. What if some function requires holding 2 mutexes at the same time? In some cases we call SlotLock directly w/o using SlotLocker.
IncrementEpoch has a check that the slow is locked, it should provide equal bug detection capability.

I see only 1 other function that has DCHECK(thr->slot_locked), which is MemoryRangeFreed. But it looks even stranger to have it as a method of SlotLocker. All other memory access functions don't need the slot to be locked, so such grouping will be inconsistent.

compiler-rt/test/tsan/memcmp_race.cpp
37

It's not about overlap, it's about the access itself, to make it possible to identify it in the source code. Overlap follows from the address/size of each of the accesses (if interesting).
But for unaligned/range accesses, the size is what we happened to remember in the shadow. It may be strange. But it was strange before, just differently strange.

compiler-rt/test/tsan/memcpy_race.cpp
35–36

See reply above.

yln added inline comments.Nov 16 2021, 9:03 AM
compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp
115–116

Currently the patch doesn't seem to apply cleanly on ToT for me (arc patch fails, offender: compiler-rt/lib/tsan/rtl/tsan_rtl_thread.cpp); I can test again. Ping me after a rebase.

yln added inline comments.Nov 16 2021, 12:13 PM
compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp
115–116

Builds successfully, but at least these tests deadlock on my Intel Mac:

fork_deadlock.cpp.tmp
fork_multithreaded.cpp.tmp

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * #0: 0x00007ff80fd78bc2 libsystem_kernel.dylib
    #1: 0x00007ff80fdb2c72 libsystem_pthread.dylib`sched_yield + 11 at .../libpthread/libpthread-485.40.4/src/pthread.c:2182
    #2: 0x000000010148fa69 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::FutexWait(p=<unavailable>, cmp=<unavailable>) + 9 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mac.cpp:514
    #3: 0x0000000101491eda libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Semaphore::Wait(this=0x0000000104927fc8) + 58 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mutex.cpp:35
    #4: 0x000000010148ed17 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Mutex::Lock(this=0x0000000104927fc0) + 183 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_mutex.h:194
    #5: 0x00000001014d2aa5 libclang_rt.tsan_osx_dynamic.dylib`__tsan::SlotLock(thr=0x000000010493e140) + 37 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.cpp:338
    #6: 0x00000001014cd4b5 libclang_rt.tsan_osx_dynamic.dylib`__tsan::OnUserFree(__tsan::ThreadState*, unsigned long, unsigned long, bool) [inlined] __tsan::SlotLocker::SlotLocker(thr=0x000000010493e140, recursive=false) + 53 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:628
    #7: 0x00000001014cd4ad libclang_rt.tsan_osx_dynamic.dylib`__tsan::OnUserFree(__tsan::ThreadState*, unsigned long, unsigned long, bool) [inlined] __tsan::SlotLocker::SlotLocker(thr=0x000000010493e140, recursive=false) at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_rtl.h:626
    #8: 0x00000001014cd4ad libclang_rt.tsan_osx_dynamic.dylib`__tsan::OnUserFree(thr=0x000000010493e140, pc=4316772291, p=135308649693312, write=<unavailable>) + 45 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_mman.cpp:245
    #9: 0x00000001014cd443 libclang_rt.tsan_osx_dynamic.dylib`__tsan::user_free(thr=0x000000010493e140, pc=4316772291, p=0x00007b1000000080, signal=<unavailable>) + 67 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_mman.cpp:187
    #10: 0x00000001014cb7d3 libclang_rt.tsan_osx_dynamic.dylib`::wrap_free(ptr=0x00007b1000000080) + 99 at .../llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_malloc_mac.inc:144
    #11: 0x00007ff80faf20d5 libxpc.dylib`-[OS_xpc_object dealloc](self=0x00007b1000000080, _cmd=<unavailable>) + 47 at .../libxpc/libxpc-2235.40.11/src/object.m:27
    #12: 0x00007ff80fb0cd42 libxpc.dylib`xpc_atfork_child + 125 at .../libxpc/libxpc-2235.40.11/src/init.c:769
    #13: 0x00007ff81abcac95 libSystem.B.dylib`libSystem_atfork_child(flags=0) + 63 at .../Libsystem/Libsystem-1311/init.c:508
    #14: 0x00007ff80fcaa6c4 libsystem_c.dylib`fork [inlined] _do_fork(libsystem_atfork_handlers_only=<unavailable>) + 84 at .../Libc/Libc-1506.40.4/sys/fork.c:0
    #15: 0x00007ff80fcaa674 libsystem_c.dylib`fork + 4 at .../Libc/Libc-1506.40.4/sys/fork.c:105
    #16: 0x00000001014b0a90 libclang_rt.tsan_osx_dynamic.dylib`::wrap_fork(fake=76800344) + 64 at .../llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:2171
    #17: 0x0000000100f3eca5 fork_deadlock.cpp.tmp`main + 85 at .../llvm-project/compiler-rt/test/tsan/fork_deadlock.cpp:20
    #18: 0x0000000106ce04fe dyld`start(kernArgs=<unavailable>) + 462 at .../dyld/dyld-940/dyld/dyldMain.cpp:879
dvyukov updated this revision to Diff 388509.Nov 19 2021, 8:03 AM
dvyukov marked an inline comment as done.

fix Mac fork deadlock

yln added a comment.Nov 19 2021, 9:47 AM

fix Mac fork deadlock

Will test, please rebase (patch does not apply cleanly).

fix Mac fork deadlock

Will test, please rebase (patch does not apply cleanly).

It was freshly rebased onto main:

$ git log --oneline
17871bb55d11 tsan: new runtime (v3)
d1b89d9c5743 tsan: add another fork test
6774cc33f75f (origin/main, main) [NFC][llvm] Inclusive language: remove instance of master in IntrinsicsNVVM.td

compiler-rt/lib/tsan/rtl/tsan_platform_posix.cpp
115–116

I rebased these series. Ping.

115–116

Builds successfully, but at least these tests deadlock on my Intel Mac:

Thanks for testing.

I've added a test that reproduces this deadlock on Linux:
https://reviews.llvm.org/D114250

And folded a fix into this change:
https://reviews.llvm.org/D112603?vs=387678&id=388509#toc
(the ForkBefore/After part of the diff)

I would appreciate another test. Thanks.

It was freshly rebased onto main:

$ git log --oneline
17871bb55d11 tsan: new runtime (v3)
d1b89d9c5743 tsan: add another fork test
6774cc33f75f (origin/main, main) [NFC][llvm] Inclusive language: remove instance of master in IntrinsicsNVVM.td

Just in case, I also pushed the tree here:
https://github.com/dvyukov/llvm-project/commits/tsanv3

yln added a comment.Nov 19 2021, 11:35 AM

The following 2 tests spinlock:

ThreadSanitizer-x86_64 :: Darwin/debug_external.cpp
ThreadSanitizer-x86_64 :: Darwin/external-swift-debugging.cpp

And a few unit tests fail:

ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicRwMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicSpinMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ReportRace
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.SimpleWrite
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ThreadSync

I hope that I can contribute some cycles to help investigate the issues with the unit tests after the break.

The following 2 tests spinlock:

ThreadSanitizer-x86_64 :: Darwin/debug_external.cpp
ThreadSanitizer-x86_64 :: Darwin/external-swift-debugging.cpp

Mailed https://reviews.llvm.org/D114348 to fix this.

And a few unit tests fail:

ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicRwMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.BasicSpinMutex
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ReportRace
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.SimpleWrite
ThreadSanitizer-Unit :: rtl/./TsanRtlTest-x86_64-Test/ThreadSanitizer.ThreadSync

I hope that I can contribute some cycles to help investigate the issues with the unit tests after the break.

This one is not immediately obvious to me.
The InitializeShadowMemory/dlsym part of the stack was changed in already landed https://github.com/llvm/llvm-project/commit/20ee72d4ccb17
And even before that commit we called dlsym from __tsan::Initialize.
Is it passing on the current HEAD?
I also noticed that TsanRtlTest is built with stale runtime sometimes and I need to run 'ninja check-tsan' 3 times before it gets built with the current runtime. Maybe that's what happened?

There are some bugs related to big-endian arches. ppc64 and s390x bots showed some failures, both are big-endian.
There are a number of sloppy bit manipulations in FastState/Shadow types. Trying to figure out exact bugs and testing story.

I've fixed a number of endian-ness bugs in tsan_shadow.h and added a unit test for shadow methods.
Also fixed a bug in MemoryRangeSet, which was uncovered by test/tsan/ignored-interceptors-mmap.cpp on powerpc (due to different page size) and extended test/tsan/ignored-interceptors-mmap.cpp to detect such bugs on x86_64 as well (D114366).

I am going to do another presubmit since all known bugs were fixed.

This revision was landed with ongoing or failed builds.Nov 22 2021, 6:55 AM
This revision was automatically updated to reflect the committed changes.

Hi, it seems that this is causing failures on ppc64le-rhel, https://lab.llvm.org/buildbot/#/builders/57/builds/12365.

dvyukov reopened this revision.Nov 22 2021, 10:22 AM
This revision is now accepted and ready to land.Nov 22 2021, 10:22 AM
dvyukov updated this revision to Diff 388964.Nov 22 2021, 10:22 AM

rebase and reopen

Fixed 2 easy bugs pointed by buildbots:

https://lab.llvm.org/buildbot#builders/57/builds/12365

/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/runtimes/../llvm/utils/unittest/googletest/include/gtest/gtest.h:1527:11: error: comparison of integers of different signs: 'const unsigned long' and 'const int' [-Werror,-Wsign-compare]
  if (lhs == rhs) {
      ~~~ ^  ~~~
/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/runtimes/../llvm/utils/unittest/googletest/include/gtest/gtest.h:1554:12: note: in instantiation of function template specialization 'testing::internal::CmpHelperEQ<unsigned long, int>' requested here
    return CmpHelperEQ(lhs_expression, rhs_expression, lhs, rhs);
           ^
/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/compiler-rt/lib/tsan/tests/unit/tsan_flags_test.cpp:125:3: note: in instantiation of function template specialization 'testing::internal::EqHelper::Compare<unsigned long, int, nullptr>' requested here
  EXPECT_EQ(f->history_size, 6);

and:
https://lab.llvm.org/buildbot/#/builders/197/builds/588

XPASS: ThreadSanitizer-aarch64:: bench_threads.cpp

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

222: SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) deadlock_detector_stress_test.cpp:96 in PthreadRWLock::rdlock() 
           230: ==6031==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x000109ec7aef bp 0x7e8000103e10 sp 0x7e8000103e10 T7632574) 
           231: ==6031==The signal is caused by a READ memory access. 
           232: ==6031==Hint: address points to the zero page. 
           233:  #0 __tsan::TraceSwitchPart(__tsan::ThreadState*) tsan_rtl.cpp:854 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x68aef) 
           234:  #1 __tsan::TraceRestartFuncExit(__tsan::ThreadState*) tsan_rtl_access.cpp:101 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x6b090) 
           235:  #2 PthreadRWLock::rdlock() deadlock_detector_stress_test.cpp:96 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x100005006) 
           236:  #3 LockTest::RL(unsigned long) deadlock_detector_stress_test.cpp:131 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x100004e70) 
           237:  #4 LockTest::Test12_Thread() deadlock_detector_stress_test.cpp:376 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x1000054ff) 
           238:  #5 LockTest::Thread(void*) deadlock_detector_stress_test.cpp:578 (deadlock_detector_stress_test.cpp.tmp:x86_64+0x1000049d5) 
           239:  #6 __tsan_thread_start_func tsan_interceptors_posix.cpp:993 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x2e9ea) 

/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/fork_deadlock.cpp:37:11: error: CHECK: expected string not found in input
            1: .ThreadSanitizer:DEADLYSIGNAL 
            2: ==6224==ERROR: ThreadSanitizer: SEGV on unknown address 0x000000012b88 (pc 0x00010b2da9c5 bp 0x7e8000103eb0 sp 0x7e8000103e90 T7633409) 
            3: ==6224==The signal is caused by a READ memory access. 
            4: . #0 __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator32<__sanitizer::AP32>, __sanitizer::LargeMmapAllocatorPtrArrayStatic>::Allocate(__sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__sanitizer::AP32> >*, unsigned long, unsigned long) sanitizer_allocator_combined.h:69 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x29c5) 
check:37'0                 5:  #1 __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__sanitizer::AP32> >*, unsigned long) sanitizer_allocator.cpp:148 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x239b) 
check:37'0                 6:  #2 __tsan::VectorClock::ReleaseAcquire(__tsan::VectorClock**) tsan_vector_clock.cpp:107 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x80ff3) 
            7:  #3 __tsan_atomic32_fetch_add tsan_interface_atomic.cpp:590 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x51838) 
            
            

/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/stress.cpp:77:11: error: CHECK: expected string not found in input
// CHECK: DONE
          ^
<stdin>:1:1: note: scanning from here
            1: ThreadSanitizerThreadSanitizer:DEADLYSIGNAL 
            2: :DEADLYSIGNAL 
            3: ==8027==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x0001087d4aef bp 0x7e8000727ec0 sp 0x7e8000727ec0 T7644989) 
            4: ==8027==The signal is caused by a READ memory access. 
            5: ==8027==Hint: address points to the zero page. 
            6: ThreadSanitizer:DEADLYSIGNAL 


/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/flush_memory.cpp:35:11: error: CHECK: expected string not found in input
// CHECK: WARNING: ThreadSanitizer: data race
          ^
<stdin>:1:1: note: scanning from here
ThreadSanitizer:DEADLYSIGNAL
<stdin>:2:7: note: possible intended match here
==6211==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000d0 (pc 0x000106a98aef bp 0x7e8000103ef0 sp 0x7e8000103ef0 T7633393)

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Oh, I see, Mac stores ThreadState* in the shadow and we wipe shadow during global reset. Then next access to ThreadState happily mmaps a new ThreadState.
But this also means nobody ever enabled flush_memory_ms/memory_limit_mb on Mac as it would have the same effect with the current runtime...

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Oh, I see, Mac stores ThreadState* in the shadow and we wipe shadow during global reset. Then next access to ThreadState happily mmaps a new ThreadState.
But this also means nobody ever enabled flush_memory_ms/memory_limit_mb on Mac as it would have the same effect with the current runtime...

Asked @yln to merge D110236

This revision was landed with ongoing or failed builds.Nov 23 2021, 2:45 AM
This revision was automatically updated to reflect the committed changes.

Harder one is some NULL derefs on a Mac buildbot (not fixed yet):
http://green.lab.llvm.org/green/job/clang-stage1-RA/25658/

There is something strange going on.
3 tests crashes on access to 0x0000000000d0, that's &thr->tctx.trace.parts.size which means thr->tctx is NULL.
1 test crashed on access to 0x000000012b88, that's an offset into internal allocator cache in Processor, which means thr->proc1 is NULL.
So it looks like ThreadState is being randomly wiped with 0's...

Oh, I see, Mac stores ThreadState* in the shadow and we wipe shadow during global reset. Then next access to ThreadState happily mmaps a new ThreadState.
But this also means nobody ever enabled flush_memory_ms/memory_limit_mb on Mac as it would have the same effect with the current runtime...

Asked @yln to merge D110236

I've re-landed this as https://github.com/llvm/llvm-project/commit/ebd47b0fb78fa11758da6ffcd3e6b415cbb8fa28
with MemToMeta instead of MemToShadow for Mac thread state location. It may help with passing testing. Subject to the same issues as MemToShadow, D110236 is a better solution.

flush_memory.cpp failed on one of the bots:

https://green.lab.llvm.org/green//job/clang-stage1-RA/25681/consoleFull#-3188605978254eaf0-7326-4999-85b0-388101f2d404

+ env TSAN_OPTIONS=atexit_sleep_ms=0:abort_on_error=0:ignore_noninstrumented_modules=0:ignore_interceptors_accesses=0:flush_memory_ms=1:flush_symbolizer_ms=1:memory_limit_mb=1 /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/deflake.bash 10 /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/tools/clang/runtime/compiler-rt-bins/test/tsan/X86_64Config/Output/flush_memory.cpp.tmp
+ FileCheck /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/flush_memory.cpp
FileCheck error: '<stdin>' is empty.
FileCheck command line:  FileCheck /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/flush_memory.cpp

But it does not give any glues. It seems that the test crashed with no output at all.
Unclear what OS/arch this is. Is it reproducible on that OS/arch?

Another failure:
http://green.lab.llvm.org/green/job/lldb-cmake/38666/

======================================================================FAIL: test_dsym (TestTsanBasic.TsanBasicTestCase)----------------------------------------------------------------------Traceback (most recent call last): File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1715, in test_method return attrvalue(self) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 149, in wrapper return func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 25, in test self.tsan_tests() File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 131, in tsan_tests substrs=['stopped', 'stop reason = signal SIGABRT']) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2434, in expect self.fail(log_msg)AssertionError: Ran command:"thread list"

It seems that test is written in an unreliable way. Strange it wasn't flaky before.

Another failure:
http://green.lab.llvm.org/green/job/lldb-cmake/38666/

======================================================================FAIL: test_dsym (TestTsanBasic.TsanBasicTestCase)----------------------------------------------------------------------Traceback (most recent call last): File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1715, in test_method return attrvalue(self) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 119, in wrapper func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 149, in wrapper return func(*args, **kwargs) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 25, in test self.tsan_tests() File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/test/API/functionalities/tsan/basic/TestTsanBasic.py", line 131, in tsan_tests substrs=['stopped', 'stop reason = signal SIGABRT']) File "/Users/buildslave/jenkins/workspace/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2434, in expect self.fail(log_msg)AssertionError: Ran command:"thread list"

It seems that test is written in an unreliable way. Strange it wasn't flaky before.

Sent https://reviews.llvm.org/D114444 to fix this.

rtrieu added a subscriber: rtrieu.Nov 23 2021, 6:41 PM

This was causing some issues in tests.
I've reverted this in 1150f02c77b81adca4d0c67afdef23321e93db89
and sent @dvyukov details to reproduce the problem I saw.

dvyukov reopened this revision.Nov 24 2021, 5:44 AM
This revision is now accepted and ready to land.Nov 24 2021, 5:44 AM
This revision was landed with ongoing or failed builds.Nov 25 2021, 9:32 AM
This revision was automatically updated to reflect the committed changes.

This breaks TestTsanBasic.py: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38898/testReport/lldb-api/functionalities_tsan_basic/TestTsanBasic_py/

I suspect the new runtime will require lldb/source/Plugins/InstrumentationRuntime/TSan to be updated.

This breaks TestTsanBasic.py: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38898/testReport/lldb-api/functionalities_tsan_basic/TestTsanBasic_py/

I suspect the new runtime will require lldb/source/Plugins/InstrumentationRuntime/TSan to be updated.

Hi Jonas,

I think the issue is not in the test, but in the tsan runtime.

Somehow the test is in __abort function but the debugger is getting EXC_BAD_INSTRUCTION rather than the expected SIGABRT:

runCmd: continue
output: Process 71919 resuming
Process 71919 stopped
* thread #3, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169
libsystem_c.dylib`:
->  0x7fff7370c8c0 <+169>: ud2    

libsystem_c.dylib`abort_report_np:
    0x7fff7370c8c2 <+0>:   pushq  %rbp
    0x7fff7370c8c3 <+1>:   movq   %rsp, %rbp
    0x7fff7370c8c6 <+4>:   pushq  %r14


runCmd: thread list
output: Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Ran command:
"thread list"

Got output:
Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Expecting sub string: "stopped" (was found)
Expecting sub string: "stop reason = signal SIGABRT" (was not found)
We should be stopped due a SIGABRT

I tried to reproduce this on Linux, but failed. Running a racy program with TSAN_OPTIONS="abort_on_error=1" properly aborts for me and gdb sees SIGABRT.

Mac has some different code paths, I stared at the code for a while, but don't see what happens.
abort() interceptor runs after the tsan thread descriptor was destroyed, which is a tricky context, but somehow it worked before.

(1) is it possible to extend the test to dump full stacks of the threads? In particular I am interested in the thread #3 in __abort.

(2) does the following help?

--- a/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
+++ b/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
@@ -1857,7 +1857,6 @@ static void FlushStreams() {
 }

 TSAN_INTERCEPTOR(void, abort, int fake) {
-  SCOPED_TSAN_INTERCEPTOR(abort, fake);
   FlushStreams();
   REAL(abort)(fake);
 }

or the following:

--- a/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
+++ b/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp
@@ -1856,12 +1856,6 @@ static void FlushStreams() {
   REAL(fflush)(stderr);
 }
 
-TSAN_INTERCEPTOR(void, abort, int fake) {
-  SCOPED_TSAN_INTERCEPTOR(abort, fake);
-  FlushStreams();
-  REAL(abort)(fake);
-}
-
 TSAN_INTERCEPTOR(int, rmdir, char *path) {
   SCOPED_TSAN_INTERCEPTOR(rmdir, path);
   Release(thr, pc, Dir2addr(path));
@@ -2872,7 +2866,6 @@ void InitializeInterceptors() {
   TSAN_INTERCEPT(unlink);
   TSAN_INTERCEPT(tmpfile);
   TSAN_MAYBE_INTERCEPT_TMPFILE64;
-  TSAN_INTERCEPT(abort);
   TSAN_INTERCEPT(rmdir);
   TSAN_INTERCEPT(closedir);

?

(3) or does applying https://reviews.llvm.org/D110236 help?

@yln maybe you can help to debug this on Mac?

This breaks TestTsanBasic.py: https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/38898/testReport/lldb-api/functionalities_tsan_basic/TestTsanBasic_py/

I suspect the new runtime will require lldb/source/Plugins/InstrumentationRuntime/TSan to be updated.

Hi Jonas,

I think the issue is not in the test, but in the tsan runtime.

Somehow the test is in __abort function but the debugger is getting EXC_BAD_INSTRUCTION rather than the expected SIGABRT:

runCmd: continue
output: Process 71919 resuming
Process 71919 stopped
* thread #3, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169
libsystem_c.dylib`:
->  0x7fff7370c8c0 <+169>: ud2    

libsystem_c.dylib`abort_report_np:
    0x7fff7370c8c2 <+0>:   pushq  %rbp
    0x7fff7370c8c3 <+1>:   movq   %rsp, %rbp
    0x7fff7370c8c6 <+4>:   pushq  %r14


runCmd: thread list
output: Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Ran command:
"thread list"

Got output:
Process 71919 stopped
  thread #1: tid = 0x3a3ceb, 0x00007fff7378055e libsystem_kernel.dylib`__ulock_wait + 10, queue = 'com.apple.main-thread'
  thread #2: tid = 0x3a3d7d, 0x00007fff73781756 libsystem_kernel.dylib`__semwait_signal + 10
* thread #3: tid = 0x3a3d7f, 0x00007fff7370c8c0 libsystem_c.dylib`__abort + 169, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)

Expecting sub string: "stopped" (was found)
Expecting sub string: "stop reason = signal SIGABRT" (was not found)
We should be stopped due a SIGABRT

Hey Dmitry, thanks for looking into this. I tried suggestions (1) and (2) but neither of them got the test to pass. I pinged Julian about (3) and he kindly landed his patch, but that didn't fix the test either.

Here's the backtrace:

* thread #3, stop reason = EXC_BREAKPOINT (code=1, subcode=0x1c06023c0)
  * frame #0: 0x00000001c06023c0 libsystem_c.dylib`__abort + 160
    frame #1: 0x00000001c0602320 libsystem_c.dylib`abort + 176
    frame #2: 0x000000010056c844 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Abort() at sanitizer_posix_libcdep.cpp:143:3 [opt]
    frame #3: 0x000000010056bf8c libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Die() at sanitizer_termination.cpp:58:5 [opt]
    frame #4: 0x00000001005d2230 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ReportIgnoresEnabled(tctx=<unavailable>, set=<unavailable>) at tsan_rtl_thread.cpp:69:3 [opt]
    frame #5: 0x00000001005d1948 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(__tsan::ThreadState*) [inlined] __tsan::ThreadCheckIgnore(thr=0x0000000100550000) at tsan_rtl_thread.cpp:76:5 [opt]
    frame #6: 0x00000001005d193c libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(thr=0x0000000100550000) at tsan_rtl_thread.cpp:214:3 [opt]
    frame #7: 0x0000000100582544 libclang_rt.tsan_osx_dynamic.dylib`__tsan::DestroyThreadState() at tsan_interceptors_posix.cpp:944:3 [opt]
    frame #8: 0x00000001005d96e0 libclang_rt.tsan_osx_dynamic.dylib`__tsan::my_pthread_introspection_hook(event=3, thread=0x000000016ff9f000, addr=0x000000016ff18000, size=557056) at tsan_platform_mac.cpp:235:7 [opt]
    frame #9: 0x00000001c06c6fe0 libsystem_pthread.dylib`_pthread_introspection_hook_callout_thread_terminate + 132
    frame #10: 0x00000001c06c2c68 libsystem_pthread.dylib`_pthread_terminate + 60
    frame #11: 0x00000001c06c2c2c libsystem_pthread.dylib`_pthread_terminate_invoke + 56
    frame #12: 0x00000001c06c5950 libsystem_pthread.dylib`_pthread_exit + 112
    frame #13: 0x00000001c06c524c libsystem_pthread.dylib`_pthread_start + 160

Given that our bot has been red for a while, I'm going to temporarily skip this test.

Hey Dmitry, thanks for looking into this. I tried suggestions (1) and (2) but neither of them got the test to pass. I pinged Julian about (3) and he kindly landed his patch, but that didn't fix the test either.

Here's the backtrace:

* thread #3, stop reason = EXC_BREAKPOINT (code=1, subcode=0x1c06023c0)
  * frame #0: 0x00000001c06023c0 libsystem_c.dylib`__abort + 160
    frame #1: 0x00000001c0602320 libsystem_c.dylib`abort + 176
    frame #2: 0x000000010056c844 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Abort() at sanitizer_posix_libcdep.cpp:143:3 [opt]
    frame #3: 0x000000010056bf8c libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Die() at sanitizer_termination.cpp:58:5 [opt]
    frame #4: 0x00000001005d2230 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ReportIgnoresEnabled(tctx=<unavailable>, set=<unavailable>) at tsan_rtl_thread.cpp:69:3 [opt]
    frame #5: 0x00000001005d1948 libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(__tsan::ThreadState*) [inlined] __tsan::ThreadCheckIgnore(thr=0x0000000100550000) at tsan_rtl_thread.cpp:76:5 [opt]
    frame #6: 0x00000001005d193c libclang_rt.tsan_osx_dynamic.dylib`__tsan::ThreadFinish(thr=0x0000000100550000) at tsan_rtl_thread.cpp:214:3 [opt]
    frame #7: 0x0000000100582544 libclang_rt.tsan_osx_dynamic.dylib`__tsan::DestroyThreadState() at tsan_interceptors_posix.cpp:944:3 [opt]
    frame #8: 0x00000001005d96e0 libclang_rt.tsan_osx_dynamic.dylib`__tsan::my_pthread_introspection_hook(event=3, thread=0x000000016ff9f000, addr=0x000000016ff18000, size=557056) at tsan_platform_mac.cpp:235:7 [opt]
    frame #9: 0x00000001c06c6fe0 libsystem_pthread.dylib`_pthread_introspection_hook_callout_thread_terminate + 132
    frame #10: 0x00000001c06c2c68 libsystem_pthread.dylib`_pthread_terminate + 60
    frame #11: 0x00000001c06c2c2c libsystem_pthread.dylib`_pthread_terminate_invoke + 56
    frame #12: 0x00000001c06c5950 libsystem_pthread.dylib`_pthread_exit + 112
    frame #13: 0x00000001c06c524c libsystem_pthread.dylib`_pthread_start + 160

Given that our bot has been red for a while, I'm going to temporarily skip this test.

Humm... this sheds some light, but I am still puzzled...

First, it calls an abort() from libsystem (even if not the one the test wants), so why it's EXC_BREAKPOINT rather than SIGABRT as the test expects...

Then, I don't understand thread termination sequence on Mac. Should my_pthread_introspection_hook be used for pthread_create-created threads? Why all other tsan tests don't fail in the same way on Mac? Does presence of the debugger somehow changes behavior of libsystem_pthread.dylib? What sets thr->ignore_sync/ignore_reads_and_writes that ReportIgnoresEnabled does not like?...
@yln @kubamracek

yln added a comment.Dec 1 2021, 11:40 AM

Should my_pthread_introspection_hook be used for pthread_create-created threads?

Yes, this is intentional, but I am unaware of the underlying reasons and whether it still holds with the new runtime.

Why all other tsan tests don't fail in the same way on Mac? Does presence of the debugger somehow changes behavior of libsystem_pthread.dylib? What sets thr->ignore_sync/ignore_reads_and_writes that ReportIgnoresEnabled does not like?...

I will try to investigate this and report back as soon as I can.

This revision is now accepted and ready to land.Dec 1 2021, 11:39 PM
This revision was automatically updated to reflect the committed changes.

Dmitry, this is still breaking the LLDB bot. Last time I XFAILed the LLDB test so you could continue investigate without having to revert the whole patch. It got reverted anyway for the chrome test, but without removing the XFAIL again, breaking our bots again. Now the patch got landed again, but still without a fix for the macOS/LLDB issue (or XFAILing the test).

I've reverted your change in 396113c19fd1dc3bb9771d04ffb3a0319e018558. Please investigate the issue and keep an eye on the LLDB job when landing this again.

Dmitry, this is still breaking the LLDB bot. Last time I XFAILed the LLDB test so you could continue investigate without having to revert the whole patch. It got reverted anyway for the chrome test, but without removing the XFAIL again, breaking our bots again. Now the patch got landed again, but still without a fix for the macOS/LLDB issue (or XFAILing the test).

I've reverted your change in 396113c19fd1dc3bb9771d04ffb3a0319e018558. Please investigate the issue and keep an eye on the LLDB job when landing this again.

I've got access to a Mac machine and spent several hours trying to reproduce this, but failed.
ninja check-lldb all fails with "this is a non-interactive debug session, cannot get permission to debug processes" (I can only run the system lldb on this machine).
So I built main.c manually and run it under lldb and it halts with the expected "stop reason = signal SIGABRT":

Process 51022 stopped
* thread #3, stop reason = Data race detected
    frame #0: 0x00000001001903f0 libclang_rt.tsan_osx_dynamic.dylib`::__tsan_on_report(rep=0x00000001062029f0) at tsan_rtl_report.cpp:47 [opt]
   44  	#endif
   45  	
   46  	SANITIZER_WEAK_DEFAULT_IMPL
-> 47  	void __tsan_on_report(const ReportDesc *rep) {
   48  	  (void)rep;
   49  	}
   50  	
Target 0: (race) stopped.
(lldb) c
Process 51022 resuming
ThreadSanitizer: reported 1 warnings
Process 51022 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff203d792e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff203d792e <+10>: jae    0x7fff203d7938            ; <+20>
    0x7fff203d7930 <+12>: movq   %rax, %rdi
    0x7fff203d7933 <+15>: jmp    0x7fff203d1ad9            ; cerror_nocancel
    0x7fff203d7938 <+20>: retq   
Target 0: (race) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff203d792e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff204065bd libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x0000000100158649 libclang_rt.tsan_osx_dynamic.dylib`::wrap_pthread_kill(tid=0x00000001000e7e00, sig=6) at tsan_interceptors_posix.cpp:2155:13 [opt]
    frame #3: 0x00007fff2035b406 libsystem_c.dylib`abort + 125
    frame #4: 0x000000010013c6d6 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Abort() at sanitizer_posix_libcdep.cpp:143:3 [opt]
    frame #5: 0x000000010013bfd5 libclang_rt.tsan_osx_dynamic.dylib`__sanitizer::Die() at sanitizer_termination.cpp:58:5 [opt]
    frame #6: 0x0000000100165dfa libclang_rt.tsan_osx_dynamic.dylib`__tsan::finalize(arg=<unavailable>) at tsan_interceptors_posix.cpp:2747:5 [opt]
    frame #7: 0x00007fff20335d25 libsystem_c.dylib`__cxa_finalize_ranges + 316
    frame #8: 0x00007fff20336010 libsystem_c.dylib`exit + 53
    frame #9: 0x00007fff20421f44 libdyld.dylib`start + 8
    frame #10: 0x00007fff20421f3d libdyld.dylib`start + 1

I also checked if any of existing tsan tests hit this __tsan::ReportIgnoresEnabled, but none of them are (except for 3 tests that test this function).

I will try to disable ReportIgnoresEnabled functionality on Mac, maybe it will help to unbreak the test.

dvyukov reopened this revision.Dec 13 2021, 3:42 AM
This revision is now accepted and ready to land.Dec 13 2021, 3:42 AM
dvyukov updated this revision to Diff 393839.Dec 13 2021, 3:44 AM

rebased + deflaked tests (D115613, D115612) + disabled ReportIgnoresEnabled on Mac

dvyukov updated this revision to Diff 393840.Dec 13 2021, 3:45 AM

Default TSAN_USE_OLD_RUNTIME to ON.
The plan is to submit with TSAN_USE_OLD_RUNTIME=ON
and then flip it separately so that people revert just the flip.

This revision was landed with ongoing or failed builds.Dec 13 2021, 3:48 AM
This revision was automatically updated to reflect the committed changes.
yln added a comment.Dec 14 2021, 4:44 PM

Landing this (even with TSAN_USE_OLD_RUNTIME=ON) did break the tests on the iOS simulator. Fixed here: D115767

Landing this (even with TSAN_USE_OLD_RUNTIME=ON) did break the tests on the iOS simulator. Fixed here: D115767

Thanks for fixing!
I am not sure what's changed wrt symbolization since last submit of this change..

compiler-rt/lib/tsan/rtl/tsan_trace.h