This is an archive of the discontinued LLVM Phabricator instance.

tsan: fix false positives in AcquireGlobal
ClosedPublic

Authored by dvyukov on May 23 2020, 8:18 AM.

Details

Summary

Add ThreadClock:: global_acquire_ which is the last time another thread
has done a global acquire of this thread's clock.

It helps to avoid problem described in:
https://github.com/golang/go/issues/39186
See test/tsan/java_finalizer2.cpp for a regression test.
Note the failuire is _extremely_ hard to hit, so if you are trying
to reproduce it, you may want to run something like:
$ go get golang.org/x/tools/cmd/stress
$ stress -p=64 ./a.out

The crux of the problem is roughly as follows.
A number of O(1) optimizations in the clocks algorithm assume proper
transitive cumulative propagation of clock values. The AcquireGlobal
operation may produce an inconsistent non-linearazable view of
thread clocks. Namely, it may acquire a later value from a thread
with a higher ID, but fail to acquire an earlier value from a thread
with a lower ID. If a thread that executed AcquireGlobal then releases
to a sync clock, it will spoil the sync clock with the inconsistent
values. If another thread later releases to the sync clock, the optimized
algorithm may break.

The exact sequence of events that leads to the failure.

  • thread 1 executes AcquireGlobal
  • thread 1 acquires value 1 for thread 2
  • thread 2 increments clock to 2
  • thread 2 releases to sync object 1
  • thread 3 at time 1
  • thread 3 acquires from sync object 1
  • thread 1 acquires value 1 for thread 3
  • thread 1 releases to sync object 2
  • sync object 2 clock has 1 for thread 2 and 1 for thread 3
  • thread 3 releases to sync object 2
  • thread 3 sees value 1 in the clock for itself and decides that it has already released to the clock and did not acquire anything from other threads after that (the last_acquire_ check in release operation)
  • thread 3 does not update the value for thread 2 in the clock from 1 to 2
  • thread 4 acquires from sync object 2
  • thread 4 detects a false race with thread 2 as it should have been synchronized with thread 2 up to time 2, but because of the broken clock it is now synchronized only up to time 1

The global_acquire_ value helps to prevent this scenario.
Namely, thread 3 will not trust any own clock values up to global_acquire_
for the purposes of the last_acquire_ optimization.

Diff Detail

Event Timeline

dvyukov created this revision.May 23 2020, 8:18 AM
dvyukov added a subscriber: dfava.May 23 2020, 8:21 AM

Daniel, I am not asking to review (but if you want, you are welcome!). It's just a very interesting exercise in vector clock reasoning and a super tricky bug, so I thought you may be interested.

The approach here LGTM. Thanks for the quick fix! I'm glad you were able to find an approach that didn't require any invasive changes external to tsan itself (e.g. forcing __tsan_finalizer_goroutine into a STW pause).

My one concern is that the change revives the last_acquire_ optimization while still allowing vector clocks to reflect non-linearizable global states. It's my understanding that the last_acquire_ optimization is the only place in tsan that relies on the property, so detecting such violations is sufficient. Still, it seems unfortunate that we'll now need to concern ourselves with the existence of such clock states in the future. Does this make the library harder to understand? Will it impede future optimization? For instance, maybe we'll want to exploit the transitive propagation of clock values to optimize for intra-node synchronization in NUMA architectures (almost certainly a bad idea).

I'll defer to your judgment on all of this, but I figured I'd raise the question.

compiler-rt/lib/tsan/rtl/tsan_clock.cpp
405

Consider expanding on this comment to indicate that "releasing" here doesn't only refer to directly releasing into the provided SyncClock, but also indirectly releasing into it through some transitive clock propagation. Even just "releasing to dst (directly or indirectly)." would be an improvement.

Unless you think this is obvious given the rest of the context in this file.

compiler-rt/lib/tsan/rtl/tsan_clock.h
186

nit: I believe the last_acquire_ optimization required the value in the sync clock to be greater than the thread's last_acquire_ value, not equal to or greater. So this example may need to have thread 3 acquire at 1 and then tick to 2 before the AcquireGlobal call observes its state.

compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp
420

There's no synchronization here and NoteGlobalAcquire is performing a blind write (with relaxed memory ordering) so I think we'd be able to get ourselves in trouble with concurrent calls to AcquireGlobal. Specifically, it appears possible to construct a history where global_acquire_ regresses, which would undermine the rest of this protection.

Do we need a CAS loop in NoteGlobalAcquire to ensure monotonicity? Do we have an implicit guarantee that AcquireGlobal will only be called on a single thread?

dvyukov marked 3 inline comments as done.May 25 2020, 10:14 PM

My one concern is that the change revives the last_acquire_ optimization while still allowing vector clocks to reflect non-linearizable global states.

True.
I would get rid of non-linearazable clock, but I did not find an elegant way to do it.
The existing STW we have wasn't designed for such operations. It's ptrace-based and ptrace is highly OS specific and stops threads at completely arbitrary points and it may conflict with ptrace uses by the program itself.
We could add a mutex per thread and acquire mutexes of all threads, but it's overhead even when GlobalAcquire is not used.
Or GlobalAcquire could do some kind of multi-pass O(N^2) operation that will consider all values in all thread clocks, but this looks too expensive and I very hard to prove to be correct.

This last_acquire_ optimization is the only one that I can think of that assumed transitive cumulative clocks.

compiler-rt/lib/tsan/rtl/tsan_clock.cpp
405

Added "(directly or indirectly)" part.

compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp
420

AcquireGlobal is protected by ThreadRegistryLock. I added a comment about this to NoteGlobalAcquire.

dvyukov updated this revision to Diff 266102.May 25 2020, 10:14 PM
nvanbenschoten accepted this revision.May 26 2020, 1:27 PM

Or GlobalAcquire could do some kind of multi-pass O(N^2) operation that will consider all values in all thread clocks, but this looks too expensive and I very hard to prove to be correct.

Taking inspiration from the distributed systems world, I think we could also do a two-pass O(n) solution by using an augmented copy-on-write scheme if we are willing to propagate a clock "version" through SyncClock operations (I'd use "epoch", but that's already taken). Imagine that each clock had an associated version. When releasing into a SyncClock, we set its version to sync_clock.version = max(sync_clock.version, thread_clock.version). When acquiring from a SyncClock, we set our ThreadClock to thread_clock.version = max(sync_clock.version, thread_clock.version). Whenever a ThreadClock's version changes, it snapshots its current clock state before making the corresponding modification.

We could then implement GlobalAcquire as:

last_version = cur_version;
next_version = cur_version + 1;
for th in thread:
    th.setVersionAndSnapshotIfNeeded(next_version);
for th in thread:
    acquiring_thr->clock.set(th, th.getClockSnapshotForVersion(last_version));

Writing this all out reveals that a ThreadClock's snapshot doesn't even need to be of the entire vector clock, just its own element at the time of the snapshot. Also, since AcquireGlobal is never run concurrently, a ThreadClock only needs to store its last snapshot. So I think this all could be accomplished with only 2 new u64s per ThreadClock (cur_version_ and last_snapshot_) and 1 new u64 per SyncClock (cur_version_).

This is still more complicated than what we have here though. I'm still fine with merging the global_acquire_ approach.

This revision is now accepted and ready to land.May 26 2020, 1:27 PM

Or GlobalAcquire could do some kind of multi-pass O(N^2) operation that will consider all values in all thread clocks, but this looks too expensive and I very hard to prove to be correct.

Taking inspiration from the distributed systems world, I think we could also do a two-pass O(n) solution by using an augmented copy-on-write scheme if we are willing to propagate a clock "version" through SyncClock operations (I'd use "epoch", but that's already taken). Imagine that each clock had an associated version. When releasing into a SyncClock, we set its version to sync_clock.version = max(sync_clock.version, thread_clock.version). When acquiring from a SyncClock, we set our ThreadClock to thread_clock.version = max(sync_clock.version, thread_clock.version). Whenever a ThreadClock's version changes, it snapshots its current clock state before making the corresponding modification.

We could then implement GlobalAcquire as:

last_version = cur_version;
next_version = cur_version + 1;
for th in thread:
    th.setVersionAndSnapshotIfNeeded(next_version);
for th in thread:
    acquiring_thr->clock.set(th, th.getClockSnapshotForVersion(last_version));

Writing this all out reveals that a ThreadClock's snapshot doesn't even need to be of the entire vector clock, just its own element at the time of the snapshot. Also, since AcquireGlobal is never run concurrently, a ThreadClock only needs to store its last snapshot. So I think this all could be accomplished with only 2 new u64s per ThreadClock (cur_version_ and last_snapshot_) and 1 new u64 per SyncClock (cur_version_).

This is still more complicated than what we have here though. I'm still fine with merging the global_acquire_ approach.

Interesting. But even if we store just 1 element per thread, won't it require storing potentially infinite amount of snapshots for all previous versions? If I am reading this correctly a thread can advance arbitrary number of version since last_version.
I am not sure how much overhead is this, but fwiw I tried to avoid additional overhead for the case when GlobalAcquire is not used as much as possible.

I've merged the existing fix for now. I probably won't have lots of time to work on this more.

But even if we store just 1 element per thread, won't it require storing potentially infinite amount of snapshots for all previous versions?

We only run a single GlobalAcquire at a time, so we should only need to store a snapshot for a single version in each thread. Once the GlobalAcquire has collected the snapshot from each thread, it is no longer necessary.

If I am reading this correctly a thread can advance arbitrary number of version since last_version.

In practice, I don't think a thread will ever advance more than a single version at a time, because all threads are advanced in lockstep.

But even if we store just 1 element per thread, won't it require storing potentially infinite amount of snapshots for all previous versions?

We only run a single GlobalAcquire at a time, so we should only need to store a snapshot for a single version in each thread. Once the GlobalAcquire has collected the snapshot from each thread, it is no longer necessary.

If I am reading this correctly a thread can advance arbitrary number of version since last_version.

In practice, I don't think a thread will ever advance more than a single version at a time, because all threads are advanced in lockstep.

Ah, I see now. The version is only incremented by GlobalAcquire, so we have at most 2 active. Initially I wrongly assumed that version is incremented by every clock operation (a-la Lamport timestamps).
I think generally it can work. Implementation of setVersionAndSnapshotIfNeeded/getClockSnapshotForVersion may need to be careful in case of concurrent updates from the thread.

Looks like the test added here is failing on Darwin. Can you please look into it.

******************** TEST 'ThreadSanitizer-x86_64 :: java_finalizer2.cpp' FAILED ********************
Script:
--
: 'RUN: at line 1';      /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/./bin/clang  --driver-mode=g++ -fsanitize=thread -Wall  -arch x86_64 -stdlib=libc++ -mmacosx-version-min=10.9 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk   -gline-tables-only -I/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/../ -std=c++11 -I/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/../ -O1 /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp -o /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/tools/clang/runtime/compiler-rt-bins/test/tsan/X86_64Config/Output/java_finalizer2.cpp.tmp &&  /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/tools/clang/runtime/compiler-rt-bins/test/tsan/X86_64Config/Output/java_finalizer2.cpp.tmp 2>&1 | FileCheck /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp
--
Exit Code: 1

Command Output (stderr):
--
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:11:3: error: unknown type name 'pthread_barrier_t'
  pthread_barrier_t barrier_finalizer;
  ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:12:3: error: unknown type name 'pthread_barrier_t'
  pthread_barrier_t barrier_ballast;
  ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:36:5: error: use of undeclared identifier 'pthread_yield'
    pthread_yield();
    ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:38:5: error: use of undeclared identifier 'pthread_yield'
    pthread_yield();
    ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:69:5: error: use of undeclared identifier 'pthread_yield'
    pthread_yield();
    ^
5 errors generated.

--

********************

Refer to http://green.lab.llvm.org/green/job/clang-stage1-RA/10675/consoleFull.

Looks like the test added here is failing on Darwin. Can you please look into it.

Hi,

Should be fixed with: https://github.com/llvm/llvm-project/commit/0969541ffcb24ae1af59fcb8778063becf17dbca

Thanks