This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/
-
lib/tsan/rtl/
-
tsan/
-
rtl/
1/1
tsan_clock.h
1/2
tsan_clock.cpp
1/2
tsan_rtl_mutex.cpp
-
test/tsan/
-
tsan/
-
java_finalizer2.cpp

Differential D80474

tsan: fix false positives in AcquireGlobal
ClosedPublic

Authored by dvyukov on May 23 2020, 8:18 AM.

Download Raw Diff

Details

Reviewers

vitalybuka
nvanbenschoten

Summary

Add ThreadClock:: global_acquire_ which is the last time another thread
has done a global acquire of this thread's clock.

It helps to avoid problem described in:
https://github.com/golang/go/issues/39186
See test/tsan/java_finalizer2.cpp for a regression test.
Note the failuire is _extremely_ hard to hit, so if you are trying
to reproduce it, you may want to run something like:
$ go get golang.org/x/tools/cmd/stress
$ stress -p=64 ./a.out

The crux of the problem is roughly as follows.
A number of O(1) optimizations in the clocks algorithm assume proper
transitive cumulative propagation of clock values. The AcquireGlobal
operation may produce an inconsistent non-linearazable view of
thread clocks. Namely, it may acquire a later value from a thread
with a higher ID, but fail to acquire an earlier value from a thread
with a lower ID. If a thread that executed AcquireGlobal then releases
to a sync clock, it will spoil the sync clock with the inconsistent
values. If another thread later releases to the sync clock, the optimized
algorithm may break.

The exact sequence of events that leads to the failure.

thread 1 executes AcquireGlobal
thread 1 acquires value 1 for thread 2
thread 2 increments clock to 2
thread 2 releases to sync object 1
thread 3 at time 1
thread 3 acquires from sync object 1
thread 1 acquires value 1 for thread 3
thread 1 releases to sync object 2
sync object 2 clock has 1 for thread 2 and 1 for thread 3
thread 3 releases to sync object 2
thread 3 sees value 1 in the clock for itself and decides that it has already released to the clock and did not acquire anything from other threads after that (the last_acquire_ check in release operation)
thread 3 does not update the value for thread 2 in the clock from 1 to 2
thread 4 acquires from sync object 2
thread 4 detects a false race with thread 2 as it should have been synchronized with thread 2 up to time 2, but because of the broken clock it is now synchronized only up to time 1

The global_acquire_ value helps to prevent this scenario.
Namely, thread 3 will not trust any own clock values up to global_acquire_
for the purposes of the last_acquire_ optimization.

Diff Detail

Event Timeline

dvyukov created this revision.May 23 2020, 8:18 AM

Herald added subscribers: jfb, JDevlieghere. · View Herald TranscriptMay 23 2020, 8:18 AM

dvyukov added a subscriber: dfava.May 23 2020, 8:21 AM

Daniel, I am not asking to review (but if you want, you are welcome!). It's just a very interesting exercise in vector clock reasoning and a super tricky bug, so I thought you may be interested.

dvyukov added a reviewer: vitalybuka.May 24 2020, 5:00 AM

The approach here LGTM. Thanks for the quick fix! I'm glad you were able to find an approach that didn't require any invasive changes external to tsan itself (e.g. forcing __tsan_finalizer_goroutine into a STW pause).

My one concern is that the change revives the last_acquire_ optimization while still allowing vector clocks to reflect non-linearizable global states. It's my understanding that the last_acquire_ optimization is the only place in tsan that relies on the property, so detecting such violations is sufficient. Still, it seems unfortunate that we'll now need to concern ourselves with the existence of such clock states in the future. Does this make the library harder to understand? Will it impede future optimization? For instance, maybe we'll want to exploit the transitive propagation of clock values to optimize for intra-node synchronization in NUMA architectures (almost certainly a bad idea).

I'll defer to your judgment on all of this, but I figured I'd raise the question.

compiler-rt/lib/tsan/rtl/tsan_clock.cpp
405	Consider expanding on this comment to indicate that "releasing" here doesn't only refer to directly releasing into the provided SyncClock, but also indirectly releasing into it through some transitive clock propagation. Even just "releasing to dst (directly or indirectly)." would be an improvement. Unless you think this is obvious given the rest of the context in this file.
compiler-rt/lib/tsan/rtl/tsan_clock.h
186	nit: I believe the last_acquire_ optimization required the value in the sync clock to be greater than the thread's last_acquire_ value, not equal to or greater. So this example may need to have thread 3 acquire at 1 and then tick to 2 before the AcquireGlobal call observes its state.
compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp
420	There's no synchronization here and NoteGlobalAcquire is performing a blind write (with relaxed memory ordering) so I think we'd be able to get ourselves in trouble with concurrent calls to AcquireGlobal. Specifically, it appears possible to construct a history where global_acquire_ regresses, which would undermine the rest of this protection. Do we need a CAS loop in NoteGlobalAcquire to ensure monotonicity? Do we have an implicit guarantee that AcquireGlobal will only be called on a single thread?

My one concern is that the change revives the last_acquire_ optimization while still allowing vector clocks to reflect non-linearizable global states.

True.
I would get rid of non-linearazable clock, but I did not find an elegant way to do it.
The existing STW we have wasn't designed for such operations. It's ptrace-based and ptrace is highly OS specific and stops threads at completely arbitrary points and it may conflict with ptrace uses by the program itself.
We could add a mutex per thread and acquire mutexes of all threads, but it's overhead even when GlobalAcquire is not used.
Or GlobalAcquire could do some kind of multi-pass O(N^2) operation that will consider all values in all thread clocks, but this looks too expensive and I very hard to prove to be correct.

This last_acquire_ optimization is the only one that I can think of that assumed transitive cumulative clocks.

compiler-rt/lib/tsan/rtl/tsan_clock.cpp
405	Added "(directly or indirectly)" part.
compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp
420	AcquireGlobal is protected by ThreadRegistryLock. I added a comment about this to NoteGlobalAcquire.

dvyukov updated this revision to Diff 266102.May 25 2020, 10:14 PM

Or GlobalAcquire could do some kind of multi-pass O(N^2) operation that will consider all values in all thread clocks, but this looks too expensive and I very hard to prove to be correct.

Taking inspiration from the distributed systems world, I think we could also do a two-pass O(n) solution by using an augmented copy-on-write scheme if we are willing to propagate a clock "version" through SyncClock operations (I'd use "epoch", but that's already taken). Imagine that each clock had an associated version. When releasing into a SyncClock, we set its version to sync_clock.version = max(sync_clock.version, thread_clock.version). When acquiring from a SyncClock, we set our ThreadClock to thread_clock.version = max(sync_clock.version, thread_clock.version). Whenever a ThreadClock's version changes, it snapshots its current clock state before making the corresponding modification.

We could then implement GlobalAcquire as:

last_version = cur_version;
next_version = cur_version + 1;
for th in thread:
    th.setVersionAndSnapshotIfNeeded(next_version);
for th in thread:
    acquiring_thr->clock.set(th, th.getClockSnapshotForVersion(last_version));

Writing this all out reveals that a ThreadClock's snapshot doesn't even need to be of the entire vector clock, just its own element at the time of the snapshot. Also, since AcquireGlobal is never run concurrently, a ThreadClock only needs to store its last snapshot. So I think this all could be accomplished with only 2 new u64s per ThreadClock (cur_version_ and last_snapshot_) and 1 new u64 per SyncClock (cur_version_).

This is still more complicated than what we have here though. I'm still fine with merging the global_acquire_ approach.

This revision is now accepted and ready to land.May 26 2020, 1:27 PM

Merged in https://github.com/llvm/llvm-project/commit/4408eeed0ff191304121c11168aa1db861cccb97

In D80474#2055762, @nvanbenschoten wrote:
Or GlobalAcquire could do some kind of multi-pass O(N^2) operation that will consider all values in all thread clocks, but this looks too expensive and I very hard to prove to be correct.

Taking inspiration from the distributed systems world, I think we could also do a two-pass O(n) solution by using an augmented copy-on-write scheme if we are willing to propagate a clock "version" through SyncClock operations (I'd use "epoch", but that's already taken). Imagine that each clock had an associated version. When releasing into a SyncClock, we set its version to sync_clock.version = max(sync_clock.version, thread_clock.version). When acquiring from a SyncClock, we set our ThreadClock to thread_clock.version = max(sync_clock.version, thread_clock.version). Whenever a ThreadClock's version changes, it snapshots its current clock state before making the corresponding modification.

We could then implement GlobalAcquire as:
last_version = cur_version;
next_version = cur_version + 1;
for th in thread:
    th.setVersionAndSnapshotIfNeeded(next_version);
for th in thread:
    acquiring_thr->clock.set(th, th.getClockSnapshotForVersion(last_version));
Writing this all out reveals that a ThreadClock's snapshot doesn't even need to be of the entire vector clock, just its own element at the time of the snapshot. Also, since AcquireGlobal is never run concurrently, a ThreadClock only needs to store its last snapshot. So I think this all could be accomplished with only 2 new u64s per ThreadClock (cur_version_ and last_snapshot_) and 1 new u64 per SyncClock (cur_version_).

This is still more complicated than what we have here though. I'm still fine with merging the global_acquire_ approach.

Interesting. But even if we store just 1 element per thread, won't it require storing potentially infinite amount of snapshots for all previous versions? If I am reading this correctly a thread can advance arbitrary number of version since last_version.
I am not sure how much overhead is this, but fwiw I tried to avoid additional overhead for the case when GlobalAcquire is not used as much as possible.

I've merged the existing fix for now. I probably won't have lots of time to work on this more.

But even if we store just 1 element per thread, won't it require storing potentially infinite amount of snapshots for all previous versions?

We only run a single GlobalAcquire at a time, so we should only need to store a snapshot for a single version in each thread. Once the GlobalAcquire has collected the snapshot from each thread, it is no longer necessary.

If I am reading this correctly a thread can advance arbitrary number of version since last_version.

In practice, I don't think a thread will ever advance more than a single version at a time, because all threads are advanced in lockstep.

In D80474#2057390, @nvanbenschoten wrote:

But even if we store just 1 element per thread, won't it require storing potentially infinite amount of snapshots for all previous versions?

We only run a single GlobalAcquire at a time, so we should only need to store a snapshot for a single version in each thread. Once the GlobalAcquire has collected the snapshot from each thread, it is no longer necessary.

If I am reading this correctly a thread can advance arbitrary number of version since last_version.

In practice, I don't think a thread will ever advance more than a single version at a time, because all threads are advanced in lockstep.

Ah, I see now. The version is only incremented by GlobalAcquire, so we have at most 2 active. Initially I wrongly assumed that version is incremented by every clock operation (a-la Lamport timestamps).
I think generally it can work. Implementation of setVersionAndSnapshotIfNeeded/getClockSnapshotForVersion may need to be careful in case of concurrent updates from the thread.

Looks like the test added here is failing on Darwin. Can you please look into it.

******************** TEST 'ThreadSanitizer-x86_64 :: java_finalizer2.cpp' FAILED ********************
Script:
--
: 'RUN: at line 1';      /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/./bin/clang  --driver-mode=g++ -fsanitize=thread -Wall  -arch x86_64 -stdlib=libc++ -mmacosx-version-min=10.9 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk   -gline-tables-only -I/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/../ -std=c++11 -I/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/../ -O1 /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp -o /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/tools/clang/runtime/compiler-rt-bins/test/tsan/X86_64Config/Output/java_finalizer2.cpp.tmp &&  /Users/buildslave/jenkins/workspace/clang-stage1-RA/clang-build/tools/clang/runtime/compiler-rt-bins/test/tsan/X86_64Config/Output/java_finalizer2.cpp.tmp 2>&1 | FileCheck /Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp
--
Exit Code: 1

Command Output (stderr):
--
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:11:3: error: unknown type name 'pthread_barrier_t'
  pthread_barrier_t barrier_finalizer;
  ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:12:3: error: unknown type name 'pthread_barrier_t'
  pthread_barrier_t barrier_ballast;
  ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:36:5: error: use of undeclared identifier 'pthread_yield'
    pthread_yield();
    ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:38:5: error: use of undeclared identifier 'pthread_yield'
    pthread_yield();
    ^
/Users/buildslave/jenkins/workspace/clang-stage1-RA/llvm-project/compiler-rt/test/tsan/java_finalizer2.cpp:69:5: error: use of undeclared identifier 'pthread_yield'
    pthread_yield();
    ^
5 errors generated.

--

********************

Refer to http://green.lab.llvm.org/green/job/clang-stage1-RA/10675/consoleFull.

In D80474#2058941, @azharudd wrote:

Looks like the test added here is failing on Darwin. Can you please look into it.

Hi,

Should be fixed with: https://github.com/llvm/llvm-project/commit/0969541ffcb24ae1af59fcb8778063becf17dbca

Thanks

Revision Contents

Path

Size

compiler-rt/

lib/

tsan/

rtl/

tsan_clock.h

57 lines

tsan_clock.cpp

15 lines

tsan_rtl_mutex.cpp

4 lines

test/

tsan/

java_finalizer2.cpp

82 lines

Diff 266102

compiler-rt/lib/tsan/rtl/tsan_clock.h

Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines	public:
uptr size() const;		uptr size() const;

void acquire(ClockCache c, SyncClock src);		void acquire(ClockCache c, SyncClock src);
void releaseStoreAcquire(ClockCache c, SyncClock src);		void releaseStoreAcquire(ClockCache c, SyncClock src);
void release(ClockCache c, SyncClock dst);		void release(ClockCache c, SyncClock dst);
void acq_rel(ClockCache c, SyncClock dst);		void acq_rel(ClockCache c, SyncClock dst);
void ReleaseStore(ClockCache c, SyncClock dst);		void ReleaseStore(ClockCache c, SyncClock dst);
void ResetCached(ClockCache *c);		void ResetCached(ClockCache *c);
		void NoteGlobalAcquire(u64 v);

void DebugReset();		void DebugReset();
void DebugDump(int(printf)(const char s, ...));		void DebugDump(int(printf)(const char s, ...));

private:		private:
static const uptr kDirtyTids = SyncClock::kDirtyTids;		static const uptr kDirtyTids = SyncClock::kDirtyTids;
// Index of the thread associated with he clock ("current thread").		// Index of the thread associated with he clock ("current thread").
const unsigned tid_;		const unsigned tid_;
const unsigned reused_; // tid_ reuse count.		const unsigned reused_; // tid_ reuse count.
// Current thread time when it acquired something from other threads.		// Current thread time when it acquired something from other threads.
u64 last_acquire_;		u64 last_acquire_;

		// Last time another thread has done a global acquire of this thread's clock.
		// It helps to avoid problem described in:
		// https://github.com/golang/go/issues/39186
		// See test/tsan/java_finalizer2.cpp for a regression test.
		// Note the failuire is _extremely_ hard to hit, so if you are trying
		// to reproduce it, you may want to run something like:
		// $ go get golang.org/x/tools/cmd/stress
		// $ stress -p=64 ./a.out
		//
		// The crux of the problem is roughly as follows.
		// A number of O(1) optimizations in the clocks algorithm assume proper
		// transitive cumulative propagation of clock values. The AcquireGlobal
		// operation may produce an inconsistent non-linearazable view of
		// thread clocks. Namely, it may acquire a later value from a thread
		// with a higher ID, but fail to acquire an earlier value from a thread
		// with a lower ID. If a thread that executed AcquireGlobal then releases
		// to a sync clock, it will spoil the sync clock with the inconsistent
		// values. If another thread later releases to the sync clock, the optimized
		// algorithm may break.
		//
		// The exact sequence of events that leads to the failure.
		// - thread 1 executes AcquireGlobal
		// - thread 1 acquires value 1 for thread 2
		// - thread 2 increments clock to 2
		// - thread 2 releases to sync object 1
		// - thread 3 at time 1
		// - thread 3 acquires from sync object 1
		// - thread 3 increments clock to 2
		// - thread 1 acquires value 2 for thread 3
		// - thread 1 releases to sync object 2
		// - sync object 2 clock has 1 for thread 2 and 2 for thread 3
		// - thread 3 releases to sync object 2
		nvanbenschotenUnsubmitted Done Reply Inline Actions nit: I believe the last_acquire_ optimization required the value in the sync clock to be greater than the thread's last_acquire_ value, not equal to or greater. So this example may need to have thread 3 acquire at 1 and then tick to 2 before the AcquireGlobal call observes its state. nvanbenschoten: nit: I believe the last_acquire_ optimization required the value in the sync clock to be…
		// - thread 3 sees value 2 in the clock for itself
		// and decides that it has already released to the clock
		// and did not acquire anything from other threads after that
		// (the last_acquire_ check in release operation)
		// - thread 3 does not update the value for thread 2 in the clock from 1 to 2
		// - thread 4 acquires from sync object 2
		// - thread 4 detects a false race with thread 2
		// as it should have been synchronized with thread 2 up to time 2,
		// but because of the broken clock it is now synchronized only up to time 1
		//
		// The global_acquire_ value helps to prevent this scenario.
		// Namely, thread 3 will not trust any own clock values up to global_acquire_
		// for the purposes of the last_acquire_ optimization.
		atomic_uint64_t global_acquire_;

// Cached SyncClock (without dirty entries and release_store_tid_).		// Cached SyncClock (without dirty entries and release_store_tid_).
// We reuse it for subsequent store-release operations without intervening		// We reuse it for subsequent store-release operations without intervening
// acquire operations. Since it is shared (and thus constant), clock value		// acquire operations. Since it is shared (and thus constant), clock value
// for the current thread is then stored in dirty entries in the SyncClock.		// for the current thread is then stored in dirty entries in the SyncClock.
// We host a refernece to the table while it is cached here.		// We host a refernece to the table while it is cached here.
u32 cached_idx_;		u32 cached_idx_;
u16 cached_size_;		u16 cached_size_;
u16 cached_blocks_;		u16 cached_blocks_;

// Number of active elements in the clk_ table (the rest is zeros).		// Number of active elements in the clk_ table (the rest is zeros).
uptr nclk_;		uptr nclk_;
u64 clk_[kMaxTidInClock]; // Fixed size vector clock.		u64 clk_[kMaxTidInClock]; // Fixed size vector clock.

bool IsAlreadyAcquired(const SyncClock *src) const;		bool IsAlreadyAcquired(const SyncClock *src) const;
		bool HasAcquiredAfterRelease(const SyncClock *dst) const;
void UpdateCurrentThread(ClockCache c, SyncClock dst) const;		void UpdateCurrentThread(ClockCache c, SyncClock dst) const;
};		};

ALWAYS_INLINE u64 ThreadClock::get(unsigned tid) const {		ALWAYS_INLINE u64 ThreadClock::get(unsigned tid) const {
DCHECK_LT(tid, kMaxTidInClock);		DCHECK_LT(tid, kMaxTidInClock);
return clk_[tid];		return clk_[tid];
}		}

ALWAYS_INLINE void ThreadClock::set(u64 v) {		ALWAYS_INLINE void ThreadClock::set(u64 v) {
DCHECK_GE(v, clk_[tid_]);		DCHECK_GE(v, clk_[tid_]);
clk_[tid_] = v;		clk_[tid_] = v;
}		}

ALWAYS_INLINE void ThreadClock::tick() {		ALWAYS_INLINE void ThreadClock::tick() {
clk_[tid_]++;		clk_[tid_]++;
}		}

ALWAYS_INLINE uptr ThreadClock::size() const {		ALWAYS_INLINE uptr ThreadClock::size() const {
return nclk_;		return nclk_;
}		}

		ALWAYS_INLINE void ThreadClock::NoteGlobalAcquire(u64 v) {
		// Here we rely on the fact that AcquireGlobal is protected by
		// ThreadRegistryLock, thus only one thread at a time executes it
		// and values passed to this function should not go backwards.
		CHECK_LE(atomic_load_relaxed(&global_acquire_), v);
		atomic_store_relaxed(&global_acquire_, v);
		}

ALWAYS_INLINE SyncClock::Iter SyncClock::begin() {		ALWAYS_INLINE SyncClock::Iter SyncClock::begin() {
return Iter(this);		return Iter(this);
}		}

ALWAYS_INLINE SyncClock::Iter SyncClock::end() {		ALWAYS_INLINE SyncClock::Iter SyncClock::end() {
return Iter(nullptr);		return Iter(nullptr);
}		}

Show All 30 Lines

compiler-rt/lib/tsan/rtl/tsan_clock.cpp

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	static void UnrefClockBlock(ClockCache *c, u32 idx, uptr blocks) {
for (uptr i = 0; i < blocks; i++)		for (uptr i = 0; i < blocks; i++)
ctx->clock_alloc.Free(c, cb->table[ClockBlock::kBlockIdx - i]);		ctx->clock_alloc.Free(c, cb->table[ClockBlock::kBlockIdx - i]);
ctx->clock_alloc.Free(c, idx);		ctx->clock_alloc.Free(c, idx);
}		}

ThreadClock::ThreadClock(unsigned tid, unsigned reused)		ThreadClock::ThreadClock(unsigned tid, unsigned reused)
: tid_(tid)		: tid_(tid)
, reused_(reused + 1) // 0 has special meaning		, reused_(reused + 1) // 0 has special meaning
		, last_acquire_()
		, global_acquire_()
, cached_idx_()		, cached_idx_()
, cached_size_()		, cached_size_()
, cached_blocks_() {		, cached_blocks_() {
CHECK_LT(tid, kMaxTidInClock);		CHECK_LT(tid, kMaxTidInClock);
CHECK_EQ(reused_, ((u64)reused_ << kClkBits) >> kClkBits);		CHECK_EQ(reused_, ((u64)reused_ << kClkBits) >> kClkBits);
nclk_ = tid_ + 1;		nclk_ = tid_ + 1;
last_acquire_ = 0;
internal_memset(clk_, 0, sizeof(clk_));		internal_memset(clk_, 0, sizeof(clk_));
}		}

void ThreadClock::ResetCached(ClockCache *c) {		void ThreadClock::ResetCached(ClockCache *c) {
if (cached_idx_) {		if (cached_idx_) {
UnrefClockBlock(c, cached_idx_, cached_blocks_);		UnrefClockBlock(c, cached_idx_, cached_blocks_);
cached_idx_ = 0;		cached_idx_ = 0;
cached_size_ = 0;		cached_size_ = 0;
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	void ThreadClock::release(ClockCache c, SyncClock dst) {
CPP_STAT_INC(StatClockRelease);		CPP_STAT_INC(StatClockRelease);
// Check if we need to resize dst.		// Check if we need to resize dst.
if (dst->size_ < nclk_)		if (dst->size_ < nclk_)
dst->Resize(c, nclk_);		dst->Resize(c, nclk_);

// Check if we had not acquired anything from other threads		// Check if we had not acquired anything from other threads
// since the last release on dst. If so, we need to update		// since the last release on dst. If so, we need to update
// only dst->elem(tid_).		// only dst->elem(tid_).
if (dst->elem(tid_).epoch > last_acquire_) {		if (!HasAcquiredAfterRelease(dst)) {
UpdateCurrentThread(c, dst);		UpdateCurrentThread(c, dst);
if (dst->release_store_tid_ != tid_ \|\|		if (dst->release_store_tid_ != tid_ \|\|
dst->release_store_reused_ != reused_)		dst->release_store_reused_ != reused_)
dst->release_store_tid_ = kInvalidTid;		dst->release_store_tid_ = kInvalidTid;
return;		return;
}		}

// O(N) release.		// O(N) release.
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	void ThreadClock::ReleaseStore(ClockCache c, SyncClock dst) {
}		}

// Check if we need to resize dst.		// Check if we need to resize dst.
if (dst->size_ < nclk_)		if (dst->size_ < nclk_)
dst->Resize(c, nclk_);		dst->Resize(c, nclk_);

if (dst->release_store_tid_ == tid_ &&		if (dst->release_store_tid_ == tid_ &&
dst->release_store_reused_ == reused_ &&		dst->release_store_reused_ == reused_ &&
dst->elem(tid_).epoch > last_acquire_) {		!HasAcquiredAfterRelease(dst)) {
CPP_STAT_INC(StatClockStoreFast);		CPP_STAT_INC(StatClockStoreFast);
UpdateCurrentThread(c, dst);		UpdateCurrentThread(c, dst);
return;		return;
}		}

// O(N) release-store.		// O(N) release-store.
CPP_STAT_INC(StatClockStoreFull);		CPP_STAT_INC(StatClockStoreFull);
dst->Unshare(c);		dst->Unshare(c);
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i < kDirtyTids; i++) {
if (dirty.tid != kInvalidTid) {		if (dirty.tid != kInvalidTid) {
if (clk_[dirty.tid] < dirty.epoch)		if (clk_[dirty.tid] < dirty.epoch)
return false;		return false;
}		}
}		}
return true;		return true;
}		}

		// Checks whether the current thread has acquired anything
		// from other clocks after releasing to dst (directly or indirectly).
		nvanbenschotenUnsubmitted Not Done Reply Inline Actions Consider expanding on this comment to indicate that "releasing" here doesn't only refer to directly releasing into the provided SyncClock, but also indirectly releasing into it through some transitive clock propagation. Even just "releasing to dst (directly or indirectly)." would be an improvement. Unless you think this is obvious given the rest of the context in this file. nvanbenschoten: Consider expanding on this comment to indicate that "releasing" here doesn't only refer to…
		dvyukovAuthorUnsubmitted Done Reply Inline Actions Added "(directly or indirectly)" part. dvyukov: Added "(directly or indirectly)" part.
		bool ThreadClock::HasAcquiredAfterRelease(const SyncClock *dst) const {
		const u64 my_epoch = dst->elem(tid_).epoch;
		return my_epoch <= last_acquire_ \|\|
		my_epoch <= atomic_load_relaxed(&global_acquire_);
		}

// Sets a single element in the vector clock.		// Sets a single element in the vector clock.
// This function is called only from weird places like AcquireGlobal.		// This function is called only from weird places like AcquireGlobal.
void ThreadClock::set(ClockCache *c, unsigned tid, u64 v) {		void ThreadClock::set(ClockCache *c, unsigned tid, u64 v) {
DCHECK_LT(tid, kMaxTid);		DCHECK_LT(tid, kMaxTid);
DCHECK_GE(v, clk_[tid]);		DCHECK_GE(v, clk_[tid]);
clk_[tid] = v;		clk_[tid] = v;
if (nclk_ <= tid)		if (nclk_ <= tid)
nclk_ = tid + 1;		nclk_ = tid + 1;
▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines

compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp

Show First 20 Lines • Show All 409 Lines • ▼ Show 20 Lines	void Acquire(ThreadState *thr, uptr pc, uptr addr) {
AcquireImpl(thr, pc, &s->clock);		AcquireImpl(thr, pc, &s->clock);
s->mtx.ReadUnlock();		s->mtx.ReadUnlock();
}		}

static void UpdateClockCallback(ThreadContextBase tctx_base, void arg) {		static void UpdateClockCallback(ThreadContextBase tctx_base, void arg) {
ThreadState thr = reinterpret_cast<ThreadState>(arg);		ThreadState thr = reinterpret_cast<ThreadState>(arg);
ThreadContext tctx = static_cast<ThreadContext>(tctx_base);		ThreadContext tctx = static_cast<ThreadContext>(tctx_base);
u64 epoch = tctx->epoch1;		u64 epoch = tctx->epoch1;
if (tctx->status == ThreadStatusRunning)		if (tctx->status == ThreadStatusRunning) {
epoch = tctx->thr->fast_state.epoch();		epoch = tctx->thr->fast_state.epoch();
		tctx->thr->clock.NoteGlobalAcquire(epoch);
		nvanbenschotenUnsubmitted Not Done Reply Inline Actions There's no synchronization here and NoteGlobalAcquire is performing a blind write (with relaxed memory ordering) so I think we'd be able to get ourselves in trouble with concurrent calls to AcquireGlobal. Specifically, it appears possible to construct a history where global_acquire_ regresses, which would undermine the rest of this protection. Do we need a CAS loop in NoteGlobalAcquire to ensure monotonicity? Do we have an implicit guarantee that AcquireGlobal will only be called on a single thread? nvanbenschoten: There's no synchronization here and NoteGlobalAcquire is performing a blind write (with relaxed…
		dvyukovAuthorUnsubmitted Done Reply Inline Actions AcquireGlobal is protected by ThreadRegistryLock. I added a comment about this to NoteGlobalAcquire. dvyukov: AcquireGlobal is protected by ThreadRegistryLock. I added a comment about this to…
		}
thr->clock.set(&thr->proc()->clock_cache, tctx->tid, epoch);		thr->clock.set(&thr->proc()->clock_cache, tctx->tid, epoch);
}		}

void AcquireGlobal(ThreadState *thr, uptr pc) {		void AcquireGlobal(ThreadState *thr, uptr pc) {
DPrintf("#%d: AcquireGlobal\n", thr->tid);		DPrintf("#%d: AcquireGlobal\n", thr->tid);
if (thr->ignore_sync)		if (thr->ignore_sync)
return;		return;
ThreadRegistryLock l(ctx->thread_registry);		ThreadRegistryLock l(ctx->thread_registry);
▲ Show 20 Lines • Show All 133 Lines • Show Last 20 Lines

compiler-rt/test/tsan/java_finalizer2.cpp

This file was added.

				// RUN: %clangxx_tsan -O1 %s -o %t && %run %t 2>&1 \| FileCheck %s
				// Regression test for https://github.com/golang/go/issues/39186
				#include "java.h"
				#include <string.h>

				struct Heap {
				uint64_t data;
				uint64_t ready;
				uint64_t finalized;
				uint64_t wg;
				pthread_barrier_t barrier_finalizer;
				pthread_barrier_t barrier_ballast;
				};

				void Thread1(void p) {
				Heap* heap = (Heap*)p;
				pthread_barrier_wait(&heap->barrier_finalizer);
				__tsan_java_finalize();
				__atomic_fetch_add(&heap->wg, 1, __ATOMIC_RELEASE);
				__atomic_store_n(&heap->finalized, 1, __ATOMIC_RELAXED);
				return 0;
				}

				void Thread2(void p) {
				Heap* heap = (Heap*)p;
				pthread_barrier_wait(&heap->barrier_finalizer);
				heap->data = 1;
				__atomic_store_n(&heap->ready, 1, __ATOMIC_RELEASE);
				return 0;
				}

				void Thread3(void p) {
				Heap* heap = (Heap*)p;
				pthread_barrier_wait(&heap->barrier_finalizer);
				while (__atomic_load_n(&heap->ready, __ATOMIC_ACQUIRE) != 1)
				pthread_yield();
				while (__atomic_load_n(&heap->finalized, __ATOMIC_RELAXED) != 1)
				pthread_yield();
				__atomic_fetch_add(&heap->wg, 1, __ATOMIC_RELEASE);
				return 0;
				}

				void Ballast(void p) {
				Heap* heap = (Heap*)p;
				pthread_barrier_wait(&heap->barrier_ballast);
				return 0;
				}

				int main() {
				Heap* heap = (Heap*)calloc(sizeof(Heap), 1);
				__tsan_java_init((jptr)heap, sizeof(*heap));
				__tsan_java_alloc((jptr)heap, sizeof(*heap));
				// Ballast threads merely make the bug a bit easier to trigger.
				const int kBallastThreads = 100;
				pthread_barrier_init(&heap->barrier_finalizer, 0, 4);
				pthread_barrier_init(&heap->barrier_ballast, 0, kBallastThreads + 1);
				pthread_t th[3];
				pthread_create(&th[0], 0, Thread1, heap);
				pthread_create(&th[1], 0, Thread2, heap);
				pthread_t ballast[kBallastThreads];
				for (int i = 0; i < kBallastThreads; i++)
				pthread_create(&ballast[i], 0, Ballast, heap);
				pthread_create(&th[2], 0, Thread3, heap);
				pthread_barrier_wait(&heap->barrier_ballast);
				for (int i = 0; i < kBallastThreads; i++)
				pthread_join(ballast[i], 0);
				pthread_barrier_wait(&heap->barrier_finalizer);
				while (__atomic_load_n(&heap->wg, __ATOMIC_ACQUIRE) != 2)
				pthread_yield();
				if (heap->data != 1)
				exit(printf("no data\n"));
				for (int i = 0; i < 3; i++)
				pthread_join(th[i], 0);
				pthread_barrier_destroy(&heap->barrier_ballast);
				pthread_barrier_destroy(&heap->barrier_finalizer);
				__tsan_java_free((jptr)heap, sizeof(*heap));
				fprintf(stderr, "DONE\n");
				return __tsan_java_fini();
				}

				// CHECK-NOT: WARNING: ThreadSanitizer: data race
				// CHECK: DONE