This is an archive of the discontinued LLVM Phabricator instance.

tsan: Add new annotation functions to promote all accesses to atomic
Needs ReviewPublic

Authored by protze.joachim on Aug 13 2021, 11:33 AM.

Download Raw Diff

Details

Reviewers

dvyukov
melver
jdoerfert
vitalybuka

Summary

The idea of the new annotation functions is to work very similar to AnnotateIgnoreWriteBegin/End.
The difference is, that the memory accesses should not be ignored but they should be treated like atomic accesses.

I want to use the new annotation to improve the modeling of OpenMP reduction semantics. With this patch, I can detect the data race between the increment of var by the master thread and the other threads performing the reduction.
Without this patch I miss the data race, because the current strategy for OpenMP reduction is to ignore the accesses performed by the OpenMP runtime library (see the changes in ompt-tsan.cpp).
The presence of the reduction does not introduce any synchronization other than ensuring that the consistently reduced result is available at the next synchronization point.

Does this patch make sense to you, or do you think it goes into a completely wrong direction?
I must admit that this patch as it is causes ~10% runtime increase for one of my benchmarks that doesn't even use this annotation. If the patch is in general ok, we can optimize the performance.

I can separate out the OpenMP specific changes into a separate patch.

Diff Detail

Unit TestsFailed

	Time	Test
	3,940 ms	x64 debian > libarcher.races::reduction-race.c
	4,810 ms	x64 debian > libarcher.reduction::parallel-reduction-nowait.c
	4,320 ms	x64 debian > libarcher.reduction::parallel-reduction.c

Event Timeline

protze.joachim created this revision.Aug 13 2021, 11:33 AM

Herald added a subscriber: jfb. · View Herald TranscriptAug 13 2021, 11:33 AM

protze.joachim requested review of this revision.Aug 13 2021, 11:33 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptAug 13 2021, 11:33 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, Restricted Project, sstefan1. · View Herald Transcript

I must admit that this patch as it is causes ~10% runtime increase for one of my benchmarks that doesn't even use this annotation. If the patch is in general ok, we can optimize the performance.

This is quite unfortunate. And it seems that the slow down is inherent to this approach. Or do you have some ideas of how to remove the overhead from the hot path?

What I wonder is: OpenMP should compile reductions (all? most?) to actual atomic operations, no? Or it should insert some kind of mutex lock/unlock around reduction code.
Either way if we just expose the actual situation to tsan, it should work the way you want.
This side-channel turning of non-atomic accesses into atomics looks strange. If the actual access are atomic, then we should just not lie to tsan that they are non-atomic in the place. And if the actual accesses are non-atomic, then we are masking real bugs.
What am I missing?

In D108046#2944211, @dvyukov wrote:

I must admit that this patch as it is causes ~10% runtime increase for one of my benchmarks that doesn't even use this annotation. If the patch is in general ok, we can optimize the performance.

This is quite unfortunate. And it seems that the slow down is inherent to this approach. Or do you have some ideas of how to remove the overhead from the hot path?

I just realized, that I did not really compare to the same base but rather used a built from 8/3 for comparison. After I rebuilt my base version from the same main commit the performance is consistent.
This also means, that within the last 10 days we introduced 10% performance regression into main.

I'll run some bisecting to find the commit.

What I wonder is: OpenMP should compile reductions (all? most?) to actual atomic operations, no? Or it should insert some kind of mutex lock/unlock around reduction code.
Either way if we just expose the actual situation to tsan, it should work the way you want.
This side-channel turning of non-atomic accesses into atomics looks strange. If the actual access are atomic, then we should just not lie to tsan that they are non-atomic in the place. And if the actual accesses are non-atomic, then we are masking real bugs.
What am I missing?

The OpenMP runtime / compiler codegen use different strategies to implement the reduction. In some cases actually atomics are used. In other cases, the reduction is implemented as part of a barrier. Wearing my OpenMP application developer hat, I trust the OpenMP implementation. From this perspective casting memory accesses implementing the reduction to atomic makes sense to me.

Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.

Harbormaster completed remote builds in B119474: Diff 366314.Aug 13 2021, 12:19 PM

I successfully removed the performance regression from the patch.

I couldn't reproduce the performance regression on main, which I have seen in intermediate tests.

Going back to the branch instead of the multiplication. Doesn't seem to have a performance impact.

Harbormaster completed remote builds in B119514: Diff 366364.Aug 13 2021, 4:05 PM

Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.

I see. This makes sense.

Re performance, what matters more now is performance of the new runtime that is slowly being upstreamed in pieces:
https://github.com/dvyukov/llvm-project/pull/3

This change itself can be ported to the new runtime, but it still adds instructions to fast paths. And more important it turns what's being compile-time constant into a non-compile time constant (we heavily rely on inlining so kAtomic is known to compiler to be 0 statically, so any branches are eliminates and any derived values are known). I think it will have more negative impact on the new runtime we also trace atomic flag:

ev->isAtomic = !!(typ & kAccessAtomic);

and use it to create vector consts:

const m128 access_read_atomic = _mm_set1_epi32((typ & (kAccessRead | kAccessAtomic)) << 30);

if kAccessAtomic is not a const, compiler will need to create several globals and load one of the other. Also for non-atomic writes this is 0, so clearing the register.
I also afraid of long-term effect. Say if we add "if (kAtomic)" (some slow path for atomics only, which does not affect normal accesses at all), with this change it becomes problematic (not only it adds a branch, it also affects register allocation, code layout, etc).

So I would like to consider any alternative as much as possible first.

I see that OpenMP reductions are always tied to a single variable (?):
https://www.openmp.org/spec-html/5.0/openmpsu107.html
If at least the modified variables are known, then I think it's possible to do the following: ignore all reduction accesses as it is done now + emit additional (fake) tsan atomic store annotations for the target variables/memory locations.
I think it should have the same effect w/o long-term tsan runtime tax. Unfortunately tsan atomic hooks do the actual atomic operation as well, but for starters you can try __atomic_fetch_add(0).
Will this work? Do I miss anything?

In D108046#2944952, @dvyukov wrote:

Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.

I see. This makes sense.

Re performance, what matters more now is performance of the new runtime that is slowly being upstreamed in pieces:
https://github.com/dvyukov/llvm-project/pull/3

This is interesting. I'm curious to see whether the new runtime is similarly sensitive to contentious shared read accesses by the application.
With the current runtime I observed 100-1000x runtime overhead, if all threads on a node concurrently read the same data (like a vector in vector-matrix multiplication).

This change itself can be ported to the new runtime, but it still adds instructions to fast paths. And more important it turns what's being compile-time constant into a non-compile time constant (we heavily rely on inlining so kAtomic is known to compiler to be 0 statically, so any branches are eliminates and any derived values are known). I think it will have more negative impact on the new runtime we also trace atomic flag:
ev->isAtomic = !!(typ & kAccessAtomic);
and use it to create vector consts:
const m128 access_read_atomic = _mm_set1_epi32((typ & (kAccessRead | kAccessAtomic)) << 30);
if kAccessAtomic is not a const, compiler will need to create several globals and load one of the other. Also for non-atomic writes this is 0, so clearing the register.
I also afraid of long-term effect. Say if we add "if (kAtomic)" (some slow path for atomics only, which does not affect normal accesses at all), with this change it becomes problematic (not only it adds a branch, it also affects register allocation, code layout, etc).

So I would like to consider any alternative as much as possible first.

I see. So, I'll do some further experiments. Casting and treating the memory accesses like atomics came just up as an idea yesterday.
In fact we are looking for a solution, that would work with gcc compiled code in the same way to cover Fortran codes as well. So, we prefer to not modify the compiler.

I see that OpenMP reductions are always tied to a single variable (?):
https://www.openmp.org/spec-html/5.0/openmpsu107.html

The OpenMP reduction clause can take a list of variables ("one or more list items") and always reduces the thread-local copies of the variables into a single copy of the variables. Besides the pre-defined reduction operations, the application can also define own reduction operations.
A user-defined reduction could be maxloc(struct{double,size_t}) to find the maximum value and it's index. Each thread would first find the local maximum and the index, and the reduction just combines the results of all threads. The OpenMP runtime takes care of the synchronization between the reduction steps, but again, this synchronization should not impact the analysis of the application.

If at least the modified variables are known, then I think it's possible to do the following: ignore all reduction accesses as it is done now + emit additional (fake) tsan atomic store annotations for the target variables/memory locations.

For a portable implementation, libarcher performs the HB annotations and ignoreWrite annotations in OpenMP tool callbacks defined in the OpenMP standard. At this point, we don't know the memory range of the reduction variables.
The semantic of the reduction callbacks is "all reduction-related memory accesses are performed between reduction-begin/end callbacks". These callbacks do not even reflect the synchronization semantic and can be outside of the critical path, i.e., outside of the locked region.

I think it should have the same effect w/o long-term tsan runtime tax. Unfortunately tsan atomic hooks do the actual atomic operation as well, but for starters you can try __atomic_fetch_add(0).
Will this work? Do I miss anything?

I agree that it should not impact the critical path for the majority of cases.
Another possible use case that I see for this kind of annotation is an heuristic "lockset" analysis approach: Assume that no pair of memory accesses under lock has a race, but don't derive HB from the locking. This would allow to find data races currently hidden by the HB introduced from locked regions.

Another possible use case that I see for this kind of annotation is an heuristic "lockset" analysis approach: Assume that no pair of memory accesses under lock has a race, but don't derive HB from the locking. This would allow to find data races currently hidden by the HB introduced from locked regions.

Interesting and I think may be possible.
Do I understand correctly that such mutexes will synchronize memory, but only for the duration of the critical section? A complete support for this may be done by creating a shadow copy of the thread vector clock on mutex lock and then restoring the shadow copy on mutex unlock.
I think there is also a somewhat hacky way to achieve the same in a different way. When we report a race we restore MutexSet for the previous access, and we know the current MutexSet. If there is any intersection for such "lockset" mutex (or maybe just for write-locked mutexes), then we stop reporting the race.
Or, this idea can actually be combined with this change in the following way. We keep most of this change, but don't change the race detection logic itself, but a thread still knows when it's inside of an "atomic section". But we also add atomic section entry/exit to the trace, so that we can restore it for previous memory accesses. Then in ReportRace we check if the current thread is inside of the atomic section and the previous memory access was inside of an atomic section as well.
But this will involve replaying the trace for the other thread, so I am not sure what will be faster creating a shadow vector clock or this... For the new tsan runtime creating a shadow vector clock may be faster, since all vector clocks are fixed size (512 bytes).

Leaving to @dvyukov

Revision Contents

Path

Size

compiler-rt/

lib/

tsan/

rtl/

tsan_interface_ann.cpp

10 lines

tsan_rtl.h

15 lines

tsan_rtl.cpp

29 lines

openmp/

tools/

archer/

ompt-tsan.cpp

27 lines

tests/

races/

reduction-race.c

42 lines

Diff 366364

compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	void INTERFACE_ATTRIBUTE AnnotateIgnoreWritesBegin(char *f, int l) {
ThreadIgnoreBegin(thr, pc);		ThreadIgnoreBegin(thr, pc);
}		}

void INTERFACE_ATTRIBUTE AnnotateIgnoreWritesEnd(char *f, int l) {		void INTERFACE_ATTRIBUTE AnnotateIgnoreWritesEnd(char *f, int l) {
SCOPED_ANNOTATION(AnnotateIgnoreWritesEnd);		SCOPED_ANNOTATION(AnnotateIgnoreWritesEnd);
ThreadIgnoreEnd(thr);		ThreadIgnoreEnd(thr);
}		}

		void INTERFACE_ATTRIBUTE AnnotateAllAtomicBegin(char *f, int l) {
		SCOPED_ANNOTATION(AnnotateAllAtomicBegin);
		ThreadAllAtomicBegin(thr, pc);
		}

		void INTERFACE_ATTRIBUTE AnnotateAllAtomicEnd(char *f, int l) {
		SCOPED_ANNOTATION(AnnotateAllAtomicEnd);
		ThreadAllAtomicEnd(thr);
		}

void INTERFACE_ATTRIBUTE AnnotateIgnoreSyncBegin(char *f, int l) {		void INTERFACE_ATTRIBUTE AnnotateIgnoreSyncBegin(char *f, int l) {
SCOPED_ANNOTATION(AnnotateIgnoreSyncBegin);		SCOPED_ANNOTATION(AnnotateIgnoreSyncBegin);
ThreadIgnoreSyncBegin(thr, pc);		ThreadIgnoreSyncBegin(thr, pc);
}		}

void INTERFACE_ATTRIBUTE AnnotateIgnoreSyncEnd(char *f, int l) {		void INTERFACE_ATTRIBUTE AnnotateIgnoreSyncEnd(char *f, int l) {
SCOPED_ANNOTATION(AnnotateIgnoreSyncEnd);		SCOPED_ANNOTATION(AnnotateIgnoreSyncEnd);
ThreadIgnoreSyncEnd(thr);		ThreadIgnoreSyncEnd(thr);
▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

compiler-rt/lib/tsan/rtl/tsan_rtl.h

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines
Allocator *allocator();		Allocator *allocator();
#endif		#endif

const RawShadow kShadowRodata = (RawShadow)-1; // .rodata shadow marker		const RawShadow kShadowRodata = (RawShadow)-1; // .rodata shadow marker

// FastState (from most significant bit):		// FastState (from most significant bit):
// ignore : 1		// ignore : 1
// tid : kTidBits		// tid : kTidBits
		// allAtomics : 1
// unused : -		// unused : -
// history_size : 3		// history_size : 3
// epoch : kClkBits		// epoch : kClkBits
class FastState {		class FastState {
public:		public:
FastState(u64 tid, u64 epoch) {		FastState(u64 tid, u64 epoch) {
x_ = tid << kTidShift;		x_ = tid << kTidShift;
x_ \|= epoch;		x_ \|= epoch;
Show All 31 Lines	void IncrementEpoch() {
DCHECK_EQ(old_epoch + 1, epoch());		DCHECK_EQ(old_epoch + 1, epoch());
(void)old_epoch;		(void)old_epoch;
}		}

void SetIgnoreBit() { x_ \|= kIgnoreBit; }		void SetIgnoreBit() { x_ \|= kIgnoreBit; }
void ClearIgnoreBit() { x_ &= ~kIgnoreBit; }		void ClearIgnoreBit() { x_ &= ~kIgnoreBit; }
bool GetIgnoreBit() const { return (s64)x_ < 0; }		bool GetIgnoreBit() const { return (s64)x_ < 0; }

		void SetAllAtomicBit() { x_ \|= kAllAtomicBit; }
		void ClearAllAtomicBit() { x_ &= ~kAllAtomicBit; }
		bool GetAllAtomicBit() const { return x_ & kAllAtomicBit; }

void SetHistorySize(int hs) {		void SetHistorySize(int hs) {
CHECK_GE(hs, 0);		CHECK_GE(hs, 0);
CHECK_LE(hs, 7);		CHECK_LE(hs, 7);
x_ = (x_ & ~(kHistoryMask << kHistoryShift)) \| (u64(hs) << kHistoryShift);		x_ = (x_ & ~(kHistoryMask << kHistoryShift)) \| (u64(hs) << kHistoryShift);
}		}

ALWAYS_INLINE		ALWAYS_INLINE
int GetHistorySize() const {		int GetHistorySize() const {
Show All 9 Lines	u64 GetTracePos() const {
const int hs = GetHistorySize();		const int hs = GetHistorySize();
// When hs == 0, the trace consists of 2 parts.		// When hs == 0, the trace consists of 2 parts.
const u64 mask = (1ull << (kTracePartSizeBits + hs + 1)) - 1;		const u64 mask = (1ull << (kTracePartSizeBits + hs + 1)) - 1;
return epoch() & mask;		return epoch() & mask;
}		}

private:		private:
friend class Shadow;		friend class Shadow;
		static const u64 kAllAtomicShift = 6 + kClkBits;
		static const u64 kAllAtomicBit = 1ull << kAllAtomicShift;
static const int kTidShift = 64 - kTidBits - 1;		static const int kTidShift = 64 - kTidBits - 1;
static const u64 kIgnoreBit = 1ull << 63;		static const u64 kIgnoreBit = 1ull << 63;
static const u64 kFreedBit = 1ull << 63;		static const u64 kFreedBit = 1ull << 63;
static const u64 kHistoryShift = kClkBits;		static const u64 kHistoryShift = kClkBits;
static const u64 kHistoryMask = 7;		static const u64 kHistoryMask = 7;
u64 x_;		u64 x_;
};		};

Show All 34 Lines	public:

void SetAtomic(bool kIsAtomic) {		void SetAtomic(bool kIsAtomic) {
DCHECK(!IsAtomic());		DCHECK(!IsAtomic());
if (kIsAtomic)		if (kIsAtomic)
x_ \|= kAtomicBit;		x_ \|= kAtomicBit;
DCHECK_EQ(IsAtomic(), kIsAtomic);		DCHECK_EQ(IsAtomic(), kIsAtomic);
}		}

		void UpgradeAtomic(bool kIsAtomic) {
		if (kIsAtomic)
		x_ \|= kAtomicBit;
		}

bool IsAtomic() const {		bool IsAtomic() const {
return x_ & kAtomicBit;		return x_ & kAtomicBit;
}		}

bool IsZero() const {		bool IsZero() const {
return x_ == 0;		return x_ == 0;
}		}

▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	struct ThreadState {
u64 fast_synch_epoch;		u64 fast_synch_epoch;
// Technically `current` should be a separate THREADLOCAL variable;		// Technically `current` should be a separate THREADLOCAL variable;
// but it is placed here in order to share cache line with previous fields.		// but it is placed here in order to share cache line with previous fields.
ThreadState* current;		ThreadState* current;
// This is a slow path flag. On fast path, fast_state.GetIgnoreBit() is read.		// This is a slow path flag. On fast path, fast_state.GetIgnoreBit() is read.
// We do not distinguish beteween ignoring reads and writes		// We do not distinguish beteween ignoring reads and writes
// for better performance.		// for better performance.
int ignore_reads_and_writes;		int ignore_reads_and_writes;
		int all_atomic;
atomic_sint32_t pending_signals;		atomic_sint32_t pending_signals;
int ignore_sync;		int ignore_sync;
int suppress_reports;		int suppress_reports;
// Go does not support ignores.		// Go does not support ignores.
#if !SANITIZER_GO		#if !SANITIZER_GO
IgnoreSet mop_ignore_set;		IgnoreSet mop_ignore_set;
IgnoreSet sync_ignore_set;		IgnoreSet sync_ignore_set;
#endif		#endif
▲ Show 20 Lines • Show All 351 Lines • ▼ Show 20 Lines
void MemoryResetRange(ThreadState *thr, uptr pc, uptr addr, uptr size);		void MemoryResetRange(ThreadState *thr, uptr pc, uptr addr, uptr size);
void MemoryRangeFreed(ThreadState *thr, uptr pc, uptr addr, uptr size);		void MemoryRangeFreed(ThreadState *thr, uptr pc, uptr addr, uptr size);
void MemoryRangeImitateWrite(ThreadState *thr, uptr pc, uptr addr, uptr size);		void MemoryRangeImitateWrite(ThreadState *thr, uptr pc, uptr addr, uptr size);
void MemoryRangeImitateWriteOrResetRange(ThreadState *thr, uptr pc, uptr addr,		void MemoryRangeImitateWriteOrResetRange(ThreadState *thr, uptr pc, uptr addr,
uptr size);		uptr size);

void ThreadIgnoreBegin(ThreadState *thr, uptr pc);		void ThreadIgnoreBegin(ThreadState *thr, uptr pc);
void ThreadIgnoreEnd(ThreadState *thr);		void ThreadIgnoreEnd(ThreadState *thr);
		void ThreadAllAtomicBegin(ThreadState *thr, uptr pc);
		void ThreadAllAtomicEnd(ThreadState *thr);
void ThreadIgnoreSyncBegin(ThreadState *thr, uptr pc);		void ThreadIgnoreSyncBegin(ThreadState *thr, uptr pc);
void ThreadIgnoreSyncEnd(ThreadState *thr);		void ThreadIgnoreSyncEnd(ThreadState *thr);

void FuncEntry(ThreadState *thr, uptr pc);		void FuncEntry(ThreadState *thr, uptr pc);
void FuncExit(ThreadState *thr);		void FuncExit(ThreadState *thr);

Tid ThreadCreate(ThreadState *thr, uptr pc, uptr uid, bool detached);		Tid ThreadCreate(ThreadState *thr, uptr pc, uptr uid, bool detached);
void ThreadStart(ThreadState *thr, Tid tid, tid_t os_id,		void ThreadStart(ThreadState *thr, Tid tid, tid_t os_id,
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

compiler-rt/lib/tsan/rtl/tsan_rtl.cpp

Show First 20 Lines • Show All 838 Lines • ▼ Show 20 Lines	#endif
FastState fast_state = thr->fast_state;		FastState fast_state = thr->fast_state;
if (UNLIKELY(fast_state.GetIgnoreBit())) {		if (UNLIKELY(fast_state.GetIgnoreBit())) {
return;		return;
}		}

Shadow cur(fast_state);		Shadow cur(fast_state);
cur.SetAddr0AndSizeLog(addr & 7, kAccessSizeLog);		cur.SetAddr0AndSizeLog(addr & 7, kAccessSizeLog);
cur.SetWrite(kAccessIsWrite);		cur.SetWrite(kAccessIsWrite);
cur.SetAtomic(kIsAtomic);		cur.UpgradeAtomic(kIsAtomic);

if (LIKELY(ContainsSameAccess(shadow_mem, cur.raw(),		if (LIKELY(ContainsSameAccess(shadow_mem, cur.raw(),
thr->fast_synch_epoch, kAccessIsWrite))) {		thr->fast_synch_epoch, kAccessIsWrite))) {
return;		return;
}		}

if (kCollectHistory) {		if (kCollectHistory) {
fast_state.IncrementEpoch();		fast_state.IncrementEpoch();
thr->fast_state = fast_state;		thr->fast_state = fast_state;
TraceAddEvent(thr, fast_state, EventTypeMop, pc);		TraceAddEvent(thr, fast_state, EventTypeMop, pc);
cur.IncrementEpoch();		cur.IncrementEpoch();
}		}

MemoryAccessImpl1(thr, addr, kAccessSizeLog, kAccessIsWrite, kIsAtomic,		MemoryAccessImpl1(thr, addr, kAccessSizeLog, kAccessIsWrite, cur.IsAtomic(),
shadow_mem, cur);		shadow_mem, cur);
}		}

// Called by MemoryAccessRange in tsan_rtl_thread.cpp		// Called by MemoryAccessRange in tsan_rtl_thread.cpp
ALWAYS_INLINE USED		ALWAYS_INLINE USED
void MemoryAccessImpl(ThreadState *thr, uptr addr,		void MemoryAccessImpl(ThreadState *thr, uptr addr,
int kAccessSizeLog, bool kAccessIsWrite, bool kIsAtomic,		int kAccessSizeLog, bool kAccessIsWrite, bool kIsAtomic,
u64 *shadow_mem, Shadow cur) {		u64 *shadow_mem, Shadow cur) {
if (LIKELY(ContainsSameAccess(shadow_mem, cur.raw(),		if (LIKELY(ContainsSameAccess(shadow_mem, cur.raw(),
▲ Show 20 Lines • Show All 165 Lines • ▼ Show 20 Lines	void ThreadIgnoreEnd(ThreadState *thr) {
if (thr->ignore_reads_and_writes == 0) {		if (thr->ignore_reads_and_writes == 0) {
thr->fast_state.ClearIgnoreBit();		thr->fast_state.ClearIgnoreBit();
#if !SANITIZER_GO		#if !SANITIZER_GO
thr->mop_ignore_set.Reset();		thr->mop_ignore_set.Reset();
#endif		#endif
}		}
}		}

		void ThreadAllAtomicBegin(ThreadState *thr, uptr pc) {
		DPrintf("#%d: ThreadAllAtomicBegin\n", thr->tid);
		thr->all_atomic++;
		CHECK_GT(thr->all_atomic, 0);
		thr->fast_state.SetAllAtomicBit();
		/*#if !SANITIZER_GO
		if (pc && !ctx->after_multithreaded_fork)
		thr->mop_ignore_set.Add(CurrentStackId(thr, pc));
		#endif*/
		}

		void ThreadAllAtomicEnd(ThreadState *thr) {
		DPrintf("#%d: ThreadAllAtomicEnd\n", thr->tid);
		CHECK_GT(thr->all_atomic, 0);
		thr->all_atomic--;
		if (thr->all_atomic == 0) {
		thr->fast_state.ClearAllAtomicBit();
		/*#if !SANITIZER_GO
		thr->mop_ignore_set.Reset();
		#endif*/
		}
		}

#if !SANITIZER_GO		#if !SANITIZER_GO
extern "C" SANITIZER_INTERFACE_ATTRIBUTE		extern "C" SANITIZER_INTERFACE_ATTRIBUTE
uptr __tsan_testonly_shadow_stack_current_size() {		uptr __tsan_testonly_shadow_stack_current_size() {
ThreadState *thr = cur_thread();		ThreadState *thr = cur_thread();
return thr->shadow_stack_pos - thr->shadow_stack;		return thr->shadow_stack_pos - thr->shadow_stack;
}		}
#endif		#endif

▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

openmp/tools/archer/ompt-tsan.cpp

Show First 20 Lines • Show All 167 Lines • ▼ Show 20 Lines
void __attribute__((weak))		void __attribute__((weak))
AnnotateHappensAfter(const char file, int line, const volatile void cv) {}		AnnotateHappensAfter(const char file, int line, const volatile void cv) {}
void __attribute__((weak))		void __attribute__((weak))
AnnotateHappensBefore(const char file, int line, const volatile void cv) {}		AnnotateHappensBefore(const char file, int line, const volatile void cv) {}
void __attribute__((weak))		void __attribute__((weak))
AnnotateIgnoreWritesBegin(const char *file, int line) {}		AnnotateIgnoreWritesBegin(const char *file, int line) {}
void __attribute__((weak)) AnnotateIgnoreWritesEnd(const char *file, int line) {		void __attribute__((weak)) AnnotateIgnoreWritesEnd(const char *file, int line) {
}		}
		void __attribute__((weak)) AnnotateAllAtomicBegin(const char *file, int line) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'file' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'line' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'file' [readability-identifier-naming]…
		AnnotateIgnoreWritesBegin(file, line);
		}
		void __attribute__((weak)) AnnotateAllAtomicEnd(const char *file, int line) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'file' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'line' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'file' [readability-identifier-naming]…
		AnnotateIgnoreWritesEnd(file, line);
		}
void __attribute__((weak))		void __attribute__((weak))
AnnotateNewMemory(const char file, int line, const volatile void cv,		AnnotateNewMemory(const char file, int line, const volatile void cv,
size_t size) {}		size_t size) {}
int __attribute__((weak)) RunningOnValgrind() {		int __attribute__((weak)) RunningOnValgrind() {
runOnTsan = 0;		runOnTsan = 0;
return 0;		return 0;
}		}
void __attribute__((weak)) __tsan_func_entry(const void *call_pc) {}		void __attribute__((weak)) __tsan_func_entry(const void *call_pc) {}
Show All 10 Lines
#define TsanHappensAfter(cv) AnnotateHappensAfter(__FILE__, __LINE__, cv)		#define TsanHappensAfter(cv) AnnotateHappensAfter(__FILE__, __LINE__, cv)

// Ignore any races on writes between here and the next TsanIgnoreWritesEnd.		// Ignore any races on writes between here and the next TsanIgnoreWritesEnd.
#define TsanIgnoreWritesBegin() AnnotateIgnoreWritesBegin(__FILE__, __LINE__)		#define TsanIgnoreWritesBegin() AnnotateIgnoreWritesBegin(__FILE__, __LINE__)

// Resume checking for racy writes.		// Resume checking for racy writes.
#define TsanIgnoreWritesEnd() AnnotateIgnoreWritesEnd(__FILE__, __LINE__)		#define TsanIgnoreWritesEnd() AnnotateIgnoreWritesEnd(__FILE__, __LINE__)

		// Promote all memory accesses to be atomic between here and the next
		// TsanAllAtomicEnd.
		#define TsanAllAtomicBegin() AnnotateAllAtomicBegin(__FILE__, __LINE__)

		// Resume checking for racy writes.
		#define TsanAllAtomicEnd() AnnotateAllAtomicEnd(__FILE__, __LINE__)

// We don't really delete the clock for now		// We don't really delete the clock for now
#define TsanDeleteClock(cv)		#define TsanDeleteClock(cv)

// newMemory		// newMemory
#define TsanNewMemory(addr, size) \		#define TsanNewMemory(addr, size) \
AnnotateNewMemory(__FILE__, __LINE__, addr, size)		AnnotateNewMemory(__FILE__, __LINE__, addr, size)
#define TsanFreeMemory(addr, size) \		#define TsanFreeMemory(addr, size) \
AnnotateNewMemory(__FILE__, __LINE__, addr, size)		AnnotateNewMemory(__FILE__, __LINE__, addr, size)
▲ Show 20 Lines • Show All 513 Lines • ▼ Show 20 Lines	case ompt_sync_region_barrier: {

if (hasReductionCallback < ompt_set_always) {		if (hasReductionCallback < ompt_set_always) {
// We ignore writes inside the barrier. These would either occur during		// We ignore writes inside the barrier. These would either occur during
// 1. reductions performed by the runtime which are guaranteed to be		// 1. reductions performed by the runtime which are guaranteed to be
// race-free.		// race-free.
// 2. execution of another task.		// 2. execution of another task.
// For the latter case we will re-enable tracking in task_switch.		// For the latter case we will re-enable tracking in task_switch.
Data->InBarrier = true;		Data->InBarrier = true;
TsanIgnoreWritesBegin();		TsanAllAtomicBegin();
}		}

break;		break;
}		}

case ompt_sync_region_taskwait:		case ompt_sync_region_taskwait:
break;		break;

Show All 16 Lines	case ompt_scope_end:
case ompt_sync_region_barrier_explicit:		case ompt_sync_region_barrier_explicit:
case ompt_sync_region_barrier_implicit_parallel:		case ompt_sync_region_barrier_implicit_parallel:
case ompt_sync_region_barrier_implicit_workshare:		case ompt_sync_region_barrier_implicit_workshare:
case ompt_sync_region_barrier_teams:		case ompt_sync_region_barrier_teams:
case ompt_sync_region_barrier: {		case ompt_sync_region_barrier: {
if (hasReductionCallback < ompt_set_always) {		if (hasReductionCallback < ompt_set_always) {
// We want to track writes after the barrier again.		// We want to track writes after the barrier again.
Data->InBarrier = false;		Data->InBarrier = false;
TsanIgnoreWritesEnd();		TsanAllAtomicEnd();
}		}

char BarrierIndex = Data->BarrierIndex;		char BarrierIndex = Data->BarrierIndex;
// Barrier will end after it has been entered by all threads.		// Barrier will end after it has been entered by all threads.
if (parallel_data)		if (parallel_data)
TsanHappensAfter(Data->Team->GetBarrierPtr(BarrierIndex));		TsanHappensAfter(Data->Team->GetBarrierPtr(BarrierIndex));

// It is not guaranteed that all threads have exited this barrier before		// It is not guaranteed that all threads have exited this barrier before
Show All 38 Lines	static void ompt_tsan_reduction(ompt_sync_region_t kind,
ompt_scope_endpoint_t endpoint,		ompt_scope_endpoint_t endpoint,
ompt_data_t *parallel_data,		ompt_data_t *parallel_data,
ompt_data_t *task_data,		ompt_data_t *task_data,
const void *codeptr_ra) {		const void *codeptr_ra) {
switch (endpoint) {		switch (endpoint) {
case ompt_scope_begin:		case ompt_scope_begin:
switch (kind) {		switch (kind) {
case ompt_sync_region_reduction:		case ompt_sync_region_reduction:
TsanIgnoreWritesBegin();		TsanAllAtomicBegin();
break;		break;
default:		default:
break;		break;
}		}
break;		break;
case ompt_scope_end:		case ompt_scope_end:
switch (kind) {		switch (kind) {
case ompt_sync_region_reduction:		case ompt_sync_region_reduction:
TsanIgnoreWritesEnd();		TsanAllAtomicEnd();
break;		break;
default:		default:
break;		break;
}		}
break;		break;
case ompt_scope_beginend:		case ompt_scope_beginend:
// Should not occur according to OpenMP 5.1		// Should not occur according to OpenMP 5.1
// Tested in OMPT tests		// Tested in OMPT tests
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	if (prior_task_status == ompt_task_early_fulfill)
return;		return;

TaskData *FromTask = ToTaskData(first_task_data);		TaskData *FromTask = ToTaskData(first_task_data);

// Legacy handling for missing reduction callback		// Legacy handling for missing reduction callback
if (hasReductionCallback < ompt_set_always && FromTask->InBarrier) {		if (hasReductionCallback < ompt_set_always && FromTask->InBarrier) {
// We want to ignore writes in the runtime code during barriers,		// We want to ignore writes in the runtime code during barriers,
// but not when executing tasks with user code!		// but not when executing tasks with user code!
TsanIgnoreWritesEnd();		TsanAllAtomicEnd();
}		}

// The late fulfill happens after the detached task finished execution		// The late fulfill happens after the detached task finished execution
if (prior_task_status == ompt_task_late_fulfill)		if (prior_task_status == ompt_task_late_fulfill)
TsanHappensAfter(FromTask->GetTaskPtr());		TsanHappensAfter(FromTask->GetTaskPtr());

// task completed execution		// task completed execution
if (prior_task_status == ompt_task_complete \|\|		if (prior_task_status == ompt_task_complete \|\|
Show All 28 Lines	static void ompt_tsan_task_schedule(ompt_data_t *first_task_data,
if (prior_task_status == ompt_task_late_fulfill) {		if (prior_task_status == ompt_task_late_fulfill) {
return;		return;
}		}

TaskData *ToTask = ToTaskData(second_task_data);		TaskData *ToTask = ToTaskData(second_task_data);
// Legacy handling for missing reduction callback		// Legacy handling for missing reduction callback
if (hasReductionCallback < ompt_set_always && ToTask->InBarrier) {		if (hasReductionCallback < ompt_set_always && ToTask->InBarrier) {
// We re-enter runtime code which currently performs a barrier.		// We re-enter runtime code which currently performs a barrier.
TsanIgnoreWritesBegin();		TsanAllAtomicBegin();
}		}

// task suspended		// task suspended
if (prior_task_status == ompt_task_switch \|\|		if (prior_task_status == ompt_task_switch \|\|
prior_task_status == ompt_task_yield \|\|		prior_task_status == ompt_task_yield \|\|
prior_task_status == ompt_task_detach) {		prior_task_status == ompt_task_detach) {
// Task may be resumed at a later point in time.		// Task may be resumed at a later point in time.
TsanHappensBefore(FromTask->GetTaskPtr());		TsanHappensBefore(FromTask->GetTaskPtr());
▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	#define findTsanFunction(f, fSig) \
} while (0)		} while (0)

findTsanFunction(AnnotateHappensAfter,		findTsanFunction(AnnotateHappensAfter,
(void ()(const char , int, const volatile void *)));		(void ()(const char , int, const volatile void *)));
findTsanFunction(AnnotateHappensBefore,		findTsanFunction(AnnotateHappensBefore,
(void ()(const char , int, const volatile void *)));		(void ()(const char , int, const volatile void *)));
findTsanFunction(AnnotateIgnoreWritesBegin, (void ()(const char , int)));		findTsanFunction(AnnotateIgnoreWritesBegin, (void ()(const char , int)));
findTsanFunction(AnnotateIgnoreWritesEnd, (void ()(const char , int)));		findTsanFunction(AnnotateIgnoreWritesEnd, (void ()(const char , int)));
		findTsanFunction(AnnotateAllAtomicBegin, (void ()(const char , int)));
		findTsanFunction(AnnotateAllAtomicEnd, (void ()(const char , int)));
findTsanFunction(		findTsanFunction(
AnnotateNewMemory,		AnnotateNewMemory,
(void ()(const char , int, const volatile void *, size_t)));		(void ()(const char , int, const volatile void *, size_t)));
findTsanFunction(__tsan_func_entry, (void ()(const void )));		findTsanFunction(__tsan_func_entry, (void ()(const void )));
findTsanFunction(__tsan_func_exit, (void (*)(void)));		findTsanFunction(__tsan_func_exit, (void (*)(void)));
#endif		#endif

SET_CALLBACK(thread_begin);		SET_CALLBACK(thread_begin);
▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

openmp/tools/archer/tests/races/reduction-race.c

This file was added.

				/*
				* reduction-race.c -- Archer testcase
				*/
				//===----------------------------------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				//
				// See tools/archer/LICENSE.txt for details.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				// RUN: %libarcher-compile-and-run-race \| FileCheck %s
				// RUN: %libarcher-compile-and-run-race-noserial \| FileCheck %s
				// REQUIRES: tsan
				#include <omp.h>
				#include <stdio.h>

				int main(int argc, char *argv[]) {
				int var = 0;
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'var' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'var' [readability-identifier-naming]…

				#pragma omp parallel num_threads(8) shared(var)
				{
				#pragma omp master
				var++;
				#pragma omp for reduction(+ : var)
				for (int i = 0; i < 100; i++)
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming]…
				var++;
				}

				int error = (var != 101);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'error' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'error' [readability-identifier-naming]…
				fprintf(stderr, "DONE\n");
				return error;
				}

				// CHECK: WARNING: ThreadSanitizer: data race
				// CHECK-NEXT: {{(Write\|Read)}} of size 4
				// CHECK-NEXT: #0 {{.*}}parallel-simple.c:23
				// CHECK: Previous write of size 4
				// CHECK-NEXT: #0 {{.*}}parallel-simple.c:23
				// CHECK: DONE
				// CHECK: ThreadSanitizer: reported 1 warnings