This is an archive of the discontinued LLVM Phabricator instance.

sanitizer_common: optimize Mutex for high contention
ClosedPublic

Authored by dvyukov on Aug 10 2021, 6:55 AM.

Download Raw Diff

Details

Reviewers

vitalybuka
melver
jfb

Commits

rG1fa4c188b5a4: sanitizer_common: optimize Mutex for high contention

Summary

After switching tsan from the old mutex to the new sanitizer_common mutex,
we've observed a significant degradation of performance on a test.
The test effectively stresses a lock-free stack with 4 threads
with a mix of atomic_compare_exchange and atomic_load operations.
The former takes write lock, while the latter takes read lock.
It turned out the new mutex performs worse because readers don't
use active spinning, which results in significant amount of thread
blocking/unblocking. The old tsan mutex used active spinning
for both writers and readers.

Add active spinning for readers.
Don't hand off the mutex to readers, and instread make them
compete for the mutex after wake up again.
This makes readers and writers almost symmetric.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dvyukov created this revision.Aug 10 2021, 6:55 AM

Herald added a reviewer: jfb. · View Herald TranscriptAug 10 2021, 6:55 AM

Herald added a subscriber: jfb. · View Herald Transcript

dvyukov requested review of this revision.Aug 10 2021, 6:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2021, 6:55 AM

Herald added a subscriber: Restricted Project. · View Herald Transcript

fixed some DCHECK checks

Harbormaster completed remote builds in B118880: Diff 365468.Aug 10 2021, 7:43 AM

melver added inline comments.Aug 10 2021, 8:02 AM

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h
194	I think we need a 'pause' hint in active spinning. Do we have arch-independent pause-hint wrapper already?
216–217	Could that unnecessarily shift fairness towards readers? E.g. if there are lots of readers, the probability that the kReaderSpinWait bit is set is high, which would mean wake_writer upon Unlock() is false. Before, it would always prioritize writers, but now it prioritizes any spinner. If the other writer is Wait()'ing, it would prioritize the read-spinners (+ any Wait()'ing reader as a side-effect). Is that ok? I guess it's fine to prefer to let the spinners through first, but if there are read-spinners + read-waiters it'll turn the Wait()'ing readers into spinners as well, without waking the waiting writers, and therefore prioritize readers. .. or I missed something that's currently preventing that, or perhaps it's not as bad as I think.
248	Is it ok that kReaderSpinWait bit is imprecise because it's just an optimization? Because multiple readers may keep flipping this bit on or off, and a wait-reader may unset the bit while there is still an in-flight spin-reader (which would immediately set it back).
264	Same here, I think we need 'pause' hint for CPU.

dvyukov added inline comments.Aug 10 2021, 8:09 AM

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h
194	We have, but the problem with PAUSE on x86 is that its latency increased from 1 cycle to 100+ recently. I am not sure how its supposed to be used now. Also 100+ cycles looks like just too much for a backoff. We used to do 10 PAUSES in a row, but now I did what abseil mutex does (just 1500 active spin iterations w/o PAUSE): https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mutex.cc#L144 (what abseil does can't be too bad)

dvyukov added inline comments.Aug 10 2021, 8:10 AM

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h
248	Looks OK to me. It would not be OK to set kReaderSpinWait when there are no spinners, but it's fine the other way around.

melver accepted this revision.Aug 10 2021, 8:11 AM

melver added inline comments.

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h
194	Ok, that's bad. I guess some recent microarch broke it then :-) That's fine then.

This revision is now accepted and ready to land.Aug 10 2021, 8:11 AM

dvyukov added inline comments.Aug 10 2021, 8:20 AM

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h
216–217	Yes, it's more unfair now with more bias towards readers. I don't know what's better. There is an inherent conflict between fairness and throughput. We try to make it more fair, we get these 100x slowdown fallouts. I don't think there is even the single right answer. We already see too much fairness hits us badly. We have not yet seen how too less fairness hits us :) I don't think we want too strong fairness guarantees (like a production mutex could have). For us throughput is quite important since we slow down execution a lot and tend to create high contention in some cases. As long as the program makes forward progress, I think throughput is more preferable. If this version proves to be problematically unfair, maybe we could add the starvation prevention logic I used in the Go's sync.Mutex: https://github.com/golang/go/commit/0556e26273f704db73df9e7c4c3d2e8434dec7be It ensures there is no pathological unfairness w/o penalizing normal execution.

melver added inline comments.Aug 10 2021, 8:25 AM

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h
216–217	Ok makes sense. Just wanted to check; and thanks for the explanation! LGTM, thanks!

Closed by commit rG1fa4c188b5a4: sanitizer_common: optimize Mutex for high contention (authored by dvyukov). · Explain WhyAug 10 2021, 11:03 AM

This revision was automatically updated to reflect the committed changes.

dvyukov added a commit: rG1fa4c188b5a4: sanitizer_common: optimize Mutex for high contention.

Revision Contents

Path

Size

compiler-rt/

lib/

sanitizer_common/

sanitizer_mutex.h

80 lines

Diff 365547

compiler-rt/lib/sanitizer_common/sanitizer_mutex.h

Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
public:		public:
explicit constexpr Mutex(MutexType type = MutexUnchecked)		explicit constexpr Mutex(MutexType type = MutexUnchecked)
: CheckedMutex(type) {}		: CheckedMutex(type) {}

void Lock() ACQUIRE() {		void Lock() ACQUIRE() {
CheckedMutex::Lock();		CheckedMutex::Lock();
u64 reset_mask = ~0ull;		u64 reset_mask = ~0ull;
u64 state = atomic_load_relaxed(&state_);		u64 state = atomic_load_relaxed(&state_);
const uptr kMaxSpinIters = 1500;
for (uptr spin_iters = 0;; spin_iters++) {		for (uptr spin_iters = 0;; spin_iters++) {
u64 new_state;		u64 new_state;
bool locked = (state & (kWriterLock \| kReaderLockMask)) != 0;		bool locked = (state & (kWriterLock \| kReaderLockMask)) != 0;
if (LIKELY(!locked)) {		if (LIKELY(!locked)) {
// The mutex is not read-/write-locked, try to lock.		// The mutex is not read-/write-locked, try to lock.
new_state = (state \| kWriterLock) & reset_mask;		new_state = (state \| kWriterLock) & reset_mask;
} else if (spin_iters > kMaxSpinIters) {		} else if (spin_iters > kMaxSpinIters) {
// We've spun enough, increment waiting writers count and block.		// We've spun enough, increment waiting writers count and block.
Show All 12 Lines	for (uptr spin_iters = 0;; spin_iters++) {
memory_order_acquire)))		memory_order_acquire)))
continue;		continue;
if (LIKELY(!locked))		if (LIKELY(!locked))
return; // We've locked the mutex.		return; // We've locked the mutex.
if (spin_iters > kMaxSpinIters) {		if (spin_iters > kMaxSpinIters) {
// We've incremented waiting writers, so now block.		// We've incremented waiting writers, so now block.
writers_.Wait();		writers_.Wait();
spin_iters = 0;		spin_iters = 0;
state = atomic_load(&state_, memory_order_relaxed);
DCHECK_NE(state & kWriterSpinWait, 0);
} else {		} else {
// We've set kWriterSpinWait, but we are still in active spinning.		// We've set kWriterSpinWait, but we are still in active spinning.
		melverUnsubmitted Not Done Reply Inline Actions I think we need a 'pause' hint in active spinning. Do we have arch-independent pause-hint wrapper already? melver: I think we need a 'pause' hint in active spinning. Do we have arch-independent pause-hint…
		dvyukovAuthorUnsubmitted Done Reply Inline Actions We have, but the problem with PAUSE on x86 is that its latency increased from 1 cycle to 100+ recently. I am not sure how its supposed to be used now. Also 100+ cycles looks like just too much for a backoff. We used to do 10 PAUSES in a row, but now I did what abseil mutex does (just 1500 active spin iterations w/o PAUSE): https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mutex.cc#L144 (what abseil does can't be too bad) dvyukov: We have, but the problem with PAUSE on x86 is that its latency increased from 1 cycle to 100+…
		melverUnsubmitted Not Done Reply Inline Actions Ok, that's bad. I guess some recent microarch broke it then :-) That's fine then. melver: Ok, that's bad. I guess some recent microarch broke it then :-) That's fine then.
}		}
// We either blocked and were unblocked,		// We either blocked and were unblocked,
// or we just spun but set kWriterSpinWait.		// or we just spun but set kWriterSpinWait.
// Either way we need to reset kWriterSpinWait		// Either way we need to reset kWriterSpinWait
// next time we take the lock or block again.		// next time we take the lock or block again.
reset_mask = ~kWriterSpinWait;		reset_mask = ~kWriterSpinWait;
		state = atomic_load(&state_, memory_order_relaxed);
		DCHECK_NE(state & kWriterSpinWait, 0);
}		}
}		}

void Unlock() RELEASE() {		void Unlock() RELEASE() {
CheckedMutex::Unlock();		CheckedMutex::Unlock();
bool wake_writer;		bool wake_writer;
u64 wake_readers;		u64 wake_readers;
u64 new_state;		u64 new_state;
u64 state = atomic_load_relaxed(&state_);		u64 state = atomic_load_relaxed(&state_);
do {		do {
DCHECK_NE(state & kWriterLock, 0);		DCHECK_NE(state & kWriterLock, 0);
DCHECK_EQ(state & kReaderLockMask, 0);		DCHECK_EQ(state & kReaderLockMask, 0);
new_state = state & ~kWriterLock;		new_state = state & ~kWriterLock;
wake_writer =		wake_writer = (state & (kWriterSpinWait \| kReaderSpinWait)) == 0 &&
(state & kWriterSpinWait) == 0 && (state & kWaitingWriterMask) != 0;		(state & kWaitingWriterMask) != 0;
		melverUnsubmitted Not Done Reply Inline Actions Could that unnecessarily shift fairness towards readers? E.g. if there are lots of readers, the probability that the kReaderSpinWait bit is set is high, which would mean wake_writer upon Unlock() is false. Before, it would always prioritize writers, but now it prioritizes any spinner. If the other writer is Wait()'ing, it would prioritize the read-spinners (+ any Wait()'ing reader as a side-effect). Is that ok? I guess it's fine to prefer to let the spinners through first, but if there are read-spinners + read-waiters it'll turn the Wait()'ing readers into spinners as well, without waking the waiting writers, and therefore prioritize readers. .. or I missed something that's currently preventing that, or perhaps it's not as bad as I think. melver: Could that unnecessarily shift fairness towards readers? E.g. if there are lots of readers, the…
		dvyukovAuthorUnsubmitted Done Reply Inline Actions Yes, it's more unfair now with more bias towards readers. I don't know what's better. There is an inherent conflict between fairness and throughput. We try to make it more fair, we get these 100x slowdown fallouts. I don't think there is even the single right answer. We already see too much fairness hits us badly. We have not yet seen how too less fairness hits us :) I don't think we want too strong fairness guarantees (like a production mutex could have). For us throughput is quite important since we slow down execution a lot and tend to create high contention in some cases. As long as the program makes forward progress, I think throughput is more preferable. If this version proves to be problematically unfair, maybe we could add the starvation prevention logic I used in the Go's sync.Mutex: https://github.com/golang/go/commit/0556e26273f704db73df9e7c4c3d2e8434dec7be It ensures there is no pathological unfairness w/o penalizing normal execution. dvyukov: Yes, it's more unfair now with more bias towards readers. I don't know what's better. There is…
		melverUnsubmitted Not Done Reply Inline Actions Ok makes sense. Just wanted to check; and thanks for the explanation! LGTM, thanks! melver: Ok makes sense. Just wanted to check; and thanks for the explanation! LGTM, thanks!
if (wake_writer)		if (wake_writer)
new_state = (new_state - kWaitingWriterInc) \| kWriterSpinWait;		new_state = (new_state - kWaitingWriterInc) \| kWriterSpinWait;
wake_readers =		wake_readers =
(state & (kWriterSpinWait \| kWaitingWriterMask)) != 0		wake_writer \|\| (state & kWriterSpinWait) != 0
? 0		? 0
: ((state & kWaitingReaderMask) >> kWaitingReaderShift);		: ((state & kWaitingReaderMask) >> kWaitingReaderShift);
if (wake_readers)		if (wake_readers)
new_state = (new_state & ~kWaitingReaderMask) +		new_state = (new_state & ~kWaitingReaderMask) \| kReaderSpinWait;
(wake_readers << kReaderLockShift);
} while (UNLIKELY(!atomic_compare_exchange_weak(&state_, &state, new_state,		} while (UNLIKELY(!atomic_compare_exchange_weak(&state_, &state, new_state,
memory_order_release)));		memory_order_release)));
if (UNLIKELY(wake_writer))		if (UNLIKELY(wake_writer))
writers_.Post();		writers_.Post();
else if (UNLIKELY(wake_readers))		else if (UNLIKELY(wake_readers))
readers_.Post(wake_readers);		readers_.Post(wake_readers);
}		}

void ReadLock() ACQUIRE_SHARED() {		void ReadLock() ACQUIRE_SHARED() {
CheckedMutex::Lock();		CheckedMutex::Lock();
bool locked;		u64 reset_mask = ~0ull;
u64 new_state;
u64 state = atomic_load_relaxed(&state_);		u64 state = atomic_load_relaxed(&state_);
do {		for (uptr spin_iters = 0;; spin_iters++) {
locked =		bool locked = (state & kWriterLock) != 0;
(state & kReaderLockMask) == 0 &&		u64 new_state;
(state & (kWriterLock \| kWriterSpinWait \| kWaitingWriterMask)) != 0;		if (LIKELY(!locked)) {
		new_state = (state + kReaderLockInc) & reset_mask;
		} else if (spin_iters > kMaxSpinIters) {
		new_state = (state + kWaitingReaderInc) & reset_mask;
		} else if ((state & kReaderSpinWait) == 0) {
		// Active spinning, but denote our presence so that unlocking
		// thread does not wake up other threads.
		new_state = state \| kReaderSpinWait;
		melverUnsubmitted Not Done Reply Inline Actions Is it ok that kReaderSpinWait bit is imprecise because it's just an optimization? Because multiple readers may keep flipping this bit on or off, and a wait-reader may unset the bit while there is still an in-flight spin-reader (which would immediately set it back). melver: Is it ok that kReaderSpinWait bit is imprecise because it's just an optimization? Because…
		dvyukovAuthorUnsubmitted Done Reply Inline Actions Looks OK to me. It would not be OK to set kReaderSpinWait when there are no spinners, but it's fine the other way around. dvyukov: Looks OK to me. It would not be OK to set kReaderSpinWait when there are no spinners, but it's…
		} else {
		// Active spinning.
		state = atomic_load(&state_, memory_order_relaxed);
		continue;
		}
		if (UNLIKELY(!atomic_compare_exchange_weak(&state_, &state, new_state,
		memory_order_acquire)))
		continue;
if (LIKELY(!locked))		if (LIKELY(!locked))
new_state = state + kReaderLockInc;		return; // We've locked the mutex.
else		if (spin_iters > kMaxSpinIters) {
new_state = state + kWaitingReaderInc;		// We've incremented waiting readers, so now block.
} while (UNLIKELY(!atomic_compare_exchange_weak(&state_, &state, new_state,
memory_order_acquire)));
if (UNLIKELY(locked))
readers_.Wait();		readers_.Wait();
DCHECK_EQ(atomic_load_relaxed(&state_) & kWriterLock, 0);		spin_iters = 0;
DCHECK_NE(atomic_load_relaxed(&state_) & kReaderLockMask, 0);		} else {
		// We've set kReaderSpinWait, but we are still in active spinning.
		melverUnsubmitted Not Done Reply Inline Actions Same here, I think we need 'pause' hint for CPU. melver: Same here, I think we need 'pause' hint for CPU.
		}
		reset_mask = ~kReaderSpinWait;
		state = atomic_load(&state_, memory_order_relaxed);
		}
}		}

void ReadUnlock() RELEASE_SHARED() {		void ReadUnlock() RELEASE_SHARED() {
CheckedMutex::Unlock();		CheckedMutex::Unlock();
bool wake;		bool wake;
u64 new_state;		u64 new_state;
u64 state = atomic_load_relaxed(&state_);		u64 state = atomic_load_relaxed(&state_);
do {		do {
DCHECK_NE(state & kReaderLockMask, 0);		DCHECK_NE(state & kReaderLockMask, 0);
DCHECK_EQ(state & (kWaitingReaderMask \| kWriterLock), 0);		DCHECK_EQ(state & kWriterLock, 0);
new_state = state - kReaderLockInc;		new_state = state - kReaderLockInc;
wake = (new_state & (kReaderLockMask \| kWriterSpinWait)) == 0 &&		wake = (new_state &
		(kReaderLockMask \| kWriterSpinWait \| kReaderSpinWait)) == 0 &&
(new_state & kWaitingWriterMask) != 0;		(new_state & kWaitingWriterMask) != 0;
if (wake)		if (wake)
new_state = (new_state - kWaitingWriterInc) \| kWriterSpinWait;		new_state = (new_state - kWaitingWriterInc) \| kWriterSpinWait;
} while (UNLIKELY(!atomic_compare_exchange_weak(&state_, &state, new_state,		} while (UNLIKELY(!atomic_compare_exchange_weak(&state_, &state, new_state,
memory_order_release)));		memory_order_release)));
if (UNLIKELY(wake))		if (UNLIKELY(wake))
writers_.Post();		writers_.Post();
}		}
Show All 27 Lines	private:
// - number of waiting writers,		// - number of waiting writers,
// if non zero, the mutex is read- or write-locked		// if non zero, the mutex is read- or write-locked
// And 2 flags:		// And 2 flags:
// - writer lock		// - writer lock
// if set, the mutex is write-locked		// if set, the mutex is write-locked
// - a writer is awake and spin-waiting		// - a writer is awake and spin-waiting
// the flag is used to prevent thundering herd problem		// the flag is used to prevent thundering herd problem
// (new writers are not woken if this flag is set)		// (new writers are not woken if this flag is set)
		// - a reader is awake and spin-waiting
//		//
// Writer support active spinning, readers does not.		// Both writers and readers use active spinning before blocking.
// But readers are more aggressive and always take the mutex		// But readers are more aggressive and always take the mutex
// if there are any other readers.		// if there are any other readers.
// Writers hand off the mutex to readers: after wake up readers		// After wake up both writers and readers compete to lock the
// already assume ownership of the mutex (don't need to do any		// mutex again. This is needed to allow repeated locks even in presence
// state updates). But the mutex is not handed off to writers,		// of other blocked threads.
// after wake up writers compete to lock the mutex again.
// This is needed to allow repeated write locks even in presence
// of other blocked writers.
static constexpr u64 kCounterWidth = 20;		static constexpr u64 kCounterWidth = 20;
static constexpr u64 kReaderLockShift = 0;		static constexpr u64 kReaderLockShift = 0;
static constexpr u64 kReaderLockInc = 1ull << kReaderLockShift;		static constexpr u64 kReaderLockInc = 1ull << kReaderLockShift;
static constexpr u64 kReaderLockMask = ((1ull << kCounterWidth) - 1)		static constexpr u64 kReaderLockMask = ((1ull << kCounterWidth) - 1)
<< kReaderLockShift;		<< kReaderLockShift;
static constexpr u64 kWaitingReaderShift = kCounterWidth;		static constexpr u64 kWaitingReaderShift = kCounterWidth;
static constexpr u64 kWaitingReaderInc = 1ull << kWaitingReaderShift;		static constexpr u64 kWaitingReaderInc = 1ull << kWaitingReaderShift;
static constexpr u64 kWaitingReaderMask = ((1ull << kCounterWidth) - 1)		static constexpr u64 kWaitingReaderMask = ((1ull << kCounterWidth) - 1)
<< kWaitingReaderShift;		<< kWaitingReaderShift;
static constexpr u64 kWaitingWriterShift = 2 * kCounterWidth;		static constexpr u64 kWaitingWriterShift = 2 * kCounterWidth;
static constexpr u64 kWaitingWriterInc = 1ull << kWaitingWriterShift;		static constexpr u64 kWaitingWriterInc = 1ull << kWaitingWriterShift;
static constexpr u64 kWaitingWriterMask = ((1ull << kCounterWidth) - 1)		static constexpr u64 kWaitingWriterMask = ((1ull << kCounterWidth) - 1)
<< kWaitingWriterShift;		<< kWaitingWriterShift;
static constexpr u64 kWriterLock = 1ull << (3 * kCounterWidth);		static constexpr u64 kWriterLock = 1ull << (3 * kCounterWidth);
static constexpr u64 kWriterSpinWait = 1ull << (3 * kCounterWidth + 1);		static constexpr u64 kWriterSpinWait = 1ull << (3 * kCounterWidth + 1);
		static constexpr u64 kReaderSpinWait = 1ull << (3 * kCounterWidth + 2);

		static constexpr uptr kMaxSpinIters = 1500;

Mutex(LinkerInitialized) = delete;		Mutex(LinkerInitialized) = delete;
Mutex(const Mutex &) = delete;		Mutex(const Mutex &) = delete;
void operator=(const Mutex &) = delete;		void operator=(const Mutex &) = delete;
};		};

void FutexWait(atomic_uint32_t *p, u32 cmp);		void FutexWait(atomic_uint32_t *p, u32 cmp);
void FutexWake(atomic_uint32_t *p, u32 count);		void FutexWake(atomic_uint32_t *p, u32 count);
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines