This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/lib/scudo/standalone/
-
lib/
-
scudo/
-
standalone/
1/2
common.h
4/5
mutex.h

Differential D156951

[scudo] Fine tune busy-waiting in HybridMutex
ClosedPublic

Authored by Chia-hungDuan on Aug 2 2023, 5:02 PM.

Download Raw Diff

Details

Reviewers

cferris
hboehm

Commits

rGcde307e46577: [scudo] Fine tune busy-waiting in HybridMutex

Summary

Instead of using hardware specific instruction, using simple loop over
volatile variable gives similar and more predicatable waiting time. Also
fine tune the waiting time to fit with the average time in malloc/free
operations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Chia-hungDuan created this revision.Aug 2 2023, 5:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 5:02 PM

Herald added subscribers: yaneury, Enna1. · View Herald Transcript

Chia-hungDuan requested review of this revision.Aug 2 2023, 5:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 5:02 PM

Herald added a subscriber: Restricted Project. · View Herald Transcript

Chia-hungDuan added reviewers: cferris, hboehm.Aug 2 2023, 5:03 PM

Mostly comment nits.

compiler-rt/lib/scudo/standalone/mutex.h
38	It's probably better to name this something like delay or delayLoop since it's not really doing a yield any more.
57–58	that are guarded by the mutex. Although, I'm not sure exactly what you mean by minimum time spent among operations. Do you mean the average time spent during operations while the mutex is held?
69	times

This revision now requires changes to proceed.Aug 2 2023, 5:13 PM

Harbormaster completed remote builds in B249913: Diff 546647.Aug 2 2023, 5:16 PM

Address review comment

Chia-hungDuan marked an inline comment as done.Aug 2 2023, 5:50 PM

Chia-hungDuan added inline comments.

compiler-rt/lib/scudo/standalone/mutex.h
57–58	It's the shortest time while doing alloc/free. Use it as query interval (or tryLock interval) to avoid waiting too long for the completion of an operation. Rephrased a little bit

Harbormaster completed remote builds in B249922: Diff 546658.Aug 2 2023, 6:06 PM

LGTM.

This revision is now accepted and ready to land.Aug 2 2023, 6:14 PM

Chia-hungDuan planned changes to this revision.Aug 4 2023, 9:46 AM

Chia-hungDuan added inline comments.

compiler-rt/lib/scudo/standalone/mutex.h
60	There was some measurement bias here. Will update a new default value

melver added a subscriber: melver.Aug 10 2023, 12:50 AM

melver added inline comments.

compiler-rt/lib/scudo/standalone/common.h
117	What's the real objective of this change? What's the performance improvement? gives similar and more predicatable waiting time The CPU pause/yield instructions are not just about the waiting time itself, but about making sure the CPU doesn't waste energy and is potentially hostile to neighbouring cores/HW-threads (such as hyperthreads). One problem here is that if the locking thread starts to spin too aggressively, it may take resources from the thread that it wants to release the mutex to make forward progress in the first place. This is a real problem when the waiting thread and the thread that needs to release the mutex are on the same core on sibling hyperthreads. "Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. [...] An additional function of the PAUSE instruction is to reduce the power consumed by a processor while executing a spin loop." https://www.felixcloutier.com/x86/pause.html

Update the length of busy-waiting loop

This revision is now accepted and ready to land.Aug 28 2023, 10:12 PM

Chia-hungDuan added inline comments.Aug 28 2023, 10:12 PM

compiler-rt/lib/scudo/standalone/common.h
117	I take several busy-waiting implementations from allocators (like jemalloc, tcmalloc), jvm (Android ART), libraries (libc++, abseil), .etc as references. Most of them use volatile as well. In Linux kernel, it uses `pause` or `nop` depends on versions. The main argument behind this is that this is not a general used mutex. It's specific to Scudo's operations and most operations are expected to be pretty quick. Besides, a program is unlikely to heavily stress the user space memory allocator and therefore causes heavy contention on the Scudo's region mutex.' Those instructions may suggest context-switch which may take longer time than finishing most operations. For example, on pixel 7 pro (arm64), it shows 7~8% performance difference. As a result, using those inline assembly seems not showing the benefit as always in our case but increase the complexity of maintaining these kind of inline assembly across different platforms (and architectures). In addition, given that the short and expected length of waiting time, `volatile` is more preferred in Scudo.

Chia-hungDuan mentioned this in D139600: [scudo][standalone] Only use yield on ARMv6K and newer.Aug 28 2023, 10:13 PM

Harbormaster completed remote builds in B255411: Diff 554176.Aug 28 2023, 10:26 PM

Closed by commit rGcde307e46577: [scudo] Fine tune busy-waiting in HybridMutex (authored by Chia-hungDuan). · Explain WhySep 21 2023, 2:03 PM

This revision was automatically updated to reflect the committed changes.

Chia-hungDuan added a commit: rGcde307e46577: [scudo] Fine tune busy-waiting in HybridMutex.

Revision Contents

Path

Size

compiler-rt/

lib/

scudo/

standalone/

common.h

15 lines

mutex.h

17 lines

Diff 557205

compiler-rt/lib/scudo/standalone/common.h

Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	template <typename T> inline void shuffle(T A, u32 N, u32 RandState) {
if (N <= 1)		if (N <= 1)
return;		return;
u32 State = *RandState;		u32 State = *RandState;
for (u32 I = N - 1; I > 0; I--)		for (u32 I = N - 1; I > 0; I--)
Swap(A[I], A[getRandomModN(&State, I + 1)]);		Swap(A[I], A[getRandomModN(&State, I + 1)]);
*RandState = State;		*RandState = State;
}		}

// Hardware specific inlinable functions.

inline void yieldProcessor(UNUSED u8 Count) {
melverUnsubmitted Not Done Reply Inline Actions What's the real objective of this change? What's the performance improvement? gives similar and more predicatable waiting time The CPU pause/yield instructions are not just about the waiting time itself, but about making sure the CPU doesn't waste energy and is potentially hostile to neighbouring cores/HW-threads (such as hyperthreads). One problem here is that if the locking thread starts to spin too aggressively, it may take resources from the thread that it wants to release the mutex to make forward progress in the first place. This is a real problem when the waiting thread and the thread that needs to release the mutex are on the same core on sibling hyperthreads. "Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. [...] An additional function of the PAUSE instruction is to reduce the power consumed by a processor while executing a spin loop." https://www.felixcloutier.com/x86/pause.html melver: What's the real objective of this change? What's the performance improvement? > gives similar…
Chia-hungDuanAuthorUnsubmitted Done Reply Inline Actions I take several busy-waiting implementations from allocators (like jemalloc, tcmalloc), jvm (Android ART), libraries (libc++, abseil), .etc as references. Most of them use volatile as well. In Linux kernel, it uses `pause` or `nop` depends on versions. The main argument behind this is that this is not a general used mutex. It's specific to Scudo's operations and most operations are expected to be pretty quick. Besides, a program is unlikely to heavily stress the user space memory allocator and therefore causes heavy contention on the Scudo's region mutex.' Those instructions may suggest context-switch which may take longer time than finishing most operations. For example, on pixel 7 pro (arm64), it shows 7~8% performance difference. As a result, using those inline assembly seems not showing the benefit as always in our case but increase the complexity of maintaining these kind of inline assembly across different platforms (and architectures). In addition, given that the short and expected length of waiting time, `volatile` is more preferred in Scudo. Chia-hungDuan: I take several busy-waiting implementations from allocators (like jemalloc, tcmalloc), jvm…
#if defined(__i386__) \|\| defined(__x86_64__)
__asm__ __volatile__("" ::: "memory");
for (u8 I = 0; I < Count; I++)
__asm__ __volatile__("pause");
#elif defined(__aarch64__) \|\| defined(__arm__)
__asm__ __volatile__("" ::: "memory");
for (u8 I = 0; I < Count; I++)
__asm__ __volatile__("yield");
#endif
__asm__ __volatile__("" ::: "memory");
}

// Platform specific functions.		// Platform specific functions.

extern uptr PageSizeCached;		extern uptr PageSizeCached;
uptr getPageSizeSlow();		uptr getPageSizeSlow();
inline uptr getPageSizeCached() {		inline uptr getPageSizeCached() {
#if SCUDO_ANDROID && defined(PAGE_SIZE)		#if SCUDO_ANDROID && defined(PAGE_SIZE)
// Most Android builds have a build-time constant page size.		// Most Android builds have a build-time constant page size.
return PAGE_SIZE;		return PAGE_SIZE;
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

compiler-rt/lib/scudo/standalone/mutex.h

Show All 29 Lines	if (LIKELY(tryLock()))
// The compiler may try to fully unroll the loop, ending up in a		// The compiler may try to fully unroll the loop, ending up in a
// NumberOfTries*NumberOfYields block of pauses mixed with tryLocks. This		// NumberOfTries*NumberOfYields block of pauses mixed with tryLocks. This
// is large, ugly and unneeded, a compact loop is better for our purpose		// is large, ugly and unneeded, a compact loop is better for our purpose
// here. Use a pragma to tell the compiler not to unroll the loop.		// here. Use a pragma to tell the compiler not to unroll the loop.
#ifdef __clang__		#ifdef __clang__
#pragma nounroll		#pragma nounroll
#endif		#endif
for (u8 I = 0U; I < NumberOfTries; I++) {		for (u8 I = 0U; I < NumberOfTries; I++) {
yieldProcessor(NumberOfYields);		delayLoop();
		cferrisUnsubmitted Done Reply Inline Actions It's probably better to name this something like delay or delayLoop since it's not really doing a yield any more. cferris: It's probably better to name this something like delay or delayLoop since it's not really doing…
if (tryLock())		if (tryLock())
return;		return;
}		}
lockSlow();		lockSlow();
}		}
void unlock() RELEASE();		void unlock() RELEASE();

// TODO(chiahungduan): In general, we may want to assert the owner of lock as		// TODO(chiahungduan): In general, we may want to assert the owner of lock as
// well. Given the current uses of HybridMutex, it's acceptable without		// well. Given the current uses of HybridMutex, it's acceptable without
// asserting the owner. Re-evaluate this when we have certain scenarios which		// asserting the owner. Re-evaluate this when we have certain scenarios which
// requires a more fine-grained lock granularity.		// requires a more fine-grained lock granularity.
ALWAYS_INLINE void assertHeld() ASSERT_CAPABILITY(this) {		ALWAYS_INLINE void assertHeld() ASSERT_CAPABILITY(this) {
if (SCUDO_DEBUG)		if (SCUDO_DEBUG)
assertHeldImpl();		assertHeldImpl();
}		}

private:		private:
		void delayLoop() {
		// The value comes from the average time spent in accessing caches (which
		// are the fastest operations) so that we are unlikely to wait too long for
		cferrisUnsubmitted Not Done Reply Inline Actions that are guarded by the mutex. Although, I'm not sure exactly what you mean by minimum time spent among operations. Do you mean the average time spent during operations while the mutex is held? cferris: that are guarded by the mutex. Although, I'm not sure exactly what you mean by minimum time…
		Chia-hungDuanAuthorUnsubmitted Done Reply Inline Actions It's the shortest time while doing alloc/free. Use it as query interval (or tryLock interval) to avoid waiting too long for the completion of an operation. Rephrased a little bit Chia-hungDuan: It's the shortest time while doing alloc/free. Use it as query interval (or tryLock interval)…
		// fast operations.
		constexpr u32 SpinTimes = 16;
		Chia-hungDuanAuthorUnsubmitted Done Reply Inline Actions There was some measurement bias here. Will update a new default value Chia-hungDuan: There was some measurement bias here. Will update a new default value
		volatile u32 V = 0;
		for (u32 I = 0; I < SpinTimes; ++I)
		++V;
		}

void assertHeldImpl();		void assertHeldImpl();

static constexpr u8 NumberOfTries = 8U;		// TODO(chiahungduan): Adapt this value based on scenarios. E.g., primary and
static constexpr u8 NumberOfYields = 8U;		// secondary allocator have different allocation times.
		cferrisUnsubmitted Done Reply Inline Actions times cferris: times
		static constexpr u8 NumberOfTries = 32U;

#if SCUDO_LINUX		#if SCUDO_LINUX
atomic_u32 M = {};		atomic_u32 M = {};
#elif SCUDO_FUCHSIA		#elif SCUDO_FUCHSIA
sync_mutex_t M = {};		sync_mutex_t M = {};
#endif		#endif

void lockSlow() ACQUIRE();		void lockSlow() ACQUIRE();
Show All 17 Lines