This is an archive of the discontinued LLVM Phabricator instance.

[scudo] Fine tune busy-waiting in HybridMutex
ClosedPublic

Authored by Chia-hungDuan on Aug 2 2023, 5:02 PM.

Details

Summary

Instead of using hardware specific instruction, using simple loop over
volatile variable gives similar and more predicatable waiting time. Also
fine tune the waiting time to fit with the average time in malloc/free
operations.

Diff Detail

Event Timeline

Chia-hungDuan created this revision.Aug 2 2023, 5:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 5:02 PM
Herald added subscribers: yaneury, Enna1. · View Herald Transcript
Chia-hungDuan requested review of this revision.Aug 2 2023, 5:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 5:02 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
cferris requested changes to this revision.Aug 2 2023, 5:13 PM

Mostly comment nits.

compiler-rt/lib/scudo/standalone/mutex.h
38

It's probably better to name this something like delay or delayLoop since it's not really doing a yield any more.

57–58

that are guarded by the mutex.

Although, I'm not sure exactly what you mean by minimum time spent among operations. Do you mean the average time spent during operations while the mutex is held?

69

times

This revision now requires changes to proceed.Aug 2 2023, 5:13 PM
Chia-hungDuan marked an inline comment as done.

Address review comment

Chia-hungDuan marked an inline comment as done.Aug 2 2023, 5:50 PM
Chia-hungDuan added inline comments.
compiler-rt/lib/scudo/standalone/mutex.h
57–58

It's the shortest time while doing alloc/free. Use it as query interval (or tryLock interval) to avoid waiting too long for the completion of an operation.

Rephrased a little bit

cferris accepted this revision.Aug 2 2023, 6:14 PM

LGTM.

This revision is now accepted and ready to land.Aug 2 2023, 6:14 PM
Chia-hungDuan planned changes to this revision.Aug 4 2023, 9:46 AM
Chia-hungDuan added inline comments.
compiler-rt/lib/scudo/standalone/mutex.h
60

There was some measurement bias here. Will update a new default value

melver added a subscriber: melver.Aug 10 2023, 12:50 AM
melver added inline comments.
compiler-rt/lib/scudo/standalone/common.h
117

What's the real objective of this change?

What's the performance improvement?

gives similar and more predicatable waiting time

The CPU pause/yield instructions are not just about the waiting time itself, but about making sure the CPU doesn't waste energy and is potentially hostile to neighbouring cores/HW-threads (such as hyperthreads). One problem here is that if the locking thread starts to spin too aggressively, it may take resources from the thread that it wants to release the mutex to make forward progress in the first place. This is a real problem when the waiting thread and the thread that needs to release the mutex are on the same core on sibling hyperthreads.

"Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. [...] An additional function of the PAUSE instruction is to reduce the power consumed by a processor while executing a spin loop."

https://www.felixcloutier.com/x86/pause.html

Update the length of busy-waiting loop

This revision is now accepted and ready to land.Aug 28 2023, 10:12 PM
Chia-hungDuan added inline comments.Aug 28 2023, 10:12 PM
compiler-rt/lib/scudo/standalone/common.h
117

I take several busy-waiting implementations from allocators (like jemalloc, tcmalloc), jvm (Android ART), libraries (libc++, abseil), .etc as references. Most of them use volatile as well. In Linux kernel, it uses pause or nop depends on versions.

The main argument behind this is that this is not a general used mutex. It's specific to Scudo's operations and most operations are expected to be pretty quick. Besides, a program is unlikely to heavily stress the user space memory allocator and therefore causes heavy contention on the Scudo's region mutex.' Those instructions may suggest context-switch which may take longer time than finishing most operations. For example, on pixel 7 pro (arm64), it shows 7~8% performance difference.

As a result, using those inline assembly seems not showing the benefit as always in our case but increase the complexity of maintaining these kind of inline assembly across different platforms (and architectures). In addition, given that the short and expected length of waiting time, volatile is more preferred in Scudo.

This revision was automatically updated to reflect the committed changes.