This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/scudo/
-
scudo/
-
scudo_allocator.cpp
-
scudo_tsd.h
-
scudo_tsd_exclusive.cpp
-
scudo_tsd_exclusive.inc
3/3
scudo_tsd_shared.cpp
-
scudo_tsd_shared.inc

Differential D47289

[scudo] Improve the scalability of the shared TSD model
ClosedPublic

Authored by cryptoad on May 23 2018, 3:05 PM.

Download Raw Diff

Details

Reviewers

alekseyshl
dvyukov
javed.absar

Commits

rG4410e2c43fdf: [scudo] Improve the scalability of the shared TSD model
rL334410: [scudo] Improve the scalability of the shared TSD model
rCRT334410: [scudo] Improve the scalability of the shared TSD model

Summary

The shared TSD model in its current form doesn't scale. Here is an example of
rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core
machines (defaulting to a CompactSizeClassMap and no Quarantine):

with tcmalloc: 337K reqs/sec, peak RSS of 338MB;
with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB;
with scudo (shared): 241K reqs/sec, peak RSS of 324MB.

This isn't great, since the exclusive model uses a lot of memory, while the
shared model doesn't even come close to be competitive.

This is mostly due to the fact that we are consistently scanning the TSD pool
starting at index 0 for an available TSD, which can result in a lot of failed
lock attempts, and touching some memory that needs not be touched.

This CL attempts to make things better in most situations:

first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model;
move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD);
64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the Precedence in a uptr instead of a u64. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it;
change a CHECK to a DCHECK: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null;
based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The Precedence still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed tryLock)

With those modifications, the benchmark yields to:

with scudo (shared): 330K reqs/sec, peak RSS of 327MB.

So the shared model for this specific situation not only becomes competitive but
outperforms the exclusive model. I experimented with some values greater than 4
for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just
sticking with the current TSD is also a tad slower. Numbers on platforms with
less cores (eg: Android) remain similar.

Diff Detail

Repository: rCRT Compiler Runtime

Event Timeline

cryptoad created this revision.May 23 2018, 3:05 PM

Herald added a reviewer: javed.absar. · View Herald TranscriptMay 23 2018, 3:05 PM

Herald added subscribers: Restricted Project, delcypher, kristof.beyls, srhines. · View Herald Transcript

Harbormaster completed remote builds in B18532: Diff 148295.May 23 2018, 3:05 PM

alekseyshl accepted this revision.May 25 2018, 3:26 PM

This revision is now accepted and ready to land.May 25 2018, 3:26 PM

dvyukov added inline comments.May 26 2018, 1:28 AM

lib/scudo/scudo_tsd_shared.cpp
70–83	The comment from the change description as to why we use precedence as rand state should be duplicated as a comment here. This is confusing.
72	Do you see getTSDAndLockSlow in profiles? Here we have 2 divisions. Divisions are expensive. We could replace them with multiplication+shift if they show up in profiles.
77	Have you tried dropping precedence entirely and then just going over all TSDs pseudo-randomly once (or maybe twice) and tryLock'ing them? If all tryLock's fail then we could lock the old one, or maybe a random one. That would shave few cycles from malloc/free fast paths. The benefit of using precedence scheme is still unproved, right? Also, if we have a hundred of cores/caches, then giving up after just 4 can lead to unnecessary blocking (there still can be dozens of available caches).

I will experiment with Dmitry's suggestions and report back with results.

Here are some answers to Dmitry's requests:

Regarding getTSDAndLockSlow and the division: pprof shows no significant time spent in the function outside of the tryLock & lock so I think we are good here;
Regarding the precedence: I tested a version where I dropped it entirely, results are mixed:
- For Android's "improved" memory_replay: it is faster in all cases, but we only have 2 caches for that specific platform (due to memory constraints compared to the default allocator);
- For rpc2-benchmark: mostly similar numbers;
- For t-test1: the version with precedence shows better performances in almost all situations; this benchmark also demonstrates a slowdown with the number of TSDs scanned in the slowpath, eg: scanning 4 and slow locking if they all failed to tryLock performs better overall than scanning 32. And this can be a significant slowdown, for example with t-test1 800 40 800000 100000, it's 900s spent in allocation functions vs 1150s. The argument here is that this benchmark only does {de}allocations (& memset) and as such isn't very representative of "real" programs, but it's exercising the most contention on the caches.

I can't seem to get a definitive answer overall as with or without precedence have both win & lose situations.
The only sure thing so far is that both are better than the current version.

I am open to suggestion or potential improvements, otherwise I'd keep the current version of the CL (and will address the review comments).

What exactly versions did you test? Current vs removing precedence and trylocking all caches? If yes, then it's unclear what factor affects performance as we change 2 things at the same time. So I would also try:

Remove precedence, but scan only 4 random caches, then lock the current cache.
Remove precedence, but scan only 4 random caches, then lock a random cache.
Keep precedence but scan all caches.

I would not expect 3 to be faster, but who knows.
Looking at your results I think it mostly needs to be tested on t-test1, because it should not affect android with 2/4 caches.

Here are more detail numbers for t-test1.

The machine has 72 cores. We are using the shared TSD version with 32 caches (to exercise some contention).
The numbers are the total time (averaged and rounded over 3 consecutive runs) spent in allocation functions only with 40, then 80 concurrent threads:

current upstream: 960s, 3315s
with precedence, max 4 caches scanned, lock current: 810s, 3200s (current CL proposed)
with precedence, max 4 caches scanned, lock random: 815s, 3125s
with precedence, all caches scanned, lock current: 880s, 3940s
with precedence, all caches scanned, lock random: 890s, 3755s
no precedence, max 4 caches scanned, lock current: 900s, 3365s
no precedence, max 4 caches scanned, lock random: 840s, 3300s
no precedence, all caches scanned, lock current: 1025s, 3600s
no precedence, all caches scanned, lock random: 890s, 3785s

Locking a random cache in the event of heavier contention seems to be beneficial, but not necessarily with lesser contention.
Since I am more interested in striking a middle ground rather than aiming for contentious applications, it looks like the precedence matters, as well as not scanning all the caches but limiting ourselves to 4.

So your current version is actually the best one.
Thanks for bearing with me.

Adding a comment to explain why we use the precedence as random seed.
Pass the current TSD to getTSDAndLockSlow to avoid an extra getCurrentTSD
call.

This revision is now accepted and ready to land.Jun 8 2018, 8:22 AM

Harbormaster completed remote builds in B19097: Diff 150521.Jun 8 2018, 8:22 AM

Remove the continue clause from the loop when the TSD is the current TSD.
This is greatly beneficial to scenarios (Android for example) where we only
have a small number of TSDs. This doesn't impact larger pools of TSDs.

Harbormaster completed remote builds in B19103: Diff 150541.Jun 8 2018, 10:48 AM

Thanks for the review Aleksey &Dmitry.
If I could get a final LGTM after the last couple of changes that would be great.

LGTM

This revision is now accepted and ready to land.Jun 8 2018, 12:23 PM

Closed by commit rCRT334410: [scudo] Improve the scalability of the shared TSD model (authored by cryptoad). · Explain WhyJun 11 2018, 7:54 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

scudo/

scudo_allocator.cpp

21 lines

scudo_tsd.h

22 lines

scudo_tsd_exclusive.cpp

4 lines

scudo_tsd_exclusive.inc

4 lines

scudo_tsd_shared.cpp

67 lines

scudo_tsd_shared.inc

16 lines

Diff 150757

lib/scudo/scudo_allocator.cpp

Show First 20 Lines • Show All 382 Lines • ▼ Show 20 Lines	void *allocate(uptr Size, uptr Alignment, AllocType Type,
// deal with alignment requirements of Primary serviced allocations here,		// deal with alignment requirements of Primary serviced allocations here,
// but the Secondary will take care of its own alignment needs.		// but the Secondary will take care of its own alignment needs.
void *BackendPtr;		void *BackendPtr;
uptr BackendSize;		uptr BackendSize;
u8 ClassId;		u8 ClassId;
if (PrimaryAllocator::CanAllocate(AlignedSize, MinAlignment)) {		if (PrimaryAllocator::CanAllocate(AlignedSize, MinAlignment)) {
BackendSize = AlignedSize;		BackendSize = AlignedSize;
ClassId = SizeClassMap::ClassID(BackendSize);		ClassId = SizeClassMap::ClassID(BackendSize);
ScudoTSD *TSD = getTSDAndLock();		bool UnlockRequired;
		ScudoTSD *TSD = getTSDAndLock(&UnlockRequired);
BackendPtr = BackendAllocator.allocatePrimary(&TSD->Cache, ClassId);		BackendPtr = BackendAllocator.allocatePrimary(&TSD->Cache, ClassId);
		if (UnlockRequired)
TSD->unlock();		TSD->unlock();
} else {		} else {
BackendSize = NeededSize;		BackendSize = NeededSize;
ClassId = 0;		ClassId = 0;
BackendPtr = BackendAllocator.allocateSecondary(BackendSize, Alignment);		BackendPtr = BackendAllocator.allocateSecondary(BackendSize, Alignment);
}		}
if (UNLIKELY(!BackendPtr))		if (UNLIKELY(!BackendPtr))
return FailureHandler::OnOOM();		return FailureHandler::OnOOM();

Show All 40 Lines	struct ScudoAllocator {
void quarantineOrDeallocateChunk(void Ptr, UnpackedHeader Header,		void quarantineOrDeallocateChunk(void Ptr, UnpackedHeader Header,
uptr Size) {		uptr Size) {
const bool BypassQuarantine = (AllocatorQuarantine.GetCacheSize() == 0) \|\|		const bool BypassQuarantine = (AllocatorQuarantine.GetCacheSize() == 0) \|\|
(Size > QuarantineChunksUpToSize);		(Size > QuarantineChunksUpToSize);
if (BypassQuarantine) {		if (BypassQuarantine) {
Chunk::eraseHeader(Ptr);		Chunk::eraseHeader(Ptr);
void *BackendPtr = Chunk::getBackendPtr(Ptr, Header);		void *BackendPtr = Chunk::getBackendPtr(Ptr, Header);
if (Header->ClassId) {		if (Header->ClassId) {
ScudoTSD *TSD = getTSDAndLock();		bool UnlockRequired;
		ScudoTSD *TSD = getTSDAndLock(&UnlockRequired);
getBackendAllocator().deallocatePrimary(&TSD->Cache, BackendPtr,		getBackendAllocator().deallocatePrimary(&TSD->Cache, BackendPtr,
Header->ClassId);		Header->ClassId);
		if (UnlockRequired)
TSD->unlock();		TSD->unlock();
} else {		} else {
getBackendAllocator().deallocateSecondary(BackendPtr);		getBackendAllocator().deallocateSecondary(BackendPtr);
}		}
} else {		} else {
// If a small memory amount was allocated with a larger alignment, we want		// If a small memory amount was allocated with a larger alignment, we want
// to take that into account. Otherwise the Quarantine would be filled		// to take that into account. Otherwise the Quarantine would be filled
// with tiny chunks, taking a lot of VA memory. This is an approximation		// with tiny chunks, taking a lot of VA memory. This is an approximation
// of the usable size, that allows us to not call		// of the usable size, that allows us to not call
// GetActuallyAllocatedSize.		// GetActuallyAllocatedSize.
const uptr EstimatedSize = Size + (Header->Offset << MinAlignmentLog);		const uptr EstimatedSize = Size + (Header->Offset << MinAlignmentLog);
UnpackedHeader NewHeader = *Header;		UnpackedHeader NewHeader = *Header;
NewHeader.State = ChunkQuarantine;		NewHeader.State = ChunkQuarantine;
Chunk::compareExchangeHeader(Ptr, &NewHeader, Header);		Chunk::compareExchangeHeader(Ptr, &NewHeader, Header);
ScudoTSD *TSD = getTSDAndLock();		bool UnlockRequired;
		ScudoTSD *TSD = getTSDAndLock(&UnlockRequired);
AllocatorQuarantine.Put(getQuarantineCache(TSD),		AllocatorQuarantine.Put(getQuarantineCache(TSD),
QuarantineCallback(&TSD->Cache), Ptr,		QuarantineCallback(&TSD->Cache), Ptr,
EstimatedSize);		EstimatedSize);
		if (UnlockRequired)
TSD->unlock();		TSD->unlock();
}		}
}		}

// Deallocates a Chunk, which means either adding it to the quarantine or		// Deallocates a Chunk, which means either adding it to the quarantine or
// directly returning it to the backend if criteria are met.		// directly returning it to the backend if criteria are met.
void deallocate(void *Ptr, uptr DeleteSize, AllocType Type) {		void deallocate(void *Ptr, uptr DeleteSize, AllocType Type) {
// For a deallocation, we only ensure minimal initialization, meaning thread		// For a deallocation, we only ensure minimal initialization, meaning thread
// local data will be left uninitialized for now (when using ELF TLS). The		// local data will be left uninitialized for now (when using ELF TLS). The
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
static ScudoBackendAllocator &getBackendAllocator() {		static ScudoBackendAllocator &getBackendAllocator() {
return Instance.BackendAllocator;		return Instance.BackendAllocator;
}		}

void initScudo() {		void initScudo() {
Instance.init();		Instance.init();
}		}

void ScudoTSD::init(bool Shared) {		void ScudoTSD::init() {
UnlockRequired = Shared;
getBackendAllocator().initCache(&Cache);		getBackendAllocator().initCache(&Cache);
memset(QuarantineCachePlaceHolder, 0, sizeof(QuarantineCachePlaceHolder));		memset(QuarantineCachePlaceHolder, 0, sizeof(QuarantineCachePlaceHolder));
}		}

void ScudoTSD::commitBack() {		void ScudoTSD::commitBack() {
Instance.commitBack(this);		Instance.commitBack(this);
}		}

▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

lib/scudo/scudo_tsd.h

	Show All 17 Lines

	#include "scudo_allocator.h"			#include "scudo_allocator.h"
	#include "scudo_utils.h"			#include "scudo_utils.h"

	#include <pthread.h>			#include <pthread.h>

	namespace __scudo {			namespace __scudo {

	struct ALIGNED(64) ScudoTSD {			struct ALIGNED(SANITIZER_CACHE_LINE_SIZE) ScudoTSD {
	AllocatorCache Cache;			AllocatorCache Cache;
	uptr QuarantineCachePlaceHolder[4];			uptr QuarantineCachePlaceHolder[4];

	void init(bool Shared);			void init();
	void commitBack();			void commitBack();

	INLINE bool tryLock() {			INLINE bool tryLock() {
	if (Mutex.TryLock()) {			if (Mutex.TryLock()) {
	atomic_store_relaxed(&Precedence, 0);			atomic_store_relaxed(&Precedence, 0);
	return true;			return true;
	}			}
	if (atomic_load_relaxed(&Precedence) == 0)			if (atomic_load_relaxed(&Precedence) == 0)
	atomic_store_relaxed(&Precedence, MonotonicNanoTime());			atomic_store_relaxed(&Precedence, static_cast<uptr>(
				MonotonicNanoTime() >> FIRST_32_SECOND_64(16, 0)));
	return false;			return false;
	}			}

	INLINE void lock() {			INLINE void lock() {
	Mutex.Lock();
	atomic_store_relaxed(&Precedence, 0);			atomic_store_relaxed(&Precedence, 0);
				Mutex.Lock();
	}			}

	INLINE void unlock() {			INLINE void unlock() { Mutex.Unlock(); }
	if (!UnlockRequired)
	return;
	Mutex.Unlock();
	}

	INLINE u64 getPrecedence() {			INLINE uptr getPrecedence() { return atomic_load_relaxed(&Precedence); }
	return atomic_load_relaxed(&Precedence);
	}

	private:			private:
	bool UnlockRequired;
	StaticSpinMutex Mutex;			StaticSpinMutex Mutex;
	atomic_uint64_t Precedence;			atomic_uintptr_t Precedence;
	};			};

	void initThread(bool MinimalInit);			void initThread(bool MinimalInit);

	// TSD model specific fastpath functions definitions.			// TSD model specific fastpath functions definitions.
	#include "scudo_tsd_exclusive.inc"			#include "scudo_tsd_exclusive.inc"
	#include "scudo_tsd_shared.inc"			#include "scudo_tsd_shared.inc"

	} // namespace __scudo			} // namespace __scudo

	#endif // SCUDO_TSD_H_			#endif // SCUDO_TSD_H_

lib/scudo/scudo_tsd_exclusive.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	static void teardownThread(void *Ptr) {
TSD.commitBack();		TSD.commitBack();
ScudoThreadState = ThreadTornDown;		ScudoThreadState = ThreadTornDown;
}		}


static void initOnce() {		static void initOnce() {
CHECK_EQ(pthread_key_create(&PThreadKey, teardownThread), 0);		CHECK_EQ(pthread_key_create(&PThreadKey, teardownThread), 0);
initScudo();		initScudo();
FallbackTSD.init(/Shared=/true);		FallbackTSD.init();
}		}

void initThread(bool MinimalInit) {		void initThread(bool MinimalInit) {
CHECK_EQ(pthread_once(&GlobalInitialized, initOnce), 0);		CHECK_EQ(pthread_once(&GlobalInitialized, initOnce), 0);
if (UNLIKELY(MinimalInit))		if (UNLIKELY(MinimalInit))
return;		return;
CHECK_EQ(pthread_setspecific(PThreadKey, reinterpret_cast<void *>(		CHECK_EQ(pthread_setspecific(PThreadKey, reinterpret_cast<void *>(
GetPthreadDestructorIterations())), 0);		GetPthreadDestructorIterations())), 0);
TSD.init(/Shared=/false);		TSD.init();
ScudoThreadState = ThreadInitialized;		ScudoThreadState = ThreadInitialized;
}		}

} // namespace __scudo		} // namespace __scudo

#endif // SCUDO_TSD_EXCLUSIVE		#endif // SCUDO_TSD_EXCLUSIVE

lib/scudo/scudo_tsd_exclusive.inc

	Show All 29 Lines
	extern ScudoTSD FallbackTSD;			extern ScudoTSD FallbackTSD;

	ALWAYS_INLINE void initThreadMaybe(bool MinimalInit = false) {			ALWAYS_INLINE void initThreadMaybe(bool MinimalInit = false) {
	if (LIKELY(ScudoThreadState != ThreadNotInitialized))			if (LIKELY(ScudoThreadState != ThreadNotInitialized))
	return;			return;
	initThread(MinimalInit);			initThread(MinimalInit);
	}			}

	ALWAYS_INLINE ScudoTSD *getTSDAndLock() {			ALWAYS_INLINE ScudoTSD getTSDAndLock(bool UnlockRequired) {
	if (UNLIKELY(ScudoThreadState != ThreadInitialized)) {			if (UNLIKELY(ScudoThreadState != ThreadInitialized)) {
	FallbackTSD.lock();			FallbackTSD.lock();
				*UnlockRequired = true;
	return &FallbackTSD;			return &FallbackTSD;
	}			}
				*UnlockRequired = false;
	return &TSD;			return &TSD;
	}			}

	#endif // SCUDO_TSD_EXCLUSIVE			#endif // SCUDO_TSD_EXCLUSIVE

lib/scudo/scudo_tsd_shared.cpp

	Show All 17 Lines
	namespace __scudo {			namespace __scudo {

	static pthread_once_t GlobalInitialized = PTHREAD_ONCE_INIT;			static pthread_once_t GlobalInitialized = PTHREAD_ONCE_INIT;
	pthread_key_t PThreadKey;			pthread_key_t PThreadKey;

	static atomic_uint32_t CurrentIndex;			static atomic_uint32_t CurrentIndex;
	static ScudoTSD *TSDs;			static ScudoTSD *TSDs;
	static u32 NumberOfTSDs;			static u32 NumberOfTSDs;
				static u32 CoPrimes[SCUDO_SHARED_TSD_POOL_SIZE];
				static u32 NumberOfCoPrimes = 0;

				#if SANITIZER_LINUX && !SANITIZER_ANDROID
				__attribute__((tls_model("initial-exec")))
				THREADLOCAL ScudoTSD *CurrentTSD;
				#endif

	static void initOnce() {			static void initOnce() {
	CHECK_EQ(pthread_key_create(&PThreadKey, NULL), 0);			CHECK_EQ(pthread_key_create(&PThreadKey, NULL), 0);
	initScudo();			initScudo();
	NumberOfTSDs = Min(Max(1U, GetNumberOfCPUsCached()),			NumberOfTSDs = Min(Max(1U, GetNumberOfCPUsCached()),
	static_cast<u32>(SCUDO_SHARED_TSD_POOL_SIZE));			static_cast<u32>(SCUDO_SHARED_TSD_POOL_SIZE));
	TSDs = reinterpret_cast<ScudoTSD *>(			TSDs = reinterpret_cast<ScudoTSD *>(
	MmapOrDie(sizeof(ScudoTSD) * NumberOfTSDs, "ScudoTSDs"));			MmapOrDie(sizeof(ScudoTSD) * NumberOfTSDs, "ScudoTSDs"));
	for (u32 i = 0; i < NumberOfTSDs; i++)			for (u32 I = 0; I < NumberOfTSDs; I++) {
	TSDs[i].init(/Shared=/true);			TSDs[I].init();
				u32 A = I + 1;
				u32 B = NumberOfTSDs;
				while (B != 0) { const u32 T = A; A = B; B = T % B; }
				if (A == 1)
				CoPrimes[NumberOfCoPrimes++] = I + 1;
				}
	}			}

	ALWAYS_INLINE void setCurrentTSD(ScudoTSD *TSD) {			ALWAYS_INLINE void setCurrentTSD(ScudoTSD *TSD) {
	#if SANITIZER_ANDROID			#if SANITIZER_ANDROID
	*get_android_tls_ptr() = reinterpret_cast<uptr>(TSD);			*get_android_tls_ptr() = reinterpret_cast<uptr>(TSD);
				#elif SANITIZER_LINUX
				CurrentTSD = TSD;
	#else			#else
	CHECK_EQ(pthread_setspecific(PThreadKey, reinterpret_cast<void *>(TSD)), 0);			CHECK_EQ(pthread_setspecific(PThreadKey, reinterpret_cast<void *>(TSD)), 0);
	#endif // SANITIZER_ANDROID			#endif // SANITIZER_ANDROID
	}			}

	void initThread(bool MinimalInit) {			void initThread(bool MinimalInit) {
	pthread_once(&GlobalInitialized, initOnce);			pthread_once(&GlobalInitialized, initOnce);
	// Initial context assignment is done in a plain round-robin fashion.			// Initial context assignment is done in a plain round-robin fashion.
	u32 Index = atomic_fetch_add(&CurrentIndex, 1, memory_order_relaxed);			u32 Index = atomic_fetch_add(&CurrentIndex, 1, memory_order_relaxed);
	setCurrentTSD(&TSDs[Index % NumberOfTSDs]);			setCurrentTSD(&TSDs[Index % NumberOfTSDs]);
	}			}

	ScudoTSD *getTSDAndLockSlow() {			ScudoTSD getTSDAndLockSlow(ScudoTSD TSD) {
	ScudoTSD *TSD;
	if (NumberOfTSDs > 1) {			if (NumberOfTSDs > 1) {
	// Go through all the contexts and find the first unlocked one.			// Use the Precedence of the current TSD as our random seed. Since we are in
	for (u32 i = 0; i < NumberOfTSDs; i++) {			// the slow path, it means that tryLock failed, and as a result it's very
	TSD = &TSDs[i];			// likely that said Precedence is non-zero.
				dvyukovUnsubmitted Done Reply Inline Actions Do you see getTSDAndLockSlow in profiles? Here we have 2 divisions. Divisions are expensive. We could replace them with multiplication+shift if they show up in profiles. dvyukov: Do you see getTSDAndLockSlow in profiles? Here we have 2 divisions. Divisions are expensive. We…
	if (TSD->tryLock()) {			u32 RandState = static_cast<u32>(TSD->getPrecedence());
	setCurrentTSD(TSD);			const u32 R = Rand(&RandState);
	return TSD;			const u32 Inc = CoPrimes[R % NumberOfCoPrimes];
	}			u32 Index = R % NumberOfTSDs;
	}			uptr LowestPrecedence = UINTPTR_MAX;
				dvyukovUnsubmitted Done Reply Inline Actions Have you tried dropping precedence entirely and then just going over all TSDs pseudo-randomly once (or maybe twice) and tryLock'ing them? If all tryLock's fail then we could lock the old one, or maybe a random one. That would shave few cycles from malloc/free fast paths. The benefit of using precedence scheme is still unproved, right? Also, if we have a hundred of cores/caches, then giving up after just 4 can lead to unnecessary blocking (there still can be dozens of available caches). dvyukov: Have you tried dropping precedence entirely and then just going over all TSDs pseudo-randomly…
	// No luck, find the one with the lowest Precedence, and slow lock it.			ScudoTSD *CandidateTSD = nullptr;
	u64 LowestPrecedence = UINT64_MAX;			// Go randomly through at most 4 contexts and find a candidate.
	for (u32 i = 0; i < NumberOfTSDs; i++) {			for (u32 I = 0; I < Min(4U, NumberOfTSDs); I++) {
	u64 Precedence = TSDs[i].getPrecedence();			if (TSDs[Index].tryLock()) {
	if (Precedence && Precedence < LowestPrecedence) {			setCurrentTSD(&TSDs[Index]);
	TSD = &TSDs[i];			return &TSDs[Index];
				dvyukovUnsubmitted Done Reply Inline Actions The comment from the change description as to why we use precedence as rand state should be duplicated as a comment here. This is confusing. dvyukov: The comment from the change description as to why we use precedence as rand state should be…
				}
				const uptr Precedence = TSDs[Index].getPrecedence();
				// A 0 precedence here means another thread just locked this TSD.
				if (UNLIKELY(Precedence == 0))
				continue;
				if (Precedence < LowestPrecedence) {
				CandidateTSD = &TSDs[Index];
	LowestPrecedence = Precedence;			LowestPrecedence = Precedence;
	}			}
	}			Index += Inc;
	if (LIKELY(LowestPrecedence != UINT64_MAX)) {			if (Index >= NumberOfTSDs)
	TSD->lock();			Index -= NumberOfTSDs;
	setCurrentTSD(TSD);			}
	return TSD;			if (CandidateTSD) {
				CandidateTSD->lock();
				setCurrentTSD(CandidateTSD);
				return CandidateTSD;
	}			}
	}			}
	// Last resort, stick with the current one.			// Last resort, stick with the current one.
	TSD = getCurrentTSD();
	TSD->lock();			TSD->lock();
	return TSD;			return TSD;
	}			}

	} // namespace __scudo			} // namespace __scudo

	#endif // !SCUDO_TSD_EXCLUSIVE			#endif // !SCUDO_TSD_EXCLUSIVE

lib/scudo/scudo_tsd_shared.inc

	Show All 13 Lines
	#ifndef SCUDO_TSD_H_			#ifndef SCUDO_TSD_H_
	# error "This file must be included inside scudo_tsd.h."			# error "This file must be included inside scudo_tsd.h."
	#endif // SCUDO_TSD_H_			#endif // SCUDO_TSD_H_

	#if !SCUDO_TSD_EXCLUSIVE			#if !SCUDO_TSD_EXCLUSIVE

	extern pthread_key_t PThreadKey;			extern pthread_key_t PThreadKey;

				#if SANITIZER_LINUX && !SANITIZER_ANDROID
				__attribute__((tls_model("initial-exec")))
				extern THREADLOCAL ScudoTSD *CurrentTSD;
				#endif

	ALWAYS_INLINE ScudoTSD* getCurrentTSD() {			ALWAYS_INLINE ScudoTSD* getCurrentTSD() {
	#if SANITIZER_ANDROID			#if SANITIZER_ANDROID
	return reinterpret_cast<ScudoTSD >(get_android_tls_ptr());			return reinterpret_cast<ScudoTSD >(get_android_tls_ptr());
				#elif SANITIZER_LINUX
				return CurrentTSD;
	#else			#else
	return reinterpret_cast<ScudoTSD *>(pthread_getspecific(PThreadKey));			return reinterpret_cast<ScudoTSD *>(pthread_getspecific(PThreadKey));
	#endif // SANITIZER_ANDROID			#endif // SANITIZER_ANDROID
	}			}

	ALWAYS_INLINE void initThreadMaybe(bool MinimalInit = false) {			ALWAYS_INLINE void initThreadMaybe(bool MinimalInit = false) {
	if (LIKELY(getCurrentTSD()))			if (LIKELY(getCurrentTSD()))
	return;			return;
	initThread(MinimalInit);			initThread(MinimalInit);
	}			}

	ScudoTSD *getTSDAndLockSlow();			ScudoTSD getTSDAndLockSlow(ScudoTSD TSD);

	ALWAYS_INLINE ScudoTSD *getTSDAndLock() {			ALWAYS_INLINE ScudoTSD getTSDAndLock(bool UnlockRequired) {
	ScudoTSD *TSD = getCurrentTSD();			ScudoTSD *TSD = getCurrentTSD();
	CHECK(TSD && "No TSD associated with the current thread!");			DCHECK(TSD && "No TSD associated with the current thread!");
				*UnlockRequired = true;
	// Try to lock the currently associated context.			// Try to lock the currently associated context.
	if (TSD->tryLock())			if (TSD->tryLock())
	return TSD;			return TSD;
	// If it failed, go the slow path.			// If it failed, go the slow path.
	return getTSDAndLockSlow();			return getTSDAndLockSlow(TSD);
	}			}

	#endif // !SCUDO_TSD_EXCLUSIVE			#endif // !SCUDO_TSD_EXCLUSIVE