This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/lib/scudo/standalone/
-
lib/
-
scudo/
-
standalone/
-
local_cache.h
-
primary32.h
-
primary64.h
-
release.h
-
release.cpp

Differential D82031

[scudo][standalone] Release smaller blocks less often
ClosedPublic

Authored by cryptoad on Jun 17 2020, 10:39 AM.

Download Raw Diff

Details

Reviewers

cferris
pcc
hctim
eugenis

Commits

rG79de8f8441d1: [scudo][standalone] Release smaller blocks less often

Summary

Releasing smaller blocks is costly and only yields significant
results when there is a large percentage of free bytes for a given
size class (see numbers below).

This CL introduces a couple of additional checks for sizes lower
than 256. First we want to make sure that there is enough free bytes,
relatively to the amount of allocated bytes. We are looking at 8X% to
9X% (smaller blocks require higher percentage). We also want to make
sure there has been enough activity with the freelist to make it
worth the time, so we now check that the bytes pushed to the freelist
is at least 1/16th of the allocated bytes for those classes.

Additionally, we clear batches before destroying them now - this
could have prevented some releases to occur (class id 0 rarely
releases anyway).

Here are the numbers, for about 1M allocations in multiple threads:

Size: 16
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 0% released
93% freed -> 0% released
94% freed -> 0% released
95% freed -> 0% released
96% freed -> 0% released
97% freed -> 2% released
98% freed -> 7% released
99% freed -> 27% released
Size: 32
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 0% released
93% freed -> 0% released
94% freed -> 0% released
95% freed -> 1% released
96% freed -> 3% released
97% freed -> 7% released
98% freed -> 17% released
99% freed -> 41% released
Size: 48
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 0% released
93% freed -> 0% released
94% freed -> 1% released
95% freed -> 3% released
96% freed -> 7% released
97% freed -> 13% released
98% freed -> 27% released
99% freed -> 52% released
Size: 64
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 1% released
93% freed -> 2% released
94% freed -> 3% released
95% freed -> 6% released
96% freed -> 11% released
97% freed -> 20% released
98% freed -> 35% released
99% freed -> 59% released
Size: 80
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 1% released
91% freed -> 1% released
92% freed -> 2% released
93% freed -> 4% released
94% freed -> 6% released
95% freed -> 10% released
96% freed -> 17% released
97% freed -> 26% released
98% freed -> 41% released
99% freed -> 64% released
Size: 96
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 1% released
90% freed -> 1% released
91% freed -> 3% released
92% freed -> 4% released
93% freed -> 6% released
94% freed -> 10% released
95% freed -> 14% released
96% freed -> 21% released
97% freed -> 31% released
98% freed -> 47% released
99% freed -> 68% released
Size: 112
85% freed -> 0% released
86% freed -> 1% released
87% freed -> 1% released
88% freed -> 2% released
89% freed -> 3% released
90% freed -> 4% released
91% freed -> 6% released
92% freed -> 8% released
93% freed -> 11% released
94% freed -> 16% released
95% freed -> 22% released
96% freed -> 30% released
97% freed -> 40% released
98% freed -> 55% released
99% freed -> 74% released
Size: 128
85% freed -> 0% released
86% freed -> 1% released
87% freed -> 1% released
88% freed -> 2% released
89% freed -> 3% released
90% freed -> 4% released
91% freed -> 6% released
92% freed -> 8% released
93% freed -> 11% released
94% freed -> 16% released
95% freed -> 22% released
96% freed -> 30% released
97% freed -> 40% released
98% freed -> 55% released
99% freed -> 74% released
Size: 144
85% freed -> 1% released
86% freed -> 2% released
87% freed -> 3% released
88% freed -> 4% released
89% freed -> 6% released
90% freed -> 7% released
91% freed -> 10% released
92% freed -> 13% released
93% freed -> 17% released
94% freed -> 22% released
95% freed -> 28% released
96% freed -> 37% released
97% freed -> 47% released
98% freed -> 61% released
99% freed -> 78% released
Size: 160
85% freed -> 1% released
86% freed -> 2% released
87% freed -> 3% released
88% freed -> 4% released
89% freed -> 5% released
90% freed -> 7% released
91% freed -> 10% released
92% freed -> 13% released
93% freed -> 17% released
94% freed -> 22% released
95% freed -> 28% released
96% freed -> 37% released
97% freed -> 47% released
98% freed -> 61% released
99% freed -> 78% released
Size: 176
85% freed -> 2% released
86% freed -> 3% released
87% freed -> 4% released
88% freed -> 6% released
89% freed -> 7% released
90% freed -> 9% released
91% freed -> 12% released
92% freed -> 15% released
93% freed -> 20% released
94% freed -> 25% released
95% freed -> 32% released
96% freed -> 40% released
97% freed -> 51% released
98% freed -> 64% released
99% freed -> 80% released
Size: 192
85% freed -> 4% released
86% freed -> 5% released
87% freed -> 6% released
88% freed -> 8% released
89% freed -> 10% released
90% freed -> 13% released
91% freed -> 16% released
92% freed -> 20% released
93% freed -> 24% released
94% freed -> 30% released
95% freed -> 37% released
96% freed -> 45% released
97% freed -> 55% released
98% freed -> 68% released
99% freed -> 82% released
Size: 224
85% freed -> 8% released
86% freed -> 10% released
87% freed -> 12% released
88% freed -> 14% released
89% freed -> 17% released
90% freed -> 20% released
91% freed -> 23% released
92% freed -> 28% released
93% freed -> 33% released
94% freed -> 39% released
95% freed -> 46% released
96% freed -> 53% released
97% freed -> 63% released
98% freed -> 73% released
99% freed -> 85% released
Size: 240
85% freed -> 8% released
86% freed -> 10% released
87% freed -> 12% released
88% freed -> 14% released
89% freed -> 17% released
90% freed -> 20% released
91% freed -> 23% released
92% freed -> 28% released
93% freed -> 33% released
94% freed -> 39% released
95% freed -> 46% released
96% freed -> 54% released
97% freed -> 63% released
98% freed -> 73% released
99% freed -> 85% released

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cryptoad created this revision.Jun 17 2020, 10:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2020, 10:39 AM

Herald added a subscriber: Restricted Project. · View Herald Transcript

Harbormaster failed remote builds in B60665: Diff 271411!Jun 17 2020, 11:18 AM

Only do the BytesPushed check is not doing a forced release.

How does this change look w.r.t. a memory replay test (https://android.googlesource.com/platform/bionic/+/master/docs/native_allocator.md#memory-replay-benchmarks)?

Is it possible to get a before/after of RSS/USS and CPU? This change seems like a very reasonable strategy to me, but I feel like I'm flying blind.

In D82031#2101385, @hctim wrote:

How does this change look w.r.t. a memory replay test (https://android.googlesource.com/platform/bionic/+/master/docs/native_allocator.md#memory-replay-benchmarks)?

Is it possible to get a before/after of RSS/USS and CPU? This change seems like a very reasonable strategy to me, but I feel like I'm flying blind.

I'm planning to run perf tests on this. Unfortunately, I lost the device I was using last night, so I couldn't run the data properly. I can put up the data here when I'm done.

Harbormaster failed remote builds in B60845: Diff 271751!Jun 18 2020, 10:54 AM

In D82031#2101508, @cferris wrote:

In D82031#2101385, @hctim wrote:

How does this change look w.r.t. a memory replay test (https://android.googlesource.com/platform/bionic/+/master/docs/native_allocator.md#memory-replay-benchmarks)?

Is it possible to get a before/after of RSS/USS and CPU? This change seems like a very reasonable strategy to me, but I feel like I'm flying blind.

I'm planning to run perf tests on this. Unfortunately, I lost the device I was using last night, so I couldn't run the data properly. I can put up the data here when I'm done.

Thanks!

Fix the PackedCounter static buffer check.
We were using the count of elements instead of the size of the buffer :(
Also make the count public so that release.cpp can use it.

Harbormaster failed remote builds in B61308: Diff 272555!Jun 22 2020, 4:40 PM

I ran the full set of perf/memory benchmarks and nearly everything has the same memory (RSS) values. The performance is nearly the same two.

There are a couple of outliers:

For 32 bit runs of the std::map and std::unordered_map benchmark of certain sizes (8, 16, 32, 64, 128), the performance is a little worse, but the RSS decreases dramatically. I'm not sure why this might happen though, maybe there is a bug in these benchmarks. But it does seem to track, the extra time is because the memory is being actually released.

For 64 bit, these same benchmarks are the same performance/RSS.

Also, for the memory replays on 32 bit, the time it takes is also a bit higher (worse performance) and the RSS is slightly higher in some runs, but the RSS is within normal variations. For the slight performance loss, I'm not sure why this might be happening, since I would expect the opposite, unless something else is triggering more calls to release the memory on 32 bit.

Overall, this seems to be reasonable, but I am a bit worried about the odd 32 bit performance. Is there an explanation for what could be causing the 32 bit calls to increase?

This revision is now accepted and ready to land.Jun 24 2020, 12:30 PM

In D82031#2112360, @cferris wrote:

Overall, this seems to be reasonable, but I am a bit worried about the odd 32 bit performance. Is there an explanation for what could be causing the 32 bit calls to increase?

Hmm this shouldn't be the case:

The primary changes will release the memory less often, so there shouldn't be an impact on runtime.
The packed counter array size change is a bug that would make us use the static buffer less often, so we are saving some mmap calls there.
The clear() addition to the batch pre-destroying only matters when trying to reclaim transfer batches which happens only on a forced release.

The change in primary32.h is what is causing the std_map and std_unordered_map RSS and speed changes. My theory is that the new change in releaseToOSMaybe cuts off most calls before any release occurs. So when there is finally a call that will actually do a release, there is a lot of memory to release. This release is taking extra time, so the total runtime increase, and the RSS goes way down. I think this is a good thing, and is a strong benefit of this change.

For the 32 bit memory replay runs, the run time is highly variable because the device I use seems extremely susceptible to throttling. I can't get the same numbers from run to run if I don't change anything, and I see huge swings in the numbers. Theoretically, since I run everything the same way, each run should have the same amount of throttling so the numbers would be similar, but I think these differences are not real. It's possible a similar mechanism is occurring as for the std_map tests, there are fewer calls that do a release, but each release is longer. I think variability is the more logical reason for the change.

I'm going to find a different device for the memory replay tests, but the more important number is that the RSS is the same.

I figured out what is causing the performance difference in the 32 bit std::map benchmarks. It's not the allocations that are causing the difference but the initialization of the memory returned by the allocation. In the regressed case, there are no release to the OS and the initialization of the memory takes longer. In the original case, there is a some memory released, and the initialization of the memory takes less time. My guess is that the kernel in 32 bit processes only keeps so many cached pages, and by releasing some of the memory, the kernel is better at figuring out what to evict. This might also be a purely chip based regression and other chips don't exhibit the same issue if they have larger caches.

At any rate, this slight regression is not a regression in the allocation speed itself. It's possible this is also causing an issue with the memory replays in 32 bit processes, but I think that's less likely.

Therefore, I think this change is fine as is.

Closed by commit rG79de8f8441d1: [scudo][standalone] Release smaller blocks less often (authored by cryptoad). · Explain WhyJul 16 2020, 9:45 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

compiler-rt/

lib/

scudo/

standalone/

1 line

12 lines

16 lines

8 lines

2 lines

Diff 278524

compiler-rt/lib/scudo/standalone/local_cache.h

Show First 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	private:
NOINLINE bool refill(PerClass *C, uptr ClassId) {		NOINLINE bool refill(PerClass *C, uptr ClassId) {
initCacheMaybe(C);		initCacheMaybe(C);
TransferBatch *B = Allocator->popBatch(this, ClassId);		TransferBatch *B = Allocator->popBatch(this, ClassId);
if (UNLIKELY(!B))		if (UNLIKELY(!B))
return false;		return false;
DCHECK_GT(B->getCount(), 0);		DCHECK_GT(B->getCount(), 0);
C->Count = B->getCount();		C->Count = B->getCount();
B->copyToArray(C->Chunks);		B->copyToArray(C->Chunks);
		B->clear();
destroyBatch(ClassId, B);		destroyBatch(ClassId, B);
return true;		return true;
}		}

NOINLINE void drain(PerClass *C, uptr ClassId) {		NOINLINE void drain(PerClass *C, uptr ClassId) {
const u32 Count = Min(C->MaxCount / 2, C->Count);		const u32 Count = Min(C->MaxCount / 2, C->Count);
TransferBatch *B = createBatch(ClassId, C->Chunks[0]);		TransferBatch *B = createBatch(ClassId, C->Chunks[0]);
if (UNLIKELY(!B))		if (UNLIKELY(!B))
Show All 13 Lines

compiler-rt/lib/scudo/standalone/primary32.h

Show First 20 Lines • Show All 438 Lines • ▼ Show 20 Lines	NOINLINE uptr releaseToOSMaybe(SizeClassInfo *Sci, uptr ClassId,
if (BytesInFreeList < PageSize)		if (BytesInFreeList < PageSize)
return 0; // No chance to release anything.		return 0; // No chance to release anything.
const uptr BytesPushed =		const uptr BytesPushed =
(Sci->Stats.PushedBlocks - Sci->ReleaseInfo.PushedBlocksAtLastRelease) *		(Sci->Stats.PushedBlocks - Sci->ReleaseInfo.PushedBlocksAtLastRelease) *
BlockSize;		BlockSize;
if (BytesPushed < PageSize)		if (BytesPushed < PageSize)
return 0; // Nothing new to release.		return 0; // Nothing new to release.

		// Releasing smaller blocks is expensive, so we want to make sure that a
		// significant amount of bytes are free, and that there has been a good
		// amount of batches pushed to the freelist before attempting to release.
		if (BlockSize < PageSize / 16U) {
		if (!Force && BytesPushed < Sci->AllocatedUser / 16U)
		return 0;
		// We want 8x% to 9x% free bytes (the larger the bock, the lower the %).
		if ((BytesInFreeList * 100U) / Sci->AllocatedUser <
		(100U - 1U - BlockSize / 16U))
		return 0;
		}

if (!Force) {		if (!Force) {
const s32 IntervalMs = getReleaseToOsIntervalMs();		const s32 IntervalMs = getReleaseToOsIntervalMs();
if (IntervalMs < 0)		if (IntervalMs < 0)
return 0;		return 0;
if (Sci->ReleaseInfo.LastReleaseAtNs +		if (Sci->ReleaseInfo.LastReleaseAtNs +
static_cast<u64>(IntervalMs) * 1000000 >		static_cast<u64>(IntervalMs) * 1000000 >
getMonotonicTime()) {		getMonotonicTime()) {
return 0; // Memory was returned recently.		return 0; // Memory was returned recently.
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

compiler-rt/lib/scudo/standalone/primary64.h

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	bool useMemoryTagging() const {
return SupportsMemoryTagging && UseMemoryTagging;		return SupportsMemoryTagging && UseMemoryTagging;
}		}
void disableMemoryTagging() { UseMemoryTagging = false; }		void disableMemoryTagging() { UseMemoryTagging = false; }

const char *getRegionInfoArrayAddress() const {		const char *getRegionInfoArrayAddress() const {
return reinterpret_cast<const char *>(RegionInfoArray);		return reinterpret_cast<const char *>(RegionInfoArray);
}		}

static uptr getRegionInfoArraySize() {		static uptr getRegionInfoArraySize() { return sizeof(RegionInfoArray); }
return sizeof(RegionInfoArray);
}

static BlockInfo findNearestBlock(const char *RegionInfoData, uptr Ptr) {		static BlockInfo findNearestBlock(const char *RegionInfoData, uptr Ptr) {
const RegionInfo *RegionInfoArray =		const RegionInfo *RegionInfoArray =
reinterpret_cast<const RegionInfo *>(RegionInfoData);		reinterpret_cast<const RegionInfo *>(RegionInfoData);
uptr ClassId;		uptr ClassId;
uptr MinDistance = -1UL;		uptr MinDistance = -1UL;
for (uptr I = 0; I != NumClasses; ++I) {		for (uptr I = 0; I != NumClasses; ++I) {
if (I == SizeClassMap::BatchClassId)		if (I == SizeClassMap::BatchClassId)
▲ Show 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	NOINLINE uptr releaseToOSMaybe(RegionInfo *Region, uptr ClassId,
if (BytesInFreeList < PageSize)		if (BytesInFreeList < PageSize)
return 0; // No chance to release anything.		return 0; // No chance to release anything.
const uptr BytesPushed = (Region->Stats.PushedBlocks -		const uptr BytesPushed = (Region->Stats.PushedBlocks -
Region->ReleaseInfo.PushedBlocksAtLastRelease) *		Region->ReleaseInfo.PushedBlocksAtLastRelease) *
BlockSize;		BlockSize;
if (BytesPushed < PageSize)		if (BytesPushed < PageSize)
return 0; // Nothing new to release.		return 0; // Nothing new to release.

		// Releasing smaller blocks is expensive, so we want to make sure that a
		// significant amount of bytes are free, and that there has been a good
		// amount of batches pushed to the freelist before attempting to release.
		if (BlockSize < PageSize / 16U) {
		if (!Force && BytesPushed < Region->AllocatedUser / 16U)
		return 0;
		// We want 8x% to 9x% free bytes (the larger the bock, the lower the %).
		if ((BytesInFreeList * 100U) / Region->AllocatedUser <
		(100U - 1U - BlockSize / 16U))
		return 0;
		}

if (!Force) {		if (!Force) {
const s32 IntervalMs = getReleaseToOsIntervalMs();		const s32 IntervalMs = getReleaseToOsIntervalMs();
if (IntervalMs < 0)		if (IntervalMs < 0)
return 0;		return 0;
if (Region->ReleaseInfo.LastReleaseAtNs +		if (Region->ReleaseInfo.LastReleaseAtNs +
static_cast<u64>(IntervalMs) * 1000000 >		static_cast<u64>(IntervalMs) * 1000000 >
getMonotonicTime()) {		getMonotonicTime()) {
return 0; // Memory was returned recently.		return 0; // Memory was returned recently.
Show All 21 Lines

compiler-rt/lib/scudo/standalone/release.h

Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	PackedCounterArray(uptr NumCounters, uptr MaxValue) : N(NumCounters) {
const uptr PackingRatio = MaxCounterBits >> CounterSizeBitsLog;		const uptr PackingRatio = MaxCounterBits >> CounterSizeBitsLog;
CHECK_GT(PackingRatio, 0);		CHECK_GT(PackingRatio, 0);
PackingRatioLog = getLog2(PackingRatio);		PackingRatioLog = getLog2(PackingRatio);
BitOffsetMask = PackingRatio - 1;		BitOffsetMask = PackingRatio - 1;

BufferSize = (roundUpTo(N, static_cast<uptr>(1U) << PackingRatioLog) >>		BufferSize = (roundUpTo(N, static_cast<uptr>(1U) << PackingRatioLog) >>
PackingRatioLog) *		PackingRatioLog) *
sizeof(*Buffer);		sizeof(*Buffer);
if (BufferSize <= StaticBufferSize && Mutex.tryLock()) {		if (BufferSize <= (StaticBufferCount * sizeof(Buffer[0])) &&
		Mutex.tryLock()) {
Buffer = &StaticBuffer[0];		Buffer = &StaticBuffer[0];
memset(Buffer, 0, BufferSize);		memset(Buffer, 0, BufferSize);
} else {		} else {
Buffer = reinterpret_cast<uptr *>(		Buffer = reinterpret_cast<uptr *>(
map(nullptr, BufferSize, "scudo:counters", MAP_ALLOWNOMEM));		map(nullptr, BufferSize, "scudo:counters", MAP_ALLOWNOMEM));
}		}
}		}
~PackedCounterArray() {		~PackedCounterArray() {
Show All 28 Lines	void incRange(uptr From, uptr To) const {
DCHECK_LE(From, To);		DCHECK_LE(From, To);
const uptr Top = Min(To + 1, N);		const uptr Top = Min(To + 1, N);
for (uptr I = From; I < Top; I++)		for (uptr I = From; I < Top; I++)
inc(I);		inc(I);
}		}

uptr getBufferSize() const { return BufferSize; }		uptr getBufferSize() const { return BufferSize; }

		static const uptr StaticBufferCount = 1024U;

private:		private:
const uptr N;		const uptr N;
uptr CounterSizeBitsLog;		uptr CounterSizeBitsLog;
uptr CounterMask;		uptr CounterMask;
uptr PackingRatioLog;		uptr PackingRatioLog;
uptr BitOffsetMask;		uptr BitOffsetMask;

uptr BufferSize;		uptr BufferSize;
uptr *Buffer;		uptr *Buffer;

static HybridMutex Mutex;		static HybridMutex Mutex;
static const uptr StaticBufferSize = 1024U;		static uptr StaticBuffer[StaticBufferCount];
static uptr StaticBuffer[StaticBufferSize];
};		};

template <class ReleaseRecorderT> class FreePagesRangeTracker {		template <class ReleaseRecorderT> class FreePagesRangeTracker {
public:		public:
explicit FreePagesRangeTracker(ReleaseRecorderT *Recorder)		explicit FreePagesRangeTracker(ReleaseRecorderT *Recorder)
: Recorder(Recorder), PageSizeLog(getLog2(getPageSizeCached())) {}		: Recorder(Recorder), PageSizeLog(getLog2(getPageSizeCached())) {}

void processNextPage(bool Freed) {		void processNextPage(bool Freed) {
▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

compiler-rt/lib/scudo/standalone/release.cpp

	//===-- release.cpp ---------------------------------------------- C++ --===//			//===-- release.cpp ---------------------------------------------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "release.h"			#include "release.h"

	namespace scudo {			namespace scudo {

	HybridMutex PackedCounterArray::Mutex = {};			HybridMutex PackedCounterArray::Mutex = {};
	uptr PackedCounterArray::StaticBuffer[1024];			uptr PackedCounterArray::StaticBuffer[PackedCounterArray::StaticBufferCount];

	} // namespace scudo			} // namespace scudo