This is an archive of the discontinued LLVM Phabricator instance.

[scudo][standalone] Release smaller blocks less often
ClosedPublic

Authored by cryptoad on Jun 17 2020, 10:39 AM.

Details

Summary

Releasing smaller blocks is costly and only yields significant
results when there is a large percentage of free bytes for a given
size class (see numbers below).

This CL introduces a couple of additional checks for sizes lower
than 256. First we want to make sure that there is enough free bytes,
relatively to the amount of allocated bytes. We are looking at 8X% to
9X% (smaller blocks require higher percentage). We also want to make
sure there has been enough activity with the freelist to make it
worth the time, so we now check that the bytes pushed to the freelist
is at least 1/16th of the allocated bytes for those classes.

Additionally, we clear batches before destroying them now - this
could have prevented some releases to occur (class id 0 rarely
releases anyway).

Here are the numbers, for about 1M allocations in multiple threads:

Size: 16
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 0% released
93% freed -> 0% released
94% freed -> 0% released
95% freed -> 0% released
96% freed -> 0% released
97% freed -> 2% released
98% freed -> 7% released
99% freed -> 27% released
Size: 32
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 0% released
93% freed -> 0% released
94% freed -> 0% released
95% freed -> 1% released
96% freed -> 3% released
97% freed -> 7% released
98% freed -> 17% released
99% freed -> 41% released
Size: 48
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 0% released
93% freed -> 0% released
94% freed -> 1% released
95% freed -> 3% released
96% freed -> 7% released
97% freed -> 13% released
98% freed -> 27% released
99% freed -> 52% released
Size: 64
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 0% released
91% freed -> 0% released
92% freed -> 1% released
93% freed -> 2% released
94% freed -> 3% released
95% freed -> 6% released
96% freed -> 11% released
97% freed -> 20% released
98% freed -> 35% released
99% freed -> 59% released
Size: 80
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 0% released
90% freed -> 1% released
91% freed -> 1% released
92% freed -> 2% released
93% freed -> 4% released
94% freed -> 6% released
95% freed -> 10% released
96% freed -> 17% released
97% freed -> 26% released
98% freed -> 41% released
99% freed -> 64% released
Size: 96
85% freed -> 0% released
86% freed -> 0% released
87% freed -> 0% released
88% freed -> 0% released
89% freed -> 1% released
90% freed -> 1% released
91% freed -> 3% released
92% freed -> 4% released
93% freed -> 6% released
94% freed -> 10% released
95% freed -> 14% released
96% freed -> 21% released
97% freed -> 31% released
98% freed -> 47% released
99% freed -> 68% released
Size: 112
85% freed -> 0% released
86% freed -> 1% released
87% freed -> 1% released
88% freed -> 2% released
89% freed -> 3% released
90% freed -> 4% released
91% freed -> 6% released
92% freed -> 8% released
93% freed -> 11% released
94% freed -> 16% released
95% freed -> 22% released
96% freed -> 30% released
97% freed -> 40% released
98% freed -> 55% released
99% freed -> 74% released
Size: 128
85% freed -> 0% released
86% freed -> 1% released
87% freed -> 1% released
88% freed -> 2% released
89% freed -> 3% released
90% freed -> 4% released
91% freed -> 6% released
92% freed -> 8% released
93% freed -> 11% released
94% freed -> 16% released
95% freed -> 22% released
96% freed -> 30% released
97% freed -> 40% released
98% freed -> 55% released
99% freed -> 74% released
Size: 144
85% freed -> 1% released
86% freed -> 2% released
87% freed -> 3% released
88% freed -> 4% released
89% freed -> 6% released
90% freed -> 7% released
91% freed -> 10% released
92% freed -> 13% released
93% freed -> 17% released
94% freed -> 22% released
95% freed -> 28% released
96% freed -> 37% released
97% freed -> 47% released
98% freed -> 61% released
99% freed -> 78% released
Size: 160
85% freed -> 1% released
86% freed -> 2% released
87% freed -> 3% released
88% freed -> 4% released
89% freed -> 5% released
90% freed -> 7% released
91% freed -> 10% released
92% freed -> 13% released
93% freed -> 17% released
94% freed -> 22% released
95% freed -> 28% released
96% freed -> 37% released
97% freed -> 47% released
98% freed -> 61% released
99% freed -> 78% released
Size: 176
85% freed -> 2% released
86% freed -> 3% released
87% freed -> 4% released
88% freed -> 6% released
89% freed -> 7% released
90% freed -> 9% released
91% freed -> 12% released
92% freed -> 15% released
93% freed -> 20% released
94% freed -> 25% released
95% freed -> 32% released
96% freed -> 40% released
97% freed -> 51% released
98% freed -> 64% released
99% freed -> 80% released
Size: 192
85% freed -> 4% released
86% freed -> 5% released
87% freed -> 6% released
88% freed -> 8% released
89% freed -> 10% released
90% freed -> 13% released
91% freed -> 16% released
92% freed -> 20% released
93% freed -> 24% released
94% freed -> 30% released
95% freed -> 37% released
96% freed -> 45% released
97% freed -> 55% released
98% freed -> 68% released
99% freed -> 82% released
Size: 224
85% freed -> 8% released
86% freed -> 10% released
87% freed -> 12% released
88% freed -> 14% released
89% freed -> 17% released
90% freed -> 20% released
91% freed -> 23% released
92% freed -> 28% released
93% freed -> 33% released
94% freed -> 39% released
95% freed -> 46% released
96% freed -> 53% released
97% freed -> 63% released
98% freed -> 73% released
99% freed -> 85% released
Size: 240
85% freed -> 8% released
86% freed -> 10% released
87% freed -> 12% released
88% freed -> 14% released
89% freed -> 17% released
90% freed -> 20% released
91% freed -> 23% released
92% freed -> 28% released
93% freed -> 33% released
94% freed -> 39% released
95% freed -> 46% released
96% freed -> 54% released
97% freed -> 63% released
98% freed -> 73% released
99% freed -> 85% released

Diff Detail

Event Timeline

cryptoad created this revision.Jun 17 2020, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2020, 10:39 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
cryptoad updated this revision to Diff 271751.Jun 18 2020, 9:33 AM

Only do the BytesPushed check is not doing a forced release.

hctim added a comment.Jun 18 2020, 9:56 AM

How does this change look w.r.t. a memory replay test (https://android.googlesource.com/platform/bionic/+/master/docs/native_allocator.md#memory-replay-benchmarks)?

Is it possible to get a before/after of RSS/USS and CPU? This change seems like a very reasonable strategy to me, but I feel like I'm flying blind.

How does this change look w.r.t. a memory replay test (https://android.googlesource.com/platform/bionic/+/master/docs/native_allocator.md#memory-replay-benchmarks)?

Is it possible to get a before/after of RSS/USS and CPU? This change seems like a very reasonable strategy to me, but I feel like I'm flying blind.

I'm planning to run perf tests on this. Unfortunately, I lost the device I was using last night, so I couldn't run the data properly. I can put up the data here when I'm done.

How does this change look w.r.t. a memory replay test (https://android.googlesource.com/platform/bionic/+/master/docs/native_allocator.md#memory-replay-benchmarks)?

Is it possible to get a before/after of RSS/USS and CPU? This change seems like a very reasonable strategy to me, but I feel like I'm flying blind.

I'm planning to run perf tests on this. Unfortunately, I lost the device I was using last night, so I couldn't run the data properly. I can put up the data here when I'm done.

Thanks!

cryptoad updated this revision to Diff 272555.Jun 22 2020, 3:15 PM

Fix the PackedCounter static buffer check.
We were using the count of elements instead of the size of the buffer :(
Also make the count public so that release.cpp can use it.

cferris accepted this revision.Jun 24 2020, 12:30 PM

I ran the full set of perf/memory benchmarks and nearly everything has the same memory (RSS) values. The performance is nearly the same two.

There are a couple of outliers:

For 32 bit runs of the std::map and std::unordered_map benchmark of certain sizes (8, 16, 32, 64, 128), the performance is a little worse, but the RSS decreases dramatically. I'm not sure why this might happen though, maybe there is a bug in these benchmarks. But it does seem to track, the extra time is because the memory is being actually released.

For 64 bit, these same benchmarks are the same performance/RSS.

Also, for the memory replays on 32 bit, the time it takes is also a bit higher (worse performance) and the RSS is slightly higher in some runs, but the RSS is within normal variations. For the slight performance loss, I'm not sure why this might be happening, since I would expect the opposite, unless something else is triggering more calls to release the memory on 32 bit.

Overall, this seems to be reasonable, but I am a bit worried about the odd 32 bit performance. Is there an explanation for what could be causing the 32 bit calls to increase?

This revision is now accepted and ready to land.Jun 24 2020, 12:30 PM

Overall, this seems to be reasonable, but I am a bit worried about the odd 32 bit performance. Is there an explanation for what could be causing the 32 bit calls to increase?

Hmm this shouldn't be the case:

  • The primary changes will release the memory less often, so there shouldn't be an impact on runtime.
  • The packed counter array size change is a bug that would make us use the static buffer less often, so we are saving some mmap calls there.
  • The clear() addition to the batch pre-destroying only matters when trying to reclaim transfer batches which happens only on a forced release.

The change in primary32.h is what is causing the std_map and std_unordered_map RSS and speed changes. My theory is that the new change in releaseToOSMaybe cuts off most calls before any release occurs. So when there is finally a call that will actually do a release, there is a lot of memory to release. This release is taking extra time, so the total runtime increase, and the RSS goes way down. I think this is a good thing, and is a strong benefit of this change.

For the 32 bit memory replay runs, the run time is highly variable because the device I use seems extremely susceptible to throttling. I can't get the same numbers from run to run if I don't change anything, and I see huge swings in the numbers. Theoretically, since I run everything the same way, each run should have the same amount of throttling so the numbers would be similar, but I think these differences are not real. It's possible a similar mechanism is occurring as for the std_map tests, there are fewer calls that do a release, but each release is longer. I think variability is the more logical reason for the change.

I'm going to find a different device for the memory replay tests, but the more important number is that the RSS is the same.

I figured out what is causing the performance difference in the 32 bit std::map benchmarks. It's not the allocations that are causing the difference but the initialization of the memory returned by the allocation. In the regressed case, there are no release to the OS and the initialization of the memory takes longer. In the original case, there is a some memory released, and the initialization of the memory takes less time. My guess is that the kernel in 32 bit processes only keeps so many cached pages, and by releasing some of the memory, the kernel is better at figuring out what to evict. This might also be a purely chip based regression and other chips don't exhibit the same issue if they have larger caches.

At any rate, this slight regression is not a regression in the allocation speed itself. It's possible this is also causing an issue with the memory replays in 32 bit processes, but I think that's less likely.

Therefore, I think this change is fine as is.

This revision was automatically updated to reflect the committed changes.