This is an archive of the discontinued LLVM Phabricator instance.

scudo: Support memory tagging in the secondary allocator.
ClosedPublic

Authored by pcc on Dec 22 2020, 1:49 PM.

Details

Summary

This patch enhances the secondary allocator to be able to detect
use-after-free, buffer overflow and (on hardware supporting
memory tagging) buffer underflow.

Use-after-free detection is implemented by setting memory page
protection to PROT_NONE on free. Because this must be done immediately
rather than after the memory has been quarantined, we no longer use the
combined allocator quarantine for secondary allocations. Instead, a
quarantine has been added to the secondary allocator cache.

Buffer overflow detection is implemented by aligning the allocation
to the right of the writable pages, so that any overflows will
spill into the guard page to the right of the allocation, which
will have PROT_NONE page protection. Because this would require the
secondary allocator to produce a header at the correct position,
the responsibility for ensuring chunk alignment has been moved to
the secondary allocator.

Buffer underflow detection has been implemented on hardware supporting
memory tagging by tagging the memory region between the start of the
mapping and the start of the allocation with a non-zero tag. Due to
the cost of pre-tagging secondary allocations, the allocation itself
uses a tag of 0.

Diff Detail

Event Timeline

pcc created this revision.Dec 22 2020, 1:49 PM
pcc requested review of this revision.Dec 22 2020, 1:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2020, 1:49 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
eugenis added inline comments.Dec 22 2020, 4:00 PM
compiler-rt/lib/scudo/standalone/combined.h
260

Do you think it will be too expensive to make the tag random and use LDG to read it in the allocator?

1080

Rename the argument to "headerTaggedPtr" or something like that

compiler-rt/lib/scudo/standalone/secondary.h
133

Do you expect this to be faster than the other branch? If we are doing an mm syscall anyway, why not release the memory immediately?

520

is this an unrelated bugfix?

530

I don't know if it is worth the cost of the extra syscall, but we only need MAP_MEMTAG on the first page of the mapping.

pcc added inline comments.Dec 22 2020, 4:34 PM
compiler-rt/lib/scudo/standalone/combined.h
260

I didn't see a huge advantage in making it random. I think we need to be careful about loads which follow the pattern of LDG followed by a load as that effectively makes MTE ineffective for those loads. It seemed better to use a fixed tag since that provides at least a little more protection than none at all.

1080

Will do.

compiler-rt/lib/scudo/standalone/secondary.h
133

The advantage of not releasing the memory is that we don't have to pay the cost of faulting the memory once it is made writable again. (Or at least that's the theory. It turns out that the Linux kernel currently faults the accesses anyway, and I have a patch to get rid of the faults.) With that patch in place I did continue to measure a significant performance improvement for decay=1 versus decay=0.

520

No, this is a consequence of the fact that we are now passing the original size into this function rather than the size padded for alignment.

530

It's the first four pages I think, because the cache may end up reusing an allocation and throwing away the first four pages. I'm inclined to wait for data before trying to optimize this.

That being said, one simple thing that we can do here is to make the flag conditional on whether memory tagging is enabled, similar to what we're already doing in the primary allocator. I'll do that.

pcc updated this revision to Diff 313591.Dec 23 2020, 12:11 PM
  • Set MAP_MEMTAG only if tagging enabled
eugenis accepted this revision.Jan 5 2021, 4:09 PM

LGTM

This revision is now accepted and ready to land.Jan 5 2021, 4:09 PM

Is this still WIP like the title says?

pcc added a comment.Jan 6 2021, 1:07 PM

It's ready to go in if you don't mind me breaking Fuchsia. Otherwise I can wait for a Fuchsia implementation of setMemoryPermission() from you.

In D93731#2482877, @pcc wrote:

It's ready to go in if you don't mind me breaking Fuchsia. Otherwise I can wait for a Fuchsia implementation of setMemoryPermission() from you.

Can we double check with cferris@ about any potential perf impact? I am going to have a look at the Fuchsia part.

In D93731#2482877, @pcc wrote:

It's ready to go in if you don't mind me breaking Fuchsia. Otherwise I can wait for a Fuchsia implementation of setMemoryPermission() from you.

Trying the following:

void setMemoryPermission(uptr Addr, uptr Size, uptr Flags,
                         MapPlatformData *Data) {
  const zx_vm_option_t Prot =
      (Flags & MAP_NOACCESS) ? 0 : (ZX_VM_PERM_READ | ZX_VM_PERM_WRITE);
  DCHECK(Data);
  DCHECK_NE(Data->Vmar, ZX_HANDLE_INVALID);
  if (_zx_vmar_protect(Data->Vmar, Prot, Addr, Size) != ZX_OK)
    dieOnMapUnmapError();
}

There seems to be something wrong on Fuchsia with it, regular processes are crashing. Looking into this.

cryptoad added inline comments.Jan 7 2021, 8:54 AM
compiler-rt/lib/scudo/standalone/secondary.h
60

The Fuchsia error is here: Fuchsia unmap tries to write to &H->Data post unmapping, so it results in an access violation.
The solution is to use a local Data variable that would store the header's Data.

pcc updated this revision to Diff 315297.Jan 7 2021, 9:43 PM
pcc retitled this revision from [wip] scudo: Support memory tagging in the secondary allocator. to scudo: Support memory tagging in the secondary allocator..
pcc edited the summary of this revision. (Show Details)

Fixes for Fuchsia

pcc marked an inline comment as done.Jan 7 2021, 9:43 PM
cryptoad added inline comments.Jan 8 2021, 7:59 AM
compiler-rt/lib/scudo/standalone/fuchsia.cpp
140

So this ended up not working for me, because the protection must have at least 1 flag (https://fuchsia.dev/fuchsia-src/reference/syscalls/vmar_protect#description)
Did it work for you?

pcc added inline comments.Jan 8 2021, 11:15 AM
compiler-rt/lib/scudo/standalone/fuchsia.cpp
140

It seems to work for me. I don't think that documentation is accurate. I verified that this function is being called like so:

diff --git a/compiler-rt/lib/scudo/standalone/fuchsia.cpp b/compiler-rt/lib/scudo/standalone/fuchsia.cpp
index bf063e11bb3d..5e90373a7fc3 100644
--- a/compiler-rt/lib/scudo/standalone/fuchsia.cpp
+++ b/compiler-rt/lib/scudo/standalone/fuchsia.cpp
@@ -137,6 +137,8 @@ void unmap(void *Addr, uptr Size, uptr Flags, MapPlatformData *Data) {
 
 void setMemoryPermission(UNUSED uptr Addr, UNUSED uptr Size, UNUSED uptr Flags,
                          UNUSED MapPlatformData *Data) {
+  if (Flags & MAP_NOACCESS)
+    dieOnMapUnmapError();
   const zx_vm_option_t Prot =
       (Flags & MAP_NOACCESS) ? 0 : (ZX_VM_PERM_READ | ZX_VM_PERM_WRITE);
   DCHECK(Data);

With that I observed a crash in the unit tests (I don't think this code is ever actually called in production on Fuchsia because the Fuchsia allocator does not enable the cache).

I also made this change to verify that the permissions were being applied properly:

diff --git a/compiler-rt/lib/scudo/standalone/fuchsia.cpp b/compiler-rt/lib/scudo/standalone/fuchsia.cpp
index bf063e11bb3d..0d4f8d9bcddd 100644
--- a/compiler-rt/lib/scudo/standalone/fuchsia.cpp
+++ b/compiler-rt/lib/scudo/standalone/fuchsia.cpp
@@ -143,6 +143,10 @@ void setMemoryPermission(UNUSED uptr Addr, UNUSED uptr Size, UNUSED uptr Flags,
   DCHECK_NE(Data->Vmar, ZX_HANDLE_INVALID);
   if (_zx_vmar_protect(Data->Vmar, Prot, Addr, Size) != ZX_OK)
     dieOnMapUnmapError();
+  if (Flags & MAP_NOACCESS) {
+    volatile char C = *(char *)Addr;
+    (void)C;
+  }
 }
 
 void releasePagesToOS(UNUSED uptr BaseAddress, uptr Offset, uptr Size,

And again I observed a crash.

cryptoad accepted this revision.Jan 8 2021, 1:08 PM

It all seems to be working.
I'll have to work on the EXPECT_DEATH aspect as a follow up.

pcc updated this revision to Diff 316296.Jan 12 2021, 6:23 PM
  • Disable memory tagging in the test allocator if disabled via prctl()
cferris requested changes to this revision.Jan 22 2021, 1:46 PM
cferris added a subscriber: cferris.

Sorry it took a long time to get back to this. Unfortunately, not only does this cause relatively large performance regressions, it also causes test failures. The test failures are related to malloc_iterate, but I haven't looked too closely at them yet. Is it possible that something needs to be updated in the code related to tracking the total allocated memory?

Benchmarks show as high as 10%, but many closer to 5% regression in other benchmarks.

As you mentioned, there are some large size calloc/mallocs that are 10x slower, for the 128KB sizes. These are all of the microbenchmarks, but even the trace benchmarks show large performance issues. I see 15% to 20% regressions on a few of these.

Also, this increases the RSS significantly on some of the traces. For example, the 32 bit camera trace that does a lot of large allocations went from ~50MB of RSS to ~55MB. It also increased the VA space on a few of the traces. For example, the maps 32 bit trace goes from ~38MB to ~46.5MB. We already have issues in 32 bit land with taking too much VA space, this would likely cause additional issues.

If there is a way to enable/disable this that would be great, otherwise, the regression is too large on almost every axis to take. Even with a possible kernel patch, I don't know if that will fix the RSS and VA space issues.

This revision now requires changes to proceed.Jan 22 2021, 1:46 PM
pcc updated this revision to Diff 320616.Feb 1 2021, 3:07 PM
  • Disable secondary quarantine/UAF detection if MTE is disabled; fix tests
pcc added a comment.Feb 1 2021, 3:10 PM

Sorry it took a long time to get back to this. Unfortunately, not only does this cause relatively large performance regressions, it also causes test failures. The test failures are related to malloc_iterate, but I haven't looked too closely at them yet. Is it possible that something needs to be updated in the code related to tracking the total allocated memory?

Benchmarks show as high as 10%, but many closer to 5% regression in other benchmarks.

As you mentioned, there are some large size calloc/mallocs that are 10x slower, for the 128KB sizes. These are all of the microbenchmarks, but even the trace benchmarks show large performance issues. I see 15% to 20% regressions on a few of these.

Also, this increases the RSS significantly on some of the traces. For example, the 32 bit camera trace that does a lot of large allocations went from ~50MB of RSS to ~55MB. It also increased the VA space on a few of the traces. For example, the maps 32 bit trace goes from ~38MB to ~46.5MB. We already have issues in 32 bit land with taking too much VA space, this would likely cause additional issues.

If there is a way to enable/disable this that would be great, otherwise, the regression is too large on almost every axis to take. Even with a possible kernel patch, I don't know if that will fix the RSS and VA space issues.

I've changed the patch so that the secondary quarantine and UAF detection (the two main sources of overhead) are only enabled when MTE is enabled. I used the memory replay and performance trace benchmarks listed at https://source.corp.google.com/android/bionic/docs/native_allocator.md;bpv=0 and verified no regression on camera trace. PTAL.

cferris requested changes to this revision.Feb 2 2021, 4:03 PM

Unfortunately, 32 bit tests are failing with this abort:

[ RUN ] ScudoCombinedTest.BasicCombined
Scudo ERROR: CHECK failed @ external/scudo/standalone/memtag.h:235 ((0 && "memory tagging not supported")) != (0) (0, 0)
Aborted

It looks like a number of the ScudoCombined tests all fail with this abort, but only on 32 bit. The 64 bit versions pass without any errors.

It looks like the test is trying to do something memtag related that isn't supported on 32 bit. Should the tests be disabled on 32 bit?

This revision now requires changes to proceed.Feb 2 2021, 4:03 PM
pcc updated this revision to Diff 320958.Feb 2 2021, 5:46 PM
  • Fix tests on non-arm64
pcc added a comment.Feb 2 2021, 5:47 PM

Unfortunately, 32 bit tests are failing with this abort:

[ RUN ] ScudoCombinedTest.BasicCombined
Scudo ERROR: CHECK failed @ external/scudo/standalone/memtag.h:235 ((0 && "memory tagging not supported")) != (0) (0, 0)
Aborted

It looks like a number of the ScudoCombined tests all fail with this abort, but only on 32 bit. The 64 bit versions pass without any errors.

It looks like the test is trying to do something memtag related that isn't supported on 32 bit. Should the tests be disabled on 32 bit?

Sorry, I forgot to test on 32-bit. The problem with the tests should be fixed now.

pcc updated this revision to Diff 321278.Feb 3 2021, 5:49 PM
  • Bug fix
hctim added inline comments.Feb 8 2021, 3:50 PM
compiler-rt/lib/scudo/standalone/combined.h
833–841

Yeah, this unfortunately means that some of the sizeclasses get mapped as PROT_MTE even when MTE is disabled. Maybe we should pull this change out of this patch and follow up separately, as it sounds like we might need to pull the MTE options out of the primary options bitset so it doesn't get clobbered at init-time.

$ adb shell process/with/no/mte
$ adb shell cat /proc/<pid>/smaps | egrep 'VmFlags.*mt' -B23
7ca588f000-7ca58cf000 rw-p 00000000 00:00 0                              [anon:scudo:primary]
VmFlags: rd wr mr mw me ac mt
7cd588c000-7cd58cc000 rw-p 00000000 00:00 0                              [anon:scudo:primary]
VmFlags: rd wr mr mw me ac mt
pcc added inline comments.Feb 8 2021, 4:33 PM
compiler-rt/lib/scudo/standalone/combined.h
833–841

Pulling this change out of the patch would lead to MTE never being disabled (from the allocator's perspective), and consequent crashes due to an incorrect include mask. I'll see what can be done about these mappings.

pcc updated this revision to Diff 322266.Feb 8 2021, 6:47 PM
  • Avoid thread init, which may allocate via post-init callback
pcc added inline comments.Feb 8 2021, 6:50 PM
compiler-rt/lib/scudo/standalone/combined.h
833–841

It turned out that the mappings were created because of the post-init callback. I've changed the code to avoid calling the post-init callback from disableMemoryTagging().

I reran the benchmarks and everything looks much better. There were only a few micro-benchmarks that showed decreases that are worrying. For example:

32 bit:

BM_stdlib_calloc_free_decay1/131072 decreased by about 18%
BM_stdlib_calloc_free_default/131072 decreased by about 10%

Various multi-alloc (malloc_forty, malloc_multiple_8192 ones) benchmarks decreased by about 10%.

64 bit:

BM_stdlib_malloc_free_decay1/16384 decreased by about 7%.
BM_stdlib_malloc_free_default/16384 decreased by about 7%.
BM_stdlib_calloc_free_decay1/16384 decreased by about 7%.
BM_stdlib_calloc_free_default/131072 decreased by about 5%.

And then a few of malloc_forty ones decreased by about 12%.

The malloc_forty and malloc_multiple do not really ever correlate with real world code, so I don't think they matter. The larger allocations performance decreases are worrying. Does it make sense that we would see this decrease?

Nothing else showed any really big issues, which probably means that there aren't that many traces that do large allocations that would cause large performance degradation.

I'm inclined to say this is fine if the degradation makes sense.

pcc updated this revision to Diff 323515.Feb 12 2021, 8:50 PM
  • Keep track of the oldest cache entry's time in order to reduce the number of times we need to call releaseOlderThan()
pcc added a comment.Feb 12 2021, 8:52 PM

I reran the benchmarks and everything looks much better. There were only a few micro-benchmarks that showed decreases that are worrying. For example:

32 bit:

BM_stdlib_calloc_free_decay1/131072 decreased by about 18%
BM_stdlib_calloc_free_default/131072 decreased by about 10%

Various multi-alloc (malloc_forty, malloc_multiple_8192 ones) benchmarks decreased by about 10%.

64 bit:

BM_stdlib_malloc_free_decay1/16384 decreased by about 7%.
BM_stdlib_malloc_free_default/16384 decreased by about 7%.
BM_stdlib_calloc_free_decay1/16384 decreased by about 7%.
BM_stdlib_calloc_free_default/131072 decreased by about 5%.

And then a few of malloc_forty ones decreased by about 12%.

The malloc_forty and malloc_multiple do not really ever correlate with real world code, so I don't think they matter. The larger allocations performance decreases are worrying. Does it make sense that we would see this decrease?

Nothing else showed any really big issues, which probably means that there aren't that many traces that do large allocations that would cause large performance degradation.

I'm inclined to say this is fine if the degradation makes sense.

I think I've managed to fix the performance regressions with the latest update.

pcc updated this revision to Diff 324453.Feb 17 2021, 3:54 PM
  • Create two maps for each MTE enabled secondary allocation in order to improve memory bandwidth utilization
cferris accepted this revision.Feb 22 2021, 1:29 PM

Everything is better than before. There is still some benchmarks that were around 6% worse:

32
BM_stdlib_calloc_free_decay1/131072 about 6% worse
BM_stdlib_calloc_free_decay1/65536 about 6% worse
BM_stdlib_calloc_free_default/65536 about 6.6% worse

64 bit
BM_stdlib_calloc_free_default/131072 about 5% worse

However, some larger numbers also improved with this change, and all trace based were about the same. I think the calloc issue is related to zeroing out the memory (which is on by default on Android) and might not be related to this change, or made worse by this change.

Given this, I think this is good enough, but I'm still a bit worried about a gradual degradation of performance if it happens a bit at a time.

This revision is now accepted and ready to land.Feb 22 2021, 1:29 PM
This revision was landed with ongoing or failed builds.Feb 22 2021, 2:35 PM
This revision was automatically updated to reflect the committed changes.
pcc reopened this revision.Mar 1 2021, 10:43 PM
This revision is now accepted and ready to land.Mar 1 2021, 10:43 PM
pcc updated this revision to Diff 327360.Mar 1 2021, 10:43 PM
  • Compute the correct value for BlockSize
pcc added a comment.Mar 1 2021, 10:47 PM

The perf regression on Android appears to have been caused by an incorrect computation of BlockSize in the secondary allocator, which made its way into the statistics returned by mallinfo, which caused ART to enter a slow path. With this fixed I was unable to measure a regression.

cryptoad added inline comments.Mar 2 2021, 7:49 AM
compiler-rt/lib/scudo/standalone/secondary.h
371

Since this is up again for review, could you please find a workaround for the 0-sized array warning?

pcc updated this revision to Diff 327643.Mar 2 2021, 6:14 PM
  • Add -Wno-pedantic to silence GCC warning
pcc marked an inline comment as done.Mar 2 2021, 6:15 PM
pcc added inline comments.
compiler-rt/lib/scudo/standalone/secondary.h
371

Done; I just silenced the warning.

pcc updated this revision to Diff 328636.Mar 5 2021, 1:40 PM
pcc marked an inline comment as done.
  • Optionally add some slack at the end of secondary allocations
This revision was landed with ongoing or failed builds.Mar 8 2021, 2:39 PM
This revision was automatically updated to reflect the committed changes.