This is an archive of the discontinued LLVM Phabricator instance.

[scudo] Android support groundwork
AbandonedPublic

Authored by cryptoad on Apr 11 2017, 9:58 AM.

Details

Summary

This makes the required changes to the Scudo code to be able to work on
Android, both for ARM and AArch64. Note that Android is not enabled yet in
the cmake configs, it will be the object of a later commit.

The following constraints had to be taken into account:

  • Android doesn't support thread_local, so things had to be changed to take this into account, using pthread_{set,get}specific as the alternative;
  • Android can't necessarily afford a cache per thread, due to memory limitations (there are incentives to low memort devices). The current Android allocator, jemalloc, uses a fixed set of arenas, which is the solution I adopted as well, assigning them to threads in a round-robin fashion.

As a result, the thread specific data was moved into a context structure, that
can either be thread_local or global (in this case, allocated during
initialization and lasting the lifetime of the program). This accounts for the
new files: scudo_thread.h, scudo_thread_linux.cpp, scudo_thread_android.cpp.
The number of contexts for Android is modifiable as a define. This also means
that we must be able to tell if a thread cache requires locking now, as for
Android it can be shared between multiple threads. Hence the addition of a
mutex per context, that is locked only when dealing with non thread_local
scenarios.

Another change is the ability to bypass the Quarantine if the cache size is
set to 0. Android watches closely its PSS, and 1MB of Quarantine ends up in
about 6MB of extra PSS for Android 32-bit. So the Quarantine might not be an
option. With no quarantine, the old execution flow ended up with the Scudo
header being checked twice, once before Put, once in Recyle, without any added
value to it.

Other changes involve a bit of reordering of lines here and there, and some
variable renaming to make things a bit more explicit.

For this review, I would also appreciate potential optimization feedback, if
someone sees a way to do things better, that would be great! Thanks for reading
that long blob of text.

Event Timeline

cryptoad created this revision.Apr 11 2017, 9:58 AM
alekseyshl accepted this revision.Apr 11 2017, 12:26 PM
alekseyshl added inline comments.
lib/scudo/scudo_allocator.cpp
176

Does UNLIKELY really improve the generated code?

This revision is now accepted and ready to land.Apr 11 2017, 12:26 PM
cryptoad added inline comments.Apr 11 2017, 1:00 PM
lib/scudo/scudo_allocator.cpp
176

I do like it as it will make the error condition the target of the conditional jump, and the rest of the code follows the regular flow.
For example for that particular check, on x64:

.text:0000000000053B72 E8 99 9A 00 00                    call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
.text:0000000000053B77 4C 8B 44 24 08                    mov     r8, [rsp+78h+var_70]
.text:0000000000053B7C 66 41 39 C0                       cmp     r8w, ax
.text:0000000000053B80 0F 85 CA 02 00 00                 jnz     loc_53E50       ; error
.text:0000000000053B86 49 C1 EC 20                       shr     r12, 20h

The error is the target of the long jump.
I changed all the die conditions to be UNLIKELY as if they are hit, we terminate anyway.
The performance gain is not necessarily significant, but the assembly does look cleaner.

I could use some more eyes and comments on this please!

kcc edited edge metadata.Apr 17 2017, 1:03 PM

Android doesn't support thread_local

eugenis@, Is that still true?
Anyway, would you like to look at this from the Android POV?

In D31947#728470, @kcc wrote:

Android doesn't support thread_local

eugenis@, Is that still true?
Anyway, would you like to look at this from the Android POV?

That isn't true. Android does support thread_local. This patch should not have a different Android path.

In D31947#728470, @kcc wrote:

Android doesn't support thread_local

eugenis@, Is that still true?
Anyway, would you like to look at this from the Android POV?

That isn't true. Android does support thread_local. This patch should not have a different Android path.

To be more specific, thread_local is supported via emutls (llvm/projects/compiler-rt/lib/builtins/emutls.c), which itself calls malloc to allocate the TLS.
Which unfortunately doesn't work well for an allocator that replaces malloc.

eugenis added inline comments.Apr 17 2017, 1:32 PM
lib/scudo/scudo_allocator.cpp
317

Calling initThreadMaybe() from all entry points looks error-prone. Is it ok (performance-wise) to do lazy initialization in getCurrentThreadContext?

lib/scudo/scudo_thread.h
11

an actual description would be nice

cryptoad marked an inline comment as done.Apr 17 2017, 1:37 PM
cryptoad added inline comments.
lib/scudo/scudo_allocator.cpp
317

Things like BackendAllocator.ReturnNullOrDieOnBadRequest will require initialization, but are called prior to getCurrentThreadContext.
I am also actually not super happy about the way this turned out. The alternative would be a preinit or equivalent?

lib/scudo/scudo_thread.h
11

Sorry, user failure. I will address that.

eugenis added inline comments.Apr 17 2017, 2:19 PM
lib/scudo/scudo_allocator.cpp
317

Ideally access to the allocator cache should go through ThreadLocalContext as well. Maybe.

What do you mean by preinit?

cryptoad marked an inline comment as done.Apr 17 2017, 2:24 PM
cryptoad added inline comments.
lib/scudo/scudo_allocator.cpp
317

Access to the local cache & quarantine cache go through the thread context.
The issue I think is more about the global init, which is currently called via a pthread_once in the thread init.
So realistically, the global init doesn't happen until a thread has been initialized.
Unfortunately, that means all the options for the backend haven't been parsed yet if we wait to get the thread context (unless I move it up the code).
I was thinking like in ASan, where it could be initialized once and for all via the preinit_array?

eugenis edited edge metadata.Apr 17 2017, 2:47 PM

preinit_array can not be used in a shared library. A library can use -z initfirst, but we probably don't want to go there, and it's not supported on Android.

Let's leave it like this.

dvyukov edited edge metadata.Apr 18 2017, 2:34 AM

Android doesn't support thread_local

We have a (real) TLS slot for tsan (TLS_SLOT_TSAN). I wonder if we can re-use it for other sanitizers and scudo. Sanitizers are mutually exclusive, so there should be no problems using TLS_SLOT_TSAN is asan/msan. Scudo is not supposed to be used with sanitizers, right? So it seems we can just (ab)use TLS_SLOT_TSAN as well. Accessing it is just few machine instructions (see TLS_SLOT_TSAN in tsan_platform_linux.cc).

To be more specific, thread_local is supported via emutls (llvm/projects/compiler-rt/lib/builtins/emutls.c), which itself calls malloc to allocate the TLS.
Which unfortunately doesn't work well for an allocator that replaces malloc.

I thought we had fixed this one by asking the Android devs to implement proper TLS on Android. They said it was simple and wasn't there because no one needed it.

@srhines, were you at the meeting with @jasonk, @kcc and some Android devs at Google, maybe 2 years ago? We discussed how emutls could be completely avoided.

Or maybe I remember it wrong...

--renato

First of all, do you have any benchmarks? Single-threaded, multi-threaded with/without contention. How do profiles look like?

lib/scudo/scudo_allocator.cpp
317

Agree, this does not look good if we care about performance. We could use preinit_array (or some other android alternative), but we also need to init thread (do we?). So we could piggy-back on thread init check. Namely:

ThreadContext* GetThreadContext() {
   ThreadContext* ctx = __get_tls()[TLS_TSAN_SLOT];
   if (LIKELY(ctx))
     return ctx; // implies that thread and global context were initialized
   return GetThreadContextSlow();
}

ThreadContext* GetThreadContextSlow() {
   init global context once;
   create and init thread context;
   store it into TLS_TSAN_SLOT and return;
}

If we call GetThreadContext() once in each interface function and then explicitly propagate the object pointer, we don't do any excessive work at all.
This needs special care if some interface functions can be called from signal handlers.

lib/scudo/scudo_thread_android.cpp
36

Humm... is the idea that you build scudo for each particular device knowing number of cores?
We can get severe performance drop if 2 or more active threads share the same context. We need to avoid contention at all costs. I am thinking of the following scheme.

  • obtain number of cores at runtime
  • create NCORES (or 2*NCORES) contexts
  • allow threads to dynamically re-associate with contexts to resolve contention

The re-association algorithm can be along the following lines. Remember the last used contenxt. Try to TryLock it. If TryLock succeeds, use it. If TryLock fails, try to find another unlocked context. If all contexts are locked, find the one that was unused for the longest time. The usage accounting can be done with only 1 additional check/store on fast path. Namely:

struct Context {
  Mutex mu;
  atomic_uint64_t unused_since;
  ...
};

bool Context::TryLock() {
  if (mu.TryLock()) {
    atomic_store_relaxed(&unused_since, 0);
    return true;
  }
  if (atomic_load_relaxed(&unused_since) == 0)
    atomic_store_relaxed(&unused_since, now());
  return false;
}

I think this should provide pretty good contention resolution with minimal cost and reasonable complexity.

I thought we had fixed this one by asking the Android devs to implement proper TLS on Android. They said it was simple and wasn't there because no one needed it.

I had a short talk yesterday with some Android Bionic people, and they said they would put it back on the to-do list, but without a commitment as to when it would land.

We have a (real) TLS slot for tsan (TLS_SLOT_TSAN). I wonder if we can re-use it for other sanitizers and scudo. Sanitizers are mutually exclusive, so there should be no problems using TLS_SLOT_TSAN is asan/msan. Scudo is not supposed to be used with sanitizers, right? So it seems we can just (ab)use TLS_SLOT_TSAN as well. Accessing it is just few machine instructions (see TLS_SLOT_TSAN in tsan_platform_linux.cc).

Scudo indeed is mutually exclusive with the other Sanitizers, so this option sounds like it could work!

First of all, do you have any benchmarks? Single-threaded, multi-threaded with/without contention. How do profiles look like?

I have shared with Aleksey some Flame profiles, I can send that to you as well. This ended up with the couple of optimizations he recently did.

First of all, do you have any benchmarks? Single-threaded, multi-threaded with/without contention. How do profiles look like?

I have shared with Aleksey some Flame profiles, I can send that to you as well. This ended up with the couple of optimizations he recently did.

Yes, please.

To be more specific, thread_local is supported via emutls (llvm/projects/compiler-rt/lib/builtins/emutls.c), which itself calls malloc to allocate the TLS.
Which unfortunately doesn't work well for an allocator that replaces malloc.

I thought we had fixed this one by asking the Android devs to implement proper TLS on Android. They said it was simple and wasn't there because no one needed it.

@srhines, were you at the meeting with @jasonk, @kcc and some Android devs at Google, maybe 2 years ago? We discussed how emutls could be completely avoided.

It's unfortunately been too long. There was no progress there IIRC, so we didn't continue pursuing anything related to it. The bionic folks were also concerned about the direction that most of the patches were taking. Yabin added TSan support, but it isn't widely used in Android (it remains harder to set up than ASan).

Or maybe I remember it wrong...

--renato

cryptoad updated this revision to Diff 95768.Apr 19 2017, 9:06 AM

Added descriptions to scudo_thread*.
One of the additional change that didn't make it to the initial commit was
to move the software CRC32 into scudo_utils.h for inlining purposes.

@dvyukov would it be acceptable to get this patch out in its current form and use this as a base for the further improvement you suggested?
Namely: improve contention, potential use of the tsan TLS slot, rework thread/global initialization.

@dvyukov would it be acceptable to get this patch out in its current form and use this as a base for the further improvement you suggested?
Namely: improve contention, potential use of the tsan TLS slot, rework thread/global initialization.

Is there a pressing need?
I think the main problem with this change is that it's very large. You can get both if you split it. For example:

  1. Minor, unrelated changes/refactorings.
  2. Preparation for Android support, but no actual Android code (only Linux as now).
  3. Rework initialization (but again no Android code).
  4. Actual Android support (will not need to touch any common code, just provide platform-specific parts).

@dvyukov would it be acceptable to get this patch out in its current form and use this as a base for the further improvement you suggested?
Namely: improve contention, potential use of the tsan TLS slot, rework thread/global initialization.

Is there a pressing need?
I think the main problem with this change is that it's very large. You can get both if you split it. For example:

  1. Minor, unrelated changes/refactorings.
  2. Preparation for Android support, but no actual Android code (only Linux as now).
  3. Rework initialization (but again no Android code).
  4. Actual Android support (will not need to touch any common code, just provide platform-specific parts).

Sounds good.

cryptoad abandoned this revision.Apr 19 2017, 12:11 PM