Page MenuHomePhabricator

[scudo][standalone] Introduce the thread specific data structures

Authored by cryptoad on May 22 2019, 8:18 AM.



This CL adds the structures dealing with thread specific data for the
allocator. This includes the thread specific data structure itself and
two registries for said structures: an exclusive one, where each thread
will have its own TSD struct, and a shared one, where a pool of TSD
structs will be shared by all threads, with dynamic reassignment at
runtime based on contention.

This departs from the current Scudo implementation: we intend to make
the Registry a template parameter of the allocator (as opposed to a
single global entity), allowing various allocators to coexist with
different TSD registry models. As a result, TSD registry and Allocator
are tightly coupled.

This also corrects a couple of things in other files that I noticed
while adding this.

Diff Detail


Event Timeline

cryptoad created this revision.May 22 2019, 8:18 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 22 2019, 8:18 AM
Herald added subscribers: Restricted Project, jfb, delcypher and 2 others. · View Herald Transcript
cryptoad updated this revision to Diff 200779.May 22 2019, 9:55 AM

Correct a test and some formatting.

morehouse added inline comments.May 23 2019, 5:08 PM
34 ↗(On Diff #200779)

I assume allocators will always be linker-initialized, which is why we have reset for testing instead of an init?

49 ↗(On Diff #200779)

Allocator-> is cleaner than Allocator.get()->, unless there's a good reason to get the raw pointer.

27 ↗(On Diff #200779)


28 ↗(On Diff #200779)

Shouldn't this also call initLinkerInitialized?

60 ↗(On Diff #200779)

Will threads have different allocators?

64 ↗(On Diff #200779)

What is the use-case for MinimalInit initialization?

74 ↗(On Diff #200779)

Any reason this needs to be a pointer?

94 ↗(On Diff #200779)

Assigning to N seems pointless. Can we simplify to:

if (TSDRegistryT::ThreadTSD.DestructorIterations > 1) {
21 ↗(On Diff #200779)

Shouldn't this also call initLinkerInitialized?

23 ↗(On Diff #200779)

Should Allocator initialization just be done in initLinkerInitialized?

61 ↗(On Diff #200779)

What are the CoPrimes used for, and what is the strange calculation above doing?

74 ↗(On Diff #200779)

Do we really need three ways to do TLS? Can we just use pthreads for all?

97 ↗(On Diff #200779)

Is it sufficient to just check NumberOfTSDs?

123 ↗(On Diff #200779)

Index %= NumberOfTSDs

144 ↗(On Diff #200779)

Why do we need these guards here but not in tsd_exclusive.h?

cryptoad marked 12 inline comments as done.May 24 2019, 8:48 AM

Matt, thank you for all the reviews you are doing. Very insightful points.

34 ↗(On Diff #200779)

There is indeed a subtlety here, the initOnce function of a TSD will call the initLinkerInitialized function of the allocator.
Calling this init would (I feel), indicate the usual construct of init calling initLinkedInitialized, which shouldn't be the case here, we just want a nulled out structure but not call the initLinkedInitialized wanna-be-constructor.
Hence using reset.

60 ↗(On Diff #200779)

Multiple allocators can coexist in a process, hence a need to know which allocator a registry belongs to.

64 ↗(On Diff #200779)

Right, this is documented in combined.h that hasn't landed yet, here is the comment:

// For a deallocation, we only ensure minimal initialization, meaning thread
// local data will be left uninitialized for now (when using ELF TLS). The
// fallback cache will be used instead. This is a workaround for a situation
// where the only heap operation performed in a thread would be a free past
// the TLS destructors, ending up in initialized thread specific data never
// being destroyed properly. Any other heap operation will do a full init.

There is test case in the current Scudo for that situation that will be in as well.

74 ↗(On Diff #200779)

The main reason is for an uninitialized allocator to take as little memory space as possible.
A TSD is usually around 8kB (varying based on the number of classes and pointers cached).
It costs an extra map() on init, but allows several allocators to coexist with using extra memory (the code footprint cost still being there).

23 ↗(On Diff #200779)

This question made me realize that something is a bit sketchy.
initOnce is initializing the allocator, which in turn is calling initLinkerInitializedof the registry: at this point we are in the spinmutex, and call the initLinkerInitialized method of that same mutex. It all works out because 1) everything is zeroinitialized 2) the initLinkerInitialized of the spinmutex is a no-op.
But it is definitely logically skewed.
The original version was using pthread_once which was alleviating the need for the mutex, but I needed the Instance parameter for initOnce (pthread_one doesn't allow parameters to the once function).
I am going to have to rethink that.

61 ↗(On Diff #200779)

I added a comment. The original idea was from Dmitry, but there are online reference such as:

74 ↗(On Diff #200779)

The ELF TLS would be the fastest as it's only a couple of instructions to access the data relative to %fs.
The pthread implementations vary widely. They require a call, which is not ideal, but also they range in terms of efficiency.
The getAndroidTlsPtr ends up being a few asm instructions as well.

Now with the introduction of ELF TLS in Android (beginning of this year), we might be able to get rid of the Android TLS ptr trick.
I haven't tested it yet. Also I am not sure the ELF TLS works from within the libc (it doesn't on Fuchsia for example).

97 ↗(On Diff #200779)

This is to help the compiler optimize the block out.
If MaxTSDCount is 1 (likely the svelte Android config case), the compiler doesn't know that NumberOfTSDs is 1 and leaves the block in.
The MaxTSDCount check will be optimized at compile time, and for the 1 case, the whole block goes away.

123 ↗(On Diff #200779)

Here the division is potentially costly, hence going for a comparison and subtraction.
I am not sure the performance difference is significant in this situation (slow path, 4 iterations at most) but I tried to avoid divisions everywhere as a general rule of thumb.

144 ↗(On Diff #200779)

The exclusive version only uses ELF TLS. So if opting for an exclusive TSD it assumes the platform supports it.
For the shared version, it can work without ELF TLS, hence having THREADLOCAL variables within defines.
A couple of things to consider:

  • Fuchsia supports THREADLOCAL, but not within the libc, and we are in the libc.
  • Android recently got support for ELF TLS, but 1) I haven't tested it yet 2) I don't know if THREADLOCAL works from within the C library

I will revisit this when things settle, but as of now, I am sure this version works.

cryptoad updated this revision to Diff 201263.May 24 2019, 8:51 AM
cryptoad marked 3 inline comments as done.

Addressing several of Matt's review points:

  • adding comments to obscure code snippets
  • calling initLinkerInitialized when needed
  • changing unique_ptr's get()-> to ->
  • simplifying some code constructs
cryptoad updated this revision to Diff 201333.May 24 2019, 2:03 PM

Switching from a spin mutex to a blocking mutex for initOnce.

morehouse added inline comments.May 29 2019, 10:40 AM
28 ↗(On Diff #200779)

I think switching to BlockingMutex introduced a new bug. BlockingMutex has a default constructor that will run at global ctor time. So the following sequence could happen:

  1. TSDRegistryExT::initLInkerInitialized()
  2. Mutex.lock()
  3. TSDRegistryExT() implicit constructor runs, calling Mutex's ctor.

So now Mutex is unlocked even though unlock was never called.

(also applies to the other registry)

60 ↗(On Diff #200779)

Ok, but will a single registry ever have multiple allocators? If not, we don't need to pass the allocator to the therad-specific initThread and should instead pass it to the global init.

64 ↗(On Diff #200779)

Thanks for the explanation. Would it make sense to put that comment here, or is the combined a better place?

cryptoad marked an inline comment as done.May 29 2019, 11:49 AM
cryptoad added inline comments.
28 ↗(On Diff #200779)

Damn, I failed.
The other option was to implement a call_once type function, that would be specific to the registry, using an atomic cas (like llvm::call_once) and a little spin, and not have a mutex at all.
I can't use std::call_once and pthread_once, but if you have another idea, I am happy to oblige.

60 ↗(On Diff #200779)

Ah, I see what you mean. That is indeed the case, I'll reorganize that.

64 ↗(On Diff #200779)

I'll add a short version of it here.

morehouse added inline comments.May 29 2019, 12:54 PM
28 ↗(On Diff #200779)

Is there a reason we need the init loop to begin with? It would be nice to have a one-way initialization instead.

23 ↗(On Diff #200779)

Seems this is a consequence of the tight coupling between registry and allocator.

I think ideally we'd have a one-way dependency so either allocator-has-a-registry, or registry-has-an-allocator. Then the initialization only ever makes sense one-way.

Loose coupling in general would also make it easier to understand each piece in isolation, which makes future code maintenance much easier.

cryptoad marked 2 inline comments as not done.May 29 2019, 3:12 PM
cryptoad added inline comments.
28 ↗(On Diff #200779)

The loop isn't required, it's sort of a residue at my attempt at lazy initializing everything, while keeping the init/initLinkerInitialized construct.
The initialization flow should be as follows:

  1. we have a zero-initialized Combined structure
  2. someone calls allocate() on that Combined
  3. this does initThread, which calls initThread in the registry (since it's really a registry thing)
  4. thread isn't initialized, neither is the allocator, so call the combined's initOnce
  5. parse the flag, initialize the internal structures
cryptoad updated this revision to Diff 202742.Jun 3 2019, 9:44 AM

This, hopefully, detangles a bit the initialization process for
the registry. In order to do that, we get rid of the mutex in favor
of our own call_once type construct (losely inspired by
llvm::call_once). We move the initialization code into

This gets rid of the potential init loop, and hopefully makes the
code structure more coherent.

Additional, I had some local flakes while testing this, and realized
that I messed up some region log sizes in the tests. Since we are
modifyin that file here, unflake the test as well.

cryptoad updated this revision to Diff 202744.Jun 3 2019, 9:57 AM

Add a comment for MinimalInit.

cryptoad updated this revision to Diff 202747.Jun 3 2019, 10:12 AM
cryptoad marked 6 inline comments as done.

Change the Primary test again to be more forgiving to OOM.

cryptoad updated this revision to Diff 202752.Jun 3 2019, 10:31 AM

Disable a test on 32-bit for now: we are running out of address space.
This will be fixed in a subsequent CL, but affects the tests of this one.

Ping pretty please! I think all the comments were addressed one way or another. The new structure doesn't "cycle" anymore.

morehouse added inline comments.Jun 6 2019, 12:56 PM
29 ↗(On Diff #201333)

Don't we need TSDRegistry.initLinkerInitialized()?

62 ↗(On Diff #202752)

With this change, we now expect initialization to only happen through initThreadMaybe, right? Should we even keep initLinkerInitialized around then?

67 ↗(On Diff #202752)

Is the fence necessary? Not an atomic expert, but doesn't the load achieve the same thing?

77 ↗(On Diff #202752)

I think things would be much simpler if we can use a StaticSpinMutex. We're assuming we're zero-initialized anyway, so the mutex will be initialized before use as well.

cryptoad added inline comments.Jun 6 2019, 1:17 PM
62 ↗(On Diff #202752)

If I did things correctly, the Registry can also now be initialized with initLinkerInitialized, which will carry out the "once" initialization. Then initThreadMaybe will do the thread initialization skipping the "once".
My current plan is to have everything lazy initialized the first time someone calls malloc (or whatever else), meaning the first initThreadMaybe will carry the "once" initialization, but the alternative of pre-initializing is now offered.

67 ↗(On Diff #202752)

Not an atomic expert either, it's inspired from from llvm::call_once which does this.
I figured it had a purpose and that I should keep it.

77 ↗(On Diff #202752)

I agree. But the issue that would come up using StaticSpinMutex, is that we would land in initLinkerInitializedwhere we should call the Mutex's initLinkerInitialized (to be consistent), while holding it. It works because it's a noop, but doesn't make sense from a logical perspective.
Implementing it this way (with our wannabe call_once) allows us to not have such a loophole.

morehouse added inline comments.Jun 6 2019, 3:37 PM
77 ↗(On Diff #202752)

Can we simply remove initLinkerInitialized from StaticSpinMutex, as we have in sanitizer_common? We could possibly add a comment that calls to init are only necessary if the mutex is not linker initialized.

cryptoad added inline comments.Jun 6 2019, 5:02 PM
77 ↗(On Diff #202752)

Definitely can!

cryptoad updated this revision to Diff 203569.Jun 7 2019, 9:15 AM

As discussed through comments, re-introduce a StaticSpinMutex to the
"once" initialization of the TSD registry, and remove its no-op
initLinkerInitialized in the various places it was used.
Also adds a test to exercise the path of "direct" initialization
via calling initLinkerInitialized directly on the registry.

morehouse added inline comments.Jun 7 2019, 1:00 PM
69 ↗(On Diff #203569)

This is unguarded by the mutex. I think what we need to do is call initOnce unconditionally.

99 ↗(On Diff #203569)

This is also unguarded by mutex.

cryptoad added inline comments.Jun 7 2019, 1:13 PM
69 ↗(On Diff #203569)

My understanding is that it's not needed.
If it's true then initialization has been carried, if false, then we lock the mutex in initOnce.
It's assuming that the bool can't be partially modified, but I guess I should enforce that with an atomic_u8 instead.

cryptoad updated this revision to Diff 203611.Jun 7 2019, 1:19 PM

Use an atomic_u8 for Initialized instead of a bool.

cryptoad updated this revision to Diff 203613.Jun 7 2019, 1:21 PM

clang-format'ing the code.

morehouse accepted this revision.Jun 7 2019, 1:50 PM
morehouse added inline comments.
69 ↗(On Diff #203569)

I think another issue is that the compiler is allowed to reorder instructions within a locking context. So it is possible that Initialized = true but initialization hasn't fully finished. If we always acquire the lock before accessing Initialized, we guarantee that we're fully initialized since compiler can't reorder instructions outside of the locking context.

This revision is now accepted and ready to land.Jun 7 2019, 1:50 PM
cryptoad updated this revision to Diff 203621.Jun 7 2019, 2:00 PM

Always call initOnce, and rename it initOnceMaybe to reflect that
initialization might not necessarily occur if it already happened.

This revision was automatically updated to reflect the committed changes.