This is an archive of the discontinued LLVM Phabricator instance.

[sanitizer] Random shuffling of chunks for the 32-bit Primary Allocator
ClosedPublic

Authored by cryptoad on Oct 24 2017, 8:57 AM.

Details

Summary

The 64-bit primary has had random shuffling of chunks for a while, this
implements it for the 32-bit primary. Scudo is currently the only user of
kRandomShuffleChunks.

This change consists of a few modifications:

  • move the random shuffling functions out of the 64-bit primary to sanitizer_common.h. Alternatively I could move them to sanitizer_allocator.h as they are only used in the allocator, I don't feel strongly either way;
  • small change in the 64-bit primary to make the rand_state initialization UNLIKELY;
  • addition of a rand_state in the 32-bit primary's SizeClassInfo and shuffling of chunks when populating the free list.
  • enabling the random_shuffle.cpp test on platforms using the 32-bit primary for Scudo.

Some comments on why the shuffling is done that way. Initially I just
implemented a Shuffle function in the TransferBatch which was simpler but I
came to realize this wasn't good enough: for chunks of 10000 bytes for example,
with a CompactSizeClassMap, a batch holds only 1 chunk, meaning shuffling the
batch has no effect, while a region is usually 1MB, eg: 104 chunks of that size.
So I decided to "stage" the newly gathered chunks in a temporary array that
would be shuffled prior to placing the chunks in batches.
The result is looping twice through n_chunks even if shuffling is not enabled,
but I didn't notice any significant significant performance impact.

Event Timeline

cryptoad created this revision.Oct 24 2017, 8:57 AM
cryptoad updated this revision to Diff 120083.Oct 24 2017, 9:20 AM

Updating some indentation & a comment.

alekseyshl added inline comments.Oct 24 2017, 10:51 AM
lib/sanitizer_common/sanitizer_allocator_primary32.h
273

How about using [kCacheLineSize - offsetof(SizeClassInfo, padding)] instead?

340

How about using the newly allocated region for the array? You can fill and shuffle all the pointers at once, create the batches and then zero reg out. Still two passes, but less stack and chunks will be farther apart. The code will be simpler too and I have a feeling that we can even avoid double copying for !kShuffleArraySize.

lib/sanitizer_common/sanitizer_common.h
933

Yep, you're right, let's not pollute this file more than it is necessary and move it to sanitizer_allocator.h.

cryptoad added inline comments.Oct 24 2017, 11:34 AM
lib/sanitizer_common/sanitizer_allocator_primary32.h
340

This is tricky because if kRandomShuffleChunks is not used with kUseSeparateSizeClassForBatch , then the batches can also use the region itself.
Meaning while we iterate through the chunks, we could create a batch that would overlap and corrupt the array we are iterating on.
We can't really assume any part of the region is "safe" to store the array either due to the randomness, so using the end of the region or the beginning doesn't change that.
If kUseSeparateSizeClassForBatch is used, then we are golden and that can work.
Currently Scudo uses both, but I feel they should be treated distinctly nonetheless.

cryptoad updated this revision to Diff 120108.Oct 24 2017, 11:44 AM
cryptoad marked an inline comment as done.

Moving Rand* to sanitizer_allocator.h instead.

cryptoad added inline comments.Oct 24 2017, 12:14 PM
lib/sanitizer_common/sanitizer_allocator_primary32.h
273

I can't seem to make that general idea to work one way or another.
The other possibility that some sanitizer parts use is just char padding[kCacheLinesize].

cryptoad added inline comments.Oct 24 2017, 12:34 PM
lib/sanitizer_common/sanitizer_allocator_primary32.h
340

Rectification: even with kUseSeparateSizeClassforBatch, there is the special case of kBatchClassID for which the batches are stored in the same region.

alekseyshl added inline comments.Oct 24 2017, 1:50 PM
lib/sanitizer_common/sanitizer_allocator_primary32.h
273

Ah, right, that should not work there, sorry. Why not make it explicit then:

kCacheLineSize - sizeof(SpinMutex) - sizeof(IntrusiveList<TransferBatch> - sizeof(u32))

or

struct SizeClassInfoData {
    SpinMutex mutex;
    IntrusiveList<TransferBatch> free_list;
    u32 rand_state;
}
template <typename T, uptr N>
struct Padded : public T {
private:
  char padding[N - sizeof(T)];
};
typedef Padded<SizeClassInfoData, kCacheLineSize> SizeClassInfo;

but yeah, it's too much for the cause :)

340

Doesn't this latter case create a loop? We're in PopulateFreeList because AllocateBatch figured that sci->free_list.empty(), and when PopulateFreeList calls CreateBatch, don't we get back to AllocateBatch with free list still empty?

Sorry for too many questions, just want to understand what's going on here.

cryptoad added inline comments.Oct 24 2017, 2:37 PM
lib/sanitizer_common/sanitizer_allocator_primary32.h
273

Regarding the first part, it actually raises an interesting point: if using the 32-bit primary on 64-bit, then fields are uptr aligned.
So - sizeof(u32) doesn't work as opposed to sizeof(uptr).
Which raises the question of: should the structures be packed tighter?

340

No worries, all questions are welcome!

So the first PopulateFreeList will be with the class_id that we requested with malloc, then it will go down the again in PopulateFreeList except that this time the class_id will be kBatchClassId, which is serviced from its own region (eg: it doesn't go down into PopulateFreeList again).
So for the first invocation (where it needs to create a region for the size requested and then for the batch), the stack looks like:

#0  __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__scudo::AP32> >::CreateBatch
#1  0x00000000005095f1 in __sanitizer::SizeClassAllocator32<__scudo::AP32>::PopulateBatches
#2  0x0000000000509123 in __sanitizer::SizeClassAllocator32<__scudo::AP32>::PopulateFreeList
#3  0x0000000000508c4e in __sanitizer::SizeClassAllocator32<__scudo::AP32>::AllocateBatch
#4  0x0000000000508a26 in __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__scudo::AP32> >::Refill
#5  0x0000000000508942 in __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__scudo::AP32> >::Allocate
#6  0x00000000005085f2 in __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__scudo::AP32> >::CreateBatch
#7  0x00000000005095f1 in __sanitizer::SizeClassAllocator32<__scudo::AP32>::PopulateBatches
#8  0x0000000000509123 in __sanitizer::SizeClassAllocator32<__scudo::AP32>::PopulateFreeList
#9  0x0000000000508c4e in __sanitizer::SizeClassAllocator32<__scudo::AP32>::AllocateBatch
#10 0x0000000000508a26 in __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__scudo::AP32> >::Refill
#11 0x0000000000508942 in __sanitizer::SizeClassAllocator32LocalCache<__sanitizer::SizeClassAllocator32<__scudo::AP32> >::Allocate

Frames 11 to 6 are wrt to the requested class_id, 5 to 0 are for kBatchClassId.
When reaching 0, we just return b and everything unrolls back up.
This is only for the first allocation, afterwards the Region for the TransferBatch is created and the associated free list is populated so we won't go there again unless we need to re-populate them.

alekseyshl added inline comments.Oct 24 2017, 3:34 PM
lib/sanitizer_common/sanitizer_allocator_primary32.h
340

Thanks for the stack!

So, you're saying it never happens, we never allocate batches from the same bucket (and thus, the same region)? Then why my proposal of using the new region for the temp buffer is not viable? Otherwise, if I got your explanation wrong and there are cases when batches are allocated from the same region (the region we're populating free list for), how come it does not get into a loop I mentioned earlier? I am a bit confused...

cryptoad added inline comments.Oct 24 2017, 4:02 PM
lib/sanitizer_common/sanitizer_allocator_primary32.h
340

I am going to try and take a step back to make sure I am on the same wave length as you.

The proposal as I understand it is to put all the possible pointers in the newly allocated region, and shuffle them, there, thus allowing for more than a 48 shuffle at a time. Then create and fill batches by iterating through the shuffled array, potentially zero out the region.
If this is not the actual proposal, please let me know.

With the proposal, there are 2 cases to take into account: kUseSeparateSizeClassforBatch is true or false.

If kUseSeparateSizeClassforBatch is false, then if a chunk in a class is large enough to hold a transfer batch, it will be used as a TranferBatch.
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/sanitizer_common/sanitizer_allocator_local_cache.h#L146
Meaning that when we do a b->Add(xxx) we actually overwrite some of region memory itself.
In this case, it sounds like it would be tough to not overlap an in-region transfer batch with the pointers we are iterating through.
As such, I am not sure the proposal is viable (or at least without significant code complexity).

If kUseSeparateSizeClassforBatch is true, then we allocate transfer batches from a separate class size, so we could reuse the region for all other classes except that one. Because for that particular one, we fall back into an equivalent of the preceding case: batches for the kBatchClassId class are allocated within the class region itself. Now we could make this work by ensuring there is no randomization in that specific kBatchClassId class: indeed we do not need batches to be randomized.

A couple of other points to consider:

  • For class 1 (16 bytes) with a 1MB region, we get 65536 chunks in the region. If we are using the region to hold the array, with 64-bit pointers, it's 524288 bytes that we are dirtying in the region (as well as the transfer batch region really);
  • Zeroing that memory (to avoid potential disclosure of valid pointers) will be somewhat costly as well.

By using a local array, we reduce the randomness of our chunk shuffle (at most 48 consecutive chunks get shuffled at a time), but we do not touch the newly allocated region (yet) and save on some complications.

I hope this clarifies my point of view (if my understanding of what you are suggesting is correct).

alekseyshl accepted this revision.Oct 24 2017, 4:54 PM
alekseyshl added inline comments.
lib/sanitizer_common/sanitizer_allocator_primary32.h
273

I guess they should not, uptr alignment is faster, right? So, yep, leave it as it is.

340

Yes, you understanding is correct, that was exactly what I proposed.

So, when batches are allocated from the same region as regular chunks, the batch contain a pointer to itself, right? Ok, now it makes more sense. Considering your other points, yes, local array seem like a proper (and safer) solution.

341

Can you rename j to something less Fortran-y?

346

Move the shuffling into PopulateBatches.

This revision is now accepted and ready to land.Oct 24 2017, 4:54 PM
cryptoad updated this revision to Diff 120272.Oct 25 2017, 9:37 AM
cryptoad marked 12 inline comments as done.

Addressing comments from the review:

  • moving shuffling into PopulateBatches
  • renaming an ambiguous variable

I also changed a couple more things:

  • if using a separate class for batches, do not shuffle it (there is no user controlled allocations in the region so randomizing isn't useful);
  • 32-bit primary rand_state initialized with sci rather than this so that 2 different classes initialized at the same time would have different seeds.
alekseyshl accepted this revision.Oct 25 2017, 10:19 AM
cryptoad closed this revision.Oct 25 2017, 10:25 AM