Page MenuHomePhabricator

[XRay][profiler] Part 1: XRay Allocator and Array Implementations

Authored by dberris on Apr 18 2018, 12:53 AM.



This change is part of the larger XRay Profiling Mode effort.

Here we implement an arena allocator, for fixed sized buffers used in a
segmented array implementation. This change adds the segmented array
data structure, which relies on the allocator to provide and maintain
the storage for the segmented array.

Key features of the Allocator type:

  • It uses cache-aligned blocks, intended to host the actual data. These blocks are cache-line-size multiples of contiguous bytes.
  • The Allocator has a maximum memory budget, set at construction time. This allows us to cap the amount of data each specific Allocator instance is responsible for.
  • Upon destruction, the Allocator will clean up the storage it's used, handing it back to the internal allocator used in sanitizer_common.

Key features of the Array type:

  • Each segmented array is always backed by an Allocator, which is either user-provided or uses a global allocator.
  • When an Array grows, it grows by appending a segment that's fixed-sized. The size of each segment is computed by the number of elements of type T that can fit into cache line multiples.
  • An Array does not return memory to the Allocator, but it can keep track of the current number of "live" objects it stores.
  • When an Array is destroyed, it will not return memory to the Allocator. Users should clean up the Allocator independently of the Array.

These basic data structures are used by the XRay Profiling Mode
implementation to implement efficient and cache-aware storage for data
that's typically read-and-write heavy for tracking latency information.
We're relying on the cache line characteristics of the architecture to
provide us good data isolation and cache friendliness, when we're
performing operations like searching for elements and/or updating data
hosted in these cache lines.

Diff Detail


Event Timeline

dberris created this revision.Apr 18 2018, 12:53 AM
kpw added a comment.Apr 18 2018, 6:17 PM

Haven't looked at segmented array yet, but I had a good look at the allocator. I think I caught one serious bug, and the rest is just suggestions.

21–24 ↗(On Diff #142886)

Didn't like Martin's attempt to alphabetize? ;)

30 ↗(On Diff #142886)

Expect not null? I don't grok this syntax.

35 ↗(On Diff #142886)

Can you document the template parameter in the class level comment? Minimum block size?

36–38 ↗(On Diff #142886)

If you comment the template parameter in the class comment. You can replace this with something to the effect of "A block is a a piece of memory handed out by the Allocator"

52 ↗(On Diff #142886)

This name collides with the cryptographic immutable content-address data structure. I would try to avoid that. BlockListNode?

57–59 ↗(On Diff #142886)

Could you make this more clear? Maybe even name it BlockPtrsPerCacheLine instead of Size. I had to read it a few times.

66 ↗(On Diff #142886)

sizeof(Block*)? Should be equivalent since Block just wraps void*, but could be more clear. I see this type of punning show up in a few other places, so maybe it's alright to be consistent.

71–73 ↗(On Diff #142886)

If you're punting on the freelist you might be able to squeeze some more data into here.

UsedBits can actually be disposed of if you rely on always advancing the Chain when Counter % BlockChain::Size == 0 instead of wrap around detection.

Similarly, *Next is never actually used.

86 ↗(On Diff #142886)

Maybe call this Tail to communicate that it is updated to refer to the node to pull new blocks from and not the head?

101–103 ↗(On Diff #142886)

Did you consider having this function accept the tail of the list? Then it could write forward links in addition to backward links. It also seems fishy (at least if you want to do a free list) that this always points back to the Chain member.

123 ↗(On Diff #142886)

I prefer 'auto*' in cases like this.

130 ↗(On Diff #142886)

NullLink is static right? This seems messy, although I know you don't intend to install multiple allocators.

138 ↗(On Diff #142886)

Pretty sure you meant & instead of |.

145–147 ↗(On Diff #142886)

DCHECK that the ChainOffset == 0 and BitMask == 1?

159 ↗(On Diff #142886)

auto* C please.

dberris updated this revision to Diff 143041.Apr 18 2018, 9:44 PM
dberris marked 14 inline comments as done.
  • fixup: Address comments by kpw@
21–24 ↗(On Diff #142886)

Oh, I just thought it was harder to track which files where in which bucket by having them all in the same line. Visually spelling these out this way allows me to determine whether there are missing files quickly.

52 ↗(On Diff #142886)

Renamed to BlockLink.

71–73 ↗(On Diff #142886)

Good call! Yeah, I kind-of stopped short of cleaning this up further.

Definitely something to think about if we ever need the freelist, but not really required at this time.

101–103 ↗(On Diff #142886)

It could be, but now that we're not actually using the forward links (thanks for catching that), we can simplify this a lot now.

130 ↗(On Diff #142886)

Yeah, NullLink is used as a sentinel value, which simplifies a lot of the operations for determining whether it's the end. It could as easily have been a nullptr though.

kpw added inline comments.Apr 19 2018, 5:26 PM
64–65 ↗(On Diff #143041)

This comment is out of date since you've cleaned up the Next ptr and the bitmap.

71–72 ↗(On Diff #143041)

Bitmap doesn't exist any more.

122 ↗(On Diff #143041)

I don't think this is a special case now that you've made changes. You should only need two arms in your conditional.

In one case, ChainOffset == 0 and you have to allocate a new BlockLink.
In the other arm, you can return a Block from the existing BlockLink arena.

134–138 ↗(On Diff #143041)

It looks like both branches of this conditional allocate a new BlockLink.

52 ↗(On Diff #142886)

BlockLink is a good name. Sorry to be a stickler. ;)

130 ↗(On Diff #142886)

Now that you're not keeping a forward pointer, having a static as the first link isn't problematic.

34–35 ↗(On Diff #143041)

For the N default value, you can't pack an entire T if the remaining bytes are non-zero. I think it can be simplified to kCacheLineSize / sizeof(T)

46–47 ↗(On Diff #143041)

Did you consider the tradeoffs of these being external to Block versus invasive to the Block?

104 ↗(On Diff #143041)

Can you call out what iterator category this can be used like even though it doesn't satisfy the complete concept.

109 ↗(On Diff #143041)

O -> Offset otherwise it kind of looked like a default zero in the second parameter.

124 ↗(On Diff #143041)

Shouldn't you check against the sentinel chunk instead (here and above).

125–131 ↗(On Diff #143041)

This doesn't seem right. There is still a value at offset 0 within the current chunk when going backward. What you want to check is whether Offset ends up being in the N - 1 congruence class after decrement.

138–140 ↗(On Diff #143041)

Can't you just invoke operator++() here?

146–148 ↗(On Diff #143041)

Similarly reuse operator--() code?

153 ↗(On Diff #143041)

This is fine., but will a compiler generate memcmp equivalent?

171–172 ↗(On Diff #143041)

Haven't looked past here yet. Leaving a note so I know where to come back to.

dberris updated this revision to Diff 143230.Apr 19 2018, 11:36 PM
dberris marked 11 inline comments as done.
  • fixup: Address comments by kpw@.

Thank you for the thorough review, @kpw!

34–35 ↗(On Diff #143041)

Here's the problem with that -- if T > kCacheLineSize, then N must not be 0.

Essentially we want to figure out how many T's will fit in blocks that are cache-line size *multiples*. The computation I wanted to express is something like this:

Given sizeof(T), how many T's can I fit in cache-line _multiple_ sized blocks if each cache line is kCacheLineSize bytes?

For a cache line of 64 bytes, and T being 16 bytes for example, it seems that we can fit 4 T's just fine. But for something with 24 bytes, we can only fit 2 in one cache line, but if we asked for 8 then we can put them in three cache lines. It might not be great without alignment information (i.e. we really should be treating the 24-byte object as a 32-byte object) but that's something that can be fixed by the definition of T.

Thinking this through, we really should be using the least-common multiple of sizeof(T) and kCacheLineSize, divide that by sizeof(T) to figure out how many elements to store in cache-line-multiple-sized blocks for the default N.

I've updated the implementation to do just that. :)

46–47 ↗(On Diff #143041)

I did think about using the Block storage as an opaque set of bytes and intrusively putting the pointers in there too (ala an intrusive list). I *think* that makes the implementation much more bug prone and it made my head hurt just trying to keep the aliasing and alignment rules in my head of where the pointers should be, whether the addresses we're getting are properly aligned, etc. -- so I punted and left it this way instead. :)

Putting a TODO to explore that possibility in the future.

125–131 ↗(On Diff #143041)

Oh my, yes you're right! Good catch, thanks.

153 ↗(On Diff #143041)

I don't think memcmp is actually what we want, but I'm happy for the compiler to do it for me if it can.

dberris updated this revision to Diff 143236.Apr 19 2018, 11:50 PM
  • fixup: Use iterators in find_element implementation
kpw added a comment.Apr 22 2018, 7:27 PM

Finally reached the end. Phew!

I think there is some substantial complexity here, and you'll probably want to think about how to get some test coverage of the implementation.

One edge case I didn't quite reason through is trimming down to size zero might trigger special case sentinel checks being invalid for instance.

283–290 ↗(On Diff #143236)

You want to reuse the allocated chunks when growing the list rather than allocate new ones right?

295–296 ↗(On Diff #143236)

Isn't this a peculiar that these don't return a const type? Will this be a problem in practice?

299–300 ↗(On Diff #143236)

Can you explain this statement please? My template foo is a bit rusty. Is this a templated member definition and which version of the standard does it require? LLVM is still on 11 IIUC.

dberris updated this revision to Diff 143500.Apr 22 2018, 9:58 PM
dberris marked 4 inline comments as done.
  • Rebase
    • squash: Add a freelist to the Array
dberris added inline comments.Apr 22 2018, 9:58 PM
283–290 ↗(On Diff #143236)

Yes, that's correct. Good call, fixed.

295–296 ↗(On Diff #143236)

In practice, for our use-case, it shouldn't be a problem, although ideally the type of the Iterator should be different for the non-const and const case. For us here, it shouldn't matter (yet). The way to fix this would be to make the Iterator type itself a template, so that we can determine what the result of the dereference and arrow operators are, whether they are const T or T.

I've made the change to make it more "technically correct", which is the best kind of correct. :)

299–300 ↗(On Diff #143236)

So, what happens here is we're telling the compiler that if the type Array<T, N> is ever instantiated in a translation unit, that we're going to get storage for the static data member. This is because SentinelChunk is a static data member of the Array<T,N> type, and it needs static (program duration) storage to be defined across translation units. This is how you spell this in C++98, AFAICT.

kpw accepted this revision.Apr 28 2018, 8:24 PM

This looks better now, although there were a few issues that slipped through the tests and I wonder if there are others lurking.

If we're going to maintain this cache-line aware storage structure, do you have a plan to measure the performance impact on the GWP implementation versus using a more simple container like sanitizer/vector.h?
This seems important to me because certain kinds of mistakes might still leave the interface operating correctly, but screw up the performance details.

299–300 ↗(On Diff #143236)

Oh. I see. This is outside of the Array class, so it's an out of line static member definition with default initialization. The templates obscur that a bit, but I get it now.

This revision is now accepted and ready to land.Apr 28 2018, 8:24 PM
In D45756#1082146, @kpw wrote:

This looks better now, although there were a few issues that slipped through the tests and I wonder if there are others lurking.

If we're going to maintain this cache-line aware storage structure, do you have a plan to measure the performance impact on the GWP implementation versus using a more simple container like sanitizer/vector.h?
This seems important to me because certain kinds of mistakes might still leave the interface operating correctly, but screw up the performance details.

Yes, I have that on the list of things to do (add microbenchmarks to the test-suite).

I had some initial measurements on the cost of an early implementation, which basically show that the cost of reaching for additional memory and copying elements dominate. Those are important things to measure and make sure we can track going forward. It's also the whole point of us going through this route, to mitigate the problems we've encountered in the early prototype of this. :)

This revision was automatically updated to reflect the committed changes.
rnk added subscribers: thakis, rnk.May 31 2018, 10:38 AM

Your usage of constexpr appears to crash GCC 4.8.5:

[871/917] Building CXX object lib/xray/CMakeFiles/RTXrayPROFILER.x86_64.dir/
FAILED: lib/xray/CMakeFiles/RTXrayPROFILER.x86_64.dir/ 
/b/c/b/ToTLinux/src/third_party/llvm-build-tools/gcc485precise/bin/g++  ...  /b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/
In file included from /b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_segmented_array.h:20:0,
                 from /b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_function_call_trie.h:19,
                 from /b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_profile_collector.h:21,
                 from /b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/
/b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_allocator.h: In instantiation of ‘__xray::Allocator<320ul>::BlockLink __xray::Allocator<320ul>::NullLink’:
/b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_allocator.h:133:31:   required from ‘__xray::Allocator<N>::~Allocator() [with long unsigned int N = 320ul]’
/b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_function_call_trie.h:202:43:   required from here
/b/c/b/ToTLinux/src/third_party/llvm/compiler-rt/lib/xray/xray_allocator.h:57:10: internal compiler error: Segmentation fault
   struct BlockLink {
Please submit a full bug report,
with preprocessed source if appropriate.
See <> for instructions.

I will try to find a workaround. @thakis

rnk added a comment.May 31 2018, 11:31 AM

I committed rL333678, which should start the process of healing the bots on

Wow, that's... interesting.

Thanks Reid!