Page MenuHomePhabricator

[XRay][profiler] Part 4: Profiler Mode Wiring

Authored by dberris on Mar 18 2018, 10:24 PM.



This is part of the larger XRay Profiling Mode effort.

This patch implements the wiring required to enable us to actually
select the xray-profiling mode, and install the handlers to start
measuring the time and frequency of the function calls in call stacks.
The current way to get the profile information is by working with the
XRay API to __xray_process_buffers(...).

In subsequent changes we'll implement profile saving to files, similar
to how the FDR and basic modes operate, as well as means for converting
this format into those that can be loaded/visualised as flame graphs. We
will also be extending the accounting tool in LLVM to support
stack-based function call accounting.

We also continue with the implementation to support building small
histograms of latencies for the FunctionCallTrie::Node type, to allow
us to actually approximate the distribution of latencies per function.

Depends on D45758 and D46998.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
pelikan added inline comments.Apr 11 2018, 12:33 PM
91–92 ↗(On Diff #138879)

f0 → f1 → setjmp ↓(f1→f0) longjmp ↓(f1↓f0) longjmp ↓(f1↓f0) should generate lots of exits but only one entry. Or when the profiling starts in a signal handler.

So it shouldn't be "impossible", just "infrequent" :-)

106–110 ↗(On Diff #138879)

Test name says "MissingIntermediaryEntry" but this is missing an intermediary *exit*.

145–150 ↗(On Diff #138879)

Same, please use multiplies of 100 for TSCs. I wonder whether we should test the TSC time series not being non-decreasing due to TSC mismatches when rescheduling among poorly synchronized CPU packages.

21 ↗(On Diff #138879)

did you mean: "the only one we actually care about"?

If we "only care" about it, what more can we do about it? :-)

47 ↗(On Diff #138879)

I would at least assert the buffer's size is within some reasonable bounds - has the trailing bit and a function list. Maybe also the zero sentinels in places.

77 ↗(On Diff #138879)

Again, some assertions to make sure both threads are reflected in that buffer would be nice. Doesn't have to be too strict.

71 ↗(On Diff #138879)

Please also test what happens when you do

Array <TestData> data;
auto it = data.begin();

Because I think you'll find Offset will be SIZE_MAX. Not sure we want that.

94 ↗(On Diff #138879)

assume NewChain == nullptr. (or BackingStore for that matter)

50 ↗(On Diff #138879)

Why do we need the volatile? It's a global, there's very little optimization the compiler can do anyway... I'd like to see what I missed, thinking it'd be OK without it.

42 ↗(On Diff #138879)

So, actually, I never liked linked lists where the prev/next pointers are in a separate region of memory, because that tends to worsen the cache miss rate when you walk through the list, and when the points at which these Chunk things are allocated are reasonably randomized along with the actual data allocations to confuse the CPU prefetcher. Which is why I've always been using LIST/TAILQ versions from queue(3) as they embed these to the structures they're listing. I'm not saying you should rewrite all of this now, but have you thought about putting the prev/next into the T somehow? Is that even possible to do with C++ templates?

44 ↗(On Diff #138879)

Why is this necessary, and we can't just use N?

61 ↗(On Diff #138879)

InternalAlloc can return nullptr.

153 ↗(On Diff #138879)

I suppose these were your debugging statements which can go away (and below).

242 ↗(On Diff #138879)


dberris updated this revision to Diff 142590.Apr 15 2018, 8:12 PM
dberris marked 29 inline comments as done.
  • fixup: Address comments
36 ↗(On Diff #138879)

This is not a valid assertion, since Counter is not relevant to the allocator's observable properties.

39 ↗(On Diff #138879)


39–40 ↗(On Diff #138879)

Can you explain better why using 100, 200, 300 as opposed to 1, 2, 3 is better?

91–92 ↗(On Diff #138879)

Right, changed to "rare".

For now, we've not made special support for setjmp/longjmp instrumentation. While there's a possibility we can do that in the future, we're not counting on being able to differentiate that for now.

The signal handler case is precisely the one we're looking to support here, but that's just one case.

145–150 ↗(On Diff #138879)

Good point. I'll leave a TODO for that. In particular we actually need to keep track of the CPU ID instead of just the TSC when we're building the shadow stack. That will let us track the migration of the thread(s).

It's still not clear to me why using multiples of 100 is important. These are just arbitrary numbers anyway, it shouldn't matter what order of magnitude they are.

21 ↗(On Diff #138879)

I don't see the difference. English is hard.

"the only one we care about" == "the one we actually only care about"

There are other use-cases for this collection API, some of which we don't cover in this unit test (yet).

In particular, we could be collecting snapshots of the function call trie for a function every so often, and associating a timestamp to that, so we can show profiles over time instead of a single profile.

47 ↗(On Diff #138879)

Some of these details aren't really relevant to the unit test. For example:

  • The size of the block is dependent on how we've decided to serialise the data. Yes we can assert that the size is not zero (doing that now).
  • The function list could be in any order. It's not a relevant feature of the API. What we care about is that we're able to get the data. The concern of parsing this data is not really at this level of the unit test (I'd rather we have an actual end-to-end test that would get this information).
  • We're testing that we can get a buffer that's not the empty buffer, which tells us enough information to say that this API in particular is holding its promises based on the preconditions and postconditions.
71 ↗(On Diff #138879)

That's technically testing for undefined behaviour -- i.e. outside of the contract of the container. :)

12–13 ↗(On Diff #138879)

We already do that, because we're relying on the underlying allocator for sanitizer_common. There's already a way to provide alternate implementations of those in that regard. All the backing store we have is gotten from sanitizer_common as opposed to using our own calls to mmap directly.

116–117 ↗(On Diff #138879)

There are some potentially subtle issues with using unsigned in these variables. Some of them are:

  • Forcing a value to be unsigned causes the compiler to implement modular arithmetic, even if we don't ever expect that values will wrap-around.
  • Doing zero-sign extension is not cheap. We want to make these values as cheap as possible to update.

Also, we cannot make the function ID unsigned, because the value we're getting from XRay is a signed number (int32_t). The conversion will not be faithful, and we've avoided unsigned in those cases for similar reasons.

If we decide in the future that we can actually get away with unsigned values for function ids (which I think we can) then we can change all the XRay implementations to take unsigned values for the function id, etc. -- most of which is not really worth the cost.

131 ↗(On Diff #138879)

This is an optimisation, so that we don't actually need to reach into the pointer just to get the function id of the function at the top of the stack.

152 ↗(On Diff #138879)

This is necessary because users of this nested type need to access these exported types. In this case, because NodeRefAllocatorType is part of the FunctionCallTrie type, we're re-exporting this type through Allocators which is a public type.

255–256 ↗(On Diff #138879)

Because the comment is an implementation detail, it's explaining what it's doing rather than what users need to expect (i.e. it's not documentation of the contract, it's documentation of the implementation detail).

315–319 ↗(On Diff #138879)

Good point. Updated the comment and the implementation to make it clear that we're *not* destroying the state of the FunctionCallTrie in O.

31 ↗(On Diff #138879)

Note, the intent here is to use the GlobalMutex to lock operations on the ThreadTries vector. We shouldn't be holding a lock on the GlobalMutex while in the process of copying the FunctionCallTrie.

31–32 ↗(On Diff #138879)

Good point. It started as a class that had member variables, until it evolved to a global implementation, which is better just as a namespace.

42 ↗(On Diff #138879)

What you're talking about is intrusive lists. These work only if you're doing a linked list of elements, but in this case it's the chunks we're linking together. Each chunk will have a block, which is what we're managing. All the chunks come from the same region of memory.

44 ↗(On Diff #138879)

Usability -- because you can do:

Chunk C;
assert(C.Size > 0);

Yeah, unfortunately without these clang-format gets confused. :D

Can this possibly be split up? It's way too long to easily review and seems to at least have two sets of functionality.

dberris updated this revision to Diff 142893.Apr 18 2018, 1:18 AM

Split into smaller parts.

dberris updated this revision to Diff 142895.Apr 18 2018, 1:33 AM
dberris retitled this revision from [XRay][compiler-rt] XRay Profiling Mode to [XRay][profiler] Part 4: Profiler Mode Wiring.
dberris edited the summary of this revision. (Show Details)
dberris removed a subscriber: llvm-commits.

Retitled, and updated to reflect breakup into smaller parts.

dberris updated this revision to Diff 142896.Apr 18 2018, 1:34 AM
dberris added a subscriber: llvm-commits.

Adding back llvm-commits.

kpw added a comment.May 23 2018, 6:05 PM

This one is more straightforward than the previous in the change. The main ideas all fall into place nicely, but I have pointed out some details that could use some attention.


Can you expand the initialism to thread-local data in a comment? I always think TopLevelDomain when I see this and after jumping to definition that would help refresh my memory.


Don't you need pthread_create first for ProfilingKey?


Seems to me this should be memory_order_acquire_release if we want mutual exclusion of profileCollectorService::reset() from another thread.


Maybe check verbosity and Report?


I think this should be wrapped in an "if (TLD.FCT)" block.

It's definitely an edge case, but If ProfileLogStatus is INITIALIZED, and if the atomic load happens and before the next statement a context switch and update to FINALIZING happens, then the TLD can be unitialized, but the status check won't know about the FINALIZING statement and we'll deference a null TLD.FCT.

I think there is a similar problem with moving the GetTLD() before the atomic_load for the transition from FINALIZED to INITIALIZING. You could solve it by having GetTLD return the TLD reference and status from the load.

It also might be worth documenting that InternalAlloc won't return null (as opposed to FCT allocator, so we're relying on that.


Not scoped to profiling mode: This stragegy still makes me uncomfortable for cases where not much is instrumented (e.g. event tracing), but I don't have a better solution fleshed out.

It kind of feels like a cooperative scheduling problem where threads should check periodically if they're cancelled. Maybe we could have an option to add sleds that just do a finalizing check without instrumenting so that sparse instrumentation is able to respect the grace period.


Is this protected from simultaneous calls to postCurrentThreadFCT if other threads take a while to see they're finalized.


Do you have that graph of valid state transitions? I though it was OK to go from FINALIZED back to INITIALIZED without going back to UNITIALIZED.


Can these error for bad/missing flags?


Ahh. Can you just make a comment near the pthread_set_specific that INITIALIZE is responsible for calling pthread_key_create.


Is it xray-profiler or xray-profiling? Does the flag not match the mode string in code? You assert on xray-profiling and set the mode to that in code below.


Do you need a thing to exclude windows since you're calling readtsc()


Yes please. ;)


Similar flag confusion.


Could be illustrative and increase coverage to have a test case that verifies that profiling mode can turn back on after a "round."

kpw added a comment.May 23 2018, 6:05 PM

in the chain* I meant.

dberris updated this revision to Diff 149413.Jun 1 2018, 2:19 AM
dberris marked 9 inline comments as done.
  • fixup: Use updated name for flags
  • fixup: address comments by kpw@, rename to use profiling instead of profiler
dberris planned changes to this revision.Jun 1 2018, 2:19 AM
dberris added inline comments.

Renamed to ProfilingData instead.


Yeah, we'll need to refactor this to instead use the same reentrance guard across all the implementations (FDR and Basic). I'll add a dependency to the C++ ABI changes which has those changes.


Yep, we talked about this offline and we'll need to fix this across the implementations anyway. Let's fix that later.


serialize() has internal synchronisation, so we're relying on that synchronisation to do the right thing.


Yeah, unfortunately it seems that we're going to need to make it so that once an implementation has flushed, it should go back to UNINITIALIZED. Either that, or we're going to have to be a bit more clever about this.


Yes, but we're really just ignoring them here -- the parser will already report if the verbosity is high enough.


There's two parts here -- there's the mode, which in this case is a name for the implementation. We're using "profiler" to be consistent with "flight data recorder" and "basic". We could just make this "xray-profiling" all throughout, which I think would make it much simpler -- and pretend that "FDR" is "flight data recording" instead. ;)


Good point. Yes, need to require Linux for now.


Good call, let me do that in the next round.

dberris edited the summary of this revision. (Show Details)Jun 1 2018, 2:22 AM
dberris removed a reviewer: echristo.
dberris updated this revision to Diff 150449.Jun 7 2018, 10:56 PM

Rebase after removing ABI reliance and refactoring of RecursionGuard.

  • fixup: s/__sanitizer:://g + clang-format
  • fixup: use the common recursion guard
  • fixup: remove C++ ABI dependency
  • fixup: do two rounds of profiling

This is now ready for a look @kpw.

kpw added a comment.Jun 8 2018, 6:16 PM

I'm going to have to download the renamed files to diff locally. Is there a way to do this in Differential that I'm missing?


I think this still must be done before you submit.

dberris marked 5 inline comments as done.Jun 11 2018, 4:55 PM
In D44620#1127115, @kpw wrote:

I'm going to have to download the renamed files to diff locally. Is there a way to do this in Differential that I'm missing?

I'm not sure, but it looks like Differential already sees the rename/merge -- is there something else you're looking for?


I looked into this and we already only enable all the tests for Linux from the lit configurations.

kpw accepted this revision.Jun 11 2018, 6:35 PM
This revision is now accepted and ready to land.Jun 11 2018, 6:35 PM
This revision was automatically updated to reflect the committed changes.
dberris marked an inline comment as done.