This is an archive of the discontinued LLVM Phabricator instance.

[XRay][profiler] Part 3: Profile Collector Service
ClosedPublic

Authored by dberris on Apr 18 2018, 1:05 AM.

Details

Summary

This is part of the larger XRay Profiling Mode effort.

This patch implements a centralised collector for FunctionCallTrie
instances, associated per thread. It maintains a global set of trie
instances which can be retrieved through the XRay API for processing
in-memory buffers (when registered). Future changes will include the
wiring to implement the actual profiling mode implementation.

This central service provides the following functionality:

  • Posting a FunctionCallTrie associated with a thread, to the central list of tries.
  • Serializing all the posted FunctionCallTrie instances into in-memory buffers.
  • Resetting the global state of the serialized buffers and tries.

Depends on D45757.

Diff Detail

Event Timeline

dberris created this revision.Apr 18 2018, 1:05 AM
dberris updated this revision to Diff 146732.May 14 2018, 6:33 PM
  • Rebase
    • fixup: Adjust names to use non-prefixed flag names
kpw added a comment.May 17 2018, 12:54 AM

My overall feedback about this code is that after reading it, it probably deserves another pass over just to focus on what allocates and frees and which state transitions are responsible for closing the loops.

This is an easy thing to get wrong or underspecify, so it might be worth putting your skeptic hat on and picking apart the contracts as well.

Otherwise, a few questions but I think it makes sense for the most part.

compiler-rt/lib/xray/tests/unit/profile_collector_test.cc
72 ↗(On Diff #146732)

Would be better (and more difficult) to validate the contents of the block in addition to the header. This doesn't verify that any of the profiling data is correct and consistent with the Trie.

compiler-rt/lib/xray/xray_profile_collector.cc
71 ↗(On Diff #146732)

Why should we be using InternalAlloc instead of GlobalAllocators for these? Seems counterintuitive but this is a lot of stuff to keep track of so I could be thinking about it wrong.

85 ↗(On Diff #146732)

Reverse order is not as clear as "ordered from the deepest callee's funcId to the root caller's funcId" or something like that.

Is the funcId for the Node also also in the array or is it elided?

104 ↗(On Diff #146732)

Nit: Could be stated "We walk a depth first traversal of the FunctionCallTrie from the root to generate ..."

109–112 ↗(On Diff #146732)

This works equally well with the stack moved out of the loop, saving reallocation for each root without consequence for understanding.

144–148 ↗(On Diff #146732)

Does PathArray have a constant time size method? I've never regretted a length-prefix encoding compared to sentinel termination where possible.

Edit: Just checked xray_segmented_array.h and size is constant time. I think we can design a seekable format (seek by record) without space or time overhead and I'd encourage a seekable format as it's more flexible.

160–164 ↗(On Diff #146732)

Does having a record delimiter actually get us anything meaningful? The format is not seekable by record as is. If we could make it seekable without adding overhead that would be best, otherwise, what do you think about removing the delimiter since it's implied by the position of the end of function ids sentinel?

Edit: Block header has total size. That might be sufficient and we can decide whether to remove record delimiter or keep it as sanity check framing.

218 ↗(On Diff #146732)

Looks like this is InternalFreed on the next call to serialize. It seems more simple to scope the allocation and free-ing to a single call to serialize. Why are we doing it the other way?

Edit: This is the exfiltration mechanism via the nextBuffer() function.

258–259 ↗(On Diff #146732)

I don't understand what's meant here. :(

compiler-rt/lib/xray/xray_profile_collector.h
33 ↗(On Diff #146732)

assocaited ->associated

48 ↗(On Diff #146732)

This pushes the need to synchronize into the caller that controls when the state can change from FINALIZED to INITIALIZED.

It's too big of a change to introduce a new state, but we should either.

  1. Point out that this synchronization is necessary by the caller in this documentation or
  2. Have the __xray_init or whatever calls fail when this processing is occuring.
67 ↗(On Diff #146732)

"reverse call order" is unclear for a Trie. Do functions with multiple callees have duplicate entries? There are no parent pointers described here.

Edit: After reading through the CL, I realized I misunderstood the rest of the data (such as call count, etc.) to be associated with each entry in the PathArray, while in reality I think it's just the function ids. I think maybe this could be cleared up by using proto like language while keeping the indentation. E.g.

repeated record:

  • repeated function_id ordered from callee to root:
  • call count
  • cumulative ...

There's nothing wrong with the current comment, but I misunderstood it and looking to clarify.

dberris updated this revision to Diff 147732.May 20 2018, 10:22 PM
dberris marked 8 inline comments as done.
  • fixup: Update to address comments by kpw@
dberris added inline comments.May 20 2018, 10:22 PM
compiler-rt/lib/xray/tests/unit/profile_collector_test.cc
72 ↗(On Diff #146732)

Yeah, unfortunately that's too much coupling to be testing at this point -- my thought was to do that testing on a end-to-end basis, so we can update the format and only determine the necessary parts here (i.e. since we only need the block size, number, and thread-id). For instance, when we do have the profile saving mechanism (in Part 5) and a loader (forthcoming) we can tweak the details of what's actually in the profile more rigorously.

compiler-rt/lib/xray/xray_profile_collector.cc
71 ↗(On Diff #146732)

We're using a Vector here, instead of a segmented array for two reasons:

  1. We cannot put non-trivally destructible things in the segmented array.
  2. We leverage the contiguous nature of the Vector implementation here for when we iterate through the FunctionCallTrie elements.

We're using InternalAlloc(...) here as well to not be dependent on the restrictions on the managed allocator we're using for the data structures.

Documenting some of these to keep the context.

85 ↗(On Diff #146732)

The path is just the function id's of the function in a stack trace. This allows us to identify each call stack uniquely, in leaf-to-root order.

144–148 ↗(On Diff #146732)

Yes, PathArray is an __xray::Array<...> which stores the size as a member, so O(1) check on the size.

I thought about always having size prefixes, but that makes it much harder to write -- i.e. reserving space for the size of the data before writing it. The value for a seekable format isn't as high at this time, we just need a way to store this in a reasonably compact format, and assume it will be read in a streaming fashion.

160–164 ↗(On Diff #146732)

Good point. Removing the end-of-record sentinel gets us more bytes.

258–259 ↗(On Diff #146732)

Oops -- yeah, that's fixed. We can now tell which block number the buffer represents. :)

compiler-rt/lib/xray/xray_profile_collector.h
48 ↗(On Diff #146732)

We already do this -- the "caller" in this case will be the XRay implementation that's using this service. In particular, the only user of this is the profiling mode implementation.

In particular, this function doesn't interfere with any of the higher level states -- all the synchronisation it needs is internal. The documentation is saying that callers of this function only need to worry about synchronising the FunctionCallTrie it has access to, if it needs to. The implementation we have in the profiling mode uses thread-local FunctionCallTrie objects, which means from the thread that has access to that trie, it doesn't need to synchronise anything.

The next patch implements the finalization routine, which deals directly with this service's APIs "correctly".

Is there something else that will make this documentation more clear from that perspective?

67 ↗(On Diff #146732)

This essentially represents the data associated per unique stack we've encountered. I think you got it, but it's unclear how to change this comment to make it clearer. A "path" is essentially the stack trace in leaf-to-root order, which is guaranteed unique in a prefix tree.

echristo resigned from this revision.May 21 2018, 12:13 AM

Outside of my vague worry that this is really depending on a lot of things that shouldn't be in compiler-rt and probably in it's own library I don't see a problem with you handling what's in the xray library yourself. You and kpw probably want to review each other's code here, but unless you're touching the rest of compiler-rt I don't think I need to be involved.

dberris updated this revision to Diff 148852.May 29 2018, 12:28 AM
dberris marked 6 inline comments as done.
  • fixup: Also test the path serialisation
  • fixup: Use PTHREAD_ONCE_INIT for the pthread_once_t
  • fixup: update top matter
kpw accepted this revision.May 30 2018, 7:03 PM

Aside from the discussion about a more invasive change to the way cooperative state transitions, this is fine. Thanks for writing the test for serialization.

As discussed that's a big enough patch it should live on its own. If you want to open a bug for it, that would also be great.

compiler-rt/lib/xray/xray_profile_collector.cc
85 ↗(On Diff #146732)

Yep. Makes sense. You can mark this as done.

144–148 ↗(On Diff #146732)

I find the value in prefix length encodings that it's easier to find bugs with a codec that gets out of sync, but agree it's not high value at this time.

compiler-rt/lib/xray/xray_profile_collector.h
48 ↗(On Diff #146732)

I phrased my question poorly. I'm not worried about thread synchronization. I was trying to reason through whether there are illegal sequences of possible state transitions within a thread.

As we discussed over chat, each generation of xray state transitions manages thread local state.
xray_init sets up state.
xray_flush tears it down and calls post().

Since these state transitions only run when threads are intercepted and the handlers are invoked, we can have problems where no instrumented function is called for a thread when in XRAY_FINALIZED or similar.

As a follow on patch, we discussed having thread local generation data that tracks the state transitions and cooperative progress across the per-thread state machines that each handler function participates in.

This revision is now accepted and ready to land.May 30 2018, 7:03 PM
dberris marked 8 inline comments as done.May 30 2018, 8:31 PM

Thanks, Keith!

Landing soon -- I think using generation IDs will apply both to the profiling mode and FDR mode. Let me file a bug that will capture some of the state and allow us to track the work as well.

Cheers

This revision was automatically updated to reflect the committed changes.
Herald added a subscriber: Restricted Project. · View Herald TranscriptMay 30 2018, 9:38 PM