This is an archive of the discontinued LLVM Phabricator instance.

[XRay][profiler] Part 2: XRay Function Call Trie
ClosedPublic

Authored by dberris on Apr 18 2018, 1:03 AM.

Details

Summary

This is part of the larger XRay Profiling Mode effort.

This patch implements a central data structure for capturing statistics
about XRay instrumented function call stacks. The FunctionCallTrie
type does the following things:

  • It keeps track of a shadow function call stack of XRay instrumented functions as they are entered (function enter event) and as they are exited (function exit event).
  • When a function is entered, the shadow stack contains information about the entry TSC, and updates the trie (or prefix tree) representing the current function call stack. If we haven't encountered this function call before, this creates a unique node for the function in this position on the stack. We update the list of callees of the parent function as well to reflect this newly found path.
  • When a function is exited, we compute statistics (TSC deltas, function call count frequency) for the associated function(s) up the stack as we unwind to find the matching entry event.

This builds upon the XRay Allocator and Array types in Part 1 of
this series of patches.

Depends on D45756.

Event Timeline

dberris created this revision.Apr 18 2018, 1:03 AM
kpw added a comment.May 13 2018, 10:34 PM

This was easier to review than I expected. I found it easier to follow than the allocator/array CL. Sorry for the long delay!

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
56

Might as well test MissingFunctionExit as well for symmetry.

83–91

The way this test is written, the two roots look exactly identical. There is a possible error case where the trie returns the same root twice that is undetected.

You might have one function spend more time so that this is detected. Then you might want the test to assert ignoring order. Up to you whether it's worth it.

140–142

Imho, this would be easier to interpret if the nodes captured CumulativeTreeTime instead of CumulativeLocalTime.

Then it would be
F3 -> 100
F2 -> 300
F1 -> 400

190–192

Is sharing an initialized allocator expected to be OK? Why not use the same allocator for FunctionCallTrie Merged below then?

218–220

Only check the root? Theres only two other nodes to check, might as well verify them.

compiler-rt/lib/xray/xray_function_call_trie.h
37

visulaise -> visualize

95

There's no reference members here. Maybe call it NodeIdPair?

217

Don't think so. This looks like the end of "struct Allocators {" to me.

318

To me, it is less intuitive to capture CumulativeLocalTime than CumulativeTreeTime in each Node. With one you can compute the other given the callees, but if I'm analyzing function latencies, I care about time spent in the callees just as much.

Why did you make this choice?

336–337

Should this be called with non-empty destinations? Can we just CHECK it is not. Having duplicate roots stinks.

357

Do you have to handle the Allocator failing here with a null check? Are you just relying on those cases artificially pruning the traversal?

380

What's the thread safety of this and deepCopy? It seems like they shouldn't be called when functions are being intercepted. How can we make sure that invariant is preserved?

393–401

Should this be pulled out of the root loop so we can use a single stack and array instead of 1 per root node?

408–409

You have some TODOs elsewhere to update the histograms. Might put one here as well.

compiler-rt/lib/xray/xray_profiler_flags.cc
35

Is the "#Name" intentional? I'm not very fluent in macros, but that popped out.

dberris updated this revision to Diff 146562.May 14 2018, 2:32 AM
dberris marked 14 inline comments as done.
  • fixup: Address comments by kpw@
compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
83–91

Good point -- I've instead asserted that R0 and R1 don't have the same function id.

140–142

I actually thought about counting both, but realised that there's a property here which allows us to count just one of them to derive the other.

I initially thought about it this way:

CTT(N) = CLT(N) + sigma(i = 0->N) CTT(callee(N)[i])

is equivalent to:

CLT(N) = CTT(N) - sigma(i = 0->N) CTT(callee(N)[i])

Where CTT is "Cumulative Tree Time" and CLT is "CumulativeLocalTime". Can we prove that this property holds, and that measuring just CTT is sufficient to derive CLT?

To do this properly I'll introduce a notation:

(f -> f') @ t[n+0]
(f <- f') @ t[n+1]

Where f and f' are function IDs, and -> represents a "calls" relationship, '<-' represents an "exits" relationship, and t is a timestamp (we denote n to be the order of timestamps by appearance). Given the following sequence of events:

(f1 -> f2) @ t[0]
(f2 -> f3) @ t[1]
(f2 <- f3) @ t[2]
(f2 -> f3) @ t[3]
(f2 <- f3) @ t[4]
(f1 <- f2) @ t[5]

Here, if we think about the cumulative local times, we might think that:

CTT(f1) = CTT(f2) + CLT(f1)
CTT(f2) = CTT(f3) + CLT(f2)
CTT(f3) = 0 + CLT(f3)

As per formula above.

If we expand this:

CTT(f3) = 0 + (t[4] - t[3]) + (t[2] - t[1])
CTT(f2) = CTT(f3) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])
CTT(f1) = CTT(f2) + 0

Let's expand it further:

CTT(f2) = 0 + (t[4] - t[3]) + (t[2] - t[1]) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])

CTT(f2) = (t[5] - t[4]) + (t[4] - t[3]) + (t[3] - t[2]) + (t[2] - t[1]) + (t[1] - t[0]) + 0

CTT(f2) = t[5] - t[0]

This is what we'd expect for computing CTT from CLT. Can we do the reverse though?

CLT(f2) = CTT(f2) - CLT(f3)
CLT(f2) = CTT(f2) - (CTT(f3) + 0)
CLT(f2) = (t[5] - t[0]) - (0 + (t[4] - t[3]) + (t[2] - t[1]) + 0)
CLT(f2) = (t[5] - t[0]) - ((t[4] - t[3]) + (t[2] - t[1]))
CLT(f2) = (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0])

QED

There's an argument for doing either, but we make the trade-off to covering Cumulative Local Time instead at runtime for the following reasons:

  • Using CLT reduces the risk of us overflowing counters.
  • We need CLT anyway for generating the histogram of latency for a particular function in the stack context.
  • CLT allows us to better account for when we're un-winding the stack in case we find an exit for a function that was entered "higher up".

Does that make sense?

Now I kind-of want to write up that formula somewhere more persistent, rather than just in a review thread. :)

190–192

Yes. The reason we're not using the same allocator for the FunctionCallTrie merging test, is because we want to make sure that functionality works -- because we do that when we're transferring the FunctionCallTrie from a thread to the central service (in the next patch). We want to have thread-local allocators, then the central storage service will use a single FunctionCallTrie for the "merged" version of the FunctionCallTrie's for all the threads.

compiler-rt/lib/xray/xray_function_call_trie.h
318

Explained in the comment above... with some math/proof. :)

336–337

Yes, fixed with a DCHECK.

357

Good question. I'm definitely relying on the artificial pruning. Writing a comment as a TODO to figure out what to do in case of failure.

380

Good question. They are thread-compatible (will need external synchronisation).

compiler-rt/lib/xray/xray_profiler_flags.cc
35

Yes, this turns the argument into a string.

kpw accepted this revision.May 14 2018, 8:39 AM
kpw added inline comments.
compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
140–142

Thanks for the detailed response. I agree with all of your points except for the that CLT is the right choice for latency histograms.

Responding to your points in turn:

  1. Yes, CLT and CTT properties can be derived from the other.
  2. Overflow is less likely with CLT. I haven't done the math for a typical CPU to approximate the risk for a 32 bit uint.
  3. For either CLT or CTT, the "higher up" exit is handled the same way: as if there were simultaneous exits for all the functions until the matching function id.

And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both.

Imagine a function "processRpc(const MyRequest&, MyResponse*)". When I want to know the 95th percentile of its latency, I am interested in CTT, not CLT. All the time spent in lower levels of the stack affects the observed latency. Callers care about how long the function blocks their execution. The person trying to optimize a function may care about CLT when choosing which code path to target.

Let's cross this bridge when we add histograms.

compiler-rt/lib/xray/xray_function_call_trie.h
336–337

You should update the comment to say that the operation should not be called with a non-empty destination. You've tightened the contract.

380

I'm curious how this will work for the data collector service ensuring that function traces won't interfere with flushing.

This revision is now accepted and ready to land.May 14 2018, 8:39 AM
dberris updated this revision to Diff 146723.May 14 2018, 5:26 PM
dberris marked 2 inline comments as done.
  • fixup: Address comments by kpw@
  • fixup: Rename flags to remove xray_profiling_ prefix.
compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
140–142

And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both.

Yes!

Let's cross this bridge when we add histograms.

Agreed.

Thanks!

compiler-rt/lib/xray/xray_function_call_trie.h
380

Yeah, Part 3 actually does external synchronisation to ensure that when posting, we're making a copy and merging to a global. There's some interesting work happening there, but not as interesting as the stuff we're doing here. ;)

This revision was automatically updated to reflect the committed changes.