This is an archive of the discontinued LLVM Phabricator instance.

[XRay] [compiler-rt] Unwriting FDR mode buffers when functions are short.
ClosedPublic

Authored by kpw on Mar 24 2017, 11:23 AM.

Details

Summary

"short" is defined as an xray flag, and buffer rewinding happens for both exits
and tail exits.

I've made the choice to seek backwards finding pairs of FunctionEntry, TailExit
record pairs and erasing them if the FunctionEntry occurred before exit from the
currently exiting function. This is a compromise so that we don't skip logging
tail calls if the function that they call into takes longer our duration.

This works by counting the consecutive function and function entry, tail exit
pairs that proceed the current point in the buffer. The buffer is rewound to
check whether these entry points happened recently enough to be erased.

It is still possible we will omit them if they call into a child function that
is not instrumented which calls a fast grandchild that is instrumented before
doing other processing.

Event Timeline

kpw created this revision.Mar 24 2017, 11:23 AM
kpw added a comment.EditedMar 24 2017, 12:45 PM

D'oh. Just realized that I need to do cycle math to convert from TSC ticks to us.

I've also got a program that I'm trying to see output from, and there may be some issues for me to iron out. I'm not seeing any records after the fdr_header.
I may be doing the setup incorrectly. I'm going to recompile compiler-rt without my change to see the difference in output.

#include <stdlib.h>
#include <xray/xray_fdr_logging.h>
#include <xray/xray_interface.h>
#include <cassert>
#include <chrono>
#include <cstdio>
#include <thread>

constexpr auto kBufferSize = 16384;
constexpr auto kBufferMax = 10;

[[clang::xray_always_instrument]] void __attribute__((noinline)) fC() {
std::printf("In fC()\n");
}

[[clang::xray_always_instrument]] void __attribute__((noinline)) fB() {
std::printf("In fB()\n");
fC();
}

[[clang::xray_always_instrument]] void __attribute__((noinline)) fA() {
std::printf("In fA()\n");
fB();
}

[[clang::xray_always_instrument]] void __attribute__((noinline)) fD() {
std::printf("In fD()\n");
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}



int main(int argc, char* argv[]) {
using namespace __xray;
FDRLoggingOptions Options;
Options.ReportErrors = true;
char TmpFilename[] = "fdr-logging-test.XXXXXX";
Options.Fd = mkstemp(TmpFilename);
fdrLoggingInit(kBufferSize, kBufferMax, &Options, sizeof(FDRLoggingOptions));
__xray_set_handler(fdrLoggingHandleArg0);
printf("Patching...\n");
__xray_patch();
fA();
fC();
fC();
fD();
printf("Unpatching...\n");
__xray_unpatch();
fdrLoggingFinalize();
fdrLoggingFlush();
return 0;
}

kpw added a comment.Mar 24 2017, 2:35 PM

I found a few problems with my test driver program.

__xray_set_handler(fdrLoggingHandleArg0) doesn't need to be called because initLogging calls it.
Also finally and flush need to be called before unpatch for there to be any chance of the handler releasing the buffer back to the buffer queue.

I made some changes, but I'm still looking into it. The handler never realizes that the buffer has been finalized, so flush doesn't have anything to work with. This is independent of my change. I think there may be a timing issue I should be paying attention to.

Once I understand the BufferQueue code a bit better, I may discover that the problem exists between the keyboard and chair.

kpw added a comment.Mar 24 2017, 6:14 PM

I am making some headway. My understanding is that FDR logging short circuits when the log is finalizing or finalized without returning the buffer and is currently depending on the thread exit destructor to release the buffer.

This does not seem ideal, and I have a fix for that, but I'm going to iterate on it a bit to see if I can simplify the logic and make sure I account for edge cases so that nothing is logged after the EOB is written and the buffer is closed.

dberris edited edge metadata.Mar 26 2017, 4:20 PM
In D31345#710496, @kpw wrote:

I am making some headway. My understanding is that FDR logging short circuits when the log is finalizing or finalized without returning the buffer and is currently depending on the thread exit destructor to release the buffer.

There's two parts to this -- we should be returning the buffer as soon as we realize that the log is finalizing in the implementation. On the user-side, there should be a short grace period if you want to get more of the buffer before flushing. So there's two stages to this, there's the finalizing at the FDR level and there's the finalizing of the buffer queue.

This does not seem ideal, and I have a fix for that, but I'm going to iterate on it a bit to see if I can simplify the logic and make sure I account for edge cases so that nothing is logged after the EOB is written and the buffer is closed.

SGTM

lib/xray/xray_fdr_logging_impl.h
110–112

We should start by making this a flag right away, so we can test effectively.

509

I think this is what you meant that you need to do some math with. When do you so, consider not having to use floating point math -- if we can get away with approximations and division/masking with power of twos instead, that would really help with the efficiency. Something to keep in mind but probably not worth worrying about at this point.

511–513

Why are we copying from the log buffer into the local aligned record buffer? This seems unused now.

kpw added a comment.Mar 26 2017, 5:28 PM

This doesn't handle Tail calls correctly. I made a bad assumption about the trampoline sequence.

with fA() calling fB() calling fC() the records will be
Enter fA()
Tail Exit fA()
Enter fB()
Tail Exit fB()
Enter fC()
Exit fC()

I was expecting records to come out
Enter fA()
Enter fB()
Enter fC()
Tail Exit fA()

I'm changing my code to handle the truth. Hold off on review until I post an update.

kpw updated this revision to Diff 93102.Mar 27 2017, 1:08 AM

Changed handling of Tail call records, which I've exercised by modifying the trampoline
assembly to emit them.

Now the threshold is computed for a new buffer based on the cpu frequency and tail calls
calling into instrumented functions can be unwritten. The topmost tail exit can be replaced
with a normal exit record that records the time that the child returns multiple levels up the
stack. I think I may revert this logic because it seems better to potentially lose track of
a short amount of time in analysis of when the tail calling function is finished than perhaps
incorrectly link the tail call to information from the wrong function if its child is
uninstrumented.

kpw marked an inline comment as done.Mar 27 2017, 1:25 AM
kpw added inline comments.
lib/xray/xray_fdr_logging_impl.h
110–112

Is there a typical style of writing flags in compiler-rt I should follow or by flag do you just mean a global that can be set via a call to a header?

511–513

I should probably reinterpret cast directly from the buffer. I can make that fix easily enough.

kpw updated this revision to Diff 93169.Mar 27 2017, 1:01 PM

Saw xray-flags.h and added a flag for the fdr function time threshold.

kpw marked 2 inline comments as done.Mar 27 2017, 1:08 PM

I've taken Dean's suggestion to use a flag and work directly with the buffer. Seeing expected behavior when setting the flag in a test program.

I've also made the change to read directly from the buffer instead of creating new aligned storage.

Considering non-instrumented functions, I think I should remove the code that rewrites childless tail exits into normal exits. I'll post that in another revision.

kpw updated this revision to Diff 93182.Mar 27 2017, 3:05 PM

Removing rewrite of TailExit to Exit functions. This doesn't make sense if the children are
not instrumented and by definition, any removed instrumented child functions must have been
faster than the profiling threshold.

dberris accepted this revision.Mar 27 2017, 7:21 PM

LGTM -- but please see one important comment on correctness.

lib/xray/xray_fdr_logging_impl.h
511–513

Actually, now that I understand what you were doing, I think it's actually better that you copy to a local aligned buffer. This incarnation is actually running into undefined behaviour, since it's technically unsafe to alias a void* as something other than a char*. For correctness and efficiency, I recommend going back to the aligned buffer local.

Sorry I missed the intent the first time around. :(

This revision is now accepted and ready to land.Mar 27 2017, 7:21 PM

Oh, also, please update the description with the details of this current implementation.

kpw added a comment.Mar 27 2017, 9:01 PM

Is the undefined behavior you mention a strict aliasing violation?

I will use the copy to aligned storage to stay legal, but for the sake of argument and/or education, I'm curious whether to optimize for efficiency, we could sidestep the rule with language lawyer loopholes.

According to the standard[1], aliased access is explicitly undefined for glvalues, but the output of "*reinterpret_cast<...>()" is a temporary prvalue. I wonder if there is another clause to clarify what would happen if
instead of defining a name, the code just invoked reinterpret cast to get a temporary each time a value needed to be accessed.

1 - Standard quote:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

the dynamic type of the object,
a cv-qualified version of the dynamic type of the object,
a type similar (as defined in 4.4) to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
an aggregate or union type that includes one of the aforementioned types among its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
a char or unsigned char type.

kpw edited the summary of this revision. (Show Details)Mar 28 2017, 2:41 PM
kpw updated this revision to Diff 93304.Mar 28 2017, 2:44 PM
kpw edited the summary of this revision. (Show Details)

Using separate AlignedStorage<FunctionRecord> buffers when rewinding over records.

This doesn't violate the strict aliasing rule and since there are separate buffers for
the FunctionEntry and TailExit records, the compiler should be free to optimize for parallelism.

Still LGTM, just a couple more nits.

Can you clarify the order of when these changes can/should be landed? Are there dependencies amongst the changes?

lib/xray/xray_fdr_logging_impl.h
507–508

This one you can also probably invert and early return.

565

I think I understand what this means, but it might look better if you invert the test case and early return. So:

if (NumConsecutiveFnEnters > 0) {
  LastFunctionEntryTSC = LastFunctionEntryTSC - FuncRecord.TSCDelta;
  return;
}
// If we've rewound the stack of all function entries, ...

Now that we have ways to test this, it would be a good idea to rebase and then see that it's actually omitting records.

kpw updated this revision to Diff 93440.Mar 29 2017, 7:08 PM

Adding test cases for end to end FDR and unwinding FDR.

dberris added inline comments.Mar 29 2017, 7:23 PM
test/xray/TestCases/Linux/fdr-mode.cc
3 ↗(On Diff #93440)

Did you need to do this after a rebase? I've landed the change that already makes this line actually work, I think?

62–63 ↗(On Diff #93440)

The change upstream improves this a bit so that we're actually checking that the stuff is showing up in different threads and matching enters/exits. Consider a rebase?

kpw updated this revision to Diff 93540.Mar 30 2017, 2:10 PM

Rebased onto the lit test changes from Dean. Thanks Dean!

This revision was automatically updated to reflect the committed changes.