Page MenuHomePhabricator

[XRay] Custom event logging intrinsic
ClosedPublic

Authored by dberris on Dec 6 2016, 10:21 PM.

Details

Summary

This patch introduces an LLVM intrinsic and a target opcode for custom event logging in XRay. Initially, its use case will be to allow users of XRay to log some type of string ("poor man's printf"). The target opcode compiles to a noop sled large enough to enable calling through to a runtime-determined relative function call. At runtime, when X-Ray is enabled, the sled is replaced by compiler-rt with a trampoline to the logic for creating the custom log entries.

Future patches will implement the compiler-rt parts and clang-side support for emitting the IR corresponding to this intrinsic.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As I understood, this patch starts adding a possibility to pass custom events to the XRay logging mechanism. So isn't it more logical to call it "custom event" rather than "custom log"? Another reason, I saw some terms similar to "custom log" somewhere in @dberris 's commits, perhaps in FDR. So I think it's desirable to avoid confusion.
And a general question: how are you going to pass these custom events to XRay logging, if currently the handler callback is not designed to take a pointer argument? It currently takes the function id and call reason code as arguments. Are you going to add the 3rd parameter like void* Data to the handler function? Trampolines may need adjustment in that case on some platforms.

timshen added inline comments.Jan 24 2017, 1:11 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5748 ↗(On Diff #84347)

NodeTys seems wrong to me, it should be MVT::Other. It indicates the return type, aka void, not the parameter types.

5752 ↗(On Diff #84347)

You need to call DAG.setRoot(patchableNode), otherwise the succeeding nodes are not reached from patchableNode, ending up with patchableNode gets garbage-collected.

As I understood, this patch starts adding a possibility to pass custom events to the XRay logging mechanism. So isn't it more logical to call it "custom event" rather than "custom log"? Another reason, I saw some terms similar to "custom log" somewhere in @dberris 's commits, perhaps in FDR. So I think it's desirable to avoid confusion.

Indeed, it will be confusing when we think about how in the FDR change we're allowing for a custom logging mechanism. In this case though we're really talking about a user-driven event being inserted into the XRay log. I think custom events seem to make a bit more sense for this functionality.

And a general question: how are you going to pass these custom events to XRay logging, if currently the handler callback is not designed to take a pointer argument? It currently takes the function id and call reason code as arguments. Are you going to add the 3rd parameter like void* Data to the handler function? Trampolines may need adjustment in that case on some platforms.

Yes, this is currently an exploratory patch -- we will need to handle these differently in the runtime. That should come later.

The intent here is to have a different trampoline for these custom events, with a different extension point in the log implementation. The API being introduced in the FDR change allows us to add different kinds of handlers too.

There's also work being done by @pelikan that should implement the first argument logging implementation, which will also require changes to the runtime. We're doing it incrementally though to unblock progress, and allow XRay to be usable even in its current form.

timshen added inline comments.Jan 25 2017, 5:23 PM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

Is PATCHABLE_LOG_CALL allowed to be inserted in arbitrary place inside the function?

If so, the callee must not clobber any register at all. This is unlike \_\_xray_FunctionEntry, which just need to save and restore callee-saved, parameter passing registers; or unlike \_\_xray_FunctionExit, which just need to save and restore return value register.

dberris added inline comments.Jan 26 2017, 11:27 PM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

Yes, we're going to have to treat this as if it's a normal function call, on a cross-platform basis. The only difference would be that the call will be guarded by a jump over that could be patched at runtime.

timshen added inline comments.Jan 27 2017, 12:43 PM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

That requires larger instrumentation pass changes, right? It's going to be moved before at least frame lowering, Or even earlier. Or even changed to an IR pass. :P

dberris added inline comments.Jan 29 2017, 4:39 PM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

Not really, at least I don't think so -- this is something that users write in their code (something like __xray_custom_log(...)) that turns into an intrinsic instruction function call in the IR. We're lowering that intrinsic call, which by definition really happens in the function body (after stack adjustments, after the prologue). The intent is to guard the function call with a "jump over" just like normal sleds (and have it compute the offset to the trampoline at runtime).

timshen added inline comments.Feb 1 2017, 4:58 PM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

which by definition really happens in the function body (after stack adjustments, after the prologue).

That is true. I was not worrying about that, and I ignored that we do have an IR intrinsic for it. During the lowering, the call sequence is there, providing proper amount of space reserved for saving live,callee-saved registers. These are good. The problem is that currently the sled doesn't save any live registers (according to LowerPATCHABLE_LOG_CALL, the function we are commenting on).

Either the caller or the callee needs to save the live registers. If it's implemented in a way that registers saved by callee (__xray_custom_log), the problem is it doesn't know which registers are alive; since custom logging may happen at any point in the function, __xray_custom_log'd better save all callee-saved registers.

If it's implemented in the way that registers are saved by the caller, great, the caller knows which registers to save - but it knows only at compile time, as opposed to patch time. The current LowerPATCHABLE_LOG_CALL, which is at compile time, doesn't emit register saving and restoring code.

Ideally, I prefer just lowering the intrinsic to a normal function call C function __xray_custom_log, instead of a pseudo instruction PATCHABLE_LOG_CALL. In the instrumentation pass, the pass scans for calls to __xray_custom_log, and add branch overs before those calls. The advantage is that all register saving/restoring code is handled by common llvm infrastructure. In this approach, adding an IR intrinsic is optional, because the user can directly call the C function __xray_custom_log, knowing that the compiler is going to add the branch over around it.

dberris added inline comments.Feb 2 2017, 11:00 PM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

Ideally, I prefer just lowering the intrinsic to a normal function call C function xray_custom_log, instead of a pseudo instruction PATCHABLE_LOG_CALL. In the instrumentation pass, the pass scans for calls to xray_custom_log, and add branch overs before those calls.

That makes sense, although I haven't quite worked out how to do this from the MachineFunction pass, since the register restoration code after the call might be arbitrarily long. From the IR level I don't think we have a means of marking sections of code yet that could be explicitly branched over statically but preserved in codegen (i.e. if it's possible to inhibit dead-code removal).

Even if it's possible to do caller-saved register stashing and have the back-end generate the jumps (similar to how we have the sleds now), the differences amongst the intrinsic call-sites could be substantial (i.e. register allocation could get messed up by treating the intrinsic call to be a "normal function call"). The advantage of making the trampoline do the caller- and callee-saved register stashing/restoring is that the patch points all look the same (matters especially in x86) and the trampoline also does the same thing and has a fixed cost.

timshen added inline comments.Feb 3 2017, 12:01 AM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

Even if it's possible to do caller-saved register stashing and have the back-end generate the jumps (similar to how we have the sleds now), the differences amongst the intrinsic call-sites could be substantial (i.e. register allocation could get messed up by treating the intrinsic call to be a "normal function call"). The advantage of making the trampoline do the caller- and callee-saved register stashing/restoring is that the patch points all look the same (matters especially in x86) and the trampoline also does the same thing and has a fixed cost.

That's a good point, since adding normal function calls may significantly change the shape of the normal code. Maybe we should just save and restore *all* registers and see how slow it is - I'm not quite optimistic, since xray logs hot functions.

While we shouldn't worry too much about the performance until we have numbers, a random thought on making this faster can be: when calling __xray_set_handler, require the passed in handler function to be marked with a special ABI attribute, that's essentially saying all registers are callee-saved.

dberris added inline comments.Feb 6 2017, 2:04 AM
lib/Target/X86/X86MCInstLower.cpp
1041 ↗(On Diff #84347)

This has so far been the assumption for all the XRay sleds -- that the trampolines do save all callee and caller-saved registers. This is going to be a bit different too (the trampoline for the custom logging) since it has to store even the scratch registers in x86. Note that this only happens though when the user asks for custom event logging -- the logging implementation is going to be the bottleneck anyway, and our fixed costs in the trampolines will be "constant" modulo interference effects.

So far too we haven't had to do special annotations of the ABI, since we do the patching at runtime anyway.

dberris requested changes to this revision.Feb 9 2017, 12:35 AM

@lrl Please let us know when this is ready again for another look (given the feedback and recent discussions).

This revision now requires changes to proceed.Feb 9 2017, 12:35 AM
lrl updated this revision to Diff 87946.Feb 9 2017, 7:31 PM
lrl edited edge metadata.
lrl updated this revision to Diff 87948.Feb 9 2017, 7:33 PM

Hullo! I've now updated the patch.

  • renamed things to custom event
  • add a single unit test -- taking suggestions for more (due to lack of domain knowledge, would love suggestions about what types of cases need to be tested)
  • followed x86 calling conventions before/after the trampoline

Cool stuff, thanks @lrl -- just one thing and I think we can land soon after unless @timshen or @echristo have other thoughts.

lib/Target/X86/X86MCInstLower.cpp
1042–1051 ↗(On Diff #87946)

We can reduce this a bit, now that we don't really need to put the function id in %r10 anymore.

lrl updated this revision to Diff 87953.Feb 9 2017, 7:52 PM

Cool stuff, thanks @lrl -- just one thing and I think we can land soon after unless @timshen or @echristo have other thoughts.

done

dberris accepted this revision.Feb 9 2017, 11:47 PM

Couple more things you might want to address. If @timshen or @echristo don't get to land it or have more comments, I'll land it tomorrow. :)

lib/Target/X86/X86MCInstLower.cpp
1046 ↗(On Diff #87953)

We don't need the function id here anymore, so really the jump will just become a 2-byte nop.

1073–1080 ↗(On Diff #87953)

You can MI.getNumOperands() and use that to do a reverse index, using MI.getOperand(...)?

test/CodeGen/X86/xray-custom-log.ll
22 ↗(On Diff #87953)

You can probably look for the entry in the instrumentation map that will contain the correct reference to the sled, with the correct type associated with it. See other adjacent tests that do similar things?

This revision is now accepted and ready to land.Feb 9 2017, 11:47 PM
timshen added inline comments.Feb 10 2017, 1:12 PM
lib/Target/X86/X86MCInstLower.cpp
1065 ↗(On Diff #87953)

I thought we want save & restore all registers at the callee side?

Besides, this might be in a middle of a function, so any register could be alive, not just the caller-saved ones. That include all 16 generic registers, 16 xmm registers, 8 mmx registers, 8 "k" registers, 8 st registers, mxcsr, x87 SW, x87 CW, 4 bnd registers. And I don't even know what are some these weirdo registers. :)

timshen added inline comments.Feb 10 2017, 8:31 PM
lib/Target/X86/X86MCInstLower.cpp
1065 ↗(On Diff #87953)

Sorry, I was wrong about this. Only caller-saved registers need to be saved.

However, there are more registers then these three. See https://github.com/llvm-mirror/compiler-rt/blob/master/lib/xray/xray_trampoline_x86_64.S "SAVE_REGISTERS".

Still, I think it's better to save and restore them on the callee side, to minimize the change on the caller sidde.

dberris added inline comments.Feb 13 2017, 9:16 AM
lib/Target/X86/X86MCInstLower.cpp
1065 ↗(On Diff #87953)

The implementation of the trampoline will do more of the register saving. That will be in the compiler-rt side of things. This is first introducing the intrinsic, and the compiler-rt changes will happen later. I think we're doing the right things at this point, and can do more in the trampoline.

timshen added inline comments.Feb 13 2017, 11:26 AM
lib/Target/X86/X86MCInstLower.cpp
1065 ↗(On Diff #87953)

I was confused by doing some of the register saving at the caller side (naturally it should be all or none).

If that's the right thing then go ahead and check in.

dberris retitled this revision from Add custom logging support for xray to [XRay] Custom event logging intrinsic.Feb 14 2017, 9:43 AM
dberris edited the summary of this revision. (Show Details)

Thanks @timshen for having a look. I'll land this now and improve the tests while @lrl is out on vacation.

lib/Target/X86/X86MCInstLower.cpp
1065 ↗(On Diff #87953)

I was confused by doing some of the register saving at the caller side (naturally it should be all or none).

The register saving is really because the trampoline needs those things as arguments -- so the caller side needs to put the data in the stack anyway, and restore those registers on the caller side after the function (in this case the trampoline) call returns. This is just normal x86 function calling semantics. Since this can occur in the body of the function, we don't necessarily know what the registers hold at that point and therefore need to stash them caller-side so we can use those for the function call.

The trampoline will then use the arguments to be passed into the logging implementation, setting the stack up appropriately there too. To do that correctly, the trampoline has to store the state of all registers that the logging implementation might use in its execution (including the scratch registers), call the logging implementation, then restore the state of the registers before returning to the caller.

The caller (the function from which the custom event logging was called) will then restore the caller-saved registers so the function can proceed as-if the instrumentation never happened in the first place. :)

Adding llvm-commits, will upload an updated patch for posterity.

dberris commandeered this revision.Feb 14 2017, 10:06 AM
dberris edited reviewers, added: lrl; removed: dberris.

Taking this on while @lrl is on vacation.

This revision now requires review to proceed.Feb 14 2017, 10:06 AM

One inline nit, I'd like to take a closer look at this though. Can you give an example of use case and lowering?

Thanks :)

lib/Target/X86/X86MCInstLower.cpp
1086 ↗(On Diff #88394)

No named fixme.

dberris updated this revision to Diff 88433.Feb 14 2017, 2:07 PM
dberris marked 2 inline comments as done.
  • Fixups

Can you give an example of use case and lowering?

The high-level idea is to translate something like this in C++:

[[clang::xray_always_instrument]] void foo() {
  static constexpr char kPhase[] = "Foo phase."
  __xray_customevent(kPhase, strlen(kPhase));
}

Into the lowered form in llvm IR calling llvm.experimental.customevent(...). The intent is to emit the sled, such that we can at runtime embed the correct call to a relative offset, to the trampoline that will invoke the custom event handler

Does that help?

dberris planned changes to this revision.Feb 28 2017, 7:59 PM

Actually, now that I think about it more, it seems that we can get away with just always precisely pushing the values in the arguments (the pointer value and the size) and not worrying about the x86 calling convention -- since we in theory control what the trampoline does anyway in compiler-rt. I'm working on some local changes to reflect this. We also want to change the type of the size argument so that it reflects what 'size_t' might be for the platform.

dberris updated this revision to Diff 90126.Feb 28 2017, 11:47 PM
  • fixup: simplify and shrink sled code

This is now ready for another look @timshen / @echristo

dberris planned changes to this revision.Mar 2 2017, 2:50 PM

Actually, while working on the compiler-rt side of things it turns out the sleds aren't quite that simple to nail down. I have some local changes and am iterating, mostly based on trying it out and hitting different roadblocks.

However, if you have feedback on the following issues, it would be most appreciated:

  • Fixing up the size of the sleds in all situations. It seems the initial assumptions in the comments are different (it largely depends on which registers are being pushed in, and the encoding of the push operations will vary in x86 if the arguments are passed in following a specific ABI).
  • Whether we can enforce a specific ABI for an intrinsic so that, say, we always have the first and second arguments to the trampoline be in %rdi and %rsi (as in the System V ABI).

Cheers

dberris updated this revision to Diff 91274.Mar 10 2017, 12:19 AM

Re-did to make lowering of arguments simpler, and be jumped over (no impact to caller code when off).

Much simpler now, although I'm not sure why calls to the intrinsic don't automatically lower the arguments to the call according to the sysv64 calling convention -- perhaps I'm missing an explicit calling convention spec in the front-end in D30018? Will that help?

Hi @rSerge -- do you have time to have a look at this?

rSerge accepted this revision.Mar 22 2017, 11:04 AM

@dberris , I've reviewed it, but I'm not sufficiently qualified for this as I am not familiar with some parts of LLVM touched...
To the extent of my knowledge, this looks good.

This revision is now accepted and ready to land.Mar 22 2017, 11:04 AM
dberris added a reviewer: rnk.Apr 18 2017, 5:27 PM

thanks @rSerge

ping @echristo on the new intrinsics, apis, etc.

also @rnk in case there's anything that jumps out here.

timshen edited edge metadata.May 2 2017, 4:57 PM

Other than some nits and questions, looks good. :)

include/llvm/IR/Intrinsics.td
779 ↗(On Diff #91274)

Is the length going to be i8 forever? What if the strings are getting longer? Would it be better to use a word-size length variable?

lib/CodeGen/SelectionDAG/FastISel.cpp
885 ↗(On Diff #91274)

I thought It's not necessary to have a FastISel implementation for this, because if FastISel doesn't work, the execution falls back to SelectionDAG.

Now that you already have this, I'm not object to keeping it though. :)

lib/Target/X86/X86MCInstLower.cpp
1052 ↗(On Diff #91274)

I believe that before at symbol, all caller-saved registers are actually saved?

1075 ↗(On Diff #91274)
dberris updated this revision to Diff 97564.May 3 2017, 12:42 AM
dberris marked 3 inline comments as done.

Update test and address comments

dberris added inline comments.May 3 2017, 12:42 AM
include/llvm/IR/Intrinsics.td
779 ↗(On Diff #91274)

good point -- i32 is definitely better for future proofing.

lib/CodeGen/SelectionDAG/FastISel.cpp
885 ↗(On Diff #91274)

Ah, cool. That's good to know. :)

lib/Target/X86/X86MCInstLower.cpp
1052 ↗(On Diff #91274)

I thought that too, except the calling convention for the intrinsic seems to be the C calling convention (default) as opposed to the SysV64 calling convention that I'd like it to be. This is where the question of whether it's possible (or whether it makes sense) for us to actually provide the calling convention for the intrinsic (or whether the front-end has to explicitly state this).

Any suggestions on how to force the calling convention here?

1075 ↗(On Diff #91274)

That doesn't quite work for x86 because the jump instruction emitted can vary (i.e. it's not always 2 bytes). The runtime requires us to be able to write two bytes atomically to enable/disable the jump. Since there's no way yet for us to force a specific jump instruction that will be preserved when the relaxation pass runs, for now we need to hard-code this static relative jump ourselves.

timshen added inline comments.May 3 2017, 9:52 AM
lib/Target/X86/X86MCInstLower.cpp
1052 ↗(On Diff #91274)

Why do you want SysV64 calling convention?

I was trying to confirm that no registers will be clobbered unexpectedly, causing correctness issues.

1075 ↗(On Diff #91274)

I'm not quite following - it's AsmPrinter, and there shouldn't be any lowering after that, is it right?

What is the "relaxation pass" you mentioned? Is it BranchRelaxation pass? But I think it runs before AsmPrinter?

dberris added inline comments.May 3 2017, 6:13 PM
lib/Target/X86/X86MCInstLower.cpp
1052 ↗(On Diff #91274)

I thought that was the default calling convention for x86_64 on Linux, and it would greatly simplify the code in the sled to not have to put the registers in the correct registers expected by the function we are calling (__xray_CustomEvent is a trampoline). In my testing, for some reason the arguments we need might be placed in random registers. Like in the test, you will see that we need to move arguments around from the C calling convention, using different registers for the arguments, than the SysV64 calling convention (which the __xray_CustomEvent trampoline requires).

I was trying to confirm that no registers will be clobbered unexpectedly, causing correctness issues.

Ah, yes -- AFAICT, the lowering of MachineInst's that are considered "call" instructions will preserve caller-saved registers before the call. Is there a way for me to verify that in the test more explicitly?

1075 ↗(On Diff #91274)

So the integrated assembler has an option (-mrelax-all) which will look for jmp instructions and attempt to relax those. The integrated assembler will run after all assembly is emitted. This works around that until we can flag certain jump instructions to never be relaxed.

timshen accepted this revision.May 4 2017, 9:59 AM
timshen added inline comments.
lib/Target/X86/X86MCInstLower.cpp
1052 ↗(On Diff #91274)

As long as it's correct, I'm fine.

1075 ↗(On Diff #91274)

\facepalm that's good to know. Thanks!

This revision was automatically updated to reflect the committed changes.