This is an archive of the discontinued LLVM Phabricator instance.

[XRay] Custom event logging intrinsic
ClosedPublic

Authored by dberris on Dec 6 2016, 10:21 PM.

Details

Summary

This patch introduces an LLVM intrinsic and a target opcode for custom event logging in XRay. Initially, its use case will be to allow users of XRay to log some type of string ("poor man's printf"). The target opcode compiles to a noop sled large enough to enable calling through to a runtime-determined relative function call. At runtime, when X-Ray is enabled, the sled is replaced by compiler-rt with a trampoline to the logic for creating the custom log entries.

Future patches will implement the compiler-rt parts and clang-side support for emitting the IR corresponding to this intrinsic.

Event Timeline

lrl updated this revision to Diff 80543.Dec 6 2016, 10:21 PM
lrl retitled this revision from to Add custom logging support for xray.
lrl updated this object.
dberris edited edge metadata.Dec 8 2016, 1:05 AM
dberris added a subscriber: echristo.

Thanks for this louis!

I think you can start getting a bit more advice from @echristo about the handling that's appropriate here. I've added some thoughts below but I'm not very familiar yet with the higher level LLVM IR functionality that would facilitate what we're trying to achieve here.

lib/Target/X86/X86ISelLowering.cpp
18649–18654

At this point I think you can check first whether the arg is a constant in the first place (and handle that appropriately). Here you can emit the pseudo instruction and use virtual registers (or other mechanisms for the operands) to pass the information through.

lrl updated this revision to Diff 81801.Dec 16 2016, 2:51 PM
lrl edited edge metadata.
lrl updated this revision to Diff 81831.Dec 16 2016, 5:37 PM

I think I might be a little closer. Now it doesn't get optimized out but it fails on the second round of combine for legalize types. I replaced uses of the original intrinsic operand with the target opcode instruction.

Other examples seem to be doing chian/glue though so I figure I still need to do this for it to work out

Can you post the legalization error somewhere, or on this patch as well?

I have to admit to not knowing very much about the SelectionDAG business... probably need to read more to understand it myself. Maybe @echristo can redirect to someone who might know more there?

lrl added a comment.Dec 19 2016, 10:03 PM
This comment was removed by lrl.
lrl added a comment.Dec 19 2016, 10:04 PM

Note: I'm planning to hack on it some more this week, probably Thursday. I haven't had time to play around too much after the legalization error. I suspect the current way of explicitly setting the use might be wrong because all the other lowerings don't explicitly set the use. But I'm not really sure...

➜  llvmninja bin/llc -debug -view-dag-combine2-dags -filetype=asm -o - -mtriple=x86_64-unknown-linux-gnu playground.ll
Args: bin/llc -debug -view-dag-combine2-dags -filetype=asm -o - -mtriple=x86_64-unknown-linux-gnu playground.ll
	.text
	.file	"playground.ll"

Features:+64bit,+sse2
CPU:generic

Subtarget features: SSELevel 2, 3DNowLevel 0, 64bit 1
********** Begin Constant Hoisting **********
********** Function: caller
********** End Constant Hoisting **********
*** Interleaved Access Pass: caller
[SafeStack] Function: caller
[SafeStack]     safestack is not requested for this function
---- Branch Probability Info : caller ----

Computing probabilities for



=== caller
Initial selection DAG: BB#0 'caller:'
SelectionDAG has 8 nodes:
  t1: i64 = FrameIndex<0>
    t0: ch = EntryToken
  t6: ch,glue = CopyToReg t0, Register:i32 %EAX, Constant:i32<0>
  t7: ch = X86ISD::RET_FLAG t6, TargetConstant:i32<0>, Register:i32 %EAX, t6:1



Combining: t7: ch = X86ISD::RET_FLAG t6, TargetConstant:i32<0>, Register:i32 %EAX, t6:1

Combining: t6: ch,glue = CopyToReg t0, Register:i32 %EAX, Constant:i32<0>

Combining: t5: i32 = Register %EAX

Combining: t4: i32 = TargetConstant<0>

Combining: t3: i32 = Constant<0>

Combining: t2: ch = PATCHABLE_LOG_CALL t2

Combining: t0: ch = EntryToken
Optimized lowered selection DAG: BB#0 'caller:'
SelectionDAG has 7 nodes:
    t0: ch = EntryToken
  t6: ch,glue = CopyToReg t0, Register:i32 %EAX, Constant:i32<0>
  t7: ch = X86ISD::RET_FLAG t6, TargetConstant:i32<0>, Register:i32 %EAX, t6:1


Legally typed node: t5: i32 = Register %EAX

Legally typed node: t4: i32 = TargetConstant<0>

Legally typed node: t3: i32 = Constant<0>

Legally typed node: t0: ch = EntryToken

Legally typed node: t6: ch,glue = CopyToReg t0, Register:i32 %EAX, Constant:i32<0>

Legally typed node: t7: ch = X86ISD::RET_FLAG t6, TargetConstant:i32<0>, Register:i32 %EAX, t6:1

Legally typed node: t65535: ch = handlenode t7

Unanalyzed node not noticed?
t2: ch = PATCHABLE_LOG_CALL t2

UNREACHABLE executed at /Users/louisli/branches/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.cpp:456!
0  libLLVMSupport.dylib     0x000000010da30275 llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 101
1  libLLVMSupport.dylib     0x000000010da308b9 PrintStackTraceSignalHandler(void*) + 25
2  libLLVMSupport.dylib     0x000000010da2c8b9 llvm::sys::RunSignalHandlers() + 425
3  libLLVMSupport.dylib     0x000000010da30ce2 SignalHandler(int) + 354
4  libsystem_platform.dylib 0x00007fff9d84852a _sigtramp + 26
Stack dump:
0.	Program arguments: bin/llc -debug -view-dag-combine2-dags -filetype=asm -o - -mtriple=x86_64-unknown-linux-gnu playground.ll
1.	Running pass 'Function Pass Manager' on module 'playground.ll'.
2.	Running pass 'X86 DAG->DAG Instruction Selection' on function '@caller'
[1]    6623 abort      bin/llc -debug -view-dag-combine2-dags -filetype=asm -o -  playground.ll

Today I spent a bit of time having a read through the CodeGenerator docs (http://llvm.org/docs/CodeGenerator.html) which allowed me a bit of insight into how we might want to think about this or at least ask the right questions (hopefully). :)

Please see below. :D

include/llvm/IR/Intrinsics.td
765

You might also consider marking the argument as explicitly ReadOnly`. So this might look like:

[IntrReadMem, IntrArgMemOnly, ReadOnly<1>]

As far as naming goes, I'm not sure how this looks like in practice... but it seems we'd want to give this an "experimental" namespace of sorts, so something like llvm.experimental.xray_customlog(...). I understand that's a bit verbose, but that's fine we can work with that.

include/llvm/Target/Target.td
1003

Having read a little about this, we probably want to provide a specific type here for the register. For custom logging, we may have to make sure that we preserve the value of %r10d before the instruction, and restore it afterwards so we might need to indicate that somehow (that we're implicitly using that register, so the code generator can work around that). Note though that this only makes sense if we ever want to associate custom log entries to function ids (which we do currently), but we can change our minds on that at any point.

Since we're doing this as a generic target-agnostic instruction, we're probably unable to enforce that at this level. So we may have to do something in the SelectionDAG time to say that we're using the R10D register implicitly. Something to think about.

One potential thing we might want to think about is whether this should behave as if it was a "call" instruction anyway, so we may also need to add:

let isCall = 1;

This introduces a wrinkle, since the custom log instruction/intrinsic may inhibit some ordering optimisations that might happen (since it already has hasSideEffects = 1), so it may act as a barrier for code moving before it. I don't know enough of the re-orderings that happen post-SelectionDAG scheduling. Whether that's a concern in practice is less of an issue I think, since we assume that the front-end generated the call to the intrinsic anyway, which means we can treat it as if it were a function call.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5741–5752

I think most of this code is unnecessary (or a red herring) -- what I think might be worth doing at this point is just looking at the LogEntry value, which is an llvm::Value. We want to make sure this is the correct type, and when selecting the operands of the PATCHABLE_LOG_CALL, and provide that list of operands (which should contain a single operand really). So I found that you might be able to do this transparently this way:

SDLoc dl = getCurSDLoc();
Value *LogEntry = I.getArgOperand(0);
// assert(LogEntry->getType()->isPointerTy() && "I need a pointer!");
setValue(&I, DAG.getMachineNode(TargetOpcode::..., dl, NodeTys, getValue(LogEntry)));
return nullptr;

At this point the lowering code should take care of the thing we actually want to see.

There's still the question of how to make sure that before the call the value of the %r10d register is preserved somewhere, and after the call gets restored. I don't have an answer to this just yet (maybe there's a way to explicitly define a "use these registers" here), but that's likely something that's more important in the lowering bits (i.e. we can work around that ourselves there if we can't find a good way of expressing that here).

lib/Target/X86/X86MCInstLower.cpp
1039–1041

I think we can do better here, that instead of pushing the pointer onto the stack, I think we're allowed by the X86 calling conventions to use %edi or similar. If we're going to lower this differently though, say if we took the length of the string as an argument to the intrinsic then we ought to think about this a little more -- i.e. we may need a bit more bytes in the sled for custom log calls. But it's a little too early for us to think about these things. :)

test/CodeGen/X86/xray-custom-log.ll
5

So to make this less weird, you can probably make it so that you have a string that you allocate either as a global (recommended for a test) or taken from a function that will have an i8* pointer argument.

To inform our design of the intrinsic, we ought to think about whether we also need the length of the data on the other side of the pointer. If we know this statically, this is fine -- but if this was something that was taken as an argument to the function, then we ought to be able to handle that too. Sample code might look like:

define i32 @caller2(i8* %str, i8 %len) {
  call void @llvm.xray.customlog_str(i8* %str, i8 %len)
  ret i32 0
}

Note that this allows us to make sure that the length fits within 8 bits (so that limits the amount of data we're willing to write to the log).

lrl added inline comments.Dec 22 2016, 5:09 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5741–5752

This is roughly what it looked like in the previous patch I think -- the assertion does pass, so it seems like the type works. With the above approach, the node gets optimized out, so I believe the next step is to replace the uses. The example I'm trying to use is the PATCHPOINT intrinsic which seems to follow roughly what we've done above by calling setValue() and then replaceUsesWith (I don't understand every line of its implementation -- still looking)

lrl updated this revision to Diff 84175.Jan 12 2017, 2:02 PM
lrl updated this revision to Diff 84347.Jan 13 2017, 11:56 AM
lrl updated this object.

I cannot reproduce the playground.ll crash with close-to-trunk revision. Have you rebased your patch recently?

Also, it might be helpful to add a "; RUN:" line for the .ll files. See http://llvm.org/docs/CommandGuide/FileCheck.html

lrl added a comment.Jan 17 2017, 11:38 AM

I cannot reproduce the playground.ll crash with close-to-trunk revision. Have you rebased your patch recently?

Also, it might be helpful to add a "; RUN:" line for the .ll files. See http://llvm.org/docs/CommandGuide/FileCheck.html

Let me add the ;RUN: line. Thanks.

The playground.ll crash was for an earlier version of the patch I think. Now it's on the revision where it just gets optimized out. The playground.ll patch will happen if I try to replace uses of the old intrinsic with the new MachineSDNode -- let me know if you want to see that version and I can upload a commit.

In D27503#648524, @lrl wrote:

The playground.ll patch will happen if I try to replace uses of the old intrinsic with the new MachineSDNode -- let me know if you want to see that version and I can upload a commit.

I'm not sure I understand this?

rSerge added a subscriber: rSerge.Jan 23 2017, 8:39 AM

As I understood, this patch starts adding a possibility to pass custom events to the XRay logging mechanism. So isn't it more logical to call it "custom event" rather than "custom log"? Another reason, I saw some terms similar to "custom log" somewhere in @dberris 's commits, perhaps in FDR. So I think it's desirable to avoid confusion.
And a general question: how are you going to pass these custom events to XRay logging, if currently the handler callback is not designed to take a pointer argument? It currently takes the function id and call reason code as arguments. Are you going to add the 3rd parameter like void* Data to the handler function? Trampolines may need adjustment in that case on some platforms.

timshen added inline comments.Jan 24 2017, 1:11 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5748

NodeTys seems wrong to me, it should be MVT::Other. It indicates the return type, aka void, not the parameter types.

5752

You need to call DAG.setRoot(patchableNode), otherwise the succeeding nodes are not reached from patchableNode, ending up with patchableNode gets garbage-collected.

As I understood, this patch starts adding a possibility to pass custom events to the XRay logging mechanism. So isn't it more logical to call it "custom event" rather than "custom log"? Another reason, I saw some terms similar to "custom log" somewhere in @dberris 's commits, perhaps in FDR. So I think it's desirable to avoid confusion.

Indeed, it will be confusing when we think about how in the FDR change we're allowing for a custom logging mechanism. In this case though we're really talking about a user-driven event being inserted into the XRay log. I think custom events seem to make a bit more sense for this functionality.

And a general question: how are you going to pass these custom events to XRay logging, if currently the handler callback is not designed to take a pointer argument? It currently takes the function id and call reason code as arguments. Are you going to add the 3rd parameter like void* Data to the handler function? Trampolines may need adjustment in that case on some platforms.

Yes, this is currently an exploratory patch -- we will need to handle these differently in the runtime. That should come later.

The intent here is to have a different trampoline for these custom events, with a different extension point in the log implementation. The API being introduced in the FDR change allows us to add different kinds of handlers too.

There's also work being done by @pelikan that should implement the first argument logging implementation, which will also require changes to the runtime. We're doing it incrementally though to unblock progress, and allow XRay to be usable even in its current form.

timshen added inline comments.Jan 25 2017, 5:23 PM
lib/Target/X86/X86MCInstLower.cpp
1041

Is PATCHABLE_LOG_CALL allowed to be inserted in arbitrary place inside the function?

If so, the callee must not clobber any register at all. This is unlike \_\_xray_FunctionEntry, which just need to save and restore callee-saved, parameter passing registers; or unlike \_\_xray_FunctionExit, which just need to save and restore return value register.

dberris added inline comments.Jan 26 2017, 11:27 PM
lib/Target/X86/X86MCInstLower.cpp
1041

Yes, we're going to have to treat this as if it's a normal function call, on a cross-platform basis. The only difference would be that the call will be guarded by a jump over that could be patched at runtime.

timshen added inline comments.Jan 27 2017, 12:43 PM
lib/Target/X86/X86MCInstLower.cpp
1041

That requires larger instrumentation pass changes, right? It's going to be moved before at least frame lowering, Or even earlier. Or even changed to an IR pass. :P

dberris added inline comments.Jan 29 2017, 4:39 PM
lib/Target/X86/X86MCInstLower.cpp
1041

Not really, at least I don't think so -- this is something that users write in their code (something like __xray_custom_log(...)) that turns into an intrinsic instruction function call in the IR. We're lowering that intrinsic call, which by definition really happens in the function body (after stack adjustments, after the prologue). The intent is to guard the function call with a "jump over" just like normal sleds (and have it compute the offset to the trampoline at runtime).

timshen added inline comments.Feb 1 2017, 4:58 PM
lib/Target/X86/X86MCInstLower.cpp
1041

which by definition really happens in the function body (after stack adjustments, after the prologue).

That is true. I was not worrying about that, and I ignored that we do have an IR intrinsic for it. During the lowering, the call sequence is there, providing proper amount of space reserved for saving live,callee-saved registers. These are good. The problem is that currently the sled doesn't save any live registers (according to LowerPATCHABLE_LOG_CALL, the function we are commenting on).

Either the caller or the callee needs to save the live registers. If it's implemented in a way that registers saved by callee (__xray_custom_log), the problem is it doesn't know which registers are alive; since custom logging may happen at any point in the function, __xray_custom_log'd better save all callee-saved registers.

If it's implemented in the way that registers are saved by the caller, great, the caller knows which registers to save - but it knows only at compile time, as opposed to patch time. The current LowerPATCHABLE_LOG_CALL, which is at compile time, doesn't emit register saving and restoring code.

Ideally, I prefer just lowering the intrinsic to a normal function call C function __xray_custom_log, instead of a pseudo instruction PATCHABLE_LOG_CALL. In the instrumentation pass, the pass scans for calls to __xray_custom_log, and add branch overs before those calls. The advantage is that all register saving/restoring code is handled by common llvm infrastructure. In this approach, adding an IR intrinsic is optional, because the user can directly call the C function __xray_custom_log, knowing that the compiler is going to add the branch over around it.

dberris added inline comments.Feb 2 2017, 11:00 PM
lib/Target/X86/X86MCInstLower.cpp
1041

Ideally, I prefer just lowering the intrinsic to a normal function call C function xray_custom_log, instead of a pseudo instruction PATCHABLE_LOG_CALL. In the instrumentation pass, the pass scans for calls to xray_custom_log, and add branch overs before those calls.

That makes sense, although I haven't quite worked out how to do this from the MachineFunction pass, since the register restoration code after the call might be arbitrarily long. From the IR level I don't think we have a means of marking sections of code yet that could be explicitly branched over statically but preserved in codegen (i.e. if it's possible to inhibit dead-code removal).

Even if it's possible to do caller-saved register stashing and have the back-end generate the jumps (similar to how we have the sleds now), the differences amongst the intrinsic call-sites could be substantial (i.e. register allocation could get messed up by treating the intrinsic call to be a "normal function call"). The advantage of making the trampoline do the caller- and callee-saved register stashing/restoring is that the patch points all look the same (matters especially in x86) and the trampoline also does the same thing and has a fixed cost.

timshen added inline comments.Feb 3 2017, 12:01 AM
lib/Target/X86/X86MCInstLower.cpp
1041

Even if it's possible to do caller-saved register stashing and have the back-end generate the jumps (similar to how we have the sleds now), the differences amongst the intrinsic call-sites could be substantial (i.e. register allocation could get messed up by treating the intrinsic call to be a "normal function call"). The advantage of making the trampoline do the caller- and callee-saved register stashing/restoring is that the patch points all look the same (matters especially in x86) and the trampoline also does the same thing and has a fixed cost.

That's a good point, since adding normal function calls may significantly change the shape of the normal code. Maybe we should just save and restore *all* registers and see how slow it is - I'm not quite optimistic, since xray logs hot functions.

While we shouldn't worry too much about the performance until we have numbers, a random thought on making this faster can be: when calling __xray_set_handler, require the passed in handler function to be marked with a special ABI attribute, that's essentially saying all registers are callee-saved.

dberris added inline comments.Feb 6 2017, 2:04 AM
lib/Target/X86/X86MCInstLower.cpp
1041

This has so far been the assumption for all the XRay sleds -- that the trampolines do save all callee and caller-saved registers. This is going to be a bit different too (the trampoline for the custom logging) since it has to store even the scratch registers in x86. Note that this only happens though when the user asks for custom event logging -- the logging implementation is going to be the bottleneck anyway, and our fixed costs in the trampolines will be "constant" modulo interference effects.

So far too we haven't had to do special annotations of the ABI, since we do the patching at runtime anyway.

dberris requested changes to this revision.Feb 9 2017, 12:35 AM

@lrl Please let us know when this is ready again for another look (given the feedback and recent discussions).

This revision now requires changes to proceed.Feb 9 2017, 12:35 AM
lrl updated this revision to Diff 87946.Feb 9 2017, 7:31 PM
lrl edited edge metadata.
lrl updated this revision to Diff 87948.Feb 9 2017, 7:33 PM

Hullo! I've now updated the patch.

  • renamed things to custom event
  • add a single unit test -- taking suggestions for more (due to lack of domain knowledge, would love suggestions about what types of cases need to be tested)
  • followed x86 calling conventions before/after the trampoline

Cool stuff, thanks @lrl -- just one thing and I think we can land soon after unless @timshen or @echristo have other thoughts.

lib/Target/X86/X86MCInstLower.cpp
1042–1051

We can reduce this a bit, now that we don't really need to put the function id in %r10 anymore.

lrl updated this revision to Diff 87953.Feb 9 2017, 7:52 PM

Cool stuff, thanks @lrl -- just one thing and I think we can land soon after unless @timshen or @echristo have other thoughts.

done

dberris accepted this revision.Feb 9 2017, 11:47 PM

Couple more things you might want to address. If @timshen or @echristo don't get to land it or have more comments, I'll land it tomorrow. :)

lib/Target/X86/X86MCInstLower.cpp
1046

We don't need the function id here anymore, so really the jump will just become a 2-byte nop.

1073–1080

You can MI.getNumOperands() and use that to do a reverse index, using MI.getOperand(...)?

test/CodeGen/X86/xray-custom-log.ll
23

You can probably look for the entry in the instrumentation map that will contain the correct reference to the sled, with the correct type associated with it. See other adjacent tests that do similar things?

This revision is now accepted and ready to land.Feb 9 2017, 11:47 PM
timshen added inline comments.Feb 10 2017, 1:12 PM
lib/Target/X86/X86MCInstLower.cpp
1065

I thought we want save & restore all registers at the callee side?

Besides, this might be in a middle of a function, so any register could be alive, not just the caller-saved ones. That include all 16 generic registers, 16 xmm registers, 8 mmx registers, 8 "k" registers, 8 st registers, mxcsr, x87 SW, x87 CW, 4 bnd registers. And I don't even know what are some these weirdo registers. :)

timshen added inline comments.Feb 10 2017, 8:31 PM
lib/Target/X86/X86MCInstLower.cpp
1065

Sorry, I was wrong about this. Only caller-saved registers need to be saved.

However, there are more registers then these three. See https://github.com/llvm-mirror/compiler-rt/blob/master/lib/xray/xray_trampoline_x86_64.S "SAVE_REGISTERS".

Still, I think it's better to save and restore them on the callee side, to minimize the change on the caller sidde.

dberris added inline comments.Feb 13 2017, 9:16 AM
lib/Target/X86/X86MCInstLower.cpp
1065

The implementation of the trampoline will do more of the register saving. That will be in the compiler-rt side of things. This is first introducing the intrinsic, and the compiler-rt changes will happen later. I think we're doing the right things at this point, and can do more in the trampoline.

timshen added inline comments.Feb 13 2017, 11:26 AM
lib/Target/X86/X86MCInstLower.cpp
1065

I was confused by doing some of the register saving at the caller side (naturally it should be all or none).

If that's the right thing then go ahead and check in.

dberris retitled this revision from Add custom logging support for xray to [XRay] Custom event logging intrinsic.Feb 14 2017, 9:43 AM
dberris edited the summary of this revision. (Show Details)

Thanks @timshen for having a look. I'll land this now and improve the tests while @lrl is out on vacation.

lib/Target/X86/X86MCInstLower.cpp
1065

I was confused by doing some of the register saving at the caller side (naturally it should be all or none).

The register saving is really because the trampoline needs those things as arguments -- so the caller side needs to put the data in the stack anyway, and restore those registers on the caller side after the function (in this case the trampoline) call returns. This is just normal x86 function calling semantics. Since this can occur in the body of the function, we don't necessarily know what the registers hold at that point and therefore need to stash them caller-side so we can use those for the function call.

The trampoline will then use the arguments to be passed into the logging implementation, setting the stack up appropriately there too. To do that correctly, the trampoline has to store the state of all registers that the logging implementation might use in its execution (including the scratch registers), call the logging implementation, then restore the state of the registers before returning to the caller.

The caller (the function from which the custom event logging was called) will then restore the caller-saved registers so the function can proceed as-if the instrumentation never happened in the first place. :)

Adding llvm-commits, will upload an updated patch for posterity.

dberris commandeered this revision.Feb 14 2017, 10:06 AM
dberris edited reviewers, added: lrl; removed: dberris.

Taking this on while @lrl is on vacation.

This revision now requires review to proceed.Feb 14 2017, 10:06 AM

One inline nit, I'd like to take a closer look at this though. Can you give an example of use case and lowering?

Thanks :)

lib/Target/X86/X86MCInstLower.cpp
1091

No named fixme.

dberris updated this revision to Diff 88433.Feb 14 2017, 2:07 PM
dberris marked 2 inline comments as done.
  • Fixups

Can you give an example of use case and lowering?

The high-level idea is to translate something like this in C++:

[[clang::xray_always_instrument]] void foo() {
  static constexpr char kPhase[] = "Foo phase."
  __xray_customevent(kPhase, strlen(kPhase));
}

Into the lowered form in llvm IR calling llvm.experimental.customevent(...). The intent is to emit the sled, such that we can at runtime embed the correct call to a relative offset, to the trampoline that will invoke the custom event handler

Does that help?

dberris planned changes to this revision.Feb 28 2017, 7:59 PM

Actually, now that I think about it more, it seems that we can get away with just always precisely pushing the values in the arguments (the pointer value and the size) and not worrying about the x86 calling convention -- since we in theory control what the trampoline does anyway in compiler-rt. I'm working on some local changes to reflect this. We also want to change the type of the size argument so that it reflects what 'size_t' might be for the platform.

dberris updated this revision to Diff 90126.Feb 28 2017, 11:47 PM
  • fixup: simplify and shrink sled code

This is now ready for another look @timshen / @echristo

dberris planned changes to this revision.Mar 2 2017, 2:50 PM

Actually, while working on the compiler-rt side of things it turns out the sleds aren't quite that simple to nail down. I have some local changes and am iterating, mostly based on trying it out and hitting different roadblocks.

However, if you have feedback on the following issues, it would be most appreciated:

  • Fixing up the size of the sleds in all situations. It seems the initial assumptions in the comments are different (it largely depends on which registers are being pushed in, and the encoding of the push operations will vary in x86 if the arguments are passed in following a specific ABI).
  • Whether we can enforce a specific ABI for an intrinsic so that, say, we always have the first and second arguments to the trampoline be in %rdi and %rsi (as in the System V ABI).

Cheers

dberris updated this revision to Diff 91274.Mar 10 2017, 12:19 AM

Re-did to make lowering of arguments simpler, and be jumped over (no impact to caller code when off).

Much simpler now, although I'm not sure why calls to the intrinsic don't automatically lower the arguments to the call according to the sysv64 calling convention -- perhaps I'm missing an explicit calling convention spec in the front-end in D30018? Will that help?

Hi @rSerge -- do you have time to have a look at this?

rSerge accepted this revision.Mar 22 2017, 11:04 AM

@dberris , I've reviewed it, but I'm not sufficiently qualified for this as I am not familiar with some parts of LLVM touched...
To the extent of my knowledge, this looks good.

This revision is now accepted and ready to land.Mar 22 2017, 11:04 AM
dberris added a reviewer: rnk.Apr 18 2017, 5:27 PM

thanks @rSerge

ping @echristo on the new intrinsics, apis, etc.

also @rnk in case there's anything that jumps out here.

timshen edited edge metadata.May 2 2017, 4:57 PM

Other than some nits and questions, looks good. :)

include/llvm/IR/Intrinsics.td
779

Is the length going to be i8 forever? What if the strings are getting longer? Would it be better to use a word-size length variable?

lib/CodeGen/SelectionDAG/FastISel.cpp
885

I thought It's not necessary to have a FastISel implementation for this, because if FastISel doesn't work, the execution falls back to SelectionDAG.

Now that you already have this, I'm not object to keeping it though. :)

lib/Target/X86/X86MCInstLower.cpp
1052

I believe that before at symbol, all caller-saved registers are actually saved?

1075
dberris updated this revision to Diff 97564.May 3 2017, 12:42 AM
dberris marked 3 inline comments as done.

Update test and address comments

dberris added inline comments.May 3 2017, 12:42 AM
include/llvm/IR/Intrinsics.td
779

good point -- i32 is definitely better for future proofing.

lib/CodeGen/SelectionDAG/FastISel.cpp
885

Ah, cool. That's good to know. :)

lib/Target/X86/X86MCInstLower.cpp
1052

I thought that too, except the calling convention for the intrinsic seems to be the C calling convention (default) as opposed to the SysV64 calling convention that I'd like it to be. This is where the question of whether it's possible (or whether it makes sense) for us to actually provide the calling convention for the intrinsic (or whether the front-end has to explicitly state this).

Any suggestions on how to force the calling convention here?

1075

That doesn't quite work for x86 because the jump instruction emitted can vary (i.e. it's not always 2 bytes). The runtime requires us to be able to write two bytes atomically to enable/disable the jump. Since there's no way yet for us to force a specific jump instruction that will be preserved when the relaxation pass runs, for now we need to hard-code this static relative jump ourselves.

timshen added inline comments.May 3 2017, 9:52 AM
lib/Target/X86/X86MCInstLower.cpp
1052

Why do you want SysV64 calling convention?

I was trying to confirm that no registers will be clobbered unexpectedly, causing correctness issues.

1075

I'm not quite following - it's AsmPrinter, and there shouldn't be any lowering after that, is it right?

What is the "relaxation pass" you mentioned? Is it BranchRelaxation pass? But I think it runs before AsmPrinter?

dberris added inline comments.May 3 2017, 6:13 PM
lib/Target/X86/X86MCInstLower.cpp
1052

I thought that was the default calling convention for x86_64 on Linux, and it would greatly simplify the code in the sled to not have to put the registers in the correct registers expected by the function we are calling (__xray_CustomEvent is a trampoline). In my testing, for some reason the arguments we need might be placed in random registers. Like in the test, you will see that we need to move arguments around from the C calling convention, using different registers for the arguments, than the SysV64 calling convention (which the __xray_CustomEvent trampoline requires).

I was trying to confirm that no registers will be clobbered unexpectedly, causing correctness issues.

Ah, yes -- AFAICT, the lowering of MachineInst's that are considered "call" instructions will preserve caller-saved registers before the call. Is there a way for me to verify that in the test more explicitly?

1075

So the integrated assembler has an option (-mrelax-all) which will look for jmp instructions and attempt to relax those. The integrated assembler will run after all assembly is emitted. This works around that until we can flag certain jump instructions to never be relaxed.

timshen accepted this revision.May 4 2017, 9:59 AM
timshen added inline comments.
lib/Target/X86/X86MCInstLower.cpp
1052

As long as it's correct, I'm fine.

1075

\facepalm that's good to know. Thanks!

This revision was automatically updated to reflect the committed changes.