This is an archive of the discontinued LLVM Phabricator instance.

XRay: Add entry and exit sleds

Authored by dberris on May 4 2016, 12:01 AM.



In this patch we implement the following parts of XRay:

  • Supporting a function attribute named 'function-instrument' which currently only supports 'xray-always'. We should be able to use this attribute for other instrumentation approaches.
  • Supporting a function attribute named 'xray-instruction-threshold' used to determine whether a function is instrumented with a minimum number of instructions (IR instruction counts).
  • X86-specific nop sleds as described in the white paper.
  • A machine function pass that adds the different instrumentation marker instructions at a very late stage.
  • A way of identifying which return opcode is considered "normal" for each architecture.

There are some caveats here:

  1. We don't handle PATCHABLE_RET in platforms other than x86_64 yet -- this means if IR used PATCHABLE_RET directly instead of a normal ret, instruction lowering for that platform might do the wrong thing. We think this should be handled at instruction selection time to by default be unpacked for platforms where XRay is not availble yet.
  1. The generated section for X86 is different from what is described from the white paper for the sole reason that LLVM allows us to do this neatly. We're taking the opportunity to deviate from the white paper from this perspective to allow us to get richer information from the runtime library.

Diff Detail

Event Timeline

dberris updated this revision to Diff 56097.May 4 2016, 12:01 AM
dberris retitled this revision from to XRay: Add entry and exit sleds.
dberris updated this object.
dberris added reviewers: echristo, sanjoy.
dberris set the repository for this revision to rL LLVM.
dberris added a subscriber: llvm-commits.

BTW, It's worth mentioning that this is a prototype implementation and we're fine with making large changes for this patch.

mcrosier added a subscriber: bmakam.May 4 2016, 6:09 AM
emaste added a subscriber: emaste.May 4 2016, 2:16 PM
sanjoy edited edge metadata.May 5 2016, 3:39 PM

Some minor comments inline. I'm not familiar with all of the intricacies of the backend, so I'll CC some other people who are likely to have more perspective on this.


Looks like this isn't used yet? If so, I'd suggest dropping it completely for now.


Why do you care only about register operands here?


Why not

AlwaysInstrument = !Attr.hasAttribute(Attribute::None) && Attr.isStringAttribute() &&
      Attr.getValueAsString() == "xray-always";



I don't quite grok the Attributes API, but I think the !Attr.hasAttribute(Attribute::None) bit is redundant -- if isStringAttribute return true, then Attr.hasAttribute(Attribute::None) will be false.


emplace_back ?


Indent is off.

sanjoy edited edge metadata.May 5 2016, 3:40 PM
sanjoy added subscribers: rnk, atrick, majnemer.
majnemer added inline comments.May 5 2016, 4:27 PM

I don't think you want to replace whacky constructs like EH_RETURN and CLEANUPRET with PATCHABLE_RET, do you? Just "normal" returns, right?

dberris updated this revision to Diff 56381.May 5 2016, 10:15 PM
dberris marked 6 inline comments as done.

Address comments from sanjoy and majnemer


Yes, that's right -- does this version exclude those appropriately?


Oops, I had assumed that the RET instructions only ever had register operands. That's a faulty assumption, good catch.

majnemer added inline comments.May 6 2016, 10:37 AM

I would recommend white-listing the opcodes you know are safe to replace instead of blacklisting the ones you know which are unsafe.
This would make the pass conservatively correct in the face of future changes to LLVM.

dberris added inline comments.May 8 2016, 9:19 PM

That makes sense, thanks majnemer -- however it's not clear to me how I'd refer to the target-specific opcodes from here. Pardon the newbie question, but how do I say "I just want return instructions here"? The only options I can see are:

  • Look at the original LLVM IR to see if it's a ReturnInst.
  • Figure out a way for including the X86 opcodes here.

Maybe I'm missing something simpler here?

dberris updated this revision to Diff 56524.May 8 2016, 9:36 PM

Update operand unpacking properly.

dberris updated this revision to Diff 56525.May 8 2016, 9:50 PM
  • clang-format
majnemer added inline comments.May 8 2016, 10:25 PM

The typical way of doing this sort of thing, AFAICT, is to teach TargetInstrInfo about it.

For example, it implements getCatchReturnOpcode and getCallFrameSetupOpcode.

dberris updated this revision to Diff 56527.May 8 2016, 11:05 PM

Add a check in TargetInstrInfo for whether a return is a "normal" return.

We also implement this somewhat correctly in X86, but use the default
implementation for other platforms.

dberris marked an inline comment as done.May 8 2016, 11:06 PM
dberris added inline comments.


I've updated the patch to do this. PTAL?

dberris updated this revision to Diff 56531.May 9 2016, 12:29 AM
dberris marked an inline comment as done.

Fix the uploaded diff.

majnemer added inline comments.May 9 2016, 7:45 AM
155–166 ↗(On Diff #56531)

I was thinking more of a whilelist-oriented solution, something like:

`unsigned getNormalReturnOpcode() const { return NormalRetOpcode; }`
bmakam added inline comments.May 9 2016, 9:56 AM

Could you please expand on why you need 9 bytes of noops here? I am not quite familiar with x86_64 but was under the impression that on x86_64 the jmp instruction is 1 byte for opcode and 4 bytes for signed relative displacement, so shouldn't 5 bytes worth of nops be sufficient?

dberris added inline comments.May 9 2016, 7:10 PM
155–166 ↗(On Diff #56531)

But how does this work on platforms that can spell 'return' many different normal ways (like in X86)? There's RETL, RETQ, and all other versions of RET which are considered "normal"?


Good question, thanks.

I have to check whether we're using the right version of JMP, but I'm specifically looking for the version that's one byte for the JMP instruction, and 8 bits (1 byte) for the relative offset. So far I haven't been able to spell jmp +0x09 correctly and have it work, without having an additional symbol as a target for the jump instruction. If we get that right, we can then add the 9 byte nops we need to get exactly 11 bytes for the function entry.

Is there a fool-proof way of spelling "JMP +0x09" with the builder interface? Or should I add another JMP instruction in X86 that supports the 8-bit displacement immediate operand?

The reason why I can't use a JMP that isn't 2 bytes, is because it's really hard to write just 5 bytes atomically. I can probably do something with an 8-byte atomic write, but enforcing that 8-byte write doesn't span cache lines is also very tricky to make safe.

dberris added inline comments.May 9 2016, 9:49 PM

Actually now that I've had a look at the generated object file, I can confirm that we're using the two-byte version of JMP with this construct. Quick test:

#include <cstdio>

[[clang::xray_always_instrument]] void foo() { std::printf("Hello, XRay!\n"); }

int main(int argc, char* argv[]) { foo(); }

Compiled with (modified clang to emit IR that has annotated functions for XRay):

./bin/clang -fxray-instrument -fxray-instruction-threshold=1 -std=c++11 -x c++ -S -emit-llvm

Creates this IR listing:

; ModuleID = ''
source_filename = ""
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@.str = private unnamed_addr constant [14 x i8] c"Hello, XRay!\0A\00", align 1

; Function Attrs: uwtable
define void @_Z3foov() #0 {
  %call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str, i32 0, i32 0))
  ret void

declare i32 @printf(i8*, ...) #1

; Function Attrs: norecurse uwtable
define i32 @main(i32 %argc, i8** %argv) #2 {
  %argc.addr = alloca i32, align 4
  %argv.addr = alloca i8**, align 8
  store i32 %argc, i32* %argc.addr, align 4
  store i8** %argv, i8*** %argv.addr, align 8
  call void @_Z3foov()
  ret i32 0

attributes #0 = { uwtable "disable-tail-calls"="false" "function-instrument"="xray-always" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { norecurse uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" "xray-instruction-threshold"="1" }

!llvm.ident = !{!0}

!0 = !{!"clang version 3.9.0 ( 2b9ed9227330789f24e56fe9e800e7be2111073b) ( 1f3e353113ff9e2f835955bf005a8a5e25f16ad1)"}

Then compiled+disassembled this way:

./bin/llc -filetype=obj -o - < test.ll | ./bin/llvm-objdump -disassemble -

Produces the following output:

<stdin>:        file format ELF64-x86-64

Disassembly of section .text:
       0:       eb 09   jmp     9 <_Z3foov+0xB>
       2:       66 0f 1f 84 00 00 02 00 00      nopw    512(%rax,%rax)
       b:       55      pushq   %rbp
       c:       48 89 e5        movq    %rsp, %rbp
       f:       bf 00 00 00 00  movl    $0, %edi
      14:       31 c0   xorl    %eax, %eax
      16:       e8 00 00 00 00  callq   0 <_Z3foov+0x1B>
      1b:       5d      popq    %rbp
      1c:       c3      retq
      1d:       2e 66 0f 1f 84 00 00 02 00 00   nopw    %cs:512(%rax,%rax)
      27:       66 0f 1f 84 00 00 00 00 00      nopw    (%rax,%rax)

      30:       eb 09   jmp     9 <main+0xB>
      32:       66 0f 1f 84 00 00 02 00 00      nopw    512(%rax,%rax)
      3b:       55      pushq   %rbp
      3c:       48 89 e5        movq    %rsp, %rbp
      3f:       48 83 ec 10     subq    $16, %rsp
      43:       89 7d fc        movl    %edi, -4(%rbp)
      46:       48 89 75 f0     movq    %rsi, -16(%rbp)
      4a:       e8 b1 ff ff ff  callq   -79 <_Z3foov>
      4f:       31 c0   xorl    %eax, %eax
      51:       48 83 c4 10     addq    $16, %rsp
      55:       5d      popq    %rbp
      56:       c3      retq
      57:       2e 66 0f 1f 84 00 00 02 00 00   nopw    %cs:512(%rax,%rax)
dberris updated this revision to Diff 56850.May 10 2016, 8:55 PM
  • Add documentation on why we need 9 bytes of nops
dberris marked an inline comment as done.May 10 2016, 9:02 PM
dberris added inline comments.

Updated the comments now to make it clear and specific.


majnemer added inline comments.May 10 2016, 9:03 PM
155–166 ↗(On Diff #56850)

The same way it is handled for X86::ADJCALLSTACKDOWN64 vs X86::ADJCALLSTACKDOWN32:

dberris added inline comments.May 10 2016, 9:08 PM
155–166 ↗(On Diff #56850)

That makes a lot of sense, thanks! I'll rework this a bit to make it more of a targeted specific OpCode.

I suppose it's fine to have multiples of these functions, each for relevant flavours of fairly common instruction classes, yes? So I suppose, for tail call exits/returns, we would have something similar for that?

dberris updated this revision to Diff 56851.May 10 2016, 10:13 PM
  • Make ReturnOpcode an argument to TargetInstrInfo constructor and whitelist one ReturnOpcode instead
dberris marked an inline comment as done.May 10 2016, 10:15 PM
dberris added inline comments.
157–168 ↗(On Diff #56851)

Updated now to white-list a specific return OpCode.

dberris updated this object.May 12 2016, 2:27 AM
dberris added reviewers: rnk, eugenis, kcc, pcc.
dberris marked an inline comment as not done.May 16 2016, 8:36 PM

Ping -- is there something else missing for this change?

dberris marked 14 inline comments as done.May 16 2016, 8:38 PM

Marked more inline comments as "Done".

echristo edited edge metadata.May 17 2016, 5:32 PM

Drive by comment.


Remove the XRay specific stuff here and rewrite in a more generic fashion (and below).

dberris updated this revision to Diff 57556.May 17 2016, 7:49 PM
dberris edited edge metadata.
  • Update documentation to remove XRay references
dberris marked an inline comment as done.May 17 2016, 7:50 PM
dberris added inline comments.


dberris marked an inline comment as done.May 23 2016, 7:08 PM


dberris updated this revision to Diff 60652.Jun 13 2016, 11:27 PM

Sync to master again.

dberris updated this revision to Diff 61491.Jun 21 2016, 7:49 PM
  • Fixes after merge
  • Emit a common symbol for the instrumentation map
  • Improve table creation and allow access to the table via xray_instr_map and xray_instr_map_end
  • Fix the sled type encoding
niravd added a subscriber: niravd.Jun 22 2016, 6:17 AM
dberris updated this revision to Diff 61625.Jun 22 2016, 4:30 PM
  • Use an explicit byte sequence for the jmp
dberris updated this revision to Diff 61929.Jun 26 2016, 8:59 PM
  • Make XRay write out tables per function
dberris updated this revision to Diff 62471.Jul 1 2016, 12:07 AM
  • Clean up local history, squashing to a single revision (rebased too)
dberris updated this revision to Diff 62651.Jul 4 2016, 12:36 AM
  • Rebase
  • Add ELF::SHF_MERGE to flags for XRay section
  • Update tests to reflect adjustments in implementation
dberris updated this revision to Diff 62716.Jul 4 2016, 10:30 PM
  • Un-break XRay section creation.
rnk accepted this revision.Jul 12 2016, 9:18 AM
rnk edited edge metadata.



Does XRay support multiple DSOs in the process? If so, the XRay runtime may be more than 2GB away from the code being patched, and this offset will overflow.


I guess we can deal with this separately. IMO we should have a 'jmpb' instruction or something that forces a short jump or assembler error if the displacement is too large.

This revision is now accepted and ready to land.Jul 12 2016, 9:18 AM
dberris marked 2 inline comments as done.Jul 12 2016, 7:00 PM

Thanks Reid -- do you or echristo@ mind landing this for me? I don't think I have commit powers yet.


That's true -- the runtime patching code will fail to patch certain sleds that are farther than 32-bits away. In case we need more, in the future we can tweak this so that we can try to force the XRay runtime trampoline in the first 2GB of the process' memory and use an "absolute" address (which may need more bytes in the sled), or have a mode to support larger binaries (a flag that changes the emitted sleds, and have the runtime support both kinds of sleds).

I suspect if it comes up in practice that we can deal with this more effectively.


I'm happy to do this as a refactoring or a follow-up change. A new instruction sounds like the right action to take anyway.

dberris updated this revision to Diff 63770.Jul 12 2016, 7:33 PM
dberris marked 2 inline comments as done.
dberris edited edge metadata.


This revision was automatically updated to reflect the committed changes.