This is an archive of the discontinued LLVM Phabricator instance.

[XRay][compiler-rt] Fix up CFI annotations and stack alignment
ClosedPublic

Authored by dberris on Apr 18 2017, 5:10 PM.

Details

Summary

Previously, we had been very undisciplined about CFI annotations with
the XRay trampolines. This leads to runtime crashes due to mis-alined
stack pointers that some function implementations may run into (i.e.
those using instructions that require properly aligned addresses coming
from the stack). This patch attempts to clean that up, as well as more
accurately use the correct amounts of space on the stack for stashing
and un-stashing registers.

Diff Detail

Repository
rL LLVM

Event Timeline

dberris created this revision.Apr 18 2017, 5:10 PM

Adding a couple more reviewers to get more eyes on this.

pelikan accepted this revision.Apr 18 2017, 10:35 PM

While it's obvious to me this won't hurt, I'm not convinced the compiler can freely expect the stack to be 16-byte aligned because it doesn't know before link-time what other objects will go into the resulting binary (and corrupt this assumption). The amd64 ABI only requires 8-byte alignments.

This revision is now accepted and ready to land.Apr 18 2017, 10:35 PM
eugenis requested changes to this revision.Apr 18 2017, 10:47 PM

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

This revision now requires changes to proceed.Apr 18 2017, 10:47 PM
This revision was automatically updated to reflect the committed changes.

Whoops, I'll make a follow-up change, didn't see the comments before landing.

I stand corrected, I was reading it wrong.

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

As discussed, the first CFA offset should be 8 when the trampoline is entered using JMP (16 is for CALL insns which save RIP onto the stack). The other number does NOT include the first number because the directive represents the amount of bytes from RSP. Therefore it will NOT be a multiple of 16 because the last set of registers we save are 16B each, and the number will end with that final 8 caused by the initial PUSH RBP. And these things should really go into the SAVE_REGISTERS macro since in the case of stack overflow, we want the precise information in gdb.

Math is hard, let's go shopping.

kpw edited edge metadata.Apr 18 2017, 11:40 PM

This may fix our alignment issues. It might be worth mentioning that the stack alignment bug we was orthogonal to the buggy stack alignment code, which I think is mostly used for debugging and exception purposes.

I still think that we're missing a big piece of the puzzle to have stack unwinding support code with exception handling rather than just produce a stacktrace.

It seems to me that the entry trampolines should have directives in the SAVE_REGISTERS macro that specify how an unwinder restores registers when carrying an exception up the stack.

An example directive, from https://sourceware.org/binutils/docs/as/CFI-directives.html

.cfi_offset register, offset --- Previous value of register is saved at offset offset from CFA.

CFI is all a part of the ABI that I've never dug into before, and I'm still relatively poor at x86 comprehension, so I'd like to make sure more experienced eyes give this at least a spot check.

lib/xray/xray_trampoline_x86_64.S
19 ↗(On Diff #95658)

We're trying to maintain 16 byte stack alignment IIUC. Are we expecting the stack pointer to be unaligned by an 8 byte offset when this is invoked? Is this expectation due to a callq instruction in the entry sled?

70 ↗(On Diff #95658)

I think you can offload your math onto the assembler with .cfi_adjust_cfa_offset

kpw added a comment.Apr 18 2017, 11:47 PM

If we decide to add additional directives for an unwinder, let's do it in a different CL so that the more immediate alignment issues don't block on it. :)

kpw added inline comments.Apr 18 2017, 11:56 PM
compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S
66–68

Does Martin's comment explain this? Quadwords are 8 bytes on x86. Why is the offset 16 bytes here?

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

Looked into this a bit more and because rbp + 184 + address is really just 208, we're fine (it's 16-byte aligned ).

I have a change coming taking into account the comments by @kpw and @pelikan as well.

dberris added inline comments.Apr 18 2017, 11:58 PM
compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S
66–68

Yes, Martin's comment explains that as the function is entered, the offset is already at 8 because the 'call' instruction would have already put the return address onto the stack. So the push of 8 bytes would adjust the offset to 16.

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

Looked into this a bit more and because rbp + 184 + address is really just 208, we're fine (it's 16-byte aligned ).

I think we are both wrong: rbp + 184 + address is 200, and it's not 16-byte aligned.
Does it work because the instrumentation calls this trampoline with 16+8 aligned stack?

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

Looked into this a bit more and because rbp + 184 + address is really just 208, we're fine (it's 16-byte aligned ).

I think we are both wrong: rbp + 184 + address is 200, and it's not 16-byte aligned.
Does it work because the instrumentation calls this trampoline with 16+8 aligned stack?

Ah, well, what happens is this:

  • On function entry, offset is already at 8 (frame of the instrumented function). At runtime when we patch, we turn the entry sled into some instructions then a call, which will add another 8 bytes onto the stack (this is the return instruction pointer).
  • When we enter this trampoline, we push 8 bytes for rbp (8 + 8 = 16)
  • We then use another 184 bytes onto the stack (184 + 16 = 200)
  • Then we call the installed handler, it will add 8 bytes onto the stack (return instruction pointer) (200 + 8 = 208)

So by the time the handler function is called we're already on a 16-byte aligned address on the stack.

Does this make more sense?

It does. So it's a custom calling convention. You may want to write it down in a comment, as in "at this point %rsp must be 16n+8" to avoid future confusion.

Good idea -- done, updated in D32214.