Previously, we had been very undisciplined about CFI annotations with
the XRay trampolines. This leads to runtime crashes due to mis-alined
stack pointers that some function implementations may run into (i.e.
those using instructions that require properly aligned addresses coming
from the stack). This patch attempts to clean that up, as well as more
accurately use the correct amounts of space on the stack for stashing
and un-stashing registers.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
While it's obvious to me this won't hurt, I'm not convinced the compiler can freely expect the stack to be 16-byte aligned because it doesn't know before link-time what other objects will go into the resulting binary (and corrupt this assumption). The amd64 ABI only requires 8-byte alignments.
FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.
As discussed, the first CFA offset should be 8 when the trampoline is entered using JMP (16 is for CALL insns which save RIP onto the stack). The other number does NOT include the first number because the directive represents the amount of bytes from RSP. Therefore it will NOT be a multiple of 16 because the last set of registers we save are 16B each, and the number will end with that final 8 caused by the initial PUSH RBP. And these things should really go into the SAVE_REGISTERS macro since in the case of stack overflow, we want the precise information in gdb.
Math is hard, let's go shopping.
This may fix our alignment issues. It might be worth mentioning that the stack alignment bug we was orthogonal to the buggy stack alignment code, which I think is mostly used for debugging and exception purposes.
I still think that we're missing a big piece of the puzzle to have stack unwinding support code with exception handling rather than just produce a stacktrace.
It seems to me that the entry trampolines should have directives in the SAVE_REGISTERS macro that specify how an unwinder restores registers when carrying an exception up the stack.
An example directive, from https://sourceware.org/binutils/docs/as/CFI-directives.html
.cfi_offset register, offset --- Previous value of register is saved at offset offset from CFA.
CFI is all a part of the ABI that I've never dug into before, and I'm still relatively poor at x86 comprehension, so I'd like to make sure more experienced eyes give this at least a spot check.
lib/xray/xray_trampoline_x86_64.S | ||
---|---|---|
19 ↗ | (On Diff #95658) | We're trying to maintain 16 byte stack alignment IIUC. Are we expecting the stack pointer to be unaligned by an 8 byte offset when this is invoked? Is this expectation due to a callq instruction in the entry sled? |
70 ↗ | (On Diff #95658) | I think you can offload your math onto the assembler with .cfi_adjust_cfa_offset |
If we decide to add additional directives for an unwinder, let's do it in a different CL so that the more immediate alignment issues don't block on it. :)
compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S | ||
---|---|---|
66–68 | Does Martin's comment explain this? Quadwords are 8 bytes on x86. Why is the offset 16 bytes here? |
compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S | ||
---|---|---|
66–68 | Yes, Martin's comment explains that as the function is entered, the offset is already at 8 because the 'call' instruction would have already put the return address onto the stack. So the push of 8 bytes would adjust the offset to 16. |
I think we are both wrong: rbp + 184 + address is 200, and it's not 16-byte aligned.
Does it work because the instrumentation calls this trampoline with 16+8 aligned stack?
Ah, well, what happens is this:
- On function entry, offset is already at 8 (frame of the instrumented function). At runtime when we patch, we turn the entry sled into some instructions then a call, which will add another 8 bytes onto the stack (this is the return instruction pointer).
- When we enter this trampoline, we push 8 bytes for rbp (8 + 8 = 16)
- We then use another 184 bytes onto the stack (184 + 16 = 200)
- Then we call the installed handler, it will add 8 bytes onto the stack (return instruction pointer) (200 + 8 = 208)
So by the time the handler function is called we're already on a 16-byte aligned address on the stack.
Does this make more sense?
It does. So it's a custom calling convention. You may want to write it down in a comment, as in "at this point %rsp must be 16n+8" to avoid future confusion.
Does Martin's comment explain this? Quadwords are 8 bytes on x86. Why is the offset 16 bytes here?