This is an archive of the discontinued LLVM Phabricator instance.

[XRay][compiler-rt] Fix up CFI annotations and stack alignment
ClosedPublic

Authored by dberris on Apr 18 2017, 5:10 PM.

Download Raw Diff

Details

Reviewers

kcc
kpw
pelikan
eugenis

Commits

rG9404497acddc: [XRay][compiler-rt] Fix up CFI annotations and stack alignment
rCRT300660: [XRay][compiler-rt] Fix up CFI annotations and stack alignment
rL300660: [XRay][compiler-rt] Fix up CFI annotations and stack alignment

Summary

Previously, we had been very undisciplined about CFI annotations with
the XRay trampolines. This leads to runtime crashes due to mis-alined
stack pointers that some function implementations may run into (i.e.
those using instructions that require properly aligned addresses coming
from the stack). This patch attempts to clean that up, as well as more
accurately use the correct amounts of space on the stack for stashing
and un-stashing registers.

Diff Detail

Repository: rL LLVM

Event Timeline

dberris created this revision.Apr 18 2017, 5:10 PM

Adding a couple more reviewers to get more eyes on this.

While it's obvious to me this won't hurt, I'm not convinced the compiler can freely expect the stack to be 16-byte aligned because it doesn't know before link-time what other objects will go into the resulting binary (and corrupt this assumption). The amd64 ABI only requires 8-byte alignments.

This revision is now accepted and ready to land.Apr 18 2017, 10:35 PM

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

This revision now requires changes to proceed.Apr 18 2017, 10:47 PM

https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
section 3.2.2

Closed by commit rL300660: [XRay][compiler-rt] Fix up CFI annotations and stack alignment (authored by dberris). · Explain WhyApr 18 2017, 10:50 PM

This revision was automatically updated to reflect the committed changes.

Whoops, I'll make a follow-up change, didn't see the comments before landing.

I stand corrected, I was reading it wrong.

In D32202#730089, @eugenis wrote:

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

As discussed, the first CFA offset should be 8 when the trampoline is entered using JMP (16 is for CALL insns which save RIP onto the stack). The other number does NOT include the first number because the directive represents the amount of bytes from RSP. Therefore it will NOT be a multiple of 16 because the last set of registers we save are 16B each, and the number will end with that final 8 caused by the initial PUSH RBP. And these things should really go into the SAVE_REGISTERS macro since in the case of stack overflow, we want the precise information in gdb.

Math is hard, let's go shopping.

This may fix our alignment issues. It might be worth mentioning that the stack alignment bug we was orthogonal to the buggy stack alignment code, which I think is mostly used for debugging and exception purposes.

I still think that we're missing a big piece of the puzzle to have stack unwinding support code with exception handling rather than just produce a stacktrace.

It seems to me that the entry trampolines should have directives in the SAVE_REGISTERS macro that specify how an unwinder restores registers when carrying an exception up the stack.

An example directive, from https://sourceware.org/binutils/docs/as/CFI-directives.html

.cfi_offset register, offset --- Previous value of register is saved at offset offset from CFA.

CFI is all a part of the ABI that I've never dug into before, and I'm still relatively poor at x86 comprehension, so I'd like to make sure more experienced eyes give this at least a spot check.

lib/xray/xray_trampoline_x86_64.S
19 ↗	(On Diff #95658)	We're trying to maintain 16 byte stack alignment IIUC. Are we expecting the stack pointer to be unaligned by an 8 byte offset when this is invoked? Is this expectation due to a callq instruction in the entry sled?
70 ↗	(On Diff #95658)	I think you can offload your math onto the assembler with .cfi_adjust_cfa_offset

If we decide to add additional directives for an unwinder, let's do it in a different CL so that the more immediate alignment issues don't block on it. :)

kpw added inline comments.Apr 18 2017, 11:56 PM

compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S
66–68	Does Martin's comment explain this? Quadwords are 8 bytes on x86. Why is the offset 16 bytes here?

In D32202#730098, @dberris wrote:

In D32202#730089, @eugenis wrote:

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

Looked into this a bit more and because rbp + 184 + address is really just 208, we're fine (it's 16-byte aligned ).

I have a change coming taking into account the comments by @kpw and @pelikan as well.

dberris added inline comments.Apr 18 2017, 11:58 PM

compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S
66–68	Yes, Martin's comment explains that as the function is entered, the offset is already at 8 because the 'call' instruction would have already put the return address onto the stack. So the push of 8 bytes would adjust the offset to 16.

dberris mentioned this in D32214: [XRay][compiler-rt] Cleanup CFI/CFA annotations on trampolines.Apr 19 2017, 12:02 AM

In D32202#730125, @dberris wrote:

In D32202#730098, @dberris wrote:

In D32202#730089, @eugenis wrote:

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

Looked into this a bit more and because rbp + 184 + address is really just 208, we're fine (it's 16-byte aligned ).

I think we are both wrong: rbp + 184 + address is 200, and it's not 16-byte aligned.
Does it work because the instrumentation calls this trampoline with 16+8 aligned stack?

In D32202#731056, @eugenis wrote:

In D32202#730125, @dberris wrote:

In D32202#730098, @dberris wrote:

In D32202#730089, @eugenis wrote:

FunctionEntry does not look right. It has the frame of 216 bytes (return address + rbp + 184), which is not a multiple of 16. The same with ArgLoggerEntry.

So adjusting this just another 8 bytes artificially would be fine, yes?

Looked into this a bit more and because rbp + 184 + address is really just 208, we're fine (it's 16-byte aligned ).

I think we are both wrong: rbp + 184 + address is 200, and it's not 16-byte aligned.
Does it work because the instrumentation calls this trampoline with 16+8 aligned stack?

Ah, well, what happens is this:

On function entry, offset is already at 8 (frame of the instrumented function). At runtime when we patch, we turn the entry sled into some instructions then a call, which will add another 8 bytes onto the stack (this is the return instruction pointer).
When we enter this trampoline, we push 8 bytes for rbp (8 + 8 = 16)
We then use another 184 bytes onto the stack (184 + 16 = 200)
Then we call the installed handler, it will add 8 bytes onto the stack (return instruction pointer) (200 + 8 = 208)

So by the time the handler function is called we're already on a 16-byte aligned address on the stack.

Does this make more sense?

It does. So it's a custom calling convention. You may want to write it down in a comment, as in "at this point %rsp must be 16n+8" to avoid future confusion.

Good idea -- done, updated in D32214.

dberris mentioned this in rL300815: [XRay][compiler-rt] Cleanup CFI/CFA annotations on trampolines.Apr 19 2017, 8:38 PM

Revision Contents

Path

Size

compiler-rt/

trunk/

lib/

xray/

xray_trampoline_x86_64.S

89 lines

Diff 95687

compiler-rt/trunk/lib/xray/xray_trampoline_x86_64.S

Show All 10 Lines
//		//
// This implements the X86-specific assembler for the trampolines.		// This implements the X86-specific assembler for the trampolines.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "../builtins/assembly.h"		#include "../builtins/assembly.h"

.macro SAVE_REGISTERS		.macro SAVE_REGISTERS
subq $200, %rsp		subq $184, %rsp
movupd %xmm0, 184(%rsp)		movupd %xmm0, 168(%rsp)
movupd %xmm1, 168(%rsp)		movupd %xmm1, 152(%rsp)
movupd %xmm2, 152(%rsp)		movupd %xmm2, 136(%rsp)
movupd %xmm3, 136(%rsp)		movupd %xmm3, 120(%rsp)
movupd %xmm4, 120(%rsp)		movupd %xmm4, 104(%rsp)
movupd %xmm5, 104(%rsp)		movupd %xmm5, 88(%rsp)
movupd %xmm6, 88(%rsp)		movupd %xmm6, 72(%rsp)
movupd %xmm7, 72(%rsp)		movupd %xmm7, 56(%rsp)
movq %rdi, 64(%rsp)		movq %rdi, 48(%rsp)
movq %rax, 56(%rsp)		movq %rax, 40(%rsp)
movq %rdx, 48(%rsp)		movq %rdx, 32(%rsp)
movq %rsi, 40(%rsp)		movq %rsi, 24(%rsp)
movq %rcx, 32(%rsp)		movq %rcx, 16(%rsp)
movq %r8, 24(%rsp)		movq %r8, 8(%rsp)
movq %r9, 16(%rsp)		movq %r9, 0(%rsp)
.endm		.endm

.macro RESTORE_REGISTERS		.macro RESTORE_REGISTERS
movupd 184(%rsp), %xmm0		movupd 168(%rsp), %xmm0
movupd 168(%rsp), %xmm1		movupd 152(%rsp), %xmm1
movupd 152(%rsp), %xmm2		movupd 136(%rsp), %xmm2
movupd 136(%rsp), %xmm3		movupd 120(%rsp), %xmm3
movupd 120(%rsp), %xmm4		movupd 104(%rsp), %xmm4
movupd 104(%rsp), %xmm5		movupd 88(%rsp), %xmm5
movupd 88(%rsp) , %xmm6		movupd 72(%rsp) , %xmm6
movupd 72(%rsp) , %xmm7		movupd 56(%rsp) , %xmm7
movq 64(%rsp), %rdi		movq 48(%rsp), %rdi
movq 56(%rsp), %rax		movq 40(%rsp), %rax
movq 48(%rsp), %rdx		movq 32(%rsp), %rdx
movq 40(%rsp), %rsi		movq 24(%rsp), %rsi
movq 32(%rsp), %rcx		movq 16(%rsp), %rcx
movq 24(%rsp), %r8		movq 8(%rsp), %r8
movq 16(%rsp), %r9		movq 0(%rsp), %r9
addq $200, %rsp		addq $184, %rsp
.endm		.endm

.text		.text
.file "xray_trampoline_x86.S"		.file "xray_trampoline_x86.S"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

.globl __xray_FunctionEntry		.globl __xray_FunctionEntry
.align 16, 0x90		.align 16, 0x90
.type __xray_FunctionEntry,@function		.type __xray_FunctionEntry,@function

__xray_FunctionEntry:		__xray_FunctionEntry:
.cfi_startproc		.cfi_startproc
pushq %rbp		pushq %rbp
.cfi_def_cfa_offset 16		.cfi_def_cfa_offset 16
		kpwUnsubmitted Not Done Reply Inline Actions Does Martin's comment explain this? Quadwords are 8 bytes on x86. Why is the offset 16 bytes here? kpw: Does Martin's comment explain this? Quadwords are 8 bytes on x86. Why is the offset 16 bytes…
		dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yes, Martin's comment explains that as the function is entered, the offset is already at 8 because the 'call' instruction would have already put the return address onto the stack. So the push of 8 bytes would adjust the offset to 16. dberris: Yes, Martin's comment explains that as the function is entered, the offset is already at 8…
SAVE_REGISTERS		SAVE_REGISTERS
		.cfi_def_cfa_offset 200

// This load has to be atomic, it's concurrent with __xray_patch().		// This load has to be atomic, it's concurrent with __xray_patch().
// On x86/amd64, a simple (type-aligned) MOV instruction is enough.		// On x86/amd64, a simple (type-aligned) MOV instruction is enough.
movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax		movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax
testq %rax, %rax		testq %rax, %rax
je .Ltmp0		je .Ltmp0

// The patched function prolog puts its xray_instr_map index into %r10d.		// The patched function prolog puts its xray_instr_map index into %r10d.
Show All 15 Lines	//===----------------------------------------------------------------------===//
.type __xray_FunctionExit,@function		.type __xray_FunctionExit,@function
__xray_FunctionExit:		__xray_FunctionExit:
.cfi_startproc		.cfi_startproc
// Save the important registers first. Since we're assuming that this		// Save the important registers first. Since we're assuming that this
// function is only jumped into, we only preserve the registers for		// function is only jumped into, we only preserve the registers for
// returning.		// returning.
pushq %rbp		pushq %rbp
.cfi_def_cfa_offset 16		.cfi_def_cfa_offset 16
subq $56, %rsp		subq $48, %rsp
.cfi_def_cfa_offset 32		.cfi_def_cfa_offset 64
movupd %xmm0, 40(%rsp)		movupd %xmm0, 32(%rsp)
movupd %xmm1, 24(%rsp)		movupd %xmm1, 16(%rsp)
movq %rax, 16(%rsp)		movq %rax, 8(%rsp)
movq %rdx, 8(%rsp)		movq %rdx, 0(%rsp)
movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax		movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax
testq %rax,%rax		testq %rax,%rax
je .Ltmp2		je .Ltmp2

movl %r10d, %edi		movl %r10d, %edi
movl $1, %esi		movl $1, %esi
callq *%rax		callq *%rax
.Ltmp2:		.Ltmp2:
// Restore the important registers.		// Restore the important registers.
movupd 40(%rsp), %xmm0		movupd 32(%rsp), %xmm0
movupd 24(%rsp), %xmm1		movupd 16(%rsp), %xmm1
movq 16(%rsp), %rax		movq 8(%rsp), %rax
movq 8(%rsp), %rdx		movq 0(%rsp), %rdx
addq $56, %rsp		addq $48, %rsp
popq %rbp		popq %rbp
retq		retq
.Ltmp3:		.Ltmp3:
.size __xray_FunctionExit, .Ltmp3-__xray_FunctionExit		.size __xray_FunctionExit, .Ltmp3-__xray_FunctionExit
.cfi_endproc		.cfi_endproc

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

.global __xray_FunctionTailExit		.global __xray_FunctionTailExit
.align 16, 0x90		.align 16, 0x90
.type __xray_FunctionTailExit,@function		.type __xray_FunctionTailExit,@function
__xray_FunctionTailExit:		__xray_FunctionTailExit:
.cfi_startproc		.cfi_startproc
// Save the important registers as in the entry trampoline, but indicate that		// Save the important registers as in the entry trampoline, but indicate that
// this is an exit. In the future, we will introduce a new entry type that		// this is an exit. In the future, we will introduce a new entry type that
// differentiates between a normal exit and a tail exit, but we'd have to do		// differentiates between a normal exit and a tail exit, but we'd have to do
// this and increment the version number for the header.		// this and increment the version number for the header.
pushq %rbp		pushq %rbp
.cfi_def_cfa_offset 16		.cfi_def_cfa_offset 16
SAVE_REGISTERS		SAVE_REGISTERS
		.cfi_def_cfa_offset 200

movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax		movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax
testq %rax,%rax		testq %rax,%rax
je .Ltmp4		je .Ltmp4

movl %r10d, %edi		movl %r10d, %edi
movl $1, %esi		movl $1, %esi
callq *%rax		callq *%rax
Show All 11 Lines	//===----------------------------------------------------------------------===//
.globl __xray_ArgLoggerEntry		.globl __xray_ArgLoggerEntry
.align 16, 0x90		.align 16, 0x90
.type __xray_ArgLoggerEntry,@function		.type __xray_ArgLoggerEntry,@function
__xray_ArgLoggerEntry:		__xray_ArgLoggerEntry:
.cfi_startproc		.cfi_startproc
pushq %rbp		pushq %rbp
.cfi_def_cfa_offset 16		.cfi_def_cfa_offset 16
SAVE_REGISTERS		SAVE_REGISTERS
		.cfi_def_cfa_offset 200

// Again, these function pointer loads must be atomic; MOV is fine.		// Again, these function pointer loads must be atomic; MOV is fine.
movq _ZN6__xray13XRayArgLoggerE(%rip), %rax		movq _ZN6__xray13XRayArgLoggerE(%rip), %rax
testq %rax, %rax		testq %rax, %rax
jne .Larg1entryLog		jne .Larg1entryLog

// If [arg1 logging handler] not set, defer to no-arg logging.		// If [arg1 logging handler] not set, defer to no-arg logging.
movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax		movq _ZN6__xray19XRayPatchedFunctionE(%rip), %rax
Show All 19 Lines