This is an archive of the discontinued LLVM Phabricator instance.

fix stack probe lowering for x86_intrcc
ClosedPublic

Authored by Freax13 on May 6 2023, 3:56 AM.

Details

Summary

The x86_intrcc calling convention will build two STACKALLOC_W_PROBING machine instructions if the function takes an error code. This is caused by an additional call to emitSPUpdate in llvm/lib/Target/X86/X86FrameLowering.cpp:1650. Previously only the first STACKALLOC_W_PROBING machine instruction was properly handled, the second one was simply ignored. This lead to miscompilations where the stack pointer wasn't properly updated (see https://github.com/rust-lang/rust/issues/109918). This patch fixes this by handling all STACKALLOC_W_PROBING machine instructions.

To be honest I don't quite understand why this didn't lead to more noticeable miscompilations previously.

This is my first time contributing to LLVM.

Diff Detail

Event Timeline

Freax13 created this revision.May 6 2023, 3:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2023, 3:56 AM
Freax13 requested review of this revision.May 6 2023, 3:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2023, 3:56 AM
RKSimon edited reviewers, added: pengfei, RKSimon; removed: llvm-commits.May 6 2023, 4:00 AM

To be honest I don't quite understand why this didn't lead to more noticeable miscompilations previously.

I don't know much about STACKALLOC_W_PROBING, just curious what's special to x86_intrcc? My real concern is if it's still well defined behavior if the function takes an error code.

llvm/lib/Target/X86/X86FrameLowering.cpp
658

Use for (MachineInstr &MI : PrologMBB) { to iterate rather than loop find_if.

llvm/test/CodeGen/X86/x86-64-intrcc.ll
181

Maybe better to use ;; for comments. The same below.

187–188

This can be in one line. There's no strict column limitation for test case.

To be honest I don't quite understand why this didn't lead to more noticeable miscompilations previously.

I don't know much about STACKALLOC_W_PROBING, just curious what's special to x86_intrcc? My real concern is if it's still well defined behavior if the function takes an error code.

x86_intrcc is used for exception/interrupt handlers on x86. Before pushing anything onto the stack the cpu will align the stack to 16-bytes. For exceptions without an error code it will push 5 8-byte values onto the stack and for exceptions with an error code it will push 6 8-byte values. This means that for exceptions without an error code the stack will be misaligned by 8 bytes and for exceptions with an error code it will be correctly 16-byte aligned.

Usually the x86-64 ABI requires that the stack will be 16-byte aligned before a function call meaning that it will by misaligned by 8 bytes after the call. The function will then push an additional value to the stack to align the stack to 16 bytes. IIUC for exceptions without an error code no special handling is needed because the stack is 8-byte misaligned just like it would be for a normal call, whereas it's off by 8 for an exception with an error code which is why the stack pointer is updated to bring it back to that misaligned state.

I also don't know much about STACKALLOC_W_PROBING but from I can gather it's used to allocate space on the stack while also probing it to ensure that that space isn't part of a guard page. The probing is done by writing to the memory to see if it will trigger a fault. IIUC in the context of x86_intrcc the 8 byte stack pointer change should never cause any stack probing to be emitted because it's smaller than the stack probe size (4096 bytes). Even if probing were to occur it should probe the memory at the opposite end of the allocation nowhere near the interrupt stack frame pushed by the cpu, so corruption of the interrupt stack frame shouldn't happen. I'm not sure if updating the stack pointer by 8 breaks any assumptions made for stack probing (previously the called function knew that (%rsp) contained the return address i.e. is used and writeable memory, this is no longer the case if we update the stack pointer by 8 when the function is first entered).

llvm/lib/Target/X86/X86FrameLowering.cpp
658

Hmm, this doesn't work and I think my whole approach is flawed: Initially I was afraid to use a for loop because I was afraid that the iterator would get invalidated by the call to emitStackProbeInline and eraseFromParent. The invalidation caused eraseFromParent isn't a huge problem and could probably be bypassed by using make_early_inc_range. The real problem is that emitStackProbeInline will split the basic block when it actually inserts stack probe if it chooses to insert a loop. Crucially this means that if the first STACKALLOC_W_PROBING machine instruction causes one of these stack probe loops to be created all other STACKALLOC_W_PROBINGs will be moved into a new basic block after the loop. This means that inlineStackProbe will no longer find them and we're back to where we started with STACKALLOC_W_PROBING not being inlined.
I'm not yet sure how to fix this, I'd welcome suggestions.

IIUC in the context of x86_intrcc the 8 byte stack pointer change should never cause any stack probing to be emitted because it's smaller than the stack probe size (4096 bytes)

I have the same conclusion but for a different reason. It is not because 8 bytes < 4096. We still have chance to enter a guard page if the current RSP is near the end a a page, e.g., 4095. The reason is no matter what the size we want for stack realignment, the rest space we used to align is always in the same page unless the alignment requirement > 4096 bytes, which I don't think existing in reality.
Given that, I think should prevent to build the first STACKALLOC_W_PROBING rather than patch it later.

x86_intrcc is used for exception/interrupt handlers on x86... The probing is done by writing to the memory to see if it will trigger a fault.

IIUC, the exception/interrupt handlers are under ring 0 mode, do we really have a chance to trigger fault? Maybe we shouldn't do probe-stack with x86_intrcc?

llvm/lib/Target/X86/X86FrameLowering.cpp
658

I think this can be solved by recording MI in a loop first and then emitting stack probe later.

What's more, faulty during exception/interrupt seems UB because we may have disabled interrupt before entering the handlers.

IIUC in the context of x86_intrcc the 8 byte stack pointer change should never cause any stack probing to be emitted because it's smaller than the stack probe size (4096 bytes)

I have the same conclusion but for a different reason. It is not because 8 bytes < 4096. We still have chance to enter a guard page if the current RSP is near the end a a page, e.g., 4095. The reason is no matter what the size we want for stack realignment, the rest space we used to align is always in the same page unless the alignment requirement > 4096 bytes, which I don't think existing in reality.
Given that, I think should prevent to build the first STACKALLOC_W_PROBING rather than patch it later.

So it turns out that we currently emit push rax for the stack update. This instruction obviously writes to the page and would trigger faults if it landed on a guard page. We can probably get away with just hard-coding the stack update to push rax and avoid any bad interactions with stack probing that way.

x86_intrcc is used for exception/interrupt handlers on x86... The probing is done by writing to the memory to see if it will trigger a fault.

IIUC, the exception/interrupt handlers are under ring 0 mode, do we really have a chance to trigger fault? Maybe we shouldn't do probe-stack with x86_intrcc?

Exceptions handlers typically run in ring 0, but this is not a constraint by the architecture, it's also possible to have exception handlers in rings 1, 2 and 3. In general, faults happen like normal in ring 0 (with few exceptions). One relevant exception to this is that the write-bit is ignored in ring 0 if the WriteProtect bit is not set in CR0, meaning that ring 0 can write to read-only pages (but not unmapped pages). Kernels that want to disable write-protection in ring 0 and also want to use stack probing should use unmapped pages as guard pages. In practice the WriteProtect bit is often enabled e.g. by some (all?) UEFI implementations and the Linux kernel, so stack probing should work just fine.

What's more, faulty during exception/interrupt seems UB because we may have disabled interrupt before entering the handlers.

Exceptions/faults can't be disabled.

Freax13 updated this revision to Diff 520157.May 7 2023, 2:11 AM

Instead of handling multiple STACKALLOC_W_PROBING machine instructions, we now just hard-code the stack update for x86_intrcc with an error code to push rax.

pengfei accepted this revision.May 7 2023, 3:36 AM

This approach look great.

IIUC in the context of x86_intrcc the 8 byte stack pointer change should never cause any stack probing to be emitted because it's smaller than the stack probe size (4096 bytes)

I have the same conclusion but for a different reason. It is not because 8 bytes < 4096. We still have chance to enter a guard page if the current RSP is near the end a a page, e.g., 4095. The reason is no matter what the size we want for stack realignment, the rest space we used to align is always in the same page unless the alignment requirement > 4096 bytes, which I don't think existing in reality.
Given that, I think should prevent to build the first STACKALLOC_W_PROBING rather than patch it later.

So it turns out that we currently emit push rax for the stack update. This instruction obviously writes to the page and would trigger faults if it landed on a guard page. We can probably get away with just hard-coding the stack update to push rax and avoid any bad interactions with stack probing that way.

You missed my point. push rax won't trigger a page fault if it's used for stack alignment, because the current value of RSP must be 0xXXXXXXX8, the instruction cannot cross the page boundary.

x86_intrcc is used for exception/interrupt handlers on x86... The probing is done by writing to the memory to see if it will trigger a fault.

IIUC, the exception/interrupt handlers are under ring 0 mode, do we really have a chance to trigger fault? Maybe we shouldn't do probe-stack with x86_intrcc?

Exceptions handlers typically run in ring 0, but this is not a constraint by the architecture, it's also possible to have exception handlers in rings 1, 2 and 3. In general, faults happen like normal in ring 0 (with few exceptions). One relevant exception to this is that the write-bit is ignored in ring 0 if the WriteProtect bit is not set in CR0, meaning that ring 0 can write to read-only pages (but not unmapped pages). Kernels that want to disable write-protection in ring 0 and also want to use stack probing should use unmapped pages as guard pages. In practice the WriteProtect bit is often enabled e.g. by some (all?) UEFI implementations and the Linux kernel, so stack probing should work just fine.

What's more, faulty during exception/interrupt seems UB because we may have disabled interrupt before entering the handlers.

Exceptions/faults can't be disabled.

Thanks! I didn't touch kernel things for a long time. Only leaving an impression we should be cautious to avoid page fault in kernel mode. With a qucik review of IST (Interrupt Stack Table) and NMI (non-maskable interrupts), I think you are correct, we still have chance to handle these cases.

This revision is now accepted and ready to land.May 7 2023, 3:36 AM

This approach look great.

IIUC in the context of x86_intrcc the 8 byte stack pointer change should never cause any stack probing to be emitted because it's smaller than the stack probe size (4096 bytes)

I have the same conclusion but for a different reason. It is not because 8 bytes < 4096. We still have chance to enter a guard page if the current RSP is near the end a a page, e.g., 4095. The reason is no matter what the size we want for stack realignment, the rest space we used to align is always in the same page unless the alignment requirement > 4096 bytes, which I don't think existing in reality.
Given that, I think should prevent to build the first STACKALLOC_W_PROBING rather than patch it later.

So it turns out that we currently emit push rax for the stack update. This instruction obviously writes to the page and would trigger faults if it landed on a guard page. We can probably get away with just hard-coding the stack update to push rax and avoid any bad interactions with stack probing that way.

You missed my point. push rax won't trigger a page fault if it's used for stack alignment, because the current value of RSP must be 0xXXXXXXX8, the instruction cannot cross the page boundary.

RSP won't be 0xXXXXXXX8, it will be 16-byte aligned. When handling an exception with an error code the CPU aligns the stack to 16 bytes and then pushes 6 8-bytes values, which will result in the stack still being 16-byte aligned.

This approach look great.

IIUC in the context of x86_intrcc the 8 byte stack pointer change should never cause any stack probing to be emitted because it's smaller than the stack probe size (4096 bytes)

I have the same conclusion but for a different reason. It is not because 8 bytes < 4096. We still have chance to enter a guard page if the current RSP is near the end a a page, e.g., 4095. The reason is no matter what the size we want for stack realignment, the rest space we used to align is always in the same page unless the alignment requirement > 4096 bytes, which I don't think existing in reality.
Given that, I think should prevent to build the first STACKALLOC_W_PROBING rather than patch it later.

So it turns out that we currently emit push rax for the stack update. This instruction obviously writes to the page and would trigger faults if it landed on a guard page. We can probably get away with just hard-coding the stack update to push rax and avoid any bad interactions with stack probing that way.

You missed my point. push rax won't trigger a page fault if it's used for stack alignment, because the current value of RSP must be 0xXXXXXXX8, the instruction cannot cross the page boundary.

RSP won't be 0xXXXXXXX8, it will be 16-byte aligned. When handling an exception with an error code the CPU aligns the stack to 16 bytes and then pushes 6 8-bytes values, which will result in the stack still being 16-byte aligned.

Thanks! I got the point now. This is used to make RSP be 0xXXXXXXX8 again to match with normal functions. I was confused it with stack realignment. So yeah, there might be a chance to trigger faults. It looks to me we have confidence to use push in prologue without prob, so I'm inclined the reason is they are small than 4096.

Thanks for the review!

This is my first time contributing to LLVM, so I'm not sure about the workflow. Do we need to do anything else to get this merged?

Thanks for the review!

This is my first time contributing to LLVM, so I'm not sure about the workflow. Do we need to do anything else to get this merged?

I can commit it for you. Please provide your preferred name and email address.

Thanks for the review!

This is my first time contributing to LLVM, so I'm not sure about the workflow. Do we need to do anything else to get this merged?

I can commit it for you. Please provide your preferred name and email address.

Perfect, thanks! My preferred name would be "Tom Dohrmann" and my preferred email would be "erbse.13@gmx.de".

This revision was automatically updated to reflect the committed changes.