This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Stack probing for function prologues
AbandonedPublic

Authored by chill on Aug 16 2023, 7:38 AM.

Details

Summary

This adds code to AArch64 function prologues to protect against stack
clash attacks by probing (writing to) the stack at regular enough
intervals to ensure that the guard page cannot be skipped over.

The patch depends on and maintains the following invariants:

  • Upon function entry the caller guarantees that it has probed the stack (e.g. performed a store) at some address [sp, #N], where`0 <= N <= 1024`. This invariant comes from a requirement for compatibility with GCC.
  • Any address range in the allocated stack, no smaller than stack-probe-size bytes contains at least one probe
  • At any time the stack pointer is above or in the guard page
  • Probes are performed in descreasing address order

The stack-probe-size is a function attribute that can be set by
a platform to correspond to the guard page size.

By default, the stack probe size is 4KiB, which is a safe default as
this is the smallest possible page size for AArch64. Linux uses a
64KiB guard for AArch64, so this can be overridden by the
stack-probe-size function attribute.

For small frames without a frame pointer (<= 240 bytes), no probes are
needed.

For larger frame sizes, LLVM always stores x29 to the stack. This
serves as an implicit stack probe. Thus, while allocating stack
objects the compiler assumes that the stack has been probed at [sp].

There are multiple probing sequences that can be emitted, depending on
the size of the stack allocation:

  • A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 5 guard pages.
  • A loop allocating and probing one page size per iteration, plus at most a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time.
  • A loop which moves the SP down to the target value held in a register (or a loop, moving a scratch register to the target value help in SP), used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch.
  • A single probe where the amount of stack adjustment is unknown, but is known to be less than or equal to a page size.

Co-authored-by: Oliver Stannard <oliver.stannard@linaro.org>

Diff Detail

Event Timeline

chill created this revision.Aug 16 2023, 7:38 AM
chill requested review of this revision.Aug 16 2023, 7:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2023, 7:38 AM
chill added a comment.Aug 16 2023, 7:47 AM

Continuation of https://reviews.llvm.org/D96004

I've added emission of asynchronous unwind directives and some other tweaks here and there.

There're is a known issue with the CFI fixup pass - that pass assumes a function prologue is contained in a single basic block. With stack probes this is no longer true
as we may create loop(s) while still setting up the frame and issuing CFI directives. I'll think about solving it.

@tnfchris I've read your comments in the original review and I think nevertheless this patch is compatible with GCC even though it does not
behave exactly the same. Main point is that since we store x29 for frames > 240 bytes, we can assume and maintain for the duration of the function
prologue that the stack has been probed at [sp].

Can we integrate Windows and non-Windows stack probing, to the extent that makes sense? (I mean, the actual code sequences and register usage are obviously going to be different, but using the same overall logic for when to trigger probing and the amount to probe would make the code easier to understand.)

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1888

This comment doesn't sound right. Probing isn't just about making sure everything is probed before we leave the prologue; we also need to make sure probes are triggered in the correct order.

chill added inline comments.Aug 17 2023, 12:18 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1888

Yes, I'm planning to look at it.

chill updated this revision to Diff 551214.Aug 17 2023, 11:47 AM

In this update:

  • save SVE CSRs in descending address order, so they serve as implicit stack probes.
chill marked an inline comment as done.Aug 17 2023, 11:47 AM

For the issue with the CFIFixup pass where we are trying to figure out the place where the prologue is done, so we can issue
the first .cfi_remember_state directive, I'm planning mark the spot with a .cfi_escape X.

chill updated this revision to Diff 551478.Aug 18 2023, 5:51 AM

Update:

  • added a few missing MachineInstr::FrameSetup flags.

save SVE CSRs in descending address order, so they serve as implicit stack probes.

This only works if the stores happen immediately after stack pointer is decremented, which I don't think is what your code does. And even then I'm not sure it's completely reliable; if an asynchronous signal is delivered to the thread between the subtraction and saving the CSRs, you can still end up jumping over the guard page.

I'd suggest as an initial implementation, just probe SVE CSRs. We can consider optimizations in a followup.

Can we integrate Windows and non-Windows stack probing, to the extent that makes sense? (I mean, the actual code sequences and register usage are obviously going to be different, but using the same overall logic for when to trigger probing and the amount to probe would make the code easier to understand.)

Did you have a reply to this?

zatrazz added inline comments.Aug 24 2023, 11:18 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1899

I am not sure if using getStackProbeMaxUnprobedStack as the threshold for emiting static probes is the more optimized value. GCC uses getStackProbeSize - 1024, since with defined invariants (the stack has been probed at [sp, #N] with 0 <= N <= 1024) rest of the guard page can be used without further probing. Afaiu the 1024 was chosen as value to minimize as possible the required probes based on common function stack requirements.

4235

Should we follow GCC as well here (it defines the the unroll value as 5)? And does it worth to add a tunable for this?

llvm/lib/Target/AArch64/AArch64ISelLowering.h
962

Does it need to be a function? This value should not vary based on function, so maybe it would be better to move a constant.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
8532

I don't think this does what the comments above specifies, since on static_4096 testcase the produced code does 'str xzr, [sp]' where it should be 'str xzr, [sp, #1024]'. I think ou should 'TLI->getStackProbeMaxUnprobedStack(MF) /8'.

chill added inline comments.Sep 13 2023, 7:32 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1899

tl;dr We are concerned here with the unprobed stack we leave for the callees.

GCC uses a different frame layout where the unprobed stack space at the moment of the call is adjacent to the space for local variables and the frame record is on top of the stack.

In LLVM we can have:

  • leaf function and local frame size <= 240 bytes. We don't issue probes in this case (NeedsStackProbe above is set to false) and our SP cannot skip over the guard page (1024 unprobed + 240 locals < 4k).
  • non-leaf function (including one calling noreturn functions) or local frame size >= 256 bytes. In this case LLVM saves at least x29, which serves as an implicit probe at [sp]. Allocation of space for SVE callee-saved registers and local SVE objects similarly end with probes at [sp]. Then we come here to allocate space for fixed-size locals. We cannot allocate more than 1024 bytes without a probe, because that area is adjacent to callee stack frames and we may perform a call without any access to the area, thus violating the invariant.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
8532

I don't think this does what the comments above specifies, since on static_4096 testcase the produced code does 'str xzr, [sp]'

The comment is incorrect here, I'll fix it.

where it should be 'str xzr, [sp, #1024]'

It *could* be, in the particular case of static_4096, but it's not necessary.

Issuing probes at [sp, #1024] is problematic, because we don't always know how many bytes we've just allocated and we may be writing to some other area. Also our own internal invariant is that while constructing the frame the stack is probed at [sp], because we start with that invariant anyway (by saving x29).

chill updated this revision to Diff 556669.Sep 13 2023, 8:28 AM
chill edited the summary of this revision. (Show Details)
chill marked 4 inline comments as done.
chill added a comment.Sep 13 2023, 8:31 AM

save SVE CSRs in descending address order, so they serve as implicit stack probes.

This only works if the stores happen immediately after stack pointer is decremented, which I don't think is what your code does. And even then I'm not sure it's completely reliable; if an asynchronous signal is delivered to the thread between the subtraction and saving the CSRs, you can still end up jumping over the guard page.

I'd suggest as an initial implementation, just probe SVE CSRs. We can consider optimizations in a followup.

Done. Now we maintain SP is never below the guard page.

Can we integrate Windows and non-Windows stack probing, to the extent that makes sense? (I mean, the actual code sequences and register usage are obviously going to be different, but using the same overall logic for when to trigger probing and the amount to probe would make the code easier to understand.)

Did you have a reply to this?

No, not yet.

chill updated this revision to Diff 556681.Sep 13 2023, 8:56 AM
chill abandoned this revision.Oct 23 2023, 10:41 AM

This is currently active on GitHub, no outstanding discussions on Phabricator.