This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Stack probing for function prologues
Needs ReviewPublic

Authored by ostannard on Feb 4 2021, 2:27 AM.



This adds code to AArch64 function prologues to protect against stack
clash attacks by probing (writing to) the stack at regular enough
intervals to ensure that the guard page cannot be skipped over.

There are multiple probing sequences that can be emitted, depending on
the size of the stack allocation:

  • A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 3 guard pages.
  • A loop allocating and probing one page size per iteration, plus a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time.
  • A loop which moves the SP down to the target value held in a register, used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch.

By default, the stack guard size is 4KiB, which is a safe default as this is
the smallest possible page size for AArch64. Linux uses a 64KiB guard for
AArch64, so this can be overridden by the stack-probe-size function attribute.

Diff Detail

Event Timeline

ostannard created this revision.Feb 4 2021, 2:27 AM
ostannard requested review of this revision.Feb 4 2021, 2:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2021, 2:27 AM
lkail added a subscriber: lkail.
lkail added a reviewer: jonpa.Feb 5 2021, 2:18 AM
ostannard updated this revision to Diff 321686.Feb 5 2021, 2:37 AM
alex added a subscriber: alex.Feb 22 2021, 4:32 PM

@ostannard I've done the best I can for this review, but I have no ARM background so I can't be sure for the arch-specific parts @kristof.beyls can you please have a look?


Why did you set that one virtual?


Should you warn at some point if another (unsupported) scheme is used?


Can you give any hint about why 1024 was chosen, and why it's independent from getStackProbeSize ?

ostannard updated this revision to Diff 329317.Mar 9 2021, 6:31 AM
ostannard marked an inline comment as done.
  • Rebase
  • Check for unsupported stack probing methods
ostannard planned changes to this revision.Mar 9 2021, 9:00 AM
ostannard added inline comments.

It's an override of a virtual function called by the target-independent part of the prologue/epilogue pass.


This matches GCC's ABI, which doesn't appear to be explicitly documented anywhere, other then comments in the GCC code:

ostannard requested review of this revision.Mar 9 2021, 9:01 AM

Thanks for this patch Oliver!
I've started reviewing, but am not very far yet.
I thought I'd share my thoughts so far already, as I'm not sure exactly when I'll be able to do a complete review.


I know this is not your original code, but I'd be tempted to go for more descriptive names than AllocateBefore and AllocateAfter. It's been a little while since I looked at the FrameLowering code, and from the names alone, I could not guess what these stack offsets were meant to represent.
Maybe I'll have a suggestion for a better name later on after I've read more of the patch.

I think it also makes it easier to read this code if there was a comment above this that describes that this part of the code handles SVE CSR and SVE locals (i.e. the SVE area of the stack frame).


I'm a bit confused by this code change (probably partly due to the not really understanding what these variables are meant to represent).
Does the change on this line indicate that there was a bug here before? If so, would it be better to commit that separately?
If not, does the code change result in the meaning of AllocateBefore and AllocateAfter to change?


This line of code reminds me that frame setup can also happen in non-entry basic blocks, when shrink wrapping has happened.
Assuming I checked correctly, there are no test cases in the tests in this patch verifying correct behavior in case shrink wrapping happens. I'm not sure if it's worthwhile to have such tests, but I thought I'd check if you thought about this.


This seems to be a key heuristic/logic on when to generate stack probes and when not.
I think it would be useful to have some light design documentation/rationale on why this is the right heuristic.
Depending on the length of that documentation, it could go either here, in the large comment at the top of this file or somewhere else (potentially as a target-independent design doc in the docs directory)? I don't think the design doc has to be particularly long, but a small doc probably will bring a lot of value for future developers needing to touch this code.


I think the keyword virtual is not needed when override is specified.
There are other examples of override methods in this class; they do not explicitly write virtual. So I guess it's more consistent to not use virtual here?

Hi @ostannard ,

I don't know enough about LLVM to comment on the actual code so I will only comment on the output I see generated from the testcases.

From the testcases (like static_1024) I can see that you probe when there is more than 1k of incoming stack arguments.
For GCC this is guard-page - 1k. The reasoning is that with any outgoing argument larger than 1k we would probe such that we maintain the invariant, but probing that 1k means we have a whole guard-size -1k left that we can use without probing. These sizes were chose as they cover about 99% of all programs (for a subset of all :)).

So the idea is to minimize the number of probes required.

As such for this is what GCC generates for these cases:

int probe (int x)
  char arr[64 * 1028];
  return arr[x];

        sub     sp, sp, #65536
        str     xzr, [sp, 1024]
        sub     sp, sp, #256
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 256
        add     sp, sp, 65536
int no_probe (int x)
  char arr[1028];
  return arr[x];

        sub     sp, sp, #1040
        add     x1, sp, 8
        ldrb    w0, [x1, w0, sxtw]
        add     sp, sp, 1040

For 64k probe sizes.

The other difference is where we probe as well. You seem to be probing at SP but we probe at SP + 1k.

This is because say you were at the 1k boundary and you allocate 1 guard size worth of incoming args you could a page. So we probe the 1k up to ensure you touch the pages as you go.

For the alloca cases,

I noticed you don't have a testcase for alloc(n) where n is a variable. Also how does it handle alloca(0)?

I also notice that you only set the CFA after the loop has finished.

In GCC we temporarily change the CFI to a different register and set it to the final expected value after the loop.
After the loop we switch it back, so we say which value it's going to be before hand.


        sub     x12, sp, #1310720
        .cfi_def_cfa 12, 1310720
        sub     sp, sp, 65536
        str     xzr, [sp, 1024]
        cmp     sp, x12    .LPSRL0
        .cfi_def_cfa_register 31
        sub     sp, sp, #512
        .cfi_def_cfa_offset 1311232
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 512
        .cfi_def_cfa_offset 1310720
        add     sp, sp, 1310720
        .cfi_def_cfa_offset 0
nagisa added a subscriber: nagisa.Mar 21 2021, 6:50 AM

Can you please pre-commit the tests so that it is easier to see how the codegen changes? E.g. I suspect the CFI directives in prologue are already broken before your changes and so my comments aren't super relevant.


This should come directly after the sub I believe. Otherwise the stack offsets would be incorrect while the str x29, [sp, #256] is being executed.

(Not sure if preexisting, though)


Is this right? We spilled 8 bytes, but specify the offset for the 32-bit view of the register.


Similarly, this should come immediately after the sub instruction, otherwise the CFI won't describe the stack accurately during the str above.

nagisa added inline comments.Mar 21 2021, 6:58 AM

I believe this would fail or produce incorrect results for functions that both have the no_caller_saved_registers attribute and request the inline stack probes, right?

For GCC this is guard-page - 1k. The reasoning is that with any outgoing argument larger than 1k we would probe such that we maintain the invariant, but probing that 1k means we have a whole guard-size -1k left that we can use without probing. These sizes were chose as they cover about 99% of all programs (for a subset of all :)).

As an addendum, as I mentioned above, wrt the outgoing argument not being larger than 1k. That's the buffer we guarantee. We're able to do so because during a function call the storing of LR counts as an implicit probe. So in order for this scheme to be secure you'd need to check that LLVM (like GCC) always stores LR, even for no-return leaf functions.

emaste added a subscriber: emaste.Aug 23 2021, 12:02 PM
cuviper added inline comments.

This line of code reminds me that frame setup can also happen in non-entry basic blocks, when shrink wrapping has happened.

FWIW, TargetFrameLowering::canUseAsPrologue can be used to block that if needed.
(I just used that to fix an X86 bug where stack probes clobbered EFLAGS, D134494.)

Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 9:48 AM
Matt added a subscriber: Matt.Sep 27 2022, 11:36 AM

What's the status of this patch? Were the review comments addressed, or is this still waiting for someone to address them?

chill added a subscriber: chill.Jan 3 2023, 2:52 AM

What's the status of this patch? Were the review comments addressed, or is this still waiting for someone to address them?

Current plans are I will be picking this up.

zatrazz added a subscriber: zatrazz.Apr 7 2023, 9:44 AM

I am resuming Oliver's work, I will address the current reviewers remarks and send a newer version.

I've spent some time myself working rebasing this stack and getting it green. Currently running into a runtime crash in libunwind where parseFDEInstructions is being called over and over and exhausting stack memory. Current best guess is that perhaps the stack probing is messing up the call frame information in such a way that causes a loop in the libunwind. If anyone has any suggestions of where to look or how to debug this I would appreciate it :)

I can think of the following possibilities:

  • The stack is getting adjusted by the wrong amount (i.e. the code with stack probes is subtracting a different total offset from the stack pointer compared to the code without probes).
  • The code is trying to asynchronously unwind the stack (i.e. unwind while the stack probing loop is running), and there isn't a frame pointer. This patch might need to be reworked a bit to support async unuwind.
chill added a comment.Aug 16 2023, 7:39 AM

I've started a new review of this patch here:
I believe I've addressed most of the comments in this review (except one?).


I've added more descriptive names in the new patch.

As far as I can tell, there's no bug, just the code is a little convoluted. I've rearranged it is the new patch (could just as well be a separate NFC/refactoring patch).


In the new patch I've added a check for an available scratch register in canUseAsPrologue.


TBH, the main reason for this logic is compatibility with GCC (let's call it an informal ABI).
GCC *expects* at most 1024 unprobed bytes above the stack, hence if we allocate more than that we need to issue probe.

The choice of the number 1024 comes from an analysis of SPEC frame size distributions and was chosen such that a great number of stack frames (of size < 3k or so, for 4k pages) won't need to perform a probe at all (for GCC).

As far as I can understand, these considerations are not really relevant for LLVM, since it stores x29 for frames, greater than 240 bytes anyway.


Keyword virtual removed.


This patch predates the support for asynchronous unwinding. In the new patch it's handled correctly, I believe.


Yes, it's correct, the DWARF encoding of register names does not encode separately xN and wN.


Also fixed.