Page MenuHomePhabricator

[AArch64] Stack probing for function prologues
Needs ReviewPublic

Authored by ostannard on Feb 4 2021, 2:27 AM.

Details

Summary

This adds code to AArch64 function prologues to protect against stack
clash attacks by probing (writing to) the stack at regular enough
intervals to ensure that the guard page cannot be skipped over.

There are multiple probing sequences that can be emitted, depending on
the size of the stack allocation:

  • A straight-line sequence of subtracts and stores, used when the allocation size is smaller than 3 guard pages.
  • A loop allocating and probing one page size per iteration, plus a single probe to deal with the remainder, used when the allocation size is larger but still known at compile time.
  • A loop which moves the SP down to the target value held in a register, used when the allocation size is not known at compile-time, such as when allocating space for SVE values, or when over-aligning the stack. This is emitted in AArch64InstrInfo because it will also be used for dynamic allocas in a future patch.

By default, the stack guard size is 4KiB, which is a safe default as this is
the smallest possible page size for AArch64. Linux uses a 64KiB guard for
AArch64, so this can be overridden by the stack-probe-size function attribute.

Diff Detail

Event Timeline

ostannard created this revision.Feb 4 2021, 2:27 AM
ostannard requested review of this revision.Feb 4 2021, 2:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2021, 2:27 AM
lkail added a subscriber: lkail.
lkail added a reviewer: jonpa.Feb 5 2021, 2:18 AM
ostannard updated this revision to Diff 321686.Feb 5 2021, 2:37 AM
alex added a subscriber: alex.Feb 22 2021, 4:32 PM

@ostannard I've done the best I can for this review, but I have no ARM background so I can't be sure for the arch-specific parts @kristof.beyls can you please have a look?

llvm/lib/Target/AArch64/AArch64FrameLowering.h
153

Why did you set that one virtual?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16629

Should you warn at some point if another (unsupported) scheme is used?

16657

Can you give any hint about why 1024 was chosen, and why it's independent from getStackProbeSize ?

ostannard updated this revision to Diff 329317.Mar 9 2021, 6:31 AM
ostannard marked an inline comment as done.
  • Rebase
  • Check for unsupported stack probing methods
ostannard planned changes to this revision.Mar 9 2021, 9:00 AM
ostannard added inline comments.
llvm/lib/Target/AArch64/AArch64FrameLowering.h
153

It's an override of a virtual function called by the target-independent part of the prologue/epilogue pass.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16657

This matches GCC's ABI, which doesn't appear to be explicitly documented anywhere, other then comments in the GCC code: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/aarch64.c#L8190

ostannard requested review of this revision.Mar 9 2021, 9:01 AM

Thanks for this patch Oliver!
I've started reviewing, but am not very far yet.
I thought I'd share my thoughts so far already, as I'm not sure exactly when I'll be able to do a complete review.

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1351

I know this is not your original code, but I'd be tempted to go for more descriptive names than AllocateBefore and AllocateAfter. It's been a little while since I looked at the FrameLowering code, and from the names alone, I could not guess what these stack offsets were meant to represent.
Maybe I'll have a suggestion for a better name later on after I've read more of the patch.

I think it also makes it easier to read this code if there was a comment above this that describes that this part of the code handles SVE CSR and SVE locals (i.e. the SVE area of the stack frame).

1351

I'm a bit confused by this code change (probably partly due to the not really understanding what these variables are meant to represent).
Does the change on this line indicate that there was a bug here before? If so, would it be better to commit that separately?
If not, does the code change result in the meaning of AllocateBefore and AllocateAfter to change?

1382

This line of code reminds me that frame setup can also happen in non-entry basic blocks, when shrink wrapping has happened.
Assuming I checked correctly, there are no test cases in the tests in this patch verifying correct behavior in case shrink wrapping happens. I'm not sure if it's worthwhile to have such tests, but I thought I'd check if you thought about this.

1398–1400

This seems to be a key heuristic/logic on when to generate stack probes and when not.
I think it would be useful to have some light design documentation/rationale on why this is the right heuristic.
Depending on the length of that documentation, it could go either here, in the large comment at the top of this file or somewhere else (potentially as a target-independent design doc in the docs directory)? I don't think the design doc has to be particularly long, but a small doc probably will bring a lot of value for future developers needing to touch this code.

llvm/lib/Target/AArch64/AArch64FrameLowering.h
153

I think the keyword virtual is not needed when override is specified.
There are other examples of override methods in this class; they do not explicitly write virtual. So I guess it's more consistent to not use virtual here?

Hi @ostannard ,

I don't know enough about LLVM to comment on the actual code so I will only comment on the output I see generated from the testcases.

From the testcases (like static_1024) I can see that you probe when there is more than 1k of incoming stack arguments.
For GCC this is guard-page - 1k. The reasoning is that with any outgoing argument larger than 1k we would probe such that we maintain the invariant, but probing that 1k means we have a whole guard-size -1k left that we can use without probing. These sizes were chose as they cover about 99% of all programs (for a subset of all :)).

So the idea is to minimize the number of probes required.

As such for this is what GCC generates for these cases:

int probe (int x)
{
  char arr[64 * 1028];
  return arr[x];
}

probe:
        sub     sp, sp, #65536
        str     xzr, [sp, 1024]
        sub     sp, sp, #256
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 256
        add     sp, sp, 65536
        ret
int no_probe (int x)
{
  char arr[1028];
  return arr[x];
}

no_probe:
        sub     sp, sp, #1040
        add     x1, sp, 8
        ldrb    w0, [x1, w0, sxtw]
        add     sp, sp, 1040
        ret

For 64k probe sizes.

The other difference is where we probe as well. You seem to be probing at SP but we probe at SP + 1k.

This is because say you were at the 1k boundary and you allocate 1 guard size worth of incoming args you could a page. So we probe the 1k up to ensure you touch the pages as you go.

For the alloca cases,

I noticed you don't have a testcase for alloc(n) where n is a variable. Also how does it handle alloca(0)?

I also notice that you only set the CFA after the loop has finished.

In GCC we temporarily change the CFI to a different register and set it to the final expected value after the loop.
After the loop we switch it back, so we say which value it's going to be before hand.

i.e.

.LFB0:
        .cfi_startproc
        sub     x12, sp, #1310720
        .cfi_def_cfa 12, 1310720
.LPSRL0:
        sub     sp, sp, 65536
        str     xzr, [sp, 1024]
        cmp     sp, x12
        b.ne    .LPSRL0
        .cfi_def_cfa_register 31
        sub     sp, sp, #512
        .cfi_def_cfa_offset 1311232
        ldrb    w0, [sp, w0, sxtw]
        add     sp, sp, 512
        .cfi_def_cfa_offset 1310720
        add     sp, sp, 1310720
        .cfi_def_cfa_offset 0
        ret
        .cfi_endproc
.LFE0:
nagisa added a subscriber: nagisa.Sun, Mar 21, 6:50 AM

Can you please pre-commit the tests so that it is easier to see how the codegen changes? E.g. I suspect the CFI directives in prologue are already broken before your changes and so my comments aren't super relevant.

llvm/test/CodeGen/AArch64/stack-probing-64k.ll
31

This should come directly after the sub I believe. Otherwise the stack offsets would be incorrect while the str x29, [sp, #256] is being executed.

(Not sure if preexisting, though)

32

Is this right? We spilled 8 bytes, but specify the offset for the 32-bit view of the register.

71

Similarly, this should come immediately after the sub instruction, otherwise the CFI won't describe the stack accurately during the str above.

nagisa added inline comments.Sun, Mar 21, 6:58 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1382

I believe this would fail or produce incorrect results for functions that both have the no_caller_saved_registers attribute and request the inline stack probes, right?

For GCC this is guard-page - 1k. The reasoning is that with any outgoing argument larger than 1k we would probe such that we maintain the invariant, but probing that 1k means we have a whole guard-size -1k left that we can use without probing. These sizes were chose as they cover about 99% of all programs (for a subset of all :)).

As an addendum, as I mentioned above, wrt the outgoing argument not being larger than 1k. That's the buffer we guarantee. We're able to do so because during a function call the storing of LR counts as an implicit probe. So in order for this scheme to be secure you'd need to check that LLVM (like GCC) always stores LR, even for no-return leaf functions.