This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][Darwin] Implement stack probing for static and dynamic stack objects
AbandonedPublic

Authored by aemerson on Dec 5 2017, 2:58 PM.

Details

Summary

[AArch64][Darwin] Implement stack probing for static and dynamic stack objects.

This feature generates calls to a stack probing function on Darwin with a custom calling convention. The probe function uses some temp registers to probe the SP at 4k intervals. For static stack objects, we can emit the probe call in the function prolog. For dynamic objects like allocas, we create a call to the same function with a custom calling convention.

Diff Detail

Repository
rL LLVM

Event Timeline

aemerson created this revision.Dec 5 2017, 2:58 PM
MatzeB edited edge metadata.Dec 5 2017, 3:16 PM

It would be helpful to describe somewhere what exactly stack probing is. Maybe add more comments to emitStackProbe() and the commit message.

Is it one of those situations where you have to touch the OS pages backing stack memory one after the other (instead of accidentally jumping across pages for big arrays)?

It would be helpful to describe somewhere what exactly stack probing is. Maybe add more comments to emitStackProbe() and the commit message.

Is it one of those situations where you have to touch the OS pages backing stack memory one after the other (instead of accidentally jumping across pages for big arrays)?

Yes, that's correct. The default implementation I've created in D40857 in compiler-rt loads data in 4096 (=page size) byte intervals. This ensures that the stack guard page is always hit. I'll add some comments to the code as you suggest. However the implementation in compiler-rt has a weakly defined symbol and may be overridden at link time so I'm reticent to define what the probe function does *exactly* in the compiler.

MatzeB added a comment.Dec 5 2017, 4:59 PM

Looks generally good to me, with nitpicks below. But I'd like to hear a comment on how this is supposed to be enabled (see comment below).

lib/Target/AArch64/AArch64CallingConvention.td
348–349

From the perspective of the caller I'd call it "clobbered" instead of "used".

lib/Target/AArch64/AArch64FrameLowering.cpp
462

I think DebugLoc will only ever be a default constructed one, which you can just do inside the function instead of passing it as a parameter.

465–466

Isn't a TargetInstrInfo enough here so you can get away without a static_cast?

471–487

Generally it feels sketchy to save/extend/restore the stackframe without MachineFrameInfo knowing about it. (For example you will hide this extra stuff from the WarnStackSize functionality in PrologueEpilogueINserter. That's probably not a big deal but if we can somehow find a better way that would be nice.

"We don't bother updating SP...", this is problematic, AFAIK a unix signal can come in at any time and will use your stack frame. It will probably work on platforms with stack red zones defined that the signal handlers have to respect.

lib/Target/AArch64/AArch64ISelLowering.cpp
11040–11042

I assume this will be a correctness problem (on newer darwins?). So maybe we should not leave the decision to the frontend to set Options, but instead always make our decision based on the target triple (being darwin and newer than some version). Otherwise I can easily see non-clang frontends such as the various JITs we have around to miss the setting and fail.

You could then additionally provide a cl::opt then for cases where users want to explicitely disable the probe code generation for some reason.

test/CodeGen/AArch64/arm64-stack-probing.ll
68–69

Did you try removing all the function attributes to make the test simpler?

aemerson added inline comments.
lib/Target/AArch64/AArch64FrameLowering.cpp
471–487

I think I should update MFI to ensure hasCalls = true here and in the dynamic case. However for the stack size, since the SP is only ever adjusted by 16 bytes, and then restored after probing, I don't think it's necessary to tell MFI about it since it's purely a local change. Since we must have at least 4k of stack objects in order to be attempting to probe, the 16 bytes are only using stack space that would have been re-used later anyway.

"We don't bother updating SP...", this is problematic, AFAIK a unix signal can come in at any time and will use your stack frame. It will probably work on platforms with stack red zones defined that the signal handlers have to respect.

I think that's a mis-worded comment. I meant to say that we don't bother saving SP since we know the probe function won't clobber it. However that might not be a good idea to restrict it so much in future. The ABI isn't yet 100% finalized.

lib/Target/AArch64/AArch64ISelLowering.cpp
11040–11042

Akira suggested on D40864 that we use function attributes for this instead of a codegen option. That will still require front-ends to handle it. If we to unconditionally enable this in the backend based on target triple perhaps that would break cases where JITs etc don't automatically link in compiler-rt?

MatzeB added inline comments.Dec 6 2017, 12:59 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
471–487

Ah, I didn't realize this happens before the stackframe is even setup so it doesn't effect the maximum or the layout during the main part of the function.

lib/Target/AArch64/AArch64ISelLowering.cpp
11040–11042

He's right that for users explicitely enabling/disabling the feature a function attribute would be best.

I think the question whether we want the logic in llvm or just in the frontend depends a bit on why we need/want stack probing.

efriedma added inline comments.Dec 6 2017, 1:31 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
560

Needs a comment to explain why NumBytes < StackProbeSize doesn't need a stack probe. (On x86, the "CALL" instruction stores the return pointer onto the stack, which simplifies the logic, but there isn't any equivalent on AArch64.)

It would be helpful to describe somewhere what exactly stack probing is. Maybe add more comments to emitStackProbe() and the commit message.

Is it one of those situations where you have to touch the OS pages backing stack memory one after the other (instead of accidentally jumping across pages for big arrays)?

Yes, that's correct. The default implementation I've created in D40857 in compiler-rt loads data in 4096 (=page size) byte intervals. This ensures that the stack guard page is always hit

Since aarch64 on darwin isn't exactly new, how did expanding the stack by more than 4096 bytes work prior to this patchset? Functions allocating more than 4096 bytes of stack space isn't exactly uncommon. Will darwin start requiring this if it detects an executable built by a new enough toolchain to support it?

lib/Target/AArch64/AArch64FrameLowering.cpp
502

In the implementation in D41131 (based on the one for Windows on ARM by @compnerd in rL207615), I check for CodeModel::Large as well, and do the function call via MOVaddrEXT and BLR for that case.

mstorsjo added inline comments.Dec 12 2017, 11:55 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
478

Instead of manually saving LR here if it wasn't already saved, in D41131 I instead tried to estimate whether the stack probe would end up being necessary in determineCalleeSaves, and made sure LR is saved already there if a stack probe will be necessary.

It would be helpful to describe somewhere what exactly stack probing is. Maybe add more comments to emitStackProbe() and the commit message.

Is it one of those situations where you have to touch the OS pages backing stack memory one after the other (instead of accidentally jumping across pages for big arrays)?

Yes, that's correct. The default implementation I've created in D40857 in compiler-rt loads data in 4096 (=page size) byte intervals. This ensures that the stack guard page is always hit

Since aarch64 on darwin isn't exactly new, how did expanding the stack by more than 4096 bytes work prior to this patchset? Functions allocating more than 4096 bytes of stack space isn't exactly uncommon. Will darwin start requiring this if it detects an executable built by a new enough toolchain to support it?

The motivation for stack probing here is different to the Windows case, although the mechanism is the same. For Darwin this is a security feature to protect against stack clash based attacks.

lib/Target/AArch64/AArch64FrameLowering.cpp
478

Is that estimate always conservative? I.e. does it always at least over-estimate the stack size?

502

This function is specific to the Darwin stack probing ABI under consideration, so I don't think we'll be able to share this code. No objections to your patch though.

The motivation for stack probing here is different to the Windows case, although the mechanism is the same. For Darwin this is a security feature to protect against stack clash based attacks.

Ah, thanks for explaining!

lib/Target/AArch64/AArch64FrameLowering.cpp
478

I think it should be possible to make it conservative. I found an issue with the estimate in my patch, but I'll update it.

502

Yes, the ABI for the function call themselves is different so this particular piece of code can't be shared, but ideally all the logic for when to emit it could be shared though.

aemerson abandoned this revision.May 16 2018, 3:48 AM

I am looking to add stack probing for linux AArch64 using this diff as a basis for my work. Are there any plans to re-introduce this diff for Darwin? Is that something I should take on as well? Context is I am working on mobile apps and we want to add -fstack-clash-protection support :)

Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2023, 11:30 AM

I am looking to add stack probing for linux AArch64 using this diff as a basis for my work. Are there any plans to re-introduce this diff for Darwin? Is that something I should take on as well? Context is I am working on mobile apps and we want to add -fstack-clash-protection support :)

For Darwin, we have this implemented downstream in the xcode compiler on by default. It’s downstream because originally it was closely tied to platform routines on Darwin for fast stack bounds checking. I think we can upstream that implementation now (it uses -fstack-check instead).