This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix aligning the stack after calling __chkstk
ClosedPublic

Authored by mstorsjo on Oct 11 2022, 9:03 AM.

Details

Summary

Whenever a call to __chkstk was made, the frame lowering previously
omitted the aligning (as NumBytes was reset to zero before doing
alignment).

This fixes https://github.com/llvm/llvm-project/issues/56182.

Diff Detail

Event Timeline

mstorsjo created this revision.Oct 11 2022, 9:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2022, 9:03 AM
mstorsjo requested review of this revision.Oct 11 2022, 9:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2022, 9:03 AM

In general, I think we need to pass the number of bytes required for realignment to __chkstk, instead of directly realigning the stack. As discussed on the bug.

In general, I think we need to pass the number of bytes required for realignment to __chkstk, instead of directly realigning the stack. As discussed on the bug.

Oh, right, I didn't realize that we could fix that part of the requirement that easily.

In general, I think we need to pass the number of bytes required for realignment to __chkstk, instead of directly realigning the stack. As discussed on the bug.

Oh, right, I didn't realize that we could fix that part of the requirement that easily.

Actually, on second thought, either this requires a bit more changes than that (deviating more from the pattern for how __chkstk is used) or it doesn't actually fulfill the intent you have here (or I'm still overlooking something).

Currently, the call to __chkstk looks like this:

mov x15, #(NumBytes/16)
bl __chkstk
sub sp, sp, x15, lsl #4

If we'd include the alignment in the probing, it'd look like this:

mov x15, #((NumBytes+Align)/16)
bl __chkstk
sub x15, sp, x15, lsl #4
and sp, x15, #(AlignMask)

Here, the call to __chkstk does cover the extra align amount, but we also decrement the stack pointer by that amount, so when doing the alignment masking, we can still decrement the stack pointer further past what's been probed. To avoid this, I guess we'd need to add an sub x15, x15, #(Align/16) after the call, before decrementing the stack. Is that how you meant it?

I was thinking you could do something like this:

mov x15, #(NumBytes/16) ; Stack allocation size
mov x9, sp
and x9, x9, #AlignMask ; compute aligned stack
sub x9, sp, x9 ; compute number of bytes required to align stack
add x15, x15, x9, lsr #4 ; add bytes to chkstk input
bl __chkstk
sub sp, sp, x15, lsl #4

There's probably some slightly more efficient way to write this, but that's the idea.

I was thinking you could do something like this:

mov x15, #(NumBytes/16) ; Stack allocation size
mov x9, sp
and x9, x9, #AlignMask ; compute aligned stack
sub x9, sp, x9 ; compute number of bytes required to align stack
add x15, x15, x9, lsr #4 ; add bytes to chkstk input
bl __chkstk
sub sp, sp, x15, lsl #4

There's probably some slightly more efficient way to write this, but that's the idea.

Hmm, ok, I see. I guess that'd work, but it feels a bit roundabout IMO.

It looks like MSVC does something like this:

mov x15, #(NumBytes+Align)/16
bl __chkstk
sub sp, sp, x15, lsl #4
mov x8, sp
add x9, x8, #(Align-1)
and x0, x9, #AlignMask

(The mov x8, sp feels superfluous here?) It doesn't align the stack pointer itself, but I think we should be able to do the same (while wasting fewer registers) by doing the last and into sp.

I think I'd prefer something like this:

mov x15, #(NumBytes+Align)/16
bl __chkstk
sub sp, sp, x15, lsl #4
add x9, sp, #(Align -1)
and sp, x9, #AlignMask

After staring at my code sequence a bit more, it simplifies to:

mov x15, #(NumBytes/16)
mov x9, sp
ubfx x9, x9, #4, #AlignBits
add x15, x15, x9
bl __chkstk
sub sp, sp, x15, lsl #4

Your suggested sequence also works, I guess, but it feels a little weird to overallocate, then deallocate the extra memory.

MSVC doesn't actually realign the stack in the sense LLVM does; it just overallocates stack memory, then uses masking to align the addresses of variables.

After staring at my code sequence a bit more, it simplifies to:

mov x15, #(NumBytes/16)
mov x9, sp
ubfx x9, x9, #4, #AlignBits
add x15, x15, x9
bl __chkstk
sub sp, sp, x15, lsl #4

Your suggested sequence also works, I guess, but it feels a little weird to overallocate, then deallocate the extra memory.

Yeah, that's a bit clumsy (but the amount of extra instructions is quite small).

I find your suggestion very neat, but it assumes that NumBytes itself is aligned to the alignment, otherwise we break the realignment we calculated from the actual stack pointer. In practice it isn't - the total allocation of the stack frame (including CSRs) is aligned, but NumBytes isn't. And trying to calculate how much is spent on the other bits and adjusting it here feels very brittle too. To counter that, we'd have to overallocate the chkstk amount to make sure that it is aligned on its own, and that's not very pretty either.

mstorsjo updated this revision to Diff 467100.Oct 12 2022, 5:22 AM

Updated to overallocate the chkstk amount based on the alignment, and then reducing the overallocation to align the stack pointer in the end.

I'm not very fond of this solution (your version would be neater), but I'm updating the patch to bring it a bit forward for discussion at least.

Oh, to account for unaligned NumBytes, my solution would have to be something more like:

sub x15, sp, #NumBytes
and x15, x15, #AlignMask
sub x15, sp, x15
lsr x15, x15, lsr #4
bl __chkstk
sub sp, sp, x15, lsl #4

(If NumBytes is large, add 1 or 2 instructions to materialize it.) Maybe emitting extra instructions to avoid overallocating isn't worth it, though...

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1654

This is allocating more than necessary? We only really need to add RealignmentBytes ? RealignmentBytes - 16 : 0 bytes.

1760

We have a helper for this: AArch64_AM::encodeLogicalImmediate.

mstorsjo updated this revision to Diff 467262.Oct 12 2022, 2:34 PM

Reduced the amount of overallocation; moving back by (RealignmentBytes-16) aka RealignmentPadding in the new revision instead of (RealignmentBytes-1) when aligning the pointer at the end.

Using encodeLogicalImmediate to encode the immediate, added a missed setStackRealigned(true), added an assert that NeedsWinCFI is false at this point.

This revision is now accepted and ready to land.Oct 12 2022, 3:17 PM
This revision was landed with ongoing or failed builds.Oct 12 2022, 11:54 PM
This revision was automatically updated to reflect the committed changes.
mstorsjo reopened this revision.Oct 14 2022, 2:11 AM
This revision is now accepted and ready to land.Oct 14 2022, 2:11 AM
mstorsjo updated this revision to Diff 467717.Oct 14 2022, 2:16 AM

Added a testcase for one case where things broke in practice with the previous form.

Since the previous version, I moved the setting of NeedsRealignment further down in the function. Previously I initialized this variable early in the function, before the first call to windowsRequiresStackProbe. However, NumBytes may be nonzero at this point (making NeedsRealignment true), even if NumBytes would be zero at the point when realignment is done later in the function, prior to the patch.

We don't need to include RealignmentPadding in the first initial check with windowsRequiresStackProbe, since if we're actually going to realign the stack, we do have a stack frame anyway.

Additionally, don't set RealignmentPadding to a nonzero value, if the requested alignment is less than 16. In the added testcase, MFI.getMaxAlign() is 8. (Curiously, for non-Windows OSes, this input actually does result in generating code for ANDing the stack pointer to align it to 8 bytes though.)

mstorsjo requested review of this revision.Oct 14 2022, 4:54 AM
efriedma added inline comments.Oct 14 2022, 10:42 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1655

I'm not sure it actually makes sense to check NumBytes here. If we're forcing realignment with an attribute, we want to make sure it happens even if there aren't any local variables.

1658

I suggested subtracting "16" here on the assumption the stack is actually 16-byte aligned on entry. If we're not assuming the incoming stack is 16-byte aligned, we can't do that. (Theoretically, we could subtract 1, but we can't actually allocate in increments of less than 16.)

Normally, we shouldn't be using realignment in the first place if we're only trying to align the stack to 16 bytes, but I guess the "stackrealign" attribute forces us to realign the stack even if we think it's already aligned. (By default, the CPU faults if the alignment of sp is less than 16, so that seems unlikely, but I guess there could be some environment which disables that flag.)

mstorsjo added inline comments.Oct 14 2022, 11:55 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1655

Right, I guess that’s true. However, in the current codepath for non-windows, all alignment is being done within an if (NumBytes) condition, so unless there’s further local allocations, it actually won’t realign the stack.

Also, the alignment to apply on the stack (from -mstack-alignment) doesn’t seem to be taken into account for actually aligning the stack, only the alignment of the objects in the current stack frame. So this does seem like a preexisting bug.

It seems like the x86 backend does the right thing here (aligning to the maximum of the objects in the stack frame and the stack’s own alignment, while this code here only looks at the objects.

1658

I don’t really think we need to worry about the case when the incoming stack isn’t aligned to 16 bytes, I guess thus is just a degenerate case - while it’s only expected to do anything if the requested alignment is higher than that.

efriedma added inline comments.Oct 14 2022, 1:07 PM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1655

Not sure I understand the different sources of alignment, but we can leave that for a followup in any case.

1658

In that case, should we just pretend the user didn't request stack realignment if they don't request alignment higher than 16?

mstorsjo added inline comments.Oct 14 2022, 1:53 PM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1655

The two different cases are these:

$ cat align1.c
void other(char *buf);

void func(void) {
    char buf[100] __attribute__((aligned(64)));
    other(buf);
}
$ clang -target aarch64-linux-gnu -S -o - align1.c -O2
$ cat align2.c 
void other(void);

void func(void) {
    other();
}
$ clang -target aarch64-linux-gnu -S -o - -O2 align2.c -mstack-alignment=64 -mstackrealign

For the first case, we want to allocate a 64-byte aligned object on the stack (which implicitly aligns the stack entirely to 64 bytes at this point). For the second case, we want to realign the stack to 64 bytes (even though the function itself doesn't allocate anything) before calling another function, so that the second function can assume that the incoming stack is aligned to 64 bytes and not bother with more similar realignments, while maintaining 64 byte alignment.

The second case isn't taken into account at all by the aarch64 target; it optimizes the call into a tail call, and even if it tweaked so that it doesn't do that, it still doesn't align the stack to 64 bytes. For x86_64, both cases are handled as intended.

(I guess there's seldom need for much more alignment than 16 bytes on aarch64 anyway, so there probably hasn't been much need or desire to fix this yet. I guess SVE might benefit from it at some point though?)

So since the aarch64 target essentially doesn't really handle the explicit stack realignment requests properly right now, I think the patch in the current form is as good as it gets on its own - handling the alignment of stack objects correctly, and no longer blowing up on the realignment requests (but doing as little as we do for other targets about it).

1658

I guess we could. Currently this patch only needs to care about whether the alignment is higher than 16 for the windows codepath; we could extend that into the rest of the cases, but I'd leave that to a separate patch.

efriedma accepted this revision.Oct 14 2022, 1:59 PM

This seems like progress in the right direction; LGTM

This revision is now accepted and ready to land.Oct 14 2022, 1:59 PM
This revision was landed with ongoing or failed builds.Oct 14 2022, 2:43 PM
This revision was automatically updated to reflect the committed changes.