This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine callee-save and local stack SP adjustment instructions.

Authored by gberry on Mar 30 2016, 1:01 PM.



If a function needs to allocate both callee-save stack memory and local
stack memory, we currently decrement/increment the SP in two steps:
first for the callee-save area, and then for the local stack area. This
changes the code to allocate them both at once at the very beginning/end
of the function. This has a few benefits:

  1. there is one fewer sub/add micro-op in the prologue/epilogue
  1. stack usage is reduced in cases where both the callee-save area and

local stack area size are equal to 8 mod 16

  1. the stack adjustment instructions act as a scheduling barrier, so

moving them to the very beginning/end of the function increases post-RA
scheduler's ability to move instructions (that only depend on argument
registers) before any of the callee-save stores

This change can cause an increase in instructions if the original local
stack SP decrement could be folded into the first store to the stack.
This occurs when the first local stack store is to stack offset 0. In
this case we are trading off one more sub instruction for one fewer sub
micro-op (along with benefits (2) and (3) above).

Diff Detail


Event Timeline

gberry updated this revision to Diff 52109.Mar 30 2016, 1:01 PM
gberry retitled this revision from to [AArch64] Combine callee-save and local stack SP adjustment instructions..
gberry updated this object.
gberry added a reviewer: t.p.northover.
gberry added a subscriber: llvm-commits.

Hi Geoff,

I'm not familiar with this part of the code, so I'll leave the full review for Tim, but I think the idea is a good one, and the prologue/epilogue changes in the test look mostly good. Some inline comments, too.


695 ↗(On Diff #52109)

Can't you just make:

if (ForceSP)
  UseFP = false;


814 ↗(On Diff #52109)

Would it be possible to get here and (size % 16 != 0)? If so, keeping the assert would be good.

gberry added inline comments.Apr 15 2016, 11:21 AM
695 ↗(On Diff #52109)

I don't think that works in the case where isFixed is true.

814 ↗(On Diff #52109)

These asserts have just moved to the BumpSP == true part of spillCalleeSavedRegisters/restoreCalleeSavedRegisters

rengolin added inline comments.Apr 15 2016, 12:40 PM
695 ↗(On Diff #53924)

Right, than maybe moving the assert inside "if (isFixed)" would guide people when it does hit?

823–824 ↗(On Diff #53924)


gberry added a comment.May 2 2016, 9:40 AM

Ping (#3)?

t.p.northover edited edge metadata.May 2 2016, 1:56 PM
t.p.northover added a subscriber: t.p.northover.

Hi Geoff,

Sorry it took so long to get back to you, I needed to do some serious
thinking about this one. Mostly it's OK, but I think this is

unsigned EstNonCSStackSize = MFI->estimateStackSize(MF);
unsigned StackSize = EstNonCSStackSize + 8 * NumRegsSpilled;
if (EstNonCSStackSize != 0 &&

 // 512 is the maximum immediate for stp/ldp that will be used for
// callee-save save/restores
 StackSize < 512 &&

Using an estimate here leaves us open for a change later on, which is
catastrophic if it pushes the amount over the 512-byte limit. It's
rare that the estimate differs from the actual amount by more than 8
(the correctness threshold since the STP offset is actually where you
want the first register to go) but it can happen.

I've attached an example where I've tweaked this discrepancy into a
real failure.


gberry added a comment.May 2 2016, 2:02 PM

Hi Tim,

Thanks for the review, it was worth the wait :)
I recall thinking about this issue and convincing myself that it wasn't a problem, but I clearly don't know all the dark corners of this code. Do you think it is reasonable to try to enumerate the cases where this estimation might go wrong (even conservatively) to avoid this problem?

gberry added a comment.May 3 2016, 8:17 AM

Hi Tim,

I believe the below change fixes the problem you identified. I'll upload a new version with this fix. I'm still looking into your suggestion to refactor the code to avoid the stack size estimation in determinCalleeSaves() by doing the stack bump combining in emitPrologue/emitEpilogue instead to isolate the logic a bit more.

  • a/lib/Target/AArch64/AArch64FrameLowering.cpp

+++ b/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -1116,7 +1116,8 @@ void AArch64FrameLowering::determineCalleeSaves(MachineFunction &MF,

// Check to see if we can combine the callee-save and local stack pointer
// adjustment into a single decrement/increment.
unsigned EstNonCSStackSize = MFI->estimateStackSize(MF);
  • unsigned StackSize = EstNonCSStackSize + 8 * NumRegsSpilled;

+ unsigned StackSize =
+ alignTo(EstNonCSStackSize + 8 * NumRegsSpilled, getStackAlignment());

if (EstNonCSStackSize != 0 &&
    // 512 is the maximum immediate for stp/ldp that will be used for
    // callee-save save/restores
gberry updated this revision to Diff 56003.May 3 2016, 8:19 AM
gberry edited edge metadata.

Update to fix bug Tim noticed in review

I don't think that fixes it. I'm seeing discrepancies of up to 32 in the estimate even with that fix in place. Unfortunately this one's not playing well with llc so I don't have a reduced example, but I can help you reproduce it if you want (it's also from povray in the test-suite-externals).

Really, I think this shows just how misguided it is to try and use estimateStackSize in this way: neither of us really understands what it's doing and even if *we* did, future readers won't.




Thanks. Don't worry about debugging that failure further, I mostly just wanted to capture that fix in case the alternate strategy of doing the SP bump combining in emitPrologue/emitEpilogue didn't work out. If that turns out to be the case, we can revisit, but hopefully it will work out and be an easier to understand change.

gberry updated this revision to Diff 56174.May 4 2016, 10:46 AM

Update change to take the approach suggested by Tim: wait until
emitPrologue/emitEpilogue to combine SP stack bumps since we know the
final stack size then.

t.p.northover added inline comments.May 4 2016, 10:59 AM
336 ↗(On Diff #56174)

I think we're duplicating effort here. If the spill/restore functions always produce non-incremented versions we have fewer cases to cover here and those functions will be simpler too.

412 ↗(On Diff #56174)

According to AAPCS, SP must equal 0 (mod 16) whenever it's used as the base of a load/store. Given the risk that there are adjacent loads/stores, LocalStackSize should probably satisfy that too. There seems to be no benefit in allowing 0x...8 since a following ADD/SUB would be needed anyway.

gberry updated this revision to Diff 56176.May 4 2016, 11:01 AM

Add back assert to emitFrameOffset that I accidentally removed from last update.

gberry added inline comments.May 4 2016, 11:15 AM
336 ↗(On Diff #56176)

Yes, that's true with the following caveat. We would be relying on AArch64LoadStoreOptimizer to combine the first/last store/load and sp decrement/increment in the case where we decide not to combine the SP bumps (e.g. if the local stack size is 0). I think that is generally okay, though when I have tried simplifying the code in this way in the past I did run into one phase ordering issue. If I recall correctly, occasionally tail duplication of the return block would no longer get done because it appears to the tail merger that the ret block had 3 instructions (add, ldp, ret) instead of 2 (ldp post inc, ret), which caused it to make a different decision. This was due to the fact that tail duplication was being done before AArch64LoadStoreOptimizer. Do you think this simplifcation of the code is worth it in light of this extra complication? I will also say I didn't see any performance degradations because of the tail duplication difference, I just noticed it looking over code diffs.

412 ↗(On Diff #56176)

I agree, though I'm not sure this is the place to check for that. I mainly added this assert because of the (LocalStackSize / 8) on the next line.
I can add asserts that the SP bumps are always 16-byte aligned when they are generated. Does that seem reasonable to you?

t.p.northover added inline comments.May 4 2016, 11:24 AM
336 ↗(On Diff #56176)

I'm not opposed to using the load/store optimizer for that if it works (and not particularly worried about the tail-duplication change -- if it's ever important we can tweak its heuristic).

But unless I'm missing something performing that transformation here would still be simpler than the proposed patch. We'd just have a makeMemOpPrePost(MBBI, TotalOffset) call if emitPrologue/emitEpilogue decide it would be useful, wouldn't we?

412 ↗(On Diff #56176)

Yep, that sounds fine.

gberry added inline comments.May 4 2016, 11:36 AM
336 ↗(On Diff #56176)

Ah, I think I misunderstood your original comment. What you are proposing is:

  • have spillCalleeSavedRegisters/restoreCalleeSavedRegisters emit stores/loads without any pre/post decrement/increment
  • have emitPrologue/emitEpilogue either insert an add/sub or change the first callee-save store/load into a pre/post decrement/increment depending on which is better

If that is correct, I agree that is cleaner. If not, let's try again :)

412 ↗(On Diff #56176)

Great, I'll do that.

t.p.northover added inline comments.May 4 2016, 11:39 AM
336 ↗(On Diff #56176)

Yep, that's the one. Sorry about the confusion.

gberry updated this revision to Diff 56203.May 4 2016, 2:11 PM

Update to simplify code based on Tim's feedback

t.p.northover added inline comments.May 4 2016, 3:21 PM
324 ↗(On Diff #56203)

Perhaps use MBBI->isStore()? No obvious way to infer NewIsUnscaled, unfortunately.

395–397 ↗(On Diff #56203)

I think this function could just be

static void fixupCalleeSaveRestoreStackOffsetIncAndDec(MachineInstr *MI, unsigned LocalStackSize) {
  MachineOperand &MO = MI->getOperand(MI->getNumExplicitOperands() - 1);
  assert(LocalStackSize % 8 == 0);
  MO.setImm(MO.getImm() + LocalStackSize/8);

But you probably made the right call in convertCalleeSaveRestoreToSPPrePostIncDec: the pre/post versions have extra operands that would get messy.

391–394 ↗(On Diff #56203)

Is this left-over from when you tried to convert spill/restoreCalleeSavedRegisters to use "addFrameIndex"? I can't see the need for it in the present scheme (no FrameSetup/FrameDestroy instructions seem to be created with a FrameIndex operand).

gberry added a comment.May 5 2016, 7:13 AM


Thanks, we're almost there :)

324 ↗(On Diff #56203)

Yeah, that seems better.

395–397 ↗(On Diff #56203)

Yep, this was just my lack of grep skills/knowledge of the MI API. I looked for a setOperand to do just what you are suggesting, without realizing I could just change the immediate value.

391–394 ↗(On Diff #56203)

Yep, I was right to get rid of it the first time :)

gberry updated this revision to Diff 56291.May 5 2016, 8:14 AM

Update to address Tim's latest comments

mcrosier added inline comments.May 5 2016, 10:21 AM
354 ↗(On Diff #56291)

The default case goes at the top of the switch.

401 ↗(On Diff #56291)

default at top.

t.p.northover accepted this revision.May 5 2016, 1:44 PM
t.p.northover edited edge metadata.

Hi Geoff,

I think this looks basically good now (just trivia left), thanks for all the updates and feel free to commit without uploading another version here.


402 ↗(On Diff #56291)

llvm_unreachable is probably better here. Though I'd probably write the whole thing as an assertion (it's using up a lot of vertical space for something quite trivial).

This revision is now accepted and ready to land.May 5 2016, 1:44 PM
This revision was automatically updated to reflect the committed changes.