This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Reduce number of callee-save save/restores.
ClosedPublic

Authored by gberry on Feb 8 2016, 2:37 PM.

Details

Summary

Before this change, callee-save registers would be rounded up to even
pairs of GPRs and FPRs. This change eliminates these extra padding
load/stores, though it does keep the stack allocation the same size
unless both the GPR and FPR sets have an odd size, in which case one
full pair stack slot (16 bytes) is saved.

Diff Detail

Event Timeline

gberry updated this revision to Diff 47248.Feb 8 2016, 2:37 PM
gberry retitled this revision from to [AArch64] Reduce number of callee-save save/restores..
gberry updated this object.
gberry added a subscriber: llvm-commits.
t.p.northover edited edge metadata.Feb 9 2016, 2:25 PM

Hi Geoff,

Unfortunately this has implications for exception unwinding on iOS. We have fast-path .debug_frame equivalents that only have 32 bits to store the prologue's format, and the format assumes the registers are stored in pairs.

On the other hand, we really don't want frame lowering to be any more complex than it absolutely has to be. For now, we probably have to go with the existing format on Darwin but I think we can probably come up with a better determineCalleeSaves that serves both needs.

To start with, some observations (none of them your fault, the existing code did it):

  • UnspilledCSFPRs is completely unused.
  • Only one value is really needed in UnspilledCSGPRs: we need one to scavenge, but that's it.
  • Only the sum of NumGPRSpilled and NumFPRSpilled is used.

So, I think we can rewrite the top loop as:

// Figure out which callee-saved registers to save/restore.
for (unsigned i = 0; CSRegs[i]; ++i) {
  const unsigned Reg = CSRegs[i];
  bool RegUsed = SavedRegs.test(Reg);

  if (!RegUsed && AArch64::GPR64RegClass.contains(Reg) && !RegInfo->isReservedReg(MF, Reg)) {
    // Save this one for later in case we need a scratch register.
    UnspilledGPR = Reg;
    continue;
  }

  // MachO's compact unwind format relies on all registers being stored in pairs.
  // FIXME: the usual format is actually better if unwinding isn't needed.
  unsigned PairedReg = CSRegs[i ^ 1];
  if (MF.getSubtarget().isTargetMachO() && !SavedRegs.test(PairedReg)) {
    SavedRegs.set(PairedReg);
    ExtraCSSpill = true;
  }
}

NumRegsSpilled = SavedRegs.size();
CanEliminateFrame = NumRegsSpilled == 0;

Do you think you could make something like that work, or have I overlooked a key use that brings my whole edifice crumbling down?

Cheers.

Tim.

lib/Target/AArch64/AArch64FrameLowering.cpp
746

I think this should be asserting 16: it matches the comment a few lines above, the corresponding set call, and the ABI requirement.

Oh, and the debug printing can probably be a lambda'd-up std::for_each or similar to avoid complicating the control flow of the loop further.

NumRegsSpilled = SavedRegs.size();

Sorry, that would definitely be .count() if the loop's at all viable.

Tim,

Tim,

Thanks, this is the kind of thing I suspected was probably hiding out there. I assume these MachO fast unwind restrictions also would prevent doing non-leaf frame pointer elimination and redzone optimization? Do you think these restrictions are currently captured in the tests that I modified? Perhaps some of these tests need another non apple triple run/check so we cover both paths?

Hi Geoff,

I assume these MachO fast unwind restrictions also would prevent doing non-leaf frame pointer elimination and redzone optimization.

We'd probably want to keep non-leaf frame pointers independently of unwinding (people are very keen on getting good backtraces here). Similarly, we'd have to check our libc (etc) for red zone violations before it could be turned on.

Are you thinking of using the red zone for callee-saved registers too? That might have unwinding implications.

Do you think these restrictions are currently captured in the tests that I modified?

I think I should probably specifically write a few basic compact unwind codegen functions which would have broken with your patch. Just a sort of smokescreen to catch obvious things that should be using the compact format. I'll do that today (hopefully).

Cheers.

Tim.

gberry updated this revision to Diff 47675.Feb 11 2016, 10:09 AM
gberry edited edge metadata.

Updated revision to deal with MachO unwind info issues.

Hi Tim,

I've uploaded a new version which should hopefully address the MachO unwind issues. I've tried to add more test coverage to the existing lit tests, but I think adding the MachO specific smoke tests would also be a good thing.

Thanks for the update regarding non-leaf frame pointer omission and red-zone usage. I'm still exploring the options at this point and just wanted to get any iOS specific requirements in my head early. The first step I'll take in this direction is probably just getting -fomit-frame-pointer to actually to work for non-leaf functions.

Thanks for rewriting it to accommodate MachO. I've got a couple of minor suggestions, but only saw one likely bug (the unconditional register increment).

Cheers.

Tim.

lib/Target/AArch64/AArch64FrameLowering.cpp
697–702

Possibly:

assert(AArch64::GPR64RegClass.contains(RPI.Reg1) ||
       AArch64::FPR64RegClass.contains(RPI.Reg1));
RPI.IsGPR =  AArch64::GPR64RegClass.contains(RPI.Reg1);
746

Did you decide a stack size == 8 (mod 16) was allowed in some circumstances?

943

The logic would be marginally simpler if you set this unconditionally and then used the MachO test when actually using it.

981

This will be incremented even if UnspilledCSGPRPaired was already spilled.

gberry updated this revision to Diff 47706.Feb 11 2016, 12:54 PM

Address Tim's comments.

gberry marked 2 inline comments as done.Feb 11 2016, 12:57 PM
gberry added inline comments.
lib/Target/AArch64/AArch64FrameLowering.cpp
746

Not currently, no. Maybe in a future change that combines the callee-save stack bump with the local variable stack bump.

981

Good catch. I just recalculated NumRegsSpilled at the end here so SavedRegs is always the canonical representation.

t.p.northover added inline comments.Feb 11 2016, 2:02 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
746

I'm not sure that would be allowed. The AAPCS says "Additionally, at any point at which memory is accessed via SP, the hardware requires that SP mod 16 = 0. The stack must be quad-word aligned."

It's a bit more complicated than that (it's configurable via SCTLR), but I think we should force anyone who wants to weaken that to think very hard about it.

981

Ah, neat!

Thanks Tim. Are you okay with this going in, or should I wait for your additional test cases? I'm fine with either.

lib/Target/AArch64/AArch64FrameLowering.cpp
746

I don't think the hypothetical future change that I'm thinking of would violate that rule.
If there was only a single instruction bumping the SP (for both the callee-save area and the local variable area), then it would be okay if the callee-save area was not 16-byte aligned as long as the "hole" was filled by local variable data so the single SP bump was 16-byte aligned. There may be other reasons why this isn't workable though. I'm also not sure it's worth the effort.

t.p.northover accepted this revision.Feb 11 2016, 3:52 PM
t.p.northover edited edge metadata.

Yep, I think this is OK now. Thanks again for the updates!

Tim.

This revision is now accepted and ready to land.Feb 11 2016, 3:52 PM
This revision was automatically updated to reflect the committed changes.