This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Don't combine callee-save and local stack adjustment when optimizing for size
ClosedPublic

Authored by Nikolai on Oct 4 2019, 5:42 PM.

Details

Summary

For arm64, D18619 introduced the ability to combine bumping the stack pointer upfront in case it needs to be bumped for both the callee-save area as well as the lo\
cal stack area.

That diff already remarks that "This change can cause an increase in instructions", but argues that even when that happens, it should be still be a performance benefit because the number o\
f micro-ops is reduced.

We have observed that this code-size increase can be significant in practice. This diff disables combining stack bumping for methods that are marked as optimize-for-size.

Example of a prologue with the behavior before this diff (combining stack bumping when possible):

sub        sp, sp, #0x40
stp        d9, d8, [sp, #0x10]
stp        x20, x19, [sp, #0x20]
stp        x29, x30, [sp, #0x30]
add        x29, sp, #0x30
[... compute x8 somehow ...]
stp        x0, x8, [sp]

And after this diff, if the method is marked as optimize-for-size:

stp        d9, d8, [sp, #-0x30]!
stp        x20, x19, [sp, #0x10]
stp        x29, x30, [sp, #0x20]
add        x29, sp, #0x20
[... compute x8 somehow ...]
stp        x0, x8, [sp, #-0x10]!

Note that without combining the stack bump there are two auto-decrements, nicely folded into the stp instructions, whereas otherwise there is a single sub sp, ... instruction, but not \
folded.

Event Timeline

Nikolai created this revision.Oct 4 2019, 5:42 PM
kyulee added a subscriber: kyulee.Oct 4 2019, 5:51 PM
dmgreen added a subscriber: dmgreen.Oct 6 2019, 3:33 AM

Hello. Seems like a sensible change.

Do we need the option? Or should we just be doing this whenever we are Optsize?

Do we need the option? Or should we just be doing this whenever we are Optsize?

We don't really need it. I mainly introduced this option for "maximal backwards compatibility", in case anyone has taken a dependency on the code generation pattern outside of the LLVM test suite.
If you prefer, I can update the diff, removing the option.

My understanding is that D18619 was an optimisation, and that optimisation increases codesize in order to decrease micro-ops and improve scheduling?

If so then I don't think that anyone should be relying on it, exactly. Under minsize it should be fine to just not do the optimisation. It sounds like a fairly limited gain to me, increasing instruction count to gain on micro ops and scheduling, so the codesize benefit should be bigger win.

Nikolai updated this revision to Diff 224535.Oct 10 2019, 8:01 PM

Addressing feedback by removing option, now always changing behavior when optimizing for size.

Nikolai retitled this revision from [AArch64] Make combining of callee-save and local stack adjustment optional to [AArch64] Don't combine callee-save and local stack adjustment when optimizing for size.Oct 10 2019, 8:02 PM
Nikolai edited the summary of this revision. (Show Details)
dmgreen accepted this revision.Oct 13 2019, 1:26 AM
dmgreen added a reviewer: dmgreen.

LGTM, Thanks!

Do you want me to commit this, or do you have commit access already?

This revision is now accepted and ready to land.Oct 13 2019, 1:26 AM

Do you want me to commit this, or do you have commit access already?

I don't have commit access. So if you can commit, that would be great!

This revision was automatically updated to reflect the committed changes.

This change appears to cause an assertion failure in clang during a Chromium for Windows on Arm (AArch64). We suspect that it is also the cause of a mis-compilation when clang does not have assertions enabled, and causes a crash in some test cases. See https://crbug.com/1029385 for details.

Oh boo. This is presumably some knock-on bug, that this is just exposing. Only on windows you say?

Do you have a reproducer?

hans added a subscriber: hans.Nov 30 2019, 5:29 AM

Oh boo. This is presumably some knock-on bug, that this is just exposing. Only on windows you say?

Do you have a reproducer?

Here's a short repro:

$ cat /tmp/a.cc
template <typename T> void foo(T t);
struct S {
  virtual bool bar();
};
void baz() { foo(&S::bar); }
$ bin/clang -cc1 -triple arm64-unknown-windows-msvc19.16.0 -emit-obj -Os /tmp/a.cc
clang: /work/llvm.monorepo/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp:706: MachineBasicBlock::iterator convertCalleeSaveRestoreToSPPrePostIncDec(llvm::MachineBasicBlock &, MachineBasicBlock::iterator, const llvm::DebugLoc &, const llvm::TargetInstrInfo *, int, bool, bool *, bool): Assertion `MBBI->getOperand(OpndIdx - 1).getReg() == AArch64::SP && "Unexpected base register in callee-save save/restore instruction!"' failed.

I've reverted in c2443155a0f to unbreak the builds while this gets investigated.

TomTan added a subscriber: TomTan.Dec 1 2019, 9:49 PM

Probably D18619 doesn't handle vcall thunk as below correctly when stack bump is not combined (in convertCalleeSaveRestoreToSPPrePostIncDec). The produced code has the last second add instruction removed which leaves unbalanced stack.

base_unittests!base::sequence_manager::TaskQueue::`vcall'{0}':
00007ff6`cc9660bc d10103ff sub         sp,sp,#0x40
00007ff6`cc9660c0 a9008be1 stp         x1,x2,[sp,#8]
00007ff6`cc9660c4 a90193e3 stp         x3,x4,[sp,#0x18]
00007ff6`cc9660c8 a9029be5 stp         x5,x6,[sp,#0x28]
00007ff6`cc9660cc f9001fe7 str         x7,[sp,#0x38]
00007ff6`cc9660d0 f9400009 ldr         x9,[x0]
00007ff6`cc9660d4 f8440529 ldr         x9,[x9],#0x40
                           add         sp, #0x40
00007ff6`cc9660d8 d61f0120 br          x9