This is an archive of the discontinued LLVM Phabricator instance.

AArch64: Omit callframe setup/destroy when not necessary
ClosedPublic

Authored by MatzeB on Jan 12 2018, 12:05 PM.

Details

Summary

Do not create CALLSEQ_START/CALLSEQ_END when there is no callframe to
setup and the callframe size is 0.

  • Fixes an invalid callframe nesting for byval arguments. This would fail the machine verifier as it looked like this before this patch (as in big-byval.ll):
...
ADJCALLSTACKDOWN 32768, 0, ...   # Setup for extfunc
...
ADJCALLSTACKDOWN 0, 0, ...  # setup for memcpy
...
BL &memcpy ...
ADJCALLSTACKUP 0, 0, ...    # destroy for memcpy
...
BL &extfunc
ADJCALLSTACKUP 32768, 0, ...   # destroy for extfunc
  • Saves us two machine instructions in the common case of zero-sized stackframes.
  • Remove an unnecessary scheduling barrier (hence the small unittest changes).

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB created this revision.Jan 12 2018, 12:05 PM
MatzeB edited the summary of this revision. (Show Details)Jan 16 2018, 6:38 PM

Various backends handle this situation different ways... ARM emits an inline loop, x86 and x86 emit an inline loop rather than a call. PowerPC has a utility createMemcpyOutsideCallSeq which avoids nesting. And this would introduce, essentially, a third way to handle it. We should probably settle on one solution, if possible.

Some target-independent code has special handling for callframe operations... but I guess most of the handling isn't necessary? Does a call without call frame setup instructions correctly disable the red zone?

Various backends handle this situation different ways... ARM emits an inline loop, x86 and x86 emit an inline loop rather than a call. PowerPC has a utility createMemcpyOutsideCallSeq which avoids nesting. And this would introduce, essentially, a third way to handle it. We should probably settle on one solution, if possible.

True, I could look into whether this simple solution would avoid the special case on PowerPC as well; but let's do that in a separate patch.
I'm not sure I want to change ARM/X86 from an inline loop to memcpy as it changes the performance characteristics (slightly slower for small sizes, faster for big sizes).

In any case I found this change desirable independently of it fixing the nesting issue.

Some target-independent code has special handling for callframe operations... but I guess most of the handling isn't necessary? Does a call without call frame setup instructions correctly disable the red zone?

AArch64FrameLowering::canUseRedZone() checks for MachineFrameInfo::hasCalls() which is set based on MCInstructions having a Call flag set not based of the stackframe setup (you can grep for setHasCalls(true) if you are interested).

efriedma accepted this revision.Jan 17 2018, 4:07 PM

LGTM

In general the idea makes sense: adjusting the stack by zero bytes is a no-op, so we don't need to mark it with a MachineInstr. I'm a little worried we're going to trip over some edge case where we depend on the call frame opcodes for some other reason, but I can't think of anything specific, so I guess we'll see. Please make sure you've run some basic correctness tests before you merge.

This revision is now accepted and ready to land.Jan 17 2018, 4:07 PM

Please make sure you've run some basic correctness tests before you merge.

I just tested this with the llvm test-suite with an asserts enabled compiler and the benchmarks compiled and ran fine.

This revision was automatically updated to reflect the committed changes.

Hi Matthias,

With this specific change, I found several performance regressions in spec benchmarks on AArch64
In -O3 :

Spec2006/astar -3.25%
Spec2006/povray -5.28%
Spec2017/povray -6.08%

In LTO :

Spec2006/astar -4.20%
Spec2006/h264ref -2.15%

For me, it appears that different value was picked for spill, resulting in different spill/reloads in different blocks. Have you run any performance test and observed any reproducible gain or regression?

Thanks,
Jun

I didn't do in-depth performance tests. In principle this just remove a few "nop" instructions so I didn't expect big changes. The change gives the scheduler a bit more freedom though as there is not instruction redefining SP anymore... Did you only get regressions and no improvements from this?

If you have some conrete differences in assembly that would be intersting to look at. I should have time for detailed analysis at the beginning of next week.

Hi Matthias,

With this specific change, I found several performance regressions in spec benchmarks on AArch64
In -O3 :

Spec2006/astar -3.25%
Spec2006/povray -5.28%
Spec2017/povray -6.08%

In LTO :

Spec2006/astar -4.20%
Spec2006/h264ref -2.15%

For me, it appears that different value was picked for spill, resulting in different spill/reloads in different blocks. Have you run any performance test and observed any reproducible gain or regression?

Thanks,
Jun

I didn't do in-depth performance tests. In principle this just remove a few "nop" instructions so I didn't expect big changes. The change gives the scheduler a bit more freedom though as there is not instruction redefining SP anymore... Did you only get regressions and no improvements from this?

If you have some conrete differences in assembly that would be intersting to look at. I should have time for detailed analysis at the beginning of next week.

I didn't see any gain in our weekily performance run, but regressions which specifically after this change unfortunately. I looked at one of the hot function (_ZN7way2obj12releasepointEii) in spec2006/astar where this change made changes in spilling widely. I beleive it should be reproducible with -O3. Please let me know if you cannot reproduce the regression; I will be happy to support fixing the regressions.

Would it be possible to revert r322917 while we investigate the regressions? We also identified a 3.61% regression in SPEC2006/bzip2, so here's to complete list of regressions we are currently seeing due to this change:

With -O3 -fno-math-errno -ffp-contract=fast -fomit-frame-pointer -mcpu=falkor:

Spec2006/astar -3.25%
Spec2006/bzip2 -3.61%
Spec2006/povray -5.28%
Spec2017/povray -6.08%

With -O3 -flto -fuse-ld=gold -fno-math-errno -ffp-contract=fast -fwhole-program-vtables -fvisibility=hidden -fomit-frame-pointer -mcpu=falkor:

Spec2006/astar -4.20%
Spec2006/h264ref -2.15%

All tests were run on Falkor, but hopefully these issues can be reproduced on other targets. Please let us know if you need any assistance reproducing, Matthias.

Chad

Of course reverting is fine. This whole commit was mostly motivated by llvm/test/CodeGen/AArch64/big-callframe.ll but you could simply XFAIL that test together with the revert.

I will revert later if you don't beat me to it.

Would it be possible to revert r322917 while we investigate the regressions? We also identified a 3.61% regression in SPEC2006/bzip2, so here's to complete list of regressions we are currently seeing due to this change:

With -O3 -fno-math-errno -ffp-contract=fast -fomit-frame-pointer -mcpu=falkor:

Spec2006/astar -3.25%
Spec2006/bzip2 -3.61%
Spec2006/povray -5.28%
Spec2017/povray -6.08%

With -O3 -flto -fuse-ld=gold -fno-math-errno -ffp-contract=fast -fwhole-program-vtables -fvisibility=hidden -fomit-frame-pointer -mcpu=falkor:

Spec2006/astar -4.20%
Spec2006/h264ref -2.15%

All tests were run on Falkor, but hopefully these issues can be reproduced on other targets. Please let us know if you need any assistance reproducing, Matthias.

Chad

Of course reverting is fine. This whole commit was mostly motivated by llvm/test/CodeGen/AArch64/big-callframe.ll but you could simply XFAIL that test together with the revert.
I will revert later if you don't beat me to it.

Okay. then I will revert this for now.

Thanks,
Jun

In r323683 , revert this and XFAIL big-callframe.ll.

Thanks, Matthias/Jun!