This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add some missing fusion subtarget features
ClosedPublic

Authored by porglezomp on Jan 11 2021, 2:55 PM.

Details

Summary

Referencing ARM's software optimization guides:

A65 - 4.8 Instruction fusion

Address, AES, and MOVZ/MOVK literals

A72 - 4.11 Fast literal generation

  • 4.12 PC-relative address calculation

A76 - 4.6. AES Encryption/Decryption
A77/A78/A78C/X1 - 4.13 Instruction fusion

CMP/CMN, TST, BICS + B.cond fusion
AES fusion

[AArch64] Make Cortex B.cc fusions more precise

The ArithmeticBccFusion feature expects to be able to fuse general
flag-updating arithmetic with a B.cc, for example an arbitrary SUBS
instructions and not just a CMP.

Since the Cortex cores are documented as fusing CMP/CMN/TST, and the A77
optimization guide specifies that BICS fusion must have a destination of
XZR or WZR, these cores should use a separate subtarget feature for
specifically fusing only comparisons with B.cc.

Diff Detail

Event Timeline

porglezomp created this revision.Jan 11 2021, 2:55 PM
porglezomp requested review of this revision.Jan 11 2021, 2:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2021, 2:55 PM

The existing fusion tests seem a little bit weak—the machine scheduler with the base Cortex-A57 machine models that most of those use seemed to pass the address and literal fusion on its own (by coincidence?) without the explicit features requested. I was unable to make that test case stronger to see the A57 machine model not exploiting those fusion pairs when the feature was disabled. Is that worth worrying about?

SjoerdMeijer accepted this revision.Jan 21 2021, 6:10 AM
SjoerdMeijer added a subscriber: SjoerdMeijer.

Thank, LGTM.

The existing fusion tests seem a little bit weak—the machine scheduler with the base Cortex-A57 machine models that most of those use seemed to pass the address and literal fusion on its own (by coincidence?) without the explicit features requested. I was unable to make that test case stronger to see the A57 machine model not exploiting those fusion pairs when the feature was disabled. Is that worth worrying about?

Yeah, perhaps there's room for improvement, it's not something that worries me very much.

llvm/lib/Target/AArch64/AArch64.td
221

Yep, makes sense I think.

624

Check

637

Check

669

Check

683

Check

694

Check

711

Check

739

Check

This revision is now accepted and ready to land.Jan 21 2021, 6:10 AM

I don't have commit access, so could someone commit the change for me? Thanks.

Thanks for contributing this. I have committed this on your behalf in rG815dd4b29208.
If you plan to do more LLVM work, you can always consider requesting an account so you can commit patches yourself (but am of course happy to do it on your behalf).

@SjoerdMeijer, @porglezomp - many thanks in advance,
I'd like to clarify one aspect related to this diff- is it intentional that currently only pairs of cmp + bcc get fused (but e.g. not cmp + cbz or cmp + cbnz) (unless I'm missing something)
and another question - after looking at https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/aarch64.cc#L1326 - would it make sense to enable CmpBccFusion by default.

cc: @dmgreen

Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2022, 4:57 PM

I have for a while thought that we are missing a number of tuning features for AArch64 cores, but the (very brief) look into enabling more fusion on more cores ended up showing worse performance. The changes were not very large IIRC, but I worry that the more aggressive fusion was forcing instruction into worse positions. It may have just been that it was unlucky with noise. We get a lot of fusion naturally by the way the scheduler positions cmp/br and cmp/csel.

The optimization guides usually list the instructions that can be merged. If you have some evidence that any of them are producing better performance then that sounds like a useful patch. The performance number I got were done too quickly to draw any strong conclusions from.

As for the questions you actually asked - cbz and cbnz take a reg under aarch64 so don't have a separate cmp. For adding to -mcpu=generic, the general principle is that is needs to generally improve performance without hurting any other cores. Especially around the big-little cpus found in android phones. (Or it enables other optimizations like the linker relaxation from FeatureFuseAdrpAdd). GCC believes it is a benefit or benign, but they may have a slightly different algorithm for deciding when to fuse, I'm not sure. So long as we have some results that show it's improving things or flat for a selection of cores (big and little), then it should be OK.