This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] LSLFast to fold onto base address by default
Needs ReviewPublic

Authored by harviniriawan on Jul 17 2023, 8:14 AM.

Details

Summary
  • Most CPUs have dedicated adder & shifter to compute base address of

loads and stores, hence they are always free to use

  • Older CPUs incur extra 1 cycle when doing load with left shift by 2, don't fold LSL to base address in these cases, add new feature for this

Diff Detail

Event Timeline

harviniriawan created this revision.Jul 17 2023, 8:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 8:14 AM
harviniriawan requested review of this revision.Jul 17 2023, 8:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 8:14 AM

I agree that it makes sense to do this more aggressively from looking at some of the optimisation manuals. (lsl-fast has taken the odd route of being originally added to represent when shifts into addressing operands were quick, but has started now to just mean that the add with shift is cheap). Some of the older optimization manual mention shifts of 2 being slower, is that something that we need to take into account? I'm not sure about other non-arm cores too. Presumably there was a reason why people originally believed that operands with shifts would be better as separate instructions.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
660

Subtarget->hasLSLFast() || FoldToBaseAddr

llvm/test/CodeGen/AArch64/aarch64-fold-lslfast.ll
173–174

These CHECK prefixes can be removed now, if they are all expected to be the same.

llvm/test/CodeGen/AArch64/arm64-addr-mode-folding.ll
195

Make sure to remove the old CHECK lines. It is probably worth updating the check lines in a separate patch, so that just the differences can be shown here.

llvm/test/CodeGen/AArch64/arm64-fold-address.ll
62

This looks like it hasn't been generated properly. It might mean the update script doesn't like the triple.

Allen added a comment.Jul 19 2023, 4:11 AM

Agree with @dmgreen, fold small shift into addressing operands don't create the instruction latency and reduce register usage, GCC has done this folding, https://gcc.godbolt.org/z/68zEq8x81

harviniriawan marked 4 inline comments as done.
harviniriawan edited the summary of this revision. (Show Details)

Sorry for the delay. I've been looking at something very related recently (folding extends into address operands), so I may have become overly opinionated.

I think that it makes sense to split LSLFast into an address and alu part, but not as a part of this patch. From the optimization guides it then looks like we have 4 case where it is or isn't better to fold into the addressing operands (with multiple uses):

  • None. The current default. If it has multiple uses don't fold it.
  • LSLFast. The current LSLFast which folds LSL (not extends), and is used for the original cases of LSLFast on Kryo and Falkor.
  • Ext23Fast. Shifts of 2 and 3 (Scales of 4 and 8) are cheap, the other two are not. All extends are cheap. Used for all Arm OoO cores.
  • AllFast. Everything is cheap. Scales of 2 and 16 along with those above. Used for in order cores I think.

It looks like Ext24Fast should be the default for -mcpu=generic, as a conservative compromise between Ext24Fast and AllFast. (There is a chance that AllFast is better on all cores, but it looks like it takes the load/store pipe for an extra cycle and the ones I tested had an extra cycle latency). This patch seems to essentially switch the default to AllFast, with a target feature that makes LSL1 slow again (but now LSL4)? I like changing the default but I'm not sure we can change it to AllFast for all cpus.

llvm/lib/Target/AArch64/AArch64.td
386 ↗(On Diff #542804)

Can we add Addr to the name of this feature, to explain that it is about address operands, not add+lsl's. Should we also use Scale2 or Shift1?
From looking at the optimization guides and what we model in the scheduling model (https://github.com/llvm/llvm-project/blob/f6bdfb0b92690403ceef8c1d58adf7a3a733b543/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td#L490), it looks like this should be slower for scale2 and scale16. scale4 and scale8 (and scale1, that one's easy) are fast.

774 ↗(On Diff #542804)

Should FeatureShiftBy2Slow be slower on A55? I don't see that from the optimization guide.

786 ↗(On Diff #542804)

From what I can see from the optimization guides, almost all AArch64 Arm cpus (except for the Cortex-A510) say that the latency of Load vector reg, register offset, scale, S/D-form is a cycle lower than Load vector reg, register offset, scale, H/Q-form.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1221

Should this and the one below be true too?

I've put up a patch to split LSLFast into two parts in https://reviews.llvm.org/D157982.