This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Disable FP LD1RX instructions generation for Neoverse-V1
ClosedPublic

Authored by igor.kirillov on Aug 7 2023, 6:14 AM.

Details

Summary

These instructions show worse performance on Neoverse-V1 compared
to pair of LDR(LDP)/MOV instructions.
This patch adds no-sve-fp-ld1r sub-target feature, which is enabled
only on Neoverse-V1.

Fixes https://github.com/llvm/llvm-project/issues/64498

Diff Detail

Event Timeline

igor.kirillov created this revision.Aug 7 2023, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2023, 6:14 AM
igor.kirillov requested review of this revision.Aug 7 2023, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2023, 6:14 AM
paulwalker-arm added inline comments.Aug 7 2023, 7:11 AM
llvm/lib/Target/AArch64/AArch64.td
151

This doesn't accurately represent the "feature" you want to model.

Firstly the effect is specific to SVE rather than wanting to avoid all uses of ld1r (plus below I reason that we also might want to keep the integer variants as is).

Secondly the issue is that on Neoverse V1 there are fewer LS pipelines for SVE than NEON. This is not normally a problem because the SVE registers are twice the size of NEON and so the overall bandwidth is greater. However, when loading 128-bit or smaller datatypes the bandwidth switches in favour of NEON with its extra LS pipe. (Noting that on V1 the latency of LD1R is the same are LDR+DUP)

The choice is yours but as a minimum, and based on agreement regarding the integer variants, I'd be happy with "sve-avoid-fp-ld1r" but if there's a nice way to sum up the second point above then that'll be perfect.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
2358–2366

I believe this should be restricted to only the floating point patterns. The issue relates to instruction bandwidth and the GPR variants of DUP on V1 go down a single pipe and thus are likely worse than the LS bandwidth issue you're trying to prevent.

igor.kirillov retitled this revision from [CodeGen] Disable LD1RX instructions generation for Neoverse-V1 to [CodeGen] Disable FP LD1RX instructions generation for Neoverse-V1.
igor.kirillov edited the summary of this revision. (Show Details)
  • Rename new feature avoid-ldr1 to no-sve-fp-ld1r
  • Now change is applied only to FP LD1R instructions
  • Some other renaming inside code
igor.kirillov marked an inline comment as done.Aug 7 2023, 7:59 AM
igor.kirillov added inline comments.
llvm/lib/Target/AArch64/AArch64.td
151

I've decided to use no- prefix rather than avoid as it seems to be the way it is done.

Update CHECK lines

paulwalker-arm added inline comments.Aug 7 2023, 10:53 AM
llvm/lib/Target/AArch64/AArch64.td
152

Should this be HasNoSVEFPLD1R? Otherwise UseSVEFPLD1R = !Subtarget->hasSVEFPLD1R code looks weird.

igor.kirillov marked an inline comment as done.Aug 7 2023, 12:11 PM
paulwalker-arm accepted this revision.Aug 7 2023, 12:25 PM

A couple of late requests but otherwise looks good.

llvm/lib/Target/AArch64/AArch64.td
152

The LDP is not relevant here. It was just a quirk of the code where you observed the issue.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
2358–2366

Sorry I missed this before but this'll override the existing value for Predicates and so you'll need to add HasSVEorSME to be correct.

This revision is now accepted and ready to land.Aug 7 2023, 12:25 PM
Matt added a subscriber: Matt.Aug 7 2023, 3:41 PM
igor.kirillov marked an inline comment as done.
  • Add RUN line to sve-ld1r.ll test with -mcpu=neoverse-v1
  • Update feature description
  • Add missing Predicate

Remove RUN with -mcpu=neoverse-v1 due to too many changes

igor.kirillov edited the summary of this revision. (Show Details)

Added a test for neoverse-v1

igor.kirillov marked 2 inline comments as done.Aug 8 2023, 6:24 AM

This sounds like an interesting one.. We have certainly seen cases before where instructions are worth splitting out into multiple parts, but it often helps in one case and hurts in others. It seems like the idea here is that the throughput of sve loads is limited to 2, but with scalar fp loads that can go up to 3? So in load-throughput limited situations the expanded nodes win out (especially if they can use ldp). Wouldn't the opposite be true too though? If it was vector-instruction limited or frontend limited then multiple instructions will be worse? You could imagine it being done in the load/store optimizer if it could detect cases where it could use ldp.

If this is better (and I imagine it might be in many situations), then it can equally apply to integer too. It would just need to be changed to an fp load, to make sure it didn't pay the cost of crossing between register banks.

llvm/test/CodeGen/AArch64/sve-ld1r.ll
1254–1255

These with multiple extra instructions look quite a bit worse. It might not apply for predicated instructions with zeros.

I agree and this is very much a last minute fix for a new regression observed within LLVM 17. My feeling is that for these sorts of issues we should be using the scheduling models to tell us about bandwidth limitations.