These instructions show worse performance on Neoverse-V1 compared
to pair of LDR(LDP)/MOV instructions.
This patch adds no-sve-fp-ld1r sub-target feature, which is enabled
only on Neoverse-V1.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
151 | This doesn't accurately represent the "feature" you want to model. Firstly the effect is specific to SVE rather than wanting to avoid all uses of ld1r (plus below I reason that we also might want to keep the integer variants as is). Secondly the issue is that on Neoverse V1 there are fewer LS pipelines for SVE than NEON. This is not normally a problem because the SVE registers are twice the size of NEON and so the overall bandwidth is greater. However, when loading 128-bit or smaller datatypes the bandwidth switches in favour of NEON with its extra LS pipe. (Noting that on V1 the latency of LD1R is the same are LDR+DUP) The choice is yours but as a minimum, and based on agreement regarding the integer variants, I'd be happy with "sve-avoid-fp-ld1r" but if there's a nice way to sum up the second point above then that'll be perfect. | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
2358–2366 | I believe this should be restricted to only the floating point patterns. The issue relates to instruction bandwidth and the GPR variants of DUP on V1 go down a single pipe and thus are likely worse than the LS bandwidth issue you're trying to prevent. |
- Rename new feature avoid-ldr1 to no-sve-fp-ld1r
- Now change is applied only to FP LD1R instructions
- Some other renaming inside code
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
151 | I've decided to use no- prefix rather than avoid as it seems to be the way it is done. |
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
152 | Should this be HasNoSVEFPLD1R? Otherwise UseSVEFPLD1R = !Subtarget->hasSVEFPLD1R code looks weird. |
A couple of late requests but otherwise looks good.
llvm/lib/Target/AArch64/AArch64.td | ||
---|---|---|
152 | The LDP is not relevant here. It was just a quirk of the code where you observed the issue. | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
2358–2366 | Sorry I missed this before but this'll override the existing value for Predicates and so you'll need to add HasSVEorSME to be correct. |
- Add RUN line to sve-ld1r.ll test with -mcpu=neoverse-v1
- Update feature description
- Add missing Predicate
This sounds like an interesting one.. We have certainly seen cases before where instructions are worth splitting out into multiple parts, but it often helps in one case and hurts in others. It seems like the idea here is that the throughput of sve loads is limited to 2, but with scalar fp loads that can go up to 3? So in load-throughput limited situations the expanded nodes win out (especially if they can use ldp). Wouldn't the opposite be true too though? If it was vector-instruction limited or frontend limited then multiple instructions will be worse? You could imagine it being done in the load/store optimizer if it could detect cases where it could use ldp.
If this is better (and I imagine it might be in many situations), then it can equally apply to integer too. It would just need to be changed to an fp load, to make sure it didn't pay the cost of crossing between register banks.
llvm/test/CodeGen/AArch64/sve-ld1r.ll | ||
---|---|---|
1254–1255 | These with multiple extra instructions look quite a bit worse. It might not apply for predicated instructions with zeros. |
I agree and this is very much a last minute fix for a new regression observed within LLVM 17. My feeling is that for these sorts of issues we should be using the scheduling models to tell us about bandwidth limitations.
This doesn't accurately represent the "feature" you want to model.
Firstly the effect is specific to SVE rather than wanting to avoid all uses of ld1r (plus below I reason that we also might want to keep the integer variants as is).
Secondly the issue is that on Neoverse V1 there are fewer LS pipelines for SVE than NEON. This is not normally a problem because the SVE registers are twice the size of NEON and so the overall bandwidth is greater. However, when loading 128-bit or smaller datatypes the bandwidth switches in favour of NEON with its extra LS pipe. (Noting that on V1 the latency of LD1R is the same are LDR+DUP)
The choice is yours but as a minimum, and based on agreement regarding the integer variants, I'd be happy with "sve-avoid-fp-ld1r" but if there's a nice way to sum up the second point above then that'll be perfect.