Lower aarch64_sve_dupq_lane with a constant floating-point vector to ld1rq to avoid the use of an unnecessary neon vector unit
Details
Diff Detail
Event Timeline
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10525–10526 | As it stands, tryLowerLD1RQ is returning the BUILD_VECTOR as an in-band signalling mechanism, reusing the 'slot' that would ordinarily be used to communicate the result of the lowering to instead communicate that further lowering should stop for now. This seems hazardous to me. I'd be interested if other reviewers have an opinion on what is a better way to achieve this. tryLowerLD1RQ wants its hands on the address of the load, which begins life as an ISD::BUILD_VECTOR and the address is not available until after BUILD_VECTOR has been lowered. However, LowerDUPQLane after this point can match Op and replace it with something else. The logic for converting the BUILD_VECTOR to a load cannot be directly called out to since it is within the lowering of BUILD_VECTOR, and reimplementing or refactoring that doesn't seem the right way either. On balance I think the current approach is OK, although I'd prefer to see the signalling be out-of-band, perhaps via a pass-by-pointer bool. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10525–10526 | I believe the problem here is that the current implementation of LowerDUPQLane is kicking out a MachineNode (i.e. AArch64::DUP_ZZI_Q) which means you cannot implement the fix by looking for the expected splatq(load(q)) pattern. I recommend first creating an AArch64ISD node for DUP_ZZI_Q or something akin to it if you think of something more generic. Then LowerDUPQLane will only emit normal DAG nodes and thus you can implement a simpler DAG combine. With that said, I believe once the AArch64ISD node exists you should be able to implement the optimisation within AArch64SVEInstrInfo.td much like how we target the other variants of LD1R albeit with a different pattern to match against. It's probably worth breaking this into two patches. The first to create and use the new ISD node and the second to add the patterns to target LD1RQ. | |
10535 | Not your fault but I believe this BITCAST is a bug waiting to hit big endian targets. I recommend ensuring your new ISD node can be selected for all packed legal scalable vector types so the need for the bitcast is removed from the DUPQ path. |
Abandoning in favour of https://reviews.llvm.org/D128902, which should result in a simpler combine across 2 patches
As it stands, tryLowerLD1RQ is returning the BUILD_VECTOR as an in-band signalling mechanism, reusing the 'slot' that would ordinarily be used to communicate the result of the lowering to instead communicate that further lowering should stop for now. This seems hazardous to me.
I'd be interested if other reviewers have an opinion on what is a better way to achieve this. tryLowerLD1RQ wants its hands on the address of the load, which begins life as an ISD::BUILD_VECTOR and the address is not available until after BUILD_VECTOR has been lowered. However, LowerDUPQLane after this point can match Op and replace it with something else. The logic for converting the BUILD_VECTOR to a load cannot be directly called out to since it is within the lowering of BUILD_VECTOR, and reimplementing or refactoring that doesn't seem the right way either.
On balance I think the current approach is OK, although I'd prefer to see the signalling be out-of-band, perhaps via a pass-by-pointer bool.