Example pattern before:
%1:gpr64 = LDRXui ...
%2:fpr64 = COPY %1:gpr64
Pattern after:
%1:gpr64 = LDRDui
The pattern show up when elements are read from arrays, and moved into floating
point registers for calculation.
Paths
| Differential D130100
[AArch64] Combine a load into GPR followed by a copy to FPR to a load into FPR directly through MIPeepholeOpt AbandonedPublic Authored by mingmingl on Jul 19 2022, 9:59 AM.
Details
Diff Detail
Event Timelinemingmingl retitled this revision from [aarch64-mi-peepholeopt] Combine a gpr->fp copy of load gpr into load gp directly. to [AArch64] Combine a load into GPR followed by a copy to FPR to a load into FPR directly through MIPeepholeOpt. Comment Actions Implementation-wise, use MachineRegisterInfo::replaceRegWith to replace all uses of Reg; in this way, debug uses of the register will be updated as well. Besides, add a dedicated MIR test for this optimization, so that machine instruction update (including MachineInstr::memoperands, etc) are visible.
Comment Actions Do you have details on the motivating case for this? I was under the impression that it was usually handled by ISel, but the tests certainly suggest it isn't always the case. There will be times for DAG ISel which cross basic-blocks, which it won't be able to match. A number of the other test changes look like artifacts from how tablegen patterns are defined. The AArch64MIPeepholeOptimizer seems to be a rich source of bugs. Every time we make a change in it, it takes several attempt to get it correct. Comment Actions
This is a good point. I understand that DAG ISel is probably a better place to handle this combine, to avoid fiddling with various kinds of load/store operations. As shown in the .mir test case, the machine instructions are from the same basic block (not cross basic blocks, thereby code sinking won't help) Existing UIntToFPROLoadPat or VecROLoadPat in AArch64InstrInfo.td won't catch this; in particular, the test case uses load with unsigned offset (not register offset), so not caught by VecROLoadPat. Debug log of instruction-selection shows it's very likely due to
btw for simplicity, the patch is about load, but I believe the same optimization opportunity exists for store operations.
Thanks for pointing out a different place to solve the same problem. I share the concern that general memory operations are hard to rewrite correctly, and even if they are correct, it's better to make them happen earlier in the pipeline.
Comment Actions An update:
For 2), peephole-opt should have folded the load (code) given AArch64InstrInfo.cpp considers this pattern (mentioned in comment). Going to see why it's not executed and update back. mingmingl marked 6 inline comments as done and an inline comment as not done.Edited · Aug 1 2022, 11:12 AM Comment Actions
It's true that most of the cross-register-back copies are introduced in the (multi-step) ISel pass. May I hear some feedback on whether the general idea of combining {ldr + copy} into {ldr} is good to go, given that 1) fixing ISel one-by-one could be laborious 2) peephole-opt does more clean up (say the {ldr+copy} -> ldr is exposed after ISel, as a result of multiple passes execution) Also update the implementation by bailing out on illegal conditions
mingmingl marked an inline comment as not done. Comment ActionsUpdate implementation to bail out on illegal conditions, major ones include
Comment Actions I worry about the potential for bugs in any patch in the AArch64MIPeepholeOptimizer. And loading makes that more complex. The altered tests look like they are either intrinsics like the pmull64 that was recently changed or strange types (<32 x i1>, <16 x i12>) that don't come up a lot in practice. The global-isel change in dp1.ll is a missed fold that is probably better handled in global-isel. The irregular type tests also show it only handling straight loads, not instructions like ld1 { v1.s }[1], [x12]. The current model is to try and handle almost all combines during ISel so that they can all work together in one place. If this comes up in more places it might be worth it. From the examples in the tests I'm not so sure though. Do you have any other examples where this helps? Comment Actions
I totally agree with the fact that loading/storing would make things more complex here, that global-isel should be able to handle dp1.ll; strange types (<16 x i2><32xi1>) looks more interesting but I doubt if it's a result of multiple issues.
I'm convinced it's better to handle affected test cases in ISel (yet still made changes, hope to make it closer to correct); I don't observe other examples where this helps -> problem hunting starts from pmull test case (llvm/test/CodeGen/AArch64/aarch64-pmull2.ll in D131047) I'll abandon this patch for now; thanks for your timely feedback on the series of changes!
Revision Contents
Diff 449080 llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
llvm/test/CodeGen/AArch64/arm64-subvector-extend.ll
llvm/test/CodeGen/AArch64/arm64-vmul.ll
llvm/test/CodeGen/AArch64/dp1.ll
llvm/test/CodeGen/AArch64/neon-extadd.ll
|
Declare this with const. Alternatively, turn the definition into a function in the anonymous namespace outside the class.