This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll | ||
---|---|---|
140 | can you add a MachineIR test case for those transforms? The tests probably should also go into one of the machine-combiner* files or a new one. For more details, please see https://llvm.org/docs/MIRLangRef.html#mir-testing-guide |
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll | ||
---|---|---|
140 | Thanks Florian! I added a new MIR test. |
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
5824 | This doesn't work because of the two lines below: DelInstrs.push_back(MUL); DelInstrs.push_back(&Root); In case of a DUP+FMUL pattern, Root is the FMUL instruction we need to delete. If we also assign it to the MUL variable, it will be deleted twice. | |
5827–5828 | This looks like a bug. I think we should at least return from the function without updating InsInstrs and DelInstrs if processLogicalImmediate function fails. (gdb) p Root.dump() %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv (gdb) p Root.Parent->dump() bb.0 (%ir-block.0): liveins: $w0 %0:gpr32 = COPY $w0 %1:gpr32 = MOVi32imm -1431655765 %2:gpr32common = MADDWrrr %0:gpr32, killed %1:gpr32, $wzr %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv %4:gpr32 = exact EXTRWrri %3:gpr32, %3:gpr32, 1 %5:gpr32 = MOVi32imm 715827883 %6:gpr32 = SUBSWrr killed %4:gpr32, killed %5:gpr32, implicit-def $nzcv %7:gpr32 = CSINCWr $wzr, $wzr, 2, implicit $nzcv $w0 = COPY %7:gpr32 RET_ReallyLR implicit $w0 (gdb) p Pattern $3 = llvm::MachineCombinerPattern::MULSUBWI_OP1 |
I've uploaded a separate patch for the FIXME issue: https://reviews.llvm.org/D100047
Let me know if anything should be fixed or improved for this one.
Sorry for the delay, mostly nits inlined, one question about missing f16 tests.
But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
4553 | Nit: this assert can be hoisted out of this switch? | |
4797 | Nit: I think you can get the MF via a few getParent() calls on Root, so you don't have to pass it. From memory I can't remember if that is also true fro MRI and TII, but I don't have strong opinions on this. | |
5788 | Nit: perhaps pass RC and Opc directly in here? Don't really need these assignments? | |
5819 | Are we missing tests for these cases? | |
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll | ||
164 | Nit: can we just do a ret %add here? | |
179 | Same? |
But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.
I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users. But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.
If VDUP is moved into the same BB as its user, then VMUL_indexed is selected without these changes in Machine InstCombiner.
However, this does not happen for some cases (like the one in the LIT tests) with -O3 optimization; maybe I'm missing some option.
This code is in AArch64TargetLowering::shouldSinkOperands, right?
But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.
So should we keep this patch?
I also worry that if we try to use shouldSinkOperands, we will have to handle all LLVM IR patterns that may
form a VDUP.
I think Dave also argued that this patch makes a lot of sense. Thus, I think left to do is addressing the previous nits.
- Removed extra assert
- Removed arguments that can be queried from Root
- Removed assignments to RC and Opc
- Changed tests to ensure that basic blocks are not merged
- Added fp16 cases to arm64-fma-combines.ll
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
4553 | I think we can remove remove the assert completely. The Match function checks the same condition already. | |
4797 | Thanks! We can get all three arguments from Root. | |
5788 | Done. | |
5819 | Thanks a lot! I didn't know about -mattr=+fullfp16. | |
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll | ||
164 | This an attempt to keep shuffle and mul in different basic blocks. If we return in the second basic block, the optimizer merges them into one before we reach instruction selection. |
Thanks, LGTM
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll | ||
---|---|---|
1 | I don't know if the cyclone CPU supports FP16, I guess not, but just for testing purposes having it here I think is fine. | |
164 | Ok, thanks. | |
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir | ||
216 | For consistency probably best to add an indexed_8h test here too.If you're confident doing this, you can just do this without another review and commit it, but otherwise happy to look again. Similarly, you will probably need -mattr=+fullfp16 on the RUN line. |
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir | ||
---|---|---|
216 | Thanks a lot! I added indexed_4h and 8h cases to the test. |
There were two issues with the patch, so I reverted it:
- The register operand of VDUP was used for VMUL_indexed. This does not work correctly if the register has a killed state.
- VMUL_indexed requires the second operand to have FPR128_lo register class instead of FPR128.
I've fixed this by adding a COPY instruction before the VDUP, but I'm not sure if it is legal to do so in genAlternativeCodeSequence function. If I understand correctly, genAlternativeCodeSequence is supposed to add new instructions to the InsInstrs, not modify the MIR function directly. On the other hand, I assume that the COPY will be removed if Machine InstCombiner decides the discard the result.
Used MRI.constrainRegClass to fix the issue with FPR128 vs FPR128_lo register class for i16 variants.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
4801 | This needs some brackets around the || I think. | |
4810 | What does machine combiner usually do about killed registers? What happens if there is, for example: %1 = dup %0 somethingelse killed %0 fmul %1 | |
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir | ||
2 | It's probably worth adding verify-machineinstrs to tests like this. There are expensive bots that will check anyway, but it's good to be deliberate on tests like this. I might also use the update_mir_test_checks script, but that's up to you depending on how useful the check lines it adds are. |
- Added -verify-machineinstrs.
- Used MRI.clearKillFlags() to extend lifetime of DUP operand.
- Added update_mir_test_checks.py checks to the MIR LIT test.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
4801 | Thank you! Done. | |
4810 | Usually the new instruction is replacing the old one, so the kill state is just copied. Added this case to machine-combiner-fmul-dup.mir. | |
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir | ||
2 | They can be useful to see details like kill states. |
Nit: this assert can be hoisted out of this switch?