Page MenuHomePhabricator

[AArch64] Add Machine InstCombiner patterns for FMUL indexed variant
ClosedPublic

Authored by asavonic on Mar 31 2021, 7:35 AM.

Details

Summary

This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.

Diff Detail

Event Timeline

asavonic created this revision.Mar 31 2021, 7:35 AM
asavonic requested review of this revision.Mar 31 2021, 7:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2021, 7:35 AM
fhahn added a subscriber: fhahn.Mar 31 2021, 8:59 AM
fhahn added inline comments.
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
140

can you add a MachineIR test case for those transforms? The tests probably should also go into one of the machine-combiner* files or a new one. For more details, please see https://llvm.org/docs/MIRLangRef.html#mir-testing-guide

asavonic updated this revision to Diff 334502.Mar 31 2021, 11:34 AM

Added test/CodeGen/AArch64/machine-combiner-fmul-dup.mir

asavonic added inline comments.Mar 31 2021, 11:38 AM
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
140

Thanks Florian! I added a new MIR test.

SjoerdMeijer added inline comments.Mar 31 2021, 11:45 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
5835

Should this be setting MUL = genIndexedMultiply(..)?

5838–5839

Was wondering because of the added if here.
Unrelated but that FIXME looks a bit dodgy. Any idea what that could be while we are at it?

asavonic added inline comments.Mar 31 2021, 1:22 PM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
5835

This doesn't work because of the two lines below:

DelInstrs.push_back(MUL);
DelInstrs.push_back(&Root);

In case of a DUP+FMUL pattern, Root is the FMUL instruction we need to delete. If we also assign it to the MUL variable, it will be deleted twice.

5838–5839

This looks like a bug.
In t32_6_3 test case, processLogicalImmediate function fails, so MUL is not updated. We also don't generate any replacement, and the Root gets deleted.

I think we should at least return from the function without updating InsInstrs and DelInstrs if processLogicalImmediate function fails.

(gdb) p Root.dump()
  %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv

(gdb) p Root.Parent->dump()
bb.0 (%ir-block.0):
  liveins: $w0
  %0:gpr32 = COPY $w0
  %1:gpr32 = MOVi32imm -1431655765
  %2:gpr32common = MADDWrrr %0:gpr32, killed %1:gpr32, $wzr
  %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv
  %4:gpr32 = exact EXTRWrri %3:gpr32, %3:gpr32, 1
  %5:gpr32 = MOVi32imm 715827883
  %6:gpr32 = SUBSWrr killed %4:gpr32, killed %5:gpr32, implicit-def $nzcv
  %7:gpr32 = CSINCWr $wzr, $wzr, 2, implicit $nzcv
  $w0 = COPY %7:gpr32
  RET_ReallyLR implicit $w0

(gdb) p Pattern
$3 = llvm::MachineCombinerPattern::MULSUBWI_OP1

I've uploaded a separate patch for the FIXME issue: https://reviews.llvm.org/D100047
Let me know if anything should be fixed or improved for this one.

Sorry for the delay, mostly nits inlined, one question about missing f16 tests.

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4553

Nit: this assert can be hoisted out of this switch?

4797

Nit: I think you can get the MF via a few getParent() calls on Root, so you don't have to pass it. From memory I can't remember if that is also true fro MRI and TII, but I don't have strong opinions on this.

5799

Nit: perhaps pass RC and Opc directly in here? Don't really need these assignments?

5830

Are we missing tests for these cases?
For this you'll probably need to add -mattr=+fullfp16 to a RUN line.

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
164

Nit: can we just do a ret %add here?

179

Same?

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users. But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

If VDUP is moved into the same BB as its user, then VMUL_indexed is selected without these changes in Machine InstCombiner.
However, this does not happen for some cases (like the one in the LIT tests) with -O3 optimization; maybe I'm missing some option.

I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users.

This code is in AArch64TargetLowering::shouldSinkOperands, right?

But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.

So should we keep this patch?
I also worry that if we try to use shouldSinkOperands, we will have to handle all LLVM IR patterns that may
form a VDUP.

I think Dave also argued that this patch makes a lot of sense. Thus, I think left to do is addressing the previous nits.

asavonic updated this revision to Diff 336446.Apr 9 2021, 7:21 AM
  • Removed extra assert
  • Removed arguments that can be queried from Root
  • Removed assignments to RC and Opc
  • Changed tests to ensure that basic blocks are not merged
  • Added fp16 cases to arm64-fma-combines.ll
asavonic added inline comments.Apr 9 2021, 7:26 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4553

I think we can remove remove the assert completely. The Match function checks the same condition already.

4797

Thanks! We can get all three arguments from Root.

5799

Done.

5830

Thanks a lot! I didn't know about -mattr=+fullfp16.

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
164

This an attempt to keep shuffle and mul in different basic blocks. If we return in the second basic block, the optimizer merges them into one before we reach instruction selection.
In the latest revision of the patch I changed the tests to ensure that this never happens.

SjoerdMeijer accepted this revision.Apr 9 2021, 10:53 AM

Thanks, LGTM

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
1

I don't know if the cyclone CPU supports FP16, I guess not, but just for testing purposes having it here I think is fine.

164

Ok, thanks.

llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
216

For consistency probably best to add an indexed_8h test here too.If you're confident doing this, you can just do this without another review and commit it, but otherwise happy to look again. Similarly, you will probably need -mattr=+fullfp16 on the RUN line.

This revision is now accepted and ready to land.Apr 9 2021, 10:53 AM
asavonic updated this revision to Diff 336562.Apr 9 2021, 2:17 PM

Added test cases for fp16 to the MIR test.

asavonic added inline comments.Apr 9 2021, 2:26 PM
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
216

Thanks a lot! I added indexed_4h and 8h cases to the test.

SjoerdMeijer accepted this revision.Apr 12 2021, 12:45 AM

Thanks, LGTM

asavonic updated this revision to Diff 336950.Apr 12 2021, 1:41 PM

There were two issues with the patch, so I reverted it:

  1. The register operand of VDUP was used for VMUL_indexed. This does not work correctly if the register has a killed state.
  1. VMUL_indexed requires the second operand to have FPR128_lo register class instead of FPR128.

I've fixed this by adding a COPY instruction before the VDUP, but I'm not sure if it is legal to do so in genAlternativeCodeSequence function. If I understand correctly, genAlternativeCodeSequence is supposed to add new instructions to the InsInstrs, not modify the MIR function directly. On the other hand, I assume that the COPY will be removed if Machine InstCombiner decides the discard the result.