This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add Machine InstCombiner patterns for FMUL indexed variant
ClosedPublic

Authored by asavonic on Mar 31 2021, 7:35 AM.

Details

Summary

This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.

Diff Detail

Event Timeline

asavonic created this revision.Mar 31 2021, 7:35 AM
asavonic requested review of this revision.Mar 31 2021, 7:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2021, 7:35 AM
fhahn added a subscriber: fhahn.Mar 31 2021, 8:59 AM
fhahn added inline comments.
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
141

can you add a MachineIR test case for those transforms? The tests probably should also go into one of the machine-combiner* files or a new one. For more details, please see https://llvm.org/docs/MIRLangRef.html#mir-testing-guide

asavonic updated this revision to Diff 334502.Mar 31 2021, 11:34 AM

Added test/CodeGen/AArch64/machine-combiner-fmul-dup.mir

asavonic added inline comments.Mar 31 2021, 11:38 AM
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
141

Thanks Florian! I added a new MIR test.

SjoerdMeijer added inline comments.Mar 31 2021, 11:45 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
6224

Should this be setting MUL = genIndexedMultiply(..)?

6227–6228

Was wondering because of the added if here.
Unrelated but that FIXME looks a bit dodgy. Any idea what that could be while we are at it?

asavonic added inline comments.Mar 31 2021, 1:22 PM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
6224

This doesn't work because of the two lines below:

DelInstrs.push_back(MUL);
DelInstrs.push_back(&Root);

In case of a DUP+FMUL pattern, Root is the FMUL instruction we need to delete. If we also assign it to the MUL variable, it will be deleted twice.

6227–6228

This looks like a bug.
In t32_6_3 test case, processLogicalImmediate function fails, so MUL is not updated. We also don't generate any replacement, and the Root gets deleted.

I think we should at least return from the function without updating InsInstrs and DelInstrs if processLogicalImmediate function fails.

(gdb) p Root.dump()
  %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv

(gdb) p Root.Parent->dump()
bb.0 (%ir-block.0):
  liveins: $w0
  %0:gpr32 = COPY $w0
  %1:gpr32 = MOVi32imm -1431655765
  %2:gpr32common = MADDWrrr %0:gpr32, killed %1:gpr32, $wzr
  %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv
  %4:gpr32 = exact EXTRWrri %3:gpr32, %3:gpr32, 1
  %5:gpr32 = MOVi32imm 715827883
  %6:gpr32 = SUBSWrr killed %4:gpr32, killed %5:gpr32, implicit-def $nzcv
  %7:gpr32 = CSINCWr $wzr, $wzr, 2, implicit $nzcv
  $w0 = COPY %7:gpr32
  RET_ReallyLR implicit $w0

(gdb) p Pattern
$3 = llvm::MachineCombinerPattern::MULSUBWI_OP1

I've uploaded a separate patch for the FIXME issue: https://reviews.llvm.org/D100047
Let me know if anything should be fixed or improved for this one.

Sorry for the delay, mostly nits inlined, one question about missing f16 tests.

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4945

Nit: this assert can be hoisted out of this switch?

5189

Nit: I think you can get the MF via a few getParent() calls on Root, so you don't have to pass it. From memory I can't remember if that is also true fro MRI and TII, but I don't have strong opinions on this.

6188

Nit: perhaps pass RC and Opc directly in here? Don't really need these assignments?

6219

Are we missing tests for these cases?
For this you'll probably need to add -mattr=+fullfp16 to a RUN line.

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
165

Nit: can we just do a ret %add here?

180

Same?

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users. But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

If VDUP is moved into the same BB as its user, then VMUL_indexed is selected without these changes in Machine InstCombiner.
However, this does not happen for some cases (like the one in the LIT tests) with -O3 optimization; maybe I'm missing some option.

I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users.

This code is in AArch64TargetLowering::shouldSinkOperands, right?

But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.

So should we keep this patch?
I also worry that if we try to use shouldSinkOperands, we will have to handle all LLVM IR patterns that may
form a VDUP.

I think Dave also argued that this patch makes a lot of sense. Thus, I think left to do is addressing the previous nits.

asavonic updated this revision to Diff 336446.Apr 9 2021, 7:21 AM
  • Removed extra assert
  • Removed arguments that can be queried from Root
  • Removed assignments to RC and Opc
  • Changed tests to ensure that basic blocks are not merged
  • Added fp16 cases to arm64-fma-combines.ll
asavonic added inline comments.Apr 9 2021, 7:26 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4945

I think we can remove remove the assert completely. The Match function checks the same condition already.

5189

Thanks! We can get all three arguments from Root.

6188

Done.

6219

Thanks a lot! I didn't know about -mattr=+fullfp16.

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
165

This an attempt to keep shuffle and mul in different basic blocks. If we return in the second basic block, the optimizer merges them into one before we reach instruction selection.
In the latest revision of the patch I changed the tests to ensure that this never happens.

SjoerdMeijer accepted this revision.Apr 9 2021, 10:53 AM

Thanks, LGTM

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
1–2

I don't know if the cyclone CPU supports FP16, I guess not, but just for testing purposes having it here I think is fine.

165

Ok, thanks.

llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
216

For consistency probably best to add an indexed_8h test here too.If you're confident doing this, you can just do this without another review and commit it, but otherwise happy to look again. Similarly, you will probably need -mattr=+fullfp16 on the RUN line.

This revision is now accepted and ready to land.Apr 9 2021, 10:53 AM
asavonic updated this revision to Diff 336562.Apr 9 2021, 2:17 PM

Added test cases for fp16 to the MIR test.

asavonic added inline comments.Apr 9 2021, 2:26 PM
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
216

Thanks a lot! I added indexed_4h and 8h cases to the test.

SjoerdMeijer accepted this revision.Apr 12 2021, 12:45 AM

Thanks, LGTM

asavonic updated this revision to Diff 336950.Apr 12 2021, 1:41 PM

There were two issues with the patch, so I reverted it:

  1. The register operand of VDUP was used for VMUL_indexed. This does not work correctly if the register has a killed state.
  1. VMUL_indexed requires the second operand to have FPR128_lo register class instead of FPR128.

I've fixed this by adding a COPY instruction before the VDUP, but I'm not sure if it is legal to do so in genAlternativeCodeSequence function. If I understand correctly, genAlternativeCodeSequence is supposed to add new instructions to the InsInstrs, not modify the MIR function directly. On the other hand, I assume that the COPY will be removed if Machine InstCombiner decides the discard the result.

asavonic reopened this revision.May 18 2021, 2:17 PM

@SjoerdMeijer, can you please check if the new revision of the patch is OK?

This revision is now accepted and ready to land.May 18 2021, 2:17 PM
asavonic requested review of this revision.May 18 2021, 2:19 PM
asavonic updated this revision to Diff 381935.Oct 25 2021, 4:33 AM

Used MRI.constrainRegClass to fix the issue with FPR128 vs FPR128_lo register class for i16 variants.

Hi, I've reworked the patch, can you please check if it is acceptable now?

dmgreen added inline comments.Nov 4 2021, 6:12 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
5193

This needs some brackets around the || I think.

5202

What does machine combiner usually do about killed registers? What happens if there is, for example:

%1 = dup %0
somethingelse killed %0
fmul %1
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
2

It's probably worth adding verify-machineinstrs to tests like this. There are expensive bots that will check anyway, but it's good to be deliberate on tests like this. I might also use the update_mir_test_checks script, but that's up to you depending on how useful the check lines it adds are.

asavonic updated this revision to Diff 385455.Nov 8 2021, 4:36 AM
  • Added -verify-machineinstrs.
  • Used MRI.clearKillFlags() to extend lifetime of DUP operand.
  • Added update_mir_test_checks.py checks to the MIR LIT test.
asavonic added inline comments.Nov 8 2021, 5:10 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
5193

Thank you! Done.

5202

Usually the new instruction is replacing the old one, so the kill state is just copied.
Here we need to extend lifetime of the register, and your example confirms that setIsKill(0) is not enough. We can use MRI.clearKillFlags instead to remove the kill state from all uses of the register.

Added this case to machine-combiner-fmul-dup.mir.

llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
2

They can be useful to see details like kill states.
Added them to the test.

dmgreen accepted this revision.Nov 9 2021, 2:22 AM

Thanks. LGTM. Lets give this another try.

This revision is now accepted and ready to land.Nov 9 2021, 2:22 AM