This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
MachineCombinerPattern.h
-
lib/Target/AArch64/
-
Target/
-
AArch64/
8/16
AArch64InstrInfo.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2/7
arm64-fma-combines.ll
2/4
machine-combiner-fmul-dup.mir

Differential D99662

[AArch64] Add Machine InstCombiner patterns for FMUL indexed variant
ClosedPublic

Authored by asavonic on Mar 31 2021, 7:35 AM.

Download Raw Diff

Details

Reviewers

Gerolf
SjoerdMeijer
sebpop
dmgreen

Commits

rGb702276ad0d6: [AArch64] Add Machine InstCombiner patterns for FMUL indexed variant
rGcca9b5985c0c: [AArch64] Add Machine InstCombiner patterns for FMUL indexed variant

Summary

This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

asavonic created this revision.Mar 31 2021, 7:35 AM

Herald added subscribers: danielkiss, arphaman, hiraditya, kristof.beyls. · View Herald TranscriptMar 31 2021, 7:35 AM

asavonic requested review of this revision.Mar 31 2021, 7:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2021, 7:35 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B96528: Diff 334437.Mar 31 2021, 8:35 AM

fhahn added a subscriber: fhahn.Mar 31 2021, 8:59 AM

fhahn added inline comments.

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
140	can you add a MachineIR test case for those transforms? The tests probably should also go into one of the machine-combiner* files or a new one. For more details, please see https://llvm.org/docs/MIRLangRef.html#mir-testing-guide

Added test/CodeGen/AArch64/machine-combiner-fmul-dup.mir

asavonic added inline comments.Mar 31 2021, 11:38 AM

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
140	Thanks Florian! I added a new MIR test.

SjoerdMeijer added inline comments.Mar 31 2021, 11:45 AM

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
5835	Should this be setting `MUL = genIndexedMultiply(..)`?
5838–5839	Was wondering because of the added if here. Unrelated but that FIXME looks a bit dodgy. Any idea what that could be while we are at it?

Harbormaster completed remote builds in B96569: Diff 334502.Mar 31 2021, 12:18 PM

asavonic added inline comments.Mar 31 2021, 1:22 PM

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

5835

This doesn't work because of the two lines below:

DelInstrs.push_back(MUL);
DelInstrs.push_back(&Root);

In case of a DUP+FMUL pattern, Root is the FMUL instruction we need to delete. If we also assign it to the MUL variable, it will be deleted twice.

5838–5839

This looks like a bug.
In t32_6_3 test case, processLogicalImmediate function fails, so MUL is not updated. We also don't generate any replacement, and the Root gets deleted.

I think we should at least return from the function without updating InsInstrs and DelInstrs if processLogicalImmediate function fails.

(gdb) p Root.dump()
  %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv

(gdb) p Root.Parent->dump()
bb.0 (%ir-block.0):
  liveins: $w0
  %0:gpr32 = COPY $w0
  %1:gpr32 = MOVi32imm -1431655765
  %2:gpr32common = MADDWrrr %0:gpr32, killed %1:gpr32, $wzr
  %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv
  %4:gpr32 = exact EXTRWrri %3:gpr32, %3:gpr32, 1
  %5:gpr32 = MOVi32imm 715827883
  %6:gpr32 = SUBSWrr killed %4:gpr32, killed %5:gpr32, implicit-def $nzcv
  %7:gpr32 = CSINCWr $wzr, $wzr, 2, implicit $nzcv
  $w0 = COPY %7:gpr32
  RET_ReallyLR implicit $w0

(gdb) p Pattern
$3 = llvm::MachineCombinerPattern::MULSUBWI_OP1

asavonic mentioned this in D100047: [NFC][AArch64] Handle processLogicalImmediate error.Apr 7 2021, 9:49 AM

I've uploaded a separate patch for the FIXME issue: https://reviews.llvm.org/D100047
Let me know if anything should be fixed or improved for this one.

Sorry for the delay, mostly nits inlined, one question about missing f16 tests.

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4553	Nit: this assert can be hoisted out of this switch?
4797	Nit: I think you can get the MF via a few `getParent()` calls on `Root`, so you don't have to pass it. From memory I can't remember if that is also true fro MRI and TII, but I don't have strong opinions on this.
5799	Nit: perhaps pass `RC` and `Opc` directly in here? Don't really need these assignments?
5830	Are we missing tests for these cases? For this you'll probably need to add `-mattr=+fullfp16` to a RUN line.
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
164	Nit: can we just do a `ret %add` here?
179	Same?

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users. But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.

In D99662#2674645, @SjoerdMeijer wrote:

But the other thing I was just wondering, not that I mind these patterns here, but are we not expecting that the VDUP is sunk to its user? I think that's probably what I would expect, but don't know if that is a fair expectation.

If VDUP is moved into the same BB as its user, then VMUL_indexed is selected without these changes in Machine InstCombiner.
However, this does not happen for some cases (like the one in the LIT tests) with -O3 optimization; maybe I'm missing some option.

In D99662#2674812, @dmgreen wrote:

I was going to put a comment on saying something similar, that this can be done in both ways. There is code in CGP that can sink dup's to users.

This code is in AArch64TargetLowering::shouldSinkOperands, right?

But I think this makes a lot of sense though, due to the other benefits of machine combiner. Like the fact that it's shared across both ISels and can take things like critical path lengths into account.

So should we keep this patch?
I also worry that if we try to use shouldSinkOperands, we will have to handle all LLVM IR patterns that may
form a VDUP.

I think Dave also argued that this patch makes a lot of sense. Thus, I think left to do is addressing the previous nits.

Removed extra assert
Removed arguments that can be queried from Root
Removed assignments to RC and Opc
Changed tests to ensure that basic blocks are not merged
Added fp16 cases to arm64-fma-combines.ll

asavonic added inline comments.Apr 9 2021, 7:26 AM

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4553	I think we can remove remove the assert completely. The `Match` function checks the same condition already.
4797	Thanks! We can get all three arguments from `Root`.
5799	Done.
5830	Thanks a lot! I didn't know about `-mattr=+fullfp16`.
llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
164	This an attempt to keep `shuffle` and `mul` in different basic blocks. If we return in the second basic block, the optimizer merges them into one before we reach instruction selection. In the latest revision of the patch I changed the tests to ensure that this never happens.

Harbormaster completed remote builds in B97980: Diff 336446.Apr 9 2021, 8:07 AM

Thanks, LGTM

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll
1	I don't know if the cyclone CPU supports FP16, I guess not, but just for testing purposes having it here I think is fine.
164	Ok, thanks.
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
216	For consistency probably best to add an indexed_8h test here too.If you're confident doing this, you can just do this without another review and commit it, but otherwise happy to look again. Similarly, you will probably need -mattr=+fullfp16 on the RUN line.

This revision is now accepted and ready to land.Apr 9 2021, 10:53 AM

Added test cases for fp16 to the MIR test.

asavonic added inline comments.Apr 9 2021, 2:26 PM

llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
216	Thanks a lot! I added indexed_4h and 8h cases to the test.

Harbormaster completed remote builds in B98069: Diff 336562.Apr 9 2021, 2:55 PM

Thanks, LGTM

Closed by commit rGcca9b5985c0c: [AArch64] Add Machine InstCombiner patterns for FMUL indexed variant (authored by asavonic). · Explain WhyApr 12 2021, 6:15 AM

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rGcca9b5985c0c: [AArch64] Add Machine InstCombiner patterns for FMUL indexed variant.

asavonic added a reverting change: rGf037b07b5c2e: Revert "[AArch64] Add Machine InstCombiner patterns for FMUL indexed variant".Apr 12 2021, 6:30 AM

There were two issues with the patch, so I reverted it:

The register operand of VDUP was used for VMUL_indexed. This does not work correctly if the register has a killed state.

VMUL_indexed requires the second operand to have FPR128_lo register class instead of FPR128.

I've fixed this by adding a COPY instruction before the VDUP, but I'm not sure if it is legal to do so in genAlternativeCodeSequence function. If I understand correctly, genAlternativeCodeSequence is supposed to add new instructions to the InsInstrs, not modify the MIR function directly. On the other hand, I assume that the COPY will be removed if Machine InstCombiner decides the discard the result.

Harbormaster completed remote builds in B98350: Diff 336950.Apr 12 2021, 3:21 PM

@SjoerdMeijer, can you please check if the new revision of the patch is OK?

This revision is now accepted and ready to land.May 18 2021, 2:17 PM

asavonic requested review of this revision.May 18 2021, 2:19 PM

ping

Used MRI.constrainRegClass to fix the issue with FPR128 vs FPR128_lo register class for i16 variants.

Harbormaster completed remote builds in B130428: Diff 381935.Oct 25 2021, 5:36 AM

asavonic added a reviewer: dmgreen.Nov 3 2021, 12:46 PM

Hi, I've reworked the patch, can you please check if it is acceptable now?

dmgreen added inline comments.Nov 4 2021, 6:12 AM

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4801	This needs some brackets around the \|\| I think.
4810	What does machine combiner usually do about killed registers? What happens if there is, for example: %1 = dup %0 somethingelse killed %0 fmul %1
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
2	It's probably worth adding verify-machineinstrs to tests like this. There are expensive bots that will check anyway, but it's good to be deliberate on tests like this. I might also use the update_mir_test_checks script, but that's up to you depending on how useful the check lines it adds are.

Added -verify-machineinstrs.
Used MRI.clearKillFlags() to extend lifetime of DUP operand.
Added update_mir_test_checks.py checks to the MIR LIT test.

asavonic added inline comments.Nov 8 2021, 5:10 AM

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4801	Thank you! Done.
4810	Usually the new instruction is replacing the old one, so the kill state is just copied. Here we need to extend lifetime of the register, and your example confirms that `setIsKill(0)` is not enough. We can use `MRI.clearKillFlags` instead to remove the kill state from all uses of the register. Added this case to machine-combiner-fmul-dup.mir.
llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir
2	They can be useful to see details like kill states. Added them to the test.

Harbormaster completed remote builds in B132988: Diff 385455.Nov 8 2021, 5:35 AM

Thanks. LGTM. Lets give this another try.

This revision is now accepted and ready to land.Nov 9 2021, 2:22 AM

Closed by commit rGb702276ad0d6: [AArch64] Add Machine InstCombiner patterns for FMUL indexed variant (authored by asavonic). · Explain WhyNov 9 2021, 4:31 AM

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rGb702276ad0d6: [AArch64] Add Machine InstCombiner patterns for FMUL indexed variant.

asavonic mentioned this in rGe201232ececb: [NFC][AArch64] Handle processLogicalImmediate error.Nov 10 2021, 5:57 AM

dmgreen mentioned this in D126632: [AArch64] Look through copy in MachineCombiner FMUL patterns..May 30 2022, 12:20 AM

dmgreen mentioned this in rG5cb14dc5a3a6: [AArch64] Look through copy in MachineCombiner FMUL patterns..May 31 2022, 1:28 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

MachineCombinerPattern.h

13 lines

lib/

Target/

AArch64/

AArch64InstrInfo.cpp

143 lines

test/

CodeGen/

AArch64/

arm64-fma-combines.ll

127 lines

machine-combiner-fmul-dup.mir

384 lines

Diff 336950

llvm/include/llvm/CodeGen/MachineCombinerPattern.h

Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	enum class MachineCombinerPattern {
FMLSv8i16_indexed_OP2,		FMLSv8i16_indexed_OP2,
FMLSv2i32_indexed_OP1,		FMLSv2i32_indexed_OP1,
FMLSv2i32_indexed_OP2,		FMLSv2i32_indexed_OP2,
FMLSv2i64_indexed_OP1,		FMLSv2i64_indexed_OP1,
FMLSv2i64_indexed_OP2,		FMLSv2i64_indexed_OP2,
FMLSv4f32_OP1,		FMLSv4f32_OP1,
FMLSv4f32_OP2,		FMLSv4f32_OP2,
FMLSv4i32_indexed_OP1,		FMLSv4i32_indexed_OP1,
FMLSv4i32_indexed_OP2		FMLSv4i32_indexed_OP2,

		FMULv2i32_indexed_OP1,
		FMULv2i32_indexed_OP2,
		FMULv2i64_indexed_OP1,
		FMULv2i64_indexed_OP2,
		FMULv4i16_indexed_OP1,
		FMULv4i16_indexed_OP2,
		FMULv4i32_indexed_OP1,
		FMULv4i32_indexed_OP2,
		FMULv8i16_indexed_OP1,
		FMULv8i16_indexed_OP2,
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,519 Lines • ▼ Show 20 Lines	case AArch64::FSUBv4f32:

Found \|= Match(AArch64::FMULv4i32_indexed, 1, MCP::FMLSv4i32_indexed_OP1) \|\|		Found \|= Match(AArch64::FMULv4i32_indexed, 1, MCP::FMLSv4i32_indexed_OP1) \|\|
Match(AArch64::FMULv4f32, 1, MCP::FMLSv4f32_OP1);		Match(AArch64::FMULv4f32, 1, MCP::FMLSv4f32_OP1);
break;		break;
}		}
return Found;		return Found;
}		}

		static bool getFMULPatterns(MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) {
		MachineBasicBlock &MBB = *Root.getParent();
		bool Found = false;

		auto Match = [&](unsigned Opcode, int Operand,
		MachineCombinerPattern Pattern) -> bool {
		MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
		MachineOperand &MO = Root.getOperand(Operand);
		MachineInstr *MI = nullptr;
		if (MO.isReg() && Register::isVirtualRegister(MO.getReg()))
		MI = MRI.getUniqueVRegDef(MO.getReg());
		if (MI && MI->getOpcode() == Opcode) {
		Patterns.push_back(Pattern);
		return true;
		}
		return false;
		};

		typedef MachineCombinerPattern MCP;

		switch (Root.getOpcode()) {
		default:
		return false;
		case AArch64::FMULv2f32:
		Found = Match(AArch64::DUPv2i32lane, 1, MCP::FMULv2i32_indexed_OP1);
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Nit: this assert can be hoisted out of this switch? SjoerdMeijer: Nit: this assert can be hoisted out of this switch?
		asavonicAuthorUnsubmitted Done Reply Inline Actions I think we can remove remove the assert completely. The `Match` function checks the same condition already. asavonic: I think we can remove remove the assert completely. The `Match` function checks the same…
		Found \|= Match(AArch64::DUPv2i32lane, 2, MCP::FMULv2i32_indexed_OP2);
		break;
		case AArch64::FMULv2f64:
		Found = Match(AArch64::DUPv2i64lane, 1, MCP::FMULv2i64_indexed_OP1);
		Found \|= Match(AArch64::DUPv2i64lane, 2, MCP::FMULv2i64_indexed_OP2);
		break;
		case AArch64::FMULv4f16:
		Found = Match(AArch64::DUPv4i16lane, 1, MCP::FMULv4i16_indexed_OP1);
		Found \|= Match(AArch64::DUPv4i16lane, 2, MCP::FMULv4i16_indexed_OP2);
		break;
		case AArch64::FMULv4f32:
		Found = Match(AArch64::DUPv4i32lane, 1, MCP::FMULv4i32_indexed_OP1);
		Found \|= Match(AArch64::DUPv4i32lane, 2, MCP::FMULv4i32_indexed_OP2);
		break;
		case AArch64::FMULv8f16:
		Found = Match(AArch64::DUPv8i16lane, 1, MCP::FMULv8i16_indexed_OP1);
		Found \|= Match(AArch64::DUPv8i16lane, 2, MCP::FMULv8i16_indexed_OP2);
		break;
		}

		return Found;
		}

/// Return true when a code sequence can improve throughput. It		/// Return true when a code sequence can improve throughput. It
/// should be called only for instructions in loops.		/// should be called only for instructions in loops.
/// \param Pattern - combiner pattern		/// \param Pattern - combiner pattern
bool AArch64InstrInfo::isThroughputPattern(		bool AArch64InstrInfo::isThroughputPattern(
MachineCombinerPattern Pattern) const {		MachineCombinerPattern Pattern) const {
switch (Pattern) {		switch (Pattern) {
default:		default:
break;		break;
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	bool AArch64InstrInfo::isThroughputPattern(
case MachineCombinerPattern::FMLSv4f16_OP1:		case MachineCombinerPattern::FMLSv4f16_OP1:
case MachineCombinerPattern::FMLSv4f16_OP2:		case MachineCombinerPattern::FMLSv4f16_OP2:
case MachineCombinerPattern::FMLSv8f16_OP1:		case MachineCombinerPattern::FMLSv8f16_OP1:
case MachineCombinerPattern::FMLSv8f16_OP2:		case MachineCombinerPattern::FMLSv8f16_OP2:
case MachineCombinerPattern::FMLSv2f32_OP2:		case MachineCombinerPattern::FMLSv2f32_OP2:
case MachineCombinerPattern::FMLSv2f64_OP2:		case MachineCombinerPattern::FMLSv2f64_OP2:
case MachineCombinerPattern::FMLSv4i32_indexed_OP2:		case MachineCombinerPattern::FMLSv4i32_indexed_OP2:
case MachineCombinerPattern::FMLSv4f32_OP2:		case MachineCombinerPattern::FMLSv4f32_OP2:
		case MachineCombinerPattern::FMULv2i32_indexed_OP1:
		case MachineCombinerPattern::FMULv2i32_indexed_OP2:
		case MachineCombinerPattern::FMULv2i64_indexed_OP1:
		case MachineCombinerPattern::FMULv2i64_indexed_OP2:
		case MachineCombinerPattern::FMULv4i16_indexed_OP1:
		case MachineCombinerPattern::FMULv4i16_indexed_OP2:
		case MachineCombinerPattern::FMULv4i32_indexed_OP1:
		case MachineCombinerPattern::FMULv4i32_indexed_OP2:
		case MachineCombinerPattern::FMULv8i16_indexed_OP1:
		case MachineCombinerPattern::FMULv8i16_indexed_OP2:
case MachineCombinerPattern::MULADDv8i8_OP1:		case MachineCombinerPattern::MULADDv8i8_OP1:
case MachineCombinerPattern::MULADDv8i8_OP2:		case MachineCombinerPattern::MULADDv8i8_OP2:
case MachineCombinerPattern::MULADDv16i8_OP1:		case MachineCombinerPattern::MULADDv16i8_OP1:
case MachineCombinerPattern::MULADDv16i8_OP2:		case MachineCombinerPattern::MULADDv16i8_OP2:
case MachineCombinerPattern::MULADDv4i16_OP1:		case MachineCombinerPattern::MULADDv4i16_OP1:
case MachineCombinerPattern::MULADDv4i16_OP2:		case MachineCombinerPattern::MULADDv4i16_OP2:
case MachineCombinerPattern::MULADDv8i16_OP1:		case MachineCombinerPattern::MULADDv8i16_OP1:
case MachineCombinerPattern::MULADDv8i16_OP2:		case MachineCombinerPattern::MULADDv8i16_OP2:
Show All 40 Lines

bool AArch64InstrInfo::getMachineCombinerPatterns(		bool AArch64InstrInfo::getMachineCombinerPatterns(
MachineInstr &Root, SmallVectorImpl<MachineCombinerPattern> &Patterns,		MachineInstr &Root, SmallVectorImpl<MachineCombinerPattern> &Patterns,
bool DoRegPressureReduce) const {		bool DoRegPressureReduce) const {
// Integer patterns		// Integer patterns
if (getMaddPatterns(Root, Patterns))		if (getMaddPatterns(Root, Patterns))
return true;		return true;
// Floating point patterns		// Floating point patterns
		if (getFMULPatterns(Root, Patterns))
		return true;
if (getFMAPatterns(Root, Patterns))		if (getFMAPatterns(Root, Patterns))
return true;		return true;

return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,		return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns,
DoRegPressureReduce);		DoRegPressureReduce);
}		}

enum class FMAInstKind { Default, Indexed, Accumulator };		enum class FMAInstKind { Default, Indexed, Accumulator };
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	MIB = BuildMI(MF, Root.getDebugLoc(), TII->get(MaddOpc), ResultReg)
.addReg(SrcReg1, getKillRegState(Src1IsKill));		.addReg(SrcReg1, getKillRegState(Src1IsKill));
else		else
assert(false && "Invalid FMA instruction kind \n");		assert(false && "Invalid FMA instruction kind \n");
// Insert the MADD (MADD, FMA, FMS, FMLA, FMSL)		// Insert the MADD (MADD, FMA, FMS, FMLA, FMSL)
InsInstrs.push_back(MIB);		InsInstrs.push_back(MIB);
return MUL;		return MUL;
}		}

		static MachineInstr *genIndexedMultiply(
		MachineInstr &Root, SmallVectorImpl<MachineInstr *> &InsInstrs,
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Nit: I think you can get the MF via a few `getParent()` calls on `Root`, so you don't have to pass it. From memory I can't remember if that is also true fro MRI and TII, but I don't have strong opinions on this. SjoerdMeijer: Nit: I think you can get the MF via a few `getParent()` calls on `Root`, so you don't have to…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks! We can get all three arguments from `Root`. asavonic: Thanks! We can get all three arguments from `Root`.
		unsigned IdxDupOp, unsigned MulOpc, const TargetRegisterClass *RC) {
		assert(IdxDupOp == 1 \|\| IdxDupOp == 2);

		MachineFunction &MF = *Root.getMF();
		dmgreenUnsubmitted Not Done Reply Inline Actions This needs some brackets around the \|\| I think. dmgreen: This needs some brackets around the \|\| I think.
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thank you! Done. asavonic: Thank you! Done.
		const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
		MachineRegisterInfo &MRI = MF.getRegInfo();

		MachineInstr *Dup =
		MF.getRegInfo().getUniqueVRegDef(Root.getOperand(IdxDupOp).getReg());
		Register DupSrcReg = Dup->getOperand(1).getReg();
		unsigned DupSrcLane = Dup->getOperand(2).getImm();

		unsigned IdxMulOp = IdxDupOp == 1 ? 2 : 1;
		dmgreenUnsubmitted Not Done Reply Inline Actions What does machine combiner usually do about killed registers? What happens if there is, for example: %1 = dup %0 somethingelse killed %0 fmul %1 dmgreen: What does machine combiner usually do about killed registers? What happens if there is, for…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Usually the new instruction is replacing the old one, so the kill state is just copied. Here we need to extend lifetime of the register, and your example confirms that `setIsKill(0)` is not enough. We can use `MRI.clearKillFlags` instead to remove the kill state from all uses of the register. Added this case to machine-combiner-fmul-dup.mir. asavonic: Usually the new instruction is replacing the old one, so the kill state is just copied. Here we…
		MachineOperand &MulOp1 = Root.getOperand(IdxMulOp);

		unsigned MulOp2Lane = DupSrcLane;
		Register MulOp2Reg = DupSrcReg;
		if (MRI.getRegClass(DupSrcReg) == &AArch64::FPR128RegClass) {
		// FMUL_indexed requires FPR128_lo scalar operand
		MulOp2Reg = MRI.createVirtualRegister(&AArch64::FPR128_loRegClass);
		BuildMI(*Dup->getParent(), MachineBasicBlock::iterator(Dup), DebugLoc(),
		TII->get(TargetOpcode::COPY), MulOp2Reg)
		.addReg(DupSrcReg);
		}

		Register ResultReg = Root.getOperand(0).getReg();

		MachineInstrBuilder MIB;
		MIB = BuildMI(MF, Root.getDebugLoc(), TII->get(MulOpc), ResultReg)
		.add(MulOp1)
		.addReg(MulOp2Reg)
		.addImm(MulOp2Lane);

		InsInstrs.push_back(MIB);
		return &Root;
		}

/// genFusedMultiplyAcc - Helper to generate fused multiply accumulate		/// genFusedMultiplyAcc - Helper to generate fused multiply accumulate
/// instructions.		/// instructions.
///		///
/// \see genFusedMultiply		/// \see genFusedMultiply
static MachineInstr *genFusedMultiplyAcc(		static MachineInstr *genFusedMultiplyAcc(
MachineFunction &MF, MachineRegisterInfo &MRI, const TargetInstrInfo *TII,		MachineFunction &MF, MachineRegisterInfo &MRI, const TargetInstrInfo *TII,
MachineInstr &Root, SmallVectorImpl<MachineInstr *> &InsInstrs,		MachineInstr &Root, SmallVectorImpl<MachineInstr *> &InsInstrs,
unsigned IdxMulOpd, unsigned MaddOpc, const TargetRegisterClass *RC) {		unsigned IdxMulOpd, unsigned MaddOpc, const TargetRegisterClass *RC) {
▲ Show 20 Lines • Show All 942 Lines • ▼ Show 20 Lines	if (Pattern == MachineCombinerPattern::FMLSv2i64_indexed_OP1) {
FMAInstKind::Indexed, &NewVR);		FMAInstKind::Indexed, &NewVR);
} else {		} else {
Opc = AArch64::FMLAv2f64;		Opc = AArch64::FMLAv2f64;
MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,		MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC,
FMAInstKind::Accumulator, &NewVR);		FMAInstKind::Accumulator, &NewVR);
}		}
break;		break;
}		}
		case MachineCombinerPattern::FMULv2i32_indexed_OP1:
		case MachineCombinerPattern::FMULv2i32_indexed_OP2: {
		unsigned IdxDupOp =
		(Pattern == MachineCombinerPattern::FMULv2i32_indexed_OP1) ? 1 : 2;
		genIndexedMultiply(Root, InsInstrs, IdxDupOp, AArch64::FMULv2i32_indexed,
		&AArch64::FPR64RegClass);
		break;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Nit: perhaps pass `RC` and `Opc` directly in here? Don't really need these assignments? SjoerdMeijer: Nit: perhaps pass `RC` and `Opc` directly in here? Don't really need these assignments?
		asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.
		}
		case MachineCombinerPattern::FMULv2i64_indexed_OP1:
		case MachineCombinerPattern::FMULv2i64_indexed_OP2: {
		unsigned IdxDupOp =
		(Pattern == MachineCombinerPattern::FMULv2i64_indexed_OP1) ? 1 : 2;
		genIndexedMultiply(Root, InsInstrs, IdxDupOp, AArch64::FMULv2i64_indexed,
		&AArch64::FPR128RegClass);
		break;
		}
		case MachineCombinerPattern::FMULv4i16_indexed_OP1:
		case MachineCombinerPattern::FMULv4i16_indexed_OP2: {
		unsigned IdxDupOp =
		(Pattern == MachineCombinerPattern::FMULv4i16_indexed_OP1) ? 1 : 2;
		genIndexedMultiply(Root, InsInstrs, IdxDupOp, AArch64::FMULv4i16_indexed,
		&AArch64::FPR64RegClass);
		break;
		}
		case MachineCombinerPattern::FMULv4i32_indexed_OP1:
		case MachineCombinerPattern::FMULv4i32_indexed_OP2: {
		unsigned IdxDupOp =
		(Pattern == MachineCombinerPattern::FMULv4i32_indexed_OP1) ? 1 : 2;
		genIndexedMultiply(Root, InsInstrs, IdxDupOp, AArch64::FMULv4i32_indexed,
		&AArch64::FPR128RegClass);
		break;
		}
		case MachineCombinerPattern::FMULv8i16_indexed_OP1:
		case MachineCombinerPattern::FMULv8i16_indexed_OP2: {
		unsigned IdxDupOp =
		(Pattern == MachineCombinerPattern::FMULv8i16_indexed_OP1) ? 1 : 2;
		genIndexedMultiply(Root, InsInstrs, IdxDupOp, AArch64::FMULv8i16_indexed,
		&AArch64::FPR128RegClass);
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Are we missing tests for these cases? For this you'll probably need to add `-mattr=+fullfp16` to a RUN line. SjoerdMeijer: Are we missing tests for these cases? For this you'll probably need to add `-mattr=+fullfp16`…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks a lot! I didn't know about `-mattr=+fullfp16`. asavonic: Thanks a lot! I didn't know about `-mattr=+fullfp16`.
		break;
		}
} // end switch (Pattern)		} // end switch (Pattern)
// Record MUL and ADD/SUB for deletion		// Record MUL and ADD/SUB for deletion
// FIXME: This assertion fails in CodeGen/AArch64/tailmerging_in_mbp.ll and		// FIXME: This assertion fails in CodeGen/AArch64/tailmerging_in_mbp.ll and
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Should this be setting `MUL = genIndexedMultiply(..)`? SjoerdMeijer: Should this be setting `MUL = genIndexedMultiply(..)`?
		asavonicAuthorUnsubmitted Done Reply Inline Actions This doesn't work because of the two lines below: DelInstrs.push_back(MUL); DelInstrs.push_back(&Root); In case of a `DUP+FMUL` pattern, `Root` is the `FMUL` instruction we need to delete. If we also assign it to the `MUL` variable, it will be deleted twice. asavonic: This doesn't work because of the two lines below: ``` DelInstrs.push_back(MUL); DelInstrs.
// CodeGen/AArch64/urem-seteq-nonzero.ll.		// CodeGen/AArch64/urem-seteq-nonzero.ll.
// assert(MUL && "MUL was never set");		// assert(MUL && "MUL was never set");
		if (MUL)
DelInstrs.push_back(MUL);		DelInstrs.push_back(MUL);
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Was wondering because of the added if here. Unrelated but that FIXME looks a bit dodgy. Any idea what that could be while we are at it? SjoerdMeijer: Was wondering because of the added if here. Unrelated but that FIXME looks a bit dodgy. Any…
		asavonicAuthorUnsubmitted Done Reply Inline Actions This looks like a bug. In `t32_6_3` test case, `processLogicalImmediate` function fails, so `MUL` is not updated. We also don't generate any replacement, and the `Root` gets deleted. I think we should at least return from the function without updating `InsInstrs` and `DelInstrs` if `processLogicalImmediate` function fails. (gdb) p Root.dump() %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv (gdb) p Root.Parent->dump() bb.0 (%ir-block.0): liveins: $w0 %0:gpr32 = COPY $w0 %1:gpr32 = MOVi32imm -1431655765 %2:gpr32common = MADDWrrr %0:gpr32, killed %1:gpr32, $wzr %3:gpr32 = SUBSWri killed %2:gpr32common, 1, 0, implicit-def dead $nzcv %4:gpr32 = exact EXTRWrri %3:gpr32, %3:gpr32, 1 %5:gpr32 = MOVi32imm 715827883 %6:gpr32 = SUBSWrr killed %4:gpr32, killed %5:gpr32, implicit-def $nzcv %7:gpr32 = CSINCWr $wzr, $wzr, 2, implicit $nzcv $w0 = COPY %7:gpr32 RET_ReallyLR implicit $w0 (gdb) p Pattern $3 = llvm::MachineCombinerPattern::MULSUBWI_OP1 asavonic: This looks like a bug. In `t32_6_3` test case, `processLogicalImmediate` function fails, so…
DelInstrs.push_back(&Root);		DelInstrs.push_back(&Root);
}		}

/// Replace csincr-branch sequence by simple conditional branch		/// Replace csincr-branch sequence by simple conditional branch
///		///
/// Examples:		/// Examples:
/// 1. \code		/// 1. \code
/// csinc w9, wzr, wzr, <condition code>		/// csinc w9, wzr, wzr, <condition code>
▲ Show 20 Lines • Show All 1,514 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll

; RUN: llc < %s -O=3 -mtriple=arm64-apple-ios -mcpu=cyclone -enable-unsafe-fp-math \| FileCheck %s		; RUN: llc < %s -O=3 -mtriple=arm64-apple-ios -mcpu=cyclone -mattr=+fullfp16 -enable-unsafe-fp-math \| FileCheck %s
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I don't know if the cyclone CPU supports FP16, I guess not, but just for testing purposes having it here I think is fine. SjoerdMeijer: I don't know if the cyclone CPU supports FP16, I guess not, but just for testing purposes…
define void @foo_2d(double* %src) {		define void @foo_2d(double* %src) {
; CHECK-LABEL: %entry		; CHECK-LABEL: %entry
; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}		; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
; CHECK: fmadd {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}		; CHECK: fmadd {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
entry:		entry:
%arrayidx1 = getelementptr inbounds double, double* %src, i64 5		%arrayidx1 = getelementptr inbounds double, double* %src, i64 5
%arrayidx2 = getelementptr inbounds double, double* %src, i64 11		%arrayidx2 = getelementptr inbounds double, double* %src, i64 11
%tmp = bitcast double* %arrayidx1 to <2 x double>*		%tmp = bitcast double* %arrayidx1 to <2 x double>*
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
%mul3 = fmul fast float %mul, %e10		%mul3 = fmul fast float %mul, %e10
store float %mul3, float* %arrayidx2, align 8		store float %mul3, float* %arrayidx2, align 8
%exitcond = icmp eq i64 %indvars.iv.next, 25		%exitcond = icmp eq i64 %indvars.iv.next, 25
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret void		ret void
}		}

		define void @indexed_2s(<2 x float> %shuf, <2 x float> %add,
		<2 x float>* %pmul, <2 x float>* %pret) {
		; CHECK-LABEL: %entry
		fhahnUnsubmitted Not Done Reply Inline Actions can you add a MachineIR test case for those transforms? The tests probably should also go into one of the machine-combiner* files or a new one. For more details, please see https://llvm.org/docs/MIRLangRef.html#mir-testing-guide fhahn: can you add a MachineIR test case for those transforms? The tests probably should also go into…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks Florian! I added a new MIR test. asavonic: Thanks Florian! I added a new MIR test.
		; CHECK: for.body
		; CHECK: fmla.2s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
		;
		entry:
		%shuffle = shufflevector <2 x float> %shuf, <2 x float> undef, <2 x i32> zeroinitializer
		br label %for.body

		for.body:
		%i = phi i64 [ 0, %entry ], [ %inext, %for.body ]
		%pmul_i = getelementptr inbounds <2 x float>, <2 x float>* %pmul, i64 %i
		%pret_i = getelementptr inbounds <2 x float>, <2 x float>* %pret, i64 %i

		%mul_i = load <2 x float>, <2 x float>* %pmul_i

		%mul = fmul fast <2 x float> %mul_i, %shuffle
		%muladd = fadd fast <2 x float> %mul, %add

		store <2 x float> %muladd, <2 x float>* %pret_i, align 16
		%inext = add i64 %i, 1
		br label %for.body
		}

		define void @indexed_2d(<2 x double> %shuf, <2 x double> %add,
		<2 x double>* %pmul, <2 x double>* %pret) {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Nit: can we just do a `ret %add` here? SjoerdMeijer: Nit: can we just do a `ret %add` here?
		asavonicAuthorUnsubmitted Done Reply Inline Actions This an attempt to keep `shuffle` and `mul` in different basic blocks. If we return in the second basic block, the optimizer merges them into one before we reach instruction selection. In the latest revision of the patch I changed the tests to ensure that this never happens. asavonic: This an attempt to keep `shuffle` and `mul` in different basic blocks. If we return in the…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Ok, thanks. SjoerdMeijer: Ok, thanks.
		; CHECK-LABEL: %entry
		; CHECK: for.body
		; CHECK: fmla.2d {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
		;
		entry:
		%shuffle = shufflevector <2 x double> %shuf, <2 x double> undef, <2 x i32> zeroinitializer
		br label %for.body

		for.body:
		%i = phi i64 [ 0, %entry ], [ %inext, %for.body ]
		%pmul_i = getelementptr inbounds <2 x double>, <2 x double>* %pmul, i64 %i
		%pret_i = getelementptr inbounds <2 x double>, <2 x double>* %pret, i64 %i

		%mul_i = load <2 x double>, <2 x double>* %pmul_i

		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Same? SjoerdMeijer: Same?
		%mul = fmul fast <2 x double> %mul_i, %shuffle
		%muladd = fadd fast <2 x double> %mul, %add

		store <2 x double> %muladd, <2 x double>* %pret_i, align 16
		%inext = add i64 %i, 1
		br label %for.body
		}

		define void @indexed_4s(<4 x float> %shuf, <4 x float> %add,
		<4 x float>* %pmul, <4 x float>* %pret) {
		; CHECK-LABEL: %entry
		; CHECK: for.body
		; CHECK: fmla.4s {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
		;
		entry:
		%shuffle = shufflevector <4 x float> %shuf, <4 x float> undef, <4 x i32> zeroinitializer
		br label %for.body

		for.body:
		%i = phi i64 [ 0, %entry ], [ %inext, %for.body ]
		%pmul_i = getelementptr inbounds <4 x float>, <4 x float>* %pmul, i64 %i
		%pret_i = getelementptr inbounds <4 x float>, <4 x float>* %pret, i64 %i

		%mul_i = load <4 x float>, <4 x float>* %pmul_i

		%mul = fmul fast <4 x float> %mul_i, %shuffle
		%muladd = fadd fast <4 x float> %mul, %add

		store <4 x float> %muladd, <4 x float>* %pret_i, align 16
		%inext = add i64 %i, 1
		br label %for.body
		}

		define void @indexed_4h(<4 x half> %shuf, <4 x half> %add,
		<4 x half>* %pmul, <4 x half>* %pret) {
		; CHECK-LABEL: %entry
		; CHECK: for.body
		; CHECK: fmla.4h {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
		;
		entry:
		%shuffle = shufflevector <4 x half> %shuf, <4 x half> undef, <4 x i32> zeroinitializer
		br label %for.body

		for.body:
		%i = phi i64 [ 0, %entry ], [ %inext, %for.body ]
		%pmul_i = getelementptr inbounds <4 x half>, <4 x half>* %pmul, i64 %i
		%pret_i = getelementptr inbounds <4 x half>, <4 x half>* %pret, i64 %i

		%mul_i = load <4 x half>, <4 x half>* %pmul_i

		%mul = fmul fast <4 x half> %mul_i, %shuffle
		%muladd = fadd fast <4 x half> %mul, %add

		store <4 x half> %muladd, <4 x half>* %pret_i, align 16
		%inext = add i64 %i, 1
		br label %for.body
		}

		define void @indexed_8h(<8 x half> %shuf, <8 x half> %add,
		<8 x half>* %pmul, <8 x half>* %pret) {
		; CHECK-LABEL: %entry
		; CHECK: for.body
		; CHECK: fmla.8h {{v[0-9]+}}, {{v[0-9]+}}, {{v[0-9]+}}[0]
		;
		entry:
		%shuffle = shufflevector <8 x half> %shuf, <8 x half> undef, <8 x i32> zeroinitializer
		br label %for.body

		for.body:
		%i = phi i64 [ 0, %entry ], [ %inext, %for.body ]
		%pmul_i = getelementptr inbounds <8 x half>, <8 x half>* %pmul, i64 %i
		%pret_i = getelementptr inbounds <8 x half>, <8 x half>* %pret, i64 %i

		%mul_i = load <8 x half>, <8 x half>* %pmul_i

		%mul = fmul fast <8 x half> %mul_i, %shuffle
		%muladd = fadd fast <8 x half> %mul, %add

		store <8 x half> %muladd, <8 x half>* %pret_i, align 16
		%inext = add i64 %i, 1
		br label %for.body
		}

llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir

This file was added.

				# RUN: llc -run-pass=machine-combiner -o - -simplify-mir -mtriple=aarch64-unknown-linux-gnu -mattr=+fullfp16 %s \| FileCheck %s
				--- \|
				dmgreenUnsubmitted Not Done Reply Inline Actions It's probably worth adding verify-machineinstrs to tests like this. There are expensive bots that will check anyway, but it's good to be deliberate on tests like this. I might also use the update_mir_test_checks script, but that's up to you depending on how useful the check lines it adds are. dmgreen: It's probably worth adding verify-machineinstrs to tests like this. There are expensive bots…
				asavonicAuthorUnsubmitted Done Reply Inline Actions They can be useful to see details like kill states. Added them to the test. asavonic: They can be useful to see details like kill states. Added them to the test.
				; ModuleID = 'lit.ll'
				source_filename = "lit.ll"
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				define void @indexed_2s(<2 x float> %shuf, <2 x float> %mu, <2 x float> %ad, <2 x float>* %ret) #0 {
				entry:
				%shuffle = shufflevector <2 x float> %shuf, <2 x float> undef, <2 x i32> zeroinitializer
				br label %for.cond

				for.cond: ; preds = %for.cond, %entry
				%mul = fmul <2 x float> %mu, %shuffle
				%add = fadd <2 x float> %mul, %ad
				store <2 x float> %add, <2 x float>* %ret, align 16
				br label %for.cond
				}

				define void @indexed_2s_rev(<2 x float> %shuf, <2 x float> %mu, <2 x float> %ad, <2 x float>* %ret) #0 {
				entry:
				%shuffle = shufflevector <2 x float> %shuf, <2 x float> undef, <2 x i32> zeroinitializer
				br label %for.cond

				for.cond: ; preds = %for.cond, %entry
				%mul = fmul <2 x float> %shuffle, %mu
				%add = fadd <2 x float> %mul, %ad
				store <2 x float> %add, <2 x float>* %ret, align 16
				br label %for.cond
				}

				define void @indexed_2d(<2 x double> %shuf, <2 x double> %mu, <2 x double> %ad, <2 x double>* %ret) #0 {
				entry:
				%shuffle = shufflevector <2 x double> %shuf, <2 x double> undef, <2 x i32> zeroinitializer
				br label %for.cond

				for.cond: ; preds = %for.cond, %entry
				%mul = fmul <2 x double> %mu, %shuffle
				%add = fadd <2 x double> %mul, %ad
				store <2 x double> %add, <2 x double>* %ret, align 16
				br label %for.cond
				}

				define void @indexed_4s(<4 x float> %shuf, <4 x float> %mu, <4 x float> %ad, <4 x float>* %ret) #0 {
				entry:
				%shuffle = shufflevector <4 x float> %shuf, <4 x float> undef, <4 x i32> zeroinitializer
				br label %for.cond

				for.cond: ; preds = %for.cond, %entry
				%mul = fmul <4 x float> %mu, %shuffle
				%add = fadd <4 x float> %mul, %ad
				store <4 x float> %add, <4 x float>* %ret, align 16
				br label %for.cond
				}

				define void @indexed_4h(<4 x half> %shuf, <4 x half> %mu, <4 x half> %ad, <4 x half>* %ret) #0 {
				entry:
				%shuffle = shufflevector <4 x half> %shuf, <4 x half> undef, <4 x i32> zeroinitializer
				br label %for.cond

				for.cond:
				%mul = fmul <4 x half> %mu, %shuffle
				%add = fadd <4 x half> %mul, %ad
				store <4 x half> %add, <4 x half>* %ret, align 16
				br label %for.cond
				}

				define void @indexed_8h(<8 x half> %shuf, <8 x half> %mu, <8 x half> %ad, <8 x half>* %ret) #0 {
				entry:
				%shuffle = shufflevector <8 x half> %shuf, <8 x half> undef, <8 x i32> zeroinitializer
				br label %for.cond

				for.cond:
				%mul = fmul <8 x half> %mu, %shuffle
				%add = fadd <8 x half> %mul, %ad
				store <8 x half> %add, <8 x half>* %ret, align 16
				br label %for.cond
				}

				attributes #0 = { "target-cpu"="cortex-a57" }

				...
				# CHECK-LABEL: name: indexed_2s
				# CHECK: [[OP1COPY:%.*]]:fpr64 = COPY $d1
				# CHECK: [[OP2COPY:%.*]]:fpr64 = COPY $d0
				# CHECK: [[UNDEF:%.*]]:fpr128 = IMPLICIT_DEF
				# CHECK: [[OP2INS:%.*]]:fpr128 = INSERT_SUBREG [[UNDEF]], [[OP2COPY]], %subreg.dsub
				# CHECK: [[OP1:%.*]]:fpr64 = COPY [[OP1COPY]]
				# CHECK: [[OP2:%.*]]:fpr128_lo = COPY [[OP2INS]]
				# CHECK-NOT: FMULv2f32
				# CHECK: :fpr64 = FMULv2i32_indexed [[OP1]], [[OP2]], 0
				---
				name: indexed_2s
				alignment: 16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: fpr64 }
				- { id: 1, class: fpr64 }
				- { id: 2, class: fpr64 }
				- { id: 3, class: fpr64 }
				- { id: 4, class: gpr64common }
				- { id: 5, class: fpr64 }
				- { id: 6, class: fpr64 }
				- { id: 7, class: fpr128 }
				- { id: 8, class: fpr128 }
				- { id: 9, class: fpr64 }
				- { id: 10, class: fpr64 }
				liveins:
				- { reg: '$d0', virtual-reg: '%1' }
				- { reg: '$d1', virtual-reg: '%2' }
				- { reg: '$d2', virtual-reg: '%3' }
				- { reg: '$x0', virtual-reg: '%4' }
				frameInfo:
				maxAlignment: 1
				maxCallFrameSize: 0
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				liveins: $d0, $d1, $d2, $x0

				%4:gpr64common = COPY $x0
				%3:fpr64 = COPY $d2
				%2:fpr64 = COPY $d1
				%1:fpr64 = COPY $d0
				%8:fpr128 = IMPLICIT_DEF
				%7:fpr128 = INSERT_SUBREG %8, %1, %subreg.dsub
				%6:fpr64 = COPY %3
				%5:fpr64 = COPY %2
				%0:fpr64 = DUPv2i32lane killed %7, 0

				bb.1.for.cond:
				%9:fpr64 = FMULv2f32 %5, %0
				%10:fpr64 = FADDv2f32 killed %9, %6
				STRDui killed %10, %4, 0 :: (store 8 into %ir.ret, align 16)
				B %bb.1

				...
				# CHECK-LABEL: name: indexed_2s_rev
				# CHECK: [[OP2COPY:%.*]]:fpr64 = COPY $d1
				# CHECK: [[OP1COPY:%.*]]:fpr64 = COPY $d0
				# CHECK: [[UNDEF:%.*]]:fpr128 = IMPLICIT_DEF
				# CHECK: [[OP1INS:%.*]]:fpr128 = INSERT_SUBREG [[UNDEF]], [[OP1COPY]], %subreg.dsub
				# CHECK: [[OP2:%.*]]:fpr64 = COPY [[OP2COPY]]
				# CHECK: [[OP1:%.*]]:fpr128_lo = COPY [[OP1INS]]
				# CHECK-NOT: FMULv2f32
				# CHECK: :fpr64 = FMULv2i32_indexed [[OP2]], [[OP1]], 0
				---
				name: indexed_2s_rev
				alignment: 16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: fpr64 }
				- { id: 1, class: fpr64 }
				- { id: 2, class: fpr64 }
				- { id: 3, class: fpr64 }
				- { id: 4, class: gpr64common }
				- { id: 5, class: fpr64 }
				- { id: 6, class: fpr64 }
				- { id: 7, class: fpr128 }
				- { id: 8, class: fpr128 }
				- { id: 9, class: fpr64 }
				- { id: 10, class: fpr64 }
				liveins:
				- { reg: '$d0', virtual-reg: '%1' }
				- { reg: '$d1', virtual-reg: '%2' }
				- { reg: '$d2', virtual-reg: '%3' }
				- { reg: '$x0', virtual-reg: '%4' }
				frameInfo:
				maxAlignment: 1
				maxCallFrameSize: 0
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				liveins: $d0, $d1, $d2, $x0

				%4:gpr64common = COPY $x0
				%3:fpr64 = COPY $d2
				%2:fpr64 = COPY $d1
				%1:fpr64 = COPY $d0
				%8:fpr128 = IMPLICIT_DEF
				%7:fpr128 = INSERT_SUBREG %8, %1, %subreg.dsub
				%6:fpr64 = COPY %3
				%5:fpr64 = COPY %2
				%0:fpr64 = DUPv2i32lane killed %7, 0

				bb.1.for.cond:
				%9:fpr64 = FMULv2f32 %0, %5
				%10:fpr64 = FADDv2f32 killed %9, %6
				STRDui killed %10, %4, 0 :: (store 8 into %ir.ret, align 16)
				B %bb.1

				...
				# CHECK-LABEL: name: indexed_2d
				# CHECK: [[OP1COPY:%.*]]:fpr128 = COPY $q1
				# CHECK: [[OP2COPY:%.*]]:fpr128 = COPY $q0
				# CHECK: [[OP1:%.*]]:fpr128 = COPY [[OP1COPY]]
				# CHECK: [[OP2:%.*]]:fpr128_lo = COPY [[OP2COPY]]
				# CHECK-NOT: FMULv2f64
				# CHECK: :fpr128 = FMULv2i64_indexed [[OP1]], [[OP2]], 0
				---
				name: indexed_2d
				alignment: 16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: fpr128 }
				- { id: 1, class: fpr128 }
				- { id: 2, class: fpr128 }
				- { id: 3, class: fpr128 }
				- { id: 4, class: gpr64common }
				- { id: 5, class: fpr128 }
				- { id: 6, class: fpr128 }
				- { id: 7, class: fpr128 }
				- { id: 8, class: fpr128 }
				liveins:
				- { reg: '$q0', virtual-reg: '%1' }
				- { reg: '$q1', virtual-reg: '%2' }
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions For consistency probably best to add an indexed_8h test here too.If you're confident doing this, you can just do this without another review and commit it, but otherwise happy to look again. Similarly, you will probably need -mattr=+fullfp16 on the RUN line. SjoerdMeijer: For consistency probably best to add an indexed_8h test here too.If you're confident doing this…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks a lot! I added indexed_4h and 8h cases to the test. asavonic: Thanks a lot! I added indexed_4h and 8h cases to the test.
				- { reg: '$q2', virtual-reg: '%3' }
				- { reg: '$x0', virtual-reg: '%4' }
				frameInfo:
				maxAlignment: 1
				maxCallFrameSize: 0
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				liveins: $q0, $q1, $q2, $x0

				%4:gpr64common = COPY $x0
				%3:fpr128 = COPY $q2
				%2:fpr128 = COPY $q1
				%1:fpr128 = COPY $q0
				%6:fpr128 = COPY %3
				%5:fpr128 = COPY %2
				%0:fpr128 = DUPv2i64lane %1, 0

				bb.1.for.cond:
				%7:fpr128 = FMULv2f64 %5, %0
				%8:fpr128 = FADDv2f64 killed %7, %6
				STRQui killed %8, %4, 0 :: (store 16 into %ir.ret)
				B %bb.1

				...
				# CHECK-LABEL: name: indexed_4s
				# CHECK: [[OP1COPY:%.*]]:fpr128 = COPY $q1
				# CHECK: [[OP2COPY:%.*]]:fpr128 = COPY $q0
				# CHECK: [[OP1:%.*]]:fpr128 = COPY [[OP1COPY]]
				# CHECK: [[OP2:%.*]]:fpr128_lo = COPY [[OP2COPY]]
				# CHECK-NOT: FMULv4f32
				# CHECK: :fpr128 = FMULv4i32_indexed [[OP1]], [[OP2]], 0
				---
				name: indexed_4s
				alignment: 16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: fpr128 }
				- { id: 1, class: fpr128 }
				- { id: 2, class: fpr128 }
				- { id: 3, class: fpr128 }
				- { id: 4, class: gpr64common }
				- { id: 5, class: fpr128 }
				- { id: 6, class: fpr128 }
				- { id: 7, class: fpr128 }
				- { id: 8, class: fpr128 }
				liveins:
				- { reg: '$q0', virtual-reg: '%1' }
				- { reg: '$q1', virtual-reg: '%2' }
				- { reg: '$q2', virtual-reg: '%3' }
				- { reg: '$x0', virtual-reg: '%4' }
				frameInfo:
				maxAlignment: 1
				maxCallFrameSize: 0
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				liveins: $q0, $q1, $q2, $x0

				%4:gpr64common = COPY $x0
				%3:fpr128 = COPY $q2
				%2:fpr128 = COPY $q1
				%1:fpr128 = COPY $q0
				%6:fpr128 = COPY %3
				%5:fpr128 = COPY %2
				%0:fpr128 = DUPv4i32lane %1, 0

				bb.1.for.cond:
				%7:fpr128 = FMULv4f32 %5, %0
				%8:fpr128 = FADDv4f32 killed %7, %6
				STRQui killed %8, %4, 0 :: (store 16 into %ir.ret)
				B %bb.1

				...
				# CHECK-LABEL: name: indexed_4h
				# CHECK: [[OP1:%.*]]:fpr64 = COPY $d1
				# CHECK: [[OP2COPY:%.*]]:fpr64 = COPY $d0
				# CHECK: [[UNDEF:%.*]]:fpr128 = IMPLICIT_DEF
				# CHECK: [[OP2INS:%.*]]:fpr128 = INSERT_SUBREG [[UNDEF]], [[OP2COPY]], %subreg.dsub
				# CHECK: [[OP2:%.*]]:fpr128_lo = COPY [[OP2INS]]
				# CHECK-NOT: FMULv4f16
				# CHECK: :fpr64 = FMULv4i16_indexed [[OP1]], [[OP2]], 0
				---
				name: indexed_4h
				alignment: 16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: fpr64 }
				- { id: 1, class: fpr64 }
				- { id: 2, class: fpr64 }
				- { id: 3, class: fpr64 }
				- { id: 4, class: gpr64common }
				- { id: 5, class: fpr128 }
				- { id: 6, class: fpr128 }
				- { id: 7, class: fpr64 }
				- { id: 8, class: fpr64 }
				liveins:
				- { reg: '$d0', virtual-reg: '%1' }
				- { reg: '$d1', virtual-reg: '%2' }
				- { reg: '$d2', virtual-reg: '%3' }
				- { reg: '$x0', virtual-reg: '%4' }
				frameInfo:
				maxAlignment: 1
				maxCallFrameSize: 0
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				liveins: $d0, $d1, $d2, $x0

				%4:gpr64common = COPY $x0
				%3:fpr64 = COPY $d2
				%2:fpr64 = COPY $d1
				%1:fpr64 = COPY $d0
				%6:fpr128 = IMPLICIT_DEF
				%5:fpr128 = INSERT_SUBREG %6, %1, %subreg.dsub
				%0:fpr64 = DUPv4i16lane killed %5, 0

				bb.1.for.cond:
				%7:fpr64 = FMULv4f16 %2, %0
				%8:fpr64 = FADDv4f16 killed %7, %3
				STRDui killed %8, %4, 0 :: (store 8 into %ir.ret, align 16)
				B %bb.1

				...
				# CHECK-LABEL: name: indexed_8h
				# CHECK: [[OP1:%.*]]:fpr128 = COPY $q1
				# CHECK: [[OP2COPY:%.*]]:fpr128 = COPY $q0
				# CHECK: [[OP2:%.*]]:fpr128_lo = COPY [[OP2COPY]]
				# CHECK-NOT: FMULv8f16
				# CHECK: :fpr128 = FMULv8i16_indexed [[OP1]], [[OP2]], 0
				---
				name: indexed_8h
				alignment: 16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: fpr128 }
				- { id: 1, class: fpr128 }
				- { id: 2, class: fpr128 }
				- { id: 3, class: fpr128 }
				- { id: 4, class: gpr64common }
				- { id: 5, class: fpr128 }
				- { id: 6, class: fpr128 }
				liveins:
				- { reg: '$q0', virtual-reg: '%1' }
				- { reg: '$q1', virtual-reg: '%2' }
				- { reg: '$q2', virtual-reg: '%3' }
				- { reg: '$x0', virtual-reg: '%4' }
				frameInfo:
				maxAlignment: 1
				maxCallFrameSize: 0
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				liveins: $q0, $q1, $q2, $x0

				%4:gpr64common = COPY $x0
				%3:fpr128 = COPY $q2
				%2:fpr128 = COPY $q1
				%1:fpr128 = COPY $q0
				%0:fpr128 = DUPv8i16lane %1, 0

				bb.1.for.cond:
				%5:fpr128 = FMULv8f16 %2, %0
				%6:fpr128 = FADDv8f16 killed %5, %3
				STRQui killed %6, %4, 0 :: (store 16 into %ir.ret)
				B %bb.1

				...

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add Machine InstCombiner patterns for FMUL indexed variantClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 336950

llvm/include/llvm/CodeGen/MachineCombinerPattern.h

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

llvm/test/CodeGen/AArch64/arm64-fma-combines.ll

llvm/test/CodeGen/AArch64/machine-combiner-fmul-dup.mir

[AArch64] Add Machine InstCombiner patterns for FMUL indexed variant
ClosedPublic