This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
6/6
SVEIntrinsicOpts.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fmul-idempotency.ll
-
sve-mul-idempotency.ll

Differential D98033

[AArch64][SVEIntrinsicOpts] Factor out redundant SVE mul/fmul intrinsics
ClosedPublic

Authored by joechrisellis on Mar 5 2021, 3:42 AM.

Download Raw Diff

Details

Reviewers

david-arm
paulwalker-arm
peterwaller-arm

Commits

rG14bd44edc6af: [AArch64][SVEIntrinsicOpts] Factor out redundant SVE mul/fmul intrinsics

Summary

This commit implements an IR-level optimization to eliminate idempotent
SVE mul/fmul intrinsic calls. Currently, the following patterns are
captured:

fmul  pg  (dup_x  1.0)  V  =>  V
mul   pg  (dup_x  1)    V  =>  V

fmul  pg  V  (dup_x  1.0)  =>  V
mul   pg  V  (dup_x  1)    =>  V

fmul  pg  V  (dup  v  pg  1.0)  =>  V
mul   pg  V  (dup  v  pg  1)    =>  V

The result of this commit is that code such as:

1  #include <arm_sve.h>
2
3  svfloat64_t foo(svfloat64_t a) {
4    svbool_t t = svptrue_b64();
5    svfloat64_t b = svdup_f64(1.0);
6    return svmul_m(t, a, b);
7  }

will lower to a nop.

This commit does not capture all possibilities; only the simple cases
described above. There is still room for further optimisation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

joechrisellis created this revision.Mar 5 2021, 3:42 AM

Herald added subscribers: hiraditya, kristof.beyls, tschuett. · View Herald TranscriptMar 5 2021, 3:42 AM

joechrisellis requested review of this revision.Mar 5 2021, 3:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2021, 3:42 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B92278: Diff 328470.Mar 5 2021, 9:08 PM

peterwaller-arm added inline comments.Mar 8 2021, 4:57 AM

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
377	Naming suggestions: OpPredicate, OpMultiplicand, OpMultiplier.
420	I've had a quick grep, and can't see much precedent for this style. If there are braces on the preceding if body, it seems to almost always appear as `} else if () {` (I couldn't find a counterexample). I can see why you've written it like this but the structure looks a bit too much like independent if's and I might have missed the else's. Suggested syntax: if () { // [f]mul pg (dupx 1) %n => %n } else if () { // [f]mul pg %n (dupx 1) => %n } etc.
450	As discussed offline, you can check if the rightmost operand is what you're interested in, and if it's not, std::swap the two operands and recheck, to avoid duplication of the substitution logic across operands.

david-arm added inline comments.Mar 8 2021, 5:22 AM

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
375	nit: Is this better named as Pg instead of Op0 so it's more obvious that Op1 and Op2 are the integer/FP vector inputs?
380	I think it would be good to make use of existing match functions if possible, i.e.: return match(V, m_FPOne()) \|\| match(V, m_One());

Address review comments.

@peterwaller-arm:
- more descriptive variable names.
- use swap idiom to reduce code duplication.
- move comments into body of if-statements.
@david-arm:
- more descriptive variable names.
- use existing match functions.

Harbormaster completed remote builds in B92653: Diff 328999.Mar 8 2021, 8:52 AM

Politely pinging this. 🙂

LGTM modulo nit.

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
404	Nit: AFAIUI, typically during canonicalization, constant-like things would appear in the righthand side (multiplier, in this case).

This revision is now accepted and ready to land.Mar 16 2021, 3:29 AM

Address comments:

@peterwaller-arm: swap uses of OpMultiplicand and OpMultiplier so that the constant-like DUP is treated as the RHS of the mul/fmul.

This revision was landed with ongoing or failed builds.Mar 16 2021, 7:50 AM

Closed by commit rG14bd44edc6af: [AArch64][SVEIntrinsicOpts] Factor out redundant SVE mul/fmul intrinsics (authored by joechrisellis). · Explain Why

This revision was automatically updated to reflect the committed changes.

joechrisellis added a commit: rG14bd44edc6af: [AArch64][SVEIntrinsicOpts] Factor out redundant SVE mul/fmul intrinsics.

Harbormaster completed remote builds in B94050: Diff 330982.Mar 16 2021, 8:53 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

SVEIntrinsicOpts.cpp

76 lines

test/

CodeGen/

AArch64/

sve-fmul-idempotency.ll

123 lines

sve-mul-idempotency.ll

123 lines

Diff 330988

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp

Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	private:
bool optimizeIntrinsicCalls(SmallSetVector<Function *, 4> &Functions);		bool optimizeIntrinsicCalls(SmallSetVector<Function *, 4> &Functions);

/// Operates at the function-scope. I.e., optimizations are applied local to		/// Operates at the function-scope. I.e., optimizations are applied local to
/// the functions themselves.		/// the functions themselves.
bool optimizeFunctions(SmallSetVector<Function *, 4> &Functions);		bool optimizeFunctions(SmallSetVector<Function *, 4> &Functions);

static bool optimizeConvertFromSVBool(IntrinsicInst *I);		static bool optimizeConvertFromSVBool(IntrinsicInst *I);
static bool optimizePTest(IntrinsicInst *I);		static bool optimizePTest(IntrinsicInst *I);
		static bool optimizeVectorMul(IntrinsicInst *I);

static bool processPhiNode(IntrinsicInst *I);		static bool processPhiNode(IntrinsicInst *I);
};		};
} // end anonymous namespace		} // end anonymous namespace

void SVEIntrinsicOpts::getAnalysisUsage(AnalysisUsage &AU) const {		void SVEIntrinsicOpts::getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
▲ Show 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	if (Op1 != Op2 && Op2->use_empty())
Op2->eraseFromParent();		Op2->eraseFromParent();

return true;		return true;
}		}

return false;		return false;
}		}

		bool SVEIntrinsicOpts::optimizeVectorMul(IntrinsicInst *I) {
		assert((I->getIntrinsicID() == Intrinsic::aarch64_sve_mul \|\|
		I->getIntrinsicID() == Intrinsic::aarch64_sve_fmul) &&
		"Unexpected opcode");

		auto *OpPredicate = I->getOperand(0);
		david-armUnsubmitted Done Reply Inline Actions nit: Is this better named as Pg instead of Op0 so it's more obvious that Op1 and Op2 are the integer/FP vector inputs? david-arm: nit: Is this better named as Pg instead of Op0 so it's more obvious that Op1 and Op2 are the…
		auto *OpMultiplicand = I->getOperand(1);
		auto *OpMultiplier = I->getOperand(2);
		peterwaller-armUnsubmitted Done Reply Inline Actions Naming suggestions: OpPredicate, OpMultiplicand, OpMultiplier. peterwaller-arm: Naming suggestions: OpPredicate, OpMultiplicand, OpMultiplier.

		// Return true if a given instruction is an aarch64_sve_dup_x intrinsic call
		// with a unit splat value, false otherwise.
		david-armUnsubmitted Done Reply Inline Actions I think it would be good to make use of existing match functions if possible, i.e.: return match(V, m_FPOne()) \|\| match(V, m_One()); david-arm: I think it would be good to make use of existing match functions if possible, i.e.: return…
		auto IsUnitDupX = [](auto *I) {
		auto *IntrI = dyn_cast<IntrinsicInst>(I);
		if (!IntrI \|\| IntrI->getIntrinsicID() != Intrinsic::aarch64_sve_dup_x)
		return false;

		auto *SplatValue = IntrI->getOperand(0);
		return match(SplatValue, m_FPOne()) \|\| match(SplatValue, m_One());
		};

		// Return true if a given instruction is an aarch64_sve_dup intrinsic call
		// with a unit splat value, false otherwise.
		auto IsUnitDup = [](auto *I) {
		auto *IntrI = dyn_cast<IntrinsicInst>(I);
		if (!IntrI \|\| IntrI->getIntrinsicID() != Intrinsic::aarch64_sve_dup)
		return false;

		auto *SplatValue = IntrI->getOperand(2);
		return match(SplatValue, m_FPOne()) \|\| match(SplatValue, m_One());
		};

		bool Changed = true;

		// The OpMultiplier variable should always point to the dup (if any), so
		// swap if necessary.
		peterwaller-armUnsubmitted Done Reply Inline Actions Nit: AFAIUI, typically during canonicalization, constant-like things would appear in the righthand side (multiplier, in this case). peterwaller-arm: Nit: AFAIUI, typically during canonicalization, constant-like things would appear in the…
		if (IsUnitDup(OpMultiplicand) \|\| IsUnitDupX(OpMultiplicand))
		std::swap(OpMultiplier, OpMultiplicand);

		if (IsUnitDupX(OpMultiplier)) {
		// [f]mul pg (dupx 1) %n => %n
		I->replaceAllUsesWith(OpMultiplicand);
		I->eraseFromParent();
		Changed = true;
		} else if (IsUnitDup(OpMultiplier)) {
		// [f]mul pg (dup pg 1) %n => %n
		auto *DupInst = cast<IntrinsicInst>(OpMultiplier);
		auto *DupPg = DupInst->getOperand(1);
		// TODO: this is naive. The optimization is still valid if DupPg
		// 'encompasses' OpPredicate, not only if they're the same predicate.
		if (OpPredicate == DupPg) {
		I->replaceAllUsesWith(OpMultiplicand);
		peterwaller-armUnsubmitted Done Reply Inline Actions I've had a quick grep, and can't see much precedent for this style. If there are braces on the preceding if body, it seems to almost always appear as `} else if () {` (I couldn't find a counterexample). I can see why you've written it like this but the structure looks a bit too much like independent if's and I might have missed the else's. Suggested syntax: if () { // [f]mul pg (dupx 1) %n => %n } else if () { // [f]mul pg %n (dupx 1) => %n } etc. peterwaller-arm: I've had a quick grep, and can't see much precedent for this style. If there are braces on the…
		I->eraseFromParent();
		Changed = true;
		}
		}

		// If an instruction was optimized out then it is possible that some dangling
		// instructions are left.
		if (Changed) {
		auto *OpPredicateInst = dyn_cast<Instruction>(OpPredicate);
		auto *OpMultiplierInst = dyn_cast<Instruction>(OpMultiplier);
		if (OpMultiplierInst && OpMultiplierInst->use_empty())
		OpMultiplierInst->eraseFromParent();
		if (OpPredicateInst && OpPredicateInst->use_empty())
		OpPredicateInst->eraseFromParent();
		}

		return Changed;
		}

bool SVEIntrinsicOpts::optimizeConvertFromSVBool(IntrinsicInst *I) {		bool SVEIntrinsicOpts::optimizeConvertFromSVBool(IntrinsicInst *I) {
assert(I->getIntrinsicID() == Intrinsic::aarch64_sve_convert_from_svbool &&		assert(I->getIntrinsicID() == Intrinsic::aarch64_sve_convert_from_svbool &&
"Unexpected opcode");		"Unexpected opcode");

// If the reinterpret instruction operand is a PHI Node		// If the reinterpret instruction operand is a PHI Node
if (isa<PHINode>(I->getArgOperand(0)))		if (isa<PHINode>(I->getArgOperand(0)))
return processPhiNode(I);		return processPhiNode(I);

SmallVector<Instruction *, 32> CandidatesForRemoval;		SmallVector<Instruction *, 32> CandidatesForRemoval;
Value Cursor = I->getOperand(0), EarliestReplacement = nullptr;		Value Cursor = I->getOperand(0), EarliestReplacement = nullptr;

		peterwaller-armUnsubmitted Done Reply Inline Actions As discussed offline, you can check if the rightmost operand is what you're interested in, and if it's not, std::swap the two operands and recheck, to avoid duplication of the substitution logic across operands. peterwaller-arm: As discussed offline, you can check if the rightmost operand is what you're interested in, and…
const auto *IVTy = cast<VectorType>(I->getType());		const auto *IVTy = cast<VectorType>(I->getType());

// Walk the chain of conversions.		// Walk the chain of conversions.
while (Cursor) {		while (Cursor) {
// If the type of the cursor has fewer lanes than the final result, zeroing		// If the type of the cursor has fewer lanes than the final result, zeroing
// must take place, which breaks the equivalence chain.		// must take place, which breaks the equivalence chain.
const auto *CursorVTy = cast<VectorType>(Cursor->getType());		const auto *CursorVTy = cast<VectorType>(Cursor->getType());
if (CursorVTy->getElementCount().getKnownMinValue() <		if (CursorVTy->getElementCount().getKnownMinValue() <
Show All 36 Lines
bool SVEIntrinsicOpts::optimizeIntrinsic(Instruction *I) {		bool SVEIntrinsicOpts::optimizeIntrinsic(Instruction *I) {
IntrinsicInst *IntrI = dyn_cast<IntrinsicInst>(I);		IntrinsicInst *IntrI = dyn_cast<IntrinsicInst>(I);
if (!IntrI)		if (!IntrI)
return false;		return false;

switch (IntrI->getIntrinsicID()) {		switch (IntrI->getIntrinsicID()) {
case Intrinsic::aarch64_sve_convert_from_svbool:		case Intrinsic::aarch64_sve_convert_from_svbool:
return optimizeConvertFromSVBool(IntrI);		return optimizeConvertFromSVBool(IntrI);
		case Intrinsic::aarch64_sve_fmul:
		case Intrinsic::aarch64_sve_mul:
		return optimizeVectorMul(IntrI);
case Intrinsic::aarch64_sve_ptest_any:		case Intrinsic::aarch64_sve_ptest_any:
case Intrinsic::aarch64_sve_ptest_first:		case Intrinsic::aarch64_sve_ptest_first:
case Intrinsic::aarch64_sve_ptest_last:		case Intrinsic::aarch64_sve_ptest_last:
return optimizePTest(IntrI);		return optimizePTest(IntrI);
default:		default:
return false;		return false;
}		}

Show All 39 Lines	if (!F.isDeclaration())
continue;		continue;

switch (F.getIntrinsicID()) {		switch (F.getIntrinsicID()) {
case Intrinsic::aarch64_sve_convert_from_svbool:		case Intrinsic::aarch64_sve_convert_from_svbool:
case Intrinsic::aarch64_sve_ptest_any:		case Intrinsic::aarch64_sve_ptest_any:
case Intrinsic::aarch64_sve_ptest_first:		case Intrinsic::aarch64_sve_ptest_first:
case Intrinsic::aarch64_sve_ptest_last:		case Intrinsic::aarch64_sve_ptest_last:
case Intrinsic::aarch64_sve_ptrue:		case Intrinsic::aarch64_sve_ptrue:
		case Intrinsic::aarch64_sve_mul:
		case Intrinsic::aarch64_sve_fmul:
for (User *U : F.users())		for (User *U : F.users())
Functions.insert(cast<Instruction>(U)->getFunction());		Functions.insert(cast<Instruction>(U)->getFunction());
break;		break;
default:		default:
break;		break;
}		}
}		}

if (!Functions.empty())		if (!Functions.empty())
Changed \|= optimizeFunctions(Functions);		Changed \|= optimizeFunctions(Functions);

return Changed;		return Changed;
}		}

llvm/test/CodeGen/AArch64/sve-fmul-idempotency.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -aarch64-sve-intrinsic-opts < %s 2>%t \| FileCheck %s
				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				; Idempotent fmuls -- should compile to just a ret.
				define <vscale x 8 x half> @idempotent_fmul_f16(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a) {
				; CHECK-LABEL: @idempotent_fmul_f16(
				; CHECK-NEXT: ret <vscale x 8 x half> [[A:%.*]]
				;
				%1 = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 1.0)
				%2 = call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a, <vscale x 8 x half> %1)
				ret <vscale x 8 x half> %2
				}

				define <vscale x 4 x float> @idempotent_fmul_f32(<vscale x 4 x i1> %pg, <vscale x 4 x float> %a) {
				; CHECK-LABEL: @idempotent_fmul_f32(
				; CHECK-NEXT: ret <vscale x 4 x float> [[A:%.*]]
				;
				%1 = call <vscale x 4 x float> @llvm.aarch64.sve.dup.x.nxv4f32(float 1.0)
				%2 = call <vscale x 4 x float> @llvm.aarch64.sve.fmul.nxv4f32(<vscale x 4 x i1> %pg, <vscale x 4 x float> %a, <vscale x 4 x float> %1)
				ret <vscale x 4 x float> %2
				}

				define <vscale x 2 x double> @idempotent_fmul_f64(<vscale x 2 x i1> %pg, <vscale x 2 x double> %a) {
				; CHECK-LABEL: @idempotent_fmul_f64(
				; CHECK-NEXT: ret <vscale x 2 x double> [[A:%.*]]
				;
				%1 = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double 1.0)
				%2 = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %pg, <vscale x 2 x double> %a, <vscale x 2 x double> %1)
				ret <vscale x 2 x double> %2
				}

				define <vscale x 2 x double> @idempotent_fmul_different_argument_order(<vscale x 2 x i1> %pg, <vscale x 2 x double> %a) {
				; CHECK-LABEL: @idempotent_fmul_different_argument_order(
				; CHECK-NEXT: ret <vscale x 2 x double> [[A:%.*]]
				;
				%1 = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double 1.0)
				; Different argument order to the above tests.
				%2 = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %pg, <vscale x 2 x double> %1, <vscale x 2 x double> %a)
				ret <vscale x 2 x double> %2
				}

				define <vscale x 8 x half> @idempotent_fmul_with_predicated_dup(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a) {
				; CHECK-LABEL: @idempotent_fmul_with_predicated_dup(
				; CHECK-NEXT: ret <vscale x 8 x half> [[A:%.*]]
				;
				%1 = call <vscale x 8 x half> @llvm.aarch64.sve.dup.nxv8f16(<vscale x 8 x half> undef, <vscale x 8 x i1> %pg, half 1.0)
				%2 = call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a, <vscale x 8 x half> %1)
				ret <vscale x 8 x half> %2
				}

				define <vscale x 8 x half> @idempotent_fmul_two_dups(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a) {
				; Edge case -- make sure that the case where we're fmultiplying two dups
				; together is sane.
				; CHECK-LABEL: @idempotent_fmul_two_dups(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 0xH3C00)
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP1]]
				;
				%1 = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 1.0)
				%2 = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 1.0)
				%3 = call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %pg, <vscale x 8 x half> %1, <vscale x 8 x half> %2)
				ret <vscale x 8 x half> %3
				}

				; Non-idempotent fmuls -- we don't expect these to be optimised out.
				define <vscale x 8 x half> @non_idempotent_fmul_f16(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a) {
				; CHECK-LABEL: @non_idempotent_fmul_f16(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 0xH4000)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> [[PG:%.]], <vscale x 8 x half> [[A:%.*]], <vscale x 8 x half> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP2]]
				;
				%1 = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 2.0)
				%2 = call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a, <vscale x 8 x half> %1)
				ret <vscale x 8 x half> %2
				}

				define <vscale x 4 x float> @non_idempotent_fmul_f32(<vscale x 4 x i1> %pg, <vscale x 4 x float> %a) {
				; CHECK-LABEL: @non_idempotent_fmul_f32(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 4 x float> @llvm.aarch64.sve.dup.x.nxv4f32(float 2.000000e+00)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 4 x float> @llvm.aarch64.sve.fmul.nxv4f32(<vscale x 4 x i1> [[PG:%.]], <vscale x 4 x float> [[A:%.*]], <vscale x 4 x float> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 4 x float> [[TMP2]]
				;
				%1 = call <vscale x 4 x float> @llvm.aarch64.sve.dup.x.nxv4f32(float 2.0)
				%2 = call <vscale x 4 x float> @llvm.aarch64.sve.fmul.nxv4f32(<vscale x 4 x i1> %pg, <vscale x 4 x float> %a, <vscale x 4 x float> %1)
				ret <vscale x 4 x float> %2
				}

				define <vscale x 2 x double> @non_idempotent_fmul_f64(<vscale x 2 x i1> %pg, <vscale x 2 x double> %a) {
				; CHECK-LABEL: @non_idempotent_fmul_f64(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double 2.000000e+00)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> [[PG:%.]], <vscale x 2 x double> [[A:%.*]], <vscale x 2 x double> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 2 x double> [[TMP2]]
				;
				%1 = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double 2.0)
				%2 = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %pg, <vscale x 2 x double> %a, <vscale x 2 x double> %1)
				ret <vscale x 2 x double> %2
				}

				define <vscale x 2 x double> @non_idempotent_fmul_with_predicated_dup(<vscale x 2 x i1> %pg1, <vscale x 2 x i1> %pg2, <vscale x 2 x double> %a) {
				; Different predicates
				; CHECK-LABEL: @non_idempotent_fmul_with_predicated_dup(
				; CHECK-NEXT: [[TMP1:%.]] = call <vscale x 2 x double> @llvm.aarch64.sve.dup.nxv2f64(<vscale x 2 x double> undef, <vscale x 2 x i1> [[PG1:%.]], double 1.000000e+00)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> [[PG2:%.]], <vscale x 2 x double> [[A:%.*]], <vscale x 2 x double> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 2 x double> [[TMP2]]
				;
				%1 = call <vscale x 2 x double> @llvm.aarch64.sve.dup.nxv2f64(<vscale x 2 x double> undef, <vscale x 2 x i1> %pg1, double 1.0)
				%2 = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %pg2, <vscale x 2 x double> %a, <vscale x 2 x double> %1)
				ret <vscale x 2 x double> %2
				}

				declare <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half)
				declare <vscale x 4 x float> @llvm.aarch64.sve.dup.x.nxv4f32(float)
				declare <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double)

				declare <vscale x 2 x double> @llvm.aarch64.sve.dup.nxv2f64(<vscale x 2 x double>, <vscale x 2 x i1>, double)
				declare <vscale x 8 x half> @llvm.aarch64.sve.dup.nxv8f16(<vscale x 8 x half>, <vscale x 8 x i1>, half)

				declare <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>, <vscale x 8 x half>)
				declare <vscale x 4 x float> @llvm.aarch64.sve.fmul.nxv4f32(<vscale x 4 x i1>, <vscale x 4 x float>, <vscale x 4 x float>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>)

llvm/test/CodeGen/AArch64/sve-mul-idempotency.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -aarch64-sve-intrinsic-opts < %s 2>%t \| FileCheck %s
				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				; Idempotent muls -- should compile to just a ret.
				define <vscale x 8 x i16> @idempotent_mul_i16(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a) {
				; CHECK-LABEL: @idempotent_mul_i16(
				; CHECK-NEXT: ret <vscale x 8 x i16> [[A:%.*]]
				;
				%1 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 1)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a, <vscale x 8 x i16> %1)
				ret <vscale x 8 x i16> %2
				}

				define <vscale x 4 x i32> @idempotent_mul_i32(<vscale x 4 x i1> %pg, <vscale x 4 x i32> %a) {
				; CHECK-LABEL: @idempotent_mul_i32(
				; CHECK-NEXT: ret <vscale x 4 x i32> [[A:%.*]]
				;
				%1 = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32 1)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1> %pg, <vscale x 4 x i32> %a, <vscale x 4 x i32> %1)
				ret <vscale x 4 x i32> %2
				}

				define <vscale x 2 x i64> @idempotent_mul_i64(<vscale x 2 x i1> %pg, <vscale x 2 x i64> %a) {
				; CHECK-LABEL: @idempotent_mul_i64(
				; CHECK-NEXT: ret <vscale x 2 x i64> [[A:%.*]]
				;
				%1 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 1)
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %pg, <vscale x 2 x i64> %a, <vscale x 2 x i64> %1)
				ret <vscale x 2 x i64> %2
				}

				define <vscale x 2 x i64> @idempotent_mul_different_argument_order(<vscale x 2 x i1> %pg, <vscale x 2 x i64> %a) {
				; CHECK-LABEL: @idempotent_mul_different_argument_order(
				; CHECK-NEXT: ret <vscale x 2 x i64> [[A:%.*]]
				;
				%1 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 1)
				; Different argument order to the above tests.
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %pg, <vscale x 2 x i64> %1, <vscale x 2 x i64> %a)
				ret <vscale x 2 x i64> %2
				}

				define <vscale x 8 x i16> @idempotent_mul_with_predicated_dup(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a) {
				; CHECK-LABEL: @idempotent_mul_with_predicated_dup(
				; CHECK-NEXT: ret <vscale x 8 x i16> [[A:%.*]]
				;
				%1 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> undef, <vscale x 8 x i1> %pg, i16 1)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a, <vscale x 8 x i16> %1)
				ret <vscale x 8 x i16> %2
				}

				define <vscale x 8 x i16> @idempotent_mul_two_dups(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a) {
				; Edge case -- make sure that the case where we're multiplying two dups
				; together is sane.
				; CHECK-LABEL: @idempotent_mul_two_dups(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 1)
				; CHECK-NEXT: ret <vscale x 8 x i16> [[TMP1]]
				;
				%1 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 1)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 1)
				%3 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %1, <vscale x 8 x i16> %2)
				ret <vscale x 8 x i16> %3
				}

				; Non-idempotent muls -- we don't expect these to be optimised out.
				define <vscale x 8 x i16> @non_idempotent_mul_i16(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a) {
				; CHECK-LABEL: @non_idempotent_mul_i16(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 2)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> [[PG:%.]], <vscale x 8 x i16> [[A:%.*]], <vscale x 8 x i16> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 8 x i16> [[TMP2]]
				;
				%1 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 2)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %pg, <vscale x 8 x i16> %a, <vscale x 8 x i16> %1)
				ret <vscale x 8 x i16> %2
				}

				define <vscale x 4 x i32> @non_idempotent_mul_i32(<vscale x 4 x i1> %pg, <vscale x 4 x i32> %a) {
				; CHECK-LABEL: @non_idempotent_mul_i32(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32 2)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1> [[PG:%.]], <vscale x 4 x i32> [[A:%.*]], <vscale x 4 x i32> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 4 x i32> [[TMP2]]
				;
				%1 = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32 2)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1> %pg, <vscale x 4 x i32> %a, <vscale x 4 x i32> %1)
				ret <vscale x 4 x i32> %2
				}

				define <vscale x 2 x i64> @non_idempotent_mul_i64(<vscale x 2 x i1> %pg, <vscale x 2 x i64> %a) {
				; CHECK-LABEL: @non_idempotent_mul_i64(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 2)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> [[PG:%.]], <vscale x 2 x i64> [[A:%.*]], <vscale x 2 x i64> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 2 x i64> [[TMP2]]
				;
				%1 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 2)
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %pg, <vscale x 2 x i64> %a, <vscale x 2 x i64> %1)
				ret <vscale x 2 x i64> %2
				}

				define <vscale x 2 x i64> @non_idempotent_mul_with_predicated_dup(<vscale x 2 x i1> %pg1, <vscale x 2 x i1> %pg2, <vscale x 2 x i64> %a) {
				; Different predicates
				; CHECK-LABEL: @non_idempotent_mul_with_predicated_dup(
				; CHECK-NEXT: [[TMP1:%.]] = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> undef, <vscale x 2 x i1> [[PG1:%.]], i64 1)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> [[PG2:%.]], <vscale x 2 x i64> [[A:%.*]], <vscale x 2 x i64> [[TMP1]])
				; CHECK-NEXT: ret <vscale x 2 x i64> [[TMP2]]
				;
				%1 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> undef, <vscale x 2 x i1> %pg1, i64 1)
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %pg2, <vscale x 2 x i64> %a, <vscale x 2 x i64> %1)
				ret <vscale x 2 x i64> %2
				}

				declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16)
				declare <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32)
				declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64)

				declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i1>, i64)
				declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i1>, i16)

				declare <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>, <vscale x 8 x i16>)
				declare <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1>, <vscale x 4 x i32>, <vscale x 4 x i32>)
				declare <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1>, <vscale x 2 x i64>, <vscale x 2 x i64>)