This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
7/13
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1
sve-intrinsics-perm-select.ll

Differential D133116

[AArch64][SVE] Optimise repeated floating-point complex patterns to splat
AbandonedPublic

Authored by MattDevereau on Sep 1 2022, 8:21 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
c-rhodes
peterwaller-arm
efriedma
benmxwl-arm

Summary

Repeated floating-point complex patterns in BUILD_VECTOR such as (f32 a, f32 b,
f32 a, f32 b) can be simplified to SPLAT_VECTOR(f64(a, b))

Diff Detail

Event Timeline

MattDevereau created this revision.Sep 1 2022, 8:21 AM

Herald added a reviewer: efriedma. · View Herald TranscriptSep 1 2022, 8:21 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

MattDevereau requested review of this revision.Sep 1 2022, 8:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 8:21 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B184589: Diff 457282.Sep 1 2022, 9:56 AM

Matt added a subscriber: Matt.Sep 1 2022, 11:13 AM

paulwalker-arm added inline comments.Sep 23 2022, 10:17 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11340–11341	As mentioned below I don't believe this transform is save outside of the context of how it's used in the `AArch64ISD::DUPLANE128` case you care about. Hence my do it as a DAG combine suggestion.
11356–11362	`ISD::BUILD_VECTOR` has its own class `BuildVectorSDNode` that provides `getRepeatedSequence()`, which can probably help here and perhaps mean we can cover more cases?
11361–11363	This is not a good place for this transformation as it's really an optimisation rather than anything related to lowering. I think you want a DAG combine for `AArch64ISD::DUPLANE128` because `useWideSplatForBuildVectorRepeatedComplexPattern` looks like a helper function rather than a correct optimisation in its own right.
11371–11382	Not sure if I'm sending you down a dead end but I'm wondering if instead of this logic we've reach the stage where it's preferable to enable the other `AArch64ISD::DUPLANE##` nodes for scalable vectors. That way you're essentially doing: AArch64ISD::DUPLANE128(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) -> AArch64ISD::DUPLANE64(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) And then if necessary we add combines to remove the unused parts of the subvector. The problem with `ISD::SPLAT_VECTOR` is that is does what you want for floating point types but if you had integer tests you may well see FPR<->GPR transitions you don't want. Of course you can use bitcast to ensure everything looks like a floating point value but that might not be the cleanest approach.

MattDevereau updated this revision to Diff 462955.Sep 26 2022, 10:05 AM

@paulwalker-arm I've moved the new function to DAG combine, and have somewhat side-graded the current complex pattern matching logic to work with the nodes present after DAG combine, which are different from the nodes previously in ISel lowering. I'll focus on implementing your other suggestion of enabling scalable vectors for AArch64ISD::DUPLANE## nodes, feel free to ignore this review for the time being

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11340–11341	I've moved this to be a DAG combine inside `performDupLane128Combine`
11356–11362	This worked perfectly when I tried it, although moving the new function inside DAGCombine meant that the `ISD::BUILD_VECTOR` node has already been lowered to an `ISD::INSERT_SUBVECTOR` node. I've changed the code to something similar here but with ISD::INSERT_SUBVECTOR, if you also have a suggestion similar to the buildvector improvement then that would be optimal. Regardless, this might not be a big deal as I'm focusing on your other suggestion of exposing scalable vector types for `AArch64ISD::DUPLANE##` for now

Harbormaster completed remote builds in B188741: Diff 462955.Sep 26 2022, 11:48 AM

MattDevereau requested review of this revision.Sep 28 2022, 6:51 AM

MattDevereau marked 2 inline comments as done.

MattDevereau added a reviewer: benmxwl-arm.Oct 12 2022, 6:53 AM

georges added a subscriber: georges.Oct 15 2022, 8:33 AM

peterwaller-arm requested changes to this revision.Nov 10 2022, 4:08 AM

peterwaller-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20704	Nothing here considers the insertion index of the insert_vector_elt, and although there could be a canonicalization which puts them in order, I'm not certain can rely on the order of the operations to give you the full sequence (e.g. what happens if some insertions are missing?).
20715	This condition doesn't look right: it doesn't consider all of the elements of the sequence, only N/2+2 of them.
20729	This considers Sequence[0] but not Sequence[1], so I believe it's getting the right codegen by coincidence. Also worth highlighting in a comment that Sequence is in reversed order.
llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll
590	This kill indicates that q0 contains live data, and it's lost after the change. This could allow the compiler to reorder instructions ignorant of the fact that the data going into the mov via v0 is live. Looking at the code I can see you're not passing %x along, so that needs to be fixed.

This revision now requires changes to proceed.Nov 10 2022, 4:08 AM

Added test dupq_f32_repeat_complex_unordered and dupq_f32_repeat_complex_rev_unordered which assert the simplifiction does not trigger when insert_vector_elt's indices are not incremental
Added test dupq_f32_repeat_complex_rev to assert the simplification takes place when starting from an insertelement into the final vector index.
Added tests dupq_f16_repeat_complex_omit_pair the simplification does not trigger when vector elements are unknown.
Added tests dupq_f16_repeat_complex_mismatched_front, dupq_f16_repeat_complex_mismatched_end and dupq_f16_repeat_complex_mismatched_middle to verify the simplifaction considers the full range of elements

MattDevereau added inline comments.Nov 10 2022, 7:53 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20704	I've added logic to check the ordering with `NumElements`. I've added tests `dupq_f32_repeat_complex_unordered`, `dupq_f32_repeat_complex_rev_unordered`and `dupq_f16_repeat_complex_omit_pair` to assert this
20715	I've changed this to `RSequence.size() - 2;`. I've added tests `dupq_f16_repeat_complex_mismatched_front`, `dupq_f16_repeat_complex_mismatched_end` and `dupq_f16_repeat_complex_mismatched_middle` to assert this.
20729	I think the updated tests now reflect this with 2 kill messages: `; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0` `; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1`

One suggestion inline.

Accepting to remove my block and encourage other reviewers to weigh in if they have problems with the approach. Please hold off submission until late Monday as I would like to do one more pass over it before we push.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20700	Would it be better to express this condition in terms of 'if (VecElTy not in (set of supported types))` so that it's impossible for new types to arise which would be broken? (Also makes it read as an explicit gate of supported types).

This revision is now accepted and ready to land.Nov 10 2022, 8:30 AM

Harbormaster completed remote builds in B197081: Diff 474547.Nov 10 2022, 8:40 AM

Hi @MattDevereau, Sorry if we've discussed this before but is there a reason this cannot be solved at the IR level as an instcombine?

@paulwalker-arm I've moved this patch to an instcombine here: https://reviews.llvm.org/D138203. I'll mark this as changes planned for now and abandon it after a short period of time.

MattDevereau abandoned this revision.Dec 5 2022, 6:48 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

89 lines

test/

CodeGen/

AArch64/

sve-intrinsics-perm-select.ll

171 lines

Diff 474547

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,331 Lines • ▼ Show 20 Lines	if (VT == MVT::nxv1i1)
return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::nxv1i1,		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::nxv1i1,
DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::nxv2i1, ID,		DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::nxv2i1, ID,
Zero, SplatVal),		Zero, SplatVal),
Zero);		Zero);
return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, VT, ID, Zero, SplatVal);		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, VT, ID, Zero, SplatVal);
}		}

SDValue AArch64TargetLowering::LowerDUPQLane(SDValue Op,		SDValue AArch64TargetLowering::LowerDUPQLane(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
		paulwalker-armUnsubmitted Done Reply Inline Actions As mentioned below I don't believe this transform is save outside of the context of how it's used in the `AArch64ISD::DUPLANE128` case you care about. Hence my do it as a DAG combine suggestion. paulwalker-arm: As mentioned below I don't believe this transform is save outside of the context of how it's…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've moved this to be a DAG combine inside `performDupLane128Combine` MattDevereau: I've moved this to be a DAG combine inside `performDupLane128Combine`

EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
if (!isTypeLegal(VT) \|\| !VT.isScalableVector())		if (!isTypeLegal(VT) \|\| !VT.isScalableVector())
return SDValue();		return SDValue();

// Current lowering only supports the SVE-ACLE types.		// Current lowering only supports the SVE-ACLE types.
if (VT.getSizeInBits().getKnownMinSize() != AArch64::SVEBitsPerBlock)		if (VT.getSizeInBits().getKnownMinSize() != AArch64::SVEBitsPerBlock)
return SDValue();		return SDValue();

// The DUPQ operation is indepedent of element type so normalise to i64s.		// The DUPQ operation is indepedent of element type so normalise to i64s.
SDValue Idx128 = Op.getOperand(2);		SDValue Idx128 = Op.getOperand(2);

// DUPQ can be used when idx is in range.		// DUPQ can be used when idx is in range.
auto *CIdx = dyn_cast<ConstantSDNode>(Idx128);		auto *CIdx = dyn_cast<ConstantSDNode>(Idx128);
if (CIdx && (CIdx->getZExtValue() <= 3)) {		if (CIdx && (CIdx->getZExtValue() <= 3)) {
SDValue CI = DAG.getTargetConstant(CIdx->getZExtValue(), DL, MVT::i64);		SDValue CI = DAG.getTargetConstant(CIdx->getZExtValue(), DL, MVT::i64);
return DAG.getNode(AArch64ISD::DUPLANE128, DL, VT, Op.getOperand(1), CI);		return DAG.getNode(AArch64ISD::DUPLANE128, DL, VT, Op.getOperand(1), CI);
}		}

SDValue V = DAG.getNode(ISD::BITCAST, DL, MVT::nxv2i64, Op.getOperand(1));		SDValue V = DAG.getNode(ISD::BITCAST, DL, MVT::nxv2i64, Op.getOperand(1));

		paulwalker-armUnsubmitted Done Reply Inline Actions `ISD::BUILD_VECTOR` has its own class `BuildVectorSDNode` that provides `getRepeatedSequence()`, which can probably help here and perhaps mean we can cover more cases? paulwalker-arm: `ISD::BUILD_VECTOR` has its own class `BuildVectorSDNode` that provides `getRepeatedSequence()`…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions This worked perfectly when I tried it, although moving the new function inside DAGCombine meant that the `ISD::BUILD_VECTOR` node has already been lowered to an `ISD::INSERT_SUBVECTOR` node. I've changed the code to something similar here but with ISD::INSERT_SUBVECTOR, if you also have a suggestion similar to the buildvector improvement then that would be optimal. Regardless, this might not be a big deal as I'm focusing on your other suggestion of exposing scalable vector types for `AArch64ISD::DUPLANE##` for now MattDevereau: This worked perfectly when I tried it, although moving the new function inside DAGCombine meant…
// The ACLE says this must produce the same result as:		// The ACLE says this must produce the same result as:
		paulwalker-armUnsubmitted Not Done Reply Inline Actions This is not a good place for this transformation as it's really an optimisation rather than anything related to lowering. I think you want a DAG combine for `AArch64ISD::DUPLANE128` because `useWideSplatForBuildVectorRepeatedComplexPattern` looks like a helper function rather than a correct optimisation in its own right. paulwalker-arm: This is not a good place for this transformation as it's really an optimisation rather than…
// svtbl(data, svadd_x(svptrue_b64(),		// svtbl(data, svadd_x(svptrue_b64(),
// svand_x(svptrue_b64(), svindex_u64(0, 1), 1),		// svand_x(svptrue_b64(), svindex_u64(0, 1), 1),
// index * 2))		// index * 2))
SDValue One = DAG.getConstant(1, DL, MVT::i64);		SDValue One = DAG.getConstant(1, DL, MVT::i64);
SDValue SplatOne = DAG.getNode(ISD::SPLAT_VECTOR, DL, MVT::nxv2i64, One);		SDValue SplatOne = DAG.getNode(ISD::SPLAT_VECTOR, DL, MVT::nxv2i64, One);

// create the vector 0,1,0,1,...		// create the vector 0,1,0,1,...
SDValue SV = DAG.getStepVector(DL, MVT::nxv2i64);		SDValue SV = DAG.getStepVector(DL, MVT::nxv2i64);
SV = DAG.getNode(ISD::AND, DL, MVT::nxv2i64, SV, SplatOne);		SV = DAG.getNode(ISD::AND, DL, MVT::nxv2i64, SV, SplatOne);

// create the vector idx64,idx64+1,idx64,idx64+1,...		// create the vector idx64,idx64+1,idx64,idx64+1,...
SDValue Idx64 = DAG.getNode(ISD::ADD, DL, MVT::i64, Idx128, Idx128);		SDValue Idx64 = DAG.getNode(ISD::ADD, DL, MVT::i64, Idx128, Idx128);
SDValue SplatIdx64 = DAG.getNode(ISD::SPLAT_VECTOR, DL, MVT::nxv2i64, Idx64);		SDValue SplatIdx64 = DAG.getNode(ISD::SPLAT_VECTOR, DL, MVT::nxv2i64, Idx64);
SDValue ShuffleMask = DAG.getNode(ISD::ADD, DL, MVT::nxv2i64, SV, SplatIdx64);		SDValue ShuffleMask = DAG.getNode(ISD::ADD, DL, MVT::nxv2i64, SV, SplatIdx64);

// create the vector Val[idx64],Val[idx64+1],Val[idx64],Val[idx64+1],...		// create the vector Val[idx64],Val[idx64+1],Val[idx64],Val[idx64+1],...
SDValue TBL = DAG.getNode(AArch64ISD::TBL, DL, MVT::nxv2i64, V, ShuffleMask);		SDValue TBL = DAG.getNode(AArch64ISD::TBL, DL, MVT::nxv2i64, V, ShuffleMask);
return DAG.getNode(ISD::BITCAST, DL, VT, TBL);		return DAG.getNode(ISD::BITCAST, DL, VT, TBL);
}		}
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Not sure if I'm sending you down a dead end but I'm wondering if instead of this logic we've reach the stage where it's preferable to enable the other `AArch64ISD::DUPLANE##` nodes for scalable vectors. That way you're essentially doing: AArch64ISD::DUPLANE128(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) -> AArch64ISD::DUPLANE64(INSERT_SUBVECTOR(BUILD_VECTOR(f32 a, f32 b, f32 a, f32 b))) And then if necessary we add combines to remove the unused parts of the subvector. The problem with `ISD::SPLAT_VECTOR` is that is does what you want for floating point types but if you had integer tests you may well see FPR<->GPR transitions you don't want. Of course you can use bitcast to ensure everything looks like a floating point value but that might not be the cleanest approach. paulwalker-arm: Not sure if I'm sending you down a dead end but I'm wondering if instead of this logic we've…


static bool resolveBuildVector(BuildVectorSDNode *BVN, APInt &CnstBits,		static bool resolveBuildVector(BuildVectorSDNode *BVN, APInt &CnstBits,
APInt &UndefBits) {		APInt &UndefBits) {
EVT VT = BVN->getValueType(0);		EVT VT = BVN->getValueType(0);
APInt SplatBits, SplatUndef;		APInt SplatBits, SplatUndef;
unsigned SplatBitSize;		unsigned SplatBitSize;
bool HasAnyUndefs;		bool HasAnyUndefs;
▲ Show 20 Lines • Show All 9,282 Lines • ▼ Show 20 Lines	static SDValue performBSPExpandForSVE(SDNode *N, SelectionDAG &DAG,
SDValue In2 = N->getOperand(2);		SDValue In2 = N->getOperand(2);

SDValue InvMask = DAG.getNOT(DL, Mask, VT);		SDValue InvMask = DAG.getNOT(DL, Mask, VT);
SDValue Sel = DAG.getNode(ISD::AND, DL, VT, Mask, In1);		SDValue Sel = DAG.getNode(ISD::AND, DL, VT, Mask, In1);
SDValue SelInv = DAG.getNode(ISD::AND, DL, VT, InvMask, In2);		SDValue SelInv = DAG.getNode(ISD::AND, DL, VT, InvMask, In2);
return DAG.getNode(ISD::OR, DL, VT, Sel, SelInv);		return DAG.getNode(ISD::OR, DL, VT, Sel, SelInv);
}		}

		// Simplify a repeating complex pattern from vector insert inside DUPLANE128
		// to a splat of one complex number, i.e.:
		// nxv4f32 DUPLANE128(f32 x, f32 y, f32 x, f32 y) -> nx2f64 SPLAT_VECTOR(x <<
		// 32 \| y))
		SDValue simplifyDupLane128RepeatedComplexPattern(SDNode *Op,
		SelectionDAG &DAG) {
		SDValue InsertSubvector = Op->getOperand(0);
		if (InsertSubvector.getOpcode() != ISD::INSERT_SUBVECTOR)
		return SDValue();
		if (!InsertSubvector.getOperand(0).isUndef())
		return SDValue();

		SDValue InsertVecElt = InsertSubvector.getOperand(1);
		if (InsertVecElt.getOpcode() != ISD::INSERT_VECTOR_ELT)
		return SDValue();
		EVT VecTy = InsertVecElt.getValueType();
		if (!VecTy.is128BitVector())
		return SDValue();
		EVT VecElTy = VecTy.getScalarType();
		if (VecElTy == MVT::f64 \|\| VecElTy == MVT::i64 \|\| VecElTy == MVT::i8)
		peterwaller-armUnsubmitted Not Done Reply Inline Actions Would it be better to express this condition in terms of 'if (VecElTy not in (set of supported types))` so that it's impossible for new types to arise which would be broken? (Also makes it read as an explicit gate of supported types). peterwaller-arm: Would it be better to express this condition in terms of 'if (VecElTy not in (set of supported…
		return SDValue();

		// Walk the chain of INSERT_VECTOR_ELT's until a SCALAR_TO_VECTOR is hit,
		// Capturing the scalar registers being inserted. This will result in a
		peterwaller-armUnsubmitted Not Done Reply Inline Actions Nothing here considers the insertion index of the insert_vector_elt, and although there could be a canonicalization which puts them in order, I'm not certain can rely on the order of the operations to give you the full sequence (e.g. what happens if some insertions are missing?). peterwaller-arm: Nothing here considers the insertion index of the insert_vector_elt, and although there could…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've added logic to check the ordering with `NumElements`. I've added tests `dupq_f32_repeat_complex_unordered`, `dupq_f32_repeat_complex_rev_unordered`and `dupq_f16_repeat_complex_omit_pair` to assert this MattDevereau: I've added logic to check the ordering with `NumElements`. I've added tests…
		// reversed sequence (e.g. y, x, y, x) or (3, 2, 1, 0)
		unsigned int NumElements = VecTy.getVectorNumElements() - 1;
		SmallVector<SDValue, 8> RSequence;
		do {
		if (InsertVecElt.getOpcode() == ISD::INSERT_VECTOR_ELT) {
		RSequence.push_back(InsertVecElt.getOperand(1));
		uint64_t Index = InsertVecElt.getConstantOperandVal(2);
		if (Index != NumElements)
		return SDValue();
		NumElements--;
		InsertVecElt = InsertVecElt.getOperand(0);
		peterwaller-armUnsubmitted Not Done Reply Inline Actions This condition doesn't look right: it doesn't consider all of the elements of the sequence, only N/2+2 of them. peterwaller-arm: This condition doesn't look right: it doesn't consider all of the elements of the sequence…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've changed this to `RSequence.size() - 2;`. I've added tests `dupq_f16_repeat_complex_mismatched_front`, `dupq_f16_repeat_complex_mismatched_end` and `dupq_f16_repeat_complex_mismatched_middle` to assert this. MattDevereau: I've changed this to `RSequence.size() - 2;`. I've added tests…
		} else if (InsertVecElt.getOpcode() == ISD::SCALAR_TO_VECTOR) {
		RSequence.push_back(InsertVecElt.getOperand(0));
		break;
		} else
		return SDValue();
		} while (true);

		// We can't simplify a repeat complex pattern if there's less than 4 elements
		if (RSequence.size() < 4)
		return SDValue();
		// We can't simplify a repeat complex pattern if there's an odd number of
		// elements
		if (RSequence.size() % 2 != 0)
		return SDValue();
		peterwaller-armUnsubmitted Not Done Reply Inline Actions This considers Sequence[0] but not Sequence[1], so I believe it's getting the right codegen by coincidence. Also worth highlighting in a comment that Sequence is in reversed order. peterwaller-arm: This considers Sequence[0] but not Sequence[1], so I believe it's getting the right codegen by…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I think the updated tests now reflect this with 2 kill messages: `; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0` `; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1` MattDevereau: I think the updated tests now reflect this with 2 kill messages: `; CHECK-NEXT: // kill: def…

		// Check the "real" and "imaginary" components for equality across each
		// complex number
		for (unsigned i = 0; i < RSequence.size() - 2; i++)
		if (RSequence[i] != RSequence[i + 2])
		return SDValue();

		MVT WideElTy = MVT::getFloatingPointVT(VecElTy.getScalarSizeInBits() * 2);
		MVT WideVecTy =
		MVT::getVectorVT(WideElTy, VecTy.getVectorNumElements() / 2, false);
		MVT ScalableWideVecTy =
		MVT::getVectorVT(WideElTy, VecTy.getVectorNumElements() / 2, true);

		int SRIdx = 0;
		if (VecElTy == MVT::f32 \|\| VecElTy == MVT::i32)
		SRIdx = AArch64::ssub;
		else if (VecElTy == MVT::f16 \|\| VecElTy == MVT::i16)
		SRIdx = AArch64::hsub;
		SDLoc DL(Op);

		SDValue X = RSequence[1], Y = RSequence[0];
		SDValue InsertSubreg =
		DAG.getTargetInsertSubreg(SRIdx, DL, VecTy, DAG.getUNDEF(VecTy), X);
		InsertVecElt = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VecTy, InsertSubreg, Y,
		DAG.getConstant(1, DL, MVT::i64));
		SDValue WideVecBitcast =
		DAG.getNode(ISD::BITCAST, DL, WideVecTy, InsertVecElt);
		SDValue ExtractWideVectorElt =
		DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, WideElTy, WideVecBitcast,
		DAG.getConstant(0, DL, MVT::i64));
		SDValue Splat = DAG.getNode(ISD::SPLAT_VECTOR, DL, ScalableWideVecTy,
		ExtractWideVectorElt);
		return DAG.getNode(ISD::BITCAST, DL, Op->getValueType(0), Splat);
		}

static SDValue performDupLane128Combine(SDNode *N, SelectionDAG &DAG) {		static SDValue performDupLane128Combine(SDNode *N, SelectionDAG &DAG) {
EVT VT = N->getValueType(0);		SDValue WideSplat = simplifyDupLane128RepeatedComplexPattern(N, DAG);
		if (WideSplat)
		return WideSplat;

		EVT VT = N->getValueType(0);
SDValue Insert = N->getOperand(0);		SDValue Insert = N->getOperand(0);
if (Insert.getOpcode() != ISD::INSERT_SUBVECTOR)		if (Insert.getOpcode() != ISD::INSERT_SUBVECTOR)
return SDValue();		return SDValue();

if (!Insert.getOperand(0).isUndef())		if (!Insert.getOperand(0).isUndef())
return SDValue();		return SDValue();

uint64_t IdxInsert = Insert.getConstantOperandVal(2);		uint64_t IdxInsert = Insert.getConstantOperandVal(2);
▲ Show 20 Lines • Show All 2,482 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll

	Show First 20 Lines • Show All 578 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%out = call <vscale x 2 x i64> @llvm.aarch64.sve.dupq.lane.nxv2i64(<vscale x 2 x i64> %a, i64 4)			%out = call <vscale x 2 x i64> @llvm.aarch64.sve.dupq.lane.nxv2i64(<vscale x 2 x i64> %a, i64 4)
	ret <vscale x 2 x i64> %out			ret <vscale x 2 x i64> %out
	}			}
	;			;
	; EXT			; EXT
	;			;

	define dso_local <vscale x 4 x float> @dupq_f32_repeat_complex(float %x, float %y) {			define <vscale x 4 x float> @dupq_f32_repeat_complex(float %x, float %y) {
	; CHECK-LABEL: dupq_f32_repeat_complex:			; CHECK-LABEL: dupq_f32_repeat_complex:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0			; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0
	peterwaller-armUnsubmitted Not Done Reply Inline Actions This kill indicates that q0 contains live data, and it's lost after the change. This could allow the compiler to reorder instructions ignorant of the fact that the data going into the mov via v0 is live. Looking at the code I can see you're not passing %x along, so that needs to be fixed. peterwaller-arm: This kill indicates that q0 contains live data, and it's lost after the change. This could…
	; CHECK-NEXT: mov v2.16b, v0.16b
	; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1			; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1
	; CHECK-NEXT: mov v2.s[1], v1.s[0]			; CHECK-NEXT: mov v0.s[1], v1.s[0]
	; CHECK-NEXT: mov v2.s[2], v0.s[0]			; CHECK-NEXT: mov z0.d, d0
	; CHECK-NEXT: mov v2.s[3], v1.s[0]
	; CHECK-NEXT: mov z0.q, q2
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = insertelement <4 x float> undef, float %x, i64 0			%1 = insertelement <4 x float> undef, float %x, i64 0
	%2 = insertelement <4 x float> %1, float %y, i64 1			%2 = insertelement <4 x float> %1, float %y, i64 1
	%3 = insertelement <4 x float> %2, float %x, i64 2			%3 = insertelement <4 x float> %2, float %x, i64 2
	%4 = insertelement <4 x float> %3, float %y, i64 3			%4 = insertelement <4 x float> %3, float %y, i64 3
	%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %4, i64 0)			%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %4, i64 0)
	%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)			%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)
	ret <vscale x 4 x float> %6			ret <vscale x 4 x float> %6
	}			}

	define dso_local <vscale x 8 x half> @dupq_f16_repeat_complex(half %a, half %b) {			define <vscale x 4 x float> @dupq_f32_repeat_complex_unordered(float %x, float %y) {
				; CHECK-LABEL: dupq_f32_repeat_complex_unordered:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0
				; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1
				; CHECK-NEXT: mov v0.s[1], v0.s[0]
				; CHECK-NEXT: mov v0.s[2], v1.s[0]
				; CHECK-NEXT: mov v0.s[3], v1.s[0]
				; CHECK-NEXT: mov z0.q, q0
				; CHECK-NEXT: ret
				%1 = insertelement <4 x float> undef, float %x, i64 0
				%2 = insertelement <4 x float> %1, float %y, i64 3
				%3 = insertelement <4 x float> %2, float %x, i64 1
				%4 = insertelement <4 x float> %3, float %y, i64 2
				%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %4, i64 0)
				%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)
				ret <vscale x 4 x float> %6
				}

				define <vscale x 4 x float> @dupq_f32_repeat_complex_rev(float %x, float %y) {
				; CHECK-LABEL: dupq_f32_repeat_complex_rev:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $s1 killed $s1 def $z1
				; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
				; CHECK-NEXT: mov v1.s[1], v0.s[0]
				; CHECK-NEXT: mov z0.d, d1
				; CHECK-NEXT: ret
				%1 = insertelement <4 x float> undef, float %x, i64 3
				%2 = insertelement <4 x float> %1, float %y, i64 2
				%3 = insertelement <4 x float> %2, float %x, i64 1
				%4 = insertelement <4 x float> %3, float %y, i64 0
				%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %4, i64 0)
				%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)
				ret <vscale x 4 x float> %6
				}

				define <vscale x 4 x float> @dupq_f32_repeat_complex_rev_unordered(float %x, float %y) {
				; CHECK-LABEL: dupq_f32_repeat_complex_rev_unordered:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $s1 killed $s1 def $z1
				; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
				; CHECK-NEXT: mov v1.s[1], v1.s[0]
				; CHECK-NEXT: mov v1.s[2], v0.s[0]
				; CHECK-NEXT: mov v1.s[3], v0.s[0]
				; CHECK-NEXT: mov z0.q, q1
				; CHECK-NEXT: ret
				%1 = insertelement <4 x float> undef, float %x, i64 3
				%2 = insertelement <4 x float> %1, float %y, i64 0
				%3 = insertelement <4 x float> %2, float %x, i64 2
				%4 = insertelement <4 x float> %3, float %y, i64 1
				%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %4, i64 0)
				%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)
				ret <vscale x 4 x float> %6
				}

				define <vscale x 8 x half> @dupq_f16_repeat_complex(half %a, half %b) {
	; CHECK-LABEL: dupq_f16_repeat_complex:			; CHECK-LABEL: dupq_f16_repeat_complex:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $h0 killed $h0 def $z0
				; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1
				; CHECK-NEXT: mov v0.h[1], v1.h[0]
				; CHECK-NEXT: mov z0.s, s0
				; CHECK-NEXT: ret
				%1 = insertelement <8 x half> undef, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %b, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %a, i64 6
				%8 = insertelement <8 x half> %7, half %b, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define <vscale x 8 x half> @dupq_f16_repeat_complex_omit_pair(half %a, half %b) {
				; CHECK-LABEL: dupq_f16_repeat_complex_omit_pair:
				; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0			; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0
	; CHECK-NEXT: mov v2.16b, v0.16b			; CHECK-NEXT: mov v2.16b, v0.16b
	; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1			; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1
	; CHECK-NEXT: mov v2.h[1], v1.h[0]			; CHECK-NEXT: mov v2.h[1], v1.h[0]
	; CHECK-NEXT: mov v2.h[2], v0.h[0]			; CHECK-NEXT: mov v2.h[2], v0.h[0]
	; CHECK-NEXT: mov v2.h[3], v1.h[0]			; CHECK-NEXT: mov v2.h[3], v1.h[0]
				; CHECK-NEXT: mov v2.h[6], v0.h[0]
				; CHECK-NEXT: mov v2.h[7], v1.h[0]
				; CHECK-NEXT: mov z0.q, q2
				; CHECK-NEXT: ret
				%1 = insertelement <8 x half> undef, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %b, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 6
				%6 = insertelement <8 x half> %5, half %b, i64 7
				%7 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %6, i64 0)
				%8 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %7, i64 0)
				ret <vscale x 8 x half> %8
				}

				define <vscale x 8 x half> @dupq_f16_repeat_complex_mismatched_front(half %a, half %b, half %c) {
				; CHECK-LABEL: dupq_f16_repeat_complex_mismatched_front:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $h2 killed $h2 def $z2
				; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0
				; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1
				; CHECK-NEXT: mov v2.h[1], v2.h[0]
				; CHECK-NEXT: mov v2.h[2], v0.h[0]
				; CHECK-NEXT: mov v2.h[3], v1.h[0]
	; CHECK-NEXT: mov v2.h[4], v0.h[0]			; CHECK-NEXT: mov v2.h[4], v0.h[0]
	; CHECK-NEXT: mov v2.h[5], v1.h[0]			; CHECK-NEXT: mov v2.h[5], v1.h[0]
	; CHECK-NEXT: mov v2.h[6], v0.h[0]			; CHECK-NEXT: mov v2.h[6], v0.h[0]
	; CHECK-NEXT: mov v2.h[7], v1.h[0]			; CHECK-NEXT: mov v2.h[7], v1.h[0]
	; CHECK-NEXT: mov z0.q, q2			; CHECK-NEXT: mov z0.q, q2
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
				%1 = insertelement <8 x half> undef, half %c, i64 0
				%2 = insertelement <8 x half> %1, half %c, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %b, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %a, i64 6
				%8 = insertelement <8 x half> %7, half %b, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define <vscale x 8 x half> @dupq_f16_repeat_complex_mismatched_end(half %a, half %b, half %c) {
				; CHECK-LABEL: dupq_f16_repeat_complex_mismatched_end:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0
				; CHECK-NEXT: mov v3.16b, v0.16b
				; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1
				; CHECK-NEXT: // kill: def $h2 killed $h2 def $q2
				; CHECK-NEXT: mov v3.h[1], v1.h[0]
				; CHECK-NEXT: mov v3.h[2], v0.h[0]
				; CHECK-NEXT: mov v3.h[3], v1.h[0]
				; CHECK-NEXT: mov v3.h[4], v0.h[0]
				; CHECK-NEXT: mov v3.h[5], v1.h[0]
				; CHECK-NEXT: mov v3.h[6], v2.h[0]
				; CHECK-NEXT: mov v3.h[7], v2.h[0]
				; CHECK-NEXT: mov z0.q, q3
				; CHECK-NEXT: ret
	%1 = insertelement <8 x half> undef, half %a, i64 0			%1 = insertelement <8 x half> undef, half %a, i64 0
	%2 = insertelement <8 x half> %1, half %b, i64 1			%2 = insertelement <8 x half> %1, half %b, i64 1
	%3 = insertelement <8 x half> %2, half %a, i64 2			%3 = insertelement <8 x half> %2, half %a, i64 2
	%4 = insertelement <8 x half> %3, half %b, i64 3			%4 = insertelement <8 x half> %3, half %b, i64 3
	%5 = insertelement <8 x half> %4, half %a, i64 4			%5 = insertelement <8 x half> %4, half %a, i64 4
	%6 = insertelement <8 x half> %5, half %b, i64 5			%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %c, i64 6
				%8 = insertelement <8 x half> %7, half %c, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define <vscale x 8 x half> @dupq_f16_repeat_complex_mismatched_middle(half %a, half %b, half %c) {
				; CHECK-LABEL: dupq_f16_repeat_complex_mismatched_middle:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0
				; CHECK-NEXT: mov v3.16b, v0.16b
				; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1
				; CHECK-NEXT: // kill: def $h2 killed $h2 def $q2
				; CHECK-NEXT: mov v3.h[1], v1.h[0]
				; CHECK-NEXT: mov v3.h[2], v0.h[0]
				; CHECK-NEXT: mov v3.h[3], v1.h[0]
				; CHECK-NEXT: mov v3.h[4], v2.h[0]
				; CHECK-NEXT: mov v3.h[5], v2.h[0]
				; CHECK-NEXT: mov v3.h[6], v0.h[0]
				; CHECK-NEXT: mov v3.h[7], v1.h[0]
				; CHECK-NEXT: mov z0.q, q3
				; CHECK-NEXT: ret
				%1 = insertelement <8 x half> undef, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %b, i64 3
				%5 = insertelement <8 x half> %4, half %c, i64 4
				%6 = insertelement <8 x half> %5, half %c, i64 5
	%7 = insertelement <8 x half> %6, half %a, i64 6			%7 = insertelement <8 x half> %6, half %a, i64 6
	%8 = insertelement <8 x half> %7, half %b, i64 7			%8 = insertelement <8 x half> %7, half %b, i64 7
	%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %8, i64 0)			%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %8, i64 0)
	%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)			%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
	ret <vscale x 8 x half> %10			ret <vscale x 8 x half> %10
	}			}

	define <vscale x 16 x i8> @ext_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {			define <vscale x 16 x i8> @ext_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
	▲ Show 20 Lines • Show All 1,888 Lines • Show Last 20 Lines