This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Zero-overhead transfer between Neon and SVE registers
AbandonedPublic

Authored by peterwaller-arm on Jul 19 2021, 5:00 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
efriedma
bsmith
DavidTruby
sdesmalen
Matt
david-arm

Summary

A pattern for moving data from a Neon ACLE type into an SVE ACLE type
involves extracting the two double-lanes of the Neon register and
inserting them into an SVE register using two DUPs with VL1 and VL2.

This must compile to a NOP.

To achieve this, this patch adds support in DAGCombine to support the
INSERT_VECTOR_ELT => BUILD_VECTOR combine. Since BUILD_VECTOR does not
support scalable vectors, the insertions are pushed into a fixed
BUILD_VECTOR through an INSERT_SUBVECTOR to make it scalable again.

With this DAGCombine in place, existing BUILD_VECTOR combines are able
neatly optimize away bitcast/extractelement/shuffle etc.

Since not all Scalable vector types are supported for INSERT_SUBVECTOR,
I introduce a TargetLoweringInfo::isInsertSubvectorLegal to query
whether to perform the combine.

Two dup => insertelement patterns are added in instCombineSVEDup:

(dup vec VL1 elem0)
=> (insertelement vec elem0 0)

(dup (dup vec VL2 elem1) VL1 elem0)
=> (insertelement (insertelement vec elem1 1) elem0 0)

... which enable the BUILD_VECTOR optimization to work.

Reference:

"Move data between Advanced SIMD (Neon) and SVE ACLE types"
https://developer.arm.com/documentation/ka004612/latest
KA004612

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

peterwaller-arm created this revision.Jul 19 2021, 5:00 AM

Herald added subscribers: ecnelises, steven.zhang, psnobl and 3 others. · View Herald TranscriptJul 19 2021, 5:00 AM

peterwaller-arm requested review of this revision.Jul 19 2021, 5:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2021, 5:00 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

peterwaller-arm mentioned this in D101820: [AArch64][SVE] Extend svdup->insertelement instcombine pattern to support ....Jul 19 2021, 5:01 AM

BRB later today with fixes. Early comments welcomed on the approach.

llvm/test/CodeGen/AArch64/dag-combine-insert-elt.ll
13	Accidental double-insert-into-undef here.
26	And here.

Fix dag-combine-insert-elt.ll tests
Remove extraneous blankline

Marking inlines done.

Harbormaster completed remote builds in B114835: Diff 359757.Jul 19 2021, 7:56 AM

Can you split the instcombine changes into a separate commit? Or are they tied together in some way I'm missing?

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-opts-dup.ll
2	Please regenerate in a separate commit.

david-arm added inline comments.Jul 20 2021, 12:12 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.h
647	Can use getVectorMinNumElements here I think.

In D106265#2887638, @efriedma wrote:

Can you split the instcombine changes into a separate commit? Or are they tied together in some way I'm missing?

Happy to split them as appropriate. However, I seek feedback on the overall approach. I believe that the instcombine changes on their own constitute a regression, which is why I kept the changes in a single patch. I will hold off splitting for a moment since I want to make sure the approach is as good as we can get. Any advice would be appreciated.

One criticism I have received surrounds the introduced TLI query isInsertSubvectorLegal, which seems inelegant. The problem is that BUILD_VECTOR creation is a generic DAGCombine, so I don't know what INSERT_SUBVECTORs can be lowered by the backend. Unfortunately <vscale x 2 x half> is marked legal, so an isTypeLegal or isOperationLegalOrCustom returns 'true', but the backend is incapable of lowering it. So one alternative solution might be to improve the INSERT_SUBVECTOR lowering.

The method as it is produces good patterns [such as (insert_subvector undef (scalar_to_vector))] but it's not possible to match them in TableGen AIUI because of the mix of scalable/fixed types, and I wasn't yet able to find something to match those patterns on as a DAGCombine which didn't result in an infinite loop. So I guess the next stop is to improve the behaviour in LowerOperation.

llvm/lib/Target/AArch64/AArch64ISelLowering.h
647	Discussed offline -- getVectorMinNumElements doesn't make sense as it would depend on the input type whereas the test is meant to match the assert in DAGToDAG insertSubReg.

An alternative approach is being considered. Feel free to resign as reviewers if you want to remove this from your queue and I will re-request in the future as appropriate.

mnadeem added a subscriber: mnadeem.Aug 31 2021, 2:31 PM

Herald added a subscriber: ctetreau. · View Herald TranscriptAug 31 2021, 2:31 PM

Thanks everyone who helped with this.

Instead of the approach in this review, the new intent is to add intrinsics to the Arm C Language Extensions (ACLE) to directly allow the user to perform this conversion through an intrinsic which will directly resolve to insert/extract subvector.

The proposal for that can be found here: https://github.com/ARM-software/acle/pull/72

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

5 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

60 lines

Target/

AArch64/

AArch64ISelLowering.h

11 lines

AArch64TargetTransformInfo.cpp

53 lines

test/

CodeGen/

AArch64/

dag-combine-insert-elt.ll

56 lines

sve-insert-element.ll

60 lines

sve-ld-post-inc.ll

7 lines

Transforms/

InstCombine/

AArch64/

sve-intrinsic-opts-dup.ll

25 lines

Diff 359751

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 2,760 Lines • ▼ Show 20 Lines	public:
/// from this source type with this index. This is needed because		/// from this source type with this index. This is needed because
/// EXTRACT_SUBVECTOR usually has custom lowering that depends on the index of		/// EXTRACT_SUBVECTOR usually has custom lowering that depends on the index of
/// the first element, and only the target knows which lowering is cheap.		/// the first element, and only the target knows which lowering is cheap.
virtual bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,		virtual bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
unsigned Index) const {		unsigned Index) const {
return false;		return false;
}		}

		virtual bool isInsertSubvectorLegal(EVT ResVT, EVT SrcVT,
		unsigned Index) const {
		return true;
		}

/// Try to convert an extract element of a vector binary operation into an		/// Try to convert an extract element of a vector binary operation into an
/// extract element followed by a scalar operation.		/// extract element followed by a scalar operation.
virtual bool shouldScalarizeBinop(SDValue VecOp) const {		virtual bool shouldScalarizeBinop(SDValue VecOp) const {
return false;		return false;
}		}

/// Return true if extraction of a scalar element from the given vector type		/// Return true if extraction of a scalar element from the given vector type
/// at the given index is cheap. For example, if scalar operations occur on		/// at the given index is cheap. For example, if scalar operations occur on
▲ Show 20 Lines • Show All 1,903 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 18,300 Lines • ▼ Show 20 Lines	SDValue St1 = DAG.getStore(
St0, DL, Hi, Ptr, ST->getPointerInfo().getWithOffset(HalfValBitSize / 8),		St0, DL, Hi, Ptr, ST->getPointerInfo().getWithOffset(HalfValBitSize / 8),
ST->getOriginalAlign(), MMOFlags, AAInfo);		ST->getOriginalAlign(), MMOFlags, AAInfo);
return St1;		return St1;
}		}

/// Convert a disguised subvector insertion into a shuffle:		/// Convert a disguised subvector insertion into a shuffle:
SDValue DAGCombiner::combineInsertEltToShuffle(SDNode *N, unsigned InsIndex) {		SDValue DAGCombiner::combineInsertEltToShuffle(SDNode *N, unsigned InsIndex) {
assert(N->getOpcode() == ISD::INSERT_VECTOR_ELT &&		assert(N->getOpcode() == ISD::INSERT_VECTOR_ELT &&
"Expected extract_vector_elt");		"Expected insert_vector_elt");
SDValue InsertVal = N->getOperand(1);		SDValue InsertVal = N->getOperand(1);
SDValue Vec = N->getOperand(0);		SDValue Vec = N->getOperand(0);

// (insert_vector_elt (vector_shuffle X, Y), (extract_vector_elt X, N),		// (insert_vector_elt (vector_shuffle X, Y), (extract_vector_elt X, N),
// InsIndex)		// InsIndex)
// --> (vector_shuffle X, Y) and variations where shuffle operands may be		// --> (vector_shuffle X, Y) and variations where shuffle operands may be
// CONCAT_VECTORS.		// CONCAT_VECTORS.
if (Vec.getOpcode() == ISD::VECTOR_SHUFFLE && Vec.hasOneUse() &&		if (Vec.getOpcode() == ISD::VECTOR_SHUFFLE && Vec.hasOneUse() &&
▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	if (InVec.isUndef() && TLI.shouldSplatInsEltVarIndex(VT)) {
else {		else {
SmallVector<SDValue, 8> Ops(VT.getVectorNumElements(), InVal);		SmallVector<SDValue, 8> Ops(VT.getVectorNumElements(), InVal);
return DAG.getBuildVector(VT, DL, Ops);		return DAG.getBuildVector(VT, DL, Ops);
}		}
}		}
return SDValue();		return SDValue();
}		}

if (VT.isScalableVector())
return SDValue();

unsigned NumElts = VT.getVectorNumElements();

// We must know which element is being inserted for folds below here.		// We must know which element is being inserted for folds below here.
unsigned Elt = IndexC->getZExtValue();		unsigned Elt = IndexC->getZExtValue();
		if (VT.isFixedLengthVector())
if (SDValue Shuf = combineInsertEltToShuffle(N, Elt))		if (SDValue Shuf = combineInsertEltToShuffle(N, Elt))
return Shuf;		return Shuf;

// Canonicalize insert_vector_elt dag nodes.		// Canonicalize insert_vector_elt dag nodes.
// Example:		// Example:
// (insert_vector_elt (insert_vector_elt A, Idx0), Idx1)		// (insert_vector_elt (insert_vector_elt A, Idx0), Idx1)
// -> (insert_vector_elt (insert_vector_elt A, Idx1), Idx0)		// -> (insert_vector_elt (insert_vector_elt A, Idx1), Idx0)
//		//
// Do this only if the child insert_vector node has one use; also		// Do this only if the child insert_vector node has one use; also
// do this only if indices are both constants and Idx1 < Idx0.		// do this only if indices are both constants and Idx1 < Idx0.
Show All 9 Lines	if (Elt < OtherElt) {
VT, NewOp, InVec.getOperand(1), InVec.getOperand(2));		VT, NewOp, InVec.getOperand(1), InVec.getOperand(2));
}		}
}		}

// If we can't generate a legal BUILD_VECTOR, exit		// If we can't generate a legal BUILD_VECTOR, exit
if (LegalOperations && !TLI.isOperationLegal(ISD::BUILD_VECTOR, VT))		if (LegalOperations && !TLI.isOperationLegal(ISD::BUILD_VECTOR, VT))
return SDValue();		return SDValue();

		unsigned NumElts = VT.getVectorMinNumElements();
		bool ScalableOut = false;
		EVT FixedVT = VT;
		if (VT.isScalableVector()) {
		// BUILD_VECTOR does not currently support scalable vectors. Insert into
		// BUILD_VECTOR through an INSERT_SUBVECTOR. The motivation for this is to
		// allow conversions from <M x Ty> to a <vscale x N x Ty> to become a no-op
		// where the fixed-vector originates from subregister of the scalable
		// register.
		FixedVT = EVT::getVectorVT(*DAG.getContext(), VT.getScalarType(), NumElts);

		// Can't make an insert_subvector of this.
		if (!TLI.isInsertSubvectorLegal(VT, FixedVT, 0) \|\|
		!TLI.isTypeLegal(FixedVT) \|\| !TLI.isTypeLegal(VT) \|\|
		!TLI.isOperationLegalOrCustom(ISD::INSERT_SUBVECTOR, VT))
		return SDValue();

		if (InVec.isUndef()) {
		// InVec was undef.
		InVec = DAG.getUNDEF(FixedVT);
		} else if (InVec.getOpcode() == ISD::INSERT_SUBVECTOR &&
		InVec.getOperand(0).isUndef() &&
		InVec.getOperand(1).getOpcode() == ISD::BUILD_VECTOR &&
		InVec.getOperand(1).hasOneUse() &&
		InVec.getConstantOperandVal(2) == 0 &&
		Elt < InVec.getOperand(1)
		.getValueType()
		.getVectorMinNumElements()) {
		// InVec was (insert_subvector undef (build_vector {...}) 0).
		InVec = InVec.getOperand(1);
		// InVec now (build_vector {...}).
		}
		// Code following is as-if insertion is against FixedVectorTy BUILD_VECTOR.
		ScalableOut = true;
		}

// Check that the operand is a BUILD_VECTOR (or UNDEF, which can essentially		// Check that the operand is a BUILD_VECTOR (or UNDEF, which can essentially
// be converted to a BUILD_VECTOR). Fill in the Ops vector with the		// be converted to a BUILD_VECTOR). Fill in the Ops vector with the
// vector elements.		// vector elements.
SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
// Do not combine these two vectors if the output vector will not replace		// Do not combine these two vectors if the output vector will not replace
// the input vector.		// the input vector.
if (InVec.getOpcode() == ISD::BUILD_VECTOR && InVec.hasOneUse()) {		if (InVec.getOpcode() == ISD::BUILD_VECTOR && InVec.hasOneUse()) {
Ops.append(InVec.getNode()->op_begin(),		Ops.append(InVec.getNode()->op_begin(),
InVec.getNode()->op_end());		InVec.getNode()->op_end());
} else if (InVec.isUndef()) {		} else if (InVec.isUndef()) {
Ops.append(NumElts, DAG.getUNDEF(InVal.getValueType()));		Ops.append(NumElts, DAG.getUNDEF(InVal.getValueType()));
} else {		} else {
return SDValue();		return SDValue();
}		}
assert(Ops.size() == NumElts && "Unexpected vector size");		assert(Ops.size() == NumElts && "Unexpected vector size");

// Insert the element		// Insert the element
if (Elt < Ops.size()) {		if (Elt < Ops.size()) {
// All the operands of BUILD_VECTOR must have the same type;		// All the operands of BUILD_VECTOR must have the same type;
// we enforce that here.		// we enforce that here.
EVT OpVT = Ops[0].getValueType();		EVT OpVT = Ops[0].getValueType();
Ops[Elt] = OpVT.isInteger() ? DAG.getAnyExtOrTrunc(InVal, DL, OpVT) : InVal;		Ops[Elt] = OpVT.isInteger() ? DAG.getAnyExtOrTrunc(InVal, DL, OpVT) : InVal;
}		}

// Return the new vector		SDValue Ret = DAG.getBuildVector(FixedVT, DL, Ops);
return DAG.getBuildVector(VT, DL, Ops);		if (ScalableOut) {
		// There is no scalable build_vector, so use (insert_subvector vec
		// (build_vector {...}) 0).
		SDValue Zero =
		DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout()));
		Ret =
		DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT), Ret, Zero);
		}
		return Ret;
}		}

SDValue DAGCombiner::scalarizeExtractedVectorLoad(SDNode *EVE, EVT InVecVT,		SDValue DAGCombiner::scalarizeExtractedVectorLoad(SDNode *EVE, EVT InVecVT,
SDValue EltNo,		SDValue EltNo,
LoadSDNode *OriginalLoad) {		LoadSDNode *OriginalLoad) {
assert(OriginalLoad->isSimple());		assert(OriginalLoad->isSimple());

EVT ResultVT = EVE->getValueType(0);		EVT ResultVT = EVE->getValueType(0);
▲ Show 20 Lines • Show All 4,842 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 633 Lines • ▼ Show 20 Lines	public:
bool shouldConvertConstantLoadToIntImm(const APInt &Imm,		bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
Type *Ty) const override;		Type *Ty) const override;

/// Return true if EXTRACT_SUBVECTOR is cheap for this result type		/// Return true if EXTRACT_SUBVECTOR is cheap for this result type
/// with this index.		/// with this index.
bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,		bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
unsigned Index) const override;		unsigned Index) const override;

		bool isInsertSubvectorLegal(EVT ResVT, EVT SrcVT,
		unsigned Index) const override {
		if (ResVT.isScalableVector() && SrcVT.isFixedLengthVector()) {
		// Fixed insert into Scalable only legal if the scalable result occupies
		// the full vector granule.
		return ResVT.getSizeInBits().getKnownMinSize() ==
		david-armUnsubmitted Done Reply Inline Actions Can use getVectorMinNumElements here I think. david-arm: Can use getVectorMinNumElements here I think.
		peterwaller-armAuthorUnsubmitted Done Reply Inline Actions Discussed offline -- getVectorMinNumElements doesn't make sense as it would depend on the input type whereas the test is meant to match the assert in DAGToDAG insertSubReg. peterwaller-arm: Discussed offline -- getVectorMinNumElements doesn't make sense as it would depend on the input…
		AArch64::SVEBitsPerBlock;
		}
		return true;
		}

bool shouldFormOverflowOp(unsigned Opcode, EVT VT,		bool shouldFormOverflowOp(unsigned Opcode, EVT VT,
bool MathUsed) const override {		bool MathUsed) const override {
// Using overflow ops for overflow checks only should beneficial on		// Using overflow ops for overflow checks only should beneficial on
// AArch64.		// AArch64.
return TargetLowering::shouldFormOverflowOp(Opcode, VT, true);		return TargetLowering::shouldFormOverflowOp(Opcode, VT, true);
}		}

Value emitLoadLinked(IRBuilderBase &Builder, Type ValueTy, Value *Addr,		Value emitLoadLinked(IRBuilderBase &Builder, Type ValueTy, Value *Addr,
▲ Show 20 Lines • Show All 472 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 412 Lines • ▼ Show 20 Lines	static Optional<Instruction *> instCombineConvertFromSVBool(InstCombiner &IC,
if (!EarliestReplacement)		if (!EarliestReplacement)
return None;		return None;

return IC.replaceInstUsesWith(II, EarliestReplacement);		return IC.replaceInstUsesWith(II, EarliestReplacement);
}		}

static Optional<Instruction *> instCombineSVEDup(InstCombiner &IC,		static Optional<Instruction *> instCombineSVEDup(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
IntrinsicInst *Pg = dyn_cast<IntrinsicInst>(II.getArgOperand(1));		assert(II.getIntrinsicID() == Intrinsic::aarch64_sve_dup &&
if (!Pg)		"Expected SVE DUP!");
return None;

if (Pg->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue)		auto *Ty = cast<ScalableVectorType>(II.getType());
return None;		assert(Ty && "non-scalable result of scalable intrinsic");

const auto PTruePattern =		// Only consider dup producing <2 x double>.
cast<ConstantInt>(Pg->getOperand(0))->getZExtValue();		if (!Ty \|\| !Ty->getScalarType()->isDoubleTy() \|\| Ty->getMinNumElements() != 2)
if (PTruePattern != AArch64SVEPredPattern::vl1)
return None;		return None;

// The intrinsic is inserting into lane zero so use an insert instead.		LLVMContext &Ctx = II.getContext();
		IRBuilder<> Builder(Ctx);
		Builder.SetInsertPoint(&II);
auto *IdxTy = Type::getInt64Ty(II.getContext());		auto *IdxTy = Type::getInt64Ty(II.getContext());
auto *Insert = InsertElementInst::Create(
II.getArgOperand(0), II.getArgOperand(2), ConstantInt::get(IdxTy, 0));
Insert->insertBefore(&II);
Insert->takeName(&II);

return IC.replaceInstUsesWith(II, Insert);		auto m_PTrue = [](auto Pat) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'm_PTrue' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'm_PTrue' [readability-identifier-naming]…
		return m_Intrinsic<Intrinsic::aarch64_sve_ptrue>(Pat);
		};
		auto m_Dup = [](auto PassThru, auto Pg, auto Val) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'm_Dup' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'm_Dup' [readability-identifier-naming]…
		return m_Intrinsic<Intrinsic::aarch64_sve_dup>(PassThru, Pg, Val);
		};

		auto VL2 = m_PTrue(m_SpecificInt(AArch64SVEPredPattern::vl2));
		auto VL1 = m_PTrue(m_SpecificInt(AArch64SVEPredPattern::vl1));
		Value Out = nullptr, PassThru, Elem0, Elem1;
		if (match(&II, m_Dup(m_Value(PassThru), VL1, m_Value(Elem0)))) {
		// (dup vec VL1 elem0) => (insertelement vec elem0 0)
		Out = PassThru;
		Out = Builder.CreateInsertElement(PassThru, Elem0,
		ConstantInt::get(IdxTy, 0));
		} else if (match(&II, m_Dup(m_Dup(m_Value(PassThru), VL2, m_Value(Elem1)),
		VL1, m_Value(Elem0)))) {
		// (dup (dup vec VL2 elem1) VL1 elem0) => (insertelement (insertelement vec
		// elem1 1) elem0 0)
		Out = PassThru;
		Out = Builder.CreateInsertElement(PassThru, Elem0,
		ConstantInt::get(IdxTy, 0));
		Out = Builder.CreateInsertElement(PassThru, Elem1,
		ConstantInt::get(IdxTy, 0));
		}

		if (!Out)
		return None;
		Out->takeName(&II);
		return IC.replaceInstUsesWith(II, Out);
}		}

static Optional<Instruction *> instCombineSVECmpNE(InstCombiner &IC,		static Optional<Instruction *> instCombineSVECmpNE(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
LLVMContext &Ctx = II.getContext();		LLVMContext &Ctx = II.getContext();
IRBuilder<> Builder(Ctx);		IRBuilder<> Builder(Ctx);
Builder.SetInsertPoint(&II);		Builder.SetInsertPoint(&II);

▲ Show 20 Lines • Show All 1,614 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/dag-combine-insert-elt.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"
				attributes #0 = {"target-features"="+sve"}

				define <vscale x 2 x double> @two_inserts(<vscale x 2 x double> %in) #0 {
				; CHECK-LABEL: two_inserts:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmov v0.2d, #1.00000000
				; CHECK-NEXT: ret
				%b = insertelement <vscale x 2 x double> undef, double 0.0, i32 0
				%a = insertelement <vscale x 2 x double> undef, double 1.0, i32 1
				peterwaller-armAuthorUnsubmitted Done Reply Inline Actions Accidental double-insert-into-undef here. peterwaller-arm: Accidental double-insert-into-undef here.
				ret <vscale x 2 x double> %a
				}


				define <vscale x 2 x double> @neon_to_sve(<2 x double> %in) #0 {
				; CHECK-LABEL: neon_to_sve:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ret
				%e0 = extractelement <2 x double> %in, i32 0
				%e1 = extractelement <2 x double> %in, i32 1
				%b = insertelement <vscale x 2 x double> undef, double %e0, i32 0
				%a = insertelement <vscale x 2 x double> undef, double %e1, i32 1
				peterwaller-armAuthorUnsubmitted Done Reply Inline Actions And here. peterwaller-arm: And here.
				ret <vscale x 2 x double> %a
				}

				; Function Attrs: nofree norecurse nosync nounwind readnone uwtable willreturn mustprogress vscale_range(4,4)
				define dso_local <vscale x 4 x float> @float32x4_t_to_svfloat32_t(<4 x float> %nv) local_unnamed_addr #0 {
				; CHECK-LABEL: float32x4_t_to_svfloat32_t:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ret
				entry:
				%lane0_floats = shufflevector <4 x float> %nv, <4 x float> undef, <2 x i32> <i32 0, i32 1>
				%lane1_floats = shufflevector <4 x float> %nv, <4 x float> undef, <2 x i32> <i32 2, i32 3>

				%0 = bitcast <2 x float> %lane0_floats to <1 x double>
				%1 = bitcast <2 x float> %lane1_floats to <1 x double>

				%lane0 = extractelement <1 x double> %0, i32 0
				%lane1 = extractelement <1 x double> %1, i32 0

				%b = insertelement <vscale x 2 x double> undef, double %lane1, i32 1
				%a = insertelement <vscale x 2 x double> %b, double %lane0, i32 0

				%out = bitcast <vscale x 2 x double> %a to <vscale x 4 x float>
				ret <vscale x 4 x float> %out
				}

				declare <vscale x 2 x double> @llvm.experimental.vector.insert.nxv2f64.v2f64(<vscale x 2 x double>, <2 x double>, i64 immarg)

				declare <vscale x 2 x double> @llvm.aarch64.sve.dup.nxv2f64(<vscale x 2 x double>, <vscale x 2 x i1>, double)
				declare <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 immarg)

llvm/test/CodeGen/AArch64/sve-insert-element.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s		; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s

define <vscale x 16 x i8> @test_lane0_16xi8(<vscale x 16 x i8> %a) {		define <vscale x 16 x i8> @test_lane0_16xi8(<vscale x 16 x i8> %a) {
; CHECK-LABEL: test_lane0_16xi8:		; CHECK-LABEL: test_lane0_16xi8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.b, vl1		; CHECK-NEXT: ptrue p0.b, vl1
; CHECK-NEXT: mov w8, #30		; CHECK-NEXT: mov w8, #30
; CHECK-NEXT: mov z0.b, p0/m, w8		; CHECK-NEXT: mov z0.b, p0/m, w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%b = insertelement <vscale x 16 x i8> %a, i8 30, i32 0		%b = insertelement <vscale x 16 x i8> %a, i8 30, i32 0
ret <vscale x 16 x i8> %b		ret <vscale x 16 x i8> %b
}		}


define <vscale x 8 x i16> @test_lane0_8xi16(<vscale x 8 x i16> %a) {		define <vscale x 8 x i16> @test_lane0_8xi16(<vscale x 8 x i16> %a) {
; CHECK-LABEL: test_lane0_8xi16:		; CHECK-LABEL: test_lane0_8xi16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.h, vl1		; CHECK-NEXT: ptrue p0.h, vl1
; CHECK-NEXT: mov w8, #30		; CHECK-NEXT: mov w8, #30
; CHECK-NEXT: mov z0.h, p0/m, w8		; CHECK-NEXT: mov z0.h, p0/m, w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%b = insertelement <vscale x 8 x i16> %a, i16 30, i32 0		%b = insertelement <vscale x 8 x i16> %a, i16 30, i32 0
Show All 28 Lines
; CHECK-NEXT: fmov d1, #1.00000000		; CHECK-NEXT: fmov d1, #1.00000000
; CHECK-NEXT: ptrue p0.d, vl1		; CHECK-NEXT: ptrue p0.d, vl1
; CHECK-NEXT: mov z0.d, p0/m, z1.d		; CHECK-NEXT: mov z0.d, p0/m, z1.d
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%b = insertelement <vscale x 2 x double> %a, double 1.0, i32 0		%b = insertelement <vscale x 2 x double> %a, double 1.0, i32 0
ret <vscale x 2 x double> %b		ret <vscale x 2 x double> %b
}		}

		define <vscale x 2 x double> @test_lane01_2xf64(<vscale x 2 x double> %a) {
		; CHECK-LABEL: test_lane01_2xf64:
		; CHECK: // %bb.0:
		; CHECK-NEXT: mov w8, #1
		; CHECK-NEXT: mov x9, #4636737291354636288
		; CHECK-NEXT: mov x10, #70368744177664
		; CHECK-NEXT: index z1.d, #0, #1
		; CHECK-NEXT: ptrue p0.d
		; CHECK-NEXT: ptrue p1.d, vl1
		; CHECK-NEXT: movk x10, #16473, lsl #48
		; CHECK-NEXT: mov z2.d, x8
		; CHECK-NEXT: fmov d3, x9
		; CHECK-NEXT: cmpeq p0.d, p0/z, z1.d, z2.d
		; CHECK-NEXT: mov z0.d, p1/m, z3.d
		; CHECK-NEXT: fmov d1, x10
		; CHECK-NEXT: mov z0.d, p0/m, d1
		; CHECK-NEXT: ret
		%b = insertelement <vscale x 2 x double> %a, double 101.0, i32 1
		%c = insertelement <vscale x 2 x double> %b, double 100.0, i32 0
		ret <vscale x 2 x double> %c
		}

		define <vscale x 2 x double> @test_lane012_2xf64(<vscale x 2 x double> %a) {
		; CHECK-LABEL: test_lane012_2xf64:
		; CHECK: // %bb.0:
		; CHECK-NEXT: mov x8, #4636737291354636288
		; CHECK-NEXT: fmov d1, x8
		; CHECK-NEXT: mov w8, #2
		; CHECK-NEXT: ptrue p1.d, vl1
		; CHECK-NEXT: index z2.d, #0, #1
		; CHECK-NEXT: ptrue p0.d
		; CHECK-NEXT: mov z0.d, p1/m, z1.d
		; CHECK-NEXT: mov z1.d, x8
		; CHECK-NEXT: mov w8, #1
		; CHECK-NEXT: cmpeq p1.d, p0/z, z2.d, z1.d
		; CHECK-NEXT: mov z1.d, x8
		; CHECK-NEXT: mov x8, #70368744177664
		; CHECK-NEXT: movk x8, #16473, lsl #48
		; CHECK-NEXT: cmpeq p0.d, p0/z, z2.d, z1.d
		; CHECK-NEXT: fmov d1, x8
		; CHECK-NEXT: mov x8, #140737488355328
		; CHECK-NEXT: movk x8, #16473, lsl #48
		; CHECK-NEXT: mov z0.d, p0/m, d1
		; CHECK-NEXT: fmov d1, x8
		; CHECK-NEXT: mov z0.d, p1/m, d1
		; CHECK-NEXT: ret
		%b = insertelement <vscale x 2 x double> %a, double 102.0, i32 2
		%c = insertelement <vscale x 2 x double> %b, double 101.0, i32 1
		%d = insertelement <vscale x 2 x double> %c, double 100.0, i32 0
		ret <vscale x 2 x double> %d
		}

define <vscale x 4 x float> @test_lane0_4xf32(<vscale x 4 x float> %a) {		define <vscale x 4 x float> @test_lane0_4xf32(<vscale x 4 x float> %a) {
; CHECK-LABEL: test_lane0_4xf32:		; CHECK-LABEL: test_lane0_4xf32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmov s1, #1.00000000		; CHECK-NEXT: fmov s1, #1.00000000
; CHECK-NEXT: ptrue p0.s, vl1		; CHECK-NEXT: ptrue p0.s, vl1
; CHECK-NEXT: mov z0.s, p0/m, z1.s		; CHECK-NEXT: mov z0.s, p0/m, z1.s
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%b = insertelement <vscale x 4 x float> %a, float 1.0, i32 0		%b = insertelement <vscale x 4 x float> %a, float 1.0, i32 0
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%b = extractelement <vscale x 4 x i32> %a, i32 2		%b = extractelement <vscale x 4 x i32> %a, i32 2
%c = insertelement <vscale x 4 x i32> %a, i32 %b, i32 2		%c = insertelement <vscale x 4 x i32> %a, i32 %b, i32 2
ret <vscale x 4 x i32> %c		ret <vscale x 4 x i32> %c
}		}

define <vscale x 8 x i16> @test_lane6_undef_8xi16(i16 %a) {		define <vscale x 8 x i16> @test_lane6_undef_8xi16(i16 %a) {
; CHECK-LABEL: test_lane6_undef_8xi16:		; CHECK-LABEL: test_lane6_undef_8xi16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #6		; CHECK-NEXT: dup v0.8h, w0
; CHECK-NEXT: index z0.h, #0, #1
; CHECK-NEXT: mov z1.h, w8
; CHECK-NEXT: ptrue p0.h
; CHECK-NEXT: cmpeq p0.h, p0/z, z0.h, z1.h
; CHECK-NEXT: mov z0.h, p0/m, w0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%b = insertelement <vscale x 8 x i16> undef, i16 %a, i32 6		%b = insertelement <vscale x 8 x i16> undef, i16 %a, i32 6
ret <vscale x 8 x i16> %b		ret <vscale x 8 x i16> %b
}		}

define <vscale x 16 x i8> @test_lane0_undef_16xi8(i8 %a) {		define <vscale x 16 x i8> @test_lane0_undef_16xi8(i8 %a) {
; CHECK-LABEL: test_lane0_undef_16xi8:		; CHECK-LABEL: test_lane0_undef_16xi8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
▲ Show 20 Lines • Show All 359 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-ld-post-inc.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s

	; These tests are here to ensure we don't get a selection error caused			; These tests are here to ensure we don't get a selection error caused
	; by performPostLD1Combine, which should bail out if the return			; by performPostLD1Combine, which should bail out if the return
	; type is a scalable vector			; type is a scalable vector

	define <vscale x 4 x i32> @test_post_ld1_insert(i32* %a, i32** %ptr, i64 %inc) {			define <vscale x 4 x i32> @test_post_ld1_insert(i32* %a, i32** %ptr, i64 %inc) {
	; CHECK-LABEL: test_post_ld1_insert:			; CHECK-LABEL: test_post_ld1_insert:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr w8, [x0]			; CHECK-NEXT: ldr s0, [x0]
	; CHECK-NEXT: add x9, x0, x2, lsl #2			; CHECK-NEXT: add x8, x0, x2, lsl #2
	; CHECK-NEXT: str x9, [x1]			; CHECK-NEXT: str x8, [x1]
	; CHECK-NEXT: fmov s0, w8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = load i32, i32* %a			%load = load i32, i32* %a
	%ins = insertelement <vscale x 4 x i32> undef, i32 %load, i32 0			%ins = insertelement <vscale x 4 x i32> undef, i32 %load, i32 0
	%gep = getelementptr i32, i32* %a, i64 %inc			%gep = getelementptr i32, i32* %a, i64 %inc
	store i32* %gep, i32** %ptr			store i32* %gep, i32** %ptr
	ret <vscale x 4 x i32> %ins			ret <vscale x 4 x i32> %ins
	}			}

	Show All 16 Lines

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-opts-dup.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -instcombine < %s \| FileCheck %s			; RUN: opt -S -instcombine < %s \| FileCheck %s
				efriedmaUnsubmitted Not Done Reply Inline Actions Please regenerate in a separate commit. efriedma: Please regenerate in a separate commit.

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define <vscale x 16 x i8> @dup_insertelement_0(<vscale x 16 x i8> %v, i8 %s) #0 {			define <vscale x 16 x i8> @dup_insertelement_0(<vscale x 16 x i8> %v, i8 %s) #0 {
	; CHECK-LABEL: @dup_insertelement_0(			; CHECK-LABEL: @dup_insertelement_0(
	; CHECK: %insert = insertelement <vscale x 16 x i8> %v, i8 %s, i64 0			; CHECK-NEXT: [[PG:%.*]] = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)
	; CHECK-NEXT: ret <vscale x 16 x i8> %insert			; CHECK-NEXT: [[INSERT:%.]] = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> [[V:%.]], <vscale x 16 x i1> [[PG]], i8 [[S:%.*]])
				; CHECK-NEXT: ret <vscale x 16 x i8> [[INSERT]]
				;
	%pg = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)			%pg = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)
	%insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)			%insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)
	ret <vscale x 16 x i8> %insert			ret <vscale x 16 x i8> %insert
	}			}

	define <vscale x 16 x i8> @dup_insertelement_1(<vscale x 16 x i8> %v, i8 %s) #0 {			define <vscale x 16 x i8> @dup_insertelement_1(<vscale x 16 x i8> %v, i8 %s) #0 {
	; CHECK-LABEL: @dup_insertelement_1(			; CHECK-LABEL: @dup_insertelement_1(
	; CHECK: %pg = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)			; CHECK-NEXT: [[PG:%.*]] = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)
	; CHECK-NEXT: %insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)			; CHECK-NEXT: [[INSERT:%.]] = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> [[V:%.]], <vscale x 16 x i1> [[PG]], i8 [[S:%.*]])
	; CHECK-NEXT: ret <vscale x 16 x i8> %insert			; CHECK-NEXT: ret <vscale x 16 x i8> [[INSERT]]
				;
	%pg = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)			%pg = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)
	%insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)			%insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)
	ret <vscale x 16 x i8> %insert			ret <vscale x 16 x i8> %insert
	}			}

	define <vscale x 16 x i8> @dup_insertelement_x(<vscale x 16 x i8> %v, i8 %s, <vscale x 16 x i1> %pg) #0 {			define <vscale x 16 x i8> @dup_insertelement_x(<vscale x 16 x i8> %v, i8 %s, <vscale x 16 x i1> %pg) #0 {
	; CHECK-LABEL: @dup_insertelement_x(			; CHECK-LABEL: @dup_insertelement_x(
	; CHECK: %insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)			; CHECK-NEXT: [[INSERT:%.]] = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> [[V:%.]], <vscale x 16 x i1> [[PG:%.]], i8 [[S:%.]])
	; CHECK-NEXT: ret <vscale x 16 x i8> %insert			; CHECK-NEXT: ret <vscale x 16 x i8> [[INSERT]]
				;
	%insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)			%insert = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg, i8 %s)
	ret <vscale x 16 x i8> %insert			ret <vscale x 16 x i8> %insert
	}			}

	define <vscale x 8 x i16> @dup_insertelement_0_convert(<vscale x 8 x i16> %v, i16 %s) #0 {			define <vscale x 8 x i16> @dup_insertelement_0_convert(<vscale x 8 x i16> %v, i16 %s) #0 {
	; CHECK-LABEL: @dup_insertelement_0_convert(			; CHECK-LABEL: @dup_insertelement_0_convert(
	; CHECK: %insert = insertelement <vscale x 8 x i16> %v, i16 %s, i64 0			; CHECK-NEXT: [[PG:%.*]] = tail call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 1)
	; CHECK-NEXT: ret <vscale x 8 x i16> %insert			; CHECK-NEXT: [[INSERT:%.]] = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> [[V:%.]], <vscale x 8 x i1> [[PG]], i16 [[S:%.*]])
				; CHECK-NEXT: ret <vscale x 8 x i16> [[INSERT]]
				;
	%pg = tail call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 1)			%pg = tail call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 1)
	%1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %pg)			%1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> %pg)
	%2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %1)			%2 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %1)
	%insert = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> %v, <vscale x 8 x i1> %2, i16 %s)			%insert = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> %v, <vscale x 8 x i1> %2, i16 %s)
	ret <vscale x 8 x i16> %insert			ret <vscale x 8 x i16> %insert
	}			}

	declare <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i1>, i8)			declare <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i1>, i8)
	Show All 9 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Zero-overhead transfer between Neon and SVE registersAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 359751

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/CodeGen/AArch64/dag-combine-insert-elt.ll

llvm/test/CodeGen/AArch64/sve-insert-element.ll

llvm/test/CodeGen/AArch64/sve-ld-post-inc.ll

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-opts-dup.ll

[AArch64][SVE] Zero-overhead transfer between Neon and SVE registers
AbandonedPublic