This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Fix cost estimation for buildvectors with extracts and/or constants.
ClosedPublic

Authored by ABataev on Apr 14 2023, 12:00 PM.

Download Raw Diff

Details

Reviewers

dmgreen
vdmitrie
RKSimon

Commits

rG8cf0290c4a47: [SLP]Fix cost estimation for buildvectors with extracts and/or constants.

Summary

If the partial matching is found and some other scalars must be
inserted, need to account the cost of the extractelements, transformed
to shuffles, and/or reused entries and calculate the cost of inserting
constants properly into the non-poison vectors.
Also, fixed the cost calculation for final gather/buildvector sequence.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Apr 14 2023, 12:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 12:00 PM

Herald added subscribers: vporpo, hiraditya. · View Herald Transcript

ABataev requested review of this revision.Apr 14 2023, 12:00 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 12:00 PM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

Harbormaster completed remote builds in B225693: Diff 513704.Apr 14 2023, 12:44 PM

Thanks for looking at this. The performance certainly seems to have improved.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6984	Is this to work around the AArch64 cost model?
8799–8800	This comment looks out of date now.

ABataev added inline comments.Apr 17 2023, 12:38 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6984	Yes, because it has some kind of strange estimation if HasUse == false, the cost of insertelement is 0. Original cost estimation estimates the cost of the deleted extractelement instruction to be 3, while the insertelement instruction to be 0. Actually, it would be good to fix this problem in AArch64 cost model. The cost must be considered free, only if the operand0 is undef/poison, otherwise it is not zero. I'm working on another solution, which should generate better shuffles, hope it will fix the regression for AArch64 and improve final emission for other targets.

Please can you rebase after D148279 ?

Restored original extractcost calculation, reworked estimation of the buildvector with non-undef initial vector.

Harbormaster completed remote builds in B226389: Diff 514658.Apr 18 2023, 9:21 AM

LGTM with one optional minor

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2275–2277	Pull out the HasRealUse computation. bool HasRealUse = Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0); return getVectorInstrCostHelper(nullptr, Val, Index, HasRealIUse);

This revision is now accepted and ready to land.Apr 19 2023, 2:46 AM

This revision was landed with ongoing or failed builds.Apr 19 2023, 5:57 AM

Closed by commit rG8cf0290c4a47: [SLP]Fix cost estimation for buildvectors with extracts and/or constants. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG8cf0290c4a47: [SLP]Fix cost estimation for buildvectors with extracts and/or constants..

dmgreen added inline comments.Apr 19 2023, 7:53 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6984	Yeah that code has always been a bit off. I think once upon a time someone accidentally applied the "zero-lane insert/extract cost 0" to integers as well as floats, and since then it has happened to give better performance in many cases to keep the inaccuracy around. I will look into removing it if I can.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

4 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

82 lines

test/

Transforms/

SLPVectorizer/

AArch64/

extractelements-to-shuffle.ll

36 lines

fshl.ll

27 lines

X86/

reduction-logical.ll

43 lines

Diff 514925

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,266 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
// All other insert/extracts cost this much.		// All other insert/extracts cost this much.
return ST->getVectorInsertExtractBaseCost();		return ST->getVectorInsertExtractBaseCost();
}		}

InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,		InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
unsigned Index, Value *Op0,		unsigned Index, Value *Op0,
Value *Op1) {		Value *Op1) {
return getVectorInstrCostHelper(nullptr, Val, Index, false /* HasRealUse */);		bool HasRealUse =
		Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
		return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
		RKSimonUnsubmitted Not Done Reply Inline Actions Pull out the HasRealUse computation. bool HasRealUse = Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0); return getVectorInstrCostHelper(nullptr, Val, Index, HasRealIUse); RKSimon: Pull out the HasRealUse computation. ``` bool HasRealUse = Opcode == Instruction::InsertElement…
}		}

InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,		InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
Type *Val,		Type *Val,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
unsigned Index) {		unsigned Index) {
return getVectorInstrCostHelper(&I, Val, Index, true /* HasRealUse */);		return getVectorInstrCostHelper(&I, Val, Index, true /* HasRealUse */);
}		}
▲ Show 20 Lines • Show All 1,209 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,453 Lines • ▼ Show 20 Lines	private:
/// \p E.		/// \p E.
Value vectorizeOperand(TreeEntry E, unsigned NodeIdx);		Value vectorizeOperand(TreeEntry E, unsigned NodeIdx);

/// Create a new vector from a list of scalar values. Produces a sequence		/// Create a new vector from a list of scalar values. Produces a sequence
/// which exploits values reused across lanes, and arranges the inserts		/// which exploits values reused across lanes, and arranges the inserts
/// for ease of later optimization.		/// for ease of later optimization.
Value createBuildVector(const TreeEntry E);		Value createBuildVector(const TreeEntry E);

/// \returns the scalarization cost for this type. Scalarization in this
/// context means the creation of vectors from a group of scalars. If \p
/// NeedToShuffle is true, need to add a cost of reshuffling some of the
/// vector elements.
InstructionCost getGatherCost(FixedVectorType *Ty,
const APInt &ShuffledIndices,
bool NeedToShuffle) const;

/// Returns the instruction in the bundle, which can be used as a base point		/// Returns the instruction in the bundle, which can be used as a base point
/// for scheduling. Usually it is the last instruction in the bundle, except		/// for scheduling. Usually it is the last instruction in the bundle, except
/// for the case when all operands are external (in this case, it is the first		/// for the case when all operands are external (in this case, it is the first
/// instruction in the list).		/// instruction in the list).
Instruction &getLastInstructionInBundle(const TreeEntry *E);		Instruction &getLastInstructionInBundle(const TreeEntry *E);

/// Checks if the gathered \p VL can be represented as shuffle(s) of previous		/// Checks if the gathered \p VL can be represented as shuffle(s) of previous
/// tree entries.		/// tree entries.
/// \param TE Tree entry checked for permutation.		/// \param TE Tree entry checked for permutation.
/// \param VL List of scalars (a subset of the TE scalar), checked for		/// \param VL List of scalars (a subset of the TE scalar), checked for
/// permutations.		/// permutations.
/// \returns ShuffleKind, if gathered values can be represented as shuffles of		/// \returns ShuffleKind, if gathered values can be represented as shuffles of
/// previous tree entries. \p Mask is filled with the shuffle mask.		/// previous tree entries. \p Mask is filled with the shuffle mask.
std::optional<TargetTransformInfo::ShuffleKind>		std::optional<TargetTransformInfo::ShuffleKind>
isGatherShuffledEntry(const TreeEntry TE, ArrayRef<Value > VL,		isGatherShuffledEntry(const TreeEntry TE, ArrayRef<Value > VL,
SmallVectorImpl<int> &Mask,		SmallVectorImpl<int> &Mask,
SmallVectorImpl<const TreeEntry *> &Entries);		SmallVectorImpl<const TreeEntry *> &Entries);

/// \returns the scalarization cost for this list of values. Assuming that		/// \returns the scalarization cost for this list of values. Assuming that
/// this subtree gets vectorized, we may need to extract the values from the		/// this subtree gets vectorized, we may need to extract the values from the
/// roots. This method calculates the cost of extracting the values.		/// roots. This method calculates the cost of extracting the values.
InstructionCost getGatherCost(ArrayRef<Value *> VL) const;		/// \param ForPoisonSrc true if initial vector is poison, false otherwise.
		InstructionCost getGatherCost(ArrayRef<Value *> VL, bool ForPoisonSrc) const;

/// Set the Builder insert point to one after the last instruction in		/// Set the Builder insert point to one after the last instruction in
/// the bundle		/// the bundle
void setInsertPointAfterBundle(const TreeEntry *E);		void setInsertPointAfterBundle(const TreeEntry *E);

/// \returns a vector from a collection of scalars in \p VL. if \p Root is not		/// \returns a vector from a collection of scalars in \p VL. if \p Root is not
/// specified, the starting vector value is poison.		/// specified, the starting vector value is poison.
Value gather(ArrayRef<Value > VL, Value *Root = nullptr);		Value gather(ArrayRef<Value > VL, Value *Root = nullptr);
▲ Show 20 Lines • Show All 4,417 Lines • ▼ Show 20 Lines	if (VL.size() > 2 && S.getOpcode() == Instruction::Load &&
PoisonValue::get(VecTy), *It);		PoisonValue::get(VecTy), *It);
return InsertCost +		return InsertCost +
(NeedShuffle ? TTI.getShuffleCost(		(NeedShuffle ? TTI.getShuffleCost(
TargetTransformInfo::SK_Broadcast, VecTy,		TargetTransformInfo::SK_Broadcast, VecTy,
/Mask=/std::nullopt, CostKind, /Index=/0,		/Mask=/std::nullopt, CostKind, /Index=/0,
/SubTp=/nullptr, /Args=/*It)		/SubTp=/nullptr, /Args=/*It)
: TTI::TCC_Free);		: TTI::TCC_Free);
}		}
return GatherCost + (all_of(Gathers, UndefValue::classof)		return GatherCost +
		(all_of(Gathers, UndefValue::classof)
? TTI::TCC_Free		? TTI::TCC_Free
: R.getGatherCost(Gathers));		: R.getGatherCost(Gathers, !Root && VL.equals(Gathers)));
};		};

public:		public:
ShuffleCostEstimator(TargetTransformInfo &TTI,		ShuffleCostEstimator(TargetTransformInfo &TTI,
ArrayRef<Value *> VectorizedVals, BoUpSLP &R)		ArrayRef<Value *> VectorizedVals, BoUpSLP &R)
: TTI(TTI), VectorizedVals(VectorizedVals), R(R) {}		: TTI(TTI), VectorizedVals(VectorizedVals), R(R) {}
Value adjustExtracts(const TreeEntry E, ArrayRef<int> Mask) {		Value adjustExtracts(const TreeEntry E, ArrayRef<int> Mask) {
if (Mask.empty())		if (Mask.empty())
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	for (auto [I, V] : enumerate(VL)) {
// Add back the cost of s\|zext which is subtracted separately.		// Add back the cost of s\|zext which is subtracted separately.
Cost += TTI.getCastInstrCost(		Cost += TTI.getCastInstrCost(
Ext->getOpcode(), Ext->getType(), EE->getType(),		Ext->getOpcode(), Ext->getType(), EE->getType(),
TTI::getCastContextHint(Ext), CostKind, Ext);		TTI::getCastContextHint(Ext), CostKind, Ext);
continue;		continue;
}		}
}		}
Cost -= TTI.getVectorInstrCost(*EE, EE->getVectorOperandType(), CostKind,		Cost -= TTI.getVectorInstrCost(*EE, EE->getVectorOperandType(), CostKind,
Idx);		Idx);
		dmgreenUnsubmitted Not Done Reply Inline Actions Is this to work around the AArch64 cost model? dmgreen: Is this to work around the AArch64 cost model?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, because it has some kind of strange estimation if HasUse == false, the cost of insertelement is 0. Original cost estimation estimates the cost of the deleted extractelement instruction to be 3, while the insertelement instruction to be 0. Actually, it would be good to fix this problem in AArch64 cost model. The cost must be considered free, only if the operand0 is undef/poison, otherwise it is not zero. I'm working on another solution, which should generate better shuffles, hope it will fix the regression for AArch64 and improve final emission for other targets. ABataev: Yes, because it has some kind of strange estimation if HasUse == false, the cost of…
		dmgreenUnsubmitted Not Done Reply Inline Actions Yeah that code has always been a bit off. I think once upon a time someone accidentally applied the "zero-lane insert/extract cost 0" to integers as well as floats, and since then it has happened to give better performance in many cases to keep the inaccuracy around. I will look into removing it if I can. dmgreen: Yeah that code has always been a bit off. I think once upon a time someone accidentally applied…
}		}
// Add a cost for subvector extracts/inserts if required.		// Add a cost for subvector extracts/inserts if required.
for (const auto &Data : ExtractVectorsTys) {		for (const auto &Data : ExtractVectorsTys) {
auto *EEVTy = cast<FixedVectorType>(Data.first->getType());		auto *EEVTy = cast<FixedVectorType>(Data.first->getType());
unsigned NumElts = VecTy->getNumElements();		unsigned NumElts = VecTy->getNumElements();
if (Data.second % NumElts == 0)		if (Data.second % NumElts == 0)
continue;		continue;
if (TTI.getNumberOfParts(EEVTy) > VecNumParts) {		if (TTI.getNumberOfParts(EEVTy) > VecNumParts) {
▲ Show 20 Lines • Show All 172 Lines • ▼ Show 20 Lines	if (GatherShuffle) {
else		else
Estimator.add(Entries.front(), Entries.back(), Mask);		Estimator.add(Entries.front(), Entries.back(), Mask);
Estimator.gather(		Estimator.gather(
GatheredScalars,		GatheredScalars,
Constant::getNullValue(FixedVectorType::get(		Constant::getNullValue(FixedVectorType::get(
GatheredScalars.front()->getType(), GatheredScalars.size())));		GatheredScalars.front()->getType(), GatheredScalars.size())));
return Estimator.finalize(E->ReuseShuffleIndices);		return Estimator.finalize(E->ReuseShuffleIndices);
}		}
if (ExtractShuffle && all_of(GatheredScalars, PoisonValue::classof)) {		InstructionCost Cost = 0;
		if (ExtractShuffle) {
// Check that gather of extractelements can be represented as just a		// Check that gather of extractelements can be represented as just a
// shuffle of a single/two vectors the scalars are extracted from.		// shuffle of a single/two vectors the scalars are extracted from.
// Found the bunch of extractelement instructions that must be gathered		// Found the bunch of extractelement instructions that must be gathered
// into a vector and can be represented as a permutation elements in a		// into a vector and can be represented as a permutation elements in a
// single input vector or of 2 input vectors.		// single input vector or of 2 input vectors.
InstructionCost Cost =		Cost += computeExtractCost(VL, VecTy, ExtractShuffle, ExtractMask, TTI);
computeExtractCost(VL, VecTy, ExtractShuffle, ExtractMask, TTI);
return Cost + Estimator.finalize(E->ReuseShuffleIndices);
}		}
Estimator.gather(		Estimator.gather(
GatheredScalars,		GatheredScalars,
(ExtractShuffle \|\| GatherShuffle)		VL.equals(GatheredScalars)
? Constant::getNullValue(FixedVectorType::get(		? nullptr
GatheredScalars.front()->getType(), GatheredScalars.size()))		: Constant::getNullValue(FixedVectorType::get(
: nullptr);		GatheredScalars.front()->getType(), GatheredScalars.size())));
return Estimator.finalize(E->ReuseShuffleIndices);		return Cost + Estimator.finalize(E->ReuseShuffleIndices);
}		}
InstructionCost CommonCost = 0;		InstructionCost CommonCost = 0;
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!E->ReorderIndices.empty()) {		if (!E->ReorderIndices.empty()) {
SmallVector<int> NewMask;		SmallVector<int> NewMask;
if (E->getOpcode() == Instruction::Store) {		if (E->getOpcode() == Instruction::Store) {
// For stores the order is actually a mask.		// For stores the order is actually a mask.
NewMask.resize(E->ReorderIndices.size());		NewMask.resize(E->ReorderIndices.size());
▲ Show 20 Lines • Show All 1,582 Lines • ▼ Show 20 Lines	case 2:
break;		break;
default:		default:
break;		break;
}		}
Entries.clear();		Entries.clear();
return std::nullopt;		return std::nullopt;
}		}

InstructionCost BoUpSLP::getGatherCost(FixedVectorType *Ty,		InstructionCost BoUpSLP::getGatherCost(ArrayRef<Value *> VL,
const APInt &ShuffledIndices,		bool ForPoisonSrc) const {
bool NeedToShuffle) const {
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
InstructionCost Cost =
TTI->getScalarizationOverhead(Ty, ~ShuffledIndices, /Insert/ true,
/Extract/ false, CostKind);
if (NeedToShuffle)
Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, Ty);
return Cost;
}

InstructionCost BoUpSLP::getGatherCost(ArrayRef<Value *> VL) const {
// Find the type of the operands in VL.		// Find the type of the operands in VL.
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());		auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());
bool DuplicateNonConst = false;		bool DuplicateNonConst = false;
// Find the cost of inserting/extracting values from the vector.		// Find the cost of inserting/extracting values from the vector.
// Check if the same elements are inserted several times and count them as		// Check if the same elements are inserted several times and count them as
// shuffle candidates.		// shuffle candidates.
APInt ShuffledElements = APInt::getZero(VL.size());		APInt ShuffledElements = APInt::getZero(VL.size());
DenseSet<Value *> UniqueElements;		DenseSet<Value *> UniqueElements;
// Iterate in reverse order to consider insert elements with the high cost.		constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
		dmgreenUnsubmitted Not Done Reply Inline Actions This comment looks out of date now. dmgreen: This comment looks out of date now.
for (unsigned I = VL.size(); I > 0; --I) {		InstructionCost Cost;
unsigned Idx = I - 1;		auto EstimateInsertCost = [&](unsigned I, Value *V) {
		if (!ForPoisonSrc)
		Cost +=
		TTI->getVectorInstrCost(Instruction::InsertElement, VecTy, CostKind,
		I, Constant::getNullValue(VecTy), V);
		};
		for (unsigned I = 0, E = VL.size(); I < E; ++I) {
		Value *V = VL[I];
// No need to shuffle duplicates for constants.		// No need to shuffle duplicates for constants.
if (isConstant(VL[Idx])) {		if ((ForPoisonSrc && isConstant(V)) \|\| isa<UndefValue>(V)) {
ShuffledElements.setBit(Idx);		ShuffledElements.setBit(I);
continue;		continue;
}		}
if (!UniqueElements.insert(VL[Idx]).second) {		if (!UniqueElements.insert(V).second) {
DuplicateNonConst = true;		DuplicateNonConst = true;
ShuffledElements.setBit(Idx);		ShuffledElements.setBit(I);
		continue;
}		}
		EstimateInsertCost(I, V);
}		}
return getGatherCost(VecTy, ShuffledElements, DuplicateNonConst);		if (ForPoisonSrc)
		Cost =
		TTI->getScalarizationOverhead(VecTy, ~ShuffledElements, /Insert/ true,
		/Extract/ false, CostKind);
		if (DuplicateNonConst)
		Cost +=
		TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
		return Cost;
}		}

// Perform operand reordering on the instructions in VL and return the reordered		// Perform operand reordering on the instructions in VL and return the reordered
// operands in Left and Right.		// operands in Left and Right.
void BoUpSLP::reorderInputsAccordingToOpcode(		void BoUpSLP::reorderInputsAccordingToOpcode(
ArrayRef<Value > VL, SmallVectorImpl<Value > &Left,		ArrayRef<Value > VL, SmallVectorImpl<Value > &Left,
SmallVectorImpl<Value *> &Right, const TargetLibraryInfo &TLI,		SmallVectorImpl<Value *> &Right, const TargetLibraryInfo &TLI,
const DataLayout &DL, ScalarEvolution &SE, const BoUpSLP &R) {		const DataLayout &DL, ScalarEvolution &SE, const BoUpSLP &R) {
▲ Show 20 Lines • Show All 5,980 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/extractelements-to-shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=aarch64 -aarch64-insert-extract-base-cost=3 \| FileCheck %s			; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=aarch64 -aarch64-insert-extract-base-cost=3 \| FileCheck %s

	define void @test(<2 x i64> %0, <2 x i64> %1, <2 x i64> %2) {			define void @test(<2 x i64> %0, <2 x i64> %1, <2 x i64> %2) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: [[TMP4:%.]] = extractelement <2 x i64> [[TMP2:%.]], i64 0			; CHECK-NEXT: [[TMP4:%.]] = extractelement <2 x i64> [[TMP1:%.]], i64 0
	; CHECK-NEXT: [[TMP5:%.]] = shufflevector <2 x i64> [[TMP1:%.]], <2 x i64> [[TMP0:%.*]], <4 x i32> <i32 0, i32 2, i32 undef, i32 2>			; CHECK-NEXT: [[TMP5:%.*]] = or i64 [[TMP4]], 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i64> [[TMP5]], i64 [[TMP4]], i32 2			; CHECK-NEXT: [[TMP6:%.*]] = trunc i64 [[TMP5]] to i32
	; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x i64> [[TMP2]], <2 x i64> poison, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>			; CHECK-NEXT: [[TMP7:%.]] = extractelement <2 x i64> [[TMP0:%.]], i64 0
	; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i64> [[TMP7]], <4 x i64> <i64 0, i64 0, i64 poison, i64 0>, <4 x i32> <i32 4, i32 5, i32 2, i32 7>			; CHECK-NEXT: [[TMP8:%.*]] = or i64 [[TMP7]], 0
	; CHECK-NEXT: [[TMP9:%.*]] = or <4 x i64> [[TMP6]], [[TMP8]]			; CHECK-NEXT: [[TMP9:%.*]] = trunc i64 [[TMP8]] to i32
	; CHECK-NEXT: [[TMP10:%.*]] = trunc <4 x i64> [[TMP9]] to <4 x i32>			; CHECK-NEXT: [[TMP10:%.]] = extractelement <2 x i64> [[TMP2:%.]], i64 0
	; CHECK-NEXT: br label [[TMP11:%.*]]			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP2]], i64 1
	; CHECK: 11:			; CHECK-NEXT: [[TMP12:%.*]] = or i64 [[TMP10]], [[TMP11]]
	; CHECK-NEXT: [[TMP12:%.]] = phi <4 x i32> [ [[TMP16:%.]], [[TMP11]] ], [ [[TMP10]], [[TMP3:%.*]] ]			; CHECK-NEXT: [[TMP13:%.*]] = trunc i64 [[TMP12]] to i32
	; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x i32> [[TMP12]], <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, <4 x i32> <i32 0, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i64> [[TMP0]], i64 0
	; CHECK-NEXT: [[TMP14:%.*]] = or <4 x i32> zeroinitializer, [[TMP13]]			; CHECK-NEXT: [[TMP15:%.*]] = or i64 [[TMP14]], 0
	; CHECK-NEXT: [[TMP15:%.*]] = add <4 x i32> zeroinitializer, [[TMP13]]			; CHECK-NEXT: [[TMP16:%.*]] = trunc i64 [[TMP15]] to i32
	; CHECK-NEXT: [[TMP16]] = shufflevector <4 x i32> [[TMP14]], <4 x i32> [[TMP15]], <4 x i32> <i32 0, i32 5, i32 6, i32 7>			; CHECK-NEXT: br label [[TMP17:%.*]]
	; CHECK-NEXT: br label [[TMP11]]			; CHECK: 17:
				; CHECK-NEXT: [[TMP18:%.]] = phi i32 [ [[TMP22:%.]], [[TMP17]] ], [ [[TMP6]], [[TMP3:%.*]] ]
				; CHECK-NEXT: [[TMP19:%.*]] = phi i32 [ 0, [[TMP17]] ], [ [[TMP9]], [[TMP3]] ]
				; CHECK-NEXT: [[TMP20:%.*]] = phi i32 [ 0, [[TMP17]] ], [ [[TMP13]], [[TMP3]] ]
				; CHECK-NEXT: [[TMP21:%.*]] = phi i32 [ 0, [[TMP17]] ], [ [[TMP16]], [[TMP3]] ]
				; CHECK-NEXT: [[TMP22]] = or i32 [[TMP18]], 0
				; CHECK-NEXT: br label [[TMP17]]
	;			;
	%4 = extractelement <2 x i64> %1, i64 0			%4 = extractelement <2 x i64> %1, i64 0
	%5 = or i64 %4, 0			%5 = or i64 %4, 0
	%6 = trunc i64 %5 to i32			%6 = trunc i64 %5 to i32
	%7 = extractelement <2 x i64> %0, i64 0			%7 = extractelement <2 x i64> %0, i64 0
	%8 = or i64 %7, 0			%8 = or i64 %7, 0
	%9 = trunc i64 %8 to i32			%9 = trunc i64 %8 to i32
	%10 = extractelement <2 x i64> %2, i64 0			%10 = extractelement <2 x i64> %2, i64 0
	Show All 19 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/fshl.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
	; RUN: opt -mtriple=arm64-apple-ios -S -passes=slp-vectorizer < %s \| FileCheck %s			; RUN: opt -mtriple=arm64-apple-ios -S -passes=slp-vectorizer < %s \| FileCheck %s

	; fshl instruction cost model is an overestimate causing this test to vectorize when it is not beneficial to do so.			; fshl instruction cost model is an overestimate causing this test to vectorize when it is not beneficial to do so.
	define i64 @fshl(i64 %or1, i64 %or2, i64 %or3 ) {			define i64 @fshl(i64 %or1, i64 %or2, i64 %or3 ) {
	; CHECK-LABEL: define i64 @fshl			; CHECK-LABEL: define i64 @fshl
	; CHECK-SAME: (i64 [[OR1:%.]], i64 [[OR2:%.]], i64 [[OR3:%.*]]) {			; CHECK-SAME: (i64 [[OR1:%.]], i64 [[OR2:%.]], i64 [[OR3:%.*]]) {
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x i64> poison, i64 [[OR2]], i32 0			; CHECK-NEXT: [[OR4:%.*]] = tail call i64 @llvm.fshl.i64(i64 [[OR2]], i64 0, i64 1)
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x i64> [[TMP0]], i64 [[OR3]], i32 1			; CHECK-NEXT: [[XOR1:%.*]] = xor i64 [[OR4]], 0
	; CHECK-NEXT: [[TMP2:%.*]] = call <2 x i64> @llvm.fshl.v2i64(<2 x i64> [[TMP1]], <2 x i64> zeroinitializer, <2 x i64> <i64 1, i64 2>)			; CHECK-NEXT: [[OR5:%.*]] = tail call i64 @llvm.fshl.i64(i64 [[OR3]], i64 0, i64 2)
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x i64> <i64 poison, i64 0>, i64 [[OR1]], i32 0			; CHECK-NEXT: [[XOR2:%.*]] = xor i64 [[OR5]], [[OR1]]
	; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> <i64 poison, i64 0>, <2 x i32> <i32 0, i32 3>			; CHECK-NEXT: [[ADD1:%.*]] = add i64 [[XOR1]], [[OR1]]
	; CHECK-NEXT: [[TMP5:%.*]] = call <2 x i64> @llvm.fshl.v2i64(<2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i64> <i64 17, i64 21>)			; CHECK-NEXT: [[ADD2:%.*]] = add i64 0, [[XOR2]]
	; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> <i64 0, i64 poison>, <2 x i32> <i32 2, i32 0>			; CHECK-NEXT: [[OR6:%.*]] = tail call i64 @llvm.fshl.i64(i64 [[OR1]], i64 [[OR2]], i64 17)
	; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i64> [[TMP2]], [[TMP6]]			; CHECK-NEXT: [[XOR3:%.*]] = xor i64 [[OR6]], [[ADD1]]
	; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i64> [[TMP7]], [[TMP3]]			; CHECK-NEXT: [[OR7:%.*]] = tail call i64 @llvm.fshl.i64(i64 0, i64 0, i64 21)
	; CHECK-NEXT: [[TMP9:%.*]] = xor <2 x i64> [[TMP5]], [[TMP8]]			; CHECK-NEXT: [[XOR4:%.*]] = xor i64 [[OR7]], [[ADD2]]
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0			; CHECK-NEXT: [[ADD3:%.*]] = or i64 [[XOR3]], [[ADD2]]
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP8]], i32 1			; CHECK-NEXT: [[XOR5:%.*]] = xor i64 [[ADD3]], [[XOR4]]
	; CHECK-NEXT: [[ADD3:%.*]] = or i64 [[TMP10]], [[TMP11]]
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
	; CHECK-NEXT: [[XOR5:%.*]] = xor i64 [[ADD3]], [[TMP12]]
	; CHECK-NEXT: ret i64 [[XOR5]]			; CHECK-NEXT: ret i64 [[XOR5]]
	;			;
	entry:			entry:
	%or4 = tail call i64 @llvm.fshl.i64(i64 %or2, i64 0, i64 1)			%or4 = tail call i64 @llvm.fshl.i64(i64 %or2, i64 0, i64 1)
	%xor1 = xor i64 %or4, 0			%xor1 = xor i64 %or4, 0
	%or5 = tail call i64 @llvm.fshl.i64(i64 %or3, i64 0, i64 2)			%or5 = tail call i64 @llvm.fshl.i64(i64 %or3, i64 0, i64 2)
	%xor2 = xor i64 %or5, %or1			%xor2 = xor i64 %or5, %or1
	%add1 = add i64 %xor1, %or1			%add1 = add i64 %xor1, %or1
	Show All 11 Lines

llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	;
%s2 = select i1 %s1, i1 true, i1 %c2		%s2 = select i1 %s1, i1 true, i1 %c2
%s3 = select i1 %s2, i1 true, i1 %c3		%s3 = select i1 %s2, i1 true, i1 %c3
ret i1 %s3		ret i1 %s3
}		}

define i1 @logical_and_icmp_diff_preds(<4 x i32> %x) {		define i1 @logical_and_icmp_diff_preds(<4 x i32> %x) {
; SSE-LABEL: @logical_and_icmp_diff_preds(		; SSE-LABEL: @logical_and_icmp_diff_preds(
; SSE-NEXT: [[X0:%.]] = extractelement <4 x i32> [[X:%.]], i32 0		; SSE-NEXT: [[X0:%.]] = extractelement <4 x i32> [[X:%.]], i32 0
; SSE-NEXT: [[X3:%.*]] = extractelement <4 x i32> [[X]], i32 3		; SSE-NEXT: [[X2:%.*]] = extractelement <4 x i32> [[X]], i32 2
; SSE-NEXT: [[C0:%.*]] = icmp ult i32 [[X0]], 0		; SSE-NEXT: [[C0:%.*]] = icmp ult i32 [[X0]], 0
; SSE-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[X]], <4 x i32> poison, <2 x i32> <i32 1, i32 2>		; SSE-NEXT: [[C2:%.*]] = icmp sgt i32 [[X2]], 0
; SSE-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> <i32 poison, i32 0>, <2 x i32> <i32 0, i32 3>		; SSE-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[X]], <4 x i32> poison, <2 x i32> <i32 3, i32 1>
; SSE-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> <i32 0, i32 poison>, <2 x i32> <i32 2, i32 1>		; SSE-NEXT: [[TMP2:%.*]] = icmp slt <2 x i32> [[TMP1]], zeroinitializer
; SSE-NEXT: [[TMP4:%.*]] = icmp slt <2 x i32> [[TMP2]], [[TMP3]]		; SSE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
; SSE-NEXT: [[C3:%.*]] = icmp slt i32 [[X3]], 0		; SSE-NEXT: [[S1:%.*]] = select i1 [[C0]], i1 [[TMP3]], i1 false
; SSE-NEXT: [[TMP5:%.*]] = extractelement <2 x i1> [[TMP4]], i32 0		; SSE-NEXT: [[S2:%.*]] = select i1 [[S1]], i1 [[C2]], i1 false
; SSE-NEXT: [[S1:%.*]] = select i1 [[C0]], i1 [[TMP5]], i1 false		; SSE-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
; SSE-NEXT: [[TMP6:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1		; SSE-NEXT: [[S3:%.*]] = select i1 [[S2]], i1 [[TMP4]], i1 false
; SSE-NEXT: [[S2:%.*]] = select i1 [[S1]], i1 [[TMP6]], i1 false
; SSE-NEXT: [[S3:%.*]] = select i1 [[S2]], i1 [[C3]], i1 false
; SSE-NEXT: ret i1 [[S3]]		; SSE-NEXT: ret i1 [[S3]]
;		;
; AVX-LABEL: @logical_and_icmp_diff_preds(		; AVX-LABEL: @logical_and_icmp_diff_preds(
; AVX-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[X:%.]], <4 x i32> <i32 poison, i32 poison, i32 poison, i32 0>, <4 x i32> <i32 0, i32 3, i32 1, i32 7>		; AVX-NEXT: [[X0:%.]] = extractelement <4 x i32> [[X:%.]], i32 0
; AVX-NEXT: [[TMP2:%.*]] = shufflevector <4 x i32> [[X]], <4 x i32> <i32 0, i32 0, i32 0, i32 poison>, <4 x i32> <i32 4, i32 5, i32 6, i32 2>		; AVX-NEXT: [[X1:%.*]] = extractelement <4 x i32> [[X]], i32 1
; AVX-NEXT: [[TMP3:%.*]] = icmp ult <4 x i32> [[TMP1]], [[TMP2]]		; AVX-NEXT: [[X2:%.*]] = extractelement <4 x i32> [[X]], i32 2
; AVX-NEXT: [[TMP4:%.*]] = icmp slt <4 x i32> [[TMP1]], [[TMP2]]		; AVX-NEXT: [[X3:%.*]] = extractelement <4 x i32> [[X]], i32 3
; AVX-NEXT: [[TMP5:%.*]] = shufflevector <4 x i1> [[TMP3]], <4 x i1> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 7>		; AVX-NEXT: [[C0:%.*]] = icmp ult i32 [[X0]], 0
; AVX-NEXT: [[TMP6:%.*]] = extractelement <4 x i1> [[TMP5]], i32 0		; AVX-NEXT: [[C1:%.*]] = icmp slt i32 [[X1]], 0
; AVX-NEXT: [[TMP7:%.*]] = extractelement <4 x i1> [[TMP5]], i32 2		; AVX-NEXT: [[C2:%.*]] = icmp sgt i32 [[X2]], 0
; AVX-NEXT: [[S1:%.*]] = select i1 [[TMP6]], i1 [[TMP7]], i1 false		; AVX-NEXT: [[C3:%.*]] = icmp slt i32 [[X3]], 0
; AVX-NEXT: [[TMP8:%.*]] = extractelement <4 x i1> [[TMP5]], i32 3		; AVX-NEXT: [[S1:%.*]] = select i1 [[C0]], i1 [[C1]], i1 false
; AVX-NEXT: [[S2:%.*]] = select i1 [[S1]], i1 [[TMP8]], i1 false		; AVX-NEXT: [[S2:%.*]] = select i1 [[S1]], i1 [[C2]], i1 false
; AVX-NEXT: [[TMP9:%.*]] = extractelement <4 x i1> [[TMP5]], i32 1		; AVX-NEXT: [[S3:%.*]] = select i1 [[S2]], i1 [[C3]], i1 false
; AVX-NEXT: [[S3:%.*]] = select i1 [[S2]], i1 [[TMP9]], i1 false
; AVX-NEXT: ret i1 [[S3]]		; AVX-NEXT: ret i1 [[S3]]
;		;
%x0 = extractelement <4 x i32> %x, i32 0		%x0 = extractelement <4 x i32> %x, i32 0
%x1 = extractelement <4 x i32> %x, i32 1		%x1 = extractelement <4 x i32> %x, i32 1
%x2 = extractelement <4 x i32> %x, i32 2		%x2 = extractelement <4 x i32> %x, i32 2
%x3 = extractelement <4 x i32> %x, i32 3		%x3 = extractelement <4 x i32> %x, i32 3
%c0 = icmp ult i32 %x0, 0		%c0 = icmp ult i32 %x0, 0
%c1 = icmp slt i32 %x1, 0		%c1 = icmp slt i32 %x1, 0
▲ Show 20 Lines • Show All 412 Lines • Show Last 20 Lines