This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
merge-gather-loads.ll
-
X86/
-
jumbled-load-multiuse.ll
-
operandorder.ll

Differential D37737

[SLPVectorizer] Merge subsequent gather loads.
AbandonedPublic

Authored by fhahn on Sep 12 2017, 5:23 AM.

Download Raw Diff

Details

Reviewers

RKSimon
ABataev
dtemirbulatov
spatel
efriedma

Summary

This patch updates SLPVectorizer to try to combine subsequent scalar gather
loads into vector loads. I think this changes makes the IR simpler
(after instcombine is run); it replaces a chain of insertelement
instructions by a shufflevector instruction using the result of the
vector load.

The specific case I want to optimize is function test1 in
/test/Transforms/SLPVectorizer/AArch64/merge-gather-loads.ll. Code
like that is generated for some SGEMM kernels.
Combing the scalar loads to a vector load is beneficial in this case,
as the users of the scalar values (mul) supports indexed vector
operands on AArch64 and there is no need to duplicate the loaded scalar
values in separate vector registers. For instructions that do not
support indexed vector operands (like add in test_add), this is makes
things worse, as we have to do a vector load + 2 dups.

In addition to that, for architectures with complex instruction sets
(e.g. X86) this could also make things worse, if the users of the
scalar value support scalar memory operands. (e.g. assembler generated
for some functions in test/Transforms/SLPVectorizer/X86/operandorder.ll
uses memory operands for some scalar values)

It is my first patch in that area and I am not sure how to address the
issues mentioned above properly. Whether vectorizing the loads is beneficial
depends on the vector instructions available on the architecture. Would
it be better to have this as part of a target specific pass? There is a
LoadStoreVectorizer which may act as a base for that. Or should
backends provide information for which instructions this transformation
is beneficial as part of TargetTransformInfo?

Diff Detail

Event Timeline

fhahn created this revision.Sep 12 2017, 5:23 AM

Herald added subscribers: kristof.beyls, javed.absar, rengolin, aemerson. · View Herald TranscriptSep 12 2017, 5:23 AM

fhahn mentioned this in D37738: [SLPVectorizer] Generalize vectorizeStores to support loads as well NFC. .Sep 12 2017, 5:24 AM

fhahn added a parent revision: D37738: [SLPVectorizer] Generalize vectorizeStores to support loads as well NFC. .

Hi Florian,

There are a lot of assumptions in there that I'm not sure hold on all cases (so they should be guarded) as well as flag-passing that is at least unnecessary (a static isGatherLoad or something could have done the job), plus a few other comments here and there.

I don't think it's worth splitting this into a separate pass (you're really just collecting the loads and vectorising them), but you do need the standard contract for performance changes:

Appropriate guard to limit not only the (sub)architecture this will run
Logic that account for which cases for them it is beneficial (TargetInfo hook)
Benchmark results on all affected targets (whatever SGEMM benchmark you used)
Show that no noticeable change happened in the test-suite (benchmark mode) because of it
If possible, also run SPEC or some other, to show similar lack of negative impact

cheers,
--renato

lib/Transforms/Vectorize/SLPVectorizer.cpp
558	This looks like a very specific change to such a generic function
2354	why is an argument flag inside the loop, and after two other non-constant checks? this doesn't make sense
3509	is this always correct? Can we merge the loads and will be this profitable for all architectures?

sorry for not responding sooner. Thanks for the feedback. I am currently swamped in other work, but I hope I can revisit this patch soon!

Not needed anymore, fixed by D36130

fhahn abandoned this revision.Mar 22 2018, 10:52 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

33 lines

test/

Transforms/

SLPVectorizer/

AArch64/

merge-gather-loads.ll

67 lines

X86/

jumbled-load-multiuse.ll

29 lines

operandorder.ll

28 lines

Diff 114803

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 549 Lines • ▼ Show 20 Lines	public:
Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);		Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
int getSpillCost();		int getSpillCost();

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
int getTreeCost();		int getTreeCost(bool GatherLoad=false);
		rengolinUnsubmitted Not Done Reply Inline Actions This looks like a very specific change to such a generic function rengolin: This looks like a very specific change to such a generic function

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
▲ Show 20 Lines • Show All 1,749 Lines • ▼ Show 20 Lines	for (const auto &N : VectorizableTree) {
}		}

PrevInst = Inst;		PrevInst = Inst;
}		}

return Cost;		return Cost;
}		}

int BoUpSLP::getTreeCost() {		int BoUpSLP::getTreeCost(bool GatherLoads) {
int Cost = 0;		int Cost = 0;
DEBUG(dbgs() << "SLP: Calculating cost for tree of size " <<		DEBUG(dbgs() << "SLP: Calculating cost for tree of size " <<
VectorizableTree.size() << ".\n");		VectorizableTree.size() << ".\n");

unsigned BundleWidth = VectorizableTree[0].Scalars.size();		unsigned BundleWidth = VectorizableTree[0].Scalars.size();

for (TreeEntry &TE : VectorizableTree) {		for (TreeEntry &TE : VectorizableTree) {
int C = getEntryCost(&TE);		int C = getEntryCost(&TE);
Show All 10 Lines	if (!ExtractCostCalculated.insert(EU.Scalar).second)
continue;		continue;

// Uses by ephemeral values are free (because the ephemeral value will be		// Uses by ephemeral values are free (because the ephemeral value will be
// removed prior to code generation, and so the extraction will be		// removed prior to code generation, and so the extraction will be
// removed as well).		// removed as well).
if (EphValues.count(EU.User))		if (EphValues.count(EU.User))
continue;		continue;

		// Users of the roots have been vectorized already, the new extractelement
		// instructions are canceled out by already added insertelement
		// instructions.
		if (GatherLoads)
		rengolinUnsubmitted Not Done Reply Inline Actions why is an argument flag inside the loop, and after two other non-constant checks? this doesn't make sense rengolin: why is an argument flag inside the loop, and after two other non-constant checks? this doesn't…
		continue;

// If we plan to rewrite the tree in a smaller type, we will need to sign		// If we plan to rewrite the tree in a smaller type, we will need to sign
// extend the extracted value back to the original type. Here, we account		// extend the extracted value back to the original type. Here, we account
// for the extract and the added cost of the sign extend if needed.		// for the extract and the added cost of the sign extend if needed.
auto *VecTy = VectorType::get(EU.Scalar->getType(), BundleWidth);		auto *VecTy = VectorType::get(EU.Scalar->getType(), BundleWidth);
auto *ScalarRoot = VectorizableTree[0].Scalars[0];		auto *ScalarRoot = VectorizableTree[0].Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);		auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto Extend =		auto Extend =
▲ Show 20 Lines • Show All 951 Lines • ▼ Show 20 Lines	static bool hasValueBeenRAUWed(ArrayRef<Value *> VL,
ArrayRef<WeakTrackingVH> VH, unsigned SliceBegin,		ArrayRef<WeakTrackingVH> VH, unsigned SliceBegin,
unsigned SliceSize) {		unsigned SliceSize) {
VL = VL.slice(SliceBegin, SliceSize);		VL = VL.slice(SliceBegin, SliceSize);
VH = VH.slice(SliceBegin, SliceSize);		VH = VH.slice(SliceBegin, SliceSize);
return !std::equal(VL.begin(), VL.end(), VH.begin());		return !std::equal(VL.begin(), VL.end(), VH.begin());
}		}

bool vectorizeAccessChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool vectorizeAccessChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned VecRegSize) {		unsigned VecRegSize, bool GatherOpt) {
assert(!Chain.empty() &&		assert(!Chain.empty() &&
(isa<LoadInst>(Chain[0]) \|\| isa<StoreInst>(Chain[0])) &&		(isa<LoadInst>(Chain[0]) \|\| isa<StoreInst>(Chain[0])) &&
"Chain has to be non-empty and contain load or store instructions");		"Chain has to be non-empty and contain load or store instructions");

unsigned ChainLen = Chain.size();		unsigned ChainLen = Chain.size();
DEBUG(dbgs() << "SLP: Analyzing a "		DEBUG(dbgs() << "SLP: Analyzing a "
<< (isa<StoreInst>(Chain[0]) ? "store" : "load")		<< (isa<StoreInst>(Chain[0]) ? "store" : "load")
<< " chain of length " << ChainLen << "\n");		<< " chain of length " << ChainLen << "\n");
Show All 22 Lines	for (unsigned i = 0, e = ChainLen; i < e; ++i) {
ArrayRef<Value *> Operands = Chain.slice(i, VF);		ArrayRef<Value *> Operands = Chain.slice(i, VF);

R.buildTree(Operands);		R.buildTree(Operands);
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

int Cost = R.getTreeCost();		int Cost = R.getTreeCost(GatherOpt);

DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF << "\n");		DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF << "\n");
if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
DEBUG(dbgs() << "SLP: Decided to vectorize cost=" << Cost << "\n");		DEBUG(dbgs() << "SLP: Decided to vectorize cost=" << Cost << "\n");

using namespace ore;		using namespace ore;
auto *ORE = R.getORE();		auto *ORE = R.getORE();
if (isa<StoreInst>(Chain[i]))		if (isa<StoreInst>(Chain[i]))
Show All 16 Lines	bool vectorizeAccessChain(ArrayRef<Value *> Chain, BoUpSLP &R,
}		}

return Changed;		return Changed;
}		}


static bool		static bool
vectorizeAccesses(ArrayRef<Instruction *> Accesses, BoUpSLP &R,		vectorizeAccesses(ArrayRef<Instruction *> Accesses, BoUpSLP &R,
const DataLayout DL, ScalarEvolution SE) {		const DataLayout DL, ScalarEvolution SE,
		bool GatherOpt) {
SetVector<Instruction *> Heads, Tails;		SetVector<Instruction *> Heads, Tails;
SmallDenseMap<Instruction, Instruction > ConsecutiveChain;		SmallDenseMap<Instruction, Instruction > ConsecutiveChain;

// We may run into multiple chains that merge into a single chain. We mark the		// We may run into multiple chains that merge into a single chain. We mark the
// stores that we vectorized so that we don't visit the same store twice.		// stores that we vectorized so that we don't visit the same store twice.
BoUpSLP::ValueSet VectorizedAccesses;		BoUpSLP::ValueSet VectorizedAccesses;

// Do a quadratic search on all of the given loads or stores and find		// Do a quadratic search on all of the given loads or stores and find
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	while (Tails.count(I) \|\| Heads.count(I)) {
// Move to the next value in the chain.		// Move to the next value in the chain.
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}
//		//
// FIXME: Is division-by-2 the correct step? Should we assert that the		// FIXME: Is division-by-2 the correct step? Should we assert that the
// register size is a power-of-2?		// register size is a power-of-2?
for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();		for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();
Size /= 2) {		Size /= 2) {
if (vectorizeAccessChain(Operands, R, Size)) {		if (vectorizeAccessChain(Operands, R, Size, GatherOpt)) {
// Mark the vectorized stores so that we don't vectorize them again.		// Mark the vectorized stores so that we don't vectorize them again.
VectorizedAccesses.insert(Operands.begin(), Operands.end());		VectorizedAccesses.insert(Operands.begin(), Operands.end());
Changed = true;		Changed = true;
break;		break;
}		}
}		}
}		}

return Changed;		return Changed;
}		}


void BoUpSLP::optimizeGatherSequence() {		void BoUpSLP::optimizeGatherSequence() {
DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()		DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()
<< " gather sequences instructions.\n");		<< " gather sequences instructions.\n");
// LICM InsertElementInst sequences.		// LICM InsertElementInst sequences.
		SmallPtrSet<LoadInst *, 4> GatherLoads;

for (Instruction *it : GatherSeq) {		for (Instruction *it : GatherSeq) {
InsertElementInst *Insert = dyn_cast<InsertElementInst>(it);		InsertElementInst *Insert = dyn_cast<InsertElementInst>(it);

if (!Insert)		if (!Insert)
continue;		continue;

		// Collect scalar loads, which can potentially be combined to vector
		// loads.
		if (LoadInst *LI = dyn_cast<LoadInst>(Insert->getOperand(1)))
		GatherLoads.insert(LI);

// Check if this block is inside a loop.		// Check if this block is inside a loop.
Loop *L = LI->getLoopFor(Insert->getParent());		Loop *L = LI->getLoopFor(Insert->getParent());
if (!L)		if (!L)
continue;		continue;

// Check if it has a preheader.		// Check if it has a preheader.
BasicBlock *PreHeader = L->getLoopPreheader();		BasicBlock *PreHeader = L->getLoopPreheader();
if (!PreHeader)		if (!PreHeader)
continue;		continue;

// If the vector or the element that we insert into it are		// If the vector or the element that we insert into it are
// instructions that are defined in this basic block then we can't		// instructions that are defined in this basic block then we can't
// hoist this instruction.		// hoist this instruction.
Instruction *CurrVec = dyn_cast<Instruction>(Insert->getOperand(0));		Instruction *CurrVec = dyn_cast<Instruction>(Insert->getOperand(0));
Instruction *NewElem = dyn_cast<Instruction>(Insert->getOperand(1));		Instruction *NewElem = dyn_cast<Instruction>(Insert->getOperand(1));
if (CurrVec && L->contains(CurrVec))		if (CurrVec && L->contains(CurrVec))
continue;		continue;
if (NewElem && L->contains(NewElem))		if (NewElem && L->contains(NewElem))
continue;		continue;

// We can hoist this instruction. Move it to the pre-header.		// We can hoist this instruction. Move it to the pre-header.
Insert->moveBefore(PreHeader->getTerminator());		Insert->moveBefore(PreHeader->getTerminator());
}		}

		if (!GatherLoads.empty()) {
		rengolinUnsubmitted Not Done Reply Inline Actions is this always correct? Can we merge the loads and will be this profitable for all architectures? rengolin: is this always correct? Can we merge the loads and will be this profitable for all…
		SmallVector<Instruction *, 4> LoadV(GatherLoads.begin(), GatherLoads.end());
		vectorizeAccesses(LoadV, *this, DL, SE, true);
		}

// Make a list of all reachable blocks in our CSE queue.		// Make a list of all reachable blocks in our CSE queue.
SmallVector<const DomTreeNode *, 8> CSEWorkList;		SmallVector<const DomTreeNode *, 8> CSEWorkList;
CSEWorkList.reserve(CSEBlocks.size());		CSEWorkList.reserve(CSEBlocks.size());
for (BasicBlock *BB : CSEBlocks)		for (BasicBlock *BB : CSEBlocks)
if (DomTreeNode *N = DT->getNode(BB)) {		if (DomTreeNode *N = DT->getNode(BB)) {
assert(DT->isReachableFromEntry(N));		assert(DT->isReachableFromEntry(N));
CSEWorkList.push_back(N);		CSEWorkList.push_back(N);
}		}
▲ Show 20 Lines • Show All 880 Lines • ▼ Show 20 Lines	for (auto BB : post_order(&F.getEntryBlock())) {
if (!GEPs.empty()) {		if (!GEPs.empty()) {
DEBUG(dbgs() << "SLP: Found GEPs for " << GEPs.size()		DEBUG(dbgs() << "SLP: Found GEPs for " << GEPs.size()
<< " underlying objects.\n");		<< " underlying objects.\n");
Changed \|= vectorizeGEPIndices(BB, R);		Changed \|= vectorizeGEPIndices(BB, R);
}		}
}		}

if (Changed) {		if (Changed) {
		R.deleteTree();
R.optimizeGatherSequence();		R.optimizeGatherSequence();
DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");		DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");
DEBUG(verifyFunction(F));		DEBUG(verifyFunction(F));
}		}
return Changed;		return Changed;
}		}


▲ Show 20 Lines • Show All 1,477 Lines • ▼ Show 20 Lines	for (StoreListMap::iterator it = Stores.begin(), e = Stores.end(); it != e;

// Process the stores in chunks of 16.		// Process the stores in chunks of 16.
// TODO: The limit of 16 inhibits greater vectorization factors.		// TODO: The limit of 16 inhibits greater vectorization factors.
// For example, AVX2 supports v32i8. Increasing this limit, however,		// For example, AVX2 supports v32i8. Increasing this limit, however,
// may cause a significant compile-time increase.		// may cause a significant compile-time increase.
for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI+=16) {		for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI+=16) {
unsigned Len = std::min<unsigned>(CE - CI, 16);		unsigned Len = std::min<unsigned>(CE - CI, 16);
Changed \|= vectorizeAccesses(makeArrayRef(&it->second[CI], Len), R, DL,		Changed \|= vectorizeAccesses(makeArrayRef(&it->second[CI], Len), R, DL,
SE);		SE, false);
}		}
}		}
return Changed;		return Changed;
}		}

char SLPVectorizer::ID = 0;		char SLPVectorizer::ID = 0;

static const char lv_name[] = "SLP Vectorizer";		static const char lv_name[] = "SLP Vectorizer";
Show All 12 Lines

test/Transforms/SLPVectorizer/AArch64/merge-gather-loads.ll

This file was added.

				; RUN: opt < %s -basicaa -slp-vectorizer -instcombine -S -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57 \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; CHECK-LABEL: @test1
				; CHECK: [[BC:[a-z0-9]+]] = bitcast i32* %b to <2 x i32>*
				; CHECK: [[MERGED_LOAD:[a-z0-9]+]] = load <2 x i32>, <2 x i32>* %[[BC]], align 4
				; CHECK: = shufflevector <2 x i32> %[[MERGED_LOAD]], <2 x i32> undef, <2 x i32> zeroinitializer
				; CHECK: = shufflevector <2 x i32> %[[MERGED_LOAD]], <2 x i32> undef, <2 x i32> <i32 1, i32 1>

				define void @test1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c) {
				entry:
				%scalar.1 = load i32, i32* %b, align 4
				%v1.1 = load i32, i32* %c, align 4
				%add = mul nsw i32 %v1.1, %scalar.1
				store i32 %add, i32* %a, align 4
				%arrayidx3 = getelementptr inbounds i32, i32* %c, i64 1
				%v1.2 = load i32, i32* %arrayidx3, align 4
				%add5 = mul nsw i32 %v1.2, %scalar.1
				%arrayidx7 = getelementptr inbounds i32, i32* %a, i64 1
				store i32 %add5, i32* %arrayidx7, align 4

				%si.2 = getelementptr inbounds i32, i32* %b, i64 1
				%scalar.2 = load i32, i32* %si.2, align 4

				%arrayidx8 = getelementptr inbounds i32, i32* %c, i64 2
				%v1.3 = load i32, i32* %arrayidx8, align 4
				%add10 = mul nsw i32 %v1.3, %scalar.2
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 2
				store i32 %add10, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %c, i64 3
				%v1.4 = load i32, i32* %arrayidx13, align 4
				%add15 = mul nsw i32 %v1.4, %scalar.2
				%arrayidx17 = getelementptr inbounds i32, i32* %a, i64 3
				store i32 %add15, i32* %arrayidx17, align 4

				ret void
				}

				define void @test_add(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c) {
				entry:
				%scalar.1 = load i32, i32* %b, align 4
				%v1.1 = load i32, i32* %c, align 4
				%add = add nsw i32 %v1.1, %scalar.1
				store i32 %add, i32* %a, align 4
				%arrayidx3 = getelementptr inbounds i32, i32* %c, i64 1
				%v1.2 = load i32, i32* %arrayidx3, align 4
				%add5 = add nsw i32 %v1.2, %scalar.1
				%arrayidx7 = getelementptr inbounds i32, i32* %a, i64 1
				store i32 %add5, i32* %arrayidx7, align 4

				%si.2 = getelementptr inbounds i32, i32* %b, i64 1
				%scalar.2 = load i32, i32* %si.2, align 4

				%arrayidx8 = getelementptr inbounds i32, i32* %c, i64 2
				%v1.3 = load i32, i32* %arrayidx8, align 4
				%add10 = add nsw i32 %v1.3, %scalar.2
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 2
				store i32 %add10, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %c, i64 3
				%v1.4 = load i32, i32* %arrayidx13, align 4
				%add15 = add nsw i32 %v1.4, %scalar.2
				%arrayidx17 = getelementptr inbounds i32, i32* %a, i64 3
				store i32 %add15, i32* %arrayidx17, align 4

				ret void
				}

test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 2), align 4			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[TMP3]], i32 1
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP2]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP3]], i32 2			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = icmp sgt <4 x i32> [[TMP7]], zeroinitializer			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 3
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1			; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt <4 x i32> [[TMP8]], zeroinitializer
	; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP9]], i32 ptrtoint (i32 () @fn1 to i32), i32 2			; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP2]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 8, i32 3			; CHECK-NEXT: [[TMP11:%.]] = insertelement <4 x i32> [[TMP10]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP12:%.*]] = select <4 x i1> [[TMP8]], <4 x i32> [[TMP11]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>			; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 8, i32 3
	; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4			; CHECK-NEXT: [[TMP13:%.*]] = select <4 x i1> [[TMP9]], <4 x i32> [[TMP12]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
				; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

test/Transforms/SLPVectorizer/X86/operandorder.ll

Show All 19 Lines	define void @shuffle_operands1(double * noalias %from, double * noalias %to,
%v1_2 = fadd double %v2, %v0_2		%v1_2 = fadd double %v2, %v0_2
%to_2 = getelementptr double, double * %to, i64 1		%to_2 = getelementptr double, double * %to, i64 1
store double %v1_1, double *%to		store double %v1_1, double *%to
store double %v1_2, double *%to_2		store double %v1_2, double *%to_2
ret void		ret void
}		}

; CHECK-LABEL: shuffle_preserve_broadcast		; CHECK-LABEL: shuffle_preserve_broadcast
; CHECK: %[[BCAST:[a-z0-9]+]] = insertelement <2 x double> undef, double %v0_1		; CHECK: bitcast double* %from to <2 x double>*
; CHECK: = shufflevector <2 x double> %[[BCAST]], <2 x double> undef, <2 x i32> zeroinitializer		; CHECK: %[[MERGED_LOAD:[a-z0-9]+]] = load <2 x double>, <2 x double>* %0, align 4
		; CHECK: shufflevector <2 x double> %[[MERGED_LOAD]], <2 x double> undef, <2 x i32> zeroinitializer
define void @shuffle_preserve_broadcast(double * noalias %from,		define void @shuffle_preserve_broadcast(double * noalias %from,
double * noalias %to,		double * noalias %to,
double %v1, double %v2) {		double %v1, double %v2) {
entry:		entry:
br label %lp		br label %lp

lp:		lp:
%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]		%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
%from_1 = getelementptr double, double *%from, i64 1		%from_1 = getelementptr double, double *%from, i64 1
%v0_1 = load double , double * %from		%v0_1 = load double , double * %from
%v0_2 = load double , double * %from_1		%v0_2 = load double , double * %from_1
%v1_1 = fadd double %v0_1, %p		%v1_1 = fadd double %v0_1, %p
%v1_2 = fadd double %v0_1, %v0_2		%v1_2 = fadd double %v0_1, %v0_2
%to_2 = getelementptr double, double * %to, i64 1		%to_2 = getelementptr double, double * %to, i64 1
store double %v1_1, double *%to		store double %v1_1, double *%to
store double %v1_2, double *%to_2		store double %v1_2, double *%to_2
br i1 undef, label %lp, label %ext		br i1 undef, label %lp, label %ext

ext:		ext:
ret void		ret void
}		}

; CHECK-LABEL: shuffle_preserve_broadcast2		; CHECK-LABEL: shuffle_preserve_broadcast2
; CHECK: %[[BCAST:[a-z0-9]+]] = insertelement <2 x double> undef, double %v0_1		; CHECK: bitcast double* %from to <2 x double>*
; CHECK: = shufflevector <2 x double> %[[BCAST]], <2 x double> undef, <2 x i32> zeroinitializer		; CHECK: %[[MERGED_LOAD:[a-z0-9]+]] = load <2 x double>, <2 x double>* %0, align 4
		; CHECK: %[[P_INSERT:[a-z0-9]+]] = insertelement <2 x double> undef, double %p, i32 0
		; CHECK: = shufflevector <2 x double> %[[P_INSERT]], <2 x double> %[[MERGED_LOAD]], <2 x i32> <i32 0, i32 3>
		; CHECK: = shufflevector <2 x double> %[[MERGED_LOAD]], <2 x double> undef, <2 x i32> zeroinitializer
define void @shuffle_preserve_broadcast2(double * noalias %from,		define void @shuffle_preserve_broadcast2(double * noalias %from,
double * noalias %to,		double * noalias %to,
double %v1, double %v2) {		double %v1, double %v2) {
entry:		entry:
br label %lp		br label %lp

lp:		lp:
%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]		%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
%from_1 = getelementptr double, double *%from, i64 1		%from_1 = getelementptr double, double *%from, i64 1
%v0_1 = load double , double * %from		%v0_1 = load double , double * %from
%v0_2 = load double , double * %from_1		%v0_2 = load double , double * %from_1
%v1_1 = fadd double %p, %v0_1		%v1_1 = fadd double %p, %v0_1
%v1_2 = fadd double %v0_2, %v0_1		%v1_2 = fadd double %v0_2, %v0_1
%to_2 = getelementptr double, double * %to, i64 1		%to_2 = getelementptr double, double * %to, i64 1
store double %v1_1, double *%to		store double %v1_1, double *%to
store double %v1_2, double *%to_2		store double %v1_2, double *%to_2
br i1 undef, label %lp, label %ext		br i1 undef, label %lp, label %ext

ext:		ext:
ret void		ret void
}		}

; CHECK-LABEL: shuffle_preserve_broadcast3		; CHECK-LABEL: shuffle_preserve_broadcast3
; CHECK: %[[BCAST:[a-z0-9]+]] = insertelement <2 x double> undef, double %v0_1		; CHECK: %[[MERGED_LOAD:[a-z0-9]+]] = load <2 x double>, <2 x double>* %0, align 4
; CHECK: = shufflevector <2 x double> %[[BCAST]], <2 x double> undef, <2 x i32> zeroinitializer		; CHECK: = shufflevector <2 x double> %[[MERGED_LOAD]], <2 x double> undef, <2 x i32> zeroinitializer
define void @shuffle_preserve_broadcast3(double * noalias %from,		define void @shuffle_preserve_broadcast3(double * noalias %from,
double * noalias %to,		double * noalias %to,
double %v1, double %v2) {		double %v1, double %v2) {
entry:		entry:
br label %lp		br label %lp

lp:		lp:
%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]		%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
%from_1 = getelementptr double, double *%from, i64 1		%from_1 = getelementptr double, double *%from, i64 1
%v0_1 = load double , double * %from		%v0_1 = load double , double * %from
%v0_2 = load double , double * %from_1		%v0_2 = load double , double * %from_1
%v1_1 = fadd double %p, %v0_1		%v1_1 = fadd double %p, %v0_1
%v1_2 = fadd double %v0_1, %v0_2		%v1_2 = fadd double %v0_1, %v0_2
%to_2 = getelementptr double, double * %to, i64 1		%to_2 = getelementptr double, double * %to, i64 1
store double %v1_1, double *%to		store double %v1_1, double *%to
store double %v1_2, double *%to_2		store double %v1_2, double *%to_2
br i1 undef, label %lp, label %ext		br i1 undef, label %lp, label %ext

ext:		ext:
ret void		ret void
}		}


; CHECK-LABEL: shuffle_preserve_broadcast4		; CHECK-LABEL: shuffle_preserve_broadcast4
; CHECK: %[[BCAST:[a-z0-9]+]] = insertelement <2 x double> undef, double %v0_1		; CHECK: %[[MERGED_LOAD:[a-z0-9]+]] = load <2 x double>, <2 x double>* %0, align 4
; CHECK: = shufflevector <2 x double> %[[BCAST]], <2 x double> undef, <2 x i32> zeroinitializer		; CHECK: = shufflevector <2 x double> %[[MERGED_LOAD]], <2 x double> undef, <2 x i32> zeroinitializer
define void @shuffle_preserve_broadcast4(double * noalias %from,		define void @shuffle_preserve_broadcast4(double * noalias %from,
double * noalias %to,		double * noalias %to,
double %v1, double %v2) {		double %v1, double %v2) {
entry:		entry:
br label %lp		br label %lp

lp:		lp:
%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]		%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
%from_1 = getelementptr double, double *%from, i64 1		%from_1 = getelementptr double, double *%from, i64 1
%v0_1 = load double , double * %from		%v0_1 = load double , double * %from
%v0_2 = load double , double * %from_1		%v0_2 = load double , double * %from_1
%v1_1 = fadd double %v0_2, %v0_1		%v1_1 = fadd double %v0_2, %v0_1
%v1_2 = fadd double %p, %v0_1		%v1_2 = fadd double %p, %v0_1
%to_2 = getelementptr double, double * %to, i64 1		%to_2 = getelementptr double, double * %to, i64 1
store double %v1_1, double *%to		store double %v1_1, double *%to
store double %v1_2, double *%to_2		store double %v1_2, double *%to_2
br i1 undef, label %lp, label %ext		br i1 undef, label %lp, label %ext

ext:		ext:
ret void		ret void
}		}

; CHECK-LABEL: shuffle_preserve_broadcast5		; CHECK-LABEL: shuffle_preserve_broadcast5
; CHECK: %[[BCAST:[a-z0-9]+]] = insertelement <2 x double> undef, double %v0_1		; CHECK: %[[MERGED_LOAD:[a-z0-9]+]] = load <2 x double>, <2 x double>* %0, align 4
; CHECK: = shufflevector <2 x double> %[[BCAST]], <2 x double> undef, <2 x i32> zeroinitializer		; CHECK: = shufflevector <2 x double> %[[MERGED_LOAD]], <2 x double> undef, <2 x i32> zeroinitializer
define void @shuffle_preserve_broadcast5(double * noalias %from,		define void @shuffle_preserve_broadcast5(double * noalias %from,
double * noalias %to,		double * noalias %to,
double %v1, double %v2) {		double %v1, double %v2) {
entry:		entry:
br label %lp		br label %lp

lp:		lp:
%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]		%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
%from_1 = getelementptr double, double *%from, i64 1		%from_1 = getelementptr double, double *%from, i64 1
%v0_1 = load double , double * %from		%v0_1 = load double , double * %from
%v0_2 = load double , double * %from_1		%v0_2 = load double , double * %from_1
%v1_1 = fadd double %v0_1, %v0_2		%v1_1 = fadd double %v0_1, %v0_2
%v1_2 = fadd double %p, %v0_1		%v1_2 = fadd double %p, %v0_1
%to_2 = getelementptr double, double * %to, i64 1		%to_2 = getelementptr double, double * %to, i64 1
store double %v1_1, double *%to		store double %v1_1, double *%to
store double %v1_2, double *%to_2		store double %v1_2, double *%to_2
br i1 undef, label %lp, label %ext		br i1 undef, label %lp, label %ext

ext:		ext:
ret void		ret void
}		}


; CHECK-LABEL: shuffle_preserve_broadcast6		; CHECK-LABEL: shuffle_preserve_broadcast6
; CHECK: %[[BCAST:[a-z0-9]+]] = insertelement <2 x double> undef, double %v0_1		; CHECK: %[[MERGED_LOAD:[a-z0-9]+]] = load <2 x double>, <2 x double>* %0, align 4
; CHECK: = shufflevector <2 x double> %[[BCAST]], <2 x double> undef, <2 x i32> zeroinitializer		; CHECK: = shufflevector <2 x double> %[[MERGED_LOAD]], <2 x double> undef, <2 x i32> zeroinitializer
define void @shuffle_preserve_broadcast6(double * noalias %from,		define void @shuffle_preserve_broadcast6(double * noalias %from,
double * noalias %to,		double * noalias %to,
double %v1, double %v2) {		double %v1, double %v2) {
entry:		entry:
br label %lp		br label %lp

lp:		lp:
%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]		%p = phi double [ 1.000000e+00, %lp ], [ 0.000000e+00, %entry ]
▲ Show 20 Lines • Show All 180 Lines • Show Last 20 Lines