This is an archive of the discontinued LLVM Phabricator instance.

Differential D20474

when calculating RegUsages, ignore instructions which are uniformed after vectorization
ClosedPublic

Authored by wmi on May 20 2016, 9:43 AM.

Download Raw Diff

Details

Reviewers

jmolloy
mkuper
hfinkel

Commits

rG79997a24d750: Recommit the patch "Use uniforms set to populate VecValuesToIgnore".
rG1fd25726afce: Use uniforms set to populate VecValuesToIgnore.
rL275912: Use uniforms set to populate VecValuesToIgnore.

Summary

This is following patch for http://reviews.llvm.org/D15177.

Just as Hal's comment in http://reviews.llvm.org/D15177?id=41809#inline-126094,
For induction variable only used in GetElementPtr and ICmp, D15177 knows it will not have vectorized version and can be added into VecValuesToIgnore, but it didn't consider the case that induction variable may be used by a add/sub before used in GetElementPtr.

for loop like below:

char a[1000];
char b[1000];
for (long i = 0; i < N; i++)

a[i] = b[i] * 6 + (b[i] + b[i + 1]) * 4 + b[i - 2] + b[i + 2];

When we are computing RegUsages for VF==8 and VF==16,
it is important for the register usages estimation component to know array index exprs like i, i+1, i+2 and i+3 will not have vectorized version after the loop being vectorized, and their live ranges should not be counted as vector register usages, or else it is likely to exaggerate the vector register pressure.

The patch adds instructions for which isUniformAfterVectorization returns true into the VecValuesToIgnore set. A special case is that PHI instructions are never included into the Uniforms set in collectLoopUniforms(), so a special handling for PHI is that when the result of PHI is only used in GetElementPtr or Uniform instructions, the PHI will be added into VecValuesToIgnore set too.

Another following patch in plan is if estimated vector register usage is more than the number of available hardware vector registers for certain VF, don't simply give up the VF. Just add the extra spill cost into the VectorCost. If the total VectorCost of the VF is the lowest, the VF is still worthy to try.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 57942.May 20 2016, 9:43 AM

wmi retitled this revision from to when calculating RegUsages, ignore instructions which are uniformed after vectorization.

wmi updated this object.

wmi added reviewers: hfinkel, jmolloy.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: llvm-commits, davidxl, congh, mkuper.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 20 2016, 9:43 AM

Ping.

mkuper added inline comments.Jun 6 2016, 12:50 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6493–6494	This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs? (e.g. see D20789)

mkuper added inline comments.Jun 7 2016, 6:30 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5669	Another side point - should we remove this? Looking at http://reviews.llvm.org/rL172178, the reason that we only look at loads, stores, and PHIs is that "We don't have a detailed analysis on which values are vectorized and which stay scalars in the vectorized loop so we use another method. We look at reduction variables, loads and stores, which are the only ways to get information in and out of loop iterations". That was true at the time, but since then we've gained a precise way of knowing which instructions are uniform, and with this patch will actually use that for ValuesToIgnore. So this check will now only miss instructions that ought to be taken into account, right?

wmi added inline comments.Jun 8 2016, 2:39 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5669	Yes, it makes sense to remove it.
6493–6494	Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the load/store using the GEP. If the induction variable is only used in GEP, it will not be vectorized, right?

mkuper added inline comments.Jun 8 2016, 3:00 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6493–6494	I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when it's profitable on the target. (I think the only target that supports it right now is AVX-512.)

wmi added inline comments.Jun 8 2016, 5:09 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6493–6494	I see. If a[3i] are vectorized using gather/scatter. It needs a vectorized version of 3i so probably it is better to generate a vectorized version of i. Then i shouldn't be added into VecValuesToIgnore. Thanks.

Add nonconsecutive pointer values and their dependency into the uniform after vectorization set in collectLoopUniforms if gather/scatter is not supported. The related code in collectValuesToIgnore is removed.

The added test hoo is to ensure the ptr of the load for which gather is possible will not be added into VecValuesToIgnore set.

mkuper added inline comments.Jun 22 2016, 4:43 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5057	Are you sure about this? Nonconsecutive pointer values when there is no gather/scatter will be scalarized, but they aren't uniform. So I'm not sure we should be counting them as uniform. This will work correctly for your new use of isUniformAfterVectorization() (since we really don't need vector registers in either case). But I think it may do the wrong thing for the existing use, in getInstructionCost(). We shouldn't be evaluating the cost of non-consecutive loads/stores as if they are a single scalar load/store. Am I confused?

wmi added inline comments.Jun 22 2016, 5:45 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5057	Ah, you are right. I misunderstood what uniform means here. Will fix it.

mssimpso added a subscriber: mssimpso.Jun 23 2016, 12:20 PM

mssimpso added inline comments.Jun 23 2016, 2:38 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6502–6503	Hi Wei, I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet. But I'm not sure I follow this. Please correct me if I'm missing something! If I take the following test case: define void @test(i32* %a, i64 %n) { entry: br label %for.body for.body: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ] %0 = trunc i64 %i to i32 %1 = getelementptr inbounds i32, i32* %a, i32 %0 store i32 %0, i32* %1, align 4 %i.next = add nuw nsw i64 %i, 1 %cond = icmp eq i64 %i.next, %n br i1 %cond, label %for.end, label %for.body for.end: ret void } We generate vectorized induction variables so that for the store, we have: store <4 x i32> %vec.ind1, <4 x i32>* %3, align 4 However, with your change, it looks to me like we will add the induction variable to VecValuesToIngore where previously we wouldn't have. Is this right?

wmi added inline comments.Jun 23 2016, 3:07 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6502–6503	Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ... which is a uniform instruction so it will be put into VecValuesToIngore. However, it may be problematic for %0 to be uniform because actually it has both scalar and vector version after vectorization. I will add your testcase into consideration. Thanks.

mkuper added inline comments.Jun 23 2016, 4:32 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6502–6503	So the problem isn't here, it's in collectLoopUniforms(), right? That is, the problem is that we're tracing back from the consecutive pointer, and adding all of the operands of the store to the worklist, instead of just the pointer operand?

wmi added inline comments.Jun 23 2016, 4:52 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6502–6503	Looks like all of the operands of the store should be added to the worklist, but only those operands not being used in any nonUniform instruction should be regarded as uniform. But that requires collectLoopUniforms algorithm to use some topological order to do uniform check for the values in worklist. Do you think it is the right way to go for collectLoopUniform?

mkuper added inline comments.Jun 23 2016, 5:16 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6502–6503	You're right, I got confused. We're not tracing back from the store anyway, it's from the use of %0 in the GEP, I didn't read your original comment correctly. The real problem, as you say, is that we're assuming that every operand of a uniform instruction must be uniform, but it simply isn't true in our current definition of "uniform".

Put the change in collectLoopUniform into a separate patch: http://reviews.llvm.org/D21755

Extract the major part of collectLoopUniforms into a helper func getDependentClosure so it can be reused by collectValuesToIgnore. For collectLoopUniforms, only loop compare and consecutive ptrs of load/store will be the seed uniform instructions in the WorkList.
For collectValuesToIgnore, loop compare, consecutive ptrs, non-gather/scatter and non-consecutive ptrs will be the seed non-vector instructions in the Worklist.

Herald added a subscriber: nemanjai. · View Herald TranscriptJul 6 2016, 9:27 PM

wmi added a reviewer: mkuper.Jul 6 2016, 9:29 PM

mssimpso added inline comments.Jul 7 2016, 10:06 AM

test/Transforms/LoopVectorize/reverse_iter.ll
38–41	Hi Wei, The change to this test doesn't look right to me. Since indvars.iv feeds into the shl, why is it added to VecValuesToIngore? The shift remains as vector computation. Am I missing something?

wmi added inline comments.Jul 7 2016, 10:57 AM

test/Transforms/LoopVectorize/reverse_iter.ll
38–41	Thanks for catching the problem. My assumption that the chain feeding into "non-gather/scatter && non-consecutive" getelementptr will only have scalar version is wrong. Will update the patch.

Fix the problem pointed out by Matthew. When induction var is only used by uniform instruction or non-consecutive/non-gather scatter ptr instructions, the related phi and update will be added into VecValuesToIgnore set.

Sorry, I lost track of this patch.

LGTM, modulo a couple of nits.

lib/Transforms/Vectorize/LoopVectorize.cpp
6500	I was sure we already had a helper for a common LI/SI getPointerOperand() helper. Turns out we have (at least!) 6, in: LoopAccessAnalysis EarlyCSE DependenceAnalysis PPCLoopPreIncPrep Delinearization LoadStoreVectorizer I'm going to refactor this into a common helper somewhere in utils, but can you hoist this into another local helper? It'll be easier for me to keep track of, in case I land after you do. (If you prefer to do the refactoring yourself, let me know. :-) )
test/Transforms/LoopVectorize/X86/reg-usage.ll
46	Could you please document what each of the two new tests actually tries to check?

This revision is now accepted and ready to land.Jul 15 2016, 4:50 PM

Michael, thanks for the review.

lib/Transforms/Vectorize/LoopVectorize.cpp
6500	Added a local helper for it.
test/Transforms/LoopVectorize/X86/reg-usage.ll
46	Comments added.

Addressed Michael's comments.

LGTM

Closed by commit rL275912: Use uniforms set to populate VecValuesToIgnore. (authored by wmi). · Explain WhyJul 18 2016, 2:07 PM

This revision was automatically updated to reflect the committed changes.

samparker mentioned this in D23509: [LoopVectorize] Query TTI when deciding to splat IV.Aug 15 2016, 6:46 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

156 lines

test/

Transforms/

LoopVectorize/

PowerPC/

vsx-tsvc-s173.ll

2 lines

X86/

reg-usage.ll

70 lines

reverse_induction.ll

43 lines

reverse_iter.ll

16 lines

Diff 63026

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,034 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorizeInstrs() {
// is the same size. If it's not, unset it here and InnerLoopVectorizer		// is the same size. If it's not, unset it here and InnerLoopVectorizer
// will create another.		// will create another.
if (Induction && WidestIndTy != Induction->getType())		if (Induction && WidestIndTy != Induction->getType())
Induction = nullptr;		Induction = nullptr;

return true;		return true;
}		}

void LoopVectorizationLegality::collectLoopUniforms() {		/// Given some seed instructions in \p Worklist, find out the dependent
// We now know that the loop is vectorizable!		/// closure set and return it in \p Worklist. The dependent closure set
// Collect variables that will remain uniform after vectorization.		/// contains the seed instructions, and all the instructions in the
		/// \p Loop which are either used by other instructions in the set, or
		/// by instructions outside of the \p loop.
		static void getDependentClosure(SetVector<Instruction > &Worklist, Loop Loop,
		LoopVectorizationLegality *Legal) {
// If V is not an instruction inside the current loop, it is a Value		// If V is not an instruction inside the current loop, it is a Value
// outside of the scope which we are interesting in.		// outside of the scope which we are interesting in.
auto isOutOfScope = [&](Value *V) -> bool {		auto isOutOfScope = [&](Value *V) -> bool {
Instruction *I = dyn_cast<Instruction>(V);		Instruction *I = dyn_cast<Instruction>(V);
return (!I \|\| !TheLoop->contains(I));		return (!I \|\| !Loop->contains(I));
};		};

SetVector<Instruction *> Worklist;
BasicBlock *Latch = TheLoop->getLoopLatch();
// Start with the conditional branch.
if (!isOutOfScope(Latch->getTerminator()->getOperand(0))) {
Instruction *Cmp = cast<Instruction>(Latch->getTerminator()->getOperand(0));
Worklist.insert(Cmp);
DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n");
}

// Also add all consecutive pointer values; these values will be uniform
// after vectorization (and subsequent cleanup).
for (auto *BB : TheLoop->getBlocks()) {
for (auto &I : *BB) {
if (I.getType()->isPointerTy() && isConsecutivePtr(&I)) {
Worklist.insert(&I);
DEBUG(dbgs() << "LV: Found uniform instruction: " << I << "\n");
}
}
}

// Expand Worklist in topological order: whenever a new instruction		// Expand Worklist in topological order: whenever a new instruction
		mkuperUnsubmitted Not Done Reply Inline Actions Are you sure about this? Nonconsecutive pointer values when there is no gather/scatter will be scalarized, but they aren't uniform. So I'm not sure we should be counting them as uniform. This will work correctly for your new use of isUniformAfterVectorization() (since we really don't need vector registers in either case). But I think it may do the wrong thing for the existing use, in getInstructionCost(). We shouldn't be evaluating the cost of non-consecutive loads/stores as if they are a single scalar load/store. Am I confused? mkuper: Are you sure about this? Nonconsecutive pointer values when there is no gather/scatter will be…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Ah, you are right. I misunderstood what uniform means here. Will fix it. wmi: Ah, you are right. I misunderstood what uniform means here. Will fix it.
// is added , its users should be either already inside Worklist, or		// is added , its users should be either already inside Worklist, or
// out of scope. It ensures a uniform instruction will only be used		// out of scope.
// by uniform instructions or out of scope instructions.
unsigned idx = 0;		unsigned idx = 0;
do {		do {
Instruction *I = Worklist[idx++];		Instruction *I = Worklist[idx++];

for (auto OV : I->operand_values()) {		for (auto OV : I->operand_values()) {
if (isOutOfScope(OV))		if (isOutOfScope(OV))
continue;		continue;
Instruction *OI = cast<Instruction>(OV);		Instruction *OI = cast<Instruction>(OV);
if (std::all_of(OI->user_begin(), OI->user_end(), [&](User *U) -> bool {		if (std::all_of(OI->user_begin(), OI->user_end(), [&](User *U) -> bool {
return isOutOfScope(U) \|\| Worklist.count(cast<Instruction>(U));		return isOutOfScope(U) \|\| Worklist.count(cast<Instruction>(U));
})) {		}))
Worklist.insert(OI);		Worklist.insert(OI);
DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");
}
}		}
} while (idx != Worklist.size());		} while (idx != Worklist.size());

// For an instruction to be added into Worklist above, all its users inside		// For an instruction to be added into Worklist above, all its users inside
// the current loop should be already added into Worklist. This condition		// the current loop should be already added into Worklist. This condition
// cannot be true for phi instructions which is always in a dependence loop.		// cannot be true for phi instructions which is always in a dependence loop.
// Because any instruction in the dependence cycle always depends on others		// Because any instruction in the dependence cycle always depends on others
// in the cycle to be added into Worklist first, the result is no ones in		// in the cycle to be added into Worklist first, the result is no ones in
// the cycle will be added into Worklist in the end.		// the cycle will be added into Worklist in the end.
// That is why we process PHI separately.		// That is why we process PHI separately.
for (auto &Induction : *getInductionVars()) {		for (auto &Induction : *Legal->getInductionVars()) {
auto *PN = Induction.first;		auto *PN = Induction.first;
auto *UpdateV = PN->getIncomingValueForBlock(TheLoop->getLoopLatch());		auto *UpdateV = PN->getIncomingValueForBlock(Loop->getLoopLatch());
if (std::all_of(PN->user_begin(), PN->user_end(),		if (std::all_of(PN->user_begin(), PN->user_end(),
[&](User *U) -> bool {		[&](User *U) -> bool {
return U == UpdateV \|\| isOutOfScope(U) \|\|		return U == UpdateV \|\| isOutOfScope(U) \|\|
Worklist.count(cast<Instruction>(U));		Worklist.count(cast<Instruction>(U));
}) &&		}) &&
std::all_of(UpdateV->user_begin(), UpdateV->user_end(),		std::all_of(UpdateV->user_begin(), UpdateV->user_end(),
[&](User *U) -> bool {		[&](User *U) -> bool {
return U == PN \|\| isOutOfScope(U) \|\|		return U == PN \|\| isOutOfScope(U) \|\|
Worklist.count(cast<Instruction>(U));		Worklist.count(cast<Instruction>(U));
})) {		})) {
Worklist.insert(cast<Instruction>(PN));		Worklist.insert(cast<Instruction>(PN));
Worklist.insert(cast<Instruction>(UpdateV));		Worklist.insert(cast<Instruction>(UpdateV));
DEBUG(dbgs() << "LV: Found uniform instruction: " << *PN << "\n");		}
DEBUG(dbgs() << "LV: Found uniform instruction: " << *UpdateV << "\n");		}
		}

		void LoopVectorizationLegality::collectLoopUniforms() {
		// We now know that the loop is vectorizable!
		// Collect variables that will remain uniform after vectorization.

		SetVector<Instruction *> Worklist;
		BasicBlock *Latch = TheLoop->getLoopLatch();
		// Start with the conditional branch.
		Instruction *Cmp =
		dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
		if (Cmp && TheLoop->contains(Cmp))
		Worklist.insert(Cmp);

		// Also add all consecutive pointer values; these values will be uniform
		// after vectorization (and subsequent cleanup).
		for (auto *BB : TheLoop->getBlocks()) {
		for (auto &I : *BB) {
		if (I.getType()->isPointerTy() && isConsecutivePtr(&I))
		Worklist.insert(&I);
}		}
}		}

Uniforms.insert(Worklist.begin(), Worklist.end());		// Find dependent closure for the seed uniform instructions in Worklist.
		getDependentClosure(Worklist, TheLoop, this);

		for (auto &I : Worklist) {
		Uniforms.insert(I);
		DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
		}
}		}

bool LoopVectorizationLegality::canVectorizeMemory() {		bool LoopVectorizationLegality::canVectorizeMemory() {
LAI = &LAA->getInfo(TheLoop);		LAI = &LAA->getInfo(TheLoop);
InterleaveInfo.setLAI(LAI);		InterleaveInfo.setLAI(LAI);
auto &OptionalReport = LAI->getReport();		auto &OptionalReport = LAI->getReport();
if (OptionalReport)		if (OptionalReport)
emitAnalysis(VectorizationReport(*OptionalReport));		emitAnalysis(VectorizationReport(*OptionalReport));
▲ Show 20 Lines • Show All 524 Lines • ▼ Show 20 Lines	for (Loop::block_iterator bb = TheLoop->block_begin(),
for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {		for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
Type *T = it->getType();		Type *T = it->getType();

// Skip ignored values.		// Skip ignored values.
if (ValuesToIgnore.count(&*it))		if (ValuesToIgnore.count(&*it))
continue;		continue;

// Only examine Loads, Stores and PHINodes.		// Only examine Loads, Stores and PHINodes.
if (!isa<LoadInst>(it) && !isa<StoreInst>(it) && !isa<PHINode>(it))		if (!isa<LoadInst>(it) && !isa<StoreInst>(it) && !isa<PHINode>(it))
		mkuperUnsubmitted Not Done Reply Inline Actions Another side point - should we remove this? Looking at http://reviews.llvm.org/rL172178, the reason that we only look at loads, stores, and PHIs is that "We don't have a detailed analysis on which values are vectorized and which stay scalars in the vectorized loop so we use another method. We look at reduction variables, loads and stores, which are the only ways to get information in and out of loop iterations". That was true at the time, but since then we've gained a precise way of knowing which instructions are uniform, and with this patch will actually use that for ValuesToIgnore. So this check will now only miss instructions that ought to be taken into account, right? mkuper: Another side point - should we remove this? Looking at http://reviews.llvm.org/rL172178, the…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Yes, it makes sense to remove it. wmi: Yes, it makes sense to remove it.
continue;		continue;

// Examine PHI nodes that are reduction variables. Update the type to		// Examine PHI nodes that are reduction variables. Update the type to
// account for the recurrence type.		// account for the recurrence type.
if (PHINode *PN = dyn_cast<PHINode>(it)) {		if (PHINode *PN = dyn_cast<PHINode>(it)) {
if (!Legal->isReductionVariable(PN))		if (!Legal->isReductionVariable(PN))
continue;		continue;
RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN];		RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN];
▲ Show 20 Lines • Show All 807 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {

// Ignore type-promoting instructions we identified during reduction		// Ignore type-promoting instructions we identified during reduction
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}

// Ignore induction phis that are only used in either GetElementPtr or ICmp		SetVector<Instruction *> Worklist;
		mkuperUnsubmitted Not Done Reply Inline Actions This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs? (e.g. see D20789) mkuper: This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the load/store using the GEP. If the induction variable is only used in GEP, it will not be vectorized, right? wmi: Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the…
		mkuperUnsubmitted Not Done Reply Inline Actions I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when it's profitable on the target. (I think the only target that supports it right now is AVX-512.) mkuper: I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I see. If a[3i] are vectorized using gather/scatter. It needs a vectorized version of 3i so probably it is better to generate a vectorized version of i. Then i shouldn't be added into VecValuesToIgnore. Thanks. wmi: I see. If a[3i] are vectorized using gather/scatter. It needs a vectorized version of 3i so…
// instruction to exit loop. Induction variables usually have large types and		BasicBlock *Latch = TheLoop->getLoopLatch();
// can have big impact when estimating register usage.		// Loop compare instruction will not have vector version. Add it into
// This is for when VF > 1.		// Worklist as a seed instruction.
for (auto &Induction : *Legal->getInductionVars()) {		Instruction *Cmp =
auto *PN = Induction.first;		dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
auto *UpdateV = PN->getIncomingValueForBlock(TheLoop->getLoopLatch());		if (Cmp && TheLoop->contains(Cmp))
		mkuperUnsubmitted Not Done Reply Inline Actions I was sure we already had a helper for a common LI/SI getPointerOperand() helper. Turns out we have (at least!) 6, in: LoopAccessAnalysis EarlyCSE DependenceAnalysis PPCLoopPreIncPrep Delinearization LoadStoreVectorizer I'm going to refactor this into a common helper somewhere in utils, but can you hoist this into another local helper? It'll be easier for me to keep track of, in case I land after you do. (If you prefer to do the refactoring yourself, let me know. :-) ) mkuper: I was sure we already had a helper for a common LI/SI getPointerOperand() helper. Turns out we…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Added a local helper for it. wmi: Added a local helper for it.
		Worklist.insert(Cmp);

// Check that the PHI is only used by the induction increment (UpdateV) or		// Ptr Instructions used by consecutive load/stores or used by
		mssimpsoUnsubmitted Not Done Reply Inline Actions Hi Wei, I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet. But I'm not sure I follow this. Please correct me if I'm missing something! If I take the following test case: define void @test(i32* %a, i64 %n) { entry: br label %for.body for.body: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ] %0 = trunc i64 %i to i32 %1 = getelementptr inbounds i32, i32* %a, i32 %0 store i32 %0, i32* %1, align 4 %i.next = add nuw nsw i64 %i, 1 %cond = icmp eq i64 %i.next, %n br i1 %cond, label %for.end, label %for.body for.end: ret void } We generate vectorized induction variables so that for the store, we have: store <4 x i32> %vec.ind1, <4 x i32>* %3, align 4 However, with your change, it looks to me like we will add the induction variable to VecValuesToIngore where previously we wouldn't have. Is this right? mssimpso: Hi Wei, I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ... which is a uniform instruction so it will be put into VecValuesToIngore. However, it may be problematic for %0 to be uniform because actually it has both scalar and vector version after vectorization. I will add your testcase into consideration. Thanks. wmi: Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ...
		mkuperUnsubmitted Not Done Reply Inline Actions So the problem isn't here, it's in collectLoopUniforms(), right? That is, the problem is that we're tracing back from the consecutive pointer, and adding all of the operands of the store to the worklist, instead of just the pointer operand? mkuper: So the problem isn't here, it's in collectLoopUniforms(), right? That is, the problem is that…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Looks like all of the operands of the store should be added to the worklist, but only those operands not being used in any nonUniform instruction should be regarded as uniform. But that requires collectLoopUniforms algorithm to use some topological order to do uniform check for the values in worklist. Do you think it is the right way to go for collectLoopUniform? wmi: Looks like all of the operands of the store should be added to the worklist, but only those…
		mkuperUnsubmitted Not Done Reply Inline Actions You're right, I got confused. We're not tracing back from the store anyway, it's from the use of %0 in the GEP, I didn't read your original comment correctly. The real problem, as you say, is that we're assuming that every operand of a uniform instruction must be uniform, but it simply isn't true in our current definition of "uniform". mkuper: You're right, I got confused. We're not tracing back from the store anyway, it's from the use…
// by GEPs. Then check that UpdateV is only used by a compare instruction,		// non-consecutive && non-gather/scatter load/stores will not have
// the loop header PHI, or by GEPs.		// vector versions. Add them into Worklist as seed instructions.
// FIXME: Need precise def-use analysis to determine if this instruction		for (auto *BB : TheLoop->getBlocks()) {
// variable will be vectorized.		for (auto &I : *BB) {
if (std::all_of(PN->user_begin(), PN->user_end(),		LoadInst *LI = dyn_cast<LoadInst>(&I);
[&](const User *U) -> bool {		StoreInst *SI = dyn_cast<StoreInst>(&I);
return U == UpdateV \|\| isa<GetElementPtrInst>(U);		if (!LI && !SI)
}) &&		continue;
std::all_of(UpdateV->user_begin(), UpdateV->user_end(),		Value *Ptr = SI ? SI->getPointerOperand() : LI->getPointerOperand();
[&](const User *U) -> bool {		Instruction *PI = dyn_cast<Instruction>(Ptr);
return U == PN \|\| isa<ICmpInst>(U) \|\|		if (PI && (Legal->isConsecutivePtr(PI) \|\|
isa<GetElementPtrInst>(U);		!isGatherOrScatterLegal(&I, PI, Legal)))
})) {		Worklist.insert(PI);
VecValuesToIgnore.insert(PN);
VecValuesToIgnore.insert(UpdateV);
}		}
}		}

// Ignore instructions that will not be vectorized.		// Find the dependent closure which will not have vector versions.
// This is for when VF > 1.		getDependentClosure(Worklist, TheLoop, Legal);
for (auto bb = TheLoop->block_begin(), be = TheLoop->block_end(); bb != be;
++bb) {
for (auto &Inst : **bb) {
switch (Inst.getOpcode())
case Instruction::GetElementPtr: {
// Ignore GEP if its last operand is an induction variable so that it is
// a consecutive load/store and won't be vectorized as scatter/gather
// pattern.

GetElementPtrInst *Gep = cast<GetElementPtrInst>(&Inst);		VecValuesToIgnore.insert(Worklist.begin(), Worklist.end());
unsigned NumOperands = Gep->getNumOperands();
unsigned InductionOperand = getGEPInductionOperand(Gep);
bool GepToIgnore = true;

// Check that all of the gep indices are uniform except for the
// induction operand.
for (unsigned i = 0; i != NumOperands; ++i) {
if (i != InductionOperand &&
!PSE.getSE()->isLoopInvariant(PSE.getSCEV(Gep->getOperand(i)),
TheLoop)) {
GepToIgnore = false;
break;
}
}

if (GepToIgnore)
VecValuesToIgnore.insert(&Inst);
break;
}
}
}
}		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore) {		bool IfPredicateStore) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/PowerPC/vsx-tsvc-s173.ll

Show All 37 Lines	for.end: ; preds = %for.body3
%cmp = icmp slt i32 %inc11, %mul		%cmp = icmp slt i32 %inc11, %mul
br i1 %cmp, label %for.cond1.preheader, label %for.end12		br i1 %cmp, label %for.cond1.preheader, label %for.end12

for.end12: ; preds = %for.end, %entry		for.end12: ; preds = %for.end, %entry
ret i32 0		ret i32 0

; CHECK-LABEL: @s173		; CHECK-LABEL: @s173
; CHECK: load <4 x float>, <4 x float>*		; CHECK: load <4 x float>, <4 x float>*
; CHECK: add nsw i64 %1, 16000		; CHECK: add nsw i64 %index, 16000
; CHECK: ret i32 0		; CHECK: ret i32 0
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }

test/Transforms/LoopVectorize/X86/reg-usage.ll

; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -S 2>&1 \| FileCheck %s		; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=x86_64-unknown-linux -S 2>&1 \| FileCheck %s
		; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=x86_64-unknown-linux -mattr=+avx512f -S 2>&1 \| FileCheck %s --check-prefix=AVX512F
; REQUIRES: asserts		; REQUIRES: asserts

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@a = global [1024 x i8] zeroinitializer, align 16		@a = global [1024 x i8] zeroinitializer, align 16
@b = global [1024 x i8] zeroinitializer, align 16		@b = global [1024 x i8] zeroinitializer, align 16

define i32 @foo() {		define i32 @foo() {
; This function has a loop of SAD pattern. Here we check when VF = 16 the		; This function has a loop of SAD pattern. Here we check when VF = 16 the
; register usage doesn't exceed 16.		; register usage doesn't exceed 16.
;		;
; CHECK-LABEL: foo		; CHECK-LABEL: foo
Show All 25 Lines	for.body:
%neg = sub nsw i32 0, %sub		%neg = sub nsw i32 0, %sub
%2 = select i1 %ispos, i32 %sub, i32 %neg		%2 = select i1 %ispos, i32 %sub, i32 %neg
%add = add nsw i32 %2, %s.015		%add = add nsw i32 %2, %s.015
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 1024		%exitcond = icmp eq i64 %indvars.iv.next, 1024
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

		define i32 @goo() {
		mkuperUnsubmitted Not Done Reply Inline Actions Could you please document what each of the two new tests actually tries to check? mkuper: Could you please document what each of the two new tests actually tries to check?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Comments added. wmi: Comments added.
		; CHECK-LABEL: goo
		; CHECK: LV(REG): VF = 4
		; CHECK-NEXT: LV(REG): Found max usage: 4
		; CHECK: LV(REG): VF = 8
		; CHECK-NEXT: LV(REG): Found max usage: 7
		; CHECK: LV(REG): VF = 16
		; CHECK-NEXT: LV(REG): Found max usage: 13
		entry:
		br label %for.body

		for.cond.cleanup: ; preds = %for.body
		%add.lcssa = phi i32 [ %add, %for.body ]
		ret i32 %add.lcssa

		for.body: ; preds = %for.body, %entry
		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
		%s.015 = phi i32 [ 0, %entry ], [ %add, %for.body ]
		%tmp1 = add nsw i64 %indvars.iv, 3
		%arrayidx = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %tmp1
		%tmp = load i8, i8* %arrayidx, align 1
		%conv = zext i8 %tmp to i32
		%tmp2 = add nsw i64 %indvars.iv, 2
		%arrayidx2 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %tmp2
		%tmp3 = load i8, i8* %arrayidx2, align 1
		%conv3 = zext i8 %tmp3 to i32
		%sub = sub nsw i32 %conv, %conv3
		%ispos = icmp sgt i32 %sub, -1
		%neg = sub nsw i32 0, %sub
		%tmp4 = select i1 %ispos, i32 %sub, i32 %neg
		%add = add nsw i32 %tmp4, %s.015
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, 1024
		br i1 %exitcond, label %for.cond.cleanup, label %for.body
		}

define i64 @bar(i64* nocapture %a) {		define i64 @bar(i64* nocapture %a) {
; CHECK-LABEL: bar		; CHECK-LABEL: bar
; CHECK: LV(REG): VF = 2		; CHECK: LV(REG): VF = 2
; CHECK: LV(REG): Found max usage: 4		; CHECK: LV(REG): Found max usage: 4
;		;
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
%add2.lcssa = phi i64 [ %add2, %for.body ]		%add2.lcssa = phi i64 [ %add2, %for.body ]
ret i64 %add2.lcssa		ret i64 %add2.lcssa

for.body:		for.body:
%i.012 = phi i64 [ 0, %entry ], [ %inc, %for.body ]		%i.012 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
%s.011 = phi i64 [ 0, %entry ], [ %add2, %for.body ]		%s.011 = phi i64 [ 0, %entry ], [ %add2, %for.body ]
%arrayidx = getelementptr inbounds i64, i64* %a, i64 %i.012		%arrayidx = getelementptr inbounds i64, i64* %a, i64 %i.012
%0 = load i64, i64* %arrayidx, align 8		%0 = load i64, i64* %arrayidx, align 8
%add = add nsw i64 %0, %i.012		%add = add nsw i64 %0, %i.012
store i64 %add, i64* %arrayidx, align 8		store i64 %add, i64* %arrayidx, align 8
%add2 = add nsw i64 %add, %s.011		%add2 = add nsw i64 %add, %s.011
%inc = add nuw nsw i64 %i.012, 1		%inc = add nuw nsw i64 %i.012, 1
%exitcond = icmp eq i64 %inc, 1024		%exitcond = icmp eq i64 %inc, 1024
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

		@d = external global [0 x i64], align 8
		@e = external global [0 x i32], align 4
		@c = external global [0 x i32], align 4

		define void @hoo(i32 %n) {
		; AVX512F-LABEL: bar
		; AVX512F: LV(REG): VF = 16
		; AVX512F: LV(REG): Found max usage: 2
		;
		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
		%arrayidx = getelementptr inbounds [0 x i64], [0 x i64]* @d, i64 0, i64 %indvars.iv
		%tmp = load i64, i64* %arrayidx, align 8
		%arrayidx1 = getelementptr inbounds [0 x i32], [0 x i32]* @e, i64 0, i64 %tmp
		%tmp1 = load i32, i32* %arrayidx1, align 4
		%arrayidx3 = getelementptr inbounds [0 x i32], [0 x i32]* @c, i64 0, i64 %indvars.iv
		store i32 %tmp1, i32* %arrayidx3, align 4
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, 10000
		br i1 %exitcond, label %for.end, label %for.body

		for.end: ; preds = %for.body
		ret void
		}

test/Transforms/LoopVectorize/reverse_induction.ll

	Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines
	; while ((reverse_induction) >= 0) {			; while ((reverse_induction) >= 0) {
	; forward_induction++;			; forward_induction++;
	; a[reverse_induction] = forward_induction;			; a[reverse_induction] = forward_induction;
	; --reverse_induction;			; --reverse_induction;
	; }			; }
	; }			; }

	; CHECK-LABEL: @reverse_forward_induction_i64_i8(			; CHECK-LABEL: @reverse_forward_induction_i64_i8(
	; CHECK: vector.body
	; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]			; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	; CHECK: %vec.ind = phi <4 x i64> [ <i64 1023, i64 1022, i64 1021, i64 1020>, %vector.ph ]			; CHECK: %offset.idx = sub i64 1023, %index
	; CHECK: %step.add = add <4 x i64> %vec.ind, <i64 -4, i64 -4, i64 -4, i64 -4>			; CHECK: %[[a0:.+]] = add i64 %offset.idx, 0
	; CHECK: trunc i64 %index to i8			; CHECK: %[[v0:.+]] = insertelement <4 x i64> undef, i64 %[[a0]], i64 0
				; CHECK: %[[a1:.+]] = add i64 %offset.idx, -1
				; CHECK: %[[v1:.+]] = insertelement <4 x i64> %[[v0]], i64 %[[a1]], i64 1
				; CHECK: %[[a2:.+]] = add i64 %offset.idx, -2
				; CHECK: %[[v2:.+]] = insertelement <4 x i64> %[[v1]], i64 %[[a2]], i64 2
				; CHECK: %[[a3:.+]] = add i64 %offset.idx, -3
				; CHECK: %[[v3:.+]] = insertelement <4 x i64> %[[v2]], i64 %[[a3]], i64 3
				; CHECK: %[[a4:.+]] = add i64 %offset.idx, -4
				; CHECK: %[[v4:.+]] = insertelement <4 x i64> undef, i64 %[[a4]], i64 0
				; CHECK: %[[a5:.+]] = add i64 %offset.idx, -5
				; CHECK: %[[v5:.+]] = insertelement <4 x i64> %[[v4]], i64 %[[a5]], i64 1
				; CHECK: %[[a6:.+]] = add i64 %offset.idx, -6
				; CHECK: %[[v6:.+]] = insertelement <4 x i64> %[[v5]], i64 %[[a6]], i64 2
				; CHECK: %[[a7:.+]] = add i64 %offset.idx, -7
				; CHECK: %[[v7:.+]] = insertelement <4 x i64> %[[v6]], i64 %[[a7]], i64 3

	define void @reverse_forward_induction_i64_i8() {			define void @reverse_forward_induction_i64_i8() {
	entry:			entry:
	br label %while.body			br label %while.body

	while.body:			while.body:
	%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]			%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]
	%forward_induction.05 = phi i8 [ 0, %entry ], [ %inc, %while.body ]			%forward_induction.05 = phi i8 [ 0, %entry ], [ %inc, %while.body ]
	%inc = add i8 %forward_induction.05, 1			%inc = add i8 %forward_induction.05, 1
	%conv = zext i8 %inc to i32			%conv = zext i8 %inc to i32
	%arrayidx = getelementptr inbounds [1024 x i32], [1024 x i32]* @a, i64 0, i64 %indvars.iv			%arrayidx = getelementptr inbounds [1024 x i32], [1024 x i32]* @a, i64 0, i64 %indvars.iv
	store i32 %conv, i32* %arrayidx, align 4			store i32 %conv, i32* %arrayidx, align 4
	%indvars.iv.next = add i64 %indvars.iv, -1			%indvars.iv.next = add i64 %indvars.iv, -1
	%0 = trunc i64 %indvars.iv to i32			%0 = trunc i64 %indvars.iv to i32
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	br i1 %cmp, label %while.body, label %while.end			br i1 %cmp, label %while.body, label %while.end

	while.end:			while.end:
	ret void			ret void
	}			}

	; CHECK-LABEL: @reverse_forward_induction_i64_i8_signed(			; CHECK-LABEL: @reverse_forward_induction_i64_i8_signed(
	; CHECK: vector.body:
	; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]			; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	; CHECK: %vec.ind = phi <4 x i64> [ <i64 1023, i64 1022, i64 1021, i64 1020>, %vector.ph ]			; CHECK: %offset.idx = sub i64 1023, %index
	; CHECK: %step.add = add <4 x i64> %vec.ind, <i64 -4, i64 -4, i64 -4, i64 -4>			; CHECK: %[[a0:.+]] = add i64 %offset.idx, 0
				; CHECK: %[[v0:.+]] = insertelement <4 x i64> undef, i64 %[[a0]], i64 0
				; CHECK: %[[a1:.+]] = add i64 %offset.idx, -1
				; CHECK: %[[v1:.+]] = insertelement <4 x i64> %[[v0]], i64 %[[a1]], i64 1
				; CHECK: %[[a2:.+]] = add i64 %offset.idx, -2
				; CHECK: %[[v2:.+]] = insertelement <4 x i64> %[[v1]], i64 %[[a2]], i64 2
				; CHECK: %[[a3:.+]] = add i64 %offset.idx, -3
				; CHECK: %[[v3:.+]] = insertelement <4 x i64> %[[v2]], i64 %[[a3]], i64 3
				; CHECK: %[[a4:.+]] = add i64 %offset.idx, -4
				; CHECK: %[[v4:.+]] = insertelement <4 x i64> undef, i64 %[[a4]], i64 0
				; CHECK: %[[a5:.+]] = add i64 %offset.idx, -5
				; CHECK: %[[v5:.+]] = insertelement <4 x i64> %[[v4]], i64 %[[a5]], i64 1
				; CHECK: %[[a6:.+]] = add i64 %offset.idx, -6
				; CHECK: %[[v6:.+]] = insertelement <4 x i64> %[[v5]], i64 %[[a6]], i64 2
				; CHECK: %[[a7:.+]] = add i64 %offset.idx, -7
				; CHECK: %[[v7:.+]] = insertelement <4 x i64> %[[v6]], i64 %[[a7]], i64 3

	define void @reverse_forward_induction_i64_i8_signed() {			define void @reverse_forward_induction_i64_i8_signed() {
	entry:			entry:
	br label %while.body			br label %while.body

	while.body:			while.body:
	%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]			%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]
	%forward_induction.05 = phi i8 [ -127, %entry ], [ %inc, %while.body ]			%forward_induction.05 = phi i8 [ -127, %entry ], [ %inc, %while.body ]
	Show All 12 Lines

test/Transforms/LoopVectorize/reverse_iter.ll

	; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.8.0"			target triple = "x86_64-apple-macosx10.8.0"

	; Make sure that the reverse iterators are calculated using 64bit arithmetic, not 32.			; Make sure that the reverse iterators are calculated using 64bit arithmetic, not 32.
	;			;
	; int foo(int n, int *A) {			; int foo(int n, int *A) {
	; int sum;			; int sum;
	; for (int i=n; i > 0; i--)			; for (int i=n; i > 0; i--)
	; sum += A[i*2];			; sum += A[i*2];
	; return sum;			; return sum;
	; }			; }
	;			;

	;CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	;CHECK: <i64 0, i64 -1, i64 -2, i64 -3>			; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				; CHECK: %offset.idx = sub i64 {{.*}}, %index
				; CHECK: %[[v0:.+]] = insertelement <4 x i64> undef, i64 %offset.idx, i64 0
				; CHECK: %[[a1:.+]] = add i64 %offset.idx, -1
				; CHECK: %[[v1:.+]] = insertelement <4 x i64> %[[v0]], i64 %[[a1]], i64 1
				; CHECK: %[[a2:.+]] = add i64 %offset.idx, -2
				; CHECK: %[[v2:.+]] = insertelement <4 x i64> %[[v1]], i64 %[[a2]], i64 2
				; CHECK: %[[a3:.+]] = add i64 %offset.idx, -3
				; CHECK: %[[v3:.+]] = insertelement <4 x i64> %[[v2]], i64 %[[a3]], i64 3
				; CHECK: %[[tv:.+]] = trunc <4 x i64> %[[v3]] to <4 x i32>
				; CHECK: %[[sv:.+]] = shl nsw <4 x i32> %[[tv]], <i32 1, i32 1, i32 1, i32 1>
	;CHECK: ret			; CHECK: ret
	define i32 @foo(i32 %n, i32* nocapture %A) {			define i32 @foo(i32 %n, i32* nocapture %A) {
	%1 = icmp sgt i32 %n, 0			%1 = icmp sgt i32 %n, 0
	br i1 %1, label %.lr.ph, label %._crit_edge			br i1 %1, label %.lr.ph, label %._crit_edge

	.lr.ph: ; preds = %0			.lr.ph: ; preds = %0
	%2 = sext i32 %n to i64			%2 = sext i32 %n to i64
	br label %3			br label %3

	; <label>:3 ; preds = %.lr.ph, %3			; <label>:3 ; preds = %.lr.ph, %3
	%indvars.iv = phi i64 [ %2, %.lr.ph ], [ %indvars.iv.next, %3 ]			%indvars.iv = phi i64 [ %2, %.lr.ph ], [ %indvars.iv.next, %3 ]
	%sum.01 = phi i32 [ undef, %.lr.ph ], [ %9, %3 ]			%sum.01 = phi i32 [ undef, %.lr.ph ], [ %9, %3 ]
	%4 = trunc i64 %indvars.iv to i32			%4 = trunc i64 %indvars.iv to i32
	%5 = shl nsw i32 %4, 1			%5 = shl nsw i32 %4, 1
				mssimpsoUnsubmitted Not Done Reply Inline Actions Hi Wei, The change to this test doesn't look right to me. Since indvars.iv feeds into the shl, why is it added to VecValuesToIngore? The shift remains as vector computation. Am I missing something? mssimpso: Hi Wei, The change to this test doesn't look right to me. Since indvars.iv feeds into the shl…
				wmiAuthorUnsubmitted Not Done Reply Inline Actions Thanks for catching the problem. My assumption that the chain feeding into "non-gather/scatter && non-consecutive" getelementptr will only have scalar version is wrong. Will update the patch. wmi: Thanks for catching the problem. My assumption that the chain feeding into "non-gather/scatter…
	%6 = sext i32 %5 to i64			%6 = sext i32 %5 to i64
	%7 = getelementptr inbounds i32, i32* %A, i64 %6			%7 = getelementptr inbounds i32, i32* %A, i64 %6
	%8 = load i32, i32* %7, align 4			%8 = load i32, i32* %7, align 4
	%9 = add nsw i32 %8, %sum.01			%9 = add nsw i32 %8, %sum.01
	%indvars.iv.next = add i64 %indvars.iv, -1			%indvars.iv.next = add i64 %indvars.iv, -1
	%10 = trunc i64 %indvars.iv.next to i32			%10 = trunc i64 %indvars.iv.next to i32
	%11 = icmp sgt i32 %10, 0			%11 = icmp sgt i32 %10, 0
	br i1 %11, label %3, label %._crit_edge			br i1 %11, label %3, label %._crit_edge

	._crit_edge: ; preds = %3, %0			._crit_edge: ; preds = %3, %0
	%sum.0.lcssa = phi i32 [ undef, %0 ], [ %9, %3 ]			%sum.0.lcssa = phi i32 [ undef, %0 ], [ %9, %3 ]
	ret i32 %sum.0.lcssa			ret i32 %sum.0.lcssa
	}			}