This is an archive of the discontinued LLVM Phabricator instance.

Differential D20474

when calculating RegUsages, ignore instructions which are uniformed after vectorization
ClosedPublic

Authored by wmi on May 20 2016, 9:43 AM.

Download Raw Diff

Details

Reviewers

jmolloy
mkuper
hfinkel

Commits

rG79997a24d750: Recommit the patch "Use uniforms set to populate VecValuesToIgnore".
rG1fd25726afce: Use uniforms set to populate VecValuesToIgnore.
rL275912: Use uniforms set to populate VecValuesToIgnore.

Summary

This is following patch for http://reviews.llvm.org/D15177.

Just as Hal's comment in http://reviews.llvm.org/D15177?id=41809#inline-126094,
For induction variable only used in GetElementPtr and ICmp, D15177 knows it will not have vectorized version and can be added into VecValuesToIgnore, but it didn't consider the case that induction variable may be used by a add/sub before used in GetElementPtr.

for loop like below:

char a[1000];
char b[1000];
for (long i = 0; i < N; i++)

a[i] = b[i] * 6 + (b[i] + b[i + 1]) * 4 + b[i - 2] + b[i + 2];

When we are computing RegUsages for VF==8 and VF==16,
it is important for the register usages estimation component to know array index exprs like i, i+1, i+2 and i+3 will not have vectorized version after the loop being vectorized, and their live ranges should not be counted as vector register usages, or else it is likely to exaggerate the vector register pressure.

The patch adds instructions for which isUniformAfterVectorization returns true into the VecValuesToIgnore set. A special case is that PHI instructions are never included into the Uniforms set in collectLoopUniforms(), so a special handling for PHI is that when the result of PHI is only used in GetElementPtr or Uniform instructions, the PHI will be added into VecValuesToIgnore set too.

Another following patch in plan is if estimated vector register usage is more than the number of available hardware vector registers for certain VF, don't simply give up the VF. Just add the extra spill cost into the VectorCost. If the total VectorCost of the VF is the lowest, the VF is still worthy to try.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 57942.May 20 2016, 9:43 AM

wmi retitled this revision from to when calculating RegUsages, ignore instructions which are uniformed after vectorization.

wmi updated this object.

wmi added reviewers: hfinkel, jmolloy.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: llvm-commits, davidxl, congh, mkuper.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 20 2016, 9:43 AM

Ping.

mkuper added inline comments.Jun 6 2016, 12:50 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6497–6498	This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs? (e.g. see D20789)

mkuper added inline comments.Jun 7 2016, 6:30 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5663	Another side point - should we remove this? Looking at http://reviews.llvm.org/rL172178, the reason that we only look at loads, stores, and PHIs is that "We don't have a detailed analysis on which values are vectorized and which stay scalars in the vectorized loop so we use another method. We look at reduction variables, loads and stores, which are the only ways to get information in and out of loop iterations". That was true at the time, but since then we've gained a precise way of knowing which instructions are uniform, and with this patch will actually use that for ValuesToIgnore. So this check will now only miss instructions that ought to be taken into account, right?

wmi added inline comments.Jun 8 2016, 2:39 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5663	Yes, it makes sense to remove it.
6497–6498	Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the load/store using the GEP. If the induction variable is only used in GEP, it will not be vectorized, right?

mkuper added inline comments.Jun 8 2016, 3:00 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6497–6498	I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when it's profitable on the target. (I think the only target that supports it right now is AVX-512.)

wmi added inline comments.Jun 8 2016, 5:09 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6497–6498	I see. If a[3i] are vectorized using gather/scatter. It needs a vectorized version of 3i so probably it is better to generate a vectorized version of i. Then i shouldn't be added into VecValuesToIgnore. Thanks.

Add nonconsecutive pointer values and their dependency into the uniform after vectorization set in collectLoopUniforms if gather/scatter is not supported. The related code in collectValuesToIgnore is removed.

The added test hoo is to ensure the ptr of the load for which gather is possible will not be added into VecValuesToIgnore set.

mkuper added inline comments.Jun 22 2016, 4:43 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5064	Are you sure about this? Nonconsecutive pointer values when there is no gather/scatter will be scalarized, but they aren't uniform. So I'm not sure we should be counting them as uniform. This will work correctly for your new use of isUniformAfterVectorization() (since we really don't need vector registers in either case). But I think it may do the wrong thing for the existing use, in getInstructionCost(). We shouldn't be evaluating the cost of non-consecutive loads/stores as if they are a single scalar load/store. Am I confused?

wmi added inline comments.Jun 22 2016, 5:45 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5064	Ah, you are right. I misunderstood what uniform means here. Will fix it.

mssimpso added a subscriber: mssimpso.Jun 23 2016, 12:20 PM

mssimpso added inline comments.Jun 23 2016, 2:38 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6531–6532	Hi Wei, I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet. But I'm not sure I follow this. Please correct me if I'm missing something! If I take the following test case: define void @test(i32* %a, i64 %n) { entry: br label %for.body for.body: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ] %0 = trunc i64 %i to i32 %1 = getelementptr inbounds i32, i32* %a, i32 %0 store i32 %0, i32* %1, align 4 %i.next = add nuw nsw i64 %i, 1 %cond = icmp eq i64 %i.next, %n br i1 %cond, label %for.end, label %for.body for.end: ret void } We generate vectorized induction variables so that for the store, we have: store <4 x i32> %vec.ind1, <4 x i32>* %3, align 4 However, with your change, it looks to me like we will add the induction variable to VecValuesToIngore where previously we wouldn't have. Is this right?

wmi added inline comments.Jun 23 2016, 3:07 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6531–6532	Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ... which is a uniform instruction so it will be put into VecValuesToIngore. However, it may be problematic for %0 to be uniform because actually it has both scalar and vector version after vectorization. I will add your testcase into consideration. Thanks.

mkuper added inline comments.Jun 23 2016, 4:32 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6531–6532	So the problem isn't here, it's in collectLoopUniforms(), right? That is, the problem is that we're tracing back from the consecutive pointer, and adding all of the operands of the store to the worklist, instead of just the pointer operand?

wmi added inline comments.Jun 23 2016, 4:52 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6531–6532	Looks like all of the operands of the store should be added to the worklist, but only those operands not being used in any nonUniform instruction should be regarded as uniform. But that requires collectLoopUniforms algorithm to use some topological order to do uniform check for the values in worklist. Do you think it is the right way to go for collectLoopUniform?

mkuper added inline comments.Jun 23 2016, 5:16 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6531–6532	You're right, I got confused. We're not tracing back from the store anyway, it's from the use of %0 in the GEP, I didn't read your original comment correctly. The real problem, as you say, is that we're assuming that every operand of a uniform instruction must be uniform, but it simply isn't true in our current definition of "uniform".

Put the change in collectLoopUniform into a separate patch: http://reviews.llvm.org/D21755

Extract the major part of collectLoopUniforms into a helper func getDependentClosure so it can be reused by collectValuesToIgnore. For collectLoopUniforms, only loop compare and consecutive ptrs of load/store will be the seed uniform instructions in the WorkList.
For collectValuesToIgnore, loop compare, consecutive ptrs, non-gather/scatter and non-consecutive ptrs will be the seed non-vector instructions in the Worklist.

Herald added a subscriber: nemanjai. · View Herald TranscriptJul 6 2016, 9:27 PM

wmi added a reviewer: mkuper.Jul 6 2016, 9:29 PM

mssimpso added inline comments.Jul 7 2016, 10:06 AM

test/Transforms/LoopVectorize/reverse_iter.ll
38–41 ↗	(On Diff #63026)	Hi Wei, The change to this test doesn't look right to me. Since indvars.iv feeds into the shl, why is it added to VecValuesToIngore? The shift remains as vector computation. Am I missing something?

wmi added inline comments.Jul 7 2016, 10:57 AM

test/Transforms/LoopVectorize/reverse_iter.ll
38–41 ↗	(On Diff #63026)	Thanks for catching the problem. My assumption that the chain feeding into "non-gather/scatter && non-consecutive" getelementptr will only have scalar version is wrong. Will update the patch.

Fix the problem pointed out by Matthew. When induction var is only used by uniform instruction or non-consecutive/non-gather scatter ptr instructions, the related phi and update will be added into VecValuesToIgnore set.

Sorry, I lost track of this patch.

LGTM, modulo a couple of nits.

lib/Transforms/Vectorize/LoopVectorize.cpp
6505	I was sure we already had a helper for a common LI/SI getPointerOperand() helper. Turns out we have (at least!) 6, in: LoopAccessAnalysis EarlyCSE DependenceAnalysis PPCLoopPreIncPrep Delinearization LoadStoreVectorizer I'm going to refactor this into a common helper somewhere in utils, but can you hoist this into another local helper? It'll be easier for me to keep track of, in case I land after you do. (If you prefer to do the refactoring yourself, let me know. :-) )
test/Transforms/LoopVectorize/X86/reg-usage.ll
46	Could you please document what each of the two new tests actually tries to check?

This revision is now accepted and ready to land.Jul 15 2016, 4:50 PM

Michael, thanks for the review.

lib/Transforms/Vectorize/LoopVectorize.cpp
6505	Added a local helper for it.
test/Transforms/LoopVectorize/X86/reg-usage.ll
46	Comments added.

Addressed Michael's comments.

LGTM

Closed by commit rL275912: Use uniforms set to populate VecValuesToIgnore. (authored by wmi). · Explain WhyJul 18 2016, 2:07 PM

This revision was automatically updated to reflect the committed changes.

samparker mentioned this in D23509: [LoopVectorize] Query TTI when deciding to splat IV.Aug 15 2016, 6:46 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

84 lines

test/

Transforms/

LoopVectorize/

PowerPC/

vsx-tsvc-s173.ll

2 lines

X86/

reg-usage.ll

76 lines

reverse_induction.ll

43 lines

Diff 64343

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,055 Lines • ▼ Show 20 Lines	void LoopVectorizationLegality::collectLoopUniforms() {
// Start with the conditional branch.		// Start with the conditional branch.
if (!isOutOfScope(Latch->getTerminator()->getOperand(0))) {		if (!isOutOfScope(Latch->getTerminator()->getOperand(0))) {
Instruction *Cmp = cast<Instruction>(Latch->getTerminator()->getOperand(0));		Instruction *Cmp = cast<Instruction>(Latch->getTerminator()->getOperand(0));
Worklist.insert(Cmp);		Worklist.insert(Cmp);
DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n");
}		}

// Also add all consecutive pointer values; these values will be uniform		// Also add all consecutive pointer values; these values will be uniform
// after vectorization (and subsequent cleanup).		// after vectorization (and subsequent cleanup).
		mkuperUnsubmitted Not Done Reply Inline Actions Are you sure about this? Nonconsecutive pointer values when there is no gather/scatter will be scalarized, but they aren't uniform. So I'm not sure we should be counting them as uniform. This will work correctly for your new use of isUniformAfterVectorization() (since we really don't need vector registers in either case). But I think it may do the wrong thing for the existing use, in getInstructionCost(). We shouldn't be evaluating the cost of non-consecutive loads/stores as if they are a single scalar load/store. Am I confused? mkuper: Are you sure about this? Nonconsecutive pointer values when there is no gather/scatter will be…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Ah, you are right. I misunderstood what uniform means here. Will fix it. wmi: Ah, you are right. I misunderstood what uniform means here. Will fix it.
for (auto *BB : TheLoop->getBlocks()) {		for (auto *BB : TheLoop->getBlocks()) {
for (auto &I : *BB) {		for (auto &I : *BB) {
if (I.getType()->isPointerTy() && isConsecutivePtr(&I)) {		if (I.getType()->isPointerTy() && isConsecutivePtr(&I)) {
Worklist.insert(&I);		Worklist.insert(&I);
DEBUG(dbgs() << "LV: Found uniform instruction: " << I << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << I << "\n");
}		}
}		}
}		}
▲ Show 20 Lines • Show All 582 Lines • ▼ Show 20 Lines	for (Loop::block_iterator bb = TheLoop->block_begin(),
for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {		for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
Type *T = it->getType();		Type *T = it->getType();

// Skip ignored values.		// Skip ignored values.
if (ValuesToIgnore.count(&*it))		if (ValuesToIgnore.count(&*it))
continue;		continue;

// Only examine Loads, Stores and PHINodes.		// Only examine Loads, Stores and PHINodes.
if (!isa<LoadInst>(it) && !isa<StoreInst>(it) && !isa<PHINode>(it))		if (!isa<LoadInst>(it) && !isa<StoreInst>(it) && !isa<PHINode>(it))
		mkuperUnsubmitted Not Done Reply Inline Actions Another side point - should we remove this? Looking at http://reviews.llvm.org/rL172178, the reason that we only look at loads, stores, and PHIs is that "We don't have a detailed analysis on which values are vectorized and which stay scalars in the vectorized loop so we use another method. We look at reduction variables, loads and stores, which are the only ways to get information in and out of loop iterations". That was true at the time, but since then we've gained a precise way of knowing which instructions are uniform, and with this patch will actually use that for ValuesToIgnore. So this check will now only miss instructions that ought to be taken into account, right? mkuper: Another side point - should we remove this? Looking at http://reviews.llvm.org/rL172178, the…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Yes, it makes sense to remove it. wmi: Yes, it makes sense to remove it.
continue;		continue;

// Examine PHI nodes that are reduction variables. Update the type to		// Examine PHI nodes that are reduction variables. Update the type to
// account for the recurrence type.		// account for the recurrence type.
if (PHINode *PN = dyn_cast<PHINode>(it)) {		if (PHINode *PN = dyn_cast<PHINode>(it)) {
if (!Legal->isReductionVariable(PN))		if (!Legal->isReductionVariable(PN))
continue;		continue;
RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN];		RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN];
▲ Show 20 Lines • Show All 796 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) {

// Check for a load.		// Check for a load.
if (LoadInst *LI = dyn_cast<LoadInst>(Inst))		if (LoadInst *LI = dyn_cast<LoadInst>(Inst))
return Legal->isConsecutivePtr(LI->getPointerOperand()) != 0;		return Legal->isConsecutivePtr(LI->getPointerOperand()) != 0;

return false;		return false;
}		}

		/// Take the pointer operand from the Load/Store instruction.
		/// Returns NULL if this is not a valid Load/Store instruction.
		static Value getPointerOperand(Value I) {
		if (LoadInst *LI = dyn_cast<LoadInst>(I))
		return LI->getPointerOperand();
		if (StoreInst *SI = dyn_cast<StoreInst>(I))
		return SI->getPointerOperand();
		return nullptr;
		}

void LoopVectorizationCostModel::collectValuesToIgnore() {		void LoopVectorizationCostModel::collectValuesToIgnore() {
// Ignore ephemeral values.		// Ignore ephemeral values.
CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore);		CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore);

// Ignore type-promoting instructions we identified during reduction		// Ignore type-promoting instructions we identified during reduction
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}

// Ignore induction phis that are only used in either GetElementPtr or ICmp		// Insert uniform instruction into VecValuesToIgnore.
		mkuperUnsubmitted Not Done Reply Inline Actions This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs? (e.g. see D20789) mkuper: This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the load/store using the GEP. If the induction variable is only used in GEP, it will not be vectorized, right? wmi: Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the…
		mkuperUnsubmitted Not Done Reply Inline Actions I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when it's profitable on the target. (I think the only target that supports it right now is AVX-512.) mkuper: I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I see. If a[3i] are vectorized using gather/scatter. It needs a vectorized version of 3i so probably it is better to generate a vectorized version of i. Then i shouldn't be added into VecValuesToIgnore. Thanks. wmi: I see. If a[3i] are vectorized using gather/scatter. It needs a vectorized version of 3i so…
// instruction to exit loop. Induction variables usually have large types and		// Collect non-gather/scatter and non-consecutive ptr in NonConsecutivePtr.
// can have big impact when estimating register usage.		SmallPtrSet<Instruction *, 8> NonConsecutivePtr;
// This is for when VF > 1.		for (auto *BB : TheLoop->getBlocks()) {
		for (auto &I : *BB) {
		if (Legal->isUniformAfterVectorization(&I))
		VecValuesToIgnore.insert(&I);
		Instruction *PI = dyn_cast_or_null<Instruction>(getPointerOperand(&I));
		mkuperUnsubmitted Not Done Reply Inline Actions I was sure we already had a helper for a common LI/SI getPointerOperand() helper. Turns out we have (at least!) 6, in: LoopAccessAnalysis EarlyCSE DependenceAnalysis PPCLoopPreIncPrep Delinearization LoadStoreVectorizer I'm going to refactor this into a common helper somewhere in utils, but can you hoist this into another local helper? It'll be easier for me to keep track of, in case I land after you do. (If you prefer to do the refactoring yourself, let me know. :-) ) mkuper: I was sure we already had a helper for a common LI/SI getPointerOperand() helper. Turns out we…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Added a local helper for it. wmi: Added a local helper for it.
		if (PI && !Legal->isConsecutivePtr(PI) &&
		!isGatherOrScatterLegal(&I, PI, Legal))
		NonConsecutivePtr.insert(PI);
		}
		}

		// Ignore induction phis that are either used in uniform instructions or
		// NonConsecutivePtr.
for (auto &Induction : *Legal->getInductionVars()) {		for (auto &Induction : *Legal->getInductionVars()) {
auto *PN = Induction.first;		auto *PN = Induction.first;
auto *UpdateV = PN->getIncomingValueForBlock(TheLoop->getLoopLatch());		auto *UpdateV = PN->getIncomingValueForBlock(TheLoop->getLoopLatch());

// Check that the PHI is only used by the induction increment (UpdateV) or
// by GEPs. Then check that UpdateV is only used by a compare instruction,
// the loop header PHI, or by GEPs.
// FIXME: Need precise def-use analysis to determine if this instruction
// variable will be vectorized.
if (std::all_of(PN->user_begin(), PN->user_end(),		if (std::all_of(PN->user_begin(), PN->user_end(),
[&](const User *U) -> bool {		[&](User *U) -> bool {
return U == UpdateV \|\| isa<GetElementPtrInst>(U);		Instruction *UI = dyn_cast<Instruction>(U);
		return U == UpdateV \|\| !TheLoop->contains(UI) \|\|
		Legal->isUniformAfterVectorization(UI) \|\|
		NonConsecutivePtr.count(UI);
}) &&		}) &&
std::all_of(UpdateV->user_begin(), UpdateV->user_end(),		std::all_of(UpdateV->user_begin(), UpdateV->user_end(),
[&](const User *U) -> bool {		[&](User *U) -> bool {
return U == PN \|\| isa<ICmpInst>(U) \|\|		Instruction *UI = dyn_cast<Instruction>(U);
isa<GetElementPtrInst>(U);		return U == PN \|\| !TheLoop->contains(UI) \|\|
		Legal->isUniformAfterVectorization(UI) \|\|
		NonConsecutivePtr.count(UI);
})) {		})) {
VecValuesToIgnore.insert(PN);		VecValuesToIgnore.insert(PN);
		mssimpsoUnsubmitted Not Done Reply Inline Actions Hi Wei, I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet. But I'm not sure I follow this. Please correct me if I'm missing something! If I take the following test case: define void @test(i32* %a, i64 %n) { entry: br label %for.body for.body: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ] %0 = trunc i64 %i to i32 %1 = getelementptr inbounds i32, i32* %a, i32 %0 store i32 %0, i32* %1, align 4 %i.next = add nuw nsw i64 %i, 1 %cond = icmp eq i64 %i.next, %n br i1 %cond, label %for.end, label %for.body for.end: ret void } We generate vectorized induction variables so that for the store, we have: store <4 x i32> %vec.ind1, <4 x i32>* %3, align 4 However, with your change, it looks to me like we will add the induction variable to VecValuesToIngore where previously we wouldn't have. Is this right? mssimpso: Hi Wei, I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ... which is a uniform instruction so it will be put into VecValuesToIngore. However, it may be problematic for %0 to be uniform because actually it has both scalar and vector version after vectorization. I will add your testcase into consideration. Thanks. wmi: Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ...
		mkuperUnsubmitted Not Done Reply Inline Actions So the problem isn't here, it's in collectLoopUniforms(), right? That is, the problem is that we're tracing back from the consecutive pointer, and adding all of the operands of the store to the worklist, instead of just the pointer operand? mkuper: So the problem isn't here, it's in collectLoopUniforms(), right? That is, the problem is that…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Looks like all of the operands of the store should be added to the worklist, but only those operands not being used in any nonUniform instruction should be regarded as uniform. But that requires collectLoopUniforms algorithm to use some topological order to do uniform check for the values in worklist. Do you think it is the right way to go for collectLoopUniform? wmi: Looks like all of the operands of the store should be added to the worklist, but only those…
		mkuperUnsubmitted Not Done Reply Inline Actions You're right, I got confused. We're not tracing back from the store anyway, it's from the use of %0 in the GEP, I didn't read your original comment correctly. The real problem, as you say, is that we're assuming that every operand of a uniform instruction must be uniform, but it simply isn't true in our current definition of "uniform". mkuper: You're right, I got confused. We're not tracing back from the store anyway, it's from the use…
VecValuesToIgnore.insert(UpdateV);		VecValuesToIgnore.insert(UpdateV);
}		}
}		}

// Ignore instructions that will not be vectorized.
// This is for when VF > 1.
for (auto bb = TheLoop->block_begin(), be = TheLoop->block_end(); bb != be;
++bb) {
for (auto &Inst : **bb) {
switch (Inst.getOpcode())
case Instruction::GetElementPtr: {
// Ignore GEP if its last operand is an induction variable so that it is
// a consecutive load/store and won't be vectorized as scatter/gather
// pattern.

GetElementPtrInst *Gep = cast<GetElementPtrInst>(&Inst);
unsigned NumOperands = Gep->getNumOperands();
unsigned InductionOperand = getGEPInductionOperand(Gep);
bool GepToIgnore = true;

// Check that all of the gep indices are uniform except for the
// induction operand.
for (unsigned i = 0; i != NumOperands; ++i) {
if (i != InductionOperand &&
!PSE.getSE()->isLoopInvariant(PSE.getSCEV(Gep->getOperand(i)),
TheLoop)) {
GepToIgnore = false;
break;
}
}

if (GepToIgnore)
VecValuesToIgnore.insert(&Inst);
break;
}
}
}
}		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore) {		bool IfPredicateStore) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/PowerPC/vsx-tsvc-s173.ll

Show All 37 Lines	for.end: ; preds = %for.body3
%cmp = icmp slt i32 %inc11, %mul		%cmp = icmp slt i32 %inc11, %mul
br i1 %cmp, label %for.cond1.preheader, label %for.end12		br i1 %cmp, label %for.cond1.preheader, label %for.end12

for.end12: ; preds = %for.end, %entry		for.end12: ; preds = %for.end, %entry
ret i32 0		ret i32 0

; CHECK-LABEL: @s173		; CHECK-LABEL: @s173
; CHECK: load <4 x float>, <4 x float>*		; CHECK: load <4 x float>, <4 x float>*
; CHECK: add nsw i64 %1, 16000		; CHECK: add nsw i64 %index, 16000
; CHECK: ret i32 0		; CHECK: ret i32 0
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }

test/Transforms/LoopVectorize/X86/reg-usage.ll

; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -S 2>&1 \| FileCheck %s		; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=x86_64-unknown-linux -S 2>&1 \| FileCheck %s
		; RUN: opt < %s -debug-only=loop-vectorize -loop-vectorize -vectorizer-maximize-bandwidth -O2 -mtriple=x86_64-unknown-linux -mattr=+avx512f -S 2>&1 \| FileCheck %s --check-prefix=AVX512F
; REQUIRES: asserts		; REQUIRES: asserts

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@a = global [1024 x i8] zeroinitializer, align 16		@a = global [1024 x i8] zeroinitializer, align 16
@b = global [1024 x i8] zeroinitializer, align 16		@b = global [1024 x i8] zeroinitializer, align 16

define i32 @foo() {		define i32 @foo() {
; This function has a loop of SAD pattern. Here we check when VF = 16 the		; This function has a loop of SAD pattern. Here we check when VF = 16 the
; register usage doesn't exceed 16.		; register usage doesn't exceed 16.
;		;
; CHECK-LABEL: foo		; CHECK-LABEL: foo
Show All 25 Lines	for.body:
%neg = sub nsw i32 0, %sub		%neg = sub nsw i32 0, %sub
%2 = select i1 %ispos, i32 %sub, i32 %neg		%2 = select i1 %ispos, i32 %sub, i32 %neg
%add = add nsw i32 %2, %s.015		%add = add nsw i32 %2, %s.015
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 1024		%exitcond = icmp eq i64 %indvars.iv.next, 1024
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

		define i32 @goo() {
		mkuperUnsubmitted Not Done Reply Inline Actions Could you please document what each of the two new tests actually tries to check? mkuper: Could you please document what each of the two new tests actually tries to check?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Comments added. wmi: Comments added.
		; For indvars.iv used in a computating chain only feeding into getelementptr or cmp,
		; it will not have vector version and the vector register usage will not exceed the
		; available vector register number.
		; CHECK-LABEL: goo
		; CHECK: LV(REG): VF = 4
		; CHECK-NEXT: LV(REG): Found max usage: 4
		; CHECK: LV(REG): VF = 8
		; CHECK-NEXT: LV(REG): Found max usage: 7
		; CHECK: LV(REG): VF = 16
		; CHECK-NEXT: LV(REG): Found max usage: 13
		entry:
		br label %for.body

		for.cond.cleanup: ; preds = %for.body
		%add.lcssa = phi i32 [ %add, %for.body ]
		ret i32 %add.lcssa

		for.body: ; preds = %for.body, %entry
		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
		%s.015 = phi i32 [ 0, %entry ], [ %add, %for.body ]
		%tmp1 = add nsw i64 %indvars.iv, 3
		%arrayidx = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %tmp1
		%tmp = load i8, i8* %arrayidx, align 1
		%conv = zext i8 %tmp to i32
		%tmp2 = add nsw i64 %indvars.iv, 2
		%arrayidx2 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %tmp2
		%tmp3 = load i8, i8* %arrayidx2, align 1
		%conv3 = zext i8 %tmp3 to i32
		%sub = sub nsw i32 %conv, %conv3
		%ispos = icmp sgt i32 %sub, -1
		%neg = sub nsw i32 0, %sub
		%tmp4 = select i1 %ispos, i32 %sub, i32 %neg
		%add = add nsw i32 %tmp4, %s.015
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, 1024
		br i1 %exitcond, label %for.cond.cleanup, label %for.body
		}

define i64 @bar(i64* nocapture %a) {		define i64 @bar(i64* nocapture %a) {
; CHECK-LABEL: bar		; CHECK-LABEL: bar
; CHECK: LV(REG): VF = 2		; CHECK: LV(REG): VF = 2
; CHECK: LV(REG): Found max usage: 4		; CHECK: LV(REG): Found max usage: 4
;		;
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
%add2.lcssa = phi i64 [ %add2, %for.body ]		%add2.lcssa = phi i64 [ %add2, %for.body ]
ret i64 %add2.lcssa		ret i64 %add2.lcssa

for.body:		for.body:
%i.012 = phi i64 [ 0, %entry ], [ %inc, %for.body ]		%i.012 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
%s.011 = phi i64 [ 0, %entry ], [ %add2, %for.body ]		%s.011 = phi i64 [ 0, %entry ], [ %add2, %for.body ]
%arrayidx = getelementptr inbounds i64, i64* %a, i64 %i.012		%arrayidx = getelementptr inbounds i64, i64* %a, i64 %i.012
%0 = load i64, i64* %arrayidx, align 8		%0 = load i64, i64* %arrayidx, align 8
%add = add nsw i64 %0, %i.012		%add = add nsw i64 %0, %i.012
store i64 %add, i64* %arrayidx, align 8		store i64 %add, i64* %arrayidx, align 8
%add2 = add nsw i64 %add, %s.011		%add2 = add nsw i64 %add, %s.011
%inc = add nuw nsw i64 %i.012, 1		%inc = add nuw nsw i64 %i.012, 1
%exitcond = icmp eq i64 %inc, 1024		%exitcond = icmp eq i64 %inc, 1024
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

		@d = external global [0 x i64], align 8
		@e = external global [0 x i32], align 4
		@c = external global [0 x i32], align 4

		define void @hoo(i32 %n) {
		; For c[i] = e[d[i]] in the loop, e[d[i]] is not consecutive but its index %tmp can
		; be gathered into a vector. For VF == 16, the vector version of %tmp will be <16 x i64>
		; so the max usage of AVX512 vector register will be 2.
		; AVX512F-LABEL: bar
		; AVX512F: LV(REG): VF = 16
		; AVX512F: LV(REG): Found max usage: 2
		;
		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
		%arrayidx = getelementptr inbounds [0 x i64], [0 x i64]* @d, i64 0, i64 %indvars.iv
		%tmp = load i64, i64* %arrayidx, align 8
		%arrayidx1 = getelementptr inbounds [0 x i32], [0 x i32]* @e, i64 0, i64 %tmp
		%tmp1 = load i32, i32* %arrayidx1, align 4
		%arrayidx3 = getelementptr inbounds [0 x i32], [0 x i32]* @c, i64 0, i64 %indvars.iv
		store i32 %tmp1, i32* %arrayidx3, align 4
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, 10000
		br i1 %exitcond, label %for.end, label %for.body

		for.end: ; preds = %for.body
		ret void
		}

test/Transforms/LoopVectorize/reverse_induction.ll

	Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines
	; while ((reverse_induction) >= 0) {			; while ((reverse_induction) >= 0) {
	; forward_induction++;			; forward_induction++;
	; a[reverse_induction] = forward_induction;			; a[reverse_induction] = forward_induction;
	; --reverse_induction;			; --reverse_induction;
	; }			; }
	; }			; }

	; CHECK-LABEL: @reverse_forward_induction_i64_i8(			; CHECK-LABEL: @reverse_forward_induction_i64_i8(
	; CHECK: vector.body
	; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]			; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	; CHECK: %vec.ind = phi <4 x i64> [ <i64 1023, i64 1022, i64 1021, i64 1020>, %vector.ph ]			; CHECK: %offset.idx = sub i64 1023, %index
	; CHECK: %step.add = add <4 x i64> %vec.ind, <i64 -4, i64 -4, i64 -4, i64 -4>			; CHECK: %[[a0:.+]] = add i64 %offset.idx, 0
	; CHECK: trunc i64 %index to i8			; CHECK: %[[v0:.+]] = insertelement <4 x i64> undef, i64 %[[a0]], i64 0
				; CHECK: %[[a1:.+]] = add i64 %offset.idx, -1
				; CHECK: %[[v1:.+]] = insertelement <4 x i64> %[[v0]], i64 %[[a1]], i64 1
				; CHECK: %[[a2:.+]] = add i64 %offset.idx, -2
				; CHECK: %[[v2:.+]] = insertelement <4 x i64> %[[v1]], i64 %[[a2]], i64 2
				; CHECK: %[[a3:.+]] = add i64 %offset.idx, -3
				; CHECK: %[[v3:.+]] = insertelement <4 x i64> %[[v2]], i64 %[[a3]], i64 3
				; CHECK: %[[a4:.+]] = add i64 %offset.idx, -4
				; CHECK: %[[v4:.+]] = insertelement <4 x i64> undef, i64 %[[a4]], i64 0
				; CHECK: %[[a5:.+]] = add i64 %offset.idx, -5
				; CHECK: %[[v5:.+]] = insertelement <4 x i64> %[[v4]], i64 %[[a5]], i64 1
				; CHECK: %[[a6:.+]] = add i64 %offset.idx, -6
				; CHECK: %[[v6:.+]] = insertelement <4 x i64> %[[v5]], i64 %[[a6]], i64 2
				; CHECK: %[[a7:.+]] = add i64 %offset.idx, -7
				; CHECK: %[[v7:.+]] = insertelement <4 x i64> %[[v6]], i64 %[[a7]], i64 3

	define void @reverse_forward_induction_i64_i8() {			define void @reverse_forward_induction_i64_i8() {
	entry:			entry:
	br label %while.body			br label %while.body

	while.body:			while.body:
	%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]			%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]
	%forward_induction.05 = phi i8 [ 0, %entry ], [ %inc, %while.body ]			%forward_induction.05 = phi i8 [ 0, %entry ], [ %inc, %while.body ]
	%inc = add i8 %forward_induction.05, 1			%inc = add i8 %forward_induction.05, 1
	%conv = zext i8 %inc to i32			%conv = zext i8 %inc to i32
	%arrayidx = getelementptr inbounds [1024 x i32], [1024 x i32]* @a, i64 0, i64 %indvars.iv			%arrayidx = getelementptr inbounds [1024 x i32], [1024 x i32]* @a, i64 0, i64 %indvars.iv
	store i32 %conv, i32* %arrayidx, align 4			store i32 %conv, i32* %arrayidx, align 4
	%indvars.iv.next = add i64 %indvars.iv, -1			%indvars.iv.next = add i64 %indvars.iv, -1
	%0 = trunc i64 %indvars.iv to i32			%0 = trunc i64 %indvars.iv to i32
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	br i1 %cmp, label %while.body, label %while.end			br i1 %cmp, label %while.body, label %while.end

	while.end:			while.end:
	ret void			ret void
	}			}

	; CHECK-LABEL: @reverse_forward_induction_i64_i8_signed(			; CHECK-LABEL: @reverse_forward_induction_i64_i8_signed(
	; CHECK: vector.body:
	; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]			; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	; CHECK: %vec.ind = phi <4 x i64> [ <i64 1023, i64 1022, i64 1021, i64 1020>, %vector.ph ]			; CHECK: %offset.idx = sub i64 1023, %index
	; CHECK: %step.add = add <4 x i64> %vec.ind, <i64 -4, i64 -4, i64 -4, i64 -4>			; CHECK: %[[a0:.+]] = add i64 %offset.idx, 0
				; CHECK: %[[v0:.+]] = insertelement <4 x i64> undef, i64 %[[a0]], i64 0
				; CHECK: %[[a1:.+]] = add i64 %offset.idx, -1
				; CHECK: %[[v1:.+]] = insertelement <4 x i64> %[[v0]], i64 %[[a1]], i64 1
				; CHECK: %[[a2:.+]] = add i64 %offset.idx, -2
				; CHECK: %[[v2:.+]] = insertelement <4 x i64> %[[v1]], i64 %[[a2]], i64 2
				; CHECK: %[[a3:.+]] = add i64 %offset.idx, -3
				; CHECK: %[[v3:.+]] = insertelement <4 x i64> %[[v2]], i64 %[[a3]], i64 3
				; CHECK: %[[a4:.+]] = add i64 %offset.idx, -4
				; CHECK: %[[v4:.+]] = insertelement <4 x i64> undef, i64 %[[a4]], i64 0
				; CHECK: %[[a5:.+]] = add i64 %offset.idx, -5
				; CHECK: %[[v5:.+]] = insertelement <4 x i64> %[[v4]], i64 %[[a5]], i64 1
				; CHECK: %[[a6:.+]] = add i64 %offset.idx, -6
				; CHECK: %[[v6:.+]] = insertelement <4 x i64> %[[v5]], i64 %[[a6]], i64 2
				; CHECK: %[[a7:.+]] = add i64 %offset.idx, -7
				; CHECK: %[[v7:.+]] = insertelement <4 x i64> %[[v6]], i64 %[[a7]], i64 3

	define void @reverse_forward_induction_i64_i8_signed() {			define void @reverse_forward_induction_i64_i8_signed() {
	entry:			entry:
	br label %while.body			br label %while.body

	while.body:			while.body:
	%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]			%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %while.body ]
	%forward_induction.05 = phi i8 [ -127, %entry ], [ %inc, %while.body ]			%forward_induction.05 = phi i8 [ -127, %entry ], [ %inc, %while.body ]
	Show All 12 Lines