This is an archive of the discontinued LLVM Phabricator instance.

lib/Transforms/Vectorize/LoopVectorize.cpp
5176	The comment here needs a bit more detail. Even if the trip count isn't uniform, as long as it isn't tiny, vectorizing still reduces the total number of instructions executed. For example, if you're vectorizing a simple f32 loop to <4 x f32>, the main loop executes "maxTripCount/4" times, and the remainder loop executes at most 3 times. Assuming maxTripCount is large, maxTripCount/4 + 3 is much smaller than maxTripCount.

Add more to comment

ping

In D32729#785795, @arsenm wrote:

ping

I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.

In D32729#785871, @hfinkel wrote:

In D32729#785795, @arsenm wrote:

ping

I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.

Yes, the point is to avoid the additional condition where both loops could execute

In D32729#785893, @arsenm wrote:

In D32729#785871, @hfinkel wrote:

In D32729#785795, @arsenm wrote:

ping

I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.

Yes, the point is to avoid the additional condition where both loops could execute

Ah, okay. Adding that check on top of D34373 seems likely to be easy. I recommend doing that.

Base on new OptSize handling. Allow forced vectorization with metadata in case user knows it is dynamically uniform etc.

fhahn added a subscriber: fhahn.Aug 2 2017, 11:26 PM

ping

efriedma added reviewers: mkuper, mssimpso, anna, Ayal.Aug 28 2017, 4:16 PM

Ayal added inline comments.Sep 13 2017, 1:46 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6322	Here we should indeed continue to use `(OptForSize)` only, rather than `(OptForSize \|\| OptForDivergent)`, right?
6323	Should this also be `if ((OptForSize \|\| OptForDivergent) && TC % MaxVF != 0)` ? It may be better to have one `AvoidTailLoop` flag, which is set if we're either really optimizing for size, or deal with a tiny loop, or optimize for a divergent target. Check here if a tail is needed for whatever reason, with a debug dump stating simply that a tail is required when it must be avoided. The reason why a tail must be avoided can be dumped separately when setting `AvoidTailLoop`. Not sure how to best handle the different `ORE` reports though; perhaps by refactoring out an `isTailLoopNeeded()`?

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

75 lines

test/

Transforms/

LoopVectorize/

AMDGPU/

divergent-loop-bounds.ll

183 lines

Diff 109459

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,846 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel(Loop *L, PredicatedScalarEvolution &PSE,
AssumptionCache *AC,		AssumptionCache *AC,
OptimizationRemarkEmitter ORE, const Function F,		OptimizationRemarkEmitter ORE, const Function F,
const LoopVectorizeHints *Hints)		const LoopVectorizeHints *Hints)
: TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),		: TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),
AC(AC), ORE(ORE), TheFunction(F), Hints(Hints) {}		AC(AC), ORE(ORE), TheFunction(F), Hints(Hints) {}

/// \return An upper bound for the vectorization factor, or None if		/// \return An upper bound for the vectorization factor, or None if
/// vectorization should be avoided up front.		/// vectorization should be avoided up front.
Optional<unsigned> computeMaxVF(bool OptForSize);		Optional<unsigned> computeMaxVF(bool OptForSize, bool OptForDivergent);

/// Information about vectorization costs		/// Information about vectorization costs
struct VectorizationFactor {		struct VectorizationFactor {
unsigned Width; // Vector width with best cost		unsigned Width; // Vector width with best cost
unsigned Cost; // Cost of the loop with that width		unsigned Cost; // Cost of the loop with that width
};		};
/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to MaxVF. If UserVF is not ZERO		/// This method checks every power of two up to MaxVF. If UserVF is not ZERO
▲ Show 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	LoopVectorizationPlanner(Loop OrigLoop, LoopInfo LI,
LoopVectorizationLegality *Legal,		LoopVectorizationLegality *Legal,
LoopVectorizationCostModel &CM)		LoopVectorizationCostModel &CM)
: OrigLoop(OrigLoop), LI(LI), Legal(Legal), CM(CM) {}		: OrigLoop(OrigLoop), LI(LI), Legal(Legal), CM(CM) {}

~LoopVectorizationPlanner() {}		~LoopVectorizationPlanner() {}

/// Plan how to best vectorize, return the best VF and its cost.		/// Plan how to best vectorize, return the best VF and its cost.
LoopVectorizationCostModel::VectorizationFactor plan(bool OptForSize,		LoopVectorizationCostModel::VectorizationFactor plan(bool OptForSize,
		bool OptForDivergent,
unsigned UserVF);		unsigned UserVF);

/// Generate the IR code for the vectorized loop.		/// Generate the IR code for the vectorized loop.
void executePlan(InnerLoopVectorizer &ILV);		void executePlan(InnerLoopVectorizer &ILV);

protected:		protected:
/// Collect the instructions from the original loop that would be trivially		/// Collect the instructions from the original loop that would be trivially
/// dead in the vectorized loop if generated.		/// dead in the vectorized loop if generated.
▲ Show 20 Lines • Show All 2,938 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize() {

// Check if we can vectorize the instructions and CFG in this loop.		// Check if we can vectorize the instructions and CFG in this loop.
if (!canVectorizeInstrs()) {		if (!canVectorizeInstrs()) {
DEBUG(dbgs() << "LV: Can't vectorize the instructions or CFG\n");		DEBUG(dbgs() << "LV: Can't vectorize the instructions or CFG\n");
if (ORE->allowExtraAnalysis())		if (ORE->allowExtraAnalysis())
Result = false;		Result = false;
else		else
return false;		return false;
}		}
		efriedmaUnsubmitted Not Done Reply Inline Actions The comment here needs a bit more detail. Even if the trip count isn't uniform, as long as it isn't tiny, vectorizing still reduces the total number of instructions executed. For example, if you're vectorizing a simple f32 loop to <4 x f32>, the main loop executes "maxTripCount/4" times, and the remainder loop executes at most 3 times. Assuming maxTripCount is large, maxTripCount/4 + 3 is much smaller than maxTripCount. efriedma: The comment here needs a bit more detail. Even if the trip count isn't uniform, as long as it…

// Go over each instruction and look at memory deps.		// Go over each instruction and look at memory deps.
if (!canVectorizeMemory()) {		if (!canVectorizeMemory()) {
DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n");		DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n");
if (ORE->allowExtraAnalysis())		if (ORE->allowExtraAnalysis())
Result = false;		Result = false;
else		else
return false;		return false;
▲ Show 20 Lines • Show All 1,081 Lines • ▼ Show 20 Lines	if (LastMember) {
continue;		continue;
}		}
DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");		DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");
RequiresScalarEpilogue = true;		RequiresScalarEpilogue = true;
}		}
}		}
}		}

Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(bool OptForSize) {		Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(bool OptForSize,
		bool OptForDivergent) {
if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {		if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {
ORE->emit(createMissedAnalysis("ConditionalStore")		ORE->emit(createMissedAnalysis("ConditionalStore")
<< "store that is conditionally executed prevents vectorization");		<< "store that is conditionally executed prevents vectorization");
DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");		DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");
return None;		return None;
}		}

if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {		if (Legal->getRuntimePointerChecking()->Need) {
		if (OptForSize) {
		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
		<< "runtime pointer checks needed. Enable vectorization of this "
		"loop with '#pragma clang loop vectorize(enable)' when "
		"compiling with -Os/-Oz");
		DEBUG(dbgs()
		<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");

		return None;
		}

		if (OptForDivergent) {
// TODO: It may by useful to do since it's still likely to be dynamically		// TODO: It may by useful to do since it's still likely to be dynamically
// uniform if the target can skip.		// uniform if the target can skip.
DEBUG(dbgs() << "LV: Not inserting runtime ptr check for divergent target");		DEBUG(dbgs() << "LV: Not inserting runtime ptr check for divergent target");

ORE->emit(		ORE->emit(
createMissedAnalysis("CantVersionLoopWithDivergentTarget")		createMissedAnalysis("CantVersionLoopWithDivergentTarget")
<< "runtime pointer checks needed. Not enabled for divergent target");		<< "runtime pointer checks needed. Not enabled for divergent target");

return None;		return None;
}		}

if (!OptForSize) // Remaining checks deal with scalar loop when OptForSize.
return computeFeasibleMaxVF(OptForSize);

if (Legal->getRuntimePointerChecking()->Need) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");
DEBUG(dbgs()
<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
return None;
}		}

// If we optimize the program for size, avoid creating the tail loop.		// If we optimize the program for size, avoid creating the tail loop.
unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);		unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

// If we don't know the precise trip count, don't try to vectorize.		// If we don't know the precise trip count, don't try to vectorize.
if (TC < 2) {		if (TC < 2 && (OptForSize \|\| OptForDivergent)) {
ORE->emit(		ORE->emit(
createMissedAnalysis("UnknownLoopCountComplexCFG")		createMissedAnalysis("UnknownLoopCountComplexCFG")
<< "unable to calculate the loop count due to complex control flow");		<< "unable to calculate the loop count due to complex control flow");
DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");		DEBUG(dbgs() << "LV: Aborting. A tail loop is required with "
		<< (OptForSize ? "-Os/-Oz.\n" : "divergent target.\n"));
return None;		return None;
}		}

unsigned MaxVF = computeFeasibleMaxVF(OptForSize);		unsigned MaxVF = computeFeasibleMaxVF(OptForSize);
		AyalUnsubmitted Not Done Reply Inline Actions Here we should indeed continue to use `(OptForSize)` only, rather than `(OptForSize \|\| OptForDivergent)`, right? Ayal: Here we should indeed continue to use `(OptForSize)` only, rather than `(OptForSize \|\|…
		if (OptForSize && TC % MaxVF != 0) {
		AyalUnsubmitted Not Done Reply Inline Actions Should this also be `if ((OptForSize \|\| OptForDivergent) && TC % MaxVF != 0)` ? It may be better to have one `AvoidTailLoop` flag, which is set if we're either really optimizing for size, or deal with a tiny loop, or optimize for a divergent target. Check here if a tail is needed for whatever reason, with a debug dump stating simply that a tail is required when it must be avoided. The reason why a tail must be avoided can be dumped separately when setting `AvoidTailLoop`. Not sure how to best handle the different `ORE` reports though; perhaps by refactoring out an `isTailLoopNeeded()`? Ayal: Should this also be `if ((OptForSize \|\| OptForDivergent) && TC % MaxVF != 0)` ? It may be…
if (TC % MaxVF != 0) {
// If the trip count that we found modulo the vectorization factor is not		// If the trip count that we found modulo the vectorization factor is not
// zero then we require a tail.		// zero then we require a tail.
// FIXME: look for a smaller MaxVF that does divide TC rather than give up.		// FIXME: look for a smaller MaxVF that does divide TC rather than give up.
// FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a		// FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a
// smaller MaxVF that does not require a scalar epilog.		// smaller MaxVF that does not require a scalar epilog.

ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")		ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")
<< "cannot optimize for size and vectorize at the "		<< "cannot optimize for size and vectorize at the "
"same time. Enable vectorization of this loop "		"same time. Enable vectorization of this loop "
"with '#pragma clang loop vectorize(enable)' "		"with '#pragma clang loop vectorize(enable)' "
"when compiling with -Os/-Oz");		"when compiling with -Os/-Oz");
DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");		DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return None;		return None;
}		}

return MaxVF;		return MaxVF;
}		}

unsigned LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize) {		unsigned LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize) {
▲ Show 20 Lines • Show All 1,263 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}
}		}

LoopVectorizationCostModel::VectorizationFactor		LoopVectorizationCostModel::VectorizationFactor
LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {		LoopVectorizationPlanner::plan(bool OptForSize, bool OptForDivergent,
		unsigned UserVF) {

// Width 1 means no vectorize, cost 0 means uncomputed cost.		// Width 1 means no vectorize, cost 0 means uncomputed cost.
const LoopVectorizationCostModel::VectorizationFactor NoVectorization = {1U,		const LoopVectorizationCostModel::VectorizationFactor NoVectorization = {1U,
0U};		0U};
Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);		Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize, OptForDivergent);
if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.		if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.
return NoVectorization;		return NoVectorization;

if (UserVF) {		if (UserVF) {
DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
▲ Show 20 Lines • Show All 198 Lines • ▼ Show 20 Lines	if (!LVL.canVectorize()) {
return false;		return false;
}		}

// Check the function attributes to find out if this function should be		// Check the function attributes to find out if this function should be
// optimized for size.		// optimized for size.
bool OptForSize =		bool OptForSize =
Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

		bool OptForDivergent =
		Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
		TTI->hasBranchDivergence();

// Check the loop for a trip count threshold: vectorize loops with a tiny trip		// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.		// count by optimizing for size, to minimize overheads.
unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);		unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);
bool HasExpectedTC = (ExpectedTC > 0);		bool HasExpectedTC = (ExpectedTC > 0);

if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {		if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
auto EstimatedTC = getLoopEstimatedTripCount(L);		auto EstimatedTC = getLoopEstimatedTripCount(L);
if (EstimatedTC) {		if (EstimatedTC) {
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(L, LI, &LVL, CM);		LoopVectorizationPlanner LVP(L, LI, &LVL, CM);

// Get user vectorization factor.		// Get user vectorization factor.
unsigned UserVF = Hints.getWidth();		unsigned UserVF = Hints.getWidth();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
LoopVectorizationCostModel::VectorizationFactor VF =		LoopVectorizationCostModel::VectorizationFactor VF =
LVP.plan(OptForSize, UserVF);		LVP.plan(OptForSize, OptForDivergent, UserVF);

// Select the interleave count.		// Select the interleave count.
unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);		unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);

// Get user interleave count.		// Get user interleave count.
unsigned UserIC = Hints.getInterleave();		unsigned UserIC = Hints.getInterleave();

// Identify the diagnostic messages that should be produced.		// Identify the diagnostic messages that should be produced.
▲ Show 20 Lines • Show All 208 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AMDGPU/divergent-loop-bounds.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 -loop-vectorize -simplifycfg < %s \| FileCheck -check-prefixes=GCN,GFX9 %s
				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji -loop-vectorize -simplifycfg < %s \| FileCheck -check-prefixes=GCN,VI %s
				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 -loop-vectorize -pass-remarks-analysis='loop-vectorize' < %s 2>&1 \| FileCheck -check-prefixes=GFX9-REMARK %s

				; It may make sense to vectorize this if the condition is uniform, but
				; assume that it isn't for now.

				; GCN-LABEL: @small_loop_i16_unknown_uniform_size(
				; GCN: load i16
				; GCN: add nsw i16
				; GCN: store i16
				; GCN: br i1 %cond

				; GFX9-REMARK: remark: <unknown>:0:0: loop not vectorized: unable to calculate the loop count due to complex control flow
				define amdgpu_kernel void @small_loop_i16_unknown_uniform_size(i16 addrspace(1)* nocapture %inArray, i16 %size) #0 {
				entry:
				%cmp = icmp sgt i16 %size, 0
				br i1 %cmp, label %loop, label %exit

				loop: ; preds = %entry, %loop
				%iv = phi i16 [ %iv1, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i16, i16 addrspace(1)* %inArray, i16 %iv
				%load = load i16, i16 addrspace(1)* %gep, align 2
				%add = add nsw i16 %load, 6
				store i16 %add, i16 addrspace(1)* %gep, align 2
				%iv1 = add i16 %iv, 1
				%cond = icmp eq i16 %iv1, %size
				br i1 %cond, label %exit, label %loop

				exit: ; preds = %loop, %entry
				ret void
				}

				; GCN-LABEL: @small_loop_i16_unknown_divergent_size(
				; GCN: load i16
				; GCN: add nsw i16
				; GCN: store i16
				; GCN: br i1 %cond

				; GFX9-REMARK: remark: <unknown>:0:0: loop not vectorized: unable to calculate the loop count due to complex control flow
				define amdgpu_kernel void @small_loop_i16_unknown_divergent_size(i16 addrspace(1)* nocapture %inArray, i16 addrspace(1)* %size.ptr) #0 {
				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%size.gep = getelementptr inbounds i16, i16 addrspace(1)* %size.ptr, i32 %tid
				%size = load i16, i16 addrspace(1)* %size.gep
				%cmp = icmp sgt i16 %size, 0
				br i1 %cmp, label %loop, label %exit

				loop: ; preds = %entry, %loop
				%iv = phi i16 [ %iv1, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i16, i16 addrspace(1)* %inArray, i16 %iv
				%load = load i16, i16 addrspace(1)* %gep, align 2
				%add = add nsw i16 %load, 6
				store i16 %add, i16 addrspace(1)* %gep, align 2
				%iv1 = add i16 %iv, 1
				%cond = icmp eq i16 %iv1, %size
				br i1 %cond, label %exit, label %loop

				exit: ; preds = %loop, %entry
				ret void
				}

				; This loop will be vectorized as the trip count is below the
				; threshold and no scalar iterations are needed.

				; GCN-LABEL: @small_loop_i16_256(
				; GFX9: load <2 x i16>
				; GFX9: add nsw <2 x i16>
				; GFX9: store <2 x i16>
				; GFX9: add i32 %index, 2
				; GFX9: br i1

				; VI-NOT: <2 x i16>
				define amdgpu_kernel void @small_loop_i16_256(i16 addrspace(1)* nocapture %inArray) #0 {
				entry:
				br label %loop

				loop: ; preds = %entry, %loop
				%iv = phi i16 [ %iv1, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i16, i16 addrspace(1)* %inArray, i16 %iv
				%load = load i16, i16 addrspace(1)* %gep, align 2
				%add = add nsw i16 %load, 6
				store i16 %add, i16 addrspace(1)* %gep, align 2
				%iv1 = add i16 %iv, 1
				%cond = icmp eq i16 %iv1, 127
				br i1 %cond, label %exit, label %loop

				exit: ; preds = %loop, %entry
				ret void
				}

				; Not divisible by vectorize factor of 2
				; GCN-LABEL: @small_loop_i16_255(
				; GFX9: load <2 x i16>
				; GFX9: add nsw <2 x i16>
				; GFX9: store <2 x i16>
				; GFX9: add i32 %index, 2
				; GFX9: br i1

				; VI-NOT: <2 x i16>
				define amdgpu_kernel void @small_loop_i16_255(i16 addrspace(1)* nocapture %inArray) #0 {
				entry:
				br label %loop

				loop: ; preds = %entry, %loop
				%iv = phi i16 [ %iv1, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i16, i16 addrspace(1)* %inArray, i16 %iv
				%load = load i16, i16 addrspace(1)* %gep, align 2
				%add = add nsw i16 %load, 6
				store i16 %add, i16 addrspace(1)* %gep, align 2
				%iv1 = add i16 %iv, 1
				%cond = icmp eq i16 %iv1, 127
				br i1 %cond, label %exit, label %loop

				exit: ; preds = %loop, %entry
				ret void
				}

				; Metadata indicates it should be vectorized even though it may be
				; divergent.
				; GCN-LABEL: @small_loop_i16_unknown_uniform_size_forced(
				; GCN: load <2 x i16>
				; GCN: add nsw <2 x i16>
				; GCN: store <2 x i16>
				; GCN: add i32 %index, 2
				; GCN: br i1
				define amdgpu_kernel void @small_loop_i16_unknown_uniform_size_forced(i16 addrspace(1)* nocapture %inArray, i32 %size) #0 {
				entry:
				%cmp = icmp sgt i32 %size, 0
				br i1 %cmp, label %loop, label %exit

				loop: ; preds = %loop, %entry
				%iv = phi i32 [ %iv1, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i16, i16 addrspace(1)* %inArray, i32 %iv
				%load = load i16, i16 addrspace(1)* %gep, align 2, !llvm.mem.parallel_loop_access !2
				%add = add nsw i16 %load, 6
				store i16 %add, i16 addrspace(1)* %gep, align 2, !llvm.mem.parallel_loop_access !2
				%iv1 = add i32 %iv, 1
				%cond = icmp eq i32 %iv1, %size
				br i1 %cond, label %exit, label %loop, !llvm.loop !2

				exit: ; preds = %loop, %entry
				ret void
				}

				; GCN-LABEL: @small_loop_i16_unknown_divergent_size_forced(
				; GCN: load <2 x i16>
				; GCN: add nsw <2 x i16>
				; GCN: store <2 x i16>
				; GCN: add i32 %index, 2
				; GCN: br i1
				define amdgpu_kernel void @small_loop_i16_unknown_divergent_size_forced(i16 addrspace(1)* nocapture %inArray, i16 addrspace(1)* %size.ptr) #0 {
				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%size.gep = getelementptr inbounds i16, i16 addrspace(1)* %size.ptr, i32 %tid
				%size = load i16, i16 addrspace(1)* %size.gep
				%cmp = icmp sgt i16 %size, 0
				br i1 %cmp, label %loop, label %exit

				loop: ; preds = %loop, %entry
				%iv = phi i16 [ %iv1, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i16, i16 addrspace(1)* %inArray, i16 %iv
				%load = load i16, i16 addrspace(1)* %gep, align 2, !llvm.mem.parallel_loop_access !2
				%add = add nsw i16 %load, 6
				store i16 %add, i16 addrspace(1)* %gep, align 2, !llvm.mem.parallel_loop_access !2
				%iv1 = add i16 %iv, 1
				%cond = icmp eq i16 %iv1, %size
				br i1 %cond, label %exit, label %loop, !llvm.loop !2

				exit: ; preds = %loop, %entry
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #1

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readnone speculatable }

				!0 = distinct !{!0}
				!1 = distinct !{!1}
				!2 = distinct !{!2, !3, !4}
				!3 = !{!"llvm.loop.vectorize.enable", i1 true}
				!4 = !{!"llvm.loop.interleave.count", i32 1}
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

LV: Don't vectorize with unknown loop counts on divergent targetsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 109459

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/AMDGPU/divergent-loop-bounds.ll

LV: Don't vectorize with unknown loop counts on divergent targets
Needs ReviewPublic