This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6/16
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
-
cost-model-assert.ll
-
uniform_mem_op.ll
-
multiple-strides-vectorization.ll

Differential D91398

[LoopVectorizer] Lower uniform loads as a single load (instead of relying on CSE)
ClosedPublic

Authored by reames on Nov 12 2020, 6:40 PM.

Download Raw Diff

Details

Reviewers

anna
fhahn
greened
Ayal

Commits

rGb06a2ad94f45: [LoopVectorizer] Lower uniform loads as a single load (instead of relying on…

Summary

A uniform load is one which loads from a uniform address across all lanes. As currently implemented, we cost model such loads as if we did a single scalar load + a broadcast, but the actual lowering replicates the load once per lane.

This change tweaks the lowering to use the REPLICATE strategy by marking such loads (and the computation leading to their memory operand) as uniform after vectorization. This is a useful change in itself, but it's real purpose is to pave the way for a following change which will generalize our uniformity logic.

Diff Detail

Event Timeline

reames created this revision.Nov 12 2020, 6:40 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2020, 6:40 PM

Herald added subscribers: dantrushin, bollu, hiraditya, mcrosier. · View Herald Transcript

reames requested review of this revision.Nov 12 2020, 6:40 PM

Harbormaster completed remote builds in B78706: Diff 305009.Nov 12 2020, 7:17 PM

Rebase w/over landed tests.

As part of that, include a slight broadening of scope. For some reason, the code wasn't considering any operand which wasn't an instruction to be uniform. Since I happened to write my test with memory addresses as arguments, this fell out.

Remove stray debug output and an extra bit of whitespace.

reames mentioned this in D91451: [LoopVectorizer] Leverage uniformity across unrolled iterations.Nov 13 2020, 11:35 AM

reames added a child revision: D91451: [LoopVectorizer] Leverage uniformity across unrolled iterations.

fhahn added inline comments.Nov 14 2020, 12:17 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	The VPRecplicateRecipe contains a `IsUniform` flag. I think it should be possible to pass the flag through from the recipe to `scalarizeInstruction`. Ideally the recipes should contain all information required for code-generation to avoid having to tie code generation directly to the cost-model.
5079	nit: message for assert.
6272	nit: message for assert?
6451	this seems an unrelated refactoring, which could be split out and committed independently?

There's D68831 which is potentially related and tries to extend the logic to consider loop-invariant ops as uniform. I suppose I need to rebase it again....

reames added inline comments.Nov 14 2020, 7:54 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	I think you're misreading the code slightly. This isn't checking whether the recipe for the instruction being scalarized is uniform. It's checking whether the input to the instruction is uniform. It seems like overkill to record a per operand uniform flag in the recipe?
6272	Aside from being pedantic, why? I'm happy to comply, but I don't really see any value in cases like this where it's obvious from context.
6451	Happy to do so since you seem to think the interface made sense. I'm very new to vectorizer code and didn't want to jump to that conclusion. :)

fhahn added inline comments.Nov 15 2020, 9:54 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	I think you're misreading the code slightly. This isn't checking whether the recipe for the instruction being scalarized is uniform. It's checking whether the input to the instruction is uniform. Indeed, I was thinking about something slightly differently. The underlying point about ideally not having this code depend on the cost-model, but the information in the corresponding VPlan still applies. It seems like overkill to record a per operand uniform flag in the recipe? Yes, but fortunately I don't think that is necessary, because I think all required information is already encoded in the operands in VPlan: they should be uniform VPReplicateRecipes. Making this information accessible here is currently work-in-progress. With D91500 which I just put up, you should be able to check if `User.getOperand(op)` is a `VPReplicateRecipe` and if so, `IsUniform` also needs to be true. I think it would make sense to take it one step further, and use this logic directly in `VPTransformState::get`: when requesting a particular lane for a uniform VPValue, we should always be able to just return lane 0 and other callers potentially could also benefit. With D91501, the changes to `scalarizeInstruction` should not be needed.
6272	I agree this one is borderline and a message like `"must be called with a uniform memory instruction"` probably does not add too much. It might make it slightly more explicit why this assertion is here for people not familiar with the code. I don't really mind either way :)

reames mentioned this in rG2240d3d05451: [LoopVec] Introduce an api for detecting uniform memory ops.Nov 16 2020, 1:30 PM

anna added inline comments.Nov 16 2020, 1:31 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	Frankly, I'm finding the code written here in review easier to read :) Mainly because my context with VPlan is limited. Do we need to block this change on two reviews landing? I'm okay either way - but we can also land this and then "clean this up" once D91500 and D91501 lands? I think it would make sense to take it one step further, and use this logic directly in VPTransformState::get: when requesting a particular lane for a uniform VPValue, we should always be able to just return lane 0 and other callers potentially could also benefit. @fhahn Could you pls clarify further what you mean by "other callers potentially could also benefit" ?
5039	Could this be turned into an assert and landed separately? I see all callers of `addToWorklistIfAllowed` either already checks for outOfScope or the instructions is already checked to be in loop.

fhahn added inline comments.Nov 18 2020, 12:28 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	Frankly, I'm finding the code written here in review easier to read :) Mainly because my context with VPlan is limited. Do we need to block this change on two reviews landing? I'm okay either way - but we can also land this and then "clean this up" once D91500 and D91501 lands? The linked patches are 'more' code yes, but most of the code is part of current in-progress improvements and the uniform handling is mostly a nice side benefit of VPlanization. We don't necessarily need to block this change, I just wanted to note that this is a step backwards in terms of the general direction (quite a bit of work was spent on moving any cost-model references out of code generation). It's just a small step though and we have a clear path forward to resolve this. @fhahn Could you pls clarify further what you mean by "other callers potentially could also benefit" ? There might be other places that request individual lanes for uniform values. Those would also benefit from only getting the first lane.

reames added inline comments.Nov 22 2020, 4:38 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	I've gone ahead and approved your D91501. I don't think it fully solves the problem, but I see little harm in it. Your proposed approach creates a coupling between the lowering strategy chosen and the semantic property of uniformity. To some extend that might be unavoidable, but can we do better? As an example case, consider an udiv(add X, Y), Z) where all inputs are loop invariant. If X and Y already have vector uses, we might reasonable decide that a widened vector add is the best lowering for this uniform expression. (Not sure if we do today, but it's a reasonable choice.) When considering the udiv, if we decide to replicate it (because we know it's uniform and expensive), we want to be able to extra lane 0 regardless of the lowering strategy we chose for the add. Another approach is to directly skip the cost model and ask the legality question. That works for the example I just gave, but doesn't work with the way we currently handle GEPs feeding wideable loads - which is entirely handled in costing. I suspect we need to move the GEP detection out of the cost model, but I don't fully understand the broader implications of that.

reames added inline comments.Nov 22 2020, 4:52 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5039	The check is needed for case where a uniform pointer is defined outside the loop but used inside of it. I could move the check to the caller, but the placement seems to make more sense here?

reames added inline comments.Nov 22 2020, 4:54 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	We don't necessarily need to block this change, I just wanted to note that this is a step backwards in terms of the general direction (quite a bit of work was spent on moving any cost-model references out of code generation). It's just a small step though and we have a clear path forward to resolve this. Could I get an LGTM then? I'm happy to help evolve towards a good overall design, but it's really hard to make progress when the first patch which has a functional effect gets blocked in review.

LGTM, thanks. I also added @Ayal, who might also have some thoughts. It would probably be best to wait a few days with committing, in case there are additional comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667	Your proposed approach creates a coupling between the lowering strategy chosen and the semantic property of uniformity. To some extend that might be unavoidable, but can we do better? Yes, the VPlan changes I put up just ask the same question in a different way. I think we should be able to improve the code generated for your example, but that should probably happen before codegen, e.g. by widening. Another approach is to directly skip the cost model and ask the legality question. That works for the example I just gave, but doesn't work with the way we currently handle GEPs feeding wideable loads - which is entirely handled in costing. Asking legality here comes with the same problematic coupling which hinders modularizing transforms via VPlan I think.

This revision is now accepted and ready to land.Nov 23 2020, 1:22 PM

Closed by commit rGb06a2ad94f45: [LoopVectorizer] Lower uniform loads as a single load (instead of relying on… (authored by reames). · Explain WhyNov 23 2020, 3:32 PM

This revision was automatically updated to reflect the committed changes.

reames added a commit: rGb06a2ad94f45: [LoopVectorizer] Lower uniform loads as a single load (instead of relying on….

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

13 lines

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

49 lines

test/

Transforms/

LoopVectorize/

X86/

cost-model-assert.ll

59 lines

uniform_mem_op.ll

126 lines

multiple-strides-vectorization.ll

47 lines

Diff 305030

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 283 Lines • ▼ Show 20 Lines	public:
/// -1 - Address is consecutive, and decreasing.		/// -1 - Address is consecutive, and decreasing.
/// NOTE: This method must only be used before modifying the original scalar		/// NOTE: This method must only be used before modifying the original scalar
/// loop. Do not use after invoking 'createVectorizedLoopSkeleton' (PR34965).		/// loop. Do not use after invoking 'createVectorizedLoopSkeleton' (PR34965).
int isConsecutivePtr(Value *Ptr);		int isConsecutivePtr(Value *Ptr);

/// Returns true if the value V is uniform within the loop.		/// Returns true if the value V is uniform within the loop.
bool isUniform(Value *V);		bool isUniform(Value *V);

		/// A uniform memory op is a load or store which accesses the same memory
		/// location on all lanes.
		bool isUniformMemOp(Instruction &I) {
		Value *Ptr = getLoadStorePointerOperand(&I);
		if (!Ptr)
		return false;
		// Note: There's nothing inherent which prevents predicated loads and
		// stores from being uniform. The current lowering simply doesn't handle
		// it; in particular, the cost model distinguishes scatter/gather from
		// scalar w/predication, and we currently rely on the scalar path.
		return isUniform(Ptr) && !blockNeedsPredication(I.getParent());
		}

/// Returns the information that we collected about runtime memory check.		/// Returns the information that we collected about runtime memory check.
const RuntimePointerChecking *getRuntimePointerChecking() const {		const RuntimePointerChecking *getRuntimePointerChecking() const {
return LAI->getRuntimePointerChecking();		return LAI->getRuntimePointerChecking();
}		}

const LoopAccessInfo *getLAI() const { return LAI; }		const LoopAccessInfo *getLAI() const { return LAI; }

unsigned getMaxSafeDepDistBytes() { return LAI->getMaxSafeDepDistBytes(); }		unsigned getMaxSafeDepDistBytes() { return LAI->getMaxSafeDepDistBytes(); }
▲ Show 20 Lines • Show All 198 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,655 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr, VPUser &User,

Instruction *Cloned = Instr->clone();		Instruction *Cloned = Instr->clone();
if (!IsVoidRetTy)		if (!IsVoidRetTy)
Cloned->setName(Instr->getName() + ".cloned");		Cloned->setName(Instr->getName() + ".cloned");

// Replace the operands of the cloned instructions with their scalar		// Replace the operands of the cloned instructions with their scalar
// equivalents in the new loop.		// equivalents in the new loop.
for (unsigned op = 0, e = User.getNumOperands(); op != e; ++op) {		for (unsigned op = 0, e = User.getNumOperands(); op != e; ++op) {
auto *NewOp = State.get(User.getOperand(op), Instance);		auto *Operand = dyn_cast<Instruction>(Instr->getOperand(op));
		auto InputInstance = Instance;
		if (!Operand \|\| !OrigLoop->contains(Operand) \|\|
		(Cost->isUniformAfterVectorization(Operand, State.VF)))
		fhahnUnsubmitted Not Done Reply Inline Actions The VPRecplicateRecipe contains a `IsUniform` flag. I think it should be possible to pass the flag through from the recipe to `scalarizeInstruction`. Ideally the recipes should contain all information required for code-generation to avoid having to tie code generation directly to the cost-model. fhahn: The VPRecplicateRecipe contains a `IsUniform` flag. I think it should be possible to pass the…
		reamesAuthorUnsubmitted Done Reply Inline Actions I think you're misreading the code slightly. This isn't checking whether the recipe for the instruction being scalarized is uniform. It's checking whether the input to the instruction is uniform. It seems like overkill to record a per operand uniform flag in the recipe? reames: I think you're misreading the code slightly. This isn't checking whether the recipe for the…
		fhahnUnsubmitted Not Done Reply Inline Actions I think you're misreading the code slightly. This isn't checking whether the recipe for the instruction being scalarized is uniform. It's checking whether the input to the instruction is uniform. Indeed, I was thinking about something slightly differently. The underlying point about ideally not having this code depend on the cost-model, but the information in the corresponding VPlan still applies. It seems like overkill to record a per operand uniform flag in the recipe? Yes, but fortunately I don't think that is necessary, because I think all required information is already encoded in the operands in VPlan: they should be uniform VPReplicateRecipes. Making this information accessible here is currently work-in-progress. With D91500 which I just put up, you should be able to check if `User.getOperand(op)` is a `VPReplicateRecipe` and if so, `IsUniform` also needs to be true. I think it would make sense to take it one step further, and use this logic directly in `VPTransformState::get`: when requesting a particular lane for a uniform VPValue, we should always be able to just return lane 0 and other callers potentially could also benefit. With D91501, the changes to `scalarizeInstruction` should not be needed. fhahn: > I think you're misreading the code slightly. This isn't checking whether the recipe for the…
		annaUnsubmitted Not Done Reply Inline Actions Frankly, I'm finding the code written here in review easier to read :) Mainly because my context with VPlan is limited. Do we need to block this change on two reviews landing? I'm okay either way - but we can also land this and then "clean this up" once D91500 and D91501 lands? I think it would make sense to take it one step further, and use this logic directly in VPTransformState::get: when requesting a particular lane for a uniform VPValue, we should always be able to just return lane 0 and other callers potentially could also benefit. @fhahn Could you pls clarify further what you mean by "other callers potentially could also benefit" ? anna: Frankly, I'm finding the code written here in review easier to read :) Mainly because my…
		fhahnUnsubmitted Not Done Reply Inline Actions Frankly, I'm finding the code written here in review easier to read :) Mainly because my context with VPlan is limited. Do we need to block this change on two reviews landing? I'm okay either way - but we can also land this and then "clean this up" once D91500 and D91501 lands? The linked patches are 'more' code yes, but most of the code is part of current in-progress improvements and the uniform handling is mostly a nice side benefit of VPlanization. We don't necessarily need to block this change, I just wanted to note that this is a step backwards in terms of the general direction (quite a bit of work was spent on moving any cost-model references out of code generation). It's just a small step though and we have a clear path forward to resolve this. @fhahn Could you pls clarify further what you mean by "other callers potentially could also benefit" ? There might be other places that request individual lanes for uniform values. Those would also benefit from only getting the first lane. fhahn: > Frankly, I'm finding the code written here in review easier to read :) Mainly because my…
		reamesAuthorUnsubmitted Done Reply Inline Actions We don't necessarily need to block this change, I just wanted to note that this is a step backwards in terms of the general direction (quite a bit of work was spent on moving any cost-model references out of code generation). It's just a small step though and we have a clear path forward to resolve this. Could I get an LGTM then? I'm happy to help evolve towards a good overall design, but it's really hard to make progress when the first patch which has a functional effect gets blocked in review. reames: > We don't necessarily need to block this change, I just wanted to note that this is a step…
		reamesAuthorUnsubmitted Done Reply Inline Actions I've gone ahead and approved your D91501. I don't think it fully solves the problem, but I see little harm in it. Your proposed approach creates a coupling between the lowering strategy chosen and the semantic property of uniformity. To some extend that might be unavoidable, but can we do better? As an example case, consider an udiv(add X, Y), Z) where all inputs are loop invariant. If X and Y already have vector uses, we might reasonable decide that a widened vector add is the best lowering for this uniform expression. (Not sure if we do today, but it's a reasonable choice.) When considering the udiv, if we decide to replicate it (because we know it's uniform and expensive), we want to be able to extra lane 0 regardless of the lowering strategy we chose for the add. Another approach is to directly skip the cost model and ask the legality question. That works for the example I just gave, but doesn't work with the way we currently handle GEPs feeding wideable loads - which is entirely handled in costing. I suspect we need to move the GEP detection out of the cost model, but I don't fully understand the broader implications of that. reames: I've gone ahead and approved your D91501. I don't think it fully solves the problem, but I see…
		fhahnUnsubmitted Not Done Reply Inline Actions Your proposed approach creates a coupling between the lowering strategy chosen and the semantic property of uniformity. To some extend that might be unavoidable, but can we do better? Yes, the VPlan changes I put up just ask the same question in a different way. I think we should be able to improve the code generated for your example, but that should probably happen before codegen, e.g. by widening. Another approach is to directly skip the cost model and ask the legality question. That works for the example I just gave, but doesn't work with the way we currently handle GEPs feeding wideable loads - which is entirely handled in costing. Asking legality here comes with the same problematic coupling which hinders modularizing transforms via VPlan I think. fhahn: > Your proposed approach creates a coupling between the lowering strategy chosen and the…
		InputInstance.Lane = 0;
		auto *NewOp = State.get(User.getOperand(op), InputInstance);
Cloned->setOperand(op, NewOp);		Cloned->setOperand(op, NewOp);
}		}
addNewMetadata(Cloned, Instr);		addNewMetadata(Cloned, Instr);

// Place the cloned scalar in the new loop.		// Place the cloned scalar in the new loop.
Builder.Insert(Cloned);		Builder.Insert(Cloned);

// Add the cloned scalar to the scalar map entry.		// Add the cloned scalar to the scalar map entry.
▲ Show 20 Lines • Show All 2,353 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
SetVector<Instruction *> Worklist;		SetVector<Instruction *> Worklist;
BasicBlock *Latch = TheLoop->getLoopLatch();		BasicBlock *Latch = TheLoop->getLoopLatch();

// Instructions that are scalar with predication must not be considered		// Instructions that are scalar with predication must not be considered
// uniform after vectorization, because that would create an erroneous		// uniform after vectorization, because that would create an erroneous
// replicating region where only a single instance out of VF should be formed.		// replicating region where only a single instance out of VF should be formed.
// TODO: optimize such seldom cases if found important, see PR40816.		// TODO: optimize such seldom cases if found important, see PR40816.
auto addToWorklistIfAllowed = [&](Instruction *I) -> void {		auto addToWorklistIfAllowed = [&](Instruction *I) -> void {
		if (isOutOfScope(I)) {
		annaUnsubmitted Not Done Reply Inline Actions Could this be turned into an assert and landed separately? I see all callers of `addToWorklistIfAllowed` either already checks for outOfScope or the instructions is already checked to be in loop. anna: Could this be turned into an assert and landed separately? I see all callers of…
		reamesAuthorUnsubmitted Done Reply Inline Actions The check is needed for case where a uniform pointer is defined outside the loop but used inside of it. I could move the check to the caller, but the placement seems to make more sense here? reames: The check is needed for case where a uniform pointer is defined outside the loop but used…
		LLVM_DEBUG(dbgs() << "LV: Found not uniform due to scope: "
		<< *I << "\n");
		return;
		}
if (isScalarWithPredication(I, VF)) {		if (isScalarWithPredication(I, VF)) {
LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "		LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
<< *I << "\n");		<< *I << "\n");
return;		return;
}		}
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
Worklist.insert(I);		Worklist.insert(I);
};		};

// Start with the conditional branch. If the branch condition is an		// Start with the conditional branch. If the branch condition is an
// instruction contained in the loop that is only used by the branch, it is		// instruction contained in the loop that is only used by the branch, it is
// uniform.		// uniform.
auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));		auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse())		if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse())
addToWorklistIfAllowed(Cmp);		addToWorklistIfAllowed(Cmp);

// Holds consecutive and consecutive-like pointers. Consecutive-like pointers		// Holds consecutive and consecutive-like pointers. Consecutive-like pointers
// are pointers that are treated like consecutive pointers during		// are pointers that are treated like consecutive pointers during
// vectorization. The pointer operands of interleaved accesses are an		// vectorization. The pointer operands of interleaved accesses are an
// example.		// example.
SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;		SmallSetVector<Value *, 8> ConsecutiveLikePtrs;

// Holds pointer operands of instructions that are possibly non-uniform.		// Holds pointer operands of instructions that are possibly non-uniform.
SmallPtrSet<Instruction *, 8> PossibleNonUniformPtrs;		SmallPtrSet<Value *, 8> PossibleNonUniformPtrs;

auto isUniformDecision = [&](Instruction *I, ElementCount VF) {		auto isUniformDecision = [&](Instruction *I, ElementCount VF) {
InstWidening WideningDecision = getWideningDecision(I, VF);		InstWidening WideningDecision = getWideningDecision(I, VF);
assert(WideningDecision != CM_Unknown &&		assert(WideningDecision != CM_Unknown &&
"Widening decision should be ready at this moment");		"Widening decision should be ready at this moment");

		// The address of a uniform mem op is itself uniform. We exclude stores
		// here as there's an assumption in the current code that all uses of
		// uniform instructions are uniform and, as noted below, uniform stores are
		// still handled via replication (i.e. aren't uniform after vectorization).
		if (isa<LoadInst>(I) && Legal->isUniformMemOp(*I)) {
		assert(WideningDecision == CM_Scalarize);
		fhahnUnsubmitted Not Done Reply Inline Actions nit: message for assert. fhahn: nit: message for assert.
		return true;
		}

return (WideningDecision == CM_Widen \|\|		return (WideningDecision == CM_Widen \|\|
WideningDecision == CM_Widen_Reverse \|\|		WideningDecision == CM_Widen_Reverse \|\|
WideningDecision == CM_Interleave);		WideningDecision == CM_Interleave);
};		};
// Iterate over the instructions in the loop, and collect all		// Iterate over the instructions in the loop, and collect all
// consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible		// consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible
// that a consecutive-like pointer operand will be scalarized, we collect it		// that a consecutive-like pointer operand will be scalarized, we collect it
// in PossibleNonUniformPtrs instead. We use two sets here because a single		// in PossibleNonUniformPtrs instead. We use two sets here because a single
// getelementptr instruction can be used by both vectorized and scalarized		// getelementptr instruction can be used by both vectorized and scalarized
// memory instructions. For example, if a loop loads and stores from the same		// memory instructions. For example, if a loop loads and stores from the same
// location, but the store is conditional, the store will be scalarized, and		// location, but the store is conditional, the store will be scalarized, and
// the getelementptr won't remain uniform.		// the getelementptr won't remain uniform.
for (auto *BB : TheLoop->blocks())		for (auto *BB : TheLoop->blocks())
for (auto &I : *BB) {		for (auto &I : *BB) {
// If there's no pointer operand, there's nothing to do.		// If there's no pointer operand, there's nothing to do.
auto *Ptr = dyn_cast_or_null<Instruction>(getLoadStorePointerOperand(&I));		auto *Ptr = getLoadStorePointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;

		// For now, avoid walking use lists in other functions.
		// TODO: Rewrite this algorithm from uses up.
		if (!isa<Instruction>(Ptr) && !isa<Argument>(Ptr))
		continue;

		// A uniform memory op is itself uniform. We exclude stores here as we
		// haven't yet added dedicated logic in the CLONE path and rely on
		// REPLICATE + DSE for correctness.
		if (isa<LoadInst>(I) && Legal->isUniformMemOp(I))
		addToWorklistIfAllowed(&I);

// True if all users of Ptr are memory accesses that have Ptr as their		// True if all users of Ptr are memory accesses that have Ptr as their
// pointer operand.		// pointer operand.
auto UsersAreMemAccesses =		auto UsersAreMemAccesses =
llvm::all_of(Ptr->users(), [&](User *U) -> bool {		llvm::all_of(Ptr->users(), [&](User *U) -> bool {
return getLoadStorePointerOperand(U) == Ptr;		return getLoadStorePointerOperand(U) == Ptr;
});		});

// Ensure the memory instruction will not be scalarized or used by		// Ensure the memory instruction will not be scalarized or used by
Show All 9 Lines	for (auto &I : *BB) {
else		else
ConsecutiveLikePtrs.insert(Ptr);		ConsecutiveLikePtrs.insert(Ptr);
}		}

// Add to the Worklist all consecutive and consecutive-like pointers that		// Add to the Worklist all consecutive and consecutive-like pointers that
// aren't also identified as possibly non-uniform.		// aren't also identified as possibly non-uniform.
for (auto *V : ConsecutiveLikePtrs)		for (auto *V : ConsecutiveLikePtrs)
if (!PossibleNonUniformPtrs.count(V))		if (!PossibleNonUniformPtrs.count(V))
addToWorklistIfAllowed(V);		if (auto *I = dyn_cast<Instruction>(V))
		addToWorklistIfAllowed(I);

// Expand Worklist in topological order: whenever a new instruction		// Expand Worklist in topological order: whenever a new instruction
// is added , its users should be already inside Worklist. It ensures		// is added , its users should be already inside Worklist. It ensures
// a uniform instruction will only be used by uniform instructions.		// a uniform instruction will only be used by uniform instructions.
unsigned idx = 0;		unsigned idx = 0;
while (idx != Worklist.size()) {		while (idx != Worklist.size()) {
Instruction *I = Worklist[idx++];		Instruction *I = Worklist[idx++];

▲ Show 20 Lines • Show All 1,116 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
if (Reverse)		if (Reverse)
Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
return Cost;		return Cost;
}		}

unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,		unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
ElementCount VF) {		ElementCount VF) {
		assert(Legal->isUniformMemOp(*I));
		fhahnUnsubmitted Not Done Reply Inline Actions nit: message for assert? fhahn: nit: message for assert?
		reamesAuthorUnsubmitted Done Reply Inline Actions Aside from being pedantic, why? I'm happy to comply, but I don't really see any value in cases like this where it's obvious from context. reames: Aside from being pedantic, why? I'm happy to comply, but I don't really see any value in cases…
		fhahnUnsubmitted Not Done Reply Inline Actions I agree this one is borderline and a message like `"must be called with a uniform memory instruction"` probably does not add too much. It might make it slightly more explicit why this assertion is here for people not familiar with the code. I don't really mind either way :) fhahn: I agree this one is borderline and a message like `"must be called with a uniform memory…

Type *ValTy = getMemInstValueType(I);		Type *ValTy = getMemInstValueType(I);
auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));		auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));
const Align Alignment = getLoadStoreAlignment(I);		const Align Alignment = getLoadStoreAlignment(I);
unsigned AS = getLoadStoreAddressSpace(I);		unsigned AS = getLoadStoreAddressSpace(I);
enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
if (isa<LoadInst>(I)) {		if (isa<LoadInst>(I)) {
return TTI.getAddressComputationCost(ValTy) +		return TTI.getAddressComputationCost(ValTy) +
TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS,		TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS,
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {

// TODO: We should generate better code and update the cost model for		// TODO: We should generate better code and update the cost model for
// predicated uniform stores. Today they are treated as any other		// predicated uniform stores. Today they are treated as any other
// predicated store (see added test cases in		// predicated store (see added test cases in
// invariant-store-vectorization.ll).		// invariant-store-vectorization.ll).
if (isa<StoreInst>(&I) && isScalarWithPredication(&I))		if (isa<StoreInst>(&I) && isScalarWithPredication(&I))
NumPredStores++;		NumPredStores++;

if (Legal->isUniform(Ptr) &&		if (Legal->isUniformMemOp(I)) {
		fhahnUnsubmitted Not Done Reply Inline Actions this seems an unrelated refactoring, which could be split out and committed independently? fhahn: this seems an unrelated refactoring, which could be split out and committed independently?
		reamesAuthorUnsubmitted Done Reply Inline Actions Happy to do so since you seem to think the interface made sense. I'm very new to vectorizer code and didn't want to jump to that conclusion. :) reames: Happy to do so since you seem to think the interface made sense. I'm very new to vectorizer…
// Conditional loads and stores should be scalarized and predicated.
// isScalarWithPredication cannot be used here since masked
// gather/scatters are not considered scalar with predication.
!Legal->blockNeedsPredication(I.getParent())) {
// TODO: Avoid replicating loads and stores instead of		// TODO: Avoid replicating loads and stores instead of
// relying on instcombine to remove them.		// relying on instcombine to remove them.
// Load: Scalar load + broadcast		// Load: Scalar load + broadcast
// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract		// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
unsigned Cost = getUniformMemOpCost(&I, VF);		unsigned Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
continue;		continue;
}		}
▲ Show 20 Lines • Show All 2,287 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/cost-model-assert.ll

	Show All 22 Lines
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
	; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 null, i64 [[TMP1]]			; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 null, i64 [[TMP1]]
	; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[BROADCAST_SPLAT]] to <4 x i32>			; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[BROADCAST_SPLAT]] to <4 x i32>
	; CHECK-NEXT: [[TMP3:%.*]] = shl nuw <4 x i32> [[TMP2]], <i32 24, i32 24, i32 24, i32 24>			; CHECK-NEXT: [[TMP3:%.*]] = shl nuw <4 x i32> [[TMP2]], <i32 24, i32 24, i32 24, i32 24>
	; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[P:%.*]], align 1, !tbaa !1			; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[P:%.]], align 1, [[TBAA1:!tbaa !.]]
	; CHECK-NEXT: [[TMP5:%.]] = load i8, i8 [[P]], align 1, !tbaa !1			; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i8> undef, i8 [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = load i8, i8 [[P]], align 1, !tbaa !1			; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i8> [[BROADCAST_SPLATINSERT1]], <4 x i8> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP7:%.]] = load i8, i8 [[P]], align 1, !tbaa !1			; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i8> [[BROADCAST_SPLAT2]] to <4 x i32>
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i8> undef, i8 [[TMP4]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw <4 x i32> [[TMP5]], <i32 16, i32 16, i32 16, i32 16>
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i8> [[TMP8]], i8 [[TMP5]], i32 1			; CHECK-NEXT: [[TMP7:%.*]] = or <4 x i32> [[TMP6]], [[TMP3]]
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i8> [[TMP9]], i8 [[TMP6]], i32 2			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 undef, align 1, [[TBAA1]]
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i8> [[TMP10]], i8 [[TMP7]], i32 3			; CHECK-NEXT: [[TMP9:%.]] = load i8, i8 undef, align 1, [[TBAA1]]
	; CHECK-NEXT: [[TMP12:%.*]] = zext <4 x i8> [[TMP11]] to <4 x i32>			; CHECK-NEXT: [[TMP10:%.]] = load i8, i8 undef, align 1, [[TBAA1]]
	; CHECK-NEXT: [[TMP13:%.*]] = shl nuw nsw <4 x i32> [[TMP12]], <i32 16, i32 16, i32 16, i32 16>			; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 undef, align 1, [[TBAA1]]
	; CHECK-NEXT: [[TMP14:%.*]] = or <4 x i32> [[TMP13]], [[TMP3]]			; CHECK-NEXT: [[TMP12:%.*]] = or <4 x i32> [[TMP7]], zeroinitializer
	; CHECK-NEXT: [[TMP15:%.]] = load i8, i8 undef, align 1, !tbaa !1			; CHECK-NEXT: [[TMP13:%.*]] = or <4 x i32> [[TMP12]], zeroinitializer
	; CHECK-NEXT: [[TMP16:%.]] = load i8, i8 undef, align 1, !tbaa !1			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x i32> [[TMP13]], i32 0
	; CHECK-NEXT: [[TMP17:%.]] = load i8, i8 undef, align 1, !tbaa !1			; CHECK-NEXT: store i32 [[TMP14]], i32* undef, align 4, [[TBAA4:!tbaa !.*]]
	; CHECK-NEXT: [[TMP18:%.]] = load i8, i8 undef, align 1, !tbaa !1			; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[TMP13]], i32 1
	; CHECK-NEXT: [[TMP19:%.*]] = or <4 x i32> [[TMP14]], zeroinitializer			; CHECK-NEXT: store i32 [[TMP15]], i32* undef, align 4, [[TBAA4]]
	; CHECK-NEXT: [[TMP20:%.*]] = or <4 x i32> [[TMP19]], zeroinitializer			; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x i32> [[TMP13]], i32 2
	; CHECK-NEXT: [[TMP21:%.*]] = extractelement <4 x i32> [[TMP20]], i32 0			; CHECK-NEXT: store i32 [[TMP16]], i32* undef, align 4, [[TBAA4]]
	; CHECK-NEXT: store i32 [[TMP21]], i32* undef, align 4, !tbaa !4			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i32> [[TMP13]], i32 3
	; CHECK-NEXT: [[TMP22:%.*]] = extractelement <4 x i32> [[TMP20]], i32 1			; CHECK-NEXT: store i32 [[TMP17]], i32* undef, align 4, [[TBAA4]]
	; CHECK-NEXT: store i32 [[TMP22]], i32* undef, align 4, !tbaa !4
	; CHECK-NEXT: [[TMP23:%.*]] = extractelement <4 x i32> [[TMP20]], i32 2
	; CHECK-NEXT: store i32 [[TMP23]], i32* undef, align 4, !tbaa !4
	; CHECK-NEXT: [[TMP24:%.*]] = extractelement <4 x i32> [[TMP20]], i32 3
	; CHECK-NEXT: store i32 [[TMP24]], i32* undef, align 4, !tbaa !4
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], 0			; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], 0
	; CHECK-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !6			; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP6:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1, 0			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1, 0
	; CHECK-NEXT: br i1 [[CMP_N]], label [[SW_EPILOG:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[SW_EPILOG:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i8 [ null, [[MIDDLE_BLOCK]] ], [ null, [[IF_THEN]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i8 [ null, [[MIDDLE_BLOCK]] ], [ null, [[IF_THEN]] ]
	; CHECK-NEXT: br label [[FOR_BODY68:%.*]]			; CHECK-NEXT: br label [[FOR_BODY68:%.*]]
	; CHECK: for.body68:			; CHECK: for.body68:
	; CHECK-NEXT: [[P_359:%.]] = phi i8 [ [[ADD_PTR86:%.*]], [[FOR_BODY68]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[P_359:%.]] = phi i8 [ [[ADD_PTR86:%.*]], [[FOR_BODY68]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[CONV70:%.*]] = zext i8 [[X]] to i32			; CHECK-NEXT: [[CONV70:%.*]] = zext i8 [[X]] to i32
	; CHECK-NEXT: [[SHL71:%.*]] = shl nuw i32 [[CONV70]], 24			; CHECK-NEXT: [[SHL71:%.*]] = shl nuw i32 [[CONV70]], 24
	; CHECK-NEXT: [[TMP26:%.]] = load i8, i8 [[P]], align 1, !tbaa !1			; CHECK-NEXT: [[TMP19:%.]] = load i8, i8 [[P]], align 1, [[TBAA1]]
	; CHECK-NEXT: [[CONV73:%.*]] = zext i8 [[TMP26]] to i32			; CHECK-NEXT: [[CONV73:%.*]] = zext i8 [[TMP19]] to i32
	; CHECK-NEXT: [[SHL74:%.*]] = shl nuw nsw i32 [[CONV73]], 16			; CHECK-NEXT: [[SHL74:%.*]] = shl nuw nsw i32 [[CONV73]], 16
	; CHECK-NEXT: [[OR75:%.*]] = or i32 [[SHL74]], [[SHL71]]			; CHECK-NEXT: [[OR75:%.*]] = or i32 [[SHL74]], [[SHL71]]
	; CHECK-NEXT: [[TMP27:%.]] = load i8, i8 undef, align 1, !tbaa !1			; CHECK-NEXT: [[TMP20:%.]] = load i8, i8 undef, align 1, [[TBAA1]]
	; CHECK-NEXT: [[SHL78:%.*]] = shl nuw nsw i32 undef, 8			; CHECK-NEXT: [[SHL78:%.*]] = shl nuw nsw i32 undef, 8
	; CHECK-NEXT: [[OR79:%.*]] = or i32 [[OR75]], [[SHL78]]			; CHECK-NEXT: [[OR79:%.*]] = or i32 [[OR75]], [[SHL78]]
	; CHECK-NEXT: [[CONV81:%.*]] = zext i8 undef to i32			; CHECK-NEXT: [[CONV81:%.*]] = zext i8 undef to i32
	; CHECK-NEXT: [[OR83:%.*]] = or i32 [[OR79]], [[CONV81]]			; CHECK-NEXT: [[OR83:%.*]] = or i32 [[OR79]], [[CONV81]]
	; CHECK-NEXT: store i32 [[OR83]], i32* undef, align 4, !tbaa !4			; CHECK-NEXT: store i32 [[OR83]], i32* undef, align 4, [[TBAA4]]
	; CHECK-NEXT: [[ADD_PTR86]] = getelementptr inbounds i8, i8* [[P_359]], i64 4			; CHECK-NEXT: [[ADD_PTR86]] = getelementptr inbounds i8, i8* [[P_359]], i64 4
	; CHECK-NEXT: [[CMP66:%.]] = icmp ult i8 [[ADD_PTR86]], undef			; CHECK-NEXT: [[CMP66:%.]] = icmp ult i8 [[ADD_PTR86]], undef
	; CHECK-NEXT: br i1 [[CMP66]], label [[FOR_BODY68]], label [[SW_EPILOG]], !llvm.loop !8			; CHECK-NEXT: br i1 [[CMP66]], label [[FOR_BODY68]], label [[SW_EPILOG]], [[LOOP8:!llvm.loop !.*]]
	; CHECK: sw.epilog:			; CHECK: sw.epilog:
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	; CHECK: Exit:			; CHECK: Exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br i1 %cond, label %if.then, label %Exit			br i1 %cond, label %if.then, label %Exit

	Show All 38 Lines

llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll

	Show All 18 Lines
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12
	; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ADDR:%.*]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ADDR:%.*]], align 4
	; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP11:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP12:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP17:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP18:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP19:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 4097, 4096			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 4097, 4096
	; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[LOAD:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[LOAD:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1			; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 4096			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 4096
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[LOOPEXIT]], label [[FOR_BODY]], [[LOOP2:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[LOOPEXIT]], label [[FOR_BODY]], [[LOOP2:!llvm.loop !.*]]
	; CHECK: loopexit:			; CHECK: loopexit:
	; CHECK-NEXT: [[LOAD_LCSSA:%.*]] = phi i32 [ [[LOAD]], [[FOR_BODY]] ], [ [[TMP19]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[LOAD_LCSSA:%.*]] = phi i32 [ [[LOAD]], [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[LOAD_LCSSA]]			; CHECK-NEXT: ret i32 [[LOAD_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]			%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]
	%load = load i32, i32* %addr			%load = load i32, i32* %addr
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond = icmp eq i64 %iv, 4096			%exitcond = icmp eq i64 %iv, 4096
	br i1 %exitcond, label %loopexit, label %for.body			br i1 %exitcond, label %loopexit, label %for.body

	loopexit:			loopexit:
	ret i32 %load			ret i32 %load
	}			}

	define i32 @uniform_load2(i32* align(4) %addr) {			define i32 @uniform_load2(i32* align(4) %addr) {
	; CHECK-LABEL: @uniform_load2(			; CHECK-LABEL: @uniform_load2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP36:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP37:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI2:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP38:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI2:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP10:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI3:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP39:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI3:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP11:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12
	; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ADDR:%.*]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ADDR:%.*]], align 4
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ADDR]], align 4
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP5]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT5:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT4]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ADDR]], align 4
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT6:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT7:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT6]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <4 x i32> undef, i32 [[TMP7]], i32 0
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP5]], i32 1			; CHECK-NEXT: [[BROADCAST_SPLAT9:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT8]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP6]], i32 2			; CHECK-NEXT: [[TMP8]] = add <4 x i32> [[VEC_PHI]], [[BROADCAST_SPLAT]]
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP7]], i32 3			; CHECK-NEXT: [[TMP9]] = add <4 x i32> [[VEC_PHI1]], [[BROADCAST_SPLAT5]]
	; CHECK-NEXT: [[TMP12:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP10]] = add <4 x i32> [[VEC_PHI2]], [[BROADCAST_SPLAT7]]
	; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[TMP11]] = add <4 x i32> [[VEC_PHI3]], [[BROADCAST_SPLAT9]]
	; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x i32> undef, i32 [[TMP12]], i32 0
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x i32> [[TMP16]], i32 [[TMP13]], i32 1
	; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x i32> [[TMP17]], i32 [[TMP14]], i32 2
	; CHECK-NEXT: [[TMP19:%.*]] = insertelement <4 x i32> [[TMP18]], i32 [[TMP15]], i32 3
	; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP21:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP22:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP23:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP24:%.*]] = insertelement <4 x i32> undef, i32 [[TMP20]], i32 0
	; CHECK-NEXT: [[TMP25:%.*]] = insertelement <4 x i32> [[TMP24]], i32 [[TMP21]], i32 1
	; CHECK-NEXT: [[TMP26:%.*]] = insertelement <4 x i32> [[TMP25]], i32 [[TMP22]], i32 2
	; CHECK-NEXT: [[TMP27:%.*]] = insertelement <4 x i32> [[TMP26]], i32 [[TMP23]], i32 3
	; CHECK-NEXT: [[TMP28:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP29:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP30:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP31:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[TMP32:%.*]] = insertelement <4 x i32> undef, i32 [[TMP28]], i32 0
	; CHECK-NEXT: [[TMP33:%.*]] = insertelement <4 x i32> [[TMP32]], i32 [[TMP29]], i32 1
	; CHECK-NEXT: [[TMP34:%.*]] = insertelement <4 x i32> [[TMP33]], i32 [[TMP30]], i32 2
	; CHECK-NEXT: [[TMP35:%.*]] = insertelement <4 x i32> [[TMP34]], i32 [[TMP31]], i32 3
	; CHECK-NEXT: [[TMP36]] = add <4 x i32> [[VEC_PHI]], [[TMP11]]
	; CHECK-NEXT: [[TMP37]] = add <4 x i32> [[VEC_PHI1]], [[TMP19]]
	; CHECK-NEXT: [[TMP38]] = add <4 x i32> [[VEC_PHI2]], [[TMP27]]
	; CHECK-NEXT: [[TMP39]] = add <4 x i32> [[VEC_PHI3]], [[TMP35]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; CHECK-NEXT: br i1 [[TMP40]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <4 x i32> [[TMP37]], [[TMP36]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <4 x i32> [[TMP9]], [[TMP8]]
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <4 x i32> [[TMP38]], [[BIN_RDX]]			; CHECK-NEXT: [[BIN_RDX10:%.*]] = add <4 x i32> [[TMP10]], [[BIN_RDX]]
	; CHECK-NEXT: [[BIN_RDX5:%.*]] = add <4 x i32> [[TMP39]], [[BIN_RDX4]]			; CHECK-NEXT: [[BIN_RDX11:%.*]] = add <4 x i32> [[TMP11]], [[BIN_RDX10]]
	; CHECK-NEXT: [[TMP41:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX5]])			; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX11]])
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 4097, 4096			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 4097, 4096
	; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[TMP41]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[ACCUM:%.]] = phi i32 [ [[ACCUM_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[ACCUM:%.]] = phi i32 [ [[ACCUM_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[LOAD:%.]] = load i32, i32 [[ADDR]], align 4			; CHECK-NEXT: [[LOAD:%.]] = load i32, i32 [[ADDR]], align 4
	; CHECK-NEXT: [[ACCUM_NEXT]] = add i32 [[ACCUM]], [[LOAD]]			; CHECK-NEXT: [[ACCUM_NEXT]] = add i32 [[ACCUM]], [[LOAD]]
	; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1			; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 4096			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV]], 4096
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[LOOPEXIT]], label [[FOR_BODY]], [[LOOP5:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[LOOPEXIT]], label [[FOR_BODY]], [[LOOP5:!llvm.loop !.*]]
	; CHECK: loopexit:			; CHECK: loopexit:
	; CHECK-NEXT: [[ACCUM_NEXT_LCSSA:%.*]] = phi i32 [ [[ACCUM_NEXT]], [[FOR_BODY]] ], [ [[TMP41]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[ACCUM_NEXT_LCSSA:%.*]] = phi i32 [ [[ACCUM_NEXT]], [[FOR_BODY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[ACCUM_NEXT_LCSSA]]			; CHECK-NEXT: ret i32 [[ACCUM_NEXT_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]			%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]
	%accum = phi i32 [%accum.next, %for.body], [0, %entry]			%accum = phi i32 [%accum.next, %for.body], [0, %entry]
	▲ Show 20 Lines • Show All 222 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12
	; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10			; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10			; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10			; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP11:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP12:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP17:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP18:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: [[TMP19:%.]] = load i32, i32 [[A]], align 4, !alias.scope !10
	; CHECK-NEXT: store i32 [[TMP4]], i32* [[B]], align 4, !alias.scope !13, !noalias !10			; CHECK-NEXT: store i32 [[TMP4]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP4]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP4]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP4]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP5]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP5]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP5]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP5]], i32* [[B]], align 4, !alias.scope !13, !noalias !10			; CHECK-NEXT: store i32 [[TMP5]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP6]], i32* [[B]], align 4, !alias.scope !13, !noalias !10			; CHECK-NEXT: store i32 [[TMP6]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP6]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP6]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP6]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP7]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP7]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
				; CHECK-NEXT: store i32 [[TMP7]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP7]], i32* [[B]], align 4, !alias.scope !13, !noalias !10			; CHECK-NEXT: store i32 [[TMP7]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP8]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP9]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP10]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP11]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP12]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP13]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP14]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP15]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP16]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP17]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP18]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: store i32 [[TMP19]], i32* [[B]], align 4, !alias.scope !13, !noalias !10
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP15:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP15:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 4097, 4096			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 4097, 4096
	; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4096, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	▲ Show 20 Lines • Show All 153 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/multiple-strides-vectorization.ll

	Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 0, i64 [[TMP2]]			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 0, i64 [[TMP2]]
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*			; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP5]], align 4, !alias.scope !0			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP5]], align 4, !alias.scope !0
	; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP1]], align 4, !alias.scope !3			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP1]], align 4, !alias.scope !3
	; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP1]], align 4, !alias.scope !3			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0
	; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP1]], align 4, !alias.scope !3			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP1]], align 4, !alias.scope !3			; CHECK-NEXT: [[TMP7:%.*]] = add nsw <4 x i32> [[BROADCAST_SPLAT]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> undef, i32 [[TMP6]], i32 0			; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 2, i64 [[I]], i64 [[TMP2]]
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP7]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP8]], i32 0
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP8]], i32 2			; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP9]], i32 3			; CHECK-NEXT: [[WIDE_LOAD12:%.]] = load <4 x i32>, <4 x i32> [[TMP10]], align 4, !alias.scope !5, !noalias !7
	; CHECK-NEXT: [[TMP14:%.*]] = add nsw <4 x i32> [[TMP13]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP11:%.*]] = add nsw <4 x i32> [[TMP7]], [[WIDE_LOAD12]]
	; CHECK-NEXT: [[TMP15:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 2, i64 [[I]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*
	; CHECK-NEXT: [[TMP16:%.]] = getelementptr inbounds i32, i32 [[TMP15]], i32 0			; CHECK-NEXT: store <4 x i32> [[TMP11]], <4 x i32>* [[TMP12]], align 4, !alias.scope !5, !noalias !7
	; CHECK-NEXT: [[TMP17:%.]] = bitcast i32 [[TMP16]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD12:%.]] = load <4 x i32>, <4 x i32> [[TMP17]], align 4, !alias.scope !5, !noalias !7
	; CHECK-NEXT: [[TMP18:%.*]] = add nsw <4 x i32> [[TMP14]], [[WIDE_LOAD12]]
	; CHECK-NEXT: [[TMP19:%.]] = bitcast i32 [[TMP16]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP18]], <4 x i32>* [[TMP19]], align 4, !alias.scope !5, !noalias !7
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[Z]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[Z]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[DOTOUTER]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[DOTOUTER]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[DOTOUTER_PREHEADER]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[DOTOUTER_PREHEADER]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: br label [[DOTINNER:%.*]]			; CHECK-NEXT: br label [[DOTINNER:%.*]]
	; CHECK: .exit:			; CHECK: .exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: .outer:			; CHECK: .outer:
	; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1			; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
	; CHECK-NEXT: [[EXITCOND_OUTER:%.*]] = icmp eq i64 [[I_NEXT]], 32			; CHECK-NEXT: [[EXITCOND_OUTER:%.*]] = icmp eq i64 [[I_NEXT]], 32
	; CHECK-NEXT: br i1 [[EXITCOND_OUTER]], label [[DOTEXIT:%.*]], label [[DOTOUTER_PREHEADER]]			; CHECK-NEXT: br i1 [[EXITCOND_OUTER]], label [[DOTEXIT:%.*]], label [[DOTOUTER_PREHEADER]]
	; CHECK: .inner:			; CHECK: .inner:
	; CHECK-NEXT: [[J:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[J_NEXT:%.]], [[DOTINNER]] ]			; CHECK-NEXT: [[J:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[J_NEXT:%.]], [[DOTINNER]] ]
	; CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 0, i64 [[J]]			; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 0, i64 [[J]]
	; CHECK-NEXT: [[TMP22:%.]] = load i32, i32 [[TMP21]], align 4			; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[TMP14]], align 4
	; CHECK-NEXT: [[TMP23:%.]] = load i32, i32 [[TMP1]], align 4			; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP1]], align 4
	; CHECK-NEXT: [[TMP24:%.*]] = add nsw i32 [[TMP23]], [[TMP22]]			; CHECK-NEXT: [[TMP17:%.*]] = add nsw i32 [[TMP16]], [[TMP15]]
	; CHECK-NEXT: [[TMP25:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 2, i64 [[I]], i64 [[J]]			; CHECK-NEXT: [[TMP18:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.s [[OBJ]], i64 0, i32 2, i64 [[I]], i64 [[J]]
	; CHECK-NEXT: [[TMP26:%.]] = load i32, i32 [[TMP25]], align 4			; CHECK-NEXT: [[TMP19:%.]] = load i32, i32 [[TMP18]], align 4
	; CHECK-NEXT: [[TMP27:%.*]] = add nsw i32 [[TMP24]], [[TMP26]]			; CHECK-NEXT: [[TMP20:%.*]] = add nsw i32 [[TMP17]], [[TMP19]]
	; CHECK-NEXT: store i32 [[TMP27]], i32* [[TMP25]]			; CHECK-NEXT: store i32 [[TMP20]], i32* [[TMP18]], align 4
	; CHECK-NEXT: [[J_NEXT]] = add nuw nsw i64 [[J]], 1			; CHECK-NEXT: [[J_NEXT]] = add nuw nsw i64 [[J]], 1
	; CHECK-NEXT: [[EXITCOND_INNER:%.*]] = icmp eq i64 [[J_NEXT]], [[Z]]			; CHECK-NEXT: [[EXITCOND_INNER:%.*]] = icmp eq i64 [[J_NEXT]], [[Z]]
	; CHECK-NEXT: br i1 [[EXITCOND_INNER]], label [[DOTOUTER]], label [[DOTINNER]], !llvm.loop !10			; CHECK-NEXT: br i1 [[EXITCOND_INNER]], label [[DOTOUTER]], label [[DOTINNER]], [[LOOP10:!llvm.loop !.*]]
	;			;
	br label %.outer.preheader			br label %.outer.preheader


	.outer.preheader:			.outer.preheader:
	%i = phi i64 [ 0, %0 ], [ %i.next, %.outer ]			%i = phi i64 [ 0, %0 ], [ %i.next, %.outer ]
	%1 = getelementptr inbounds %struct.s, %struct.s* %obj, i64 0, i32 1, i64 %i			%1 = getelementptr inbounds %struct.s, %struct.s* %obj, i64 0, i32 1, i64 %i
	br label %.inner			br label %.inner
	Show All 23 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Lower uniform loads as a single load (instead of relying on CSE)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 305030

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/cost-model-assert.ll

llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll

llvm/test/Transforms/LoopVectorize/multiple-strides-vectorization.ll

[LoopVectorizer] Lower uniform loads as a single load (instead of relying on CSE)
ClosedPublic