This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3/3
LoopVectorize.cpp
-
VPlan.h
9/10
VPlan.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
sve-low-trip-count.ll
-
X86/
-
constant-fold.ll
-
outer_loop_test1_no_explicit_vect_width.ll
-
pr34438.ll
-
pr42674.ll

Differential D121899

[LoopVectorize] Optimise away the icmp when tail-folding for some low trip counts
AbandonedPublic

Authored by david-arm on Mar 17 2022, 3:38 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
frasercrmck
dmgreen
fhahn

Summary

For low trip counts the vectoriser will attempt to create a single
predicated loop that folds the scalar tail into the vector body. For
some combinations of the trip count and the VF it is possible to
determine at compile time if there will only be a single vector
iteration. If so, we can avoid creating the comparison at the end of
the loop and just always branch to the loop exit. This improves the
code quality for smaller loops with low trip counts because the
compare + branch add a relatively high cost to the loop.

This optimisation may also apply for unpredicated vector loops with
low trip counts too, hence the change in test X86/pr42674.ll.

Diff Detail

Unit TestsFailed

	Time	Test
	60,180 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
	60,090 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp

Event Timeline

david-arm created this revision.Mar 17 2022, 3:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 3:38 AM

Herald added subscribers: pengfei, rogfer01, hiraditya. · View Herald Transcript

david-arm requested review of this revision.Mar 17 2022, 3:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 3:38 AM

Herald added subscribers: llvm-commits, vkmr. · View Herald Transcript

david-arm added a parent revision: D121595: [LoopVectorize] Permit tail-folding for low trip counts using scalable vectors.Mar 17 2022, 3:38 AM

Harbormaster completed remote builds in B154805: Diff 416124.Mar 17 2022, 3:38 AM

david-arm added a reviewer: fhahn.Mar 17 2022, 3:38 AM

Thanks for this patch Dave, I've left a few comments.

Perhaps it's worth highlighting in the commit message that this is more of a problem for scalable vectors, where the branch/compare isn't constant folded or instcombined away that easily as it is for fixed-width vectors.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8712	Initially I wanted to suggest creating a new (unconditional) branch instruction here, but at this point you don't know which VF will be chosen for the given VPlan, so we have to defer this decision until codegen.
8715	There's nothing wrong with it, but perhaps it's a bit unfortunate this has to be passed into the VPInstruction as an operand. It seems more like a bit of useful knowledge of the loop that the VPInstruction could use to optimise the branch, rather than an actual operand for the conceptual BranchOnCount intrinsic. If the trip-count of the loop is constant, maybe we can store that information as a piece of state somewhere. @fhahn what would be the desired place to add such information about the loop? My understanding was that the uses of the InnerLoopVectorizer in VPlan were being phased out, is that right?
llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	nit: How about Value ConstCmp = nullptr; if (auto C = dyn_cast<ConstantInt>(TC)) if (C->getZExtValue() <= State.UF * State.VF.getKnownMinValue()) ConstCmp = Builder.getInt1(true); Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);
986–987	I noticed some of this code here has been removed and I'm not sure if this change is then still required. In any case this patch needs a rebase.

david-arm added inline comments.Apr 27 2022, 8:49 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8715	Yeah I'm not terribly happy about adding this as an operand, but I wasn't sure how else to pass this. I could possibly add a constant trip count to the VPTransformState, which is available when generating the instruction?
llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	It would have to be something like this I think: Value ConstCmp = nullptr; if (auto C = dyn_cast<ConstantInt>(TC)) { uint64_t TCVal = C->getZExtValue(); if (TCVal && TCVal <= State.UF * State.VF.getKnownMinValue()) ConstCmp = Builder.getInt1(true); } Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC); because I'm pretty sure I found a test where the trip count was actually defined as a zero constant. I'm happy to change it to the code above, but not sure it looks much better to be honest. :)
986–987	Yeah I noticed that too. Will fix in a new patch!

sdesmalen added inline comments.Apr 27 2022, 8:59 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	If the trip-count is zero, wouldn't `true` be the correct value for the condition?

fhahn added inline comments.Apr 27 2022, 2:05 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	Hmm, I am not sure if doing this late in codegen is the right place. If we simplify the condition and effectively remove the loop, it would be good to remove the branch-on-count recipe directly in VPlan. I think we should have almost everything needed to do that already in place. Let me check.

david-arm added inline comments.May 10 2022, 5:30 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	Hi @fhahn, have you had a chance to look into this at all?

fhahn added inline comments.May 16 2022, 12:43 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	Unfortunately not yet, I've still got a backlog of other things to work through :(. Hopefully I'll be able to check this week, otherwise I think we should proceed with this patch next week.

david-arm added inline comments.May 20 2022, 7:10 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	So the problem with treating zero-trip counts the same as non-zero trip counts is that the loop vectoriser considers loops like this as having a zero trip count: entry: br label %loop.body loop.body: %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ] %a = extractvalue { i64, i64 } %sv, 0 %b = extractvalue { i64, i64 } %sv, 1 %addr = getelementptr i64, i64* %dst, i32 %iv %add = add i64 %a, %b store i64 %add, i64* %addr %iv.next = add nsw i32 %iv, 1 %cond = icmp ne i32 %iv.next, 0 br i1 %cond, label %loop.body, label %exit exit: ret void even though this loop is actually going to execute UINT32_MAX times before finally the IV overflows back to 0! This came from a real test: Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll If I fold away the comparison and always branch out of the loop then I've changed the behaviour of the original scalar loop. So I think we either have to: Fix the vectoriser to report the trip count as UINT32_MAX for loops like this, or Only apply my optimisation for low trip counts > 0

I've removed the need for a separate TripCount VPValue in the VPlan class because we're always going to need the original scalar trip count, and the value is the same for each Part anyway. Now we just store a copy in the VPTransformState so that the execute() functions can access it.

david-arm marked 7 inline comments as done.May 23 2022, 1:36 AM

Harbormaster completed remote builds in B165790: Diff 431302.May 23 2022, 2:14 AM

fhahn mentioned this in D126680: [VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF..May 30 2022, 2:41 PM

fhahn added inline comments.May 30 2022, 2:43 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
796–806	Took a bit longer, but D126680 is what simplification directly on VPlan would look like. It depends on a few upcoming improvements though.

Herald added a subscriber: jsji. · View Herald TranscriptMay 30 2022, 2:43 PM

fhahn mentioned this in rGeaf48dd9b079: [VPlan] Replace BranchOnCount with BranchOnCond if TC <= UF * VF..Jun 6 2022, 1:39 AM

Already fixed by an alternative implementation - D126680

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

3 lines

VPlan.h

20 lines

VPlan.cpp

23 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-low-trip-count.ll

6 lines

X86/

constant-fold.ll

3 lines

outer_loop_test1_no_explicit_vect_width.ll

4 lines

pr34438.ll

3 lines

pr42674.ll

20 lines

Diff 431302

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,119 Lines • ▼ Show 20 Lines	VPBasicBlock *HeaderVPBB =
Plan->getVectorLoopRegion()->getEntryBasicBlock();		Plan->getVectorLoopRegion()->getEntryBasicBlock();
auto NewInsertionPoint = HeaderVPBB->getFirstNonPhi();		auto NewInsertionPoint = HeaderVPBB->getFirstNonPhi();
auto *IV = new VPWidenCanonicalIVRecipe(Plan->getCanonicalIV());		auto *IV = new VPWidenCanonicalIVRecipe(Plan->getCanonicalIV());
HeaderVPBB->insert(IV, HeaderVPBB->getFirstNonPhi());		HeaderVPBB->insert(IV, HeaderVPBB->getFirstNonPhi());

VPBuilder::InsertPointGuard Guard(Builder);		VPBuilder::InsertPointGuard Guard(Builder);
Builder.setInsertPoint(HeaderVPBB, NewInsertionPoint);		Builder.setInsertPoint(HeaderVPBB, NewInsertionPoint);
if (CM.TTI.emitGetActiveLaneMask()) {		if (CM.TTI.emitGetActiveLaneMask()) {
VPValue *TC = Plan->getOrCreateTripCount();		BlockMask = Builder.createNaryOp(VPInstruction::ActiveLaneMask, {IV});
BlockMask = Builder.createNaryOp(VPInstruction::ActiveLaneMask, {IV, TC});
} else {		} else {
VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();		VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();
BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
}		}
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
}		}

// This is the block mask. We OR all incoming edges.		// This is the block mask. We OR all incoming edges.
▲ Show 20 Lines • Show All 567 Lines • ▼ Show 20 Lines	auto *CanonicalIVIncrement =
{CanonicalIVPHI}, DL);		{CanonicalIVPHI}, DL);
CanonicalIVPHI->addOperand(CanonicalIVIncrement);		CanonicalIVPHI->addOperand(CanonicalIVIncrement);

VPBasicBlock *EB = TopRegion->getExitBasicBlock();		VPBasicBlock *EB = TopRegion->getExitBasicBlock();
if (IsVPlanNative)		if (IsVPlanNative)
EB->setCondBit(nullptr);		EB->setCondBit(nullptr);
EB->appendRecipe(CanonicalIVIncrement);		EB->appendRecipe(CanonicalIVIncrement);

auto *BranchOnCount =		auto *BranchOnCount =
		sdesmalenUnsubmitted Done Reply Inline Actions Initially I wanted to suggest creating a new (unconditional) branch instruction here, but at this point you don't know which VF will be chosen for the given VPlan, so we have to defer this decision until codegen. sdesmalen: Initially I wanted to suggest creating a new (unconditional) branch instruction here, but at…
new VPInstruction(VPInstruction::BranchOnCount,		new VPInstruction(VPInstruction::BranchOnCount,
{CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);		{CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
EB->appendRecipe(BranchOnCount);		EB->appendRecipe(BranchOnCount);
		sdesmalenUnsubmitted Done Reply Inline Actions There's nothing wrong with it, but perhaps it's a bit unfortunate this has to be passed into the VPInstruction as an operand. It seems more like a bit of useful knowledge of the loop that the VPInstruction could use to optimise the branch, rather than an actual operand for the conceptual BranchOnCount intrinsic. If the trip-count of the loop is constant, maybe we can store that information as a piece of state somewhere. @fhahn what would be the desired place to add such information about the loop? My understanding was that the uses of the InnerLoopVectorizer in VPlan were being phased out, is that right? sdesmalen: There's nothing wrong with it, but perhaps it's a bit unfortunate this has to be passed into…
		david-armAuthorUnsubmitted Done Reply Inline Actions Yeah I'm not terribly happy about adding this as an operand, but I wasn't sure how else to pass this. I could possibly add a constant trip count to the VPTransformState, which is available when generating the instruction? david-arm: Yeah I'm not terribly happy about adding this as an operand, but I wasn't sure how else to pass…
}		}

VPlanPtr LoopVectorizationPlanner::buildVPlanWithVPRecipes(		VPlanPtr LoopVectorizationPlanner::buildVPlanWithVPRecipes(
VFRange &Range, SmallPtrSetImpl<Instruction *> &DeadInstructions,		VFRange &Range, SmallPtrSetImpl<Instruction *> &DeadInstructions,
const MapVector<Instruction , Instruction > &SinkAfter) {		const MapVector<Instruction , Instruction > &SinkAfter) {

SmallPtrSet<const InterleaveGroup<Instruction> *, 1> InterleaveGroups;		SmallPtrSet<const InterleaveGroup<Instruction> *, 1> InterleaveGroups;

▲ Show 20 Lines • Show All 2,117 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	struct VPTransformState {
/// method will delegate the call to ILV in such cases in order to provide		/// method will delegate the call to ILV in such cases in order to provide
/// callers a consistent API.		/// callers a consistent API.
/// \see set.		/// \see set.
Value get(VPValue Def, unsigned Part);		Value get(VPValue Def, unsigned Part);

/// Get the generated Value for a given VPValue and given Part and Lane.		/// Get the generated Value for a given VPValue and given Part and Lane.
Value get(VPValue Def, const VPIteration &Instance);		Value get(VPValue Def, const VPIteration &Instance);

		void setTripCount(Value *V) { TripCount = V; }

		Value *getTripCount() const { return TripCount; }

bool hasVectorValue(VPValue *Def, unsigned Part) {		bool hasVectorValue(VPValue *Def, unsigned Part) {
auto I = Data.PerPartOutput.find(Def);		auto I = Data.PerPartOutput.find(Def);
return I != Data.PerPartOutput.end() && Part < I->second.size() &&		return I != Data.PerPartOutput.end() && Part < I->second.size() &&
I->second[Part];		I->second[Part];
}		}

bool hasAnyVectorValue(VPValue *Def) const {		bool hasAnyVectorValue(VPValue *Def) const {
return Data.PerPartOutput.find(Def) != Data.PerPartOutput.end();		return Data.PerPartOutput.find(Def) != Data.PerPartOutput.end();
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	struct VPTransformState {
/// Hold a reference to the IRBuilder used to generate output IR code.		/// Hold a reference to the IRBuilder used to generate output IR code.
IRBuilderBase &Builder;		IRBuilderBase &Builder;

VPValue2ValueTy VPValue2Value;		VPValue2ValueTy VPValue2Value;

/// Hold the canonical scalar IV of the vector loop (start=0, step=VF*UF).		/// Hold the canonical scalar IV of the vector loop (start=0, step=VF*UF).
Value *CanonicalIV = nullptr;		Value *CanonicalIV = nullptr;

		/// Hold the original scalar trip count.
		Value *TripCount = nullptr;

/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.		/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.
InnerLoopVectorizer *ILV;		InnerLoopVectorizer *ILV;

/// Pointer to the VPlan code is generated for.		/// Pointer to the VPlan code is generated for.
VPlan *Plan;		VPlan *Plan;

/// Holds recipes that may generate a poison value that is used after		/// Holds recipes that may generate a poison value that is used after
/// vectorization, even when their operands are not poison.		/// vectorization, even when their operands are not poison.
▲ Show 20 Lines • Show All 2,101 Lines • ▼ Show 20 Lines	class VPlan {

/// Holds the name of the VPlan, for printing.		/// Holds the name of the VPlan, for printing.
std::string Name;		std::string Name;

/// Holds all the external definitions created for this VPlan. External		/// Holds all the external definitions created for this VPlan. External
/// definitions must be immutable and hold a pointer to their underlying IR.		/// definitions must be immutable and hold a pointer to their underlying IR.
DenseMap<Value , VPValue > VPExternalDefs;		DenseMap<Value , VPValue > VPExternalDefs;

/// Represents the trip count of the original loop, for folding
/// the tail.
VPValue *TripCount = nullptr;

/// Represents the backedge taken count of the original loop, for folding		/// Represents the backedge taken count of the original loop, for folding
/// the tail. It equals TripCount - 1.		/// the tail. It equals TripCount - 1.
VPValue *BackedgeTakenCount = nullptr;		VPValue *BackedgeTakenCount = nullptr;

/// Represents the vector trip count.		/// Represents the vector trip count.
VPValue VectorTripCount;		VPValue VectorTripCount;

/// Holds a mapping between Values and their corresponding VPValue inside		/// Holds a mapping between Values and their corresponding VPValue inside
Show All 22 Lines	if (Entry) {
VPValue DummyValue;		VPValue DummyValue;
for (VPBlockBase *Block : depth_first(Entry))		for (VPBlockBase *Block : depth_first(Entry))
Block->dropAllReferences(&DummyValue);		Block->dropAllReferences(&DummyValue);

VPBlockBase::deleteCFG(Entry);		VPBlockBase::deleteCFG(Entry);
}		}
for (VPValue *VPV : VPValuesToFree)		for (VPValue *VPV : VPValuesToFree)
delete VPV;		delete VPV;
if (TripCount)
delete TripCount;
if (BackedgeTakenCount)		if (BackedgeTakenCount)
delete BackedgeTakenCount;		delete BackedgeTakenCount;
for (auto &P : VPExternalDefs)		for (auto &P : VPExternalDefs)
delete P.second;		delete P.second;
}		}

/// Prepare the plan for execution, setting up the required live-in values.		/// Prepare the plan for execution, setting up the required live-in values.
void prepareToExecute(Value TripCount, Value VectorTripCount,		void prepareToExecute(Value TripCount, Value VectorTripCount,
Value *CanonicalIVStartValue, VPTransformState &State);		Value *CanonicalIVStartValue, VPTransformState &State);

/// Generate the IR code for this VPlan.		/// Generate the IR code for this VPlan.
void execute(struct VPTransformState *State);		void execute(struct VPTransformState *State);

VPBlockBase *getEntry() { return Entry; }		VPBlockBase *getEntry() { return Entry; }
const VPBlockBase *getEntry() const { return Entry; }		const VPBlockBase *getEntry() const { return Entry; }

VPBlockBase setEntry(VPBlockBase Block) {		VPBlockBase setEntry(VPBlockBase Block) {
Entry = Block;		Entry = Block;
Block->setPlan(this);		Block->setPlan(this);
return Entry;		return Entry;
}		}

/// The trip count of the original loop.
VPValue *getOrCreateTripCount() {
if (!TripCount)
TripCount = new VPValue();
return TripCount;
}

/// The backedge taken count of the original loop.		/// The backedge taken count of the original loop.
VPValue *getOrCreateBackedgeTakenCount() {		VPValue *getOrCreateBackedgeTakenCount() {
if (!BackedgeTakenCount)		if (!BackedgeTakenCount)
BackedgeTakenCount = new VPValue();		BackedgeTakenCount = new VPValue();
return BackedgeTakenCount;		return BackedgeTakenCount;
}		}

/// The vector trip count.		/// The vector trip count.
▲ Show 20 Lines • Show All 500 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 726 Lines • ▼ Show 20 Lines	case Instruction::Select: {
Value *V = Builder.CreateSelect(Cond, Op1, Op2);		Value *V = Builder.CreateSelect(Cond, Op1, Op2);
State.set(this, V, Part);		State.set(this, V, Part);
break;		break;
}		}
case VPInstruction::ActiveLaneMask: {		case VPInstruction::ActiveLaneMask: {
// Get first lane of vector induction variable.		// Get first lane of vector induction variable.
Value *VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0));		Value *VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0));
// Get the original loop tripcount.		// Get the original loop tripcount.
Value *ScalarTC = State.get(getOperand(1), Part);		Value *ScalarTC = State.getTripCount();

auto *Int1Ty = Type::getInt1Ty(Builder.getContext());		auto *Int1Ty = Type::getInt1Ty(Builder.getContext());
auto *PredTy = VectorType::get(Int1Ty, State.VF);		auto *PredTy = VectorType::get(Int1Ty, State.VF);
Instruction *Call = Builder.CreateIntrinsic(		Instruction *Call = Builder.CreateIntrinsic(
Intrinsic::get_active_lane_mask, {PredTy, ScalarTC->getType()},		Intrinsic::get_active_lane_mask, {PredTy, ScalarTC->getType()},
{VIVElem0, ScalarTC}, nullptr, "active.lane.mask");		{VIVElem0, ScalarTC}, nullptr, "active.lane.mask");
State.set(this, Call, Part);		State.set(this, Call, Part);
break;		break;
Show All 39 Lines	case VPInstruction::CanonicalIVIncrementNUW: {
}		}

State.set(this, Next, Part);		State.set(this, Next, Part);
break;		break;
}		}
case VPInstruction::BranchOnCount: {		case VPInstruction::BranchOnCount: {
if (Part != 0)		if (Part != 0)
break;		break;
// First create the compare.		// First create the compare if necessary.
Value *IV = State.get(getOperand(0), Part);		Value *IV = State.get(getOperand(0), Part);
Value *TC = State.get(getOperand(1), Part);		Value *VTC = State.get(getOperand(1), Part);
Value *Cond = Builder.CreateICmpEQ(IV, TC);		Value *TC = State.getTripCount();

		Value *ConstCmp = nullptr;
		// When we know there will only be one vector iteration there is no need to
		// create the comparison, since we already know the answer.
		if (auto *C = dyn_cast<ConstantInt>(TC)) {
		uint64_t TCVal = C->getZExtValue();
		if (TCVal && TCVal <= State.UF * State.VF.getKnownMinValue())
		ConstCmp = Builder.getInt1(true);
		}
		Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);

// Now create the branch.		// Now create the branch.
		sdesmalenUnsubmitted Done Reply Inline Actions nit: How about Value ConstCmp = nullptr; if (auto C = dyn_cast<ConstantInt>(TC)) if (C->getZExtValue() <= State.UF * State.VF.getKnownMinValue()) ConstCmp = Builder.getInt1(true); Value Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC); sdesmalen:* nit: How about Value ConstCmp = nullptr; if (auto C = dyn_cast<ConstantInt>(TC)) if…
		david-armAuthorUnsubmitted Done Reply Inline Actions It would have to be something like this I think: Value ConstCmp = nullptr; if (auto C = dyn_cast<ConstantInt>(TC)) { uint64_t TCVal = C->getZExtValue(); if (TCVal && TCVal <= State.UF * State.VF.getKnownMinValue()) ConstCmp = Builder.getInt1(true); } Value Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC); because I'm pretty sure I found a test where the trip count was actually defined as a zero constant. I'm happy to change it to the code above, but not sure it looks much better to be honest. :) david-arm:* It would have to be something like this I think: ```Value ConstCmp = nullptr; if (auto C =…
		sdesmalenUnsubmitted Done Reply Inline Actions If the trip-count is zero, wouldn't `true` be the correct value for the condition? sdesmalen: If the trip-count is zero, wouldn't `true` be the correct value for the condition?
		fhahnUnsubmitted Done Reply Inline Actions Hmm, I am not sure if doing this late in codegen is the right place. If we simplify the condition and effectively remove the loop, it would be good to remove the branch-on-count recipe directly in VPlan. I think we should have almost everything needed to do that already in place. Let me check. fhahn: Hmm, I am not sure if doing this late in codegen is the right place. If we simplify the…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @fhahn, have you had a chance to look into this at all? david-arm: Hi @fhahn, have you had a chance to look into this at all?
		fhahnUnsubmitted Done Reply Inline Actions Unfortunately not yet, I've still got a backlog of other things to work through :(. Hopefully I'll be able to check this week, otherwise I think we should proceed with this patch next week. fhahn: Unfortunately not yet, I've still got a backlog of other things to work through :(. Hopefully…
		fhahnUnsubmitted Not Done Reply Inline Actions Took a bit longer, but D126680 is what simplification directly on VPlan would look like. It depends on a few upcoming improvements though. fhahn: Took a bit longer, but D126680 is what simplification directly on VPlan would look like. It…
		david-armAuthorUnsubmitted Done Reply Inline Actions So the problem with treating zero-trip counts the same as non-zero trip counts is that the loop vectoriser considers loops like this as having a zero trip count: entry: br label %loop.body loop.body: %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ] %a = extractvalue { i64, i64 } %sv, 0 %b = extractvalue { i64, i64 } %sv, 1 %addr = getelementptr i64, i64* %dst, i32 %iv %add = add i64 %a, %b store i64 %add, i64* %addr %iv.next = add nsw i32 %iv, 1 %cond = icmp ne i32 %iv.next, 0 br i1 %cond, label %loop.body, label %exit exit: ret void even though this loop is actually going to execute UINT32_MAX times before finally the IV overflows back to 0! This came from a real test: Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll If I fold away the comparison and always branch out of the loop then I've changed the behaviour of the original scalar loop. So I think we either have to: Fix the vectoriser to report the trip count as UINT32_MAX for loops like this, or Only apply my optimisation for low trip counts > 0 david-arm: So the problem with treating zero-trip counts the same as non-zero trip counts is that the loop…
auto *Plan = getParent()->getPlan();		auto *Plan = getParent()->getPlan();
VPRegionBlock *TopRegion = Plan->getVectorLoopRegion();		VPRegionBlock *TopRegion = Plan->getVectorLoopRegion();
VPBasicBlock *Header = TopRegion->getEntry()->getEntryBasicBlock();		VPBasicBlock *Header = TopRegion->getEntry()->getEntryBasicBlock();
if (Header->empty()) {		if (Header->empty()) {
assert(EnableVPlanNativePath &&		assert(EnableVPlanNativePath &&
"empty entry block only expected in VPlanNativePath");		"empty entry block only expected in VPlanNativePath");
Header = cast<VPBasicBlock>(Header->getSingleSuccessor());		Header = cast<VPBasicBlock>(Header->getSingleSuccessor());
}		}
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	assert((Opcode == Instruction::FAdd \|\| Opcode == Instruction::FMul \|\|
"this op can't take fast-math flags");		"this op can't take fast-math flags");
FMF = FMFNew;		FMF = FMFNew;
}		}

void VPlan::prepareToExecute(Value TripCountV, Value VectorTripCountV,		void VPlan::prepareToExecute(Value TripCountV, Value VectorTripCountV,
Value *CanonicalIVStartValue,		Value *CanonicalIVStartValue,
VPTransformState &State) {		VPTransformState &State) {
// Check if the trip count is needed, and if so build it.		// Check if the trip count is needed, and if so build it.
if (TripCount && TripCount->getNumUsers()) {		State.setTripCount(TripCountV);
for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part)
State.set(TripCount, TripCountV, Part);
}

// Check if the backedge taken count is needed, and if so build it.		// Check if the backedge taken count is needed, and if so build it.
if (BackedgeTakenCount && BackedgeTakenCount->getNumUsers()) {		if (BackedgeTakenCount && BackedgeTakenCount->getNumUsers()) {
IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());		IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());
auto *TCMO = Builder.CreateSub(TripCountV,		auto *TCMO = Builder.CreateSub(TripCountV,
ConstantInt::get(TripCountV->getType(), 1),		ConstantInt::get(TripCountV->getType(), 1),
"trip.count.minus.1");		"trip.count.minus.1");
auto VF = State.VF;		auto VF = State.VF;
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	for (auto VPBB : State->CFG.VPBBsToFix) {
auto *BBTerminator = BB->getTerminator();		auto *BBTerminator = BB->getTerminator();

for (VPBlockBase *SuccVPBlock : VPBB->getHierarchicalSuccessors()) {		for (VPBlockBase *SuccVPBlock : VPBB->getHierarchicalSuccessors()) {
VPBasicBlock *SuccVPBB = SuccVPBlock->getEntryBasicBlock();		VPBasicBlock *SuccVPBB = SuccVPBlock->getEntryBasicBlock();
BBTerminator->setSuccessor(Idx, State->CFG.VPBB2IRBB[SuccVPBB]);		BBTerminator->setSuccessor(Idx, State->CFG.VPBB2IRBB[SuccVPBB]);
++Idx;		++Idx;
}		}
}		}

VPBasicBlock *LatchVPBB = getVectorLoopRegion()->getExitBasicBlock();		VPBasicBlock *LatchVPBB = getVectorLoopRegion()->getExitBasicBlock();
		sdesmalenUnsubmitted Done Reply Inline Actions I noticed some of this code here has been removed and I'm not sure if this change is then still required. In any case this patch needs a rebase. sdesmalen: I noticed some of this code here has been removed and I'm not sure if this change is then still…
		david-armAuthorUnsubmitted Done Reply Inline Actions Yeah I noticed that too. Will fix in a new patch! david-arm: Yeah I noticed that too. Will fix in a new patch!
BasicBlock *VectorLatchBB = State->CFG.VPBB2IRBB[LatchVPBB];		BasicBlock *VectorLatchBB = State->CFG.VPBB2IRBB[LatchVPBB];

// Fix the latch value of canonical, reduction and first-order recurrences		// Fix the latch value of canonical, reduction and first-order recurrences
// phis in the vector loop.		// phis in the vector loop.
VPBasicBlock *Header = getVectorLoopRegion()->getEntryBasicBlock();		VPBasicBlock *Header = getVectorLoopRegion()->getEntryBasicBlock();
for (VPRecipeBase &R : Header->phis()) {		for (VPRecipeBase &R : Header->phis()) {
// Skip phi-like recipes that generate their backedege values themselves.		// Skip phi-like recipes that generate their backedege values themselves.
if (isa<VPWidenPHIRecipe>(&R))		if (isa<VPWidenPHIRecipe>(&R))
▲ Show 20 Lines • Show All 799 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll

	Show All 38 Lines
	define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {			define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip5_i8(			; CHECK-LABEL: @trip5_i8(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
	; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 {{%.}}, i64 5)			; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 {{%.}}, i64 5)
	; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)			; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
	; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)			; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
	; CHECK: call void @llvm.masked.store.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.}}, <vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])			; CHECK: call void @llvm.masked.store.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.}}, <vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
	; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()			; CHECK: br i1 true, label %middle.block, label %vector.body
	; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 16
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
	; CHECK-NEXT: [[COND:%.]] = icmp eq i64 [[INDEX_NEXT]], {{%.}}
	; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]			%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds i8, i8* %src, i64 %i.08			%arrayidx = getelementptr inbounds i8, i8* %src, i64 %i.08
	%0 = load i8, i8* %arrayidx, align 1			%0 = load i8, i8* %arrayidx, align 1
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll

	Show All 20 Lines
	; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16			; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
	; CHECK-NEXT: [[TMP0:%.*]] = add i16 [[OFFSET_IDX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i16 [[OFFSET_IDX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = sext i16 [[TMP0]] to i64			; CHECK-NEXT: [[TMP1:%.*]] = sext i16 [[TMP0]] to i64
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr [2 x i16], [2 x i16] @b, i16 0, i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr [2 x i16], [2 x i16] @b, i16 0, i64 [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr i16, i16** [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = getelementptr i16, i16** [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i16* [[TMP3]] to <2 x i16>			; CHECK-NEXT: [[TMP4:%.]] = bitcast i16* [[TMP3]] to <2 x i16>
	; CHECK-NEXT: store <2 x i16> <i16 getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0), i16* getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0)>, <2 x i16> [[TMP4]], align 8			; CHECK-NEXT: store <2 x i16> <i16 getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0), i16* getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0)>, <2 x i16> [[TMP4]], align 8
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2			; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 2, 2			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 2, 2
	; CHECK-NEXT: br i1 [[CMP_N]], label [[BB3:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[BB3:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i16 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[BB1:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i16 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[BB1:%.]] ]
	; CHECK-NEXT: br label [[BB2:%.*]]			; CHECK-NEXT: br label [[BB2:%.*]]
	; CHECK: bb2:			; CHECK: bb2:
	; CHECK-NEXT: [[C_1_0:%.]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[_TMP9:%.]], [[BB2]] ]			; CHECK-NEXT: [[C_1_0:%.]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[_TMP9:%.]], [[BB2]] ]
	Show All 31 Lines

llvm/test/Transforms/LoopVectorize/X86/outer_loop_test1_no_explicit_vect_width.ll

	Show First 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
	; AVX: call void @llvm.masked.scatter.v8i32.v8p0i32(<8 x i32> %[[StoreVal]], <8 x i32*> %[[AAddr2]], i32 4, <8 x i1> <i1 true, i1 true, i1 true			; AVX: call void @llvm.masked.scatter.v8i32.v8p0i32(<8 x i32> %[[StoreVal]], <8 x i32*> %[[AAddr2]], i32 4, <8 x i1> <i1 true, i1 true, i1 true
	; AVX: %[[InnerPhiNext]] = add nuw nsw <8 x i64> %[[InnerPhi]], <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>			; AVX: %[[InnerPhiNext]] = add nuw nsw <8 x i64> %[[InnerPhi]], <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
	; AVX: %[[VecCond:.*]] = icmp eq <8 x i64> %[[InnerPhiNext]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>			; AVX: %[[VecCond:.*]] = icmp eq <8 x i64> %[[InnerPhiNext]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
	; AVX: %[[InnerCond:.*]] = extractelement <8 x i1> %[[VecCond]], i32 0			; AVX: %[[InnerCond:.*]] = extractelement <8 x i1> %[[VecCond]], i32 0
	; AVX: br i1 %[[InnerCond]], label %[[ForInc]], label %[[InnerLoop]]			; AVX: br i1 %[[InnerCond]], label %[[ForInc]], label %[[InnerLoop]]

	; AVX: [[ForInc]]:			; AVX: [[ForInc]]:
	; AVX: %[[IndNext]] = add nuw i64 %[[Ind]], 8			; AVX: %[[IndNext]] = add nuw i64 %[[Ind]], 8
	; AVX: %[[VecIndNext]] = add <8 x i64> %[[VecInd]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>			; AVX: br i1 true, label %middle.block, label %vector.body
	; AVX: %[[Cmp:.*]] = icmp eq i64 %[[IndNext]], 8
	; AVX: br i1 %[[Cmp]], label %middle.block, label %vector.body

	@arr2 = external global [8 x i32], align 16			@arr2 = external global [8 x i32], align 16
	@arr = external global [8 x [8 x i32]], align 16			@arr = external global [8 x [8 x i32]], align 16

	; Function Attrs: norecurse nounwind uwtable			; Function Attrs: norecurse nounwind uwtable
	define void @foo(i32 %n) {			define void @foo(i32 %n) {
	entry:			entry:
	br label %for.body			br label %for.body
	Show All 29 Lines

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll

	Show All 24 Lines
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 8			; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	Show All 34 Lines

llvm/test/Transforms/LoopVectorize/X86/pr42674.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt %s -loop-vectorize -instcombine -simplifycfg -simplifycfg-require-and-preserve-domtree=1 -mtriple=x86_64-unknown-linux-gnu -mattr=avx512vl,avx512dq,avx512bw -S \| FileCheck %s			; RUN: opt %s -loop-vectorize -instcombine -simplifycfg -simplifycfg-require-and-preserve-domtree=1 -mtriple=x86_64-unknown-linux-gnu -mattr=avx512vl,avx512dq,avx512bw -S \| FileCheck %s

	@bytes = global [128 x i8] zeroinitializer, align 16			@bytes = global [128 x i8] zeroinitializer, align 16

	; Make sure we end up with vector code for this loop. We used to try to create			; Make sure we end up with vector code for this loop. We used to try to create
	; a VF=64,UF=4 loop, but the scalar trip count is only 128 so			; a VF=64,UF=4 loop, but the scalar trip count is only 128 so
	; the vector loop was dead code leaving only a scalar remainder.			; the vector loop was dead code leaving only a scalar remainder.
	define zeroext i8 @sum() {			define zeroext i8 @sum() {
	; CHECK-LABEL: @sum(			; CHECK-LABEL: @sum(
	; CHECK-NEXT: iter.check:			; CHECK-NEXT: iter.check:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds [128 x i8], [128 x i8] @bytes, i64 0, i64 0
	; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <64 x i8> [ zeroinitializer, [[ENTRY]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <64 x i8> [ zeroinitializer, [[ENTRY]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds [128 x i8], [128 x i8] @bytes, i64 0, i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <64 x i8>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <64 x i8>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <64 x i8>, <64 x i8> [[TMP1]], align 16			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <64 x i8>, <64 x i8> [[TMP1]], align 16
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP0]], i64 64			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP0]], i64 64
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <64 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <64 x i8>*
	; CHECK-NEXT: [[WIDE_LOAD2:%.]] = load <64 x i8>, <64 x i8> [[TMP3]], align 16			; CHECK-NEXT: [[WIDE_LOAD2:%.]] = load <64 x i8>, <64 x i8> [[TMP3]], align 16
	; CHECK-NEXT: [[TMP4]] = add <64 x i8> [[WIDE_LOAD]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP4:%.*]] = add <64 x i8> [[WIDE_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP5]] = add <64 x i8> [[WIDE_LOAD2]], [[VEC_PHI1]]			; CHECK-NEXT: [[TMP5:%.*]] = add <64 x i8> [[WIDE_LOAD2]], zeroinitializer
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 128			; CHECK-NEXT: [[INDEX_NEXT:%.*]] = add nuw i64 0, 128
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX]], 0
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; CHECK: middle.block:
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <64 x i8> [[TMP5]], [[TMP4]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <64 x i8> [[TMP5]], [[TMP4]]
	; CHECK-NEXT: [[TMP7:%.*]] = call i8 @llvm.vector.reduce.add.v64i8(<64 x i8> [[BIN_RDX]])			; CHECK-NEXT: [[TMP6:%.*]] = call i8 @llvm.vector.reduce.add.v64i8(<64 x i8> [[BIN_RDX]])
	; CHECK-NEXT: ret i8 [[TMP7]]			; CHECK-NEXT: ret i8 [[TMP6]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	%r.010 = phi i8 [ 0, %entry ], [ %add, %for.body ]			%r.010 = phi i8 [ 0, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds [128 x i8], [128 x i8]* @bytes, i64 0, i64 %indvars.iv			%arrayidx = getelementptr inbounds [128 x i8], [128 x i8]* @bytes, i64 0, i64 %indvars.iv
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Optimise away the icmp when tail-folding for some low trip countsAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 431302

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlan.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll

llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll

llvm/test/Transforms/LoopVectorize/X86/outer_loop_test1_no_explicit_vect_width.ll

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll

llvm/test/Transforms/LoopVectorize/X86/pr42674.ll

[LoopVectorize] Optimise away the icmp when tail-folding for some low trip counts
AbandonedPublic