Download Raw Diff

Details

Reviewers

hsaito
Ayal
fhahn
reames
silvas
dcaballe
SjoerdMeijer
mkuper

Summary

Currently we reject to vectorize loops with epilog if trip count less than 16 (as defined by TinyTripCountVectorThreshold) and loop is cold according to profile. Thus in absence of profile summary information we will never vectorize such loops even if cost model can prove profitability. With this change we try to improve the situation by using block execution count information (if available) to judge if loop is hot relatively to function entry.

Please note that I moved related code to getScalarEpilogueLowering. That not only improves maintainability but fixes potential issue of redefining decision made by getScalarEpilogueLowering by short trip count heuristic.

Ideally we should be using loop size and expected perf gain from cost model in addition to loop hotness to make the decision but that would be much bigger change and can be a follow up step.

Note that this change depends on https://reviews.llvm.org/D67690

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 41232
Build 41417: arc lint + arc unit

Event Timeline

ebrevnov created this revision.Sep 20 2019, 1:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2019, 1:09 AM

Herald added subscribers: llvm-commits, rkruppe, zzheng, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B38329: Diff 220961.Sep 20 2019, 1:14 AM

Minor update

Harbormaster completed remote builds in B38330: Diff 220962.Sep 20 2019, 1:23 AM

Minor changes in tests.

ebrevnov retitled this revision from [LV] Allow vectorization of hot short trip count loops with epilog. to [LV] Allow vectorization of hot short trip count loops with epilog.Sep 20 2019, 1:42 AM

ebrevnov edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B38345: Diff 220978.Sep 20 2019, 1:42 AM

ebrevnov edited the summary of this revision. (Show Details)Sep 20 2019, 1:47 AM

ebrevnov marked 2 inline comments as done.

ebrevnov added inline comments.

llvm/include/llvm/Transforms/Utils/SizeOpts.h
25 ↗	(On Diff #220978)	This change is motivated by use case in getScalarEpilogueLowering.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
942	just to be consistent with the next line :-)

ebrevnov added reviewers: hsaito, Ayal, fhahn, reames.Sep 20 2019, 1:49 AM

ebrevnov added a parent revision: D67690: [LV][NFC] Factor out calculation of "best" estimated trip count..Sep 20 2019, 1:55 AM

rscottmanley added a subscriber: rscottmanley.Sep 20 2019, 6:26 AM

Test update

Harbormaster completed remote builds in B38655: Diff 222118.Sep 27 2019, 3:19 AM

Minor test update

ping

ebrevnov added reviewers: silvas, dcaballe, SjoerdMeijer.Oct 24 2019, 4:17 AM

ebrevnov added a reviewer: mkuper.Oct 24 2019, 11:11 PM

Heuristics looks pretty reasonable to me, but I'd really like to have someone more familiar with how all the existing hueristics interact give the actual LGTM.

llvm/include/llvm/Transforms/Utils/SizeOpts.h
23 ↗	(On Diff #222120)	Please update comments to indicate what None value means. Your verbal explanation was clear, just make sure the code reflects that. After that, please separate and land the refactoring change involving these two methods. Then rebase.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7440	Explicit type this please for readability.

ebrevnov marked an inline comment as done.Nov 20 2019, 12:23 AM

ebrevnov added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7464	Some rational for the chosen heuristic. In general if nobody actually asked to optimize for size it seems reasonable to relay on cost model to decide if vectorization is profitable or not even for short trip count loops. If we still want to have some balance between code bloat and performance we should decide based on potential gain and loop size for all loops. Even though the described approach looks simple and reasonable in theory it most likely will have big implications on existing apps. That's why I decided take more conservative approach and give a chance for hot loops to be vectorized.

Rebase.

ebrevnov edited parent revisions, added: D70482: [NFC] Change return type for 'shouldOptimizeForSize'; removed: D67690: [LV][NFC] Factor out calculation of "best" estimated trip count..Nov 20 2019, 3:02 AM

Harbormaster completed remote builds in B41229: Diff 230224.Nov 20 2019, 3:04 AM

Rebase. Attempt2.

ebrevnov added a child revision: D67905: [LV] Vectorizer should adjust trip count in profile information.Nov 20 2019, 3:43 AM

Harbormaster completed remote builds in B41232: Diff 230227.Nov 20 2019, 3:50 AM

Ideally we should be using loop size and expected perf gain from cost model in addition to loop hotness to make the decision but that would be much bigger change and can be a follow up step.

The original motivation to vectorize loops of tiny trip counts as if under opt-for-size, is not to save code size, nor is it related to how hot the loop is. It stems from the fact that LV's cost model currently focuses on the body of the vector loop only, disregarding overheads associated with runtime guards and code outside this body, under the assumption that the execution of the vector loop will dominate. If the trip count is tiny, this assumption no longer holds, and the overheads may outweigh the benefits, w/o LV's cost-model noticing, leading to slowdowns (which become more severe as the loop gets hotter). Under opt-for-size LV generates only the vector loop, with no runtime guards nor scalar epilog, so this assumption holds.

In order to vectorize loops of tiny trip count with a scalar epilogue, the cost model should be enhanced to account for both plus associated overheads.

Regarding hotness, that could help the cost model in general to compare the expected performance with that of the original loop, but not sure how in this case where the trip count is a known constant.

ebrevnov mentioned this in D70482: [NFC] Change return type for 'shouldOptimizeForSize'.Dec 5 2019, 12:46 AM

I'll start a separate review since the approach should be completely different.

ebrevnov removed a child revision: D67905: [LV] Vectorizer should adjust trip count in profile information.Dec 30 2019, 4:33 AM

Diff 230227

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines

cl::opt<bool> llvm::EnableLoopInterleaving(		cl::opt<bool> llvm::EnableLoopInterleaving(
"interleave-loops", cl::init(true), cl::Hidden,		"interleave-loops", cl::init(true), cl::Hidden,
cl::desc("Enable loop interleaving in Loop vectorization passes"));		cl::desc("Enable loop interleaving in Loop vectorization passes"));
cl::opt<bool> llvm::EnableLoopVectorization(		cl::opt<bool> llvm::EnableLoopVectorization(
"vectorize-loops", cl::init(true), cl::Hidden,		"vectorize-loops", cl::init(true), cl::Hidden,
cl::desc("Run the Loop vectorization passes"));		cl::desc("Run the Loop vectorization passes"));

		static cl::opt<unsigned> LocalHotnessThreshold(
		"local-hotness-threshold", cl::init(500), cl::Hidden,
		cl::desc(
		"In cases when there is no info on block hotness available from module "
		"profile we define \"local hotness\" as a ratio of the block to "
		"function entry execution counts. If the ration is greater than the "
		"threshold defined by this parameter the block is said to be locally "
		"hot."));
/// A helper function for converting Scalar types to vector types.		/// A helper function for converting Scalar types to vector types.
/// If the incoming type is void, we return void. If the VF is 1, we return		/// If the incoming type is void, we return void. If the VF is 1, we return
/// the scalar type.		/// the scalar type.
static Type ToVectorTy(Type Scalar, unsigned VF) {		static Type ToVectorTy(Type Scalar, unsigned VF) {
if (Scalar->isVoidTy() \|\| VF == 1)		if (Scalar->isVoidTy() \|\| VF == 1)
return Scalar;		return Scalar;
return VectorType::get(Scalar, VF);		return VectorType::get(Scalar, VF);
}		}
▲ Show 20 Lines • Show All 621 Lines • ▼ Show 20 Lines
enum ScalarEpilogueLowering {		enum ScalarEpilogueLowering {

// The default: allowing scalar epilogues.		// The default: allowing scalar epilogues.
CM_ScalarEpilogueAllowed,		CM_ScalarEpilogueAllowed,

// Vectorization with OptForSize: don't allow epilogues.		// Vectorization with OptForSize: don't allow epilogues.
CM_ScalarEpilogueNotAllowedOptSize,		CM_ScalarEpilogueNotAllowedOptSize,

// A special case of vectorisation with OptForSize: loops with a very small		// A special case of vectorisation with OptForSize: loops with a very small
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions just to be consistent with the next line :-) ebrevnov: just to be consistent with the next line :-)
// trip count are considered for vectorization under OptForSize, thereby		// trip count are considered for vectorization under OptForSize, thereby
// making sure the cost of their loop body is dominant, free of runtime		// making sure the cost of their loop body is dominant, free of runtime
// guards and scalar iteration overheads.		// guards and scalar iteration overheads.
CM_ScalarEpilogueNotAllowedLowTripLoop,		CM_ScalarEpilogueNotAllowedLowTripLoop,

// Loop hint predicate indicating an epilogue is undesired.		// Loop hint predicate indicating an epilogue is undesired.
CM_ScalarEpilogueNotNeededUsePredicate		CM_ScalarEpilogueNotNeededUsePredicate
};		};
▲ Show 20 Lines • Show All 6,481 Lines • ▼ Show 20 Lines
static ScalarEpilogueLowering		static ScalarEpilogueLowering
getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,		getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,
ProfileSummaryInfo PSI, BlockFrequencyInfo BFI,		ProfileSummaryInfo PSI, BlockFrequencyInfo BFI,
TargetTransformInfo TTI, TargetLibraryInfo TLI,		TargetTransformInfo TTI, TargetLibraryInfo TLI,
AssumptionCache AC, LoopInfo LI,		AssumptionCache AC, LoopInfo LI,
ScalarEvolution SE, DominatorTree DT,		ScalarEvolution SE, DominatorTree DT,
const LoopAccessInfo *LAI) {		const LoopAccessInfo *LAI) {
ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;
Optional<bool> IsColdByProfile =		Optional<bool> IsColdByProfile =
		reamesUnsubmitted Not Done Reply Inline Actions Explicit type this please for readability. reames: Explicit type this please for readability.
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI);		llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI);
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
(F->hasOptSize() \|\| IsColdByProfile.getValueOr(false)))		(F->hasOptSize() \|\| IsColdByProfile.getValueOr(false)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;		SEL = CM_ScalarEpilogueNotAllowedOptSize;
else if (PreferPredicateOverEpilog \|\|		else if (PreferPredicateOverEpilog \|\|
Hints.getPredicate() == LoopVectorizeHints::FK_Enabled \|\|		Hints.getPredicate() == LoopVectorizeHints::FK_Enabled \|\|
(TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI) &&		(TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI) &&
Hints.getPredicate() != LoopVectorizeHints::FK_Disabled))		Hints.getPredicate() != LoopVectorizeHints::FK_Disabled))
SEL = CM_ScalarEpilogueNotNeededUsePredicate;		SEL = CM_ScalarEpilogueNotNeededUsePredicate;
		else {
		auto ExpectedTC = getSmallBestKnownTC(*SE, L);
		// Check the loop for a trip count threshold: vectorize loops with a tiny
		// trip count by optimizing for size, to minimize overheads.
		if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
		// Even short trip count loops may be hot (part of hot region).
		// In absence of profile summary estimate loop hotness relative to
		// function entry using execution frequency information.
		if (!IsColdByProfile && LoopVectorizeWithBlockFrequency && BFI) {
		Optional<uint64_t> LoopCount =
		BFI->getBlockProfileCount(L->getHeader(), true);
		Optional<uint64_t> FunctionCount =
		BFI->getBlockProfileCount(&F->getEntryBlock(), true);
		if (LoopCount && FunctionCount &&
		(LoopCount > FunctionCount * LocalHotnessThreshold)) {
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions Some rational for the chosen heuristic. In general if nobody actually asked to optimize for size it seems reasonable to relay on cost model to decide if vectorization is profitable or not even for short trip count loops. If we still want to have some balance between code bloat and performance we should decide based on potential gain and loop size for all loops. Even though the described approach looks simple and reasonable in theory it most likely will have big implications on existing apps. That's why I decided take more conservative approach and give a chance for hot loops to be vectorized. ebrevnov: Some rational for the chosen heuristic. In general if nobody actually asked to optimize for…
		LLVM_DEBUG(dbgs() << "Allow epilog for short trip count loop due to "
		"hotness considerations.");
		return CM_ScalarEpilogueAllowed;
		}
		}

		LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
		<< "This loop is worth vectorizing only if no scalar "
		<< "iteration overheads are incurred.");

		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
		LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
		else {
		LLVM_DEBUG(dbgs() << "\n");
		SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
		}
		}
		}

return SEL;		return SEL;
}		}

// Process the loop in the VPlan-native vectorization path. This path builds		// Process the loop in the VPlan-native vectorization path. This path builds
// VPlan upfront in the vectorization pipeline, which allows to apply		// VPlan upfront in the vectorization pipeline, which allows to apply
// VPlan-to-VPlan transformations from the very beginning without modifying the		// VPlan-to-VPlan transformations from the very beginning without modifying the
// input LLVM IR.		// input LLVM IR.
▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->empty())		if (!L->empty())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
ORE, BFI, PSI, Hints);		ORE, BFI, PSI, Hints);

assert(L->empty() && "Inner loop expected.");		assert(L->empty() && "Inner loop expected.");

// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.
auto ExpectedTC = getSmallBestKnownTC(*SE, L);
if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {
LLVM_DEBUG(dbgs() << "\n");
SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
}
}

// Check the function attributes to see if implicit floats are allowed.		// Check the function attributes to see if implicit floats are allowed.
// FIXME: This check doesn't seem possibly correct -- what if the loop is		// FIXME: This check doesn't seem possibly correct -- what if the loop is
// an integer loop and the vector instructions selected are purely integer		// an integer loop and the vector instructions selected are purely integer
// vector instructions?		// vector instructions?
if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {		if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {
reportVectorizationFailure(		reportVectorizationFailure(
"Can't vectorize when the NoImplicitFloat attribute is used",		"Can't vectorize when the NoImplicitFloat attribute is used",
"loop not vectorized due to NoImplicitFloat attribute",		"loop not vectorized due to NoImplicitFloat attribute",
▲ Show 20 Lines • Show All 314 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/hot_short_tc_loop.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -passes="print<block-freq>,loop-vectorize" -S < %s 2>&1 \| FileCheck %s

				; Check vectorization of hot short trip count with epilog. In this case inner
				; loop trip count is not constant and its value is estimated by profile.

				; ModuleID = 'test.cpp'
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@a = dso_local global [5 x i32] zeroinitializer, align 16
				@b = dso_local global [5 x i32] zeroinitializer, align 16

				; Function Attrs: uwtable
				define dso_local void @_Z3fooi(i32 %M) local_unnamed_addr #0 !prof !11 {
				; CHECK: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP15:%.*]]
				; CHECK: [[TMP18:%.]] = mul nsw <4 x i32> [[WIDE_LOAD]], [[VEC_IND6:%.]]
				; CHECK: [[WIDE_LOAD10:%.]] = load <4 x i32>, <4 x i32> [[TMP23:%.*]]
				; CHECK: [[TMP26:%.*]] = add nsw <4 x i32> [[WIDE_LOAD10]], [[TMP18]]
				; CHECK: store <4 x i32> [[TMP26]], <4 x i32>* [[TMP28:%.*]]
				;
				entry:
				%a = alloca [5 x i32], align 16
				%b = alloca [5 x i32], align 16
				%0 = bitcast [5 x i32]* %a to i8*
				call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %0) #3
				%1 = bitcast [5 x i32]* %b to i8*
				call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %1) #3
				%arraydecay = getelementptr inbounds [5 x i32], [5 x i32]* %a, i64 0, i64 0
				br label %for.body.us.preheader

				for.body.us.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %M to i64
				br label %for.body.us

				for.body.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.body.us.preheader
				%j.019.us = phi i32 [ %inc8.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.body.us.preheader ]
				call void @_Z3barPi(i32* nonnull %arraydecay)
				br label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.body.us
				%indvars.iv = phi i64 [ 0, %for.body.us ], [ %indvars.iv.next, %for.body4.us ]
				%arrayidx.us = getelementptr inbounds [5 x i32], [5 x i32]* %b, i64 0, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx.us, align 4, !tbaa !2
				%3 = trunc i64 %indvars.iv to i32
				%mul.us = mul nsw i32 %2, %3
				%arrayidx6.us = getelementptr inbounds [5 x i32], [5 x i32]* %a, i64 0, i64 %indvars.iv
				%4 = load i32, i32* %arrayidx6.us, align 4, !tbaa !2
				%add.us = add nsw i32 %4, %mul.us
				store i32 %add.us, i32* %arrayidx6.us, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us, !prof !10

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us
				%inc8.us = add nuw nsw i32 %j.019.us, 1
				%exitcond21 = icmp eq i32 %inc8.us, 20
				br i1 %exitcond21, label %for.cond.cleanup.loopexit, label %for.body.us, !prof !12

				for.cond.cleanup.loopexit: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us
				br label %for.cond.cleanup

				for.cond.cleanup.loopexit24: ; preds = %for.body
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit24, %for.cond.cleanup.loopexit
				call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %1) #3
				call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %0) #3
				ret void
				}

				; Check vectorization of hot short trip count with epilog. In this case inner
				; loop trip count is known constant value.

				; Function Attrs: uwtable
				define dso_local void @_Z3fooi2() local_unnamed_addr #0 !prof !11 {
				; CHECK: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP15:%.*]]
				; CHECK: [[TMP18:%.]] = mul nsw <4 x i32> [[WIDE_LOAD]], [[VEC_IND6:%.]]
				; CHECK: [[WIDE_LOAD10:%.]] = load <4 x i32>, <4 x i32> [[TMP23:%.*]]
				; CHECK: [[TMP26:%.*]] = add nsw <4 x i32> [[WIDE_LOAD10]], [[TMP18]]
				; CHECK: store <4 x i32> [[TMP26]], <4 x i32>* [[TMP28:%.*]]
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.cond.cleanup3
				ret void

				for.body: ; preds = %entry, %for.cond.cleanup3
				%j.018 = phi i32 [ 0, %entry ], [ %inc8, %for.cond.cleanup3 ]
				tail call void @_Z3barPi(i32* getelementptr inbounds ([5 x i32], [5 x i32]* @a, i64 0, i64 0))
				br label %for.body4

				for.cond.cleanup3: ; preds = %for.body4
				%inc8 = add nuw nsw i32 %j.018, 1
				%cmp = icmp ult i32 %inc8, 1000
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !prof !13

				for.body4: ; preds = %for.body, %for.body4
				%i.017 = phi i32 [ 0, %for.body ], [ %inc, %for.body4 ]
				%idxprom = zext i32 %i.017 to i64
				%arrayidx = getelementptr inbounds [5 x i32], [5 x i32]* @b, i64 0, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul = mul nsw i32 %0, %i.017
				%arrayidx6 = getelementptr inbounds [5 x i32], [5 x i32]* @a, i64 0, i64 %idxprom
				%1 = load i32, i32* %arrayidx6, align 4, !tbaa !2
				%add = add nsw i32 %1, %mul
				store i32 %add, i32* %arrayidx6, align 4, !tbaa !2
				%inc = add nuw nsw i32 %i.017, 1
				%cmp2 = icmp ult i32 %inc, 5
				br i1 %cmp2, label %for.body4, label %for.cond.cleanup3
				}

				; This is negative test. Check that vectorization is not performed for COLD
				; short trip count loop requiring epilog. Note that outer loop has only 20
				; iterations and there is no associated profile info.


				; Function Attrs: uwtable
				define dso_local void @_Z3fooi3(i32 %M) local_unnamed_addr #0 !prof !11 {
				; CHECK: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX_US:%.*]]
				; CHECK: [[MUL_US:%.]] = mul nsw i32 [[TMP2]], [[TMP3:%.]]
				; CHECK: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX6_US:%.*]]
				; CHECK: [[ADD_US:%.*]] = add nsw i32 [[TMP4]], [[MUL_US]]
				; CHECK: store i32 [[ADD_US]], i32* [[ARRAYIDX6_US]]
				;
				entry:
				%a = alloca [5 x i32], align 16
				%b = alloca [5 x i32], align 16
				%0 = bitcast [5 x i32]* %a to i8*
				call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %0) #3
				%1 = bitcast [5 x i32]* %b to i8*
				call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %1) #3
				%arraydecay = getelementptr inbounds [5 x i32], [5 x i32]* %a, i64 0, i64 0
				br label %for.body.us.preheader

				for.body.us.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %M to i64
				br label %for.body.us

				for.body.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.body.us.preheader
				%j.019.us = phi i32 [ %inc8.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.body.us.preheader ]
				call void @_Z3barPi(i32* nonnull %arraydecay)
				br label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.body.us
				%indvars.iv = phi i64 [ 0, %for.body.us ], [ %indvars.iv.next, %for.body4.us ]
				%arrayidx.us = getelementptr inbounds [5 x i32], [5 x i32]* %b, i64 0, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx.us, align 4, !tbaa !2
				%3 = trunc i64 %indvars.iv to i32
				%mul.us = mul nsw i32 %2, %3
				%arrayidx6.us = getelementptr inbounds [5 x i32], [5 x i32]* %a, i64 0, i64 %indvars.iv
				%4 = load i32, i32* %arrayidx6.us, align 4, !tbaa !2
				%add.us = add nsw i32 %4, %mul.us
				store i32 %add.us, i32* %arrayidx6.us, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us, !prof !14

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us
				%inc8.us = add nuw nsw i32 %j.019.us, 1
				%exitcond21 = icmp eq i32 %inc8.us, 20
				br i1 %exitcond21, label %for.cond.cleanup.loopexit, label %for.body.us

				for.cond.cleanup.loopexit: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us
				br label %for.cond.cleanup

				for.cond.cleanup.loopexit24: ; preds = %for.body
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit24, %for.cond.cleanup.loopexit
				call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %1) #3
				call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %0) #3
				ret void
				}

				; Function Attrs: argmemonly nounwind willreturn
				declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1

				declare dso_local void @_Z3barPi(i32*) local_unnamed_addr

				; Function Attrs: argmemonly nounwind willreturn
				declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1

				attributes #0 = { "use-soft-float"="false" }
				attributes #1 = { argmemonly nounwind willreturn }

				!llvm.module.flags = !{!0}
				!llvm.ident = !{!1}

				!0 = !{i32 1, !"wchar_size", i32 4}
				!1 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project f379dd57b978c4e1483d721f422c79e3c0c5ccdc)"}
				!2 = !{!3, !3, i64 0}
				!3 = !{!"int", !4, i64 0}
				!4 = !{!"omnipotent char", !5, i64 0}
				!5 = !{!"Simple C++ TBAA"}
				!6 = distinct !{!6, !7}
				!7 = !{!"llvm.loop.isvectorized", i32 1}
				!8 = distinct !{!8, !9, !7}
				!9 = !{!"llvm.loop.unroll.runtime.disable"}
				!10 = !{!"branch_weights", i32 999, i32 4995}
				!11 = !{!"function_entry_count", i64 1}
				!12 = !{!"branch_weights", i32 1, i32 999}
				!13 = !{!"branch_weights", i32 1000, i32 1}
				!14 = !{!"branch_weights", i32 9, i32 45}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Allow vectorization of hot short trip count loops with epilog
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 230227

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/hot_short_tc_loop.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Allow vectorization of hot short trip count loops with epilogAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 230227

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/hot_short_tc_loop.ll

[LV] Allow vectorization of hot short trip count loops with epilog
AbandonedPublic