This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationPlanner.h
6
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
short_tc_rt_checks.ll

Differential D71053

[LV] Take overhead of run-time checks into account during vectorization.
Needs ReviewPublic

Authored by ebrevnov on Dec 5 2019, 2:28 AM.

Download Raw Diff

Details

Reviewers

rengolin
Ayal
hsaito
fhahn
anna
gilr
rkruppe
mgorny

Summary

Currently vectorizer rejects short trip count loops in presence of run-time checks. That is caused by inability of cost model to take an overhead caused by run-time checks into account. While ideal solution would be to calculate entire cost of VPlan there is no infrastructure in place for that. I see to possibilities to mitigate that.

Make a dry-run of SCEV expansion during main phase of cost modeling. Thus Instead of real SCEV expansion we will calculate cost of such expansion.
Defer overhead calculation for run-time checks and final decision till after run-time checks expansion. If overhead turns out to make vectorization not profitable make unconditional bypass of vectorized loop and relay on the following optimizations to remove trivially dead code.

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Please note loops vectorized before will remain vectorized after the change in the same way. Only short trip count loops not vectorized before may potentially get vectorized (if proved profitable by cost model). Thus the change has pretty narrow scope and is not expected to have wide performance impact (neither positive nor negative).

Main motivation for the change is performance gain on a benchmark which is based on publicly available math library for linear algebra (https://github.com/fommil/netlib-java). The following simple loop over DSPR method from Netlib shows +55% gain on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz:

for (int i = 0; i < 1000; i++) {

blas.dspr("l", 9, 1.0, vectorD, 1, result);

}

I measured performance impact on llvm test-suite but I don't know how to read the results. I ran original and patched compilers 3 times each and results are unstable. Even for consequent runs of original compiler I observe +100/-100% variation. This is a special machine dedicated for performance testing and no other applications were running.

Besides I evaluated performance on SpecJVM98, Dacapo9.12 and Netlib . All measurements were performed on OS:Linux, CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. No performance changes detected. I can share numbers if anyone wants to see them.

Note the feature is disabled by default for now. It is going to be enabled downstream first to give it extensive testing on wider range of real world applications.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 43574
Build 44490: arc lint + arc unit

Event Timeline

ebrevnov created this revision.Dec 5 2019, 2:28 AM

Herald added a reviewer: rengolin. · View Herald TranscriptDec 5 2019, 2:28 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, rogfer01, rkruppe and 3 others. · View Herald Transcript

Harbormaster completed remote builds in B41911: Diff 232288.Dec 5 2019, 2:30 AM

ebrevnov added a parent revision: D71047: [LV][NFC] Keep dominator tree up to date during vectorization..Dec 5 2019, 2:32 AM

Is this change inspired by a real world case? If so, how relevant / pervasive is this case?

Also, how different are the run time checks from each other?

If we assume most of them end up being an additional comparison, sometimes with pointer dereference, and assuming the values will be already in registers (as they're about to be looped over), I think their costs end up being roughly the same, no?

Or am I missing something obvious?

Thanks!
--renato

rscottmanley added a subscriber: rscottmanley.Dec 5 2019, 7:36 AM

In D71053#1770322, @rengolin wrote:

Is this change inspired by a real world case? If so, how relevant / pervasive is this case?

Yes, this case came from the real world benchmark. This change gives +15% on it.

Also, how different are the run time checks from each other?

If we assume most of them end up being an additional comparison, sometimes with pointer dereference, and assuming the values will be already in registers (as they're about to be looped over), I think their costs end up being roughly the same, no?

Or am I missing something obvious?

Run-time checks may involve much more than just comparison and some pointer arithmetic. Here is an example how typical overflow check looks like:

vector.scevcheck: ; preds = %tc.check

%mul = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %1, i32 %indvar)
%mul.result = extractvalue { i32, i1 } %mul, 0
%mul.overflow = extractvalue { i32, i1 } %mul, 1
%54 = add i32 %loop-predication.iv, %mul.result
%55 = sub i32 %loop-predication.iv, %mul.result
%56 = icmp ugt i32 %55, %loop-predication.iv
%57 = icmp ult i32 %54, %loop-predication.iv
%58 = select i1 false, i1 %56, i1 %57
%59 = or i1 %58, %mul.overflow
%60 = or i1 false, %59
%mul268 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %2, i32 %indvar)
%mul.result269 = extractvalue { i32, i1 } %mul268, 0
%mul.overflow270 = extractvalue { i32, i1 } %mul268, 1
%61 = add i32 %arg4, %mul.result269
%62 = sub i32 %arg4, %mul.result269
%63 = icmp ugt i32 %62, %arg4
%64 = icmp ult i32 %61, %arg4
%65 = select i1 false, i1 %63, i1 %64
%66 = or i1 %65, %mul.overflow270
%67 = or i1 %60, %66
br i1 %67, label %scalar.ph, label %vector.memcheck

Thanks
Evgeniy

Thanks!
--renato

In D71053#1772108, @ebrevnov wrote:

Yes, this case came from the real world benchmark. This change gives +15% on it.

Well, 15% on an unnamed benchmark on an unnamed architecture shouldn't be a reason to change a core functionality of the vectoriser.

The first step is to show that, on LLVM's own test-suite (in benchmarking mode), multiple programs are changed and the effect is largely positive.

But you also need to make sure that the change has no significant negative effect on any of the benchmarks people care about, on the major architectures supported by LLVM.

Obvious candidates are SPEC, x86_64 and Arm, but it'd be nice if the architecture owners can chime in with their own acks.

Be aware of geomeans. Large regressions can be hidden by a number of smaller improvements, and that's a bad direction to go. The other way is also worrying, where one benchmark shows massive improvements and lots of others show regressions. Benchmarks tend to be very specific to one task and their numbers are artificial, whereas regressions on unrelated code means you've added a base cost to *all other* programs.

With this patch, this is specially true. You need to make sure that the cost you're adding is "insignificant" overall, which I'm finding it hard to prove.

Run-time checks may involve much more than just comparison and some pointer arithmetic. Here is an example how typical overflow check looks like:

My point wasn't how small or big they are. But how constant.

If that's the usual size of a range check, then it almost certainly eclipses the benefits of small loop vectorisation in the majority of cases.

But if the checks range from one ICMP to a range of MUL, ADD, etc., then it gets harder to predict the check size, and static assumptions will be hard.

Bear in mind that the following golden rule of compilers apply: Whenever something is hard to do it statically, we do it conservatively. To make it smarter, we *must* prove it's still safe in the general case.

Today we're in the former state, so, you need to prove your change is safe in the general case.

cheers,
--renato

fedor.sergeev added a subscriber: fedor.sergeev.Dec 9 2019, 10:17 PM

ebrevnov added a child revision: D71249: [LV] Set name for vector preheader and trip count check blocks.Dec 10 2019, 3:55 AM

ebrevnov removed a child revision: D71249: [LV] Set name for vector preheader and trip count check blocks.

ebrevnov edited parent revisions, added: D71250: [LV] Create new vector loop preheader so it contains vectorizer generated code only.; removed: D71047: [LV][NFC] Keep dominator tree up to date during vectorization..

ebrevnov retitled this revision from [LV] Take overhead of run-time checks into account during vectorization. to [WIP][LV] Take overhead of run-time checks into account during vectorization..Dec 10 2019, 3:57 AM

Introduce control flag for the feature and disable by default + Rebase.

Harbormaster completed remote builds in B43574: Diff 236998.Jan 9 2020, 2:45 AM

ebrevnov edited the summary of this revision. (Show Details)Jan 9 2020, 4:37 AM

ebrevnov edited the summary of this revision. (Show Details)Jan 9 2020, 4:42 AM

ebrevnov retitled this revision from [WIP][LV] Take overhead of run-time checks into account during vectorization. to [LV] Take overhead of run-time checks into account during vectorization..

ebrevnov added reviewers: Ayal, hsaito, fhahn, anna.

Hi Evgeniy,

Hiding it behind a flag default to false is a good idea. Also, now I understand what the benchmark and the architecture were, thanks!

I think the idea is not wholly unreasonable, especially needing a flag to run (for now). This will help people on other architectures test, too.

However, I'm not comfortable reviewing the code, as this seems pretty invasive and VPlan is changing constantly. People already added to the reviewers, and the ones I'm adding now, should have a look and review, too.

About the test-suite, it's not just running it, you need to run in Benchmark mode. It's not trivial, so I'm hoping people can help you with that. At least check standard C++ benchmarks to see if there's any unexpected negative impact.

Thanks!
--renato

Саn anyone take a look at this? Thanks in advance!

In the commit message:

I see to possibilities to mitigate that.

to -> two

bypass of vectorized loop and relay on the following optimizations to remove trivially dead code.

relay->rely?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4963	I think that "forced" check should be done before this, and if it is forced, no further checks needed. With that, conrol flow within `if` branches should simplify a lot.
5040	nit: whitespace after "with" missing.
6619	Maybe rename `C` -> `Blocks`?
6670	Can be just `CM.BestVF.Width > 1`, the `!= 0` part will be simply implied.
6692	Assert that it was at least 1 before it?
6717	Also assert that it is conditional? This fact seems to be used below on line 6766.

Just nits from me, I don't know this part well enough to approve.

Thanks Maxim for looking at this! I would be really nice to hear from anyone who is well familiar with the code.

Evgeniy

Ping @Ayal , @hsaito , @fhahn , @rkruppe

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Sorry if I missed an earlier discussion about this, but why do you think that expanding the RT checks is very problematic in practice? Is it the cost of expanding the RT checks early (and maybe throwing them away later)?
I am currently experimenting with this approach to see if it is feasible to use size estimates to better decide on the number of runtime checks (including mem ptr checks), regardless of the vectorisation mode.

The approach in this patch seems to delay & interleave parts of cost-modelling with codegen and adds quite a bit of additional plumbing through the codebase to do so.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing. Something related also came up in https://bugs.llvm.org/show_bug.cgi?id=44662 . I am not yet sure what better alternatives are there though.

@fhahn , Thanks for taking the time to watch!

In D71053#1878424, @fhahn wrote:

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Sorry if I missed an earlier discussion about this, but why do you think that expanding the RT checks is very problematic in practice? Is it the cost of expanding the RT checks early (and maybe throwing them away later)?
I am currently experimenting with this approach to see if it is feasible to use size estimates to better decide on the number of runtime checks (including mem ptr checks), regardless of the vectorisation mode.

You slightly misread the approach suggested in 1). The wild idea was to not to generate any instructions at all. The suggestion is to make dry-run of SCEV but instead of generating code we could evaluate cost of instructions requested to be built. To make it less invasive we could for example pass custom Builder which will do that (not really sure if this is possible though). Another issue is that TTI still needs an instruction to evaluate cost.

IIUC you suggest to actually built IR generated by SCEVExpander for run-time checks, evaluate it's cost and then disregard if not needed. I would be more than happy if we could do that. But I believe current infrastructure doesn't allow to generate side IR to play with and then disregard. AFAIK this limitation was the main motivation for VPlan introduction.

The approach in this patch seems to delay & interleave parts of cost-modelling with codegen and adds quite a bit of additional plumbing through the codebase to do so.

Agree. It will be really great if we can find something more straightforward. Just to note that some of complexity of the current patch is necessity to support both current and new ways of doing cost modeling. In addition at the moment current patch targets short trip count loops only. We could streamline the code a bit more if make it default for all loops.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing. Something related also came up in https://bugs.llvm.org/show_bug.cgi?id=44662 . I am not yet sure what better alternatives are there though.

Agree. Unfortunately don't have a good idea either. Just a note that we currently disregard this for trip count check which is always generated for non constant trip count loops.

In D71053#1878424, @fhahn wrote:

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing.

I think the general assumption is that these runtime checks will never fail in typical programs and only have to be added to not miscompile edge cases (array aliasing, unsigned overflow, MIN_INT).

fhahn mentioned this in D75980: [LV] Generate RT checks up-front and remove them if required..Mar 11 2020, 4:03 AM

In D71053#1878524, @ebrevnov wrote:

@fhahn , Thanks for taking the time to watch!

In D71053#1878424, @fhahn wrote:

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Sorry if I missed an earlier discussion about this, but why do you think that expanding the RT checks is very problematic in practice? Is it the cost of expanding the RT checks early (and maybe throwing them away later)?
I am currently experimenting with this approach to see if it is feasible to use size estimates to better decide on the number of runtime checks (including mem ptr checks), regardless of the vectorisation mode.

You slightly misread the approach suggested in 1). The wild idea was to not to generate any instructions at all. The suggestion is to make dry-run of SCEV but instead of generating code we could evaluate cost of instructions requested to be built. To make it less invasive we could for example pass custom Builder which will do that (not really sure if this is possible though). Another issue is that TTI still needs an instruction to evaluate cost.

IIUC you suggest to actually built IR generated by SCEVExpander for run-time checks, evaluate it's cost and then disregard if not needed. I would be more than happy if we could do that. But I believe current infrastructure doesn't allow to generate side IR to play with and then disregard. AFAIK this limitation was the main motivation for VPlan introduction.

I finally had time to wrap up a proof-of-concept patch that creates the runtime checks up-front and drops them if not required: D75980. It seems to work quite well in practice.

The approach in this patch seems to delay & interleave parts of cost-modelling with codegen and adds quite a bit of additional plumbing through the codebase to do so.

Agree. It will be really great if we can find something more straightforward. Just to note that some of complexity of the current patch is necessity to support both current and new ways of doing cost modeling. In addition at the moment current patch targets short trip count loops only. We could streamline the code a bit more if make it default for all loops.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing. Something related also came up in https://bugs.llvm.org/show_bug.cgi?id=44662 . I am not yet sure what better alternatives are there though.

Agree. Unfortunately don't have a good idea either. Just a note that we currently disregard this for trip count check which is always generated for non constant trip count loops.

Building on D75980 if have put up D75981 which allows larger trip counts if the cost of the trip count is less than 0.5% of the expected scalar cost of the loop. The threshold is quite arbitrarily chosen (and would need further analysis/benchmarking to tune), but it might be a viable general idea.

I think it would be relatively straight forward to implement the logic in this patch on top of D75980 as well, if there is agreement that building the RT checks up-front is preferable.

fhahn mentioned this in rG53dacb7b6775: [LV] Generate RT checks up-front and remove them if required..Mar 1 2021, 2:48 AM

ebrevnov mentioned this in D75981: [LV] Create RT checks once VF/IC are selected, track scalar cost..Mar 5 2021, 12:11 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorizationPlanner.h

14 lines

LoopVectorize.cpp

406 lines

test/

Transforms/

LoopVectorize/

short_tc_rt_checks.ll

296 lines

Diff 236998

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
/// TODO: The following VectorizationFactor was pulled out of		/// TODO: The following VectorizationFactor was pulled out of
/// LoopVectorizationCostModel class. LV also deals with		/// LoopVectorizationCostModel class. LV also deals with
/// VectorizerParams::VectorizationFactor and VectorizationCostTy.		/// VectorizerParams::VectorizationFactor and VectorizationCostTy.
/// We need to streamline them.		/// We need to streamline them.

/// Information about vectorization costs		/// Information about vectorization costs
struct VectorizationFactor {		struct VectorizationFactor {
// Vector width with best cost		// Vector width with best cost
unsigned Width;		unsigned Width = 0;
// Cost of the loop with that width		// Cost of the loop with that width
unsigned Cost;		unsigned Cost = 0;

// Width 1 means no vectorization, cost 0 means uncomputed cost.		// Width 1 means no vectorization, cost 0 means uncomputed cost.
static VectorizationFactor Disabled() { return {1, 0}; }		static VectorizationFactor Disabled() { return {1, 0}; }

bool operator==(const VectorizationFactor &rhs) const {		bool operator==(const VectorizationFactor &rhs) const {
return Width == rhs.Width && Cost == rhs.Cost;		return Width == rhs.Width && Cost == rhs.Cost;
}		}
};		};
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	public:
/// Use the VPlan-native path to plan how to best vectorize, return the best		/// Use the VPlan-native path to plan how to best vectorize, return the best
/// VF and its cost.		/// VF and its cost.
VectorizationFactor planInVPlanNativePath(unsigned UserVF);		VectorizationFactor planInVPlanNativePath(unsigned UserVF);

/// Finalize the best decision and dispose of all other VPlans.		/// Finalize the best decision and dispose of all other VPlans.
void setBestPlan(unsigned VF, unsigned UF);		void setBestPlan(unsigned VF, unsigned UF);

/// Generate the IR code for the body of the vectorized loop according to the		/// Generate the IR code for the body of the vectorized loop according to the
/// best selected VPlan.		/// best selected VPlan. Returns 'true' if we successfully generated vector
void executePlan(InnerLoopVectorizer &LB, DominatorTree *DT);		/// loop, 'false' otherwise.
		bool executePlan(InnerLoopVectorizer &LB, DominatorTree *DT);

void printPlans(raw_ostream &O) {		void printPlans(raw_ostream &O) {
for (const auto &Plan : VPlans)		for (const auto &Plan : VPlans)
O << *Plan;		O << *Plan;
}		}

/// Test a \p Predicate on a \p Range of VF's. Return the value of applying		/// Test a \p Predicate on a \p Range of VF's. Return the value of applying
/// \p Predicate on Range.Start, possibly decreasing Range.End such that the		/// \p Predicate on Range.Start, possibly decreasing Range.End such that the
Show All 24 Lines	private:
VPlanPtr		VPlanPtr
buildVPlanWithVPRecipes(VFRange &Range, SmallPtrSetImpl<Value *> &NeedDef,		buildVPlanWithVPRecipes(VFRange &Range, SmallPtrSetImpl<Value *> &NeedDef,
SmallPtrSetImpl<Instruction *> &DeadInstructions);		SmallPtrSetImpl<Instruction *> &DeadInstructions);

/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,		/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,
/// according to the information gathered by Legal when it checked if it is		/// according to the information gathered by Legal when it checked if it is
/// legal to vectorize the loop. This method creates VPlans using VPRecipes.		/// legal to vectorize the loop. This method creates VPlans using VPRecipes.
void buildVPlansWithVPRecipes(unsigned MinVF, unsigned MaxVF);		void buildVPlansWithVPRecipes(unsigned MinVF, unsigned MaxVF);

		/// Returns 'false' if additional overhead from generated runtime checks (trip
		/// count, memory dependency and SCEV checks) makes vectorization not
		/// profitable, 'true' otherwise.
		bool mayDisregardRTChecksOverhead(InnerLoopVectorizer &ILV);
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines

cl::opt<bool> llvm::EnableLoopInterleaving(		cl::opt<bool> llvm::EnableLoopInterleaving(
"interleave-loops", cl::init(true), cl::Hidden,		"interleave-loops", cl::init(true), cl::Hidden,
cl::desc("Enable loop interleaving in Loop vectorization passes"));		cl::desc("Enable loop interleaving in Loop vectorization passes"));
cl::opt<bool> llvm::EnableLoopVectorization(		cl::opt<bool> llvm::EnableLoopVectorization(
"vectorize-loops", cl::init(true), cl::Hidden,		"vectorize-loops", cl::init(true), cl::Hidden,
cl::desc("Run the Loop vectorization passes"));		cl::desc("Run the Loop vectorization passes"));

		static cl::opt<bool> EnableTinyLoopVectorization(
		"vectorize-tiny-loops-with-epilog", cl::init(false), cl::Hidden,
		cl::desc("Enable vectorization of tiny loops even if run-time and/or "
		"scalar iterations overhead are incuired. See "
		"'vectorizer-min-trip-count' for more information on tiny "
		"loops."));

/// A helper function for converting Scalar types to vector types.		/// A helper function for converting Scalar types to vector types.
/// If the incoming type is void, we return void. If the VF is 1, we return		/// If the incoming type is void, we return void. If the VF is 1, we return
/// the scalar type.		/// the scalar type.
static Type ToVectorTy(Type Scalar, unsigned VF) {		static Type ToVectorTy(Type Scalar, unsigned VF) {
if (Scalar->isVoidTy() \|\| VF == 1)		if (Scalar->isVoidTy() \|\| VF == 1)
return Scalar;		return Scalar;
return VectorType::get(Scalar, VF);		return VectorType::get(Scalar, VF);
}		}
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
}		}

/// Returns "best known" trip count for the specified loop \p L as defined by		/// Returns "best known" trip count for the specified loop \p L as defined by
/// the following procedure:		/// the following procedure:
/// 1) Returns exact trip count if it is known.		/// 1) Returns exact trip count if it is known.
/// 2) Returns expected trip count according to profile data if any.		/// 2) Returns expected trip count according to profile data if any.
/// 3) Returns upper bound estimate if it is known.		/// 3) Returns upper bound estimate if it is known.
/// 4) Returns None if all of the above failed.		/// 4) Returns None if all of the above failed.
static Optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE, Loop *L) {		Optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE, Loop *L) {
// Check if exact trip count is known.		// Check if exact trip count is known.
if (unsigned ExpectedTC = SE.getSmallConstantTripCount(L))		if (unsigned ExpectedTC = SE.getSmallConstantTripCount(L))
return ExpectedTC;		return ExpectedTC;

// Check if there is an expected trip count available from profile data.		// Check if there is an expected trip count available from profile data.
if (LoopVectorizeWithBlockFrequency)		if (LoopVectorizeWithBlockFrequency)
if (auto EstimatedTC = getLoopEstimatedTripCount(L))		if (auto EstimatedTC = getLoopEstimatedTripCount(L))
return EstimatedTC;		return EstimatedTC;
▲ Show 20 Lines • Show All 556 Lines • ▼ Show 20 Lines
enum ScalarEpilogueLowering {		enum ScalarEpilogueLowering {

// The default: allowing scalar epilogues.		// The default: allowing scalar epilogues.
CM_ScalarEpilogueAllowed,		CM_ScalarEpilogueAllowed,

// Vectorization with OptForSize: don't allow epilogues.		// Vectorization with OptForSize: don't allow epilogues.
CM_ScalarEpilogueNotAllowedOptSize,		CM_ScalarEpilogueNotAllowedOptSize,

// A special case of vectorisation with OptForSize: loops with a very small
// trip count are considered for vectorization under OptForSize, thereby
// making sure the cost of their loop body is dominant, free of runtime
// guards and scalar iteration overheads.
CM_ScalarEpilogueNotAllowedLowTripLoop,

// Loop hint predicate indicating an epilogue is undesired.		// Loop hint predicate indicating an epilogue is undesired.
CM_ScalarEpilogueNotNeededUsePredicate		CM_ScalarEpilogueNotNeededUsePredicate
};		};

/// LoopVectorizationCostModel - estimates the expected speedups due to		/// LoopVectorizationCostModel - estimates the expected speedups due to
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
Show All 16 Lines	LoopVectorizationCostModel(ScalarEpilogueLowering SEL, Loop *L,
Hints(Hints), InterleaveInfo(IAI) {}		Hints(Hints), InterleaveInfo(IAI) {}

/// \return An upper bound for the vectorization factor, or None if		/// \return An upper bound for the vectorization factor, or None if
/// vectorization and interleaving should be avoided up front.		/// vectorization and interleaving should be avoided up front.
Optional<unsigned> computeMaxVF();		Optional<unsigned> computeMaxVF();

/// \return True if runtime checks are required for vectorization, and false		/// \return True if runtime checks are required for vectorization, and false
/// otherwise.		/// otherwise.
bool runtimeChecksRequired();		bool runtimeChecksRequired(bool ReportFailure);

/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to MaxVF. If UserVF is not ZERO		/// This method checks every power of two up to MaxVF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(unsigned MaxVF);		VectorizationFactor selectVectorizationFactor(unsigned MaxVF);

/// Setup cost-based decisions for user vectorization factor.		/// Setup cost-based decisions for user vectorization factor.
▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines	public:

/// Estimate cost of a call instruction CI if it were vectorized with factor		/// Estimate cost of a call instruction CI if it were vectorized with factor
/// VF. Return the cost of the instruction, including scalarization overhead		/// VF. Return the cost of the instruction, including scalarization overhead
/// if it's needed. The flag NeedToScalarize shows if the call needs to be		/// if it's needed. The flag NeedToScalarize shows if the call needs to be
/// scalarized -		/// scalarized -
/// i.e. either vector version isn't available, or is too expensive.		/// i.e. either vector version isn't available, or is too expensive.
unsigned getVectorCallCost(CallInst *CI, unsigned VF, bool &NeedToScalarize);		unsigned getVectorCallCost(CallInst *CI, unsigned VF, bool &NeedToScalarize);

private:
unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factor, larger than zero.
/// One is returned if vectorization should best be avoided due to cost.
unsigned computeFeasibleMaxVF(unsigned ConstTripCount);

/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually		/// indicating whether any of the contributing operations will actually
/// operate on		/// operate on
/// vector values after type legalization in the backend. If this latter value		/// vector values after type legalization in the backend. If this latter value
/// is		/// is
/// false, then all operations will be scalarized (i.e. no vectorization has		/// false, then all operations will be scalarized (i.e. no vectorization has
/// actually taken place).		/// actually taken place).
using VectorizationCostTy = std::pair<unsigned, bool>;		using VectorizationCostTy = std::pair<unsigned, bool>;

		/// Returns the execution time cost of an instruction for a given vector
		/// width. Vector width of one means scalar.
		VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);

		private:
		unsigned NumPredStores = 0;

		/// \return An upper bound for the vectorization factor, larger than zero.
		/// One is returned if vectorization should best be avoided due to cost.
		unsigned computeFeasibleMaxVF(unsigned ConstTripCount);

/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width.		/// the factor width.
VectorizationCostTy expectedCost(unsigned VF);		VectorizationCostTy expectedCost(unsigned VF);

/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);		unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);

/// Calculate vectorization cost of memory instruction \p I.		/// Calculate vectorization cost of memory instruction \p I.
unsigned getMemoryInstructionCost(Instruction *I, unsigned VF);		unsigned getMemoryInstructionCost(Instruction *I, unsigned VF);

/// The cost computation for scalarized memory instruction.		/// The cost computation for scalarized memory instruction.
▲ Show 20 Lines • Show All 165 Lines • ▼ Show 20 Lines	public:
/// with the same stride and close to each other.		/// with the same stride and close to each other.
InterleavedAccessInfo &InterleaveInfo;		InterleavedAccessInfo &InterleaveInfo;

/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;

/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;

		/// Cached {VF, Cost} for scalar loop (VF==1).
		VectorizationFactor ScalarVF;
		/// Cached {VF, Cost} for best expected vectorization mode.
		VectorizationFactor BestVF;
};		};

} // end namespace llvm		} // end namespace llvm

// Return true if \p OuterLp is an outer loop annotated with hints for explicit		// Return true if \p OuterLp is an outer loop annotated with hints for explicit
// vectorization. The loop needs to be annotated with #pragma omp simd		// vectorization. The loop needs to be annotated with #pragma omp simd
// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the		// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the
// vector length information is not provided, vectorization is not considered		// vector length information is not provided, vectorization is not considered
▲ Show 20 Lines • Show All 3,363 Lines • ▼ Show 20 Lines	for (auto &Induction : *Legal->getInductionVars()) {
// The induction variable and its update instruction will remain uniform.		// The induction variable and its update instruction will remain uniform.
addToWorklistIfAllowed(Ind);		addToWorklistIfAllowed(Ind);
addToWorklistIfAllowed(IndUpdate);		addToWorklistIfAllowed(IndUpdate);
}		}

Uniforms[VF].insert(Worklist.begin(), Worklist.end());		Uniforms[VF].insert(Worklist.begin(), Worklist.end());
}		}

bool LoopVectorizationCostModel::runtimeChecksRequired() {		bool LoopVectorizationCostModel::runtimeChecksRequired(bool ReportFailure) {
LLVM_DEBUG(dbgs() << "LV: Performing code size checks.\n");		LLVM_DEBUG(dbgs() << "LV: Performing code size checks.\n");

if (Legal->getRuntimePointerChecking()->Need) {		if (Legal->getRuntimePointerChecking()->Need) {
reportVectorizationFailure("Runtime ptr check is required with -Os/-Oz",		if (ReportFailure)
		reportVectorizationFailure(
		"Runtime ptr check is required with -Os/-Oz",
"runtime pointer checks needed. Enable vectorization of this "		"runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz",		"compiling with -Os/-Oz",
"CantVersionLoopWithOptForSize", ORE, TheLoop);		"CantVersionLoopWithOptForSize", ORE, TheLoop);
return true;		return true;
}		}

if (!PSE.getUnionPredicate().getPredicates().empty()) {		if (!PSE.getUnionPredicate().getPredicates().empty()) {
reportVectorizationFailure("Runtime SCEV check is required with -Os/-Oz",		if (ReportFailure)
		reportVectorizationFailure(
		"Runtime SCEV check is required with -Os/-Oz",
"runtime SCEV checks needed. Enable vectorization of this "		"runtime SCEV checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz",		"compiling with -Os/-Oz",
"CantVersionLoopWithOptForSize", ORE, TheLoop);		"CantVersionLoopWithOptForSize", ORE, TheLoop);
return true;		return true;
}		}

// FIXME: Avoid specializing for stride==1 instead of bailing out.		// FIXME: Avoid specializing for stride==1 instead of bailing out.
if (!Legal->getLAI()->getSymbolicStrides().empty()) {		if (!Legal->getLAI()->getSymbolicStrides().empty()) {
reportVectorizationFailure("Runtime stride check is required with -Os/-Oz",		if (ReportFailure)
		reportVectorizationFailure(
		"Runtime stride check is required with -Os/-Oz",
"runtime stride == 1 checks needed. Enable vectorization of "		"runtime stride == 1 checks needed. Enable vectorization of "
"this loop with '#pragma clang loop vectorize(enable)' when "		"this loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz",		"compiling with -Os/-Oz",
"CantVersionLoopWithOptForSize", ORE, TheLoop);		"CantVersionLoopWithOptForSize", ORE, TheLoop);
return true;		return true;
}		}

return false;		return false;
}		}

Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {		Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {
if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {		if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {
Show All 11 Lines	Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {
if (TC == 1) {		if (TC == 1) {
reportVectorizationFailure("Single iteration (non) loop",		reportVectorizationFailure("Single iteration (non) loop",
"loop trip count is one, irrelevant for vectorization",		"loop trip count is one, irrelevant for vectorization",
"SingleIterationLoop", ORE, TheLoop);		"SingleIterationLoop", ORE, TheLoop);
return None;		return None;
}		}

switch (ScalarEpilogueStatus) {		switch (ScalarEpilogueStatus) {
case CM_ScalarEpilogueAllowed:		case CM_ScalarEpilogueAllowed: {
		auto ExpectedTC = getSmallBestKnownTC(*PSE.getSE(), TheLoop);
		// Tiny loops are handled in a special way.
		if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
		mkazantsevUnsubmitted Not Done Reply Inline Actions I think that "forced" check should be done before this, and if it is forced, no further checks needed. With that, conrol flow within `if` branches should simplify a lot. mkazantsev: I think that "forced" check should be done before this, and if it is forced, no further checks…
		if (EnableTinyLoopVectorization) {
		// For tiny loops without runtime checks prefer masked vectorization
		// to preserve legacy behavior.
		// TODO: Ideally this decision should be done by cost model.
		if (Hints->getForce() != LoopVectorizeHints::FK_Enabled &&
		!runtimeChecksRequired(false)) {
		LLVM_DEBUG(
		dbgs() << "LV: Prefer masked vectorization for short trip "
		<< "count loop.\n");
		break;
		}
		} else {
		LLVM_DEBUG(
		dbgs() << "LV: Found a loop with a very small trip count. "
		<< "This loop is worth vectorizing only if no scalar "
		<< "iteration overheads are incurred.");
		if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)
		LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
		else {
		LLVM_DEBUG(dbgs() << "\n");
		ScalarEpilogueStatus = CM_ScalarEpilogueNotAllowedOptSize;
		// Legacy behavior is to disable vectorization for tiny loops.
		if (runtimeChecksRequired(true))
		return None;
		break;
		}
		}
		}
return computeFeasibleMaxVF(TC);		return computeFeasibleMaxVF(TC);
		}
case CM_ScalarEpilogueNotNeededUsePredicate:		case CM_ScalarEpilogueNotNeededUsePredicate:
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: vector predicate hint/switch found.\n"		dbgs() << "LV: vector predicate hint/switch found.\n"
<< "LV: Not allowing scalar epilogue, creating predicated "		<< "LV: Not allowing scalar epilogue, creating predicated "
<< "vector loop.\n");		<< "vector loop.\n");
break;		break;
case CM_ScalarEpilogueNotAllowedLowTripLoop:
// fallthrough as a special case of OptForSize
case CM_ScalarEpilogueNotAllowedOptSize:		case CM_ScalarEpilogueNotAllowedOptSize:
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");		dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");
else
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
<< "count.\n");

// Bail if runtime checks are required, which are not good when optimising		// Bail if runtime checks are required, which are not good when optimising
// for size.		// for size.
if (runtimeChecksRequired())		if (runtimeChecksRequired(true))
return None;		return None;
break;		break;
}		}

// Now try the tail folding		// Now try the tail folding

// Invalidate interleave groups that require an epilogue if we can't mask		// Invalidate interleave groups that require an epilogue if we can't mask
// the interleave-group.		// the interleave-group.
if (!useMaskedInterleavedAccesses(TTI))		if (!useMaskedInterleavedAccesses(TTI))
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();		InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();

unsigned MaxVF = computeFeasibleMaxVF(TC);		unsigned MaxVF = computeFeasibleMaxVF(TC);
if (TC > 0 && TC % MaxVF == 0) {		if (TC > 0 && TC % MaxVF == 0) {
// Accept MaxVF if we do not have a tail.		// Accept MaxVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxVF;		return MaxVF;
}		}

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {		if (Legal->prepareToFoldTailByMasking()) {
		// Synchronize 'ScalarEpilogueStatus' with folding mode if required.
		if (ScalarEpilogueStatus == CM_ScalarEpilogueAllowed)
		ScalarEpilogueStatus = CM_ScalarEpilogueNotAllowedOptSize;
FoldTailByMasking = true;		FoldTailByMasking = true;
return MaxVF;		return MaxVF;
}		}

		// If scalar epilogue was not forbidden proceed with 'normal' vectorization.
		if (ScalarEpilogueStatus == CM_ScalarEpilogueAllowed) {
		LLVM_DEBUG(
		dbgs() << "LV: Masked vectorization is not allowed. Continue with"
		mkazantsevUnsubmitted Not Done Reply Inline Actions nit: whitespace after "with" missing. mkazantsev: nit: whitespace after "with" missing.
		"'normal' vectorization using epilogue\n");

		return MaxVF;
		}

if (TC == 0) {		if (TC == 0) {
reportVectorizationFailure(		reportVectorizationFailure(
"Unable to calculate the loop count due to complex control flow",		"Unable to calculate the loop count due to complex control flow",
"unable to calculate the loop count due to complex control flow",		"unable to calculate the loop count due to complex control flow",
"UnknownLoopCountComplexCFG", ORE, TheLoop);		"UnknownLoopCountComplexCFG", ORE, TheLoop);
return None;		return None;
}		}

▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	if (unsigned MinVF = TTI.getMinimumVF(SmallestType)) {
}		}
}		}
}		}
return MaxVF;		return MaxVF;
}		}

VectorizationFactor		VectorizationFactor
LoopVectorizationCostModel::selectVectorizationFactor(unsigned MaxVF) {		LoopVectorizationCostModel::selectVectorizationFactor(unsigned MaxVF) {
float Cost = expectedCost(1).first;		ScalarVF = { 1, expectedCost(1).first };
const float ScalarCost = Cost;		float Cost = ScalarVF.Cost;
unsigned Width = 1;		unsigned Width = 1;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarVF.Cost
		<< ".\n");

bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;		bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
if (ForceVectorization && MaxVF > 1) {		if (ForceVectorization && MaxVF > 1) {
// Ignore scalar width, because the user explicitly wants vectorization.		// Ignore scalar width, because the user explicitly wants vectorization.
// Initialize cost to max so that VF = 2 is, at least, chosen during cost		// Initialize cost to max so that VF = 2 is, at least, chosen during cost
// evaluation.		// evaluation.
Cost = std::numeric_limits<float>::max();		Cost = std::numeric_limits<float>::max();
}		}
Show All 18 Lines	for (unsigned i = 2; i <= MaxVF; i *= 2) {
}		}
}		}

if (!EnableCondStoresVectorization && NumPredStores) {		if (!EnableCondStoresVectorization && NumPredStores) {
reportVectorizationFailure("There are conditional stores.",		reportVectorizationFailure("There are conditional stores.",
"store that is conditionally executed prevents vectorization",		"store that is conditionally executed prevents vectorization",
"ConditionalStore", ORE, TheLoop);		"ConditionalStore", ORE, TheLoop);
Width = 1;		Width = 1;
Cost = ScalarCost;		Cost = ScalarVF.Cost;
}		}

LLVM_DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()		LLVM_DEBUG(
		if (ForceVectorization && Width > 1 && Cost >= ScalarVF.Cost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
<< "but was forced by a user.\n");		<< "but was forced by a user.\n");
LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");		LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");
VectorizationFactor Factor = {Width, (unsigned)(Width * Cost)};		BestVF = { Width, (unsigned)(Width * Cost) };
return Factor;		return BestVF;
}		}

std::pair<unsigned, unsigned>		std::pair<unsigned, unsigned>
LoopVectorizationCostModel::getSmallestAndWidestTypes() {		LoopVectorizationCostModel::getSmallestAndWidestTypes() {
unsigned MinWidth = -1U;		unsigned MinWidth = -1U;
unsigned MaxWidth = 8;		unsigned MaxWidth = 8;
const DataLayout &DL = TheFunction->getParent()->getDataLayout();		const DataLayout &DL = TheFunction->getParent()->getDataLayout();

▲ Show 20 Lines • Show All 1,410 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) {
BestUF = UF;		BestUF = UF;

erase_if(VPlans, [VF](const VPlanPtr &Plan) {		erase_if(VPlans, [VF](const VPlanPtr &Plan) {
return !Plan->hasVF(VF);		return !Plan->hasVF(VF);
});		});
assert(VPlans.size() == 1 && "Best VF has not a single VPlan.");		assert(VPlans.size() == 1 && "Best VF has not a single VPlan.");
}		}

void LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV,		// Helper function to calculate cost of all instructions in the \p C.
		template<typename Container>
		static uint64_t getCostOfBlocks(LoopVectorizationCostModel &CM,
		Container &&C) {
		mkazantsevUnsubmitted Not Done Reply Inline Actions Maybe rename `C` -> `Blocks`? mkazantsev: Maybe rename `C` -> `Blocks`?
		uint64_t TotalCost = 0;
		for (BasicBlock *BB : C) {
		for (Instruction &I : *BB) {
		auto InstCost = CM.getInstructionCost(&I, 1).first;
		TotalCost += InstCost;
		LLVM_DEBUG(dbgs() << "LV: Found an estimated cost of " << InstCost
		<< " for VF " << 1 << " For instruction: " << I
		<< '\n');
		}
		}
		return TotalCost;
		}

		// Returns 'false' if additional overhead from generated runtime checks (trip
		// count, memory dependency and SCEV checks) makes vectorization not profitable,
		// 'true' otherwise.
		bool LoopVectorizationPlanner::mayDisregardRTChecksOverhead(
		InnerLoopVectorizer &ILV) {
		Optional<unsigned> ExpectedTC =
		getSmallBestKnownTC(*CM.PSE.getSE(), OrigLoop);

		// No need to check for RT overhead for loops expected not to have short
		// trip count.
		// TODO: This is done this was to preserve legacy behavior. We should change
		// that eventually and be checking for RT overhead for all loops regardless of
		// TC.
		if (!ExpectedTC \|\| *ExpectedTC >= TinyTripCountVectorThreshold) {
		LLVM_DEBUG(
		dbgs() << "LV: Disregarding run-time checks overhead: not short trip "
		"count loop.\n");
		return true;
		}

		// No need to check for RT overhead if vectorization was forced. Note that
		// cost modeling still may be performed to select best VF.
		if (CM.Hints->getForce() == LoopVectorizeHints::FK_Enabled) {
		LLVM_DEBUG(dbgs() << "LV: Disregarding run-time checks overhead: "
		"vectorization was forced.\n");
		return true;
		}

		// No need to check for RT overhead if cost modeling was skipped and VF
		// selected by the user.
		if (CM.ScalarVF.Width == 0) {
		LLVM_DEBUG(
		dbgs()
		<< "LV: Disregarding run-time checks overhead: VF was forced.\n");
		return true;
		}

		assert(CM.BestVF.Width != 0 && CM.BestVF.Width > 1 &&
		mkazantsevUnsubmitted Not Done Reply Inline Actions Can be just `CM.BestVF.Width > 1`, the `!= 0` part will be simply implied. mkazantsev: Can be just `CM.BestVF.Width > 1`, the `!= 0` part will be simply implied.
		"Best VF was not properly selected?");

		LLVM_DEBUG(dbgs() << "LV: Checking cost of run-time overhead for short "
		"trip count loop.\n");

		uint64_t VecTripCount = *ExpectedTC / CM.BestVF.Width;
		uint64_t RemainderTripCount = *ExpectedTC % CM.BestVF.Width;

		// In "foldTailByMasking" mode remainder iterations are executed as part of
		// the main vector loop. That means all remainder iterations will be executed
		// as one masked vector iterations.
		if (RemainderTripCount != 0 && CM.foldTailByMasking()) {
		++VecTripCount;
		RemainderTripCount = 0;
		}

		// In "requiresScalarEpilogue" mode there should be at least one iteration
		// executed in the remainder loop. If all iterations end up being executed
		// as part of the main vector loop forward one vector iteration to remainder
		// loop.
		if (RemainderTripCount == 0 && CM.requiresScalarEpilogue()) {
		--VecTripCount;
		mkazantsevUnsubmitted Not Done Reply Inline Actions Assert that it was at least 1 before it? mkazantsev: Assert that it was at least 1 before it?
		RemainderTripCount = CM.BestVF.Width;
		}

		uint64_t VecRTCost = getCostOfBlocks(CM, ILV.LoopBypassBlocks) +
		getCostOfBlocks<std::initializer_list<BasicBlock *> >(
		CM, { ILV.LoopVectorPreHeader, ILV.LoopMiddleBlock,
		ILV.LoopScalarPreHeader });
		uint64_t VecCost = CM.BestVF.Cost * VecTripCount;
		uint64_t RemainderCost = CM.ScalarVF.Cost * RemainderTripCount;
		uint64_t VecTotalCost = VecRTCost + VecCost + RemainderCost;
		uint64_t ScalarTotalCost = CM.ScalarVF.Cost * (*ExpectedTC);

		LLVM_DEBUG(dbgs() << "LV: ScalarTotalCost = " << ScalarTotalCost << "\n");
		LLVM_DEBUG(dbgs() << "LV: VecTotalCost = RTCost + (VecCost * VecTC) + "
		"(RemainderCost * RemainderTC) = " << VecRTCost << " + ("
		<< CM.BestVF.Cost << " * " << VecTripCount << ") + ("
		<< CM.ScalarVF.Cost << " * " << RemainderTripCount
		<< ") = " << VecTotalCost << "\n");

		if (VecTotalCost >= ScalarTotalCost) {
		LLVM_DEBUG(
		dbgs()
		<< "LV: It's not profitable to vectorize short trip count loop.\n");

		assert(isa<BranchInst>(ILV.LoopBypassBlocks.front()->getTerminator()) &&
		mkazantsevUnsubmitted Not Done Reply Inline Actions Also assert that it is conditional? This fact seems to be used below on line 6766. mkazantsev: Also assert that it is conditional? This fact seems to be used below on line 6766.
		"RT check should end with branch instruction.");

		return false;
		}

		LLVM_DEBUG(
		dbgs()
		<< "LV: It's still profitable to vectorize short trip count loop.\n");
		return true;
		}

		bool LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV,
DominatorTree *DT) {		DominatorTree *DT) {
// Perform the actual loop transformation.		// Perform the actual loop transformation.

// 1. Create a new empty loop. Unlink the old loop and connect the new one.		// 1. Create a new empty loop. Unlink the old loop and connect the new one.
VPCallbackILV CallbackILV(ILV);		VPCallbackILV CallbackILV(ILV);

VPTransformState State{BestVF, BestUF, LI,		VPTransformState State{BestVF, BestUF, LI,
DT, ILV.Builder, ILV.VectorLoopValueMap,		DT, ILV.Builder, ILV.VectorLoopValueMap,
&ILV, CallbackILV};		&ILV, CallbackILV};
State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();		State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();
State.TripCount = ILV.getOrCreateTripCount(nullptr);		State.TripCount = ILV.getOrCreateTripCount(nullptr);

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should also be implemented in		// into the code below should also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//

		bool IsVectorizationProfitable = mayDisregardRTChecksOverhead(ILV);
		// Skip generation of vector body if vectorization turned out to be not
		// profitable (vector loop is dead in this case).
		if (IsVectorizationProfitable) {
// 2. Copy and widen instructions from the old loop into the new loop.		// 2. Copy and widen instructions from the old loop into the new loop.
assert(VPlans.size() == 1 && "Not a single VPlan to execute.");		assert(VPlans.size() == 1 && "Not a single VPlan to execute.");
VPlans.front()->execute(&State);		VPlans.front()->execute(&State);

// 3. Fix the vectorized code: take care of header phi's, live-outs,		// 3. Fix the vectorized code: take care of header phi's, live-outs,
// predication, updating analyses.		// predication, updating analyses.
ILV.fixVectorizedLoop();		ILV.fixVectorizedLoop();
		} else {
		// Make vectorized loop effectively dead. Later optimizations should clean
		// it up.
		auto *BrInst =
		cast<BranchInst>(ILV.LoopBypassBlocks.front()->getTerminator());
		BrInst->setCondition(
		ConstantInt::getTrue(BrInst->getCondition()->getType()));


		ILV.ORE->emit([&]() {
		return OptimizationRemark(LV_NAME, "Not Vectorized",
		ILV.OrigLoop->getStartLoc(),
		ILV.OrigLoop->getHeader())
		<< "not profitable to vectorize short trip count loop.";
		});
		}

		return IsVectorizationProfitable;
}		}

void LoopVectorizationPlanner::collectTriviallyDeadInstructions(		void LoopVectorizationPlanner::collectTriviallyDeadInstructions(
SmallPtrSetImpl<Instruction *> &DeadInstructions) {		SmallPtrSetImpl<Instruction *> &DeadInstructions) {
BasicBlock *Latch = OrigLoop->getLoopLatch();		BasicBlock *Latch = OrigLoop->getLoopLatch();

// We create new control-flow for the vectorized loop, so the original		// We create new control-flow for the vectorized loop, so the original
// condition will be dead after vectorization if it's only used by the		// condition will be dead after vectorization if it's only used by the
▲ Show 20 Lines • Show All 979 Lines • ▼ Show 20 Lines	if (VPlanBuildStressTest \|\| EnableVPlanPredication \|\|
return false;		return false;

LVP.setBestPlan(VF.Width, 1);		LVP.setBestPlan(VF.Width, 1);

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, 1, LVL,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, 1, LVL,
&CM);		&CM);
LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""		LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""
<< L->getHeader()->getParent()->getName() << "\"\n");		<< L->getHeader()->getParent()->getName() << "\"\n");
LVP.executePlan(LB, DT);		bool IsVectorized = LVP.executePlan(LB, DT);
		assert(IsVectorized && "VPlan failed to be executed in native path.");
		(void)IsVectorized;

// Mark the loop as already vectorized to avoid vectorizing again.		// Mark the loop as already vectorized to avoid vectorizing again.
Hints.setAlreadyVectorized();		Hints.setAlreadyVectorized();

LLVM_DEBUG(verifyFunction(*L->getHeader()->getParent()));		LLVM_DEBUG(verifyFunction(*L->getHeader()->getParent()));
return true;		return true;
}		}

▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->empty())		if (!L->empty())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
ORE, BFI, PSI, Hints);		ORE, BFI, PSI, Hints);

assert(L->empty() && "Inner loop expected.");		assert(L->empty() && "Inner loop expected.");

// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.
auto ExpectedTC = getSmallBestKnownTC(*SE, L);
if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {
LLVM_DEBUG(dbgs() << "\n");
SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
}
}

// Check the function attributes to see if implicit floats are allowed.		// Check the function attributes to see if implicit floats are allowed.
// FIXME: This check doesn't seem possibly correct -- what if the loop is		// FIXME: This check doesn't seem possibly correct -- what if the loop is
// an integer loop and the vector instructions selected are purely integer		// an integer loop and the vector instructions selected are purely integer
// vector instructions?		// vector instructions?
if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {		if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {
reportVectorizationFailure(		reportVectorizationFailure(
"Can't vectorize when the NoImplicitFloat attribute is used",		"Can't vectorize when the NoImplicitFloat attribute is used",
"loop not vectorized due to NoImplicitFloat attribute",		"loop not vectorized due to NoImplicitFloat attribute",
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	if (!VectorizeLoop && !InterleaveLoop) {
LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width		LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width
<< ") in " << DebugLocStr << '\n');		<< ") in " << DebugLocStr << '\n');
LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');		LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');
}		}

LVP.setBestPlan(VF.Width, IC);		LVP.setBestPlan(VF.Width, IC);

using namespace ore;		using namespace ore;
bool DisableRuntimeUnroll = false;
MDNode *OrigLoopID = L->getLoopID();		MDNode *OrigLoopID = L->getLoopID();
		std::unique_ptr<InnerLoopVectorizer> ILV;

if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		ILV = std::make_unique<InnerLoopUnroller>(L, PSE, LI, DT, TLI, TTI, AC, ORE,
&CM);		IC, &LVL, &CM);
LVP.executePlan(Unroller, DT);		} else {
		// If we decided that it is legal to vectorize the loop, then do it.
		ILV = std::make_unique<InnerLoopVectorizer>(L, PSE, LI, DT, TLI, TTI, AC,
		ORE, VF.Width, IC, &LVL, &CM);
		}

		bool IsPlanExecuted = LVP.executePlan(*ILV, DT);

		if (IsPlanExecuted) {
		if (!VectorizeLoop) {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")";		<< NV("InterleaveCount", IC) << ")";
});		});
} else {		} else {
// If we decided that it is legal to vectorize the loop, then do it.
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
&LVL, &CM);
LVP.executePlan(LB, DT);
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there are
// no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())
DisableRuntimeUnroll = true;

// Report the vectorization decision.		// Report the vectorization decision.
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "vectorized loop (vectorization width: "		<< "vectorized loop (vectorization width: "
<< NV("VectorizationFactor", VF.Width)		<< NV("VectorizationFactor", VF.Width)
<< ", interleaved count: " << NV("InterleaveCount", IC) << ")";		<< ", interleaved count: " << NV("InterleaveCount", IC) << ")";
});		});
}		}

Optional<MDNode *> RemainderLoopID =		Optional<MDNode *> RemainderLoopID =
makeFollowupLoopID(OrigLoopID, {LLVMLoopVectorizeFollowupAll,		makeFollowupLoopID(OrigLoopID, { LLVMLoopVectorizeFollowupAll,
LLVMLoopVectorizeFollowupEpilogue});		LLVMLoopVectorizeFollowupEpilogue });
if (RemainderLoopID.hasValue()) {		if (RemainderLoopID.hasValue()) {
L->setLoopID(RemainderLoopID.getValue());		L->setLoopID(RemainderLoopID.getValue());
} else {		} else {
if (DisableRuntimeUnroll)		// Add metadata to disable runtime unrolling a scalar loop when there are
		// no runtime checks about strides and memory. A scalar loop that is
		// rarely used is not worth unrolling.
		if (VectorizeLoop && !ILV->areSafetyChecksAdded())
AddRuntimeUnrollDisableMetaData(L);		AddRuntimeUnrollDisableMetaData(L);

// Mark the loop as already vectorized to avoid vectorizing again.		// Mark the loop as already vectorized to avoid vectorizing again.
Hints.setAlreadyVectorized();		Hints.setAlreadyVectorized();
}		}
		}

		// IR could be changed even if 'IsPlanExecuted' is false.
LLVM_DEBUG(verifyFunction(*L->getHeader()->getParent()));		LLVM_DEBUG(verifyFunction(*L->getHeader()->getParent()));
return true;		return true;
}		}

bool LoopVectorizePass::runImpl(		bool LoopVectorizePass::runImpl(
Function &F, ScalarEvolution &SE_, LoopInfo &LI_, TargetTransformInfo &TTI_,		Function &F, ScalarEvolution &SE_, LoopInfo &LI_, TargetTransformInfo &TTI_,
DominatorTree &DT_, BlockFrequencyInfo &BFI_, TargetLibraryInfo *TLI_,		DominatorTree &DT_, BlockFrequencyInfo &BFI_, TargetLibraryInfo *TLI_,
DemandedBits &DB_, AliasAnalysis &AA_, AssumptionCache &AC_,		DemandedBits &DB_, AliasAnalysis &AA_, AssumptionCache &AC_,
▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/short_tc_rt_checks.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -passes="print<block-freq>,loop-vectorize" -S -debug-only=loop-vectorize -vectorize-tiny-loops-with-epilog=true < %s 2>&1 \| FileCheck %s
				; REQUIRES asserts

				; Check vectorization of hot short trip count with epilog. In this case inner
				; loop trip count is not constant and its value is estimated by profile.

				; ModuleID = 'test.cpp'
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@a = dso_local global [5 x i32] zeroinitializer, align 16
				@b = dso_local global [5 x i32] zeroinitializer, align 16

				; CHECK: LV: Found trip count: 0
				; CHECK: LV: Checking cost of run-time overhead for short trip count loop.
				; CHECK: LV: It's still profitable to vectorize short trip count loop.
				;
				; CHECK: LV: Found trip count: 5
				; CHECK: LV: Checking cost of run-time overhead for short trip count loop.
				; CHECK: LV: It's still profitable to vectorize short trip count loop.
				;
				; Function Attrs: uwtable
				define dso_local void @_Z3fooi(i32 %M) local_unnamed_addr #0 !prof !11 {
				; CHECK-LABEL: @_Z3fooi(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.*]] = alloca [5 x i32], align 16
				; CHECK-NEXT: [[TMP0:%.]] = bitcast [5 x i32] [[A]] to i8*
				; CHECK-NEXT: [[B:%.*]] = alloca [5 x i32], align 16
				; CHECK-NEXT: [[TMP1:%.]] = bitcast [5 x i32] [[B]] to i8*
				; CHECK-NEXT: [[TMP2:%.]] = bitcast [5 x i32] [[A]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull [[TMP0]])
				; CHECK-NEXT: [[TMP3:%.]] = bitcast [5 x i32] [[B]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull [[TMP1]])
				; CHECK-NEXT: [[ARRAYDECAY:%.]] = getelementptr inbounds [5 x i32], [5 x i32] [[A]], i64 0, i64 0
				; CHECK-NEXT: br label [[FOR_BODY_US_PREHEADER:%.*]]
				; CHECK: for.body.us.preheader:
				; CHECK-NEXT: [[WIDE_TRIP_COUNT:%.]] = zext i32 [[M:%.]] to i64
				; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr [5 x i32], [5 x i32] [[A]], i64 0, i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP1:%.]] = bitcast i32 [[SCEVGEP]] to i8*
				; CHECK-NEXT: [[SCEVGEP2:%.]] = getelementptr [5 x i32], [5 x i32] [[B]], i64 0, i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP23:%.]] = bitcast i32 [[SCEVGEP2]] to i8*
				; CHECK-NEXT: br label [[FOR_BODY_US:%.*]]
				; CHECK: for.body.us:
				; CHECK-NEXT: [[J_019_US:%.]] = phi i32 [ [[INC8_US:%.]], [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US:%.*]] ], [ 0, [[FOR_BODY_US_PREHEADER]] ]
				; CHECK-NEXT: call void @_Z3barPi(i32* nonnull [[ARRAYDECAY]])
				; CHECK-NEXT: br label [[TC_CHECK:%.*]]
				; CHECK: tc.check:
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 [[TMP0]], [[SCEVGEP23]]
				; CHECK-NEXT: [[BOUND1:%.]] = icmp ult i8 [[TMP1]], [[SCEVGEP1]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[FOUND_CONFLICT]], true
				; CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND4:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT5:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 2
				; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 3
				; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds [5 x i32], [5 x i32] [[B]], i64 0, i64 [[TMP4]]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP8]], i32 0
				; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP10]], align 4, !tbaa !3, !alias.scope !7
				; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD]], [[VEC_IND4]]
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds [5 x i32], [5 x i32] [[A]], i64 0, i64 [[TMP4]]
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds i32, i32 [[TMP12]], i32 0
				; CHECK-NEXT: [[TMP14:%.]] = bitcast i32 [[TMP13]] to <4 x i32>*
				; CHECK-NEXT: [[WIDE_LOAD6:%.]] = load <4 x i32>, <4 x i32> [[TMP14]], align 4, !tbaa !3, !alias.scope !10, !noalias !7
				; CHECK-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[WIDE_LOAD6]], [[TMP11]]
				; CHECK-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP13]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa !3, !alias.scope !10, !noalias !7
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[VEC_IND]], <i64 4, i64 4, i64 4, i64 4>
				; CHECK-NEXT: [[VEC_IND_NEXT5]] = add <4 x i32> [[VEC_IND4]], <i32 4, i32 4, i32 4, i32 4>
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !12
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[TC_CHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY4_US:%.*]]
				; CHECK: for.body4.us:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY4_US]] ]
				; CHECK-NEXT: [[ARRAYIDX_US:%.]] = getelementptr inbounds [5 x i32], [5 x i32] [[B]], i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP18:%.]] = load i32, i32 [[ARRAYIDX_US]], align 4, !tbaa !3
				; CHECK-NEXT: [[TMP19:%.*]] = trunc i64 [[INDVARS_IV]] to i32
				; CHECK-NEXT: [[MUL_US:%.*]] = mul nsw i32 [[TMP18]], [[TMP19]]
				; CHECK-NEXT: [[ARRAYIDX6_US:%.]] = getelementptr inbounds [5 x i32], [5 x i32] [[A]], i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX6_US]], align 4, !tbaa !3
				; CHECK-NEXT: [[ADD_US:%.*]] = add nsw i32 [[TMP20]], [[MUL_US]]
				; CHECK-NEXT: store i32 [[ADD_US]], i32* [[ARRAYIDX6_US]], align 4, !tbaa !3
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US]], label [[FOR_BODY4_US]], !prof !14, !llvm.loop !15
				; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us:
				; CHECK-NEXT: [[INC8_US]] = add nuw nsw i32 [[J_019_US]], 1
				; CHECK-NEXT: [[EXITCOND21:%.*]] = icmp eq i32 [[INC8_US]], 20
				; CHECK-NEXT: br i1 [[EXITCOND21]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[FOR_BODY_US]], !prof !16
				; CHECK: for.cond.cleanup.loopexit:
				; CHECK-NEXT: br label [[FOR_COND_CLEANUP:%.*]]
				; CHECK: for.cond.cleanup.loopexit24:
				; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull [[TMP1]])
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull [[TMP0]])
				; CHECK-NEXT: ret void
				;
				entry:
				%a = alloca [5 x i32], align 16
				%b = alloca [5 x i32], align 16
				%0 = bitcast [5 x i32]* %a to i8*
				call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %0) #3
				%1 = bitcast [5 x i32]* %b to i8*
				call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %1) #3
				%arraydecay = getelementptr inbounds [5 x i32], [5 x i32]* %a, i64 0, i64 0
				br label %for.body.us.preheader

				for.body.us.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %M to i64
				br label %for.body.us

				for.body.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.body.us.preheader
				%j.019.us = phi i32 [ %inc8.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.body.us.preheader ]
				call void @_Z3barPi(i32* nonnull %arraydecay)
				br label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.body.us
				%indvars.iv = phi i64 [ 0, %for.body.us ], [ %indvars.iv.next, %for.body4.us ]
				%arrayidx.us = getelementptr inbounds [5 x i32], [5 x i32]* %b, i64 0, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx.us, align 4, !tbaa !2
				%3 = trunc i64 %indvars.iv to i32
				%mul.us = mul nsw i32 %2, %3
				%arrayidx6.us = getelementptr inbounds [5 x i32], [5 x i32]* %a, i64 0, i64 %indvars.iv
				%4 = load i32, i32* %arrayidx6.us, align 4, !tbaa !2
				%add.us = add nsw i32 %4, %mul.us
				store i32 %add.us, i32* %arrayidx6.us, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us, !prof !10

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us
				%inc8.us = add nuw nsw i32 %j.019.us, 1
				%exitcond21 = icmp eq i32 %inc8.us, 20
				br i1 %exitcond21, label %for.cond.cleanup.loopexit, label %for.body.us, !prof !12

				for.cond.cleanup.loopexit: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us
				br label %for.cond.cleanup

				for.cond.cleanup.loopexit24: ; preds = %for.body
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit24, %for.cond.cleanup.loopexit
				call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %1) #3
				call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %0) #3
				ret void
				}

				; Check vectorization of hot short trip count with epilog. In this case inner
				; loop trip count is known constant value.

				; Function Attrs: uwtable
				define dso_local void @_Z3fooi2() local_unnamed_addr #0 !prof !11 {
				; CHECK-LABEL: @_Z3fooi2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[J_018:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC8:%.]], [[FOR_COND_CLEANUP3:%.]] ]
				; CHECK-NEXT: tail call void @_Z3barPi(i32* getelementptr inbounds ([5 x i32], [5 x i32]* @a, i64 0, i64 0))
				; CHECK-NEXT: br label [[TC_CHECK:%.*]]
				; CHECK: tc.check:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[MEMCHECK_CONFLICT:%.]] = and i1 and (i1 icmp ult (i32 getelementptr inbounds ([5 x i32], [5 x i32]* @a, i32 0, i32 0), i32* getelementptr inbounds ([5 x i32], [5 x i32]* @b, i64 1, i64 0)), i1 icmp ult (i32* getelementptr inbounds ([5 x i32], [5 x i32]* @b, i32 0, i32 0), i32* getelementptr inbounds ([5 x i32], [5 x i32]* @a, i64 1, i64 0))), true
				; CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 2
				; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[INDEX]], 3
				; CHECK-NEXT: [[TMP4:%.*]] = zext i32 [[TMP0]] to i64
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @b, i64 0, i64 [[TMP4]]
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
				; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP7]], align 4, !tbaa !3, !alias.scope !17
				; CHECK-NEXT: [[TMP8:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD]], [[VEC_IND]]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @a, i64 0, i64 [[TMP4]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0
				; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <4 x i32>*
				; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP11]], align 4, !tbaa !3, !alias.scope !20, !noalias !17
				; CHECK-NEXT: [[TMP12:%.*]] = add nsw <4 x i32> [[WIDE_LOAD1]], [[TMP8]]
				; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP10]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4, !tbaa !3, !alias.scope !20, !noalias !17
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
				; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i32 [[INDEX_NEXT]], 4
				; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !22
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 5, 4
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP3]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ 4, [[MIDDLE_BLOCK]] ], [ 0, [[TC_CHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY4:%.*]]
				; CHECK: for.cond.cleanup3:
				; CHECK-NEXT: [[INC8]] = add nuw nsw i32 [[J_018]], 1
				; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[INC8]], 1000
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP:%.*]], !prof !23
				; CHECK: for.body4:
				; CHECK-NEXT: [[I_017:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY4]] ]
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[I_017]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @b, i64 0, i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[ARRAYIDX]], align 4, !tbaa !3
				; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP15]], [[I_017]]
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @a, i64 0, i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[ARRAYIDX6]], align 4, !tbaa !3
				; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP16]], [[MUL]]
				; CHECK-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX6]], align 4, !tbaa !3
				; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_017]], 1
				; CHECK-NEXT: [[CMP2:%.*]] = icmp ult i32 [[INC]], 5
				; CHECK-NEXT: br i1 [[CMP2]], label [[FOR_BODY4]], label [[FOR_COND_CLEANUP3]], !llvm.loop !24
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.cond.cleanup3
				ret void

				for.body: ; preds = %entry, %for.cond.cleanup3
				%j.018 = phi i32 [ 0, %entry ], [ %inc8, %for.cond.cleanup3 ]
				tail call void @_Z3barPi(i32* getelementptr inbounds ([5 x i32], [5 x i32]* @a, i64 0, i64 0))
				br label %for.body4

				for.cond.cleanup3: ; preds = %for.body4
				%inc8 = add nuw nsw i32 %j.018, 1
				%cmp = icmp ult i32 %inc8, 1000
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !prof !13

				for.body4: ; preds = %for.body, %for.body4
				%i.017 = phi i32 [ 0, %for.body ], [ %inc, %for.body4 ]
				%idxprom = zext i32 %i.017 to i64
				%arrayidx = getelementptr inbounds [5 x i32], [5 x i32]* @b, i64 0, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul = mul nsw i32 %0, %i.017
				%arrayidx6 = getelementptr inbounds [5 x i32], [5 x i32]* @a, i64 0, i64 %idxprom
				%1 = load i32, i32* %arrayidx6, align 4, !tbaa !2
				%add = add nsw i32 %1, %mul
				store i32 %add, i32* %arrayidx6, align 4, !tbaa !2
				%inc = add nuw nsw i32 %i.017, 1
				%cmp2 = icmp ult i32 %inc, 5
				br i1 %cmp2, label %for.body4, label %for.cond.cleanup3
				}

				; Function Attrs: argmemonly nounwind willreturn
				declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1

				declare dso_local void @_Z3barPi(i32*) local_unnamed_addr

				; Function Attrs: argmemonly nounwind willreturn
				declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1

				attributes #0 = { "use-soft-float"="false" }
				attributes #1 = { argmemonly nounwind willreturn }

				!llvm.module.flags = !{!0}
				!llvm.ident = !{!1}

				!0 = !{i32 1, !"wchar_size", i32 4}
				!1 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project f379dd57b978c4e1483d721f422c79e3c0c5ccdc)"}
				!2 = !{!3, !3, i64 0}
				!3 = !{!"int", !4, i64 0}
				!4 = !{!"omnipotent char", !5, i64 0}
				!5 = !{!"Simple C++ TBAA"}
				!6 = distinct !{!6, !7}
				!7 = !{!"llvm.loop.isvectorized", i32 1}
				!8 = distinct !{!8, !9, !7}
				!9 = !{!"llvm.loop.unroll.runtime.disable"}
				!10 = !{!"branch_weights", i32 999, i32 4995}
				!11 = !{!"function_entry_count", i64 1}
				!12 = !{!"branch_weights", i32 1, i32 999}
				!13 = !{!"branch_weights", i32 1000, i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Take overhead of run-time checks into account during vectorization.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 236998

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/short_tc_rt_checks.ll

[LV] Take overhead of run-time checks into account during vectorization.
Needs ReviewPublic