This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Utils/
-
llvm/
-
Transforms/
-
Utils/
1
LoopVersioning.h
-
lib/Transforms/
-
Transforms/
-
Utils/
-
LoopVersioning.cpp
-
Vectorize/
6
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
epilogvec1.ll
-
epilogvec2.ll

Differential D30247

Epilog loop vectorization
Needs ReviewPublic

Authored by ashutosh.nema on Feb 22 2017, 1:46 AM.

Download Raw Diff

Details

Reviewers

rengolin
Ayal
jmolloy
mkuper
llvm-commits

Summary

This is a proposal about epilog loop vectorization.

Currently Loop Vectorizer inserts an epilogue loop for handling loops that don’t have known iteration counts.

The Loop Vectorizer supports loops with an unknown trip count, unknown trip count may not be a multiple of the vector width, and the vectorizer has to execute the last few iterations as scalar code. It keeps a scalar copy of the loop for the remaining iterations.

Loop with the large width has a high possibility of executing many scalar iterations.
i.e. i8 data type with 256bits target register can vectorize with vector width 32, with that maximum trip count possibility for scalar(epilog) loop is 31, which is significant & worth vectorizing.

Large vector factor has following challenges:

Possibility of remainder iteration is substantial.
Actual trip count at runtime is substantial but not meeting minimum trip count to execute vector loop.

These challenges can be addressed by mask instructions, but these instructions are limited and may not be available to all targets.

By epilog vectorization our aim to vectorize epilog loop where original loop is vectorized with large vector factor and has a high possibility of executing scalar iterations.

This require following changes:

Costing: Preserve all profitable vector factor.
Transform: Create an additional vector loop with next profitable vector factor.

Please refer attached file (BlockLayout.png) for the details about transformed block layout.

Diff Detail

Repository: rL LLVM

Event Timeline

ashutosh.nema created this revision.Feb 22 2017, 1:46 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptFeb 22 2017, 1:46 AM

ashutosh.nema added reviewers: Ayal, mkuper, jmolloy, rengolin.Feb 22 2017, 1:49 AM

dorit added a subscriber: dorit.Feb 22 2017, 1:51 AM

Block layout:

Test cases are missing well add.

I think this is an interesting idea, though with limited applicability. Furthermore, the way you implemented makes it completely orthogonal to the vectoriser, and at odds with the VPlan strategy being discussed.

Before VPlans get in, I'd assume you would sort the strategies by cost and, if they were in a beneficial order (ex. 4 > 2 > 1), you'd then split the loop in N parts, one for each size. The problem here is that the trade-offs are not clear, and this is probably only beneficial for *very* large VF (32+), because now you're adding more run time checks, shuffles and moves between scalar and vector register banks, which are not always free.

A few more ideas...

If the second loop needs to be >16, then just unroll however many instructions to match 16 lanes, no loops necessary.
Link this optimisation to code size restrictions. We don't wan't to run this at Os.
Once VPlans go in, this would be a separate VPlan that could be applied on top of others, so we may need to change the VPlan implementation to allow that.

Finally, tests. Even this just being a proposal, with tests you can show your proposal "in action" and allow us to discuss specific details of the pass that could be too opaque or intricate to realise just looking at the code.

cheers,
--renato

lib/Transforms/Vectorize/LoopVectorize.cpp
3429	So, you're assuming that just having more than 16 iterations is "beneficial", but you haven't done any cost analysis. It may very well be that the cost is just not worth it, especially for smaller vector sizes.

delena added a subscriber: delena.Feb 22 2017, 3:43 AM

mssimpso added a subscriber: mssimpso.Feb 22 2017, 7:24 AM

I haven't looked at the patch yet - but I think it's a nice idea.
I know there was some experimentation with using masked epilogues, which, IIRC, did not provide a tangible benefit. Do you have any performance numbers showing that doing it this way does?

ashutosh.nema added inline comments.Mar 6 2017, 12:51 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3364	We can remove call to loop vectorizer’s legality before widening epilog loop, as legality is already proven for scalar loop while generating first vector version, it’s not required to prove again as scalar body has not changed. Epilog vectorization will depend on already computed legality. I have tried this change and found no impacts in my regular tests & benchmarks. Please let me know thoughts on this.
3429	Costing is already done, during the cost calculation of first vector version we have preserved the profitable VFs. At this point we look for the next profitable vector factor and widen epilog loop with it.

This change includes:

a) Code refactoring.

b) Removed call to loop vectorizer’s legality before widening epilog loop as legality is already proven for scalar loop while generating first vector version, it’s not required to prove again as scalar body has not changed. Epilog vectorization will depend on already computed legality.

c) Added tests.

Block layout description

Hi Ashutosh,

Sorry for the delay. I think the patch is looking good from my end.

I'd add a TODO to maybe unroll+SLP if the proportion between the next profitable VF and the size of the tail are close together. This could also simplify the call graph.

Example: VF=8, so tail is at most 7 and next VF is 4. This would be better without a latch, just if (end - i) > 4 -> 4-way SIMD -> for (i .. end) -> scalar.

But this is not a job for this patch, I think.

So, now, it would be good to know a few numbers. Can you share some performance improvements?

Also, would be good to get @mkuper take on this.

cheers,
--renato

lib/Transforms/Vectorize/LoopVectorize.cpp
3364	Legality is not the problem, here, but the cost. If the transformation is legal for `VF`, then it's also legal for all `vf`s < `VF`.
6420	This makes sense to me, and it's exactly how I would implement it.
6430	You don't need the `NOTE` in comments...

After this patch was posted, there was an RFC discussion regarding approach. We discussed, for example, just adding metadata (or keeping similar state) to restrict VF and rerunning the vectorizer to vectorize the epilogue loop. Can you please summarize that discussion and how it relates to what's here? Did the design of this patch change as a result of that discussion? If not, why not?

include/llvm/Transforms/Utils/LoopVersioning.h
151	I don't see why this variable is needed. It only seems to be used to adjust the assert. I see no problem with relaxing the required loop conditions, but I think that you should just relax them and document what they are.

magabari added a subscriber: magabari.Mar 30 2017, 12:30 AM

In D30247#700366, @hfinkel wrote:

After this patch was posted, there was an RFC discussion regarding approach. We discussed, for example, just adding metadata (or keeping similar state) to restrict VF and rerunning the vectorizer to vectorize the epilogue loop. Can you please summarize that discussion and how it relates to what's here? Did the design of this patch change as a result of that discussion? If not, why not?

@ashutosh.nema can you clarify where this patch stands? and if it's gonna be updated soon?

This patch is little old and not updated with latest vectorization changes.

In the RFC there are few concerns raised around the new block layout:
a) If original loop alias check fails, then directly jump to scalar loop
b) On critical path there should not be extra checks, i.e. when minimum iteration check fails for the original vector loop then directly jump to scalar loop. In my opinion this is possible optimization opportunity loss, because when minimum iteration check for original loop fails, can still try executing epilog loop and that requires epilog loop’s minimum iteration check. We can keep his under option.

I guess both the cases can be handled, I’ll re initiate the discussion & update this patch soon.

In D30247#863168, @ashutosh.nema wrote:

This patch is little old and not updated with latest vectorization changes.

In the RFC there are few concerns raised around the new block layout:
a) If original loop alias check fails, then directly jump to scalar loop
b) On critical path there should not be extra checks, i.e. when minimum iteration check fails for the original vector loop then directly jump to scalar loop. In my opinion this is possible optimization opportunity loss, because when minimum iteration check for original loop fails, can still try executing epilog loop and that requires epilog loop’s minimum iteration check. We can keep his under option.

I guess both the cases can be handled, I’ll re initiate the discussion & update this patch soon.

Is there any update with this patch? I find it very useful for cycles and size for targets that has masked operations not just for load and store but also for other operations including intra operations

In D30247#1340561, @DavidMC wrote:

In D30247#863168, @ashutosh.nema wrote:

This patch is little old and not updated with latest vectorization changes.

In the RFC there are few concerns raised around the new block layout:
a) If original loop alias check fails, then directly jump to scalar loop
b) On critical path there should not be extra checks, i.e. when minimum iteration check fails for the original vector loop then directly jump to scalar loop. In my opinion this is possible optimization opportunity loss, because when minimum iteration check for original loop fails, can still try executing epilog loop and that requires epilog loop’s minimum iteration check. We can keep his under option.

I guess both the cases can be handled, I’ll re initiate the discussion & update this patch soon.

Is there any update with this patch? I find it very useful for cycles and size for targets that has masked operations not just for load and store but also for other operations including intra operations

BTW, targets that have efficient masked operations may also find it useful for cycles and size to vectorize (including epilogs) under -Os, and following r345705 possibly also with enable-masked-interleaved-mem-accesses turned on.

In D30247#1375520, @Ayal wrote:

BTW, targets that have efficient masked operations may also find it useful for cycles and size to vectorize (including epilogs) under -Os, and following r345705 possibly also with enable-masked-interleaved-mem-accesses turned on.

I agree, but I think this approach is a tad too heavy for masked operations.

In SVE, the masking will happen naturally and the last loop will just have the remaining elements as a consequence of the ISA design.

In non-scalable vector extensions, that's not true, so it would need some emulation. But I wouldn't make it such a special case and with so many checks.

I imagine that, if the vectorisation of the full vector length was legal, then so is the case for a smaller length. It may not be profitable on its own, but the fact that it continues the existing patterns (before moving results into scalar registers) will probably make it cheaper than a scalar tail.

This could either be some arithmetic on the mask computation (which would occur on every iteration of the loop + 1) or a tail loop with the same mask computation duplicated from the one just vectorised. I imagine -O3 would have different trade-offs than -Os, so we could potentially have both solutions.

rscottmanley added a subscriber: rscottmanley.Jul 30 2019, 11:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 30 2019, 11:11 AM

Ayal mentioned this in D75746: [LoopVectorizer] Simplify branch in the remainder loop for trivial cases.Mar 14 2020, 7:10 AM

bmahjour mentioned this in D89566: [LV] Epilogue Vectorization with Optimal Control Flow.Oct 16 2020, 10:30 AM

bmahjour mentioned this in rG9c5504adceb5: [LV] Epilogue Vectorization with Optimal Control Flow.Dec 1 2020, 9:05 AM

bmahjour mentioned this in rGa7e2c2693997: [LV] Epilogue Vectorization with Optimal Control Flow (Recommit).Dec 2 2020, 7:10 AM

Revision Contents

Path

Size

include/

llvm/

Transforms/

Utils/

LoopVersioning.h

5 lines

lib/

Transforms/

Utils/

LoopVersioning.cpp

9 lines

Vectorize/

LoopVectorize.cpp

383 lines

test/

Transforms/

LoopVectorize/

epilogvec1.ll

46 lines

epilogvec2.ll

82 lines

Diff 90657

include/llvm/Transforms/Utils/LoopVersioning.h

Show All 35 Lines
class LoopVersioning {		class LoopVersioning {
public:		public:
/// \brief Expects LoopAccessInfo, Loop, LoopInfo, DominatorTree as input.		/// \brief Expects LoopAccessInfo, Loop, LoopInfo, DominatorTree as input.
/// It uses runtime check provided by the user. If \p UseLAIChecks is true,		/// It uses runtime check provided by the user. If \p UseLAIChecks is true,
/// we will retain the default checks made by LAI. Otherwise, construct an		/// we will retain the default checks made by LAI. Otherwise, construct an
/// object having no checks and we expect the user to add them.		/// object having no checks and we expect the user to add them.
LoopVersioning(const LoopAccessInfo &LAI, Loop L, LoopInfo LI,		LoopVersioning(const LoopAccessInfo &LAI, Loop L, LoopInfo LI,
DominatorTree DT, ScalarEvolution SE,		DominatorTree DT, ScalarEvolution SE,
bool UseLAIChecks = true);		bool UseLAIChecks = true, bool AssumeLoopSimplifyForm = false);

/// \brief Performs the CFG manipulation part of versioning the loop including		/// \brief Performs the CFG manipulation part of versioning the loop including
/// the DominatorTree and LoopInfo updates.		/// the DominatorTree and LoopInfo updates.
///		///
/// The loop that was used to construct the class will be the "versioned" loop		/// The loop that was used to construct the class will be the "versioned" loop
/// i.e. the loop that will receive control if all the memchecks pass.		/// i.e. the loop that will receive control if all the memchecks pass.
///		///
/// This allows the loop transform pass to operate on the same loop regardless		/// This allows the loop transform pass to operate on the same loop regardless
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	private:
DenseMap<const RuntimePointerChecking::CheckingPtrGroup , MDNode >		DenseMap<const RuntimePointerChecking::CheckingPtrGroup , MDNode >
GroupToNonAliasingScopeList;		GroupToNonAliasingScopeList;

/// \brief Analyses used.		/// \brief Analyses used.
const LoopAccessInfo &LAI;		const LoopAccessInfo &LAI;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
ScalarEvolution *SE;		ScalarEvolution *SE;
		// Flag to allow non simplify form loops in LoopVersioning.
		// Minimal expectation is Loop should have a pre header & single exit block.
		bool AssumeLoopSimplifyForm;
		hfinkelUnsubmitted Not Done Reply Inline Actions I don't see why this variable is needed. It only seems to be used to adjust the assert. I see no problem with relaxing the required loop conditions, but I think that you should just relax them and document what they are. hfinkel: I don't see why this variable is needed. It only seems to be used to adjust the assert. I see…
};		};
}		}

#endif		#endif

lib/Transforms/Utils/LoopVersioning.cpp

	Show All 26 Lines
	static cl::opt<bool>			static cl::opt<bool>
	AnnotateNoAlias("loop-version-annotate-no-alias", cl::init(true),			AnnotateNoAlias("loop-version-annotate-no-alias", cl::init(true),
	cl::Hidden,			cl::Hidden,
	cl::desc("Add no-alias annotation for instructions that "			cl::desc("Add no-alias annotation for instructions that "
	"are disambiguated by memchecks"));			"are disambiguated by memchecks"));

	LoopVersioning::LoopVersioning(const LoopAccessInfo &LAI, Loop L, LoopInfo LI,			LoopVersioning::LoopVersioning(const LoopAccessInfo &LAI, Loop L, LoopInfo LI,
	DominatorTree DT, ScalarEvolution SE,			DominatorTree DT, ScalarEvolution SE,
	bool UseLAIChecks)			bool UseLAIChecks, bool AssumeLoopSimplifyForm)
	: VersionedLoop(L), NonVersionedLoop(nullptr), LAI(LAI), LI(LI), DT(DT),			: VersionedLoop(L), NonVersionedLoop(nullptr), LAI(LAI), LI(LI), DT(DT),
	SE(SE) {			SE(SE), AssumeLoopSimplifyForm(AssumeLoopSimplifyForm) {
	assert(L->getExitBlock() && "No single exit block");			assert(L->getExitBlock() && "No single exit block");
				if (AssumeLoopSimplifyForm)
				assert(L->getLoopPreheader() && "No preheader in the loop");
				else
	assert(L->isLoopSimplifyForm() && "Loop is not in loop-simplify form");			assert(L->isLoopSimplifyForm() && "Loop is not in loop-simplify form");
	if (UseLAIChecks) {			if (UseLAIChecks) {
	setAliasChecks(LAI.getRuntimePointerChecking()->getChecks());			setAliasChecks(LAI.getRuntimePointerChecking()->getChecks());
	setSCEVChecks(LAI.getPSE().getUnionPredicate());			setSCEVChecks(LAI.getPSE().getUnionPredicate());
	}			}
	}			}

	void LoopVersioning::setAliasChecks(			void LoopVersioning::setAliasChecks(
	SmallVector<RuntimePointerChecking::PointerCheck, 4> Checks) {			SmallVector<RuntimePointerChecking::PointerCheck, 4> Checks) {
	▲ Show 20 Lines • Show All 276 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> VectorizeSCEVCheckThreshold(
"vectorize-scev-check-threshold", cl::init(16), cl::Hidden,		"vectorize-scev-check-threshold", cl::init(16), cl::Hidden,
cl::desc("The maximum number of SCEV checks allowed."));		cl::desc("The maximum number of SCEV checks allowed."));

static cl::opt<unsigned> PragmaVectorizeSCEVCheckThreshold(		static cl::opt<unsigned> PragmaVectorizeSCEVCheckThreshold(
"pragma-vectorize-scev-check-threshold", cl::init(128), cl::Hidden,		"pragma-vectorize-scev-check-threshold", cl::init(128), cl::Hidden,
cl::desc("The maximum number of SCEV checks allowed with a "		cl::desc("The maximum number of SCEV checks allowed with a "
"vectorize(enable) pragma"));		"vectorize(enable) pragma"));

		static cl::opt<bool> EnableEpilogVectorization(
		"enable-epilog-vectorization", cl::init(false), cl::Hidden,
		cl::desc("Enable vectorization of epilog scalar loop."));

		static cl::opt<unsigned> MinWidthEpilogVectorization(
		"min-width-epilog-vectorization", cl::init(16), cl::Hidden,
		cl::desc("Minimum vector loop width to enable epilog vectorization."));

/// Create an analysis remark that explains why vectorization failed		/// Create an analysis remark that explains why vectorization failed
///		///
/// \p PassName is the name of the pass (e.g. can be AlwaysPrint). \p		/// \p PassName is the name of the pass (e.g. can be AlwaysPrint). \p
/// RemarkName is the identifier for the remark. If \p I is passed it is an		/// RemarkName is the identifier for the remark. If \p I is passed it is an
/// instruction that prevents vectorization. Otherwise \p TheLoop is used for		/// instruction that prevents vectorization. Otherwise \p TheLoop is used for
/// the location of the remark. \return the remark object that can be		/// the location of the remark. \return the remark object that can be
/// streamed to.		/// streamed to.
static OptimizationRemarkAnalysis		static OptimizationRemarkAnalysis
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines

/// A helper function that returns an integer or floating-point constant with		/// A helper function that returns an integer or floating-point constant with
/// value C.		/// value C.
static Constant getSignedIntOrFpConstant(Type Ty, int64_t C) {		static Constant getSignedIntOrFpConstant(Type Ty, int64_t C) {
return Ty->isIntegerTy() ? ConstantInt::getSigned(Ty, C)		return Ty->isIntegerTy() ? ConstantInt::getSigned(Ty, C)
: ConstantFP::get(Ty, C);		: ConstantFP::get(Ty, C);
}		}

		class EpilogLoopInfo {
		public:
		EpilogLoopInfo(Value RemainderItrs, Value EpilogResumeValue,
		PHINode EpilogInduction, PHINode ACR)
		: RemainderItrs(RemainderItrs), EpilogResumeValue(EpilogResumeValue),
		EpilogInduction(EpilogInduction), ACR(ACR) {}

		EpilogLoopInfo()
		: RemainderItrs(nullptr), EpilogResumeValue(nullptr),
		EpilogInduction(nullptr), ACR(nullptr) {}

		Value *getRemainderItrs() { return RemainderItrs; }
		Value *getEpilogResumeValue() { return EpilogResumeValue; }
		PHINode *getEpilogInduction() { return EpilogInduction; }
		PHINode *getAliasCheckResult() { return ACR; }

		private:
		Value *RemainderItrs;
		Value *EpilogResumeValue;
		PHINode *EpilogInduction;
		PHINode *ACR;
		};

		class EpilogVectorizationInfo {
		public:
		EpilogVectorizationInfo()
		: ELI(nullptr), EpilogVectorLoopWidth(1), EpilogVecProfitable(false) {}
		EpilogVectorizationInfo(EpilogLoopInfo *ELI, unsigned Width, bool EVP)
		: ELI(ELI), EpilogVectorLoopWidth(Width), EpilogVecProfitable(EVP) {}

		void setEpilogVectorizationInfo(unsigned V, bool _EpilogVecProfitable) {
		EpilogVectorLoopWidth = V;
		EpilogVecProfitable = _EpilogVecProfitable;
		}

		unsigned getEpilogVectorLoopWidth() { return EpilogVectorLoopWidth; }

		Value *getRemainderItrs() { return ELI ? ELI->getRemainderItrs() : nullptr; }

		Value *getEpilogResumeValue() {
		return ELI ? ELI->getEpilogResumeValue() : nullptr;
		}

		PHINode *getEpilogInduction() {
		return ELI ? ELI->getEpilogInduction() : nullptr;
		}

		PHINode *getAliasCheckResult() {
		return ELI ? ELI->getAliasCheckResult() : nullptr;
		}

		bool hasELI() { return ELI ? true : false; }

		bool isEpilogVecProfitable() { return EpilogVecProfitable; }

		private:
		EpilogLoopInfo *ELI;
		unsigned EpilogVectorLoopWidth;
		bool EpilogVecProfitable;
		};

/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
/// * It handles the code generation for reduction variables.		/// * It handles the code generation for reduction variables.
/// * Scalarization (implementation using scalars) of un-vectorizable		/// * Scalarization (implementation using scalars) of un-vectorizable
/// instructions.		/// instructions.
/// InnerLoopVectorizer does not perform any vectorization-legality		/// InnerLoopVectorizer does not perform any vectorization-legality
/// checks, and relies on the caller to check for the different legality		/// checks, and relies on the caller to check for the different legality
/// aspects. The InnerLoopVectorizer relies on the		/// aspects. The InnerLoopVectorizer relies on the
/// LoopVectorizationLegality class to provide information about the induction		/// LoopVectorizationLegality class to provide information about the induction
/// and reduction variables that were found to a given vectorization factor.		/// and reduction variables that were found to a given vectorization factor.
class InnerLoopVectorizer {		class InnerLoopVectorizer {
public:		public:
InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned VecWidth,		OptimizationRemarkEmitter *ORE, unsigned VecWidth,
unsigned UnrollFactor, LoopVectorizationLegality *LVL,		unsigned UnrollFactor, LoopVectorizationLegality *LVL,
LoopVectorizationCostModel *CM)		LoopVectorizationCostModel *CM,
		const EpilogVectorizationInfo &EVI)
: OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),		: OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),
AC(AC), ORE(ORE), VF(VecWidth), UF(UnrollFactor),		AC(AC), ORE(ORE), VF(VecWidth), UF(UnrollFactor),
Builder(PSE.getSE()->getContext()), Induction(nullptr),		Builder(PSE.getSE()->getContext()), Induction(nullptr),
OldInduction(nullptr), VectorLoopValueMap(UnrollFactor, VecWidth),		OldInduction(nullptr), VectorLoopValueMap(UnrollFactor, VecWidth),
TripCount(nullptr), VectorTripCount(nullptr), Legal(LVL), Cost(CM),		TripCount(nullptr), VectorTripCount(nullptr), Legal(LVL), Cost(CM),
AddedSafetyChecks(false) {}		AddedSafetyChecks(false), ScalarResumeValue(nullptr), EVI(EVI) {}

// Perform the actual loop widening (vectorization).		// Perform the actual loop widening (vectorization).
void vectorize() {		void vectorize() {
// Create a new empty loop. Unlink the old loop and connect the new one.		// Create a new empty loop. Unlink the old loop and connect the new one.
createEmptyLoop();		createEmptyLoop();
// Widen each instruction in the old loop to a new one in the new loop.		// Widen each instruction in the old loop to a new one in the new loop.
vectorizeLoop();		vectorizeLoop();
		// Check for epilog vectorization feasiblity:
		if (canCreateVectorEpilog())
		createVectorEpilog();
}		}

// Return true if any runtime check is added.		// Return true if any runtime check is added.
bool areSafetyChecksAdded() { return AddedSafetyChecks; }		bool areSafetyChecksAdded() { return AddedSafetyChecks; }

virtual ~InnerLoopVectorizer() {}		virtual ~InnerLoopVectorizer() {}

protected:		protected:
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	protected:

/// Generate a shuffle sequence that will reverse the vector Vec.		/// Generate a shuffle sequence that will reverse the vector Vec.
virtual Value reverseVector(Value Vec);		virtual Value reverseVector(Value Vec);

/// Returns (and creates if needed) the original loop trip count.		/// Returns (and creates if needed) the original loop trip count.
Value getOrCreateTripCount(Loop NewLoop);		Value getOrCreateTripCount(Loop NewLoop);

/// Returns (and creates if needed) the trip count of the widened loop.		/// Returns (and creates if needed) the trip count of the widened loop.
Value getOrCreateVectorTripCount(Loop NewLoop);		Value getOrCreateVectorTripCount(Loop NewLoop, bool ReCreateVecTC = false);

/// Emit a bypass check to see if the trip count would overflow, or we		/// Emit a bypass check to see if the trip count would overflow, or we
/// wouldn't have enough iterations to execute one vector loop.		/// wouldn't have enough iterations to execute one vector loop.
void emitMinimumIterationCountCheck(Loop L, BasicBlock Bypass);		void emitMinimumIterationCountCheck(Loop L, BasicBlock Bypass);
/// Emit a bypass check to see if the vector trip count is nonzero.		/// Emit a bypass check to see if the vector trip count is nonzero.
void emitVectorLoopEnteredCheck(Loop L, BasicBlock Bypass);		void emitVectorLoopEnteredCheck(Loop L, BasicBlock Bypass);
/// Emit a bypass check to see if all of the SCEV assumptions we've		/// Emit a bypass check to see if all of the SCEV assumptions we've
/// had to make are correct.		/// had to make are correct.
void emitSCEVChecks(Loop L, BasicBlock Bypass);		void emitSCEVChecks(Loop L, BasicBlock Bypass);
/// Emit bypass checks to check any memory assumptions we may have made.		/// Emit bypass checks to check any memory assumptions we may have made.
void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);		BasicBlock emitMemRuntimeChecks(Loop L, BasicBlock *Bypass);
		/// Emit bypass checks by checking runtime alias result.
		BasicBlock emitAliasResultCheck(Loop L, BasicBlock *Bypass);
/// Add additional metadata to \p To that was not present on \p Orig.		/// Add additional metadata to \p To that was not present on \p Orig.
///		///
/// Currently this is used to add the noalias annotations based on the		/// Currently this is used to add the noalias annotations based on the
/// inserted memchecks. Use this for instructions that are cloned into the		/// inserted memchecks. Use this for instructions that are cloned into the
/// vector loop.		/// vector loop.
void addNewMetadata(Instruction To, const Instruction Orig);		void addNewMetadata(Instruction To, const Instruction Orig);

/// Add metadata from one instruction to another.		/// Add metadata from one instruction to another.
///		///
/// This includes both the original MDs from \p From and additional ones (\see		/// This includes both the original MDs from \p From and additional ones (\see
/// addNewMetadata). Use this for newly created instructions in the vector		/// addNewMetadata). Use this for newly created instructions in the vector
/// loop.		/// loop.
void addMetadata(Instruction To, Instruction From);		void addMetadata(Instruction To, Instruction From);

		/// Returns true if epilog vectorization is legal & profitable.
		/// Legality is already performed on the loop while first vector
		/// loop generation, now checking minimal stuff required for epilog
		/// vectorization.
		bool canCreateVectorEpilog() {
		// Epilog vectorization option should be enabled.
		if (!EnableEpilogVectorization)
		return false;
		// Identified epilog vector width should be a valid vector width.
		if (EVI.getEpilogVectorLoopWidth() == 1)
		return false;
		// Loop should have exit block.
		if (!OrigLoop->getExitBlock())
		return false;
		// Loop should have pre header.
		if (!OrigLoop->getLoopPreheader())
		return false;
		// Unique induction phi node should exist.
		if (!ScalarResumeValue)
		return false;
		// Loop bypass blocks should not be empty.
		if (!LoopBypassBlocks.size())
		return false;
		// Original Trip count & resume value type should be same.
		if (getOrCreateTripCount(OrigLoop)->getType() !=
		ScalarResumeValue->getType())
		return false;
		// Old Induction variable should exist.
		if (!OldInduction)
		return false;
		return true;
		}

		// Transform for epilog vectorization.
		void createVectorEpilog();

		// Create alias check result phinode.
		PHINode *createAliasCheckResultPHI();

/// \brief Similar to the previous function but it adds the metadata to a		/// \brief Similar to the previous function but it adds the metadata to a
/// vector of instructions.		/// vector of instructions.
void addMetadata(ArrayRef<Value > To, Instruction From);		void addMetadata(ArrayRef<Value > To, Instruction From);

/// \brief Set the debug location in the builder using the debug location in		/// \brief Set the debug location in the builder using the debug location in
/// the instruction.		/// the instruction.
void setDebugLocFromInst(IRBuilder<> &B, const Value *Ptr);		void setDebugLocFromInst(IRBuilder<> &B, const Value *Ptr);

▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	protected:
/// Middle Block between the vector and the scalar.		/// Middle Block between the vector and the scalar.
BasicBlock *LoopMiddleBlock;		BasicBlock *LoopMiddleBlock;
/// The ExitBlock of the scalar loop.		/// The ExitBlock of the scalar loop.
BasicBlock *LoopExitBlock;		BasicBlock *LoopExitBlock;
/// The vector loop body.		/// The vector loop body.
BasicBlock *LoopVectorBody;		BasicBlock *LoopVectorBody;
/// The scalar loop body.		/// The scalar loop body.
BasicBlock *LoopScalarBody;		BasicBlock *LoopScalarBody;
		/// Alias check block.
		BasicBlock *AliasCheckBlock;
/// A list of all bypass blocks. The first block is the entry of the loop.		/// A list of all bypass blocks. The first block is the entry of the loop.
SmallVector<BasicBlock *, 4> LoopBypassBlocks;		SmallVector<BasicBlock *, 4> LoopBypassBlocks;

/// The new Induction variable which was added to the new block.		/// The new Induction variable which was added to the new block.
PHINode *Induction;		PHINode *Induction;
/// The induction variable of the old basic block.		/// The induction variable of the old basic block.
PHINode *OldInduction;		PHINode *OldInduction;

Show All 27 Lines	protected:
// separately emit induction "steps" when generating code for the new loop.		// separately emit induction "steps" when generating code for the new loop.
// Similarly, we create a new latch condition when setting up the structure		// Similarly, we create a new latch condition when setting up the structure
// of the new loop, so the old one can become dead.		// of the new loop, so the old one can become dead.
SmallPtrSet<Instruction *, 4> DeadInstructions;		SmallPtrSet<Instruction *, 4> DeadInstructions;

// Holds the end values for each induction variable. We save the end values		// Holds the end values for each induction variable. We save the end values
// so we can later fix-up the external users of the induction variables.		// so we can later fix-up the external users of the induction variables.
DenseMap<PHINode , Value > IVEndValues;		DenseMap<PHINode , Value > IVEndValues;
		/// Scalar block resume value PHINode.
		PHINode *ScalarResumeValue;
		// Epilog vectorization info.
		EpilogVectorizationInfo EVI;
};		};

class InnerLoopUnroller : public InnerLoopVectorizer {		class InnerLoopUnroller : public InnerLoopVectorizer {
public:		public:
InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,		OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,
LoopVectorizationLegality *LVL,		LoopVectorizationLegality *LVL,
LoopVectorizationCostModel *CM)		LoopVectorizationCostModel *CM)
: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,		: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,
UnrollFactor, LVL, CM) {}		UnrollFactor, LVL, CM, EpilogVectorizationInfo()) {}

private:		private:
void scalarizeInstruction(Instruction *Instr,		void scalarizeInstruction(Instruction *Instr,
bool IfPredicateInstr = false) override;		bool IfPredicateInstr = false) override;
void vectorizeMemoryInstruction(Instruction *Instr) override;		void vectorizeMemoryInstruction(Instruction *Instr) override;
Value getBroadcastInstrs(Value V) override;		Value getBroadcastInstrs(Value V) override;
Value getStepVector(Value Val, int StartIdx, Value *Step,		Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
▲ Show 20 Lines • Show All 720 Lines • ▼ Show 20 Lines
/// single induction variable, that all types are supported and vectorize-able,		/// single induction variable, that all types are supported and vectorize-able,
/// etc. This code reflects the capabilities of InnerLoopVectorizer.		/// etc. This code reflects the capabilities of InnerLoopVectorizer.
/// This class is also used by InnerLoopVectorizer for identifying		/// This class is also used by InnerLoopVectorizer for identifying
/// induction variable and the different reduction variables.		/// induction variable and the different reduction variables.
class LoopVectorizationLegality {		class LoopVectorizationLegality {
public:		public:
LoopVectorizationLegality(		LoopVectorizationLegality(
Loop L, PredicatedScalarEvolution &PSE, DominatorTree DT,		Loop L, PredicatedScalarEvolution &PSE, DominatorTree DT,
TargetLibraryInfo TLI, AliasAnalysis AA, Function *F,		const TargetLibraryInfo TLI, AliasAnalysis AA, Function *F,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
std::function<const LoopAccessInfo &(Loop &)> GetLAA, LoopInfo LI,		std::function<const LoopAccessInfo &(Loop &)> GetLAA, LoopInfo LI,
OptimizationRemarkEmitter ORE, LoopVectorizationRequirements R,		OptimizationRemarkEmitter ORE, LoopVectorizationRequirements R,
LoopVectorizeHints *H)		LoopVectorizeHints *H)
: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TTI(TTI), DT(DT),		: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TTI(TTI), DT(DT),
GetLAA(GetLAA), LAI(nullptr), ORE(ORE), InterleaveInfo(PSE, L, DT, LI),		GetLAA(GetLAA), LAI(nullptr), ORE(ORE), InterleaveInfo(PSE, L, DT, LI),
PrimaryInduction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),		PrimaryInduction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),
Requirements(R), Hints(H) {}		Requirements(R), Hints(H) {}
▲ Show 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	private:
Loop *TheLoop;		Loop *TheLoop;
/// A wrapper around ScalarEvolution used to add runtime SCEV checks.		/// A wrapper around ScalarEvolution used to add runtime SCEV checks.
/// Applies dynamic knowledge to simplify SCEV expressions in the context		/// Applies dynamic knowledge to simplify SCEV expressions in the context
/// of existing SCEV assumptions. The analysis will also add a minimal set		/// of existing SCEV assumptions. The analysis will also add a minimal set
/// of new predicates if this is required to enable vectorization and		/// of new predicates if this is required to enable vectorization and
/// unrolling.		/// unrolling.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
/// Target Library Info.		/// Target Library Info.
TargetLibraryInfo *TLI;		const TargetLibraryInfo *TLI;
/// Target Transform Info		/// Target Transform Info
const TargetTransformInfo *TTI;		const TargetTransformInfo *TTI;
/// Dominator Tree.		/// Dominator Tree.
DominatorTree *DT;		DominatorTree *DT;
// LoopAccess analysis.		// LoopAccess analysis.
std::function<const LoopAccessInfo &(Loop &)> *GetLAA;		std::function<const LoopAccessInfo &(Loop &)> *GetLAA;
// And the loop-accesses info corresponding to this loop. This pointer is		// And the loop-accesses info corresponding to this loop. This pointer is
// null until canVectorizeMemory sets it up.		// null until canVectorizeMemory sets it up.
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel(Loop *L, PredicatedScalarEvolution &PSE,
const LoopVectorizeHints *Hints)		const LoopVectorizeHints *Hints)
: TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),		: TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),
AC(AC), ORE(ORE), TheFunction(F), Hints(Hints) {}		AC(AC), ORE(ORE), TheFunction(F), Hints(Hints) {}

/// Information about vectorization costs		/// Information about vectorization costs
struct VectorizationFactor {		struct VectorizationFactor {
unsigned Width; // Vector width with best cost		unsigned Width; // Vector width with best cost
unsigned Cost; // Cost of the loop with that width		unsigned Cost; // Cost of the loop with that width
		VectorizationFactor() : Width(1), Cost(0) {}
		VectorizationFactor(unsigned Width, unsigned Cost)
		: Width(Width), Cost(Cost) {}
};		};
/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to VF. If UserVF is not ZERO		/// This method checks every power of two up to VF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(bool OptForSize);		VectorizationFactor selectVectorizationFactor(bool OptForSize);

		/// \return The next profitable vector factor after the input vector factor
		/// If there is not any next profitable vector factor return
		/// VectorizationFactor
		/// with scalar as next possible factor.
		VectorizationFactor identifyNextProfitableVF(const unsigned IVF,
		bool OptForSize);

/// \return The size (in bits) of the smallest and widest types in the code		/// \return The size (in bits) of the smallest and widest types in the code
/// that needs to be vectorized. We ignore values that remain scalar such as		/// that needs to be vectorized. We ignore values that remain scalar such as
/// 64 bit loop indices.		/// 64 bit loop indices.
std::pair<unsigned, unsigned> getSmallestAndWidestTypes();		std::pair<unsigned, unsigned> getSmallestAndWidestTypes();

/// \return The desired interleave count.		/// \return The desired interleave count.
/// If interleave count has been specified by metadata it will be returned.		/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost		/// Otherwise, the interleave count is computed and returned. VF and LoopCost
▲ Show 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	public:

const Function *TheFunction;		const Function *TheFunction;
/// Loop Vectorize Hint.		/// Loop Vectorize Hint.
const LoopVectorizeHints *Hints;		const LoopVectorizeHints *Hints;
/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;
/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;
		/// Profitable vector factors.
		SmallVector<VectorizationFactor, 8> ProfitableVF;
};		};

/// \brief This holds vectorization requirements that must be verified late in		/// \brief This holds vectorization requirements that must be verified late in
/// the process. The requirements are set by legalize and costmodel. Once		/// the process. The requirements are set by legalize and costmodel. Once
/// vectorization has been determined to be possible and profitable the		/// vectorization has been determined to be possible and profitable the
/// requirements can be verified by looking for metadata or compiler options.		/// requirements can be verified by looking for metadata or compiler options.
/// For example, some loops require FP commutativity which is only allowed if		/// For example, some loops require FP commutativity which is only allowed if
/// vectorization is explicitly specified or if the fast-math compiler option		/// vectorization is explicitly specified or if the fast-math compiler option
▲ Show 20 Lines • Show All 986 Lines • ▼ Show 20 Lines	for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
// End if-block.		// End if-block.
if (IfPredicateInstr)		if (IfPredicateInstr)
PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));		PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
}		}
}		}
VectorLoopValueMap.initScalar(Instr, Entry);		VectorLoopValueMap.initScalar(Instr, Entry);
}		}

		/// This function will generate PHINode for alias result check in original
		/// scalar loop's preheader block.
		PHINode *InnerLoopVectorizer::createAliasCheckResultPHI() {
		// Return nullptr if no alias check created previously.
		if (!AliasCheckBlock)
		return nullptr;
		auto &M = *LoopScalarPreHeader->getTerminator()->getFunction()->getParent();
		auto &Ctx = M.getContext();
		// Create PHINode for alias check result.
		PHINode *ACR =
		PHINode::Create(Type::getInt1Ty(Ctx), 3, "acr.val", ScalarResumeValue);
		// From middle block set true incoming value.
		ACR->addIncoming(Builder.getInt1(true), LoopMiddleBlock);
		// From bypassing blocks set false incoming value.
		for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)
		ACR->addIncoming(Builder.getInt1(false), LoopBypassBlocks[I]);
		return ACR;
		}

		/// This function will vectorize epilog by generating epilog vector version.
		/// In this function we generate a new loop. The new loop will contain
		/// the vectorized instructions with new vector factor.
		///
		/// [ ] <-- original loop iteration number check.
		/// / \|
		/// / v
		/// \| [ ] <-- original vector loop bypass (may consist of multiple blocks
		/// \| / \| having checks like alias checks & scev checks.)
		/// \| / v
		/// \|\| [ ] <-- original vector pre header.
		/// \|/ \|
		/// \| v
		/// \| [ ] \
		/// \| [ ]_\| <-- original vector loop.
		/// \| \|
		/// \| v
		/// \| -[ ] <--- original middle-block.
		/// \| / \|
		/// \| / v
		/// \|----[ ] <-- new version loop iteration number check.
		/// \| / \|
		/// \|/ v
		/// / [ ] <-- Previously executed alias result check.
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions We can remove call to loop vectorizer’s legality before widening epilog loop, as legality is already proven for scalar loop while generating first vector version, it’s not required to prove again as scalar body has not changed. Epilog vectorization will depend on already computed legality. I have tried this change and found no impacts in my regular tests & benchmarks. Please let me know thoughts on this. ashutosh.nema: We can remove call to loop vectorizer’s legality before widening epilog loop, as legality is…
		rengolinUnsubmitted Not Done Reply Inline Actions Legality is not the problem, here, but the cost. If the transformation is legal for `VF`, then it's also legal for all `vf`s < `VF`. rengolin: Legality is not the problem, here, but the cost. If the transformation is legal for `VF`, then…
		/// \|\| / \|
		/// \|\|/ v
		/// \|/ [ ] <-- new vector version pre header.
		/// \|\| \|
		/// \|\| v
		/// \|\| [ ] \
		/// \|\| [ ]_\| <-- new vector loop.
		/// \|\| \|
		/// \|\| v
		/// \|\| -[ ] <--- new middle-block.
		/// \|\| / \|
		/// \| / v
		/// -\|- >[ ] <--- scalar preheader.
		/// \| \|
		/// \| v
		/// \| [ ] \
		/// \| [ ]_\| <-- scalar loop to handle remainder.
		/// \ \|
		/// \ v
		/// >[ ] <-- exit block.
		/// ...
		///
		void InnerLoopVectorizer::createVectorEpilog() {
		DEBUG(dbgs() << "Epilog vectorization is beneficial with width : "
		<< EVI.getEpilogVectorLoopWidth() << " in Function: "
		<< LoopScalarPreHeader->getParent()->getName() << "\n");
		// Create PHINode for alias check execution result.
		PHINode *AliasCheckResult = createAliasCheckResultPHI();
		// Identify remaining number of iterations.
		IRBuilder<> ResumeBuilder(&*LoopScalarPreHeader->getTerminator());
		Value *RemainderItrs = ResumeBuilder.CreateSub(getOrCreateTripCount(OrigLoop),
		ScalarResumeValue, "rem.itrs");
		// Create instance of inner loop vectorizer & vectorize.
		EpilogLoopInfo NewELI(RemainderItrs, ScalarResumeValue, OldInduction,
		AliasCheckResult);
		InnerLoopVectorizer LB(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,
		EVI.getEpilogVectorLoopWidth(), /EpilogUF/ 1, Legal,
		Cost, EpilogVectorizationInfo(&NewELI, 1, true));
		LB.vectorize();
		}

PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,		PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,
Value End, Value Step,		Value End, Value Step,
Instruction *DL) {		Instruction *DL) {
BasicBlock *Header = L->getHeader();		BasicBlock *Header = L->getHeader();
BasicBlock *Latch = L->getLoopLatch();		BasicBlock *Latch = L->getLoopLatch();
// As we're just creating this loop, it's possible no latch exists		// As we're just creating this loop, it's possible no latch exists
// yet. If so, use the header as this will be a single block loop.		// yet. If so, use the header as this will be a single block loop.
if (!Latch)		if (!Latch)
Latch = Header;		Latch = Header;

IRBuilder<> Builder(&*Header->getFirstInsertionPt());		IRBuilder<> Builder(&*Header->getFirstInsertionPt());
Instruction *OldInst = getDebugLocFromInstOrOperands(OldInduction);		Instruction *OldInst = getDebugLocFromInstOrOperands(OldInduction);
setDebugLocFromInst(Builder, OldInst);		setDebugLocFromInst(Builder, OldInst);
auto *Induction = Builder.CreatePHI(Start->getType(), 2, "index");		auto *Induction = Builder.CreatePHI(Start->getType(), 2, "index");

Builder.SetInsertPoint(Latch->getTerminator());		Builder.SetInsertPoint(Latch->getTerminator());
setDebugLocFromInst(Builder, OldInst);		setDebugLocFromInst(Builder, OldInst);

// Create i+1 and fill the PHINode.		// Create i+1 and fill the PHINode.
Value *Next = Builder.CreateAdd(Induction, Step, "index.next");		Value *Next = Builder.CreateAdd(Induction, Step, "index.next");
Induction->addIncoming(Start, L->getLoopPreheader());		Induction->addIncoming(Start, L->getLoopPreheader());
Induction->addIncoming(Next, Latch);		Induction->addIncoming(Next, Latch);
// Create the compare.		// Create the compare.
Value *ICmp = Builder.CreateICmpEQ(Next, End);		Value *ICmp = Builder.CreateICmpEQ(Next, End);
		rengolinUnsubmitted Not Done Reply Inline Actions So, you're assuming that just having more than 16 iterations is "beneficial", but you haven't done any cost analysis. It may very well be that the cost is just not worth it, especially for smaller vector sizes. rengolin: So, you're assuming that just having more than 16 iterations is "beneficial", but you haven't…
		ashutosh.nemaAuthorUnsubmitted Not Done Reply Inline Actions Costing is already done, during the cost calculation of first vector version we have preserved the profitable VFs. At this point we look for the next profitable vector factor and widen epilog loop with it. ashutosh.nema: Costing is already done, during the cost calculation of first vector version we have preserved…
Builder.CreateCondBr(ICmp, L->getExitBlock(), Header);		Builder.CreateCondBr(ICmp, L->getExitBlock(), Header);

// Now we have two terminators. Remove the old one from the block.		// Now we have two terminators. Remove the old one from the block.
Latch->getTerminator()->eraseFromParent();		Latch->getTerminator()->eraseFromParent();

return Induction;		return Induction;
}		}

Show All 37 Lines	Value InnerLoopVectorizer::getOrCreateTripCount(Loop L) {
if (TripCount->getType()->isPointerTy())		if (TripCount->getType()->isPointerTy())
TripCount =		TripCount =
CastInst::CreatePointerCast(TripCount, IdxTy, "exitcount.ptrcnt.to.int",		CastInst::CreatePointerCast(TripCount, IdxTy, "exitcount.ptrcnt.to.int",
L->getLoopPreheader()->getTerminator());		L->getLoopPreheader()->getTerminator());

return TripCount;		return TripCount;
}		}

Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L) {		Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L,
if (VectorTripCount)		bool ReCreateVecTC) {
		if (!ReCreateVecTC && VectorTripCount)
return VectorTripCount;		return VectorTripCount;

Value *TC = getOrCreateTripCount(L);		Value *TC = getOrCreateTripCount(L);
IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());		IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());

// Now we need to generate the expression for the part of the loop that the		// Now we need to generate the expression for the part of the loop that the
// vectorized body will execute. This is equal to N - (N % Step) if scalar		// vectorized body will execute. This is equal to N - (N % Step) if scalar
// iterations are not required for correctness, or N - Step, otherwise. Step		// iterations are not required for correctness, or N - Step, otherwise. Step
Show All 16 Lines	Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L,

VectorTripCount = Builder.CreateSub(TC, R, "n.vec");		VectorTripCount = Builder.CreateSub(TC, R, "n.vec");

return VectorTripCount;		return VectorTripCount;
}		}

void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,		void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,
BasicBlock *Bypass) {		BasicBlock *Bypass) {
Value *Count = getOrCreateTripCount(L);		// If its a known trip count then use it.
		Value *Count =
		EVI.hasELI() ? EVI.getRemainderItrs() : getOrCreateTripCount(L);
BasicBlock *BB = L->getLoopPreheader();		BasicBlock *BB = L->getLoopPreheader();
IRBuilder<> Builder(BB->getTerminator());		IRBuilder<> Builder(BB->getTerminator());

// Generate code to check that the loop's trip count that we computed by		// Generate code to check that the loop's trip count that we computed by
// adding one to the backedge-taken count will not overflow.		// adding one to the backedge-taken count will not overflow.
Value *CheckMinIters = Builder.CreateICmpULT(		Value *CheckMinIters = Builder.CreateICmpULT(
Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check");		Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check");

BasicBlock *NewBB =		BasicBlock *NewBB =
BB->splitBasicBlock(BB->getTerminator(), "min.iters.checked");		BB->splitBasicBlock(BB->getTerminator(), "min.iters.checked");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);		L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
ReplaceInstWithInst(BB->getTerminator(),		ReplaceInstWithInst(BB->getTerminator(),
BranchInst::Create(Bypass, NewBB, CheckMinIters));		BranchInst::Create(Bypass, NewBB, CheckMinIters));
LoopBypassBlocks.push_back(BB);		LoopBypassBlocks.push_back(BB);
}		}

void InnerLoopVectorizer::emitVectorLoopEnteredCheck(Loop *L,		void InnerLoopVectorizer::emitVectorLoopEnteredCheck(Loop *L,
BasicBlock *Bypass) {		BasicBlock *Bypass) {
Value *TC = getOrCreateVectorTripCount(L);		Value *TC = getOrCreateVectorTripCount(L, (EVI.hasELI() ? true : false));
BasicBlock *BB = L->getLoopPreheader();		BasicBlock *BB = L->getLoopPreheader();
IRBuilder<> Builder(BB->getTerminator());		IRBuilder<> Builder(BB->getTerminator());

// Now, compare the new count to zero. If it is zero skip the vector loop and		// Now, compare the new count to zero. If it is zero skip the vector loop and
// jump to the scalar loop.		// jump to the scalar loop.
Value *Cmp = Builder.CreateICmpEQ(TC, Constant::getNullValue(TC->getType()),		Value *Cmp = Builder.CreateICmpEQ(TC, Constant::getNullValue(TC->getType()),
"cmp.zero");		"cmp.zero");

Show All 36 Lines	void InnerLoopVectorizer::emitSCEVChecks(Loop L, BasicBlock Bypass) {
if (L->getParentLoop())		if (L->getParentLoop())
L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);		L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
ReplaceInstWithInst(BB->getTerminator(),		ReplaceInstWithInst(BB->getTerminator(),
BranchInst::Create(Bypass, NewBB, SCEVCheck));		BranchInst::Create(Bypass, NewBB, SCEVCheck));
LoopBypassBlocks.push_back(BB);		LoopBypassBlocks.push_back(BB);
AddedSafetyChecks = true;		AddedSafetyChecks = true;
}		}

void InnerLoopVectorizer::emitMemRuntimeChecks(Loop L, BasicBlock Bypass) {		BasicBlock InnerLoopVectorizer::emitMemRuntimeChecks(Loop L,
		BasicBlock *Bypass) {
BasicBlock *BB = L->getLoopPreheader();		BasicBlock *BB = L->getLoopPreheader();

// Generate the code that checks in runtime if arrays overlap. We put the		// Generate the code that checks in runtime if arrays overlap. We put the
// checks into a separate block to make the more common case of few elements		// checks into a separate block to make the more common case of few elements
// faster.		// faster.
Instruction *FirstCheckInst;		Instruction *FirstCheckInst;
Instruction *MemRuntimeCheck;		Instruction *MemRuntimeCheck;
std::tie(FirstCheckInst, MemRuntimeCheck) =		std::tie(FirstCheckInst, MemRuntimeCheck) =
Legal->getLAI()->addRuntimeChecks(BB->getTerminator());		Legal->getLAI()->addRuntimeChecks(BB->getTerminator());
if (!MemRuntimeCheck)		if (!MemRuntimeCheck)
return;		return nullptr;

// Create a new block containing the memory check.		// Create a new block containing the memory check.
BB->setName("vector.memcheck");		BB->setName("vector.memcheck");
auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");		auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);		L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
ReplaceInstWithInst(BB->getTerminator(),		ReplaceInstWithInst(BB->getTerminator(),
BranchInst::Create(Bypass, NewBB, MemRuntimeCheck));		BranchInst::Create(Bypass, NewBB, MemRuntimeCheck));
LoopBypassBlocks.push_back(BB);		LoopBypassBlocks.push_back(BB);
AddedSafetyChecks = true;		AddedSafetyChecks = true;

// We currently don't use LoopVersioning for the actual loop cloning but we		// We currently don't use LoopVersioning for the actual loop cloning but we
// still use it to add the noalias metadata.		// still use it to add the noalias metadata.
LVer = llvm::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,		LVer = llvm::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,
PSE.getSE());		PSE.getSE(), true, true);
LVer->prepareNoAliasMetadata();		LVer->prepareNoAliasMetadata();
		// Return alias check basic block.
		return BB;
		}

		/// Generate a check for alias check result, if alias check is executed
		/// with no-alias result then jump to epilog loop min iteration check
		/// else jump to scalar loop.
		BasicBlock InnerLoopVectorizer::emitAliasResultCheck(Loop L,
		BasicBlock *Bypass) {

		BasicBlock *BB = L->getLoopPreheader();
		// Create a new block containing the memory check.
		BB->setName("alias.result.chk");
		auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
		// Update dominator tree immediately if the generated block is a
		// LoopBypassBlock because SCEV expansions to generate loop bypass
		// checks may query it before the current function is finished.
		DT->addNewBlock(NewBB, BB);
		if (L->getParentLoop())
		L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
		ReplaceInstWithInst(
		BB->getTerminator(),
		BranchInst::Create(NewBB, Bypass, EVI.getAliasCheckResult()));
		LoopBypassBlocks.push_back(BB);
		// We currently don't use LoopVersioning for the actual loop cloning but we
		// still use it to add the noalias metadata.
		LVer = llvm::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,
		PSE.getSE(), true, true);
		LVer->prepareNoAliasMetadata();
		return BB;
}		}

void InnerLoopVectorizer::createEmptyLoop() {		void InnerLoopVectorizer::createEmptyLoop() {
/*		/*
In this function we generate a new loop. The new loop will contain		In this function we generate a new loop. The new loop will contain
the vectorized instructions while the old loop will continue to run the		the vectorized instructions while the old loop will continue to run the
scalar remainder.		scalar remainder.

Show All 36 Lines	void InnerLoopVectorizer::createEmptyLoop() {
// don't have a single induction variable.		// don't have a single induction variable.
//		//
// We try to obtain an induction variable from the original loop as hard		// We try to obtain an induction variable from the original loop as hard
// as possible. However if we don't find one that:		// as possible. However if we don't find one that:
// - is an integer		// - is an integer
// - counts from zero, stepping by one		// - counts from zero, stepping by one
// - is the size of the widest induction variable type		// - is the size of the widest induction variable type
// then we create a new one.		// then we create a new one.
OldInduction = Legal->getPrimaryInduction();		OldInduction =
		EVI.hasELI() ? EVI.getEpilogInduction() : Legal->getPrimaryInduction();
Type *IdxTy = Legal->getWidestInductionType();		Type *IdxTy = Legal->getWidestInductionType();

// Split the single block loop into the two loop structure described above.		// Split the single block loop into the two loop structure described above.
BasicBlock *VecBody =		BasicBlock *VecBody =
VectorPH->splitBasicBlock(VectorPH->getTerminator(), "vector.body");		VectorPH->splitBasicBlock(VectorPH->getTerminator(), "vector.body");
BasicBlock *MiddleBlock =		BasicBlock *MiddleBlock =
VecBody->splitBasicBlock(VecBody->getTerminator(), "middle.block");		VecBody->splitBasicBlock(VecBody->getTerminator(), "middle.block");
BasicBlock *ScalarPH =		BasicBlock *ScalarPH =
Show All 29 Lines	void InnerLoopVectorizer::createEmptyLoop() {
emitVectorLoopEnteredCheck(Lp, ScalarPH);		emitVectorLoopEnteredCheck(Lp, ScalarPH);
// Generate the code to check any assumptions that we've made for SCEV		// Generate the code to check any assumptions that we've made for SCEV
// expressions.		// expressions.
emitSCEVChecks(Lp, ScalarPH);		emitSCEVChecks(Lp, ScalarPH);

// Generate the code that checks in runtime if arrays overlap. We put the		// Generate the code that checks in runtime if arrays overlap. We put the
// checks into a separate block to make the more common case of few elements		// checks into a separate block to make the more common case of few elements
// faster.		// faster.
emitMemRuntimeChecks(Lp, ScalarPH);		// If alias check is already executed decide based on its result,
		// else generate runtime alias check.
		BasicBlock *RuntimeCheckBlock = nullptr;
		if (EVI.getAliasCheckResult()) {
		RuntimeCheckBlock = emitAliasResultCheck(Lp, ScalarPH);
		} else {
		RuntimeCheckBlock = emitMemRuntimeChecks(Lp, ScalarPH);
		}
// Generate the induction variable.		// Generate the induction variable.
// The loop step is equal to the vectorization factor (num of SIMD elements)		// The loop step is equal to the vectorization factor (num of SIMD elements)
// times the unroll factor (num of SIMD instructions).		// times the unroll factor (num of SIMD instructions).
Value *CountRoundDown = getOrCreateVectorTripCount(Lp);		Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
Constant Step = ConstantInt::get(IdxTy, VF UF);		Constant Step = ConstantInt::get(IdxTy, VF UF);
		Value *EpilogResumeValue = EVI.getEpilogResumeValue();
		if (EpilogResumeValue) {
		Induction =
		createInductionVariable(Lp, EpilogResumeValue, CountRoundDown, Step,
		getDebugLocFromInstOrOperands(OldInduction));
		} else {
Induction =		Induction =
createInductionVariable(Lp, StartIdx, CountRoundDown, Step,		createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
getDebugLocFromInstOrOperands(OldInduction));		getDebugLocFromInstOrOperands(OldInduction));
		}
// We are going to resume the execution of the scalar loop.		// We are going to resume the execution of the scalar loop.
// Go over all of the induction variables that we found and fix the		// Go over all of the induction variables that we found and fix the
// PHIs that are left in the scalar version of the loop.		// PHIs that are left in the scalar version of the loop.
// The starting values of PHI nodes depend on the counter of the last		// The starting values of PHI nodes depend on the counter of the last
// iteration in the vectorized loop.		// iteration in the vectorized loop.
// If we come from a bypass edge then we need to start from the original		// If we come from a bypass edge then we need to start from the original
// start value.		// start value.

// This variable saves the new starting index for the scalar loop. It is used		// This variable saves the new starting index for the scalar loop. It is used
// to test if there are any tail iterations left once the vector loop has		// to test if there are any tail iterations left once the vector loop has
// completed.		// completed.
		ScalarResumeValue = nullptr;
LoopVectorizationLegality::InductionList *List = Legal->getInductionVars();		LoopVectorizationLegality::InductionList *List = Legal->getInductionVars();
for (auto &InductionEntry : *List) {		for (auto &InductionEntry : *List) {
PHINode *OrigPhi = InductionEntry.first;		PHINode *OrigPhi = InductionEntry.first;
InductionDescriptor II = InductionEntry.second;		InductionDescriptor II = InductionEntry.second;

// Create phi nodes to merge from the backedge-taken check block.		// Create phi nodes to merge from the backedge-taken check block.
PHINode *BCResumeVal = PHINode::Create(		PHINode *BCResumeVal = PHINode::Create(
OrigPhi->getType(), 3, "bc.resume.val", ScalarPH->getTerminator());		OrigPhi->getType(), 3, "bc.resume.val", ScalarPH->getTerminator());
Show All 16 Lines	for (auto &InductionEntry : *List) {
// or the value at the end of the vectorized loop.		// or the value at the end of the vectorized loop.
BCResumeVal->addIncoming(EndValue, MiddleBlock);		BCResumeVal->addIncoming(EndValue, MiddleBlock);

// Fix the scalar body counter (PHI node).		// Fix the scalar body counter (PHI node).
unsigned BlockIdx = OrigPhi->getBasicBlockIndex(ScalarPH);		unsigned BlockIdx = OrigPhi->getBasicBlockIndex(ScalarPH);

// The old induction's phi node in the scalar body needs the truncated		// The old induction's phi node in the scalar body needs the truncated
// value.		// value.
for (BasicBlock *BB : LoopBypassBlocks)		if (EpilogResumeValue) {
BCResumeVal->addIncoming(II.getStartValue(), BB);		for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)
		BCResumeVal->addIncoming(EpilogResumeValue, LoopBypassBlocks[I]);
		} else {
		for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)
		BCResumeVal->addIncoming(II.getStartValue(), LoopBypassBlocks[I]);
		}
OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);		OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);
		ScalarResumeValue = BCResumeVal;
}		}

		// NOTE: Currently multiple induction variables not supported as multiple
		// induction variable incurs multiple checks which may not be profitable.
		if (List->size() > 1)
		ScalarResumeValue = nullptr;

// Add a check in the middle block to see if we have completed		// Add a check in the middle block to see if we have completed
// all of the iterations in the first vector loop.		// all of the iterations in the first vector loop.
// If (N - N%VF) == N, then we don't need to run the remainder.		// If (N - N%VF) == N, then we don't need to run the remainder.
Value *CmpN =		Value *CmpN =
CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,		CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,
CountRoundDown, "cmp.n", MiddleBlock->getTerminator());		CountRoundDown, "cmp.n", MiddleBlock->getTerminator());
ReplaceInstWithInst(MiddleBlock->getTerminator(),		ReplaceInstWithInst(MiddleBlock->getTerminator(),
BranchInst::Create(ExitBlock, ScalarPH, CmpN));		BranchInst::Create(ExitBlock, ScalarPH, CmpN));

// Get ready to start creating new instructions into the vectorized body.		// Get ready to start creating new instructions into the vectorized body.
Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());		Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());

// Save the state.		// Save the state.
LoopVectorPreHeader = Lp->getLoopPreheader();		LoopVectorPreHeader = Lp->getLoopPreheader();
LoopScalarPreHeader = ScalarPH;		LoopScalarPreHeader = ScalarPH;
LoopMiddleBlock = MiddleBlock;		LoopMiddleBlock = MiddleBlock;
LoopExitBlock = ExitBlock;		LoopExitBlock = ExitBlock;
LoopVectorBody = VecBody;		LoopVectorBody = VecBody;
LoopScalarBody = OldBasicBlock;		LoopScalarBody = OldBasicBlock;
		AliasCheckBlock = RuntimeCheckBlock;

// Keep all loop hints from the original loop on the vector loop (we'll		// Keep all loop hints from the original loop on the vector loop (we'll
// replace the vectorizer-specific hints below).		// replace the vectorizer-specific hints below).
if (MDNode *LID = OrigLoop->getLoopID())		if (MDNode *LID = OrigLoop->getLoopID())
Lp->setLoopID(LID);		Lp->setLoopID(LID);

LoopVectorizeHints Hints(Lp, true, *ORE);		LoopVectorizeHints Hints(Lp, true, *ORE);
Hints.setAlreadyVectorized();		Hints.setAlreadyVectorized();
▲ Show 20 Lines • Show All 1,421 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {
} // end of for_each instr.		} // end of for_each instr.
}		}

void InnerLoopVectorizer::updateAnalysis() {		void InnerLoopVectorizer::updateAnalysis() {
// Forget the original basic block.		// Forget the original basic block.
PSE.getSE()->forgetLoop(OrigLoop);		PSE.getSE()->forgetLoop(OrigLoop);

// Update the dominator tree information.		// Update the dominator tree information.
		// With epilog vectorization not all loop bypass blocks dominates exit blocks.
		if (!EnableEpilogVectorization)
assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) &&		assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) &&
"Entry does not dominate exit.");		"Entry does not dominate exit.");

// We don't predicate stores by this point, so the vector body should be a		// We don't predicate stores by this point, so the vector body should be a
// single loop.		// single loop.
DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader);		DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader);

DT->addNewBlock(LoopMiddleBlock, LoopVectorBody);		DT->addNewBlock(LoopMiddleBlock, LoopVectorBody);
DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]);		DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]);
DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader);		DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader);
		// Change immediate dominator only when:
		// 1) When epilog vectorization option is disabled.
		// 2) When epilog vectorization option is enabled & not profitable.
		// 3) When epilog vectorization option is enabled & profitable
		// for the given loop, only update immediate dominator for
		// first vector loop.
		bool EpilogVecProfitable = EVI.isEpilogVecProfitable();
		if (!EnableEpilogVectorization \|\| !EpilogVecProfitable \|\|
		(EpilogVecProfitable && EVI.getEpilogVectorLoopWidth() != 1))
DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]);		DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]);

DEBUG(DT->verifyDomTree());		DEBUG(DT->verifyDomTree());
}		}

/// \brief Check whether it is safe to if-convert this phi node.		/// \brief Check whether it is safe to if-convert this phi node.
///		///
/// Phi nodes with constant expressions that can trap are not safe to if		/// Phi nodes with constant expressions that can trap are not safe to if
/// convert.		/// convert.
▲ Show 20 Lines • Show All 1,081 Lines • ▼ Show 20 Lines	if (LastMember) {
continue;		continue;
}		}
DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");		DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");
RequiresScalarEpilogue = true;		RequiresScalarEpilogue = true;
}		}
}		}
}		}

		/// Identify epilog vector loop width from profitable VF's.
		LoopVectorizationCostModel::VectorizationFactor
		LoopVectorizationCostModel::identifyNextProfitableVF(const unsigned Width,
		rengolinUnsubmitted Not Done Reply Inline Actions This makes sense to me, and it's exactly how I would implement it. rengolin: This makes sense to me, and it's exactly how I would implement it.
		bool OptForSize) {
		// Find next VF.
		unsigned VF = 1;
		unsigned Cost = 0;
		bool hasMultipleInductionVariables = (Legal->getInductionVars()->size() != 1);
		// Return if epilog vectorization is disabled.
		// Return if OptForSize is enabled
		if (!EnableEpilogVectorization \|\| OptForSize)
		return VectorizationFactor(VF, Cost);
		// NOTE: Currently multiple induction variables not supported as multiple
		rengolinUnsubmitted Not Done Reply Inline Actions You don't need the `NOTE` in comments... rengolin: You don't need the `NOTE` in comments...
		// induction variable incurs multiple checks which may not be profitable.
		if (hasMultipleInductionVariables)
		return VectorizationFactor(VF, Cost);
		// Check the loop vector width should be more than min width threshold.
		if (Width < MinWidthEpilogVectorization)
		return VectorizationFactor(VF, Cost);
		// Identify next profitable VF.
		for (auto &NextVF : ProfitableVF)
		// NOTE: VF's in ProfitableVF is in accesding order.
		if ((NextVF.Width < Width) && (VF == 1 \|\| NextVF.Cost < Cost)) {
		VF = NextVF.Width;
		Cost = NextVF.Cost;
		}
		return VectorizationFactor(VF, Cost);
		}

LoopVectorizationCostModel::VectorizationFactor		LoopVectorizationCostModel::VectorizationFactor
LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {		LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
// Width 1 means no vectorize		// Width 1 means no vectorize
VectorizationFactor Factor = {1U, 0U};		VectorizationFactor Factor = {1U, 0U};
if (OptForSize && Legal->getRuntimePointerChecking()->Need) {		if (OptForSize && Legal->getRuntimePointerChecking()->Need) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime pointer checks needed. Enable vectorization of this "		<< "runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	for (unsigned i = 2; i <= VF; i *= 2) {
DEBUG(dbgs() << "LV: Vector loop of width " << i		DEBUG(dbgs() << "LV: Vector loop of width " << i
<< " costs: " << (int)VectorCost << ".\n");		<< " costs: " << (int)VectorCost << ".\n");
if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
DEBUG(		DEBUG(
dbgs() << "LV: Not considering vector loop of width " << i		dbgs() << "LV: Not considering vector loop of width " << i
<< " because it will not generate any vector instructions.\n");		<< " because it will not generate any vector instructions.\n");
continue;		continue;
}		}
		// If profitable add it to ProfitableVF list.
		if (VectorCost < ScalarCost)
		ProfitableVF.push_back(VectorizationFactor(i, VectorCost));
if (VectorCost < Cost) {		if (VectorCost < Cost) {
Cost = VectorCost;		Cost = VectorCost;
Width = i;		Width = i;
}		}
}		}

DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()		DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
▲ Show 20 Lines • Show All 1,379 Lines • ▼ Show 20 Lines	ORE->emit(OptimizationRemarkAnalysis(LV_NAME, IntDiagMsg.first,
<< IntDiagMsg.second);		<< IntDiagMsg.second);
} else if (VectorizeLoop && InterleaveLoop) {		} else if (VectorizeLoop && InterleaveLoop) {
DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width << ") in "		DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width << ") in "
<< DebugLocStr << '\n');		<< DebugLocStr << '\n');
DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');		DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');
}		}

using namespace ore;		using namespace ore;
		// Identify epilog vector loop width
		const LoopVectorizationCostModel::VectorizationFactor EpilogVF =
		CM.identifyNextProfitableVF(VF.Width, OptForSize);
		// Check profitablity of epilog vectorization.
		bool EpilogVecProfitable = (EpilogVF.Width == 1) ? 0 : 1;
		// Switch indicates widening of epilog loop.
if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
&CM);		&CM);
Unroller.vectorize();		Unroller.vectorize();

ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")");		<< NV("InterleaveCount", IC) << ")");
} else {		} else {
// If we decided that it is legal to vectorize the loop, then do it.		// If we decided that it is legal to vectorize the loop, then do it.
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,		InnerLoopVectorizer LB(
&LVL, &CM);		L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, &LVL, &CM,
		EpilogVectorizationInfo(nullptr, EpilogVF.Width, EpilogVecProfitable));
LB.vectorize();		LB.vectorize();
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there are		// Add metadata to disable runtime unrolling a scalar loop when there are
// no runtime checks about strides and memory. A scalar loop that is		// no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.		// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())		if (!LB.areSafetyChecksAdded())
AddRuntimeUnrollDisableMetaData(L);		AddRuntimeUnrollDisableMetaData(L);
▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/epilogvec1.ll

				; RUN: opt < %s -S -enable-epilog-vectorization -debug-only=loop-vectorize -loop-vectorize 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				; Test to check epilog loop vectorization.
				; CHECK: LV: Found a vectorizable loop (16)
				; CHECK: Epilog vectorization is beneficial with width : 8
				; CHECK-LABEL: @foo(
				; CHECK-LABEL: vector.body:
				; CHECK: <16 x i8>
				; CHECK: scalar.ph:
				; CHECK-NEXT: %acr.val = phi i1 [ true, %middle.block ], [ false, %for.body.preheader ], [ false, %min.iters.checked ], [ false, %vector.memcheck ]
				; CHECK-LABEL: vector.body10:
				; CHECK: <8 x i8>
				; CHECK-LABEL: for.body:

				target triple = "x86_64-unknown-linux-gnu"

				; Function Attrs: norecurse nounwind uwtable
				define void @foo(i8* nocapture %A, i8* nocapture readonly %B, i8* nocapture readonly %C, i32 %len) local_unnamed_addr #0 {
				entry:
				%cmp14 = icmp sgt i32 %len, 0
				br i1 %cmp14, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %len to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i8, i8* %B, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx, align 1
				%arrayidx2 = getelementptr inbounds i8, i8* %C, i64 %indvars.iv
				%1 = load i8, i8* %arrayidx2, align 1
				%add = add i8 %1, %0
				%arrayidx6 = getelementptr inbounds i8, i8* %A, i64 %indvars.iv
				store i8 %add, i8* %arrayidx6, align 1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

test/Transforms/LoopVectorize/epilogvec2.ll

				; RUN: opt < %s -S -enable-epilog-vectorization -debug-only=loop-vectorize -loop-vectorize 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				; Test to check epilog loop vectorization.
				; CHECK: LV: Found a vectorizable loop (16)
				; CHECK: Epilog vectorization is beneficial with width : 4
				; CHECK-LABEL: @foo(

				; CHECK-LABEL: vector.body:
				; CHECK: <16 x i8>
				; CHECK-LABEL: scalar.ph:
				; CHECK-NEXT: %acr.val = phi i1 [ true, %middle.block ], [ false, %for.body4.preheader ], [ false, %min.iters.checked ], [ false, %vector.memcheck ]
				; CHECK-NEXT: %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %for.body4.preheader ], [ 0, %min.iters.checked ], [ 0, %vector.memcheck ]
				; CHECK-NEXT: %rem.itrs = sub i64 %wide.trip.count, %bc.resume.val
				; CHECK-NEXT: %min.iters.check13 = icmp ult i64 %rem.itrs, 4
				; CHECK-NEXT: br i1 %min.iters.check13, label %scalar.ph12, label %min.iters.checked14
				; CHECK-LABEL: alias.result.chk:
				; CHECK-NEXT: br i1 %acr.val, label %vector.ph19, label %scalar.ph12
				; CHECK-LABEL: vector.body10:
				; CHECK: <4 x i8>

				target triple = "x86_64-unknown-linux-gnu"

				; Function Attrs: norecurse nounwind uwtable
				define void @foo(i8* nocapture %dst, i32 %d_st, i8* nocapture readonly %src1, i32 %s1_st, i8* nocapture readonly %src2, i32 %s2_st, i32 %w, i32 %h) local_unnamed_addr #0 {
				entry:
				%cmp32 = icmp sgt i32 %h, 0
				br i1 %cmp32, label %for.cond1.preheader.lr.ph, label %for.cond.cleanup

				for.cond1.preheader.lr.ph: ; preds = %entry
				%cmp230 = icmp sgt i32 %w, 0
				%idx.ext = sext i32 %d_st to i64
				%idx.ext12 = sext i32 %s1_st to i64
				%idx.ext14 = sext i32 %s2_st to i64
				%wide.trip.count = zext i32 %w to i64
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.cond.cleanup3, %for.cond1.preheader.lr.ph
				%y.036 = phi i32 [ 0, %for.cond1.preheader.lr.ph ], [ %inc17, %for.cond.cleanup3 ]
				%dst.addr.035 = phi i8* [ %dst, %for.cond1.preheader.lr.ph ], [ %add.ptr, %for.cond.cleanup3 ]
				%src1.addr.034 = phi i8* [ %src1, %for.cond1.preheader.lr.ph ], [ %add.ptr13, %for.cond.cleanup3 ]
				%src2.addr.033 = phi i8* [ %src2, %for.cond1.preheader.lr.ph ], [ %add.ptr15, %for.cond.cleanup3 ]
				br i1 %cmp230, label %for.body4.preheader, label %for.cond.cleanup3

				for.body4.preheader: ; preds = %for.cond1.preheader
				br label %for.body4

				for.cond.cleanup.loopexit: ; preds = %for.cond.cleanup3
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.cond.cleanup3.loopexit: ; preds = %for.body4
				br label %for.cond.cleanup3

				for.cond.cleanup3: ; preds = %for.cond.cleanup3.loopexit, %for.cond1.preheader
				%add.ptr = getelementptr inbounds i8, i8* %dst.addr.035, i64 %idx.ext
				%add.ptr13 = getelementptr inbounds i8, i8* %src1.addr.034, i64 %idx.ext12
				%add.ptr15 = getelementptr inbounds i8, i8* %src2.addr.033, i64 %idx.ext14
				%inc17 = add nuw nsw i32 %y.036, 1
				%exitcond37 = icmp eq i32 %inc17, %h
				br i1 %exitcond37, label %for.cond.cleanup.loopexit, label %for.cond1.preheader

				for.body4: ; preds = %for.body4.preheader, %for.body4
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body4 ], [ 0, %for.body4.preheader ]
				%arrayidx = getelementptr inbounds i8, i8* %src1.addr.034, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				%arrayidx6 = getelementptr inbounds i8, i8* %src2.addr.033, i64 %indvars.iv
				%1 = load i8, i8* %arrayidx6, align 1
				%conv7 = zext i8 %1 to i32
				%add = add nuw nsw i32 %conv, 1
				%add8 = add nuw nsw i32 %add, %conv7
				%shr29 = lshr i32 %add8, 1
				%conv9 = trunc i32 %shr29 to i8
				%arrayidx11 = getelementptr inbounds i8, i8* %dst.addr.035, i64 %indvars.iv
				store i8 %conv9, i8* %arrayidx11, align 1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup3.loopexit, label %for.body4
				}

This is an archive of the discontinued LLVM Phabricator instance.

Epilog loop vectorizationNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 90657

include/llvm/Transforms/Utils/LoopVersioning.h

lib/Transforms/Utils/LoopVersioning.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/epilogvec1.ll

test/Transforms/LoopVectorize/epilogvec2.ll

Epilog loop vectorization
Needs ReviewPublic