This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopInfo.h
-
lib/
-
Analysis/
1
LoopInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
10
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
2
tail_loop_folding.ll

Differential D65197

[LV] Tail-loop Folding
ClosedPublic

Authored by SjoerdMeijer on Jul 24 2019, 3:57 AM.

Download Raw Diff

Details

Reviewers

Meinersbur
hsaito
fhahn
samparker
dmgreen
rengolin

Commits

rL367592: [LV] Tail-Loop Folding
rG20b198ec5ea7: [LV] Tail-Loop Folding

Summary

This allows folding of the scalar epilogue loop (the tail) into the main
vectorised loop body when the loop is annotated with a "vector predicate"
metadata hint. To fold the tail, instructions need to be predicated (masked),
enabling/disabling lanes for the remainder iterations.

This depends on D64744 that introduces the llvm.loop.vectorize.predicate.enable
pragma and metadata node, and D64916 which is a refactoring to make tail
folding a more general concept.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jul 24 2019, 3:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2019, 3:57 AM

Herald added subscribers: rkruppe, hiraditya. · View Herald Transcript

SjoerdMeijer added subscribers: huntergr, sdesmalen.Jul 24 2019, 5:04 AM

[serious] There is a LoopVectorizeHints class in LoopVectorizationLegality.cpp that should be used.

[serious] Documentation of llvm.loop.vectorize.predicate.enable is missing.

Just just realized that docs for llvm.loop.vectorize.predicate.enable is part of D64744, which otherwise is a clang-only patch.

Hi Michael, thanks for taking a look again!

Just just realized that docs for llvm.loop.vectorize.predicate.enable is part of D64744, which otherwise is a clang-only patch.

Yep, indeed, so I assume that's all good.

There is a LoopVectorizeHints class in LoopVectorizationLegality.cpp that should be used.

Ah yes, thanks for the suggestion, I will start looking into this, and will move the pragma handling to some function in there.

In D65197#1599082, @SjoerdMeijer wrote:

Just just realized that docs for llvm.loop.vectorize.predicate.enable is part of D64744, which otherwise is a clang-only patch.

Yep, indeed, so I assume that's all good.

Before we had the monoropo reviewers frequently asked to split patches into the LLVM and Clang part. With the monorepo, I am not sure the rule still needs to be followed. At least, I did not expect LLVM documentation in a clang patch, so sorry for the non-applicable comment.

In D65197#1599102, @Meinersbur wrote:

Before we had the monoropo reviewers frequently asked to split patches into the LLVM and Clang part. With the monorepo, I am not sure the rule still needs to be followed. At least, I did not expect LLVM documentation in a clang patch, so sorry for the non-applicable comment.

It does because we're still committing to SVN. Once we enable write mode on the monorepo, that'll change.

Ha, that's funny, because before noticing these comments here, I was just doing a test commit (366904) with the github monorepo workflow.
With the discussion on the dev list that the transition date is near, and just following the public documentation in https://llvm.org/docs/GettingStarted.html, it really looks like I committed to the clang and llvm repo at the same time using the git llvm push script from a local git monorepo. I was of course aware of separating clang and llvm patches, but again, thought that this new workflow is fully accepted/supported.

Anyway, back to looking at LoopVectorizationLegality.cpp :-)

We probably need to discuss whether vectorize_predicate(enable) should (or should not) implicitly turns on vectorize(enable) or not. I guess the current behavior is "does not", right? We don't have to discuss that in this review, but we still want to make a conscious decision one way or the other, or did I miss that discussion?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
845	I think the nuance here is rather ScalarEpilogueNotNeededPredicatePragma. In other words, if scalar epilogue is needed for some other reason (but still okay to skip scalar epilogue execution when vector code executes), scalar epilogue can be emitted/utilized. Runtime vectorization legality check of all kinds fits in that profile. We shouldn't overload "predicated vector code" pragma with "don't emit scalar epilogue" meaning.
4765	-Os/-Oz message comes out from fall through. Not desired.
4784	How about // Accept MaxVF if we don't have a tail at all. and move the comment inside IF.

SjoerdMeijer mentioned this in rL366989: [Clang] New loop pragma vectorize_predicate.Jul 25 2019, 12:34 AM

SjoerdMeijer mentioned this in rGa48f58c97fec: [Clang] New loop pragma vectorize_predicate.

SjoerdMeijer mentioned this in rL366993: [LV] Scalar Epilogue Lowering. NFC..Jul 25 2019, 1:05 AM

SjoerdMeijer mentioned this in rG5c606cef796e: [LV] Scalar Epilogue Lowering. NFC..Jul 25 2019, 1:11 AM

About:

We probably need to discuss whether vectorize_predicate(enable) should (or should not) implicitly turns on vectorize(enable) or not. I guess the current behavior is "does not", right? We don't have to discuss that in this review, but we still want to make a conscious decision one way or the other, or did I miss that discussion?

Nope, you're exactly right. We haven't discussed this yet, it had also crossed my mind, and we should discuss it. Your statement about the current behaviour is also right.

I will first look into addressing previous comments. My responses might be delayed due to an upcoming holiday, but finishing this is my highest priority.

For SVE we found that there are sometimes benefits to using an unpredicated vector body plus a predicated tail. When the main vectorized loop-body is unpredicated, we know all lanes in the vector are executed and can produce more efficient set of instructions. The scalar tail can then still be vectorized using predication to mask off the inactive lanes, or depending on the cost of vectorizing the tail loop the compiler may want to choose not vectorizing the tail loop at all. It would be nice if your design allows for this use-case.
So maybe instead of having a boolean 'llvm.loop.vectorize.predicate.enable' you can make it into an enum, or perhaps rename the attribute to emphasises the difference so we can add this logic later?

llvm/lib/Analysis/LoopInfo.cpp
516	nit: return Name.equals(S->getString()) && mdconst::extract<ConstantInt>(MD->getOperand(1))->getZExtValue());
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7344	nit: unnecessary whitespace.

Thanks for taking a look at this!

Some initial thoughts on this:

For SVE we found that there are sometimes benefits to using an unpredicated vector body plus a predicated tail. When the main vectorized loop-body is unpredicated, we know all lanes in the vector are executed and can produce more efficient set of instructions. The scalar tail can then still be vectorized using predication to mask off the inactive lanes, or depending on the cost of vectorizing the tail loop the compiler may want to choose not vectorizing the tail loop at all. It would be nice if your design allows for this use-case.
So maybe instead of having a boolean 'llvm.loop.vectorize.predicate.enable' you can make it into an enum, or perhaps rename the attribute to emphasises the difference so we can add this logic later?

In the current flow, the only use-case that we have so far, is that predicate.enable set by a pragma. As it is a pragma, like any other pragma, it is the user's responsibility whether this makes sense and is profitable, etc.

Another use case, is that predicate.enable is set by a loop vectorisation profitability analysis. Whether this is profitable or not, will indeed depend on the target (SVE, MVE, AVX, etc.), the core implementation, and different loop properties. So I can imagine that different target hooks will be required for this decision making, which can then result in setting predicate.enable. Thus, I don't think it excludes any use case, but in fact is the ground work for other use-cases.

I think I've addressed all comments, the main ones are:

I've moved the loop hint handling to LoopVectorizationLegality.cpp
I've renamed ScalarEpilogueNotNeededPredicatePragma
and finally created a helper function to avoid some code duplication that has been bothering me for a while

fhahn added inline comments.Jul 25 2019, 6:08 AM

llvm/test/Transforms/LoopVectorize/tail_loop_folding.ll
2	It looks like assertions are not required for the test case.
6	If this test relies on the x86 cost model/x86 masked instructions, it should go into the subfolder I think.

I've moved the test case to the X86 subfolder and removed the ASSERT.

I obviously want to add some MVE tests too, but will do that later. The X86 cost model and masked instructions are a nice demonstrator :-)

ran clang-format.

Friendly ping :-)

Looks fine to me, but see what the other reviewers say.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
859–869	[nit] this seems unrelated?
7254–7255	[nit] formatting-only change?
7449–7450	[nit] unrelated change?

In D65197#1604940, @SjoerdMeijer wrote:

Friendly ping :-)

Looks like we are converging. One minor comment only.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
845	Thanks for addressing. May not be immediately effective, but should help if someone wants to move towards that direction.
4765	Thanks for taking care of it.
4784	Suggest moving this comment between the Lines 4806 and 4807.

Thanks for taking another look!

Feedback addressed, moved comment.

Sorry for asking, @hsaito , but I was just wondering and wanted to check if your last comment was in fact a LGTM with a minor nit.

LGTM, pending the discussion about the exact meaning of the newly introduced "vector predicate" pragma (expect this to happen outside of this review). Please wait for another day to give others last minute opportunity to give feedback.

This revision is now accepted and ready to land.Jul 31 2019, 9:52 AM

Many thanks for all your help and reviews!

I will start the discussion about the interaction between the vector predicate and vectorize pragmas as soon as I am back in the office. I will have a closer look first and try to form a better opinion, but my first thought at this moment is that enabling "vectorize_predicate" should simply imply "vectorize". As soon as I'm ready, I will upload a patch and perhaps a message to cfe dev.

Closed by commit rG20b198ec5ea7: [LV] Tail-Loop Folding (authored by SjoerdMeijer). · Explain WhyAug 1 2019, 11:24 AM

This revision was automatically updated to reflect the committed changes.

Ayal mentioned this in D67764: [LV] Forced vectorization with runtime checks and OptForSize.Sep 21 2019, 8:05 AM

SjoerdMeijer mentioned this in rL372694: [LV] Forced vectorization with runtime checks and OptForSize.Sep 24 2019, 1:03 AM

SjoerdMeijer mentioned this in rG0fcb3afb401c: [LV] Forced vectorization with runtime checks and OptForSize.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopInfo.h

4 lines

lib/

Analysis/

LoopInfo.cpp

28 lines

Transforms/

Vectorize/

LoopVectorize.cpp

84 lines

test/

Transforms/

LoopVectorize/

tail_loop_folding.ll

79 lines

Diff 211460

llvm/include/llvm/Analysis/LoopInfo.h

Show First 20 Lines • Show All 774 Lines • ▼ Show 20 Lines	public:
/// Add llvm.loop.unroll.disable to this loop's loop id metadata.		/// Add llvm.loop.unroll.disable to this loop's loop id metadata.
///		///
/// Remove existing unroll metadata and add unroll disable metadata to		/// Remove existing unroll metadata and add unroll disable metadata to
/// indicate the loop has already been unrolled. This prevents a loop		/// indicate the loop has already been unrolled. This prevents a loop
/// from being unrolled more than is directed by a pragma if the loop		/// from being unrolled more than is directed by a pragma if the loop
/// unrolling pass is run more than once (which it generally is).		/// unrolling pass is run more than once (which it generally is).
void setLoopAlreadyUnrolled();		void setLoopAlreadyUnrolled();

		/// Return true if the loop is annotated with pragma
		/// llvm.loop.vectorize.predicate.enable, and false otherwise.
		bool isAnnotatedVectorPredicate() const;

void dump() const;		void dump() const;
void dumpVerbose() const;		void dumpVerbose() const;

/// Return the debug location of the start of this loop.		/// Return the debug location of the start of this loop.
/// This looks for a BB terminating instruction with a known debug		/// This looks for a BB terminating instruction with a known debug
/// location by looking at the preheader and header blocks. If it		/// location by looking at the preheader and header blocks. If it
/// cannot find a terminating instruction with location information,		/// cannot find a terminating instruction with location information,
/// it returns an unknown location.		/// it returns an unknown location.
▲ Show 20 Lines • Show All 457 Lines • Show Last 20 Lines

llvm/lib/Analysis/LoopInfo.cpp

Show First 20 Lines • Show All 488 Lines • ▼ Show 20 Lines	void Loop::setLoopAlreadyUnrolled() {
MDNode *DisableUnrollMD =		MDNode *DisableUnrollMD =
MDNode::get(Context, MDString::get(Context, "llvm.loop.unroll.disable"));		MDNode::get(Context, MDString::get(Context, "llvm.loop.unroll.disable"));
MDNode *LoopID = getLoopID();		MDNode *LoopID = getLoopID();
MDNode *NewLoopID = makePostTransformationMetadata(		MDNode *NewLoopID = makePostTransformationMetadata(
Context, LoopID, {"llvm.loop.unroll."}, {DisableUnrollMD});		Context, LoopID, {"llvm.loop.unroll."}, {DisableUnrollMD});
setLoopID(NewLoopID);		setLoopID(NewLoopID);
}		}

		bool Loop::isAnnotatedVectorPredicate() const {
		MDNode *LoopID = getLoopID();
		if (!LoopID)
		return false;

		StringRef Name = "llvm.loop.vectorize.predicate.enable";
		// First operand should refer to the loop id itself.
		assert(LoopID->getNumOperands() > 0 && "requires at least one operand");
		assert(LoopID->getOperand(0) == LoopID && "invalid loop id");

		for (unsigned i = 1, e = LoopID->getNumOperands(); i < e; ++i) {
		MDNode *MD = dyn_cast<MDNode>(LoopID->getOperand(i));
		if (!MD)
		continue;

		MDString *S = dyn_cast<MDString>(MD->getOperand(0));
		if (!S)
		continue;

		if (Name.equals(S->getString()) &&
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: return Name.equals(S->getString()) && mdconst::extract<ConstantInt>(MD->getOperand(1))->getZExtValue()); sdesmalen: nit: ``` return Name.equals(S->getString()) && mdconst::extract<ConstantInt>(MD…
		mdconst::extract<ConstantInt>(MD->getOperand(1))->getZExtValue())
		return true;
		else
		return false;
		}
		return false;
		}

bool Loop::isAnnotatedParallel() const {		bool Loop::isAnnotatedParallel() const {
MDNode *DesiredLoopIdMetadata = getLoopID();		MDNode *DesiredLoopIdMetadata = getLoopID();

if (!DesiredLoopIdMetadata)		if (!DesiredLoopIdMetadata)
return false;		return false;

MDNode *ParallelAccesses =		MDNode *ParallelAccesses =
findOptionMDForLoop(this, "llvm.loop.parallel_accesses");		findOptionMDForLoop(this, "llvm.loop.parallel_accesses");
▲ Show 20 Lines • Show All 566 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 835 Lines • ▼ Show 20 Lines

namespace llvm {		namespace llvm {

// Loop vectorization cost-model hints how the scalar epilogue loop should be		// Loop vectorization cost-model hints how the scalar epilogue loop should be
// lowered.		// lowered.
enum ScalarEpilogueLowering {		enum ScalarEpilogueLowering {
CM_ScalarEpilogueAllowed,		CM_ScalarEpilogueAllowed,
CM_ScalarEpilogueNotAllowedOptSize,		CM_ScalarEpilogueNotAllowedOptSize,
CM_ScalarEpilogueNotAllowedLowTripLoop		CM_ScalarEpilogueNotAllowedLowTripLoop,
		CM_ScalarEpilogueNotAllowedPredicatePragma
		hsaitoUnsubmitted Not Done Reply Inline Actions I think the nuance here is rather ScalarEpilogueNotNeededPredicatePragma. In other words, if scalar epilogue is needed for some other reason (but still okay to skip scalar epilogue execution when vector code executes), scalar epilogue can be emitted/utilized. Runtime vectorization legality check of all kinds fits in that profile. We shouldn't overload "predicated vector code" pragma with "don't emit scalar epilogue" meaning. hsaito: I think the nuance here is rather ScalarEpilogueNotNeededPredicatePragma. In other words, if…
		hsaitoUnsubmitted Not Done Reply Inline Actions Thanks for addressing. May not be immediately effective, but should help if someone wants to move towards that direction. hsaito: Thanks for addressing. May not be immediately effective, but should help if someone wants to…
};		};

/// LoopVectorizationCostModel - estimates the expected speedups due to		/// LoopVectorizationCostModel - estimates the expected speedups due to
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
/// different operations.		/// different operations.
class LoopVectorizationCostModel {		class LoopVectorizationCostModel {
public:		public:
LoopVectorizationCostModel(ScalarEpilogueLowering SEL, Loop *L,		LoopVectorizationCostModel(ScalarEpilogueLowering SEL, Loop *L,
PredicatedScalarEvolution &PSE,		PredicatedScalarEvolution &PSE,
LoopInfo LI, LoopVectorizationLegality Legal,		LoopInfo LI, LoopVectorizationLegality Legal,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
const TargetLibraryInfo TLI, DemandedBits DB,		const TargetLibraryInfo TLI, DemandedBits DB,
AssumptionCache *AC,		AssumptionCache *AC,
OptimizationRemarkEmitter ORE, const Function F,		OptimizationRemarkEmitter ORE, const Function F,
const LoopVectorizeHints *Hints,		const LoopVectorizeHints *Hints,
InterleavedAccessInfo &IAI)		InterleavedAccessInfo &IAI)
: IsScalarEpilogueAllowed(SEL), TheLoop(L), PSE(PSE),		: IsScalarEpilogueAllowed(SEL), TheLoop(L), PSE(PSE),
LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE),		LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE),
TheFunction(F), Hints(Hints), InterleaveInfo(IAI) {}		TheFunction(F), Hints(Hints), InterleaveInfo(IAI) {}

		MeinersburUnsubmitted Not Done Reply Inline Actions [nit] this seems unrelated? Meinersbur: [nit] this seems unrelated?
/// \return An upper bound for the vectorization factor, or None if		/// \return An upper bound for the vectorization factor, or None if
/// vectorization and interleaving should be avoided up front.		/// vectorization and interleaving should be avoided up front.
Optional<unsigned> computeMaxVF();		Optional<unsigned> computeMaxVF();

		/// \return True if runtime checks are required for vectorization, and false
		/// otherwise.
		bool runtimeChecksRequired();

/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to MaxVF. If UserVF is not ZERO		/// This method checks every power of two up to MaxVF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(unsigned MaxVF);		VectorizationFactor selectVectorizationFactor(unsigned MaxVF);

/// Setup cost-based decisions for user vectorization factor.		/// Setup cost-based decisions for user vectorization factor.
void selectUserVectorizationFactor(unsigned UserVF) {		void selectUserVectorizationFactor(unsigned UserVF) {
▲ Show 20 Lines • Show All 3,800 Lines • ▼ Show 20 Lines	for (auto &Induction : *Legal->getInductionVars()) {
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate
<< "\n");		<< "\n");
}		}

Uniforms[VF].insert(Worklist.begin(), Worklist.end());		Uniforms[VF].insert(Worklist.begin(), Worklist.end());
}		}

Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {		bool LoopVectorizationCostModel::runtimeChecksRequired() {
if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {		LLVM_DEBUG(dbgs() << "LV: Performing code size checks.\n");
// TODO: It may by useful to do since it's still likely to be dynamically
// uniform if the target can skip.
LLVM_DEBUG(
dbgs() << "LV: Not inserting runtime ptr check for divergent target");

ORE->emit(
createMissedAnalysis("CantVersionLoopWithDivergentTarget")
<< "runtime pointer checks needed. Not enabled for divergent target");

return None;
}

unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
if (isScalarEpilogueAllowed())
return computeFeasibleMaxVF(TC);

LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue.\n" <<
"LV: Performing code size checks.\n");

if (Legal->getRuntimePointerChecking()->Need) {		if (Legal->getRuntimePointerChecking()->Need) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime pointer checks needed. Enable vectorization of this "		<< "runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
return None;		return true;
}		}

if (!PSE.getUnionPredicate().getPredicates().empty()) {		if (!PSE.getUnionPredicate().getPredicates().empty()) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime SCEV checks needed. Enable vectorization of this "		<< "runtime SCEV checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime SCEV check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime SCEV check is required with -Os/-Oz.\n");
return None;		return true;
}		}

// FIXME: Avoid specializing for stride==1 instead of bailing out.		// FIXME: Avoid specializing for stride==1 instead of bailing out.
if (!Legal->getLAI()->getSymbolicStrides().empty()) {		if (!Legal->getLAI()->getSymbolicStrides().empty()) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime stride == 1 checks needed. Enable vectorization of "		<< "runtime stride == 1 checks needed. Enable vectorization of "
"this loop with '#pragma clang loop vectorize(enable)' when "		"this loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime stride check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime stride check is required with -Os/-Oz.\n");
		return true;
		}

		return false;
		}

		Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {
		if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {
		// TODO: It may by useful to do since it's still likely to be dynamically
		// uniform if the target can skip.
		LLVM_DEBUG(
		dbgs() << "LV: Not inserting runtime ptr check for divergent target");

		ORE->emit(
		createMissedAnalysis("CantVersionLoopWithDivergentTarget")
		<< "runtime pointer checks needed. Not enabled for divergent target");

return None;		return None;
}		}

// If we optimize the program for size, avoid creating the tail loop.		unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

if (TC == 1) {		if (TC == 1) {
ORE->emit(createMissedAnalysis("SingleIterationLoop")		ORE->emit(createMissedAnalysis("SingleIterationLoop")
<< "loop trip count is one, irrelevant for vectorization");		<< "loop trip count is one, irrelevant for vectorization");
LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");		LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
return None;		return None;
}		}

// Record that scalar epilogue is not allowed.		switch (IsScalarEpilogueAllowed) {
		default: return None;
		case CM_ScalarEpilogueAllowed:
		return computeFeasibleMaxVF(TC);
		case CM_ScalarEpilogueNotAllowedPredicatePragma:
		LLVM_DEBUG(dbgs() << "LV: vector predicate pragma found.\n"
		<< "LV: creating predicated vector loop.\n");
		break;
		case CM_ScalarEpilogueNotAllowedLowTripLoop:
		hsaitoUnsubmitted Not Done Reply Inline Actions -Os/-Oz message comes out from fall through. Not desired. hsaito: -Os/-Oz message comes out from fall through. Not desired.
		hsaitoUnsubmitted Not Done Reply Inline Actions Thanks for taking care of it. hsaito: Thanks for taking care of it.
		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
		<< "count.\n");
		case CM_ScalarEpilogueNotAllowedOptSize:
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");
		// Bail if runtime checks are required, which are not good when optimising
		// for size.
		if (runtimeChecksRequired())
		return None;
		break;
		}

		// Now try the tail folding

// We don't create an epilogue when optimizing for size.
// Invalidate interleave groups that require an epilogue if we can't mask		// Invalidate interleave groups that require an epilogue if we can't mask
// the interleave-group.		// the interleave-group.
if (!useMaskedInterleavedAccesses(TTI))		if (!useMaskedInterleavedAccesses(TTI))
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();		InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();

		// Bail if we don't have a tail at all.
		hsaitoUnsubmitted Not Done Reply Inline Actions How about // Accept MaxVF if we don't have a tail at all. and move the comment inside IF. hsaito: How about // Accept MaxVF if we don't have a tail at all. and move the comment inside IF.
		hsaitoUnsubmitted Not Done Reply Inline Actions Suggest moving this comment between the Lines 4806 and 4807. hsaito: Suggest moving this comment between the Lines 4806 and 4807.
unsigned MaxVF = computeFeasibleMaxVF(TC);		unsigned MaxVF = computeFeasibleMaxVF(TC);

if (TC > 0 && TC % MaxVF == 0) {		if (TC > 0 && TC % MaxVF == 0) {
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxVF;		return MaxVF;
}		}

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
▲ Show 20 Lines • Show All 2,448 Lines • ▼ Show 20 Lines	static bool processLoopInVPlanNativePath(
Function *F = L->getHeader()->getParent();		Function *F = L->getHeader()->getParent();
InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());

ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
(F->hasOptSize() \|\|		(F->hasOptSize() \|\|
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))		llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;		SEL = CM_ScalarEpilogueNotAllowedOptSize;
		else if (L->isAnnotatedVectorPredicate())
		SEL = CM_ScalarEpilogueNotAllowedPredicatePragma;

LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI,		LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI,
DB, AC, ORE, F, &Hints, IAI);		DB, AC, ORE, F, &Hints, IAI);
// Use the planner for outer loop vectorization.		// Use the planner for outer loop vectorization.
		MeinersburUnsubmitted Not Done Reply Inline Actions [nit] formatting-only change? Meinersbur: [nit] formatting-only change?
// TODO: CM is not used at this point inside the planner. Turn CM into an		// TODO: CM is not used at this point inside the planner. Turn CM into an
// optional argument if we don't need it in the future.		// optional argument if we don't need it in the future.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);

// Get user vectorization factor.		// Get user vectorization factor.
const unsigned UserVF = Hints.getWidth();		const unsigned UserVF = Hints.getWidth();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	if (!LVL.canVectorize(EnableVPlanNativePath)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

// Check the function attributes and profiles to find out if this function		// Check the function attributes and profiles to find out if this function
// should be optimized for size.		// should be optimized for size.
ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;

		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: unnecessary whitespace. sdesmalen: nit: unnecessary whitespace.
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
(F->hasOptSize() \|\|		(F->hasOptSize() \|\|
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))		llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;		SEL = CM_ScalarEpilogueNotAllowedOptSize;
		else if (L->isAnnotatedVectorPredicate())
		SEL = CM_ScalarEpilogueNotAllowedPredicatePragma;

// Entrance to the VPlan-native vectorization path. Outer loops are processed		// Entrance to the VPlan-native vectorization path. Outer loops are processed
// here. They may require CFG and instruction level transformations before		// here. They may require CFG and instruction level transformations before
// even evaluating whether vectorization is profitable. Since we cannot modify		// even evaluating whether vectorization is profitable. Since we cannot modify
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->empty())		if (!L->empty())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */

// Analyze interleaved memory accesses.		// Analyze interleaved memory accesses.
if (UseInterleaved) {		if (UseInterleaved) {
IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));		IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));
}		}

// Use the cost model.		// Use the cost model.
LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI,		LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI,
DB, AC, ORE, F, &Hints, IAI);		DB, AC, ORE, F, &Hints, IAI);
CM.collectValuesToIgnore();		CM.collectValuesToIgnore();
		MeinersburUnsubmitted Not Done Reply Inline Actions [nit] unrelated change? Meinersbur: [nit] unrelated change?

// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM);

// Get user vectorization factor.		// Get user vectorization factor.
unsigned UserVF = Hints.getWidth();		unsigned UserVF = Hints.getWidth();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/tail_loop_folding.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt < %s -loop-vectorize -S \| FileCheck %s
				fhahnUnsubmitted Not Done Reply Inline Actions It looks like assertions are not required for the test case. fhahn: It looks like assertions are not required for the test case.

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				fhahnUnsubmitted Not Done Reply Inline Actions If this test relies on the x86 cost model/x86 masked instructions, it should go into the subfolder I think. fhahn: If this test relies on the x86 cost model/x86 masked instructions, it should go into the…
				define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
				; CHECK-LABEL: tail_folding_enabled(
				; CHECK: vector.body:
				; CHECK: %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; CHECK: %wide.masked.load1 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; CHECK: %8 = add nsw <8 x i32> %wide.masked.load1, %wide.masked.load
				; CHECK: call void @llvm.masked.store.v8i32.p0v8i32(
				; CHECK: %index.next = add i64 %index, 8
				; CHECK: %12 = icmp eq i64 %index.next, 432
				; CHECK: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !0

				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %add, i32* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 430
				br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !6
				}

				define dso_local void @tail_folding_disabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
				; CHECK-LABEL: tail_folding_disabled(
				; CHECK: vector.body:
				; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(
				; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(
				; CHECK: br i1 %44, label {{.*}}, label %vector.body
				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %add, i32* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 430
				br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !10
				}

				; CHECK: !0 = distinct !{!0, !1}
				; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}
				; CHECK-NEXT: !2 = distinct !{!2, !3, !1}
				; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}
				; CHECK-NEXT: !4 = distinct !{!4, !1}
				; CHECK-NEXT: !5 = distinct !{!5, !3, !1}

				attributes #0 = { nounwind optsize uwtable "target-cpu"="core-avx2" "target-features"="+avx,+avx2" }

				!6 = distinct !{!6, !7, !8}
				!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
				!8 = !{!"llvm.loop.vectorize.enable", i1 true}

				!10 = distinct !{!10, !11, !12}
				!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 false}
				!12 = !{!"llvm.loop.vectorize.enable", i1 true}