This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.cpp
10
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
tail_loop_folding.ll

Differential D65197

[LV] Tail-loop Folding
ClosedPublic

Authored by SjoerdMeijer on Jul 24 2019, 3:57 AM.

Download Raw Diff

Details

Reviewers

Meinersbur
hsaito
fhahn
samparker
dmgreen
rengolin

Commits

rL367592: [LV] Tail-Loop Folding
rG20b198ec5ea7: [LV] Tail-Loop Folding

Summary

This allows folding of the scalar epilogue loop (the tail) into the main
vectorised loop body when the loop is annotated with a "vector predicate"
metadata hint. To fold the tail, instructions need to be predicated (masked),
enabling/disabling lanes for the remainder iterations.

This depends on D64744 that introduces the llvm.loop.vectorize.predicate.enable
pragma and metadata node, and D64916 which is a refactoring to make tail
folding a more general concept.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Jul 24 2019, 3:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 24 2019, 3:57 AM

Herald added subscribers: rkruppe, hiraditya. · View Herald Transcript

SjoerdMeijer added subscribers: huntergr, sdesmalen.Jul 24 2019, 5:04 AM

[serious] There is a LoopVectorizeHints class in LoopVectorizationLegality.cpp that should be used.

[serious] Documentation of llvm.loop.vectorize.predicate.enable is missing.

Just just realized that docs for llvm.loop.vectorize.predicate.enable is part of D64744, which otherwise is a clang-only patch.

Hi Michael, thanks for taking a look again!

Just just realized that docs for llvm.loop.vectorize.predicate.enable is part of D64744, which otherwise is a clang-only patch.

Yep, indeed, so I assume that's all good.

There is a LoopVectorizeHints class in LoopVectorizationLegality.cpp that should be used.

Ah yes, thanks for the suggestion, I will start looking into this, and will move the pragma handling to some function in there.

In D65197#1599082, @SjoerdMeijer wrote:

Just just realized that docs for llvm.loop.vectorize.predicate.enable is part of D64744, which otherwise is a clang-only patch.

Yep, indeed, so I assume that's all good.

Before we had the monoropo reviewers frequently asked to split patches into the LLVM and Clang part. With the monorepo, I am not sure the rule still needs to be followed. At least, I did not expect LLVM documentation in a clang patch, so sorry for the non-applicable comment.

In D65197#1599102, @Meinersbur wrote:

Before we had the monoropo reviewers frequently asked to split patches into the LLVM and Clang part. With the monorepo, I am not sure the rule still needs to be followed. At least, I did not expect LLVM documentation in a clang patch, so sorry for the non-applicable comment.

It does because we're still committing to SVN. Once we enable write mode on the monorepo, that'll change.

Ha, that's funny, because before noticing these comments here, I was just doing a test commit (366904) with the github monorepo workflow.
With the discussion on the dev list that the transition date is near, and just following the public documentation in https://llvm.org/docs/GettingStarted.html, it really looks like I committed to the clang and llvm repo at the same time using the git llvm push script from a local git monorepo. I was of course aware of separating clang and llvm patches, but again, thought that this new workflow is fully accepted/supported.

Anyway, back to looking at LoopVectorizationLegality.cpp :-)

We probably need to discuss whether vectorize_predicate(enable) should (or should not) implicitly turns on vectorize(enable) or not. I guess the current behavior is "does not", right? We don't have to discuss that in this review, but we still want to make a conscious decision one way or the other, or did I miss that discussion?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
849	I think the nuance here is rather ScalarEpilogueNotNeededPredicatePragma. In other words, if scalar epilogue is needed for some other reason (but still okay to skip scalar epilogue execution when vector code executes), scalar epilogue can be emitted/utilized. Runtime vectorization legality check of all kinds fits in that profile. We shouldn't overload "predicated vector code" pragma with "don't emit scalar epilogue" meaning.
4777	-Os/-Oz message comes out from fall through. Not desired.
4804	How about // Accept MaxVF if we don't have a tail at all. and move the comment inside IF.

SjoerdMeijer mentioned this in rL366989: [Clang] New loop pragma vectorize_predicate.Jul 25 2019, 12:34 AM

SjoerdMeijer mentioned this in rGa48f58c97fec: [Clang] New loop pragma vectorize_predicate.

SjoerdMeijer mentioned this in rL366993: [LV] Scalar Epilogue Lowering. NFC..Jul 25 2019, 1:05 AM

SjoerdMeijer mentioned this in rG5c606cef796e: [LV] Scalar Epilogue Lowering. NFC..Jul 25 2019, 1:11 AM

About:

We probably need to discuss whether vectorize_predicate(enable) should (or should not) implicitly turns on vectorize(enable) or not. I guess the current behavior is "does not", right? We don't have to discuss that in this review, but we still want to make a conscious decision one way or the other, or did I miss that discussion?

Nope, you're exactly right. We haven't discussed this yet, it had also crossed my mind, and we should discuss it. Your statement about the current behaviour is also right.

I will first look into addressing previous comments. My responses might be delayed due to an upcoming holiday, but finishing this is my highest priority.

For SVE we found that there are sometimes benefits to using an unpredicated vector body plus a predicated tail. When the main vectorized loop-body is unpredicated, we know all lanes in the vector are executed and can produce more efficient set of instructions. The scalar tail can then still be vectorized using predication to mask off the inactive lanes, or depending on the cost of vectorizing the tail loop the compiler may want to choose not vectorizing the tail loop at all. It would be nice if your design allows for this use-case.
So maybe instead of having a boolean 'llvm.loop.vectorize.predicate.enable' you can make it into an enum, or perhaps rename the attribute to emphasises the difference so we can add this logic later?

llvm/lib/Analysis/LoopInfo.cpp
516 ↗	(On Diff #211460)	nit: return Name.equals(S->getString()) && mdconst::extract<ConstantInt>(MD->getOperand(1))->getZExtValue());
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7370	nit: unnecessary whitespace.

Thanks for taking a look at this!

Some initial thoughts on this:

For SVE we found that there are sometimes benefits to using an unpredicated vector body plus a predicated tail. When the main vectorized loop-body is unpredicated, we know all lanes in the vector are executed and can produce more efficient set of instructions. The scalar tail can then still be vectorized using predication to mask off the inactive lanes, or depending on the cost of vectorizing the tail loop the compiler may want to choose not vectorizing the tail loop at all. It would be nice if your design allows for this use-case.
So maybe instead of having a boolean 'llvm.loop.vectorize.predicate.enable' you can make it into an enum, or perhaps rename the attribute to emphasises the difference so we can add this logic later?

In the current flow, the only use-case that we have so far, is that predicate.enable set by a pragma. As it is a pragma, like any other pragma, it is the user's responsibility whether this makes sense and is profitable, etc.

Another use case, is that predicate.enable is set by a loop vectorisation profitability analysis. Whether this is profitable or not, will indeed depend on the target (SVE, MVE, AVX, etc.), the core implementation, and different loop properties. So I can imagine that different target hooks will be required for this decision making, which can then result in setting predicate.enable. Thus, I don't think it excludes any use case, but in fact is the ground work for other use-cases.

I think I've addressed all comments, the main ones are:

I've moved the loop hint handling to LoopVectorizationLegality.cpp
I've renamed ScalarEpilogueNotNeededPredicatePragma
and finally created a helper function to avoid some code duplication that has been bothering me for a while

fhahn added inline comments.Jul 25 2019, 6:08 AM

llvm/test/Transforms/LoopVectorize/tail_loop_folding.ll
1 ↗	(On Diff #211725)	It looks like assertions are not required for the test case.
5 ↗	(On Diff #211725)	If this test relies on the x86 cost model/x86 masked instructions, it should go into the subfolder I think.

I've moved the test case to the X86 subfolder and removed the ASSERT.

I obviously want to add some MVE tests too, but will do that later. The X86 cost model and masked instructions are a nice demonstrator :-)

ran clang-format.

Friendly ping :-)

Looks fine to me, but see what the other reviewers say.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
869–880	[nit] this seems unrelated?
7280–7282	[nit] formatting-only change?
7465–7467	[nit] unrelated change?

In D65197#1604940, @SjoerdMeijer wrote:

Friendly ping :-)

Looks like we are converging. One minor comment only.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
849	Thanks for addressing. May not be immediately effective, but should help if someone wants to move towards that direction.
4777	Thanks for taking care of it.
4804	Suggest moving this comment between the Lines 4806 and 4807.

Thanks for taking another look!

Feedback addressed, moved comment.

Sorry for asking, @hsaito , but I was just wondering and wanted to check if your last comment was in fact a LGTM with a minor nit.

LGTM, pending the discussion about the exact meaning of the newly introduced "vector predicate" pragma (expect this to happen outside of this review). Please wait for another day to give others last minute opportunity to give feedback.

This revision is now accepted and ready to land.Jul 31 2019, 9:52 AM

Many thanks for all your help and reviews!

I will start the discussion about the interaction between the vector predicate and vectorize pragmas as soon as I am back in the office. I will have a closer look first and try to form a better opinion, but my first thought at this moment is that enabling "vectorize_predicate" should simply imply "vectorize". As soon as I'm ready, I will upload a patch and perhaps a message to cfe dev.

Closed by commit rG20b198ec5ea7: [LV] Tail-Loop Folding (authored by SjoerdMeijer). · Explain WhyAug 1 2019, 11:24 AM

This revision was automatically updated to reflect the committed changes.

Ayal mentioned this in D67764: [LV] Forced vectorization with runtime checks and OptForSize.Sep 21 2019, 8:05 AM

SjoerdMeijer mentioned this in rL372694: [LV] Forced vectorization with runtime checks and OptForSize.Sep 24 2019, 1:03 AM

SjoerdMeijer mentioned this in rG0fcb3afb401c: [LV] Forced vectorization with runtime checks and OptForSize.

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

7 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

7 lines

LoopVectorize.cpp

146 lines

test/

Transforms/

LoopVectorize/

X86/

tail_loop_folding.ll

78 lines

Diff 212861

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
/// This class keeps a number of loop annotations locally (as member variables)		/// This class keeps a number of loop annotations locally (as member variables)
/// and can, upon request, write them back as metadata on the loop. It will		/// and can, upon request, write them back as metadata on the loop. It will
/// initially scan the loop for existing metadata, and will update the local		/// initially scan the loop for existing metadata, and will update the local
/// values based on information in the loop.		/// values based on information in the loop.
/// We cannot write all values to metadata, as the mere presence of some info,		/// We cannot write all values to metadata, as the mere presence of some info,
/// for example 'force', means a decision has been made. So, we need to be		/// for example 'force', means a decision has been made. So, we need to be
/// careful NOT to add them if the user hasn't specifically asked so.		/// careful NOT to add them if the user hasn't specifically asked so.
class LoopVectorizeHints {		class LoopVectorizeHints {
enum HintKind { HK_WIDTH, HK_UNROLL, HK_FORCE, HK_ISVECTORIZED };		enum HintKind { HK_WIDTH, HK_UNROLL, HK_FORCE, HK_ISVECTORIZED,
		HK_PREDICATE };

/// Hint - associates name and validation with the hint value.		/// Hint - associates name and validation with the hint value.
struct Hint {		struct Hint {
const char *Name;		const char *Name;
unsigned Value; // This may have to change for non-numeric values.		unsigned Value; // This may have to change for non-numeric values.
HintKind Kind;		HintKind Kind;

Hint(const char *Name, unsigned Value, HintKind Kind)		Hint(const char *Name, unsigned Value, HintKind Kind)
Show All 9 Lines	class LoopVectorizeHints {
Hint Interleave;		Hint Interleave;

/// Vectorization forced		/// Vectorization forced
Hint Force;		Hint Force;

/// Already Vectorized		/// Already Vectorized
Hint IsVectorized;		Hint IsVectorized;

		/// Vector Predicate
		Hint Predicate;

/// Return the loop metadata prefix.		/// Return the loop metadata prefix.
static StringRef Prefix() { return "llvm.loop."; }		static StringRef Prefix() { return "llvm.loop."; }

/// True if there is any unsafe math in the loop.		/// True if there is any unsafe math in the loop.
bool PotentiallyUnsafe = false;		bool PotentiallyUnsafe = false;

public:		public:
enum ForceKind {		enum ForceKind {
Show All 12 Lines	bool allowVectorization(Function F, Loop L,
bool VectorizeOnlyWhenForced) const;		bool VectorizeOnlyWhenForced) const;

/// Dumps all the hint information.		/// Dumps all the hint information.
void emitRemarkWithHints() const;		void emitRemarkWithHints() const;

unsigned getWidth() const { return Width.Value; }		unsigned getWidth() const { return Width.Value; }
unsigned getInterleave() const { return Interleave.Value; }		unsigned getInterleave() const { return Interleave.Value; }
unsigned getIsVectorized() const { return IsVectorized.Value; }		unsigned getIsVectorized() const { return IsVectorized.Value; }
		unsigned getPredicate() const { return Predicate.Value; }
enum ForceKind getForce() const {		enum ForceKind getForce() const {
if ((ForceKind)Force.Value == FK_Undefined &&		if ((ForceKind)Force.Value == FK_Undefined &&
hasDisableAllTransformsHint(TheLoop))		hasDisableAllTransformsHint(TheLoop))
return FK_Disabled;		return FK_Disabled;
return (ForceKind)Force.Value;		return (ForceKind)Force.Value;
}		}

/// If hints are provided that force vectorization, use the AlwaysPrint		/// If hints are provided that force vectorization, use the AlwaysPrint
▲ Show 20 Lines • Show All 364 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	bool LoopVectorizeHints::Hint::validate(unsigned Val) {
switch (Kind) {		switch (Kind) {
case HK_WIDTH:		case HK_WIDTH:
return isPowerOf2_32(Val) && Val <= VectorizerParams::MaxVectorWidth;		return isPowerOf2_32(Val) && Val <= VectorizerParams::MaxVectorWidth;
case HK_UNROLL:		case HK_UNROLL:
return isPowerOf2_32(Val) && Val <= MaxInterleaveFactor;		return isPowerOf2_32(Val) && Val <= MaxInterleaveFactor;
case HK_FORCE:		case HK_FORCE:
return (Val <= 1);		return (Val <= 1);
case HK_ISVECTORIZED:		case HK_ISVECTORIZED:
		case HK_PREDICATE:
return (Val == 0 \|\| Val == 1);		return (Val == 0 \|\| Val == 1);
}		}
return false;		return false;
}		}

LoopVectorizeHints::LoopVectorizeHints(const Loop *L,		LoopVectorizeHints::LoopVectorizeHints(const Loop *L,
bool InterleaveOnlyWhenForced,		bool InterleaveOnlyWhenForced,
OptimizationRemarkEmitter &ORE)		OptimizationRemarkEmitter &ORE)
: Width("vectorize.width", VectorizerParams::VectorizationFactor, HK_WIDTH),		: Width("vectorize.width", VectorizerParams::VectorizationFactor, HK_WIDTH),
Interleave("interleave.count", InterleaveOnlyWhenForced, HK_UNROLL),		Interleave("interleave.count", InterleaveOnlyWhenForced, HK_UNROLL),
Force("vectorize.enable", FK_Undefined, HK_FORCE),		Force("vectorize.enable", FK_Undefined, HK_FORCE),
IsVectorized("isvectorized", 0, HK_ISVECTORIZED), TheLoop(L), ORE(ORE) {		IsVectorized("isvectorized", 0, HK_ISVECTORIZED),
		Predicate("vectorize.predicate.enable", 0, HK_PREDICATE), TheLoop(L),
		ORE(ORE) {
// Populate values with existing loop metadata.		// Populate values with existing loop metadata.
getHintsFromMetadata();		getHintsFromMetadata();

// force-vector-interleave overrides DisableInterleaving.		// force-vector-interleave overrides DisableInterleaving.
if (VectorizerParams::isInterleaveForced())		if (VectorizerParams::isInterleaveForced())
Interleave.Value = VectorizerParams::VectorizationInterleave;		Interleave.Value = VectorizerParams::VectorizationInterleave;

if (IsVectorized.Value != 1)		if (IsVectorized.Value != 1)
▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	if (!Name.startswith(Prefix()))
return;		return;
Name = Name.substr(Prefix().size(), StringRef::npos);		Name = Name.substr(Prefix().size(), StringRef::npos);

const ConstantInt *C = mdconst::dyn_extract<ConstantInt>(Arg);		const ConstantInt *C = mdconst::dyn_extract<ConstantInt>(Arg);
if (!C)		if (!C)
return;		return;
unsigned Val = C->getZExtValue();		unsigned Val = C->getZExtValue();

Hint *Hints[] = {&Width, &Interleave, &Force, &IsVectorized};		Hint *Hints[] = {&Width, &Interleave, &Force, &IsVectorized, &Predicate};
for (auto H : Hints) {		for (auto H : Hints) {
if (Name == H->Name) {		if (Name == H->Name) {
if (H->validate(Val))		if (H->validate(Val))
H->Value = Val;		H->Value = Val;
else		else
LLVM_DEBUG(dbgs() << "LV: ignoring invalid hint '" << Name << "'\n");		LLVM_DEBUG(dbgs() << "LV: ignoring invalid hint '" << Name << "'\n");
break;		break;
}		}
▲ Show 20 Lines • Show All 988 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 833 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::addMetadata(ArrayRef<Value *> To,
}		}
}		}

namespace llvm {		namespace llvm {

// Loop vectorization cost-model hints how the scalar epilogue loop should be		// Loop vectorization cost-model hints how the scalar epilogue loop should be
// lowered.		// lowered.
enum ScalarEpilogueLowering {		enum ScalarEpilogueLowering {

		// The default: allowing scalar epilogues.
CM_ScalarEpilogueAllowed,		CM_ScalarEpilogueAllowed,

		// Vectorization with OptForSize: don't allow epilogues.
CM_ScalarEpilogueNotAllowedOptSize,		CM_ScalarEpilogueNotAllowedOptSize,
CM_ScalarEpilogueNotAllowedLowTripLoop
		// A special case of vectorisation with OptForSize: loops with a very small
		hsaitoUnsubmitted Not Done Reply Inline Actions I think the nuance here is rather ScalarEpilogueNotNeededPredicatePragma. In other words, if scalar epilogue is needed for some other reason (but still okay to skip scalar epilogue execution when vector code executes), scalar epilogue can be emitted/utilized. Runtime vectorization legality check of all kinds fits in that profile. We shouldn't overload "predicated vector code" pragma with "don't emit scalar epilogue" meaning. hsaito: I think the nuance here is rather ScalarEpilogueNotNeededPredicatePragma. In other words, if…
		hsaitoUnsubmitted Not Done Reply Inline Actions Thanks for addressing. May not be immediately effective, but should help if someone wants to move towards that direction. hsaito: Thanks for addressing. May not be immediately effective, but should help if someone wants to…
		// trip count are considered for vectorization under OptForSize, thereby
		// making sure the cost of their loop body is dominant, free of runtime
		// guards and scalar iteration overheads.
		CM_ScalarEpilogueNotAllowedLowTripLoop,

		// Loop hint predicate indicating an epilogue is undesired.
		CM_ScalarEpilogueNotNeededPredicatePragma
};		};

/// LoopVectorizationCostModel - estimates the expected speedups due to		/// LoopVectorizationCostModel - estimates the expected speedups due to
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
/// different operations.		/// different operations.
class LoopVectorizationCostModel {		class LoopVectorizationCostModel {
public:		public:
LoopVectorizationCostModel(ScalarEpilogueLowering SEL, Loop *L,		LoopVectorizationCostModel(ScalarEpilogueLowering SEL, Loop *L,
PredicatedScalarEvolution &PSE,		PredicatedScalarEvolution &PSE, LoopInfo *LI,
LoopInfo LI, LoopVectorizationLegality Legal,		LoopVectorizationLegality *Legal,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
const TargetLibraryInfo TLI, DemandedBits DB,		const TargetLibraryInfo TLI, DemandedBits DB,
AssumptionCache *AC,		AssumptionCache *AC,
OptimizationRemarkEmitter ORE, const Function F,		OptimizationRemarkEmitter ORE, const Function F,
const LoopVectorizeHints *Hints,		const LoopVectorizeHints *Hints,
InterleavedAccessInfo &IAI)		InterleavedAccessInfo &IAI)
: ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE),		: ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE), LI(LI), Legal(Legal),
LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE),		TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), TheFunction(F),
TheFunction(F), Hints(Hints), InterleaveInfo(IAI) {}		Hints(Hints), InterleaveInfo(IAI) {}

		MeinersburUnsubmitted Not Done Reply Inline Actions [nit] this seems unrelated? Meinersbur: [nit] this seems unrelated?
/// \return An upper bound for the vectorization factor, or None if		/// \return An upper bound for the vectorization factor, or None if
/// vectorization and interleaving should be avoided up front.		/// vectorization and interleaving should be avoided up front.
Optional<unsigned> computeMaxVF();		Optional<unsigned> computeMaxVF();

		/// \return True if runtime checks are required for vectorization, and false
		/// otherwise.
		bool runtimeChecksRequired();

/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to MaxVF. If UserVF is not ZERO		/// This method checks every power of two up to MaxVF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(unsigned MaxVF);		VectorizationFactor selectVectorizationFactor(unsigned MaxVF);

/// Setup cost-based decisions for user vectorization factor.		/// Setup cost-based decisions for user vectorization factor.
void selectUserVectorizationFactor(unsigned UserVF) {		void selectUserVectorizationFactor(unsigned UserVF) {
▲ Show 20 Lines • Show All 3,801 Lines • ▼ Show 20 Lines	for (auto &Induction : *Legal->getInductionVars()) {
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate
<< "\n");		<< "\n");
}		}

Uniforms[VF].insert(Worklist.begin(), Worklist.end());		Uniforms[VF].insert(Worklist.begin(), Worklist.end());
}		}

Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {		bool LoopVectorizationCostModel::runtimeChecksRequired() {
if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {		LLVM_DEBUG(dbgs() << "LV: Performing code size checks.\n");
// TODO: It may by useful to do since it's still likely to be dynamically
// uniform if the target can skip.
LLVM_DEBUG(
dbgs() << "LV: Not inserting runtime ptr check for divergent target");

ORE->emit(
createMissedAnalysis("CantVersionLoopWithDivergentTarget")
<< "runtime pointer checks needed. Not enabled for divergent target");

return None;
}

unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
if (isScalarEpilogueAllowed())
return computeFeasibleMaxVF(TC);

LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue.\n" <<
"LV: Performing code size checks.\n");

if (Legal->getRuntimePointerChecking()->Need) {		if (Legal->getRuntimePointerChecking()->Need) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime pointer checks needed. Enable vectorization of this "		<< "runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
return None;		return true;
}		}

if (!PSE.getUnionPredicate().getPredicates().empty()) {		if (!PSE.getUnionPredicate().getPredicates().empty()) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime SCEV checks needed. Enable vectorization of this "		<< "runtime SCEV checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime SCEV check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime SCEV check is required with -Os/-Oz.\n");
return None;		return true;
}		}

// FIXME: Avoid specializing for stride==1 instead of bailing out.		// FIXME: Avoid specializing for stride==1 instead of bailing out.
if (!Legal->getLAI()->getSymbolicStrides().empty()) {		if (!Legal->getLAI()->getSymbolicStrides().empty()) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime stride == 1 checks needed. Enable vectorization of "		<< "runtime stride == 1 checks needed. Enable vectorization of "
"this loop with '#pragma clang loop vectorize(enable)' when "		"this loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime stride check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime stride check is required with -Os/-Oz.\n");
		return true;
		}

		return false;
		}

		Optional<unsigned> LoopVectorizationCostModel::computeMaxVF() {
		if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {
		// TODO: It may by useful to do since it's still likely to be dynamically
		// uniform if the target can skip.
		LLVM_DEBUG(
		dbgs() << "LV: Not inserting runtime ptr check for divergent target");

		ORE->emit(
		createMissedAnalysis("CantVersionLoopWithDivergentTarget")
		<< "runtime pointer checks needed. Not enabled for divergent target");

return None;		return None;
}		}

// If we optimize the program for size, avoid creating the tail loop.		unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

if (TC == 1) {		if (TC == 1) {
ORE->emit(createMissedAnalysis("SingleIterationLoop")		ORE->emit(createMissedAnalysis("SingleIterationLoop")
<< "loop trip count is one, irrelevant for vectorization");		<< "loop trip count is one, irrelevant for vectorization");
LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");		LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
return None;		return None;
}		}

// Record that scalar epilogue is not allowed.		switch (ScalarEpilogueStatus) {
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");		default:
		return None;
		case CM_ScalarEpilogueAllowed:
		return computeFeasibleMaxVF(TC);
		case CM_ScalarEpilogueNotNeededPredicatePragma:
		LLVM_DEBUG(
		dbgs() << "LV: vector predicate hint found.\n"
		<< "LV: Not allowing scalar epilogue, creating predicated "
		hsaitoUnsubmitted Not Done Reply Inline Actions -Os/-Oz message comes out from fall through. Not desired. hsaito: -Os/-Oz message comes out from fall through. Not desired.
		hsaitoUnsubmitted Not Done Reply Inline Actions Thanks for taking care of it. hsaito: Thanks for taking care of it.
		<< "vector loop.\n");
		break;
		case CM_ScalarEpilogueNotAllowedLowTripLoop:
		// fallthrough as a special case of OptForSize
		case CM_ScalarEpilogueNotAllowedOptSize:
		if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)
		LLVM_DEBUG(
		dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");
		else
		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
		<< "count.\n");

		// Bail if runtime checks are required, which are not good when optimising
		// for size.
		if (runtimeChecksRequired())
		return None;
		break;
		}

		// Now try the tail folding

// We don't create an epilogue when optimizing for size.
// Invalidate interleave groups that require an epilogue if we can't mask		// Invalidate interleave groups that require an epilogue if we can't mask
// the interleave-group.		// the interleave-group.
if (!useMaskedInterleavedAccesses(TTI))		if (!useMaskedInterleavedAccesses(TTI))
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();		InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();

unsigned MaxVF = computeFeasibleMaxVF(TC);		unsigned MaxVF = computeFeasibleMaxVF(TC);
		hsaitoUnsubmitted Not Done Reply Inline Actions How about // Accept MaxVF if we don't have a tail at all. and move the comment inside IF. hsaito: How about // Accept MaxVF if we don't have a tail at all. and move the comment inside IF.
		hsaitoUnsubmitted Not Done Reply Inline Actions Suggest moving this comment between the Lines 4806 and 4807. hsaito: Suggest moving this comment between the Lines 4806 and 4807.

if (TC > 0 && TC % MaxVF == 0) {		if (TC > 0 && TC % MaxVF == 0) {
		// Accept MaxVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxVF;		return MaxVF;
}		}

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
▲ Show 20 Lines • Show All 2,427 Lines • ▼ Show 20 Lines	void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
// Last (and currently only) operand is a mask.		// Last (and currently only) operand is a mask.
InnerLoopVectorizer::VectorParts MaskValues(State.UF);		InnerLoopVectorizer::VectorParts MaskValues(State.UF);
VPValue *Mask = User->getOperand(User->getNumOperands() - 1);		VPValue *Mask = User->getOperand(User->getNumOperands() - 1);
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
MaskValues[Part] = State.get(Mask, Part);		MaskValues[Part] = State.get(Mask, Part);
State.ILV->vectorizeMemoryInstruction(&Instr, &MaskValues);		State.ILV->vectorizeMemoryInstruction(&Instr, &MaskValues);
}		}

		static ScalarEpilogueLowering
		getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,
		ProfileSummaryInfo PSI, BlockFrequencyInfo BFI) {
		ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;
		if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
		(F->hasOptSize() \|\|
		llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
		SEL = CM_ScalarEpilogueNotAllowedOptSize;
		else if (Hints.getPredicate())
		SEL = CM_ScalarEpilogueNotNeededPredicatePragma;

		return SEL;
		}

// Process the loop in the VPlan-native vectorization path. This path builds		// Process the loop in the VPlan-native vectorization path. This path builds
// VPlan upfront in the vectorization pipeline, which allows to apply		// VPlan upfront in the vectorization pipeline, which allows to apply
// VPlan-to-VPlan transformations from the very beginning without modifying the		// VPlan-to-VPlan transformations from the very beginning without modifying the
// input LLVM IR.		// input LLVM IR.
static bool processLoopInVPlanNativePath(		static bool processLoopInVPlanNativePath(
Loop L, PredicatedScalarEvolution &PSE, LoopInfo LI, DominatorTree *DT,		Loop L, PredicatedScalarEvolution &PSE, LoopInfo LI, DominatorTree *DT,
LoopVectorizationLegality LVL, TargetTransformInfo TTI,		LoopVectorizationLegality LVL, TargetTransformInfo TTI,
TargetLibraryInfo TLI, DemandedBits DB, AssumptionCache *AC,		TargetLibraryInfo TLI, DemandedBits DB, AssumptionCache *AC,
OptimizationRemarkEmitter ORE, BlockFrequencyInfo BFI,		OptimizationRemarkEmitter ORE, BlockFrequencyInfo BFI,
ProfileSummaryInfo *PSI, LoopVectorizeHints &Hints) {		ProfileSummaryInfo *PSI, LoopVectorizeHints &Hints) {

assert(EnableVPlanNativePath && "VPlan-native path is disabled.");		assert(EnableVPlanNativePath && "VPlan-native path is disabled.");
Function *F = L->getHeader()->getParent();		Function *F = L->getHeader()->getParent();
InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());
		ScalarEpilogueLowering SEL = getScalarEpilogueLowering(F, L, Hints, PSI, BFI);

ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		&Hints, IAI);
(F->hasOptSize() \|\|
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;

LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI,
DB, AC, ORE, F, &Hints, IAI);
// Use the planner for outer loop vectorization.		// Use the planner for outer loop vectorization.
		MeinersburUnsubmitted Not Done Reply Inline Actions [nit] formatting-only change? Meinersbur: [nit] formatting-only change?
// TODO: CM is not used at this point inside the planner. Turn CM into an		// TODO: CM is not used at this point inside the planner. Turn CM into an
// optional argument if we don't need it in the future.		// optional argument if we don't need it in the future.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);

// Get user vectorization factor.		// Get user vectorization factor.
const unsigned UserVF = Hints.getWidth();		const unsigned UserVF = Hints.getWidth();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
if (!LVL.canVectorize(EnableVPlanNativePath)) {		if (!LVL.canVectorize(EnableVPlanNativePath)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

// Check the function attributes and profiles to find out if this function		// Check the function attributes and profiles to find out if this function
// should be optimized for size.		// should be optimized for size.
ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering SEL = getScalarEpilogueLowering(F, L, Hints, PSI, BFI);
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: unnecessary whitespace. sdesmalen: nit: unnecessary whitespace.
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
(F->hasOptSize() \|\|
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;

// Entrance to the VPlan-native vectorization path. Outer loops are processed		// Entrance to the VPlan-native vectorization path. Outer loops are processed
// here. They may require CFG and instruction level transformations before		// here. They may require CFG and instruction level transformations before
// even evaluating whether vectorization is profitable. Since we cannot modify		// even evaluating whether vectorization is profitable. Since we cannot modify
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->empty())		if (!L->empty())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
Show All 32 Lines	#endif /* NDEBUG */
if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {		if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "		<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");		<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
LLVM_DEBUG(dbgs() << "\n");		LLVM_DEBUG(dbgs() << "\n");
// Loops with a very small trip count are considered for vectorization
// under OptForSize, thereby making sure the cost of their loop body is
// dominant, free of runtime guards and scalar iteration overheads.
SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;		SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
}		}
}		}

// Check the function attributes to see if implicit floats are allowed.		// Check the function attributes to see if implicit floats are allowed.
// FIXME: This check doesn't seem possibly correct -- what if the loop is		// FIXME: This check doesn't seem possibly correct -- what if the loop is
// an integer loop and the vector instructions selected are purely integer		// an integer loop and the vector instructions selected are purely integer
// vector instructions?		// vector instructions?
Show All 30 Lines	if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
UseInterleaved = EnableInterleavedMemAccesses;		UseInterleaved = EnableInterleavedMemAccesses;

// Analyze interleaved memory accesses.		// Analyze interleaved memory accesses.
if (UseInterleaved) {		if (UseInterleaved) {
IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));		IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));
}		}

// Use the cost model.		// Use the cost model.
LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI,		LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE,
DB, AC, ORE, F, &Hints, IAI);		F, &Hints, IAI);
CM.collectValuesToIgnore();		CM.collectValuesToIgnore();
		MeinersburUnsubmitted Not Done Reply Inline Actions [nit] unrelated change? Meinersbur: [nit] unrelated change?

// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM);

// Get user vectorization factor.		// Get user vectorization factor.
unsigned UserVF = Hints.getWidth();		unsigned UserVF = Hints.getWidth();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
				; CHECK-LABEL: tail_folding_enabled(
				; CHECK: vector.body:
				; CHECK: %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; CHECK: %wide.masked.load1 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; CHECK: %8 = add nsw <8 x i32> %wide.masked.load1, %wide.masked.load
				; CHECK: call void @llvm.masked.store.v8i32.p0v8i32(
				; CHECK: %index.next = add i64 %index, 8
				; CHECK: %12 = icmp eq i64 %index.next, 432
				; CHECK: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !0

				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %add, i32* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 430
				br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !6
				}

				define dso_local void @tail_folding_disabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
				; CHECK-LABEL: tail_folding_disabled(
				; CHECK: vector.body:
				; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(
				; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(
				; CHECK: br i1 %44, label {{.*}}, label %vector.body
				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %add, i32* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 430
				br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !10
				}

				; CHECK: !0 = distinct !{!0, !1}
				; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}
				; CHECK-NEXT: !2 = distinct !{!2, !3, !1}
				; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}
				; CHECK-NEXT: !4 = distinct !{!4, !1}
				; CHECK-NEXT: !5 = distinct !{!5, !3, !1}

				attributes #0 = { nounwind optsize uwtable "target-cpu"="core-avx2" "target-features"="+avx,+avx2" }

				!6 = distinct !{!6, !7, !8}
				!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
				!8 = !{!"llvm.loop.vectorize.enable", i1 true}

				!10 = distinct !{!10, !11, !12}
				!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 false}
				!12 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Tail-loop FoldingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 212861

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll

[LV] Tail-loop Folding
ClosedPublic