This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
-
ARMTargetTransformInfo.h
1
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
1
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
prefer-tail-loop-folding.ll
1
tail-loop-folding.ll

Differential D69040

[TTI][LV] preferPredicateOverEpilogue
ClosedPublic

Authored by SjoerdMeijer on Oct 16 2019, 7:20 AM.

Download Raw Diff

Details

Reviewers

Ayal
hsaito
fhahn
samparker
efriedma
dorit

Commits

rG6c2a4f5ff93e: [TTI][LV] preferPredicateOverEpilogue

Summary

We currently have two ways to steer creating a predicated vector body over creating a scalar epilogue. To force this, we have 1) a command line option and 2) a pragma available. This adds a third, a target hook to TargetTransformInfo that can be queried whether predication is preferred or not, which allows the vectoriser to make the decision (without forcing it).

I did the initial TTI plumbing for this, added usage of this new hook to the vectoriser where this should be queried, and added the beginning of an ARM MVE implementation. While this isn't complete yet, it currently behaves as a non-functional change, it demonstrates the required function interfaces. I.e., for MVE, we would like the vectoriser to do tail-folding when we know we will be generating a hardware-loop and these checks are implemented in the ARM specific hook. I will follow-up on this soon, but that will be an entirely ARM specific patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Oct 16 2019, 7:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 16 2019, 7:20 AM

Herald added subscribers: dmgreen, rkruppe, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer edited the summary of this revision. (Show Details)Oct 16 2019, 7:21 AM

SjoerdMeijer edited the summary of this revision. (Show Details)

samparker added inline comments.Oct 16 2019, 7:32 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1010	What about MVE..? We also need to have masked load/stores enabled.
llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll
10	I think it's worth adding your tests now for the conditions that you're adding, even if this means adding an extra backend command to enable/disable tail predication. I imagine we'll be testing for a while so I think it's worth it.

rscottmanley added a subscriber: rscottmanley.Oct 16 2019, 7:50 AM

Thinking about this some more, I think it would be best to at least check some features of the loop for legality:

no vector widths greater than 128 bits.
all vector operations should have the same number of lanes.

Thinking about this some more, I think it would be best to at least check some features of the loop for legality:

no vector widths greater than 128 bits.

all vector operations should have the same number of lanes.

Yep, this is exactly what I wanted to do in the follow up as this is mostly target dependent.
But you do have good points, and perhaps addressing some of that now is a good thing. And so, looking into things now, what I wanted to implement in the target specific part, is something very similar to LoopVectorizationCostModel::getSmallestAndWidestTypes. I will reuse that, and also pass the smallest and widest type to preferPredicateOverEpilogue, this is the info we need for the legality checks.

Yeah. It might be better expressed as "what the vectorizer produces will need to end up fitting into 128 bits". You could imagine the vectorizer producing something wider but then folding them in ISel into something legal.

That's maybe something to leave till later, but keep in mind non-the-less.

hsaito added inline comments.Oct 18 2019, 3:41 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7437	@SjoerdMeijer thanks for the work. I like the direction you are heading to. Now that you are stepping into making this behavior default for some arch for some loop, please think about the hint needed to reverse that default choice, something along the lines of (TTI->prefer...(...) && Hints.getDoNotPredicate())

Thanks, will take that on board, and I will pick this up after the llvm dev conference.

I have:

changed the ARM implementation to check the required architecture extensions: check for MVE, the remaining checks are done in isHardwareLoopProfitable so there was no point in duplicating that.
added a new test case, to cover testing these architecture extension combo's, and profitability check.

@samparker, @dmgreen :
I've given it quite some thoughts to see if the current interface of preferPredicateOverEpilogue would be good enough to make a well informed decision whether we prefer to predicate or not. For example, questions were if we should also pass the Vectorization Factor (VF), and the vector types, but there are quite a few chicken-and-eggs-problems here. That is, we decide quite early if we want to predicate or not, long before we have calculated the VF, so let alone about the vector types. So, should preferPredicateOverEpiloguethen possibly return a set of VF candidates, or calculate the VF a lot earlier? Well, possibly, but that's also what the rest of pipeline is trying to figure out. But more importantly, I don't think we need without sacrificing too much. That is, we are going to be conservative anyway and I think the current interface is good decide on tail-predication in this way:

we bail if we find operations on i64, because we don't support (vector) operations on them,
we look if the data processing instructions all operate on the same and uniform data types (data type less or equal i32s),
we look at the loads and stores, and see access strides are the same and in the same directions. This makes sure we don't generate shuffles, which would mess with data elements and possibly masking/tail-predicating the wrong lanes.
and then we then rely on the vectoriser and the rest of the pipeline to choose a vectorisation factor.

@hsaito :
Many thanks for the suggestion! I will do that right away after this one as a follow up.

Ok, fair enough.

This revision is now accepted and ready to land.Nov 4 2019, 3:47 AM

SjoerdMeijer mentioned this in D69845: [ARM][MVE] canTailPredicateLoop.Nov 5 2019, 7:31 AM

Thanks all!

I will commit this soon'ish as this is a NFC change, and in the mean time I have uploaded a work-in-progress patch D69845 just to demonstrate usage of this for MVE. It also contains a test case showing that we do not yet respect hint vectorize_predicate(disable) as mentioned by Hideki, which I will work on next as well as finishing D69845.

Closed by commit rG6c2a4f5ff93e: [TTI][LV] preferPredicateOverEpilogue (authored by SjoerdMeijer). · Explain WhyNov 6 2019, 2:22 AM

This revision was automatically updated to reflect the committed changes.

vkmr added a subscriber: vkmr.May 5 2020, 7:12 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

20 lines

TargetTransformInfoImpl.h

7 lines

CodeGen/

BasicTTIImpl.h

7 lines

lib/

Analysis/

TargetTransformInfo.cpp

6 lines

Target/

ARM/

ARMTargetTransformInfo.h

7 lines

ARMTargetTransformInfo.cpp

44 lines

Transforms/

Vectorize/

LoopVectorize.cpp

18 lines

test/

Transforms/

LoopVectorize/

ARM/

prefer-tail-loop-folding.ll

49 lines

tail-loop-folding.ll

35 lines

Diff 228027

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show All 40 Lines

class AssumptionCache;		class AssumptionCache;
class BlockFrequencyInfo;		class BlockFrequencyInfo;
class BranchInst;		class BranchInst;
class Function;		class Function;
class GlobalValue;		class GlobalValue;
class IntrinsicInst;		class IntrinsicInst;
class LoadInst;		class LoadInst;
		class LoopAccessInfo;
class Loop;		class Loop;
class ProfileSummaryInfo;		class ProfileSummaryInfo;
class SCEV;		class SCEV;
class ScalarEvolution;		class ScalarEvolution;
class StoreInst;		class StoreInst;
class SwitchInst;		class SwitchInst;
class TargetLibraryInfo;		class TargetLibraryInfo;
class Type;		class Type;
▲ Show 20 Lines • Show All 456 Lines • ▼ Show 20 Lines	public:

/// Query the target whether it would be profitable to convert the given loop		/// Query the target whether it would be profitable to convert the given loop
/// into a hardware loop.		/// into a hardware loop.
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) const;		HardwareLoopInfo &HWLoopInfo) const;

		/// Query the target whether it would be prefered to create a predicated vector
		/// loop, which can avoid the need to emit a scalar epilogue loop.
		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
		AssumptionCache &AC, TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI) const;

/// @}		/// @}

/// \name Scalar Target Information		/// \name Scalar Target Information
/// @{		/// @{

/// Flags indicating the kind of support for population count.		/// Flags indicating the kind of support for population count.
///		///
/// Compared to the SW implementation, HW support is supposed to		/// Compared to the SW implementation, HW support is supposed to
▲ Show 20 Lines • Show All 667 Lines • ▼ Show 20 Lines	virtual bool rewriteIntrinsicWithAddressSpace(
IntrinsicInst II, Value OldV, Value *NewV) const = 0;		IntrinsicInst II, Value OldV, Value *NewV) const = 0;
virtual bool isLoweredToCall(const Function *F) = 0;		virtual bool isLoweredToCall(const Function *F) = 0;
virtual void getUnrollingPreferences(Loop *L, ScalarEvolution &,		virtual void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP) = 0;		UnrollingPreferences &UP) = 0;
virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) = 0;		HardwareLoopInfo &HWLoopInfo) = 0;
		virtual bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,
		ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI) = 0;
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace,		unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
▲ Show 20 Lines • Show All 254 Lines • ▼ Show 20 Lines	void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
return Impl.getUnrollingPreferences(L, SE, UP);		return Impl.getUnrollingPreferences(L, SE, UP);
}		}
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) override {		HardwareLoopInfo &HWLoopInfo) override {
return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}
		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
		AssumptionCache &AC, TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI) override {
		return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
		}
bool isLegalAddImmediate(int64_t Imm) override {		bool isLegalAddImmediate(int64_t Imm) override {
return Impl.isLegalAddImmediate(Imm);		return Impl.isLegalAddImmediate(Imm);
}		}
bool isLegalICmpImmediate(int64_t Imm) override {		bool isLegalICmpImmediate(int64_t Imm) override {
return Impl.isLegalICmpImmediate(Imm);		return Impl.isLegalICmpImmediate(Imm);
}		}
bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
▲ Show 20 Lines • Show All 481 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	public:

bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) {		HardwareLoopInfo &HWLoopInfo) {
return false;		return false;
}		}

		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
		AssumptionCache &AC, TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI) const {
		return false;
		}

void getUnrollingPreferences(Loop *, ScalarEvolution &,		void getUnrollingPreferences(Loop *, ScalarEvolution &,
TTI::UnrollingPreferences &) {}		TTI::UnrollingPreferences &) {}

bool isLegalAddImmediate(int64_t Imm) { return false; }		bool isLegalAddImmediate(int64_t Imm) { return false; }

bool isLegalICmpImmediate(int64_t Imm) { return false; }		bool isLegalICmpImmediate(int64_t Imm) { return false; }

bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
▲ Show 20 Lines • Show All 702 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 504 Lines • ▼ Show 20 Lines	public:

bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) {		HardwareLoopInfo &HWLoopInfo) {
return BaseT::isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return BaseT::isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}

		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
		AssumptionCache &AC, TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI) {
		return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
		}

int getInstructionLatency(const Instruction *I) {		int getInstructionLatency(const Instruction *I) {
if (isa<LoadInst>(I))		if (isa<LoadInst>(I))
return getST()->getSchedModel().DefaultLoadLatency;		return getST()->getSchedModel().DefaultLoadLatency;

return BaseT::getInstructionLatency(I);		return BaseT::getInstructionLatency(I);
}		}

virtual Optional<unsigned>		virtual Optional<unsigned>
▲ Show 20 Lines • Show All 1,218 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 237 Lines • ▼ Show 20 Lines
	}			}

	bool TargetTransformInfo::isHardwareLoopProfitable(			bool TargetTransformInfo::isHardwareLoopProfitable(
	Loop *L, ScalarEvolution &SE, AssumptionCache &AC,			Loop *L, ScalarEvolution &SE, AssumptionCache &AC,
	TargetLibraryInfo *LibInfo, HardwareLoopInfo &HWLoopInfo) const {			TargetLibraryInfo *LibInfo, HardwareLoopInfo &HWLoopInfo) const {
	return TTIImpl->isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);			return TTIImpl->isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
	}			}

				bool TargetTransformInfo::preferPredicateOverEpilogue(Loop L, LoopInfo LI,
				ScalarEvolution &SE, AssumptionCache &AC, TargetLibraryInfo *TLI,
				DominatorTree DT, const LoopAccessInfo LAI) const {
				return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
				}

	void TargetTransformInfo::getUnrollingPreferences(			void TargetTransformInfo::getUnrollingPreferences(
	Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {			Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {
	return TTIImpl->getUnrollingPreferences(L, SE, UP);			return TTIImpl->getUnrollingPreferences(L, SE, UP);
	}			}

	bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {			bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {
	return TTIImpl->isLegalAddImmediate(Imm);			return TTIImpl->isLegalAddImmediate(Imm);
	}			}
	▲ Show 20 Lines • Show All 1,134 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
bool UseMaskForCond = false,		bool UseMaskForCond = false,
bool UseMaskForGaps = false);		bool UseMaskForGaps = false);

bool isLoweredToCall(const Function *F);		bool isLoweredToCall(const Function *F);
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo);		HardwareLoopInfo &HWLoopInfo);
		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,
		ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

bool shouldBuildLookupTablesForConstant(Constant *C) const {		bool shouldBuildLookupTablesForConstant(Constant *C) const {
// In the ROPI and RWPI relocation models we can't have pointers to global		// In the ROPI and RWPI relocation models we can't have pointers to global
// variables or functions in constant data, so don't convert switches to		// variables or functions in constant data, so don't convert switches to
// lookup tables if any of the values would need relocation.		// lookup tables if any of the values would need relocation.
if (ST->isROPI() \|\| ST->isRWPI())		if (ST->isROPI() \|\| ST->isRWPI())
Show All 10 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 994 Lines • ▼ Show 20 Lines	bool ARMTTIImpl::isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
HWLoopInfo.CounterInReg = true;		HWLoopInfo.CounterInReg = true;
HWLoopInfo.IsNestingLegal = false;		HWLoopInfo.IsNestingLegal = false;
HWLoopInfo.PerformEntryTest = true;		HWLoopInfo.PerformEntryTest = true;
HWLoopInfo.CountType = Type::getInt32Ty(C);		HWLoopInfo.CountType = Type::getInt32Ty(C);
HWLoopInfo.LoopDecrement = ConstantInt::get(HWLoopInfo.CountType, 1);		HWLoopInfo.LoopDecrement = ConstantInt::get(HWLoopInfo.CountType, 1);
return true;		return true;
}		}

		bool ARMTTIImpl::preferPredicateOverEpilogue(Loop L, LoopInfo LI,
		ScalarEvolution &SE,
		AssumptionCache &AC,
		TargetLibraryInfo *TLI,
		DominatorTree *DT,
		const LoopAccessInfo *LAI) {
		// Creating a predicated vector loop is the first step for generating a
		// tail-predicated hardware loop, for which we need the MVE masked
		samparkerUnsubmitted Not Done Reply Inline Actions What about MVE..? We also need to have masked load/stores enabled. samparker: What about MVE..? We also need to have masked load/stores enabled.
		// load/stores instructions:
		if (!ST->hasMVEIntegerOps())
		return false;

		HardwareLoopInfo HWLoopInfo(L);
		if (!HWLoopInfo.canAnalyze(*LI)) {
		LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
		"analyzable.\n");
		return false;
		}

		// This checks if we have the low-overhead branch architecture
		// extension, and if we will create a hardware-loop:
		if (!isHardwareLoopProfitable(L, SE, AC, TLI, HWLoopInfo)) {
		LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
		"profitable.\n");
		return false;
		}

		if (!HWLoopInfo.isHardwareLoopCandidate(SE, LI, DT)) {
		LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
		"a candidate.\n");
		return false;
		}

		// TODO: to set up a tail-predicated loop, which works by setting up
		// the total number of elements processed by the loop, we need to
		// determine the element size here, and if it is uniform for all operations
		// in the vector loop. This means we will reject narrowing/widening
		// operations, and don't want to predicate the vector loop, which is
		// the main prep step for tail-predicated loops.

		return false;
		}


void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// Only currently enable these preferences for M-Class cores.		// Only currently enable these preferences for M-Class cores.
if (!ST->isMClass())		if (!ST->isMClass())
return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);

// Disable loop unrolling for Oz and Os.		// Disable loop unrolling for Oz and Os.
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
▲ Show 20 Lines • Show All 88 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,417 Lines • ▼ Show 20 Lines	void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
InnerLoopVectorizer::VectorParts MaskValues(State.UF);		InnerLoopVectorizer::VectorParts MaskValues(State.UF);
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
MaskValues[Part] = State.get(Mask, Part);		MaskValues[Part] = State.get(Mask, Part);
State.ILV->vectorizeMemoryInstruction(&Instr, &MaskValues);		State.ILV->vectorizeMemoryInstruction(&Instr, &MaskValues);
}		}

static ScalarEpilogueLowering		static ScalarEpilogueLowering
getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,		getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,
ProfileSummaryInfo PSI, BlockFrequencyInfo BFI) {		ProfileSummaryInfo PSI, BlockFrequencyInfo BFI,
		TargetTransformInfo TTI, TargetLibraryInfo TLI,
		AssumptionCache AC, LoopInfo LI,
		ScalarEvolution SE, DominatorTree DT,
		const LoopAccessInfo *LAI) {
ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
(F->hasOptSize() \|\|		(F->hasOptSize() \|\|
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))		llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;		SEL = CM_ScalarEpilogueNotAllowedOptSize;
else if (PreferPredicateOverEpilog \|\| Hints.getPredicate())		else if (PreferPredicateOverEpilog \|\| Hints.getPredicate() \|\|
		TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI))
		hsaitoUnsubmitted Not Done Reply Inline Actions @SjoerdMeijer thanks for the work. I like the direction you are heading to. Now that you are stepping into making this behavior default for some arch for some loop, please think about the hint needed to reverse that default choice, something along the lines of (TTI->prefer...(...) && Hints.getDoNotPredicate()) hsaito: @SjoerdMeijer thanks for the work. I like the direction you are heading to. Now that you are…
SEL = CM_ScalarEpilogueNotNeededUsePredicate;		SEL = CM_ScalarEpilogueNotNeededUsePredicate;

return SEL;		return SEL;
}		}

// Process the loop in the VPlan-native vectorization path. This path builds		// Process the loop in the VPlan-native vectorization path. This path builds
// VPlan upfront in the vectorization pipeline, which allows to apply		// VPlan upfront in the vectorization pipeline, which allows to apply
// VPlan-to-VPlan transformations from the very beginning without modifying the		// VPlan-to-VPlan transformations from the very beginning without modifying the
// input LLVM IR.		// input LLVM IR.
static bool processLoopInVPlanNativePath(		static bool processLoopInVPlanNativePath(
Loop L, PredicatedScalarEvolution &PSE, LoopInfo LI, DominatorTree *DT,		Loop L, PredicatedScalarEvolution &PSE, LoopInfo LI, DominatorTree *DT,
LoopVectorizationLegality LVL, TargetTransformInfo TTI,		LoopVectorizationLegality LVL, TargetTransformInfo TTI,
TargetLibraryInfo TLI, DemandedBits DB, AssumptionCache *AC,		TargetLibraryInfo TLI, DemandedBits DB, AssumptionCache *AC,
OptimizationRemarkEmitter ORE, BlockFrequencyInfo BFI,		OptimizationRemarkEmitter ORE, BlockFrequencyInfo BFI,
ProfileSummaryInfo *PSI, LoopVectorizeHints &Hints) {		ProfileSummaryInfo *PSI, LoopVectorizeHints &Hints) {

assert(EnableVPlanNativePath && "VPlan-native path is disabled.");		assert(EnableVPlanNativePath && "VPlan-native path is disabled.");
Function *F = L->getHeader()->getParent();		Function *F = L->getHeader()->getParent();
InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());
ScalarEpilogueLowering SEL = getScalarEpilogueLowering(F, L, Hints, PSI, BFI);
		ScalarEpilogueLowering SEL =
		getScalarEpilogueLowering(F, L, Hints, PSI, BFI, TTI, TLI, AC, LI,
		PSE.getSE(), DT, LVL->getLAI());

LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,		LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,
&Hints, IAI);		&Hints, IAI);
// Use the planner for outer loop vectorization.		// Use the planner for outer loop vectorization.
// TODO: CM is not used at this point inside the planner. Turn CM into an		// TODO: CM is not used at this point inside the planner. Turn CM into an
// optional argument if we don't need it in the future.		// optional argument if we don't need it in the future.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI);

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
if (!LVL.canVectorize(EnableVPlanNativePath)) {		if (!LVL.canVectorize(EnableVPlanNativePath)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

// Check the function attributes and profiles to find out if this function		// Check the function attributes and profiles to find out if this function
// should be optimized for size.		// should be optimized for size.
ScalarEpilogueLowering SEL = getScalarEpilogueLowering(F, L, Hints, PSI, BFI);		ScalarEpilogueLowering SEL =
		getScalarEpilogueLowering(F, L, Hints, PSI, BFI, TTI, TLI, AC, LI,
		PSE.getSE(), DT, LVL.getLAI());

// Entrance to the VPlan-native vectorization path. Outer loops are processed		// Entrance to the VPlan-native vectorization path. Outer loops are processed
// here. They may require CFG and instruction level transformations before		// here. They may require CFG and instruction level transformations before
// even evaluating whether vectorization is profitable. Since we cannot modify		// even evaluating whether vectorization is profitable. Since we cannot modify
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->empty())		if (!L->empty())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
▲ Show 20 Lines • Show All 341 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/prefer-tail-loop-folding.ll

This file was added.

				; RUN: opt -mtriple=thumbv8.1m.main-arm-eabihf < %s -loop-vectorize -S \| \
				; RUN: FileCheck %s -check-prefixes=CHECK,NO-FOLDING

				; RUN: opt -mtriple=thumbv8.1m.main-arm-eabihf -mattr=-mve < %s -loop-vectorize -enable-arm-maskedldst=true -S \| \
				; RUN: FileCheck %s -check-prefixes=CHECK,NO-FOLDING

				; RUN: opt -mtriple=thumbv8.1m.main-arm-eabihf -mattr=+mve < %s -loop-vectorize -enable-arm-maskedldst=false -S \| \
				; RUN: FileCheck %s -check-prefixes=CHECK,NO-FOLDING

				; Disabling the low-overhead branch extension will make
				; 'isHardwareLoopProfitable' return false, so that we test avoiding folding for
				; these cases.
				; RUN: opt -mtriple=thumbv8.1m.main-arm-eabihf -mattr=+mve,-lob < %s -loop-vectorize -enable-arm-maskedldst=true -S \| \
				; RUN: FileCheck %s -check-prefixes=CHECK,NO-FOLDING

				; RUN: opt -mtriple=thumbv8.1m.main-arm-eabihf -mattr=+mve < %s -loop-vectorize -enable-arm-maskedldst=true -S \| \
				; RUN: FileCheck %s -check-prefixes=CHECK,PREFER-FOLDING

				define dso_local void @tail_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) {
				; CHECK-LABEL: tail_folding(
				;
				; NO-FOLDING-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
				; NO-FOLDING-NOT: call void @llvm.masked.store.v4i32.p0v4i32(
				;
				; TODO: this needs implementation of TTI::preferPredicateOverEpilogue,
				; then this will be tail-folded too:
				;
				; PREFER-FOLDING-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
				; PREFER-FOLDING-NOT: call void @llvm.masked.store.v4i32.p0v4i32(
				;
				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %add, i32* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 430
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

	; RUN: opt < %s -loop-vectorize -enable-arm-maskedldst -S \| \			; RUN: opt < %s -loop-vectorize -enable-arm-maskedldst -S \| \
	; RUN: FileCheck %s -check-prefixes=COMMON,CHECK			; RUN: FileCheck %s -check-prefixes=COMMON,CHECK

	; RUN: opt < %s -loop-vectorize -enable-arm-maskedldst -prefer-predicate-over-epilog -S \| \			; RUN: opt < %s -loop-vectorize -enable-arm-maskedldst -prefer-predicate-over-epilog -S \| \
	; RUN: FileCheck -check-prefixes=COMMON,PREDFLAG %s			; RUN: FileCheck -check-prefixes=COMMON,PREDFLAG %s

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv8.1m.main-arm-unknown-eabihf"			target triple = "thumbv8.1m.main-arm-unknown-eabihf"

				define dso_local void @tail_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {
				samparkerUnsubmitted Not Done Reply Inline Actions I think it's worth adding your tests now for the conditions that you're adding, even if this means adding an extra backend command to enable/disable tail predication. I imagine we'll be testing for a while so I think it's worth it. samparker: I think it's worth adding your tests now for the conditions that you're adding, even if this…
				; CHECK-LABEL: tail_folding(
				; CHECK: vector.body:
				;
				; This needs implementation of TTI::preferPredicateOverEpilogue,
				; then this will be tail-folded too:
				;
				; CHECK-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
				; CHECK-NOT: call void @llvm.masked.store.v4i32.p0v4i32(
				; CHECK: br i1 %{{.}}, label %{{.}}, label %vector.body
				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				store i32 %add, i32* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 430
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}


	define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {			define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
	; COMMON-LABEL: tail_folding_enabled(			; COMMON-LABEL: tail_folding_enabled(
	; COMMON: vector.body:			; COMMON: vector.body:
	; COMMON: %[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(			; COMMON: %[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
	; COMMON: %[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(			; COMMON: %[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
	; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]			; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]
	; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]]			; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]]
	; COMMON: br i1 %12, label %{{.*}}, label %vector.body			; COMMON: br i1 %12, label %{{.*}}, label %vector.body
	Show All 27 Lines
	; PREDFLAG-LABEL: tail_folding_disabled(			; PREDFLAG-LABEL: tail_folding_disabled(
	; PREDFLAG: vector.body:			; PREDFLAG: vector.body:
	; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(			; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
	; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(			; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
	; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load			; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load
	; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32(			; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32(
	; PREDFLAG: %index.next = add i64 %index, 4			; PREDFLAG: %index.next = add i64 %index, 4
	; PREDFLAG: %12 = icmp eq i64 %index.next, 432			; PREDFLAG: %12 = icmp eq i64 %index.next, 432
	; PREDFLAG: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !4			; PREDFLAG: br i1 %{{.*}}, label %middle.block, label %vector.body, !llvm.loop !6
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	ret void			ret void

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 10 Lines
	}			}

	; CHECK: !0 = distinct !{!0, !1}			; CHECK: !0 = distinct !{!0, !1}
	; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}			; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}
	; CHECK-NEXT: !2 = distinct !{!2, !3, !1}			; CHECK-NEXT: !2 = distinct !{!2, !3, !1}
	; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}			; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}
	; CHECK-NEXT: !4 = distinct !{!4, !1}			; CHECK-NEXT: !4 = distinct !{!4, !1}
	; CHECK-NEXT: !5 = distinct !{!5, !3, !1}			; CHECK-NEXT: !5 = distinct !{!5, !3, !1}
				; CHECK-NEXT: !6 = distinct !{!6, !1}
	attributes #0 = { nofree norecurse nounwind "target-features"="+armv8.1-m.main,+mve.fp" }			attributes #0 = { nofree norecurse nounwind "target-features"="+armv8.1-m.main,+mve.fp" }

	!6 = distinct !{!6, !7, !8}			!6 = distinct !{!6, !7, !8}
	!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}			!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
	!8 = !{!"llvm.loop.vectorize.enable", i1 true}			!8 = !{!"llvm.loop.vectorize.enable", i1 true}

	!10 = distinct !{!10, !11, !12}			!10 = distinct !{!10, !11, !12}
	!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 false}			!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 false}
	!12 = !{!"llvm.loop.vectorize.enable", i1 true}			!12 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[TTI][LV] preferPredicateOverEpilogueClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228027

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/prefer-tail-loop-folding.ll

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

[TTI][LV] preferPredicateOverEpilogue
ClosedPublic