This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
3/5
IVDescriptors.h
-
lib/
-
Analysis/
2/3
IVDescriptors.cpp
-
Transforms/Vectorize/
-
Vectorize/
3/8
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1/5
smallest-and-widest-types.ll
-
X86/
-
funclet.ll

Differential D113973

[LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions with no loads/stores
ClosedPublic

Authored by RosieSumpter on Nov 16 2021, 1:49 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
kmclaughlin
dmgreen
fhahn

Commits

rG961f51fdf04f: [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions without…

Summary

For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,

float s = 0;
for (int i = 0; i < N; ++i)
  s += (float) i*i;

which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.

In this patch, for the case where ElementTypesInLoop is emptry, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RosieSumpter created this revision.Nov 16 2021, 1:49 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptNov 16 2021, 1:49 AM

RosieSumpter requested review of this revision.Nov 16 2021, 1:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2021, 1:49 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

RosieSumpter edited the summary of this revision. (Show Details)Nov 16 2021, 1:51 AM

lebedev.ri added a subscriber: lebedev.ri.Nov 16 2021, 1:55 AM

lebedev.ri added inline comments.

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48	Where is `ElementTypesInLoop` populated? `LoopVectorizationCostModel::collectElementTypesForWidening()` suggests that PHI nodes are also queried.

Harbormaster completed remote builds in B134454: Diff 387531.Nov 16 2021, 2:30 AM

sdesmalen added reviewers: dmgreen, fhahn.Nov 16 2021, 10:01 AM

sdesmalen added inline comments.

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48	The function `collectElementTypesForWidening()` collects types for the following reasons: Types of loads/stores, because these are used in computing a safe dependence distance (and also to have a sensible/natural VF as maximum VF). Types of unordered reductions, in order to avoid generating vector PHI nodes that span multiple registers when the VF is too wide. In-order reductions are not considered, because those PHI nodes remain scalar after vectorization. If we consider more types in the loop in `collectElementTypesForWidening()`, e.g. for in-order reductions or extends, then this leads to regressions; any type larger than the maximum loaded/stored type will limit the maximum VF. If the maximum VF is limited, then any VF upto the wider maximum will not be considered by the cost-model. So in general, it's better not to limit the VF for those reasons and leave it up to the cost-model to choose the best VF. In the test-case that @RosieSumpter added, there are no loads/stores and there is no vector PHI node because the reduction is ordered, so it considers any VF up-to the widest-possible VF based on an i8 element type (even if the loop operates on a larger size). The cost-model then only considers the throughput cost, and the per-lane cost is no different between VF=4 and VF=16, so it favours VF=16. It does not consider the additional code-size cost when the operation is scalarized. I guess the alternative choices are to: Look through the operands of the ordered reduction operation, as well as any casts/extends on those operands, and only consider these element types in `collectElementTypesForWidening()` if there are no loads/stores in the loop. This heuristic gets quite complicated I think. Consider the code-size cost for in-order reductions, so that it favours shorter VFs (I'm not sure if this is desirable, as it may regress other cases where code-size isn't relevant). Because the case is so niche (i.e. a loop with only in-order reductions), @RosieSumpter thought it made sense to fall back to a more sensible default instead.

I think there is no single right choice, i'm not sure why 32 is more right than 8 or 64 here, for everything?
I'd think this should be in costmodel somewhere, not hardcoded.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5961	I suspect this should be querying datalayout for the minimal integer size?

dmgreen added inline comments.Nov 16 2021, 3:00 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5995	This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation. That might not be as useful for float types - and the example loop you showed with no loads/stores in the loop is much less likely for integer - it will already have been simplified. Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions?

sdesmalen added inline comments.Nov 17 2021, 12:44 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5995	This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation. Yes, the same is true for SVE. There is also code in the cost-model to recognise the extend to the wider type. I think the actual reason for the reduction-code being here is to avoid ending up with vector PHIs that are too wide (out-of-loop reduction). The checks for in-loop reductions were added later because (1) there is no vector PHI and (2) that it doesn't limit the VF too early so that it lets the cost-model code consider the wider VFs for the reason you described. Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions? I don't think we should add specific knowledge to limit the VF for fp-reductions here, because that means adding target-specific knowledge to the `collectElementTypesForWidening`, which is something that the cost-model should decide on. Also it would lead to regressions.

dmgreen added inline comments.Nov 17 2021, 8:04 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5995	Yeah, we added it in https://reviews.llvm.org/D93476. I would have no objections to removing this for ordered reductions or float types I don't think. There is no such thing (as far as I understand) as an extending in-order float reduction, and we can rely on the UF for things that will become wider-than-legal vectors.

Matt added a subscriber: Matt.Nov 17 2021, 12:05 PM

If there are no element types, only set the max width to 32 if there are no legal int sizes set, otherwise set it to the smallest legal int width.
update X86/funclet.ll test.

sdesmalen added inline comments.Nov 22 2021, 1:41 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5968	I think the MaxWidth should remain 8 if DL returns no smallest legal integer type? (or if there's no specific datalayout defined)

Don't set max width to 32 if there are no legal int widths set (leave as 8)

RosieSumpter marked an inline comment as done.Nov 22 2021, 1:55 AM

Harbormaster completed remote builds in B135357: Diff 388818.Nov 22 2021, 2:37 AM

Remove changes to LoopVectorize/pr32859.ll and LoopVectorize/pr36983.ll tests

fhahn added inline comments.Nov 22 2021, 2:40 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5995	Would it be possible to use TTI to pick the min/max type width to use there for the reduction, which would encode the target specific knowledge?
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48	I think picking the smallest integer type as max width comes with its own problems unfortunately. By choosing 32 bit as max width on AArch64, won't we pessimize codegen for loops with fp16 in loop reductions (by choosing VF 4 instead of VF 8) for example? It is hard to say what if there are similar impacts on other targets.

Harbormaster completed remote builds in B135366: Diff 388822.Nov 22 2021, 3:05 AM

Change name of collectCastsToIgnore to collectCastInstrs and make it a member function of RecurrenceDescriptor.
Modify collectCastInstrs to also collect cast instructions where the destination type is the same as the recurrence type.
Call collectCastInstrs from LoopVectorizationCostModel::getSmallestAndWidestType in the case of in-loop reductions with no loads/stores to check through casts on recurrence operands when determining the max width.

Harbormaster completed remote builds in B138913: Diff 393814.Dec 13 2021, 2:21 AM

As @fhahn pointed out, using the smallest legal integer type for the default max width negatively impacts performance if the smallest type used in the loop is smaller than the smallest legal integer type. Instead, this update iterates through the recurrences in the loop and sets the maximum width to be that of the smallest type used by the recurrences when ElementTypesInLoop is empty. To determine the smallest type used by recurrences, we need to check for any casts on the recurrences’ input operands, which are now found by collectCastInstrs. This means that the max VF isn’t restricted too much in cases where in-loop reductions use types with width smaller than the target's smallest legal int.

sdesmalen added inline comments.Dec 16 2021, 5:05 AM

llvm/include/llvm/Analysis/IVDescriptors.h
286	Passing `IgnoreCasts` to `collectCastInstrs` seems a bit contradictory :) Would it make more sense to pass two SmallPtrSet's, one named `CastsToRecurrenceType` (for case `RecurrenceType == Cast->getDstTy()`) and the other `CastsFromRecurrenceType` (for case `RecurrenceType == Cast->getSrcTy()`) and making these two sets available in RecurrenceDescriptor under their respective names. Then either set can be queried depending on what information is needed.

Split the cast instruction SmallPtrSet into 2 separate ones; CastsToRecurrenceType and CastsFromRecurrenceType
removed bool IgnoreCasts parameter in collectCastInstrs
call collectCastInsts for all recurrence descriptors which get saved at the end of AddReductionVar
query CastsToRecurrenceType when checking casts used by recurrence descriptors in getSmallestAndWidestTypes

Thanks for the latest update @RosieSumpter. The patch seems to improve the case based on the types used in the reductions now, so that seems like an improvement. Just left a few final nits.

llvm/lib/Analysis/IVDescriptors.cpp
486	nit: move to line 274 to be together with `CastsFromRecurTy` ?
521	nit: narrower or wider
525	nit: This is unrelated to your patch, but to me it looks like this mechanism is implemented the wrong way around. I would have expected it to unconditionally capture all cast instructions (to/from recurrence type), and in LoopVectorize to call `computeRecurrenceType`, and make decisions to ignore cast instructions based on the narrower recurrence type.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974–5975	nit: this can be `RdxDesc.getCurrenceType()->getScalarSizeInBits()`
5981–5982	nit: same here: `cast<CastInst>(I)->getSrcTy()->getScalarSizeInBits()`

Harbormaster completed remote builds in B139673: Diff 394884.Dec 16 2021, 9:18 AM

Addressed nits

LGTM!

This revision is now accepted and ready to land.Dec 16 2021, 9:36 AM

fhahn added inline comments.Dec 16 2021, 10:14 AM

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48	I might have missed it, but did you add a test case for the scenario I mentioned above? It would be good to have a bit wider test coverage.

fhahn added inline comments.Dec 16 2021, 10:17 AM

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48	Oh I see the loop here now operates on `i16`. I think we should have at least one or two other width combinations.

Harbormaster completed remote builds in B139689: Diff 394907.Dec 16 2021, 10:45 AM

fhahn added inline comments.Dec 17 2021, 1:39 AM

llvm/include/llvm/Analysis/IVDescriptors.h
260	this is only used to compute the minimum width, right? It seems like it would be simpler if we would compute the minimum width directly and store that, rather than a set of casts?

Replaced SmallPtrSet CastInstsToRecurrenceType with unsigned MinWidthCastToRecurrenceType (there could possibly be a shorter name for this!)
Added 2 test cases to smallest-and-widest-types.ll

RosieSumpter marked an inline comment as done.Dec 17 2021, 6:46 AM

RosieSumpter added inline comments.

llvm/include/llvm/Analysis/IVDescriptors.h
260	Hi @fhahn, thanks for the suggestion - that does seem to make more sense. Let me know if what I've done is what you had in mind.

Harbormaster completed remote builds in B139839: Diff 395119.Dec 17 2021, 7:37 AM

LGTM, thanks!

llvm/include/llvm/Analysis/IVDescriptors.h
257	it would be good to add `InBits` or something to make clear the value is in bits.

This revision was landed with ongoing or failed builds.Jan 4 2022, 2:26 AM

Closed by commit rG961f51fdf04f: [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions without… (authored by RosieSumpter). · Explain Why

This revision was automatically updated to reflect the committed changes.

RosieSumpter added a commit: rG961f51fdf04f: [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions without….

RosieSumpter marked an inline comment as done.Jan 4 2022, 2:28 AM

RosieSumpter added inline comments.

llvm/include/llvm/Analysis/IVDescriptors.h
257	I made this change before committing. Thanks for your help with this patch :)

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

IVDescriptors.h

13 lines

lib/

Analysis/

IVDescriptors.cpp

57 lines

Transforms/

Vectorize/

LoopVectorize.cpp

28 lines

test/

Transforms/

LoopVectorize/

AArch64/

smallest-and-widest-types.ll

73 lines

X86/

funclet.ll

2 lines

Diff 397235

llvm/include/llvm/Analysis/IVDescriptors.h

Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines
/// This struct holds information about recurrence variables.		/// This struct holds information about recurrence variables.
class RecurrenceDescriptor {		class RecurrenceDescriptor {
public:		public:
RecurrenceDescriptor() = default;		RecurrenceDescriptor() = default;

RecurrenceDescriptor(Value Start, Instruction Exit, RecurKind K,		RecurrenceDescriptor(Value Start, Instruction Exit, RecurKind K,
FastMathFlags FMF, Instruction ExactFP, Type RT,		FastMathFlags FMF, Instruction ExactFP, Type RT,
bool Signed, bool Ordered,		bool Signed, bool Ordered,
SmallPtrSetImpl<Instruction *> &CI)		SmallPtrSetImpl<Instruction *> &CI,
		unsigned MinWidthCastToRecurTy)
: StartValue(Start), LoopExitInstr(Exit), Kind(K), FMF(FMF),		: StartValue(Start), LoopExitInstr(Exit), Kind(K), FMF(FMF),
ExactFPMathInst(ExactFP), RecurrenceType(RT), IsSigned(Signed),		ExactFPMathInst(ExactFP), RecurrenceType(RT), IsSigned(Signed),
IsOrdered(Ordered) {		IsOrdered(Ordered),
		MinWidthCastToRecurrenceType(MinWidthCastToRecurTy) {
CastInsts.insert(CI.begin(), CI.end());		CastInsts.insert(CI.begin(), CI.end());
}		}

/// This POD struct holds information about a potential recurrence operation.		/// This POD struct holds information about a potential recurrence operation.
class InstDesc {		class InstDesc {
public:		public:
InstDesc(bool IsRecur, Instruction I, Instruction ExactFP = nullptr)		InstDesc(bool IsRecur, Instruction I, Instruction ExactFP = nullptr)
: IsRecurrence(IsRecur), PatternLastInst(I),		: IsRecurrence(IsRecur), PatternLastInst(I),
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	public:
/// Returns the type of the recurrence. This type can be narrower than the		/// Returns the type of the recurrence. This type can be narrower than the
/// actual type of the Phi if the recurrence has been type-promoted.		/// actual type of the Phi if the recurrence has been type-promoted.
Type *getRecurrenceType() const { return RecurrenceType; }		Type *getRecurrenceType() const { return RecurrenceType; }

/// Returns a reference to the instructions used for type-promoting the		/// Returns a reference to the instructions used for type-promoting the
/// recurrence.		/// recurrence.
const SmallPtrSet<Instruction *, 8> &getCastInsts() const { return CastInsts; }		const SmallPtrSet<Instruction *, 8> &getCastInsts() const { return CastInsts; }

		/// Returns the minimum width used by the recurrence in bits.
		unsigned getMinWidthCastToRecurrenceTypeInBits() const {
		fhahnUnsubmitted Not Done Reply Inline Actions it would be good to add `InBits` or something to make clear the value is in bits. fhahn: it would be good to add `InBits` or something to make clear the value is in bits.
		RosieSumpterAuthorUnsubmitted Done Reply Inline Actions I made this change before committing. Thanks for your help with this patch :) RosieSumpter: I made this change before committing. Thanks for your help with this patch :)
		return MinWidthCastToRecurrenceType;
		}

		fhahnUnsubmitted Not Done Reply Inline Actions this is only used to compute the minimum width, right? It seems like it would be simpler if we would compute the minimum width directly and store that, rather than a set of casts? fhahn: this is only used to compute the minimum width, right? It seems like it would be simpler if we…
		RosieSumpterAuthorUnsubmitted Done Reply Inline Actions Hi @fhahn, thanks for the suggestion - that does seem to make more sense. Let me know if what I've done is what you had in mind. RosieSumpter: Hi @fhahn, thanks for the suggestion - that does seem to make more sense. Let me know if what…
/// Returns true if all source operands of the recurrence are SExtInsts.		/// Returns true if all source operands of the recurrence are SExtInsts.
bool isSigned() const { return IsSigned; }		bool isSigned() const { return IsSigned; }

/// Expose an ordered FP reduction to the instance users.		/// Expose an ordered FP reduction to the instance users.
bool isOrdered() const { return IsOrdered; }		bool isOrdered() const { return IsOrdered; }

/// Attempts to find a chain of operations from Phi to LoopExitInst that can		/// Attempts to find a chain of operations from Phi to LoopExitInst that can
/// be treated as a set of reductions instructions for in-loop reductions.		/// be treated as a set of reductions instructions for in-loop reductions.
Show All 9 Lines
private:		private:
// The starting value of the recurrence.		// The starting value of the recurrence.
// It does not have to be zero!		// It does not have to be zero!
TrackingVH<Value> StartValue;		TrackingVH<Value> StartValue;
// The instruction who's value is used outside the loop.		// The instruction who's value is used outside the loop.
Instruction *LoopExitInstr = nullptr;		Instruction *LoopExitInstr = nullptr;
// The kind of the recurrence.		// The kind of the recurrence.
RecurKind Kind = RecurKind::None;		RecurKind Kind = RecurKind::None;
// The fast-math flags on the recurrent instructions. We propagate these		// The fast-math flags on the recurrent instructions. We propagate these
		sdesmalenUnsubmitted Not Done Reply Inline Actions Passing `IgnoreCasts` to `collectCastInstrs` seems a bit contradictory :) Would it make more sense to pass two SmallPtrSet's, one named `CastsToRecurrenceType` (for case `RecurrenceType == Cast->getDstTy()`) and the other `CastsFromRecurrenceType` (for case `RecurrenceType == Cast->getSrcTy()`) and making these two sets available in RecurrenceDescriptor under their respective names. Then either set can be queried depending on what information is needed. sdesmalen: Passing `IgnoreCasts` to `collectCastInstrs` seems a bit contradictory :) Would it make more…
// fast-math flags into the vectorized FP instructions we generate.		// fast-math flags into the vectorized FP instructions we generate.
FastMathFlags FMF;		FastMathFlags FMF;
// First instance of non-reassociative floating-point in the PHI's use-chain.		// First instance of non-reassociative floating-point in the PHI's use-chain.
Instruction *ExactFPMathInst = nullptr;		Instruction *ExactFPMathInst = nullptr;
// The type of the recurrence.		// The type of the recurrence.
Type *RecurrenceType = nullptr;		Type *RecurrenceType = nullptr;
// True if all source operands of the recurrence are SExtInsts.		// True if all source operands of the recurrence are SExtInsts.
bool IsSigned = false;		bool IsSigned = false;
// True if this recurrence can be treated as an in-order reduction.		// True if this recurrence can be treated as an in-order reduction.
// Currently only a non-reassociative FAdd can be considered in-order,		// Currently only a non-reassociative FAdd can be considered in-order,
// if it is also the only FAdd in the PHI's use chain.		// if it is also the only FAdd in the PHI's use chain.
bool IsOrdered = false;		bool IsOrdered = false;
// Instructions used for type-promoting the recurrence.		// Instructions used for type-promoting the recurrence.
SmallPtrSet<Instruction *, 8> CastInsts;		SmallPtrSet<Instruction *, 8> CastInsts;
		// The minimum width used by the recurrence.
		unsigned MinWidthCastToRecurrenceType;
};		};

/// A struct for saving information about induction variables.		/// A struct for saving information about induction variables.
class InductionDescriptor {		class InductionDescriptor {
public:		public:
/// This enum represents the kinds of inductions that we support.		/// This enum represents the kinds of inductions that we support.
enum InductionKind {		enum InductionKind {
IK_NoInduction, ///< Not an induction variable.		IK_NoInduction, ///< Not an induction variable.
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

llvm/lib/Analysis/IVDescriptors.cpp

Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	if (!isPowerOf2_64(MaxBitWidth))
MaxBitWidth = NextPowerOf2(MaxBitWidth);		MaxBitWidth = NextPowerOf2(MaxBitWidth);

return std::make_pair(Type::getIntNTy(Exit->getContext(), MaxBitWidth),		return std::make_pair(Type::getIntNTy(Exit->getContext(), MaxBitWidth),
IsSigned);		IsSigned);
}		}

/// Collect cast instructions that can be ignored in the vectorizer's cost		/// Collect cast instructions that can be ignored in the vectorizer's cost
/// model, given a reduction exit value and the minimal type in which the		/// model, given a reduction exit value and the minimal type in which the
/// reduction can be represented.		// reduction can be represented. Also search casts to the recurrence type
static void collectCastsToIgnore(Loop TheLoop, Instruction Exit,		// to find the minimum width used by the recurrence.
		static void collectCastInstrs(Loop TheLoop, Instruction Exit,
Type *RecurrenceType,		Type *RecurrenceType,
SmallPtrSetImpl<Instruction *> &Casts) {		SmallPtrSetImpl<Instruction *> &Casts,
		unsigned &MinWidthCastToRecurTy) {

SmallVector<Instruction *, 8> Worklist;		SmallVector<Instruction *, 8> Worklist;
SmallPtrSet<Instruction *, 8> Visited;		SmallPtrSet<Instruction *, 8> Visited;
Worklist.push_back(Exit);		Worklist.push_back(Exit);
		MinWidthCastToRecurTy = -1U;

while (!Worklist.empty()) {		while (!Worklist.empty()) {
Instruction *Val = Worklist.pop_back_val();		Instruction *Val = Worklist.pop_back_val();
Visited.insert(Val);		Visited.insert(Val);
if (auto *Cast = dyn_cast<CastInst>(Val))		if (auto *Cast = dyn_cast<CastInst>(Val)) {
if (Cast->getSrcTy() == RecurrenceType) {		if (Cast->getSrcTy() == RecurrenceType) {
// If the source type of a cast instruction is equal to the recurrence		// If the source type of a cast instruction is equal to the recurrence
// type, it will be eliminated, and should be ignored in the vectorizer		// type, it will be eliminated, and should be ignored in the vectorizer
// cost model.		// cost model.
Casts.insert(Cast);		Casts.insert(Cast);
continue;		continue;
}		}
		if (Cast->getDestTy() == RecurrenceType) {
		// The minimum width used by the recurrence is found by checking for
		// casts on its operands. The minimum width is used by the vectorizer
		// when finding the widest type for in-loop reductions without any
		// loads/stores.
		MinWidthCastToRecurTy = std::min<unsigned>(
		MinWidthCastToRecurTy, Cast->getSrcTy()->getScalarSizeInBits());
		continue;
		}
		}
// Add all operands to the work list if they are loop-varying values that		// Add all operands to the work list if they are loop-varying values that
// we haven't yet visited.		// we haven't yet visited.
for (Value *O : cast<User>(Val)->operands())		for (Value *O : cast<User>(Val)->operands())
if (auto *I = dyn_cast<Instruction>(O))		if (auto *I = dyn_cast<Instruction>(O))
if (TheLoop->contains(I) && !Visited.count(I))		if (TheLoop->contains(I) && !Visited.count(I))
Worklist.push_back(I);		Worklist.push_back(I);
}		}
}		}
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	bool RecurrenceDescriptor::AddReductionVar(PHINode *Phi, RecurKind Kind,
// the number of instruction we saw from the recognized min/max pattern,		// the number of instruction we saw from the recognized min/max pattern,
// to make sure we only see exactly the two instructions.		// to make sure we only see exactly the two instructions.
unsigned NumCmpSelectPatternInst = 0;		unsigned NumCmpSelectPatternInst = 0;
InstDesc ReduxDesc(false, nullptr);		InstDesc ReduxDesc(false, nullptr);

// Data used for determining if the recurrence has been type-promoted.		// Data used for determining if the recurrence has been type-promoted.
Type *RecurrenceType = Phi->getType();		Type *RecurrenceType = Phi->getType();
SmallPtrSet<Instruction *, 4> CastInsts;		SmallPtrSet<Instruction *, 4> CastInsts;
		unsigned MinWidthCastToRecurrenceType;
Instruction *Start = Phi;		Instruction *Start = Phi;
bool IsSigned = false;		bool IsSigned = false;

SmallPtrSet<Instruction *, 8> VisitedInsts;		SmallPtrSet<Instruction *, 8> VisitedInsts;
SmallVector<Instruction *, 8> Worklist;		SmallVector<Instruction *, 8> Worklist;

// Return early if the recurrence kind does not match the type of Phi. If the		// Return early if the recurrence kind does not match the type of Phi. If the
// recurrence kind is arithmetic, we attempt to look through AND operations		// recurrence kind is arithmetic, we attempt to look through AND operations
▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	if (isSelectCmpRecurrenceKind(Kind) && NumCmpSelectPatternInst != 1)
return false;		return false;

if (!FoundStartPHI \|\| !FoundReduxOp \|\| !ExitInstruction)		if (!FoundStartPHI \|\| !FoundReduxOp \|\| !ExitInstruction)
return false;		return false;

const bool IsOrdered = checkOrderedReduction(		const bool IsOrdered = checkOrderedReduction(
Kind, ReduxDesc.getExactFPMathInst(), ExitInstruction, Phi);		Kind, ReduxDesc.getExactFPMathInst(), ExitInstruction, Phi);

if (Start != Phi) {		if (Start != Phi) {
		sdesmalenUnsubmitted Done Reply Inline Actions nit: move to line 274 to be together with `CastsFromRecurTy` ? sdesmalen: nit: move to line 274 to be together with `CastsFromRecurTy` ?
// If the starting value is not the same as the phi node, we speculatively		// If the starting value is not the same as the phi node, we speculatively
// looked through an 'and' instruction when evaluating a potential		// looked through an 'and' instruction when evaluating a potential
// arithmetic reduction to determine if it may have been type-promoted.		// arithmetic reduction to determine if it may have been type-promoted.
//		//
// We now compute the minimal bit width that is required to represent the		// We now compute the minimal bit width that is required to represent the
// reduction. If this is the same width that was indicated by the 'and', we		// reduction. If this is the same width that was indicated by the 'and', we
// can represent the reduction in the smaller type. The 'and' instruction		// can represent the reduction in the smaller type. The 'and' instruction
// will be eliminated since it will essentially be a cast instruction that		// will be eliminated since it will essentially be a cast instruction that
Show All 13 Lines	if (Start != Phi) {
// TODO: We should not rely on InstCombine to rewrite the reduction in the		// TODO: We should not rely on InstCombine to rewrite the reduction in the
// smaller type. We should just generate a correctly typed expression		// smaller type. We should just generate a correctly typed expression
// to begin with.		// to begin with.
Type *ComputedType;		Type *ComputedType;
std::tie(ComputedType, IsSigned) =		std::tie(ComputedType, IsSigned) =
computeRecurrenceType(ExitInstruction, DB, AC, DT);		computeRecurrenceType(ExitInstruction, DB, AC, DT);
if (ComputedType != RecurrenceType)		if (ComputedType != RecurrenceType)
return false;		return false;
		}

// The recurrence expression will be represented in a narrower type. If		// Collect cast instructions and the minimum width used by the recurrence.
// there are any cast instructions that will be unnecessary, collect them		// If the starting value is not the same as the phi node and the computed
// in CastInsts. Note that the 'and' instruction was already included in		// recurrence type is equal to the recurrence type, the recurrence expression
// this list.		// will be represented in a narrower or wider type. If there are any cast
		sdesmalenUnsubmitted Done Reply Inline Actions nit: narrower or wider sdesmalen: nit: narrower or wider
		// instructions that will be unnecessary, collect them in CastsFromRecurTy.
		// Note that the 'and' instruction was already included in this list.
//		//
// TODO: A better way to represent this may be to tag in some way all the		// TODO: A better way to represent this may be to tag in some way all the
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: This is unrelated to your patch, but to me it looks like this mechanism is implemented the wrong way around. I would have expected it to unconditionally capture all cast instructions (to/from recurrence type), and in LoopVectorize to call `computeRecurrenceType`, and make decisions to ignore cast instructions based on the narrower recurrence type. sdesmalen: nit: This is unrelated to your patch, but to me it looks like this mechanism is implemented the…
// instructions that are a part of the reduction. The vectorizer cost		// instructions that are a part of the reduction. The vectorizer cost
// model could then apply the recurrence type to these instructions,		// model could then apply the recurrence type to these instructions,
// without needing a white list of instructions to ignore.		// without needing a white list of instructions to ignore.
// This may also be useful for the inloop reductions, if it can be		// This may also be useful for the inloop reductions, if it can be
// kept simple enough.		// kept simple enough.
collectCastsToIgnore(TheLoop, ExitInstruction, RecurrenceType, CastInsts);		collectCastInstrs(TheLoop, ExitInstruction, RecurrenceType, CastInsts,
}		MinWidthCastToRecurrenceType);

// We found a reduction var if we have reached the original phi node and we		// We found a reduction var if we have reached the original phi node and we
// only have a single instruction with out-of-loop users.		// only have a single instruction with out-of-loop users.

// The ExitInstruction(Instruction which is allowed to have out-of-loop users)		// The ExitInstruction(Instruction which is allowed to have out-of-loop users)
// is saved as part of the RecurrenceDescriptor.		// is saved as part of the RecurrenceDescriptor.

// Save the description of this reduction variable.		// Save the description of this reduction variable.
RecurrenceDescriptor RD(RdxStart, ExitInstruction, Kind, FMF,		RecurrenceDescriptor RD(RdxStart, ExitInstruction, Kind, FMF,
ReduxDesc.getExactFPMathInst(), RecurrenceType,		ReduxDesc.getExactFPMathInst(), RecurrenceType,
IsSigned, IsOrdered, CastInsts);		IsSigned, IsOrdered, CastInsts,
		MinWidthCastToRecurrenceType);
RedDes = RD;		RedDes = RD;

return true;		return true;
}		}

// We are looking for loops that do something like this:		// We are looking for loops that do something like this:
// int r = 0;		// int r = 0;
// for (int i = 0; i < n; i++) {		// for (int i = 0; i < n; i++) {
▲ Show 20 Lines • Show All 884 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,952 Lines • ▼ Show 20 Lines	if (Result != VectorizationFactor::Disabled())
LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "		LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "
<< Result.Width.getFixedValue() << "\n";);		<< Result.Width.getFixedValue() << "\n";);
return Result;		return Result;
}		}

std::pair<unsigned, unsigned>		std::pair<unsigned, unsigned>
LoopVectorizationCostModel::getSmallestAndWidestTypes() {		LoopVectorizationCostModel::getSmallestAndWidestTypes() {
unsigned MinWidth = -1U;		unsigned MinWidth = -1U;
unsigned MaxWidth = 8;		unsigned MaxWidth = 8;
		lebedev.riUnsubmitted Not Done Reply Inline Actions I suspect this should be querying datalayout for the minimal integer size? lebedev.ri: I suspect this should be querying datalayout for the minimal integer size?
const DataLayout &DL = TheFunction->getParent()->getDataLayout();		const DataLayout &DL = TheFunction->getParent()->getDataLayout();
		// For in-loop reductions, no element types are added to ElementTypesInLoop
		// if there are no loads/stores in the loop. In this case, check through the
		// reduction variables to determine the maximum width.
		if (ElementTypesInLoop.empty() && !Legal->getReductionVars().empty()) {
		// Reset MaxWidth so that we can find the smallest type used by recurrences
		// in the loop.
		sdesmalenUnsubmitted Done Reply Inline Actions I think the MaxWidth should remain 8 if DL returns no smallest legal integer type? (or if there's no specific datalayout defined) sdesmalen: I think the MaxWidth should remain 8 if DL returns no smallest legal integer type? (or if…
		MaxWidth = -1U;
		for (auto &PhiDescriptorPair : Legal->getReductionVars()) {
		const RecurrenceDescriptor &RdxDesc = PhiDescriptorPair.second;
		// When finding the min width used by the recurrence we need to account
		// for casts on the input operands of the recurrence.
		MaxWidth = std::min<unsigned>(
		MaxWidth, std::min<unsigned>(
		sdesmalenUnsubmitted Done Reply Inline Actions nit: this can be `RdxDesc.getCurrenceType()->getScalarSizeInBits()` sdesmalen: nit: this can be `RdxDesc.getCurrenceType()->getScalarSizeInBits()`
		RdxDesc.getMinWidthCastToRecurrenceTypeInBits(),
		RdxDesc.getRecurrenceType()->getScalarSizeInBits()));
		}
		} else {
for (Type *T : ElementTypesInLoop) {		for (Type *T : ElementTypesInLoop) {
MinWidth = std::min<unsigned>(		MinWidth = std::min<unsigned>(
MinWidth, DL.getTypeSizeInBits(T->getScalarType()).getFixedSize());		MinWidth, DL.getTypeSizeInBits(T->getScalarType()).getFixedSize());
		sdesmalenUnsubmitted Done Reply Inline Actions nit: same here: `cast<CastInst>(I)->getSrcTy()->getScalarSizeInBits()` sdesmalen: nit: same here: `cast<CastInst>(I)->getSrcTy()->getScalarSizeInBits()`
MaxWidth = std::max<unsigned>(		MaxWidth = std::max<unsigned>(
MaxWidth, DL.getTypeSizeInBits(T->getScalarType()).getFixedSize());		MaxWidth, DL.getTypeSizeInBits(T->getScalarType()).getFixedSize());
}		}
		}
return {MinWidth, MaxWidth};		return {MinWidth, MaxWidth};
}		}

void LoopVectorizationCostModel::collectElementTypesForWidening() {		void LoopVectorizationCostModel::collectElementTypesForWidening() {
ElementTypesInLoop.clear();		ElementTypesInLoop.clear();
// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// For each instruction in the loop.		// For each instruction in the loop.
Show All 10 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {

// Examine PHI nodes that are reduction variables. Update the type to		// Examine PHI nodes that are reduction variables. Update the type to
// account for the recurrence type.		// account for the recurrence type.
if (auto *PN = dyn_cast<PHINode>(&I)) {		if (auto *PN = dyn_cast<PHINode>(&I)) {
if (!Legal->isReductionVariable(PN))		if (!Legal->isReductionVariable(PN))
continue;		continue;
const RecurrenceDescriptor &RdxDesc =		const RecurrenceDescriptor &RdxDesc =
Legal->getReductionVars().find(PN)->second;		Legal->getReductionVars().find(PN)->second;
if (PreferInLoopReductions \|\| useOrderedReductions(RdxDesc) \|\|		if (PreferInLoopReductions \|\| useOrderedReductions(RdxDesc) \|\|
dmgreenUnsubmitted Not Done Reply Inline Actions This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation. That might not be as useful for float types - and the example loop you showed with no loads/stores in the loop is much less likely for integer - it will already have been simplified. Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions? dmgreen: This deliberately excludes InLoopReductions from the maximum width of the register - because…
sdesmalenUnsubmitted Not Done Reply Inline Actions This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation. Yes, the same is true for SVE. There is also code in the cost-model to recognise the extend to the wider type. I think the actual reason for the reduction-code being here is to avoid ending up with vector PHIs that are too wide (out-of-loop reduction). The checks for in-loop reductions were added later because (1) there is no vector PHI and (2) that it doesn't limit the VF too early so that it lets the cost-model code consider the wider VFs for the reason you described. Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions? I don't think we should add specific knowledge to limit the VF for fp-reductions here, because that means adding target-specific knowledge to the `collectElementTypesForWidening`, which is something that the cost-model should decide on. Also it would lead to regressions. sdesmalen: > This deliberately excludes InLoopReductions from the maximum width of the register - because…
dmgreenUnsubmitted Not Done Reply Inline Actions Yeah, we added it in https://reviews.llvm.org/D93476. I would have no objections to removing this for ordered reductions or float types I don't think. There is no such thing (as far as I understand) as an extending in-order float reduction, and we can rely on the UF for things that will become wider-than-legal vectors. dmgreen: Yeah, we added it in https://reviews.llvm.org/D93476. I would have no objections to removing…
fhahnUnsubmitted Not Done Reply Inline Actions Would it be possible to use TTI to pick the min/max type width to use there for the reduction, which would encode the target specific knowledge? fhahn: Would it be possible to use TTI to pick the min/max type width to use there for the reduction…
TTI.preferInLoopReduction(RdxDesc.getOpcode(),		TTI.preferInLoopReduction(RdxDesc.getOpcode(),
RdxDesc.getRecurrenceType(),		RdxDesc.getRecurrenceType(),
TargetTransformInfo::ReductionFlags()))		TargetTransformInfo::ReductionFlags()))
continue;		continue;
T = RdxDesc.getRecurrenceType();		T = RdxDesc.getRecurrenceType();
}		}

// Examine the stored values.		// Examine the stored values.
▲ Show 20 Lines • Show All 4,723 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt < %s -loop-vectorize -debug-only=loop-vectorize -disable-output 2>&1 \| FileCheck %s		; RUN: opt < %s -loop-vectorize -force-target-instruction-cost=1 -debug-only=loop-vectorize -disable-output 2>&1 \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

; CHECK-LABEL: Checking a loop in "interleaved_access"		; CHECK-LABEL: Checking a loop in "interleaved_access"
; CHECK: The Smallest and Widest types: 64 / 64 bits		; CHECK: The Smallest and Widest types: 64 / 64 bits
;		;
define void @interleaved_access(i8** %A, i64 %N) {		define void @interleaved_access(i8** %A, i64 %N) {
Show All 15 Lines	for.body:
store i8* null, i8** %tmp3, align 8		store i8* null, i8** %tmp3, align 8
%i.next.3 = add nsw i64 %i, 4		%i.next.3 = add nsw i64 %i, 4
%cond = icmp slt i64 %i.next.3, %N		%cond = icmp slt i64 %i.next.3, %N
br i1 %cond, label %for.body, label %for.end		br i1 %cond, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

		; For in-loop reductions with no loads or stores in the loop the widest type is
		; determined by looking through the recurrences, which allows a sensible VF to be
		; chosen. The following 3 cases check different combinations of widths.

		; CHECK-LABEL: Checking a loop in "no_loads_stores_32"
		; CHECK: The Smallest and Widest types: 4294967295 / 32 bits
		; CHECK: Selecting VF: 4

		define double @no_loads_stores_32(i32 %n) {
		entry:
		br label %for.body

		for.body:
		%s.09 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
		lebedev.riUnsubmitted Not Done Reply Inline Actions Where is `ElementTypesInLoop` populated? `LoopVectorizationCostModel::collectElementTypesForWidening()` suggests that PHI nodes are also queried. lebedev.ri: Where is `ElementTypesInLoop` populated? `LoopVectorizationCostModel…
		sdesmalenUnsubmitted Not Done Reply Inline Actions The function `collectElementTypesForWidening()` collects types for the following reasons: Types of loads/stores, because these are used in computing a safe dependence distance (and also to have a sensible/natural VF as maximum VF). Types of unordered reductions, in order to avoid generating vector PHI nodes that span multiple registers when the VF is too wide. In-order reductions are not considered, because those PHI nodes remain scalar after vectorization. If we consider more types in the loop in `collectElementTypesForWidening()`, e.g. for in-order reductions or extends, then this leads to regressions; any type larger than the maximum loaded/stored type will limit the maximum VF. If the maximum VF is limited, then any VF upto the wider maximum will not be considered by the cost-model. So in general, it's better not to limit the VF for those reasons and leave it up to the cost-model to choose the best VF. In the test-case that @RosieSumpter added, there are no loads/stores and there is no vector PHI node because the reduction is ordered, so it considers any VF up-to the widest-possible VF based on an i8 element type (even if the loop operates on a larger size). The cost-model then only considers the throughput cost, and the per-lane cost is no different between VF=4 and VF=16, so it favours VF=16. It does not consider the additional code-size cost when the operation is scalarized. I guess the alternative choices are to: Look through the operands of the ordered reduction operation, as well as any casts/extends on those operands, and only consider these element types in `collectElementTypesForWidening()` if there are no loads/stores in the loop. This heuristic gets quite complicated I think. Consider the code-size cost for in-order reductions, so that it favours shorter VFs (I'm not sure if this is desirable, as it may regress other cases where code-size isn't relevant). Because the case is so niche (i.e. a loop with only in-order reductions), @RosieSumpter thought it made sense to fall back to a more sensible default instead. sdesmalen: The function `collectElementTypesForWidening()` collects types for the following reasons: *…
		fhahnUnsubmitted Not Done Reply Inline Actions I think picking the smallest integer type as max width comes with its own problems unfortunately. By choosing 32 bit as max width on AArch64, won't we pessimize codegen for loops with fp16 in loop reductions (by choosing VF 4 instead of VF 8) for example? It is hard to say what if there are similar impacts on other targets. fhahn: I think picking the smallest integer type as max width comes with its own problems…
		fhahnUnsubmitted Not Done Reply Inline Actions I might have missed it, but did you add a test case for the scenario I mentioned above? It would be good to have a bit wider test coverage. fhahn: I might have missed it, but did you add a test case for the scenario I mentioned above? It…
		fhahnUnsubmitted Done Reply Inline Actions Oh I see the loop here now operates on `i16`. I think we should have at least one or two other width combinations. fhahn: Oh I see the loop here now operates on `i16`. I think we should have at least one or two other…
		%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%conv = sitofp i32 %i.08 to float
		%conv1 = fpext float %conv to double
		%add = fadd double %s.09, %conv1
		%inc = add nuw i32 %i.08, 1
		%exitcond.not = icmp eq i32 %inc, %n
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		%.lcssa = phi double [ %add, %for.body ]
		ret double %.lcssa
		}

		; CHECK-LABEL: Checking a loop in "no_loads_stores_16"
		; CHECK: The Smallest and Widest types: 4294967295 / 16 bits
		; CHECK: Selecting VF: 8

		define double @no_loads_stores_16() {
		entry:
		br label %for.body

		for.body:
		%s.09 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
		%i.08 = phi i16 [ 0, %entry ], [ %inc, %for.body ]
		%conv = sitofp i16 %i.08 to double
		%add = fadd double %s.09, %conv
		%inc = add nuw nsw i16 %i.08, 1
		%exitcond.not = icmp eq i16 %inc, 12345
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		%.lcssa = phi double [ %add, %for.body ]
		ret double %.lcssa
		}

		; CHECK-LABEL: Checking a loop in "no_loads_stores_8"
		; CHECK: The Smallest and Widest types: 4294967295 / 8 bits
		; CHECK: Selecting VF: 16

		define float @no_loads_stores_8() {
		entry:
		br label %for.body

		for.body:
		%s.09 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
		%i.08 = phi i8 [ 0, %entry ], [ %inc, %for.body ]
		%conv = sitofp i8 %i.08 to float
		%add = fadd float %s.09, %conv
		%inc = add nuw nsw i8 %i.08, 1
		%exitcond.not = icmp eq i8 %inc, 12345
		br i1 %exitcond.not, label %for.end, label %for.body

		for.end:
		%.lcssa = phi float [ %add, %for.body ]
		ret float %.lcssa
		}

llvm/test/Transforms/LoopVectorize/X86/funclet.ll

Show All 27 Lines	try.cont: ; preds = %for.cond.cleanup
ret void		ret void

unreachable: ; preds = %entry		unreachable: ; preds = %entry
unreachable		unreachable
}		}

; CHECK-LABEL: define void @test1(		; CHECK-LABEL: define void @test1(
; CHECK: %[[cpad:.]] = catchpad within {{.}} [i8* null, i32 64, i8* null]		; CHECK: %[[cpad:.]] = catchpad within {{.}} [i8* null, i32 64, i8* null]
; CHECK: call <16 x double> @llvm.floor.v16f64(<16 x double> {{.*}}) [ "funclet"(token %[[cpad]]) ]		; CHECK: call <8 x double> @llvm.floor.v8f64(<8 x double> {{.*}}) [ "funclet"(token %[[cpad]]) ]

declare x86_stdcallcc void @_CxxThrowException(i8, i8)		declare x86_stdcallcc void @_CxxThrowException(i8, i8)

declare i32 @__CxxFrameHandler3(...)		declare i32 @__CxxFrameHandler3(...)

declare double @floor(double) #1		declare double @floor(double) #1

attributes #0 = { "target-features"="+sse2" }		attributes #0 = { "target-features"="+sse2" }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }