This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
extractvalue-no-scalarization-required.ll

Differential D59995

[LV] Exclude loop-invariant inputs from scalar cost computation.
ClosedPublic

Authored by fhahn on Mar 29 2019, 8:03 AM.

Download Raw Diff

Details

Reviewers

hsaito
rengolin
dcaballe
Ayal

Commits

rG9428d95ce7f8: [LV] Exclude loop-invariant inputs from scalar cost computation.
rL366030: [LV] Exclude loop-invariant inputs from scalar cost computation.

Summary

Loop invariant operands do not need to be scalarized, as we are using
the values outside the loop. We should ignore them when computing the
scalarization overhead.

Fixes PR41294

Diff Detail

Repository: rL LLVM

Event Timeline

fhahn created this revision.Mar 29 2019, 8:03 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2019, 8:03 AM

Herald added subscribers: jdoerfert, rkruppe, hiraditya, javed.absar. · View Herald Transcript

Harbormaster completed remote builds in B29809: Diff 192826.Mar 29 2019, 8:05 AM

The culprit here is the assumption made by TTI.getOperandsScalarizationOverhead(Operands, VF) that all its Operands will be vectorized according to VF, and would thus require extraction to feed a scalarized/replicated user. But any such Operand might not get vectorized, and possibly must not get vectorized, e.g., due to an incompatible type as demonstrated by PR41294 and the testcase. In some cases an Operand will obviously not be vectorized, such as if it's loop-invariant or live-in. More generally, LV uses the following:

auto needsExtract = [&](Instruction *I) -> bool {
  return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};

which would require passing not only TheLoop into getScalarizationOverhead(I, VF, TTI) but also the CM --- better turn this static function into a method of CM?

Note that there's also CM.isProfitableToScalarize(I, VF)), but it relies on having computed costs, so difficult to use when computing the cost (of a User). Skipping it would only affect accuracy of resulting cost, considering Operands that can be vectorized but will not be due to profitability.

Fixing getVectorCallCost deserves another testcase.
Seems like getVectorIntrinsicCost also requires fixing?

Would be good to hoist such invariant instructions

%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1

out of the loop before LV, or at-least have LV recognize them as uniform.

llvm/test/Transforms/LoopVectorize/AArch64/instructions-with-struct-ops.ll
3 ↗	(On Diff #192826)	Record PR41294 in comment, file name, or testcase function name.
4 ↗	(On Diff #192826)	`for extractelement` >> `for extractvalue`
10 ↗	(On Diff #192826)	`the extractelement` >> `the extractvalue`

Thanks for your comments Ayal! I'll update the patch in a bit.

fhahn mentioned this in D61638: [LV] Move getScalarizationOverhead and vector call cost computations to CM. (NFC).May 7 2019, 3:51 AM

Split out moving various functions to LoopVectorizationCostModel to D61638.

Hoisted needsExtract to do the check if we need to extract/scalarize an operand
to LoopVectorizationCostModel and extend it to check for loop invariant operands.

Harbormaster completed remote builds in B31517: Diff 198430.May 7 2019, 3:54 AM

In D59995#1455952, @Ayal wrote:
The culprit here is the assumption made by TTI.getOperandsScalarizationOverhead(Operands, VF) that all its Operands will be vectorized according to VF, and would thus require extraction to feed a scalarized/replicated user. But any such Operand might not get vectorized, and possibly must not get vectorized, e.g., due to an incompatible type as demonstrated by PR41294 and the testcase. In some cases an Operand will obviously not be vectorized, such as if it's loop-invariant or live-in. More generally, LV uses the following:
auto needsExtract = [&](Instruction *I) -> bool {
  return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};
which would require passing not only TheLoop into getScalarizationOverhead(I, VF, TTI) but also the CM --- better turn this static function into a method of CM?

Done.

Note that there's also CM.isProfitableToScalarize(I, VF)), but it relies on having computed costs, so difficult to use when computing the cost (of a User). Skipping it would only affect accuracy of resulting cost, considering Operands that can be vectorized but will not be due to profitability.

Fixing getVectorCallCost deserves another testcase.

Added a test case with call. But thinking about it again, it does not really test the issue. Not sure it is actually possible to test getVectorCallCost, as there are no vector call functions that take struct values?

Seems like getVectorIntrinsicCost also requires fixing?

Yep, I'll update it in a second. In practice, I don't think there are any intrinsics that take struct types, but maybe in the future it might become a problem.

Would be good to hoist such invariant instructions
%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1
out of the loop before LV, or at-least have LV recognize them as uniform.

Yep, I can look into that as a follow up. LICM should hoist those things, but I think in general we cannot guarantee that all loop-invariant instructions are hoisted out before LoopVectorize (the test case came from a fuzzer). Do you think we should try to hoist them in LV? I would assume a later run of LICM will hoist them.

Filter operands for getIntrinsicCallCost as well.

fhahn added a parent revision: D61638: [LV] Move getScalarizationOverhead and vector call cost computations to CM. (NFC).May 7 2019, 4:10 AM

Harbormaster completed remote builds in B31518: Diff 198433.May 7 2019, 4:13 AM

Ping. Ayal, does this address your comments appropriately?

In D59995#1493241, @fhahn wrote:
In D59995#1455952, @Ayal wrote:
The culprit here is the assumption made by TTI.getOperandsScalarizationOverhead(Operands, VF) that all its Operands will be vectorized according to VF, and would thus require extraction to feed a scalarized/replicated user. But any such Operand might not get vectorized, and possibly must not get vectorized, e.g., due to an incompatible type as demonstrated by PR41294 and the testcase. In some cases an Operand will obviously not be vectorized, such as if it's loop-invariant or live-in. More generally, LV uses the following:
auto needsExtract = [&](Instruction *I) -> bool {
  return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};
which would require passing not only TheLoop into getScalarizationOverhead(I, VF, TTI) but also the CM --- better turn this static function into a method of CM?
Done.

Very good, thanks for the refactoring.

Note that there's also CM.isProfitableToScalarize(I, VF)), but it relies on having computed costs, so difficult to use when computing the cost (of a User). Skipping it would only affect accuracy of resulting cost, considering Operands that can be vectorized but will not be due to profitability.

Fixing getVectorCallCost deserves another testcase.

Added a test case with call. But thinking about it again, it does not really test the issue. Not sure it is actually possible to test getVectorCallCost, as there are no vector call functions that take struct values?

Good point. Does the test case pass w/o this patch?

Seems like getVectorIntrinsicCost also requires fixing?

Yep, I'll update it in a second. In practice, I don't think there are any intrinsics that take struct types, but maybe in the future it might become a problem.
Would be good to hoist such invariant instructions
%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1
out of the loop before LV, or at-least have LV recognize them as uniform.
Yep, I can look into that as a follow up. LICM should hoist those things, but I think in general we cannot guarantee that all loop-invariant instructions are hoisted out before LoopVectorize (the test case came from a fuzzer). Do you think we should try to hoist them in LV? I would assume a later run of LICM will hoist them.

The thought was to check LICM, and check LV's uniform analysis.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1335 ↗	(On Diff #198433)	This is admittedly part of the original comment, but while we're here - "cannot be scalarized" >> "is expected to be vectorized" - every V can be scalarized. Furthermore, if we're unsure whether V will be vectorized or not it's safer to assume that it will be scalarized rather than vectorized, as the latter might not be possible. So if Scalars.find(VF) isn't set yet, which will cause isScalarAfterVectorization() to assert, better have needsExtract() return false.
5696 ↗	(On Diff #198433)	Yes, there's a phase ordering issue here, and also one regarding isProfitableToScalarize() as mentioned above (which would be good to note somewhere). But the instruction(s) that may trigger an assert, being loads or stores, are the operands of I rather than I itself, right? In any case, suggest to have FilterOperands()/needsExtract() understand if they can use isScalarAfterVectorization() or else be conservative, as noted above. Perhaps make Filter[Extracting]Operands() a method, to be reused.

fhahn mentioned this in rL360758: [LV] Move getScalarizationOverhead and vector call cost computations to CM..May 15 2019, 3:05 AM

fhahn mentioned this in rG9e778e6c730a: [LV] Move getScalarizationOverhead and vector call cost computations to CM..

sidorovd mentioned this in rG2ffabae8b2f6: [LV] Move getScalarizationOverhead and vector call cost computations to CM..May 30 2019, 9:24 AM

sidorovd mentioned this in rG143c5a2e616d: [LV] Move getScalarizationOverhead and vector call cost computations to CM..May 30 2019, 10:24 AM

Addressed Ayal's comments, thanks and sorry for the long delay, it somehow dropped off my radar.

In D59995#1502129, @Ayal wrote:

Good point. Does the test case pass w/o this patch?

Yes both tests fail with the patch.

The only thing that cannot be tested is the combination of intrinsic taking struct
values, but that should be fine.

Seems like getVectorIntrinsicCost also requires fixing?

Yep, I'll update it in a second. In practice, I don't think there are any intrinsics that take struct types, but maybe in the future it might become a problem.
Would be good to hoist such invariant instructions
%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1
out of the loop before LV, or at-least have LV recognize them as uniform.
Yep, I can look into that as a follow up. LICM should hoist those things, but I think in general we cannot guarantee that all loop-invariant instructions are hoisted out before LoopVectorize (the test case came from a fuzzer). Do you think we should try to hoist them in LV? I would assume a later run of LICM will hoist them.
The thought was to check LICM, and check LV's uniform analysis.

I am not entirely sure I understand what you mean here. I checked LICM and it will
hoist the problematic case here, but I do not think we should rely on that, as
there might be other reasons why LICM decided against hoisting (or LICM is not run).

Currently they are not marked as uniform in LV, but I can fix that in a follow-up
patch.

Herald added a subscriber: rogfer01. · View Herald TranscriptJul 5 2019, 8:54 AM

Harbormaster completed remote builds in B34438: Diff 208191.Jul 5 2019, 8:56 AM

Some format typos, and clarifying if needsExtract() should assume vectorized or scalarized before scalars are computed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1180 ↗	(On Diff #208191)	`// >> ///` on last line too ;-)
1182 ↗	(On Diff #208191)	indentation?
1343 ↗	(On Diff #208191)	Ahh, is it better to assume we can vectorize V, contrary to the above comment?
3151 ↗	(On Diff #208191)	overhed >> overhead.
4130 ↗	(On Diff #208191)	one line?
5669 ↗	(On Diff #208191)	two lines?
5686 ↗	(On Diff #208191)	Effectively initializing Ops to an empty range by default? This is an orthogonal refactoring; consider instead having another early-exit: if (isa<StoreInst>(I) && TTI.supportsEfficientVectorElementLoadStore()) return Cost; and initializing Ops to I->operands() by default.
6664 ↗	(On Diff #208191)	one line?

clang-format, small refactor

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗	(On Diff #208191)	Ahh, is it better to assume we can vectorize V, contrary to the above comment? Ah sorry for the change, I missed adding an explanation here. Assuming vectorization here seems to be in line with the old behavior and switching it causes a bunch of unit-test failures. I agree it would be safer to assume scalarizing, but the current cost model seems to expect the optimistic choice here. I might be worth investigating flipping it as a follow-up, but I think it will require some additional work to make sure other places in the cost model are in sync with the new assumption?

Harbormaster completed remote builds in B34859: Diff 209476.Jul 12 2019, 6:32 AM

LGTM, with some additional thoughts provoked by this fix.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗	(On Diff #208191)	OK, fair enough. The current optimistic choice is ok, as long as V is of vectorizable type; and the latter seems to be the case given that setCostBasedWidenDecision() deals with loads and stores only, and LVLegality checks that their relevant operands (stored values) are of vectorizable types. Worth augmenting the explanation if agreed. In any case, the optimistic choice can be confined to guard isScalarAfterVectorization() only, e.g.,: if (VF <= 1) return false; Instruction *I = dyn_cast<Instruction>(V); if (!I \|\| !TheLoop->contains(I) \|\| TheLoop->isLoopInvariant(I)) return false; return (Scalars.find(VF) == Scalars.end() \|\| !isScalarAfterVectorization(I, VF)); On a related note, notice that if `%sv` were a <2 x i64> vector instead of a {i64, i64} struct, LVLegality would bail out given that we can't vectorize its associated `extractelement`: // Also, we can't vectorize extractelement instructions. if ((!VectorType::isValidElementType(I.getType()) && !I.getType()->isVoidTy()) \|\| isa<ExtractElementInst>(I)) { reportVectorizationFailure("Found unvectorizable type", "instruction return type cannot be vectorized", "CantVectorizeInstructionReturnType", &I); return false; } but it does allow vectorizing (loops with scalarizing) `extractvalue`. So a crude way of "fixing" the testcase would be to treat extractvalue similar to extractelement... but its clearly better to allow vectorizing such loops. OTOH, perhaps we could also allow vectorizing (loops with scalarizing) `extractelement`...? (Surely in a separate patch if so)

This revision is now accepted and ready to land.Jul 14 2019, 8:03 AM

fhahn marked an inline comment as done.Jul 14 2019, 11:34 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗	(On Diff #208191)	OK, fair enough. The current optimistic choice is ok, as long as V is of vectorizable type; and the latter seems to be the case given that setCostBasedWidenDecision() deals with loads and stores only, and LVLegality checks that their relevant operands (stored values) are of vectorizable types. Worth augmenting the explanation if agreed. In any case, the optimistic choice can be confined to guard isScalarAfterVectorization() Thanks, I'll commit the change with the moved optimistic check as suggested and add a comment. On a related note, notice that if %sv were a <2 x i64> vector instead of a {i64, i64} struct, LVLegality would bail out given that we can't vectorize its associated extractelement: I'll look into that as a follow up, same as marking the invariant instructions as uniform.

Closed by commit rL366030: [LV] Exclude loop-invariant inputs from scalar cost computation. (authored by fhahn). · Explain WhyJul 14 2019, 1:12 PM

This revision was automatically updated to reflect the committed changes.

fhahn marked an inline comment as done.Jul 14 2019, 1:13 PM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗	(On Diff #208191)	OK, fair enough. The current optimistic choice is ok, as long as V is of vectorizable type; and the latter seems to be the case given that setCostBasedWidenDecision() deals with loads and stores only, and LVLegality checks that their relevant operands (stored values) are of vectorizable types. Worth augmenting the explanation if agreed. I think vectorization should be what is happening in most of those cases, but I think there could be scenarios where we would not vectorize the operands. I've adjusted the comment accordingly.

This triggers failed asserts in some cases, reproable with https://martin.st/temp/loadimage-preproc.cpp.xz, with clang++ -target i686-w64-mingw32 -c -O3 loadimage-preproc.cpp. Will post a proper bug report later today, unless someone else beats me to it.

In D59995#1584967, @mstorsjo wrote:

This triggers failed asserts in some cases, reproable with https://martin.st/temp/loadimage-preproc.cpp.xz, with clang++ -target i686-w64-mingw32 -c -O3 loadimage-preproc.cpp. Will post a proper bug report later today, unless someone else beats me to it.

Thanks for reporting the issue, should be fixed in rL366049.

@Ayal, the problem was that we are not computing the scalarized cost in getVectorIntrinsicCost and TTI::getIntrinsicInstrCost expects the full list of arguments, so we should not filter it. I'll check if there's a reason we do not consider the scalarized cost there.

Ayal added inline comments.Jul 15 2019, 3:26 AM

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
1175	Thanks @fhahn, just pointing out the above documentation is also inaccurate, along with the associated cost.

In D59995#1585031, @fhahn wrote:

In D59995#1584967, @mstorsjo wrote:

This triggers failed asserts in some cases, reproable with https://martin.st/temp/loadimage-preproc.cpp.xz, with clang++ -target i686-w64-mingw32 -c -O3 loadimage-preproc.cpp. Will post a proper bug report later today, unless someone else beats me to it.

Thanks for reporting the issue, should be fixed in rL366049.

Thanks for the very quick fix - it seems to work fine now!

fhahn mentioned this in D65060: [LICM] Make Loop ICM profile aware.Jul 22 2019, 12:24 PM

fhahn mentioned this in D68831: [LV] Mark instructions with loop invariant arguments as uniform. (WIP).Oct 10 2019, 1:41 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

64 lines

test/

Transforms/

LoopVectorize/

AArch64/

extractvalue-no-scalarization-required.ll

109 lines

Diff 209747

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,166 Lines • ▼ Show 20 Lines	public:
bool foldTailByMasking() const { return FoldTailByMasking; }		bool foldTailByMasking() const { return FoldTailByMasking; }

bool blockNeedsPredication(BasicBlock *BB) {		bool blockNeedsPredication(BasicBlock *BB) {
return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);		return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);
}		}

/// Estimate cost of an intrinsic call instruction CI if it were vectorized		/// Estimate cost of an intrinsic call instruction CI if it were vectorized
/// with factor VF. Return the cost of the instruction, including		/// with factor VF. Return the cost of the instruction, including
/// scalarization overhead if it's needed.		/// scalarization overhead if it's needed.
		AyalUnsubmitted Not Done Reply Inline Actions Thanks @fhahn, just pointing out the above documentation is also inaccurate, along with the associated cost. Ayal: Thanks @fhahn, just pointing out the above documentation is also inaccurate, along with the…
unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF);		unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF);

/// Estimate cost of a call instruction CI if it were vectorized with factor		/// Estimate cost of a call instruction CI if it were vectorized with factor
/// VF. Return the cost of the instruction, including scalarization overhead		/// VF. Return the cost of the instruction, including scalarization overhead
/// if it's needed. The flag NeedToScalarize shows if the call needs to be		/// if it's needed. The flag NeedToScalarize shows if the call needs to be
/// scalarized -		/// scalarized -
// i.e. either vector version isn't available, or is too expensive.		/// i.e. either vector version isn't available, or is too expensive.
unsigned getVectorCallCost(CallInst *CI, unsigned VF, bool &NeedToScalarize);		unsigned getVectorCallCost(CallInst *CI, unsigned VF, bool &NeedToScalarize);

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factor, larger than zero.		/// \return An upper bound for the vectorization factor, larger than zero.
/// One is returned if vectorization should best be avoided due to cost.		/// One is returned if vectorization should best be avoided due to cost.
unsigned computeFeasibleMaxVF(bool OptForSize, unsigned ConstTripCount);		unsigned computeFeasibleMaxVF(bool OptForSize, unsigned ConstTripCount);
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	private:

/// Keeps cost model vectorization decision and cost for instructions.		/// Keeps cost model vectorization decision and cost for instructions.
/// Right now it is used for memory instructions only.		/// Right now it is used for memory instructions only.
using DecisionList = DenseMap<std::pair<Instruction *, unsigned>,		using DecisionList = DenseMap<std::pair<Instruction *, unsigned>,
std::pair<InstWidening, unsigned>>;		std::pair<InstWidening, unsigned>>;

DecisionList WideningDecisions;		DecisionList WideningDecisions;

		/// Returns true if \p V is expected to be vectorized and it needs to be
		/// extracted.
		bool needsExtract(Value *V, unsigned VF) const {
		Instruction *I = dyn_cast<Instruction>(V);
		if (VF == 1 \|\| !I \|\| !TheLoop->contains(I) \|\| TheLoop->isLoopInvariant(I))
		return false;

		// Assume we can vectorize V (and hence we need extraction) if the
		// scalars are not computed yet. This can happen, because it is called
		// via getScalarizationOverhead from setCostBasedWideningDecision, before
		// the scalars are collected. That should be a safe assumption in most
		// cases, because we check if the operands have vectorizable types
		// beforehand in LoopVectorizationLegality.
		return Scalars.find(VF) == Scalars.end() \|\|
		!isScalarAfterVectorization(I, VF);
		};

		/// Returns a range containing only operands needing to be extracted.
		SmallVector<Value *, 4> filterExtractingOperands(Instruction::op_range Ops,
		unsigned VF) {
		return SmallVector<Value *, 4>(make_filter_range(
		Ops, [this, VF](Value *V) { return this->needsExtract(V, VF); }));
		}

public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;

/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

/// Loop Info analysis.		/// Loop Info analysis.
▲ Show 20 Lines • Show All 1,777 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getVectorIntrinsicCost(CallInst *CI,
unsigned VF) {		unsigned VF) {
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
assert(ID && "Expected intrinsic call!");		assert(ID && "Expected intrinsic call!");

FastMathFlags FMF;		FastMathFlags FMF;
if (auto *FPMO = dyn_cast<FPMathOperator>(CI))		if (auto *FPMO = dyn_cast<FPMathOperator>(CI))
FMF = FPMO->getFastMathFlags();		FMF = FPMO->getFastMathFlags();

SmallVector<Value *, 4> Operands(CI->arg_operands());		// Skip operands that do not require extraction/scalarization and do not incur
return TTI.getIntrinsicInstrCost(ID, CI->getType(), Operands, FMF, VF);		// any overhead.
		return TTI.getIntrinsicInstrCost(
		ID, CI->getType(), filterExtractingOperands(CI->arg_operands(), VF), FMF,
		VF);
}		}

static Type smallestIntegerVectorType(Type T1, Type *T2) {		static Type smallestIntegerVectorType(Type T1, Type *T2) {
auto *I1 = cast<IntegerType>(T1->getVectorElementType());		auto *I1 = cast<IntegerType>(T1->getVectorElementType());
auto *I2 = cast<IntegerType>(T2->getVectorElementType());		auto *I2 = cast<IntegerType>(T2->getVectorElementType());
return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;		return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;
}		}
static Type largestIntegerVectorType(Type T1, Type *T2) {		static Type largestIntegerVectorType(Type T1, Type *T2) {
▲ Show 20 Lines • Show All 2,203 Lines • ▼ Show 20 Lines	for (Use &U : I->operands())
if (auto *J = dyn_cast<Instruction>(U.get()))		if (auto *J = dyn_cast<Instruction>(U.get()))
if (isUniformAfterVectorization(J, VF))		if (isUniformAfterVectorization(J, VF))
return false;		return false;

// Otherwise, we can scalarize the instruction.		// Otherwise, we can scalarize the instruction.
return true;		return true;
};		};

// Returns true if an operand that cannot be scalarized must be extracted
// from a vector. We will account for this scalarization overhead below. Note
// that the non-void predicated instructions are placed in their own blocks,
// and their return values are inserted into vectors. Thus, an extract would
// still be required.
auto needsExtract = [&](Instruction *I) -> bool {
return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};

// Compute the expected cost discount from scalarizing the entire expression		// Compute the expected cost discount from scalarizing the entire expression
// feeding the predicated instruction. We currently only consider expressions		// feeding the predicated instruction. We currently only consider expressions
// that are single-use instruction chains.		// that are single-use instruction chains.
Worklist.push_back(PredInst);		Worklist.push_back(PredInst);
while (!Worklist.empty()) {		while (!Worklist.empty()) {
Instruction *I = Worklist.pop_back_val();		Instruction *I = Worklist.pop_back_val();

// If we've already analyzed the instruction, there's nothing to do.		// If we've already analyzed the instruction, there's nothing to do.
Show All 23 Lines	while (!Worklist.empty()) {
// be scalarized, add it to the worklist; otherwise, account for the		// be scalarized, add it to the worklist; otherwise, account for the
// overhead.		// overhead.
for (Use &U : I->operands())		for (Use &U : I->operands())
if (auto *J = dyn_cast<Instruction>(U.get())) {		if (auto *J = dyn_cast<Instruction>(U.get())) {
assert(VectorType::isValidElementType(J->getType()) &&		assert(VectorType::isValidElementType(J->getType()) &&
"Instruction has non-scalar type");		"Instruction has non-scalar type");
if (canBeScalarized(J))		if (canBeScalarized(J))
Worklist.push_back(J);		Worklist.push_back(J);
else if (needsExtract(J))		else if (needsExtract(J, VF))
ScalarCost += TTI.getScalarizationOverhead(		ScalarCost += TTI.getScalarizationOverhead(
ToVectorTy(J->getType(),VF), false, true);		ToVectorTy(J->getType(),VF), false, true);
}		}

// Scale the total scalar cost by block probability.		// Scale the total scalar cost by block probability.
ScalarCost /= getReciprocalPredBlockProb();		ScalarCost /= getReciprocalPredBlockProb();

// Compute the discount. A non-negative discount means the vector version		// Compute the discount. A non-negative discount means the vector version
▲ Show 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getScalarizationOverhead(Instruction *I,
if (!RetTy->isVoidTy() &&		if (!RetTy->isVoidTy() &&
(!isa<LoadInst>(I) \|\| !TTI.supportsEfficientVectorElementLoadStore()))		(!isa<LoadInst>(I) \|\| !TTI.supportsEfficientVectorElementLoadStore()))
Cost += TTI.getScalarizationOverhead(RetTy, true, false);		Cost += TTI.getScalarizationOverhead(RetTy, true, false);

// Some targets keep addresses scalar.		// Some targets keep addresses scalar.
if (isa<LoadInst>(I) && !TTI.prefersVectorizedAddressing())		if (isa<LoadInst>(I) && !TTI.prefersVectorizedAddressing())
return Cost;		return Cost;

if (CallInst *CI = dyn_cast<CallInst>(I)) {		// Some targets support efficient element stores.
SmallVector<const Value *, 4> Operands(CI->arg_operands());		if (isa<StoreInst>(I) && TTI.supportsEfficientVectorElementLoadStore())
Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
} else if (!isa<StoreInst>(I) \|\|
!TTI.supportsEfficientVectorElementLoadStore()) {
SmallVector<const Value *, 4> Operands(I->operand_values());
Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
}

return Cost;		return Cost;

		// Collect operands to consider.
		CallInst *CI = dyn_cast<CallInst>(I);
		Instruction::op_range Ops = CI ? CI->arg_operands() : I->operands();

		// Skip operands that do not require extraction/scalarization and do not incur
		// any overhead.
		return Cost + TTI.getOperandsScalarizationOverhead(
		filterExtractingOperands(Ops, VF), VF);
}		}

void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {		void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
if (VF == 1)		if (VF == 1)
return;		return;
NumPredStores = 0;		NumPredStores = 0;
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// For each instruction in the old loop.		// For each instruction in the old loop.
▲ Show 20 Lines • Show All 1,973 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

				; REQUIRES: asserts

				; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios %s -S -debug -disable-output 2>&1 \| FileCheck --check-prefix=CM %s
				; RUN: opt -loop-vectorize -force-vector-width=2 -force-vector-interleave=1 %s -S \| FileCheck --check-prefix=FORCED %s

				; Test case from PR41294.

				; Check scalar cost for extractvalue. The constant and loop invariant operands are free,
				; leaving cost 3 for scalarizing the result + 2 for executing the op with VF 2.

				; CM: LV: Scalar loop costs: 7.
				; CM: LV: Found an estimated cost of 5 for VF 2 For instruction: %a = extractvalue { i64, i64 } %sv, 0
				; CM-NEXT: LV: Found an estimated cost of 5 for VF 2 For instruction: %b = extractvalue { i64, i64 } %sv, 1

				; Check that the extractvalue operands are actually free in vector code.

				; FORCED-LABEL: vector.body: ; preds = %vector.body, %vector.ph
				; FORCED-NEXT: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				; FORCED-NEXT: %broadcast.splatinsert = insertelement <2 x i32> undef, i32 %index, i32 0
				; FORCED-NEXT: %broadcast.splat = shufflevector <2 x i32> %broadcast.splatinsert, <2 x i32> undef, <2 x i32> zeroinitializer
				; FORCED-NEXT: %induction = add <2 x i32> %broadcast.splat, <i32 0, i32 1>
				; FORCED-NEXT: %0 = add i32 %index, 0
				; FORCED-NEXT: %1 = extractvalue { i64, i64 } %sv, 0
				; FORCED-NEXT: %2 = extractvalue { i64, i64 } %sv, 0
				; FORCED-NEXT: %3 = insertelement <2 x i64> undef, i64 %1, i32 0
				; FORCED-NEXT: %4 = insertelement <2 x i64> %3, i64 %2, i32 1
				; FORCED-NEXT: %5 = extractvalue { i64, i64 } %sv, 1
				; FORCED-NEXT: %6 = extractvalue { i64, i64 } %sv, 1
				; FORCED-NEXT: %7 = insertelement <2 x i64> undef, i64 %5, i32 0
				; FORCED-NEXT: %8 = insertelement <2 x i64> %7, i64 %6, i32 1
				; FORCED-NEXT: %9 = getelementptr i64, i64* %dst, i32 %0
				; FORCED-NEXT: %10 = add <2 x i64> %4, %8
				; FORCED-NEXT: %11 = getelementptr i64, i64* %9, i32 0
				; FORCED-NEXT: %12 = bitcast i64* %11 to <2 x i64>*
				; FORCED-NEXT: store <2 x i64> %10, <2 x i64>* %12, align 4
				; FORCED-NEXT: %index.next = add i32 %index, 2
				; FORCED-NEXT: %13 = icmp eq i32 %index.next, 0
				; FORCED-NEXT: br i1 %13, label %middle.block, label %vector.body, !llvm.loop !0

				define void @test1(i64* %dst, {i64, i64} %sv) {
				entry:
				br label %loop.body

				loop.body:
				%iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ]
				%a = extractvalue { i64, i64 } %sv, 0
				%b = extractvalue { i64, i64 } %sv, 1
				%addr = getelementptr i64, i64* %dst, i32 %iv
				%add = add i64 %a, %b
				store i64 %add, i64* %addr
				%iv.next = add nsw i32 %iv, 1
				%cond = icmp ne i32 %iv.next, 0
				br i1 %cond, label %loop.body, label %exit

				exit:
				ret void
				}


				; Similar to the test case above, but checks getVectorCallCost as well.
				declare float @pow(float, float) readnone nounwind

				; CM: LV: Scalar loop costs: 16.
				; CM: LV: Found an estimated cost of 5 for VF 2 For instruction: %a = extractvalue { float, float } %sv, 0
				; CM-NEXT: LV: Found an estimated cost of 5 for VF 2 For instruction: %b = extractvalue { float, float } %sv, 1

				; FORCED-LABEL: define void @test_getVectorCallCost

				; FORCED-LABEL: vector.body: ; preds = %vector.body, %vector.ph
				; FORCED-NEXT: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				; FORCED-NEXT: %broadcast.splatinsert = insertelement <2 x i32> undef, i32 %index, i32 0
				; FORCED-NEXT: %broadcast.splat = shufflevector <2 x i32> %broadcast.splatinsert, <2 x i32> undef, <2 x i32> zeroinitializer
				; FORCED-NEXT: %induction = add <2 x i32> %broadcast.splat, <i32 0, i32 1>
				; FORCED-NEXT: %0 = add i32 %index, 0
				; FORCED-NEXT: %1 = extractvalue { float, float } %sv, 0
				; FORCED-NEXT: %2 = extractvalue { float, float } %sv, 0
				; FORCED-NEXT: %3 = insertelement <2 x float> undef, float %1, i32 0
				; FORCED-NEXT: %4 = insertelement <2 x float> %3, float %2, i32 1
				; FORCED-NEXT: %5 = extractvalue { float, float } %sv, 1
				; FORCED-NEXT: %6 = extractvalue { float, float } %sv, 1
				; FORCED-NEXT: %7 = insertelement <2 x float> undef, float %5, i32 0
				; FORCED-NEXT: %8 = insertelement <2 x float> %7, float %6, i32 1
				; FORCED-NEXT: %9 = getelementptr float, float* %dst, i32 %0
				; FORCED-NEXT: %10 = call <2 x float> @llvm.pow.v2f32(<2 x float> %4, <2 x float> %8)
				; FORCED-NEXT: %11 = getelementptr float, float* %9, i32 0
				; FORCED-NEXT: %12 = bitcast float* %11 to <2 x float>*
				; FORCED-NEXT: store <2 x float> %10, <2 x float>* %12, align 4
				; FORCED-NEXT: %index.next = add i32 %index, 2
				; FORCED-NEXT: %13 = icmp eq i32 %index.next, 0
				; FORCED-NEXT: br i1 %13, label %middle.block, label %vector.body, !llvm.loop !4

				define void @test_getVectorCallCost(float* %dst, {float, float} %sv) {
				entry:
				br label %loop.body

				loop.body:
				%iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ]
				%a = extractvalue { float, float } %sv, 0
				%b = extractvalue { float, float } %sv, 1
				%addr = getelementptr float, float* %dst, i32 %iv
				%p = call float @pow(float %a, float %b)
				store float %p, float* %addr
				%iv.next = add nsw i32 %iv, 1
				%cond = icmp ne i32 %iv.next, 0
				br i1 %cond, label %loop.body, label %exit

				exit:
				ret void
				}