This is an archive of the discontinued LLVM Phabricator instance.

[LV] Exclude loop-invariant inputs from scalar cost computation.
ClosedPublic

Authored by fhahn on Mar 29 2019, 8:03 AM.

Details

Summary

Loop invariant operands do not need to be scalarized, as we are using
the values outside the loop. We should ignore them when computing the
scalarization overhead.

Fixes PR41294

Diff Detail

Repository
rL LLVM

Event Timeline

fhahn created this revision.Mar 29 2019, 8:03 AM
Ayal added a subscriber: Ayal.Apr 5 2019, 1:03 AM

The culprit here is the assumption made by TTI.getOperandsScalarizationOverhead(Operands, VF) that all its Operands will be vectorized according to VF, and would thus require extraction to feed a scalarized/replicated user. But any such Operand might not get vectorized, and possibly must not get vectorized, e.g., due to an incompatible type as demonstrated by PR41294 and the testcase. In some cases an Operand will obviously not be vectorized, such as if it's loop-invariant or live-in. More generally, LV uses the following:

auto needsExtract = [&](Instruction *I) -> bool {
  return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};

which would require passing not only TheLoop into getScalarizationOverhead(I, VF, TTI) but also the CM --- better turn this static function into a method of CM?

Note that there's also CM.isProfitableToScalarize(I, VF)), but it relies on having computed costs, so difficult to use when computing the cost (of a User). Skipping it would only affect accuracy of resulting cost, considering Operands that can be vectorized but will not be due to profitability.

Fixing getVectorCallCost deserves another testcase.
Seems like getVectorIntrinsicCost also requires fixing?

Would be good to hoist such invariant instructions

%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1

out of the loop before LV, or at-least have LV recognize them as uniform.

llvm/test/Transforms/LoopVectorize/AArch64/instructions-with-struct-ops.ll
3 ↗(On Diff #192826)

Record PR41294 in comment, file name, or testcase function name.

4 ↗(On Diff #192826)

for extractelement >> for extractvalue

10 ↗(On Diff #192826)

the extractelement >> the extractvalue

fhahn planned changes to this revision.Apr 10 2019, 1:03 PM

Thanks for your comments Ayal! I'll update the patch in a bit.

fhahn updated this revision to Diff 198430.May 7 2019, 3:54 AM

Split out moving various functions to LoopVectorizationCostModel to D61638.

Hoisted needsExtract to do the check if we need to extract/scalarize an operand
to LoopVectorizationCostModel and extend it to check for loop invariant operands.

fhahn marked 3 inline comments as done.May 7 2019, 4:07 AM

The culprit here is the assumption made by TTI.getOperandsScalarizationOverhead(Operands, VF) that all its Operands will be vectorized according to VF, and would thus require extraction to feed a scalarized/replicated user. But any such Operand might not get vectorized, and possibly must not get vectorized, e.g., due to an incompatible type as demonstrated by PR41294 and the testcase. In some cases an Operand will obviously not be vectorized, such as if it's loop-invariant or live-in. More generally, LV uses the following:

auto needsExtract = [&](Instruction *I) -> bool {
  return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};

which would require passing not only TheLoop into getScalarizationOverhead(I, VF, TTI) but also the CM --- better turn this static function into a method of CM?

Done.

Note that there's also CM.isProfitableToScalarize(I, VF)), but it relies on having computed costs, so difficult to use when computing the cost (of a User). Skipping it would only affect accuracy of resulting cost, considering Operands that can be vectorized but will not be due to profitability.

Fixing getVectorCallCost deserves another testcase.

Added a test case with call. But thinking about it again, it does not really test the issue. Not sure it is actually possible to test getVectorCallCost, as there are no vector call functions that take struct values?

Seems like getVectorIntrinsicCost also requires fixing?

Yep, I'll update it in a second. In practice, I don't think there are any intrinsics that take struct types, but maybe in the future it might become a problem.

Would be good to hoist such invariant instructions

%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1

out of the loop before LV, or at-least have LV recognize them as uniform.

Yep, I can look into that as a follow up. LICM should hoist those things, but I think in general we cannot guarantee that all loop-invariant instructions are hoisted out before LoopVectorize (the test case came from a fuzzer). Do you think we should try to hoist them in LV? I would assume a later run of LICM will hoist them.

fhahn updated this revision to Diff 198433.May 7 2019, 4:10 AM

Filter operands for getIntrinsicCallCost as well.

fhahn added a reviewer: Ayal.May 14 2019, 12:32 PM

Ping. Ayal, does this address your comments appropriately?

Ayal added a comment.May 14 2019, 3:06 PM

The culprit here is the assumption made by TTI.getOperandsScalarizationOverhead(Operands, VF) that all its Operands will be vectorized according to VF, and would thus require extraction to feed a scalarized/replicated user. But any such Operand might not get vectorized, and possibly must not get vectorized, e.g., due to an incompatible type as demonstrated by PR41294 and the testcase. In some cases an Operand will obviously not be vectorized, such as if it's loop-invariant or live-in. More generally, LV uses the following:

auto needsExtract = [&](Instruction *I) -> bool {
  return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};

which would require passing not only TheLoop into getScalarizationOverhead(I, VF, TTI) but also the CM --- better turn this static function into a method of CM?

Done.

Very good, thanks for the refactoring.

Note that there's also CM.isProfitableToScalarize(I, VF)), but it relies on having computed costs, so difficult to use when computing the cost (of a User). Skipping it would only affect accuracy of resulting cost, considering Operands that can be vectorized but will not be due to profitability.

Fixing getVectorCallCost deserves another testcase.

Added a test case with call. But thinking about it again, it does not really test the issue. Not sure it is actually possible to test getVectorCallCost, as there are no vector call functions that take struct values?

Good point. Does the test case pass w/o this patch?

Seems like getVectorIntrinsicCost also requires fixing?

Yep, I'll update it in a second. In practice, I don't think there are any intrinsics that take struct types, but maybe in the future it might become a problem.

Would be good to hoist such invariant instructions

%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1

out of the loop before LV, or at-least have LV recognize them as uniform.

Yep, I can look into that as a follow up. LICM should hoist those things, but I think in general we cannot guarantee that all loop-invariant instructions are hoisted out before LoopVectorize (the test case came from a fuzzer). Do you think we should try to hoist them in LV? I would assume a later run of LICM will hoist them.

The thought was to check LICM, and check LV's uniform analysis.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1335 ↗(On Diff #198433)

This is admittedly part of the original comment, but while we're here - "cannot be scalarized" >> "is expected to be vectorized" - every V can be scalarized. Furthermore, if we're unsure whether V will be vectorized or not it's safer to assume that it will be scalarized rather than vectorized, as the latter might not be possible. So if Scalars.find(VF) isn't set yet, which will cause isScalarAfterVectorization() to assert, better have needsExtract() return false.

5696 ↗(On Diff #198433)

Yes, there's a phase ordering issue here, and also one regarding isProfitableToScalarize() as mentioned above (which would be good to note somewhere). But the instruction(s) that may trigger an assert, being loads or stores, are the operands of I rather than I itself, right? In any case, suggest to have FilterOperands()/needsExtract() understand if they can use isScalarAfterVectorization() or else be conservative, as noted above. Perhaps make Filter[Extracting]Operands() a method, to be reused.

fhahn updated this revision to Diff 208191.Jul 5 2019, 8:54 AM
fhahn marked an inline comment as done.

Addressed Ayal's comments, thanks and sorry for the long delay, it somehow dropped off my radar.

Good point. Does the test case pass w/o this patch?

Yes both tests fail with the patch.

The only thing that cannot be tested is the combination of intrinsic taking struct
values, but that should be fine.

Seems like getVectorIntrinsicCost also requires fixing?

Yep, I'll update it in a second. In practice, I don't think there are any intrinsics that take struct types, but maybe in the future it might become a problem.

Would be good to hoist such invariant instructions

%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1

out of the loop before LV, or at-least have LV recognize them as uniform.

Yep, I can look into that as a follow up. LICM should hoist those things, but I think in general we cannot guarantee that all loop-invariant instructions are hoisted out before LoopVectorize (the test case came from a fuzzer). Do you think we should try to hoist them in LV? I would assume a later run of LICM will hoist them.

The thought was to check LICM, and check LV's uniform analysis.

I am not entirely sure I understand what you mean here. I checked LICM and it will
hoist the problematic case here, but I do not think we should rely on that, as
there might be other reasons why LICM decided against hoisting (or LICM is not run).

Currently they are not marked as uniform in LV, but I can fix that in a follow-up
patch.

Ayal added a comment.Jul 12 2019, 4:27 AM

Some format typos, and clarifying if needsExtract() should assume vectorized or scalarized before scalars are computed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1180 ↗(On Diff #208191)

// >> /// on last line too ;-)

1182 ↗(On Diff #208191)

indentation?

1343 ↗(On Diff #208191)

Ahh, is it better to assume we can vectorize V, contrary to the above comment?

3151 ↗(On Diff #208191)

overhed >> overhead.

4130 ↗(On Diff #208191)

one line?

5669 ↗(On Diff #208191)

two lines?

5686 ↗(On Diff #208191)

Effectively initializing Ops to an empty range by default? This is an orthogonal refactoring; consider instead having another early-exit:

if (isa<StoreInst>(I) && TTI.supportsEfficientVectorElementLoadStore())
  return Cost;

and initializing Ops to I->operands() by default.

6664 ↗(On Diff #208191)

one line?

fhahn updated this revision to Diff 209476.Jul 12 2019, 6:30 AM
fhahn marked 7 inline comments as done.

clang-format, small refactor

fhahn added a comment.Jul 12 2019, 6:31 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗(On Diff #208191)

Ahh, is it better to assume we can vectorize V, contrary to the above comment?

Ah sorry for the change, I missed adding an explanation here. Assuming vectorization here seems to be in line with the old behavior and switching it causes a bunch of unit-test failures. I agree it would be safer to assume scalarizing, but the current cost model seems to expect the optimistic choice here.

I might be worth investigating flipping it as a follow-up, but I think it will require some additional work to make sure other places in the cost model are in sync with the new assumption?

Ayal accepted this revision.Jul 14 2019, 8:03 AM

LGTM, with some additional thoughts provoked by this fix.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗(On Diff #208191)

OK, fair enough. The current optimistic choice is ok, as long as V is of vectorizable type; and the latter seems to be the case given that setCostBasedWidenDecision() deals with loads and stores only, and LVLegality checks that their relevant operands (stored values) are of vectorizable types. Worth augmenting the explanation if agreed.

In any case, the optimistic choice can be confined to guard isScalarAfterVectorization() only, e.g.,:

if (VF <= 1)
  return false;
Instruction *I = dyn_cast<Instruction>(V);
if (!I || !TheLoop->contains(I) || TheLoop->isLoopInvariant(I))
  return false;
return (Scalars.find(VF) == Scalars.end() || !isScalarAfterVectorization(I, VF));

On a related note, notice that if %sv were a <2 x i64> vector instead of a {i64, i64} struct, LVLegality would bail out given that we can't vectorize its associated extractelement:

// Also, we can't vectorize extractelement instructions.
if ((!VectorType::isValidElementType(I.getType()) &&
     !I.getType()->isVoidTy()) ||
    isa<ExtractElementInst>(I)) {
  reportVectorizationFailure("Found unvectorizable type",
      "instruction return type cannot be vectorized",
      "CantVectorizeInstructionReturnType", &I);
  return false;
}

but it does allow vectorizing (loops with scalarizing) extractvalue. So a crude way of "fixing" the testcase would be to treat extractvalue similar to extractelement... but its clearly better to allow vectorizing such loops. OTOH, perhaps we could also allow vectorizing (loops with scalarizing) extractelement...? (Surely in a separate patch if so)

This revision is now accepted and ready to land.Jul 14 2019, 8:03 AM
fhahn marked an inline comment as done.Jul 14 2019, 11:34 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗(On Diff #208191)

OK, fair enough. The current optimistic choice is ok, as long as V is of vectorizable type; and the latter seems to be the case given that setCostBasedWidenDecision() deals with loads and stores only, and LVLegality checks that their relevant operands (stored values) are of vectorizable types. Worth augmenting the explanation if agreed.

In any case, the optimistic choice can be confined to guard isScalarAfterVectorization()

Thanks, I'll commit the change with the moved optimistic check as suggested and add a comment.

On a related note, notice that if %sv were a <2 x i64> vector instead of a {i64, i64} struct, LVLegality would bail out given that we can't vectorize its associated extractelement:

I'll look into that as a follow up, same as marking the invariant instructions as uniform.

This revision was automatically updated to reflect the committed changes.
fhahn marked an inline comment as done.Jul 14 2019, 1:13 PM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1343 ↗(On Diff #208191)

OK, fair enough. The current optimistic choice is ok, as long as V is of vectorizable type; and the latter seems to be the case given that setCostBasedWidenDecision() deals with loads and stores only, and LVLegality checks that their relevant operands (stored values) are of vectorizable types. Worth augmenting the explanation if agreed.

I think vectorization should be what is happening in most of those cases, but I think there could be scenarios where we would not vectorize the operands. I've adjusted the comment accordingly.

This triggers failed asserts in some cases, reproable with https://martin.st/temp/loadimage-preproc.cpp.xz, with clang++ -target i686-w64-mingw32 -c -O3 loadimage-preproc.cpp. Will post a proper bug report later today, unless someone else beats me to it.

fhahn added a comment.Jul 15 2019, 1:51 AM

This triggers failed asserts in some cases, reproable with https://martin.st/temp/loadimage-preproc.cpp.xz, with clang++ -target i686-w64-mingw32 -c -O3 loadimage-preproc.cpp. Will post a proper bug report later today, unless someone else beats me to it.

Thanks for reporting the issue, should be fixed in rL366049.

@Ayal, the problem was that we are not computing the scalarized cost in getVectorIntrinsicCost and TTI::getIntrinsicInstrCost expects the full list of arguments, so we should not filter it. I'll check if there's a reason we do not consider the scalarized cost there.

Ayal added inline comments.Jul 15 2019, 3:26 AM
llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
1175

Thanks @fhahn, just pointing out the above documentation is also inaccurate, along with the associated cost.

This triggers failed asserts in some cases, reproable with https://martin.st/temp/loadimage-preproc.cpp.xz, with clang++ -target i686-w64-mingw32 -c -O3 loadimage-preproc.cpp. Will post a proper bug report later today, unless someone else beats me to it.

Thanks for reporting the issue, should be fixed in rL366049.

Thanks for the very quick fix - it seems to work fine now!