This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions with no loads/stores
ClosedPublic

Authored by RosieSumpter on Nov 16 2021, 1:49 AM.

Details

Summary

For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,

float s = 0;
for (int i = 0; i < N; ++i)
  s += (float) i*i;

which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.

In this patch, for the case where ElementTypesInLoop is emptry, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.

Diff Detail

Event Timeline

RosieSumpter created this revision.Nov 16 2021, 1:49 AM
RosieSumpter requested review of this revision.Nov 16 2021, 1:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2021, 1:49 AM
RosieSumpter edited the summary of this revision. (Show Details)Nov 16 2021, 1:51 AM
lebedev.ri added inline comments.
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48

Where is ElementTypesInLoop populated?
LoopVectorizationCostModel::collectElementTypesForWidening() suggests that PHI nodes are also queried.

sdesmalen added inline comments.
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48

The function collectElementTypesForWidening() collects types for the following reasons:

  • Types of loads/stores, because these are used in computing a safe dependence distance (and also to have a sensible/natural VF as maximum VF).
  • Types of unordered reductions, in order to avoid generating vector PHI nodes that span multiple registers when the VF is too wide. In-order reductions are not considered, because those PHI nodes remain scalar after vectorization.

If we consider *more* types in the loop in collectElementTypesForWidening(), e.g. for in-order reductions or extends, then this leads to regressions; any type larger than the maximum loaded/stored type will limit the maximum VF. If the maximum VF is limited, then any VF upto the wider maximum will not be considered by the cost-model. So in general, it's better not to limit the VF for those reasons and leave it up to the cost-model to choose the best VF.

In the test-case that @RosieSumpter added, there are no loads/stores and there is no vector PHI node because the reduction is ordered, so it considers any VF up-to the widest-possible VF based on an i8 element type (even if the loop operates on a larger size). The cost-model then only considers the throughput cost, and the per-lane cost is no different between VF=4 and VF=16, so it favours VF=16. It does not consider the additional code-size cost when the operation is scalarized.

I guess the alternative choices are to:

  • Look through the operands of the ordered reduction operation, as well as any casts/extends on those operands, and only consider these element types in collectElementTypesForWidening() if there are no loads/stores in the loop. This heuristic gets quite complicated I think.
  • Consider the code-size cost for in-order reductions, so that it favours shorter VFs (I'm not sure if this is desirable, as it may regress other cases where code-size isn't relevant).

Because the case is so niche (i.e. a loop with *only* in-order reductions), @RosieSumpter thought it made sense to fall back to a more sensible default instead.

I think there is no single right choice, i'm not sure why 32 is more right than 8 or 64 here, for everything?
I'd think this should be in costmodel somewhere, not hardcoded.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6320

I suspect this should be querying datalayout for the minimal integer size?

dmgreen added inline comments.Nov 16 2021, 3:00 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6353

This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation. That might not be as useful for float types - and the example loop you showed with no loads/stores in the loop is much less likely for integer - it will already have been simplified.

Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions?

sdesmalen added inline comments.Nov 17 2021, 12:44 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6353

This deliberately excludes InLoopReductions from the maximum width of the register - because the phi remains scalar and it's useful for integer reductions under MVE where they can sum a vecreduce(v16i32 sext(v8i32)) in a single operation.

Yes, the same is true for SVE. There is also code in the cost-model to recognise the extend to the wider type.

I think the actual reason for the reduction-code being here is to avoid ending up with vector PHIs that are too wide (out-of-loop reduction). The checks for in-loop reductions were added later because (1) there is no vector PHI and (2) that it doesn't limit the VF too early so that it lets the cost-model code consider the wider VFs for the reason you described.

Perhaps we just remove this for ordered (float) reductions? Or does that lead to regressions?

I don't think we should add specific knowledge to limit the VF for fp-reductions here, because that means adding target-specific knowledge to the collectElementTypesForWidening, which is something that the cost-model should decide on. Also it would lead to regressions.

dmgreen added inline comments.Nov 17 2021, 8:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6353

Yeah, we added it in https://reviews.llvm.org/D93476. I would have no objections to removing this for ordered reductions or float types I don't think. There is no such thing (as far as I understand) as an extending in-order float reduction, and we can rely on the UF for things that will become wider-than-legal vectors.

Matt added a subscriber: Matt.Nov 17 2021, 12:05 PM
RosieSumpter edited the summary of this revision. (Show Details)
  • If there are no element types, only set the max width to 32 if there are no legal int sizes set, otherwise set it to the smallest legal int width.
  • update X86/funclet.ll test.
sdesmalen added inline comments.Nov 22 2021, 1:41 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6327

I think the MaxWidth should remain 8 if DL returns no smallest legal integer type? (or if there's no specific datalayout defined)

RosieSumpter edited the summary of this revision. (Show Details)
  • Don't set max width to 32 if there are no legal int widths set (leave as 8)
RosieSumpter marked an inline comment as done.Nov 22 2021, 1:55 AM
  • Remove changes to LoopVectorize/pr32859.ll and LoopVectorize/pr36983.ll tests
fhahn added inline comments.Nov 22 2021, 2:40 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6353

Would it be possible to use TTI to pick the min/max type width to use there for the reduction, which would encode the target specific knowledge?

llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48

I think picking the smallest integer type as max width comes with its own problems unfortunately. By choosing 32 bit as max width on AArch64, won't we pessimize codegen for loops with fp16 in loop reductions (by choosing VF 4 instead of VF 8) for example?

It is hard to say what if there are similar impacts on other targets.

RosieSumpter edited the summary of this revision. (Show Details)
  • Change name of collectCastsToIgnore to collectCastInstrs and make it a member function of RecurrenceDescriptor.
  • Modify collectCastInstrs to also collect cast instructions where the destination type is the same as the recurrence type.
  • Call collectCastInstrs from LoopVectorizationCostModel::getSmallestAndWidestType in the case of in-loop reductions with no loads/stores to check through casts on recurrence operands when determining the max width.

As @fhahn pointed out, using the smallest legal integer type for the default max width negatively impacts performance if the smallest type used in the loop is smaller than the smallest legal integer type. Instead, this update iterates through the recurrences in the loop and sets the maximum width to be that of the smallest type used by the recurrences when ElementTypesInLoop is empty. To determine the smallest type used by recurrences, we need to check for any casts on the recurrences’ input operands, which are now found by collectCastInstrs. This means that the max VF isn’t restricted too much in cases where in-loop reductions use types with width smaller than the target's smallest legal int.

sdesmalen added inline comments.Dec 16 2021, 5:05 AM
llvm/include/llvm/Analysis/IVDescriptors.h
279 ↗(On Diff #393814)

Passing IgnoreCasts to collectCastInstrs seems a bit contradictory :)

Would it make more sense to pass two SmallPtrSet's, one named CastsToRecurrenceType (for case RecurrenceType == Cast->getDstTy()) and the other CastsFromRecurrenceType (for case RecurrenceType == Cast->getSrcTy()) and making these two sets available in RecurrenceDescriptor under their respective names. Then either set can be queried depending on what information is needed.

  • Split the cast instruction SmallPtrSet into 2 separate ones; CastsToRecurrenceType and CastsFromRecurrenceType
  • removed bool IgnoreCasts parameter in collectCastInstrs
  • call collectCastInsts for all recurrence descriptors which get saved at the end of AddReductionVar
  • query CastsToRecurrenceType when checking casts used by recurrence descriptors in getSmallestAndWidestTypes

Thanks for the latest update @RosieSumpter. The patch seems to improve the case based on the types used in the reductions now, so that seems like an improvement. Just left a few final nits.

llvm/lib/Analysis/IVDescriptors.cpp
481 ↗(On Diff #394884)

nit: move to line 274 to be together with CastsFromRecurTy ?

518 ↗(On Diff #394884)

nit: narrower or wider

522 ↗(On Diff #394884)

nit: This is unrelated to your patch, but to me it looks like this mechanism is implemented the wrong way around. I would have expected it to unconditionally capture all cast instructions (to/from recurrence type), and in LoopVectorize to call computeRecurrenceType, and make decisions to ignore cast instructions based on the narrower recurrence type.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6348–6349

nit: this can be RdxDesc.getCurrenceType()->getScalarSizeInBits()

6355–6356

nit: same here: cast<CastInst>(I)->getSrcTy()->getScalarSizeInBits()

RosieSumpter marked 4 inline comments as done.
  • Addressed nits
This revision is now accepted and ready to land.Dec 16 2021, 9:36 AM
fhahn added inline comments.Dec 16 2021, 10:14 AM
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48

I might have missed it, but did you add a test case for the scenario I mentioned above? It would be good to have a bit wider test coverage.

fhahn added inline comments.Dec 16 2021, 10:17 AM
llvm/test/Transforms/LoopVectorize/AArch64/smallest-and-widest-types.ll
47–48

Oh I see the loop here now operates on i16. I think we should have at least one or two other width combinations.

fhahn added inline comments.Dec 17 2021, 1:39 AM
llvm/include/llvm/Analysis/IVDescriptors.h
260 ↗(On Diff #394907)

this is only used to compute the minimum width, right? It seems like it would be simpler if we would compute the minimum width directly and store that, rather than a set of casts?

  • Replaced SmallPtrSet CastInstsToRecurrenceType with unsigned MinWidthCastToRecurrenceType (there could possibly be a shorter name for this!)
  • Added 2 test cases to smallest-and-widest-types.ll
RosieSumpter marked an inline comment as done.Dec 17 2021, 6:46 AM
RosieSumpter added inline comments.
llvm/include/llvm/Analysis/IVDescriptors.h
260 ↗(On Diff #394907)

Hi @fhahn, thanks for the suggestion - that does seem to make more sense. Let me know if what I've done is what you had in mind.

fhahn accepted this revision.Dec 20 2021, 2:21 AM

LGTM, thanks!

llvm/include/llvm/Analysis/IVDescriptors.h
257 ↗(On Diff #395119)

it would be good to add InBits or something to make clear the value is in bits.

This revision was landed with ongoing or failed builds.Jan 4 2022, 2:26 AM
This revision was automatically updated to reflect the committed changes.
RosieSumpter marked an inline comment as done.Jan 4 2022, 2:28 AM
RosieSumpter added inline comments.
llvm/include/llvm/Analysis/IVDescriptors.h
257 ↗(On Diff #395119)

I made this change before committing. Thanks for your help with this patch :)