This is an archive of the discontinued LLVM Phabricator instance.

[LV] Account for tripcount when calculation vectorization profitability
ClosedPublic

Authored by dmgreen on May 2 2021, 1:10 PM.

Details

Summary

The loop vectorizer will currently assume a large trip count when calculating which of several vectorization factors are more profitable. That is often not a terrible assumption to make as small trip count loops will usually have been fully unrolled. There are cases however where we will try to vectorize them, and especially when folding the tail by masking can incorrectly choose to vectorize loops that are not beneficial, due to the folded tail rounding the iteration count up for the vectorized loop.

The motivating example here has a trip count of 5, so either performs 5 scalar iterations or 2 vector iterations (with VF=4). At a high enough trip count the vectorization becomes profitable, but the rounding up to 2 vector iterations vs only 5 scalar makes it unprofitable.

This adds an alternative cost calculation when we know the max trip count and are folding tail by masking, rounding the iteration count up to the correct number for the vector width. We still do not account for anything like setup cost or the mixture of vector and scalar loops, but this is at least an improvement in a few cases that we have had reported.

Diff Detail

Event Timeline

dmgreen created this revision.May 2 2021, 1:10 PM
dmgreen requested review of this revision.May 2 2021, 1:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2021, 1:10 PM
nikic added a subscriber: nikic.May 2 2021, 1:20 PM
nikic added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974

There is divideCeil() in MathExtras.

bmahjour added inline comments.May 3 2021, 2:52 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

but the count is the scalar trip count while the width is the vectorization factor....it seems odd that this calculation combines the two concepts together. Also if the trip count is a large constant, wouldn't the ceiling cause the VF to be effectively ignored?

Perhaps you can achieve the desired outcome by adjusting the cost elsewhere (say in expectedCost) and preferably with a target hook.

5982

For this inequality to compare the per-lane costs, CostA above should be multiplied by ceil(...,B.Width) and CostB should be multiplied by ceil(...,A.Width).

dmgreen added inline comments.May 4 2021, 12:51 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974

Oh excellent, thank. I will update the other places I stole this code from too :-)

5978

I'm not sure I understand what you mean. Can you explain in more detail? This doesn't sound like something that should be target dependent.

Let say the trip count is 5. CostA will be the cost of a single iteration with a vector factor of A.Width. If VF is 4 then the total cost of the loop vectorized 4x becomes CostA*ceil(5/4). The rounding up is due to us folding the tail and so executing 2 iterations in this case [*]. If B is scalar then the total scalar cost will be CostB*ceil(5/1) = CostB*5. You then compare the two final costs to find the smallest, it being expected to be the most profitable.

So long as we are working in int64_t and the tripcount/cost are int32, this generalizes for any known trip count without rounding/overflow, I believe.

([*] I thought about adding the non-FoldTailByMasking case too, but there the total cost is more like CostVec*floor(TripCount/VF) + CostScalar*(TripCount%VF) + SomeOverhead, which is more difficult to work out correctly. The existing CostVec/VF is at least an approximation of that.)

dmgreen updated this revision to Diff 342665.May 4 2021, 12:52 AM

Use divideCeil

dmgreen added inline comments.May 4 2021, 12:54 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

Reading this again, perhaps the confusion comes from the old 'ceil' not being some form a fp ceil, but ceil(A/B)? It wasn't the most descriptive name for a function.

bmahjour added inline comments.May 4 2021, 10:53 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

Yes, sorry I saw the lambda ceil and didn't notice that it was doing a divide. My main concern was that the costs were not being normalized by the VF for larger trip counts, but now I see that they are, it's just that the trip count is rounded up and multiplied on both sides of the inequality.

Regarding overflow, I believe it's possible because InstructionCost uses int...so you might want to change it to int64_t.

Regarding target dependency, I'm wondering whether the motivating example is concerned with throughput or latency. The CostA and CostB above are mostly modeling throughput, so if the concern is latency, don't we need to take it into account in a target dependent way?

sdesmalen added inline comments.May 4 2021, 1:29 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974

I think this only makes sense if both factors are fixed-width VFs?
If so, please add this as a condition and use getFixedValue() in the cost-calculation.

5979

should this be B (and the one below be A)?

RKSimon resigned from this revision.May 4 2021, 2:33 PM
RKSimon added a subscriber: RKSimon.
dmgreen added inline comments.May 5 2021, 2:11 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974

Hmm. Sure. Sounds good!

I think we may need something similar for scalable vectors too, eventually. They will run into the same issue with low trip-count loops. It will just not be as obvious what the actual vector width is.

5978

I would expect CostA to be positive and (relatively) small. MaxTripCount can easily be something close to UINTMAX. Width will obviously be a low power of 2. Unsigned felt like the natural type, but changing it to signed sounds fine too.

My motivating case is... concerned with reciprocal throughput.. or that is at least close enough to it. Costing instructions isn't always simple, even on simple architectures. I don't believe it's any different to the existing code below, even if they are both a bit of an approximation.

5979

I'm not sure I following why. Can you give some more details which part would be B/A?

dmgreen updated this revision to Diff 342967.May 5 2021, 2:11 AM

Change to int64_t and guard against scalable vectors.

sdesmalen added inline comments.May 5 2021, 6:34 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974

isScalable?

5974

I think we may need something similar for scalable vectors too, eventually. They will run into the same issue with low trip-count loops. It will just not be as obvious what the actual vector width is.

We may be able to use knowledge about the scalable vectors' runtime width from the vscale_range attribute. When we know nothing about the runtime VF, then I'm not sure if we can make any sensible decisions.

5978

nit: PerVectorIterCost

5979

Sorry, please ignore that comment. You're not calculating the "cost per lane" (like we do below, which switches B/A), but rather calculating the total cost for handling TC scalar iterations by doing ceil(TC/VF) vector iterations.

5980

nit: PerVectorIterCost

5984

nit: getFixedValue (here and below)

dmgreen marked 3 inline comments as done.May 5 2021, 7:29 AM
dmgreen added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974

Doh!

5974

Yep, but comparing a non-scalable VF and a scalable VF will be wrong whichever method we choose, unless the scalable factor happens to be 1. I always imagined that the backend TTI would be telling the vectorizer the correct vscale to use if it was known from -mcpu or guess at a likely one if not (which would probably be 1 or 2 at the moment).

5978

These ones can be scalar too.

dmgreen updated this revision to Diff 343041.May 5 2021, 7:30 AM
dmgreen marked 2 inline comments as done.

Fix typos and whatnot.

sdesmalen accepted this revision.May 5 2021, 7:39 AM

Thanks, LGTM!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

nit: PerVFIterCost in that case?

This revision is now accepted and ready to land.May 5 2021, 7:39 AM
bmahjour added inline comments.May 5 2021, 8:14 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

Sorry if I wasn't clear, but I wasn't asking to change the type of RTCostA and RTCostB to signed int64_t. I was asking to change the underlying cost type inside InstructionCost from int to int64_t. Without that change, it's possible to cause an overflow in expressions like = CostA * ceil(...). The other possible fix would be to enforce an upper bound on the value of MaxTripCount where this heuristic is being applied, but I think that's less desirable.

dmgreen added inline comments.May 5 2021, 9:47 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

Hmm.. divideCeil takes and return uint64_t's.
MaxTripCount will be unsigned so we know (with reasonable assumptions for the size of unsigned (?)) that divideCeil(MaxTripCount, Width) will be less than or equal to UINT_MAX.
CostA will be an an int.

So I'm not sure how int32_t * (uint64_t)uint32_t would then overflow, especially if CostA will be much less than 2^32 (usually much less than 2^16). My understanding is that it would just convert the multiply to a i64 and would be happy enough.

bmahjour added inline comments.May 5 2021, 1:30 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978

Hmm.. divideCeil takes and return uint64_t's.
MaxTripCount will be unsigned so we know (with reasonable assumptions for the size of unsigned (?)) that divideCeil(MaxTripCount, Width) will be less than or equal to UINT_MAX.
CostA will be an an int.

So I'm not sure how int32_t * (uint64_t)uint32_t would then overflow, especially if CostA will be much less than 2^32 (usually much less than 2^16). My understanding is that it would just convert the multiply to a i64 and would be happy enough.

My bad. I thought CostA was of type InstructionCost and didn't notice it was actually InstructionCost::CostType. If it were we would have had an overflow issue due to implicit conversion to InstructionCost, but that's not the case.