This is an archive of the discontinued LLVM Phabricator instance.

[LV] Don't apply "TinyTripCountVectorThreshold" for loops with compile time known TC.
Needs ReviewPublic

Authored by ebrevnov on Dec 14 2021, 1:37 AM.

Details

Summary

When trip count is not known at compile time, there are additional overhead to make sure it's safe to perform next VF iterations. Thus, if vector loop is skipped at runtime then such vectorization is unprofitable. When trip count is known to be small enough there is high chance to get into such situation. Currently, LV is not able to properly cost model in this case since it doesn't account for cost of the epilog loop. Instead "short trip count" heuristic is employed.

While "short trip count" heuristic  makes sense in general (at least for current state) it can be slightly lifted up when trip count is compile time known constant. In   this case it's known at compile time how many vector iterations will be executed and there is no implied overhead by trip count checks as well.  Cost modeling is simple as well, if one vector iteration costs less than one scalar iteration multiple VF then vectorization is profitable.

Note: One may say, that "short trip count" heuristic is the needed to reduce code size in assumption that short trip count loops can't be performance critical. That statement turns out to be false in many cases (for example, nested loops) and should not be driving factor.

Diff Detail

Event Timeline

ebrevnov created this revision.Dec 14 2021, 1:37 AM
ebrevnov requested review of this revision.Dec 14 2021, 1:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2021, 1:37 AM
ebrevnov edited the summary of this revision. (Show Details)Dec 14 2021, 2:42 AM
ebrevnov added reviewers: fhahn, dmgreen, Ayal, lebedev.ri.
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2022, 2:34 AM
dmgreen added inline comments.Oct 10 2022, 12:04 AM
llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll
7 ↗(On Diff #394187)

I don't think we want this - it is worse. At least that is what my benchmarks suggest.

That was the point of D101726. 1 vector + 1 masked vector iteration when unrolled was worse than 5 scalar because of the overheads of vector instructions. 1 vector + 1 scalar will be in the same boat.

fhahn added inline comments.Oct 10 2022, 12:27 AM
llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll
7 ↗(On Diff #394187)

I also think we would probably need to make this target/CPU dependent. We also have some AArch64 CPUs where usually at least 2 vector iterations are needed to make the vector code profitable if there is a scalar tail.

ebrevnov edited the summary of this revision. (Show Details)Oct 10 2022, 2:07 AM
ebrevnov updated this revision to Diff 467819.Oct 14 2022, 10:12 AM

Rebase + fix problem for platforms which prefer tail folding over scalar epilogue.

Thanks for feedback!
The problem with mve-known-trip-count.ll was caused by 2 things 1) Outdated code base 2) Bug in the implementation caused to skip "preferPredicateOverEpilogue" check and as a result scalar epilogue have been selected. Now both items are fixed mve-known-trip-count.ll works as previously (cost model says vectorization with tail folding is not profitable).

In order to fix above two items (and not introduce new ones) I had to restructure the code responsible for scalar epilog lowering. This restructuring is now parent for this one and in WIP state. While I believe proposed restructuring is a good thing regardless this change I would like to postpone polishing it until we have an agreement on this one.

david-arm added subscribers: dtemirbulatov, david-arm.

Adding @dtemirbulatov since he was the author of the original patch that introduced getMinTripCountTailFoldingThreshold.

paulwalker-arm added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

I don't quite understand this change. The whole point of getMinTripCountTailFoldingThreshold() was to give targets control over this behaviour based on their understanding how the cost model has been implemented.

Admittedly this was in part due to the immaturity of the cost modelling but this change essentially removes that flexibility to the point where there's no value in keeping getMinTripCountTailFoldingThreshold()?

If your previous patches improve the cost model in this regard then I'd rather getMinTripCountTailFoldingThreshold() be removed. That said, @dtemirbulatov can you help here in ascertaining if this option is still required based on this new patch series?

ebrevnov added inline comments.Oct 17 2022, 6:34 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

I think you are missing one thing. Currently we allow vectorization of short trip count loops if there is no run-time overhead and the tail can be folded only. Introduction of getMinTripCountTailFoldingThreshold put additional limitations on vectorization of such loops. This change enables vectorization of short trip count loops with scalar epilogue ONLY If trip count is compile time constant. getMinTripCountTailFoldingThreshold is still applicable if we decide to vectorize by folding the tail.

In other words, this change enables exactly the same way of vectorization for loops with small constant trip count as for loops with unknown or large trip count.

It's a question whether getMinTripCountTailFoldingThreshold threshold should be taken into account if it's decided to fold the tail at step 5) for short trip count loop. Possibly yes... I think this is your original concern, right?

I think the effect this patch has is not well understood, and the description is not informative enough to understand it.
I would perhaps recommend formatting the desired behavior change as a truth table.

I tried running the original benchmarks again with this patch series, it still sees the same decrease in performance I'm afraid. If it was a small change it would be understandable, we accidentally end up on the wrong side of the scalar cost and choose to vectorize where we shouldn't. But the new code is 40% slower than the scalar version, so it's quite a difference. I haven't had a chance to look into the costs it produces, there is a chance we are underestimating the cost of predication or overestimating the cost of scalar. At worst, providing getMinTripCountTailFoldingThreshold works like it should we can always put a limit on the tripcount, providing we can find a decent minimum that works in general for MVE.

The vectorizer doesn't really do any cost modelling for the setup cost, past some obvious things like runtime checks. Due the the pass pipeline, it is usually expected that many low-trip-count loops are unrolled prior to vectorization, and we SLP them instead. D135971 looks like it changes a lot of code in an area that has caused an amount of trouble in the past. We know that the Vectorizer current needs to commit to tail folding vs not tail folding too early. It would be better to have that option as part of the vplan, allowing predicated and non-predicated patters to be costed against one another. I'm not sure I follow D135971 enough to understand the ramifications of those changes.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

Are you assuming that the loop will be unrolled after vectorization, and that unrolling will leave no extra overhead? Scalable vector loops cannot (currently) be unrolled.

ebrevnov updated this revision to Diff 472865.Nov 3 2022, 1:02 AM

Make sure MinTripCountTailFoldingThreshold is honord even for loops with short compile tine known TC.

I tried running the original benchmarks again with this patch series, it still sees the same decrease in performance I'm afraid. If it was a small change it would be understandable, we accidentally end up on the wrong side of the scalar cost and choose to vectorize where we shouldn't. But the new code is 40% slower than the scalar version, so it's quite a difference. I haven't had a chance to look into the costs it produces, there is a chance we are underestimating the cost of predication or overestimating the cost of scalar. At worst, providing getMinTripCountTailFoldingThreshold works like it should we can always put a limit on the tripcount, providing we can find a decent minimum that works in general for MVE.

It sounds like current cost model works really bad for this case and it's essentially the reason why getMinTripCountTailFoldingThreshold had been introduced. If this is the case would you mind trying updated version and see if it helps. Otherwise, I'd probably need a test case I could analyze.

The vectorizer doesn't really do any cost modelling for the setup cost, past some obvious things like runtime checks. Due the the pass pipeline, it is usually expected that many low-trip-count loops are unrolled prior to vectorization, and we SLP them instead. D135971 looks like it changes a lot of code in an area that has caused an amount of trouble in the past. We know that the Vectorizer current needs to commit to tail folding vs not tail folding too early. It would be better to have that option as part of the vplan, allowing predicated and non-predicated patters to be costed against one another. I'm not sure I follow D135971 enough to understand the ramifications of those changes.

I understand the problem you describe here. D135971 has nothing to do with it. It's an NFC. None of tests change their behavior. Main motivation is to simplify unnecessary complicated implementation because I struggled how to push new functionality in it.

ebrevnov added inline comments.Nov 3 2022, 1:13 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any additional overhead comparing to scalar loop. At least not on IR level. Is there any overhead specific to MVE not visible on IR level?

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult. There's a lot of setup and spilling too, which is going to hurt for small trip counts. I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled. It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE.

There will be a certain amount of overhead just from having a second loop, let alone all the extra overhead from the gubbins around it. https://godbolt.org/z/s8xEf7d6K. None of that is particularly MVE specific though, although as it is designed for small energy-efficient cores it may feel the overheads more than other architectures.

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult.

I would like to understand you case better. Let's take current non optimality of code generation out of the consideration. Is it a problem of inaccurate cost estimation for some particular instructions or not taking overhead which comes with the vectorization into account?

There's a lot of setup and spilling too, which is going to hurt for small trip counts.

Does this setup/spilling happens inside the main vector loop or outside? Is this reflected on IR level or low level (maybe even hardware specific)?

I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

IMHO, this may be an option only if we find out that this is something specific to MVE. Anyway, it looks like a workaround (hopefully temporary :-)) until a better support is there.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled.

Until I'm missing something looks like currently LV comes before Unroller&SLP which makes perfect sense to me. Anyway, I wouldn't stick to any specific pass ordering as LLVM is an infrastructure for building custom compilers. LV should do its job as good as it can. If it can prove that vectorization is beneficial it should do it (until we have an infrastructure to take dependencies between different passes into account).

It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

ebrevnov added inline comments.Nov 7 2022, 4:45 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

It looks like, in this specific example, vector cost of the main vector loop being estimated as 2x lower than scalar one is pretty much accurate. The problem is that extra cost coming from horizontal reduction after main vector loop is not taken into account. This is something that definitely should be fixed before we can move on in this direction. Essentially, this problem is a major blocking factor to enable short TC loops vectorization in general case (not only compile time known TC).

PS: I wonder if your problematic case on MVE is something similar to this or not?

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled.

Until I'm missing something looks like currently LV comes before Unroller&SLP which makes perfect sense to me. Anyway, I wouldn't stick to any specific pass ordering as LLVM is an infrastructure for building custom compilers. LV should do its job as good as it can. If it can prove that vectorization is beneficial it should do it (until we have an infrastructure to take dependencies between different passes into account).

(Unfortunately?) full loop unrolling happens before LV.
I personally think that once LV has outer loop vectorization, that should be considered a bug that should be resolved.

Matt added a subscriber: Matt.Nov 7 2022, 12:39 PM
ebrevnov updated this revision to Diff 474185.Nov 9 2022, 1:01 AM

Avoid vectorization if there is reduction/recurrence induced overhead

ebrevnov added inline comments.Nov 9 2022, 1:05 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878

I've update the patch to avoid vectorization in such cases. Please take a look.

Avoid vectorization if there is reduction/recurrence induced overhead

For some architectures like MVE and AArch64 the reduction cost might well be cheap. Non-zero but also not huge. It can depend on the reduction. Have you considered the cost of constants and splats too, that the vectorizer will currently hoist into the preheader? The costmodel functions might also sometimes return cheaper costs for certain shuffles and gathers under the assumption that loop invariant parts can be hoisted.

I've tried to get an example of the case that is getting worse for MVE in rG662b5f18467e. It might be a bit over-reduced, but hopefully it still shows the same problems. The original I can't share, I'm afraid.

@fhahn did you have an example of cases where you had seen AArch64 getting worse? Is it covered by reductions or is it more general?

lebedev.ri resigned from this revision.Jan 12 2023, 5:32 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.