Page MenuHomePhabricator

[LoopVectorizer] Don't perform interleaving of predicated scalar loops
ClosedPublic

Authored by dmgreen on Jan 30 2022, 6:31 AM.

Details

Summary

The vectorizer will choose at times to "vectorize" loop with a scalar factor (VF=1) with interleaving (IC > 1). This can produce better code than the unroller (notable for reductions where it can produce independent reduction chains that are combined after the loop). At times this is not very beneficial though, for example when runtime checks are needed or when the scalar code requires predication.

This addresses the second point, preventing the vectorizer from interleaving when the scalar loop will require predication. This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial. It helps reverse some of the regressions from D118090.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 30 2022, 6:31 AM
dmgreen requested review of this revision.Jan 30 2022, 6:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2022, 6:31 AM

This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial

Can you expand a little bit on why this becomes a bit of a mess? The original scalar loop has control-flow for predication as well, so I guess interleaving would just duplicate such control flow for the second scalar iteration. Is the code generated by the LV less efficient or are we missing any folds/simplification? Or is there a fundamental reason this can never be an improvement? I could imagine a scenario where most of the loop body would benefit from interleaving, but one statement in the loop doesn't because of predication, it would still be beneficial to interleave.

This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial

Can you expand a little bit on why this becomes a bit of a mess? The original scalar loop has control-flow for predication as well, so I guess interleaving would just duplicate such control flow for the second scalar iteration. Is the code generated by the LV less efficient or are we missing any folds/simplification? Or is there a fundamental reason this can never be an improvement? I could imagine a scenario where most of the loop body would benefit from interleaving, but one statement in the loop doesn't because of predication, it would still be beneficial to interleave.

It is less efficient in all the benchmarks I've ran. It won't come up very often - we usually either choose to vectorize or won't choose to interleave. Interleaving is generally only done for smallish loops. When the vectorizer is forced to make serialized predicate blocks (and possibly add scev checks, as in the testcase) - it's hard to see how the code could be so much better than it is now. The patch gives a 50-60% improvement in the places it helps.

Whatever happens, it is best to leave it for the unroller to unroll with its own profitability heuristics (which in this case, it likely will not).

Hi @dmgreen, from what you're saying it sounds like perhaps the problem is due to two related things:

  1. The cost model for interleaving predicated operations is broken in some cases and needs fixing in the long term (since if the code it produces is of such a low quality then it has probably vastly underestimated the cost). I'm not sure if selectInterleaveCount really takes the cost into account for the VF=1 case - it lseems to be a selection of bolted-on workarounds/guesses due to the lack of a proper cost model.
  2. The loop vectoriser is basically rubbish at unrolling and/or scalarising predicated operations and ultimately in the long term we probably want to fix this. I imagine this is also a problem for VF>1 when the predicated operation has to be scalarised.

One thing to note about your patch is that you are calling blockNeedsPredicationForAnyReason, which includes predicated loops (tail-folding). That may be the right thing to do in this case, but I think it's worth adding a test case for it at least?

sdesmalen accepted this revision.Feb 2 2022, 8:02 AM

This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial

Can you expand a little bit on why this becomes a bit of a mess? The original scalar loop has control-flow for predication as well, so I guess interleaving would just duplicate such control flow for the second scalar iteration. Is the code generated by the LV less efficient or are we missing any folds/simplification? Or is there a fundamental reason this can never be an improvement? I could imagine a scenario where most of the loop body would benefit from interleaving, but one statement in the loop doesn't because of predication, it would still be beneficial to interleave.

It is less efficient in all the benchmarks I've ran. It won't come up very often - we usually either choose to vectorize or won't choose to interleave. Interleaving is generally only done for smallish loops. When the vectorizer is forced to make serialized predicate blocks (and possibly add scev checks, as in the testcase) - it's hard to see how the code could be so much better than it is now. The patch gives a 50-60% improvement in the places it helps.

Whatever happens, it is best to leave it for the unroller to unroll with its own profitability heuristics (which in this case, it likely will not).

Okay, I think I see what you mean now. Because the block is predicated, the LV will try to predicate each operation that needs predication (e.g. every load/store), so that we end up executing NumPredicatedOps * UF branch instructions.

I guess the SCEVChecks would be something we could add to the cost-model in a separate patch, e.g. if the LV has to generate SCEVChecks at all don't bother setting UF>1 iff VF=1.

This revision is now accepted and ready to land.Feb 2 2022, 8:02 AM
This revision was landed with ongoing or failed builds.Feb 7 2022, 11:34 AM
This revision was automatically updated to reflect the committed changes.
TiehuZhang added a subscriber: TiehuZhang.EditedJun 22 2022, 6:19 PM

Hi, @dmgreen, is it necessary to add ScalarInterleavingRequiresPredication to the guard ? I met some cases that the current patch wants to prevent from interleaving but still be interleaved, as AggressivelyInterleaveReductions is enabled on our target.

if (AggressivelyInterleaveReductions && !ScalarInterleavingRequiresPredication) {
  LLVM_DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
  return IC;
}
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 6:19 PM

Hello.

Hmm. I remember that this was important for performance in some cases on AArch64 - but we don't enable AggressivelyInterleaveReductions. If you have examples of scalar interleaving causing performance problems with AggressivelyInterleaveReductions then it sounds sensible to me to disable it for the if you pointed to as well.

TiehuZhang added a comment.EditedJun 23 2022, 6:52 PM

Hello.

Hmm. I remember that this was important for performance in some cases on AArch64 - but we don't enable AggressivelyInterleaveReductions. If you have examples of scalar interleaving causing performance problems with AggressivelyInterleaveReductions then it sounds sensible to me to disable it for the if you pointed to as well.

OK, thanks! I'll try to extract an example from the benchmark to illustrate my point. By the way, I wonder why the patch does not return 1 after detecting that ScalarInterleavingRequiresPredication is true. Is this patch not intended to filter all predicated scalar loops?