This is an archive of the discontinued LLVM Phabricator instance.

[LV] Support for Remainder loop vectorization
Needs ReviewPublic

Authored by mivnay on Oct 5 2020, 4:00 AM.

Details

Summary

Ticket : https://bugs.llvm.org/show_bug.cgi?id=46929

Loop Vectorize currently doesn’t support epilog loop vectorization. The idea is to vectorize the remainder scalar loop after the initial vectorization with Vector Factor (VF) less than the original loop. Once this loop gets executed, the remaining iterations (if any) will go to the original scalar loop.

  • Iteration checks are performed for both the vectorized and epilog vectorized loops
  • Runtime checks (alias and SCEV checks) are done (only once) for either vectorized or epilog vector loops. If it fails, the original scalar loop is executed.

This helps in executing the vector code either for loops with smaller trip count (less than the original VF) or for loops with considerable remainder iterations after the original vectorization. Currently it is enabled for VF > 16.

This change gains in one of the SPEC CPU 2017 benchmark in both AArch64 and X86 Targets. It gains around 5% in x264 in Ryzen 2700x ref run.

Diff Detail

Event Timeline

mivnay created this revision.Oct 5 2020, 4:00 AM
mivnay requested review of this revision.Oct 5 2020, 4:00 AM
mivnay added a comment.Oct 5 2020, 4:04 AM

The CFG after the optimization of a typical loop will be as follows:

fhahn added a comment.Oct 5 2020, 6:01 AM

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

mivnay added a comment.Oct 5 2020, 8:12 AM

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Thanks for working on epilogue vectorization. Incidentally I've also looked into this recently. There has been a long and detailed discussion on the mailing list from back in 2017 about this transformation here http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-td106322.html. Your patch is able to vectorize epilogue loops with fairly small changes to the LV, however the generated CFG is not optimal. For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily. The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count. Based on the followup discussions from the mentioned RFC, the optimal CFG should look more like what I've attached below.

bmahjour requested changes to this revision.Oct 6 2020, 8:07 AM
bmahjour added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
3383

It's extremely hard to "draw" this diagram in text. It's even harder to read it. I think we should create a documentation section under https://llvm.org/docs/Vectorizers.html#loop-vectorizer and upload an image. The link can then be put into the comment for people to view and understand what is being generated.

5636

The vector epilogue loop's VF need not be smaller than the VF of the original loop for it to be profitable. For example with large interleave counts there may still be significant number of iterations to be executed and the throughput would be affected if a VF is chosen that is smaller than the widest profitable VF.

5639

why this limitation?

8469

Can your code handle first-order or reduction recurrences? Please see InnerLoopVectorizer::fixCrossIterationPHIs() and provide a test if they are supported. Otherwise I'm not sure this check is sufficient to catch those cases, specially given that the code guarded by LB.canCreateVectorEpilog() does not preserve LCSSA.

test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
238

why these casts are hoisted?

test/Transforms/LoopVectorize/epilog-loop-vectorize.ll
99

It would be good to generate more meaningful names for the labels forming the skeleton of the vector epilogue loop. For example vector.ph vs vector.epilogue.ph, vector.body vs vec.epilogue.body, etc.

This revision now requires changes to proceed.Oct 6 2020, 8:07 AM
fhahn added a comment.Oct 6 2020, 8:14 AM

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Is the main motivation avoiding re-doing the runtime checks? I think we might be able to annotate the the remainder loop with noalias metadata, if we emit memory runtime checks, which should avoid generating them again for the remainder (and might be beneficial even if we do not vectorize the remainder). As for the iteration count check, I'd hope that LLVM would already be able to eliminate such a redundant check. If not, we should certainly fix that.

Thanks for working on epilogue vectorization. Incidentally I've also looked into this recently. There has been a long and detailed discussion on the mailing list from back in 2017 about this transformation here http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-td106322.html. Your patch is able to vectorize epilogue loops with fairly small changes to the LV, however the generated CFG is not optimal. For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily. The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count. Based on the followup discussions from the mentioned RFC, the optimal CFG should look more like what I've attached below.

Thanks for looking into the patch. The idea is to not affect the performance of the original vectorization too much even when the epilog vectorization has happened. The CFG you suggested seems to have epilog trip count check first even if trip count is good enough for original vector loop.

I think optimal CFG is all about profiling information. I ran SPEC CPU2017 benchmark with the current change and did not see any regression even though many loops got transformed. It gained in one of the benchmarks.

The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count.

This approach doesn't work when most of the trip counts are always good for original vector loop. In fact, it even performs one additional trip count check when both vector loop and epilog vector loops are executed. For example, if original VF=16 and UF=2, and epilog VF=8 and UF=1, trip count as small as 40 requires 3 trip count checks. Where as, it is 2 in the current implementation.

For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Is the main motivation avoiding re-doing the runtime checks? I think we might be able to annotate the the remainder loop with noalias metadata, if we emit memory runtime checks, which should avoid generating them again for the remainder (and might be beneficial even if we do not vectorize the remainder).

Yes, the code changes inside the InnerLoopVectorizer are done to get the various Values (like, ResumeValue) and Blocks(like, MiddleBlock) easily. If we do the vectorizations independently, we would need a separate analysis to identify the loops, CFG, metadata, llvm::Value, etc.

As for the iteration count check, I'd hope that LLVM would already be able to eliminate such a redundant check. If not, we should certainly fix that

The trip count checks are done for different values (not redundant checks). It is initially done for original trip count and done again for remaining iterations after original vector loop execution.

mivnay edited reviewers, added: ashutosh.nema; removed: Ashutosh.
mivnay updated this revision to Diff 296673.Oct 7 2020, 7:41 AM
mivnay marked 3 inline comments as done.

Fixed review comments

lib/Transforms/Vectorize/LoopVectorize.cpp
3383

Sure. I can do it once this patch goes through.

5636

Currently, it is tuned as per SPEC CPU 2017 benchmarks. It can be fine tuned based on the further data.

5639

There were some issues with the Resume values when multiple induction variables are involved. I am planning to handle it later.

test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
238

Note that the tests are auto generated using update_test_checks. It is done inside loop vectorize as there are redundant casts now in epilog vector I guess.

mivnay marked 2 inline comments as done.Oct 7 2020, 7:43 AM
mivnay marked an inline comment as done.

While i'm not really familiar with LV, i'd like to agree with previous reviewers.
This doesn't immediately seem like the correct approach, especially if
all that is being done to avoid re-emitting checks - because if they are truly unneeded,
they should be getting folded away by other optimization passes.

I do see the elegance of just feeding the epilogue to the vectoriser again, but also have sympathy for not pushing the responsibility of the clean up down the line to something else especially if this is non-trivial. But to progress this discussion, I was wondering if we can say something more about this:

if all that is being done to avoid re-emitting checks - because if they are truly unneeded, they should be getting folded away by other optimization passes

I haven't looked in much detail in the CFG (re)strucure, and where then all these checks end up, but can we say something how difficult it is to clean this up? Is it already supported, or how difficult would it be to support this?

fhahn added a comment.Oct 15 2020, 3:45 AM

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Is the main motivation avoiding re-doing the runtime checks? I think we might be able to annotate the the remainder loop with noalias metadata, if we emit memory runtime checks, which should avoid generating them again for the remainder (and might be beneficial even if we do not vectorize the remainder).

Yes, the code changes inside the InnerLoopVectorizer are done to get the various Values (like, ResumeValue) and Blocks(like, MiddleBlock) easily. If we do the vectorizations independently, we would need a separate analysis to identify the loops, CFG, metadata, llvm::Value, etc.

I am not sure I follow here. LoopVectorize preserves LoopInfo, so I think after LoopVectorizePass::processLoop it should be easy to get the Loop * pointer for the remainder loop? And that should be all that is needed to process it again? We might also need a way to instruct ILV to choose a smaller VF for the remainder, but we might just be able to use the vectorization metadata to do so. It should also be relatively straight-forward to skip runtime check generation in the epilogue case.

I think Florian answered those questions, that looks indeed the most sensible way forward then.

I will try to summarize the current changes done in the below C code and also try to answer some of the common questions raised.

Before Loop Vectorization:

#include <stdint.h>

void func(int8_t *A, int8_t *B, int8_t *C, int N) {
  for (int I = 0; I < N; ++I)
    A[I] = B[I] + C[I];
}

After Loop Vectorization (with epilog disabled):

// After Loop Vectorization
void func1(int8_t *A, int8_t *B, int8_t *C, int N) {
  int I1;
  int VF1;
  bool alias_check_1;
  bool scev_check_1;

  if (N >= VF1) { // iteration_check_1
    if (!alias_check_1)
      goto SCALAR_LOOP;

    if (!scev_check_1)
      goto SCALAR_LOOP;

    // vector_loop_1
    for (I1 = 0; I1 <= N; I1 += VF1)
      A[I1:(I1 + VF1 - 1)] = B[I1:(I1 + VF1 - 1)] + C[I1:(I1 + VF1 - 1)];

    goto SCALAR_LOOP_WITH_CHECK;
  } else
    goto SCALAR_LOOP;

SCALAR_LOOP_WITH_CHECK:
  if (N - I1 > 0) { // remainder_iteration_check_1
  SCALAR_LOOP:
    for (int I = I1; I < N; ++I)
      A[I] = B[I] + C[I];

    goto EXIT;
  } else
    goto EXIT;

EXIT:
  return;
}

After Epilog Loop Vectorization:

void func2(int8_t *A, int8_t *B, int8_t *C, int N) {
  int I1 = 0, I2;
  int VF1, VF2;
  bool alias_check_1, alias_check_2;
  bool scev_check_1, scev_check_2;
  bool is_vector_loop_executed = false;

  if (N >= VF1) { // iteration_check_1
    if (!alias_check_1)
      goto SCALAR_LOOP; // optimization_1

    if (!scev_check_1)
      goto SCALAR_LOOP; // optimization_1

    // Vector Loop
    for (I1 = 0; I1 <= N; I1 += VF1)
      A[I1:(I1 + VF1 - 1)] = B[I1:(I1 + VF1 - 1)] + C[I1:(I1 + VF1 - 1)];
    is_vector_loop_executed = true;
    goto EPILOG_LOOP_ENTRY_WITH_CHECK;
  } else
    goto EPILOG_LOOP_ENTRY;

EPILOG_LOOP_ENTRY_WITH_CHECK:
  if (N - I1 == 0) { // remainder_iteration_check_1
    goto EXIT;
  }

EPILOG_LOOP_ENTRY:     // I1 is mostly 0 here and ignored in the actual code.
  if (N - I1 >= VF2) { // iteration_check_2

    if (!is_vector_loop_executed) { // optimization_2
      if (!alias_check_2)
        goto SCALAR_LOOP;
        
      if (!scev_check_2)
         goto SCALAR_LOOP;
    }
    // Epilog Vector Loop
    for (I2 = N - I1; I2 <= N; I2 += VF2)
      A [I2:(I1 + VF2 - 1)] = B [I2:(I1 + VF2 - 1)] + C [I2:(I1 + VF2 - 1)];

    goto SCALAR_LOOP_WITH_CHECK;
  } else
    goto SCALAR_LOOP;

SCALAR_LOOP_WITH_CHECK:
  if (N - I2 > 0) { // remainder_iteration_check_2
  SCALAR_LOOP:
    for (int I = I2; I < N; ++I)
      A[I] = B[I] + C[I];

    goto EXIT;
  } else
    goto EXIT;

EXIT:
  return;
}
NOTE:
  1. function names are changed just for the reference purpose.
  2. VF1 is the vectorization factor, VF2 is the epilog vectorization factor.
  3. The SCEV,alias and iteration checks may not be present for all the vectorized loops.
  4. is_vector_loop_executed is actually implemented as PHI node.

Why epilog loop vectorization?

There are two kinds of cases where it benefits:

  1. The remainder iterations after original vectorization is too huge and there is an opportunity to vectorize.

    Example: For i8 types, if the VF is 16 and trip count is 24. Epilog vectorization of VF=8 makes perfect sense.
  1. The original trip count itself is small.

    Example: Original vectorization itself generates VF = 16. But trip count is 8 for i8 types.

We are trying to cover both of the cases in this patch.

On what basis the order of checks were decided?

The order was decided based on the profile information from the current candidates we have in SPEC CPU 2017. We did not find any regressions with the current order.
Also, the current order of checks do not disturb the original vectorization flow even if epilog vectorization is done except for the epilog loop iteration check (iteration_check_2 in func2()).

Why not re-rerun the vectorizer?

Short answer:

Re-running the vectorizer is not optimal.

Long answer:

We have the runtime checks in both vector loop and epilog vector loop. It is needed because the iteration check for original VF (VF1 in func2) might fail and directly go to epilog loops (EPILOG_LOOP_ENTRY). So, there may be a possibility that original SCEV and alias checks may not get executed and directly go to epilog vector loop.

There are two optimizations which are done to avoid re-running the checks:

a. optimization_1 in func() : If any of SCEV and alias checks fails in the original vector loop, directly go to SCALAR_LOOP (instead of EPILOG_LOOP_ENTRY as in case of re-running the vectorizer)
b. optimizaiton_2 in func(): If the vectorizer executes the tests and passes it, do not run them again in epilog vectorizer.

Re-running the vectorizer again would not give us access to all these checks in CFG. That is why the changes are done inside the InnerLoopVectorizer. I don't see any optimizations eliminating the redundant blocks after blindly re-running the vectorizer. It has been discussed before in the older RFC as well.

This approach doesn't work when most of the trip counts are always good for original vector loop. In fact, it even performs one additional trip count check when both vector loop and epilog vector loops are executed. For example, if original VF=16 and UF=2, and epilog VF=8 and UF=1, trip count as small as 40 requires 3 trip count checks. Where as, it is 2 in the current implementation.

The relative cost of the extra trip count check is greater when the trip count is small enough to by-pass the vector code. Similarly the relative cost is lower (in proportion to the actual computation of the loop) when the vector code is executed. As a result it makes more sense to optimize the case where the cost of the extra trip count matters the most. If we take your example above and consider the case where the trip count is smaller than 8, then the number of trip count checks would be 1 (in the CFG I posted), compared to 2 in this patch.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

Code size can also have an impact on performance. It's also much harder to model the cost of code-size increase than it is to model the cost of compare and branches (required for the trip count checks), so I'd strongly suggest we go with the alternative CFG which avoids generating the redundant runtime checks.

I also agree with @mivnay's summary above and the general approach of just running ILV again on the remainder loop with the available vplan. If we mark the epilogue loop and put it back in the worklist, it'll be harder/uglier to then modify the CFG to make it more optimal. I do, however, think that the implementation can be improved (see my note below). Please also note that the SCEV and runtime checks cannot be avoided by marking them "noalias" (or similar tricks) because if the iteration count of the loop is small enough to by-pass the main vector loop and large enough to execute the vector epilogue, then the runtime checks need to be executed for the epilogue loop. The only way to avoid the redundant runtime checks is to generate the smaller trip count check first, as illustrated in the CFG I've posted above.

I have an alternative implementation with the same general approach, but with a bit more modular design that also avoids the extra runtime checks using the mentioned CFG. I've cleaned it up a little but haven't had time to post a patch. I should have it ready by the end of the week.

fhahn added a comment.Oct 15 2020, 9:39 AM

I also agree with @mivnay's summary above and the general approach of just running ILV again on the remainder loop with the available vplan. If we mark the epilogue loop and put it back in the worklist, it'll be harder/uglier to then modify the CFG to make it more optimal. I do, however, think that the implementation can be improved (see my note below). Please also note that the SCEV and runtime checks cannot be avoided by marking them "noalias" (or similar tricks) because if the iteration count of the loop is small enough to by-pass the main vector loop and large enough to execute the vector epilogue, then the runtime checks need to be executed for the epilogue loop. The only way to avoid the redundant runtime checks is to generate the smaller trip count check first, as illustrated in the CFG I've posted above.

Right, the approach I suggested should work for the case where we only execute the epilogue if we also execute the main vector loop (currently the runtime checks are independent of the VF AFAIK, and the SCEV checks as well (less sure), but not the minimum iteration check).

But setting things up as in the suggested CFG is going to be a bit more tricky and might not turn out to be much simpler in the end. I might give it a try to see if it's feasible.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

Code size can also have an impact on performance. It's also much harder to model the cost of code-size increase than it is to model the cost of compare and branches (required for the trip count checks), so I'd strongly suggest we go with the alternative CFG which avoids generating the redundant runtime checks.

If we have implementations for both, we could just evaluate which one's better on a large set of benchmarks?

But setting things up as in the suggested CFG is going to be a bit more tricky and might not turn out to be much simpler in the end. I might give it a try to see if it's feasible.

Not sure what you mean by simplicity, as we are trying to generate the most optimal control flow around the loops, but sure it would be a good idea to try and see if any problems are uncovered with either of these approaches.

If we have implementations for both, we could just evaluate which one's better on a large set of benchmarks?

The problem is that without an adequate cost-model, empirical data may not give us enough information about the optimality of the generated code (or worse it could send us down the wrong path). The current heuristic is very limited and specific to a particular benchmark in SPEC2017. I think we should base our decisions on theoretical foundations, develop a cost-model and then do performance verification and tuning using benchmarks and other workloads.

Please see D89566 for the alternative approach.

bmahjour resigned from this revision.Nov 6 2020, 10:28 AM