This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.cpp
-
LoopVectorizationPlanner.h
7/7
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
2/2
invariant-store-vectorization.ll
-
masked_load_store.ll
1/1
epilog-loop-vectorize.ll

Differential D88819

[LV] Support for Remainder loop vectorization
Needs ReviewPublic

Authored by mivnay on Oct 5 2020, 4:00 AM.

Download Raw Diff

Details

Reviewers

Ayal
fhahn
dmgreen
SjoerdMeijer
gilr
ashutosh.nema
bmahjour

Summary

Ticket : https://bugs.llvm.org/show_bug.cgi?id=46929

Loop Vectorize currently doesn’t support epilog loop vectorization. The idea is to vectorize the remainder scalar loop after the initial vectorization with Vector Factor (VF) less than the original loop. Once this loop gets executed, the remaining iterations (if any) will go to the original scalar loop.

Iteration checks are performed for both the vectorized and epilog vectorized loops
Runtime checks (alias and SCEV checks) are done (only once) for either vectorized or epilog vector loops. If it fails, the original scalar loop is executed.

This helps in executing the vector code either for loops with smaller trip count (less than the original VF) or for loops with considerable remainder iterations after the original vectorization. Currently it is enabled for VF > 16.

This change gains in one of the SPEC CPU 2017 benchmark in both AArch64 and X86 Targets. It gains around 5% in x264 in Ryzen 2700x ref run.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mivnay created this revision.Oct 5 2020, 4:00 AM

Herald added subscribers: llvm-commits, bmahjour, pengfei and 2 others. · View Herald TranscriptOct 5 2020, 4:00 AM

mivnay requested review of this revision.Oct 5 2020, 4:00 AM

The CFG after the optimization of a typical loop will be as follows:

AshokBhat added a subscriber: AshokBhat.Oct 5 2020, 4:13 AM

SjoerdMeijer added reviewers: Ayal, fhahn, dmgreen, SjoerdMeijer, gilr.Oct 5 2020, 4:57 AM

xbolva00 added a subscriber: xbolva00.Oct 5 2020, 5:14 AM

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

In D88819#2311678, @fhahn wrote:

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

rscottmanley added a subscriber: rscottmanley.Oct 5 2020, 8:14 AM

Thanks for working on epilogue vectorization. Incidentally I've also looked into this recently. There has been a long and detailed discussion on the mailing list from back in 2017 about this transformation here http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-td106322.html. Your patch is able to vectorize epilogue loops with fairly small changes to the LV, however the generated CFG is not optimal. For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily. The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count. Based on the followup discussions from the mentioned RFC, the optimal CFG should look more like what I've attached below.

bmahjour added a reviewer: bmahjour.Oct 6 2020, 7:41 AM

bmahjour requested changes to this revision.Oct 6 2020, 8:07 AM

bmahjour added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
3388	It's extremely hard to "draw" this diagram in text. It's even harder to read it. I think we should create a documentation section under https://llvm.org/docs/Vectorizers.html#loop-vectorizer and upload an image. The link can then be put into the comment for people to view and understand what is being generated.
5641	The vector epilogue loop's VF need not be smaller than the VF of the original loop for it to be profitable. For example with large interleave counts there may still be significant number of iterations to be executed and the throughput would be affected if a VF is chosen that is smaller than the widest profitable VF.
5644	why this limitation?
8471	Can your code handle first-order or reduction recurrences? Please see `InnerLoopVectorizer::fixCrossIterationPHIs()` and provide a test if they are supported. Otherwise I'm not sure this check is sufficient to catch those cases, specially given that the code guarded by `LB.canCreateVectorEpilog()` does not preserve LCSSA.
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
238	why these casts are hoisted?
test/Transforms/LoopVectorize/epilog-loop-vectorize.ll
100	It would be good to generate more meaningful names for the labels forming the skeleton of the vector epilogue loop. For example `vector.ph` vs `vector.epilogue.ph`, `vector.body` vs `vec.epilogue.body`, etc.

This revision now requires changes to proceed.Oct 6 2020, 8:07 AM

In D88819#2311917, @mivnay wrote:

In D88819#2311678, @fhahn wrote:

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Is the main motivation avoiding re-doing the runtime checks? I think we might be able to annotate the the remainder loop with noalias metadata, if we emit memory runtime checks, which should avoid generating them again for the remainder (and might be beneficial even if we do not vectorize the remainder). As for the iteration count check, I'd hope that LLVM would already be able to eliminate such a redundant check. If not, we should certainly fix that.

In D88819#2314457, @bmahjour wrote:

Thanks for working on epilogue vectorization. Incidentally I've also looked into this recently. There has been a long and detailed discussion on the mailing list from back in 2017 about this transformation here http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-td106322.html. Your patch is able to vectorize epilogue loops with fairly small changes to the LV, however the generated CFG is not optimal. For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily. The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count. Based on the followup discussions from the mentioned RFC, the optimal CFG should look more like what I've attached below.

Thanks for looking into the patch. The idea is to not affect the performance of the original vectorization too much even when the epilog vectorization has happened. The CFG you suggested seems to have epilog trip count check first even if trip count is good enough for original vector loop.

I think optimal CFG is all about profiling information. I ran SPEC CPU2017 benchmark with the current change and did not see any regression even though many loops got transformed. It gained in one of the benchmarks.

The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count.

This approach doesn't work when most of the trip counts are always good for original vector loop. In fact, it even performs one additional trip count check when both vector loop and epilog vector loops are executed. For example, if original VF=16 and UF=2, and epilog VF=8 and UF=1, trip count as small as 40 requires 3 trip count checks. Where as, it is 2 in the current implementation.

For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

In D88819#2314545, @fhahn wrote:

In D88819#2311917, @mivnay wrote:

In D88819#2311678, @fhahn wrote:

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Is the main motivation avoiding re-doing the runtime checks? I think we might be able to annotate the the remainder loop with noalias metadata, if we emit memory runtime checks, which should avoid generating them again for the remainder (and might be beneficial even if we do not vectorize the remainder).

Yes, the code changes inside the InnerLoopVectorizer are done to get the various Values (like, ResumeValue) and Blocks(like, MiddleBlock) easily. If we do the vectorizations independently, we would need a separate analysis to identify the loops, CFG, metadata, llvm::Value, etc.

As for the iteration count check, I'd hope that LLVM would already be able to eliminate such a redundant check. If not, we should certainly fix that

The trip count checks are done for different values (not redundant checks). It is initially done for original trip count and done again for remaining iterations after original vector loop execution.

mivnay added a reviewer: Ashutosh.Oct 6 2020, 11:11 AM

mivnay edited reviewers, added: ashutosh.nema; removed: Ashutosh.

Fixed review comments

lib/Transforms/Vectorize/LoopVectorize.cpp
3388	Sure. I can do it once this patch goes through.
5641	Currently, it is tuned as per SPEC CPU 2017 benchmarks. It can be fine tuned based on the further data.
5644	There were some issues with the Resume values when multiple induction variables are involved. I am planning to handle it later.
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
238	Note that the tests are auto generated using update_test_checks. It is done inside loop vectorize as there are redundant casts now in epilog vector I guess.

mivnay marked 2 inline comments as done.Oct 7 2020, 7:43 AM

mivnay marked an inline comment as done.

ping!

While i'm not really familiar with LV, i'd like to agree with previous reviewers.
This doesn't immediately seem like the correct approach, especially if
all that is being done to avoid re-emitting checks - because if they are truly unneeded,
they should be getting folded away by other optimization passes.

I do see the elegance of just feeding the epilogue to the vectoriser again, but also have sympathy for not pushing the responsibility of the clean up down the line to something else especially if this is non-trivial. But to progress this discussion, I was wondering if we can say something more about this:

if all that is being done to avoid re-emitting checks - because if they are truly unneeded, they should be getting folded away by other optimization passes

I haven't looked in much detail in the CFG (re)strucure, and where then all these checks end up, but can we say something how difficult it is to clean this up? Is it already supported, or how difficult would it be to support this?

In D88819#2314896, @mivnay wrote:

In D88819#2314545, @fhahn wrote:

In D88819#2311917, @mivnay wrote:

In D88819#2311678, @fhahn wrote:

Did you consider supporting this naturally by just having LV re-visit the newly created remainder loops, i.e. remember the created remainder loops and add them to the top-level worklist https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L8587 ? We would need to make sure we do not visit them repeatedly, but overall we should be able to achieve the same goal, but without adding extra complexity to the vectorizer.

Blindly calling the vectorizer for the loop again is not optimal. The current change does that but at lower abstraction level. The majority of the changes are about setting the right overall CFG structure. Example: It is unnecessary to execute runtime checks twice, etc. "struct EpilogVectorLoopHelper" is just the carrier of information from the original vector loop generation to the epilog vector loop generation. Also, InnerLoopVectorizer doesn't expose the vector loop CFG structure to it's users. Fixing the CFG structure at the higher abstraction level exposes this class completely.

Is the main motivation avoiding re-doing the runtime checks? I think we might be able to annotate the the remainder loop with noalias metadata, if we emit memory runtime checks, which should avoid generating them again for the remainder (and might be beneficial even if we do not vectorize the remainder).

Yes, the code changes inside the InnerLoopVectorizer are done to get the various Values (like, ResumeValue) and Blocks(like, MiddleBlock) easily. If we do the vectorizations independently, we would need a separate analysis to identify the loops, CFG, metadata, llvm::Value, etc.

I am not sure I follow here. LoopVectorize preserves LoopInfo, so I think after LoopVectorizePass::processLoop it should be easy to get the Loop * pointer for the remainder loop? And that should be all that is needed to process it again? We might also need a way to instruct ILV to choose a smaller VF for the remainder, but we might just be able to use the vectorization metadata to do so. It should also be relatively straight-forward to skip runtime check generation in the epilogue case.

I think Florian answered those questions, that looks indeed the most sensible way forward then.

I will try to summarize the current changes done in the below C code and also try to answer some of the common questions raised.

Before Loop Vectorization:

#include <stdint.h>

void func(int8_t *A, int8_t *B, int8_t *C, int N) {
  for (int I = 0; I < N; ++I)
    A[I] = B[I] + C[I];
}

After Loop Vectorization (with epilog disabled):

// After Loop Vectorization
void func1(int8_t *A, int8_t *B, int8_t *C, int N) {
  int I1;
  int VF1;
  bool alias_check_1;
  bool scev_check_1;

  if (N >= VF1) { // iteration_check_1
    if (!alias_check_1)
      goto SCALAR_LOOP;

    if (!scev_check_1)
      goto SCALAR_LOOP;

    // vector_loop_1
    for (I1 = 0; I1 <= N; I1 += VF1)
      A[I1:(I1 + VF1 - 1)] = B[I1:(I1 + VF1 - 1)] + C[I1:(I1 + VF1 - 1)];

    goto SCALAR_LOOP_WITH_CHECK;
  } else
    goto SCALAR_LOOP;

SCALAR_LOOP_WITH_CHECK:
  if (N - I1 > 0) { // remainder_iteration_check_1
  SCALAR_LOOP:
    for (int I = I1; I < N; ++I)
      A[I] = B[I] + C[I];

    goto EXIT;
  } else
    goto EXIT;

EXIT:
  return;
}

After Epilog Loop Vectorization:

void func2(int8_t *A, int8_t *B, int8_t *C, int N) {
  int I1 = 0, I2;
  int VF1, VF2;
  bool alias_check_1, alias_check_2;
  bool scev_check_1, scev_check_2;
  bool is_vector_loop_executed = false;

  if (N >= VF1) { // iteration_check_1
    if (!alias_check_1)
      goto SCALAR_LOOP; // optimization_1

    if (!scev_check_1)
      goto SCALAR_LOOP; // optimization_1

    // Vector Loop
    for (I1 = 0; I1 <= N; I1 += VF1)
      A[I1:(I1 + VF1 - 1)] = B[I1:(I1 + VF1 - 1)] + C[I1:(I1 + VF1 - 1)];
    is_vector_loop_executed = true;
    goto EPILOG_LOOP_ENTRY_WITH_CHECK;
  } else
    goto EPILOG_LOOP_ENTRY;

EPILOG_LOOP_ENTRY_WITH_CHECK:
  if (N - I1 == 0) { // remainder_iteration_check_1
    goto EXIT;
  }

EPILOG_LOOP_ENTRY:     // I1 is mostly 0 here and ignored in the actual code.
  if (N - I1 >= VF2) { // iteration_check_2

    if (!is_vector_loop_executed) { // optimization_2
      if (!alias_check_2)
        goto SCALAR_LOOP;
        
      if (!scev_check_2)
         goto SCALAR_LOOP;
    }
    // Epilog Vector Loop
    for (I2 = N - I1; I2 <= N; I2 += VF2)
      A [I2:(I1 + VF2 - 1)] = B [I2:(I1 + VF2 - 1)] + C [I2:(I1 + VF2 - 1)];

    goto SCALAR_LOOP_WITH_CHECK;
  } else
    goto SCALAR_LOOP;

SCALAR_LOOP_WITH_CHECK:
  if (N - I2 > 0) { // remainder_iteration_check_2
  SCALAR_LOOP:
    for (int I = I2; I < N; ++I)
      A[I] = B[I] + C[I];

    goto EXIT;
  } else
    goto EXIT;

EXIT:
  return;
}

NOTE:

function names are changed just for the reference purpose.
VF1 is the vectorization factor, VF2 is the epilog vectorization factor.
The SCEV,alias and iteration checks may not be present for all the vectorized loops.
is_vector_loop_executed is actually implemented as PHI node.

Why epilog loop vectorization?

There are two kinds of cases where it benefits:

The remainder iterations after original vectorization is too huge and there is an opportunity to vectorize.

Example: For i8 types, if the VF is 16 and trip count is 24. Epilog vectorization of VF=8 makes perfect sense.

The original trip count itself is small.

Example: Original vectorization itself generates VF = 16. But trip count is 8 for i8 types.

We are trying to cover both of the cases in this patch.

On what basis the order of checks were decided?

The order was decided based on the profile information from the current candidates we have in SPEC CPU 2017. We did not find any regressions with the current order.
Also, the current order of checks do not disturb the original vectorization flow even if epilog vectorization is done except for the epilog loop iteration check (iteration_check_2 in func2()).

Why not re-rerun the vectorizer?

Short answer:

Re-running the vectorizer is not optimal.

Long answer:

We have the runtime checks in both vector loop and epilog vector loop. It is needed because the iteration check for original VF (VF1 in func2) might fail and directly go to epilog loops (EPILOG_LOOP_ENTRY). So, there may be a possibility that original SCEV and alias checks may not get executed and directly go to epilog vector loop.

There are two optimizations which are done to avoid re-running the checks:

a. optimization_1 in func() : If any of SCEV and alias checks fails in the original vector loop, directly go to SCALAR_LOOP (instead of EPILOG_LOOP_ENTRY as in case of re-running the vectorizer)
b. optimizaiton_2 in func(): If the vectorizer executes the tests and passes it, do not run them again in epilog vectorizer.

Re-running the vectorizer again would not give us access to all these checks in CFG. That is why the changes are done inside the InnerLoopVectorizer. I don't see any optimizations eliminating the redundant blocks after blindly re-running the vectorizer. It has been discussed before in the older RFC as well.

This approach doesn't work when most of the trip counts are always good for original vector loop. In fact, it even performs one additional trip count check when both vector loop and epilog vector loops are executed. For example, if original VF=16 and UF=2, and epilog VF=8 and UF=1, trip count as small as 40 requires 3 trip count checks. Where as, it is 2 in the current implementation.

The relative cost of the extra trip count check is greater when the trip count is small enough to by-pass the vector code. Similarly the relative cost is lower (in proportion to the actual computation of the loop) when the vector code is executed. As a result it makes more sense to optimize the case where the cost of the extra trip count matters the most. If we take your example above and consider the case where the trip count is smaller than 8, then the number of trip count checks would be 1 (in the CFG I posted), compared to 2 in this patch.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

Code size can also have an impact on performance. It's also much harder to model the cost of code-size increase than it is to model the cost of compare and branches (required for the trip count checks), so I'd strongly suggest we go with the alternative CFG which avoids generating the redundant runtime checks.

I also agree with @mivnay's summary above and the general approach of just running ILV again on the remainder loop with the available vplan. If we mark the epilogue loop and put it back in the worklist, it'll be harder/uglier to then modify the CFG to make it more optimal. I do, however, think that the implementation can be improved (see my note below). Please also note that the SCEV and runtime checks cannot be avoided by marking them "noalias" (or similar tricks) because if the iteration count of the loop is small enough to by-pass the main vector loop and large enough to execute the vector epilogue, then the runtime checks need to be executed for the epilogue loop. The only way to avoid the redundant runtime checks is to generate the smaller trip count check first, as illustrated in the CFG I've posted above.

I have an alternative implementation with the same general approach, but with a bit more modular design that also avoids the extra runtime checks using the mentioned CFG. I've cleaned it up a little but haven't had time to post a patch. I should have it ready by the end of the week.

In D88819#2332490, @bmahjour wrote:

I also agree with @mivnay's summary above and the general approach of just running ILV again on the remainder loop with the available vplan. If we mark the epilogue loop and put it back in the worklist, it'll be harder/uglier to then modify the CFG to make it more optimal. I do, however, think that the implementation can be improved (see my note below). Please also note that the SCEV and runtime checks cannot be avoided by marking them "noalias" (or similar tricks) because if the iteration count of the loop is small enough to by-pass the main vector loop and large enough to execute the vector epilogue, then the runtime checks need to be executed for the epilogue loop. The only way to avoid the redundant runtime checks is to generate the smaller trip count check first, as illustrated in the CFG I've posted above.

Right, the approach I suggested should work for the case where we only execute the epilogue if we also execute the main vector loop (currently the runtime checks are independent of the VF AFAIK, and the SCEV checks as well (less sure), but not the minimum iteration check).

But setting things up as in the suggested CFG is going to be a bit more tricky and might not turn out to be much simpler in the end. I might give it a try to see if it's feasible.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

Code size can also have an impact on performance. It's also much harder to model the cost of code-size increase than it is to model the cost of compare and branches (required for the trip count checks), so I'd strongly suggest we go with the alternative CFG which avoids generating the redundant runtime checks.

If we have implementations for both, we could just evaluate which one's better on a large set of benchmarks?

But setting things up as in the suggested CFG is going to be a bit more tricky and might not turn out to be much simpler in the end. I might give it a try to see if it's feasible.

Not sure what you mean by simplicity, as we are trying to generate the most optimal control flow around the loops, but sure it would be a good idea to try and see if any problems are uncovered with either of these approaches.

If we have implementations for both, we could just evaluate which one's better on a large set of benchmarks?

The problem is that without an adequate cost-model, empirical data may not give us enough information about the optimality of the generated code (or worse it could send us down the wrong path). The current heuristic is very limited and specific to a particular benchmark in SPEC2017. I think we should base our decisions on theoretical foundations, develop a cost-model and then do performance verification and tuning using benchmarks and other workloads.

bmahjour mentioned this in D89566: [LV] Epilogue Vectorization with Optimal Control Flow.Oct 16 2020, 10:30 AM

Please see D89566 for the alternative approach.

bmahjour resigned from this revision.Nov 6 2020, 10:28 AM

bmahjour mentioned this in rG9c5504adceb5: [LV] Epilogue Vectorization with Optimal Control Flow.Dec 1 2020, 9:05 AM

bmahjour mentioned this in rGa7e2c2693997: [LV] Epilogue Vectorization with Optimal Control Flow (Recommit).Dec 2 2020, 7:10 AM

Revision Contents

Path

Size


	llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

2 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

5 lines

LoopVectorizationPlanner.h

3 lines

LoopVectorize.cpp

312 lines

test/

Transforms/

LoopVectorize/

X86/

invariant-store-vectorization.ll

146 lines

masked_load_store.ll

398 lines

test/

Transforms/

LoopVectorize/

epilog-loop-vectorize.ll

185 lines

Diff 296673

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	enum ForceKind {
FK_Disabled = 0, ///< Forcing disabled.		FK_Disabled = 0, ///< Forcing disabled.
FK_Enabled = 1, ///< Forcing enabled.		FK_Enabled = 1, ///< Forcing enabled.
};		};

LoopVectorizeHints(const Loop *L, bool InterleaveOnlyWhenForced,		LoopVectorizeHints(const Loop *L, bool InterleaveOnlyWhenForced,
OptimizationRemarkEmitter &ORE);		OptimizationRemarkEmitter &ORE);

/// Mark the loop L as already vectorized by setting the width to 1.		/// Mark the loop L as already vectorized by setting the width to 1.
void setAlreadyVectorized();		void setAlreadyVectorized(const Loop *TheLoop = nullptr);

bool allowVectorization(Function F, Loop L,		bool allowVectorization(Function F, Loop L,
bool VectorizeOnlyWhenForced) const;		bool VectorizeOnlyWhenForced) const;

/// Dumps all the hint information.		/// Dumps all the hint information.
void emitRemarkWithHints() const;		void emitRemarkWithHints() const;

unsigned getWidth() const { return Width.Value; }		unsigned getWidth() const { return Width.Value; }
▲ Show 20 Lines • Show All 396 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (IsVectorized.Value != 1)
// If the vectorization width and interleaving count are both 1 then		// If the vectorization width and interleaving count are both 1 then
// consider the loop to have been already vectorized because there's		// consider the loop to have been already vectorized because there's
// nothing more that we can do.		// nothing more that we can do.
IsVectorized.Value = Width.Value == 1 && Interleave.Value == 1;		IsVectorized.Value = Width.Value == 1 && Interleave.Value == 1;
LLVM_DEBUG(if (InterleaveOnlyWhenForced && Interleave.Value == 1) dbgs()		LLVM_DEBUG(if (InterleaveOnlyWhenForced && Interleave.Value == 1) dbgs()
<< "LV: Interleaving disabled by the pass manager\n");		<< "LV: Interleaving disabled by the pass manager\n");
}		}

void LoopVectorizeHints::setAlreadyVectorized() {		void LoopVectorizeHints::setAlreadyVectorized(const Loop *TheLoop) {

		if (!TheLoop)
		TheLoop = this->TheLoop;
LLVMContext &Context = TheLoop->getHeader()->getContext();		LLVMContext &Context = TheLoop->getHeader()->getContext();

MDNode *IsVectorizedMD = MDNode::get(		MDNode *IsVectorizedMD = MDNode::get(
Context,		Context,
{MDString::get(Context, "llvm.loop.isvectorized"),		{MDString::get(Context, "llvm.loop.isvectorized"),
ConstantAsMetadata::get(ConstantInt::get(Context, APInt(32, 1)))});		ConstantAsMetadata::get(ConstantInt::get(Context, APInt(32, 1)))});
MDNode *LoopID = TheLoop->getLoopID();		MDNode *LoopID = TheLoop->getLoopID();
MDNode *NewLoopID =		MDNode *NewLoopID =
▲ Show 20 Lines • Show All 1,187 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	struct VectorizationFactor {
// Width 1 means no vectorization, cost 0 means uncomputed cost.		// Width 1 means no vectorization, cost 0 means uncomputed cost.
static VectorizationFactor Disabled() {		static VectorizationFactor Disabled() {
return {ElementCount::getFixed(1), 0};		return {ElementCount::getFixed(1), 0};
}		}

bool operator==(const VectorizationFactor &rhs) const {		bool operator==(const VectorizationFactor &rhs) const {
return Width == rhs.Width && Cost == rhs.Cost;		return Width == rhs.Width && Cost == rhs.Cost;
}		}

		VectorizationFactor(ElementCount Width, unsigned Cost)
		: Width(Width), Cost(Cost) {}
};		};

/// Planner drives the vectorization process after having passed		/// Planner drives the vectorization process after having passed
/// Legality checks.		/// Legality checks.
class LoopVectorizationPlanner {		class LoopVectorizationPlanner {
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *OrigLoop;		Loop *OrigLoop;

▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 329 Lines • ▼ Show 20 Lines

cl::opt<bool> llvm::EnableLoopInterleaving(		cl::opt<bool> llvm::EnableLoopInterleaving(
"interleave-loops", cl::init(true), cl::Hidden,		"interleave-loops", cl::init(true), cl::Hidden,
cl::desc("Enable loop interleaving in Loop vectorization passes"));		cl::desc("Enable loop interleaving in Loop vectorization passes"));
cl::opt<bool> llvm::EnableLoopVectorization(		cl::opt<bool> llvm::EnableLoopVectorization(
"vectorize-loops", cl::init(true), cl::Hidden,		"vectorize-loops", cl::init(true), cl::Hidden,
cl::desc("Run the Loop vectorization passes"));		cl::desc("Run the Loop vectorization passes"));

		static cl::opt<bool> EnableEpilogLoopVectorization(
		"vectorize-remainder-loops", cl::init(true), cl::Hidden,
		cl::desc("Do Epilog / Remainder Loop vectorization"));

/// A helper function that returns the type of loaded or stored value.		/// A helper function that returns the type of loaded or stored value.
static Type getMemInstValueType(Value I) {		static Type getMemInstValueType(Value I) {
assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&		assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&
"Expected Load or Store instruction");		"Expected Load or Store instruction");
if (auto *LI = dyn_cast<LoadInst>(I))		if (auto *LI = dyn_cast<LoadInst>(I))
return LI->getType();		return LI->getType();
return cast<StoreInst>(I)->getValueOperand()->getType();		return cast<StoreInst>(I)->getValueOperand()->getType();
}		}
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	static Optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE, Loop *L) {
if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))		if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))
return ExpectedTC;		return ExpectedTC;

return None;		return None;
}		}

namespace llvm {		namespace llvm {

		/// EpilogVectorLoopHelper hold information about original vector
		// loop.
		struct EpilogVectorLoopHelper {

		// Vector loop runtime check blocks.
		BasicBlock *MemRuntimeCheckBlock;
		BasicBlock *SCEVCheckBlock;

		// Original scalar block preheader.
		BasicBlock *OriginalScalarBlock;

		// Vector loop middle block.
		BasicBlock *LoopMiddleBlock;

		// Vector loop iteration check block.
		BasicBlock *OrigTCCheckBlock;

		// Alias check block for epilog loop.
		BasicBlock *EpilogAliasCheckBlock;

		// Resume Value from the original vectorized loop.
		Value *ResumeValue;

		bool hasRuntimeChecks() { return MemRuntimeCheckBlock \|\| SCEVCheckBlock; }
		};

/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
/// * It handles the code generation for reduction variables.		/// * It handles the code generation for reduction variables.
/// * Scalarization (implementation using scalars) of un-vectorizable		/// * Scalarization (implementation using scalars) of un-vectorizable
/// instructions.		/// instructions.
/// InnerLoopVectorizer does not perform any vectorization-legality		/// InnerLoopVectorizer does not perform any vectorization-legality
/// checks, and relies on the caller to check for the different legality		/// checks, and relies on the caller to check for the different legality
/// aspects. The InnerLoopVectorizer relies on the		/// aspects. The InnerLoopVectorizer relies on the
/// LoopVectorizationLegality class to provide information about the induction		/// LoopVectorizationLegality class to provide information about the induction
/// and reduction variables that were found to a given vectorization factor.		/// and reduction variables that were found to a given vectorization factor.
class InnerLoopVectorizer {		class InnerLoopVectorizer {
public:		public:
InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, ElementCount VecWidth,		OptimizationRemarkEmitter *ORE, ElementCount VecWidth,
unsigned UnrollFactor, LoopVectorizationLegality *LVL,		unsigned UnrollFactor, LoopVectorizationLegality *LVL,
LoopVectorizationCostModel CM, BlockFrequencyInfo BFI,		LoopVectorizationCostModel CM, BlockFrequencyInfo BFI,
ProfileSummaryInfo *PSI)		ProfileSummaryInfo *PSI, EpilogVectorLoopHelper EVLH,
		bool IsEpilogLoop = false)
: OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),		: OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),
AC(AC), ORE(ORE), VF(VecWidth), UF(UnrollFactor),		AC(AC), ORE(ORE), VF(VecWidth), UF(UnrollFactor),
Builder(PSE.getSE()->getContext()),		Builder(PSE.getSE()->getContext()),
VectorLoopValueMap(UnrollFactor, VecWidth), Legal(LVL), Cost(CM),		VectorLoopValueMap(UnrollFactor, VecWidth), Legal(LVL), Cost(CM),
BFI(BFI), PSI(PSI) {		BFI(BFI), PSI(PSI), EVLH(EVLH), IsEpilogLoop(IsEpilogLoop) {
// Query this against the original loop and save it here because the profile		// Query this against the original loop and save it here because the profile
// of the original loop header may change as the transformation happens.		// of the original loop header may change as the transformation happens.
OptForSizeBasedOnProfile = llvm::shouldOptimizeForSize(		OptForSizeBasedOnProfile = llvm::shouldOptimizeForSize(
OrigLoop->getHeader(), PSI, BFI, PGSOQueryType::IRPass);		OrigLoop->getHeader(), PSI, BFI, PGSOQueryType::IRPass);
}		}

virtual ~InnerLoopVectorizer() = default;		virtual ~InnerLoopVectorizer() = default;

Show All 14 Lines	public:

/// Widen a single select instruction within the innermost loop.		/// Widen a single select instruction within the innermost loop.
void widenSelectInstruction(SelectInst &I, VPUser &Operands,		void widenSelectInstruction(SelectInst &I, VPUser &Operands,
bool InvariantCond, VPTransformState &State);		bool InvariantCond, VPTransformState &State);

/// Fix the vectorized code, taking care of header phi's, live-outs, and more.		/// Fix the vectorized code, taking care of header phi's, live-outs, and more.
void fixVectorizedLoop();		void fixVectorizedLoop();

		Loop *getVectorizedLoop(void);

// Return true if any runtime check is added.		// Return true if any runtime check is added.
bool areSafetyChecksAdded() { return AddedSafetyChecks; }		bool areSafetyChecksAdded() { return AddedSafetyChecks; }

/// A type for vectorized values in the new loop. Each value from the		/// A type for vectorized values in the new loop. Each value from the
/// original loop, when vectorized, is represented by UF vector values in the		/// original loop, when vectorized, is represented by UF vector values in the
/// new unrolled loop, where UF is the unroll factor.		/// new unrolled loop, where UF is the unroll factor.
using VectorParts = SmallVector<Value *, 2>;		using VectorParts = SmallVector<Value *, 2>;

▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	public:

/// Set the debug location in the builder using the debug location in		/// Set the debug location in the builder using the debug location in
/// the instruction.		/// the instruction.
void setDebugLocFromInst(IRBuilder<> &B, const Value *Ptr);		void setDebugLocFromInst(IRBuilder<> &B, const Value *Ptr);

/// Fix the non-induction PHIs in the OrigPHIsToFix vector.		/// Fix the non-induction PHIs in the OrigPHIsToFix vector.
void fixNonInductionPHIs(void);		void fixNonInductionPHIs(void);

		EpilogVectorLoopHelper getEpilogVectorLoopHelper() { return EVLH; }

		bool canCreateVectorEpilog();

protected:		protected:
friend class LoopVectorizationPlanner;		friend class LoopVectorizationPlanner;

/// A small list of PHINodes.		/// A small list of PHINodes.
using PhiVector = SmallVector<PHINode *, 4>;		using PhiVector = SmallVector<PHINode *, 4>;

/// A type for scalarized values in the new loop. Each value from the		/// A type for scalarized values in the new loop. Each value from the
/// original loop, when scalarized, is represented by UF x VF scalar values		/// original loop, when scalarized, is represented by UF x VF scalar values
▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	protected:

/// Returns a bitcasted value to the requested vector type.		/// Returns a bitcasted value to the requested vector type.
/// Also handles bitcasts of vector<float> <-> vector<pointer> types.		/// Also handles bitcasts of vector<float> <-> vector<pointer> types.
Value createBitOrPointerCast(Value V, VectorType *DstVTy,		Value createBitOrPointerCast(Value V, VectorType *DstVTy,
const DataLayout &DL);		const DataLayout &DL);

/// Emit a bypass check to see if the vector trip count is zero, including if		/// Emit a bypass check to see if the vector trip count is zero, including if
/// it overflows.		/// it overflows.
void emitMinimumIterationCountCheck(Loop L, BasicBlock Bypass);		BasicBlock emitMinimumIterationCountCheck(Loop L, BasicBlock *Bypass);

/// Emit a bypass check to see if all of the SCEV assumptions we've		/// Emit a bypass check to see if all of the SCEV assumptions we've
/// had to make are correct.		/// had to make are correct.
void emitSCEVChecks(Loop L, BasicBlock Bypass);		void emitSCEVChecks(Loop L, BasicBlock Bypass);

/// Emit bypass checks to check any memory assumptions we may have made.		/// Emit bypass checks to check any memory assumptions we may have made.
void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);		void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);

▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	protected:

/// BFI and PSI are used to check for profile guided size optimizations.		/// BFI and PSI are used to check for profile guided size optimizations.
BlockFrequencyInfo *BFI;		BlockFrequencyInfo *BFI;
ProfileSummaryInfo *PSI;		ProfileSummaryInfo *PSI;

// Whether this loop should be optimized for size based on profile guided size		// Whether this loop should be optimized for size based on profile guided size
// optimizatios.		// optimizatios.
bool OptForSizeBasedOnProfile;		bool OptForSizeBasedOnProfile;

		// Epilog Loop Helpers
		EpilogVectorLoopHelper EVLH;
		bool IsEpilogLoop;
};		};

class InnerLoopUnroller : public InnerLoopVectorizer {		class InnerLoopUnroller : public InnerLoopVectorizer {
public:		public:
InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,		OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,
LoopVectorizationLegality *LVL,		LoopVectorizationLegality *LVL,
LoopVectorizationCostModel CM, BlockFrequencyInfo BFI,		LoopVectorizationCostModel CM, BlockFrequencyInfo BFI,
ProfileSummaryInfo *PSI)		ProfileSummaryInfo *PSI)
: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,		: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,
ElementCount::getFixed(1), UnrollFactor, LVL, CM,		ElementCount::getFixed(1), UnrollFactor, LVL, CM,
BFI, PSI) {}		BFI, PSI, EpilogVectorLoopHelper()) {}

private:		private:
Value getBroadcastInstrs(Value V) override;		Value getBroadcastInstrs(Value V) override;
Value getStepVector(Value Val, int StartIdx, Value *Step,		Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
Instruction::BinaryOpsEnd) override;		Instruction::BinaryOpsEnd) override;
Value reverseVector(Value Vec) override;		Value reverseVector(Value Vec) override;
};		};
▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	public:
/// otherwise.		/// otherwise.
bool runtimeChecksRequired();		bool runtimeChecksRequired();

/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to MaxVF. If UserVF is not ZERO		/// This method checks every power of two up to MaxVF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(unsigned MaxVF);		VectorizationFactor selectVectorizationFactor(unsigned MaxVF);
		VectorizationFactor selectEpilogVectorizationFactor(unsigned MaxVF);

/// Setup cost-based decisions for user vectorization factor.		/// Setup cost-based decisions for user vectorization factor.
void selectUserVectorizationFactor(ElementCount UserVF) {		void selectUserVectorizationFactor(ElementCount UserVF) {
collectUniformsAndScalars(UserVF);		collectUniformsAndScalars(UserVF);
collectInstsToScalarize(UserVF);		collectInstsToScalarize(UserVF);
}		}

/// \return The size (in bits) of the smallest and widest types in the code		/// \return The size (in bits) of the smallest and widest types in the code
▲ Show 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	public:
/// with the same stride and close to each other.		/// with the same stride and close to each other.
InterleavedAccessInfo &InterleaveInfo;		InterleavedAccessInfo &InterleaveInfo;

/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;

/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;

		/// Profitable vector factors.
		SmallVector<VectorizationFactor, 8> ProfitableVFs;
};		};

} // end namespace llvm		} // end namespace llvm

// Return true if \p OuterLp is an outer loop annotated with hints for explicit		// Return true if \p OuterLp is an outer loop annotated with hints for explicit
// vectorization. The loop needs to be annotated with #pragma omp simd		// vectorization. The loop needs to be annotated with #pragma omp simd
// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the		// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the
// vector length information is not provided, vectorization is not considered		// vector length information is not provided, vectorization is not considered
▲ Show 20 Lines • Show All 1,176 Lines • ▼ Show 20 Lines	assert((DstElemTy->isFloatingPointTy() != SrcElemTy->isFloatingPointTy()) &&
"Only one type should be a floating point type");		"Only one type should be a floating point type");
Type *IntTy =		Type *IntTy =
IntegerType::getIntNTy(V->getContext(), DL.getTypeSizeInBits(SrcElemTy));		IntegerType::getIntNTy(V->getContext(), DL.getTypeSizeInBits(SrcElemTy));
auto *VecIntTy = FixedVectorType::get(IntTy, VF);		auto *VecIntTy = FixedVectorType::get(IntTy, VF);
Value *CastVal = Builder.CreateBitOrPointerCast(V, VecIntTy);		Value *CastVal = Builder.CreateBitOrPointerCast(V, VecIntTy);
return Builder.CreateBitOrPointerCast(CastVal, DstFVTy);		return Builder.CreateBitOrPointerCast(CastVal, DstFVTy);
}		}

void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,		BasicBlock *
		InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,
BasicBlock *Bypass) {		BasicBlock *Bypass) {
Value *Count = getOrCreateTripCount(L);		Value *Count = getOrCreateTripCount(L);
// Reuse existing vector loop preheader for TC checks.		// Reuse existing vector loop preheader for TC checks.
// Note that new preheader block is generated for vector loop.		// Note that new preheader block is generated for vector loop.
BasicBlock *const TCCheckBlock = LoopVectorPreHeader;		BasicBlock *const TCCheckBlock = LoopVectorPreHeader;
IRBuilder<> Builder(TCCheckBlock->getTerminator());		IRBuilder<> Builder(TCCheckBlock->getTerminator());

		// Epilog loop: Subtract the number of iterations from the original vector
		// loop.
		if (IsEpilogLoop) {
		Value *Temp = Builder.CreateSub(
		EVLH.ResumeValue,
		Legal->getInductionVars().front().second.getStartValue());
		Count = Builder.CreateSub(Count, Temp, "remainder.iter");
		}

// Generate code to check if the loop's trip count is less than VF * UF, or		// Generate code to check if the loop's trip count is less than VF * UF, or
// equal to it in case a scalar epilogue is required; this implies that the		// equal to it in case a scalar epilogue is required; this implies that the
// vector trip count is zero. This check also covers the case where adding one		// vector trip count is zero. This check also covers the case where adding one
// to the backedge-taken count overflowed leading to an incorrect trip count		// to the backedge-taken count overflowed leading to an incorrect trip count
// of zero. In this case we will also jump to the scalar loop.		// of zero. In this case we will also jump to the scalar loop.
auto P = Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE		auto P = Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE
: ICmpInst::ICMP_ULT;		: ICmpInst::ICMP_ULT;

// If tail is to be folded, vector loop takes care of all iterations.		// If tail is to be folded, vector loop takes care of all iterations.
Value *CheckMinIters = Builder.getFalse();		Value *CheckMinIters = Builder.getFalse();
if (!Cost->foldTailByMasking()) {		if (!Cost->foldTailByMasking()) {
assert(!VF.isScalable() && "scalable vectors not yet supported.");		assert(!VF.isScalable() && "scalable vectors not yet supported.");
CheckMinIters = Builder.CreateICmp(		CheckMinIters = Builder.CreateICmp(
P, Count,		P, Count,
ConstantInt::get(Count->getType(), VF.getKnownMinValue() * UF),		ConstantInt::get(Count->getType(), VF.getKnownMinValue() * UF),
"min.iters.check");		"min.iters.check");
}		}
// Create new preheader for vector loop.		// Create new preheader for vector loop.
LoopVectorPreHeader =		LoopVectorPreHeader =
SplitBlock(TCCheckBlock, TCCheckBlock->getTerminator(), DT, LI, nullptr,		SplitBlock(TCCheckBlock, TCCheckBlock->getTerminator(), DT, LI, nullptr,
"vector.ph");		IsEpilogLoop ? "epilog.vector.ph" : "vector.ph");

assert(DT->properlyDominates(DT->getNode(TCCheckBlock),		assert(DT->properlyDominates(DT->getNode(TCCheckBlock),
DT->getNode(Bypass)->getIDom()) &&		DT->getNode(Bypass)->getIDom()) &&
"TC check is expected to dominate Bypass");		"TC check is expected to dominate Bypass");

// Update dominator for Bypass & LoopExit.		// Update dominator for Bypass & LoopExit.
DT->changeImmediateDominator(Bypass, TCCheckBlock);		DT->changeImmediateDominator(Bypass, TCCheckBlock);
DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);		DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);

ReplaceInstWithInst(		ReplaceInstWithInst(
TCCheckBlock->getTerminator(),		TCCheckBlock->getTerminator(),
BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));		BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));
LoopBypassBlocks.push_back(TCCheckBlock);		LoopBypassBlocks.push_back(TCCheckBlock);
		return TCCheckBlock;
}		}

void InnerLoopVectorizer::emitSCEVChecks(Loop L, BasicBlock Bypass) {		void InnerLoopVectorizer::emitSCEVChecks(Loop L, BasicBlock Bypass) {
// Reuse existing vector loop preheader for SCEV checks.		// Reuse existing vector loop preheader for SCEV checks.
// Note that new preheader block is generated for vector loop.		// Note that new preheader block is generated for vector loop.
BasicBlock *const SCEVCheckBlock = LoopVectorPreHeader;		BasicBlock *const SCEVCheckBlock = LoopVectorPreHeader;

// Generate the code to check that the SCEV assumptions that we made.		// Generate the code to check that the SCEV assumptions that we made.
// We want the new basic block to start at the first instruction in a		// We want the new basic block to start at the first instruction in a
// sequence of instructions that form a check.		// sequence of instructions that form a check.
SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(),		SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(),
"scev.check");		"scev.check");
Value *SCEVCheck = Exp.expandCodeForPredicate(		Value *SCEVCheck = Exp.expandCodeForPredicate(
&PSE.getUnionPredicate(), SCEVCheckBlock->getTerminator());		&PSE.getUnionPredicate(), SCEVCheckBlock->getTerminator());

if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))		if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))
if (C->isZero())		if (C->isZero())
return;		return;

assert(!(SCEVCheckBlock->getParent()->hasOptSize() \|\|		assert(!(SCEVCheckBlock->getParent()->hasOptSize() \|\|
(OptForSizeBasedOnProfile &&		(OptForSizeBasedOnProfile &&
Cost->Hints->getForce() != LoopVectorizeHints::FK_Enabled)) &&		Cost->Hints->getForce() != LoopVectorizeHints::FK_Enabled)) &&
"Cannot SCEV check stride or overflow when optimizing for size");		"Cannot SCEV check stride or overflow when optimizing for size");

SCEVCheckBlock->setName("vector.scevcheck");		SCEVCheckBlock->setName(IsEpilogLoop ? "epilog.vector.scevcheck"
		: "vector.scevcheck");
// Create new preheader for vector loop.		// Create new preheader for vector loop.
LoopVectorPreHeader =		LoopVectorPreHeader =
SplitBlock(SCEVCheckBlock, SCEVCheckBlock->getTerminator(), DT, LI,		SplitBlock(SCEVCheckBlock, SCEVCheckBlock->getTerminator(), DT, LI,
nullptr, "vector.ph");		nullptr, IsEpilogLoop ? "epilog.vector.ph" : "vector.ph");

// Update dominator only if this is first RT check.		// Update dominator only if this is first RT check.
if (LoopBypassBlocks.empty()) {		if (LoopBypassBlocks.empty()) {
DT->changeImmediateDominator(Bypass, SCEVCheckBlock);		DT->changeImmediateDominator(Bypass, SCEVCheckBlock);
DT->changeImmediateDominator(LoopExitBlock, SCEVCheckBlock);		DT->changeImmediateDominator(LoopExitBlock, SCEVCheckBlock);
}		}

ReplaceInstWithInst(		ReplaceInstWithInst(
SCEVCheckBlock->getTerminator(),		SCEVCheckBlock->getTerminator(),
BranchInst::Create(Bypass, LoopVectorPreHeader, SCEVCheck));		BranchInst::Create(Bypass, LoopVectorPreHeader, SCEVCheck));
		if (!IsEpilogLoop)
		EVLH.SCEVCheckBlock = SCEVCheckBlock;
LoopBypassBlocks.push_back(SCEVCheckBlock);		LoopBypassBlocks.push_back(SCEVCheckBlock);
AddedSafetyChecks = true;		AddedSafetyChecks = true;
}		}

void InnerLoopVectorizer::emitMemRuntimeChecks(Loop L, BasicBlock Bypass) {		void InnerLoopVectorizer::emitMemRuntimeChecks(Loop L, BasicBlock Bypass) {
// VPlan-native path does not do any analysis for runtime checks currently.		// VPlan-native path does not do any analysis for runtime checks currently.
if (EnableVPlanNativePath)		if (EnableVPlanNativePath)
return;		return;
Show All 19 Lines	ORE->emit([&]() {
L->getStartLoc(), L->getHeader())		L->getStartLoc(), L->getHeader())
<< "Code-size may be reduced by not forcing "		<< "Code-size may be reduced by not forcing "
"vectorization, or by source-code modifications "		"vectorization, or by source-code modifications "
"eliminating the need for runtime checks "		"eliminating the need for runtime checks "
"(e.g., adding 'restrict').";		"(e.g., adding 'restrict').";
});		});
}		}

MemCheckBlock->setName("vector.memcheck");		MemCheckBlock->setName(IsEpilogLoop ? "epilog.vector.memcheck"
		: "vector.memcheck");
// Create new preheader for vector loop.		// Create new preheader for vector loop.
LoopVectorPreHeader =		LoopVectorPreHeader =
SplitBlock(MemCheckBlock, MemCheckBlock->getTerminator(), DT, LI, nullptr,		SplitBlock(MemCheckBlock, MemCheckBlock->getTerminator(), DT, LI, nullptr,
"vector.ph");		IsEpilogLoop ? "epilog.vector.ph" : "vector.ph");

auto *CondBranch = cast<BranchInst>(		auto *CondBranch = cast<BranchInst>(
Builder.CreateCondBr(Builder.getTrue(), Bypass, LoopVectorPreHeader));		Builder.CreateCondBr(Builder.getTrue(), Bypass, LoopVectorPreHeader));
ReplaceInstWithInst(MemCheckBlock->getTerminator(), CondBranch);		ReplaceInstWithInst(MemCheckBlock->getTerminator(), CondBranch);
LoopBypassBlocks.push_back(MemCheckBlock);		LoopBypassBlocks.push_back(MemCheckBlock);
		if (!IsEpilogLoop)
		EVLH.MemRuntimeCheckBlock = MemCheckBlock;
AddedSafetyChecks = true;		AddedSafetyChecks = true;

// Update dominator only if this is first RT check.		// Update dominator only if this is first RT check.
if (LoopBypassBlocks.empty()) {		if (LoopBypassBlocks.empty()) {
DT->changeImmediateDominator(Bypass, MemCheckBlock);		DT->changeImmediateDominator(Bypass, MemCheckBlock);
DT->changeImmediateDominator(LoopExitBlock, MemCheckBlock);		DT->changeImmediateDominator(LoopExitBlock, MemCheckBlock);
}		}

▲ Show 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	for (auto &InductionEntry : Legal->getInductionVars()) {
InductionDescriptor II = InductionEntry.second;		InductionDescriptor II = InductionEntry.second;

// Create phi nodes to merge from the backedge-taken check block.		// Create phi nodes to merge from the backedge-taken check block.
PHINode *BCResumeVal =		PHINode *BCResumeVal =
PHINode::Create(OrigPhi->getType(), 3, "bc.resume.val",		PHINode::Create(OrigPhi->getType(), 3, "bc.resume.val",
LoopScalarPreHeader->getTerminator());		LoopScalarPreHeader->getTerminator());
// Copy original phi DL over to the new one.		// Copy original phi DL over to the new one.
BCResumeVal->setDebugLoc(OrigPhi->getDebugLoc());		BCResumeVal->setDebugLoc(OrigPhi->getDebugLoc());

		if (!IsEpilogLoop) {
		EVLH.ResumeValue = BCResumeVal;
		EVLH.OriginalScalarBlock = LoopScalarPreHeader;
		} else {
		EVLH.OriginalScalarBlock->setName("epilog.ph");
		LoopScalarPreHeader->setName("scalar.ph");
		}

Value *&EndValue = IVEndValues[OrigPhi];		Value *&EndValue = IVEndValues[OrigPhi];
if (OrigPhi == OldInduction) {		if (OrigPhi == OldInduction) {
// We know what the end value is.		// We know what the end value is.
EndValue = VectorTripCount;		EndValue = VectorTripCount;
} else {		} else {
IRBuilder<> B(L->getLoopPreheader()->getTerminator());		IRBuilder<> B(L->getLoopPreheader()->getTerminator());
Type *StepType = II.getStep()->getType();		Type *StepType = II.getStep()->getType();
Instruction::CastOps CastOp =		Instruction::CastOps CastOp =
CastInst::getCastOpcode(VectorTripCount, true, StepType, true);		CastInst::getCastOpcode(VectorTripCount, true, StepType, true);
Value *CRD = B.CreateCast(CastOp, VectorTripCount, StepType, "cast.crd");		Value *CRD = B.CreateCast(CastOp, VectorTripCount, StepType, "cast.crd");
const DataLayout &DL = LoopScalarBody->getModule()->getDataLayout();		const DataLayout &DL = LoopScalarBody->getModule()->getDataLayout();
EndValue = emitTransformedIndex(B, CRD, PSE.getSE(), DL, II);		EndValue = emitTransformedIndex(B, CRD, PSE.getSE(), DL, II);
EndValue->setName("ind.end");		EndValue->setName("ind.end");
}		}

// The new PHI merges the original incoming value, in case of a bypass,		// The new PHI merges the original incoming value, in case of a bypass,
// or the value at the end of the vectorized loop.		// or the value at the end of the vectorized loop.
BCResumeVal->addIncoming(EndValue, LoopMiddleBlock);		BCResumeVal->addIncoming(EndValue, LoopMiddleBlock);

// Fix the scalar body counter (PHI node).		// Fix the scalar body counter (PHI node).
// The old induction's phi node in the scalar body needs the truncated		// The old induction's phi node in the scalar body needs the truncated
// value.		// value.
for (BasicBlock *BB : LoopBypassBlocks)		for (BasicBlock *BB : LoopBypassBlocks) {
BCResumeVal->addIncoming(II.getStartValue(), BB);		Value *StartValue = II.getStartValue();
		if (IsEpilogLoop) {
		StartValue = EVLH.ResumeValue;
		}
		BCResumeVal->addIncoming(StartValue, BB);
		}

OrigPhi->setIncomingValueForBlock(LoopScalarPreHeader, BCResumeVal);		OrigPhi->setIncomingValueForBlock(LoopScalarPreHeader, BCResumeVal);

		// Fix the CFG when epilog vector loop is generated when there are runtime
		// check blocks
		if (IsEpilogLoop && (EVLH.hasRuntimeChecks())) {
		for (BasicBlock *BB : {EVLH.MemRuntimeCheckBlock, EVLH.SCEVCheckBlock}) {
		if (!BB)
		continue;
		BB->getTerminator()->replaceSuccessorWith(EVLH.OriginalScalarBlock,
		LoopScalarPreHeader);
		BCResumeVal->addIncoming(II.getStartValue(), BB);
		cast<PHINode>(EVLH.ResumeValue)->removeIncomingValue(BB);
		}
		BasicBlock *Entry =
		DT->getNode(EVLH.SCEVCheckBlock ? EVLH.SCEVCheckBlock
		: EVLH.MemRuntimeCheckBlock)
		->getIDom()
		->getBlock();
		DT->changeImmediateDominator(LoopScalarPreHeader, Entry);
		DT->changeImmediateDominator(LoopExitBlock, Entry);
		BranchInst *BI =
		cast<BranchInst>(EVLH.EpilogAliasCheckBlock->getTerminator());
		BI->setSuccessor(1, LoopVectorPreHeader);
		DT->changeImmediateDominator(LoopVectorPreHeader,
		EVLH.EpilogAliasCheckBlock);
		}
}		}
}		}

BasicBlock InnerLoopVectorizer::completeLoopSkeleton(Loop L,		BasicBlock InnerLoopVectorizer::completeLoopSkeleton(Loop L,
MDNode *OrigLoopID) {		MDNode *OrigLoopID) {
assert(L && "Expected valid loop.");		assert(L && "Expected valid loop.");

// The trip counts should be cached by now.		// The trip counts should be cached by now.
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	BasicBlock *InnerLoopVectorizer::createVectorizedLoopSkeleton() {
\| [ ] \		\| [ ] \
\| [ ]_\| <-- old scalar loop to handle remainder.		\| [ ]_\| <-- old scalar loop to handle remainder.
\ \|		\ \|
\ v		\ v
>[ ] <-- exit block.		>[ ] <-- exit block.
...		...
*/		*/

		/*
		Simplifed version of Epilog Vectorized loop CFG:

		1. The alias and scev are done once. Either during original vector loop or at
		epilog vector loop. But it is cloned in both the places.


		[ ] <-- original loop iteratiib check.
		bmahjourUnsubmitted Done Reply Inline Actions It's extremely hard to "draw" this diagram in text. It's even harder to read it. I think we should create a documentation section under https://llvm.org/docs/Vectorizers.html#loop-vectorizer and upload an image. The link can then be put into the comment for people to view and understand what is being generated. bmahjour: It's extremely hard to "draw" this diagram in text. It's even harder to read it. I think we…
		mivnayAuthorUnsubmitted Done Reply Inline Actions Sure. I can do it once this patch goes through. mivnay: Sure. I can do it once this patch goes through.
		/ \|
		/ v
		\| [ ] <-- original vector loop bypass (may consist of multiple blocks).
		\| / \|
		\| / v
		\|\| [ ] <-- original vector pre-header.
		\|/ \|
		\| v
		\| [ ] \
		\| [ ]_\| <-- original vector loop.
		\| \|
		\| v
		\| -[ ] <--- original middle-block.
		\| / \|
		\| / v
		\|----[ ] <-- epilog vector loop iteration number check.
		\| / \|
		\|/ v
		/ [ ] <-- Previously executed alias result check or new checks.
		\|\| / \|
		\|\|/ v
		\|/ [ ] <-- new vector version pre-header.
		\|\| \|
		\|\| v
		\|\| [ ] \
		\|\| [ ]_\| <-- new vector loop.
		\|\| \|
		\|\| v
		\|\| -[ ] <--- new middle-block.
		\|\| / \|
		\| / v
		-\|- >[ ] <--- new scalar pre-header.
		\| \|
		\| v
		\| [ ] \
		\| [ ]_\| <--old scalar loop to handle remainder.
		\ \|
		\ v
		>[ ] <-- exit block.
		*/

// Get the metadata of the original loop before it gets modified.		// Get the metadata of the original loop before it gets modified.
MDNode *OrigLoopID = OrigLoop->getLoopID();		MDNode *OrigLoopID = OrigLoop->getLoopID();

// Create an empty vector loop, and prepare basic blocks for the runtime		// Create an empty vector loop, and prepare basic blocks for the runtime
// checks.		// checks.
Loop *Lp = createVectorLoopSkeleton("");		Loop *Lp = createVectorLoopSkeleton(IsEpilogLoop ? "epilog." : "");

// Now, compare the new count to zero. If it is zero skip the vector loop and		// Now, compare the new count to zero. If it is zero skip the vector loop and
// jump to the scalar loop. This check also covers the case where the		// jump to the scalar loop. This check also covers the case where the
// backedge-taken count is uint##_max: adding one to it will overflow leading		// backedge-taken count is uint##_max: adding one to it will overflow leading
// to an incorrect trip count of zero. In this (rare) case we will also jump		// to an incorrect trip count of zero. In this (rare) case we will also jump
// to the scalar loop.		// to the scalar loop.
		BasicBlock *TCCheckBlock =
emitMinimumIterationCountCheck(Lp, LoopScalarPreHeader);		emitMinimumIterationCountCheck(Lp, LoopScalarPreHeader);

		// Create bypass for epilog vector loop runtime checks if already done in
		// vector loop.
		if (IsEpilogLoop) {
		IRBuilder<> Builder(&TCCheckBlock->front());
		PHINode *RuntimeChecksPHI = Builder.CreatePHI(Builder.getInt1Ty(), 2);
		RuntimeChecksPHI->addIncoming(Builder.getInt1(false), EVLH.LoopMiddleBlock);
		RuntimeChecksPHI->addIncoming(Builder.getInt1(true), EVLH.OrigTCCheckBlock);

		BasicBlock *TCBlock = LoopVectorPreHeader;
		// Create new preheader for vector loop.
		TCBlock->setName("runtime.check.ph");
		LoopVectorPreHeader = SplitBlock(TCBlock, TCBlock->getTerminator(), DT, LI,
		nullptr, "runtime.check.ph");

		ReplaceInstWithInst(TCBlock->getTerminator(),
		BranchInst::Create(LoopVectorPreHeader,
		LoopVectorPreHeader,
		RuntimeChecksPHI));
		EVLH.EpilogAliasCheckBlock = TCBlock;
		} else {
		EVLH.OrigTCCheckBlock = TCCheckBlock;
		}
// Generate the code to check any assumptions that we've made for SCEV		// Generate the code to check any assumptions that we've made for SCEV
// expressions.		// expressions.
emitSCEVChecks(Lp, LoopScalarPreHeader);		emitSCEVChecks(Lp, LoopScalarPreHeader);

// Generate the code that checks in runtime if arrays overlap. We put the		// Generate the code that checks in runtime if arrays overlap. We put the
// checks into a separate block to make the more common case of few elements		// checks into a separate block to make the more common case of few elements
// faster.		// faster.
emitMemRuntimeChecks(Lp, LoopScalarPreHeader);		emitMemRuntimeChecks(Lp, LoopScalarPreHeader);

// Some loops have a single integer induction variable, while other loops		// Some loops have a single integer induction variable, while other loops
// don't. One example is c++ iterators that often have multiple pointer		// don't. One example is c++ iterators that often have multiple pointer
// induction variables. In the code below we also support a case where we		// induction variables. In the code below we also support a case where we
// don't have a single induction variable.		// don't have a single induction variable.
//		//
// We try to obtain an induction variable from the original loop as hard		// We try to obtain an induction variable from the original loop as hard
// as possible. However if we don't find one that:		// as possible. However if we don't find one that:
// - is an integer		// - is an integer
// - counts from zero, stepping by one		// - counts from zero, stepping by one
// - is the size of the widest induction variable type		// - is the size of the widest induction variable type
// then we create a new one.		// then we create a new one.
OldInduction = Legal->getPrimaryInduction();		OldInduction = Legal->getPrimaryInduction();
Type *IdxTy = Legal->getWidestInductionType();		Type *IdxTy = Legal->getWidestInductionType();
Value *StartIdx = ConstantInt::get(IdxTy, 0);		Value *StartIdx = ConstantInt::get(IdxTy, 0);
		if (IsEpilogLoop) {
		IRBuilder<> Builder(cast<Instruction>(EVLH.ResumeValue)->getNextNode());
		StartIdx = Builder.CreateSub(
		EVLH.ResumeValue,
		Legal->getInductionVars().front().second.getStartValue());
		} else {
		EVLH.LoopMiddleBlock = LoopMiddleBlock;
		}

// The loop step is equal to the vectorization factor (num of SIMD elements)		// The loop step is equal to the vectorization factor (num of SIMD elements)
// times the unroll factor (num of SIMD instructions).		// times the unroll factor (num of SIMD instructions).
assert(!VF.isScalable() && "scalable vectors not yet supported.");		assert(!VF.isScalable() && "scalable vectors not yet supported.");
Constant Step = ConstantInt::get(IdxTy, VF.getKnownMinValue() UF);		Constant Step = ConstantInt::get(IdxTy, VF.getKnownMinValue() UF);
Value *CountRoundDown = getOrCreateVectorTripCount(Lp);		Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
Induction =		Induction =
createInductionVariable(Lp, StartIdx, CountRoundDown, Step,		createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
getDebugLocFromInstOrOperands(OldInduction));		getDebugLocFromInstOrOperands(OldInduction));
▲ Show 20 Lines • Show All 323 Lines • ▼ Show 20 Lines	for (unsigned Part = 0; Part < UF; ++Part) {
Value *NewI = Inst->getOperand(0);		Value *NewI = Inst->getOperand(0);
Inst->eraseFromParent();		Inst->eraseFromParent();
VectorLoopValueMap.resetVectorValue(KV.first, Part, NewI);		VectorLoopValueMap.resetVectorValue(KV.first, Part, NewI);
}		}
}		}
}		}
}		}

		Loop *InnerLoopVectorizer::getVectorizedLoop() {
		return LI->getLoopFor(LoopVectorBody);
		}

void InnerLoopVectorizer::fixVectorizedLoop() {		void InnerLoopVectorizer::fixVectorizedLoop() {
// Insert truncates and extends for any truncated instructions as hints to		// Insert truncates and extends for any truncated instructions as hints to
// InstCombine.		// InstCombine.
if (VF.isVector())		if (VF.isVector())
truncateToMinimalBitwidths();		truncateToMinimalBitwidths();

// Fix widened non-induction PHIs by setting up the PHI operands.		// Fix widened non-induction PHIs by setting up the PHI operands.
if (OrigPHIsToFix.size()) {		if (OrigPHIsToFix.size()) {
▲ Show 20 Lines • Show All 1,751 Lines • ▼ Show 20 Lines	for (unsigned i = 2; i <= MaxVF; i *= 2) {
LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i		LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i
<< " costs: " << (int)VectorCost << ".\n");		<< " costs: " << (int)VectorCost << ".\n");
if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not considering vector loop of width " << i		dbgs() << "LV: Not considering vector loop of width " << i
<< " because it will not generate any vector instructions.\n");		<< " because it will not generate any vector instructions.\n");
continue;		continue;
}		}

		// If profitable add it to ProfitableVF list.
		if (VectorCost < ScalarCost) {
		ProfitableVFs.push_back(
		VectorizationFactor(ElementCount::getFixed(i), VectorCost));
		}

if (VectorCost < Cost) {		if (VectorCost < Cost) {
Cost = VectorCost;		Cost = VectorCost;
Width = i;		Width = i;
}		}
}		}

if (!EnableCondStoresVectorization && NumPredStores) {		if (!EnableCondStoresVectorization && NumPredStores) {
reportVectorizationFailure("There are conditional stores.",		reportVectorizationFailure("There are conditional stores.",
"store that is conditionally executed prevents vectorization",		"store that is conditionally executed prevents vectorization",
"ConditionalStore", ORE, TheLoop);		"ConditionalStore", ORE, TheLoop);
Width = 1;		Width = 1;
Cost = ScalarCost;		Cost = ScalarCost;
}		}

LLVM_DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()		LLVM_DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
<< "but was forced by a user.\n");		<< "but was forced by a user.\n");
LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");		LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");
VectorizationFactor Factor = {ElementCount::getFixed(Width),		VectorizationFactor Factor = {ElementCount::getFixed(Width),
(unsigned)(Width * Cost)};		(unsigned)(Width * Cost)};
return Factor;		return Factor;
}		}

		VectorizationFactor
		LoopVectorizationCostModel::selectEpilogVectorizationFactor(unsigned MaxVF) {
		// Find next VF.
		bmahjourUnsubmitted Done Reply Inline Actions The vector epilogue loop's VF need not be smaller than the VF of the original loop for it to be profitable. For example with large interleave counts there may still be significant number of iterations to be executed and the throughput would be affected if a VF is chosen that is smaller than the widest profitable VF. bmahjour: The vector epilogue loop's VF need not be smaller than the VF of the original loop for it to be…
		mivnayAuthorUnsubmitted Done Reply Inline Actions Currently, it is tuned as per SPEC CPU 2017 benchmarks. It can be fine tuned based on the further data. mivnay: Currently, it is tuned as per SPEC CPU 2017 benchmarks. It can be fine tuned based on the…
		unsigned VF = 1;
		unsigned int Cost = 0;
		// Original loop vector factor should be atleast 16.
		bmahjourUnsubmitted Done Reply Inline Actions why this limitation? bmahjour: why this limitation?
		mivnayAuthorUnsubmitted Done Reply Inline Actions There were some issues with the Resume values when multiple induction variables are involved. I am planning to handle it later. mivnay: There were some issues with the Resume values when multiple induction variables are involved. I…
		if (MaxVF < 16) {
		return VectorizationFactor(ElementCount::getFixed(VF), Cost);
		}
		for (auto &NextVF : ProfitableVFs) {
		if ((NextVF.Width.getFixedValue() < MaxVF) &&
		(VF == 1 \|\| NextVF.Cost < Cost)) {
		VF = NextVF.Width.getFixedValue();
		Cost = NextVF.Cost;
		}
		}
		return VectorizationFactor(ElementCount::getFixed(VF), Cost);
		}

std::pair<unsigned, unsigned>		std::pair<unsigned, unsigned>
LoopVectorizationCostModel::getSmallestAndWidestTypes() {		LoopVectorizationCostModel::getSmallestAndWidestTypes() {
unsigned MinWidth = -1U;		unsigned MinWidth = -1U;
unsigned MaxWidth = 8;		unsigned MaxWidth = 8;
const DataLayout &DL = TheFunction->getParent()->getDataLayout();		const DataLayout &DL = TheFunction->getParent()->getDataLayout();

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
▲ Show 20 Lines • Show All 2,759 Lines • ▼ Show 20 Lines	static bool processLoopInVPlanNativePath(
// Also, do not attempt to vectorize if no vector code will be produced.		// Also, do not attempt to vectorize if no vector code will be produced.
if (VPlanBuildStressTest \|\| EnableVPlanPredication \|\|		if (VPlanBuildStressTest \|\| EnableVPlanPredication \|\|
VectorizationFactor::Disabled() == VF)		VectorizationFactor::Disabled() == VF)
return false;		return false;

LVP.setBestPlan(VF.Width, 1);		LVP.setBestPlan(VF.Width, 1);

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, 1, LVL,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, 1, LVL,
&CM, BFI, PSI);		&CM, BFI, PSI, EpilogVectorLoopHelper());
LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""		LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""
<< L->getHeader()->getParent()->getName() << "\"\n");		<< L->getHeader()->getParent()->getName() << "\"\n");
LVP.executePlan(LB, DT);		LVP.executePlan(LB, DT);

// Mark the loop as already vectorized to avoid vectorizing again.		// Mark the loop as already vectorized to avoid vectorizing again.
Hints.setAlreadyVectorized();		Hints.setAlreadyVectorized();

assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs()));		assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs()));
return true;		return true;
}		}

LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)		LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)
: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced \|\|		: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced \|\|
!EnableLoopInterleaving),		!EnableLoopInterleaving),
VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced \|\|		VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced \|\|
!EnableLoopVectorization) {}		!EnableLoopVectorization) {}

		bool InnerLoopVectorizer::canCreateVectorEpilog() {

		// If optimized for size, do not do it.
		if (OrigLoop->getHeader()->getParent()->hasOptSize() \|\|
		OptForSizeBasedOnProfile) {
		return false;
		}
		// Epilog vectorization option should be enabled.
		if (!EnableEpilogLoopVectorization)
		return false;
		// Loop should have single exit block.
		if (!OrigLoop->getExitBlock())
		return false;
		// Loop should have pre header.
		if (!OrigLoop->getLoopPreheader())
		return false;
		// Loop bypass blocks should not be empty.
		if (!LoopBypassBlocks.size())
		return false;
		// FIXME: Yet to handle loops with multiple induction variables.
		if (Legal->getInductionVars().size() != 1) {
		bmahjourUnsubmitted Done Reply Inline Actions Can your code handle first-order or reduction recurrences? Please see `InnerLoopVectorizer::fixCrossIterationPHIs()` and provide a test if they are supported. Otherwise I'm not sure this check is sufficient to catch those cases, specially given that the code guarded by `LB.canCreateVectorEpilog()` does not preserve LCSSA. bmahjour: Can your code handle first-order or reduction recurrences? Please see `InnerLoopVectorizer…
		return false;
		}
		// FIXME: Yet to handle reductions / first order recurrences.
		if (!Legal->getReductionVars().empty() \|\|
		!Legal->getFirstOrderRecurrences().empty()) {
		return false;
		}
		// FIMXE: Yet to handle loops with outgoing values.
		if (!LoopExitBlock->phis().empty())
		return false;
		// illegal for now if types are extended / truncated.
		if (getOrCreateTripCount(OrigLoop)->getType() != EVLH.ResumeValue->getType())
		return false;
		return true;
		}

bool LoopVectorizePass::processLoop(Loop *L) {		bool LoopVectorizePass::processLoop(Loop *L) {
assert((EnableVPlanNativePath \|\| L->isInnermost()) &&		assert((EnableVPlanNativePath \|\| L->isInnermost()) &&
"VPlan-native path is not enabled. Only process inner loops.");		"VPlan-native path is not enabled. Only process inner loops.");

#ifndef NDEBUG		#ifndef NDEBUG
const std::string DebugLocStr = getDebugLocString(L);		const std::string DebugLocStr = getDebugLocString(L);
#endif /* NDEBUG */		#endif /* NDEBUG */

▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")";		<< NV("InterleaveCount", IC) << ")";
});		});
} else {		} else {
// If we decided that it is legal to vectorize the loop, then do it.		// If we decided that it is legal to vectorize the loop, then do it.
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
&LVL, &CM, BFI, PSI);		&LVL, &CM, BFI, PSI, EpilogVectorLoopHelper());
LVP.executePlan(LB, DT);		LVP.executePlan(LB, DT);
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there are		// Add metadata to disable runtime unrolling a scalar loop when there are
// no runtime checks about strides and memory. A scalar loop that is		// no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.		// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())		if (!LB.areSafetyChecksAdded())
DisableRuntimeUnroll = true;		DisableRuntimeUnroll = true;

// Report the vectorization decision.		// Report the vectorization decision.
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "vectorized loop (vectorization width: "		<< "vectorized loop (vectorization width: "
<< NV("VectorizationFactor", VF.Width)		<< NV("VectorizationFactor", VF.Width)
<< ", interleaved count: " << NV("InterleaveCount", IC) << ")";		<< ", interleaved count: " << NV("InterleaveCount", IC) << ")";
});		});

		if (LB.canCreateVectorEpilog()) {
		LLVM_DEBUG(dbgs() << "LV: Trying for Epilog Vectorization\n");
		VectorizationFactor EpiVF =
		CM.selectEpilogVectorizationFactor(VF.Width.getFixedValue());
		// Enter only if VF > 1
		if (EpiVF.Width.isVector()) {
		LLVM_DEBUG(dbgs() << "LV: Epilog Vectorization with width : "
		<< EpiVF.Width << " in " << DebugLocStr << '\n');
		// Get the simplified form again.
		simplifyLoop(L, DT, LI, SE, AC, nullptr, false /* PreserveLCSSA */);
		InnerLoopVectorizer EpilogVectorizer(
		L, PSE, LI, DT, TLI, TTI, AC, ORE, EpiVF.Width, 1, &LVL, &CM, BFI,
		PSI, LB.getEpilogVectorLoopHelper(), true);
		// There is no use if we Unroll epilog loop.
		LVP.setBestPlan(EpiVF.Width, /UF=/1);
		LVP.executePlan(EpilogVectorizer, DT);
		// Also disable the unrolling of the epilog vector loop.
		AddRuntimeUnrollDisableMetaData(EpilogVectorizer.getVectorizedLoop());
		// Mark it as vectorized.
		Hints.setAlreadyVectorized(EpilogVectorizer.getVectorizedLoop());
		} else {
		LLVM_DEBUG(dbgs() << "LV: Epilog Vectorization is not profitable "
		<< " in " << DebugLocStr << '\n');
		}
		}
}		}

Optional<MDNode *> RemainderLoopID =		Optional<MDNode *> RemainderLoopID =
makeFollowupLoopID(OrigLoopID, {LLVMLoopVectorizeFollowupAll,		makeFollowupLoopID(OrigLoopID, {LLVMLoopVectorizeFollowupAll,
LLVMLoopVectorizeFollowupEpilogue});		LLVMLoopVectorizeFollowupEpilogue});
if (RemainderLoopID.hasValue()) {		if (RemainderLoopID.hasValue()) {
L->setLoopID(RemainderLoopID.getValue());		L->setLoopID(RemainderLoopID.getValue());
} else {		} else {
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[WIDE_LOAD10:%.]] = load <16 x i32>, <16 x i32> [[TMP9]], align 8, !alias.scope !0		; CHECK-NEXT: [[WIDE_LOAD10:%.]] = load <16 x i32>, <16 x i32> [[TMP9]], align 8, !alias.scope !0
; CHECK-NEXT: [[TMP10]] = add <16 x i32> [[VEC_PHI]], [[WIDE_LOAD]]		; CHECK-NEXT: [[TMP10]] = add <16 x i32> [[VEC_PHI]], [[WIDE_LOAD]]
; CHECK-NEXT: [[TMP11]] = add <16 x i32> [[VEC_PHI5]], [[WIDE_LOAD8]]		; CHECK-NEXT: [[TMP11]] = add <16 x i32> [[VEC_PHI5]], [[WIDE_LOAD8]]
; CHECK-NEXT: [[TMP12]] = add <16 x i32> [[VEC_PHI6]], [[WIDE_LOAD9]]		; CHECK-NEXT: [[TMP12]] = add <16 x i32> [[VEC_PHI6]], [[WIDE_LOAD9]]
; CHECK-NEXT: [[TMP13]] = add <16 x i32> [[VEC_PHI7]], [[WIDE_LOAD10]]		; CHECK-NEXT: [[TMP13]] = add <16 x i32> [[VEC_PHI7]], [[WIDE_LOAD10]]
; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4, !alias.scope !3, !noalias !0		; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4, !alias.scope !3, !noalias !0
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !5		; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP5:!llvm.loop !.]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[BIN_RDX:%.*]] = add <16 x i32> [[TMP11]], [[TMP10]]		; CHECK-NEXT: [[BIN_RDX:%.*]] = add <16 x i32> [[TMP11]], [[TMP10]]
; CHECK-NEXT: [[BIN_RDX11:%.*]] = add <16 x i32> [[TMP12]], [[BIN_RDX]]		; CHECK-NEXT: [[BIN_RDX11:%.*]] = add <16 x i32> [[TMP12]], [[BIN_RDX]]
; CHECK-NEXT: [[BIN_RDX12:%.*]] = add <16 x i32> [[TMP13]], [[BIN_RDX11]]		; CHECK-NEXT: [[BIN_RDX12:%.*]] = add <16 x i32> [[TMP13]], [[BIN_RDX11]]
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX12]])		; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX12]])
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP15]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_MEMCHECK]] ]		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP15]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_MEMCHECK]] ]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[T0:%.]] = phi i32 [ [[T3:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[T0:%.]] = phi i32 [ [[T3:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]		; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8		; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8
; CHECK-NEXT: [[T3]] = add i32 [[T0]], [[T2]]		; CHECK-NEXT: [[T3]] = add i32 [[T0]], [[T2]]
; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4		; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4
; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1		; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]		; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !7		; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP7:!llvm.loop !.*]]
; CHECK: for.end:		; CHECK: for.end:
; CHECK-NEXT: [[T4:%.*]] = phi i32 [ [[T3]], [[FOR_BODY]] ], [ [[TMP15]], [[MIDDLE_BLOCK]] ]		; CHECK-NEXT: [[T4:%.*]] = phi i32 [ [[T3]], [[FOR_BODY]] ], [ [[TMP15]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i32 [[T4]]		; CHECK-NEXT: ret i32 [[T4]]
;		;
entry:		entry:
%ntrunc = trunc i64 %n to i32		%ntrunc = trunc i64 %n to i32
br label %for.body		br label %for.body

Show All 13 Lines	for.end: ; preds = %for.body
ret i32 %t4		ret i32 %t4
}		}

; Conditional store		; Conditional store
; if (b[i] == k) a = ntrunc		; if (b[i] == k) a = ntrunc
define void @inv_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32 %k) {		define void @inv_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32 %k) {
; CHECK-LABEL: @inv_val_store_to_inv_address_conditional(		; CHECK-LABEL: @inv_val_store_to_inv_address_conditional(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[B1:%.]] = bitcast i32 [[B:%.]] to i8
		; CHECK-NEXT: [[A4:%.]] = bitcast i32 [[A:%.]] to i8
; CHECK-NEXT: [[NTRUNC:%.]] = trunc i64 [[N:%.]] to i32		; CHECK-NEXT: [[NTRUNC:%.]] = trunc i64 [[N:%.]] to i32
; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i64 [[N]], 1		; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i64 [[N]], 1
; CHECK-NEXT: [[SMAX:%.*]] = select i1 [[TMP0]], i64 [[N]], i64 1		; CHECK-NEXT: [[SMAX:%.*]] = select i1 [[TMP0]], i64 [[N]], i64 1
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[SMAX]], 16		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[SMAX]], 16
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[EPILOG_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
; CHECK: vector.memcheck:		; CHECK: vector.memcheck:
; CHECK-NEXT: [[A4:%.]] = bitcast i32 [[A:%.]] to i8
; CHECK-NEXT: [[B1:%.]] = bitcast i32 [[B:%.]] to i8
; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt i64 [[N]], 1		; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt i64 [[N]], 1
; CHECK-NEXT: [[SMAX2:%.*]] = select i1 [[TMP1]], i64 [[N]], i64 1		; CHECK-NEXT: [[SMAX2:%.*]] = select i1 [[TMP1]], i64 [[N]], i64 1
; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]		; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]
; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1		; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1
; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]		; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]
; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]		; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]
; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]		; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]		; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792		; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i32> undef, i32 [[K:%.]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i32> undef, i32 [[K:%.]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> undef, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> undef, <16 x i32> zeroinitializer
; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT5]], <16 x i32> undef, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT5]], <16 x i32> undef, <16 x i32> zeroinitializer
; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT8:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT7]], <16 x i32*> undef, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT8:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT7]], <16 x i32*> undef, <16 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]		; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !8, !noalias !11		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !8, !noalias !11
; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]		; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*		; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT6]], <16 x i32>* [[TMP5]], align 4, !alias.scope !8, !noalias !11		; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT6]], <16 x i32>* [[TMP5]], align 4, !alias.scope !8, !noalias !11
; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT6]], <16 x i32*> [[BROADCAST_SPLAT8]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !11		; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT6]], <16 x i32*> [[BROADCAST_SPLAT8]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !11
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !13		; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP13:!llvm.loop !.]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[EPILOG_PH]]
		; CHECK: epilog.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
		; CHECK-NEXT: [[TMP7:%.*]] = icmp sgt i64 [[N]], 1
		; CHECK-NEXT: [[SMAX9:%.*]] = select i1 [[TMP7]], i64 [[N]], i64 1
		; CHECK-NEXT: [[REMAINDER_ITER:%.*]] = sub nsw i64 [[SMAX9]], [[BC_RESUME_VAL]]
		; CHECK-NEXT: [[MIN_ITERS_CHECK10:%.*]] = icmp ult i64 [[REMAINDER_ITER]], 8
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK10]], label [[SCALAR_PH]], label [[RUNTIME_CHECK_PH:%.*]]
		; CHECK: runtime.check.ph:
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[EPILOG_VECTOR_MEMCHECK:%.]], label [[EPILOG_VECTOR_PH:%.]]
		; CHECK: epilog.vector.memcheck:
		; CHECK-NEXT: [[SCEVGEP12:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX]]
		; CHECK-NEXT: [[UGLYGEP14:%.]] = getelementptr i8, i8 [[A4]], i64 1
		; CHECK-NEXT: [[BOUND016:%.]] = icmp ugt i8 [[UGLYGEP14]], [[B1]]
		; CHECK-NEXT: [[BOUND117:%.]] = icmp ugt i32 [[SCEVGEP12]], [[A]]
		; CHECK-NEXT: [[FOUND_CONFLICT18:%.*]] = and i1 [[BOUND016]], [[BOUND117]]
		; CHECK-NEXT: br i1 [[FOUND_CONFLICT18]], label [[SCALAR_PH]], label [[EPILOG_VECTOR_PH]]
		; CHECK: epilog.vector.ph:
		; CHECK-NEXT: [[N_VEC21:%.*]] = and i64 [[SMAX9]], 9223372036854775800
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT27:%.*]] = insertelement <8 x i32> undef, i32 [[K]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT28:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT27]], <8 x i32> undef, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT29:%.*]] = insertelement <8 x i32> undef, i32 [[NTRUNC]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT30:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT29]], <8 x i32> undef, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT31:%.]] = insertelement <8 x i32> undef, i32* [[A]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT32:%.]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT31]], <8 x i32*> undef, <8 x i32> zeroinitializer
		; CHECK-NEXT: br label [[EPILOG_VECTOR_BODY:%.*]]
		; CHECK: epilog.vector.body:
		; CHECK-NEXT: [[INDEX22:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_PH]] ], [ [[INDEX_NEXT23:%.]], [[EPILOG_VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX22]]
		; CHECK-NEXT: [[TMP9:%.]] = bitcast i32 [[TMP8]] to <8 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD26:%.]] = load <8 x i32>, <8 x i32> [[TMP9]], align 8, !alias.scope !14, !noalias !17
		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq <8 x i32> [[WIDE_LOAD26]], [[BROADCAST_SPLAT28]]
		; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP8]] to <8 x i32>*
		; CHECK-NEXT: store <8 x i32> [[BROADCAST_SPLAT30]], <8 x i32>* [[TMP11]], align 4, !alias.scope !14, !noalias !17
		; CHECK-NEXT: call void @llvm.masked.scatter.v8i32.v8p0i32(<8 x i32> [[BROADCAST_SPLAT30]], <8 x i32*> [[BROADCAST_SPLAT32]], i32 4, <8 x i1> [[TMP10]]), !alias.scope !17
		; CHECK-NEXT: [[INDEX_NEXT23]] = add i64 [[INDEX22]], 8
		; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT23]], [[N_VEC21]]
		; CHECK-NEXT: br i1 [[TMP12]], label [[EPILOG_MIDDLE_BLOCK:%.]], label [[EPILOG_VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
		; CHECK: epilog.middle.block:
		; CHECK-NEXT: [[CMP_N25:%.*]] = icmp eq i64 [[SMAX9]], [[N_VEC21]]
		; CHECK-NEXT: br i1 [[CMP_N25]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL24:%.*]] = phi i64 [ [[N_VEC21]], [[EPILOG_MIDDLE_BLOCK]] ], [ [[BC_RESUME_VAL]], [[EPILOG_PH]] ], [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_MEMCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL24]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]		; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8		; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8
; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]		; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]
; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4		; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4
; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]		; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]
; CHECK: cond_store:		; CHECK: cond_store:
; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4		; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4
; CHECK-NEXT: br label [[LATCH]]		; CHECK-NEXT: br label [[LATCH]]
; CHECK: latch:		; CHECK: latch:
; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1		; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]		; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !14		; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT]], [[LOOP21:!llvm.loop !.*]]
		; CHECK: for.end.loopexit:
		; CHECK-NEXT: br label [[FOR_END]]
; CHECK: for.end:		; CHECK: for.end:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%ntrunc = trunc i64 %n to i32		%ntrunc = trunc i64 %n to i32
br label %for.body		br label %for.body

for.body: ; preds = %for.body, %entry		for.body: ; preds = %for.body, %entry
Show All 15 Lines

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret void		ret void
}		}

define void @variant_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32* %c, i32 %k) {		define void @variant_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32* %c, i32 %k) {
; CHECK-LABEL: @variant_val_store_to_inv_address_conditional(		; CHECK-LABEL: @variant_val_store_to_inv_address_conditional(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[B1:%.]] = bitcast i32 [[B:%.]] to i8
		bmahjourUnsubmitted Done Reply Inline Actions why these casts are hoisted? bmahjour: why these casts are hoisted?
		mivnayAuthorUnsubmitted Done Reply Inline Actions Note that the tests are auto generated using update_test_checks. It is done inside loop vectorize as there are redundant casts now in epilog vector I guess. mivnay: Note that the tests are auto generated using update_test_checks. It is done inside loop…
		; CHECK-NEXT: [[A4:%.]] = bitcast i32 [[A:%.]] to i8
		; CHECK-NEXT: [[C5:%.]] = bitcast i32 [[C:%.]] to i8
; CHECK-NEXT: [[NTRUNC:%.]] = trunc i64 [[N:%.]] to i32		; CHECK-NEXT: [[NTRUNC:%.]] = trunc i64 [[N:%.]] to i32
; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i64 [[N]], 1		; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i64 [[N]], 1
; CHECK-NEXT: [[SMAX:%.*]] = select i1 [[TMP0]], i64 [[N]], i64 1		; CHECK-NEXT: [[SMAX:%.*]] = select i1 [[TMP0]], i64 [[N]], i64 1
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[SMAX]], 16		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[SMAX]], 16
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[EPILOG_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
; CHECK: vector.memcheck:		; CHECK: vector.memcheck:
; CHECK-NEXT: [[C5:%.]] = bitcast i32 [[C:%.]] to i8
; CHECK-NEXT: [[B1:%.]] = bitcast i32 [[B:%.]] to i8
; CHECK-NEXT: [[A4:%.]] = bitcast i32 [[A:%.]] to i8
; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt i64 [[N]], 1		; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt i64 [[N]], 1
; CHECK-NEXT: [[SMAX2:%.*]] = select i1 [[TMP1]], i64 [[N]], i64 1		; CHECK-NEXT: [[SMAX2:%.*]] = select i1 [[TMP1]], i64 [[N]], i64 1
; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]		; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]
; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1		; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1
; CHECK-NEXT: [[SCEVGEP6:%.]] = getelementptr i32, i32 [[C]], i64 [[SMAX2]]		; CHECK-NEXT: [[SCEVGEP6:%.]] = getelementptr i32, i32 [[C]], i64 [[SMAX2]]
; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]		; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]
; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]		; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]
; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]		; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
; CHECK-NEXT: [[BOUND08:%.]] = icmp ugt i32 [[SCEVGEP6]], [[B]]		; CHECK-NEXT: [[BOUND08:%.]] = icmp ugt i32 [[SCEVGEP6]], [[B]]
; CHECK-NEXT: [[BOUND19:%.]] = icmp ugt i32 [[SCEVGEP]], [[C]]		; CHECK-NEXT: [[BOUND19:%.]] = icmp ugt i32 [[SCEVGEP]], [[C]]
; CHECK-NEXT: [[FOUND_CONFLICT10:%.*]] = and i1 [[BOUND08]], [[BOUND19]]		; CHECK-NEXT: [[FOUND_CONFLICT10:%.*]] = and i1 [[BOUND08]], [[BOUND19]]
; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT10]]		; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT10]]
; CHECK-NEXT: [[BOUND012:%.]] = icmp ugt i32 [[SCEVGEP6]], [[A]]		; CHECK-NEXT: [[BOUND012:%.]] = icmp ugt i32 [[SCEVGEP6]], [[A]]
; CHECK-NEXT: [[BOUND113:%.]] = icmp ugt i8 [[UGLYGEP]], [[C5]]		; CHECK-NEXT: [[BOUND113:%.]] = icmp ugt i8 [[UGLYGEP]], [[C5]]
; CHECK-NEXT: [[FOUND_CONFLICT14:%.*]] = and i1 [[BOUND012]], [[BOUND113]]		; CHECK-NEXT: [[FOUND_CONFLICT14:%.*]] = and i1 [[BOUND012]], [[BOUND113]]
; CHECK-NEXT: [[CONFLICT_RDX15:%.*]] = or i1 [[CONFLICT_RDX]], [[FOUND_CONFLICT14]]		; CHECK-NEXT: [[CONFLICT_RDX15:%.*]] = or i1 [[CONFLICT_RDX]], [[FOUND_CONFLICT14]]
; CHECK-NEXT: br i1 [[CONFLICT_RDX15]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]		; CHECK-NEXT: br i1 [[CONFLICT_RDX15]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792		; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i32> undef, i32 [[K:%.]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i32> undef, i32 [[K:%.]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> undef, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> undef, <16 x i32> zeroinitializer
; CHECK-NEXT: [[BROADCAST_SPLATINSERT16:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT16:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT17:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT16]], <16 x i32> undef, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT17:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT16]], <16 x i32> undef, <16 x i32> zeroinitializer
; CHECK-NEXT: [[BROADCAST_SPLATINSERT18:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT18:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT19:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT18]], <16 x i32*> undef, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT19:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT18]], <16 x i32*> undef, <16 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]		; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !15, !noalias !18		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !22, !noalias !25
; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]		; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*		; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT17]], <16 x i32>* [[TMP5]], align 4, !alias.scope !15, !noalias !18		; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT17]], <16 x i32>* [[TMP5]], align 4, !alias.scope !22, !noalias !25
; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDEX]]		; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <16 x i32>*		; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <16 x i32>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32> [[TMP7]], i32 8, <16 x i1> [[TMP4]], <16 x i32> undef), !alias.scope !21		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32> [[TMP7]], i32 8, <16 x i1> [[TMP4]], <16 x i32> undef), !alias.scope !28
; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[WIDE_MASKED_LOAD]], <16 x i32*> [[BROADCAST_SPLAT19]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !22, !noalias !21		; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[WIDE_MASKED_LOAD]], <16 x i32*> [[BROADCAST_SPLAT19]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !29, !noalias !28
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !23		; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP30:!llvm.loop !.]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[EPILOG_PH]]
		; CHECK: epilog.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
		; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt i64 [[N]], 1
		; CHECK-NEXT: [[SMAX20:%.*]] = select i1 [[TMP9]], i64 [[N]], i64 1
		; CHECK-NEXT: [[REMAINDER_ITER:%.*]] = sub nsw i64 [[SMAX20]], [[BC_RESUME_VAL]]
		; CHECK-NEXT: [[MIN_ITERS_CHECK21:%.*]] = icmp ult i64 [[REMAINDER_ITER]], 8
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK21]], label [[SCALAR_PH]], label [[RUNTIME_CHECK_PH:%.*]]
		; CHECK: runtime.check.ph:
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[EPILOG_VECTOR_MEMCHECK:%.]], label [[EPILOG_VECTOR_PH:%.]]
		; CHECK: epilog.vector.memcheck:
		; CHECK-NEXT: [[SCEVGEP23:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX]]
		; CHECK-NEXT: [[UGLYGEP25:%.]] = getelementptr i8, i8 [[A4]], i64 1
		; CHECK-NEXT: [[SCEVGEP26:%.]] = getelementptr i32, i32 [[C]], i64 [[SMAX]]
		; CHECK-NEXT: [[BOUND029:%.]] = icmp ugt i8 [[UGLYGEP25]], [[B1]]
		; CHECK-NEXT: [[BOUND130:%.]] = icmp ugt i32 [[SCEVGEP23]], [[A]]
		; CHECK-NEXT: [[FOUND_CONFLICT31:%.*]] = and i1 [[BOUND029]], [[BOUND130]]
		; CHECK-NEXT: [[BOUND032:%.]] = icmp ugt i32 [[SCEVGEP26]], [[B]]
		; CHECK-NEXT: [[BOUND133:%.]] = icmp ugt i32 [[SCEVGEP23]], [[C]]
		; CHECK-NEXT: [[FOUND_CONFLICT34:%.*]] = and i1 [[BOUND032]], [[BOUND133]]
		; CHECK-NEXT: [[CONFLICT_RDX35:%.*]] = or i1 [[FOUND_CONFLICT31]], [[FOUND_CONFLICT34]]
		; CHECK-NEXT: [[BOUND037:%.]] = icmp ugt i32 [[SCEVGEP26]], [[A]]
		; CHECK-NEXT: [[BOUND138:%.]] = icmp ugt i8 [[UGLYGEP25]], [[C5]]
		; CHECK-NEXT: [[FOUND_CONFLICT39:%.*]] = and i1 [[BOUND037]], [[BOUND138]]
		; CHECK-NEXT: [[CONFLICT_RDX40:%.*]] = or i1 [[CONFLICT_RDX35]], [[FOUND_CONFLICT39]]
		; CHECK-NEXT: br i1 [[CONFLICT_RDX40]], label [[SCALAR_PH]], label [[EPILOG_VECTOR_PH]]
		; CHECK: epilog.vector.ph:
		; CHECK-NEXT: [[N_VEC43:%.*]] = and i64 [[SMAX20]], 9223372036854775800
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT49:%.*]] = insertelement <8 x i32> undef, i32 [[K]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT50:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT49]], <8 x i32> undef, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT51:%.*]] = insertelement <8 x i32> undef, i32 [[NTRUNC]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT52:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT51]], <8 x i32> undef, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT54:%.]] = insertelement <8 x i32> undef, i32* [[A]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT55:%.]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT54]], <8 x i32*> undef, <8 x i32> zeroinitializer
		; CHECK-NEXT: br label [[EPILOG_VECTOR_BODY:%.*]]
		; CHECK: epilog.vector.body:
		; CHECK-NEXT: [[INDEX44:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_PH]] ], [ [[INDEX_NEXT45:%.]], [[EPILOG_VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX44]]
		; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD48:%.]] = load <8 x i32>, <8 x i32> [[TMP11]], align 8, !alias.scope !31, !noalias !34
		; CHECK-NEXT: [[TMP12:%.*]] = icmp eq <8 x i32> [[WIDE_LOAD48]], [[BROADCAST_SPLAT50]]
		; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
		; CHECK-NEXT: store <8 x i32> [[BROADCAST_SPLAT52]], <8 x i32>* [[TMP13]], align 4, !alias.scope !31, !noalias !34
		; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDEX44]]
		; CHECK-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <8 x i32>*
		; CHECK-NEXT: [[WIDE_MASKED_LOAD53:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP15]], i32 8, <8 x i1> [[TMP12]], <8 x i32> undef), !alias.scope !37
		; CHECK-NEXT: call void @llvm.masked.scatter.v8i32.v8p0i32(<8 x i32> [[WIDE_MASKED_LOAD53]], <8 x i32*> [[BROADCAST_SPLAT55]], i32 4, <8 x i1> [[TMP12]]), !alias.scope !38, !noalias !37
		; CHECK-NEXT: [[INDEX_NEXT45]] = add i64 [[INDEX44]], 8
		; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT45]], [[N_VEC43]]
		; CHECK-NEXT: br i1 [[TMP16]], label [[EPILOG_MIDDLE_BLOCK:%.]], label [[EPILOG_VECTOR_BODY]], [[LOOP39:!llvm.loop !.]]
		; CHECK: epilog.middle.block:
		; CHECK-NEXT: [[CMP_N47:%.*]] = icmp eq i64 [[SMAX20]], [[N_VEC43]]
		; CHECK-NEXT: br i1 [[CMP_N47]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL46:%.*]] = phi i64 [ [[N_VEC43]], [[EPILOG_MIDDLE_BLOCK]] ], [ [[BC_RESUME_VAL]], [[EPILOG_PH]] ], [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_MEMCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL46]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]		; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8		; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8
; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]		; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]
; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4		; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4
; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]		; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]
; CHECK: cond_store:		; CHECK: cond_store:
; CHECK-NEXT: [[T3:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[I]]		; CHECK-NEXT: [[T3:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[I]]
; CHECK-NEXT: [[T4:%.]] = load i32, i32 [[T3]], align 8		; CHECK-NEXT: [[T4:%.]] = load i32, i32 [[T3]], align 8
; CHECK-NEXT: store i32 [[T4]], i32* [[A]], align 4		; CHECK-NEXT: store i32 [[T4]], i32* [[A]], align 4
; CHECK-NEXT: br label [[LATCH]]		; CHECK-NEXT: br label [[LATCH]]
; CHECK: latch:		; CHECK: latch:
; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1		; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]		; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !24		; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT]], [[LOOP40:!llvm.loop !.*]]
		; CHECK: for.end.loopexit:
		; CHECK-NEXT: br label [[FOR_END]]
; CHECK: for.end:		; CHECK: for.end:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%ntrunc = trunc i64 %n to i32		%ntrunc = trunc i64 %n to i32
br label %for.body		br label %for.body

for.body: ; preds = %for.body, %entry		for.body: ; preds = %for.body, %entry
Show All 21 Lines

test/Transforms/LoopVectorize/X86/masked_load_store.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !3			; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !3
	; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]			; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]
	; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP0]]
	; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0
	; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*			; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
	; AVX1-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP8]], <8 x i32>* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !5, !noalias !7			; AVX1-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP8]], <8 x i32>* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !5, !noalias !7
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100			; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100
	; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX3]], align 4			; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]			; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
	; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4			; AVX1-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP10:!llvm.loop !.*]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo1(			; AVX2-LABEL: @foo1(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8			; AVX2-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8
	; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX2-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8			; AVX2-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 16			; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 16
	; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <8 x i32>*			; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <8 x i32>*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP34]], <8 x i32>* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !5, !noalias !7			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP34]], <8 x i32>* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !5, !noalias !7
	; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 24			; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 24
	; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <8 x i32>*			; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <8 x i32>*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP35]], <8 x i32>* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !5, !noalias !7			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP35]], <8 x i32>* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !5, !noalias !7
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4			; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
	; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4			; AVX2-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP10:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo1(			; AVX512-LABEL: @foo1(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8			; AVX512-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8			; AVX512-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8
	; AVX512-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]			; AVX512-NEXT: br i1 false, label [[EPILOG_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
	; AVX512: vector.memcheck:			; AVX512: vector.memcheck:
	; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[A]], i64 10000			; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[A]], i64 10000
	; AVX512-NEXT: [[SCEVGEP2:%.]] = bitcast i32 [[SCEVGEP]] to i8*			; AVX512-NEXT: [[SCEVGEP2:%.]] = bitcast i32 [[SCEVGEP]] to i8*
	; AVX512-NEXT: [[SCEVGEP4:%.]] = getelementptr i32, i32 [[TRIGGER]], i64 10000			; AVX512-NEXT: [[SCEVGEP4:%.]] = getelementptr i32, i32 [[TRIGGER]], i64 10000
	; AVX512-NEXT: [[SCEVGEP45:%.]] = bitcast i32 [[SCEVGEP4]] to i8*			; AVX512-NEXT: [[SCEVGEP45:%.]] = bitcast i32 [[SCEVGEP4]] to i8*
	; AVX512-NEXT: [[SCEVGEP7:%.]] = getelementptr i32, i32 [[B]], i64 10000			; AVX512-NEXT: [[SCEVGEP7:%.]] = getelementptr i32, i32 [[B]], i64 10000
	; AVX512-NEXT: [[SCEVGEP78:%.]] = bitcast i32 [[SCEVGEP7]] to i8*			; AVX512-NEXT: [[SCEVGEP78:%.]] = bitcast i32 [[SCEVGEP7]] to i8*
	; AVX512-NEXT: [[BOUND0:%.]] = icmp ult i8 [[A1]], [[SCEVGEP45]]			; AVX512-NEXT: [[BOUND0:%.]] = icmp ult i8 [[A1]], [[SCEVGEP45]]
	; AVX512-NEXT: [[BOUND1:%.]] = icmp ult i8 [[TRIGGER3]], [[SCEVGEP2]]			; AVX512-NEXT: [[BOUND1:%.]] = icmp ult i8 [[TRIGGER3]], [[SCEVGEP2]]
	; AVX512-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]			; AVX512-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
	; AVX512-NEXT: [[BOUND09:%.]] = icmp ult i8 [[A1]], [[SCEVGEP78]]			; AVX512-NEXT: [[BOUND09:%.]] = icmp ult i8 [[A1]], [[SCEVGEP78]]
	; AVX512-NEXT: [[BOUND110:%.]] = icmp ult i8 [[B6]], [[SCEVGEP2]]			; AVX512-NEXT: [[BOUND110:%.]] = icmp ult i8 [[B6]], [[SCEVGEP2]]
	; AVX512-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]			; AVX512-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]
	; AVX512-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]			; AVX512-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]
	; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true			; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
	; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]			; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AVX512: vector.ph:			; AVX512: vector.ph:
	; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]			; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX512: vector.body:			; AVX512: vector.body:
	; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; AVX512-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16			; AVX512-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16
	; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48			; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48
	▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 32			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 32
	; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <16 x i32>*			; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <16 x i32>*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP34]], <16 x i32>* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !5, !noalias !7			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP34]], <16 x i32>* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !5, !noalias !7
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 48			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 48
	; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <16 x i32>*			; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <16 x i32>*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP35]], <16 x i32>* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !5, !noalias !7			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP35]], <16 x i32>* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !5, !noalias !7
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[EPILOG_PH]]
				; AVX512: epilog.ph:
				; AVX512-NEXT: [[TMP49:%.]] = phi i1 [ false, [[MIDDLE_BLOCK]] ], [ true, [[ENTRY:%.]] ]
				; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
				; AVX512-NEXT: [[TMP50:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; AVX512-NEXT: [[TMP51:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; AVX512-NEXT: [[REMAINDER_ITER:%.*]] = sub i64 10000, [[TMP51]]
				; AVX512-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[REMAINDER_ITER]], 8
				; AVX512-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH]], label [[RUNTIME_CHECK_PH:%.*]]
				; AVX512: runtime.check.ph:
				; AVX512-NEXT: br i1 [[TMP49]], label [[EPILOG_VECTOR_MEMCHECK:%.]], label [[EPILOG_VECTOR_PH:%.]]
				; AVX512: epilog.vector.memcheck:
				; AVX512-NEXT: [[SCEVGEP19:%.]] = getelementptr i32, i32 [[A]], i64 10000
				; AVX512-NEXT: [[SCEVGEP1920:%.]] = bitcast i32 [[SCEVGEP19]] to i8*
				; AVX512-NEXT: [[SCEVGEP21:%.]] = getelementptr i32, i32 [[TRIGGER]], i64 10000
				; AVX512-NEXT: [[SCEVGEP2122:%.]] = bitcast i32 [[SCEVGEP21]] to i8*
				; AVX512-NEXT: [[SCEVGEP23:%.]] = getelementptr i32, i32 [[B]], i64 10000
				; AVX512-NEXT: [[SCEVGEP2324:%.]] = bitcast i32 [[SCEVGEP23]] to i8*
				; AVX512-NEXT: [[BOUND025:%.]] = icmp ult i8 [[A1]], [[SCEVGEP2122]]
				; AVX512-NEXT: [[BOUND126:%.]] = icmp ult i8 [[TRIGGER3]], [[SCEVGEP1920]]
				; AVX512-NEXT: [[FOUND_CONFLICT27:%.*]] = and i1 [[BOUND025]], [[BOUND126]]
				; AVX512-NEXT: [[BOUND028:%.]] = icmp ult i8 [[A1]], [[SCEVGEP2324]]
				; AVX512-NEXT: [[BOUND129:%.]] = icmp ult i8 [[B6]], [[SCEVGEP1920]]
				; AVX512-NEXT: [[FOUND_CONFLICT30:%.*]] = and i1 [[BOUND028]], [[BOUND129]]
				; AVX512-NEXT: [[CONFLICT_RDX31:%.*]] = or i1 [[FOUND_CONFLICT27]], [[FOUND_CONFLICT30]]
				; AVX512-NEXT: [[MEMCHECK_CONFLICT32:%.*]] = and i1 [[CONFLICT_RDX31]], true
				; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT32]], label [[SCALAR_PH]], label [[EPILOG_VECTOR_PH]]
				; AVX512: epilog.vector.ph:
				; AVX512-NEXT: br label [[EPILOG_VECTOR_BODY:%.*]]
				; AVX512: epilog.vector.body:
				; AVX512-NEXT: [[INDEX33:%.]] = phi i64 [ [[TMP50]], [[EPILOG_VECTOR_PH]] ], [ [[INDEX_NEXT34:%.]], [[EPILOG_VECTOR_BODY]] ]
				; AVX512-NEXT: [[TMP52:%.*]] = add i64 [[INDEX33]], 0
				; AVX512-NEXT: [[TMP53:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP52]]
				; AVX512-NEXT: [[TMP54:%.]] = getelementptr inbounds i32, i32 [[TMP53]], i32 0
				; AVX512-NEXT: [[TMP55:%.]] = bitcast i32 [[TMP54]] to <8 x i32>*
				; AVX512-NEXT: [[WIDE_LOAD37:%.]] = load <8 x i32>, <8 x i32> [[TMP55]], align 4, !alias.scope !10
				; AVX512-NEXT: [[TMP56:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD37]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
				; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP52]]
				; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds i32, i32 [[TMP57]], i32 0
				; AVX512-NEXT: [[TMP59:%.]] = bitcast i32 [[TMP58]] to <8 x i32>*
				; AVX512-NEXT: [[WIDE_MASKED_LOAD38:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP59]], i32 4, <8 x i1> [[TMP56]], <8 x i32> undef), !alias.scope !13
				; AVX512-NEXT: [[TMP60:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD38]], [[WIDE_LOAD37]]
				; AVX512-NEXT: [[TMP61:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP52]]
				; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds i32, i32 [[TMP61]], i32 0
				; AVX512-NEXT: [[TMP63:%.]] = bitcast i32 [[TMP62]] to <8 x i32>*
				; AVX512-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP60]], <8 x i32>* [[TMP63]], i32 4, <8 x i1> [[TMP56]]), !alias.scope !15, !noalias !17
				; AVX512-NEXT: [[INDEX_NEXT34]] = add i64 [[INDEX33]], 8
				; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT34]], 10000
				; AVX512-NEXT: br i1 [[TMP64]], label [[EPILOG_MIDDLE_BLOCK:%.]], label [[EPILOG_VECTOR_BODY]], [[LOOP18:!llvm.loop !.]]
				; AVX512: epilog.middle.block:
				; AVX512-NEXT: [[CMP_N36:%.*]] = icmp eq i64 10000, 10000
				; AVX512-NEXT: br i1 [[CMP_N36]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL35:%.*]] = phi i64 [ 10000, [[EPILOG_MIDDLE_BLOCK]] ], [ [[BC_RESUME_VAL]], [[EPILOG_PH]] ], [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_MEMCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL35]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP65:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP65]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4			; AVX512-NEXT: [[TMP66:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP66]], [[TMP65]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4			; AVX512-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
				; AVX512: for.end.loopexit:
				; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p1v8i32(<8 x i32> addrspace(1) [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !14			; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p1v8i32(<8 x i32> addrspace(1) [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !14
	; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]			; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]
	; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP0]]
	; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP9]], i32 0			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP9]], i32 0
	; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 addrspace(1) [[TMP10]] to <8 x i32> addrspace(1)*			; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 addrspace(1) [[TMP10]] to <8 x i32> addrspace(1)*
	; AVX1-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP8]], <8 x i32> addrspace(1)* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !16, !noalias !18			; AVX1-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP8]], <8 x i32> addrspace(1)* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !16, !noalias !18
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19			; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4			; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4
	; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100			; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100
	; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4			; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4
	; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]			; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
	; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4			; AVX1-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !20			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo1_addrspace1(			; AVX2-LABEL: @foo1_addrspace1(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)			; AVX2-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)
	; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)			; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)
	; AVX2-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)			; AVX2-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 16			; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 16
	; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <8 x i32> addrspace(1)*			; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <8 x i32> addrspace(1)*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP34]], <8 x i32> addrspace(1)* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !16, !noalias !18			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP34]], <8 x i32> addrspace(1)* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !16, !noalias !18
	; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 24			; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 24
	; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <8 x i32> addrspace(1)*			; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <8 x i32> addrspace(1)*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP35]], <8 x i32> addrspace(1)* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !16, !noalias !18			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP35]], <8 x i32> addrspace(1)* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !16, !noalias !18
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19			; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4			; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4
	; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
	; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4			; AVX2-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !20			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo1_addrspace1(			; AVX512-LABEL: @foo1_addrspace1(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)			; AVX512-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)
	; AVX512-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)			; AVX512-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)
	; AVX512-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]			; AVX512-NEXT: br i1 false, label [[EPILOG_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
	; AVX512: vector.memcheck:			; AVX512: vector.memcheck:
	; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 addrspace(1) [[A]], i64 10000			; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 addrspace(1) [[A]], i64 10000
	; AVX512-NEXT: [[SCEVGEP2:%.]] = bitcast i32 addrspace(1) [[SCEVGEP]] to i8 addrspace(1)*			; AVX512-NEXT: [[SCEVGEP2:%.]] = bitcast i32 addrspace(1) [[SCEVGEP]] to i8 addrspace(1)*
	; AVX512-NEXT: [[SCEVGEP4:%.]] = getelementptr i32, i32 addrspace(1) [[TRIGGER]], i64 10000			; AVX512-NEXT: [[SCEVGEP4:%.]] = getelementptr i32, i32 addrspace(1) [[TRIGGER]], i64 10000
	; AVX512-NEXT: [[SCEVGEP45:%.]] = bitcast i32 addrspace(1) [[SCEVGEP4]] to i8 addrspace(1)*			; AVX512-NEXT: [[SCEVGEP45:%.]] = bitcast i32 addrspace(1) [[SCEVGEP4]] to i8 addrspace(1)*
	; AVX512-NEXT: [[SCEVGEP7:%.]] = getelementptr i32, i32 addrspace(1) [[B]], i64 10000			; AVX512-NEXT: [[SCEVGEP7:%.]] = getelementptr i32, i32 addrspace(1) [[B]], i64 10000
	; AVX512-NEXT: [[SCEVGEP78:%.]] = bitcast i32 addrspace(1) [[SCEVGEP7]] to i8 addrspace(1)*			; AVX512-NEXT: [[SCEVGEP78:%.]] = bitcast i32 addrspace(1) [[SCEVGEP7]] to i8 addrspace(1)*
	; AVX512-NEXT: [[BOUND0:%.]] = icmp ult i8 addrspace(1) [[A1]], [[SCEVGEP45]]			; AVX512-NEXT: [[BOUND0:%.]] = icmp ult i8 addrspace(1) [[A1]], [[SCEVGEP45]]
	; AVX512-NEXT: [[BOUND1:%.]] = icmp ult i8 addrspace(1) [[TRIGGER3]], [[SCEVGEP2]]			; AVX512-NEXT: [[BOUND1:%.]] = icmp ult i8 addrspace(1) [[TRIGGER3]], [[SCEVGEP2]]
	; AVX512-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]			; AVX512-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
	; AVX512-NEXT: [[BOUND09:%.]] = icmp ult i8 addrspace(1) [[A1]], [[SCEVGEP78]]			; AVX512-NEXT: [[BOUND09:%.]] = icmp ult i8 addrspace(1) [[A1]], [[SCEVGEP78]]
	; AVX512-NEXT: [[BOUND110:%.]] = icmp ult i8 addrspace(1) [[B6]], [[SCEVGEP2]]			; AVX512-NEXT: [[BOUND110:%.]] = icmp ult i8 addrspace(1) [[B6]], [[SCEVGEP2]]
	; AVX512-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]			; AVX512-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]
	; AVX512-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]			; AVX512-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]
	; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true			; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
	; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]			; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AVX512: vector.ph:			; AVX512: vector.ph:
	; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]			; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX512: vector.body:			; AVX512: vector.body:
	; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; AVX512-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16			; AVX512-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16
	; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48			; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48
	; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 0			; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 0
	; AVX512-NEXT: [[TMP9:%.]] = bitcast i32 addrspace(1) [[TMP8]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP9:%.]] = bitcast i32 addrspace(1) [[TMP8]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP9]], align 4, !alias.scope !11			; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP9]], align 4, !alias.scope !21
	; AVX512-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 16			; AVX512-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 16
	; AVX512-NEXT: [[TMP11:%.]] = bitcast i32 addrspace(1) [[TMP10]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP11:%.]] = bitcast i32 addrspace(1) [[TMP10]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP11]], align 4, !alias.scope !11			; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP11]], align 4, !alias.scope !21
	; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 32			; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 32
	; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 addrspace(1) [[TMP12]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 addrspace(1) [[TMP12]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_LOAD13:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP13]], align 4, !alias.scope !11			; AVX512-NEXT: [[WIDE_LOAD13:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP13]], align 4, !alias.scope !21
	; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 48			; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP4]], i32 48
	; AVX512-NEXT: [[TMP15:%.]] = bitcast i32 addrspace(1) [[TMP14]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP15:%.]] = bitcast i32 addrspace(1) [[TMP14]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP15]], align 4, !alias.scope !11			; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <16 x i32>, <16 x i32> addrspace(1) [[TMP15]], align 4, !alias.scope !21
	; AVX512-NEXT: [[TMP16:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP16:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP17:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD12]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP17:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD12]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP18:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD13]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP18:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD13]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP19:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD14]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP19:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD14]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP20:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP20:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 0			; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 0
	; AVX512-NEXT: [[TMP25:%.]] = bitcast i32 addrspace(1) [[TMP24]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP25:%.]] = bitcast i32 addrspace(1) [[TMP24]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP25]], i32 4, <16 x i1> [[TMP16]], <16 x i32> undef), !alias.scope !14			; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP25]], i32 4, <16 x i1> [[TMP16]], <16 x i32> undef), !alias.scope !24
	; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 16			; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 16
	; AVX512-NEXT: [[TMP27:%.]] = bitcast i32 addrspace(1) [[TMP26]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP27:%.]] = bitcast i32 addrspace(1) [[TMP26]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP27]], i32 4, <16 x i1> [[TMP17]], <16 x i32> undef), !alias.scope !14			; AVX512-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP27]], i32 4, <16 x i1> [[TMP17]], <16 x i32> undef), !alias.scope !24
	; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 32			; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 32
	; AVX512-NEXT: [[TMP29:%.]] = bitcast i32 addrspace(1) [[TMP28]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP29:%.]] = bitcast i32 addrspace(1) [[TMP28]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD16:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP29]], i32 4, <16 x i1> [[TMP18]], <16 x i32> undef), !alias.scope !14			; AVX512-NEXT: [[WIDE_MASKED_LOAD16:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP29]], i32 4, <16 x i1> [[TMP18]], <16 x i32> undef), !alias.scope !24
	; AVX512-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 48			; AVX512-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP20]], i32 48
	; AVX512-NEXT: [[TMP31:%.]] = bitcast i32 addrspace(1) [[TMP30]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP31:%.]] = bitcast i32 addrspace(1) [[TMP30]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD17:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP31]], i32 4, <16 x i1> [[TMP19]], <16 x i32> undef), !alias.scope !14			; AVX512-NEXT: [[WIDE_MASKED_LOAD17:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p1v16i32(<16 x i32> addrspace(1) [[TMP31]], i32 4, <16 x i1> [[TMP19]], <16 x i32> undef), !alias.scope !24
	; AVX512-NEXT: [[TMP32:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]			; AVX512-NEXT: [[TMP32:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]
	; AVX512-NEXT: [[TMP33:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD15]], [[WIDE_LOAD12]]			; AVX512-NEXT: [[TMP33:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD15]], [[WIDE_LOAD12]]
	; AVX512-NEXT: [[TMP34:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD16]], [[WIDE_LOAD13]]			; AVX512-NEXT: [[TMP34:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD16]], [[WIDE_LOAD13]]
	; AVX512-NEXT: [[TMP35:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD17]], [[WIDE_LOAD14]]			; AVX512-NEXT: [[TMP35:%.*]] = add nsw <16 x i32> [[WIDE_MASKED_LOAD17]], [[WIDE_LOAD14]]
	; AVX512-NEXT: [[TMP36:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP36:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP37:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP37:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP38:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP38:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP39:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP39:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP40:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 0			; AVX512-NEXT: [[TMP40:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 0
	; AVX512-NEXT: [[TMP41:%.]] = bitcast i32 addrspace(1) [[TMP40]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP41:%.]] = bitcast i32 addrspace(1) [[TMP40]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP32]], <16 x i32> addrspace(1)* [[TMP41]], i32 4, <16 x i1> [[TMP16]]), !alias.scope !16, !noalias !18			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP32]], <16 x i32> addrspace(1)* [[TMP41]], i32 4, <16 x i1> [[TMP16]]), !alias.scope !26, !noalias !28
	; AVX512-NEXT: [[TMP42:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 16			; AVX512-NEXT: [[TMP42:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 16
	; AVX512-NEXT: [[TMP43:%.]] = bitcast i32 addrspace(1) [[TMP42]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP43:%.]] = bitcast i32 addrspace(1) [[TMP42]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP33]], <16 x i32> addrspace(1)* [[TMP43]], i32 4, <16 x i1> [[TMP17]]), !alias.scope !16, !noalias !18			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP33]], <16 x i32> addrspace(1)* [[TMP43]], i32 4, <16 x i1> [[TMP17]]), !alias.scope !26, !noalias !28
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 32			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 32
	; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP34]], <16 x i32> addrspace(1)* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !16, !noalias !18			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP34]], <16 x i32> addrspace(1)* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !26, !noalias !28
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 48			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 48
	; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP35]], <16 x i32> addrspace(1)* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !16, !noalias !18			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP35]], <16 x i32> addrspace(1)* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !26, !noalias !28
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19			; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP29:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[EPILOG_PH]]
				; AVX512: epilog.ph:
				; AVX512-NEXT: [[TMP49:%.]] = phi i1 [ false, [[MIDDLE_BLOCK]] ], [ true, [[ENTRY:%.]] ]
				; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
				; AVX512-NEXT: [[TMP50:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; AVX512-NEXT: [[TMP51:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; AVX512-NEXT: [[REMAINDER_ITER:%.*]] = sub i64 10000, [[TMP51]]
				; AVX512-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[REMAINDER_ITER]], 8
				; AVX512-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH]], label [[RUNTIME_CHECK_PH:%.*]]
				; AVX512: runtime.check.ph:
				; AVX512-NEXT: br i1 [[TMP49]], label [[EPILOG_VECTOR_MEMCHECK:%.]], label [[EPILOG_VECTOR_PH:%.]]
				; AVX512: epilog.vector.memcheck:
				; AVX512-NEXT: [[SCEVGEP19:%.]] = getelementptr i32, i32 addrspace(1) [[A]], i64 10000
				; AVX512-NEXT: [[SCEVGEP1920:%.]] = bitcast i32 addrspace(1) [[SCEVGEP19]] to i8 addrspace(1)*
				; AVX512-NEXT: [[SCEVGEP21:%.]] = getelementptr i32, i32 addrspace(1) [[TRIGGER]], i64 10000
				; AVX512-NEXT: [[SCEVGEP2122:%.]] = bitcast i32 addrspace(1) [[SCEVGEP21]] to i8 addrspace(1)*
				; AVX512-NEXT: [[SCEVGEP23:%.]] = getelementptr i32, i32 addrspace(1) [[B]], i64 10000
				; AVX512-NEXT: [[SCEVGEP2324:%.]] = bitcast i32 addrspace(1) [[SCEVGEP23]] to i8 addrspace(1)*
				; AVX512-NEXT: [[BOUND025:%.]] = icmp ult i8 addrspace(1) [[A1]], [[SCEVGEP2122]]
				; AVX512-NEXT: [[BOUND126:%.]] = icmp ult i8 addrspace(1) [[TRIGGER3]], [[SCEVGEP1920]]
				; AVX512-NEXT: [[FOUND_CONFLICT27:%.*]] = and i1 [[BOUND025]], [[BOUND126]]
				; AVX512-NEXT: [[BOUND028:%.]] = icmp ult i8 addrspace(1) [[A1]], [[SCEVGEP2324]]
				; AVX512-NEXT: [[BOUND129:%.]] = icmp ult i8 addrspace(1) [[B6]], [[SCEVGEP1920]]
				; AVX512-NEXT: [[FOUND_CONFLICT30:%.*]] = and i1 [[BOUND028]], [[BOUND129]]
				; AVX512-NEXT: [[CONFLICT_RDX31:%.*]] = or i1 [[FOUND_CONFLICT27]], [[FOUND_CONFLICT30]]
				; AVX512-NEXT: [[MEMCHECK_CONFLICT32:%.*]] = and i1 [[CONFLICT_RDX31]], true
				; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT32]], label [[SCALAR_PH]], label [[EPILOG_VECTOR_PH]]
				; AVX512: epilog.vector.ph:
				; AVX512-NEXT: br label [[EPILOG_VECTOR_BODY:%.*]]
				; AVX512: epilog.vector.body:
				; AVX512-NEXT: [[INDEX33:%.]] = phi i64 [ [[TMP50]], [[EPILOG_VECTOR_PH]] ], [ [[INDEX_NEXT34:%.]], [[EPILOG_VECTOR_BODY]] ]
				; AVX512-NEXT: [[TMP52:%.*]] = add i64 [[INDEX33]], 0
				; AVX512-NEXT: [[TMP53:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[TMP52]]
				; AVX512-NEXT: [[TMP54:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP53]], i32 0
				; AVX512-NEXT: [[TMP55:%.]] = bitcast i32 addrspace(1) [[TMP54]] to <8 x i32> addrspace(1)*
				; AVX512-NEXT: [[WIDE_LOAD37:%.]] = load <8 x i32>, <8 x i32> addrspace(1) [[TMP55]], align 4, !alias.scope !30
				; AVX512-NEXT: [[TMP56:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD37]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
				; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[TMP52]]
				; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP57]], i32 0
				; AVX512-NEXT: [[TMP59:%.]] = bitcast i32 addrspace(1) [[TMP58]] to <8 x i32> addrspace(1)*
				; AVX512-NEXT: [[WIDE_MASKED_LOAD38:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p1v8i32(<8 x i32> addrspace(1) [[TMP59]], i32 4, <8 x i1> [[TMP56]], <8 x i32> undef), !alias.scope !33
				; AVX512-NEXT: [[TMP60:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD38]], [[WIDE_LOAD37]]
				; AVX512-NEXT: [[TMP61:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP52]]
				; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP61]], i32 0
				; AVX512-NEXT: [[TMP63:%.]] = bitcast i32 addrspace(1) [[TMP62]] to <8 x i32> addrspace(1)*
				; AVX512-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP60]], <8 x i32> addrspace(1)* [[TMP63]], i32 4, <8 x i1> [[TMP56]]), !alias.scope !35, !noalias !37
				; AVX512-NEXT: [[INDEX_NEXT34]] = add i64 [[INDEX33]], 8
				; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT34]], 10000
				; AVX512-NEXT: br i1 [[TMP64]], label [[EPILOG_MIDDLE_BLOCK:%.]], label [[EPILOG_VECTOR_BODY]], [[LOOP38:!llvm.loop !.]]
				; AVX512: epilog.middle.block:
				; AVX512-NEXT: [[CMP_N36:%.*]] = icmp eq i64 10000, 10000
				; AVX512-NEXT: br i1 [[CMP_N36]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL35:%.*]] = phi i64 [ 10000, [[EPILOG_MIDDLE_BLOCK]] ], [ [[BC_RESUME_VAL]], [[EPILOG_PH]] ], [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_MEMCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL35]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP65:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP65]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4			; AVX512-NEXT: [[TMP66:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4
	; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP66]], [[TMP65]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4			; AVX512-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !20			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP39:!llvm.loop !.*]]
				; AVX512: for.end.loopexit:
				; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[TMP8:%.*]] = sitofp <8 x i32> [[WIDE_LOAD]] to <8 x float>			; AVX1-NEXT: [[TMP8:%.*]] = sitofp <8 x i32> [[WIDE_LOAD]] to <8 x float>
	; AVX1-NEXT: [[TMP9:%.*]] = fadd <8 x float> [[WIDE_MASKED_LOAD]], [[TMP8]]			; AVX1-NEXT: [[TMP9:%.*]] = fadd <8 x float> [[WIDE_MASKED_LOAD]], [[TMP8]]
	; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP0]]
	; AVX1-NEXT: [[TMP11:%.]] = getelementptr inbounds float, float [[TMP10]], i32 0			; AVX1-NEXT: [[TMP11:%.]] = getelementptr inbounds float, float [[TMP10]], i32 0
	; AVX1-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <8 x float>*			; AVX1-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <8 x float>*
	; AVX1-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP9]], <8 x float>* [[TMP12]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !26, !noalias !28			; AVX1-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP9]], <8 x float>* [[TMP12]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !26, !noalias !28
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX1-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX1-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX1-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !29			; AVX1-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP29:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP14]], 100			; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP14]], 100
	; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP15:%.]] = load float, float [[ARRAYIDX3]], align 4			; AVX1-NEXT: [[TMP15:%.]] = load float, float [[ARRAYIDX3]], align 4
	; AVX1-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP14]] to float			; AVX1-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP14]] to float
	; AVX1-NEXT: [[ADD:%.*]] = fadd float [[TMP15]], [[CONV]]			; AVX1-NEXT: [[ADD:%.*]] = fadd float [[TMP15]], [[CONV]]
	; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4			; AVX1-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !30			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP30:!llvm.loop !.*]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo2(			; AVX2-LABEL: @foo2(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8			; AVX2-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8
	; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX2-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8			; AVX2-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8
	▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 16			; AVX2-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 16
	; AVX2-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <8 x float>*			; AVX2-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <8 x float>*
	; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP38]], <8 x float>* [[TMP49]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !26, !noalias !28			; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP38]], <8 x float>* [[TMP49]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !26, !noalias !28
	; AVX2-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 24			; AVX2-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 24
	; AVX2-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <8 x float>*			; AVX2-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <8 x float>*
	; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP39]], <8 x float>* [[TMP51]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !26, !noalias !28			; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP39]], <8 x float>* [[TMP51]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !26, !noalias !28
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX2-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX2-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX2-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !29			; AVX2-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP29:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4			; AVX2-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4
	; AVX2-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float			; AVX2-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float
	; AVX2-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]			; AVX2-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]
	; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4			; AVX2-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !30			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP30:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo2(			; AVX512-LABEL: @foo2(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8			; AVX512-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8			; AVX512-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8
	; AVX512-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]			; AVX512-NEXT: br i1 false, label [[EPILOG_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
	; AVX512: vector.memcheck:			; AVX512: vector.memcheck:
	; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[A]], i64 10000			; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[A]], i64 10000
	; AVX512-NEXT: [[SCEVGEP2:%.]] = bitcast float [[SCEVGEP]] to i8*			; AVX512-NEXT: [[SCEVGEP2:%.]] = bitcast float [[SCEVGEP]] to i8*
	; AVX512-NEXT: [[SCEVGEP4:%.]] = getelementptr i32, i32 [[TRIGGER]], i64 10000			; AVX512-NEXT: [[SCEVGEP4:%.]] = getelementptr i32, i32 [[TRIGGER]], i64 10000
	; AVX512-NEXT: [[SCEVGEP45:%.]] = bitcast i32 [[SCEVGEP4]] to i8*			; AVX512-NEXT: [[SCEVGEP45:%.]] = bitcast i32 [[SCEVGEP4]] to i8*
	; AVX512-NEXT: [[SCEVGEP7:%.]] = getelementptr float, float [[B]], i64 10000			; AVX512-NEXT: [[SCEVGEP7:%.]] = getelementptr float, float [[B]], i64 10000
	; AVX512-NEXT: [[SCEVGEP78:%.]] = bitcast float [[SCEVGEP7]] to i8*			; AVX512-NEXT: [[SCEVGEP78:%.]] = bitcast float [[SCEVGEP7]] to i8*
	; AVX512-NEXT: [[BOUND0:%.]] = icmp ult i8 [[A1]], [[SCEVGEP45]]			; AVX512-NEXT: [[BOUND0:%.]] = icmp ult i8 [[A1]], [[SCEVGEP45]]
	; AVX512-NEXT: [[BOUND1:%.]] = icmp ult i8 [[TRIGGER3]], [[SCEVGEP2]]			; AVX512-NEXT: [[BOUND1:%.]] = icmp ult i8 [[TRIGGER3]], [[SCEVGEP2]]
	; AVX512-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]			; AVX512-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
	; AVX512-NEXT: [[BOUND09:%.]] = icmp ult i8 [[A1]], [[SCEVGEP78]]			; AVX512-NEXT: [[BOUND09:%.]] = icmp ult i8 [[A1]], [[SCEVGEP78]]
	; AVX512-NEXT: [[BOUND110:%.]] = icmp ult i8 [[B6]], [[SCEVGEP2]]			; AVX512-NEXT: [[BOUND110:%.]] = icmp ult i8 [[B6]], [[SCEVGEP2]]
	; AVX512-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]			; AVX512-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]
	; AVX512-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]			; AVX512-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]
	; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true			; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
	; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]			; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AVX512: vector.ph:			; AVX512: vector.ph:
	; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]			; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX512: vector.body:			; AVX512: vector.body:
	; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; AVX512-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16			; AVX512-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16
	; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48			; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 48
	; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0			; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0
	; AVX512-NEXT: [[TMP9:%.]] = bitcast i32 [[TMP8]] to <16 x i32>*			; AVX512-NEXT: [[TMP9:%.]] = bitcast i32 [[TMP8]] to <16 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP9]], align 4, !alias.scope !21			; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP9]], align 4, !alias.scope !40
	; AVX512-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 16			; AVX512-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 16
	; AVX512-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <16 x i32>*			; AVX512-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <16 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <16 x i32>, <16 x i32> [[TMP11]], align 4, !alias.scope !21			; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <16 x i32>, <16 x i32> [[TMP11]], align 4, !alias.scope !40
	; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 32			; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 32
	; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <16 x i32>*			; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <16 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD13:%.]] = load <16 x i32>, <16 x i32> [[TMP13]], align 4, !alias.scope !21			; AVX512-NEXT: [[WIDE_LOAD13:%.]] = load <16 x i32>, <16 x i32> [[TMP13]], align 4, !alias.scope !40
	; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 48			; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 48
	; AVX512-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <16 x i32>*			; AVX512-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <16 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <16 x i32>, <16 x i32> [[TMP15]], align 4, !alias.scope !21			; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <16 x i32>, <16 x i32> [[TMP15]], align 4, !alias.scope !40
	; AVX512-NEXT: [[TMP16:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP16:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP17:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD12]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP17:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD12]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP18:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD13]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP18:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD13]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP19:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD14]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP19:%.*]] = icmp slt <16 x i32> [[WIDE_LOAD14]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP20:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP20:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP22:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP22:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP23:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP23:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds float, float [[TMP20]], i32 0			; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds float, float [[TMP20]], i32 0
	; AVX512-NEXT: [[TMP25:%.]] = bitcast float [[TMP24]] to <16 x float>*			; AVX512-NEXT: [[TMP25:%.]] = bitcast float [[TMP24]] to <16 x float>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP25]], i32 4, <16 x i1> [[TMP16]], <16 x float> undef), !alias.scope !24			; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP25]], i32 4, <16 x i1> [[TMP16]], <16 x float> undef), !alias.scope !43
	; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds float, float [[TMP20]], i32 16			; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds float, float [[TMP20]], i32 16
	; AVX512-NEXT: [[TMP27:%.]] = bitcast float [[TMP26]] to <16 x float>*			; AVX512-NEXT: [[TMP27:%.]] = bitcast float [[TMP26]] to <16 x float>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP27]], i32 4, <16 x i1> [[TMP17]], <16 x float> undef), !alias.scope !24			; AVX512-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP27]], i32 4, <16 x i1> [[TMP17]], <16 x float> undef), !alias.scope !43
	; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds float, float [[TMP20]], i32 32			; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds float, float [[TMP20]], i32 32
	; AVX512-NEXT: [[TMP29:%.]] = bitcast float [[TMP28]] to <16 x float>*			; AVX512-NEXT: [[TMP29:%.]] = bitcast float [[TMP28]] to <16 x float>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD16:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP29]], i32 4, <16 x i1> [[TMP18]], <16 x float> undef), !alias.scope !24			; AVX512-NEXT: [[WIDE_MASKED_LOAD16:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP29]], i32 4, <16 x i1> [[TMP18]], <16 x float> undef), !alias.scope !43
	; AVX512-NEXT: [[TMP30:%.]] = getelementptr inbounds float, float [[TMP20]], i32 48			; AVX512-NEXT: [[TMP30:%.]] = getelementptr inbounds float, float [[TMP20]], i32 48
	; AVX512-NEXT: [[TMP31:%.]] = bitcast float [[TMP30]] to <16 x float>*			; AVX512-NEXT: [[TMP31:%.]] = bitcast float [[TMP30]] to <16 x float>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD17:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP31]], i32 4, <16 x i1> [[TMP19]], <16 x float> undef), !alias.scope !24			; AVX512-NEXT: [[WIDE_MASKED_LOAD17:%.]] = call <16 x float> @llvm.masked.load.v16f32.p0v16f32(<16 x float> [[TMP31]], i32 4, <16 x i1> [[TMP19]], <16 x float> undef), !alias.scope !43
	; AVX512-NEXT: [[TMP32:%.*]] = sitofp <16 x i32> [[WIDE_LOAD]] to <16 x float>			; AVX512-NEXT: [[TMP32:%.*]] = sitofp <16 x i32> [[WIDE_LOAD]] to <16 x float>
	; AVX512-NEXT: [[TMP33:%.*]] = sitofp <16 x i32> [[WIDE_LOAD12]] to <16 x float>			; AVX512-NEXT: [[TMP33:%.*]] = sitofp <16 x i32> [[WIDE_LOAD12]] to <16 x float>
	; AVX512-NEXT: [[TMP34:%.*]] = sitofp <16 x i32> [[WIDE_LOAD13]] to <16 x float>			; AVX512-NEXT: [[TMP34:%.*]] = sitofp <16 x i32> [[WIDE_LOAD13]] to <16 x float>
	; AVX512-NEXT: [[TMP35:%.*]] = sitofp <16 x i32> [[WIDE_LOAD14]] to <16 x float>			; AVX512-NEXT: [[TMP35:%.*]] = sitofp <16 x i32> [[WIDE_LOAD14]] to <16 x float>
	; AVX512-NEXT: [[TMP36:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD]], [[TMP32]]			; AVX512-NEXT: [[TMP36:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD]], [[TMP32]]
	; AVX512-NEXT: [[TMP37:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD15]], [[TMP33]]			; AVX512-NEXT: [[TMP37:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD15]], [[TMP33]]
	; AVX512-NEXT: [[TMP38:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD16]], [[TMP34]]			; AVX512-NEXT: [[TMP38:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD16]], [[TMP34]]
	; AVX512-NEXT: [[TMP39:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD17]], [[TMP35]]			; AVX512-NEXT: [[TMP39:%.*]] = fadd <16 x float> [[WIDE_MASKED_LOAD17]], [[TMP35]]
	; AVX512-NEXT: [[TMP40:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP40:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP42:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP42:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP43:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP43:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds float, float [[TMP40]], i32 0			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds float, float [[TMP40]], i32 0
	; AVX512-NEXT: [[TMP45:%.]] = bitcast float [[TMP44]] to <16 x float>*			; AVX512-NEXT: [[TMP45:%.]] = bitcast float [[TMP44]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP36]], <16 x float>* [[TMP45]], i32 4, <16 x i1> [[TMP16]]), !alias.scope !26, !noalias !28			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP36]], <16 x float>* [[TMP45]], i32 4, <16 x i1> [[TMP16]]), !alias.scope !45, !noalias !47
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds float, float [[TMP40]], i32 16			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds float, float [[TMP40]], i32 16
	; AVX512-NEXT: [[TMP47:%.]] = bitcast float [[TMP46]] to <16 x float>*			; AVX512-NEXT: [[TMP47:%.]] = bitcast float [[TMP46]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP37]], <16 x float>* [[TMP47]], i32 4, <16 x i1> [[TMP17]]), !alias.scope !26, !noalias !28			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP37]], <16 x float>* [[TMP47]], i32 4, <16 x i1> [[TMP17]]), !alias.scope !45, !noalias !47
	; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 32			; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 32
	; AVX512-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <16 x float>*			; AVX512-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP38]], <16 x float>* [[TMP49]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !26, !noalias !28			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP38]], <16 x float>* [[TMP49]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !45, !noalias !47
	; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 48			; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 48
	; AVX512-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <16 x float>*			; AVX512-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP39]], <16 x float>* [[TMP51]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !26, !noalias !28			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP39]], <16 x float>* [[TMP51]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !45, !noalias !47
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !29			; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP48:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[EPILOG_PH]]
				; AVX512: epilog.ph:
				; AVX512-NEXT: [[TMP53:%.]] = phi i1 [ false, [[MIDDLE_BLOCK]] ], [ true, [[ENTRY:%.]] ]
				; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
				; AVX512-NEXT: [[TMP54:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; AVX512-NEXT: [[TMP55:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; AVX512-NEXT: [[REMAINDER_ITER:%.*]] = sub i64 10000, [[TMP55]]
				; AVX512-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[REMAINDER_ITER]], 8
				; AVX512-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH]], label [[RUNTIME_CHECK_PH:%.*]]
				; AVX512: runtime.check.ph:
				; AVX512-NEXT: br i1 [[TMP53]], label [[EPILOG_VECTOR_MEMCHECK:%.]], label [[EPILOG_VECTOR_PH:%.]]
				; AVX512: epilog.vector.memcheck:
				; AVX512-NEXT: [[SCEVGEP19:%.]] = getelementptr float, float [[A]], i64 10000
				; AVX512-NEXT: [[SCEVGEP1920:%.]] = bitcast float [[SCEVGEP19]] to i8*
				; AVX512-NEXT: [[SCEVGEP21:%.]] = getelementptr i32, i32 [[TRIGGER]], i64 10000
				; AVX512-NEXT: [[SCEVGEP2122:%.]] = bitcast i32 [[SCEVGEP21]] to i8*
				; AVX512-NEXT: [[SCEVGEP23:%.]] = getelementptr float, float [[B]], i64 10000
				; AVX512-NEXT: [[SCEVGEP2324:%.]] = bitcast float [[SCEVGEP23]] to i8*
				; AVX512-NEXT: [[BOUND025:%.]] = icmp ult i8 [[A1]], [[SCEVGEP2122]]
				; AVX512-NEXT: [[BOUND126:%.]] = icmp ult i8 [[TRIGGER3]], [[SCEVGEP1920]]
				; AVX512-NEXT: [[FOUND_CONFLICT27:%.*]] = and i1 [[BOUND025]], [[BOUND126]]
				; AVX512-NEXT: [[BOUND028:%.]] = icmp ult i8 [[A1]], [[SCEVGEP2324]]
				; AVX512-NEXT: [[BOUND129:%.]] = icmp ult i8 [[B6]], [[SCEVGEP1920]]
				; AVX512-NEXT: [[FOUND_CONFLICT30:%.*]] = and i1 [[BOUND028]], [[BOUND129]]
				; AVX512-NEXT: [[CONFLICT_RDX31:%.*]] = or i1 [[FOUND_CONFLICT27]], [[FOUND_CONFLICT30]]
				; AVX512-NEXT: [[MEMCHECK_CONFLICT32:%.*]] = and i1 [[CONFLICT_RDX31]], true
				; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT32]], label [[SCALAR_PH]], label [[EPILOG_VECTOR_PH]]
				; AVX512: epilog.vector.ph:
				; AVX512-NEXT: br label [[EPILOG_VECTOR_BODY:%.*]]
				; AVX512: epilog.vector.body:
				; AVX512-NEXT: [[INDEX33:%.]] = phi i64 [ [[TMP54]], [[EPILOG_VECTOR_PH]] ], [ [[INDEX_NEXT34:%.]], [[EPILOG_VECTOR_BODY]] ]
				; AVX512-NEXT: [[TMP56:%.*]] = add i64 [[INDEX33]], 0
				; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP56]]
				; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds i32, i32 [[TMP57]], i32 0
				; AVX512-NEXT: [[TMP59:%.]] = bitcast i32 [[TMP58]] to <8 x i32>*
				; AVX512-NEXT: [[WIDE_LOAD37:%.]] = load <8 x i32>, <8 x i32> [[TMP59]], align 4, !alias.scope !49
				; AVX512-NEXT: [[TMP60:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD37]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
				; AVX512-NEXT: [[TMP61:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP56]]
				; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds float, float [[TMP61]], i32 0
				; AVX512-NEXT: [[TMP63:%.]] = bitcast float [[TMP62]] to <8 x float>*
				; AVX512-NEXT: [[WIDE_MASKED_LOAD38:%.]] = call <8 x float> @llvm.masked.load.v8f32.p0v8f32(<8 x float> [[TMP63]], i32 4, <8 x i1> [[TMP60]], <8 x float> undef), !alias.scope !52
				; AVX512-NEXT: [[TMP64:%.*]] = sitofp <8 x i32> [[WIDE_LOAD37]] to <8 x float>
				; AVX512-NEXT: [[TMP65:%.*]] = fadd <8 x float> [[WIDE_MASKED_LOAD38]], [[TMP64]]
				; AVX512-NEXT: [[TMP66:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP56]]
				; AVX512-NEXT: [[TMP67:%.]] = getelementptr inbounds float, float [[TMP66]], i32 0
				; AVX512-NEXT: [[TMP68:%.]] = bitcast float [[TMP67]] to <8 x float>*
				; AVX512-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP65]], <8 x float>* [[TMP68]], i32 4, <8 x i1> [[TMP60]]), !alias.scope !54, !noalias !56
				; AVX512-NEXT: [[INDEX_NEXT34]] = add i64 [[INDEX33]], 8
				; AVX512-NEXT: [[TMP69:%.*]] = icmp eq i64 [[INDEX_NEXT34]], 10000
				; AVX512-NEXT: br i1 [[TMP69]], label [[EPILOG_MIDDLE_BLOCK:%.]], label [[EPILOG_VECTOR_BODY]], [[LOOP57:!llvm.loop !.]]
				; AVX512: epilog.middle.block:
				; AVX512-NEXT: [[CMP_N36:%.*]] = icmp eq i64 10000, 10000
				; AVX512-NEXT: br i1 [[CMP_N36]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL35:%.*]] = phi i64 [ 10000, [[EPILOG_MIDDLE_BLOCK]] ], [ [[BC_RESUME_VAL]], [[EPILOG_PH]] ], [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_MEMCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL35]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP70:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP70]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4			; AVX512-NEXT: [[TMP71:%.]] = load float, float [[ARRAYIDX3]], align 4
	; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float			; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP70]] to float
	; AVX512-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]			; AVX512-NEXT: [[ADD:%.*]] = fadd float [[TMP71]], [[CONV]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4			; AVX512-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !30			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP58:!llvm.loop !.*]]
				; AVX512: for.end.loopexit:
				; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
	; AVX-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 8			; AVX-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 8
	; AVX-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <4 x double>*			; AVX-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <4 x double>*
	; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP38]], <4 x double>* [[TMP49]], i32 8, <4 x i1> [[TMP18]]), !alias.scope !36, !noalias !38			; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP38]], <4 x double>* [[TMP49]], i32 8, <4 x i1> [[TMP18]]), !alias.scope !36, !noalias !38
	; AVX-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 12			; AVX-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 12
	; AVX-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <4 x double>*			; AVX-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <4 x double>*
	; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP39]], <4 x double>* [[TMP51]], i32 8, <4 x i1> [[TMP19]]), !alias.scope !36, !noalias !38			; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP39]], <4 x double>* [[TMP51]], i32 8, <4 x i1> [[TMP19]]), !alias.scope !36, !noalias !38
	; AVX-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !39			; AVX-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP39:!llvm.loop !.]]
	; AVX: middle.block:			; AVX: middle.block:
	; AVX-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX: scalar.ph:			; AVX: scalar.ph:
	; AVX-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX-NEXT: br label [[FOR_BODY:%.*]]			; AVX-NEXT: br label [[FOR_BODY:%.*]]
	; AVX: for.body:			; AVX: for.body:
	; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX: if.then:			; AVX: if.then:
	; AVX-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]
	; AVX-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double			; AVX-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double
	; AVX-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]			; AVX-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]
	; AVX-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]
	; AVX-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8			; AVX-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8
	; AVX-NEXT: br label [[FOR_INC]]			; AVX-NEXT: br label [[FOR_INC]]
	; AVX: for.inc:			; AVX: for.inc:
	; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !40			; AVX-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP40:!llvm.loop !.*]]
	; AVX: for.end:			; AVX: for.end:
	; AVX-NEXT: ret void			; AVX-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo3(			; AVX512-LABEL: @foo3(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast double [[A:%.]] to i8			; AVX512-NEXT: [[A1:%.]] = bitcast double [[A:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[B6:%.]] = bitcast double [[B:%.]] to i8			; AVX512-NEXT: [[B6:%.]] = bitcast double [[B:%.]] to i8
	Show All 23 Lines
	; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 16			; AVX512-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 16
	; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 24			; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 24
	; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0			; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0
	; AVX512-NEXT: [[TMP9:%.]] = bitcast i32 [[TMP8]] to <8 x i32>*			; AVX512-NEXT: [[TMP9:%.]] = bitcast i32 [[TMP8]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <8 x i32>, <8 x i32> [[TMP9]], align 4, !alias.scope !31			; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <8 x i32>, <8 x i32> [[TMP9]], align 4, !alias.scope !59
	; AVX512-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 8			; AVX512-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 8
	; AVX512-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*			; AVX512-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <8 x i32>, <8 x i32> [[TMP11]], align 4, !alias.scope !31			; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <8 x i32>, <8 x i32> [[TMP11]], align 4, !alias.scope !59
	; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 16			; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 16
	; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <8 x i32>*			; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD13:%.]] = load <8 x i32>, <8 x i32> [[TMP13]], align 4, !alias.scope !31			; AVX512-NEXT: [[WIDE_LOAD13:%.]] = load <8 x i32>, <8 x i32> [[TMP13]], align 4, !alias.scope !59
	; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 24			; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 24
	; AVX512-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <8 x i32>*			; AVX512-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <8 x i32>, <8 x i32> [[TMP15]], align 4, !alias.scope !31			; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <8 x i32>, <8 x i32> [[TMP15]], align 4, !alias.scope !59
	; AVX512-NEXT: [[TMP16:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP16:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP17:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD12]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP17:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD12]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP18:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD13]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP18:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD13]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP19:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD14]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP19:%.*]] = icmp slt <8 x i32> [[WIDE_LOAD14]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP20:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP20:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP21:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP21:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP22:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP22:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP23:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP23:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds double, double [[TMP20]], i32 0			; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds double, double [[TMP20]], i32 0
	; AVX512-NEXT: [[TMP25:%.]] = bitcast double [[TMP24]] to <8 x double>*			; AVX512-NEXT: [[TMP25:%.]] = bitcast double [[TMP24]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP25]], i32 8, <8 x i1> [[TMP16]], <8 x double> undef), !alias.scope !34			; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP25]], i32 8, <8 x i1> [[TMP16]], <8 x double> undef), !alias.scope !62
	; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds double, double [[TMP20]], i32 8			; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds double, double [[TMP20]], i32 8
	; AVX512-NEXT: [[TMP27:%.]] = bitcast double [[TMP26]] to <8 x double>*			; AVX512-NEXT: [[TMP27:%.]] = bitcast double [[TMP26]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP27]], i32 8, <8 x i1> [[TMP17]], <8 x double> undef), !alias.scope !34			; AVX512-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP27]], i32 8, <8 x i1> [[TMP17]], <8 x double> undef), !alias.scope !62
	; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds double, double [[TMP20]], i32 16			; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds double, double [[TMP20]], i32 16
	; AVX512-NEXT: [[TMP29:%.]] = bitcast double [[TMP28]] to <8 x double>*			; AVX512-NEXT: [[TMP29:%.]] = bitcast double [[TMP28]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD16:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP29]], i32 8, <8 x i1> [[TMP18]], <8 x double> undef), !alias.scope !34			; AVX512-NEXT: [[WIDE_MASKED_LOAD16:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP29]], i32 8, <8 x i1> [[TMP18]], <8 x double> undef), !alias.scope !62
	; AVX512-NEXT: [[TMP30:%.]] = getelementptr inbounds double, double [[TMP20]], i32 24			; AVX512-NEXT: [[TMP30:%.]] = getelementptr inbounds double, double [[TMP20]], i32 24
	; AVX512-NEXT: [[TMP31:%.]] = bitcast double [[TMP30]] to <8 x double>*			; AVX512-NEXT: [[TMP31:%.]] = bitcast double [[TMP30]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD17:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP31]], i32 8, <8 x i1> [[TMP19]], <8 x double> undef), !alias.scope !34			; AVX512-NEXT: [[WIDE_MASKED_LOAD17:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP31]], i32 8, <8 x i1> [[TMP19]], <8 x double> undef), !alias.scope !62
	; AVX512-NEXT: [[TMP32:%.*]] = sitofp <8 x i32> [[WIDE_LOAD]] to <8 x double>			; AVX512-NEXT: [[TMP32:%.*]] = sitofp <8 x i32> [[WIDE_LOAD]] to <8 x double>
	; AVX512-NEXT: [[TMP33:%.*]] = sitofp <8 x i32> [[WIDE_LOAD12]] to <8 x double>			; AVX512-NEXT: [[TMP33:%.*]] = sitofp <8 x i32> [[WIDE_LOAD12]] to <8 x double>
	; AVX512-NEXT: [[TMP34:%.*]] = sitofp <8 x i32> [[WIDE_LOAD13]] to <8 x double>			; AVX512-NEXT: [[TMP34:%.*]] = sitofp <8 x i32> [[WIDE_LOAD13]] to <8 x double>
	; AVX512-NEXT: [[TMP35:%.*]] = sitofp <8 x i32> [[WIDE_LOAD14]] to <8 x double>			; AVX512-NEXT: [[TMP35:%.*]] = sitofp <8 x i32> [[WIDE_LOAD14]] to <8 x double>
	; AVX512-NEXT: [[TMP36:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD]], [[TMP32]]			; AVX512-NEXT: [[TMP36:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD]], [[TMP32]]
	; AVX512-NEXT: [[TMP37:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD15]], [[TMP33]]			; AVX512-NEXT: [[TMP37:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD15]], [[TMP33]]
	; AVX512-NEXT: [[TMP38:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD16]], [[TMP34]]			; AVX512-NEXT: [[TMP38:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD16]], [[TMP34]]
	; AVX512-NEXT: [[TMP39:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD17]], [[TMP35]]			; AVX512-NEXT: [[TMP39:%.*]] = fadd <8 x double> [[WIDE_MASKED_LOAD17]], [[TMP35]]
	; AVX512-NEXT: [[TMP40:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP40:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP41:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP41:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP42:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP42:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP43:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP43:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds double, double [[TMP40]], i32 0			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds double, double [[TMP40]], i32 0
	; AVX512-NEXT: [[TMP45:%.]] = bitcast double [[TMP44]] to <8 x double>*			; AVX512-NEXT: [[TMP45:%.]] = bitcast double [[TMP44]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP36]], <8 x double>* [[TMP45]], i32 8, <8 x i1> [[TMP16]]), !alias.scope !36, !noalias !38			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP36]], <8 x double>* [[TMP45]], i32 8, <8 x i1> [[TMP16]]), !alias.scope !64, !noalias !66
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds double, double [[TMP40]], i32 8			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds double, double [[TMP40]], i32 8
	; AVX512-NEXT: [[TMP47:%.]] = bitcast double [[TMP46]] to <8 x double>*			; AVX512-NEXT: [[TMP47:%.]] = bitcast double [[TMP46]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP37]], <8 x double>* [[TMP47]], i32 8, <8 x i1> [[TMP17]]), !alias.scope !36, !noalias !38			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP37]], <8 x double>* [[TMP47]], i32 8, <8 x i1> [[TMP17]]), !alias.scope !64, !noalias !66
	; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 16			; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 16
	; AVX512-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <8 x double>*			; AVX512-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP38]], <8 x double>* [[TMP49]], i32 8, <8 x i1> [[TMP18]]), !alias.scope !36, !noalias !38			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP38]], <8 x double>* [[TMP49]], i32 8, <8 x i1> [[TMP18]]), !alias.scope !64, !noalias !66
	; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 24			; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 24
	; AVX512-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <8 x double>*			; AVX512-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP39]], <8 x double>* [[TMP51]], i32 8, <8 x i1> [[TMP19]]), !alias.scope !36, !noalias !38			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP39]], <8 x double>* [[TMP51]], i32 8, <8 x i1> [[TMP19]]), !alias.scope !64, !noalias !66
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !39			; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP67:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX512-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double			; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double
	; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]			; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8			; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !40			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP68:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true			; AVX512-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
	; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]			; AVX512-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
	; AVX512: vector.ph:			; AVX512: vector.ph:
	; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]			; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX512: vector.body:			; AVX512: vector.body:
	; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX512-NEXT: [[VEC_IND:%.]] = phi <8 x i64> [ <i64 0, i64 16, i64 32, i64 48, i64 64, i64 80, i64 96, i64 112>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX512-NEXT: [[VEC_IND:%.]] = phi <8 x i64> [ <i64 0, i64 16, i64 32, i64 48, i64 64, i64 80, i64 96, i64 112>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX512-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <8 x i64> [[VEC_IND]]			; AVX512-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <8 x i64> [[VEC_IND]]
	; AVX512-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <8 x i32> @llvm.masked.gather.v8i32.v8p0i32(<8 x i32> [[TMP0]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef), !alias.scope !41			; AVX512-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <8 x i32> @llvm.masked.gather.v8i32.v8p0i32(<8 x i32> [[TMP0]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef), !alias.scope !69
	; AVX512-NEXT: [[TMP1:%.*]] = icmp slt <8 x i32> [[WIDE_MASKED_GATHER]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>			; AVX512-NEXT: [[TMP1:%.*]] = icmp slt <8 x i32> [[WIDE_MASKED_GATHER]], <i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100, i32 100>
	; AVX512-NEXT: [[TMP2:%.*]] = shl nuw nsw <8 x i64> [[VEC_IND]], <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>			; AVX512-NEXT: [[TMP2:%.*]] = shl nuw nsw <8 x i64> [[VEC_IND]], <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
	; AVX512-NEXT: [[TMP3:%.]] = getelementptr inbounds double, double [[B]], <8 x i64> [[TMP2]]			; AVX512-NEXT: [[TMP3:%.]] = getelementptr inbounds double, double [[B]], <8 x i64> [[TMP2]]
	; AVX512-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double> [[TMP3]], i32 8, <8 x i1> [[TMP1]], <8 x double> undef), !alias.scope !44			; AVX512-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double> [[TMP3]], i32 8, <8 x i1> [[TMP1]], <8 x double> undef), !alias.scope !72
	; AVX512-NEXT: [[TMP4:%.*]] = sitofp <8 x i32> [[WIDE_MASKED_GATHER]] to <8 x double>			; AVX512-NEXT: [[TMP4:%.*]] = sitofp <8 x i32> [[WIDE_MASKED_GATHER]] to <8 x double>
	; AVX512-NEXT: [[TMP5:%.*]] = fadd <8 x double> [[WIDE_MASKED_GATHER12]], [[TMP4]]			; AVX512-NEXT: [[TMP5:%.*]] = fadd <8 x double> [[WIDE_MASKED_GATHER12]], [[TMP4]]
	; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds double, double [[A]], <8 x i64> [[VEC_IND]]			; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds double, double [[A]], <8 x i64> [[VEC_IND]]
	; AVX512-NEXT: call void @llvm.masked.scatter.v8f64.v8p0f64(<8 x double> [[TMP5]], <8 x double*> [[TMP6]], i32 8, <8 x i1> [[TMP1]]), !alias.scope !46, !noalias !48			; AVX512-NEXT: call void @llvm.masked.scatter.v8f64.v8p0f64(<8 x double> [[TMP5]], <8 x double*> [[TMP6]], i32 8, <8 x i1> [[TMP1]]), !alias.scope !74, !noalias !76
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX512-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>			; AVX512-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>
	; AVX512-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 624			; AVX512-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 624
	; AVX512-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !49			; AVX512-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP77:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 625, 624			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 625, 624
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP8]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP8]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP9]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP9]]
	; AVX512-NEXT: [[TMP10:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX512-NEXT: [[TMP10:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP8]] to double			; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP8]] to double
	; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP10]], [[CONV]]			; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP10]], [[CONV]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8			; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 16			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 16
	; AVX512-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !50			; AVX512-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP78:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.inc			for.body: ; preds = %entry, %for.inc
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE33]], <4 x double>* [[TMP56]], i32 8, <4 x i1> [[REVERSE23]]), !alias.scope !46, !noalias !48			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE33]], <4 x double>* [[TMP56]], i32 8, <4 x i1> [[REVERSE23]]), !alias.scope !46, !noalias !48
	; AVX2-NEXT: [[REVERSE35:%.*]] = shufflevector <4 x double> [[TMP43]], <4 x double> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; AVX2-NEXT: [[REVERSE35:%.*]] = shufflevector <4 x double> [[TMP43]], <4 x double> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; AVX2-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -12			; AVX2-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -12
	; AVX2-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -3			; AVX2-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -3
	; AVX2-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <4 x double>*			; AVX2-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE35]], <4 x double>* [[TMP59]], i32 8, <4 x i1> [[REVERSE26]]), !alias.scope !46, !noalias !48			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE35]], <4 x double>* [[TMP59]], i32 8, <4 x i1> [[REVERSE26]]), !alias.scope !46, !noalias !48
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX2-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; AVX2-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; AVX2-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !49			; AVX2-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP49:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0			; AVX2-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX2-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX2-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01			; AVX2-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01
	; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8			; AVX2-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
	; AVX2-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0			; AVX2-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0
	; AVX2-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !50			; AVX2-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP50:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo6(			; AVX512-LABEL: @foo6(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[OUT1:%.]] = bitcast double [[OUT:%.]] to i8			; AVX512-NEXT: [[OUT1:%.]] = bitcast double [[OUT:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[IN6:%.]] = bitcast double [[IN:%.]] to i8			; AVX512-NEXT: [[IN6:%.]] = bitcast double [[IN:%.]] to i8
	Show All 25 Lines
	; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[OFFSET_IDX]], -24			; AVX512-NEXT: [[TMP3:%.*]] = add i64 [[OFFSET_IDX]], -24
	; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0			; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0
	; AVX512-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP8]], i32 -7			; AVX512-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP8]], i32 -7
	; AVX512-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <8 x i32>*			; AVX512-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <8 x i32>, <8 x i32> [[TMP10]], align 4, !alias.scope !51			; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <8 x i32>, <8 x i32> [[TMP10]], align 4, !alias.scope !79
	; AVX512-NEXT: [[REVERSE:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 -8			; AVX512-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 -8
	; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP11]], i32 -7			; AVX512-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP11]], i32 -7
	; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <8 x i32>*			; AVX512-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <8 x i32>, <8 x i32> [[TMP13]], align 4, !alias.scope !51			; AVX512-NEXT: [[WIDE_LOAD12:%.]] = load <8 x i32>, <8 x i32> [[TMP13]], align 4, !alias.scope !79
	; AVX512-NEXT: [[REVERSE13:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD12]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE13:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD12]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 -16			; AVX512-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 -16
	; AVX512-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP14]], i32 -7			; AVX512-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP14]], i32 -7
	; AVX512-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP15]] to <8 x i32>*			; AVX512-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP15]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <8 x i32>, <8 x i32> [[TMP16]], align 4, !alias.scope !51			; AVX512-NEXT: [[WIDE_LOAD14:%.]] = load <8 x i32>, <8 x i32> [[TMP16]], align 4, !alias.scope !79
	; AVX512-NEXT: [[REVERSE15:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD14]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE15:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD14]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 -24			; AVX512-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 -24
	; AVX512-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[TMP17]], i32 -7			; AVX512-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[TMP17]], i32 -7
	; AVX512-NEXT: [[TMP19:%.]] = bitcast i32 [[TMP18]] to <8 x i32>*			; AVX512-NEXT: [[TMP19:%.]] = bitcast i32 [[TMP18]] to <8 x i32>*
	; AVX512-NEXT: [[WIDE_LOAD16:%.]] = load <8 x i32>, <8 x i32> [[TMP19]], align 4, !alias.scope !51			; AVX512-NEXT: [[WIDE_LOAD16:%.]] = load <8 x i32>, <8 x i32> [[TMP19]], align 4, !alias.scope !79
	; AVX512-NEXT: [[REVERSE17:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD16]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE17:%.*]] = shufflevector <8 x i32> [[WIDE_LOAD16]], <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP20:%.*]] = icmp sgt <8 x i32> [[REVERSE]], zeroinitializer			; AVX512-NEXT: [[TMP20:%.*]] = icmp sgt <8 x i32> [[REVERSE]], zeroinitializer
	; AVX512-NEXT: [[TMP21:%.*]] = icmp sgt <8 x i32> [[REVERSE13]], zeroinitializer			; AVX512-NEXT: [[TMP21:%.*]] = icmp sgt <8 x i32> [[REVERSE13]], zeroinitializer
	; AVX512-NEXT: [[TMP22:%.*]] = icmp sgt <8 x i32> [[REVERSE15]], zeroinitializer			; AVX512-NEXT: [[TMP22:%.*]] = icmp sgt <8 x i32> [[REVERSE15]], zeroinitializer
	; AVX512-NEXT: [[TMP23:%.*]] = icmp sgt <8 x i32> [[REVERSE17]], zeroinitializer			; AVX512-NEXT: [[TMP23:%.*]] = icmp sgt <8 x i32> [[REVERSE17]], zeroinitializer
	; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP24:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP25:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP25:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP26:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP27:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP27:%.]] = getelementptr inbounds double, double [[IN]], i64 [[TMP3]]
	; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds double, double [[TMP24]], i32 0			; AVX512-NEXT: [[TMP28:%.]] = getelementptr inbounds double, double [[TMP24]], i32 0
	; AVX512-NEXT: [[TMP29:%.]] = getelementptr inbounds double, double [[TMP28]], i32 -7			; AVX512-NEXT: [[TMP29:%.]] = getelementptr inbounds double, double [[TMP28]], i32 -7
	; AVX512-NEXT: [[REVERSE18:%.*]] = shufflevector <8 x i1> [[TMP20]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE18:%.*]] = shufflevector <8 x i1> [[TMP20]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP30:%.]] = bitcast double [[TMP29]] to <8 x double>*			; AVX512-NEXT: [[TMP30:%.]] = bitcast double [[TMP29]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP30]], i32 8, <8 x i1> [[REVERSE18]], <8 x double> undef), !alias.scope !54			; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP30]], i32 8, <8 x i1> [[REVERSE18]], <8 x double> undef), !alias.scope !82
	; AVX512-NEXT: [[REVERSE19:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE19:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP31:%.]] = getelementptr inbounds double, double [[TMP24]], i32 -8			; AVX512-NEXT: [[TMP31:%.]] = getelementptr inbounds double, double [[TMP24]], i32 -8
	; AVX512-NEXT: [[TMP32:%.]] = getelementptr inbounds double, double [[TMP31]], i32 -7			; AVX512-NEXT: [[TMP32:%.]] = getelementptr inbounds double, double [[TMP31]], i32 -7
	; AVX512-NEXT: [[REVERSE20:%.*]] = shufflevector <8 x i1> [[TMP21]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE20:%.*]] = shufflevector <8 x i1> [[TMP21]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP33:%.]] = bitcast double [[TMP32]] to <8 x double>*			; AVX512-NEXT: [[TMP33:%.]] = bitcast double [[TMP32]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD21:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP33]], i32 8, <8 x i1> [[REVERSE20]], <8 x double> undef), !alias.scope !54			; AVX512-NEXT: [[WIDE_MASKED_LOAD21:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP33]], i32 8, <8 x i1> [[REVERSE20]], <8 x double> undef), !alias.scope !82
	; AVX512-NEXT: [[REVERSE22:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD21]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE22:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD21]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP34:%.]] = getelementptr inbounds double, double [[TMP24]], i32 -16			; AVX512-NEXT: [[TMP34:%.]] = getelementptr inbounds double, double [[TMP24]], i32 -16
	; AVX512-NEXT: [[TMP35:%.]] = getelementptr inbounds double, double [[TMP34]], i32 -7			; AVX512-NEXT: [[TMP35:%.]] = getelementptr inbounds double, double [[TMP34]], i32 -7
	; AVX512-NEXT: [[REVERSE23:%.*]] = shufflevector <8 x i1> [[TMP22]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE23:%.*]] = shufflevector <8 x i1> [[TMP22]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP36:%.]] = bitcast double [[TMP35]] to <8 x double>*			; AVX512-NEXT: [[TMP36:%.]] = bitcast double [[TMP35]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD24:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP36]], i32 8, <8 x i1> [[REVERSE23]], <8 x double> undef), !alias.scope !54			; AVX512-NEXT: [[WIDE_MASKED_LOAD24:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP36]], i32 8, <8 x i1> [[REVERSE23]], <8 x double> undef), !alias.scope !82
	; AVX512-NEXT: [[REVERSE25:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD24]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE25:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD24]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP37:%.]] = getelementptr inbounds double, double [[TMP24]], i32 -24			; AVX512-NEXT: [[TMP37:%.]] = getelementptr inbounds double, double [[TMP24]], i32 -24
	; AVX512-NEXT: [[TMP38:%.]] = getelementptr inbounds double, double [[TMP37]], i32 -7			; AVX512-NEXT: [[TMP38:%.]] = getelementptr inbounds double, double [[TMP37]], i32 -7
	; AVX512-NEXT: [[REVERSE26:%.*]] = shufflevector <8 x i1> [[TMP23]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE26:%.*]] = shufflevector <8 x i1> [[TMP23]], <8 x i1> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP39:%.]] = bitcast double [[TMP38]] to <8 x double>*			; AVX512-NEXT: [[TMP39:%.]] = bitcast double [[TMP38]] to <8 x double>*
	; AVX512-NEXT: [[WIDE_MASKED_LOAD27:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP39]], i32 8, <8 x i1> [[REVERSE26]], <8 x double> undef), !alias.scope !54			; AVX512-NEXT: [[WIDE_MASKED_LOAD27:%.]] = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double> [[TMP39]], i32 8, <8 x i1> [[REVERSE26]], <8 x double> undef), !alias.scope !82
	; AVX512-NEXT: [[REVERSE28:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD27]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE28:%.*]] = shufflevector <8 x double> [[WIDE_MASKED_LOAD27]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP40:%.*]] = fadd <8 x double> [[REVERSE19]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>			; AVX512-NEXT: [[TMP40:%.*]] = fadd <8 x double> [[REVERSE19]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
	; AVX512-NEXT: [[TMP41:%.*]] = fadd <8 x double> [[REVERSE22]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>			; AVX512-NEXT: [[TMP41:%.*]] = fadd <8 x double> [[REVERSE22]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
	; AVX512-NEXT: [[TMP42:%.*]] = fadd <8 x double> [[REVERSE25]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>			; AVX512-NEXT: [[TMP42:%.*]] = fadd <8 x double> [[REVERSE25]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
	; AVX512-NEXT: [[TMP43:%.*]] = fadd <8 x double> [[REVERSE28]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>			; AVX512-NEXT: [[TMP43:%.*]] = fadd <8 x double> [[REVERSE28]], <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP0]]			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP0]]
	; AVX512-NEXT: [[TMP45:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP1]]			; AVX512-NEXT: [[TMP45:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP1]]
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP2]]			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP2]]
	; AVX512-NEXT: [[TMP47:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP3]]			; AVX512-NEXT: [[TMP47:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[TMP3]]
	; AVX512-NEXT: [[REVERSE29:%.*]] = shufflevector <8 x double> [[TMP40]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE29:%.*]] = shufflevector <8 x double> [[TMP40]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP44]], i32 0			; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP44]], i32 0
	; AVX512-NEXT: [[TMP49:%.]] = getelementptr inbounds double, double [[TMP48]], i32 -7			; AVX512-NEXT: [[TMP49:%.]] = getelementptr inbounds double, double [[TMP48]], i32 -7
	; AVX512-NEXT: [[TMP50:%.]] = bitcast double [[TMP49]] to <8 x double>*			; AVX512-NEXT: [[TMP50:%.]] = bitcast double [[TMP49]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE29]], <8 x double>* [[TMP50]], i32 8, <8 x i1> [[REVERSE18]]), !alias.scope !56, !noalias !58			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE29]], <8 x double>* [[TMP50]], i32 8, <8 x i1> [[REVERSE18]]), !alias.scope !84, !noalias !86
	; AVX512-NEXT: [[REVERSE31:%.*]] = shufflevector <8 x double> [[TMP41]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE31:%.*]] = shufflevector <8 x double> [[TMP41]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP51:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -8			; AVX512-NEXT: [[TMP51:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -8
	; AVX512-NEXT: [[TMP52:%.]] = getelementptr inbounds double, double [[TMP51]], i32 -7			; AVX512-NEXT: [[TMP52:%.]] = getelementptr inbounds double, double [[TMP51]], i32 -7
	; AVX512-NEXT: [[TMP53:%.]] = bitcast double [[TMP52]] to <8 x double>*			; AVX512-NEXT: [[TMP53:%.]] = bitcast double [[TMP52]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE31]], <8 x double>* [[TMP53]], i32 8, <8 x i1> [[REVERSE20]]), !alias.scope !56, !noalias !58			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE31]], <8 x double>* [[TMP53]], i32 8, <8 x i1> [[REVERSE20]]), !alias.scope !84, !noalias !86
	; AVX512-NEXT: [[REVERSE33:%.*]] = shufflevector <8 x double> [[TMP42]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE33:%.*]] = shufflevector <8 x double> [[TMP42]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP54:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -16			; AVX512-NEXT: [[TMP54:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -16
	; AVX512-NEXT: [[TMP55:%.]] = getelementptr inbounds double, double [[TMP54]], i32 -7			; AVX512-NEXT: [[TMP55:%.]] = getelementptr inbounds double, double [[TMP54]], i32 -7
	; AVX512-NEXT: [[TMP56:%.]] = bitcast double [[TMP55]] to <8 x double>*			; AVX512-NEXT: [[TMP56:%.]] = bitcast double [[TMP55]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE33]], <8 x double>* [[TMP56]], i32 8, <8 x i1> [[REVERSE23]]), !alias.scope !56, !noalias !58			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE33]], <8 x double>* [[TMP56]], i32 8, <8 x i1> [[REVERSE23]]), !alias.scope !84, !noalias !86
	; AVX512-NEXT: [[REVERSE35:%.*]] = shufflevector <8 x double> [[TMP43]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE35:%.*]] = shufflevector <8 x double> [[TMP43]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -24			; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -24
	; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -7			; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -7
	; AVX512-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <8 x double>*			; AVX512-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE35]], <8 x double>* [[TMP59]], i32 8, <8 x i1> [[REVERSE26]]), !alias.scope !56, !noalias !58			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE35]], <8 x double>* [[TMP59]], i32 8, <8 x i1> [[REVERSE26]]), !alias.scope !84, !noalias !86
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; AVX512-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; AVX512-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !59			; AVX512-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP87:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0			; AVX512-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX512-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01			; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01
	; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8			; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
	; AVX512-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0			; AVX512-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0
	; AVX512-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !60			; AVX512-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP88:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 4095, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 4095, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !41			; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP41:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !42			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP42:!llvm.loop !.*]]
	; AVX1: for.end.loopexit:			; AVX1: for.end.loopexit:
	; AVX1-NEXT: br label [[FOR_END]]			; AVX1-NEXT: br label [[FOR_END]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo7(			; AVX2-LABEL: @foo7(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !51			; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP51:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !52			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP52:!llvm.loop !.*]]
	; AVX2: for.end.loopexit:			; AVX2: for.end.loopexit:
	; AVX2-NEXT: br label [[FOR_END]]			; AVX2-NEXT: br label [[FOR_END]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo7(			; AVX512-LABEL: @foo7(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16			; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16
	; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*			; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])
	; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24			; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24
	; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*			; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !61			; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP89:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !62			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP90:!llvm.loop !.*]]
	; AVX512: for.end.loopexit:			; AVX512: for.end.loopexit:
	; AVX512-NEXT: br label [[FOR_END]]			; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	%cmp5 = icmp eq i32 %size, 0			%cmp5 = icmp eq i32 %size, 0
	br i1 %cmp5, label %for.end, label %for.body.preheader			br i1 %cmp5, label %for.end, label %for.body.preheader
	▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !44			; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP44:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !45			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP45:!llvm.loop !.*]]
	; AVX1: for.end.loopexit:			; AVX1: for.end.loopexit:
	; AVX1-NEXT: br label [[FOR_END]]			; AVX1-NEXT: br label [[FOR_END]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo8(			; AVX2-LABEL: @foo8(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !54			; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP54:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !55			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP55:!llvm.loop !.*]]
	; AVX2: for.end.loopexit:			; AVX2: for.end.loopexit:
	; AVX2-NEXT: br label [[FOR_END]]			; AVX2-NEXT: br label [[FOR_END]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo8(			; AVX512-LABEL: @foo8(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16			; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16
	; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*			; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])
	; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24			; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24
	; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*			; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !64			; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP91:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !65			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP92:!llvm.loop !.*]]
	; AVX512: for.end.loopexit:			; AVX512: for.end.loopexit:
	; AVX512-NEXT: br label [[FOR_END]]			; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	%cmp5 = icmp eq i32 %size, 0			%cmp5 = icmp eq i32 %size, 0
	br i1 %cmp5, label %for.end, label %for.body.preheader			br i1 %cmp5, label %for.end, label %for.body.preheader
	Show All 34 Lines

test/Transforms/LoopVectorize/epilog-loop-vectorize.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-vectorize -S %s \| FileCheck %s
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64"

				define void @epilog_loop_test(i8* nocapture %c, i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {
				; CHECK-LABEL: @epilog_loop_test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[CMP12:%.]] = icmp sgt i32 [[N:%.]], 0
				; CHECK-NEXT: br i1 [[CMP12]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
				; CHECK: for.body.preheader:
				; CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 32
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[EPILOG_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i8, i8 [[C:%.*]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP1:%.]] = getelementptr i8, i8 [[A:%.*]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP2:%.]] = getelementptr i8, i8 [[B:%.*]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 [[C]], [[SCEVGEP1]]
				; CHECK-NEXT: [[BOUND1:%.]] = icmp ult i8 [[A]], [[SCEVGEP]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: [[BOUND03:%.]] = icmp ult i8 [[C]], [[SCEVGEP2]]
				; CHECK-NEXT: [[BOUND14:%.]] = icmp ult i8 [[B]], [[SCEVGEP]]
				; CHECK-NEXT: [[FOUND_CONFLICT5:%.*]] = and i1 [[BOUND03]], [[BOUND14]]
				; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT5]]
				; CHECK-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
				; CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 32
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 16
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 [[TMP2]], i32 0
				; CHECK-NEXT: [[TMP5:%.]] = bitcast i8 [[TMP4]] to <16 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP5]], align 1, !alias.scope !0
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP2]], i32 16
				; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <16 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD6:%.]] = load <16 x i8>, <16 x i8> [[TMP7]], align 1, !alias.scope !0
				; CHECK-NEXT: [[TMP8:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
				; CHECK-NEXT: [[TMP9:%.*]] = zext <16 x i8> [[WIDE_LOAD6]] to <16 x i16>
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i8, i8 [[B]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i8, i8 [[B]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i8, i8 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP13:%.]] = bitcast i8 [[TMP12]] to <16 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD7:%.]] = load <16 x i8>, <16 x i8> [[TMP13]], align 1, !alias.scope !3
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 [[TMP10]], i32 16
				; CHECK-NEXT: [[TMP15:%.]] = bitcast i8 [[TMP14]] to <16 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD8:%.]] = load <16 x i8>, <16 x i8> [[TMP15]], align 1, !alias.scope !3
				; CHECK-NEXT: [[TMP16:%.*]] = zext <16 x i8> [[WIDE_LOAD7]] to <16 x i16>
				; CHECK-NEXT: [[TMP17:%.*]] = zext <16 x i8> [[WIDE_LOAD8]] to <16 x i16>
				; CHECK-NEXT: [[TMP18:%.*]] = add nuw nsw <16 x i16> [[TMP16]], [[TMP8]]
				; CHECK-NEXT: [[TMP19:%.*]] = add nuw nsw <16 x i16> [[TMP17]], [[TMP9]]
				; CHECK-NEXT: [[TMP20:%.*]] = lshr <16 x i16> [[TMP18]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
				; CHECK-NEXT: [[TMP21:%.*]] = lshr <16 x i16> [[TMP19]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
				; CHECK-NEXT: [[TMP22:%.*]] = trunc <16 x i16> [[TMP20]] to <16 x i8>
				; CHECK-NEXT: [[TMP23:%.*]] = trunc <16 x i16> [[TMP21]] to <16 x i8>
				; CHECK-NEXT: [[TMP24:%.]] = getelementptr inbounds i8, i8 [[C]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP25:%.]] = getelementptr inbounds i8, i8 [[C]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP26:%.]] = getelementptr inbounds i8, i8 [[TMP24]], i32 0
				; CHECK-NEXT: [[TMP27:%.]] = bitcast i8 [[TMP26]] to <16 x i8>*
				; CHECK-NEXT: store <16 x i8> [[TMP22]], <16 x i8>* [[TMP27]], align 1, !alias.scope !5, !noalias !7
				; CHECK-NEXT: [[TMP28:%.]] = getelementptr inbounds i8, i8 [[TMP24]], i32 16
				; CHECK-NEXT: [[TMP29:%.]] = bitcast i8 [[TMP28]] to <16 x i8>*
				; CHECK-NEXT: store <16 x i8> [[TMP23]], <16 x i8>* [[TMP29]], align 1, !alias.scope !5, !noalias !7
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
				; CHECK-NEXT: [[TMP30:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP30]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[EPILOG_PH]]
				; CHECK: epilog.ph:
				; CHECK-NEXT: [[TMP31:%.*]] = phi i1 [ false, [[MIDDLE_BLOCK]] ], [ true, [[FOR_BODY_PREHEADER]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; CHECK-NEXT: [[TMP32:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; CHECK-NEXT: [[TMP33:%.*]] = sub i64 [[BC_RESUME_VAL]], 0
				; CHECK-NEXT: [[REMAINDER_ITER:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[TMP33]]
				; CHECK-NEXT: [[MIN_ITERS_CHECK9:%.*]] = icmp ult i64 [[REMAINDER_ITER]], 8
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK9]], label [[SCALAR_PH]], label [[RUNTIME_CHECK_PH:%.*]]
				; CHECK: runtime.check.ph:
				; CHECK-NEXT: br i1 [[TMP31]], label [[EPILOG_VECTOR_MEMCHECK:%.]], label [[EPILOG_VECTOR_PH:%.]]
				; CHECK: epilog.vector.memcheck:
				; CHECK-NEXT: [[SCEVGEP11:%.]] = getelementptr i8, i8 [[C]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP12:%.]] = getelementptr i8, i8 [[A]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP13:%.]] = getelementptr i8, i8 [[B]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[BOUND014:%.]] = icmp ult i8 [[C]], [[SCEVGEP12]]
				; CHECK-NEXT: [[BOUND115:%.]] = icmp ult i8 [[A]], [[SCEVGEP11]]
				; CHECK-NEXT: [[FOUND_CONFLICT16:%.*]] = and i1 [[BOUND014]], [[BOUND115]]
				; CHECK-NEXT: [[BOUND017:%.]] = icmp ult i8 [[C]], [[SCEVGEP13]]
				; CHECK-NEXT: [[BOUND118:%.]] = icmp ult i8 [[B]], [[SCEVGEP11]]
				; CHECK-NEXT: [[FOUND_CONFLICT19:%.*]] = and i1 [[BOUND017]], [[BOUND118]]
				; CHECK-NEXT: [[CONFLICT_RDX20:%.*]] = or i1 [[FOUND_CONFLICT16]], [[FOUND_CONFLICT19]]
				; CHECK-NEXT: [[MEMCHECK_CONFLICT21:%.*]] = and i1 [[CONFLICT_RDX20]], true
				; CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT21]], label [[SCALAR_PH]], label [[EPILOG_VECTOR_PH]]
				; CHECK: epilog.vector.ph:
				; CHECK-NEXT: [[N_MOD_VF22:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 8
				bmahjourUnsubmitted Done Reply Inline Actions It would be good to generate more meaningful names for the labels forming the skeleton of the vector epilogue loop. For example `vector.ph` vs `vector.epilogue.ph`, `vector.body` vs `vec.epilogue.body`, etc. bmahjour: It would be good to generate more meaningful names for the labels forming the skeleton of the…
				; CHECK-NEXT: [[N_VEC23:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF22]]
				; CHECK-NEXT: br label [[EPILOG_VECTOR_BODY:%.*]]
				; CHECK: epilog.vector.body:
				; CHECK-NEXT: [[INDEX24:%.]] = phi i64 [ [[TMP32]], [[EPILOG_VECTOR_PH]] ], [ [[INDEX_NEXT25:%.]], [[EPILOG_VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP34:%.*]] = add i64 [[INDEX24]], 0
				; CHECK-NEXT: [[TMP35:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP34]]
				; CHECK-NEXT: [[TMP36:%.]] = getelementptr inbounds i8, i8 [[TMP35]], i32 0
				; CHECK-NEXT: [[TMP37:%.]] = bitcast i8 [[TMP36]] to <8 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD28:%.]] = load <8 x i8>, <8 x i8> [[TMP37]], align 1, !alias.scope !10
				; CHECK-NEXT: [[TMP38:%.*]] = zext <8 x i8> [[WIDE_LOAD28]] to <8 x i16>
				; CHECK-NEXT: [[TMP39:%.]] = getelementptr inbounds i8, i8 [[B]], i64 [[TMP34]]
				; CHECK-NEXT: [[TMP40:%.]] = getelementptr inbounds i8, i8 [[TMP39]], i32 0
				; CHECK-NEXT: [[TMP41:%.]] = bitcast i8 [[TMP40]] to <8 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD29:%.]] = load <8 x i8>, <8 x i8> [[TMP41]], align 1, !alias.scope !13
				; CHECK-NEXT: [[TMP42:%.*]] = zext <8 x i8> [[WIDE_LOAD29]] to <8 x i16>
				; CHECK-NEXT: [[TMP43:%.*]] = add nuw nsw <8 x i16> [[TMP42]], [[TMP38]]
				; CHECK-NEXT: [[TMP44:%.*]] = lshr <8 x i16> [[TMP43]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
				; CHECK-NEXT: [[TMP45:%.*]] = trunc <8 x i16> [[TMP44]] to <8 x i8>
				; CHECK-NEXT: [[TMP46:%.]] = getelementptr inbounds i8, i8 [[C]], i64 [[TMP34]]
				; CHECK-NEXT: [[TMP47:%.]] = getelementptr inbounds i8, i8 [[TMP46]], i32 0
				; CHECK-NEXT: [[TMP48:%.]] = bitcast i8 [[TMP47]] to <8 x i8>*
				; CHECK-NEXT: store <8 x i8> [[TMP45]], <8 x i8>* [[TMP48]], align 1, !alias.scope !15, !noalias !17
				; CHECK-NEXT: [[INDEX_NEXT25]] = add i64 [[INDEX24]], 8
				; CHECK-NEXT: [[TMP49:%.*]] = icmp eq i64 [[INDEX_NEXT25]], [[N_VEC23]]
				; CHECK-NEXT: br i1 [[TMP49]], label [[EPILOG_MIDDLE_BLOCK:%.]], label [[EPILOG_VECTOR_BODY]], [[LOOP18:!llvm.loop !.]]
				; CHECK: epilog.middle.block:
				; CHECK-NEXT: [[CMP_N27:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC23]]
				; CHECK-NEXT: br i1 [[CMP_N27]], label [[FOR_COND_CLEANUP_LOOPEXIT_LOOPEXIT:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL26:%.*]] = phi i64 [ [[N_VEC23]], [[EPILOG_MIDDLE_BLOCK]] ], [ [[BC_RESUME_VAL]], [[EPILOG_PH]] ], [ [[BC_RESUME_VAL]], [[EPILOG_VECTOR_MEMCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup.loopexit.loopexit:
				; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]]
				; CHECK: for.cond.cleanup.loopexit:
				; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL26]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP50:%.]] = load i8, i8 [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP50]] to i16
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i8, i8 [[B]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP51:%.]] = load i8, i8 [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CONV3:%.*]] = zext i8 [[TMP51]] to i16
				; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i16 [[CONV3]], [[CONV]]
				; CHECK-NEXT: [[TMP52:%.*]] = lshr i16 [[ADD]], 1
				; CHECK-NEXT: [[CONV4:%.*]] = trunc i16 [[TMP52]] to i8
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i8, i8 [[C]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: store i8 [[CONV4]], i8* [[ARRAYIDX6]], align 1
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP_LOOPEXIT_LOOPEXIT]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
				;
				entry:
				%cmp12 = icmp sgt i32 %n, 0
				br i1 %cmp12, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.body
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %a, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx
				%conv = zext i8 %0 to i16
				%arrayidx2 = getelementptr inbounds i8, i8* %b, i64 %indvars.iv
				%1 = load i8, i8* %arrayidx2
				%conv3 = zext i8 %1 to i16
				%add = add nuw nsw i16 %conv3, %conv
				%2 = lshr i16 %add, 1
				%conv4 = trunc i16 %2 to i8
				%arrayidx6 = getelementptr inbounds i8, i8* %c, i64 %indvars.iv
				store i8 %conv4, i8* %arrayidx6
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
				}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Support for Remainder loop vectorizationNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Why epilog loop vectorization?

On what basis the order of checks were decided?

Why not re-rerun the vectorizer?

Revision Contents

Diff 296673

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

lib/Transforms/Vectorize/LoopVectorizationPlanner.h

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll

test/Transforms/LoopVectorize/X86/masked_load_store.ll

test/Transforms/LoopVectorize/epilog-loop-vectorize.ll

[LV] Support for Remainder loop vectorization
Needs ReviewPublic