This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
-
Vectorizers.rst
-
epilogue-vectorization-cfg.png
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationPlanner.h
47/49
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
PowerPC/
-
optimal-epilog-vectorization-profitability.ll
2/2
optimal-epilog-vectorization.ll
-
X86/
-
invariant-load-gather.ll
-
invariant-store-vectorization.ll
-
masked_load_store.ll
1
optimal-epilog-vectorization-limitations.ll
-
optimal-epilog-vectorization-liveout.ll
2/3
optimal-epilog-vectorization.ll

Differential D89566

[LV] Epilogue Vectorization with Optimal Control Flow
ClosedPublic

Authored by bmahjour on Oct 16 2020, 10:30 AM.

Download Raw Diff

Details

Reviewers

Ayal
fhahn
dmgreen
SjoerdMeijer
gilr
ashutosh.nema
mivnay
etiotto
rengolin

Commits

rG4db9b78c8146: [LV] Epilogue Vectorization with Optimal Control Flow - Default Enablement
rGa7e2c2693997: [LV] Epilogue Vectorization with Optimal Control Flow (Recommit)
rG9c5504adceb5: [LV] Epilogue Vectorization with Optimal Control Flow

Summary

This is yet another attempt at providing support for epilogue vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none and reviews D30247 and D88819.

Similar to D88819, this patch achieve epilogue vectorization by executing a single vplan twice: once on the main loop and a second time on the epilogue loop (using a different VF). This implementation differs from D88819 at least in the following ways:

It's able to generate the most optimal control flow discussed in the above mentioned RFC by shortening the path-length in the case of small trip counts (those that result in all or most of the vector code getting skipped). It also avoids the redundant generation of runtime memory and SCEV checks needed to check for pointer aliasing. Please refer to the attached image illustrating the generated CFG.
It uses a more modular approach by using the strategy design pattern and extending the InnerLoopVectorizer class.
It can handle loops with multiple induction variables.
It adds more debug traces.

The heuristic for determining when to perform the transform is overly simplistic and needs to be improved in the future. That work is not in the scope of this patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bmahjour created this revision.Oct 16 2020, 10:30 AM

Herald added a reviewer: rengolin. · View Herald TranscriptOct 16 2020, 10:30 AM

Herald added subscribers: llvm-commits, rogfer01, javed.absar and 2 others. · View Herald Transcript

bmahjour requested review of this revision.Oct 16 2020, 10:30 AM

Herald added a subscriber: vkmr. · View Herald TranscriptOct 16 2020, 10:30 AM

bmahjour edited the summary of this revision. (Show Details)Oct 16 2020, 10:30 AM

bmahjour mentioned this in D88819: [LV] Support for Remainder loop vectorization.

Harbormaster completed remote builds in B75328: Diff 298651.Oct 16 2020, 11:38 AM

ping!

mivnay added inline comments.Oct 26 2020, 9:00 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175	Why not enable it by default?
926	High UF might negate the benefit of EpilogVectorization. I think keeping `UF = 1` is a good idea unless there are multiple levels of epilog loop vectorization.
7454	This function and other core functions like `createEpilogueVectorizedLoopSkeleton`, `createInductionResumeValues`, etc contains lot of redundant code. Can it be improved?

xbolva00 added a subscriber: xbolva00.Oct 26 2020, 9:09 AM

xbolva00 added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175	+1

junparser added a subscriber: junparser.Oct 27 2020, 12:57 AM

bmahjour added inline comments.Oct 27 2020, 1:29 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175	The reason I'm reluctant to enable it by default is because the heuristic for choosing when to vectorize the epilogue (and by what VF) is quite simplistic and tuned to a specific benchmark, without consideration for the cost of extra branches, code size increase, extra register spills, etc. My thinking is we can move towards the goal of enabling it by default in smaller steps, by first implementing the transformation (this patch), then improving the cost-model along with performance tuning (future work). What do you think?
926	I think so too, but I also thought it would be a good thing to make the UF configurable in case the need arises in the future (eg. with increasing vector widths and scalable vector types). In this patch, the only EpilogueLoopVectorizationInfo object is created with an epilogue UF of 1.
7454	Right, there is a bit of code duplication in exchange for decoupling and separation of concerns. There is very little code reuse opportunity from `createEpilogueVectorizedLoopSkeleton()`, as it already makes calls to common code like `emitMinimumIterationCountCheck`, `createInductionVariable`, `completeLoopSkeleton`, etc. and the rest of the code is inherently different between the `IMLAEMainLoop` and `IMLAEEpilogueLoop`. I agree this function can be avoided and I can think of a way to reuse code in `createInductionResumeValues`. I'll post an update soon.

Removed executePlanForEpilogueVectorization, improved code reuse for creatInductionResumeValues, and added more test coverage including a case with double IV.

bmahjour marked 2 inline comments as done.Oct 29 2020, 7:55 AM

mivnay added inline comments.Oct 29 2020, 9:01 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175	Unless you see some regressions, I think we can enable it by default.
926	It would be good to assert `UF == 1` I guess.

Harbormaster completed remote builds in B76912: Diff 301618.Oct 29 2020, 9:15 AM

Thanks for sharing and sorry for the delay. I probably won’t be able to take a closer look this week due to vacation & needing to wrap up some internal work, but plan to take a closer look early next week :)

As for whether this should be on/off by default in general, I think it would be good to gather some numbers on some architectures to establish a baseline and make sure there are no surprises.

I have not looked at any of the details here, but a very high level comment is that this isn't very VPlany. If we do want to push things in that direction, then it is at least worth thinking about how this and vplan will co-exist, even if vplan isn't ready for it yet. It would be great to get to a point where all this information is in the vplan and we can compare epilog remainders vs scalar remainders vs whatever else, and come up with a good total cost based on the estimated trip count. Just something to think about.

Also on a completely different note, I presume this could be expanded to handle predicated remainders too? So that a unpredicated loop was give a single predicated remainder iteration, as might be useful to SVE.

Also, +1 to "it would be good to gather some numbers"

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9215	I worry this would not work if we have removed the vplan for the different VF's.
llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll
7	PowerPC test go into LoopVectorize/PowerPC (I think, if it exists).

In D89566#2371597, @dmgreen wrote:

I have not looked at any of the details here, but a very high level comment is that this isn't very VPlany. If we do want to push things in that direction, then it is at least worth thinking about how this and vplan will co-exist, even if vplan isn't ready for it yet. It would be great to get to a point where all this information is in the vplan and we can compare epilog remainders vs scalar remainders vs whatever else, and come up with a good total cost based on the estimated trip count. Just something to think about.

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Also on a completely different note, I presume this could be expanded to handle predicated remainders too? So that a unpredicated loop was give a single predicated remainder iteration, as might be useful to SVE.

Absolutely, that can be a nice follow on to this work. I think any target that supports predicated vector instructions could benefit, specially if the predicated vector instructions perform better than scalar instructions but not as good as non-predicated vector instructions. If predicated and non-predicated vector instructions have similar throughput and latency, then perhaps tail-folding the main loop would be a better fit.

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Yeah. I don't think VPlan should slow this down. The problem is that if no-one pushes on vplan to have those extra features, they will never appear :)

Absolutely, that can be a nice follow on to this work. I think any target that supports predicated vector instructions could benefit, specially if the predicated vector instructions perform better than scalar instructions but not as good as non-predicated vector instructions. If predicated and non-predicated vector instructions have similar throughput and latency, then perhaps tail-folding the main loop would be a better fit.

I think this can depend upon.. a lot of things. Good to hear it should be possible.

In D89566#2373443, @dmgreen wrote:

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Yeah. I don't think VPlan should slow this down. The problem is that if no-one pushes on vplan to have those extra features, they will never appear :)

FWIW, I agree with both. The more features we add to the vectoriser, the more difficult it will be for VPlan to catch up. At the same time, I don't think there's an initiative to do this, so waiting for VPlan would be unreasonable. Also, epilogue vectorisation has been an outstanding issue for so long, so it is time it gets addressed.

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

Addressed code review comments + check for optsize and minsize.

Herald added a subscriber: nemanjai. · View Herald TranscriptNov 6 2020, 9:23 AM

Harbormaster completed remote builds in B77906: Diff 303480.Nov 6 2020, 9:24 AM

bmahjour added inline comments.Nov 6 2020, 9:24 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175	I'm not aware of any functional or performance regressions. Performance results for SPEC2017 x264 should match those from D88819. @mivnay would you mind verifying that on X86 and ARM? I can do a perf run on POWER.
9215	Good catch. We need to make sure that the vplan that's chosen for the main loop supports the requested VF for the epilogue loop. I've updated `selectEpilogueVectorizationFactor()` to check for this.
llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll
7	Done.

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

In D89566#2379432, @bmahjour wrote:

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

In D89566#2379500, @Prashanth wrote:

In D89566#2379432, @bmahjour wrote:

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Have you been following the conversations in D88819 and this revision? Blocking work has certainly not been my intention. I just resigned from D88819, in case people would like to move ahead with it. In terms of testing I've verified that x264 opportunity can be caught on POWER, however I do not have access to other architectures, which is why I would appreciate help on testing on other platforms.

I also think "blocking" is not the right terminology here. But I was asking about this because first we had D88819, then came this one, so we have 2 options. In the end it's pretty simple I guess, because we go for the one with the best code-gen, and this patch looks like the one with the most potential. But would be nice to get that confirmed.

In D89566#2379543, @bmahjour wrote:

In D89566#2379500, @Prashanth wrote:

In D89566#2379432, @bmahjour wrote:

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Have you been following the conversations in D88819 and this revision? Blocking work has certainly not been my intention. I just resigned from D88819, in case people would like to move ahead with it. In terms of testing I've verified that x264 opportunity can be caught on POWER, however I do not have access to other architectures, which is why I would appreciate help on testing on other platforms.

I have been following both the reviews....
What are x264 numbers for both the patches on POWER?

In D89566#2379562, @SjoerdMeijer wrote:

I also think "blocking" is not the right terminology here. But I was asking about this because first we had D88819, then came this one, so we have 2 options. In the end it's pretty simple I guess, because we go for the one with the best code-gen, and this patch looks like the one with the most potential. But would be nice to get that confirmed.

On Thunderx2 , x264, 1c, rate :
Base : 3.31
With this patch : 3.50
With D88819 : 3.60

In D89566#2379543, @bmahjour wrote:

In D89566#2379500, @Prashanth wrote:

In D89566#2379432, @bmahjour wrote:

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Have you been following the conversations in D88819 and this revision? Blocking work has certainly not been my intention. I just resigned from D88819, in case people would like to move ahead with it. In terms of testing I've verified that x264 opportunity can be caught on POWER, however I do not have access to other architectures, which is why I would appreciate help on testing on other platforms.

It would be great if you are reviewer of the other patch. Let us run the benchmarks on Graviton/Thunderx2/X86/Power and see what happens..

In D89566#2380882, @Prashanth wrote:

In D89566#2379562, @SjoerdMeijer wrote:

I also think "blocking" is not the right terminology here. But I was asking about this because first we had D88819, then came this one, so we have 2 options. In the end it's pretty simple I guess, because we go for the one with the best code-gen, and this patch looks like the one with the most potential. But would be nice to get that confirmed.

On Thunderx2 , x264, 1c, rate :
Base : 3.31
With this patch : 3.50
With D88819 : 3.60

Here are the numbers with -O3 -flto -fuse-ld=lld :

Here are the numbers for SPEC2017 intrate suite on POWER9:

and here are the number of loops that can be transformed by each patch:

My observations here are:

Performance of both patches for SPEC are the same. The advantage of D88819 on ThunderX2 might be a coincidence (knock on effect, e.g. (loop) alignment) or some micro-architecture reason.
This patch D89566 can handle more loops, but it is not reflected in the SPEC numbers (cold/not executed code?).
The beauty of D88819 is that the changes are very minimal, but looks like it's worth spending extra lines of code here in D89566 to helps in vectorising more.

To me, that shows there's more potential with this patch.
I don't think I can be the arbiter in this, so it's best if other reviewers comment too.

No regressions so enable by default?

Update test cases for default enablement.

Harbormaster completed remote builds in B78204: Diff 304012.Nov 9 2020, 4:48 PM

In D89566#2383591, @xbolva00 wrote:

No regressions so enable by default?

I enabled it in my previous update, but forgot to update the test cases. Now the test cases are updated too.

In D89566#2383507, @SjoerdMeijer wrote:

My observations here are:

Performance of both patches for SPEC are the same. The advantage of D88819 on ThunderX2 might be a coincidence (knock on effect, e.g. (loop) alignment) or some micro-architecture reason.

This patch D89566 can handle more loops, but it is not reflected in the SPEC numbers (cold/not executed code?).

The beauty of D88819 is that the changes are very minimal, but looks like it's worth spending extra lines of code here in D89566 to helps in vectorising more.

To me, that shows there's more potential with this patch.
I don't think I can be the arbiter in this, so it's best if other reviewers comment too.

As you pointed out, the reason D89566 causes no additional gain despite transforming more loops, could be because the loops are cold or the main vectorized loop dominates the execution profile. These issues can probably be addressed with an improved cost-model or with PGO data. It's important to be able to have the infrastructure to transform these loops for when the cost-model gets an upgrade.

This is a big change, and here are some notes from my first pass reading through this. Some high level questions here, and find some questions inlined:

Think we need a doc update with the new vectorization skeleton? The picture in the description of this change? Don't know how feasible that is, some ascii art too as comments?
Difficult to see for me, but are there tests with Minsize?
Thanks for running the SPEC numbers! Would it not too difficult to run the llvm test suite too? Hopefully that serves 2 purposes: throw some more code at this to test it, and should probably trigger in a few cases.
Given that there are not test changes in this area, it doesn't look like this is changing the tail folding decision making, there is no interaction with that? I.e., haven't checked, but I guess that is performed first as part of the first step, the "normal" inner loop vectorisation. I guess targets that support this, have some interesting decision making to do: tail folding or epilogue vectorisation. But that doesn't seem to be a problem of this patch.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
177	nit: for consistency residual -> epilogue?
950	Bikeshedding names: perhaps `InnerMainLoopAndEpilogueVectorizer` -> `InnerLoopAndEpilogueVectorizer` if `Main` doesn't add much here?

SjoerdMeijer added inline comments.Nov 10 2020, 2:30 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
986	And some more bikeshedding: Can we make `IMLAEEpilogueLoop` a bit more readable? Same for IMLAEEpilogueLoop below.
5791	Why is this 16? Do we need to put this "magic constant" behind a target hook? Or calculate this?
llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll
7	Just checking: what is PPC specific about this test?

In D89566#2373536, @SjoerdMeijer wrote:

In D89566#2373443, @dmgreen wrote:

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Yeah. I don't think VPlan should slow this down. The problem is that if no-one pushes on vplan to have those extra features, they will never appear :)

FWIW, I agree with both. The more features we add to the vectoriser, the more difficult it will be for VPlan to catch up. At the same time, I don't think there's an initiative to do this, so waiting for VPlan would be unreasonable. Also, epilogue vectorisation has been an outstanding issue for so long, so it is time it gets addressed.

I anticipate that we will be able to eventually model RT check blocks and the CFG around the vector body somehow in VPlan, which should eventually replace the code added here. But that's not going to happen in the near future, so I don't think this should hinder progress here. I think however we should try to structure the code in a way that will not make our life harder than it needs to be going forward.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1030	IIUC we specialize LV so we can override just `createEpilogueVectorizedLoopSkeleton` and `emitMinimumVectorEpilogueIterCountCheck` (which is only used by `createEpilogueVectorizedLoopSkeleton`. Could we do a more targeted specialization, e.g. by extracting the skeleton creation code into something like `LoopSekletonCreator` and have a regular and epilogue version of that? Or combine it with `EpilogueLoopVectorizationInfo`?
5717	I think there are some additional tests cases needed to cover all code paths in here?
5721	nit: could use any_of, e.g. `any_of(L.getHeader()->phis(), [this](PHINode &Phi){ return Legal->isFirstOrderRecurrence(&Phi) \|\| Legal->isReductionVariable(&Phi); })`.
5741	nit: use any_of?

In D89566#2385184, @SjoerdMeijer wrote:

This is a big change, and here are some notes from my first pass reading through this. Some high level questions here, and find some questions inlined:

Think we need a doc update with the new vectorization skeleton? The picture in the description of this change? Don't know how feasible that is, some ascii art too as comments?

Sure, I had a similar comment in D88819. I find the textual graph horrendous. I'll create a section under https://llvm.org/docs/Vectorizers.html and put the diagram there.

Difficult to see for me, but are there tests with Minsize?

Added optsize and minsize tests.

Thanks for running the SPEC numbers! Would it not too difficult to run the llvm test suite too? Hopefully that serves 2 purposes: throw some more code at this to test it, and should probably trigger in a few cases.

I've verified that test-suite is functionally clean. The compile-time and performance numbers fluctuate a lot. Is this normal? I reran what appeared to be a few large (20%+) degradations and they were not reproducible. There were upwards of 14% improvement in some tests but I think those are fluctuations also.

Given that there are not test changes in this area, it doesn't look like this is changing the tail folding decision making, there is no interaction with that? I.e., haven't checked, but I guess that is performed first as part of the first step, the "normal" inner loop vectorisation. I guess targets that support this, have some interesting decision making to do: tail folding or epilogue vectorisation. But that doesn't seem to be a problem of this patch.

This does not affect tail-folding decision. When tail-folding is requested, no scalar loop is generated so there is no epilogue to vectorize. However, it is possible to tail-fold the epilogue loop. This patch makes it easier to do that in the future.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
950	sure.
986	Are you referring to their names or their implementation? If the former, would `EpilogueVectorizerMainLoop` and `EpilogueVectorizerEpilogueLoop` be preferred?
1030	Currently only the skeleton code is specialized, but future enhancements will likely necessitate more specialization (eg to support widened induction vars and live-out phis). We could create a `LoopSkeletonCreator`, and modify the `LoopVectorizationPlanner` interfaces to work with it, but I think extending from the `InnerLoopVectorizer` fits better with the current design. Note that `InnerLoopUnroller` also takes a similar approach and extends `InnerLoopVectorizer` while only overriding a couple of functions.
5717	more tests added.
5791	This part is copied from D88819. The value 16 is chosen because it catches the x264 opportunity. As I've been saying all along, we need to replace this with a better cost-model. The cost-modeling work is not in the scope of this patch.
llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll
7	The triple is used to make sure ILV finds the loop profitable and chooses a default VF, triggering vectorization of the main loop and its epilogue. Without this we'd have to force a vectorization factor (through hints) for the main loop, however this patch does not vectorize epilogue loops that have user hints on them. It's debatable whether we should vectorize epilogues for hinted loops or not, or whether we need a new pragma, etc. It should be fairly easy to implement whatever we agree on, so I think we should leave those discussions for later and focus on the codegen in this patch.

Addressed more comments.

Harbormaster completed remote builds in B78460: Diff 304511.Nov 11 2020, 7:17 AM

Update llvm docs.

Harbormaster completed remote builds in B78463: Diff 304523.Nov 11 2020, 8:14 AM

One more round of high-level remarks before I look at some more details.

Thanks for running the SPEC numbers! Would it not too difficult to run the llvm test suite too? Hopefully that serves 2 purposes: throw some more code at this to test it, and should probably trigger in a few cases.

I've verified that test-suite is functionally clean. The compile-time and performance numbers fluctuate a lot. Is this normal? I reran what appeared to be a few large (20%+) degradations and they were not reproducible. There were upwards of 14% improvement in some tests but I think those are fluctuations also.

Yeah, that could be the case. There are some (micro)benchmarks with a very small execution time, so more susceptible to fluctuations. Have you tried the CTMark subset and TEST_SUITE_BENCHMARKING_ONLY variable? Anyway, if it is too noisy, it was at least a good testing exercise.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
986	Cheers, much more readable IMHO.
5791	I think that's fine, but we should at least hide this 16 behind a TTI hook that e.g. defaults to 16, which should be a separate patch.
7561	Do we want to refer in a comment here to the doc and skeleton that you've added?

Add link to the doc in comments.

Harbormaster completed remote builds in B78633: Diff 304862.Nov 12 2020, 9:02 AM

Yeah, that could be the case. There are some (micro)benchmarks with a very small execution time, so more susceptible to fluctuations. Have you tried the CTMark subset and TEST_SUITE_BENCHMARKING_ONLY variable? Anyway, if it is too noisy, it was at least a good testing exercise.

I've tried both CTMark and TEST_SUITE_BENCHMARKING_ONLY .

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5791	A reasonable cost function can probably be developed with the existing TTI hooks, so a new one may actually not be necessary. I agree to deferring this to a separate patch.
7561	Done.

SjoerdMeijer added inline comments.Nov 16 2020, 2:33 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5791	Okay, but can we stub it out for now? If we don't need a new TTI function, can we create a new helper function and at least sketch a way how this can be calculated. It can still just default to returning 16, but I just don't think we should keep the 16 hard coded here.

bmahjour updated this revision to Diff 305547.Nov 16 2020, 10:09 AM

bmahjour marked an inline comment as done.

bmahjour added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5791	Ok, sure. I've added an option to make it configurable too.

Harbormaster completed remote builds in B78978: Diff 305547.Nov 16 2020, 10:10 AM

SjoerdMeijer added inline comments.Nov 17 2020, 7:01 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
185	Think we need a test-case for it too, perhaps one that is not the default value 16.
5727	I was hoping and seemed to remember there was a LoopUtils helper for this. Had a quick look, perhaps `findDefsUsedOutsideOfLoop` could be useful, or there was another function here in the vectoriser that was doing the same/similar things for reductions. Perhaps have a look if there's something we can reuse.
5798	I am wondering which option should win: e.g. `-Os` or `EpilogueVectorizationForceVF`? I am guessing that is the forced epilogue vectorisation? Should the opt size check be moved?
5800	nit: don't think we need the curly brackets here in LLVM_DEBUG.
5805	same here
5821	nit: `!=` ?
5822	nit: curly bracket
9210	`false` is for `PreserveLCSSA`, but do we want that to be true, and does that make formLCSSARecursively redundant? Not sure if I am suprised that this is necessary at all.... nit: I think it should be at least `false /* PreserveLCSSA */`

AshokBhat added a subscriber: AshokBhat.Nov 18 2020, 12:57 PM

Rebased, addressed comments, and added more tests (including one for the limited case of live-out supported).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
185	test added
5727	I was trying to match the conditions in `fixupIVUsers()` and added early exits to make sure we don't waste compile-time collecting information that will be thrown away. Maybe we can add a flag to `findDefsUsedOutsideOfLoop` to stop collecting instructions as soon as one is found, but then we'll have a new problem in that the checks in this code would be different from what's in `fixupIVUsers()`.
5798	I can see arguments for it either way, but after thinking more about it, I believe it makes it easier to reason about the behaviour of the loop-vectorizer when the forced vf wins over the minsize. I'll change it.
5800	I have developed a habit of adding them because I find it makes the enclosed code more friendly to clang-format. I'll remove the brackets here since they don't make much of a difference to the formatting in this case.
5821	Currently `VectorizationFactor` only overrides `operator==` not the `operator!=`. I'll provide an override and change this to `!=`.
9210	I think we could get away without calling `formLCSSARecursively` for now and set `PreserveLCSSA` to true for the limited cases of live-out values supported (see newly added `Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.liveout.ll`). However in the future if we add support for more live-outs (eg. reductions and first-order recs), I worry that we may be in a state at this point in the code where the original loop is temporarily non-LCSSA which would break `simplifyLoop`'s assumptions if we set `PreserveLCSSA` to true. Note that when `PreserveLCSSA` is true, `simplifyLoop` assumes that the loop is already in LCSSA form. Maybe I worry too much about it, but I think the current way is a bit more future-proof.

Harbormaster completed remote builds in B80010: Diff 307441.Nov 24 2020, 2:17 PM

fhahn added inline comments.Nov 25 2020, 1:34 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1030	Currently only the skeleton code is specialized, but future enhancements will likely necessitate more specialization (eg to support widened induction vars and live-out phis). I thought a bit on how to support those cases. IIUC the problem here is mostly setting up the right incoming values for the PHIs in the epilogue loop. Given that this is related to codegen, I think we should be able to represent all the required info in the VPlan, i.e. we should be able to adjust the VPlan of the epilogue vector loop to use the correct start values naturally. We could create a LoopSkeletonCreator, and modify the LoopVectorizationPlanner interfaces to work with it, but I think extending from the InnerLoopVectorizer fits better with the current design. Note that InnerLoopUnroller also takes a similar approach and extends InnerLoopVectorizer while only overriding a couple of functions. My concern here is that this approach invites more modifications in the sub-classes, instead of modeling the required information directly in VPlan. I think this will lead to more work down the road, as we move more pieces into VPlan. I don't think `InnerLoopUnroller` is an ideal example here, because IIRC it was created before VPlan. I think working on modeling the incoming values in VPlan would be better aligned with the long term goals and lead to a more modular solution overall. To illustrate this approach, I went ahead and tried to sketch support for widened induction using VPlan: D92132. It is a bit rough around the edges, but should give an idea of the required pieces. I'm working on a similar patch for reductions as well and we should be able to handle first-order recurrences in a similar fashion.

Thanks for working on this, I am happy with this patch.

The way I understand it, is that we want to make this VPlan-future-proof, not make any VPlan work more difficult than necessary. Looking at D92132, that doesn't seemed to be the case to me. I.e., even though that patch is work-in-progress, if that patch is representative to achieve that, we don't have anything to worry about. So, I would be happy if this patch lands if we also progress D92132 (and friends). But @fhahn can correct me if I am wrong here.

This revision is now accepted and ready to land.Nov 26 2020, 12:54 AM

In D89566#2417874, @SjoerdMeijer wrote:

Thanks for working on this, I am happy with this patch.

The way I understand it, is that we want to make this VPlan-future-proof, not make any VPlan work more difficult than necessary. Looking at D92132, that doesn't seemed to be the case to me. I.e., even though that patch is work-in-progress, if that patch is representative to achieve that, we don't have anything to worry about. So, I would be happy if this patch lands if we also progress D92132 (and friends). But @fhahn can correct me if I am wrong here.

The point I tried to highlight with the patch is that I think we can solve the limitations without subclassing/customising code other than the skeleton creation. In that case, I think it would be preferable to go with a more targeted approach discussed (only providing a custom ‘skeleton’ creator). I outlined the potential drawbacks I see inline, but the main one is that it encourages adding more codegen specialisations that we have to entangle again later.

As mentioned earlier, I don’t think it is worth holding things up due to VPlan, but I think if we have a clear and short path towards addressing the limitations using VPlan, we should do that and avoid broad sub classing.

I might be missing some other cases where full subclassing might be needed. I think that’s something worth further discussing before committing.

On an unrelated note, I think it would be good to have some target-independent tests for this or for some additional targets, so this gets wide coverage on the public bots (some of which might not enable the PPC target)

Thank you for your reviews @SjoerdMeijer @fhahn .

I totally see the benefit of D92129 as you illustrated in D92132 and I think it can be used to replace things like fixupIVUsers() which would also make it easier to extend epilogue vectorization. As for broader subclassing, it shouldn't prevent us from extending it in a way that uses VPlan. If specialization at the subclass level is not necessary (ie there is code reuse opportunity) then we should still implement it in the superclasses and inherit it in the subclasses. If specialization is not applicable to any other part of the transform then it will happen at the subclass level where it belongs, whether using VPlan or not.

Having said that there is a timing aspect. I see your point that if we tried to extend epilogue vectorization now via specialization of parts of the codegen, that don't use VPlan currently, then we'd have to change more places in the future. From that perspective a more targeted specialization would help by making it harder to extend epilogue vectorization without having improved VPlan. On the other hand, some improvements to the epilogue vectorization would become contingent on improving VPlan (which is not an issue for me, but others may disagree).

Having a skeleton creator class would require some refactoring of InnerLoopVectorizer and LoopVectorizationPlanner, so it would warrant having a separate patch for it. I can do it after this one lands.

BTW, I found a trick to make some of the tests target independent and expand the coverage. Please see llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization*.

bmahjour updated this revision to Diff 307915.Nov 26 2020, 12:46 PM

Harbormaster completed remote builds in B80267: Diff 307915.Nov 26 2020, 12:47 PM

Forgot to remove target triple and attributes from target independent tests. They're fixed now.

Harbormaster completed remote builds in B80269: Diff 307918.Nov 26 2020, 12:58 PM

dmgreen added inline comments.Nov 27 2020, 4:27 AM

llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗	(On Diff #307918)	I was surprised to see an MVE test like this chose to try and epilogue vectorize. I had presumed that would not happen on MVE - we only have a single vector width with no interleaving - the benefit of trying to do a single <8 x i8> iterations after a <16 x i8> main loop is not going to be worth the additional branching/setup we have to do, unfortunately. I ran some extra tests and added a mve-qabs.ll test, where again the <16 x i8> loop is getting a remainder where it isn't beneficial. I don't believe that MVE is a vector target that would ever benefit from epilogue vectorization, unfortunately. Can we get some sort of target hook that allows us to disable it? Perhaps something that sets a maximum epilogue vectorization factor given a VF * UF main loop? That would allow us to set it to none, whilst others tune it for their needs, like possibly always having the fallback as a 64bit vector under aarch64 (just a though, not sure if that's best idea or not but it at least allows targets to tune things).
llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll
11	-> triple

bmahjour added inline comments.Nov 27 2020, 7:37 AM

llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗	(On Diff #307918)	I ran some extra tests and added a mve-qabs.ll test, where again the <16 x i8> loop is getting a remainder where it isn't beneficial. Is it degrading performance or just not beneficial (harmless)? As I mentioned before the heuristic in this patch is not very good, but putting the cost-modeling in the critical path for getting the codegen implemented is also not desirable. I had suggested to disable this transformation by default until a proper cost-model is implemented, to which some people disagreed. In order to come up with a meaningful target hook it would be helpful to know what machine characteristics in MVE cause epilogue vectorization to not be beneficial. Are there existing TTI hooks that we can use (eg. `getMaxInterleaveFactor() > 1`)?

dmgreen added inline comments.Nov 27 2020, 9:09 AM

llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗	(On Diff #307918)	Is it degrading performance or just not beneficial (harmless)? Degrading performance unfortunately. It doesn't happen in a lot of tests, but it was between a 0% and 25% decrease, apparently. The M in MVE stands for microcontroller (umm, I think), so it can be somewhat constrained and can be especially hurt by inefficient codegen, that would not be as bad on other cores/vector architectures. The max interleaving factor will be 1 for any MVE target. They also only have 128 bit vectors, no 64bit wide vectors that would be present in NEON. Essentially that means that for i8 we would either vectorize x 16 (which is excellent), or x 8 using extends if we can't for x16. The x8 would be beneficial on it's own with enough iterations I think, but doing a single iteration at x 8 does not overcome the additional cost from outside the vector loops. Using getMaxInterleaveFactor to limit this for the moment would work for me. I have no strong opinions on enabling this by default or not, but you may want the very initial commit to default to false with a commit soon after to enable it. That way if someone does revert this, at least they are only reverting the flipping of the switch and not the whole patch.

bmahjour updated this revision to Diff 308114.Nov 27 2020, 2:18 PM

bmahjour marked an inline comment as done.

Harbormaster completed remote builds in B80386: Diff 308114.Nov 27 2020, 2:19 PM

bmahjour added inline comments.Nov 27 2020, 2:19 PM

llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗	(On Diff #307918)	Using getMaxInterleaveFactor to limit this for the moment would work for me. Ok, I'll use that then. I have no strong opinions on enabling this by default or not, but you may want the very initial commit to default to false with a commit soon after to enable it. That way if someone does revert this, at least they are only reverting the flipping of the switch and not the whole patch. Great suggestion! I'll do it that way.

This revision was landed with ongoing or failed builds.Dec 1 2020, 9:05 AM

Closed by commit rG9c5504adceb5: [LV] Epilogue Vectorization with Optimal Control Flow (authored by bmahjour). · Explain Why

This revision was automatically updated to reflect the committed changes.

bmahjour added a commit: rG9c5504adceb5: [LV] Epilogue Vectorization with Optimal Control Flow.

In D89566#2419093, @bmahjour wrote:

Thank you for your reviews @SjoerdMeijer @fhahn .

I totally see the benefit of D92129 as you illustrated in D92132 and I think it can be used to replace things like fixupIVUsers() which would also make it easier to extend epilogue vectorization. As for broader subclassing, it shouldn't prevent us from extending it in a way that uses VPlan. If specialization at the subclass level is not necessary (ie there is code reuse opportunity) then we should still implement it in the superclasses and inherit it in the subclasses. If specialization is not applicable to any other part of the transform then it will happen at the subclass level where it belongs, whether using VPlan or not.

Having said that there is a timing aspect. I see your point that if we tried to extend epilogue vectorization now via specialization of parts of the codegen, that don't use VPlan currently, then we'd have to change more places in the future. From that perspective a more targeted specialization would help by making it harder to extend epilogue vectorization without having improved VPlan. On the other hand, some improvements to the epilogue vectorization would become contingent on improving VPlan (which is not an issue for me, but others may disagree).

Having a skeleton creator class would require some refactoring of InnerLoopVectorizer and LoopVectorizationPlanner, so it would warrant having a separate patch for it. I can do it after this one lands.

OK great. It sounds like there's agreement on the further direction overall and we can work on that in-tree. I'll work on getting the pieces ready to handle live-ins/live-outs in VPlan as required. As for the sekeleton creator, I might be able to take a look over the next week.

LGTM, thanks (meant to respond a bit earlier today :)

bmahjour added a reverting change: rGc94af03f7f32: Revert "[LV] Epilogue Vectorization with Optimal Control Flow".Dec 1 2020, 9:51 AM

MaskRay added a subscriber: MaskRay.Dec 1 2020, 10:04 AM

MaskRay added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
159	This variable is mutable. Please use `constexpr char VerboseDebug[]` or `const char VerboseDebug[]`. Namespace-scoped const variables are already internal, no need for `static` Please also use `#ifndef NDEBUG` to avoid -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds

bmahjour added a commit: rGa7e2c2693997: [LV] Epilogue Vectorization with Optimal Control Flow (Recommit).Dec 2 2020, 7:10 AM

OK great. It sounds like there's agreement on the further direction overall and we can work on that in-tree. I'll work on getting the pieces ready to handle live-ins/live-outs in VPlan as required. As for the sekeleton creator, I might be able to take a look over the next week.

LGTM, thanks (meant to respond a bit earlier today :)

Sounds good. If I get a chance to work on skeleton builder I'll let you know before hand to make sure we don't work on it at the same time. Otherwise I'd be happy to help with the review. Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
159	Right, made the changes in the latest commit. Thanks!

MaskRay added inline comments.Dec 2 2020, 11:15 AM

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
1	debug-only tests are rare. If you really want them, please add `REQUIRES: asserts` (which has kindly been fixed by RKSimon)

bmahjour added a commit: rG4db9b78c8146: [LV] Epilogue Vectorization with Optimal Control Flow - Default Enablement.Dec 7 2020, 11:29 AM

Should this really be enabled by default given the current (lack of) cost modelling? Primarily concerned about the code size regression this causes.

In D89566#2437988, @nikic wrote:

Should this really be enabled by default given the current (lack of) cost modelling? Primarily concerned about the code size regression this causes.

Please note that functions with optsize and minsize attributes are not affected. I'm ok with turning it back off (that was my preference as well), but curious about what has regressed (since that information can help with cost-modeling). Any comments @mivnay @xbolva00 ?

In D89566#2438061, @bmahjour wrote:

In D89566#2437988, @nikic wrote:

Should this really be enabled by default given the current (lack of) cost modelling? Primarily concerned about the code size regression this causes.

Please note that functions with optsize and minsize attributes are not affected. I'm ok with turning it back off (that was my preference as well), but curious about what has regressed (since that information can help with cost-modeling). Any comments @mivnay @xbolva00 ?

@bmahjour Thanks for the patch. I will run the experiments and get back to you...

alextsao1999 added a subscriber: alextsao1999.Apr 6 2021, 11:07 PM

Revision Contents

Path

Size

llvm/

docs/

Vectorizers.rst

19 lines

epilogue-vectorization-cfg.png

lib/

Transforms/

Vectorize/

LoopVectorizationPlanner.h

16 lines

LoopVectorize.cpp

646 lines

test/

Transforms/

LoopVectorize/

PowerPC/

optimal-epilog-vectorization-profitability.ll

133 lines

optimal-epilog-vectorization.ll

593 lines

X86/

invariant-load-gather.ll

22 lines

invariant-store-vectorization.ll

12 lines

masked_load_store.ll

80 lines

optimal-epilog-vectorization-limitations.ll

100 lines

optimal-epilog-vectorization-liveout.ll

125 lines

optimal-epilog-vectorization.ll

402 lines

Diff 308681

llvm/docs/Vectorizers.rst

Show First 20 Lines • Show All 364 Lines • ▼ Show 20 Lines	int foo(int *A, int n) {
for (int i = 0; i < n; ++i)		for (int i = 0; i < n; ++i)
sum += A[i];		sum += A[i];
return sum;		return sum;
}		}

The Loop Vectorizer uses a cost model to decide when it is profitable to unroll loops.		The Loop Vectorizer uses a cost model to decide when it is profitable to unroll loops.
The decision to unroll the loop depends on the register pressure and the generated code size.		The decision to unroll the loop depends on the register pressure and the generated code size.

		Epilogue Vectorization
		^^^^^^^^^^^^^^^^^^^^^^

		When vectorizing a loop, often a scalar remainder (epilogue) loop is necessary
		to execute tail iterations of the loop if the loop trip count is unknown or it
		does not evenly divide the vectorization and unroll factors. When the
		vectorization and unroll factors are large, it's possible for loops with smaller
		trip counts to end up spending most of their time in the scalar (rather than
		the vector) code. In order to address this issue, the inner loop vectorizer is
		enhanced with a feature that allows it to vectorize epilogue loops with a
		vectorization and unroll factor combination that makes it more likely for small
		trip count loops to still execute in vectorized code. The diagram below shows
		the CFG for a typical epilogue vectorized loop with runtime checks. As
		illustrated the control flow is structured in a way that avoids duplicating the
		runtime pointer checks and optimizes the path length for loops that have very
		small trip counts.

		.. image:: epilogue-vectorization-cfg.png

Performance		Performance
-----------		-----------

This section shows the execution time of Clang on a simple benchmark:		This section shows the execution time of Clang on a simple benchmark:
`gcc-loops <https://github.com/llvm/llvm-test-suite/tree/master/SingleSource/UnitTests/Vectorizer>`_.		`gcc-loops <https://github.com/llvm/llvm-test-suite/tree/master/SingleSource/UnitTests/Vectorizer>`_.
This benchmarks is a collection of loops from the GCC autovectorization		This benchmarks is a collection of loops from the GCC autovectorization
`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.		`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.

▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

llvm/docs/epilogue-vectorization-cfg.png

This binary file was added.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	struct VectorizationFactor {
// Width 1 means no vectorization, cost 0 means uncomputed cost.		// Width 1 means no vectorization, cost 0 means uncomputed cost.
static VectorizationFactor Disabled() {		static VectorizationFactor Disabled() {
return {ElementCount::getFixed(1), 0};		return {ElementCount::getFixed(1), 0};
}		}

bool operator==(const VectorizationFactor &rhs) const {		bool operator==(const VectorizationFactor &rhs) const {
return Width == rhs.Width && Cost == rhs.Cost;		return Width == rhs.Width && Cost == rhs.Cost;
}		}

		bool operator!=(const VectorizationFactor &rhs) const {
		return !(*this == rhs);
		}
};		};

/// Planner drives the vectorization process after having passed		/// Planner drives the vectorization process after having passed
/// Legality checks.		/// Legality checks.
class LoopVectorizationPlanner {		class LoopVectorizationPlanner {
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *OrigLoop;		Loop *OrigLoop;

▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	public:
/// best selected VPlan.		/// best selected VPlan.
void executePlan(InnerLoopVectorizer &LB, DominatorTree *DT);		void executePlan(InnerLoopVectorizer &LB, DominatorTree *DT);

void printPlans(raw_ostream &O) {		void printPlans(raw_ostream &O) {
for (const auto &Plan : VPlans)		for (const auto &Plan : VPlans)
O << *Plan;		O << *Plan;
}		}

		/// Look through the existing plans and return true if we have one with all
		/// the vectorization factors in question.
		bool hasPlanWithVFs(const ArrayRef<ElementCount> VFs) const {
		return any_of(VPlans, [&](const VPlanPtr &Plan) {
		return all_of(VFs, [&](const ElementCount &VF) {
		if (Plan->hasVF(VF))
		return true;
		return false;
		});
		});
		}

/// Test a \p Predicate on a \p Range of VF's. Return the value of applying		/// Test a \p Predicate on a \p Range of VF's. Return the value of applying
/// \p Predicate on Range.Start, possibly decreasing Range.End such that the		/// \p Predicate on Range.Start, possibly decreasing Range.End such that the
/// returned value holds for the entire \p Range.		/// returned value holds for the entire \p Range.
static bool		static bool
getDecisionAndClampRange(const std::function<bool(ElementCount)> &Predicate,		getDecisionAndClampRange(const std::function<bool(ElementCount)> &Predicate,
VFRange &Range);		VFRange &Range);

protected:		protected:
Show All 38 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 150 Lines • ▼ Show 20 Lines
#include <string>		#include <string>
#include <tuple>		#include <tuple>
#include <utility>		#include <utility>

using namespace llvm;		using namespace llvm;

#define LV_NAME "loop-vectorize"		#define LV_NAME "loop-vectorize"
#define DEBUG_TYPE LV_NAME		#define DEBUG_TYPE LV_NAME
		static const char *VerboseDebug = DEBUG_TYPE "-verbose";
		MaskRayUnsubmitted Not Done Reply Inline Actions This variable is mutable. Please use `constexpr char VerboseDebug[]` or `const char VerboseDebug[]`. Namespace-scoped const variables are already internal, no need for `static` Please also use `#ifndef NDEBUG` to avoid -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds MaskRay: This variable is mutable. Please use `constexpr char VerboseDebug[]` or `const char…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Right, made the changes in the latest commit. Thanks! bmahjour: Right, made the changes in the latest commit. Thanks!

/// @{		/// @{
/// Metadata attribute names		/// Metadata attribute names
static const char *const LLVMLoopVectorizeFollowupAll =		static const char *const LLVMLoopVectorizeFollowupAll =
"llvm.loop.vectorize.followup_all";		"llvm.loop.vectorize.followup_all";
static const char *const LLVMLoopVectorizeFollowupVectorized =		static const char *const LLVMLoopVectorizeFollowupVectorized =
"llvm.loop.vectorize.followup_vectorized";		"llvm.loop.vectorize.followup_vectorized";
static const char *const LLVMLoopVectorizeFollowupEpilogue =		static const char *const LLVMLoopVectorizeFollowupEpilogue =
"llvm.loop.vectorize.followup_epilogue";		"llvm.loop.vectorize.followup_epilogue";
/// @}		/// @}

STATISTIC(LoopsVectorized, "Number of loops vectorized");		STATISTIC(LoopsVectorized, "Number of loops vectorized");
STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");		STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
		STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");

		static cl::opt<bool> EnableEpilogueVectorization(
		mivnayUnsubmitted Done Reply Inline Actions Why not enable it by default? mivnay: Why not enable it by default?
		xbolva00Unsubmitted Done Reply Inline Actions +1 xbolva00: +1
		bmahjourAuthorUnsubmitted Done Reply Inline Actions The reason I'm reluctant to enable it by default is because the heuristic for choosing when to vectorize the epilogue (and by what VF) is quite simplistic and tuned to a specific benchmark, without consideration for the cost of extra branches, code size increase, extra register spills, etc. My thinking is we can move towards the goal of enabling it by default in smaller steps, by first implementing the transformation (this patch), then improving the cost-model along with performance tuning (future work). What do you think? bmahjour: The reason I'm reluctant to enable it by default is because the heuristic for choosing when to…
		mivnayUnsubmitted Done Reply Inline Actions Unless you see some regressions, I think we can enable it by default. mivnay: Unless you see some regressions, I think we can enable it by default.
		bmahjourAuthorUnsubmitted Done Reply Inline Actions I'm not aware of any functional or performance regressions. Performance results for SPEC2017 x264 should match those from D88819. @mivnay would you mind verifying that on X86 and ARM? I can do a perf run on POWER. bmahjour: I'm not aware of any functional or performance regressions. Performance results for SPEC2017…
		"enable-epilogue-vectorization", cl::init(false), cl::Hidden,
		cl::desc("Enable vectorization of epilogue loops."));
		SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: for consistency residual -> epilogue? SjoerdMeijer: nit: for consistency residual -> epilogue?

		static cl::opt<unsigned> EpilogueVectorizationForceVF(
		"epilogue-vectorization-force-VF", cl::init(1), cl::Hidden,
		cl::desc("When epilogue vectorization is enabled, and a value greater than "
		"1 is specified, forces the given VF for all applicable epilogue "
		"loops."));

		static cl::opt<unsigned> EpilogueVectorizationMinVF(
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Think we need a test-case for it too, perhaps one that is not the default value 16. SjoerdMeijer: Think we need a test-case for it too, perhaps one that is not the default value 16.
		bmahjourAuthorUnsubmitted Done Reply Inline Actions test added bmahjour: test added
		"epilogue-vectorization-minimum-VF", cl::init(16), cl::Hidden,
		cl::desc("Only loops with vectorization factor equal to or larger than "
		"the specified value are considered for epilogue vectorization."));

/// Loops with a known constant trip count below this number are vectorized only		/// Loops with a known constant trip count below this number are vectorized only
/// if no scalar iteration overheads are incurred.		/// if no scalar iteration overheads are incurred.
static cl::opt<unsigned> TinyTripCountVectorThreshold(		static cl::opt<unsigned> TinyTripCountVectorThreshold(
"vectorizer-min-trip-count", cl::init(16), cl::Hidden,		"vectorizer-min-trip-count", cl::init(16), cl::Hidden,
cl::desc("Loops with a constant trip count that is smaller than this "		cl::desc("Loops with a constant trip count that is smaller than this "
"value are vectorized only if no scalar iteration overheads "		"value are vectorized only if no scalar iteration overheads "
"are incurred."));		"are incurred."));
▲ Show 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	public:

virtual ~InnerLoopVectorizer() = default;		virtual ~InnerLoopVectorizer() = default;

/// Create a new empty loop that will contain vectorized instructions later		/// Create a new empty loop that will contain vectorized instructions later
/// on, while the old loop will be used as the scalar remainder. Control flow		/// on, while the old loop will be used as the scalar remainder. Control flow
/// is generated around the vectorized (and scalar epilogue) loops consisting		/// is generated around the vectorized (and scalar epilogue) loops consisting
/// of various checks and bypasses. Return the pre-header block of the new		/// of various checks and bypasses. Return the pre-header block of the new
/// loop.		/// loop.
BasicBlock *createVectorizedLoopSkeleton();		/// In the case of epilogue vectorization, this function is overriden to
		/// handle the more complex control flow around the loops.
		virtual BasicBlock *createVectorizedLoopSkeleton();

/// Widen a single instruction within the innermost loop.		/// Widen a single instruction within the innermost loop.
void widenInstruction(Instruction &I, VPValue *Def, VPUser &Operands,		void widenInstruction(Instruction &I, VPValue *Def, VPUser &Operands,
VPTransformState &State);		VPTransformState &State);

/// Widen a single call instruction within the innermost loop.		/// Widen a single call instruction within the innermost loop.
void widenCallInstruction(CallInst &I, VPValue *Def, VPUser &ArgOperands,		void widenCallInstruction(CallInst &I, VPValue *Def, VPUser &ArgOperands,
VPTransformState &State);		VPTransformState &State);
▲ Show 20 Lines • Show All 246 Lines • ▼ Show 20 Lines	protected:
/// Emit basic blocks (prefixed with \p Prefix) for the iteration check,		/// Emit basic blocks (prefixed with \p Prefix) for the iteration check,
/// vector loop preheader, middle block and scalar preheader. Also		/// vector loop preheader, middle block and scalar preheader. Also
/// allocate a loop object for the new vector loop and return it.		/// allocate a loop object for the new vector loop and return it.
Loop *createVectorLoopSkeleton(StringRef Prefix);		Loop *createVectorLoopSkeleton(StringRef Prefix);

/// Create new phi nodes for the induction variables to resume iteration count		/// Create new phi nodes for the induction variables to resume iteration count
/// in the scalar epilogue, from where the vectorized loop left off (given by		/// in the scalar epilogue, from where the vectorized loop left off (given by
/// \p VectorTripCount).		/// \p VectorTripCount).
void createInductionResumeValues(Loop L, Value VectorTripCount);		/// In cases where the loop skeleton is more complicated (eg. epilogue
		/// vectorization) and the resume values can come from an additional bypass
		/// block, the \p AdditionalBypass pair provides information about the bypass
		/// block and the end value on the edge from bypass to this loop.
		void createInductionResumeValues(
		Loop L, Value VectorTripCount,
		std::pair<BasicBlock , Value > AdditionalBypass = {nullptr, nullptr});

/// Complete the loop skeleton by adding debug MDs, creating appropriate		/// Complete the loop skeleton by adding debug MDs, creating appropriate
/// conditional branches in the middle block, preparing the builder and		/// conditional branches in the middle block, preparing the builder and
/// running the verifier. Take in the vector loop \p L as argument, and return		/// running the verifier. Take in the vector loop \p L as argument, and return
/// the preheader of the completed vector loop.		/// the preheader of the completed vector loop.
BasicBlock completeLoopSkeleton(Loop L, MDNode *OrigLoopID);		BasicBlock completeLoopSkeleton(Loop L, MDNode *OrigLoopID);

/// Add additional metadata to \p To that was not present on \p Orig.		/// Add additional metadata to \p To that was not present on \p Orig.
Show All 9 Lines	protected:
/// addNewMetadata). Use this for newly created instructions in the vector		/// addNewMetadata). Use this for newly created instructions in the vector
/// loop.		/// loop.
void addMetadata(Instruction To, Instruction From);		void addMetadata(Instruction To, Instruction From);

/// Similar to the previous function but it adds the metadata to a		/// Similar to the previous function but it adds the metadata to a
/// vector of instructions.		/// vector of instructions.
void addMetadata(ArrayRef<Value > To, Instruction From);		void addMetadata(ArrayRef<Value > To, Instruction From);

		/// Allow subclasses to override and print debug traces before/after vplan
		/// execution, when trace information is requested.
		virtual void printDebugTracesAtStart(){};
		virtual void printDebugTracesAtEnd(){};

/// The original loop.		/// The original loop.
Loop *OrigLoop;		Loop *OrigLoop;

/// A wrapper around ScalarEvolution used to add runtime SCEV checks. Applies		/// A wrapper around ScalarEvolution used to add runtime SCEV checks. Applies
/// dynamic knowledge to simplify SCEV expressions and converts them to a		/// dynamic knowledge to simplify SCEV expressions and converts them to a
/// more usable form.		/// more usable form.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines
private:		private:
Value getBroadcastInstrs(Value V) override;		Value getBroadcastInstrs(Value V) override;
Value getStepVector(Value Val, int StartIdx, Value *Step,		Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
Instruction::BinaryOpsEnd) override;		Instruction::BinaryOpsEnd) override;
Value reverseVector(Value Vec) override;		Value reverseVector(Value Vec) override;
};		};

		/// Encapsulate information regarding vectorization of a loop and its epilogue.
		/// This information is meant to be updated and used across two stages of
		/// epilogue vectorization.
		struct EpilogueLoopVectorizationInfo {
		ElementCount MainLoopVF = ElementCount::getFixed(0);
		unsigned MainLoopUF = 0;
		ElementCount EpilogueVF = ElementCount::getFixed(0);
		unsigned EpilogueUF = 0;
		mivnayUnsubmitted Done Reply Inline Actions High UF might negate the benefit of EpilogVectorization. I think keeping `UF = 1` is a good idea unless there are multiple levels of epilog loop vectorization. mivnay: High UF might negate the benefit of EpilogVectorization. I think keeping `UF = 1` is a good…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions I think so too, but I also thought it would be a good thing to make the UF configurable in case the need arises in the future (eg. with increasing vector widths and scalable vector types). In this patch, the only EpilogueLoopVectorizationInfo object is created with an epilogue UF of 1. bmahjour: I think so too, but I also thought it would be a good thing to make the UF configurable in case…
		mivnayUnsubmitted Done Reply Inline Actions It would be good to assert `UF == 1` I guess. mivnay: It would be good to assert `UF == 1` I guess.
		BasicBlock *MainLoopIterationCountCheck = nullptr;
		BasicBlock *EpilogueIterationCountCheck = nullptr;
		BasicBlock *SCEVSafetyCheck = nullptr;
		BasicBlock *MemSafetyCheck = nullptr;
		Value *TripCount = nullptr;
		Value *VectorTripCount = nullptr;

		EpilogueLoopVectorizationInfo(unsigned MVF, unsigned MUF, unsigned EVF,
		unsigned EUF)
		: MainLoopVF(ElementCount::getFixed(MVF)), MainLoopUF(MUF),
		EpilogueVF(ElementCount::getFixed(EVF)), EpilogueUF(EUF) {
		assert(EUF == 1 &&
		"A high UF for the epilogue loop is likely not beneficial.");
		}
		};

		/// An extension of the inner loop vectorizer that creates a skeleton for a
		/// vectorized loop that has its epilogue (residual) also vectorized.
		/// The idea is to run the vplan on a given loop twice, firstly to setup the
		/// skeleton and vectorize the main loop, and secondly to complete the skeleton
		/// from the first step and vectorize the epilogue. This is achieved by
		/// deriving two concrete strategy classes from this base class and invoking
		/// them in succession from the loop vectorizer planner.
		class InnerLoopAndEpilogueVectorizer : public InnerLoopVectorizer {
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Bikeshedding names: perhaps `InnerMainLoopAndEpilogueVectorizer` -> `InnerLoopAndEpilogueVectorizer` if `Main` doesn't add much here? SjoerdMeijer: Bikeshedding names: perhaps `InnerMainLoopAndEpilogueVectorizer` ->…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions sure. bmahjour: sure.
		public:
		InnerLoopAndEpilogueVectorizer(
		Loop OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo LI,
		DominatorTree DT, const TargetLibraryInfo TLI,
		const TargetTransformInfo TTI, AssumptionCache AC,
		OptimizationRemarkEmitter *ORE, EpilogueLoopVectorizationInfo &EPI,
		LoopVectorizationLegality LVL, llvm::LoopVectorizationCostModel CM,
		BlockFrequencyInfo BFI, ProfileSummaryInfo PSI)
		: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,
		EPI.MainLoopVF, EPI.MainLoopUF, LVL, CM, BFI, PSI),
		EPI(EPI) {}

		// Override this function to handle the more complex control flow around the
		// three loops.
		BasicBlock *createVectorizedLoopSkeleton() final override {
		return createEpilogueVectorizedLoopSkeleton();
		}

		/// The interface for creating a vectorized skeleton using one of two
		/// different strategies, each corresponding to one execution of the vplan
		/// as described above.
		virtual BasicBlock *createEpilogueVectorizedLoopSkeleton() = 0;

		/// Holds and updates state information required to vectorize the main loop
		/// and its epilogue in two separate passes. This setup helps us avoid
		/// regenerating and recomputing runtime safety checks. It also helps us to
		/// shorten the iteration-count-check path length for the cases where the
		/// iteration count of the loop is so small that the main vector loop is
		/// completely skipped.
		EpilogueLoopVectorizationInfo &EPI;
		};

		/// A specialized derived class of inner loop vectorizer that performs
		/// vectorization of main loops in the process of vectorizing loops and their
		/// epilogues.
		class EpilogueVectorizerMainLoop : public InnerLoopAndEpilogueVectorizer {
		SjoerdMeijerUnsubmitted Done Reply Inline Actions And some more bikeshedding: Can we make `IMLAEEpilogueLoop` a bit more readable? Same for IMLAEEpilogueLoop below. SjoerdMeijer: And some more bikeshedding: Can we make `IMLAEEpilogueLoop` a bit more readable? Same for…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Are you referring to their names or their implementation? If the former, would `EpilogueVectorizerMainLoop` and `EpilogueVectorizerEpilogueLoop` be preferred? bmahjour: Are you referring to their names or their implementation? If the former, would…
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Cheers, much more readable IMHO. SjoerdMeijer: Cheers, much more readable IMHO.
		public:
		EpilogueVectorizerMainLoop(
		Loop OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo LI,
		DominatorTree DT, const TargetLibraryInfo TLI,
		const TargetTransformInfo TTI, AssumptionCache AC,
		OptimizationRemarkEmitter *ORE, EpilogueLoopVectorizationInfo &EPI,
		LoopVectorizationLegality LVL, llvm::LoopVectorizationCostModel CM,
		BlockFrequencyInfo BFI, ProfileSummaryInfo PSI)
		: InnerLoopAndEpilogueVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,
		EPI, LVL, CM, BFI, PSI) {}
		/// Implements the interface for creating a vectorized skeleton using the
		/// main loop strategy (ie the first pass of vplan execution).
		BasicBlock *createEpilogueVectorizedLoopSkeleton() final override;

		protected:
		/// Emits an iteration count bypass check once for the main loop (when \p
		/// ForEpilogue is false) and once for the epilogue loop (when \p
		/// ForEpilogue is true).
		BasicBlock emitMinimumIterationCountCheck(Loop L, BasicBlock *Bypass,
		bool ForEpilogue);
		void printDebugTracesAtStart() override;
		void printDebugTracesAtEnd() override;
		};

		// A specialized derived class of inner loop vectorizer that performs
		// vectorization of epilogue loops in the process of vectorizing loops and
		// their epilogues.
		class EpilogueVectorizerEpilogueLoop : public InnerLoopAndEpilogueVectorizer {
		public:
		EpilogueVectorizerEpilogueLoop(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
		LoopInfo LI, DominatorTree DT,
		const TargetLibraryInfo *TLI,
		const TargetTransformInfo TTI, AssumptionCache AC,
		OptimizationRemarkEmitter *ORE,
		EpilogueLoopVectorizationInfo &EPI,
		LoopVectorizationLegality *LVL,
		llvm::LoopVectorizationCostModel *CM,
		BlockFrequencyInfo BFI, ProfileSummaryInfo PSI)
		: InnerLoopAndEpilogueVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,
		EPI, LVL, CM, BFI, PSI) {}
		/// Implements the interface for creating a vectorized skeleton using the
		/// epilogue loop strategy (ie the second pass of vplan execution).
		BasicBlock *createEpilogueVectorizedLoopSkeleton() final override;

		fhahnUnsubmitted Done Reply Inline Actions IIUC we specialize LV so we can override just `createEpilogueVectorizedLoopSkeleton` and `emitMinimumVectorEpilogueIterCountCheck` (which is only used by `createEpilogueVectorizedLoopSkeleton`. Could we do a more targeted specialization, e.g. by extracting the skeleton creation code into something like `LoopSekletonCreator` and have a regular and epilogue version of that? Or combine it with `EpilogueLoopVectorizationInfo`? fhahn: IIUC we specialize LV so we can override just `createEpilogueVectorizedLoopSkeleton` and…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Currently only the skeleton code is specialized, but future enhancements will likely necessitate more specialization (eg to support widened induction vars and live-out phis). We could create a `LoopSkeletonCreator`, and modify the `LoopVectorizationPlanner` interfaces to work with it, but I think extending from the `InnerLoopVectorizer` fits better with the current design. Note that `InnerLoopUnroller` also takes a similar approach and extends `InnerLoopVectorizer` while only overriding a couple of functions. bmahjour: Currently only the skeleton code is specialized, but future enhancements will likely…
		fhahnUnsubmitted Not Done Reply Inline Actions Currently only the skeleton code is specialized, but future enhancements will likely necessitate more specialization (eg to support widened induction vars and live-out phis). I thought a bit on how to support those cases. IIUC the problem here is mostly setting up the right incoming values for the PHIs in the epilogue loop. Given that this is related to codegen, I think we should be able to represent all the required info in the VPlan, i.e. we should be able to adjust the VPlan of the epilogue vector loop to use the correct start values naturally. We could create a LoopSkeletonCreator, and modify the LoopVectorizationPlanner interfaces to work with it, but I think extending from the InnerLoopVectorizer fits better with the current design. Note that InnerLoopUnroller also takes a similar approach and extends InnerLoopVectorizer while only overriding a couple of functions. My concern here is that this approach invites more modifications in the sub-classes, instead of modeling the required information directly in VPlan. I think this will lead to more work down the road, as we move more pieces into VPlan. I don't think `InnerLoopUnroller` is an ideal example here, because IIRC it was created before VPlan. I think working on modeling the incoming values in VPlan would be better aligned with the long term goals and lead to a more modular solution overall. To illustrate this approach, I went ahead and tried to sketch support for widened induction using VPlan: D92132. It is a bit rough around the edges, but should give an idea of the required pieces. I'm working on a similar patch for reductions as well and we should be able to handle first-order recurrences in a similar fashion. fhahn: > Currently only the skeleton code is specialized, but future enhancements will likely…
		protected:
		/// Emits an iteration count bypass check after the main vector loop has
		/// finished to see if there are any iterations left to execute by either
		/// the vector epilogue or the scalar epilogue.
		BasicBlock emitMinimumVectorEpilogueIterCountCheck(Loop L,
		BasicBlock *Bypass,
		BasicBlock *Insert);
		void printDebugTracesAtStart() override;
		void printDebugTracesAtEnd() override;
		};
} // end namespace llvm		} // end namespace llvm

/// Look for a meaningful debug location on the instruction or it's		/// Look for a meaningful debug location on the instruction or it's
/// operands.		/// operands.
static Instruction getDebugLocFromInstOrOperands(Instruction I) {		static Instruction getDebugLocFromInstOrOperands(Instruction I) {
if (!I)		if (!I)
return I;		return I;

▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	public:
/// otherwise.		/// otherwise.
bool runtimeChecksRequired();		bool runtimeChecksRequired();

/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to MaxVF. If UserVF is not ZERO		/// This method checks every power of two up to MaxVF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(ElementCount MaxVF);		VectorizationFactor selectVectorizationFactor(ElementCount MaxVF);
		VectorizationFactor
		selectEpilogueVectorizationFactor(const ElementCount MaxVF,
		const LoopVectorizationPlanner &LVP);

/// Setup cost-based decisions for user vectorization factor.		/// Setup cost-based decisions for user vectorization factor.
void selectUserVectorizationFactor(ElementCount UserVF) {		void selectUserVectorizationFactor(ElementCount UserVF) {
collectUniformsAndScalars(UserVF);		collectUniformsAndScalars(UserVF);
collectInstsToScalarize(UserVF);		collectInstsToScalarize(UserVF);
}		}

/// \return The size (in bits) of the smallest and widest types in the code		/// \return The size (in bits) of the smallest and widest types in the code
▲ Show 20 Lines • Show All 517 Lines • ▼ Show 20 Lines	private:

/// Returns a range containing only operands needing to be extracted.		/// Returns a range containing only operands needing to be extracted.
SmallVector<Value *, 4> filterExtractingOperands(Instruction::op_range Ops,		SmallVector<Value *, 4> filterExtractingOperands(Instruction::op_range Ops,
ElementCount VF) {		ElementCount VF) {
return SmallVector<Value *, 4>(make_filter_range(		return SmallVector<Value *, 4>(make_filter_range(
Ops, [this, VF](Value *V) { return this->needsExtract(V, VF); }));		Ops, [this, VF](Value *V) { return this->needsExtract(V, VF); }));
}		}

		/// Determines if we have the infrastructure to vectorize loop \p L and its
		/// epilogue, assuming the main loop is vectorized by \p VF.
		bool isCandidateForEpilogueVectorization(const Loop &L,
		const ElementCount VF) const;

		/// Returns true if epilogue vectorization is considered profitable, and
		/// false otherwise.
		/// \p VF is the vectorization factor chosen for the original loop.
		bool isEpilogueVectorizationProfitable(const ElementCount VF) const;

public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;

/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

/// Loop Info analysis.		/// Loop Info analysis.
Show All 26 Lines	public:
/// with the same stride and close to each other.		/// with the same stride and close to each other.
InterleavedAccessInfo &InterleaveInfo;		InterleavedAccessInfo &InterleaveInfo;

/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;

/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;

		/// Profitable vector factors.
		SmallVector<VectorizationFactor, 8> ProfitableVFs;
};		};

} // end namespace llvm		} // end namespace llvm

// Return true if \p OuterLp is an outer loop annotated with hints for explicit		// Return true if \p OuterLp is an outer loop annotated with hints for explicit
// vectorization. The loop needs to be annotated with #pragma omp simd		// vectorization. The loop needs to be annotated with #pragma omp simd
// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the		// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the
// vector length information is not provided, vectorization is not considered		// vector length information is not provided, vectorization is not considered
▲ Show 20 Lines • Show All 1,471 Lines • ▼ Show 20 Lines	if (ParentLoop) {
ParentLoop->addChildLoop(Lp);		ParentLoop->addChildLoop(Lp);
} else {		} else {
LI->addTopLevelLoop(Lp);		LI->addTopLevelLoop(Lp);
}		}
Lp->addBasicBlockToLoop(LoopVectorBody, *LI);		Lp->addBasicBlockToLoop(LoopVectorBody, *LI);
return Lp;		return Lp;
}		}

void InnerLoopVectorizer::createInductionResumeValues(Loop *L,		void InnerLoopVectorizer::createInductionResumeValues(
Value *VectorTripCount) {		Loop L, Value VectorTripCount,
		std::pair<BasicBlock , Value > AdditionalBypass) {
assert(VectorTripCount && L && "Expected valid arguments");		assert(VectorTripCount && L && "Expected valid arguments");
		assert(((AdditionalBypass.first && AdditionalBypass.second) \|\|
		(!AdditionalBypass.first && !AdditionalBypass.second)) &&
		"Inconsistent information about additional bypass.");
// We are going to resume the execution of the scalar loop.		// We are going to resume the execution of the scalar loop.
// Go over all of the induction variables that we found and fix the		// Go over all of the induction variables that we found and fix the
// PHIs that are left in the scalar version of the loop.		// PHIs that are left in the scalar version of the loop.
// The starting values of PHI nodes depend on the counter of the last		// The starting values of PHI nodes depend on the counter of the last
// iteration in the vectorized loop.		// iteration in the vectorized loop.
// If we come from a bypass edge then we need to start from the original		// If we come from a bypass edge then we need to start from the original
// start value.		// start value.
for (auto &InductionEntry : Legal->getInductionVars()) {		for (auto &InductionEntry : Legal->getInductionVars()) {
PHINode *OrigPhi = InductionEntry.first;		PHINode *OrigPhi = InductionEntry.first;
InductionDescriptor II = InductionEntry.second;		InductionDescriptor II = InductionEntry.second;

// Create phi nodes to merge from the backedge-taken check block.		// Create phi nodes to merge from the backedge-taken check block.
PHINode *BCResumeVal =		PHINode *BCResumeVal =
PHINode::Create(OrigPhi->getType(), 3, "bc.resume.val",		PHINode::Create(OrigPhi->getType(), 3, "bc.resume.val",
LoopScalarPreHeader->getTerminator());		LoopScalarPreHeader->getTerminator());
// Copy original phi DL over to the new one.		// Copy original phi DL over to the new one.
BCResumeVal->setDebugLoc(OrigPhi->getDebugLoc());		BCResumeVal->setDebugLoc(OrigPhi->getDebugLoc());
Value *&EndValue = IVEndValues[OrigPhi];		Value *&EndValue = IVEndValues[OrigPhi];
		Value *EndValueFromAdditionalBypass = AdditionalBypass.second;
if (OrigPhi == OldInduction) {		if (OrigPhi == OldInduction) {
// We know what the end value is.		// We know what the end value is.
EndValue = VectorTripCount;		EndValue = VectorTripCount;
} else {		} else {
IRBuilder<> B(L->getLoopPreheader()->getTerminator());		IRBuilder<> B(L->getLoopPreheader()->getTerminator());
Type *StepType = II.getStep()->getType();		Type *StepType = II.getStep()->getType();
Instruction::CastOps CastOp =		Instruction::CastOps CastOp =
CastInst::getCastOpcode(VectorTripCount, true, StepType, true);		CastInst::getCastOpcode(VectorTripCount, true, StepType, true);
Value *CRD = B.CreateCast(CastOp, VectorTripCount, StepType, "cast.crd");		Value *CRD = B.CreateCast(CastOp, VectorTripCount, StepType, "cast.crd");
const DataLayout &DL = LoopScalarBody->getModule()->getDataLayout();		const DataLayout &DL = LoopScalarBody->getModule()->getDataLayout();
EndValue = emitTransformedIndex(B, CRD, PSE.getSE(), DL, II);		EndValue = emitTransformedIndex(B, CRD, PSE.getSE(), DL, II);
EndValue->setName("ind.end");		EndValue->setName("ind.end");
}

		// Compute the end value for the additional bypass (if applicable).
		if (AdditionalBypass.first) {
		B.SetInsertPoint(&(*AdditionalBypass.first->getFirstInsertionPt()));
		CastOp = CastInst::getCastOpcode(AdditionalBypass.second, true,
		StepType, true);
		CRD =
		B.CreateCast(CastOp, AdditionalBypass.second, StepType, "cast.crd");
		EndValueFromAdditionalBypass =
		emitTransformedIndex(B, CRD, PSE.getSE(), DL, II);
		EndValueFromAdditionalBypass->setName("ind.end");
		}
		}
// The new PHI merges the original incoming value, in case of a bypass,		// The new PHI merges the original incoming value, in case of a bypass,
// or the value at the end of the vectorized loop.		// or the value at the end of the vectorized loop.
BCResumeVal->addIncoming(EndValue, LoopMiddleBlock);		BCResumeVal->addIncoming(EndValue, LoopMiddleBlock);

// Fix the scalar body counter (PHI node).		// Fix the scalar body counter (PHI node).
// The old induction's phi node in the scalar body needs the truncated		// The old induction's phi node in the scalar body needs the truncated
// value.		// value.
for (BasicBlock *BB : LoopBypassBlocks)		for (BasicBlock *BB : LoopBypassBlocks)
BCResumeVal->addIncoming(II.getStartValue(), BB);		BCResumeVal->addIncoming(II.getStartValue(), BB);

		if (AdditionalBypass.first)
		BCResumeVal->setIncomingValueForBlock(AdditionalBypass.first,
		EndValueFromAdditionalBypass);

OrigPhi->setIncomingValueForBlock(LoopScalarPreHeader, BCResumeVal);		OrigPhi->setIncomingValueForBlock(LoopScalarPreHeader, BCResumeVal);
}		}
}		}

BasicBlock InnerLoopVectorizer::completeLoopSkeleton(Loop L,		BasicBlock InnerLoopVectorizer::completeLoopSkeleton(Loop L,
MDNode *OrigLoopID) {		MDNode *OrigLoopID) {
assert(L && "Expected valid loop.");		assert(L && "Expected valid loop.");

▲ Show 20 Lines • Show All 2,296 Lines • ▼ Show 20 Lines	for (unsigned i = 2; i <= MaxVF.getFixedValue(); i *= 2) {
LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i		LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i
<< " costs: " << (int)VectorCost << ".\n");		<< " costs: " << (int)VectorCost << ".\n");
if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not considering vector loop of width " << i		dbgs() << "LV: Not considering vector loop of width " << i
<< " because it will not generate any vector instructions.\n");		<< " because it will not generate any vector instructions.\n");
continue;		continue;
}		}

		// If profitable add it to ProfitableVF list.
		if (VectorCost < ScalarCost) {
		ProfitableVFs.push_back(VectorizationFactor(
		{ElementCount::getFixed(i), (unsigned)VectorCost}));
		}

if (VectorCost < Cost) {		if (VectorCost < Cost) {
Cost = VectorCost;		Cost = VectorCost;
Width = i;		Width = i;
}		}
}		}

if (!EnableCondStoresVectorization && NumPredStores) {		if (!EnableCondStoresVectorization && NumPredStores) {
reportVectorizationFailure("There are conditional stores.",		reportVectorizationFailure("There are conditional stores.",
"store that is conditionally executed prevents vectorization",		"store that is conditionally executed prevents vectorization",
"ConditionalStore", ORE, TheLoop);		"ConditionalStore", ORE, TheLoop);
Width = 1;		Width = 1;
Cost = ScalarCost;		Cost = ScalarCost;
}		}

LLVM_DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()		LLVM_DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
<< "but was forced by a user.\n");		<< "but was forced by a user.\n");
LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");		LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");
VectorizationFactor Factor = {ElementCount::getFixed(Width),		VectorizationFactor Factor = {ElementCount::getFixed(Width),
(unsigned)(Width * Cost)};		(unsigned)(Width * Cost)};
return Factor;		return Factor;
}		}

		bool LoopVectorizationCostModel::isCandidateForEpilogueVectorization(
		fhahnUnsubmitted Done Reply Inline Actions I think there are some additional tests cases needed to cover all code paths in here? fhahn: I think there are some additional tests cases needed to cover all code paths in here?
		bmahjourAuthorUnsubmitted Done Reply Inline Actions more tests added. bmahjour: more tests added.
		const Loop &L, ElementCount VF) const {
		// Cross iteration phis such as reductions need special handling and are
		// currently unsupported.
		if (any_of(L.getHeader()->phis(), [&](PHINode &Phi) {
		fhahnUnsubmitted Done Reply Inline Actions nit: could use any_of, e.g. `any_of(L.getHeader()->phis(), [this](PHINode &Phi){ return Legal->isFirstOrderRecurrence(&Phi) \|\| Legal->isReductionVariable(&Phi); })`. fhahn: nit: could use any_of, e.g. `any_of(L.getHeader()->phis(), [this](PHINode &Phi){ return Legal…
		return Legal->isFirstOrderRecurrence(&Phi) \|\|
		Legal->isReductionVariable(&Phi);
		}))
		return false;

		// Phis with uses outside of the loop require special handling and are
		SjoerdMeijerUnsubmitted Done Reply Inline Actions I was hoping and seemed to remember there was a LoopUtils helper for this. Had a quick look, perhaps `findDefsUsedOutsideOfLoop` could be useful, or there was another function here in the vectoriser that was doing the same/similar things for reductions. Perhaps have a look if there's something we can reuse. SjoerdMeijer: I was hoping and seemed to remember there was a LoopUtils helper for this. Had a quick look…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions I was trying to match the conditions in `fixupIVUsers()` and added early exits to make sure we don't waste compile-time collecting information that will be thrown away. Maybe we can add a flag to `findDefsUsedOutsideOfLoop` to stop collecting instructions as soon as one is found, but then we'll have a new problem in that the checks in this code would be different from what's in `fixupIVUsers()`. bmahjour: I was trying to match the conditions in `fixupIVUsers()` and added early exits to make sure we…
		// currently unsupported.
		for (auto &Entry : Legal->getInductionVars()) {
		// Look for uses of the value of the induction at the last iteration.
		Value *PostInc = Entry.first->getIncomingValueForBlock(L.getLoopLatch());
		for (User *U : PostInc->users())
		if (!L.contains(cast<Instruction>(U)))
		return false;
		// Look for uses of penultimate value of the induction.
		for (User *U : Entry.first->users())
		if (!L.contains(cast<Instruction>(U)))
		return false;
		}

		// Induction variables that are widened require special handling that is
		fhahnUnsubmitted Done Reply Inline Actions nit: use any_of? fhahn: nit: use any_of?
		// currently not supported.
		if (any_of(Legal->getInductionVars(), [&](auto &Entry) {
		return !(isScalarAfterVectorization(Entry.first, VF) \|\|
		isProfitableToScalarize(Entry.first, VF));
		}))
		return false;

		return true;
		}

		bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
		const ElementCount VF) const {
		// FIXME: We need a much better cost-model to take different parameters such
		// as register pressure, code size increase and cost of extra branches into
		// account. For now we apply a very crude heuristic and only consider loops
		// with vectorization factors larger than a certain value.
		// We also consider epilogue vectorization unprofitable for targets that don't
		// consider interleaving beneficial (eg. MVE).
		if (TTI.getMaxInterleaveFactor(VF.getKnownMinValue()) <= 1)
		return false;
		if (VF.getFixedValue() >= EpilogueVectorizationMinVF)
		return true;
		return false;
		}

		VectorizationFactor
		LoopVectorizationCostModel::selectEpilogueVectorizationFactor(
		const ElementCount MainLoopVF, const LoopVectorizationPlanner &LVP) {
		VectorizationFactor Result = VectorizationFactor::Disabled();
		if (!EnableEpilogueVectorization) {
		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is disabled.\n";);
		return Result;
		}

		if (!isScalarEpilogueAllowed()) {
		LLVM_DEBUG(
		dbgs() << "LEV: Unable to vectorize epilogue because no epilogue is "
		"allowed.\n";);
		return Result;
		}

		// Not really a cost consideration, but check for unsupported cases here to
		// simplify the logic.
		if (!isCandidateForEpilogueVectorization(*TheLoop, MainLoopVF)) {
		LLVM_DEBUG(
		dbgs() << "LEV: Unable to vectorize epilogue because the loop is "
		"not a supported candidate.\n";);
		return Result;
		}

		SjoerdMeijerUnsubmitted Done Reply Inline Actions Why is this 16? Do we need to put this "magic constant" behind a target hook? Or calculate this? SjoerdMeijer: Why is this 16? Do we need to put this "magic constant" behind a target hook? Or calculate this?
		bmahjourAuthorUnsubmitted Done Reply Inline Actions This part is copied from D88819. The value 16 is chosen because it catches the x264 opportunity. As I've been saying all along, we need to replace this with a better cost-model. The cost-modeling work is not in the scope of this patch. bmahjour: This part is copied from D88819. The value 16 is chosen because it catches the x264 opportunity.
		SjoerdMeijerUnsubmitted Done Reply Inline Actions I think that's fine, but we should at least hide this 16 behind a TTI hook that e.g. defaults to 16, which should be a separate patch. SjoerdMeijer: I think that's fine, but we should at least hide this 16 behind a TTI hook that e.g. defaults…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions A reasonable cost function can probably be developed with the existing TTI hooks, so a new one may actually not be necessary. I agree to deferring this to a separate patch. bmahjour: A reasonable cost function can probably be developed with the existing TTI hooks, so a new one…
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Okay, but can we stub it out for now? If we don't need a new TTI function, can we create a new helper function and at least sketch a way how this can be calculated. It can still just default to returning 16, but I just don't think we should keep the 16 hard coded here. SjoerdMeijer: Okay, but can we stub it out for now? If we don't need a new TTI function, can we create a new…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Ok, sure. I've added an option to make it configurable too. bmahjour: Ok, sure. I've added an option to make it configurable too.
		if (EpilogueVectorizationForceVF > 1) {
		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);
		if (LVP.hasPlanWithVFs(
		{MainLoopVF, ElementCount::getFixed(EpilogueVectorizationForceVF)}))
		return {ElementCount::getFixed(EpilogueVectorizationForceVF), 0};
		else {
		LLVM_DEBUG(
		SjoerdMeijerUnsubmitted Done Reply Inline Actions I am wondering which option should win: e.g. `-Os` or `EpilogueVectorizationForceVF`? I am guessing that is the forced epilogue vectorisation? Should the opt size check be moved? SjoerdMeijer: I am wondering which option should win: e.g. `-Os` or `EpilogueVectorizationForceVF`? I am…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions I can see arguments for it either way, but after thinking more about it, I believe it makes it easier to reason about the behaviour of the loop-vectorizer when the forced vf wins over the minsize. I'll change it. bmahjour: I can see arguments for it either way, but after thinking more about it, I believe it makes it…
		dbgs()
		<< "LEV: Epilogue vectorization forced factor is not viable.\n";);
		SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: don't think we need the curly brackets here in LLVM_DEBUG. SjoerdMeijer: nit: don't think we need the curly brackets here in LLVM_DEBUG.
		bmahjourAuthorUnsubmitted Done Reply Inline Actions I have developed a habit of adding them because I find it makes the enclosed code more friendly to clang-format. I'll remove the brackets here since they don't make much of a difference to the formatting in this case. bmahjour: I have developed a habit of adding them because I find it makes the enclosed code more friendly…
		return Result;
		}
		}

		if (TheLoop->getHeader()->getParent()->hasOptSize() \|\|
		SjoerdMeijerUnsubmitted Done Reply Inline Actions same here SjoerdMeijer: same here
		TheLoop->getHeader()->getParent()->hasMinSize()) {
		LLVM_DEBUG(
		dbgs()
		<< "LEV: Epilogue vectorization skipped due to opt for size.\n";);
		return Result;
		}

		if (!isEpilogueVectorizationProfitable(MainLoopVF))
		return Result;

		for (auto &NextVF : ProfitableVFs)
		if (ElementCount::isKnownLT(NextVF.Width, MainLoopVF) &&
		(Result.Width.getFixedValue() == 1 \|\| NextVF.Cost < Result.Cost) &&
		LVP.hasPlanWithVFs({MainLoopVF, NextVF.Width}))
		Result = NextVF;

		SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: `!=` ? SjoerdMeijer: nit: `!=` ?
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Currently `VectorizationFactor` only overrides `operator==` not the `operator!=`. I'll provide an override and change this to `!=`. bmahjour: Currently `VectorizationFactor` only overrides `operator==` not the `operator!=`. I'll provide…
		if (Result != VectorizationFactor::Disabled())
		SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: curly bracket SjoerdMeijer: nit: curly bracket
		LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "
		<< Result.Width.getFixedValue() << "\n";);
		return Result;
		}

std::pair<unsigned, unsigned>		std::pair<unsigned, unsigned>
LoopVectorizationCostModel::getSmallestAndWidestTypes() {		LoopVectorizationCostModel::getSmallestAndWidestTypes() {
unsigned MinWidth = -1U;		unsigned MinWidth = -1U;
unsigned MaxWidth = 8;		unsigned MaxWidth = 8;
const DataLayout &DL = TheFunction->getParent()->getDataLayout();		const DataLayout &DL = TheFunction->getParent()->getDataLayout();

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
▲ Show 20 Lines • Show All 1,588 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV,

VPTransformState State{*BestVF, BestUF, LI,		VPTransformState State{*BestVF, BestUF, LI,
DT, ILV.Builder, ILV.VectorLoopValueMap,		DT, ILV.Builder, ILV.VectorLoopValueMap,
&ILV, CallbackILV};		&ILV, CallbackILV};
State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();		State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();
State.TripCount = ILV.getOrCreateTripCount(nullptr);		State.TripCount = ILV.getOrCreateTripCount(nullptr);
State.CanonicalIV = ILV.Induction;		State.CanonicalIV = ILV.Induction;

		ILV.printDebugTracesAtStart();

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should also be implemented in		// into the code below should also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//

// 2. Copy and widen instructions from the old loop into the new loop.		// 2. Copy and widen instructions from the old loop into the new loop.
assert(VPlans.size() == 1 && "Not a single VPlan to execute.");		assert(VPlans.size() == 1 && "Not a single VPlan to execute.");
VPlans.front()->execute(&State);		VPlans.front()->execute(&State);

// 3. Fix the vectorized code: take care of header phi's, live-outs,		// 3. Fix the vectorized code: take care of header phi's, live-outs,
// predication, updating analyses.		// predication, updating analyses.
ILV.fixVectorizedLoop();		ILV.fixVectorizedLoop();

		ILV.printDebugTracesAtEnd();
}		}

void LoopVectorizationPlanner::collectTriviallyDeadInstructions(		void LoopVectorizationPlanner::collectTriviallyDeadInstructions(
SmallPtrSetImpl<Instruction *> &DeadInstructions) {		SmallPtrSetImpl<Instruction *> &DeadInstructions) {
		mivnayUnsubmitted Done Reply Inline Actions This function and other core functions like `createEpilogueVectorizedLoopSkeleton`, `createInductionResumeValues`, etc contains lot of redundant code. Can it be improved? mivnay: This function and other core functions like `createEpilogueVectorizedLoopSkeleton`…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Right, there is a bit of code duplication in exchange for decoupling and separation of concerns. There is very little code reuse opportunity from `createEpilogueVectorizedLoopSkeleton()`, as it already makes calls to common code like `emitMinimumIterationCountCheck`, `createInductionVariable`, `completeLoopSkeleton`, etc. and the rest of the code is inherently different between the `IMLAEMainLoop` and `IMLAEEpilogueLoop`. I agree this function can be avoided and I can think of a way to reuse code in `createInductionResumeValues`. I'll post an update soon. bmahjour: Right, there is a bit of code duplication in exchange for decoupling and separation of concerns.
BasicBlock *Latch = OrigLoop->getLoopLatch();		BasicBlock *Latch = OrigLoop->getLoopLatch();

// We create new control-flow for the vectorized loop, so the original		// We create new control-flow for the vectorized loop, so the original
// condition will be dead after vectorization if it's only used by the		// condition will be dead after vectorization if it's only used by the
// branch.		// branch.
auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));		auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
if (Cmp && Cmp->hasOneUse()) {		if (Cmp && Cmp->hasOneUse()) {
DeadInstructions.insert(Cmp);		DeadInstructions.insert(Cmp);
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	if (!IsUnrollMetadata) {
MDs.push_back(DisableNode);		MDs.push_back(DisableNode);
MDNode *NewLoopID = MDNode::get(Context, MDs);		MDNode *NewLoopID = MDNode::get(Context, MDs);
// Set operand 0 to refer to the loop id itself.		// Set operand 0 to refer to the loop id itself.
NewLoopID->replaceOperandWith(0, NewLoopID);		NewLoopID->replaceOperandWith(0, NewLoopID);
L->setLoopID(NewLoopID);		L->setLoopID(NewLoopID);
}		}
}		}

		//===--------------------------------------------------------------------===//
		// EpilogueVectorizerMainLoop
		//===--------------------------------------------------------------------===//

		/// This function is partially responsible for generating the control flow
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Do we want to refer in a comment here to the doc and skeleton that you've added? SjoerdMeijer: Do we want to refer in a comment here to the doc and skeleton that you've added?
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Done. bmahjour: Done.
		/// depicted in https://llvm.org/docs/Vectorizers.html#epilogue-vectorization.
		BasicBlock *EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton() {
		MDNode *OrigLoopID = OrigLoop->getLoopID();
		Loop *Lp = createVectorLoopSkeleton("");

		// Generate the code to check the minimum iteration count of the vector
		// epilogue (see below).
		EPI.EpilogueIterationCountCheck =
		emitMinimumIterationCountCheck(Lp, LoopScalarPreHeader, true);
		EPI.EpilogueIterationCountCheck->setName("iter.check");

		// Generate the code to check any assumptions that we've made for SCEV
		// expressions.
		BasicBlock *SavedPreHeader = LoopVectorPreHeader;
		emitSCEVChecks(Lp, LoopScalarPreHeader);

		// If a safety check was generated save it.
		if (SavedPreHeader != LoopVectorPreHeader)
		EPI.SCEVSafetyCheck = SavedPreHeader;

		// Generate the code that checks at runtime if arrays overlap. We put the
		// checks into a separate block to make the more common case of few elements
		// faster.
		SavedPreHeader = LoopVectorPreHeader;
		emitMemRuntimeChecks(Lp, LoopScalarPreHeader);

		// If a safety check was generated save/overwite it.
		if (SavedPreHeader != LoopVectorPreHeader)
		EPI.MemSafetyCheck = SavedPreHeader;

		// Generate the iteration count check for the main loop, after the check
		// for the epilogue loop, so that the path-length is shorter for the case
		// that goes directly through the vector epilogue. The longer-path length for
		// the main loop is compensated for, by the gain from vectorizing the larger
		// trip count. Note: the branch will get updated later on when we vectorize
		// the epilogue.
		EPI.MainLoopIterationCountCheck =
		emitMinimumIterationCountCheck(Lp, LoopScalarPreHeader, false);

		// Generate the induction variable.
		OldInduction = Legal->getPrimaryInduction();
		Type *IdxTy = Legal->getWidestInductionType();
		Value *StartIdx = ConstantInt::get(IdxTy, 0);
		Constant Step = ConstantInt::get(IdxTy, VF.getKnownMinValue() UF);
		Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
		EPI.VectorTripCount = CountRoundDown;
		Induction =
		createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
		getDebugLocFromInstOrOperands(OldInduction));

		// Skip induction resume value creation here because they will be created in
		// the second pass. If we created them here, they wouldn't be used anyway,
		// because the vplan in the second pass still contains the inductions from the
		// original loop.

		return completeLoopSkeleton(Lp, OrigLoopID);
		}

		void EpilogueVectorizerMainLoop::printDebugTracesAtStart() {
		LLVM_DEBUG({
		dbgs() << "Create Skeleton for epilogue vectorized loop (first pass)\n"
		<< "Main Loop VF:" << EPI.MainLoopVF.getKnownMinValue()
		<< ", Main Loop UF:" << EPI.MainLoopUF
		<< ", Epilogue Loop VF:" << EPI.EpilogueVF.getKnownMinValue()
		<< ", Epilogue Loop UF:" << EPI.EpilogueUF << "\n";
		});
		}

		void EpilogueVectorizerMainLoop::printDebugTracesAtEnd() {
		DEBUG_WITH_TYPE(VerboseDebug, {
		dbgs() << "intermediate fn:\n" << *Induction->getFunction() << "\n";
		});
		}

		BasicBlock *EpilogueVectorizerMainLoop::emitMinimumIterationCountCheck(
		Loop L, BasicBlock Bypass, bool ForEpilogue) {
		assert(L && "Expected valid Loop.");
		assert(Bypass && "Expected valid bypass basic block.");
		unsigned VFactor =
		ForEpilogue ? EPI.EpilogueVF.getKnownMinValue() : VF.getKnownMinValue();
		unsigned UFactor = ForEpilogue ? EPI.EpilogueUF : UF;
		Value *Count = getOrCreateTripCount(L);
		// Reuse existing vector loop preheader for TC checks.
		// Note that new preheader block is generated for vector loop.
		BasicBlock *const TCCheckBlock = LoopVectorPreHeader;
		IRBuilder<> Builder(TCCheckBlock->getTerminator());

		// Generate code to check if the loop's trip count is less than VF * UF of the
		// main vector loop.
		auto P =
		Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE : ICmpInst::ICMP_ULT;

		Value *CheckMinIters = Builder.CreateICmp(
		P, Count, ConstantInt::get(Count->getType(), VFactor * UFactor),
		"min.iters.check");

		if (!ForEpilogue)
		TCCheckBlock->setName("vector.main.loop.iter.check");

		// Create new preheader for vector loop.
		LoopVectorPreHeader = SplitBlock(TCCheckBlock, TCCheckBlock->getTerminator(),
		DT, LI, nullptr, "vector.ph");

		if (ForEpilogue) {
		assert(DT->properlyDominates(DT->getNode(TCCheckBlock),
		DT->getNode(Bypass)->getIDom()) &&
		"TC check is expected to dominate Bypass");

		// Update dominator for Bypass & LoopExit.
		DT->changeImmediateDominator(Bypass, TCCheckBlock);
		DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);

		LoopBypassBlocks.push_back(TCCheckBlock);

		// Save the trip count so we don't have to regenerate it in the
		// vec.epilog.iter.check. This is safe to do because the trip count
		// generated here dominates the vector epilog iter check.
		EPI.TripCount = Count;
		}

		ReplaceInstWithInst(
		TCCheckBlock->getTerminator(),
		BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));

		return TCCheckBlock;
		}

		//===--------------------------------------------------------------------===//
		// EpilogueVectorizerEpilogueLoop
		//===--------------------------------------------------------------------===//

		/// This function is partially responsible for generating the control flow
		/// depicted in https://llvm.org/docs/Vectorizers.html#epilogue-vectorization.
		BasicBlock *
		EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton() {
		MDNode *OrigLoopID = OrigLoop->getLoopID();
		Loop *Lp = createVectorLoopSkeleton("vec.epilog.");

		// Now, compare the remaining count and if there aren't enough iterations to
		// execute the vectorized epilogue skip to the scalar part.
		BasicBlock *VecEpilogueIterationCountCheck = LoopVectorPreHeader;
		VecEpilogueIterationCountCheck->setName("vec.epilog.iter.check");
		LoopVectorPreHeader =
		SplitBlock(LoopVectorPreHeader, LoopVectorPreHeader->getTerminator(), DT,
		LI, nullptr, "vec.epilog.ph");
		emitMinimumVectorEpilogueIterCountCheck(Lp, LoopScalarPreHeader,
		VecEpilogueIterationCountCheck);

		// Adjust the control flow taking the state info from the main loop
		// vectorization into account.
		assert(EPI.MainLoopIterationCountCheck && EPI.EpilogueIterationCountCheck &&
		"expected this to be saved from the previous pass.");
		EPI.MainLoopIterationCountCheck->getTerminator()->replaceUsesOfWith(
		VecEpilogueIterationCountCheck, LoopVectorPreHeader);

		DT->changeImmediateDominator(LoopVectorPreHeader,
		EPI.MainLoopIterationCountCheck);

		EPI.EpilogueIterationCountCheck->getTerminator()->replaceUsesOfWith(
		VecEpilogueIterationCountCheck, LoopScalarPreHeader);

		if (EPI.SCEVSafetyCheck)
		EPI.SCEVSafetyCheck->getTerminator()->replaceUsesOfWith(
		VecEpilogueIterationCountCheck, LoopScalarPreHeader);
		if (EPI.MemSafetyCheck)
		EPI.MemSafetyCheck->getTerminator()->replaceUsesOfWith(
		VecEpilogueIterationCountCheck, LoopScalarPreHeader);

		DT->changeImmediateDominator(
		VecEpilogueIterationCountCheck,
		VecEpilogueIterationCountCheck->getSinglePredecessor());

		DT->changeImmediateDominator(LoopScalarPreHeader,
		EPI.EpilogueIterationCountCheck);
		DT->changeImmediateDominator(LoopExitBlock, EPI.EpilogueIterationCountCheck);

		// Keep track of bypass blocks, as they feed start values to the induction
		// phis in the scalar loop preheader.
		if (EPI.SCEVSafetyCheck)
		LoopBypassBlocks.push_back(EPI.SCEVSafetyCheck);
		if (EPI.MemSafetyCheck)
		LoopBypassBlocks.push_back(EPI.MemSafetyCheck);
		LoopBypassBlocks.push_back(EPI.EpilogueIterationCountCheck);

		// Generate a resume induction for the vector epilogue and put it in the
		// vector epilogue preheader
		Type *IdxTy = Legal->getWidestInductionType();
		PHINode *EPResumeVal = PHINode::Create(IdxTy, 2, "vec.epilog.resume.val",
		LoopVectorPreHeader->getFirstNonPHI());
		EPResumeVal->addIncoming(EPI.VectorTripCount, VecEpilogueIterationCountCheck);
		EPResumeVal->addIncoming(ConstantInt::get(IdxTy, 0),
		EPI.MainLoopIterationCountCheck);

		// Generate the induction variable.
		OldInduction = Legal->getPrimaryInduction();
		Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
		Constant Step = ConstantInt::get(IdxTy, VF.getKnownMinValue() UF);
		Value *StartIdx = EPResumeVal;
		Induction =
		createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
		getDebugLocFromInstOrOperands(OldInduction));

		// Generate induction resume values. These variables save the new starting
		// indexes for the scalar loop. They are used to test if there are any tail
		// iterations left once the vector loop has completed.
		// Note that when the vectorized epilogue is skipped due to iteration count
		// check, then the resume value for the induction variable comes from
		// the trip count of the main vector loop, hence passing the AdditionalBypass
		// argument.
		createInductionResumeValues(Lp, CountRoundDown,
		{VecEpilogueIterationCountCheck,
		EPI.VectorTripCount} /* AdditionalBypass */);

		AddRuntimeUnrollDisableMetaData(Lp);
		return completeLoopSkeleton(Lp, OrigLoopID);
		}

		BasicBlock *
		EpilogueVectorizerEpilogueLoop::emitMinimumVectorEpilogueIterCountCheck(
		Loop L, BasicBlock Bypass, BasicBlock *Insert) {

		assert(EPI.TripCount &&
		"Expected trip count to have been safed in the first pass.");
		assert(!isa<Instruction>(EPI.TripCount) \|\|
		DT->dominates(cast<Instruction>(EPI.TripCount)->getParent(), Insert) &&
		"saved trip count does not dominate insertion point.");
		Value *TC = EPI.TripCount;
		IRBuilder<> Builder(Insert->getTerminator());
		Value *Count = Builder.CreateSub(TC, EPI.VectorTripCount, "n.vec.remaining");

		// Generate code to check if the loop's trip count is less than VF * UF of the
		// vector epilogue loop.
		auto P =
		Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE : ICmpInst::ICMP_ULT;

		Value *CheckMinIters = Builder.CreateICmp(
		P, Count,
		ConstantInt::get(Count->getType(),
		EPI.EpilogueVF.getKnownMinValue() * EPI.EpilogueUF),
		"min.epilog.iters.check");

		ReplaceInstWithInst(
		Insert->getTerminator(),
		BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));

		LoopBypassBlocks.push_back(Insert);
		return Insert;
		}

		void EpilogueVectorizerEpilogueLoop::printDebugTracesAtStart() {
		LLVM_DEBUG({
		dbgs() << "Create Skeleton for epilogue vectorized loop (second pass)\n"
		<< "Main Loop VF:" << EPI.MainLoopVF.getKnownMinValue()
		<< ", Main Loop UF:" << EPI.MainLoopUF
		<< ", Epilogue Loop VF:" << EPI.EpilogueVF.getKnownMinValue()
		<< ", Epilogue Loop UF:" << EPI.EpilogueUF << "\n";
		});
		}

		void EpilogueVectorizerEpilogueLoop::printDebugTracesAtEnd() {
		DEBUG_WITH_TYPE(VerboseDebug, {
		dbgs() << "final fn:\n" << *Induction->getFunction() << "\n";
		});
		}

bool LoopVectorizationPlanner::getDecisionAndClampRange(		bool LoopVectorizationPlanner::getDecisionAndClampRange(
const std::function<bool(ElementCount)> &Predicate, VFRange &Range) {		const std::function<bool(ElementCount)> &Predicate, VFRange &Range) {
assert(!Range.isEmpty() && "Trying to test an empty VF range.");		assert(!Range.isEmpty() && "Trying to test an empty VF range.");
bool PredicateAtRangeStart = Predicate(Range.Start);		bool PredicateAtRangeStart = Predicate(Range.Start);

for (ElementCount TmpVF = Range.Start * 2;		for (ElementCount TmpVF = Range.Start * 2;
ElementCount::isKnownLT(TmpVF, Range.End); TmpVF *= 2)		ElementCount::isKnownLT(TmpVF, Range.End); TmpVF *= 2)
if (Predicate(TmpVF) != PredicateAtRangeStart) {		if (Predicate(TmpVF) != PredicateAtRangeStart) {
▲ Show 20 Lines • Show All 1,349 Lines • ▼ Show 20 Lines	if (!VectorizeLoop) {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")";		<< NV("InterleaveCount", IC) << ")";
});		});
} else {		} else {
// If we decided that it is legal to vectorize the loop, then do it.		// If we decided that it is legal to vectorize the loop, then do it.

		// Consider vectorizing the epilogue too if it's profitable.
		VectorizationFactor EpilogueVF =
		CM.selectEpilogueVectorizationFactor(VF.Width, LVP);
		if (EpilogueVF.Width.isVector()) {

		// The first pass vectorizes the main loop and creates a scalar epilogue
		// to be vectorized by executing the plan (potentially with a different
		// factor) again shortly afterwards.
		EpilogueLoopVectorizationInfo EPI(VF.Width.getKnownMinValue(), IC,
		EpilogueVF.Width.getKnownMinValue(), 1);
		EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TLI, TTI, AC, ORE, EPI,
		&LVL, &CM, BFI, PSI);

		LVP.setBestPlan(EPI.MainLoopVF, EPI.MainLoopUF);
		LVP.executePlan(MainILV, DT);
		++LoopsVectorized;

		simplifyLoop(L, DT, LI, SE, AC, nullptr, false /* PreserveLCSSA */);
		SjoerdMeijerUnsubmitted Done Reply Inline Actions `false` is for `PreserveLCSSA`, but do we want that to be true, and does that make formLCSSARecursively redundant? Not sure if I am suprised that this is necessary at all.... nit: I think it should be at least `false /* PreserveLCSSA /` SjoerdMeijer:* `false` is for `PreserveLCSSA`, but do we want that to be true, and does that make…
		bmahjourAuthorUnsubmitted Done Reply Inline Actions I think we could get away without calling `formLCSSARecursively` for now and set `PreserveLCSSA` to true for the limited cases of live-out values supported (see newly added `Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.liveout.ll`). However in the future if we add support for more live-outs (eg. reductions and first-order recs), I worry that we may be in a state at this point in the code where the original loop is temporarily non-LCSSA which would break `simplifyLoop`'s assumptions if we set `PreserveLCSSA` to true. Note that when `PreserveLCSSA` is true, `simplifyLoop` assumes that the loop is already in LCSSA form. Maybe I worry too much about it, but I think the current way is a bit more future-proof. bmahjour: I think we could get away without calling `formLCSSARecursively` for now and set…
		formLCSSARecursively(L, DT, LI, SE);

		// Second pass vectorizes the epilogue and adjusts the control flow
		// edges from the first pass.
		LVP.setBestPlan(EPI.EpilogueVF, EPI.EpilogueUF);
		dmgreenUnsubmitted Done Reply Inline Actions I worry this would not work if we have removed the vplan for the different VF's. dmgreen: I worry this would not work if we have removed the vplan for the different VF's.
		bmahjourAuthorUnsubmitted Done Reply Inline Actions Good catch. We need to make sure that the vplan that's chosen for the main loop supports the requested VF for the epilogue loop. I've updated `selectEpilogueVectorizationFactor()` to check for this. bmahjour: Good catch. We need to make sure that the vplan that's chosen for the main loop supports the…
		EPI.MainLoopVF = EPI.EpilogueVF;
		EPI.MainLoopUF = EPI.EpilogueUF;
		EpilogueVectorizerEpilogueLoop EpilogILV(L, PSE, LI, DT, TLI, TTI, AC,
		ORE, EPI, &LVL, &CM, BFI, PSI);
		LVP.executePlan(EpilogILV, DT);
		++LoopsEpilogueVectorized;

		if (!MainILV.areSafetyChecksAdded())
		DisableRuntimeUnroll = true;
		} else {
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
&LVL, &CM, BFI, PSI);		&LVL, &CM, BFI, PSI);
LVP.executePlan(LB, DT);		LVP.executePlan(LB, DT);
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there are		// Add metadata to disable runtime unrolling a scalar loop when there are
// no runtime checks about strides and memory. A scalar loop that is		// no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.		// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())		if (!LB.areSafetyChecksAdded())
DisableRuntimeUnroll = true;		DisableRuntimeUnroll = true;
		}

// Report the vectorization decision.		// Report the vectorization decision.
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "vectorized loop (vectorization width: "		<< "vectorized loop (vectorization width: "
<< NV("VectorizationFactor", VF.Width)		<< NV("VectorizationFactor", VF.Width)
<< ", interleaved count: " << NV("InterleaveCount", IC) << ")";		<< ", interleaved count: " << NV("InterleaveCount", IC) << ")";
▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization-profitability.ll

This file was added.

				; RUN: opt < %s -passes='loop-vectorize' -enable-epilogue-vectorization -S \| FileCheck %s

				; TODO: For now test for the `-epilogue-vectorization-minimum-VF` option. In
				; the future we need to replace this with a more meaningful test of the
				; epilogue vectorization cost-model.
				; RUN: opt < %s -passes='loop-vectorize' -enable-epilogue-vectorization -epilogue-vectorization-minimum-VF=4 -S \| FileCheck %s --check-prefix=CHECK-MIN-4
				; RUN: opt < %s -passes='loop-vectorize' -enable-epilogue-vectorization -S \| FileCheck %s --check-prefix=CHECK-MIN-D

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-unknown-linux-gnu"

				; Do not vectorize epilogues for loops with minsize attribute
				; CHECK-LABLE: @f1
				; CHECK-NOT: vector.main.loop.iter.check
				; CHECK-NOT: vec.epilog.iter.check
				; CHECK-NOT: vec.epilog.ph
				; CHECK-NOT: vec.epilog.vector.body
				; CHECK-NOT: vec.epilog.middle.block

				define dso_local void @f1(float* noalias %aa, float* noalias %bb, float* noalias %cc, i32 signext %N) #0 {
				entry:
				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %bb, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %cc, i64 %indvars.iv
				%1 = load float, float* %arrayidx2, align 4
				%add = fadd fast float %0, %1
				%arrayidx4 = getelementptr inbounds float, float* %aa, i64 %indvars.iv
				store float %add, float* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Do not vectorize epilogues for loops with optsize attribute
				; CHECK-LABLE: @f2
				; CHECK-NOT: vector.main.loop.iter.check
				; CHECK-NOT: vec.epilog.iter.check
				; CHECK-NOT: vec.epilog.ph
				; CHECK-NOT: vec.epilog.vector.body
				; CHECK-NOT: vec.epilog.middle.block

				define dso_local void @f2(float* noalias %aa, float* noalias %bb, float* noalias %cc, i32 signext %N) #1 {
				entry:
				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %bb, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %cc, i64 %indvars.iv
				%1 = load float, float* %arrayidx2, align 4
				%add = fadd fast float %0, %1
				%arrayidx4 = getelementptr inbounds float, float* %aa, i64 %indvars.iv
				store float %add, float* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				; Do not vectorize the epilogue for loops with VF less than the default -epilogue-vectorization-minimum-VF of 16.
				; CHECK-MIN-D-LABLE: @f3
				; CHECK-MIN-D-NOT: vector.main.loop.iter.check
				; CHECK-MIN-D-NOT: vec.epilog.iter.check
				; CHECK-MIN-D-NOT: vec.epilog.ph
				; CHECK-MIN-D-NOT: vec.epilog.vector.body
				; CHECK-MIN-D-NOT: vec.epilog.middle.block

				; Specify a smaller minimum VF (via `-epilogue-vectorization-minimum-VF=4`) and
				; make sure the epilogue gets vectorized in that case.
				; CHECK-MIN-D-LABLE: @f3
				; CHECK-MIN-4: vector.main.loop.iter.check
				; CHECK-MIN-4: vec.epilog.iter.check
				; CHECK-MIN-4: vec.epilog.ph
				; CHECK-MIN-4: vec.epilog.vector.body
				; CHECK-MIN-4: vec.epilog.middle.block

				define dso_local void @f3(float* noalias %aa, float* noalias %bb, float* noalias %cc, i32 signext %N) {
				entry:
				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %bb, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %cc, i64 %indvars.iv
				%1 = load float, float* %arrayidx2, align 4
				%add = fadd fast float %0, %1
				%arrayidx4 = getelementptr inbounds float, float* %aa, i64 %indvars.iv
				store float %add, float* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				attributes #0 = { minsize }
				attributes #1 = { optsize }
				No newline at end of file

llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -passes='loop-vectorize' -enable-epilogue-vectorization -epilogue-vectorization-force-VF=2 -S \| FileCheck %s --check-prefix VF-TWO-CHECK
				; RUN: opt < %s -passes='loop-vectorize' -enable-epilogue-vectorization -epilogue-vectorization-force-VF=4 -S \| FileCheck %s --check-prefix VF-FOUR-CHECK

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-unknown-linux-gnu"

				SjoerdMeijerUnsubmitted Done Reply Inline Actions Just checking: what is PPC specific about this test? SjoerdMeijer: Just checking: what is PPC specific about this test?
				bmahjourAuthorUnsubmitted Done Reply Inline Actions The triple is used to make sure ILV finds the loop profitable and chooses a default VF, triggering vectorization of the main loop and its epilogue. Without this we'd have to force a vectorization factor (through hints) for the main loop, however this patch does not vectorize epilogue loops that have user hints on them. It's debatable whether we should vectorize epilogues for hinted loops or not, or whether we need a new pragma, etc. It should be fairly easy to implement whatever we agree on, so I think we should leave those discussions for later and focus on the codegen in this patch. bmahjour: The triple is used to make sure ILV finds the loop profitable and chooses a default VF…
				; Function Attrs: nounwind
				define dso_local void @f1(float* noalias %aa, float* noalias %bb, float* noalias %cc, i32 signext %N) #0 {
				; VF-TWO-CHECK-LABEL: @f1(
				; VF-TWO-CHECK-NEXT: entry:
				; VF-TWO-CHECK-NEXT: [[AA1:%.]] = bitcast float [[AA:%.]] to i8
				; VF-TWO-CHECK-NEXT: [[BB3:%.]] = bitcast float [[BB:%.]] to i8
				; VF-TWO-CHECK-NEXT: [[CC6:%.]] = bitcast float [[CC:%.]] to i8
				; VF-TWO-CHECK-NEXT: [[CMP1:%.]] = icmp sgt i32 [[N:%.]], 0
				; VF-TWO-CHECK-NEXT: br i1 [[CMP1]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]
				; VF-TWO-CHECK: iter.check:
				; VF-TWO-CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
				; VF-TWO-CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 2
				; VF-TWO-CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; VF-TWO-CHECK: vector.memcheck:
				; VF-TWO-CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[AA]], i64 [[WIDE_TRIP_COUNT]]
				; VF-TWO-CHECK-NEXT: [[SCEVGEP2:%.]] = bitcast float [[SCEVGEP]] to i8*
				; VF-TWO-CHECK-NEXT: [[SCEVGEP4:%.]] = getelementptr float, float [[BB]], i64 [[WIDE_TRIP_COUNT]]
				; VF-TWO-CHECK-NEXT: [[SCEVGEP45:%.]] = bitcast float [[SCEVGEP4]] to i8*
				; VF-TWO-CHECK-NEXT: [[SCEVGEP7:%.]] = getelementptr float, float [[CC]], i64 [[WIDE_TRIP_COUNT]]
				; VF-TWO-CHECK-NEXT: [[SCEVGEP78:%.]] = bitcast float [[SCEVGEP7]] to i8*
				; VF-TWO-CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 [[AA1]], [[SCEVGEP45]]
				; VF-TWO-CHECK-NEXT: [[BOUND1:%.]] = icmp ult i8 [[BB3]], [[SCEVGEP2]]
				; VF-TWO-CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; VF-TWO-CHECK-NEXT: [[BOUND09:%.]] = icmp ult i8 [[AA1]], [[SCEVGEP78]]
				; VF-TWO-CHECK-NEXT: [[BOUND110:%.]] = icmp ult i8 [[CC6]], [[SCEVGEP2]]
				; VF-TWO-CHECK-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]
				; VF-TWO-CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]
				; VF-TWO-CHECK-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
				; VF-TWO-CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
				; VF-TWO-CHECK: vector.main.loop.iter.check:
				; VF-TWO-CHECK-NEXT: [[MIN_ITERS_CHECK12:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 48
				; VF-TWO-CHECK-NEXT: br i1 [[MIN_ITERS_CHECK12]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; VF-TWO-CHECK: vector.ph:
				; VF-TWO-CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 48
				; VF-TWO-CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; VF-TWO-CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; VF-TWO-CHECK: vector.body:
				; VF-TWO-CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; VF-TWO-CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; VF-TWO-CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 4
				; VF-TWO-CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 8
				; VF-TWO-CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 12
				; VF-TWO-CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 16
				; VF-TWO-CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 20
				; VF-TWO-CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 24
				; VF-TWO-CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 28
				; VF-TWO-CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX]], 32
				; VF-TWO-CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 36
				; VF-TWO-CHECK-NEXT: [[TMP10:%.*]] = add i64 [[INDEX]], 40
				; VF-TWO-CHECK-NEXT: [[TMP11:%.*]] = add i64 [[INDEX]], 44
				; VF-TWO-CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP0]]
				; VF-TWO-CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP1]]
				; VF-TWO-CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP2]]
				; VF-TWO-CHECK-NEXT: [[TMP15:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP3]]
				; VF-TWO-CHECK-NEXT: [[TMP16:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP4]]
				; VF-TWO-CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP5]]
				; VF-TWO-CHECK-NEXT: [[TMP18:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP6]]
				; VF-TWO-CHECK-NEXT: [[TMP19:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP7]]
				; VF-TWO-CHECK-NEXT: [[TMP20:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP8]]
				; VF-TWO-CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP9]]
				; VF-TWO-CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP10]]
				; VF-TWO-CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP11]]
				; VF-TWO-CHECK-NEXT: [[TMP24:%.]] = getelementptr inbounds float, float [[TMP12]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP25:%.]] = bitcast float [[TMP24]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP25]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP26:%.]] = getelementptr inbounds float, float [[TMP12]], i32 4
				; VF-TWO-CHECK-NEXT: [[TMP27:%.]] = bitcast float [[TMP26]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD13:%.]] = load <4 x float>, <4 x float> [[TMP27]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP28:%.]] = getelementptr inbounds float, float [[TMP12]], i32 8
				; VF-TWO-CHECK-NEXT: [[TMP29:%.]] = bitcast float [[TMP28]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD14:%.]] = load <4 x float>, <4 x float> [[TMP29]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP30:%.]] = getelementptr inbounds float, float [[TMP12]], i32 12
				; VF-TWO-CHECK-NEXT: [[TMP31:%.]] = bitcast float [[TMP30]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD15:%.]] = load <4 x float>, <4 x float> [[TMP31]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP32:%.]] = getelementptr inbounds float, float [[TMP12]], i32 16
				; VF-TWO-CHECK-NEXT: [[TMP33:%.]] = bitcast float [[TMP32]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD16:%.]] = load <4 x float>, <4 x float> [[TMP33]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP34:%.]] = getelementptr inbounds float, float [[TMP12]], i32 20
				; VF-TWO-CHECK-NEXT: [[TMP35:%.]] = bitcast float [[TMP34]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD17:%.]] = load <4 x float>, <4 x float> [[TMP35]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP36:%.]] = getelementptr inbounds float, float [[TMP12]], i32 24
				; VF-TWO-CHECK-NEXT: [[TMP37:%.]] = bitcast float [[TMP36]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD18:%.]] = load <4 x float>, <4 x float> [[TMP37]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP38:%.]] = getelementptr inbounds float, float [[TMP12]], i32 28
				; VF-TWO-CHECK-NEXT: [[TMP39:%.]] = bitcast float [[TMP38]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD19:%.]] = load <4 x float>, <4 x float> [[TMP39]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP40:%.]] = getelementptr inbounds float, float [[TMP12]], i32 32
				; VF-TWO-CHECK-NEXT: [[TMP41:%.]] = bitcast float [[TMP40]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD20:%.]] = load <4 x float>, <4 x float> [[TMP41]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP42:%.]] = getelementptr inbounds float, float [[TMP12]], i32 36
				; VF-TWO-CHECK-NEXT: [[TMP43:%.]] = bitcast float [[TMP42]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD21:%.]] = load <4 x float>, <4 x float> [[TMP43]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP44:%.]] = getelementptr inbounds float, float [[TMP12]], i32 40
				; VF-TWO-CHECK-NEXT: [[TMP45:%.]] = bitcast float [[TMP44]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD22:%.]] = load <4 x float>, <4 x float> [[TMP45]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP46:%.]] = getelementptr inbounds float, float [[TMP12]], i32 44
				; VF-TWO-CHECK-NEXT: [[TMP47:%.]] = bitcast float [[TMP46]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD23:%.]] = load <4 x float>, <4 x float> [[TMP47]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP0]]
				; VF-TWO-CHECK-NEXT: [[TMP49:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP1]]
				; VF-TWO-CHECK-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP2]]
				; VF-TWO-CHECK-NEXT: [[TMP51:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP3]]
				; VF-TWO-CHECK-NEXT: [[TMP52:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP4]]
				; VF-TWO-CHECK-NEXT: [[TMP53:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP5]]
				; VF-TWO-CHECK-NEXT: [[TMP54:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP6]]
				; VF-TWO-CHECK-NEXT: [[TMP55:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP7]]
				; VF-TWO-CHECK-NEXT: [[TMP56:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP8]]
				; VF-TWO-CHECK-NEXT: [[TMP57:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP9]]
				; VF-TWO-CHECK-NEXT: [[TMP58:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP10]]
				; VF-TWO-CHECK-NEXT: [[TMP59:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP11]]
				; VF-TWO-CHECK-NEXT: [[TMP60:%.]] = getelementptr inbounds float, float [[TMP48]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP61:%.]] = bitcast float [[TMP60]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD24:%.]] = load <4 x float>, <4 x float> [[TMP61]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP62:%.]] = getelementptr inbounds float, float [[TMP48]], i32 4
				; VF-TWO-CHECK-NEXT: [[TMP63:%.]] = bitcast float [[TMP62]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD25:%.]] = load <4 x float>, <4 x float> [[TMP63]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP64:%.]] = getelementptr inbounds float, float [[TMP48]], i32 8
				; VF-TWO-CHECK-NEXT: [[TMP65:%.]] = bitcast float [[TMP64]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD26:%.]] = load <4 x float>, <4 x float> [[TMP65]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP66:%.]] = getelementptr inbounds float, float [[TMP48]], i32 12
				; VF-TWO-CHECK-NEXT: [[TMP67:%.]] = bitcast float [[TMP66]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD27:%.]] = load <4 x float>, <4 x float> [[TMP67]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP68:%.]] = getelementptr inbounds float, float [[TMP48]], i32 16
				; VF-TWO-CHECK-NEXT: [[TMP69:%.]] = bitcast float [[TMP68]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD28:%.]] = load <4 x float>, <4 x float> [[TMP69]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP70:%.]] = getelementptr inbounds float, float [[TMP48]], i32 20
				; VF-TWO-CHECK-NEXT: [[TMP71:%.]] = bitcast float [[TMP70]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD29:%.]] = load <4 x float>, <4 x float> [[TMP71]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP72:%.]] = getelementptr inbounds float, float [[TMP48]], i32 24
				; VF-TWO-CHECK-NEXT: [[TMP73:%.]] = bitcast float [[TMP72]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD30:%.]] = load <4 x float>, <4 x float> [[TMP73]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP74:%.]] = getelementptr inbounds float, float [[TMP48]], i32 28
				; VF-TWO-CHECK-NEXT: [[TMP75:%.]] = bitcast float [[TMP74]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD31:%.]] = load <4 x float>, <4 x float> [[TMP75]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP76:%.]] = getelementptr inbounds float, float [[TMP48]], i32 32
				; VF-TWO-CHECK-NEXT: [[TMP77:%.]] = bitcast float [[TMP76]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD32:%.]] = load <4 x float>, <4 x float> [[TMP77]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP78:%.]] = getelementptr inbounds float, float [[TMP48]], i32 36
				; VF-TWO-CHECK-NEXT: [[TMP79:%.]] = bitcast float [[TMP78]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD33:%.]] = load <4 x float>, <4 x float> [[TMP79]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP80:%.]] = getelementptr inbounds float, float [[TMP48]], i32 40
				; VF-TWO-CHECK-NEXT: [[TMP81:%.]] = bitcast float [[TMP80]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD34:%.]] = load <4 x float>, <4 x float> [[TMP81]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP82:%.]] = getelementptr inbounds float, float [[TMP48]], i32 44
				; VF-TWO-CHECK-NEXT: [[TMP83:%.]] = bitcast float [[TMP82]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD35:%.]] = load <4 x float>, <4 x float> [[TMP83]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP84:%.*]] = fadd fast <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD24]]
				; VF-TWO-CHECK-NEXT: [[TMP85:%.*]] = fadd fast <4 x float> [[WIDE_LOAD13]], [[WIDE_LOAD25]]
				; VF-TWO-CHECK-NEXT: [[TMP86:%.*]] = fadd fast <4 x float> [[WIDE_LOAD14]], [[WIDE_LOAD26]]
				; VF-TWO-CHECK-NEXT: [[TMP87:%.*]] = fadd fast <4 x float> [[WIDE_LOAD15]], [[WIDE_LOAD27]]
				; VF-TWO-CHECK-NEXT: [[TMP88:%.*]] = fadd fast <4 x float> [[WIDE_LOAD16]], [[WIDE_LOAD28]]
				; VF-TWO-CHECK-NEXT: [[TMP89:%.*]] = fadd fast <4 x float> [[WIDE_LOAD17]], [[WIDE_LOAD29]]
				; VF-TWO-CHECK-NEXT: [[TMP90:%.*]] = fadd fast <4 x float> [[WIDE_LOAD18]], [[WIDE_LOAD30]]
				; VF-TWO-CHECK-NEXT: [[TMP91:%.*]] = fadd fast <4 x float> [[WIDE_LOAD19]], [[WIDE_LOAD31]]
				; VF-TWO-CHECK-NEXT: [[TMP92:%.*]] = fadd fast <4 x float> [[WIDE_LOAD20]], [[WIDE_LOAD32]]
				; VF-TWO-CHECK-NEXT: [[TMP93:%.*]] = fadd fast <4 x float> [[WIDE_LOAD21]], [[WIDE_LOAD33]]
				; VF-TWO-CHECK-NEXT: [[TMP94:%.*]] = fadd fast <4 x float> [[WIDE_LOAD22]], [[WIDE_LOAD34]]
				; VF-TWO-CHECK-NEXT: [[TMP95:%.*]] = fadd fast <4 x float> [[WIDE_LOAD23]], [[WIDE_LOAD35]]
				; VF-TWO-CHECK-NEXT: [[TMP96:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP0]]
				; VF-TWO-CHECK-NEXT: [[TMP97:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP1]]
				; VF-TWO-CHECK-NEXT: [[TMP98:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP2]]
				; VF-TWO-CHECK-NEXT: [[TMP99:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP3]]
				; VF-TWO-CHECK-NEXT: [[TMP100:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP4]]
				; VF-TWO-CHECK-NEXT: [[TMP101:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP5]]
				; VF-TWO-CHECK-NEXT: [[TMP102:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP6]]
				; VF-TWO-CHECK-NEXT: [[TMP103:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP7]]
				; VF-TWO-CHECK-NEXT: [[TMP104:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP8]]
				; VF-TWO-CHECK-NEXT: [[TMP105:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP9]]
				; VF-TWO-CHECK-NEXT: [[TMP106:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP10]]
				; VF-TWO-CHECK-NEXT: [[TMP107:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP11]]
				; VF-TWO-CHECK-NEXT: [[TMP108:%.]] = getelementptr inbounds float, float [[TMP96]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP109:%.]] = bitcast float [[TMP108]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP84]], <4 x float>* [[TMP109]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP110:%.]] = getelementptr inbounds float, float [[TMP96]], i32 4
				; VF-TWO-CHECK-NEXT: [[TMP111:%.]] = bitcast float [[TMP110]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP85]], <4 x float>* [[TMP111]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP112:%.]] = getelementptr inbounds float, float [[TMP96]], i32 8
				; VF-TWO-CHECK-NEXT: [[TMP113:%.]] = bitcast float [[TMP112]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP86]], <4 x float>* [[TMP113]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP114:%.]] = getelementptr inbounds float, float [[TMP96]], i32 12
				; VF-TWO-CHECK-NEXT: [[TMP115:%.]] = bitcast float [[TMP114]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP87]], <4 x float>* [[TMP115]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP116:%.]] = getelementptr inbounds float, float [[TMP96]], i32 16
				; VF-TWO-CHECK-NEXT: [[TMP117:%.]] = bitcast float [[TMP116]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP88]], <4 x float>* [[TMP117]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP118:%.]] = getelementptr inbounds float, float [[TMP96]], i32 20
				; VF-TWO-CHECK-NEXT: [[TMP119:%.]] = bitcast float [[TMP118]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP89]], <4 x float>* [[TMP119]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP120:%.]] = getelementptr inbounds float, float [[TMP96]], i32 24
				; VF-TWO-CHECK-NEXT: [[TMP121:%.]] = bitcast float [[TMP120]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP90]], <4 x float>* [[TMP121]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP122:%.]] = getelementptr inbounds float, float [[TMP96]], i32 28
				; VF-TWO-CHECK-NEXT: [[TMP123:%.]] = bitcast float [[TMP122]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP91]], <4 x float>* [[TMP123]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP124:%.]] = getelementptr inbounds float, float [[TMP96]], i32 32
				; VF-TWO-CHECK-NEXT: [[TMP125:%.]] = bitcast float [[TMP124]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP92]], <4 x float>* [[TMP125]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP126:%.]] = getelementptr inbounds float, float [[TMP96]], i32 36
				; VF-TWO-CHECK-NEXT: [[TMP127:%.]] = bitcast float [[TMP126]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP93]], <4 x float>* [[TMP127]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP128:%.]] = getelementptr inbounds float, float [[TMP96]], i32 40
				; VF-TWO-CHECK-NEXT: [[TMP129:%.]] = bitcast float [[TMP128]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP94]], <4 x float>* [[TMP129]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP130:%.]] = getelementptr inbounds float, float [[TMP96]], i32 44
				; VF-TWO-CHECK-NEXT: [[TMP131:%.]] = bitcast float [[TMP130]] to <4 x float>*
				; VF-TWO-CHECK-NEXT: store <4 x float> [[TMP95]], <4 x float>* [[TMP131]], align 4
				; VF-TWO-CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 48
				; VF-TWO-CHECK-NEXT: [[TMP132:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; VF-TWO-CHECK-NEXT: br i1 [[TMP132]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], !llvm.loop [[LOOPID_MV:!.]]
				; VF-TWO-CHECK: middle.block:
				; VF-TWO-CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; VF-TWO-CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; VF-TWO-CHECK: vec.epilog.iter.check:
				; VF-TWO-CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; VF-TWO-CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 2
				; VF-TWO-CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; VF-TWO-CHECK: vec.epilog.ph:
				; VF-TWO-CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; VF-TWO-CHECK-NEXT: [[N_MOD_VF36:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 2
				; VF-TWO-CHECK-NEXT: [[N_VEC37:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF36]]
				; VF-TWO-CHECK-NEXT: br label [[VEC_EPILOG_BODY:%.*]]
				; VF-TWO-CHECK: vec.epilog.vector.body:
				; VF-TWO-CHECK-NEXT: [[INDEX38:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT39:%.]], [[VEC_EPILOG_BODY]] ]
				; VF-TWO-CHECK-NEXT: [[TMP133:%.*]] = add i64 [[INDEX38]], 0
				; VF-TWO-CHECK-NEXT: [[TMP134:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP133]]
				; VF-TWO-CHECK-NEXT: [[TMP135:%.]] = getelementptr inbounds float, float [[TMP134]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP136:%.]] = bitcast float [[TMP135]] to <2 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD41:%.]] = load <2 x float>, <2 x float> [[TMP136]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP137:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP133]]
				; VF-TWO-CHECK-NEXT: [[TMP138:%.]] = getelementptr inbounds float, float [[TMP137]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP139:%.]] = bitcast float [[TMP138]] to <2 x float>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD42:%.]] = load <2 x float>, <2 x float> [[TMP139]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP140:%.*]] = fadd fast <2 x float> [[WIDE_LOAD41]], [[WIDE_LOAD42]]
				; VF-TWO-CHECK-NEXT: [[TMP141:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP133]]
				; VF-TWO-CHECK-NEXT: [[TMP142:%.]] = getelementptr inbounds float, float [[TMP141]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP143:%.]] = bitcast float [[TMP142]] to <2 x float>*
				; VF-TWO-CHECK-NEXT: store <2 x float> [[TMP140]], <2 x float>* [[TMP143]], align 4
				; VF-TWO-CHECK-NEXT: [[INDEX_NEXT39]] = add i64 [[INDEX38]], 2
				; VF-TWO-CHECK-NEXT: [[TMP144:%.*]] = icmp eq i64 [[INDEX_NEXT39]], [[N_VEC37]]
				; VF-TWO-CHECK-NEXT: br i1 [[TMP144]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_BODY]], !llvm.loop [[LOOPID_EV:!.]]
				; VF-TWO-CHECK: vec.epilog.middle.block:
				; VF-TWO-CHECK-NEXT: [[CMP_N40:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC37]]
				; VF-TWO-CHECK-NEXT: br i1 [[CMP_N40]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]], label [[SCALAR_PH]]
				; VF-TWO-CHECK: scalar.ph:
				; VF-TWO-CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC37]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; VF-TWO-CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; VF-TWO-CHECK: for.body:
				; VF-TWO-CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; VF-TWO-CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[BB]], i64 [[INDVARS_IV]]
				; VF-TWO-CHECK-NEXT: [[TMP145:%.]] = load float, float [[ARRAYIDX]], align 4
				; VF-TWO-CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[CC]], i64 [[INDVARS_IV]]
				; VF-TWO-CHECK-NEXT: [[TMP146:%.]] = load float, float [[ARRAYIDX2]], align 4
				; VF-TWO-CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP145]], [[TMP146]]
				; VF-TWO-CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[AA]], i64 [[INDVARS_IV]]
				; VF-TWO-CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX4]], align 4
				; VF-TWO-CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; VF-TWO-CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; VF-TWO-CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT_LOOPEXIT]], !llvm.loop [[LOOPID_MS:!.*]]
				; VF-TWO-CHECK: for.end.loopexit.loopexit:
				; VF-TWO-CHECK-NEXT: br label [[FOR_END_LOOPEXIT]]
				; VF-TWO-CHECK: for.end.loopexit:
				; VF-TWO-CHECK-NEXT: br label [[FOR_END]]
				; VF-TWO-CHECK: for.end:
				; VF-TWO-CHECK-NEXT: ret void
				;
				; VF-TWO-CHECK-DAG: [[LOOPID_MV]] = distinct !{[[LOOPID_MV]], [[LOOPID_DISABLE_VECT:!.*]]}
				; VF-TWO-CHECK-DAG: [[LOOPID_EV]] = distinct !{[[LOOPID_EV]], [[LOOPID_DISABLE_UNROLL:!.]], [[LOOPID_DISABLE_VECT:!.]]}
				; VF-TWO-CHECK-DAG: [[LOOPID_DISABLE_VECT]] = [[DISABLE_VECT_STR:!{!"llvm.loop.isvectorized".}.]]
				; VF-TWO-CHECK-DAG: [[LOOPID_DISABLE_UNROLL]] = [[DISABLE_UNROLL_STR:!{!"llvm.loop.unroll.runtime.disable"}.*]]


				entry:
				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %bb, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %cc, i64 %indvars.iv
				%1 = load float, float* %arrayidx2, align 4
				%add = fadd fast float %0, %1
				%arrayidx4 = getelementptr inbounds float, float* %aa, i64 %indvars.iv
				store float %add, float* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				define dso_local signext i32 @f2(float* noalias %A, float* noalias %B, i32 signext %n) #0 {
				; VF-FOUR-CHECK-LABEL: @f2(
				; VF-FOUR-CHECK-NEXT: entry:
				; VF-FOUR-CHECK-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8
				; VF-FOUR-CHECK-NEXT: [[CMP1:%.]] = icmp sgt i32 [[N:%.]], 1
				; VF-FOUR-CHECK-NEXT: br i1 [[CMP1]], label [[ITER_CHECK:%.]], label [[FOR_END:%.]]
				; VF-FOUR-CHECK: iter.check:
				; VF-FOUR-CHECK-NEXT: [[TMP0:%.*]] = add i32 [[N]], -1
				; VF-FOUR-CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[TMP0]] to i64
				; VF-FOUR-CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
				; VF-FOUR-CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]
				; VF-FOUR-CHECK: vector.scevcheck:
				; VF-FOUR-CHECK-NEXT: [[TMP1:%.*]] = add nsw i64 [[WIDE_TRIP_COUNT]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP2:%.*]] = trunc i64 [[TMP1]] to i32
				; VF-FOUR-CHECK-NEXT: [[MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 [[TMP2]])
				; VF-FOUR-CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL]], 0
				; VF-FOUR-CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL]], 1
				; VF-FOUR-CHECK-NEXT: [[TMP3:%.*]] = add i32 [[TMP0]], [[MUL_RESULT]]
				; VF-FOUR-CHECK-NEXT: [[TMP4:%.*]] = sub i32 [[TMP0]], [[MUL_RESULT]]
				; VF-FOUR-CHECK-NEXT: [[TMP5:%.*]] = icmp sgt i32 [[TMP4]], [[TMP0]]
				; VF-FOUR-CHECK-NEXT: [[TMP6:%.*]] = icmp slt i32 [[TMP3]], [[TMP0]]
				; VF-FOUR-CHECK-NEXT: [[TMP7:%.*]] = select i1 true, i1 [[TMP5]], i1 [[TMP6]]
				; VF-FOUR-CHECK-NEXT: [[TMP8:%.*]] = icmp ugt i64 [[TMP1]], 4294967295
				; VF-FOUR-CHECK-NEXT: [[TMP9:%.*]] = or i1 [[TMP7]], [[TMP8]]
				; VF-FOUR-CHECK-NEXT: [[TMP10:%.*]] = or i1 [[TMP9]], [[MUL_OVERFLOW]]
				; VF-FOUR-CHECK-NEXT: [[TMP11:%.*]] = or i1 false, [[TMP10]]
				; VF-FOUR-CHECK-NEXT: br i1 [[TMP11]], label [[VEC_EPILOG_SCALAR_PH]], label [[VECTOR_MEM_CHECK:%.*]]
				; VF-FOUR-CHECK: vector.memcheck:
				; VF-FOUR-CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[A]], i64 [[WIDE_TRIP_COUNT]]
				; VF-FOUR-CHECK-NEXT: [[SCEVGEP2:%.]] = bitcast float [[SCEVGEP]] to i8*
				; VF-FOUR-CHECK-NEXT: [[TMP12:%.*]] = sext i32 [[TMP0]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP13:%.*]] = add i64 [[TMP12]], 1
				; VF-FOUR-CHECK-NEXT: [[TMP14:%.*]] = sub i64 [[TMP13]], [[WIDE_TRIP_COUNT]]
				; VF-FOUR-CHECK-NEXT: [[SCEVGEP3:%.]] = getelementptr float, float [[B:%.*]], i64 [[TMP14]]
				; VF-FOUR-CHECK-NEXT: [[SCEVGEP34:%.]] = bitcast float [[SCEVGEP3]] to i8*
				; VF-FOUR-CHECK-NEXT: [[TMP15:%.*]] = add nsw i64 [[TMP12]], 1
				; VF-FOUR-CHECK-NEXT: [[SCEVGEP5:%.]] = getelementptr float, float [[B]], i64 [[TMP15]]
				; VF-FOUR-CHECK-NEXT: [[SCEVGEP56:%.]] = bitcast float [[SCEVGEP5]] to i8*
				; VF-FOUR-CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 [[A1]], [[SCEVGEP56]]
				; VF-FOUR-CHECK-NEXT: [[BOUND1:%.]] = icmp ult i8 [[SCEVGEP34]], [[SCEVGEP2]]
				; VF-FOUR-CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; VF-FOUR-CHECK-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[FOUND_CONFLICT]], true
				; VF-FOUR-CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
				; VF-FOUR-CHECK: vector.main.loop.iter.check:
				; VF-FOUR-CHECK-NEXT: [[MIN_ITERS_CHECK7:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 32
				; VF-FOUR-CHECK-NEXT: br i1 [[MIN_ITERS_CHECK7]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; VF-FOUR-CHECK: vector.ph:
				; VF-FOUR-CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 32
				; VF-FOUR-CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; VF-FOUR-CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; VF-FOUR-CHECK: vector.body:
				; VF-FOUR-CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; VF-FOUR-CHECK-NEXT: [[TMP16:%.*]] = add i64 [[INDEX]], 0
				; VF-FOUR-CHECK-NEXT: [[TMP17:%.*]] = add i64 [[INDEX]], 4
				; VF-FOUR-CHECK-NEXT: [[TMP18:%.*]] = add i64 [[INDEX]], 8
				; VF-FOUR-CHECK-NEXT: [[TMP19:%.*]] = add i64 [[INDEX]], 12
				; VF-FOUR-CHECK-NEXT: [[TMP20:%.*]] = add i64 [[INDEX]], 16
				; VF-FOUR-CHECK-NEXT: [[TMP21:%.*]] = add i64 [[INDEX]], 20
				; VF-FOUR-CHECK-NEXT: [[TMP22:%.*]] = add i64 [[INDEX]], 24
				; VF-FOUR-CHECK-NEXT: [[TMP23:%.*]] = add i64 [[INDEX]], 28
				; VF-FOUR-CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i64 [[INDEX]] to i32
				; VF-FOUR-CHECK-NEXT: [[TMP24:%.*]] = add i32 [[OFFSET_IDX]], 0
				; VF-FOUR-CHECK-NEXT: [[TMP25:%.*]] = add i32 [[OFFSET_IDX]], 4
				; VF-FOUR-CHECK-NEXT: [[TMP26:%.*]] = add i32 [[OFFSET_IDX]], 8
				; VF-FOUR-CHECK-NEXT: [[TMP27:%.*]] = add i32 [[OFFSET_IDX]], 12
				; VF-FOUR-CHECK-NEXT: [[TMP28:%.*]] = add i32 [[OFFSET_IDX]], 16
				; VF-FOUR-CHECK-NEXT: [[TMP29:%.*]] = add i32 [[OFFSET_IDX]], 20
				; VF-FOUR-CHECK-NEXT: [[TMP30:%.*]] = add i32 [[OFFSET_IDX]], 24
				; VF-FOUR-CHECK-NEXT: [[TMP31:%.*]] = add i32 [[OFFSET_IDX]], 28
				; VF-FOUR-CHECK-NEXT: [[TMP32:%.*]] = xor i32 [[TMP24]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP33:%.*]] = xor i32 [[TMP25]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP34:%.*]] = xor i32 [[TMP26]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP35:%.*]] = xor i32 [[TMP27]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP36:%.*]] = xor i32 [[TMP28]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP37:%.*]] = xor i32 [[TMP29]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP38:%.*]] = xor i32 [[TMP30]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP39:%.*]] = xor i32 [[TMP31]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP40:%.*]] = add i32 [[TMP32]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP41:%.*]] = add i32 [[TMP33]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP42:%.*]] = add i32 [[TMP34]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP43:%.*]] = add i32 [[TMP35]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP44:%.*]] = add i32 [[TMP36]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP45:%.*]] = add i32 [[TMP37]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP46:%.*]] = add i32 [[TMP38]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP47:%.*]] = add i32 [[TMP39]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP48:%.*]] = sext i32 [[TMP40]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP49:%.*]] = sext i32 [[TMP41]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP50:%.*]] = sext i32 [[TMP42]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP51:%.*]] = sext i32 [[TMP43]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP52:%.*]] = sext i32 [[TMP44]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP53:%.*]] = sext i32 [[TMP45]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP54:%.*]] = sext i32 [[TMP46]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP55:%.*]] = sext i32 [[TMP47]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP56:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP48]]
				; VF-FOUR-CHECK-NEXT: [[TMP57:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP49]]
				; VF-FOUR-CHECK-NEXT: [[TMP58:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP50]]
				; VF-FOUR-CHECK-NEXT: [[TMP59:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP51]]
				; VF-FOUR-CHECK-NEXT: [[TMP60:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP52]]
				; VF-FOUR-CHECK-NEXT: [[TMP61:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP53]]
				; VF-FOUR-CHECK-NEXT: [[TMP62:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP54]]
				; VF-FOUR-CHECK-NEXT: [[TMP63:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP55]]
				; VF-FOUR-CHECK-NEXT: [[TMP64:%.]] = getelementptr inbounds float, float [[TMP56]], i32 0
				; VF-FOUR-CHECK-NEXT: [[TMP65:%.]] = getelementptr inbounds float, float [[TMP64]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP66:%.]] = bitcast float [[TMP65]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP66]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE:%.*]] = shufflevector <4 x float> [[WIDE_LOAD]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP67:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -4
				; VF-FOUR-CHECK-NEXT: [[TMP68:%.]] = getelementptr inbounds float, float [[TMP67]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP69:%.]] = bitcast float [[TMP68]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD8:%.]] = load <4 x float>, <4 x float> [[TMP69]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE9:%.*]] = shufflevector <4 x float> [[WIDE_LOAD8]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP70:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -8
				; VF-FOUR-CHECK-NEXT: [[TMP71:%.]] = getelementptr inbounds float, float [[TMP70]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP72:%.]] = bitcast float [[TMP71]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD10:%.]] = load <4 x float>, <4 x float> [[TMP72]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE11:%.*]] = shufflevector <4 x float> [[WIDE_LOAD10]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP73:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -12
				; VF-FOUR-CHECK-NEXT: [[TMP74:%.]] = getelementptr inbounds float, float [[TMP73]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP75:%.]] = bitcast float [[TMP74]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD12:%.]] = load <4 x float>, <4 x float> [[TMP75]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE13:%.*]] = shufflevector <4 x float> [[WIDE_LOAD12]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP76:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -16
				; VF-FOUR-CHECK-NEXT: [[TMP77:%.]] = getelementptr inbounds float, float [[TMP76]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP78:%.]] = bitcast float [[TMP77]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD14:%.]] = load <4 x float>, <4 x float> [[TMP78]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE15:%.*]] = shufflevector <4 x float> [[WIDE_LOAD14]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -20
				; VF-FOUR-CHECK-NEXT: [[TMP80:%.]] = getelementptr inbounds float, float [[TMP79]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP81:%.]] = bitcast float [[TMP80]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD16:%.]] = load <4 x float>, <4 x float> [[TMP81]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE17:%.*]] = shufflevector <4 x float> [[WIDE_LOAD16]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP82:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -24
				; VF-FOUR-CHECK-NEXT: [[TMP83:%.]] = getelementptr inbounds float, float [[TMP82]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP84:%.]] = bitcast float [[TMP83]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD18:%.]] = load <4 x float>, <4 x float> [[TMP84]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE19:%.*]] = shufflevector <4 x float> [[WIDE_LOAD18]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP85:%.]] = getelementptr inbounds float, float [[TMP56]], i32 -28
				; VF-FOUR-CHECK-NEXT: [[TMP86:%.]] = getelementptr inbounds float, float [[TMP85]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP87:%.]] = bitcast float [[TMP86]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD20:%.]] = load <4 x float>, <4 x float> [[TMP87]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE21:%.*]] = shufflevector <4 x float> [[WIDE_LOAD20]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP88:%.*]] = fadd fast <4 x float> [[REVERSE]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP89:%.*]] = fadd fast <4 x float> [[REVERSE9]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP90:%.*]] = fadd fast <4 x float> [[REVERSE11]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP91:%.*]] = fadd fast <4 x float> [[REVERSE13]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP92:%.*]] = fadd fast <4 x float> [[REVERSE15]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP93:%.*]] = fadd fast <4 x float> [[REVERSE17]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP94:%.*]] = fadd fast <4 x float> [[REVERSE19]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP95:%.*]] = fadd fast <4 x float> [[REVERSE21]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP96:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP16]]
				; VF-FOUR-CHECK-NEXT: [[TMP97:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP17]]
				; VF-FOUR-CHECK-NEXT: [[TMP98:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP18]]
				; VF-FOUR-CHECK-NEXT: [[TMP99:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP19]]
				; VF-FOUR-CHECK-NEXT: [[TMP100:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP20]]
				; VF-FOUR-CHECK-NEXT: [[TMP101:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP21]]
				; VF-FOUR-CHECK-NEXT: [[TMP102:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP22]]
				; VF-FOUR-CHECK-NEXT: [[TMP103:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP23]]
				; VF-FOUR-CHECK-NEXT: [[TMP104:%.]] = getelementptr inbounds float, float [[TMP96]], i32 0
				; VF-FOUR-CHECK-NEXT: [[TMP105:%.]] = bitcast float [[TMP104]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP88]], <4 x float>* [[TMP105]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP106:%.]] = getelementptr inbounds float, float [[TMP96]], i32 4
				; VF-FOUR-CHECK-NEXT: [[TMP107:%.]] = bitcast float [[TMP106]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP89]], <4 x float>* [[TMP107]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP108:%.]] = getelementptr inbounds float, float [[TMP96]], i32 8
				; VF-FOUR-CHECK-NEXT: [[TMP109:%.]] = bitcast float [[TMP108]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP90]], <4 x float>* [[TMP109]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP110:%.]] = getelementptr inbounds float, float [[TMP96]], i32 12
				; VF-FOUR-CHECK-NEXT: [[TMP111:%.]] = bitcast float [[TMP110]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP91]], <4 x float>* [[TMP111]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP112:%.]] = getelementptr inbounds float, float [[TMP96]], i32 16
				; VF-FOUR-CHECK-NEXT: [[TMP113:%.]] = bitcast float [[TMP112]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP92]], <4 x float>* [[TMP113]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP114:%.]] = getelementptr inbounds float, float [[TMP96]], i32 20
				; VF-FOUR-CHECK-NEXT: [[TMP115:%.]] = bitcast float [[TMP114]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP93]], <4 x float>* [[TMP115]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP116:%.]] = getelementptr inbounds float, float [[TMP96]], i32 24
				; VF-FOUR-CHECK-NEXT: [[TMP117:%.]] = bitcast float [[TMP116]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP94]], <4 x float>* [[TMP117]], align 4
				; VF-FOUR-CHECK-NEXT: [[TMP118:%.]] = getelementptr inbounds float, float [[TMP96]], i32 28
				; VF-FOUR-CHECK-NEXT: [[TMP119:%.]] = bitcast float [[TMP118]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP95]], <4 x float>* [[TMP119]], align 4
				; VF-FOUR-CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
				; VF-FOUR-CHECK-NEXT: [[TMP120:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; VF-FOUR-CHECK-NEXT: br i1 [[TMP120]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], !llvm.loop [[LOOPID_MV_CM:!.]]
				; VF-FOUR-CHECK: middle.block:
				; VF-FOUR-CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; VF-FOUR-CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; VF-FOUR-CHECK: vec.epilog.iter.check:
				; VF-FOUR-CHECK-NEXT: [[IND_END27:%.*]] = trunc i64 [[N_VEC]] to i32
				; VF-FOUR-CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; VF-FOUR-CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 4
				; VF-FOUR-CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; VF-FOUR-CHECK: vec.epilog.ph:
				; VF-FOUR-CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; VF-FOUR-CHECK-NEXT: [[N_MOD_VF22:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
				; VF-FOUR-CHECK-NEXT: [[N_VEC23:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF22]]
				; VF-FOUR-CHECK-NEXT: [[IND_END:%.*]] = trunc i64 [[N_VEC23]] to i32
				; VF-FOUR-CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
				; VF-FOUR-CHECK: vec.epilog.vector.body:
				; VF-FOUR-CHECK-NEXT: [[INDEX24:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT25:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
				; VF-FOUR-CHECK-NEXT: [[TMP121:%.*]] = add i64 [[INDEX24]], 0
				; VF-FOUR-CHECK-NEXT: [[OFFSET_IDX29:%.*]] = trunc i64 [[INDEX24]] to i32
				; VF-FOUR-CHECK-NEXT: [[TMP122:%.*]] = add i32 [[OFFSET_IDX29]], 0
				; VF-FOUR-CHECK-NEXT: [[TMP123:%.*]] = xor i32 [[TMP122]], -1
				; VF-FOUR-CHECK-NEXT: [[TMP124:%.*]] = add i32 [[TMP123]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[TMP125:%.*]] = sext i32 [[TMP124]] to i64
				; VF-FOUR-CHECK-NEXT: [[TMP126:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP125]]
				; VF-FOUR-CHECK-NEXT: [[TMP127:%.]] = getelementptr inbounds float, float [[TMP126]], i32 0
				; VF-FOUR-CHECK-NEXT: [[TMP128:%.]] = getelementptr inbounds float, float [[TMP127]], i32 -3
				; VF-FOUR-CHECK-NEXT: [[TMP129:%.]] = bitcast float [[TMP128]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: [[WIDE_LOAD30:%.]] = load <4 x float>, <4 x float> [[TMP129]], align 4
				; VF-FOUR-CHECK-NEXT: [[REVERSE31:%.*]] = shufflevector <4 x float> [[WIDE_LOAD30]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; VF-FOUR-CHECK-NEXT: [[TMP130:%.*]] = fadd fast <4 x float> [[REVERSE31]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; VF-FOUR-CHECK-NEXT: [[TMP131:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP121]]
				; VF-FOUR-CHECK-NEXT: [[TMP132:%.]] = getelementptr inbounds float, float [[TMP131]], i32 0
				; VF-FOUR-CHECK-NEXT: [[TMP133:%.]] = bitcast float [[TMP132]] to <4 x float>*
				; VF-FOUR-CHECK-NEXT: store <4 x float> [[TMP130]], <4 x float>* [[TMP133]], align 4
				; VF-FOUR-CHECK-NEXT: [[INDEX_NEXT25]] = add i64 [[INDEX24]], 4
				; VF-FOUR-CHECK-NEXT: [[TMP134:%.*]] = icmp eq i64 [[INDEX_NEXT25]], [[N_VEC23]]
				; VF-FOUR-CHECK-NEXT: br i1 [[TMP134]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOPID_EV_CM:!.]]
				; VF-FOUR-CHECK: vec.epilog.middle.block:
				; VF-FOUR-CHECK-NEXT: [[CMP_N28:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC23]]
				; VF-FOUR-CHECK-NEXT: br i1 [[CMP_N28]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
				; VF-FOUR-CHECK: vec.epilog.scalar.ph:
				; VF-FOUR-CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC23]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[VECTOR_MEM_CHECK]] ], [ 0, [[ITER_CHECK]] ]
				; VF-FOUR-CHECK-NEXT: [[BC_RESUME_VAL26:%.*]] = phi i32 [ [[IND_END]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END27]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[VECTOR_MEM_CHECK]] ], [ 0, [[ITER_CHECK]] ]
				; VF-FOUR-CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; VF-FOUR-CHECK: for.body:
				; VF-FOUR-CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; VF-FOUR-CHECK-NEXT: [[I_014:%.]] = phi i32 [ [[BC_RESUME_VAL26]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
				; VF-FOUR-CHECK-NEXT: [[TMP135:%.*]] = xor i32 [[I_014]], -1
				; VF-FOUR-CHECK-NEXT: [[SUB2:%.*]] = add i32 [[TMP135]], [[N]]
				; VF-FOUR-CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[SUB2]] to i64
				; VF-FOUR-CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[IDXPROM]]
				; VF-FOUR-CHECK-NEXT: [[TMP136:%.]] = load float, float [[ARRAYIDX]], align 4
				; VF-FOUR-CHECK-NEXT: [[CONV3:%.*]] = fadd fast float [[TMP136]], 1.000000e+00
				; VF-FOUR-CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
				; VF-FOUR-CHECK-NEXT: store float [[CONV3]], float* [[ARRAYIDX5]], align 4
				; VF-FOUR-CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; VF-FOUR-CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_014]], 1
				; VF-FOUR-CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; VF-FOUR-CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT_LOOPEXIT]], !llvm.loop [[LOOPID_MS_CM:!.*]]
				; VF-FOUR-CHECK: for.end.loopexit.loopexit:
				; VF-FOUR-CHECK-NEXT: br label [[FOR_END_LOOPEXIT]]
				; VF-FOUR-CHECK: for.end.loopexit:
				; VF-FOUR-CHECK-NEXT: br label [[FOR_END]]
				; VF-FOUR-CHECK: for.end:
				; VF-FOUR-CHECK-NEXT: ret i32 0

				; VF-FOUR-CHECK-DAG: [[LOOPID_MV_CM]] = distinct !{[[LOOPID_MV_CM]], [[LOOPID_DISABLE_VECT_CM:!.*]]}
				; VF-FOUR-CHECK-DAG: [[LOOPID_EV_CM]] = distinct !{[[LOOPID_EV_CM]], [[LOOPID_DISABLE_UNROLL_CM:!.]], [[LOOPID_DISABLE_VECT_CM:!.]]}
				; VF-FOUR-CHECK-DAG: [[LOOPID_DISABLE_VECT_CM]] = [[DISABLE_VECT_STR_CM:!{!"llvm.loop.isvectorized".}.]]
				; VF-FOUR-CHECK-DAG: [[LOOPID_DISABLE_UNROLL_CM]] = [[DISABLE_UNROLL_STR_CM:!{!"llvm.loop.unroll.runtime.disable"}.*]]

				entry:
				%cmp1 = icmp sgt i32 %n, 1
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%0 = add i32 %n, -1
				%wide.trip.count = zext i32 %0 to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%i.014 = phi i32 [ 0, %for.body.preheader ], [ %inc, %for.body ]
				%1 = xor i32 %i.014, -1
				%sub2 = add i32 %1, %n
				%idxprom = sext i32 %sub2 to i64
				%arrayidx = getelementptr inbounds float, float* %B, i64 %idxprom
				%2 = load float, float* %arrayidx, align 4
				%conv3 = fadd fast float %2, 1.000000e+00
				%arrayidx5 = getelementptr inbounds float, float* %A, i64 %indvars.iv
				store float %conv3, float* %arrayidx5, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%inc = add nuw nsw i32 %i.014, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret i32 0
				}

				attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="ppc64le" "target-features"="+altivec,+bpermd,+crypto,+direct-move,+extdiv,+htm,+power8-vector,+vsx,-power9-vector,-spe" "unsafe-fp-math"="true" "use-soft-float"="false" }

llvm/test/Transforms/LoopVectorize/X86/invariant-load-gather.ll

	Show All 19 Lines
	; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]			; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]
	; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1			; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1
	; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]			; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]
	; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]			; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]
	; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]			; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
	; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]			; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792			; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT6:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT5]], <16 x i32*> undef, <16 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32*> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT7]], <16 x i32> undef, <16 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT5]], <16 x i32> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = icmp ne <16 x i32> [[BROADCAST_SPLAT6]], zeroinitializer			; CHECK-NEXT: [[TMP3:%.]] = icmp ne <16 x i32> [[BROADCAST_SPLAT]], zeroinitializer
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
	; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT8]], <16 x i32>* [[TMP4]], align 4, !alias.scope !0, !noalias !3			; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT6]], <16 x i32>* [[TMP4]], align 4, !alias.scope !0, !noalias !3
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !5			; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP5:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <16 x i32> @llvm.masked.gather.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT6]], i32 4, <16 x i1> [[TMP3]], <16 x i32> undef), !alias.scope !3			; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <16 x i32> @llvm.masked.gather.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT]], i32 4, <16 x i1> [[TMP3]], <16 x i32> undef), !alias.scope !3
	; CHECK-NEXT: [[PREDPHI:%.*]] = select <16 x i1> [[TMP3]], <16 x i32> [[WIDE_MASKED_GATHER]], <16 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 1>			; CHECK-NEXT: [[PREDPHI:%.*]] = select <16 x i1> [[TMP3]], <16 x i32> [[WIDE_MASKED_GATHER]], <16 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 1>
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <16 x i32> [[PREDPHI]], i32 15			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <16 x i32> [[PREDPHI]], i32 15
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
	; CHECK-NEXT: [[CMP:%.]] = icmp eq i32 [[A]], null			; CHECK-NEXT: [[CMP_NOT:%.]] = icmp eq i32 [[A]], null
	; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[TMP1]], align 4			; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[TMP1]], align 4
	; CHECK-NEXT: br i1 [[CMP]], label [[LATCH]], label [[COND_LOAD:%.*]]			; CHECK-NEXT: br i1 [[CMP_NOT]], label [[LATCH]], label [[COND_LOAD:%.*]]
	; CHECK: cond_load:			; CHECK: cond_load:
	; CHECK-NEXT: [[ALOAD:%.]] = load i32, i32 [[A]], align 4			; CHECK-NEXT: [[ALOAD:%.]] = load i32, i32 [[A]], align 4
	; CHECK-NEXT: br label [[LATCH]]			; CHECK-NEXT: br label [[LATCH]]
	; CHECK: latch:			; CHECK: latch:
	; CHECK-NEXT: [[A_LCSSA:%.*]] = phi i32 [ [[ALOAD]], [[COND_LOAD]] ], [ 1, [[FOR_BODY]] ]			; CHECK-NEXT: [[A_LCSSA:%.*]] = phi i32 [ [[ALOAD]], [[COND_LOAD]] ], [ 1, [[FOR_BODY]] ]
	; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1			; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
	; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]			; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
	; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !7			; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP7:!llvm.loop !.*]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[A_LCSSA_LCSSA:%.*]] = phi i32 [ [[A_LCSSA]], [[LATCH]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[A_LCSSA_LCSSA:%.*]] = phi i32 [ [[A_LCSSA]], [[LATCH]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[A_LCSSA_LCSSA]]			; CHECK-NEXT: ret i32 [[A_LCSSA_LCSSA]]
	;			;
	entry:			entry:
	%ntrunc = trunc i64 %n to i32			%ntrunc = trunc i64 %n to i32
	br label %for.body			br label %for.body

	Show All 21 Lines

llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[WIDE_LOAD10:%.]] = load <16 x i32>, <16 x i32> [[TMP9]], align 8, !alias.scope !0			; CHECK-NEXT: [[WIDE_LOAD10:%.]] = load <16 x i32>, <16 x i32> [[TMP9]], align 8, !alias.scope !0
	; CHECK-NEXT: [[TMP10]] = add <16 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP10]] = add <16 x i32> [[VEC_PHI]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP11]] = add <16 x i32> [[VEC_PHI5]], [[WIDE_LOAD8]]			; CHECK-NEXT: [[TMP11]] = add <16 x i32> [[VEC_PHI5]], [[WIDE_LOAD8]]
	; CHECK-NEXT: [[TMP12]] = add <16 x i32> [[VEC_PHI6]], [[WIDE_LOAD9]]			; CHECK-NEXT: [[TMP12]] = add <16 x i32> [[VEC_PHI6]], [[WIDE_LOAD9]]
	; CHECK-NEXT: [[TMP13]] = add <16 x i32> [[VEC_PHI7]], [[WIDE_LOAD10]]			; CHECK-NEXT: [[TMP13]] = add <16 x i32> [[VEC_PHI7]], [[WIDE_LOAD10]]
	; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4, !alias.scope !3, !noalias !0			; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4, !alias.scope !3, !noalias !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !5			; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP5:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <16 x i32> [[TMP11]], [[TMP10]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <16 x i32> [[TMP11]], [[TMP10]]
	; CHECK-NEXT: [[BIN_RDX11:%.*]] = add <16 x i32> [[TMP12]], [[BIN_RDX]]			; CHECK-NEXT: [[BIN_RDX11:%.*]] = add <16 x i32> [[TMP12]], [[BIN_RDX]]
	; CHECK-NEXT: [[BIN_RDX12:%.*]] = add <16 x i32> [[TMP13]], [[BIN_RDX11]]			; CHECK-NEXT: [[BIN_RDX12:%.*]] = add <16 x i32> [[TMP13]], [[BIN_RDX11]]
	; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX12]])			; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[BIN_RDX12]])
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP15]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP15]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[T0:%.]] = phi i32 [ [[T3:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[T0:%.]] = phi i32 [ [[T3:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]			; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
	; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8			; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8
	; CHECK-NEXT: [[T3]] = add i32 [[T0]], [[T2]]			; CHECK-NEXT: [[T3]] = add i32 [[T0]], [[T2]]
	; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4			; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4
	; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1			; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
	; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]			; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
	; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !7			; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP7:!llvm.loop !.*]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[T4:%.*]] = phi i32 [ [[T3]], [[FOR_BODY]] ], [ [[TMP15]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[T4:%.*]] = phi i32 [ [[T3]], [[FOR_BODY]] ], [ [[TMP15]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[T4]]			; CHECK-NEXT: ret i32 [[T4]]
	;			;
	entry:			entry:
	%ntrunc = trunc i64 %n to i32			%ntrunc = trunc i64 %n to i32
	br label %for.body			br label %for.body

	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !8, !noalias !11			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !8, !noalias !11
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*			; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
	; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT6]], <16 x i32>* [[TMP5]], align 4, !alias.scope !8, !noalias !11			; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT6]], <16 x i32>* [[TMP5]], align 4, !alias.scope !8, !noalias !11
	; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT6]], <16 x i32*> [[BROADCAST_SPLAT8]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !11			; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT6]], <16 x i32*> [[BROADCAST_SPLAT8]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !11
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !13			; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP13:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]			; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
	; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8			; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8
	; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]			; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]
	; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4			; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4
	; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]			; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]
	; CHECK: cond_store:			; CHECK: cond_store:
	; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4			; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4
	; CHECK-NEXT: br label [[LATCH]]			; CHECK-NEXT: br label [[LATCH]]
	; CHECK: latch:			; CHECK: latch:
	; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1			; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
	; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]			; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
	; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !14			; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP14:!llvm.loop !.*]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%ntrunc = trunc i64 %n to i32			%ntrunc = trunc i64 %n to i32
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*			; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
	; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT17]], <16 x i32>* [[TMP5]], align 4, !alias.scope !15, !noalias !18			; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT17]], <16 x i32>* [[TMP5]], align 4, !alias.scope !15, !noalias !18
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <16 x i32>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <16 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32> [[TMP7]], i32 8, <16 x i1> [[TMP4]], <16 x i32> undef), !alias.scope !21			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32> [[TMP7]], i32 8, <16 x i1> [[TMP4]], <16 x i32> undef), !alias.scope !21
	; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[WIDE_MASKED_LOAD]], <16 x i32*> [[BROADCAST_SPLAT19]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !22, !noalias !21			; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[WIDE_MASKED_LOAD]], <16 x i32*> [[BROADCAST_SPLAT19]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !22, !noalias !21
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !23			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP23:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]			; CHECK-NEXT: [[T1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
	; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8			; CHECK-NEXT: [[T2:%.]] = load i32, i32 [[T1]], align 8
	; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]			; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[T2]], [[K]]
	; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4			; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[T1]], align 4
	; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]			; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]
	; CHECK: cond_store:			; CHECK: cond_store:
	; CHECK-NEXT: [[T3:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[I]]			; CHECK-NEXT: [[T3:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[I]]
	; CHECK-NEXT: [[T4:%.]] = load i32, i32 [[T3]], align 8			; CHECK-NEXT: [[T4:%.]] = load i32, i32 [[T3]], align 8
	; CHECK-NEXT: store i32 [[T4]], i32* [[A]], align 4			; CHECK-NEXT: store i32 [[T4]], i32* [[A]], align 4
	; CHECK-NEXT: br label [[LATCH]]			; CHECK-NEXT: br label [[LATCH]]
	; CHECK: latch:			; CHECK: latch:
	; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1			; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
	; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]			; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
	; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !24			; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP24:!llvm.loop !.*]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%ntrunc = trunc i64 %n to i32			%ntrunc = trunc i64 %n to i32
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 21 Lines

llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !3			; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !3
	; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]			; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]
	; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP0]]
	; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0
	; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*			; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
	; AVX1-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP8]], <8 x i32>* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !5, !noalias !7			; AVX1-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP8]], <8 x i32>* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !5, !noalias !7
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100			; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100
	; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX3]], align 4			; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]			; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
	; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4			; AVX1-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP10:!llvm.loop !.*]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo1(			; AVX2-LABEL: @foo1(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8			; AVX2-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8
	; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX2-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8			; AVX2-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 16			; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 16
	; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <8 x i32>*			; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <8 x i32>*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP34]], <8 x i32>* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !5, !noalias !7			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP34]], <8 x i32>* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !5, !noalias !7
	; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 24			; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 24
	; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <8 x i32>*			; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <8 x i32>*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP35]], <8 x i32>* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !5, !noalias !7			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP35]], <8 x i32>* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !5, !noalias !7
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4			; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
	; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4			; AVX2-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP10:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo1(			; AVX512-LABEL: @foo1(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8			; AVX512-NEXT: [[A1:%.]] = bitcast i32 [[A:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8			; AVX512-NEXT: [[B6:%.]] = bitcast i32 [[B:%.]] to i8
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 32			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 32
	; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <16 x i32>*			; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 [[TMP44]] to <16 x i32>*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP34]], <16 x i32>* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !5, !noalias !7			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP34]], <16 x i32>* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !5, !noalias !7
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 48			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 48
	; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <16 x i32>*			; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 [[TMP46]] to <16 x i32>*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP35]], <16 x i32>* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !5, !noalias !7			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> [[TMP35]], <16 x i32>* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !5, !noalias !7
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP49:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4			; AVX512-NEXT: [[TMP50:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4			; AVX512-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX7]], align 4
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP10:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p1v8i32(<8 x i32> addrspace(1) [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !14			; AVX1-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p1v8i32(<8 x i32> addrspace(1) [[TMP7]], i32 4, <8 x i1> [[TMP4]], <8 x i32> undef), !alias.scope !14
	; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]			; AVX1-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD]], [[WIDE_LOAD]]
	; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[TMP0]]
	; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP9]], i32 0			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP9]], i32 0
	; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 addrspace(1) [[TMP10]] to <8 x i32> addrspace(1)*			; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 addrspace(1) [[TMP10]] to <8 x i32> addrspace(1)*
	; AVX1-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP8]], <8 x i32> addrspace(1)* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !16, !noalias !18			; AVX1-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP8]], <8 x i32> addrspace(1)* [[TMP11]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !16, !noalias !18
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX1-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19			; AVX1-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4			; AVX1-NEXT: [[TMP13:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4
	; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100			; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP13]], 100
	; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4			; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4
	; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]			; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
	; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4			; AVX1-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !20			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo1_addrspace1(			; AVX2-LABEL: @foo1_addrspace1(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)			; AVX2-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)
	; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)			; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)
	; AVX2-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)			; AVX2-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 16			; AVX2-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 16
	; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <8 x i32> addrspace(1)*			; AVX2-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <8 x i32> addrspace(1)*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP34]], <8 x i32> addrspace(1)* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !16, !noalias !18			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP34]], <8 x i32> addrspace(1)* [[TMP45]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !16, !noalias !18
	; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 24			; AVX2-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 24
	; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <8 x i32> addrspace(1)*			; AVX2-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <8 x i32> addrspace(1)*
	; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP35]], <8 x i32> addrspace(1)* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !16, !noalias !18			; AVX2-NEXT: call void @llvm.masked.store.v8i32.p1v8i32(<8 x i32> [[TMP35]], <8 x i32> addrspace(1)* [[TMP47]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !16, !noalias !18
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX2-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19			; AVX2-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4			; AVX2-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4
	; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
	; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4			; AVX2-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !20			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo1_addrspace1(			; AVX512-LABEL: @foo1_addrspace1(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)			; AVX512-NEXT: [[A1:%.]] = bitcast i32 addrspace(1) [[A:%.]] to i8 addrspace(1)
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 addrspace(1) [[TRIGGER:%.]] to i8 addrspace(1)
	; AVX512-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)			; AVX512-NEXT: [[B6:%.]] = bitcast i32 addrspace(1) [[B:%.]] to i8 addrspace(1)
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 32			; AVX512-NEXT: [[TMP44:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 32
	; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP45:%.]] = bitcast i32 addrspace(1) [[TMP44]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP34]], <16 x i32> addrspace(1)* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !16, !noalias !18			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP34]], <16 x i32> addrspace(1)* [[TMP45]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !16, !noalias !18
	; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 48			; AVX512-NEXT: [[TMP46:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP36]], i32 48
	; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <16 x i32> addrspace(1)*			; AVX512-NEXT: [[TMP47:%.]] = bitcast i32 addrspace(1) [[TMP46]] to <16 x i32> addrspace(1)*
	; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP35]], <16 x i32> addrspace(1)* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !16, !noalias !18			; AVX512-NEXT: call void @llvm.masked.store.v16i32.p1v16i32(<16 x i32> [[TMP35]], <16 x i32> addrspace(1)* [[TMP47]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !16, !noalias !18
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19			; AVX512-NEXT: br i1 [[TMP48]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP49:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP49]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4			; AVX512-NEXT: [[TMP50:%.]] = load i32, i32 addrspace(1) [[ARRAYIDX3]], align 4
	; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]			; AVX512-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP50]], [[TMP49]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4			; AVX512-NEXT: store i32 [[ADD]], i32 addrspace(1)* [[ARRAYIDX7]], align 4
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !20			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP20:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[TMP8:%.*]] = sitofp <8 x i32> [[WIDE_LOAD]] to <8 x float>			; AVX1-NEXT: [[TMP8:%.*]] = sitofp <8 x i32> [[WIDE_LOAD]] to <8 x float>
	; AVX1-NEXT: [[TMP9:%.*]] = fadd <8 x float> [[WIDE_MASKED_LOAD]], [[TMP8]]			; AVX1-NEXT: [[TMP9:%.*]] = fadd <8 x float> [[WIDE_MASKED_LOAD]], [[TMP8]]
	; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP0]]
	; AVX1-NEXT: [[TMP11:%.]] = getelementptr inbounds float, float [[TMP10]], i32 0			; AVX1-NEXT: [[TMP11:%.]] = getelementptr inbounds float, float [[TMP10]], i32 0
	; AVX1-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <8 x float>*			; AVX1-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <8 x float>*
	; AVX1-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP9]], <8 x float>* [[TMP12]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !26, !noalias !28			; AVX1-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP9]], <8 x float>* [[TMP12]], i32 4, <8 x i1> [[TMP4]]), !alias.scope !26, !noalias !28
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX1-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX1-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX1-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !29			; AVX1-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP29:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX1-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP14]], 100			; AVX1-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP14]], 100
	; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX1-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: [[TMP15:%.]] = load float, float [[ARRAYIDX3]], align 4			; AVX1-NEXT: [[TMP15:%.]] = load float, float [[ARRAYIDX3]], align 4
	; AVX1-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP14]] to float			; AVX1-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP14]] to float
	; AVX1-NEXT: [[ADD:%.*]] = fadd float [[TMP15]], [[CONV]]			; AVX1-NEXT: [[ADD:%.*]] = fadd float [[TMP15]], [[CONV]]
	; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4			; AVX1-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !30			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP30:!llvm.loop !.*]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo2(			; AVX2-LABEL: @foo2(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8			; AVX2-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8
	; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX2-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX2-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8			; AVX2-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8
	▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 16			; AVX2-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 16
	; AVX2-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <8 x float>*			; AVX2-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <8 x float>*
	; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP38]], <8 x float>* [[TMP49]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !26, !noalias !28			; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP38]], <8 x float>* [[TMP49]], i32 4, <8 x i1> [[TMP18]]), !alias.scope !26, !noalias !28
	; AVX2-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 24			; AVX2-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 24
	; AVX2-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <8 x float>*			; AVX2-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <8 x float>*
	; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP39]], <8 x float>* [[TMP51]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !26, !noalias !28			; AVX2-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP39]], <8 x float>* [[TMP51]], i32 4, <8 x i1> [[TMP19]]), !alias.scope !26, !noalias !28
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX2-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX2-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX2-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !29			; AVX2-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP29:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX2-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4			; AVX2-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4
	; AVX2-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float			; AVX2-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float
	; AVX2-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]			; AVX2-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]
	; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4			; AVX2-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !30			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP30:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo2(			; AVX512-LABEL: @foo2(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8			; AVX512-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8			; AVX512-NEXT: [[B6:%.]] = bitcast float [[B:%.]] to i8
	▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 32			; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds float, float [[TMP40]], i32 32
	; AVX512-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <16 x float>*			; AVX512-NEXT: [[TMP49:%.]] = bitcast float [[TMP48]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP38]], <16 x float>* [[TMP49]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !26, !noalias !28			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP38]], <16 x float>* [[TMP49]], i32 4, <16 x i1> [[TMP18]]), !alias.scope !26, !noalias !28
	; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 48			; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds float, float [[TMP40]], i32 48
	; AVX512-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <16 x float>*			; AVX512-NEXT: [[TMP51:%.]] = bitcast float [[TMP50]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP39]], <16 x float>* [[TMP51]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !26, !noalias !28			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP39]], <16 x float>* [[TMP51]], i32 4, <16 x i1> [[TMP19]]), !alias.scope !26, !noalias !28
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 64
	; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !29			; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP29:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4			; AVX512-NEXT: [[TMP54:%.]] = load float, float [[ARRAYIDX3]], align 4
	; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float			; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to float
	; AVX512-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]			; AVX512-NEXT: [[ADD:%.*]] = fadd float [[TMP54]], [[CONV]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4			; AVX512-NEXT: store float [[ADD]], float* [[ARRAYIDX7]], align 4
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !30			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP30:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
	; AVX-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 8			; AVX-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 8
	; AVX-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <4 x double>*			; AVX-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <4 x double>*
	; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP38]], <4 x double>* [[TMP49]], i32 8, <4 x i1> [[TMP18]]), !alias.scope !36, !noalias !38			; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP38]], <4 x double>* [[TMP49]], i32 8, <4 x i1> [[TMP18]]), !alias.scope !36, !noalias !38
	; AVX-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 12			; AVX-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 12
	; AVX-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <4 x double>*			; AVX-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <4 x double>*
	; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP39]], <4 x double>* [[TMP51]], i32 8, <4 x i1> [[TMP19]]), !alias.scope !36, !noalias !38			; AVX-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[TMP39]], <4 x double>* [[TMP51]], i32 8, <4 x i1> [[TMP19]]), !alias.scope !36, !noalias !38
	; AVX-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000			; AVX-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 10000
	; AVX-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !39			; AVX-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP39:!llvm.loop !.]]
	; AVX: middle.block:			; AVX: middle.block:
	; AVX-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000			; AVX-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 10000
	; AVX-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX: scalar.ph:			; AVX: scalar.ph:
	; AVX-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 10000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX-NEXT: br label [[FOR_BODY:%.*]]			; AVX-NEXT: br label [[FOR_BODY:%.*]]
	; AVX: for.body:			; AVX: for.body:
	; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX: if.then:			; AVX: if.then:
	; AVX-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]
	; AVX-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double			; AVX-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double
	; AVX-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]			; AVX-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]
	; AVX-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]
	; AVX-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8			; AVX-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8
	; AVX-NEXT: br label [[FOR_INC]]			; AVX-NEXT: br label [[FOR_INC]]
	; AVX: for.inc:			; AVX: for.inc:
	; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !40			; AVX-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP40:!llvm.loop !.*]]
	; AVX: for.end:			; AVX: for.end:
	; AVX-NEXT: ret void			; AVX-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo3(			; AVX512-LABEL: @foo3(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[A1:%.]] = bitcast double [[A:%.]] to i8			; AVX512-NEXT: [[A1:%.]] = bitcast double [[A:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[B6:%.]] = bitcast double [[B:%.]] to i8			; AVX512-NEXT: [[B6:%.]] = bitcast double [[B:%.]] to i8
	▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 16			; AVX512-NEXT: [[TMP48:%.]] = getelementptr inbounds double, double [[TMP40]], i32 16
	; AVX512-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <8 x double>*			; AVX512-NEXT: [[TMP49:%.]] = bitcast double [[TMP48]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP38]], <8 x double>* [[TMP49]], i32 8, <8 x i1> [[TMP18]]), !alias.scope !36, !noalias !38			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP38]], <8 x double>* [[TMP49]], i32 8, <8 x i1> [[TMP18]]), !alias.scope !36, !noalias !38
	; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 24			; AVX512-NEXT: [[TMP50:%.]] = getelementptr inbounds double, double [[TMP40]], i32 24
	; AVX512-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <8 x double>*			; AVX512-NEXT: [[TMP51:%.]] = bitcast double [[TMP50]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP39]], <8 x double>* [[TMP51]], i32 8, <8 x i1> [[TMP19]]), !alias.scope !36, !noalias !38			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[TMP39]], <8 x double>* [[TMP51]], i32 8, <8 x i1> [[TMP19]]), !alias.scope !36, !noalias !38
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984			; AVX512-NEXT: [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 9984
	; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !39			; AVX512-NEXT: br i1 [[TMP52]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP39:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 10000, 9984
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP53:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP53]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX512-NEXT: [[TMP54:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double			; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP53]] to double
	; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]			; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP54]], [[CONV]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8			; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !40			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP40:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double> [[TMP3]], i32 8, <8 x i1> [[TMP1]], <8 x double> undef), !alias.scope !44			; AVX512-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double> [[TMP3]], i32 8, <8 x i1> [[TMP1]], <8 x double> undef), !alias.scope !44
	; AVX512-NEXT: [[TMP4:%.*]] = sitofp <8 x i32> [[WIDE_MASKED_GATHER]] to <8 x double>			; AVX512-NEXT: [[TMP4:%.*]] = sitofp <8 x i32> [[WIDE_MASKED_GATHER]] to <8 x double>
	; AVX512-NEXT: [[TMP5:%.*]] = fadd <8 x double> [[WIDE_MASKED_GATHER12]], [[TMP4]]			; AVX512-NEXT: [[TMP5:%.*]] = fadd <8 x double> [[WIDE_MASKED_GATHER12]], [[TMP4]]
	; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds double, double [[A]], <8 x i64> [[VEC_IND]]			; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds double, double [[A]], <8 x i64> [[VEC_IND]]
	; AVX512-NEXT: call void @llvm.masked.scatter.v8f64.v8p0f64(<8 x double> [[TMP5]], <8 x double*> [[TMP6]], i32 8, <8 x i1> [[TMP1]]), !alias.scope !46, !noalias !48			; AVX512-NEXT: call void @llvm.masked.scatter.v8f64.v8p0f64(<8 x double> [[TMP5]], <8 x double*> [[TMP6]], i32 8, <8 x i1> [[TMP1]]), !alias.scope !46, !noalias !48
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; AVX512-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>			; AVX512-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>
	; AVX512-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 624			; AVX512-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 624
	; AVX512-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !49			; AVX512-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP49:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 625, 624			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 625, 624
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 9984, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP8]], 100			; AVX512-NEXT: [[CMP1:%.*]] = icmp slt i32 [[TMP8]], 100
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP9]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[B]], i64 [[TMP9]]
	; AVX512-NEXT: [[TMP10:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX512-NEXT: [[TMP10:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP8]] to double			; AVX512-NEXT: [[CONV:%.*]] = sitofp i32 [[TMP8]] to double
	; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP10]], [[CONV]]			; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP10]], [[CONV]]
	; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds double, double [[A]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8			; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX7]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 16			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 16
	; AVX512-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV_NEXT]], 10000			; AVX512-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV_NEXT]], 10000
	; AVX512-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !50			; AVX512-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], [[LOOP50:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.inc			for.body: ; preds = %entry, %for.inc
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE33]], <4 x double>* [[TMP56]], i32 8, <4 x i1> [[REVERSE23]]), !alias.scope !46, !noalias !48			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE33]], <4 x double>* [[TMP56]], i32 8, <4 x i1> [[REVERSE23]]), !alias.scope !46, !noalias !48
	; AVX2-NEXT: [[REVERSE35:%.*]] = shufflevector <4 x double> [[TMP43]], <4 x double> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>			; AVX2-NEXT: [[REVERSE35:%.*]] = shufflevector <4 x double> [[TMP43]], <4 x double> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; AVX2-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -12			; AVX2-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -12
	; AVX2-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -3			; AVX2-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -3
	; AVX2-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <4 x double>*			; AVX2-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE35]], <4 x double>* [[TMP59]], i32 8, <4 x i1> [[REVERSE26]]), !alias.scope !46, !noalias !48			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> [[REVERSE35]], <4 x double>* [[TMP59]], i32 8, <4 x i1> [[REVERSE26]]), !alias.scope !46, !noalias !48
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX2-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; AVX2-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; AVX2-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !49			; AVX2-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP49:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX2-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX2-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0			; AVX2-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0
	; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX2-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX2-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX2-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01			; AVX2-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01
	; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8			; AVX2-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
	; AVX2-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0			; AVX2-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0
	; AVX2-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !50			; AVX2-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP50:!llvm.loop !.*]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo6(			; AVX512-LABEL: @foo6(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[OUT1:%.]] = bitcast double [[OUT:%.]] to i8			; AVX512-NEXT: [[OUT1:%.]] = bitcast double [[OUT:%.]] to i8
	; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8			; AVX512-NEXT: [[TRIGGER3:%.]] = bitcast i32 [[TRIGGER:%.]] to i8
	; AVX512-NEXT: [[IN6:%.]] = bitcast double [[IN:%.]] to i8			; AVX512-NEXT: [[IN6:%.]] = bitcast double [[IN:%.]] to i8
	▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE33]], <8 x double>* [[TMP56]], i32 8, <8 x i1> [[REVERSE23]]), !alias.scope !56, !noalias !58			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE33]], <8 x double>* [[TMP56]], i32 8, <8 x i1> [[REVERSE23]]), !alias.scope !56, !noalias !58
	; AVX512-NEXT: [[REVERSE35:%.*]] = shufflevector <8 x double> [[TMP43]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			; AVX512-NEXT: [[REVERSE35:%.*]] = shufflevector <8 x double> [[TMP43]], <8 x double> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -24			; AVX512-NEXT: [[TMP57:%.]] = getelementptr inbounds double, double [[TMP44]], i32 -24
	; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -7			; AVX512-NEXT: [[TMP58:%.]] = getelementptr inbounds double, double [[TMP57]], i32 -7
	; AVX512-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <8 x double>*			; AVX512-NEXT: [[TMP59:%.]] = bitcast double [[TMP58]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE35]], <8 x double>* [[TMP59]], i32 8, <8 x i1> [[REVERSE26]]), !alias.scope !56, !noalias !58			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> [[REVERSE35]], <8 x double>* [[TMP59]], i32 8, <8 x i1> [[REVERSE26]]), !alias.scope !56, !noalias !58
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; AVX512-NEXT: [[TMP60:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; AVX512-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !59			; AVX512-NEXT: br i1 [[TMP60]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP59:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 4096, 4096
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ -1, [[MIDDLE_BLOCK]] ], [ 4095, [[ENTRY:%.]] ], [ 4095, [[VECTOR_MEMCHECK]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AVX512-NEXT: [[TMP61:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; AVX512-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0			; AVX512-NEXT: [[CMP1:%.*]] = icmp sgt i32 [[TMP61]], 0
	; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]			; AVX512-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[IN]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8			; AVX512-NEXT: [[TMP62:%.]] = load double, double [[ARRAYIDX3]], align 8
	; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01			; AVX512-NEXT: [[ADD:%.*]] = fadd double [[TMP62]], 5.000000e-01
	; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8			; AVX512-NEXT: store double [[ADD]], double* [[ARRAYIDX5]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
	; AVX512-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0			; AVX512-NEXT: [[CMP:%.*]] = icmp eq i64 [[INDVARS_IV]], 0
	; AVX512-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !60			; AVX512-NEXT: br i1 [[CMP]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP60:!llvm.loop !.*]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	%indvars.iv = phi i64 [ 4095, %entry ], [ %indvars.iv.next, %for.inc ]			%indvars.iv = phi i64 [ 4095, %entry ], [ %indvars.iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !41			; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP41:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !42			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP42:!llvm.loop !.*]]
	; AVX1: for.end.loopexit:			; AVX1: for.end.loopexit:
	; AVX1-NEXT: br label [[FOR_END]]			; AVX1-NEXT: br label [[FOR_END]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo7(			; AVX2-LABEL: @foo7(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !51			; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP51:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !52			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP52:!llvm.loop !.*]]
	; AVX2: for.end.loopexit:			; AVX2: for.end.loopexit:
	; AVX2-NEXT: br label [[FOR_END]]			; AVX2-NEXT: br label [[FOR_END]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo7(			; AVX512-LABEL: @foo7(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16			; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16
	; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*			; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])
	; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24			; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24
	; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*			; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !61			; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP61:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !62			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP62:!llvm.loop !.*]]
	; AVX512: for.end.loopexit:			; AVX512: for.end.loopexit:
	; AVX512-NEXT: br label [[FOR_END]]			; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	%cmp5 = icmp eq i32 %size, 0			%cmp5 = icmp eq i32 %size, 0
	br i1 %cmp5, label %for.end, label %for.body.preheader			br i1 %cmp5, label %for.end, label %for.body.preheader
	▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX1-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX1-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX1-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX1-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX1-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX1-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX1-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !44			; AVX1-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP44:!llvm.loop !.]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX1-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX1: if.then:			; AVX1: if.then:
	; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX1-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX1-NEXT: br label [[FOR_INC]]			; AVX1-NEXT: br label [[FOR_INC]]
	; AVX1: for.inc:			; AVX1: for.inc:
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX1-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !45			; AVX1-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP45:!llvm.loop !.*]]
	; AVX1: for.end.loopexit:			; AVX1: for.end.loopexit:
	; AVX1-NEXT: br label [[FOR_END]]			; AVX1-NEXT: br label [[FOR_END]]
	; AVX1: for.end:			; AVX1: for.end:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	;			;
	; AVX2-LABEL: @foo8(			; AVX2-LABEL: @foo8(
	; AVX2-NEXT: entry:			; AVX2-NEXT: entry:
	; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX2-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8			; AVX2-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 8
	; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*			; AVX2-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP61]], i32 8, <4 x i1> [[TMP54]])
	; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12			; AVX2-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 12
	; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*			; AVX2-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <4 x double>*
	; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])			; AVX2-NEXT: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <4 x double>* [[TMP63]], i32 8, <4 x i1> [[TMP55]])
	; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX2-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !54			; AVX2-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP54:!llvm.loop !.]]
	; AVX2: middle.block:			; AVX2: middle.block:
	; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX2-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX2: if.then:			; AVX2: if.then:
	; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX2-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX2-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX2-NEXT: br label [[FOR_INC]]			; AVX2-NEXT: br label [[FOR_INC]]
	; AVX2: for.inc:			; AVX2: for.inc:
	; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX2-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !55			; AVX2-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP55:!llvm.loop !.*]]
	; AVX2: for.end.loopexit:			; AVX2: for.end.loopexit:
	; AVX2-NEXT: br label [[FOR_END]]			; AVX2-NEXT: br label [[FOR_END]]
	; AVX2: for.end:			; AVX2: for.end:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512-LABEL: @foo8(			; AVX512-LABEL: @foo8(
	; AVX512-NEXT: entry:			; AVX512-NEXT: entry:
	; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0			; AVX512-NEXT: [[CMP5:%.]] = icmp eq i32 [[SIZE:%.]], 0
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16			; AVX512-NEXT: [[TMP60:%.]] = getelementptr inbounds double, double [[TMP44]], i32 16
	; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*			; AVX512-NEXT: [[TMP61:%.]] = bitcast double [[TMP60]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP61]], i32 8, <8 x i1> [[TMP54]])
	; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24			; AVX512-NEXT: [[TMP62:%.]] = getelementptr inbounds double, double [[TMP44]], i32 24
	; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*			; AVX512-NEXT: [[TMP63:%.]] = bitcast double [[TMP62]] to <8 x double>*
	; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])			; AVX512-NEXT: call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>, <8 x double>* [[TMP63]], i32 8, <8 x i1> [[TMP55]])
	; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32			; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
	; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX512-NEXT: [[TMP64:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !64			; AVX512-NEXT: br i1 [[TMP64]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP64:!llvm.loop !.]]
	; AVX512: middle.block:			; AVX512: middle.block:
	; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX512: scalar.ph:			; AVX512: scalar.ph:
	; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX512-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX512-NEXT: br label [[FOR_BODY:%.*]]			; AVX512-NEXT: br label [[FOR_BODY:%.*]]
	; AVX512: for.body:			; AVX512: for.body:
	; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]			; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.*]] ]
	Show All 9 Lines
	; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]			; AVX512-NEXT: br i1 [[CMP3]], label [[FOR_INC]], label [[IF_THEN:%.*]]
	; AVX512: if.then:			; AVX512: if.then:
	; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]			; AVX512-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[OUT]], i64 [[INDVARS_IV]]
	; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8			; AVX512-NEXT: store double 5.000000e-01, double* [[ARRAYIDX5]], align 8
	; AVX512-NEXT: br label [[FOR_INC]]			; AVX512-NEXT: br label [[FOR_INC]]
	; AVX512: for.inc:			; AVX512: for.inc:
	; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop !65			; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], [[LOOP65:!llvm.loop !.*]]
	; AVX512: for.end.loopexit:			; AVX512: for.end.loopexit:
	; AVX512-NEXT: br label [[FOR_END]]			; AVX512-NEXT: br label [[FOR_END]]
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
	entry:			entry:
	%cmp5 = icmp eq i32 %size, 0			%cmp5 = icmp eq i32 %size, 0
	br i1 %cmp5, label %for.end, label %for.body.preheader			br i1 %cmp5, label %for.end, label %for.body.preheader
	Show All 34 Lines

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll

This file was added.

				; RUN: opt < %s -passes='loop-vectorize' -force-vector-width=2 -enable-epilogue-vectorization -epilogue-vectorization-force-VF=2 --debug-only=loop-vectorize -S 2>&1 \| FileCheck %s
				MaskRayUnsubmitted Not Done Reply Inline Actions debug-only tests are rare. If you really want them, please add `REQUIRES: asserts` (which has kindly been fixed by RKSimon) MaskRay: debug-only tests are rare. If you really want them, please add `REQUIRES: asserts` (which has…

				target datalayout = "e-m:e-i64:64-n32:64-v256:256:256-v512:512:512"

				; Currently we cannot handle reduction loops.
				; CHECK: LV: Checking a loop in "f1"
				; CHECK: LEV: Unable to vectorize epilogue because the loop is not a supported candidate.

				define signext i32 @f1(i8* noalias %A, i32 signext %n) {
				entry:
				%cmp1 = icmp sgt i32 %n, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.02 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i8, i8* %A, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				%add = add nuw nsw i32 %sum.02, %conv
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi i32 [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %sum.0.lcssa
				}

				; Currently we cannot handle live-out variables that are recurrences.
				; CHECK: LV: Checking a loop in "f2"
				; CHECK: LEV: Unable to vectorize epilogue because the loop is not a supported candidate.

				define signext i32 @f2(i8* noalias %A, i32 signext %n) {
				entry:
				%cmp1 = icmp sgt i32 %n, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %A, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx, align 1
				%add = add i8 %0, 1
				%arrayidx3 = getelementptr inbounds i8, i8* %A, i64 %indvars.iv
				store i8 %add, i8* %arrayidx3, align 1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				%inc.lcssa.wide = phi i64 [ %indvars.iv.next, %for.body ]
				%1 = trunc i64 %inc.lcssa.wide to i32
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%i.0.lcssa = phi i32 [ 0, %entry ], [ %1, %for.end.loopexit ]
				ret i32 %i.0.lcssa
				}

				; Currently we cannot handle widended/truncated inductions.
				; CHECK: LV: Checking a loop in "f3"
				; CHECK: LEV: Unable to vectorize epilogue because the loop is not a supported candidate.

				define void @f3(i8* noalias %A, i32 signext %n) {
				entry:
				%cmp1 = icmp sgt i32 %n, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%0 = trunc i64 %indvars.iv to i32
				%conv = trunc i32 %0 to i8
				%arrayidx = getelementptr inbounds i8, i8* %A, i64 %indvars.iv
				store i8 %conv, i8* %arrayidx, align 1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-liveout.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py

				; To test epilogue-vectorization we need to make sure that the vectorizer actually vectorizes the loop.
				; Without a target triple this becomes difficult, unless we force vectorization through user hints.
				; Currently user provided vectorization hints prevent epilogue vectorization unless the forced
				; VF is the same as the epilogue vectorization VF. To make these tests target independent we'll use a
				; trick where both VFs are forced to be the same value.
				; RUN: opt < %s -passes='loop-vectorize' -enable-epilogue-vectorization -force-vector-width=2 -epilogue-vectorization-force-VF=2 -S \| FileCheck %s --check-prefix VF-TWO-CHECK

				target datalayout = "e-m:e-i64:64-n32:64"

				; Some limited forms of live-outs (non-reduction, non-recurrences) are supported.
				define signext i32 @f1(i32* noalias %A, i32* noalias %B, i32 signext %n) {
				; VF-TWO-CHECK-LABEL: @f1(
				; VF-TWO-CHECK-NEXT: entry:
				; VF-TWO-CHECK-NEXT: [[CMP1:%.]] = icmp sgt i32 [[N:%.]], 0
				; VF-TWO-CHECK-NEXT: br i1 [[CMP1]], label [[ITER_CHECK:%.]], label [[FOR_END:%.]]
				; VF-TWO-CHECK: iter.check:
				; VF-TWO-CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
				; VF-TWO-CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 2
				; VF-TWO-CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
				; VF-TWO-CHECK: vector.main.loop.iter.check:
				; VF-TWO-CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 2
				; VF-TWO-CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; VF-TWO-CHECK: vector.ph:
				; VF-TWO-CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 2
				; VF-TWO-CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; VF-TWO-CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; VF-TWO-CHECK: vector.body:
				; VF-TWO-CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; VF-TWO-CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; VF-TWO-CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[TMP0]]
				; VF-TWO-CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <2 x i32>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
				; VF-TWO-CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[TMP5]] to <2 x i32>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD2:%.]] = load <2 x i32>, <2 x i32> [[TMP6]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP7:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], [[WIDE_LOAD2]]
				; VF-TWO-CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
				; VF-TWO-CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; VF-TWO-CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
				; VF-TWO-CHECK: middle.block:
				; VF-TWO-CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; VF-TWO-CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i32> [[TMP7]], i32 1
				; VF-TWO-CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; VF-TWO-CHECK: vec.epilog.iter.check:
				; VF-TWO-CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; VF-TWO-CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 2
				; VF-TWO-CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; VF-TWO-CHECK: vec.epilog.ph:
				; VF-TWO-CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; VF-TWO-CHECK-NEXT: [[N_MOD_VF4:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 2
				; VF-TWO-CHECK-NEXT: [[N_VEC5:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF4]]
				; VF-TWO-CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
				; VF-TWO-CHECK: vec.epilog.vector.body:
				; VF-TWO-CHECK-NEXT: [[INDEX6:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT7:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
				; VF-TWO-CHECK-NEXT: [[TMP10:%.*]] = add i64 [[INDEX6]], 0
				; VF-TWO-CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP10]]
				; VF-TWO-CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP11]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <2 x i32>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD9:%.]] = load <2 x i32>, <2 x i32> [[TMP13]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP10]]
				; VF-TWO-CHECK-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP14]], i32 0
				; VF-TWO-CHECK-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP15]] to <2 x i32>*
				; VF-TWO-CHECK-NEXT: [[WIDE_LOAD10:%.]] = load <2 x i32>, <2 x i32> [[TMP16]], align 4
				; VF-TWO-CHECK-NEXT: [[TMP17:%.*]] = add nsw <2 x i32> [[WIDE_LOAD9]], [[WIDE_LOAD10]]
				; VF-TWO-CHECK-NEXT: [[INDEX_NEXT7]] = add i64 [[INDEX6]], 2
				; VF-TWO-CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT7]], [[N_VEC5]]
				; VF-TWO-CHECK-NEXT: br i1 [[TMP18]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], [[LOOP2:!llvm.loop !.]]
				; VF-TWO-CHECK: vec.epilog.middle.block:
				; VF-TWO-CHECK-NEXT: [[CMP_N8:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC5]]
				; VF-TWO-CHECK-NEXT: [[TMP19:%.*]] = extractelement <2 x i32> [[TMP17]], i32 1
				; VF-TWO-CHECK-NEXT: br i1 [[CMP_N8]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
				; VF-TWO-CHECK: vec.epilog.scalar.ph:
				; VF-TWO-CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
				; VF-TWO-CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; VF-TWO-CHECK: for.body:
				; VF-TWO-CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; VF-TWO-CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; VF-TWO-CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX]], align 4
				; VF-TWO-CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
				; VF-TWO-CHECK-NEXT: [[TMP21:%.]] = load i32, i32 [[ARRAYIDX2]], align 4
				; VF-TWO-CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP20]], [[TMP21]]
				; VF-TWO-CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; VF-TWO-CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; VF-TWO-CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT_LOOPEXIT]], [[LOOP4:!llvm.loop !.*]]
				; VF-TWO-CHECK: for.end.loopexit.loopexit:
				; VF-TWO-CHECK-NEXT: [[ADD_LCSSA3:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[TMP19]], [[VEC_EPILOG_MIDDLE_BLOCK]] ]
				; VF-TWO-CHECK-NEXT: br label [[FOR_END_LOOPEXIT]]
				; VF-TWO-CHECK: for.end.loopexit:
				; VF-TWO-CHECK-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ [[ADD_LCSSA3]], [[FOR_END_LOOPEXIT_LOOPEXIT]] ]
				; VF-TWO-CHECK-NEXT: br label [[FOR_END]]
				; VF-TWO-CHECK: for.end:
				; VF-TWO-CHECK-NEXT: [[RES_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[ADD_LCSSA]], [[FOR_END_LOOPEXIT]] ]
				; VF-TWO-CHECK-NEXT: ret i32 [[RES_0_LCSSA]]
				;
				entry:
				%cmp1 = icmp sgt i32 %n, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %0, %1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				%add.lcssa = phi i32 [ %add, %for.body ]
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%res.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %res.0.lcssa
				}

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py

				; To test epilogue-vectorization we need to make sure that the vectorizer actually vectorizes the loop.
				; Without a target triple this becomes difficult, unless we force vectorization through user hints.
				; Currently user provided vectorization hints prevent epilogue vectorization unless the forced
				; VF is the same as the epilogue vectorization VF. To make these tests target independent we'll use a
				; trick where both VFs are forced to be the same value. Mismatching VFs are tested in target specific tests.
				dmgreenUnsubmitted Done Reply Inline Actions PowerPC test go into LoopVectorize/PowerPC (I think, if it exists). dmgreen: PowerPC test go into LoopVectorize/PowerPC (I think, if it exists).
				bmahjourAuthorUnsubmitted Done Reply Inline Actions Done. bmahjour: Done.
				; RUN: opt -passes='loop-vectorize' -force-vector-width=4 -enable-epilogue-vectorization -epilogue-vectorization-force-VF=4 -S %s \| FileCheck %s

				; Some simpler cases are found profitable even without triple or user hints.
				; RUN: opt -passes='loop-vectorize' -enable-epilogue-vectorization -epilogue-vectorization-force-VF=2 -S %s \| FileCheck --check-prefix=CHECK-PROFITABLE-BY-DEFAULT %s
				dmgreenUnsubmitted Not Done Reply Inline Actions -> triple dmgreen: -> triple

				target datalayout = "e-m:e-i64:64-n32:64-v128:128:128"

				define dso_local void @f1(float* noalias %aa, float* noalias %bb, float* noalias %cc, i32 signext %N) {
				; CHECK-LABEL: @f1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[AA1:%.]] = bitcast float [[AA:%.]] to i8
				; CHECK-NEXT: [[BB3:%.]] = bitcast float [[BB:%.]] to i8
				; CHECK-NEXT: [[CC6:%.]] = bitcast float [[CC:%.]] to i8
				; CHECK-NEXT: [[CMP1:%.]] = icmp sgt i32 [[N:%.]], 0
				; CHECK-NEXT: br i1 [[CMP1]], label [[ITER_CHECK:%.]], label [[FOR_END:%.]]
				; CHECK: iter.check:
				; CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[AA]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP2:%.]] = bitcast float [[SCEVGEP]] to i8*
				; CHECK-NEXT: [[SCEVGEP4:%.]] = getelementptr float, float [[BB]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP45:%.]] = bitcast float [[SCEVGEP4]] to i8*
				; CHECK-NEXT: [[SCEVGEP7:%.]] = getelementptr float, float [[CC]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP78:%.]] = bitcast float [[SCEVGEP7]] to i8*
				; CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 [[AA1]], [[SCEVGEP45]]
				; CHECK-NEXT: [[BOUND1:%.]] = icmp ult i8 [[BB3]], [[SCEVGEP2]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: [[BOUND09:%.]] = icmp ult i8 [[AA1]], [[SCEVGEP78]]
				; CHECK-NEXT: [[BOUND110:%.]] = icmp ult i8 [[CC6]], [[SCEVGEP2]]
				; CHECK-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]]
				; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]]
				; CHECK-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[CONFLICT_RDX]], true
				; CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
				; CHECK: vector.main.loop.iter.check:
				; CHECK-NEXT: [[MIN_ITERS_CHECK12:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK12]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4, !alias.scope !0
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
				; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <4 x float>*
				; CHECK-NEXT: [[WIDE_LOAD13:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4, !alias.scope !3
				; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD13]]
				; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds float, float [[TMP8]], i32 0
				; CHECK-NEXT: [[TMP10:%.]] = bitcast float [[TMP9]] to <4 x float>*
				; CHECK-NEXT: store <4 x float> [[TMP7]], <4 x float>* [[TMP10]], align 4, !alias.scope !5, !noalias !7
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
				; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; CHECK: vec.epilog.iter.check:
				; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 4
				; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; CHECK: vec.epilog.ph:
				; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; CHECK-NEXT: [[N_MOD_VF14:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: [[N_VEC15:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF14]]
				; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
				; CHECK: vec.epilog.vector.body:
				; CHECK-NEXT: [[INDEX16:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT17:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP12:%.*]] = add i64 [[INDEX16]], 0
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds float, float [[BB]], i64 [[TMP12]]
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds float, float [[TMP13]], i32 0
				; CHECK-NEXT: [[TMP15:%.]] = bitcast float [[TMP14]] to <4 x float>*
				; CHECK-NEXT: [[WIDE_LOAD19:%.]] = load <4 x float>, <4 x float> [[TMP15]], align 4
				; CHECK-NEXT: [[TMP16:%.]] = getelementptr inbounds float, float [[CC]], i64 [[TMP12]]
				; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds float, float [[TMP16]], i32 0
				; CHECK-NEXT: [[TMP18:%.]] = bitcast float [[TMP17]] to <4 x float>*
				; CHECK-NEXT: [[WIDE_LOAD20:%.]] = load <4 x float>, <4 x float> [[TMP18]], align 4
				; CHECK-NEXT: [[TMP19:%.*]] = fadd fast <4 x float> [[WIDE_LOAD19]], [[WIDE_LOAD20]]
				; CHECK-NEXT: [[TMP20:%.]] = getelementptr inbounds float, float [[AA]], i64 [[TMP12]]
				; CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[TMP20]], i32 0
				; CHECK-NEXT: [[TMP22:%.]] = bitcast float [[TMP21]] to <4 x float>*
				; CHECK-NEXT: store <4 x float> [[TMP19]], <4 x float>* [[TMP22]], align 4
				; CHECK-NEXT: [[INDEX_NEXT17]] = add i64 [[INDEX16]], 4
				; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT17]], [[N_VEC15]]
				; CHECK-NEXT: br i1 [[TMP23]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], [[LOOP10:!llvm.loop !.]]
				; CHECK: vec.epilog.middle.block:
				; CHECK-NEXT: [[CMP_N18:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC15]]
				; CHECK-NEXT: br i1 [[CMP_N18]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
				; CHECK: vec.epilog.scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC15]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ITER_CHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[BB]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP24:%.]] = load float, float [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[CC]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP25:%.]] = load float, float [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP24]], [[TMP25]]
				; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[AA]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX4]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT_LOOPEXIT]], [[LOOP12:!llvm.loop !.*]]
				; CHECK: for.end.loopexit.loopexit:
				; CHECK-NEXT: br label [[FOR_END_LOOPEXIT]]
				; CHECK: for.end.loopexit:
				; CHECK-NEXT: br label [[FOR_END]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %bb, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds float, float* %cc, i64 %indvars.iv
				%1 = load float, float* %arrayidx2, align 4
				%add = fadd fast float %0, %1
				%arrayidx4 = getelementptr inbounds float, float* %aa, i64 %indvars.iv
				store float %add, float* %arrayidx4, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}

				define dso_local signext i32 @f2(float* noalias %A, float* noalias %B, i32 signext %n) {
				; CHECK-LABEL: @f2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A1:%.]] = bitcast float [[A:%.]] to i8
				; CHECK-NEXT: [[CMP1:%.]] = icmp sgt i32 [[N:%.]], 1
				; CHECK-NEXT: br i1 [[CMP1]], label [[ITER_CHECK:%.]], label [[FOR_END:%.]]
				; CHECK: iter.check:
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[N]], -1
				; CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[TMP0]] to i64
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]
				; CHECK: vector.scevcheck:
				; CHECK-NEXT: [[TMP1:%.*]] = add nsw i64 [[WIDE_TRIP_COUNT]], -1
				; CHECK-NEXT: [[TMP2:%.*]] = trunc i64 [[TMP1]] to i32
				; CHECK-NEXT: [[MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 [[TMP2]])
				; CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL]], 0
				; CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL]], 1
				; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[TMP0]], [[MUL_RESULT]]
				; CHECK-NEXT: [[TMP4:%.*]] = sub i32 [[TMP0]], [[MUL_RESULT]]
				; CHECK-NEXT: [[TMP5:%.*]] = icmp sgt i32 [[TMP4]], [[TMP0]]
				; CHECK-NEXT: [[TMP6:%.*]] = icmp slt i32 [[TMP3]], [[TMP0]]
				; CHECK-NEXT: [[TMP7:%.*]] = select i1 true, i1 [[TMP5]], i1 [[TMP6]]
				; CHECK-NEXT: [[TMP8:%.*]] = icmp ugt i64 [[TMP1]], 4294967295
				; CHECK-NEXT: [[TMP9:%.*]] = or i1 [[TMP7]], [[TMP8]]
				; CHECK-NEXT: [[TMP10:%.*]] = or i1 [[TMP9]], [[MUL_OVERFLOW]]
				; CHECK-NEXT: [[TMP11:%.*]] = or i1 false, [[TMP10]]
				; CHECK-NEXT: br i1 [[TMP11]], label [[VEC_EPILOG_SCALAR_PH]], label [[VECTOR_MEMCHECK:%.*]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[A]], i64 [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP2:%.]] = bitcast float [[SCEVGEP]] to i8*
				; CHECK-NEXT: [[TMP12:%.*]] = sext i32 [[TMP0]] to i64
				; CHECK-NEXT: [[TMP13:%.*]] = add i64 [[TMP12]], 1
				; CHECK-NEXT: [[TMP14:%.*]] = sub i64 [[TMP13]], [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: [[SCEVGEP3:%.]] = getelementptr float, float [[B:%.*]], i64 [[TMP14]]
				; CHECK-NEXT: [[SCEVGEP34:%.]] = bitcast float [[SCEVGEP3]] to i8*
				; CHECK-NEXT: [[TMP15:%.*]] = add nsw i64 [[TMP12]], 1
				; CHECK-NEXT: [[SCEVGEP5:%.]] = getelementptr float, float [[B]], i64 [[TMP15]]
				; CHECK-NEXT: [[SCEVGEP56:%.]] = bitcast float [[SCEVGEP5]] to i8*
				; CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 [[A1]], [[SCEVGEP56]]
				; CHECK-NEXT: [[BOUND1:%.]] = icmp ult i8 [[SCEVGEP34]], [[SCEVGEP2]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: [[MEMCHECK_CONFLICT:%.*]] = and i1 [[FOUND_CONFLICT]], true
				; CHECK-NEXT: br i1 [[MEMCHECK_CONFLICT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
				; CHECK: vector.main.loop.iter.check:
				; CHECK-NEXT: [[MIN_ITERS_CHECK7:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK7]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i64 [[INDEX]] to i32
				; CHECK-NEXT: [[TMP17:%.*]] = add i32 [[OFFSET_IDX]], 0
				; CHECK-NEXT: [[TMP18:%.*]] = xor i32 [[TMP17]], -1
				; CHECK-NEXT: [[TMP19:%.*]] = add i32 [[TMP18]], [[N]]
				; CHECK-NEXT: [[TMP20:%.*]] = sext i32 [[TMP19]] to i64
				; CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP20]]
				; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds float, float [[TMP21]], i32 0
				; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds float, float [[TMP22]], i32 -3
				; CHECK-NEXT: [[TMP24:%.]] = bitcast float [[TMP23]] to <4 x float>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP24]], align 4, !alias.scope !13
				; CHECK-NEXT: [[REVERSE:%.*]] = shufflevector <4 x float> [[WIDE_LOAD]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: [[TMP25:%.*]] = fadd fast <4 x float> [[REVERSE]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; CHECK-NEXT: [[TMP26:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP16]]
				; CHECK-NEXT: [[TMP27:%.]] = getelementptr inbounds float, float [[TMP26]], i32 0
				; CHECK-NEXT: [[TMP28:%.]] = bitcast float [[TMP27]] to <4 x float>*
				; CHECK-NEXT: store <4 x float> [[TMP25]], <4 x float>* [[TMP28]], align 4, !alias.scope !16, !noalias !13
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
				; CHECK-NEXT: [[TMP29:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP29]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP18:!llvm.loop !.]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; CHECK: vec.epilog.iter.check:
				; CHECK-NEXT: [[IND_END13:%.*]] = trunc i64 [[N_VEC]] to i32
				; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 4
				; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; CHECK: vec.epilog.ph:
				; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; CHECK-NEXT: [[N_MOD_VF8:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
				; CHECK-NEXT: [[N_VEC9:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF8]]
				; CHECK-NEXT: [[IND_END:%.*]] = trunc i64 [[N_VEC9]] to i32
				; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
				; CHECK: vec.epilog.vector.body:
				; CHECK-NEXT: [[INDEX10:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP30:%.*]] = add i64 [[INDEX10]], 0
				; CHECK-NEXT: [[OFFSET_IDX15:%.*]] = trunc i64 [[INDEX10]] to i32
				; CHECK-NEXT: [[TMP31:%.*]] = add i32 [[OFFSET_IDX15]], 0
				; CHECK-NEXT: [[TMP32:%.*]] = xor i32 [[TMP31]], -1
				; CHECK-NEXT: [[TMP33:%.*]] = add i32 [[TMP32]], [[N]]
				; CHECK-NEXT: [[TMP34:%.*]] = sext i32 [[TMP33]] to i64
				; CHECK-NEXT: [[TMP35:%.]] = getelementptr inbounds float, float [[B]], i64 [[TMP34]]
				; CHECK-NEXT: [[TMP36:%.]] = getelementptr inbounds float, float [[TMP35]], i32 0
				; CHECK-NEXT: [[TMP37:%.]] = getelementptr inbounds float, float [[TMP36]], i32 -3
				; CHECK-NEXT: [[TMP38:%.]] = bitcast float [[TMP37]] to <4 x float>*
				; CHECK-NEXT: [[WIDE_LOAD16:%.]] = load <4 x float>, <4 x float> [[TMP38]], align 4
				; CHECK-NEXT: [[REVERSE17:%.*]] = shufflevector <4 x float> [[WIDE_LOAD16]], <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: [[TMP39:%.*]] = fadd fast <4 x float> [[REVERSE17]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; CHECK-NEXT: [[TMP40:%.]] = getelementptr inbounds float, float [[A]], i64 [[TMP30]]
				; CHECK-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float [[TMP40]], i32 0
				; CHECK-NEXT: [[TMP42:%.]] = bitcast float [[TMP41]] to <4 x float>*
				; CHECK-NEXT: store <4 x float> [[TMP39]], <4 x float>* [[TMP42]], align 4
				; CHECK-NEXT: [[INDEX_NEXT11]] = add i64 [[INDEX10]], 4
				; CHECK-NEXT: [[TMP43:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC9]]
				; CHECK-NEXT: br i1 [[TMP43]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], [[LOOP19:!llvm.loop !.]]
				; CHECK: vec.epilog.middle.block:
				; CHECK-NEXT: [[CMP_N14:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC9]]
				; CHECK-NEXT: br i1 [[CMP_N14]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
				; CHECK: vec.epilog.scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC9]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ITER_CHECK]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL12:%.*]] = phi i32 [ [[IND_END]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END13]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ITER_CHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[I_014:%.]] = phi i32 [ [[BC_RESUME_VAL12]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[TMP44:%.*]] = xor i32 [[I_014]], -1
				; CHECK-NEXT: [[SUB2:%.*]] = add i32 [[TMP44]], [[N]]
				; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[SUB2]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP45:%.]] = load float, float [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[CONV3:%.*]] = fadd fast float [[TMP45]], 1.000000e+00
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: store float [[CONV3]], float* [[ARRAYIDX5]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_014]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT_LOOPEXIT]], [[LOOP20:!llvm.loop !.*]]
				; CHECK: for.end.loopexit.loopexit:
				; CHECK-NEXT: br label [[FOR_END_LOOPEXIT]]
				; CHECK: for.end.loopexit:
				; CHECK-NEXT: br label [[FOR_END]]
				; CHECK: for.end:
				; CHECK-NEXT: ret i32 0
				;
				entry:
				%cmp1 = icmp sgt i32 %n, 1
				br i1 %cmp1, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%0 = add i32 %n, -1
				%wide.trip.count = zext i32 %0 to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%i.014 = phi i32 [ 0, %for.body.preheader ], [ %inc, %for.body ]
				%1 = xor i32 %i.014, -1
				%sub2 = add i32 %1, %n
				%idxprom = sext i32 %sub2 to i64
				%arrayidx = getelementptr inbounds float, float* %B, i64 %idxprom
				%2 = load float, float* %arrayidx, align 4
				%conv3 = fadd fast float %2, 1.000000e+00
				%arrayidx5 = getelementptr inbounds float, float* %A, i64 %indvars.iv
				store float %conv3, float* %arrayidx5, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%inc = add nuw nsw i32 %i.014, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret i32 0
				}

				define void @f3(i8* noalias %A, i64 %n) {
				; CHECK-PROFITABLE-BY-DEFAULT-LABEL: @f3(
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: iter.check:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.]] = icmp ult i64 [[N:%.]], 2
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
				; CHECK-PROFITABLE-BY-DEFAULT: vector.main.loop.iter.check:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[N]], 4
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK-PROFITABLE-BY-DEFAULT: vector.ph:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK-PROFITABLE-BY-DEFAULT: vector.body:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP0]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: store <4 x i8> <i8 1, i8 1, i8 1, i8 1>, <4 x i8>* [[TMP3]], align 1
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
				; CHECK-PROFITABLE-BY-DEFAULT: middle.block:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; CHECK-PROFITABLE-BY-DEFAULT: vec.epilog.iter.check:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[N]], [[N_VEC]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 2
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; CHECK-PROFITABLE-BY-DEFAULT: vec.epilog.ph:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[N]], 2
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[N_VEC3:%.*]] = sub i64 [[N]], [[N_MOD_VF2]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
				; CHECK-PROFITABLE-BY-DEFAULT: vec.epilog.vector.body:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[INDEX4:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT5:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP5:%.*]] = add i64 [[INDEX4]], 0
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP5]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP7:%.]] = getelementptr inbounds i8, i8 [[TMP6]], i32 0
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP8:%.]] = bitcast i8 [[TMP7]] to <2 x i8>*
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: store <2 x i8> <i8 1, i8 1>, <2 x i8>* [[TMP8]], align 1
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[INDEX_NEXT5]] = add i64 [[INDEX4]], 2
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT5]], [[N_VEC3]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[TMP9]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], [[LOOP2:!llvm.loop !.]]
				; CHECK-PROFITABLE-BY-DEFAULT: vec.epilog.middle.block:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[N]], [[N_VEC3]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[CMP_N6]], label [[FOR_END_LOOPEXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
				; CHECK-PROFITABLE-BY-DEFAULT: vec.epilog.scalar.ph:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC3]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.]] ]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK-PROFITABLE-BY-DEFAULT: for.body:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[IV]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: store i8 1, i8* [[ARRAYIDX]], align 1
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[IV_NEXT]], [[N]]
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END_LOOPEXIT_LOOPEXIT]], [[LOOP4:!llvm.loop !.*]]
				; CHECK-PROFITABLE-BY-DEFAULT: for.end.loopexit.loopexit:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br label [[FOR_END_LOOPEXIT]]
				; CHECK-PROFITABLE-BY-DEFAULT: for.end.loopexit:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: br label [[FOR_END:%.*]]
				; CHECK-PROFITABLE-BY-DEFAULT: for.end:
				; CHECK-PROFITABLE-BY-DEFAULT-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %A, i64 %iv
				store i8 1, i8* %arrayidx, align 1
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond = icmp ne i64 %iv.next, %n
				br i1 %exitcond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				ret void
				}