This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/
-
llvm/
-
Analysis/
-
VectorUtils.h
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.cpp
-
LoopVectorize.cpp
-
VPlan.h
-
VPlan.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
optsize.ll
-
small-size.ll
-
vect.omp.force.small-tc.ll

Differential D50480

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for size
ClosedPublic

Authored by Ayal on Aug 8 2018, 3:11 PM.

Download Raw Diff

Details

Reviewers

mkuper
hsaito
dcaballe
fhahn
rengolin
hfinkel

Commits

rGb0b5312e677c: [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt…
rL344743: [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt…

Summary

When optimizing for size, a loop is vectorized only if the resulting vector loop completely replaces the original scalar loop. This holds if no runtime guards are needed, if the original trip-count TC does not overflow, if TC is a known constant and if TC is a multiple of the VF. Targets with efficient vector masking can thereby overcome the last three TC-related conditions: see “Direction #1” in [[ http://lists.llvm.org/pipermail/llvm-dev/2018-August/125042.html | [llvm-dev] Vectorizing remainder loop ]] - this patch applies that transformation of setting the trip-count of the vector loop to be TC rounded-up to a multiple of VF while masking the vector body under a newly introduced "if (i < TC)" condition; or rather "if (i <= TC-1)" to overcome the aforementioned overflow hazard.

The patch allows loops with arbitrary trip counts to be vectorized under -Os, subject to the existing cost model considerations. It also applies to loops with small trip counts (under -O2) which are currently handled as if under -Os.

Handling loops with reductions and live-outs are marked as TODOs for subsequent extensions.

Diff Detail

Repository: rL LLVM

Event Timeline

Ayal created this revision.Aug 8 2018, 3:11 PM

Herald added subscribers: llvm-commits, rogfer01. · View Herald TranscriptAug 8 2018, 3:11 PM

hsaito added inline comments.Aug 8 2018, 5:15 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2663 ↗	(On Diff #159798)	I think there is a danger in assuming UF being a power of two. Granted that there may be other parts of LV already assuming it, I still wouldn't like to see any more of those being added. If new code is assuming power of two UF, it's best if we ensure that is really the case (e.g., when foldTailByMansking() is true, assert that UF is power of two). Better yet, check VF*UF is power of two here since that's the assumption this code has.
2673 ↗	(On Diff #159798)	This Urem creation should be skipped if we aren't generating remainder.
4977 ↗	(On Diff #159798)	I think we need to add if (TC==0) { emit one kind of remark } else { emit another kind of remark } here ---- in order to match previous capability.

Ayal mentioned this in D50474: [LV] Vectorize header phis that feed from if-convertable latch phis.Aug 9 2018, 3:27 PM

sguggill added a subscriber: sguggill.Aug 10 2018, 3:33 PM

Ayal added inline comments.Aug 11 2018, 2:06 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2663 ↗	(On Diff #159798)	OK, will add an assert that VF*UF is a power of 2 below under the `if (Legal->foldTailByMasking())`.
2673 ↗	(On Diff #159798)	This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating remainder.
4977 ↗	(On Diff #159798)	OK, will retain the previous MissedAnalysis remarks here, in addition to the new ones supplied by `canFoldTailByMasking()`.

Thanks, Ayal! Some comments below.
Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

Thanks,
Diego

lib/Transforms/Vectorize/LoopVectorize.cpp
4965 ↗	(On Diff #159798)	I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower than `TinyTripCountInterleaveThreshold` (i.e., 128)?. Should we use an independent threshold for this?
5218 ↗	(On Diff #159798)	inwhich -> in which?
6355 ↗	(On Diff #159798)	Just curious. Could we prevent the computation or interleave groups for these cases instead of doing a reset?
lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first?
1126 ↗	(On Diff #159798)	Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual operations to compute the BTC? We would only need a sub, right?

In D50480#1196699, @dcaballe wrote:

Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

You should know better

lib/Transforms/Vectorize/LoopVectorize.cpp
4965 ↗	(On Diff #159798)	Ah, this is wrong, good catch! The original purpose (of `TinyTripCountVectorThreshold` rather than `TinyTripCountInterleaveThreshold`) was to prevent vectorization of loops with very short trip counts due to overheads. Later it was extended in r306803 to allow vectorization under OptForSize, as it implies that all iterations are concentrated inside the vector loop for more accurate cost estimation. This still holds when folding the tail by masking, so we should not bail out here.
5218 ↗	(On Diff #159798)	ok
6355 ↗	(On Diff #159798)	That would have been simpler indeed. But there's a subtle phase-ordering issue here: `MaxVF=computeFeasibleMaxVF()` uses tentative interleave groups to `getSmallestAndWidestTypes()`, and is then used in determining if the tail should be folded by masking (i.e., if TC is a multiple of MaxVF), in which case these groups will all be masked/invalid.
lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil): + // Introduce the early-exit compare IV <= BTC to form header block mask. + // This is used instead of IV < TC because TC may wrap, unlike BTC. + VPValue IV = Plan->getVPValue(Legal->getPrimaryInduction()); + VPValue BTC = Plan->getBackedgeTakenCount(); + Value Undef = UndefValue::get(Legal->getPrimaryInduction()->getType()); + auto ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef); + Plan->addDetachedValue(ICmp); + BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp); return BlockMaskCache[BB] = BlockMask; and then have `VPInstruction::generateInstruction()` do + case Instruction::ICmp: { + Value IV = State.get(getOperand(0), Part); + Value TC = State.get(getOperand(1), Part); + auto ICmp = cast<ICmpInst>(getUnderlyingValue()); + Value V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC); + State.set(this, V, Part); + break; + } where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s.
1126 ↗	(On Diff #159798)	The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by SCEVExpander. To model this decrement would require using an "empty" VPValue to model its Trip Count operand. In any case, both involve scalar instructions that take place before the vectorized loop, currently outside the VPlan'd zone.

I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {

if (unlikely(i > M)) 
   break;
sum += a[i];

}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Secondly, are there any plans to enable this approach for anything other than optsize?

test/Transforms/LoopVectorize/X86/optsize.ll
12 ↗	(On Diff #159798)	Testing wise, expanding out the IR generated w/update-lit-checks and landing the tests without the changes and then rebasing on top would make it much easier to follow the transform being described for those us not already expert in the vectorizer code structures. I get that your following existing practice, but this might be one of the cases which justify changing existing practice in the area. :)

hsaito added inline comments.Aug 14 2018, 3:23 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2673 ↗	(On Diff #159798)	Ouch. Well, given the assertion for VF*UF being power of two (constant), the UREM and other computation should be reasonably optimizable downstream. So, it's probably unfair to ask you to fix the trip count computation ---- so, I won't ask. There is a trade off between generating more optimal output IR and the cost of maintaining the code to do that. Keeping UREM here is opting for lower maintenance. Just for the record.

reames added inline comments.Aug 14 2018, 3:25 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
4948 ↗	(On Diff #159798)	There's a mix of seemingly unrelated changes here. This is one example. It would be good to land these separately.

In D50480#1199900, @reames wrote:
I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {
if (unlikely(i > M)) 
   break;
sum += a[i];
}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Short answer is No.

From vectorizer perspective, mechanics is quite different. In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it. There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter. Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Having said that, making VPlan more powerful (like adding a new IF) certainly help lead to the ability to model early_exit situation within the VPlan eventually. From that perspective, it's a baby step forward.

From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

Secondly, are there any plans to enable this approach for anything other than optsize?

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

hsaito added inline comments.Aug 14 2018, 5:00 PM

lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator? The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that. My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?
1126 ↗	(On Diff #159798)	I'm not a big fan of allocating memory that goes unused in many situations. We can initialize this to nullptr, and create an instance once we know BTC is needed. That'll lose the convenience of being able to check NumUsers, but creating needsBackedgeTakenCount() member function shouldn't be that bad. It's just Legal->foldTailByMasking(), until something else needs BTC, right?

dcaballe added inline comments.Aug 14 2018, 6:33 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6355 ↗	(On Diff #159798)	Thanks!
lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places. My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? The closer the class hierarchies are, the easier will be.

hsaito added inline comments.Aug 15 2018, 11:20 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4948 ↗	(On Diff #159798)	This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0. TC==0 case will then have a chance of hitting Legal->canFoldTailByMasking() later. As a result, TC==1 case can return early here, with a very crisp messaging. Having said that, if you'd like to see the same ORE->emit(...) LLVM_DEBUG() stuff here, I won't go against that. Messaging change can be a separate commit. Ayal, we need ORE->emit() here, in addition to LLVM_DEBUG(), right, regardless of whether we change the actual message?

In D50480#1200014, @hsaito wrote:
In D50480#1199900, @reames wrote:
I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {
if (unlikely(i > M)) 
   break;
sum += a[i];
}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?
Short answer is No.

From vectorizer perspective, mechanics is quite different.

Ok, I think we're talking past each other a bit. I see these both as forms of predication. It sounds like you have a slightly different view; I'll try to ask clarifying questions in the right spots. I think we have different mental models here and I'm trying to understand where that difference is.

In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it.

I'm unfamiliar with this pragma, but the best reference I found was https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-simd-directive-openmp-api

From what I can tell, this provides user guarantees of a couple of legality checks and profitability checks. I don't know enough about openmp to completely follow all the wording, but the key bit appears to be this:
"Each operation before the last lexical early exit of the loop may be executed as if the early exit were not triggered within the SIMD chunk."

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve. There are two obvious approaches: speculation safety and predication. Unless I'm misreading this patch, it has the same problem and uses predication right?

There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right? If there's not, then we don't need to distinguish which iteration "caused" the exit. This is actually an interesting and useful subcase for me.

Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

I agree that cases like a[i] >0 are the hard ones. Other examples are things like i < M for loop invariant M. Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

I am very specifically not interested in the language extension aspects. I'm specifically asking about doing the transform for unannotated C code. (i.e. having to prove all the legality the hard way)

Secondly, are there any plans to enable this approach for anything other than optsize?

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying? Not surprised, but I'd be curious to hear your reasoning. You might be coming at this from a different angle than I am

In D50480#1199900, @reames wrote:
I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {
if (unlikely(i > M)) 
   break;
sum += a[i];
}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Good question! Replacing the break with a continue vectorizes just fine and produces the same result, albeit spinning uselessly for the last N-M iterations. Dealing with such "breaks" directly deserves more thought :-). In general it's probably better to fold such two upper bounds into one = min(N,M+1), producing a countable unpredicated loop. This is a known optimization for OpenCL1.x kernels, often guarded with "if (get_global_id(0) > M) continue;" due to work_group size constraints, when compiled for CPU.

Secondly, are there any plans to enable this approach for anything other than optsize?

We could, for example, consider enabling it under -O2 for loops whose entire (or nearly entire) body is already conditional; e.g.,

for (int i = 0; i < N; i++) {
  if (i*i % 4 != 2) {
    <loop body>
  }
}

otherwise the overhead of predicating code that could otherwise run unpredicated may be detrimental.

lib/Transforms/Vectorize/LoopVectorize.cpp
2673 ↗	(On Diff #159798)	Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple of two (which is currently always the case, and must be the case when folding the tail by masking), it gets optimized downstream to N&(-Step). If Step would be some other constant it may get optimized downstream to use multiplication instead of division, depending on target characteristics. In any case, this takes place before the loop; and is orthogonal to this patch, which simply reuses the existing logic to also round up.
4948 ↗	(On Diff #159798)	Yes, this change is unrelated and should land separately. The original ORE message is wrong. Not sure the TC==1 qualifies for any ORE message - "loops" with a known trip count of one are simply irrelevant for vectorization; though we could vectorize them with a mask...
4965 ↗	(On Diff #159798)	This BTW is caught by vect.omp.force.small-tc.ll; but the -vectorizer-min-trip-count=21 flag it uses is external to OpenMP, afaik.
lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve a separate discussion thread and justification. This patch could tentatively make use of it, though for the purpose of this patch an ICmpULE opcode or a detached ICmpInst suffice. An ICmpULE opcode shouldn't be problematic currently, as this early-exit is the only VPInstruction compare with a Predicate, right? Note that detached UnderlyingValues could serve as data containers for all fields already implemented in the IR hierarchy, and could be constructed at any point of VPlan construction for that purpose. Extending VPInstructions to provide a similar API as that of IR Instructions seems to be an orthogonal concern with its own design objectives, and can coexist with detached Values; e.g., a VPCmpInst could hold its Predicate using a detached ICmpInst/FCmpInst.
1126 ↗	(On Diff #159798)	OK. The VPValue can be created on demand, turning `getBackedgeTakenCount()` into `getOrCreateBackedgeTakenCount()`. `NumUsers` should still be checked, as this isolates the decision of creating the IR based on the VPlan. In any case, VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always used, seems insignificant.
test/Transforms/LoopVectorize/X86/optsize.ll
12 ↗	(On Diff #159798)	Agreed. The original target-independent version of optsize.ll still passes, BTW, (i.e., fails to vectorize), but due to cost-model considerations rather than scalar tail considerations.

In D50480#1201125, @reames wrote:

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve. There are two obvious approaches: speculation safety and predication. Unless I'm misreading this patch, it has the same problem and uses predication right?

In this particular case, we don't get much of speculation. If you call computing loop index beyond the original upper bound as speculation (and use it in compare), it is, but we know there aren't any safety issues. In your case, what really matters is inside "unlikely(i > M)". If that's just trivial "i > M" (or something that can be converted in that form), we are better off simply changing the loop upper bound and do so prior to hitting the vectorizer. Then, this patch will take care of it. If not (i.e., general compute_some_predicate_value_based_on(i)) the whole speculation safety issue comes up and that's the difficult part to deal with and this patch doesn't deal with any aspect of it.

There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right? If there's not, then we don't need to distinguish which iteration "caused" the exit. This is actually an interesting and useful subcase for me.

I don't know what you mean by "a use of a variable from within the loop down the early exit path". Assume cond becomes true within a vector chunk (say, elem#2), you have to execute B for all prior iters (i.e., elem#0 and #1),
and execute A for elem #2.

for (i){
   if (cond){
       A
       break;
   }
   B
}

Assuming that B is lexically below (note: this is vectorization, as such, you need to have some lexical ordering assumption somewhere) all the early exit points, it can be non-speculatively executed under proper predication.
This kind of predication, however, has nothing to do with this patch. General IF-THEN-ELSE and GOTO based control flow needs the same kind of predication.

Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

Sure ---- but unless you are talking about trivial (i.e., not very interesting) "early exit" stuff, how to deal with speculation is the most important aspect of vectorizer's early exit handling.

Other examples are things like i < M for loop invariant M. Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

Sure, but that's not very interesting from vectorization perspective. Vectorizer doesn't have to do what other loop transformation can handle.

I am very specifically not interested in the language extension aspects. I'm specifically asking about doing the transform for unannotated C code. (i.e. having to prove all the legality the hard way)

ICC is doing it. So, let us know if anyone is volunteering before we do so that we can share our learning. It's an important aspect of vectorization but not yet high enough on our priority list. So, we aren't immediately jumping on to it.

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying?

Qualifying to what?

If your question is whether ICC uses the masked main vector code for AVX512, other than OptForSize case, then the answer is yes it does.

It's a combination of HW and SW. If you know the trip count as a compile time constant, you can evaluate various different ways to vectorize and decide the best one, much better than when you don't know the trip count. The legacy part of LV isn't set up to do such an evaluation. VPlan native part of LV would eventually have such a capability. W/o this capability, we need to go one way or the other rather blindly --- and blindly changing the status quo requires a pretty good justification (like brilliantly fast masked vector execution unit). I'm more interested in doing the evaluation when VPlan native path is ready to do that.

Not surprised, but I'd be curious to hear your reasoning. You might be coming at this from a different angle than I am

If the trip count is unknown, the best AVX512 vectorization strategy so far is go with unmasked (at the top-level) vector main loop. Underlying assumption is that unmasked vector main loop is faster than the masked vector main loop, and a lot of time is spent in executing main vector loop. If such an assumption does not hold, like main vector code isn't executed a lot, programmers should try to communicate the trip count estimation to the compiler so that the compiler can do a better job. As the HW narrows the gap between the two, optimization point moves. We have to evaluate every generation of HW and see what works the best. So, my comment applies to today's HW. I don't know what ARM SVE folks would say for their HW.

Does this make sense to you?

hsaito added inline comments.Aug 15 2018, 2:26 PM

lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	I go against detached ICmpInst. We'll be moving VPlan modeling before the cost model and creating an IR Instruction before deciding to vectorize is against the VPlan concept. seems to be an orthogonal concern with its own design objectives Not quite. We'd like VPInstruction as easy to use to many LLVM developers and that is an integral part of design/implementation from the beginning. Having said that, new opcode versus VPCmpInst doesn't block the rest of the review. Other parts of the review should proceed while opcode versus VPCmpInst discussion is in progress on the side.
1126 ↗	(On Diff #159798)	VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the >BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always >used, seems insignificant. VPlan footprint was part of the community concern. We'd like to be better wherever we can. Just as simple as that. Thanks for taking care of it.

dcaballe mentioned this in D50823: [VPlan] Introduce VPCmpInst sub-class in the instruction-level representation.Aug 15 2018, 5:12 PM

dcaballe added inline comments.Aug 15 2018, 5:18 PM

lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	I created D50823 with the VPCmpInst sub-class so that we can make a decision with the code in place.

rkruppe added a subscriber: rkruppe.Aug 16 2018, 6:10 AM

hsaito added inline comments.Aug 16 2018, 12:37 PM

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485 ↗	(On Diff #159798)	I think it's best not to keep this state in the Legal. From the Legal perspective, being able to vectorize the whole loop body under the mask and the actual decision to do so are completely separate issues. Since canFold...() is invoked by CostModel::computeMaxVF, we should be able to keep this state in the CostModel. After all, whether to bail out or continue under FoldTailByMasking is a cost model side of the state, after consulting the Legal.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792 ↗	(On Diff #159798)	By moving FoldTail state to CostModel, we can define CostModel::blockNeedsPredication(BB) as FoldTailByMasking \|\| LAI::blockNeedsPredication(BB) and make Legal version static to Legal.
lib/Transforms/Vectorize/LoopVectorize.cpp
2673 ↗	(On Diff #159798)	orthogonal to this patch I agree.

Meinersbur mentioned this in D49281: [Unroll/UnrollAndJam/Vectorizer/Distribute] Add followup loop attributes..Aug 17 2018, 4:23 PM

Addressed review comments.

New test X86/optsize.ll added and vect.omp.force.small-tc.ll augmented with CHECKs, both showing current behavior, to be uploaded separately before this patch. Test small-size.ll includes CHECKs that pass with this patch.

Ayal marked 4 inline comments as done.Aug 20 2018, 3:08 PM

Ayal added inline comments.

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485 ↗	(On Diff #159798)	OK.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792 ↗	(On Diff #159798)	OK, except that LAI::blockNeedsPredication() also asks for DT which CostModel does not have. Let's have CostModel::blockNeedsPredication(BB) return FoldTailByMasking \|\| Legal::blockNeedsPredication(). Hopefully the two will not cause confusion. Making Legal version static should be pursued in a separate patch, if desired.
lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	VPlans should indeed keep the existing IR intact w/o changing it, as they are tentative by design, and also by current implementation. But creating a detached IR Instruction, just for the purpose of holding its attributes, w/o connecting it to any User, Operand (except Undef's) or BasicBlock, is arguably keeping the existing IR intact. Doing so should be quite familiar to LLVM developers, avoids mirroring Instruction's class hierarchy or a subset thereof, and leverages the existing UnderlyingValue pointer that is unutilized by InnerLoopVectorizer. Next uploaded version provides this complete option. Having said that, this patch can surely work with a VP(I)CmpInst just as well, as it merely needs a way for a single compare VPInstruction to hold a single Predicate, and print its name.
test/Transforms/LoopVectorize/X86/optsize.ll
12 ↗	(On Diff #159798)	Expanded IR CHECKs have been added for cases that should get vectorized. For cases that should not, suffice to check that no vector is formed.

dcaballe added inline comments.Aug 20 2018, 3:43 PM

lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	I understand your point, Ayal. However, using UnderlyingValue as a pointer to the actual input IR in the VPlan native path and as a pointer to a detached IR Value in the inner loop path is very likely to be problematic, even in the short term. We would have to special case the code that is shared for both paths to treat the UnderlyingValue differently. The detached IR special semantics in the inner loop path would also make a bit more complicated the convergence of both paths. If there are no major concerns regarding the VPCmpInst, I'd prefer going with that approach.

hsaito added inline comments.Aug 20 2018, 3:53 PM

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485 ↗	(On Diff #159798)	Thank you.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792 ↗	(On Diff #159798)	Thanks, and fair enough.
lib/Transforms/Vectorize/LoopVectorize.cpp
2748 ↗	(On Diff #161564)	Personally, I don't like to see the IR like the following going out of the vectorizer, even though that's later cleaned up tirivially. %1 = false // unused and thus will be trivially cleaned up later. %2 = icmp ... Changing this part of the patch to Value *CheckMinIters = nullptr; if () .... else CheckMinIters = Builder.getFalse(); would make cleaner IR going out for common cases, at a small price to pay in ease-of-reading. If you agree, great. If not, I won't make a big deal about it. At the end of the day, we should clean up this area of code such that we don't have to rely on CheckMinIters being "false" constant to cleanup the unnecessary min iter check. That improvement can be done as a separate NFC patch.
2990 ↗	(On Diff #161564)	See the comment on CheckMinIters.

For me, the only major issue left is the detached IR instruction. @dcaballe, please try adding the reviewers/subscribers of D50480 to D50823, in the hopes of getting a quicker resolution there, so as not to block D50480 because of that. I will not oppose to D50480 for introducing new ULE opcode of VPInstruction (design/implementation choice within VPlan concept), but I will strongly oppose for the use of detahced IR instruction (goes against VPlan concept).

It's certainly nicer if @Ayal, @dcaballe, and others can agree on VPCmpInst or not quickly enough. I vote in favor of VPCmpInst.

Thanks,
Hideki

lib/Transforms/Vectorize/LoopVectorize.cpp
4948 ↗	(On Diff #159798)	I think every non-vectorized loop that goes through vectorizer's analysis qualifies for ORE. After all, TC==1 knowledge may or may not be available to the programmer otherwise.
4977 ↗	(On Diff #159798)	Thank you.
lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	Detached IR instruction is detrimental to VPlan direction. Please do not use it.
test/Transforms/LoopVectorize/X86/optsize.ll
4 ↗	(On Diff #161564)	Is the test really dependent on the apple triple?

Ayal added inline comments.Aug 22 2018, 5:38 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
2748 ↗	(On Diff #161564)	One could change this part of the patch to create an unconditional branch instead of a conditional one from BB to NewBB; or avoid creating NewBB / calling `emitMinimumIterationCountCheck()` altogether `if (foldTailByMasking())`. Both alternatives will change the dominance structure and thus require special attention when updating DT in `updateAnalysis()`. The latter would also need to record the EntryBlock for cases where `LoopBypassBlocks` remains empty. It's simpler to keep the existing skeletal structure intact, and rely on subsequent trivial dce cleanup. If desired, such alternatives should be proposed as a separate follow-up NFC patch.
2990 ↗	(On Diff #161564)	ditto.
4948 ↗	(On Diff #159798)	ok
test/Transforms/LoopVectorize/X86/optsize.ll
4 ↗	(On Diff #161564)	-mtriple=x86_64-unknown-linux works just as well

Addressing review comments, rebased, added a couple of asserts.

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

The TC==1 part and preliminary CHECK completion of tests are to be uploaded first.

Ayal marked 2 inline comments as done.Aug 22 2018, 9:17 AM

Ayal added inline comments.

lib/Transforms/Vectorize/VPlan.h
609 ↗	(On Diff #159798)	Would be good to clarify the aforementioned discrepancy between VPlan native's use of input IR and the proposed use of detached IR; both should presumably model defs, uses and basic-block ownerships in VPlan rather than the IR Instruction, so the latter can merely be used for storing internal properties, for both paths alike. BTW, SROA.cpp and StraightLineStrengthReduce.cpp, e.g., also make use of detached Instructions. Would also be good to explain why detached Instructions are considered detrimental or what concept of VPlan they allegedly violate, given that their existence keeps the original IR intact. But let's keep this patch out of that discussion, and have it use an ICmpULE extended opcode as originally proposed and reloaded. After all, it plays a very small part in this patch, and can be easily revised later as needed.

Let's give @dcaballe one more day to try getting some traction on D50823. Fair enough to both of you (and others who might be interested)?

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

Since the VPCmpInst code is ready (D50823) and this is a clear use case where we need to model a new compare (including its predicate) that is not in the input IR, I'd appreciate if we could discuss a bit more about using the VPCmpInst approach. At least, I'd like to understand what are the concerns about the VPCmpInst approach and what other people think.

I do have concerns regarding modeling ICmpULE as an opcode only for compare instructions newly created during a VPlan-to-VPlan transformation. For example:

Inconsistent modeling of compare instructions in the VPlan native path. Compare instructions in the input IR will be modeled as VPInstructions with a Instruction::ICmpInst/Instruction::FCmpInst opcode. New compare instructions will be modeled as VPInstructions with predicates as opcodes (VPInstruction::ICmpULE, for now). We'd have to compare the opcode against Instruction::ICmpInst, Instruction::ICmpInst, VPInstruction::ICmpULE and any future predicate opcode to know that a VPInstruction is a comparison. Similar inconsistency to get information about the compare predicate.

Adding ICmpULE as an opcode is paving the way to adding more predicates as opcodes in VPInstruction in the short term. Where would the limit be? Do we want to model the around 30 predicates currently in LLVM CmpInst as opcodes?

The ICmpULE approach may also be detrimental for the Instruction/VPInstruction templatization that we planned to explore.

If these points and the fact that VPCmpInst code is ready to go don't convince you, there isn't much else I can do :). I know this compare representation may sound insignificant but I'm well aware of how painful things can turn when things are built on top of "insignificant decisions" that have to be changed later on. If the problem with VPCmpInst is to rebase this patch on top of D50823, I'm perfectly fine with introducing D50823 after this patch goes in. However, if there are any other concerns regarding the VPCmpInst sub-class, it would be better to know them now. I'd prefer not to keep the ICmpULE opcode representation for a long time.

Thanks,
Diego

Under the assumption that the acceptance of this patch is not a conscious choice between new CmpULE VPInstruction opcode versus VPCmpInst derivation (whose discussion should continue in D50823 or its follow on), I think this patch is ready to land. LGTM.

This revision is now accepted and ready to land.Aug 23 2018, 2:51 PM

In D50480#1210022, @dcaballe wrote:

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

Since the VPCmpInst code is ready (D50823) and this is a clear use case where we need to model a new compare (including its predicate) that is not in the input IR, I'd appreciate if we could discuss a bit more about using the VPCmpInst approach. At least, I'd like to understand what are the concerns about the VPCmpInst approach and what other people think.

I do have concerns regarding modeling ICmpULE as an opcode only for compare instructions newly created during a VPlan-to-VPlan transformation. For example:

...

In D50480#1211580, @hsaito wrote:

Under the assumption that the acceptance of this patch is not a conscious choice between new CmpULE VPInstruction opcode versus VPCmpInst derivation (whose discussion should continue in D50823 or its follow on), I think this patch is ready to land. LGTM.

This patch aims to model a rather special early-exit condition that restricts the execution of the entire loop body to certain iterations, rather than model general compare instructions. If preferred, an "EarlyExit" extended opcode can be introduced instead of the controversial ICmpULE. This should be easy to revisit in the future if needed.

This patch focuses on modeling an early-exit compare and then generating it, w/o making strategic design decisions supporting future vplan-to-vplan transformations, the interfaces they may need, potential templatization, or other long-term high-level VPlan concerns. These should be explained and discussed separately along with pros and cons of alternative solutions for supporting the desired interfaces and for holding their storage, including subclassing VPInstructions, using detached Instructions, or other possibilities.

In D50480#1213673, @Ayal wrote:

This patch aims to model a rather special early-exit condition that restricts the execution of the entire loop body to certain iterations, rather than model general compare instructions. If preferred, an "EarlyExit" extended opcode can be introduced instead of the controversial ICmpULE. This should be easy to revisit in the future if needed.

This patch is fine as is, or rather much better with ICmpULE than EarlyExit.

This patch focuses on modeling an early-exit compare and then generating it, w/o making strategic design decisions supporting future vplan-to-vplan transformations, the interfaces they may need, potential templatization, or other long-term high-level VPlan concerns. These should be explained and discussed separately along with pros and cons of alternative solutions for supporting the desired interfaces and for holding their storage, including subclassing VPInstructions, using detached Instructions, or other possibilities.

Sure. I agree.

[Full disclosure] I have a big mental barrier in accepting your "early-exit" terminology here since I relate that term to "break out of the loop", but that's just the terminology difference. Nothing to do with the substance of this patch. [End of full disclosure]

Closed by commit rL344743: [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt… (authored by ayalz). · Explain WhyOct 18 2018, 8:05 AM

This revision was automatically updated to reflect the committed changes.

dorit mentioned this in D53559: [LV] Don't have fold-tail under optsize invalidate interleave-groups when masked-interleaving is enabled.Oct 23 2018, 1:43 AM

dorit mentioned this in rL345115: [LV] Don't have fold-tail under optsize invalidate interleave-groups when.Oct 24 2018, 12:13 AM

Ayal mentioned this in D66720: [LV] Fold tail by masking - handle reductions.Aug 25 2019, 10:33 AM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

VectorUtils.h

21 lines

Transforms/

Vectorize/

LoopVectorizationLegality.h

4 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

55 lines

LoopVectorize.cpp

126 lines

VPlan.h

21 lines

VPlan.cpp

24 lines

test/

Transforms/

LoopVectorize/

X86/

optsize.ll

85 lines

small-size.ll

172 lines

vect.omp.force.small-tc.ll

47 lines

Diff 170090

llvm/trunk/include/llvm/Analysis/VectorUtils.h

	Show First 20 Lines • Show All 339 Lines • ▼ Show 20 Lines
	/// between the member and the group in a map.			/// between the member and the group in a map.
	class InterleavedAccessInfo {			class InterleavedAccessInfo {
	public:			public:
	InterleavedAccessInfo(PredicatedScalarEvolution &PSE, Loop *L,			InterleavedAccessInfo(PredicatedScalarEvolution &PSE, Loop *L,
	DominatorTree DT, LoopInfo LI,			DominatorTree DT, LoopInfo LI,
	const LoopAccessInfo *LAI)			const LoopAccessInfo *LAI)
	: PSE(PSE), TheLoop(L), DT(DT), LI(LI), LAI(LAI) {}			: PSE(PSE), TheLoop(L), DT(DT), LI(LI), LAI(LAI) {}

	~InterleavedAccessInfo() {			~InterleavedAccessInfo() { reset(); }

				/// Analyze the interleaved accesses and collect them in interleave
				/// groups. Substitute symbolic strides using \p Strides.
				/// Consider also predicated loads/stores in the analysis if
				/// \p EnableMaskedInterleavedGroup is true.
				void analyzeInterleaving(bool EnableMaskedInterleavedGroup);

				/// Invalidate groups, e.g., in case all blocks in loop will be predicated
				/// contrary to original assumption. Although we currently prevent group
				/// formation for predicated accesses, we may be able to relax this limitation
				/// in the future once we handle more complicated blocks.
				void reset() {
	SmallPtrSet<InterleaveGroup *, 4> DelSet;			SmallPtrSet<InterleaveGroup *, 4> DelSet;
	// Avoid releasing a pointer twice.			// Avoid releasing a pointer twice.
	for (auto &I : InterleaveGroupMap)			for (auto &I : InterleaveGroupMap)
	DelSet.insert(I.second);			DelSet.insert(I.second);
	for (auto *Ptr : DelSet)			for (auto *Ptr : DelSet)
	delete Ptr;			delete Ptr;
				InterleaveGroupMap.clear();
				RequiresScalarEpilogue = false;
	}			}

	/// Analyze the interleaved accesses and collect them in interleave
	/// groups. Substitute symbolic strides using \p Strides.
	/// Consider also predicated loads/stores in the analysis if
	/// \p EnableMaskedInterleavedGroup is true.
	void analyzeInterleaving(bool EnableMaskedInterleavedGroup);

	/// Check if \p Instr belongs to any interleave group.			/// Check if \p Instr belongs to any interleave group.
	bool isInterleaved(Instruction *Instr) const {			bool isInterleaved(Instruction *Instr) const {
	return InterleaveGroupMap.find(Instr) != InterleaveGroupMap.end();			return InterleaveGroupMap.find(Instr) != InterleaveGroupMap.end();
	}			}

	/// Get the interleave group that \p Instr belongs to.			/// Get the interleave group that \p Instr belongs to.
	///			///
	▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	public:
/// This does not mean that it is profitable to vectorize this		/// This does not mean that it is profitable to vectorize this
/// loop, only that it is legal to do so.		/// loop, only that it is legal to do so.
/// Temporarily taking UseVPlanNativePath parameter. If true, take		/// Temporarily taking UseVPlanNativePath parameter. If true, take
/// the new code path being implemented for outer loop vectorization		/// the new code path being implemented for outer loop vectorization
/// (should be functional for inner loop vectorization) based on VPlan.		/// (should be functional for inner loop vectorization) based on VPlan.
/// If false, good old LV code.		/// If false, good old LV code.
bool canVectorize(bool UseVPlanNativePath);		bool canVectorize(bool UseVPlanNativePath);

		/// Return true if we can vectorize this loop while folding its tail by
		/// masking.
		bool canFoldTailByMasking();

/// Returns the primary induction variable.		/// Returns the primary induction variable.
PHINode *getPrimaryInduction() { return PrimaryInduction; }		PHINode *getPrimaryInduction() { return PrimaryInduction; }

/// Returns the reduction variables found in the loop.		/// Returns the reduction variables found in the loop.
ReductionList *getReductionVars() { return &Reductions; }		ReductionList *getReductionVars() { return &Reductions; }

/// Returns the induction variables found in the loop.		/// Returns the induction variables found in the loop.
InductionList *getInductionVars() { return &Inductions; }		InductionList *getInductionVars() { return &Inductions; }
▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 1,128 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {

// Okay! We've done all the tests. If any have failed, return false. Otherwise		// Okay! We've done all the tests. If any have failed, return false. Otherwise
// we can vectorize, and at this point we don't have any other mem analysis		// we can vectorize, and at this point we don't have any other mem analysis
// which may limit our maximum vectorization factor, so just return true with		// which may limit our maximum vectorization factor, so just return true with
// no restrictions.		// no restrictions.
return Result;		return Result;
}		}

		bool LoopVectorizationLegality::canFoldTailByMasking() {

		LLVM_DEBUG(dbgs() << "LV: checking if tail can be folded by masking.\n");

		if (!PrimaryInduction) {
		ORE->emit(createMissedAnalysis("NoPrimaryInduction")
		<< "Missing a primary induction variable in the loop, which is "
		<< "needed in order to fold tail by masking as required.");
		LLVM_DEBUG(dbgs() << "LV: No primary induction, cannot fold tail by "
		<< "masking.\n");
		return false;
		}

		// TODO: handle reductions when tail is folded by masking.
		if (!Reductions.empty()) {
		ORE->emit(createMissedAnalysis("ReductionFoldingTailByMasking")
		<< "Cannot fold tail by masking in the presence of reductions.");
		LLVM_DEBUG(dbgs() << "LV: Loop has reductions, cannot fold tail by "
		<< "masking.\n");
		return false;
		}

		// TODO: handle outside users when tail is folded by masking.
		for (auto *AE : AllowedExit) {
		// Check that all users of allowed exit values are inside the loop.
		for (User *U : AE->users()) {
		Instruction *UI = cast<Instruction>(U);
		if (TheLoop->contains(UI))
		continue;
		ORE->emit(createMissedAnalysis("LiveOutFoldingTailByMasking")
		<< "Cannot fold tail by masking in the presence of live outs.");
		LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking, loop has an "
		<< "outside user for : " << *UI << '\n');
		return false;
		}
		}

		// The list of pointers that we can safely read and write to remains empty.
		SmallPtrSet<Value *, 8> SafePointers;

		// Check and mark all blocks for predication, including those that ordinarily
		// do not need predication such as the header block.
		for (BasicBlock *BB : TheLoop->blocks()) {
		if (!blockCanBePredicated(BB, SafePointers)) {
		ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator())
		<< "control flow cannot be substituted for a select");
		LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking as required.\n");
		return false;
		}
		}

		LLVM_DEBUG(dbgs() << "LV: can fold tail by masking.\n");
		return true;
		}

} // namespace llvm		} // namespace llvm

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,099 Lines • ▼ Show 20 Lines	public:
/// If a non-zero VF has been calculated, we check if I will be scalarized		/// If a non-zero VF has been calculated, we check if I will be scalarized
/// predication for that VF.		/// predication for that VF.
bool isScalarWithPredication(Instruction *I, unsigned VF = 1);		bool isScalarWithPredication(Instruction *I, unsigned VF = 1);

// Returns true if \p I is an instruction that will be predicated either		// Returns true if \p I is an instruction that will be predicated either
// through scalar predication or masked load/store or masked gather/scatter.		// through scalar predication or masked load/store or masked gather/scatter.
// Superset of instructions that return true for isScalarWithPredication.		// Superset of instructions that return true for isScalarWithPredication.
bool isPredicatedInst(Instruction *I) {		bool isPredicatedInst(Instruction *I) {
if (!Legal->blockNeedsPredication(I->getParent()))		if (!blockNeedsPredication(I->getParent()))
return false;		return false;
// Loads and stores that need some form of masked operation are predicated		// Loads and stores that need some form of masked operation are predicated
// instructions.		// instructions.
if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))
return Legal->isMaskRequired(I);		return Legal->isMaskRequired(I);
return isScalarWithPredication(I);		return isScalarWithPredication(I);
}		}

Show All 17 Lines	public:
}		}

/// Returns true if an interleaved group requires a scalar iteration		/// Returns true if an interleaved group requires a scalar iteration
/// to handle accesses with gaps.		/// to handle accesses with gaps.
bool requiresScalarEpilogue() const {		bool requiresScalarEpilogue() const {
return InterleaveInfo.requiresScalarEpilogue();		return InterleaveInfo.requiresScalarEpilogue();
}		}

		/// Returns true if all loop blocks should be masked to fold tail loop.
		bool foldTailByMasking() const { return FoldTailByMasking; }

		bool blockNeedsPredication(BasicBlock *BB) {
		return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);
		}

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factor, larger than zero.		/// \return An upper bound for the vectorization factor, larger than zero.
/// One is returned if vectorization should best be avoided due to cost.		/// One is returned if vectorization should best be avoided due to cost.
unsigned computeFeasibleMaxVF(bool OptForSize, unsigned ConstTripCount);		unsigned computeFeasibleMaxVF(bool OptForSize, unsigned ConstTripCount);

/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	private:
/// scalarized rather than vectorized. The entries are Instruction-Cost		/// scalarized rather than vectorized. The entries are Instruction-Cost
/// pairs.		/// pairs.
using ScalarCostsTy = DenseMap<Instruction *, unsigned>;		using ScalarCostsTy = DenseMap<Instruction *, unsigned>;

/// A set containing all BasicBlocks that are known to present after		/// A set containing all BasicBlocks that are known to present after
/// vectorization as a predicated block.		/// vectorization as a predicated block.
SmallPtrSet<BasicBlock *, 4> PredicatedBBsAfterVectorization;		SmallPtrSet<BasicBlock *, 4> PredicatedBBsAfterVectorization;

		/// All blocks of loop are to be masked to fold tail of scalar iterations.
		bool FoldTailByMasking = false;

/// A map holding scalar costs for different vectorization factors. The		/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the		/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated		/// instruction will be scalarized when vectorizing with the associated
/// vectorization factor. The entries are VF-ScalarCostTy pairs.		/// vectorization factor. The entries are VF-ScalarCostTy pairs.
DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;		DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;

/// Holds the instructions known to be uniform after vectorization.		/// Holds the instructions known to be uniform after vectorization.
/// The data is collected per VF.		/// The data is collected per VF.
▲ Show 20 Lines • Show All 1,101 Lines • ▼ Show 20 Lines	PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,

return Induction;		return Induction;
}		}

Value InnerLoopVectorizer::getOrCreateTripCount(Loop L) {		Value InnerLoopVectorizer::getOrCreateTripCount(Loop L) {
if (TripCount)		if (TripCount)
return TripCount;		return TripCount;

		assert(L && "Create Trip Count for null loop.");
IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());		IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
// Find the loop boundaries.		// Find the loop boundaries.
ScalarEvolution *SE = PSE.getSE();		ScalarEvolution *SE = PSE.getSE();
const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();		const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();
assert(BackedgeTakenCount != SE->getCouldNotCompute() &&		assert(BackedgeTakenCount != SE->getCouldNotCompute() &&
"Invalid loop count");		"Invalid loop count");

Type *IdxTy = Legal->getWidestInductionType();		Type *IdxTy = Legal->getWidestInductionType();
Show All 33 Lines

Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L) {		Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L) {
if (VectorTripCount)		if (VectorTripCount)
return VectorTripCount;		return VectorTripCount;

Value *TC = getOrCreateTripCount(L);		Value *TC = getOrCreateTripCount(L);
IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());		IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());

		Type *Ty = TC->getType();
		Constant Step = ConstantInt::get(Ty, VF UF);

		// If the tail is to be folded by masking, round the number of iterations N
		// up to a multiple of Step instead of rounding down. This is done by first
		// adding Step-1 and then rounding down. Note that it's ok if this addition
		// overflows: the vector induction variable will eventually wrap to zero given
		// that it starts at zero and its Step is a power of two; the loop will then
		// exit, with the last early-exit vector comparison also producing all-true.
		if (Cost->foldTailByMasking()) {
		assert(isPowerOf2_32(VF * UF) &&
		"VF*UF must be a power of 2 when folding tail by masking");
		TC = Builder.CreateAdd(TC, ConstantInt::get(Ty, VF * UF - 1), "n.rnd.up");
		}

// Now we need to generate the expression for the part of the loop that the		// Now we need to generate the expression for the part of the loop that the
// vectorized body will execute. This is equal to N - (N % Step) if scalar		// vectorized body will execute. This is equal to N - (N % Step) if scalar
// iterations are not required for correctness, or N - Step, otherwise. Step		// iterations are not required for correctness, or N - Step, otherwise. Step
// is equal to the vectorization factor (number of SIMD elements) times the		// is equal to the vectorization factor (number of SIMD elements) times the
// unroll factor (number of SIMD instructions).		// unroll factor (number of SIMD instructions).
Constant Step = ConstantInt::get(TC->getType(), VF UF);
Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");		Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");

// If there is a non-reversed interleaved group that may speculatively access		// If there is a non-reversed interleaved group that may speculatively access
// memory out-of-bounds, we need to ensure that there will be at least one		// memory out-of-bounds, we need to ensure that there will be at least one
// iteration of the scalar epilogue loop. Thus, if the step evenly divides		// iteration of the scalar epilogue loop. Thus, if the step evenly divides
// the trip count, we set the remainder to be equal to the step. If the step		// the trip count, we set the remainder to be equal to the step. If the step
// does not evenly divide the trip count, no adjustment is necessary since		// does not evenly divide the trip count, no adjustment is necessary since
// there will already be scalar iterations. Note that the minimum iterations		// there will already be scalar iterations. Note that the minimum iterations
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,

// Generate code to check if the loop's trip count is less than VF * UF, or		// Generate code to check if the loop's trip count is less than VF * UF, or
// equal to it in case a scalar epilogue is required; this implies that the		// equal to it in case a scalar epilogue is required; this implies that the
// vector trip count is zero. This check also covers the case where adding one		// vector trip count is zero. This check also covers the case where adding one
// to the backedge-taken count overflowed leading to an incorrect trip count		// to the backedge-taken count overflowed leading to an incorrect trip count
// of zero. In this case we will also jump to the scalar loop.		// of zero. In this case we will also jump to the scalar loop.
auto P = Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE		auto P = Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE
: ICmpInst::ICMP_ULT;		: ICmpInst::ICMP_ULT;
Value *CheckMinIters = Builder.CreateICmp(
P, Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check");		// If tail is to be folded, vector loop takes care of all iterations.
		Value *CheckMinIters = Builder.getFalse();
		if (!Cost->foldTailByMasking())
		CheckMinIters = Builder.CreateICmp(
		P, Count, ConstantInt::get(Count->getType(), VF * UF),
		"min.iters.check");

BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");		BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);		L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
Show All 12 Lines	SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(),
"scev.check");		"scev.check");
Value *SCEVCheck =		Value *SCEVCheck =
Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator());		Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator());

if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))		if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))
if (C->isZero())		if (C->isZero())
return;		return;

		assert(!Cost->foldTailByMasking() && "Cannot check stride when folding tail");
// Create a new block containing the stride check.		// Create a new block containing the stride check.
BB->setName("vector.scevcheck");		BB->setName("vector.scevcheck");
auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");		auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
Show All 16 Lines	void InnerLoopVectorizer::emitMemRuntimeChecks(Loop L, BasicBlock Bypass) {
// faster.		// faster.
Instruction *FirstCheckInst;		Instruction *FirstCheckInst;
Instruction *MemRuntimeCheck;		Instruction *MemRuntimeCheck;
std::tie(FirstCheckInst, MemRuntimeCheck) =		std::tie(FirstCheckInst, MemRuntimeCheck) =
Legal->getLAI()->addRuntimeChecks(BB->getTerminator());		Legal->getLAI()->addRuntimeChecks(BB->getTerminator());
if (!MemRuntimeCheck)		if (!MemRuntimeCheck)
return;		return;

		assert(!Cost->foldTailByMasking() && "Cannot check memory when folding tail");
// Create a new block containing the memory check.		// Create a new block containing the memory check.
BB->setName("vector.memcheck");		BB->setName("vector.memcheck");
auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");		auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
▲ Show 20 Lines • Show All 252 Lines • ▼ Show 20 Lines	for (auto &InductionEntry : *List) {
for (BasicBlock *BB : LoopBypassBlocks)		for (BasicBlock *BB : LoopBypassBlocks)
BCResumeVal->addIncoming(II.getStartValue(), BB);		BCResumeVal->addIncoming(II.getStartValue(), BB);
OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);		OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);
}		}

// Add a check in the middle block to see if we have completed		// Add a check in the middle block to see if we have completed
// all of the iterations in the first vector loop.		// all of the iterations in the first vector loop.
// If (N - N%VF) == N, then we don't need to run the remainder.		// If (N - N%VF) == N, then we don't need to run the remainder.
Value *CmpN =		// If tail is to be folded, we know we don't need to run the remainder.
		Value *CmpN = Builder.getTrue();
		if (!Cost->foldTailByMasking())
		CmpN =
CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,		CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,
CountRoundDown, "cmp.n", MiddleBlock->getTerminator());		CountRoundDown, "cmp.n", MiddleBlock->getTerminator());
ReplaceInstWithInst(MiddleBlock->getTerminator(),		ReplaceInstWithInst(MiddleBlock->getTerminator(),
BranchInst::Create(ExitBlock, ScalarPH, CmpN));		BranchInst::Create(ExitBlock, ScalarPH, CmpN));

// Get ready to start creating new instructions into the vectorized body.		// Get ready to start creating new instructions into the vectorized body.
Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());		Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());

// Save the state.		// Save the state.
LoopVectorPreHeader = Lp->getLoopPreheader();		LoopVectorPreHeader = Lp->getLoopPreheader();
▲ Show 20 Lines • Show All 1,457 Lines • ▼ Show 20 Lines	for (auto &Induction : *Legal->getInductionVars()) {
LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate		LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate
<< "\n");		<< "\n");
}		}

Scalars[VF].insert(Worklist.begin(), Worklist.end());		Scalars[VF].insert(Worklist.begin(), Worklist.end());
}		}

bool LoopVectorizationCostModel::isScalarWithPredication(Instruction *I, unsigned VF) {		bool LoopVectorizationCostModel::isScalarWithPredication(Instruction *I, unsigned VF) {
if (!Legal->blockNeedsPredication(I->getParent()))		if (!blockNeedsPredication(I->getParent()))
return false;		return false;
switch(I->getOpcode()) {		switch(I->getOpcode()) {
default:		default:
break;		break;
case Instruction::Load:		case Instruction::Load:
case Instruction::Store: {		case Instruction::Store: {
if (!Legal->isMaskRequired(I))		if (!Legal->isMaskRequired(I))
return false;		return false;
▲ Show 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(bool OptForSize) {

if (TC == 1) {		if (TC == 1) {
ORE->emit(createMissedAnalysis("SingleIterationLoop")		ORE->emit(createMissedAnalysis("SingleIterationLoop")
<< "loop trip count is one, irrelevant for vectorization");		<< "loop trip count is one, irrelevant for vectorization");
LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");		LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
return None;		return None;
}		}

// If we don't know the precise trip count, don't try to vectorize.		unsigned MaxVF = computeFeasibleMaxVF(OptForSize, TC);

		if (TC > 0 && TC % MaxVF == 0) {
		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
		return MaxVF;
		}

		// If we don't know the precise trip count, or if the trip count that we
		// found modulo the vectorization factor is not zero, try to fold the tail
		// by masking.
		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
		// FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a
		// smaller MaxVF that does not require a scalar epilog.
		if (Legal->canFoldTailByMasking()) {
		FoldTailByMasking = true;
		return MaxVF;
		}

if (TC == 0) {		if (TC == 0) {
ORE->emit(		ORE->emit(
createMissedAnalysis("UnknownLoopCountComplexCFG")		createMissedAnalysis("UnknownLoopCountComplexCFG")
<< "unable to calculate the loop count due to complex control flow");		<< "unable to calculate the loop count due to complex control flow");
LLVM_DEBUG(
dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return None;		return None;
}		}

unsigned MaxVF = computeFeasibleMaxVF(OptForSize, TC);

if (TC % MaxVF != 0) {
// If the trip count that we found modulo the vectorization factor is not
// zero then we require a tail.
// FIXME: look for a smaller MaxVF that does divide TC rather than give up.
// FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a
// smaller MaxVF that does not require a scalar epilog.

ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")		ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")
<< "cannot optimize for size and vectorize at the "		<< "cannot optimize for size and vectorize at the same time. "
"same time. Enable vectorization of this loop "		"Enable vectorization of this loop with '#pragma clang loop "
"with '#pragma clang loop vectorize(enable)' "		"vectorize(enable)' when compiling with -Os/-Oz");
"when compiling with -Os/-Oz");
LLVM_DEBUG(
dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return None;		return None;
}		}

return MaxVF;
}

unsigned		unsigned
LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,		LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,
unsigned ConstTripCount) {		unsigned ConstTripCount) {
MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);		MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
unsigned SmallestType, WidestType;		unsigned SmallestType, WidestType;
std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();		std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
unsigned WidestRegister = TTI.getRegisterBitWidth(true);		unsigned WidestRegister = TTI.getRegisterBitWidth(true);

▲ Show 20 Lines • Show All 219 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
// We calculate the interleave count using the following formula.		// We calculate the interleave count using the following formula.
// Subtract the number of loop invariants from the number of available		// Subtract the number of loop invariants from the number of available
// registers. These registers are used by all of the interleaved instances.		// registers. These registers are used by all of the interleaved instances.
// Next, divide the remaining registers by the number of registers that is		// Next, divide the remaining registers by the number of registers that is
// required by the loop, in order to estimate how many parallel instances		// required by the loop, in order to estimate how many parallel instances
// fit without causing spills. All of this is rounded down if necessary to be		// fit without causing spills. All of this is rounded down if necessary to be
// a power of two. We want power of two interleave count to simplify any		// a power of two. We want power of two interleave count to simplify any
// addressing operations or alignment considerations.		// addressing operations or alignment considerations.
		// We also want power of two interleave counts to ensure that the induction
		// variable of the vector loop wraps to zero, when tail is folded by masking;
		// this currently happens when OptForSize, in which case IC is set to 1 above.
unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /		unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
R.MaxLocalUsers);		R.MaxLocalUsers);

// Don't count the induction variable as interleaved.		// Don't count the induction variable as interleaved.
if (EnableIndVarRegisterHeur)		if (EnableIndVarRegisterHeur)
IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /		IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
std::max(1U, (R.MaxLocalUsers - 1)));		std::max(1U, (R.MaxLocalUsers - 1)));

▲ Show 20 Lines • Show All 270 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {
// not profitable to scalarize any instructions, the presence of VF in the		// not profitable to scalarize any instructions, the presence of VF in the
// map will indicate that we've analyzed it already.		// map will indicate that we've analyzed it already.
ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];		ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];

// Find all the instructions that are scalar with predication in the loop and		// Find all the instructions that are scalar with predication in the loop and
// determine if it would be better to not if-convert the blocks they are in.		// determine if it would be better to not if-convert the blocks they are in.
// If so, we also record the instructions to scalarize.		// If so, we also record the instructions to scalarize.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (!Legal->blockNeedsPredication(BB))		if (!blockNeedsPredication(BB))
continue;		continue;
for (Instruction &I : *BB)		for (Instruction &I : *BB)
if (isScalarWithPredication(&I)) {		if (isScalarWithPredication(&I)) {
ScalarCostsTy ScalarCosts;		ScalarCostsTy ScalarCosts;
// Do not apply discount logic if hacked cost is needed		// Do not apply discount logic if hacked cost is needed
// for emulated masked memrefs.		// for emulated masked memrefs.
if (!useEmulatedMaskMemRefHack(&I) &&		if (!useEmulatedMaskMemRefHack(&I) &&
computePredInstDiscount(&I, ScalarCosts, VF) >= 0)		computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	for (BasicBlock *BB : TheLoop->blocks()) {
}		}

// If we are vectorizing a predicated block, it will have been		// If we are vectorizing a predicated block, it will have been
// if-converted. This means that the block's instructions (aside from		// if-converted. This means that the block's instructions (aside from
// stores and instructions that may divide by zero) will now be		// stores and instructions that may divide by zero) will now be
// unconditionally executed. For the scalar case, we may not always execute		// unconditionally executed. For the scalar case, we may not always execute
// the predicated block. Thus, scale the block's cost by the probability of		// the predicated block. Thus, scale the block's cost by the probability of
// executing it.		// executing it.
if (VF == 1 && Legal->blockNeedsPredication(BB))		if (VF == 1 && blockNeedsPredication(BB))
BlockCost.first /= getReciprocalPredBlockProb();		BlockCost.first /= getReciprocalPredBlockProb();

Cost.first += BlockCost.first;		Cost.first += BlockCost.first;
Cost.second \|= BlockCost.second;		Cost.second \|= BlockCost.second;
}		}

return Cost;		return Cost;
}		}
▲ Show 20 Lines • Show All 674 Lines • ▼ Show 20 Lines
LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {		LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {
assert(OrigLoop->empty() && "Inner loop expected.");		assert(OrigLoop->empty() && "Inner loop expected.");
// Width 1 means no vectorization, cost 0 means uncomputed cost.		// Width 1 means no vectorization, cost 0 means uncomputed cost.
const VectorizationFactor NoVectorization = {1U, 0U};		const VectorizationFactor NoVectorization = {1U, 0U};
Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);		Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);
if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.		if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.
return NoVectorization;		return NoVectorization;

		// Invalidate interleave groups if all blocks of loop will be predicated.
		if (CM.blockNeedsPredication(OrigLoop->getHeader()))
		CM.InterleaveInfo.reset();

if (UserVF) {		if (UserVF) {
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
CM.selectUserVectorizationFactor(UserVF);		CM.selectUserVectorizationFactor(UserVF);
buildVPlansWithVPRecipes(UserVF, UserVF);		buildVPlansWithVPRecipes(UserVF, UserVF);
LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
Show All 40 Lines	void LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV,

// 1. Create a new empty loop. Unlink the old loop and connect the new one.		// 1. Create a new empty loop. Unlink the old loop and connect the new one.
VPCallbackILV CallbackILV(ILV);		VPCallbackILV CallbackILV(ILV);

VPTransformState State{BestVF, BestUF, LI,		VPTransformState State{BestVF, BestUF, LI,
DT, ILV.Builder, ILV.VectorLoopValueMap,		DT, ILV.Builder, ILV.VectorLoopValueMap,
&ILV, CallbackILV};		&ILV, CallbackILV};
State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();		State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();
		State.TripCount = ILV.getOrCreateTripCount(nullptr);

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should also be implemented in		// into the code below should also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//
▲ Show 20 Lines • Show All 164 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB);		BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB);
if (BCEntryIt != BlockMaskCache.end())		if (BCEntryIt != BlockMaskCache.end())
return BCEntryIt->second;		return BCEntryIt->second;

// All-one mask is modelled as no-mask following the convention for masked		// All-one mask is modelled as no-mask following the convention for masked
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

// Loop incoming mask is all-one.		if (OrigLoop->getHeader() == BB) {
if (OrigLoop->getHeader() == BB)		if (!CM.blockNeedsPredication(BB))
		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

		// Introduce the early-exit compare IV <= BTC to form header block mask.
		// This is used instead of IV < TC because TC may wrap, unlike BTC.
		VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
		VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();
		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
		}

// This is the block mask. We OR all incoming edges.		// This is the block mask. We OR all incoming edges.
for (auto *Predecessor : predecessors(BB)) {		for (auto *Predecessor : predecessors(BB)) {
VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);		VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);
if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.		if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.
return BlockMaskCache[BB] = EdgeMask;		return BlockMaskCache[BB] = EdgeMask;

if (!BlockMask) { // BlockMask has its initialized nullptr value.		if (!BlockMask) { // BlockMask has its initialized nullptr value.
▲ Show 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::buildVPlansWithVPRecipes(unsigned MinVF,
for (BasicBlock *BB : OrigLoop->blocks()) {		for (BasicBlock *BB : OrigLoop->blocks()) {
if (BB == Latch)		if (BB == Latch)
continue;		continue;
BranchInst *Branch = dyn_cast<BranchInst>(BB->getTerminator());		BranchInst *Branch = dyn_cast<BranchInst>(BB->getTerminator());
if (Branch && Branch->isConditional())		if (Branch && Branch->isConditional())
NeedDef.insert(Branch->getCondition());		NeedDef.insert(Branch->getCondition());
}		}

		// If the tail is to be folded by masking, the primary induction variable
		// needs to be represented in VPlan for it to model early-exit masking.
		if (CM.foldTailByMasking())
		NeedDef.insert(Legal->getPrimaryInduction());

// Collect instructions from the original loop that will become trivially dead		// Collect instructions from the original loop that will become trivially dead
// in the vectorized loop. We don't need to vectorize these instructions. For		// in the vectorized loop. We don't need to vectorize these instructions. For
// example, original induction update instructions can become dead because we		// example, original induction update instructions can become dead because we
// separately emit induction "steps" when generating code for the new loop.		// separately emit induction "steps" when generating code for the new loop.
// Similarly, we create a new latch condition when setting up the structure		// Similarly, we create a new latch condition when setting up the structure
// of the new loop, so the old one can become dead.		// of the new loop, so the old one can become dead.
SmallPtrSet<Instruction *, 4> DeadInstructions;		SmallPtrSet<Instruction *, 4> DeadInstructions;
collectTriviallyDeadInstructions(DeadInstructions);		collectTriviallyDeadInstructions(DeadInstructions);
▲ Show 20 Lines • Show All 792 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	struct VPTransformState {
/// Hold a reference to the Value state information used when generating the		/// Hold a reference to the Value state information used when generating the
/// Values of the output IR.		/// Values of the output IR.
VectorizerValueMap &ValueMap;		VectorizerValueMap &ValueMap;

/// Hold a reference to a mapping between VPValues in VPlan and original		/// Hold a reference to a mapping between VPValues in VPlan and original
/// Values they correspond to.		/// Values they correspond to.
VPValue2ValueTy VPValue2Value;		VPValue2ValueTy VPValue2Value;

		/// Hold the trip count of the scalar loop.
		Value *TripCount = nullptr;

/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.		/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.
InnerLoopVectorizer *ILV;		InnerLoopVectorizer *ILV;

VPCallback &Callback;		VPCallback &Callback;
};		};

/// VPBlockBase is the building block of the Hierarchical Control-Flow Graph.		/// VPBlockBase is the building block of the Hierarchical Control-Flow Graph.
/// A VPBlockBase can be either a VPBasicBlock or a VPRegionBlock.		/// A VPBlockBase can be either a VPBasicBlock or a VPRegionBlock.
▲ Show 20 Lines • Show All 274 Lines • ▼ Show 20 Lines
/// While as any Recipe it may generate a sequence of IR instructions when		/// While as any Recipe it may generate a sequence of IR instructions when
/// executed, these instructions would always form a single-def expression as		/// executed, these instructions would always form a single-def expression as
/// the VPInstruction is also a single def-use vertex.		/// the VPInstruction is also a single def-use vertex.
class VPInstruction : public VPUser, public VPRecipeBase {		class VPInstruction : public VPUser, public VPRecipeBase {
friend class VPlanHCFGTransforms;		friend class VPlanHCFGTransforms;

public:		public:
/// VPlan opcodes, extending LLVM IR with idiomatics instructions.		/// VPlan opcodes, extending LLVM IR with idiomatics instructions.
enum { Not = Instruction::OtherOpsEnd + 1 };		enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };

private:		private:
typedef unsigned char OpcodeTy;		typedef unsigned char OpcodeTy;
OpcodeTy Opcode;		OpcodeTy Opcode;

/// Utility method serving execute(): generates a single instance of the		/// Utility method serving execute(): generates a single instance of the
/// modeled instruction.		/// modeled instruction.
void generateInstruction(VPTransformState &State, unsigned Part);		void generateInstruction(VPTransformState &State, unsigned Part);
▲ Show 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	private:

/// Holds all the external definitions created for this VPlan.		/// Holds all the external definitions created for this VPlan.
// TODO: Introduce a specific representation for external definitions in		// TODO: Introduce a specific representation for external definitions in
// VPlan. External definitions must be immutable and hold a pointer to its		// VPlan. External definitions must be immutable and hold a pointer to its
// underlying IR that will be used to implement its structural comparison		// underlying IR that will be used to implement its structural comparison
// (operators '==' and '<').		// (operators '==' and '<').
SmallPtrSet<VPValue *, 16> VPExternalDefs;		SmallPtrSet<VPValue *, 16> VPExternalDefs;

		/// Represents the backedge taken count of the original loop, for folding
		/// the tail.
		VPValue *BackedgeTakenCount = nullptr;

/// Holds a mapping between Values and their corresponding VPValue inside		/// Holds a mapping between Values and their corresponding VPValue inside
/// VPlan.		/// VPlan.
Value2VPValueTy Value2VPValue;		Value2VPValueTy Value2VPValue;

/// Holds the VPLoopInfo analysis for this VPlan.		/// Holds the VPLoopInfo analysis for this VPlan.
VPLoopInfo VPLInfo;		VPLoopInfo VPLInfo;

/// Holds the condition bit values built during VPInstruction to VPRecipe transformation.		/// Holds the condition bit values built during VPInstruction to VPRecipe transformation.
SmallVector<VPValue *, 4> VPCBVs;		SmallVector<VPValue *, 4> VPCBVs;

public:		public:
VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {}		VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {}

~VPlan() {		~VPlan() {
if (Entry)		if (Entry)
VPBlockBase::deleteCFG(Entry);		VPBlockBase::deleteCFG(Entry);
for (auto &MapEntry : Value2VPValue)		for (auto &MapEntry : Value2VPValue)
		if (MapEntry.second != BackedgeTakenCount)
delete MapEntry.second;		delete MapEntry.second;
		if (BackedgeTakenCount)
		delete BackedgeTakenCount; // Delete once, if in Value2VPValue or not.
for (VPValue *Def : VPExternalDefs)		for (VPValue *Def : VPExternalDefs)
delete Def;		delete Def;
for (VPValue *CBV : VPCBVs)		for (VPValue *CBV : VPCBVs)
delete CBV;		delete CBV;
}		}

/// Generate the IR code for this VPlan.		/// Generate the IR code for this VPlan.
void execute(struct VPTransformState *State);		void execute(struct VPTransformState *State);

VPBlockBase *getEntry() { return Entry; }		VPBlockBase *getEntry() { return Entry; }
const VPBlockBase *getEntry() const { return Entry; }		const VPBlockBase *getEntry() const { return Entry; }

VPBlockBase setEntry(VPBlockBase Block) { return Entry = Block; }		VPBlockBase setEntry(VPBlockBase Block) { return Entry = Block; }

		/// The backedge taken count of the original loop.
		VPValue *getOrCreateBackedgeTakenCount() {
		if (!BackedgeTakenCount)
		BackedgeTakenCount = new VPValue();
		return BackedgeTakenCount;
		}

void addVF(unsigned VF) { VFs.insert(VF); }		void addVF(unsigned VF) { VFs.insert(VF); }

bool hasVF(unsigned VF) { return VFs.count(VF); }		bool hasVF(unsigned VF) { return VFs.count(VF); }

const std::string &getName() const { return Name; }		const std::string &getName() const { return Name; }

void setName(const Twine &newName) { Name = newName.str(); }		void setName(const Twine &newName) { Name = newName.str(); }

▲ Show 20 Lines • Show All 295 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	void VPInstruction::generateInstruction(VPTransformState &State,

switch (getOpcode()) {		switch (getOpcode()) {
case VPInstruction::Not: {		case VPInstruction::Not: {
Value *A = State.get(getOperand(0), Part);		Value *A = State.get(getOperand(0), Part);
Value *V = Builder.CreateNot(A);		Value *V = Builder.CreateNot(A);
State.set(this, V, Part);		State.set(this, V, Part);
break;		break;
}		}
		case VPInstruction::ICmpULE: {
		Value *IV = State.get(getOperand(0), Part);
		Value *TC = State.get(getOperand(1), Part);
		Value *V = Builder.CreateICmpULE(IV, TC);
		State.set(this, V, Part);
		break;
		}
default:		default:
llvm_unreachable("Unsupported opcode for instruction");		llvm_unreachable("Unsupported opcode for instruction");
}		}
}		}

void VPInstruction::execute(VPTransformState &State) {		void VPInstruction::execute(VPTransformState &State) {
assert(!State.Instance && "VPInstruction executing an Instance");		assert(!State.Instance && "VPInstruction executing an Instance");
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
Show All 9 Lines
void VPInstruction::print(raw_ostream &O) const {		void VPInstruction::print(raw_ostream &O) const {
printAsOperand(O);		printAsOperand(O);
O << " = ";		O << " = ";

switch (getOpcode()) {		switch (getOpcode()) {
case VPInstruction::Not:		case VPInstruction::Not:
O << "not";		O << "not";
break;		break;
		case VPInstruction::ICmpULE:
		O << "icmp ule";
		break;
default:		default:
O << Instruction::getOpcodeName(getOpcode());		O << Instruction::getOpcodeName(getOpcode());
}		}

for (const VPValue *Operand : operands()) {		for (const VPValue *Operand : operands()) {
O << " ";		O << " ";
Operand->printAsOperand(O);		Operand->printAsOperand(O);
}		}
}		}

/// Generate the code inside the body of the vectorized loop. Assumes a single		/// Generate the code inside the body of the vectorized loop. Assumes a single
/// LoopVectorBody basic-block was created for this. Introduce additional		/// LoopVectorBody basic-block was created for this. Introduce additional
/// basic-blocks as needed, and fill them all.		/// basic-blocks as needed, and fill them all.
void VPlan::execute(VPTransformState *State) {		void VPlan::execute(VPTransformState *State) {
		// -1. Check if the backedge taken count is needed, and if so build it.
		if (BackedgeTakenCount && BackedgeTakenCount->getNumUsers()) {
		Value *TC = State->TripCount;
		IRBuilder<> Builder(State->CFG.PrevBB->getTerminator());
		auto *TCMO = Builder.CreateSub(TC, ConstantInt::get(TC->getType(), 1),
		"trip.count.minus.1");
		Value2VPValue[TCMO] = BackedgeTakenCount;
		}

// 0. Set the reverse mapping from VPValues to Values for code generation.		// 0. Set the reverse mapping from VPValues to Values for code generation.
for (auto &Entry : Value2VPValue)		for (auto &Entry : Value2VPValue)
State->VPValue2Value[Entry.second] = Entry.first;		State->VPValue2Value[Entry.second] = Entry.first;

BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB;		BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB;
BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor();		BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor();
assert(VectorHeaderBB && "Loop preheader does not have a single successor.");		assert(VectorHeaderBB && "Loop preheader does not have a single successor.");
BasicBlock *VectorLatchBB = VectorHeaderBB;		BasicBlock *VectorLatchBB = VectorHeaderBB;
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines

void VPlanPrinter::dump() {		void VPlanPrinter::dump() {
Depth = 1;		Depth = 1;
bumpIndent(0);		bumpIndent(0);
OS << "digraph VPlan {\n";		OS << "digraph VPlan {\n";
OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan";		OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan";
if (!Plan.getName().empty())		if (!Plan.getName().empty())
OS << "\\n" << DOT::EscapeString(Plan.getName());		OS << "\\n" << DOT::EscapeString(Plan.getName());
if (!Plan.Value2VPValue.empty()) {		if (!Plan.Value2VPValue.empty() \|\| Plan.BackedgeTakenCount) {
OS << ", where:";		OS << ", where:";
		if (Plan.BackedgeTakenCount)
		OS << "\\n"
		<< *Plan.getOrCreateBackedgeTakenCount() << " := BackedgeTakenCount";
for (auto Entry : Plan.Value2VPValue) {		for (auto Entry : Plan.Value2VPValue) {
OS << "\\n" << *Entry.second;		OS << "\\n" << *Entry.second;
OS << DOT::EscapeString(" := ");		OS << DOT::EscapeString(" := ");
Entry.first->printAsOperand(OS, false);		Entry.first->printAsOperand(OS, false);
}		}
}		}
OS << "\"]\n";		OS << "\"]\n";
OS << "node [shape=rect, fontname=Courier, fontsize=30]\n";		OS << "node [shape=rect, fontname=Courier, fontsize=30]\n";
▲ Show 20 Lines • Show All 179 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/optsize.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; This test verifies that the loop vectorizer will NOT vectorize loops that			; This test verifies that the loop vectorizer will NOT vectorize loops that
	; will produce a tail loop with the optimize for size or the minimize size			; will produce a tail loop with the optimize for size or the minimize size
	; attributes. This is a target-dependent version of the test.			; attributes. This is a target-dependent version of the test.
	; RUN: opt < %s -loop-vectorize -force-vector-width=64 -S -mtriple=x86_64-unknown-linux -mcpu=skx \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-width=64 -S -mtriple=x86_64-unknown-linux -mcpu=skx \| FileCheck %s

	target datalayout = "E-m:e-p:32:32-i64:32-f64:32:64-a:0:32-n32-S128"			target datalayout = "E-m:e-p:32:32-i64:32-f64:32:64-a:0:32-n32-S128"

	@tab = common global [32 x i8] zeroinitializer, align 1			@tab = common global [32 x i8] zeroinitializer, align 1

	define i32 @foo_optsize() #0 {			define i32 @foo_optsize() #0 {
	; CHECK-LABEL: @foo_optsize(			; CHECK-LABEL: @foo_optsize(
	; CHECK-NOT: x i8>			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <64 x i32> undef, i32 [[INDEX]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT]], <64 x i32> undef, <64 x i32> zeroinitializer
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <64 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.*]] = icmp ule <64 x i32> [[INDUCTION]], <i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202>
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <64 x i8>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <64 x i8> @llvm.masked.load.v64i8.p0v64i8(<64 x i8> [[TMP4]], i32 1, <64 x i1> [[TMP2]], <64 x i8> undef)
				; CHECK-NEXT: [[TMP5:%.*]] = icmp eq <64 x i8> [[WIDE_MASKED_LOAD]], zeroinitializer
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <64 x i1> [[TMP5]], i32 0
				; CHECK-NEXT: [[TMP7:%.*]] = select <64 x i1> [[TMP5]], <64 x i8> <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2>, <64 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				; CHECK-NEXT: [[TMP8:%.]] = bitcast i8 [[TMP3]] to <64 x i8>*
				; CHECK-NEXT: call void @llvm.masked.store.v64i8.p0v64i8(<64 x i8> [[TMP7]], <64 x i8>* [[TMP8]], i32 1, <64 x i1> [[TMP2]])
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 64
				; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 256, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
				; CHECK-NEXT: [[TMP10:%.]] = load i8, i8 [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP10]], 0
				; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
				; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[I_08]], 202
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !2
				; CHECK: for.end:
				; CHECK-NEXT: ret i32 0
				;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]			%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08			%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
	%0 = load i8, i8* %arrayidx, align 1			%0 = load i8, i8* %arrayidx, align 1
	%cmp1 = icmp eq i8 %0, 0			%cmp1 = icmp eq i8 %0, 0
	%. = select i1 %cmp1, i8 2, i8 1			%. = select i1 %cmp1, i8 2, i8 1
	store i8 %., i8* %arrayidx, align 1			store i8 %., i8* %arrayidx, align 1
	%inc = add nsw i32 %i.08, 1			%inc = add nsw i32 %i.08, 1
	%exitcond = icmp eq i32 %i.08, 202			%exitcond = icmp eq i32 %i.08, 202
	br i1 %exitcond, label %for.end, label %for.body			br i1 %exitcond, label %for.end, label %for.body

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	}			}

	attributes #0 = { optsize }			attributes #0 = { optsize }

	define i32 @foo_minsize() #1 {			define i32 @foo_minsize() #1 {
	; CHECK-LABEL: @foo_minsize(			; CHECK-LABEL: @foo_minsize(
	; CHECK-NOT: x i8>			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <64 x i32> undef, i32 [[INDEX]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT]], <64 x i32> undef, <64 x i32> zeroinitializer
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <64 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.*]] = icmp ule <64 x i32> [[INDUCTION]], <i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202, i32 202>
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <64 x i8>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <64 x i8> @llvm.masked.load.v64i8.p0v64i8(<64 x i8> [[TMP4]], i32 1, <64 x i1> [[TMP2]], <64 x i8> undef)
				; CHECK-NEXT: [[TMP5:%.*]] = icmp eq <64 x i8> [[WIDE_MASKED_LOAD]], zeroinitializer
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <64 x i1> [[TMP5]], i32 0
				; CHECK-NEXT: [[TMP7:%.*]] = select <64 x i1> [[TMP5]], <64 x i8> <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2>, <64 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				; CHECK-NEXT: [[TMP8:%.]] = bitcast i8 [[TMP3]] to <64 x i8>*
				; CHECK-NEXT: call void @llvm.masked.store.v64i8.p0v64i8(<64 x i8> [[TMP7]], <64 x i8>* [[TMP8]], i32 1, <64 x i1> [[TMP2]])
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 64
				; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 256, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
				; CHECK-NEXT: [[TMP10:%.]] = load i8, i8 [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP10]], 0
				; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
				; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[I_08]], 202
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !5
				; CHECK: for.end:
				; CHECK-NEXT: ret i32 0
				;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]			%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08			%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
	%0 = load i8, i8* %arrayidx, align 1			%0 = load i8, i8* %arrayidx, align 1
	Show All 13 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	; <label>:1 ; preds = %1, %0
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, 256		%exitcond = icmp eq i32 %lftr.wideiv, 256
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

; Can't vectorize in 'optsize' mode because we need a tail.		; Can vectorize in 'optsize' mode by masking the needed tail.
;CHECK-LABEL: @example2(
;CHECK-NOT: store <4 x i32>
;CHECK: ret void
define void @example2(i32 %n, i32 %x) optsize {		define void @example2(i32 %n, i32 %x) optsize {
		; CHECK-LABEL: @example2(
		; CHECK-NEXT: [[TMP1:%.]] = icmp sgt i32 [[N:%.]], 0
		; CHECK-NEXT: br i1 [[TMP1]], label [[DOTLR_PH5_PREHEADER:%.]], label [[DOTPREHEADER:%.]]
		; CHECK: .lr.ph5.preheader:
		; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[N]], -1
		; CHECK-NEXT: [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_RND_UP:%.*]] = add nuw nsw i64 [[TMP3]], 4
		; CHECK-NEXT: [[TMP4:%.*]] = and i32 [[TMP2]], 3
		; CHECK-NEXT: [[N_MOD_VF:%.*]] = zext i32 [[TMP4]] to i64
		; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[N_RND_UP]], [[N_MOD_VF]]
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i64> undef, i64 [[TMP3]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT1]], <4 x i64> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE8:%.*]] ]
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> undef, i64 [[INDEX]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
		; CHECK-NEXT: [[TMP5:%.*]] = or i64 [[INDEX]], 1
		; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[INDEX]], 2
		; CHECK-NEXT: [[TMP7:%.*]] = or i64 [[INDEX]], 3
		; CHECK-NEXT: [[TMP8:%.*]] = icmp ule <4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT2]]
		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i1> [[TMP8]], i32 0
		; CHECK-NEXT: br i1 [[TMP9]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
		; CHECK: pred.store.if:
		; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds [2048 x i32], [2048 x i32] @b, i64 0, i64 [[INDEX]]
		; CHECK-NEXT: store i32 [[X:%.]], i32 [[TMP10]], align 16
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
		; CHECK: pred.store.continue:
		; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP8]], i32 1
		; CHECK-NEXT: br i1 [[TMP11]], label [[PRED_STORE_IF3:%.]], label [[PRED_STORE_CONTINUE4:%.]]
		; CHECK: pred.store.if3:
		; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds [2048 x i32], [2048 x i32] @b, i64 0, i64 [[TMP5]]
		; CHECK-NEXT: store i32 [[X]], i32* [[TMP12]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE4]]
		; CHECK: pred.store.continue4:
		; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i1> [[TMP8]], i32 2
		; CHECK-NEXT: br i1 [[TMP13]], label [[PRED_STORE_IF5:%.]], label [[PRED_STORE_CONTINUE6:%.]]
		; CHECK: pred.store.if5:
		; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds [2048 x i32], [2048 x i32] @b, i64 0, i64 [[TMP6]]
		; CHECK-NEXT: store i32 [[X]], i32* [[TMP14]], align 8
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE6]]
		; CHECK: pred.store.continue6:
		; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i1> [[TMP8]], i32 3
		; CHECK-NEXT: br i1 [[TMP15]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8]]
		; CHECK: pred.store.if7:
		; CHECK-NEXT: [[TMP16:%.]] = getelementptr inbounds [2048 x i32], [2048 x i32] @b, i64 0, i64 [[TMP7]]
		; CHECK-NEXT: store i32 [[X]], i32* [[TMP16]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE8]]
		; CHECK: pred.store.continue8:
		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
		; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
		; CHECK: middle.block:
		; CHECK-NEXT: br i1 true, label [[DOT_PREHEADER_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
		; CHECK: ._crit_edge:
		; CHECK-NEXT: ret void
		;
%1 = icmp sgt i32 %n, 0		%1 = icmp sgt i32 %n, 0
br i1 %1, label %.lr.ph5, label %.preheader		br i1 %1, label %.lr.ph5, label %.preheader

..preheader_crit_edge: ; preds = %.lr.ph5		..preheader_crit_edge: ; preds = %.lr.ph5
%phitmp = sext i32 %n to i64		%phitmp = sext i32 %n to i64
br label %.preheader		br label %.preheader

.preheader: ; preds = %..preheader_crit_edge, %0		.preheader: ; preds = %..preheader_crit_edge, %0
Show All 24 Lines	.lr.ph: ; preds = %.preheader, %.lr.ph
%indvars.iv.next = add i64 %indvars.iv, 1		%indvars.iv.next = add i64 %indvars.iv, 1
%11 = icmp eq i32 %4, 0		%11 = icmp eq i32 %4, 0
br i1 %11, label %._crit_edge, label %.lr.ph		br i1 %11, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph, %.preheader		._crit_edge: ; preds = %.lr.ph, %.preheader
ret void		ret void
}		}

; N is unknown, we need a tail. Can't vectorize.		; N is unknown, we need a tail. Can't vectorize because loop has no primary
		; induction.
;CHECK-LABEL: @example3(		;CHECK-LABEL: @example3(
;CHECK-NOT: <4 x i32>		;CHECK-NOT: <4 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example3(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) optsize {		define void @example3(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) optsize {
%1 = icmp eq i32 %n, 0		%1 = icmp eq i32 %n, 0
br i1 %1, label %._crit_edge, label %.lr.ph		br i1 %1, label %._crit_edge, label %.lr.ph

.lr.ph: ; preds = %0, %.lr.ph		.lr.ph: ; preds = %0, %.lr.ph
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[NEXT_GEP]] to <4 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[NEXT_GEP]] to <4 x i16>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i16> [[WIDE_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i16> [[WIDE_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw <4 x i32> [[TMP2]], <i32 7, i32 7, i32 7, i32 7>		; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw <4 x i32> [[TMP2]], <i32 7, i32 7, i32 7, i32 7>
; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[NEXT_GEP4]] to <4 x i32>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[NEXT_GEP4]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 4		; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !6
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: br i1 true, label [[TMP7:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 true, label [[TMP7:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: br label [[TMP6:%.*]]		; CHECK-NEXT: br label [[TMP6:%.*]]
; CHECK: br i1 undef, label [[TMP7]], label [[TMP6]], !llvm.loop !5		; CHECK: br i1 undef, label [[TMP7]], label [[TMP6]], !llvm.loop !7
; CHECK: ret void		; CHECK: ret void
;		;
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]		%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]
%i.02 = phi i32 [ 0, %0 ], [ %7, %1 ]		%i.02 = phi i32 [ 0, %0 ], [ %7, %1 ]
%2 = getelementptr inbounds i16, i16* %.04, i64 1		%2 = getelementptr inbounds i16, i16* %.04, i64 1
%3 = load i16, i16* %.04, align 2		%3 = load i16, i16* %.04, align 2
%4 = zext i16 %3 to i32		%4 = zext i16 %3 to i32
%5 = shl nuw nsw i32 %4, 7		%5 = shl nuw nsw i32 %4, 7
%6 = getelementptr inbounds i32, i32* %.013, i64 1		%6 = getelementptr inbounds i32, i32* %.013, i64 1
store i32 %5, i32* %.013, align 4		store i32 %5, i32* %.013, align 4
%7 = add nsw i32 %i.02, 1		%7 = add nsw i32 %i.02, 1
%exitcond = icmp eq i32 %7, 256		%exitcond = icmp eq i32 %7, 256
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

; We CAN'T vectorize this example because it would entail a tail.		; We CAN vectorize this example by folding the tail it entails.
define void @example23c(i16* noalias nocapture %src, i32* noalias nocapture %dst) optsize {		define void @example23c(i16* noalias nocapture %src, i32* noalias nocapture %dst) optsize {
; CHECK-LABEL: @example23c(		; CHECK-LABEL: @example23c(
; CHECK-NOT: <4 x		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE22:%.*]] ]
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> undef, i64 [[INDEX]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
		; CHECK-NEXT: [[TMP1:%.*]] = icmp ult <4 x i64> [[INDUCTION]], <i64 257, i64 257, i64 257, i64 257>
		; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i1> [[TMP1]], i32 0
		; CHECK-NEXT: br i1 [[TMP2]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
		; CHECK: pred.load.if:
		; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i16, i16 [[SRC:%.*]], i64 [[INDEX]]
		; CHECK-NEXT: [[TMP3:%.]] = load i16, i16 [[NEXT_GEP]], align 2
		; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]]
		; CHECK: pred.load.continue:
		; CHECK-NEXT: [[TMP4:%.*]] = phi i16 [ undef, [[VECTOR_BODY]] ], [ [[TMP3]], [[PRED_LOAD_IF]] ]
		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i1> [[TMP1]], i32 1
		; CHECK-NEXT: br i1 [[TMP5]], label [[PRED_LOAD_IF11:%.]], label [[PRED_LOAD_CONTINUE12:%.]]
		; CHECK: pred.load.if11:
		; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[INDEX]], 1
		; CHECK-NEXT: [[NEXT_GEP4:%.]] = getelementptr i16, i16 [[SRC]], i64 [[TMP6]]
		; CHECK-NEXT: [[TMP7:%.]] = load i16, i16 [[NEXT_GEP4]], align 2
		; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE12]]
		; CHECK: pred.load.continue12:
		; CHECK-NEXT: [[TMP8:%.*]] = phi i16 [ undef, [[PRED_LOAD_CONTINUE]] ], [ [[TMP7]], [[PRED_LOAD_IF11]] ]
		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i1> [[TMP1]], i32 2
		; CHECK-NEXT: br i1 [[TMP9]], label [[PRED_LOAD_IF13:%.]], label [[PRED_LOAD_CONTINUE14:%.]]
		; CHECK: pred.load.if13:
		; CHECK-NEXT: [[TMP10:%.*]] = or i64 [[INDEX]], 2
		; CHECK-NEXT: [[NEXT_GEP5:%.]] = getelementptr i16, i16 [[SRC]], i64 [[TMP10]]
		; CHECK-NEXT: [[TMP11:%.]] = load i16, i16 [[NEXT_GEP5]], align 2
		; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE14]]
		; CHECK: pred.load.continue14:
		; CHECK-NEXT: [[TMP12:%.*]] = phi i16 [ undef, [[PRED_LOAD_CONTINUE12]] ], [ [[TMP11]], [[PRED_LOAD_IF13]] ]
		; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i1> [[TMP1]], i32 3
		; CHECK-NEXT: br i1 [[TMP13]], label [[PRED_LOAD_IF15:%.]], label [[PRED_LOAD_CONTINUE16:%.]]
		; CHECK: pred.load.if15:
		; CHECK-NEXT: [[TMP14:%.*]] = or i64 [[INDEX]], 3
		; CHECK-NEXT: [[NEXT_GEP6:%.]] = getelementptr i16, i16 [[SRC]], i64 [[TMP14]]
		; CHECK-NEXT: [[TMP15:%.]] = load i16, i16 [[NEXT_GEP6]], align 2
		; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE16]]
		; CHECK: pred.load.continue16:
		; CHECK-NEXT: [[TMP16:%.*]] = phi i16 [ undef, [[PRED_LOAD_CONTINUE14]] ], [ [[TMP15]], [[PRED_LOAD_IF15]] ]
		; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i1> [[TMP1]], i32 0
		; CHECK-NEXT: br i1 [[TMP17]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
		; CHECK: pred.store.if:
		; CHECK-NEXT: [[TMP18:%.*]] = zext i16 [[TMP4]] to i32
		; CHECK-NEXT: [[TMP19:%.*]] = shl nuw nsw i32 [[TMP18]], 7
		; CHECK-NEXT: [[NEXT_GEP7:%.]] = getelementptr i32, i32 [[DST:%.*]], i64 [[INDEX]]
		; CHECK-NEXT: store i32 [[TMP19]], i32* [[NEXT_GEP7]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
		; CHECK: pred.store.continue:
		; CHECK-NEXT: [[TMP20:%.*]] = extractelement <4 x i1> [[TMP1]], i32 1
		; CHECK-NEXT: br i1 [[TMP20]], label [[PRED_STORE_IF17:%.]], label [[PRED_STORE_CONTINUE18:%.]]
		; CHECK: pred.store.if17:
		; CHECK-NEXT: [[TMP21:%.*]] = zext i16 [[TMP8]] to i32
		; CHECK-NEXT: [[TMP22:%.*]] = shl nuw nsw i32 [[TMP21]], 7
		; CHECK-NEXT: [[TMP23:%.*]] = or i64 [[INDEX]], 1
		; CHECK-NEXT: [[NEXT_GEP8:%.]] = getelementptr i32, i32 [[DST]], i64 [[TMP23]]
		; CHECK-NEXT: store i32 [[TMP22]], i32* [[NEXT_GEP8]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE18]]
		; CHECK: pred.store.continue18:
		; CHECK-NEXT: [[TMP24:%.*]] = extractelement <4 x i1> [[TMP1]], i32 2
		; CHECK-NEXT: br i1 [[TMP24]], label [[PRED_STORE_IF19:%.]], label [[PRED_STORE_CONTINUE20:%.]]
		; CHECK: pred.store.if19:
		; CHECK-NEXT: [[TMP25:%.*]] = zext i16 [[TMP12]] to i32
		; CHECK-NEXT: [[TMP26:%.*]] = shl nuw nsw i32 [[TMP25]], 7
		; CHECK-NEXT: [[TMP27:%.*]] = or i64 [[INDEX]], 2
		; CHECK-NEXT: [[NEXT_GEP9:%.]] = getelementptr i32, i32 [[DST]], i64 [[TMP27]]
		; CHECK-NEXT: store i32 [[TMP26]], i32* [[NEXT_GEP9]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE20]]
		; CHECK: pred.store.continue20:
		; CHECK-NEXT: [[TMP28:%.*]] = extractelement <4 x i1> [[TMP1]], i32 3
		; CHECK-NEXT: br i1 [[TMP28]], label [[PRED_STORE_IF21:%.*]], label [[PRED_STORE_CONTINUE22]]
		; CHECK: pred.store.if21:
		; CHECK-NEXT: [[TMP29:%.*]] = zext i16 [[TMP16]] to i32
		; CHECK-NEXT: [[TMP30:%.*]] = shl nuw nsw i32 [[TMP29]], 7
		; CHECK-NEXT: [[TMP31:%.*]] = or i64 [[INDEX]], 3
		; CHECK-NEXT: [[NEXT_GEP10:%.]] = getelementptr i32, i32 [[DST]], i64 [[TMP31]]
		; CHECK-NEXT: store i32 [[TMP30]], i32* [[NEXT_GEP10]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE22]]
		; CHECK: pred.store.continue22:
		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
		; CHECK-NEXT: [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT]], 260
		; CHECK-NEXT: br i1 [[TMP32]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8
		; CHECK: middle.block:
		; CHECK-NEXT: br i1 true, label [[TMP34:%.*]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: br label [[TMP33:%.*]]
		; CHECK: br i1 undef, label [[TMP34]], label [[TMP33]], !llvm.loop !9
; CHECK: ret void		; CHECK: ret void
		;
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]		%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]
%i.02 = phi i64 [ 0, %0 ], [ %7, %1 ]		%i.02 = phi i64 [ 0, %0 ], [ %7, %1 ]
%2 = getelementptr inbounds i16, i16* %.04, i64 1		%2 = getelementptr inbounds i16, i16* %.04, i64 1
%3 = load i16, i16* %.04, align 2		%3 = load i16, i16* %.04, align 2
%4 = zext i16 %3 to i32		%4 = zext i16 %3 to i32
%5 = shl nuw nsw i32 %4, 7		%5 = shl nuw nsw i32 %4, 7
%6 = getelementptr inbounds i32, i32* %.013, i64 1		%6 = getelementptr inbounds i32, i32* %.013, i64 1
store i32 %5, i32* %.013, align 4		store i32 %5, i32* %.013, align 4
%7 = add nsw i64 %i.02, 1		%7 = add nsw i64 %i.02, 1
%exitcond = icmp eq i64 %7, 257		%exitcond = icmp eq i64 %7, 257
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

; We CAN'T vectorize this example because it would entail a tail.		; We CAN'T vectorize this example because it would entail a tail and an
		; induction is used outside the loop.
define i64 @example23d(i16* noalias nocapture %src, i32* noalias nocapture %dst) optsize {		define i64 @example23d(i16* noalias nocapture %src, i32* noalias nocapture %dst) optsize {
;CHECK-LABEL: @example23d(		;CHECK-LABEL: @example23d(
; CHECK-NOT: <4 x		; CHECK-NOT: <4 x
; CHECK: ret i64		; CHECK: ret i64
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
Show All 15 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

	Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	for.end:			for.end:
	ret void			ret void
	}			}

	!1 = !{!1, !2}			!1 = !{!1, !2}
	!2 = !{!"llvm.loop.vectorize.enable", i1 true}			!2 = !{!"llvm.loop.vectorize.enable", i1 true}

	;			;
	; This loop will not be vectorized as the trip count is below the threshold.			; This loop will be vectorized as the trip count is below the threshold but no
				; scalar iterations are needed thanks to folding its tail.
	;			;
	define void @not_vectorized(float* noalias nocapture %A, float* noalias nocapture readonly %B) {			define void @vectorized1(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
	; CHECK-LABEL: @not_vectorized(			; CHECK-LABEL: @vectorized1(
	; CHECK-NOT: x float>			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> undef, i64 [[INDEX]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> undef, <8 x i32> zeroinitializer
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
				; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
				; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4
				; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
				; CHECK-NEXT: [[TMP8:%.*]] = icmp ule <8 x i64> [[INDUCTION]], <i64 19, i64 19, i64 19, i64 19, i64 19, i64 19, i64 19, i64 19>
				; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[TMP5]] to <8 x float>*
				; CHECK-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP7]], <8 x float>* [[TMP9]], i32 4, <8 x i1> [[TMP8]])
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
				; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24
				; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !6
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
	; CHECK: for.end:			; CHECK: for.end:
				; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %B, i64 %indvars.iv			%arrayidx = getelementptr inbounds float, float* %B, i64 %indvars.iv
	%0 = load float, float* %arrayidx, align 4, !llvm.mem.parallel_loop_access !3			%0 = load float, float* %arrayidx, align 4, !llvm.mem.parallel_loop_access !3
	Show All 35 Lines
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !7			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !9
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.mem.parallel_loop_access !6			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.mem.parallel_loop_access !7
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.mem.parallel_loop_access !6			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.mem.parallel_loop_access !7
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.mem.parallel_loop_access !6			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.mem.parallel_loop_access !7
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !8			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !10
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 16 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for sizeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 170090

llvm/trunk/include/llvm/Analysis/VectorUtils.h

llvm/trunk/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/trunk/lib/Transforms/Vectorize/VPlan.h

llvm/trunk/lib/Transforms/Vectorize/VPlan.cpp

llvm/trunk/test/Transforms/LoopVectorize/X86/optsize.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for size
ClosedPublic