This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
3
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3
LoopVectorizationLegality.cpp
4/27
LoopVectorize.cpp
1/16
VPlan.h
-
VPlan.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
1/5
optsize.ll
-
small-size.ll

Differential D50480

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for size
ClosedPublic

Authored by Ayal on Aug 8 2018, 3:11 PM.

Download Raw Diff

Details

Reviewers

mkuper
hsaito
dcaballe
fhahn
rengolin
hfinkel

Commits

rGb0b5312e677c: [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt…
rL344743: [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt…

Summary

When optimizing for size, a loop is vectorized only if the resulting vector loop completely replaces the original scalar loop. This holds if no runtime guards are needed, if the original trip-count TC does not overflow, if TC is a known constant and if TC is a multiple of the VF. Targets with efficient vector masking can thereby overcome the last three TC-related conditions: see “Direction #1” in [[ http://lists.llvm.org/pipermail/llvm-dev/2018-August/125042.html | [llvm-dev] Vectorizing remainder loop ]] - this patch applies that transformation of setting the trip-count of the vector loop to be TC rounded-up to a multiple of VF while masking the vector body under a newly introduced "if (i < TC)" condition; or rather "if (i <= TC-1)" to overcome the aforementioned overflow hazard.

The patch allows loops with arbitrary trip counts to be vectorized under -Os, subject to the existing cost model considerations. It also applies to loops with small trip counts (under -O2) which are currently handled as if under -Os.

Handling loops with reductions and live-outs are marked as TODOs for subsequent extensions.

Diff Detail

Event Timeline

Ayal created this revision.Aug 8 2018, 3:11 PM

Herald added subscribers: llvm-commits, rogfer01. · View Herald TranscriptAug 8 2018, 3:11 PM

hsaito added inline comments.Aug 8 2018, 5:15 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2663	I think there is a danger in assuming UF being a power of two. Granted that there may be other parts of LV already assuming it, I still wouldn't like to see any more of those being added. If new code is assuming power of two UF, it's best if we ensure that is really the case (e.g., when foldTailByMansking() is true, assert that UF is power of two). Better yet, check VF*UF is power of two here since that's the assumption this code has.
2673	This Urem creation should be skipped if we aren't generating remainder.
4977	I think we need to add if (TC==0) { emit one kind of remark } else { emit another kind of remark } here ---- in order to match previous capability.

Ayal mentioned this in D50474: [LV] Vectorize header phis that feed from if-convertable latch phis.Aug 9 2018, 3:27 PM

sguggill added a subscriber: sguggill.Aug 10 2018, 3:33 PM

Ayal added inline comments.Aug 11 2018, 2:06 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2663	OK, will add an assert that VF*UF is a power of 2 below under the `if (Legal->foldTailByMasking())`.
2673	This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating remainder.
4977	OK, will retain the previous MissedAnalysis remarks here, in addition to the new ones supplied by `canFoldTailByMasking()`.

Thanks, Ayal! Some comments below.
Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

Thanks,
Diego

lib/Transforms/Vectorize/LoopVectorize.cpp
4965	I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower than `TinyTripCountInterleaveThreshold` (i.e., 128)?. Should we use an independent threshold for this?
5218	inwhich -> in which?
6355	Just curious. Could we prevent the computation or interleave groups for these cases instead of doing a reset?
lib/Transforms/Vectorize/VPlan.h
609	I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first?
1126	Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual operations to compute the BTC? We would only need a sub, right?

In D50480#1196699, @dcaballe wrote:

Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

You should know better

lib/Transforms/Vectorize/LoopVectorize.cpp
4965	Ah, this is wrong, good catch! The original purpose (of `TinyTripCountVectorThreshold` rather than `TinyTripCountInterleaveThreshold`) was to prevent vectorization of loops with very short trip counts due to overheads. Later it was extended in r306803 to allow vectorization under OptForSize, as it implies that all iterations are concentrated inside the vector loop for more accurate cost estimation. This still holds when folding the tail by masking, so we should not bail out here.
5218	ok
6355	That would have been simpler indeed. But there's a subtle phase-ordering issue here: `MaxVF=computeFeasibleMaxVF()` uses tentative interleave groups to `getSmallestAndWidestTypes()`, and is then used in determining if the tail should be folded by masking (i.e., if TC is a multiple of MaxVF), in which case these groups will all be masked/invalid.
lib/Transforms/Vectorize/VPlan.h
609	An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil): + // Introduce the early-exit compare IV <= BTC to form header block mask. + // This is used instead of IV < TC because TC may wrap, unlike BTC. + VPValue IV = Plan->getVPValue(Legal->getPrimaryInduction()); + VPValue BTC = Plan->getBackedgeTakenCount(); + Value Undef = UndefValue::get(Legal->getPrimaryInduction()->getType()); + auto ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef); + Plan->addDetachedValue(ICmp); + BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp); return BlockMaskCache[BB] = BlockMask; and then have `VPInstruction::generateInstruction()` do + case Instruction::ICmp: { + Value IV = State.get(getOperand(0), Part); + Value TC = State.get(getOperand(1), Part); + auto ICmp = cast<ICmpInst>(getUnderlyingValue()); + Value V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC); + State.set(this, V, Part); + break; + } where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s.
1126	The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by SCEVExpander. To model this decrement would require using an "empty" VPValue to model its Trip Count operand. In any case, both involve scalar instructions that take place before the vectorized loop, currently outside the VPlan'd zone.

I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {

if (unlikely(i > M)) 
   break;
sum += a[i];

}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Secondly, are there any plans to enable this approach for anything other than optsize?

test/Transforms/LoopVectorize/X86/optsize.ll
12	Testing wise, expanding out the IR generated w/update-lit-checks and landing the tests without the changes and then rebasing on top would make it much easier to follow the transform being described for those us not already expert in the vectorizer code structures. I get that your following existing practice, but this might be one of the cases which justify changing existing practice in the area. :)

hsaito added inline comments.Aug 14 2018, 3:23 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2673	Ouch. Well, given the assertion for VF*UF being power of two (constant), the UREM and other computation should be reasonably optimizable downstream. So, it's probably unfair to ask you to fix the trip count computation ---- so, I won't ask. There is a trade off between generating more optimal output IR and the cost of maintaining the code to do that. Keeping UREM here is opting for lower maintenance. Just for the record.

reames added inline comments.Aug 14 2018, 3:25 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
4948	There's a mix of seemingly unrelated changes here. This is one example. It would be good to land these separately.

In D50480#1199900, @reames wrote:
I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {
if (unlikely(i > M)) 
   break;
sum += a[i];
}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Short answer is No.

From vectorizer perspective, mechanics is quite different. In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it. There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter. Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Having said that, making VPlan more powerful (like adding a new IF) certainly help lead to the ability to model early_exit situation within the VPlan eventually. From that perspective, it's a baby step forward.

From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

Secondly, are there any plans to enable this approach for anything other than optsize?

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

hsaito added inline comments.Aug 14 2018, 5:00 PM

lib/Transforms/Vectorize/VPlan.h
609	Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator? The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that. My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?
1126	I'm not a big fan of allocating memory that goes unused in many situations. We can initialize this to nullptr, and create an instance once we know BTC is needed. That'll lose the convenience of being able to check NumUsers, but creating needsBackedgeTakenCount() member function shouldn't be that bad. It's just Legal->foldTailByMasking(), until something else needs BTC, right?

dcaballe added inline comments.Aug 14 2018, 6:33 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6355	Thanks!
lib/Transforms/Vectorize/VPlan.h
609	Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places. My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? The closer the class hierarchies are, the easier will be.

hsaito added inline comments.Aug 15 2018, 11:20 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4948	This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0. TC==0 case will then have a chance of hitting Legal->canFoldTailByMasking() later. As a result, TC==1 case can return early here, with a very crisp messaging. Having said that, if you'd like to see the same ORE->emit(...) LLVM_DEBUG() stuff here, I won't go against that. Messaging change can be a separate commit. Ayal, we need ORE->emit() here, in addition to LLVM_DEBUG(), right, regardless of whether we change the actual message?

In D50480#1200014, @hsaito wrote:
In D50480#1199900, @reames wrote:
I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {
if (unlikely(i > M)) 
   break;
sum += a[i];
}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?
Short answer is No.

From vectorizer perspective, mechanics is quite different.

Ok, I think we're talking past each other a bit. I see these both as forms of predication. It sounds like you have a slightly different view; I'll try to ask clarifying questions in the right spots. I think we have different mental models here and I'm trying to understand where that difference is.

In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it.

I'm unfamiliar with this pragma, but the best reference I found was https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-simd-directive-openmp-api

From what I can tell, this provides user guarantees of a couple of legality checks and profitability checks. I don't know enough about openmp to completely follow all the wording, but the key bit appears to be this:
"Each operation before the last lexical early exit of the loop may be executed as if the early exit were not triggered within the SIMD chunk."

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve. There are two obvious approaches: speculation safety and predication. Unless I'm misreading this patch, it has the same problem and uses predication right?

There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right? If there's not, then we don't need to distinguish which iteration "caused" the exit. This is actually an interesting and useful subcase for me.

Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

I agree that cases like a[i] >0 are the hard ones. Other examples are things like i < M for loop invariant M. Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

I am very specifically not interested in the language extension aspects. I'm specifically asking about doing the transform for unannotated C code. (i.e. having to prove all the legality the hard way)

Secondly, are there any plans to enable this approach for anything other than optsize?

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying? Not surprised, but I'd be curious to hear your reasoning. You might be coming at this from a different angle than I am

In D50480#1199900, @reames wrote:
I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {
if (unlikely(i > M)) 
   break;
sum += a[i];
}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Good question! Replacing the break with a continue vectorizes just fine and produces the same result, albeit spinning uselessly for the last N-M iterations. Dealing with such "breaks" directly deserves more thought :-). In general it's probably better to fold such two upper bounds into one = min(N,M+1), producing a countable unpredicated loop. This is a known optimization for OpenCL1.x kernels, often guarded with "if (get_global_id(0) > M) continue;" due to work_group size constraints, when compiled for CPU.

Secondly, are there any plans to enable this approach for anything other than optsize?

We could, for example, consider enabling it under -O2 for loops whose entire (or nearly entire) body is already conditional; e.g.,

for (int i = 0; i < N; i++) {
  if (i*i % 4 != 2) {
    <loop body>
  }
}

otherwise the overhead of predicating code that could otherwise run unpredicated may be detrimental.

lib/Transforms/Vectorize/LoopVectorize.cpp
2673	Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple of two (which is currently always the case, and must be the case when folding the tail by masking), it gets optimized downstream to N&(-Step). If Step would be some other constant it may get optimized downstream to use multiplication instead of division, depending on target characteristics. In any case, this takes place before the loop; and is orthogonal to this patch, which simply reuses the existing logic to also round up.
4948	Yes, this change is unrelated and should land separately. The original ORE message is wrong. Not sure the TC==1 qualifies for any ORE message - "loops" with a known trip count of one are simply irrelevant for vectorization; though we could vectorize them with a mask...
4965	This BTW is caught by vect.omp.force.small-tc.ll; but the -vectorizer-min-trip-count=21 flag it uses is external to OpenMP, afaik.
lib/Transforms/Vectorize/VPlan.h
609	Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve a separate discussion thread and justification. This patch could tentatively make use of it, though for the purpose of this patch an ICmpULE opcode or a detached ICmpInst suffice. An ICmpULE opcode shouldn't be problematic currently, as this early-exit is the only VPInstruction compare with a Predicate, right? Note that detached UnderlyingValues could serve as data containers for all fields already implemented in the IR hierarchy, and could be constructed at any point of VPlan construction for that purpose. Extending VPInstructions to provide a similar API as that of IR Instructions seems to be an orthogonal concern with its own design objectives, and can coexist with detached Values; e.g., a VPCmpInst could hold its Predicate using a detached ICmpInst/FCmpInst.
1126	OK. The VPValue can be created on demand, turning `getBackedgeTakenCount()` into `getOrCreateBackedgeTakenCount()`. `NumUsers` should still be checked, as this isolates the decision of creating the IR based on the VPlan. In any case, VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always used, seems insignificant.
test/Transforms/LoopVectorize/X86/optsize.ll
12	Agreed. The original target-independent version of optsize.ll still passes, BTW, (i.e., fails to vectorize), but due to cost-model considerations rather than scalar tail considerations.

In D50480#1201125, @reames wrote:

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve. There are two obvious approaches: speculation safety and predication. Unless I'm misreading this patch, it has the same problem and uses predication right?

In this particular case, we don't get much of speculation. If you call computing loop index beyond the original upper bound as speculation (and use it in compare), it is, but we know there aren't any safety issues. In your case, what really matters is inside "unlikely(i > M)". If that's just trivial "i > M" (or something that can be converted in that form), we are better off simply changing the loop upper bound and do so prior to hitting the vectorizer. Then, this patch will take care of it. If not (i.e., general compute_some_predicate_value_based_on(i)) the whole speculation safety issue comes up and that's the difficult part to deal with and this patch doesn't deal with any aspect of it.

There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right? If there's not, then we don't need to distinguish which iteration "caused" the exit. This is actually an interesting and useful subcase for me.

I don't know what you mean by "a use of a variable from within the loop down the early exit path". Assume cond becomes true within a vector chunk (say, elem#2), you have to execute B for all prior iters (i.e., elem#0 and #1),
and execute A for elem #2.

for (i){
   if (cond){
       A
       break;
   }
   B
}

Assuming that B is lexically below (note: this is vectorization, as such, you need to have some lexical ordering assumption somewhere) all the early exit points, it can be non-speculatively executed under proper predication.
This kind of predication, however, has nothing to do with this patch. General IF-THEN-ELSE and GOTO based control flow needs the same kind of predication.

Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

Sure ---- but unless you are talking about trivial (i.e., not very interesting) "early exit" stuff, how to deal with speculation is the most important aspect of vectorizer's early exit handling.

Other examples are things like i < M for loop invariant M. Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

Sure, but that's not very interesting from vectorization perspective. Vectorizer doesn't have to do what other loop transformation can handle.

I am very specifically not interested in the language extension aspects. I'm specifically asking about doing the transform for unannotated C code. (i.e. having to prove all the legality the hard way)

ICC is doing it. So, let us know if anyone is volunteering before we do so that we can share our learning. It's an important aspect of vectorization but not yet high enough on our priority list. So, we aren't immediately jumping on to it.

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying?

Qualifying to what?

If your question is whether ICC uses the masked main vector code for AVX512, other than OptForSize case, then the answer is yes it does.

It's a combination of HW and SW. If you know the trip count as a compile time constant, you can evaluate various different ways to vectorize and decide the best one, much better than when you don't know the trip count. The legacy part of LV isn't set up to do such an evaluation. VPlan native part of LV would eventually have such a capability. W/o this capability, we need to go one way or the other rather blindly --- and blindly changing the status quo requires a pretty good justification (like brilliantly fast masked vector execution unit). I'm more interested in doing the evaluation when VPlan native path is ready to do that.

Not surprised, but I'd be curious to hear your reasoning. You might be coming at this from a different angle than I am

If the trip count is unknown, the best AVX512 vectorization strategy so far is go with unmasked (at the top-level) vector main loop. Underlying assumption is that unmasked vector main loop is faster than the masked vector main loop, and a lot of time is spent in executing main vector loop. If such an assumption does not hold, like main vector code isn't executed a lot, programmers should try to communicate the trip count estimation to the compiler so that the compiler can do a better job. As the HW narrows the gap between the two, optimization point moves. We have to evaluate every generation of HW and see what works the best. So, my comment applies to today's HW. I don't know what ARM SVE folks would say for their HW.

Does this make sense to you?

hsaito added inline comments.Aug 15 2018, 2:26 PM

lib/Transforms/Vectorize/VPlan.h
609	I go against detached ICmpInst. We'll be moving VPlan modeling before the cost model and creating an IR Instruction before deciding to vectorize is against the VPlan concept. seems to be an orthogonal concern with its own design objectives Not quite. We'd like VPInstruction as easy to use to many LLVM developers and that is an integral part of design/implementation from the beginning. Having said that, new opcode versus VPCmpInst doesn't block the rest of the review. Other parts of the review should proceed while opcode versus VPCmpInst discussion is in progress on the side.
1126	VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the >BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always >used, seems insignificant. VPlan footprint was part of the community concern. We'd like to be better wherever we can. Just as simple as that. Thanks for taking care of it.

dcaballe mentioned this in D50823: [VPlan] Introduce VPCmpInst sub-class in the instruction-level representation.Aug 15 2018, 5:12 PM

dcaballe added inline comments.Aug 15 2018, 5:18 PM

lib/Transforms/Vectorize/VPlan.h
609	I created D50823 with the VPCmpInst sub-class so that we can make a decision with the code in place.

rkruppe added a subscriber: rkruppe.Aug 16 2018, 6:10 AM

hsaito added inline comments.Aug 16 2018, 12:37 PM

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485	I think it's best not to keep this state in the Legal. From the Legal perspective, being able to vectorize the whole loop body under the mask and the actual decision to do so are completely separate issues. Since canFold...() is invoked by CostModel::computeMaxVF, we should be able to keep this state in the CostModel. After all, whether to bail out or continue under FoldTailByMasking is a cost model side of the state, after consulting the Legal.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792	By moving FoldTail state to CostModel, we can define CostModel::blockNeedsPredication(BB) as FoldTailByMasking \|\| LAI::blockNeedsPredication(BB) and make Legal version static to Legal.
lib/Transforms/Vectorize/LoopVectorize.cpp
2673	orthogonal to this patch I agree.

Meinersbur mentioned this in D49281: [Unroll/UnrollAndJam/Vectorizer/Distribute] Add followup loop attributes..Aug 17 2018, 4:23 PM

Addressed review comments.

New test X86/optsize.ll added and vect.omp.force.small-tc.ll augmented with CHECKs, both showing current behavior, to be uploaded separately before this patch. Test small-size.ll includes CHECKs that pass with this patch.

Ayal marked 4 inline comments as done.Aug 20 2018, 3:08 PM

Ayal added inline comments.

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485	OK.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792	OK, except that LAI::blockNeedsPredication() also asks for DT which CostModel does not have. Let's have CostModel::blockNeedsPredication(BB) return FoldTailByMasking \|\| Legal::blockNeedsPredication(). Hopefully the two will not cause confusion. Making Legal version static should be pursued in a separate patch, if desired.
lib/Transforms/Vectorize/VPlan.h
609	VPlans should indeed keep the existing IR intact w/o changing it, as they are tentative by design, and also by current implementation. But creating a detached IR Instruction, just for the purpose of holding its attributes, w/o connecting it to any User, Operand (except Undef's) or BasicBlock, is arguably keeping the existing IR intact. Doing so should be quite familiar to LLVM developers, avoids mirroring Instruction's class hierarchy or a subset thereof, and leverages the existing UnderlyingValue pointer that is unutilized by InnerLoopVectorizer. Next uploaded version provides this complete option. Having said that, this patch can surely work with a VP(I)CmpInst just as well, as it merely needs a way for a single compare VPInstruction to hold a single Predicate, and print its name.
test/Transforms/LoopVectorize/X86/optsize.ll
12	Expanded IR CHECKs have been added for cases that should get vectorized. For cases that should not, suffice to check that no vector is formed.

dcaballe added inline comments.Aug 20 2018, 3:43 PM

lib/Transforms/Vectorize/VPlan.h
609	I understand your point, Ayal. However, using UnderlyingValue as a pointer to the actual input IR in the VPlan native path and as a pointer to a detached IR Value in the inner loop path is very likely to be problematic, even in the short term. We would have to special case the code that is shared for both paths to treat the UnderlyingValue differently. The detached IR special semantics in the inner loop path would also make a bit more complicated the convergence of both paths. If there are no major concerns regarding the VPCmpInst, I'd prefer going with that approach.

hsaito added inline comments.Aug 20 2018, 3:53 PM

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485	Thank you.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792	Thanks, and fair enough.
lib/Transforms/Vectorize/LoopVectorize.cpp
2737	Personally, I don't like to see the IR like the following going out of the vectorizer, even though that's later cleaned up tirivially. %1 = false // unused and thus will be trivially cleaned up later. %2 = icmp ... Changing this part of the patch to Value *CheckMinIters = nullptr; if () .... else CheckMinIters = Builder.getFalse(); would make cleaner IR going out for common cases, at a small price to pay in ease-of-reading. If you agree, great. If not, I won't make a big deal about it. At the end of the day, we should clean up this area of code such that we don't have to rely on CheckMinIters being "false" constant to cleanup the unnecessary min iter check. That improvement can be done as a separate NFC patch.
2979	See the comment on CheckMinIters.

For me, the only major issue left is the detached IR instruction. @dcaballe, please try adding the reviewers/subscribers of D50480 to D50823, in the hopes of getting a quicker resolution there, so as not to block D50480 because of that. I will not oppose to D50480 for introducing new ULE opcode of VPInstruction (design/implementation choice within VPlan concept), but I will strongly oppose for the use of detahced IR instruction (goes against VPlan concept).

It's certainly nicer if @Ayal, @dcaballe, and others can agree on VPCmpInst or not quickly enough. I vote in favor of VPCmpInst.

Thanks,
Hideki

lib/Transforms/Vectorize/LoopVectorize.cpp
4948	I think every non-vectorized loop that goes through vectorizer's analysis qualifies for ORE. After all, TC==1 knowledge may or may not be available to the programmer otherwise.
4977	Thank you.
lib/Transforms/Vectorize/VPlan.h
609	Detached IR instruction is detrimental to VPlan direction. Please do not use it.
test/Transforms/LoopVectorize/X86/optsize.ll
5	Is the test really dependent on the apple triple?

Ayal added inline comments.Aug 22 2018, 5:38 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
2737	One could change this part of the patch to create an unconditional branch instead of a conditional one from BB to NewBB; or avoid creating NewBB / calling `emitMinimumIterationCountCheck()` altogether `if (foldTailByMasking())`. Both alternatives will change the dominance structure and thus require special attention when updating DT in `updateAnalysis()`. The latter would also need to record the EntryBlock for cases where `LoopBypassBlocks` remains empty. It's simpler to keep the existing skeletal structure intact, and rely on subsequent trivial dce cleanup. If desired, such alternatives should be proposed as a separate follow-up NFC patch.
2979	ditto.
4948	ok
test/Transforms/LoopVectorize/X86/optsize.ll
5	-mtriple=x86_64-unknown-linux works just as well

Addressing review comments, rebased, added a couple of asserts.

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

The TC==1 part and preliminary CHECK completion of tests are to be uploaded first.

Ayal marked 2 inline comments as done.Aug 22 2018, 9:17 AM

Ayal added inline comments.

lib/Transforms/Vectorize/VPlan.h
609	Would be good to clarify the aforementioned discrepancy between VPlan native's use of input IR and the proposed use of detached IR; both should presumably model defs, uses and basic-block ownerships in VPlan rather than the IR Instruction, so the latter can merely be used for storing internal properties, for both paths alike. BTW, SROA.cpp and StraightLineStrengthReduce.cpp, e.g., also make use of detached Instructions. Would also be good to explain why detached Instructions are considered detrimental or what concept of VPlan they allegedly violate, given that their existence keeps the original IR intact. But let's keep this patch out of that discussion, and have it use an ICmpULE extended opcode as originally proposed and reloaded. After all, it plays a very small part in this patch, and can be easily revised later as needed.

Let's give @dcaballe one more day to try getting some traction on D50823. Fair enough to both of you (and others who might be interested)?

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

Since the VPCmpInst code is ready (D50823) and this is a clear use case where we need to model a new compare (including its predicate) that is not in the input IR, I'd appreciate if we could discuss a bit more about using the VPCmpInst approach. At least, I'd like to understand what are the concerns about the VPCmpInst approach and what other people think.

I do have concerns regarding modeling ICmpULE as an opcode only for compare instructions newly created during a VPlan-to-VPlan transformation. For example:

Inconsistent modeling of compare instructions in the VPlan native path. Compare instructions in the input IR will be modeled as VPInstructions with a Instruction::ICmpInst/Instruction::FCmpInst opcode. New compare instructions will be modeled as VPInstructions with predicates as opcodes (VPInstruction::ICmpULE, for now). We'd have to compare the opcode against Instruction::ICmpInst, Instruction::ICmpInst, VPInstruction::ICmpULE and any future predicate opcode to know that a VPInstruction is a comparison. Similar inconsistency to get information about the compare predicate.

Adding ICmpULE as an opcode is paving the way to adding more predicates as opcodes in VPInstruction in the short term. Where would the limit be? Do we want to model the around 30 predicates currently in LLVM CmpInst as opcodes?

The ICmpULE approach may also be detrimental for the Instruction/VPInstruction templatization that we planned to explore.

If these points and the fact that VPCmpInst code is ready to go don't convince you, there isn't much else I can do :). I know this compare representation may sound insignificant but I'm well aware of how painful things can turn when things are built on top of "insignificant decisions" that have to be changed later on. If the problem with VPCmpInst is to rebase this patch on top of D50823, I'm perfectly fine with introducing D50823 after this patch goes in. However, if there are any other concerns regarding the VPCmpInst sub-class, it would be better to know them now. I'd prefer not to keep the ICmpULE opcode representation for a long time.

Thanks,
Diego

Under the assumption that the acceptance of this patch is not a conscious choice between new CmpULE VPInstruction opcode versus VPCmpInst derivation (whose discussion should continue in D50823 or its follow on), I think this patch is ready to land. LGTM.

This revision is now accepted and ready to land.Aug 23 2018, 2:51 PM

In D50480#1210022, @dcaballe wrote:

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

Since the VPCmpInst code is ready (D50823) and this is a clear use case where we need to model a new compare (including its predicate) that is not in the input IR, I'd appreciate if we could discuss a bit more about using the VPCmpInst approach. At least, I'd like to understand what are the concerns about the VPCmpInst approach and what other people think.

I do have concerns regarding modeling ICmpULE as an opcode only for compare instructions newly created during a VPlan-to-VPlan transformation. For example:

...

In D50480#1211580, @hsaito wrote:

Under the assumption that the acceptance of this patch is not a conscious choice between new CmpULE VPInstruction opcode versus VPCmpInst derivation (whose discussion should continue in D50823 or its follow on), I think this patch is ready to land. LGTM.

This patch aims to model a rather special early-exit condition that restricts the execution of the entire loop body to certain iterations, rather than model general compare instructions. If preferred, an "EarlyExit" extended opcode can be introduced instead of the controversial ICmpULE. This should be easy to revisit in the future if needed.

This patch focuses on modeling an early-exit compare and then generating it, w/o making strategic design decisions supporting future vplan-to-vplan transformations, the interfaces they may need, potential templatization, or other long-term high-level VPlan concerns. These should be explained and discussed separately along with pros and cons of alternative solutions for supporting the desired interfaces and for holding their storage, including subclassing VPInstructions, using detached Instructions, or other possibilities.

In D50480#1213673, @Ayal wrote:

This patch aims to model a rather special early-exit condition that restricts the execution of the entire loop body to certain iterations, rather than model general compare instructions. If preferred, an "EarlyExit" extended opcode can be introduced instead of the controversial ICmpULE. This should be easy to revisit in the future if needed.

This patch is fine as is, or rather much better with ICmpULE than EarlyExit.

This patch focuses on modeling an early-exit compare and then generating it, w/o making strategic design decisions supporting future vplan-to-vplan transformations, the interfaces they may need, potential templatization, or other long-term high-level VPlan concerns. These should be explained and discussed separately along with pros and cons of alternative solutions for supporting the desired interfaces and for holding their storage, including subclassing VPInstructions, using detached Instructions, or other possibilities.

Sure. I agree.

[Full disclosure] I have a big mental barrier in accepting your "early-exit" terminology here since I relate that term to "break out of the loop", but that's just the terminology difference. Nothing to do with the substance of this patch. [End of full disclosure]

Closed by commit rL344743: [LV] Fold tail by masking to vectorize loops of arbitrary trip count under opt… (authored by ayalz). · Explain WhyOct 18 2018, 8:05 AM

This revision was automatically updated to reflect the committed changes.

dorit mentioned this in D53559: [LV] Don't have fold-tail under optsize invalidate interleave-groups when masked-interleaving is enabled.Oct 23 2018, 1:43 AM

dorit mentioned this in rL345115: [LV] Don't have fold-tail under optsize invalidate interleave-groups when.Oct 24 2018, 12:13 AM

Ayal mentioned this in D66720: [LV] Fold tail by masking - handle reductions.Aug 25 2019, 10:33 AM

Revision Contents

Path

Size

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

10 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

59 lines

LoopVectorize.cpp

120 lines

VPlan.h

20 lines

VPlan.cpp

22 lines

test/

Transforms/

LoopVectorize/

X86/

optsize.ll

57 lines

small-size.ll

60 lines

Diff 159798

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	public:
/// This does not mean that it is profitable to vectorize this		/// This does not mean that it is profitable to vectorize this
/// loop, only that it is legal to do so.		/// loop, only that it is legal to do so.
/// Temporarily taking UseVPlanNativePath parameter. If true, take		/// Temporarily taking UseVPlanNativePath parameter. If true, take
/// the new code path being implemented for outer loop vectorization		/// the new code path being implemented for outer loop vectorization
/// (should be functional for inner loop vectorization) based on VPlan.		/// (should be functional for inner loop vectorization) based on VPlan.
/// If false, good old LV code.		/// If false, good old LV code.
bool canVectorize(bool UseVPlanNativePath);		bool canVectorize(bool UseVPlanNativePath);

		/// Return true if we can vectorize this loop while folding its tail by
		/// masking.
		bool canFoldTailByMasking();

/// Returns the primary induction variable.		/// Returns the primary induction variable.
PHINode *getPrimaryInduction() { return PrimaryInduction; }		PHINode *getPrimaryInduction() { return PrimaryInduction; }

/// Returns the reduction variables found in the loop.		/// Returns the reduction variables found in the loop.
ReductionList *getReductionVars() { return &Reductions; }		ReductionList *getReductionVars() { return &Reductions; }

/// Returns the induction variables found in the loop.		/// Returns the induction variables found in the loop.
InductionList *getInductionVars() { return &Inductions; }		InductionList *getInductionVars() { return &Inductions; }
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	public:
}		}

bool hasStride(Value *V) { return LAI->hasStride(V); }		bool hasStride(Value *V) { return LAI->hasStride(V); }

/// Returns true if vector representation of the instruction \p I		/// Returns true if vector representation of the instruction \p I
/// requires mask.		/// requires mask.
bool isMaskRequired(const Instruction *I) { return (MaskedOp.count(I) != 0); }		bool isMaskRequired(const Instruction *I) { return (MaskedOp.count(I) != 0); }

		/// Returns true if all loop blocks should be masked to fold tail loop.
		bool foldTailByMasking() const { return FoldTailByMasking; }

unsigned getNumStores() const { return LAI->getNumStores(); }		unsigned getNumStores() const { return LAI->getNumStores(); }
unsigned getNumLoads() const { return LAI->getNumLoads(); }		unsigned getNumLoads() const { return LAI->getNumLoads(); }

// Returns true if the NoNaN attribute is set on the function.		// Returns true if the NoNaN attribute is set on the function.
bool hasFunNoNaNAttr() const { return HasFunNoNaNAttr; }		bool hasFunNoNaNAttr() const { return HasFunNoNaNAttr; }

private:		private:
/// Return true if the pre-header, exiting and latch blocks of \p Lp and all		/// Return true if the pre-header, exiting and latch blocks of \p Lp and all
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	private:

/// The assumption cache analysis is used to compute the minimum type size in		/// The assumption cache analysis is used to compute the minimum type size in
/// which a reduction can be computed.		/// which a reduction can be computed.
AssumptionCache *AC;		AssumptionCache *AC;

/// While vectorizing these instructions we have to generate a		/// While vectorizing these instructions we have to generate a
/// call to the appropriate masked intrinsic		/// call to the appropriate masked intrinsic
SmallPtrSet<const Instruction *, 8> MaskedOp;		SmallPtrSet<const Instruction *, 8> MaskedOp;

		hsaitoUnsubmitted Not Done Reply Inline Actions I think it's best not to keep this state in the Legal. From the Legal perspective, being able to vectorize the whole loop body under the mask and the actual decision to do so are completely separate issues. Since canFold...() is invoked by CostModel::computeMaxVF, we should be able to keep this state in the CostModel. After all, whether to bail out or continue under FoldTailByMasking is a cost model side of the state, after consulting the Legal. hsaito: I think it's best not to keep this state in the Legal. From the Legal perspective, being able…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions OK. Ayal: OK.
		hsaitoUnsubmitted Not Done Reply Inline Actions Thank you. hsaito: Thank you.
		/// All blocks of loop are to be masked to fold tail of scalar iterations.
		bool FoldTailByMasking = false;
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 783 Lines • ▼ Show 20 Lines
bool LoopVectorizationLegality::isInductionVariable(const Value *V) {		bool LoopVectorizationLegality::isInductionVariable(const Value *V) {
return isInductionPhi(V) \|\| isCastedInductionVariable(V);		return isInductionPhi(V) \|\| isCastedInductionVariable(V);
}		}

bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) {		bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) {
return FirstOrderRecurrences.count(Phi);		return FirstOrderRecurrences.count(Phi);
}		}

bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {		bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {
		hsaitoUnsubmitted Not Done Reply Inline Actions By moving FoldTail state to CostModel, we can define CostModel::blockNeedsPredication(BB) as FoldTailByMasking \|\| LAI::blockNeedsPredication(BB) and make Legal version static to Legal. hsaito: By moving FoldTail state to CostModel, we can define CostModel::blockNeedsPredication(BB) as…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions OK, except that LAI::blockNeedsPredication() also asks for DT which CostModel does not have. Let's have CostModel::blockNeedsPredication(BB) return FoldTailByMasking \|\| Legal::blockNeedsPredication(). Hopefully the two will not cause confusion. Making Legal version static should be pursued in a separate patch, if desired. Ayal: OK, except that LAI::blockNeedsPredication() also asks for DT which CostModel does not have.
		hsaitoUnsubmitted Not Done Reply Inline Actions Thanks, and fair enough. hsaito: Thanks, and fair enough.
return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);		return (FoldTailByMasking \|\|
		LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT));
}		}

bool LoopVectorizationLegality::blockCanBePredicated(		bool LoopVectorizationLegality::blockCanBePredicated(
BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs) {		BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs) {
const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();		const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();

for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Check that we don't have a constant expression that can trap as operand.		// Check that we don't have a constant expression that can trap as operand.
▲ Show 20 Lines • Show All 262 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {

// Okay! We've done all the tests. If any have failed, return false. Otherwise		// Okay! We've done all the tests. If any have failed, return false. Otherwise
// we can vectorize, and at this point we don't have any other mem analysis		// we can vectorize, and at this point we don't have any other mem analysis
// which may limit our maximum vectorization factor, so just return true with		// which may limit our maximum vectorization factor, so just return true with
// no restrictions.		// no restrictions.
return Result;		return Result;
}		}

		bool LoopVectorizationLegality::canFoldTailByMasking() {

		LLVM_DEBUG(dbgs() << "LV: checking if tail can be folded by masking.\n");

		if (!PrimaryInduction) {
		ORE->emit(createMissedAnalysis("NoPrimaryInduction")
		<< "Missing a primary induction variable in the loop, which is "
		<< "needed in order to fold tail by masking as required.");
		LLVM_DEBUG(dbgs() << "LV: No primary induction, cannot fold tail by "
		<< "masking.\n");
		return false;
		}

		// TODO: handle reductions when tail is folded by masking.
		if (!Reductions.empty()) {
		ORE->emit(createMissedAnalysis("ReductionFoldingTailByMasking")
		<< "Cannot fold tail by masking in the presence of reductions.");
		LLVM_DEBUG(dbgs() << "LV: Loop has reductions, cannot fold tail by "
		<< "masking.\n");
		return false;
		}

		// TODO: handle outside users when tail is folded by masking.
		for (auto *AE : AllowedExit) {
		// Check that all users of allowed exit values are inside the loop.
		for (User *U : AE->users()) {
		Instruction *UI = cast<Instruction>(U);
		if (TheLoop->contains(UI))
		continue;
		ORE->emit(createMissedAnalysis("LiveOutFoldingTailByMasking")
		<< "Cannot fold tail by masking in the presence of live outs.");
		LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking, loop has an "
		<< "outside user for : " << *UI << '\n');
		return false;
		}
		}

		// The list of pointers that we can safely read and write to remains empty.
		SmallPtrSet<Value *, 8> SafePointers;

		// Check and mark all blocks for predication, including those that ordinarily
		// do not need predication such as the header block.
		for (BasicBlock *BB : TheLoop->blocks()) {
		if (!blockCanBePredicated(BB, SafePointers)) {
		ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator())
		<< "control flow cannot be substituted for a select");
		LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking as required.\n");
		return false;
		}
		}

		LLVM_DEBUG(dbgs() << "LV: can fold tail by masking.\n");
		FoldTailByMasking = true;
		return true;
		}

} // namespace llvm		} // namespace llvm

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 951 Lines • ▼ Show 20 Lines
class InterleavedAccessInfo {		class InterleavedAccessInfo {
public:		public:
InterleavedAccessInfo(PredicatedScalarEvolution &PSE, Loop *L,		InterleavedAccessInfo(PredicatedScalarEvolution &PSE, Loop *L,
DominatorTree DT, LoopInfo LI,		DominatorTree DT, LoopInfo LI,
const LoopAccessInfo *LAI)		const LoopAccessInfo *LAI)
: PSE(PSE), TheLoop(L), DT(DT), LI(LI), LAI(LAI) {}		: PSE(PSE), TheLoop(L), DT(DT), LI(LI), LAI(LAI) {}

~InterleavedAccessInfo() {		~InterleavedAccessInfo() {
		reset();
		}

		/// Analyze the interleaved accesses and collect them in interleave
		/// groups. Substitute symbolic strides using \p Strides.
		void analyzeInterleaving();

		/// Invalidate groups, e.g., in case all blocks in loop will be predicated
		/// contrary to original assumption. Although we currently prevent group
		/// formation for predicated accesses, we may be able to relax this limitation
		/// in the future once we handle more complicated blocks.
		void reset() {
SmallPtrSet<InterleaveGroup *, 4> DelSet;		SmallPtrSet<InterleaveGroup *, 4> DelSet;
// Avoid releasing a pointer twice.		// Avoid releasing a pointer twice.
for (auto &I : InterleaveGroupMap)		for (auto &I : InterleaveGroupMap)
DelSet.insert(I.second);		DelSet.insert(I.second);
for (auto *Ptr : DelSet)		for (auto *Ptr : DelSet)
delete Ptr;		delete Ptr;
		InterleaveGroupMap.clear();
		RequiresScalarEpilogue = false;
}		}

/// Analyze the interleaved accesses and collect them in interleave
/// groups. Substitute symbolic strides using \p Strides.
void analyzeInterleaving();

/// Check if \p Instr belongs to any interleave group.		/// Check if \p Instr belongs to any interleave group.
bool isInterleaved(Instruction *Instr) const {		bool isInterleaved(Instruction *Instr) const {
return InterleaveGroupMap.count(Instr);		return InterleaveGroupMap.count(Instr);
}		}

/// Get the interleave group that \p Instr belongs to.		/// Get the interleave group that \p Instr belongs to.
///		///
/// \returns nullptr if doesn't have such group.		/// \returns nullptr if doesn't have such group.
▲ Show 20 Lines • Show All 1,609 Lines • ▼ Show 20 Lines	PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,

return Induction;		return Induction;
}		}

Value InnerLoopVectorizer::getOrCreateTripCount(Loop L) {		Value InnerLoopVectorizer::getOrCreateTripCount(Loop L) {
if (TripCount)		if (TripCount)
return TripCount;		return TripCount;

		assert(L && "Create Trip Count for null loop.");
IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());		IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
// Find the loop boundaries.		// Find the loop boundaries.
ScalarEvolution *SE = PSE.getSE();		ScalarEvolution *SE = PSE.getSE();
const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();		const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();
assert(BackedgeTakenCount != SE->getCouldNotCompute() &&		assert(BackedgeTakenCount != SE->getCouldNotCompute() &&
"Invalid loop count");		"Invalid loop count");

Type *IdxTy = Legal->getWidestInductionType();		Type *IdxTy = Legal->getWidestInductionType();
Show All 32 Lines

Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L) {		Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L) {
if (VectorTripCount)		if (VectorTripCount)
return VectorTripCount;		return VectorTripCount;

Value *TC = getOrCreateTripCount(L);		Value *TC = getOrCreateTripCount(L);
IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());		IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());

		Type *Ty = TC->getType();
		Constant Step = ConstantInt::get(Ty, VF UF);

		// If the tail is to be folded by masking, round the number of iterations N
		// up to a multiple of Step instead of rounding down. This is done by first
		// adding Step-1 and then rounding down. Note that it's ok if this addition
		// overflows: the vector induction variable will eventually wrap to zero given
		// that it starts at zero and its Step is a power of two; the loop will then
		hsaitoUnsubmitted Not Done Reply Inline Actions I think there is a danger in assuming UF being a power of two. Granted that there may be other parts of LV already assuming it, I still wouldn't like to see any more of those being added. If new code is assuming power of two UF, it's best if we ensure that is really the case (e.g., when foldTailByMansking() is true, assert that UF is power of two). Better yet, check VFUF is power of two here since that's the assumption this code has. hsaito:* I think there is a danger in assuming UF being a power of two. Granted that there may be other…
		AyalAuthorUnsubmitted Done Reply Inline Actions OK, will add an assert that VFUF is a power of 2 below under the `if (Legal->foldTailByMasking())`. Ayal:* OK, will add an assert that VF*UF is a power of 2 below under the `if (Legal->foldTailByMasking…
		// exit, with the last early-exit vector comparison also producing all-true.
		if (Legal->foldTailByMasking())
		TC = Builder.CreateAdd(TC, ConstantInt::get(Ty, VF * UF - 1), "n.rnd.up");

// Now we need to generate the expression for the part of the loop that the		// Now we need to generate the expression for the part of the loop that the
// vectorized body will execute. This is equal to N - (N % Step) if scalar		// vectorized body will execute. This is equal to N - (N % Step) if scalar
// iterations are not required for correctness, or N - Step, otherwise. Step		// iterations are not required for correctness, or N - Step, otherwise. Step
// is equal to the vectorization factor (number of SIMD elements) times the		// is equal to the vectorization factor (number of SIMD elements) times the
// unroll factor (number of SIMD instructions).		// unroll factor (number of SIMD instructions).
Constant Step = ConstantInt::get(TC->getType(), VF UF);
Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");		Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");
		hsaitoUnsubmitted Not Done Reply Inline Actions This Urem creation should be skipped if we aren't generating remainder. hsaito: This Urem creation should be skipped if we aren't generating remainder.
		AyalAuthorUnsubmitted Not Done Reply Inline Actions This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating remainder. Ayal: This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating…
		hsaitoUnsubmitted Not Done Reply Inline Actions Ouch. Well, given the assertion for VFUF being power of two (constant), the UREM and other computation should be reasonably optimizable downstream. So, it's probably unfair to ask you to fix the trip count computation ---- so, I won't ask. There is a trade off between generating more optimal output IR and the cost of maintaining the code to do that. Keeping UREM here is opting for lower maintenance. Just for the record. hsaito:* Ouch. Well, given the assertion for VF*UF being power of two (constant), the UREM and other…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple of two (which is currently always the case, and must be the case when folding the tail by masking), it gets optimized downstream to N&(-Step). If Step would be some other constant it may get optimized downstream to use multiplication instead of division, depending on target characteristics. In any case, this takes place before the loop; and is orthogonal to this patch, which simply reuses the existing logic to also round up. Ayal: Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple…
		hsaitoUnsubmitted Not Done Reply Inline Actions orthogonal to this patch I agree. hsaito: >orthogonal to this patch I agree.

// If there is a non-reversed interleaved group that may speculatively access		// If there is a non-reversed interleaved group that may speculatively access
// memory out-of-bounds, we need to ensure that there will be at least one		// memory out-of-bounds, we need to ensure that there will be at least one
// iteration of the scalar epilogue loop. Thus, if the step evenly divides		// iteration of the scalar epilogue loop. Thus, if the step evenly divides
// the trip count, we set the remainder to be equal to the step. If the step		// the trip count, we set the remainder to be equal to the step. If the step
// does not evenly divide the trip count, no adjustment is necessary since		// does not evenly divide the trip count, no adjustment is necessary since
// there will already be scalar iterations. Note that the minimum iterations		// there will already be scalar iterations. Note that the minimum iterations
// check ensures that N >= Step.		// check ensures that N >= Step.
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,

// Generate code to check if the loop's trip count is less than VF * UF, or		// Generate code to check if the loop's trip count is less than VF * UF, or
// equal to it in case a scalar epilogue is required; this implies that the		// equal to it in case a scalar epilogue is required; this implies that the
// vector trip count is zero. This check also covers the case where adding one		// vector trip count is zero. This check also covers the case where adding one
// to the backedge-taken count overflowed leading to an incorrect trip count		// to the backedge-taken count overflowed leading to an incorrect trip count
// of zero. In this case we will also jump to the scalar loop.		// of zero. In this case we will also jump to the scalar loop.
auto P = Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE		auto P = Cost->requiresScalarEpilogue() ? ICmpInst::ICMP_ULE
: ICmpInst::ICMP_ULT;		: ICmpInst::ICMP_ULT;
Value *CheckMinIters = Builder.CreateICmp(
P, Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check");		// If tail is to be folded, vector loop takes care of all iterations.
		Value *CheckMinIters = Builder.getFalse();
		hsaitoUnsubmitted Not Done Reply Inline Actions Personally, I don't like to see the IR like the following going out of the vectorizer, even though that's later cleaned up tirivially. %1 = false // unused and thus will be trivially cleaned up later. %2 = icmp ... Changing this part of the patch to Value CheckMinIters = nullptr; if () .... else CheckMinIters = Builder.getFalse(); would make cleaner IR going out for common cases, at a small price to pay in ease-of-reading. If you agree, great. If not, I won't make a big deal about it. At the end of the day, we should clean up this area of code such that we don't have to rely on CheckMinIters being "false" constant to cleanup the unnecessary min iter check. That improvement can be done as a separate NFC patch. hsaito:* Personally, I don't like to see the IR like the following going out of the vectorizer, even…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions One could change this part of the patch to create an unconditional branch instead of a conditional one from BB to NewBB; or avoid creating NewBB / calling `emitMinimumIterationCountCheck()` altogether `if (foldTailByMasking())`. Both alternatives will change the dominance structure and thus require special attention when updating DT in `updateAnalysis()`. The latter would also need to record the EntryBlock for cases where `LoopBypassBlocks` remains empty. It's simpler to keep the existing skeletal structure intact, and rely on subsequent trivial dce cleanup. If desired, such alternatives should be proposed as a separate follow-up NFC patch. Ayal: One could change this part of the patch to create an unconditional branch instead of a…
		if (!Legal->foldTailByMasking())
		CheckMinIters = Builder.CreateICmp(
		P, Count, ConstantInt::get(Count->getType(), VF * UF),
		"min.iters.check");

BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");		BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);		L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	for (auto &InductionEntry : *List) {
for (BasicBlock *BB : LoopBypassBlocks)		for (BasicBlock *BB : LoopBypassBlocks)
BCResumeVal->addIncoming(II.getStartValue(), BB);		BCResumeVal->addIncoming(II.getStartValue(), BB);
OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);		OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);
}		}

// Add a check in the middle block to see if we have completed		// Add a check in the middle block to see if we have completed
// all of the iterations in the first vector loop.		// all of the iterations in the first vector loop.
// If (N - N%VF) == N, then we don't need to run the remainder.		// If (N - N%VF) == N, then we don't need to run the remainder.
Value *CmpN =		// If tail is to be folded, we know we don't need to run the remainder.
		Value *CmpN = Builder.getTrue();
		hsaitoUnsubmitted Not Done Reply Inline Actions See the comment on CheckMinIters. hsaito: See the comment on CheckMinIters.
		AyalAuthorUnsubmitted Not Done Reply Inline Actions ditto. Ayal: ditto.
		if (!Legal->foldTailByMasking())
		CmpN =
CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,		CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,
CountRoundDown, "cmp.n", MiddleBlock->getTerminator());		CountRoundDown, "cmp.n", MiddleBlock->getTerminator());
ReplaceInstWithInst(MiddleBlock->getTerminator(),		ReplaceInstWithInst(MiddleBlock->getTerminator(),
BranchInst::Create(ExitBlock, ScalarPH, CmpN));		BranchInst::Create(ExitBlock, ScalarPH, CmpN));

// Get ready to start creating new instructions into the vectorized body.		// Get ready to start creating new instructions into the vectorized body.
Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());		Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());

// Save the state.		// Save the state.
LoopVectorPreHeader = Lp->getLoopPreheader();		LoopVectorPreHeader = Lp->getLoopPreheader();
▲ Show 20 Lines • Show All 1,948 Lines • ▼ Show 20 Lines	LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
return None;		return None;
}		}

// If we optimize the program for size, avoid creating the tail loop.		// If we optimize the program for size, avoid creating the tail loop.
LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

// If we don't know the precise trip count, don't try to vectorize.		if (TC == 1) {
		reamesUnsubmitted Not Done Reply Inline Actions There's a mix of seemingly unrelated changes here. This is one example. It would be good to land these separately. reames: There's a mix of seemingly unrelated changes here. This is one example. It would be good to…
		hsaitoUnsubmitted Not Done Reply Inline Actions This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0. TC==0 case will then have a chance of hitting Legal->canFoldTailByMasking() later. As a result, TC==1 case can return early here, with a very crisp messaging. Having said that, if you'd like to see the same ORE->emit(...) LLVM_DEBUG() stuff here, I won't go against that. Messaging change can be a separate commit. Ayal, we need ORE->emit() here, in addition to LLVM_DEBUG(), right, regardless of whether we change the actual message? hsaito: This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0.
		AyalAuthorUnsubmitted Not Done Reply Inline Actions Yes, this change is unrelated and should land separately. The original ORE message is wrong. Not sure the TC==1 qualifies for any ORE message - "loops" with a known trip count of one are simply irrelevant for vectorization; though we could vectorize them with a mask... Ayal: Yes, this change is unrelated and should land separately. The original ORE message is wrong.
		hsaitoUnsubmitted Not Done Reply Inline Actions I think every non-vectorized loop that goes through vectorizer's analysis qualifies for ORE. After all, TC==1 knowledge may or may not be available to the programmer otherwise. hsaito: I think every non-vectorized loop that goes through vectorizer's analysis qualifies for ORE.
		AyalAuthorUnsubmitted Done Reply Inline Actions ok Ayal: ok
if (TC < 2) {		LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
ORE->emit(
createMissedAnalysis("UnknownLoopCountComplexCFG")
<< "unable to calculate the loop count due to complex control flow");
LLVM_DEBUG(
dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return None;		return None;
}		}

unsigned MaxVF = computeFeasibleMaxVF(OptForSize, TC);		unsigned MaxVF = computeFeasibleMaxVF(OptForSize, TC);

if (TC % MaxVF != 0) {		if (TC > 0 && TC % MaxVF == 0) {
// If the trip count that we found modulo the vectorization factor is not		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
// zero then we require a tail.		return MaxVF;
// FIXME: look for a smaller MaxVF that does divide TC rather than give up.		}
// FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a
// smaller MaxVF that does not require a scalar epilog.

ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")		if (TC > 0 && TC < TinyTripCountInterleaveThreshold) {
<< "cannot optimize for size and vectorize at the "		ORE->emit(createMissedAnalysis("TinyTripCount")
"same time. Enable vectorization of this loop "		<< "The trip count of the loop is below the given threshold for "
"with '#pragma clang loop vectorize(enable)' "		<< "loops with scalar iterations.");
"when compiling with -Os/-Oz");		LLVM_DEBUG(dbgs() << "LV: Aborting - trip count below given threshold for "
LLVM_DEBUG(		<< "loop with scalar iterations.\n");
		dcaballeUnsubmitted Not Done Reply Inline Actions I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower than `TinyTripCountInterleaveThreshold` (i.e., 128)?. Should we use an independent threshold for this? dcaballe: I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions Ah, this is wrong, good catch! The original purpose (of `TinyTripCountVectorThreshold` rather than `TinyTripCountInterleaveThreshold`) was to prevent vectorization of loops with very short trip counts due to overheads. Later it was extended in r306803 to allow vectorization under OptForSize, as it implies that all iterations are concentrated inside the vector loop for more accurate cost estimation. This still holds when folding the tail by masking, so we should not bail out here. Ayal: Ah, this is wrong, good catch! The original purpose (of `TinyTripCountVectorThreshold` rather…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions This BTW is caught by vect.omp.force.small-tc.ll; but the -vectorizer-min-trip-count=21 flag it uses is external to OpenMP, afaik. Ayal: This BTW is caught by vect.omp.force.small-tc.ll; but the -vectorizer-min-trip-count=21…
dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return None;		return None;
}		}

		// If we don't know the precise trip count, or if the trip count that we
		// found modulo the vectorization factor is not zero, try to fold the tail
		// by masking.
		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
		// FIXME: return None if loop requiresScalarEpilog(<MaxVF>), or look for a
		// smaller MaxVF that does not require a scalar epilog.
		if (Legal->canFoldTailByMasking())
return MaxVF;		return MaxVF;

		hsaitoUnsubmitted Not Done Reply Inline Actions I think we need to add if (TC==0) { emit one kind of remark } else { emit another kind of remark } here ---- in order to match previous capability. hsaito: I think we need to add if (TC==0) { emit one kind of remark } else { emit another…
		AyalAuthorUnsubmitted Done Reply Inline Actions OK, will retain the previous MissedAnalysis remarks here, in addition to the new ones supplied by `canFoldTailByMasking()`. Ayal: OK, will retain the previous MissedAnalysis remarks here, in addition to the new ones supplied…
		hsaitoUnsubmitted Not Done Reply Inline Actions Thank you. hsaito: Thank you.
		return None;
}		}

unsigned		unsigned
LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,		LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,
unsigned ConstTripCount) {		unsigned ConstTripCount) {
MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);		MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
unsigned SmallestType, WidestType;		unsigned SmallestType, WidestType;
std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();		std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
// We calculate the interleave count using the following formula.		// We calculate the interleave count using the following formula.
// Subtract the number of loop invariants from the number of available		// Subtract the number of loop invariants from the number of available
// registers. These registers are used by all of the interleaved instances.		// registers. These registers are used by all of the interleaved instances.
// Next, divide the remaining registers by the number of registers that is		// Next, divide the remaining registers by the number of registers that is
// required by the loop, in order to estimate how many parallel instances		// required by the loop, in order to estimate how many parallel instances
// fit without causing spills. All of this is rounded down if necessary to be		// fit without causing spills. All of this is rounded down if necessary to be
// a power of two. We want power of two interleave count to simplify any		// a power of two. We want power of two interleave count to simplify any
// addressing operations or alignment considerations.		// addressing operations or alignment considerations.
		// We also want power of two interleave counts to ensure that the induction
		// variable of the vector loop wraps to zero, when tail is folded by masking;
		// this currently happens when OptForSize, inwhich case IC is set to 1 above.
		dcaballeUnsubmitted Not Done Reply Inline Actions inwhich -> in which? dcaballe: inwhich -> in which?
		AyalAuthorUnsubmitted Done Reply Inline Actions ok Ayal: ok
unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /		unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
R.MaxLocalUsers);		R.MaxLocalUsers);

// Don't count the induction variable as interleaved.		// Don't count the induction variable as interleaved.
if (EnableIndVarRegisterHeur)		if (EnableIndVarRegisterHeur)
IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /		IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
std::max(1U, (R.MaxLocalUsers - 1)));		std::max(1U, (R.MaxLocalUsers - 1)));

▲ Show 20 Lines • Show All 1,118 Lines • ▼ Show 20 Lines
LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {		LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {
assert(OrigLoop->empty() && "Inner loop expected.");		assert(OrigLoop->empty() && "Inner loop expected.");
// Width 1 means no vectorization, cost 0 means uncomputed cost.		// Width 1 means no vectorization, cost 0 means uncomputed cost.
const VectorizationFactor NoVectorization = {1U, 0U};		const VectorizationFactor NoVectorization = {1U, 0U};
Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);		Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);
if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.		if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.
return NoVectorization;		return NoVectorization;

		// Invalidate interleave groups if all blocks of loop will be predicated.
		if (Legal->blockNeedsPredication(OrigLoop->getHeader()))
		CM.InterleaveInfo.reset();
		dcaballeUnsubmitted Not Done Reply Inline Actions Just curious. Could we prevent the computation or interleave groups for these cases instead of doing a reset? dcaballe: Just curious. Could we prevent the computation or interleave groups for these cases instead of…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions That would have been simpler indeed. But there's a subtle phase-ordering issue here: `MaxVF=computeFeasibleMaxVF()` uses tentative interleave groups to `getSmallestAndWidestTypes()`, and is then used in determining if the tail should be folded by masking (i.e., if TC is a multiple of MaxVF), in which case these groups will all be masked/invalid. Ayal: That would have been simpler indeed. But there's a subtle phase-ordering issue here…
		dcaballeUnsubmitted Not Done Reply Inline Actions Thanks! dcaballe: Thanks!

if (UserVF) {		if (UserVF) {
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
CM.selectUserVectorizationFactor(UserVF);		CM.selectUserVectorizationFactor(UserVF);
buildVPlansWithVPRecipes(UserVF, UserVF);		buildVPlansWithVPRecipes(UserVF, UserVF);
LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
Show All 40 Lines	void LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV,

// 1. Create a new empty loop. Unlink the old loop and connect the new one.		// 1. Create a new empty loop. Unlink the old loop and connect the new one.
VPCallbackILV CallbackILV(ILV);		VPCallbackILV CallbackILV(ILV);

VPTransformState State{BestVF, BestUF, LI,		VPTransformState State{BestVF, BestUF, LI,
DT, ILV.Builder, ILV.VectorLoopValueMap,		DT, ILV.Builder, ILV.VectorLoopValueMap,
&ILV, CallbackILV};		&ILV, CallbackILV};
State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();		State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();
		State.TripCount = ILV.getOrCreateTripCount(nullptr);

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should also be implemented in		// into the code below should also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB);		BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB);
if (BCEntryIt != BlockMaskCache.end())		if (BCEntryIt != BlockMaskCache.end())
return BCEntryIt->second;		return BCEntryIt->second;

// All-one mask is modelled as no-mask following the convention for masked		// All-one mask is modelled as no-mask following the convention for masked
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

// Loop incoming mask is all-one.		if (OrigLoop->getHeader() == BB) {
if (OrigLoop->getHeader() == BB)		if (!Legal->blockNeedsPredication(BB))
		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

		// Introduce the early-exit compare IV <= BTC to form header block mask.
		// This is used instead of IV < TC because TC may wrap, unlike BTC.
		VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
		VPValue *BTC = Plan->getBackedgeTakenCount();
		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
		}

// This is the block mask. We OR all incoming edges.		// This is the block mask. We OR all incoming edges.
for (auto *Predecessor : predecessors(BB)) {		for (auto *Predecessor : predecessors(BB)) {
VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);		VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);
if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.		if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.
return BlockMaskCache[BB] = EdgeMask;		return BlockMaskCache[BB] = EdgeMask;

if (!BlockMask) { // BlockMask has its initialized nullptr value.		if (!BlockMask) { // BlockMask has its initialized nullptr value.
▲ Show 20 Lines • Show All 338 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::buildVPlansWithVPRecipes(unsigned MinVF,
for (BasicBlock *BB : OrigLoop->blocks()) {		for (BasicBlock *BB : OrigLoop->blocks()) {
if (BB == Latch)		if (BB == Latch)
continue;		continue;
BranchInst *Branch = dyn_cast<BranchInst>(BB->getTerminator());		BranchInst *Branch = dyn_cast<BranchInst>(BB->getTerminator());
if (Branch && Branch->isConditional())		if (Branch && Branch->isConditional())
NeedDef.insert(Branch->getCondition());		NeedDef.insert(Branch->getCondition());
}		}

		// If the tail is to be folded by masking, the primary induction variable
		// needs to be represented in VPlan for it to model early-exit masking.
		if (Legal->foldTailByMasking())
		NeedDef.insert(Legal->getPrimaryInduction());

// Collect instructions from the original loop that will become trivially dead		// Collect instructions from the original loop that will become trivially dead
// in the vectorized loop. We don't need to vectorize these instructions. For		// in the vectorized loop. We don't need to vectorize these instructions. For
// example, original induction update instructions can become dead because we		// example, original induction update instructions can become dead because we
// separately emit induction "steps" when generating code for the new loop.		// separately emit induction "steps" when generating code for the new loop.
// Similarly, we create a new latch condition when setting up the structure		// Similarly, we create a new latch condition when setting up the structure
// of the new loop, so the old one can become dead.		// of the new loop, so the old one can become dead.
SmallPtrSet<Instruction *, 4> DeadInstructions;		SmallPtrSet<Instruction *, 4> DeadInstructions;
collectTriviallyDeadInstructions(DeadInstructions);		collectTriviallyDeadInstructions(DeadInstructions);
▲ Show 20 Lines • Show All 749 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 307 Lines • ▼ Show 20 Lines	struct VPTransformState {
/// Hold a reference to the Value state information used when generating the		/// Hold a reference to the Value state information used when generating the
/// Values of the output IR.		/// Values of the output IR.
VectorizerValueMap &ValueMap;		VectorizerValueMap &ValueMap;

/// Hold a reference to a mapping between VPValues in VPlan and original		/// Hold a reference to a mapping between VPValues in VPlan and original
/// Values they correspond to.		/// Values they correspond to.
VPValue2ValueTy VPValue2Value;		VPValue2ValueTy VPValue2Value;

		/// Hold the trip count of the scalar loop.
		Value *TripCount = nullptr;

/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.		/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.
InnerLoopVectorizer *ILV;		InnerLoopVectorizer *ILV;

VPCallback &Callback;		VPCallback &Callback;
};		};

/// VPBlockBase is the building block of the Hierarchical Control-Flow Graph.		/// VPBlockBase is the building block of the Hierarchical Control-Flow Graph.
/// A VPBlockBase can be either a VPBasicBlock or a VPRegionBlock.		/// A VPBlockBase can be either a VPBasicBlock or a VPRegionBlock.
▲ Show 20 Lines • Show All 274 Lines • ▼ Show 20 Lines
/// While as any Recipe it may generate a sequence of IR instructions when		/// While as any Recipe it may generate a sequence of IR instructions when
/// executed, these instructions would always form a single-def expression as		/// executed, these instructions would always form a single-def expression as
/// the VPInstruction is also a single def-use vertex.		/// the VPInstruction is also a single def-use vertex.
class VPInstruction : public VPUser, public VPRecipeBase {		class VPInstruction : public VPUser, public VPRecipeBase {
friend class VPlanHCFGTransforms;		friend class VPlanHCFGTransforms;

public:		public:
/// VPlan opcodes, extending LLVM IR with idiomatics instructions.		/// VPlan opcodes, extending LLVM IR with idiomatics instructions.
enum { Not = Instruction::OtherOpsEnd + 1 };		enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };
		dcaballeUnsubmitted Not Done Reply Inline Actions I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first? dcaballe: I'm worried that this new opcode could be problematic since now we can have compare…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil): + // Introduce the early-exit compare IV <= BTC to form header block mask. + // This is used instead of IV < TC because TC may wrap, unlike BTC. + VPValue IV = Plan->getVPValue(Legal->getPrimaryInduction()); + VPValue BTC = Plan->getBackedgeTakenCount(); + Value Undef = UndefValue::get(Legal->getPrimaryInduction()->getType()); + auto ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef); + Plan->addDetachedValue(ICmp); + BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp); return BlockMaskCache[BB] = BlockMask; and then have `VPInstruction::generateInstruction()` do + case Instruction::ICmp: { + Value IV = State.get(getOperand(0), Part); + Value TC = State.get(getOperand(1), Part); + auto ICmp = cast<ICmpInst>(getUnderlyingValue()); + Value V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC); + State.set(this, V, Part); + break; + } where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s. Ayal: An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for…
		hsaitoUnsubmitted Not Done Reply Inline Actions Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator? The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that. My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? hsaito: Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the…
		dcaballeUnsubmitted Not Done Reply Inline Actions Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places. My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? The closer the class hierarchies are, the easier will be. dcaballe: Yes, I also feel that opening this door could be problematic in the long term. Let me see if I…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve a separate discussion thread and justification. This patch could tentatively make use of it, though for the purpose of this patch an ICmpULE opcode or a detached ICmpInst suffice. An ICmpULE opcode shouldn't be problematic currently, as this early-exit is the only VPInstruction compare with a Predicate, right? Note that detached UnderlyingValues could serve as data containers for all fields already implemented in the IR hierarchy, and could be constructed at any point of VPlan construction for that purpose. Extending VPInstructions to provide a similar API as that of IR Instructions seems to be an orthogonal concern with its own design objectives, and can coexist with detached Values; e.g., a VPCmpInst could hold its Predicate using a detached ICmpInst/FCmpInst. Ayal: Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve…
		hsaitoUnsubmitted Not Done Reply Inline Actions I go against detached ICmpInst. We'll be moving VPlan modeling before the cost model and creating an IR Instruction before deciding to vectorize is against the VPlan concept. seems to be an orthogonal concern with its own design objectives Not quite. We'd like VPInstruction as easy to use to many LLVM developers and that is an integral part of design/implementation from the beginning. Having said that, new opcode versus VPCmpInst doesn't block the rest of the review. Other parts of the review should proceed while opcode versus VPCmpInst discussion is in progress on the side. hsaito: I go against detached ICmpInst. We'll be moving VPlan modeling before the cost model and…
		dcaballeUnsubmitted Not Done Reply Inline Actions I created D50823 with the VPCmpInst sub-class so that we can make a decision with the code in place. dcaballe: I created D50823 with the VPCmpInst sub-class so that we can make a decision with the code in…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions VPlans should indeed keep the existing IR intact w/o changing it, as they are tentative by design, and also by current implementation. But creating a detached IR Instruction, just for the purpose of holding its attributes, w/o connecting it to any User, Operand (except Undef's) or BasicBlock, is arguably keeping the existing IR intact. Doing so should be quite familiar to LLVM developers, avoids mirroring Instruction's class hierarchy or a subset thereof, and leverages the existing UnderlyingValue pointer that is unutilized by InnerLoopVectorizer. Next uploaded version provides this complete option. Having said that, this patch can surely work with a VP(I)CmpInst just as well, as it merely needs a way for a single compare VPInstruction to hold a single Predicate, and print its name. Ayal: VPlans should indeed keep the existing IR intact w/o changing it, as they are tentative by…
		dcaballeUnsubmitted Not Done Reply Inline Actions I understand your point, Ayal. However, using UnderlyingValue as a pointer to the actual input IR in the VPlan native path and as a pointer to a detached IR Value in the inner loop path is very likely to be problematic, even in the short term. We would have to special case the code that is shared for both paths to treat the UnderlyingValue differently. The detached IR special semantics in the inner loop path would also make a bit more complicated the convergence of both paths. If there are no major concerns regarding the VPCmpInst, I'd prefer going with that approach. dcaballe: I understand your point, Ayal. However, using UnderlyingValue as a pointer to the actual input…
		hsaitoUnsubmitted Not Done Reply Inline Actions Detached IR instruction is detrimental to VPlan direction. Please do not use it. hsaito: Detached IR instruction is detrimental to VPlan direction. Please do not use it.
		AyalAuthorUnsubmitted Not Done Reply Inline Actions Would be good to clarify the aforementioned discrepancy between VPlan native's use of input IR and the proposed use of detached IR; both should presumably model defs, uses and basic-block ownerships in VPlan rather than the IR Instruction, so the latter can merely be used for storing internal properties, for both paths alike. BTW, SROA.cpp and StraightLineStrengthReduce.cpp, e.g., also make use of detached Instructions. Would also be good to explain why detached Instructions are considered detrimental or what concept of VPlan they allegedly violate, given that their existence keeps the original IR intact. But let's keep this patch out of that discussion, and have it use an ICmpULE extended opcode as originally proposed and reloaded. After all, it plays a very small part in this patch, and can be easily revised later as needed. Ayal: Would be good to clarify the aforementioned discrepancy between VPlan native's use of input IR…

private:		private:
typedef unsigned char OpcodeTy;		typedef unsigned char OpcodeTy;
OpcodeTy Opcode;		OpcodeTy Opcode;

/// Utility method serving execute(): generates a single instance of the		/// Utility method serving execute(): generates a single instance of the
/// modeled instruction.		/// modeled instruction.
void generateInstruction(VPTransformState &State, unsigned Part);		void generateInstruction(VPTransformState &State, unsigned Part);
▲ Show 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	private:

/// Holds all the external definitions created for this VPlan.		/// Holds all the external definitions created for this VPlan.
// TODO: Introduce a specific representation for external definitions in		// TODO: Introduce a specific representation for external definitions in
// VPlan. External definitions must be immutable and hold a pointer to its		// VPlan. External definitions must be immutable and hold a pointer to its
// underlying IR that will be used to implement its structural comparison		// underlying IR that will be used to implement its structural comparison
// (operators '==' and '<').		// (operators '==' and '<').
SmallPtrSet<VPValue *, 16> VPExternalDefs;		SmallPtrSet<VPValue *, 16> VPExternalDefs;

		/// Represents the backedge taken count of the original loop, for folding
		/// the tail.
		VPValue *BackedgeTakenCount;

/// Holds a mapping between Values and their corresponding VPValue inside		/// Holds a mapping between Values and their corresponding VPValue inside
/// VPlan.		/// VPlan.
Value2VPValueTy Value2VPValue;		Value2VPValueTy Value2VPValue;

/// Holds the VPLoopInfo analysis for this VPlan.		/// Holds the VPLoopInfo analysis for this VPlan.
VPLoopInfo VPLInfo;		VPLoopInfo VPLInfo;

public:		public:
VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {}		VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {
		BackedgeTakenCount = new VPValue();
		dcaballeUnsubmitted Not Done Reply Inline Actions Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual operations to compute the BTC? We would only need a sub, right? dcaballe: Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual…
		AyalAuthorUnsubmitted Not Done Reply Inline Actions The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by SCEVExpander. To model this decrement would require using an "empty" VPValue to model its Trip Count operand. In any case, both involve scalar instructions that take place before the vectorized loop, currently outside the VPlan'd zone. Ayal: The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by…
		hsaitoUnsubmitted Not Done Reply Inline Actions I'm not a big fan of allocating memory that goes unused in many situations. We can initialize this to nullptr, and create an instance once we know BTC is needed. That'll lose the convenience of being able to check NumUsers, but creating needsBackedgeTakenCount() member function shouldn't be that bad. It's just Legal->foldTailByMasking(), until something else needs BTC, right? hsaito: I'm not a big fan of allocating memory that goes unused in many situations. We can initialize…
		AyalAuthorUnsubmitted Done Reply Inline Actions OK. The VPValue can be created on demand, turning `getBackedgeTakenCount()` into `getOrCreateBackedgeTakenCount()`. `NumUsers` should still be checked, as this isolates the decision of creating the IR based on the VPlan. In any case, VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always used, seems insignificant. Ayal: OK. The VPValue can be created on demand, turning `getBackedgeTakenCount()` into…
		hsaitoUnsubmitted Not Done Reply Inline Actions VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the >BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always >used, seems insignificant. VPlan footprint was part of the community concern. We'd like to be better wherever we can. Just as simple as that. Thanks for taking care of it. hsaito: >VPlan in general is a tentative construct, destined for destruction w/o being materialized…
		}

~VPlan() {		~VPlan() {
if (Entry)		if (Entry)
VPBlockBase::deleteCFG(Entry);		VPBlockBase::deleteCFG(Entry);
for (auto &MapEntry : Value2VPValue)		for (auto &MapEntry : Value2VPValue)
		if (MapEntry.second != BackedgeTakenCount)
delete MapEntry.second;		delete MapEntry.second;
		delete BackedgeTakenCount; // Delete once, if in Value2VPValue or not.
for (VPValue *Def : VPExternalDefs)		for (VPValue *Def : VPExternalDefs)
delete Def;		delete Def;
}		}

/// Generate the IR code for this VPlan.		/// Generate the IR code for this VPlan.
void execute(struct VPTransformState *State);		void execute(struct VPTransformState *State);

VPBlockBase *getEntry() { return Entry; }		VPBlockBase *getEntry() { return Entry; }
const VPBlockBase *getEntry() const { return Entry; }		const VPBlockBase *getEntry() const { return Entry; }

VPBlockBase setEntry(VPBlockBase Block) { return Entry = Block; }		VPBlockBase setEntry(VPBlockBase Block) { return Entry = Block; }

		/// The backedge taken count of the original loop.
		VPValue *getBackedgeTakenCount() { return BackedgeTakenCount; }

void addVF(unsigned VF) { VFs.insert(VF); }		void addVF(unsigned VF) { VFs.insert(VF); }

bool hasVF(unsigned VF) { return VFs.count(VF); }		bool hasVF(unsigned VF) { return VFs.count(VF); }

const std::string &getName() const { return Name; }		const std::string &getName() const { return Name; }

void setName(const Twine &newName) { Name = newName.str(); }		void setName(const Twine &newName) { Name = newName.str(); }

▲ Show 20 Lines • Show All 290 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	void VPInstruction::generateInstruction(VPTransformState &State,

switch (getOpcode()) {		switch (getOpcode()) {
case VPInstruction::Not: {		case VPInstruction::Not: {
Value *A = State.get(getOperand(0), Part);		Value *A = State.get(getOperand(0), Part);
Value *V = Builder.CreateNot(A);		Value *V = Builder.CreateNot(A);
State.set(this, V, Part);		State.set(this, V, Part);
break;		break;
}		}
		case VPInstruction::ICmpULE: {
		Value *IV = State.get(getOperand(0), Part);
		Value *TC = State.get(getOperand(1), Part);
		Value *V = Builder.CreateICmpULE(IV, TC);
		State.set(this, V, Part);
		break;
		}
default:		default:
llvm_unreachable("Unsupported opcode for instruction");		llvm_unreachable("Unsupported opcode for instruction");
}		}
}		}

void VPInstruction::execute(VPTransformState &State) {		void VPInstruction::execute(VPTransformState &State) {
assert(!State.Instance && "VPInstruction executing an Instance");		assert(!State.Instance && "VPInstruction executing an Instance");
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
Show All 9 Lines
void VPInstruction::print(raw_ostream &O) const {		void VPInstruction::print(raw_ostream &O) const {
printAsOperand(O);		printAsOperand(O);
O << " = ";		O << " = ";

switch (getOpcode()) {		switch (getOpcode()) {
case VPInstruction::Not:		case VPInstruction::Not:
O << "not";		O << "not";
break;		break;
		case VPInstruction::ICmpULE:
		O << "icmp ule";
		break;
default:		default:
O << Instruction::getOpcodeName(getOpcode());		O << Instruction::getOpcodeName(getOpcode());
}		}

for (const VPValue *Operand : operands()) {		for (const VPValue *Operand : operands()) {
O << " ";		O << " ";
Operand->printAsOperand(O);		Operand->printAsOperand(O);
}		}
}		}

/// Generate the code inside the body of the vectorized loop. Assumes a single		/// Generate the code inside the body of the vectorized loop. Assumes a single
/// LoopVectorBody basic-block was created for this. Introduce additional		/// LoopVectorBody basic-block was created for this. Introduce additional
/// basic-blocks as needed, and fill them all.		/// basic-blocks as needed, and fill them all.
void VPlan::execute(VPTransformState *State) {		void VPlan::execute(VPTransformState *State) {
		// -1. Check if the backedge taken count is needed, and if so build it.
		if (BackedgeTakenCount->getNumUsers()) {
		Value *TC = State->TripCount;
		IRBuilder<> Builder(State->CFG.PrevBB->getTerminator());
		auto *TCMO = Builder.CreateSub(TC, ConstantInt::get(TC->getType(), 1),
		"trip.count.minus.1");
		Value2VPValue[TCMO] = BackedgeTakenCount;
		}

// 0. Set the reverse mapping from VPValues to Values for code generation.		// 0. Set the reverse mapping from VPValues to Values for code generation.
for (auto &Entry : Value2VPValue)		for (auto &Entry : Value2VPValue)
State->VPValue2Value[Entry.second] = Entry.first;		State->VPValue2Value[Entry.second] = Entry.first;

BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB;		BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB;
BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor();		BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor();
assert(VectorHeaderBB && "Loop preheader does not have a single successor.");		assert(VectorHeaderBB && "Loop preheader does not have a single successor.");
BasicBlock *VectorLatchBB = VectorHeaderBB;		BasicBlock *VectorLatchBB = VectorHeaderBB;
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines

void VPlanPrinter::dump() {		void VPlanPrinter::dump() {
Depth = 1;		Depth = 1;
bumpIndent(0);		bumpIndent(0);
OS << "digraph VPlan {\n";		OS << "digraph VPlan {\n";
OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan";		OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan";
if (!Plan.getName().empty())		if (!Plan.getName().empty())
OS << "\\n" << DOT::EscapeString(Plan.getName());		OS << "\\n" << DOT::EscapeString(Plan.getName());
		OS << ", where: \\n" << *Plan.getBackedgeTakenCount()
		<< " := BackedgeTakenCount";
if (!Plan.Value2VPValue.empty()) {		if (!Plan.Value2VPValue.empty()) {
OS << ", where:";
for (auto Entry : Plan.Value2VPValue) {		for (auto Entry : Plan.Value2VPValue) {
OS << "\\n" << *Entry.second;		OS << "\\n" << *Entry.second;
OS << DOT::EscapeString(" := ");		OS << DOT::EscapeString(" := ");
Entry.first->printAsOperand(OS, false);		Entry.first->printAsOperand(OS, false);
}		}
}		}
OS << "\"]\n";		OS << "\"]\n";
OS << "node [shape=rect, fontname=Courier, fontsize=30]\n";		OS << "node [shape=rect, fontname=Courier, fontsize=30]\n";
▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/optsize.ll

				; This test verifies that the loop vectorizer WILL vectorize WITHOUT producing
				; a tail loop, with the optimize for size or the minimize size attributes,
				; using masking to fold the tail.
				; RUN: opt < %s -loop-vectorize -S -mtriple=x86_64-apple-darwin -mcpu=skx \| FileCheck %s

				hsaitoUnsubmitted Not Done Reply Inline Actions Is the test really dependent on the apple triple? hsaito: Is the test really dependent on the apple triple?
				AyalAuthorUnsubmitted Done Reply Inline Actions -mtriple=x86_64-unknown-linux works just as well Ayal: -mtriple=x86_64-unknown-linux works just as well
				target datalayout = "E-m:e-p:32:32-i64:32-f64:32:64-a:0:32-n32-S128"

				@tab = common global [32 x i8] zeroinitializer, align 1

				define i32 @foo_optsize() #0 {
				; CHECK-LABEL: @foo_optsize(
				; CHECK: x i8>
				reamesUnsubmitted Not Done Reply Inline Actions Testing wise, expanding out the IR generated w/update-lit-checks and landing the tests without the changes and then rebasing on top would make it much easier to follow the transform being described for those us not already expert in the vectorizer code structures. I get that your following existing practice, but this might be one of the cases which justify changing existing practice in the area. :) reames: Testing wise, expanding out the IR generated w/update-lit-checks and landing the tests without…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Agreed. The original target-independent version of optsize.ll still passes, BTW, (i.e., fails to vectorize), but due to cost-model considerations rather than scalar tail considerations. Ayal: Agreed. The original target-independent version of optsize.ll still passes, BTW, (i.e., fails…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Expanded IR CHECKs have been added for cases that should get vectorized. For cases that should not, suffice to check that no vector is formed. Ayal: Expanded IR CHECKs have been added for cases that should get vectorized. For cases that should…

				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%cmp1 = icmp eq i8 %0, 0
				%. = select i1 %cmp1, i8 2, i8 1
				store i8 %., i8* %arrayidx, align 1
				%inc = add nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %i.08, 202
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 0
				}

				attributes #0 = { optsize }

				define i32 @foo_minsize() #1 {
				; CHECK-LABEL: @foo_minsize(
				; CHECK: x i8>

				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%cmp1 = icmp eq i8 %0, 0
				%. = select i1 %cmp1, i8 2, i8 1
				store i8 %., i8* %arrayidx, align 1
				%inc = add nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %i.08, 202
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 0
				}

				attributes #1 = { minsize }

test/Transforms/LoopVectorize/X86/small-size.ll

Show All 40 Lines	; <label>:1 ; preds = %1, %0
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, 256		%exitcond = icmp eq i32 %lftr.wideiv, 256
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

; Can't vectorize in 'optsize' mode because we need a tail.		; We can vectorize the first loop in 'optsize' mode by masking its tail.
		; Can't vectorize the second loop because it has no primary induction.
;CHECK-LABEL: @example2(		;CHECK-LABEL: @example2(
;CHECK-NOT: store <4 x i32>		;CHECK: <4 x i32>
		;CHECK: middle.block
		;CHECK-NOT: <4 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example2(i32 %n, i32 %x) optsize {		define void @example2(i32 %n, i32 %x) optsize {
%1 = icmp sgt i32 %n, 0		%1 = icmp sgt i32 %n, 0
br i1 %1, label %.lr.ph5, label %.preheader		br i1 %1, label %.lr.ph5, label %.preheader

..preheader_crit_edge: ; preds = %.lr.ph5		..preheader_crit_edge: ; preds = %.lr.ph5
%phitmp = sext i32 %n to i64		%phitmp = sext i32 %n to i64
br label %.preheader		br label %.preheader
Show All 26 Lines	.lr.ph: ; preds = %.preheader, %.lr.ph
%indvars.iv.next = add i64 %indvars.iv, 1		%indvars.iv.next = add i64 %indvars.iv, 1
%11 = icmp eq i32 %4, 0		%11 = icmp eq i32 %4, 0
br i1 %11, label %._crit_edge, label %.lr.ph		br i1 %11, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph, %.preheader		._crit_edge: ; preds = %.lr.ph, %.preheader
ret void		ret void
}		}

; N is unknown, we need a tail. Can't vectorize.		; N is unknown, we need a tail. Can't vectorize by masking it because the loop
		; has no primary induction.
;CHECK-LABEL: @example3(		;CHECK-LABEL: @example3(
;CHECK-NOT: <4 x i32>		;CHECK-NOT: <4 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example3(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) optsize {		define void @example3(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) optsize {
%1 = icmp eq i32 %n, 0		%1 = icmp eq i32 %n, 0
br i1 %1, label %._crit_edge, label %.lr.ph		br i1 %1, label %._crit_edge, label %.lr.ph

.lr.ph: ; preds = %0, %.lr.ph		.lr.ph: ; preds = %0, %.lr.ph
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	; <label>:1 ; preds = %1, %0
%7 = add nsw i32 %i.02, 1		%7 = add nsw i32 %i.02, 1
%exitcond = icmp eq i32 %7, 256		%exitcond = icmp eq i32 %7, 256
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

		; We CAN vectorize this example by folding the tail using masking.
		;CHECK-LABEL: @example23c(
		;CHECK: <4 x i32>
		;CHECK: ret void
		define void @example23c(i16* noalias nocapture %src, i32* noalias nocapture %dst) optsize {
		br label %1

		; <label>:1 ; preds = %1, %0
		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
		%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]
		%i.02 = phi i64 [ 0, %0 ], [ %7, %1 ]
		%2 = getelementptr inbounds i16, i16* %.04, i64 1
		%3 = load i16, i16* %.04, align 2
		%4 = zext i16 %3 to i32
		%5 = shl nuw nsw i32 %4, 7
		%6 = getelementptr inbounds i32, i32* %.013, i64 1
		store i32 %5, i32* %.013, align 4
		%7 = add nsw i64 %i.02, 1
		%exitcond = icmp eq i64 %7, 257
		br i1 %exitcond, label %8, label %1

		; <label>:8 ; preds = %1
		ret void
		}

		; We CAN'T vectorize this example because an induction variable is used outside
		; the loop.
		;CHECK-LABEL: @example23d(
		;CHECK-NOT: <4 x i32>
		;CHECK: ret i64
		define i64 @example23d(i16* noalias nocapture %src, i32* noalias nocapture %dst) optsize {
		br label %1

		; <label>:1 ; preds = %1, %0
		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
		%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]
		%i.02 = phi i64 [ 0, %0 ], [ %7, %1 ]
		%2 = getelementptr inbounds i16, i16* %.04, i64 1
		%3 = load i16, i16* %.04, align 2
		%4 = zext i16 %3 to i32
		%5 = shl nuw nsw i32 %4, 7
		%6 = getelementptr inbounds i32, i32* %.013, i64 1
		store i32 %5, i32* %.013, align 4
		%7 = add nsw i64 %i.02, 1
		%exitcond = icmp eq i64 %7, 257
		br i1 %exitcond, label %8, label %1

		; <label>:8 ; preds = %1
		ret i64 %7
		}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for sizeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 159798

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

lib/Transforms/Vectorize/VPlan.h

lib/Transforms/Vectorize/VPlan.cpp

test/Transforms/LoopVectorize/X86/optsize.ll

test/Transforms/LoopVectorize/X86/small-size.ll

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for size
ClosedPublic