This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorizing loops of arbitrary trip count without remainder under opt for size
ClosedPublic

Authored by Ayal on Aug 8 2018, 3:11 PM.

Details

Summary

When optimizing for size, a loop is vectorized only if the resulting vector loop completely replaces the original scalar loop. This holds if no runtime guards are needed, if the original trip-count TC does not overflow, if TC is a known constant and if TC is a multiple of the VF. Targets with efficient vector masking can thereby overcome the last three TC-related conditions: see “Direction #1” in [[ http://lists.llvm.org/pipermail/llvm-dev/2018-August/125042.html | [llvm-dev] Vectorizing remainder loop ]] - this patch applies that transformation of setting the trip-count of the vector loop to be TC rounded-up to a multiple of VF while masking the vector body under a newly introduced "if (i < TC)" condition; or rather "if (i <= TC-1)" to overcome the aforementioned overflow hazard.

The patch allows loops with arbitrary trip counts to be vectorized under -Os, subject to the existing cost model considerations. It also applies to loops with small trip counts (under -O2) which are currently handled as if under -Os.

Handling loops with reductions and live-outs are marked as TODOs for subsequent extensions.

Diff Detail

Repository
rL LLVM

Event Timeline

Ayal created this revision.Aug 8 2018, 3:11 PM
hsaito added inline comments.Aug 8 2018, 5:15 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2663 ↗(On Diff #159798)

I think there is a danger in assuming UF being a power of two. Granted that there may be other parts of LV already assuming it, I still wouldn't like to see any more of those being added. If new code is assuming power of two UF, it's best if we ensure that is really the case (e.g., when foldTailByMansking() is true, assert that UF is power of two). Better yet, check VF*UF is power of two here since that's the assumption this code has.

2673 ↗(On Diff #159798)

This Urem creation should be skipped if we aren't generating remainder.

4977 ↗(On Diff #159798)

I think we need to add

if (TC==0) { emit one kind of remark }
else { emit another kind of remark }

here ---- in order to match previous capability.

Ayal added inline comments.Aug 11 2018, 2:06 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2663 ↗(On Diff #159798)

OK, will add an assert that VF*UF is a power of 2 below under the if (Legal->foldTailByMasking()).

2673 ↗(On Diff #159798)

This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating remainder.

4977 ↗(On Diff #159798)

OK, will retain the previous MissedAnalysis remarks here, in addition to the new ones supplied by canFoldTailByMasking().

Thanks, Ayal! Some comments below.
Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

Thanks,
Diego

lib/Transforms/Vectorize/LoopVectorize.cpp
4965 ↗(On Diff #159798)

I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower than TinyTripCountInterleaveThreshold (i.e., 128)?. Should we use an independent threshold for this?

5218 ↗(On Diff #159798)

inwhich -> in which?

6355 ↗(On Diff #159798)

Just curious. Could we prevent the computation or interleave groups for these cases instead of doing a reset?

lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first?

1126 ↗(On Diff #159798)

Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual operations to compute the BTC? We would only need a sub, right?

Ayal added a comment.Aug 14 2018, 12:25 AM

Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

You should know better

lib/Transforms/Vectorize/LoopVectorize.cpp
4965 ↗(On Diff #159798)

Ah, this is wrong, good catch!
The original purpose (of TinyTripCountVectorThreshold rather than TinyTripCountInterleaveThreshold) was to prevent vectorization of loops with very short trip counts due to overheads. Later it was extended in r306803 to allow vectorization under OptForSize, as it implies that all iterations are concentrated inside the vector loop for more accurate cost estimation. This still holds when folding the tail by masking, so we should not bail out here.

5218 ↗(On Diff #159798)

ok

6355 ↗(On Diff #159798)

That would have been simpler indeed. But there's a subtle phase-ordering issue here: MaxVF=computeFeasibleMaxVF() uses tentative interleave groups to getSmallestAndWidestTypes(), and is then used in determining if the tail should be folded by masking (i.e., if TC is a multiple of MaxVF), in which case these groups will all be masked/invalid.

lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

An alternative of leveraging Instruction::ICmp opcode and existing ICmpInst subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil):

+    // Introduce the early-exit compare IV <= BTC to form header block mask.
+    // This is used instead of IV < TC because TC may wrap, unlike BTC.
+    VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
+    VPValue *BTC = Plan->getBackedgeTakenCount();
+    Value *Undef = UndefValue::get(Legal->getPrimaryInduction()->getType());
+    auto *ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef);
+    Plan->addDetachedValue(ICmp);
+    BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp);
     return BlockMaskCache[BB] = BlockMask;

and then have VPInstruction::generateInstruction() do

+  case Instruction::ICmp: {
+    Value *IV = State.get(getOperand(0), Part);
+    Value *TC = State.get(getOperand(1), Part);
+    auto *ICmp = cast<ICmpInst>(getUnderlyingValue());
+    Value *V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC);
+    State.set(this, V, Part);
+    break;
+  }

where VPlan::addDetachedValue() is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to UndefValue's.

1126 ↗(On Diff #159798)

The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by SCEVExpander. To model this decrement would require using an "empty" VPValue to model its Trip Count operand. In any case, both involve scalar instructions that take place before the vectorized loop, currently outside the VPlan'd zone.

reames added a subscriber: reames.Aug 14 2018, 3:20 PM

I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {

if (unlikely(i > M)) 
   break;
sum += a[i];

}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Secondly, are there any plans to enable this approach for anything other than optsize?

test/Transforms/LoopVectorize/X86/optsize.ll
12 ↗(On Diff #159798)

Testing wise, expanding out the IR generated w/update-lit-checks and landing the tests without the changes and then rebasing on top would make it much easier to follow the transform being described for those us not already expert in the vectorizer code structures. I get that your following existing practice, but this might be one of the cases which justify changing existing practice in the area. :)

hsaito added inline comments.Aug 14 2018, 3:23 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2673 ↗(On Diff #159798)

Ouch. Well, given the assertion for VF*UF being power of two (constant), the UREM and other computation should be reasonably optimizable downstream. So, it's probably unfair to ask you to fix the trip count computation ---- so, I won't ask. There is a trade off between generating more optimal output IR and the cost of maintaining the code to do that. Keeping UREM here is opting for lower maintenance. Just for the record.

reames added inline comments.Aug 14 2018, 3:25 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4948 ↗(On Diff #159798)

There's a mix of seemingly unrelated changes here. This is one example. It would be good to land these separately.

I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {

if (unlikely(i > M)) 
   break;
sum += a[i];

}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Short answer is No.

From vectorizer perspective, mechanics is quite different. In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it. There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter. Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Having said that, making VPlan more powerful (like adding a new IF) certainly help lead to the ability to model early_exit situation within the VPlan eventually. From that perspective, it's a baby step forward.

From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

Secondly, are there any plans to enable this approach for anything other than optsize?

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

hsaito added inline comments.Aug 14 2018, 5:00 PM
lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator?

The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that.

My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?

1126 ↗(On Diff #159798)

I'm not a big fan of allocating memory that goes unused in many situations. We can initialize this to nullptr, and create an instance once we know BTC is needed. That'll lose the convenience of being able to check NumUsers, but creating needsBackedgeTakenCount() member function shouldn't be that bad. It's just Legal->foldTailByMasking(), until something else needs BTC, right?

dcaballe added inline comments.Aug 14 2018, 6:33 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6355 ↗(On Diff #159798)

Thanks!

lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places.

My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?

The closer the class hierarchies are, the easier will be.

hsaito added inline comments.Aug 15 2018, 11:20 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
4948 ↗(On Diff #159798)

This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0. TC==0 case will then have a chance of hitting Legal->canFoldTailByMasking() later. As a result, TC==1 case can return early here, with a very crisp messaging.

Having said that, if you'd like to see the same ORE->emit(...) LLVM_DEBUG() stuff here, I won't go against that. Messaging change can be a separate commit.

Ayal, we need ORE->emit() here, in addition to LLVM_DEBUG(), right, regardless of whether we change the actual message?

I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {

if (unlikely(i > M)) 
   break;
sum += a[i];

}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Short answer is No.

From vectorizer perspective, mechanics is quite different.

Ok, I think we're talking past each other a bit. I see these both as forms of predication. It sounds like you have a slightly different view; I'll try to ask clarifying questions in the right spots. I think we have different mental models here and I'm trying to understand where that difference is.

In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it.

I'm unfamiliar with this pragma, but the best reference I found was https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-simd-directive-openmp-api

From what I can tell, this provides user guarantees of a couple of legality checks and profitability checks. I don't know enough about openmp to completely follow all the wording, but the key bit appears to be this:
"Each operation before the last lexical early exit of the loop may be executed as if the early exit were not triggered within the SIMD chunk."

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve. There are two obvious approaches: speculation safety and predication. Unless I'm misreading this patch, it has the same problem and uses predication right?

There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right? If there's not, then we don't need to distinguish which iteration "caused" the exit. This is actually an interesting and useful subcase for me.

Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

I agree that cases like a[i] >0 are the hard ones. Other examples are things like i < M for loop invariant M. Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

I am very specifically not interested in the language extension aspects. I'm specifically asking about doing the transform for unannotated C code. (i.e. having to prove all the legality the hard way)

Secondly, are there any plans to enable this approach for anything other than optsize?

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying? Not surprised, but I'd be curious to hear your reasoning. You might be coming at this from a different angle than I am

Ayal added a comment.Aug 15 2018, 12:54 PM

I have a general question about direction, not specific to this patch.

It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
for (int i = 0; i < N; i++) {

if (unlikely(i > M)) 
   break;
sum += a[i];

}

Can the infrastructure in this patch be generalized to handle such cases? And if so, are their any specific plans to do so?

Good question! Replacing the break with a continue vectorizes just fine and produces the same result, albeit spinning uselessly for the last N-M iterations. Dealing with such "breaks" directly deserves more thought :-). In general it's probably better to fold such two upper bounds into one = min(N,M+1), producing a countable unpredicated loop. This is a known optimization for OpenCL1.x kernels, often guarded with "if (get_global_id(0) > M) continue;" due to work_group size constraints, when compiled for CPU.

Secondly, are there any plans to enable this approach for anything other than optsize?

We could, for example, consider enabling it under -O2 for loops whose entire (or nearly entire) body is already conditional; e.g.,

for (int i = 0; i < N; i++) {
  if (i*i % 4 != 2) {
    <loop body>
  }
}

otherwise the overhead of predicating code that could otherwise run unpredicated may be detrimental.

lib/Transforms/Vectorize/LoopVectorize.cpp
2673 ↗(On Diff #159798)

Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple of two (which is currently always the case, and must be the case when folding the tail by masking), it gets optimized downstream to N&(-Step). If Step would be some other constant it may get optimized downstream to use multiplication instead of division, depending on target characteristics. In any case, this takes place before the loop; and is orthogonal to this patch, which simply reuses the existing logic to also round up.

4948 ↗(On Diff #159798)

Yes, this change is unrelated and should land separately. The original ORE message is wrong. Not sure the TC==1 qualifies for any ORE message - "loops" with a known trip count of one are simply irrelevant for vectorization; though we could vectorize them with a mask...

4965 ↗(On Diff #159798)

This BTW is caught by vect.omp.force.small-tc.ll; but the -vectorizer-min-trip-count=21 flag it uses is external to OpenMP, afaik.

lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve a separate discussion thread and justification. This patch could tentatively make use of it, though for the purpose of this patch an ICmpULE opcode or a detached ICmpInst suffice. An ICmpULE opcode shouldn't be problematic currently, as this early-exit is the only VPInstruction compare with a Predicate, right? Note that detached UnderlyingValues could serve as data containers for all fields already implemented in the IR hierarchy, and could be constructed at any point of VPlan construction for that purpose. Extending VPInstructions to provide a similar API as that of IR Instructions seems to be an orthogonal concern with its own design objectives, and can coexist with detached Values; e.g., a VPCmpInst could hold its Predicate using a detached ICmpInst/FCmpInst.

1126 ↗(On Diff #159798)

OK. The VPValue can be created on demand, turning getBackedgeTakenCount() into getOrCreateBackedgeTakenCount(). NumUsers should still be checked, as this isolates the decision of creating the IR based on the VPlan.
In any case, VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always used, seems insignificant.

test/Transforms/LoopVectorize/X86/optsize.ll
12 ↗(On Diff #159798)

Agreed. The original target-independent version of optsize.ll still passes, BTW, (i.e., fails to vectorize), but due to cost-model considerations rather than scalar tail considerations.

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve. There are two obvious approaches: speculation safety and predication. Unless I'm misreading this patch, it has the same problem and uses predication right?

In this particular case, we don't get much of speculation. If you call computing loop index beyond the original upper bound as speculation (and use it in compare), it is, but we know there aren't any safety issues. In your case, what really matters is inside "unlikely(i > M)". If that's just trivial "i > M" (or something that can be converted in that form), we are better off simply changing the loop upper bound and do so prior to hitting the vectorizer. Then, this patch will take care of it. If not (i.e., general compute_some_predicate_value_based_on(i)) the whole speculation safety issue comes up and that's the difficult part to deal with and this patch doesn't deal with any aspect of it.

There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right? If there's not, then we don't need to distinguish which iteration "caused" the exit. This is actually an interesting and useful subcase for me.

I don't know what you mean by "a use of a variable from within the loop down the early exit path". Assume cond becomes true within a vector chunk (say, elem#2), you have to execute B for all prior iters (i.e., elem#0 and #1),
and execute A for elem #2.

for (i){
   if (cond){
       A
       break;
   }
   B
}

Assuming that B is lexically below (note: this is vectorization, as such, you need to have some lexical ordering assumption somewhere) all the early exit points, it can be non-speculatively executed under proper predication.
This kind of predication, however, has nothing to do with this patch. General IF-THEN-ELSE and GOTO based control flow needs the same kind of predication.

Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

Sure ---- but unless you are talking about trivial (i.e., not very interesting) "early exit" stuff, how to deal with speculation is the most important aspect of vectorizer's early exit handling.

Other examples are things like i < M for loop invariant M. Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

Sure, but that's not very interesting from vectorization perspective. Vectorizer doesn't have to do what other loop transformation can handle.

I am very specifically not interested in the language extension aspects. I'm specifically asking about doing the transform for unannotated C code. (i.e. having to prove all the legality the hard way)

ICC is doing it. So, let us know if anyone is volunteering before we do so that we can share our learning. It's an important aspect of vectorization but not yet high enough on our priority list. So, we aren't immediately jumping on to it.

If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying?

Qualifying to what?

If your question is whether ICC uses the masked main vector code for AVX512, other than OptForSize case, then the answer is yes it does.

It's a combination of HW and SW. If you know the trip count as a compile time constant, you can evaluate various different ways to vectorize and decide the best one, much better than when you don't know the trip count. The legacy part of LV isn't set up to do such an evaluation. VPlan native part of LV would eventually have such a capability. W/o this capability, we need to go one way or the other rather blindly --- and blindly changing the status quo requires a pretty good justification (like brilliantly fast masked vector execution unit). I'm more interested in doing the evaluation when VPlan native path is ready to do that.

Not surprised, but I'd be curious to hear your reasoning. You might be coming at this from a different angle than I am

If the trip count is unknown, the best AVX512 vectorization strategy so far is go with unmasked (at the top-level) vector main loop. Underlying assumption is that unmasked vector main loop is faster than the masked vector main loop, and a lot of time is spent in executing main vector loop. If such an assumption does not hold, like main vector code isn't executed a lot, programmers should try to communicate the trip count estimation to the compiler so that the compiler can do a better job. As the HW narrows the gap between the two, optimization point moves. We have to evaluate every generation of HW and see what works the best. So, my comment applies to today's HW. I don't know what ARM SVE folks would say for their HW.

Does this make sense to you?

hsaito added inline comments.Aug 15 2018, 2:26 PM
lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

I go against detached ICmpInst. We'll be moving VPlan modeling before the cost model and creating an IR Instruction before deciding to vectorize is against the VPlan concept.

seems to be an orthogonal concern with its own design objectives

Not quite. We'd like VPInstruction as easy to use to many LLVM developers and that is an integral part of design/implementation from the beginning.

Having said that, new opcode versus VPCmpInst doesn't block the rest of the review. Other parts of the review should proceed while opcode versus VPCmpInst discussion is in progress on the side.

1126 ↗(On Diff #159798)

VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the >BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always >used, seems insignificant.

VPlan footprint was part of the community concern. We'd like to be better wherever we can. Just as simple as that. Thanks for taking care of it.

dcaballe added inline comments.Aug 15 2018, 5:18 PM
lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

I created D50823 with the VPCmpInst sub-class so that we can make a decision with the code in place.

hsaito added inline comments.Aug 16 2018, 12:37 PM
include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485 ↗(On Diff #159798)

I think it's best not to keep this state in the Legal. From the Legal perspective, being able to vectorize the whole loop body under the mask and the actual decision to do so are completely separate issues.

Since canFold...() is invoked by CostModel::computeMaxVF, we should be able to keep this state in the CostModel. After all, whether to bail out or continue under FoldTailByMasking is a cost model side of the state, after consulting the Legal.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792 ↗(On Diff #159798)

By moving FoldTail state to CostModel, we can define CostModel::blockNeedsPredication(BB) as FoldTailByMasking || LAI::blockNeedsPredication(BB) and make Legal version static to Legal.

lib/Transforms/Vectorize/LoopVectorize.cpp
2673 ↗(On Diff #159798)

orthogonal to this patch

I agree.

Ayal updated this revision to Diff 161564.Aug 20 2018, 3:00 PM

Addressed review comments.

New test X86/optsize.ll added and vect.omp.force.small-tc.ll augmented with CHECKs, both showing current behavior, to be uploaded separately before this patch. Test small-size.ll includes CHECKs that pass with this patch.

Ayal marked 4 inline comments as done.Aug 20 2018, 3:08 PM
Ayal added inline comments.
include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485 ↗(On Diff #159798)

OK.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792 ↗(On Diff #159798)

OK, except that LAI::blockNeedsPredication() also asks for DT which CostModel does not have. Let's have CostModel::blockNeedsPredication(BB) return FoldTailByMasking || Legal::blockNeedsPredication(). Hopefully the two will not cause confusion.

Making Legal version static should be pursued in a separate patch, if desired.

lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

VPlans should indeed keep the existing IR intact w/o changing it, as they are tentative by design, and also by current implementation. But creating a detached IR Instruction, just for the purpose of holding its attributes, w/o connecting it to any User, Operand (except Undef's) or BasicBlock, is arguably keeping the existing IR intact. Doing so should be quite familiar to LLVM developers, avoids mirroring Instruction's class hierarchy or a subset thereof, and leverages the existing UnderlyingValue pointer that is unutilized by InnerLoopVectorizer. Next uploaded version provides this complete option.

Having said that, this patch can surely work with a VP(I)CmpInst just as well, as it merely needs a way for a single compare VPInstruction to hold a single Predicate, and print its name.

test/Transforms/LoopVectorize/X86/optsize.ll
12 ↗(On Diff #159798)

Expanded IR CHECKs have been added for cases that should get vectorized. For cases that should not, suffice to check that no vector is formed.

dcaballe added inline comments.Aug 20 2018, 3:43 PM
lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

I understand your point, Ayal. However, using UnderlyingValue as a pointer to the actual input IR in the VPlan native path and as a pointer to a detached IR Value in the inner loop path is very likely to be problematic, even in the short term. We would have to special case the code that is shared for both paths to treat the UnderlyingValue differently. The detached IR special semantics in the inner loop path would also make a bit more complicated the convergence of both paths. If there are no major concerns regarding the VPCmpInst, I'd prefer going with that approach.

hsaito added inline comments.Aug 20 2018, 3:53 PM
include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
485 ↗(On Diff #159798)

Thank you.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
792 ↗(On Diff #159798)

Thanks, and fair enough.

lib/Transforms/Vectorize/LoopVectorize.cpp
2748 ↗(On Diff #161564)

Personally, I don't like to see the IR like the following going out of the vectorizer, even though that's later cleaned up tirivially.

%1 = false       // unused and thus will be trivially cleaned up later.
%2 = icmp ...

Changing this part of the patch to

Value *CheckMinIters = nullptr;
if ()
    ....
else
    CheckMinIters = Builder.getFalse();

would make cleaner IR going out for common cases, at a small price to pay in ease-of-reading.

If you agree, great. If not, I won't make a big deal about it. At the end of the day, we should clean up this area of code such that we don't have to rely on CheckMinIters being "false" constant to cleanup the unnecessary min iter check. That improvement can be done as a separate NFC patch.

2990 ↗(On Diff #161564)

See the comment on CheckMinIters.

For me, the only major issue left is the detached IR instruction. @dcaballe, please try adding the reviewers/subscribers of D50480 to D50823, in the hopes of getting a quicker resolution there, so as not to block D50480 because of that. I will not oppose to D50480 for introducing new ULE opcode of VPInstruction (design/implementation choice within VPlan concept), but I will strongly oppose for the use of detahced IR instruction (goes against VPlan concept).

It's certainly nicer if @Ayal, @dcaballe, and others can agree on VPCmpInst or not quickly enough. I vote in favor of VPCmpInst.

Thanks,
Hideki

lib/Transforms/Vectorize/LoopVectorize.cpp
4948 ↗(On Diff #159798)

I think every non-vectorized loop that goes through vectorizer's analysis qualifies for ORE. After all, TC==1 knowledge may or may not be available to the programmer otherwise.

4977 ↗(On Diff #159798)

Thank you.

lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

Detached IR instruction is detrimental to VPlan direction. Please do not use it.

test/Transforms/LoopVectorize/X86/optsize.ll
4 ↗(On Diff #161564)

Is the test really dependent on the apple triple?

Ayal added inline comments.Aug 22 2018, 5:38 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2748 ↗(On Diff #161564)

One could change this part of the patch to create an unconditional branch instead of a conditional one from BB to NewBB; or avoid creating NewBB / calling emitMinimumIterationCountCheck() altogether if (foldTailByMasking()). Both alternatives will change the dominance structure and thus require special attention when updating DT in updateAnalysis(). The latter would also need to record the EntryBlock for cases where LoopBypassBlocks remains empty.

It's simpler to keep the existing skeletal structure intact, and rely on subsequent trivial dce cleanup.

If desired, such alternatives should be proposed as a separate follow-up NFC patch.

2990 ↗(On Diff #161564)

ditto.

4948 ↗(On Diff #159798)

ok

test/Transforms/LoopVectorize/X86/optsize.ll
4 ↗(On Diff #161564)

-mtriple=x86_64-unknown-linux works just as well

Ayal updated this revision to Diff 161965.Aug 22 2018, 8:39 AM

Addressing review comments, rebased, added a couple of asserts.

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

The TC==1 part and preliminary CHECK completion of tests are to be uploaded first.

Ayal marked 2 inline comments as done.Aug 22 2018, 9:17 AM
Ayal added inline comments.
lib/Transforms/Vectorize/VPlan.h
609 ↗(On Diff #159798)

Would be good to clarify the aforementioned discrepancy between VPlan native's use of input IR and the proposed use of detached IR; both should presumably model defs, uses and basic-block ownerships in VPlan rather than the IR Instruction, so the latter can merely be used for storing internal properties, for both paths alike. BTW, SROA.cpp and StraightLineStrengthReduce.cpp, e.g., also make use of detached Instructions. Would also be good to explain why detached Instructions are considered detrimental or what concept of VPlan they allegedly violate, given that their existence keeps the original IR intact.

But let's keep this patch out of that discussion, and have it use an ICmpULE extended opcode as originally proposed and reloaded. After all, it plays a very small part in this patch, and can be easily revised later as needed.

Let's give @dcaballe one more day to try getting some traction on D50823. Fair enough to both of you (and others who might be interested)?

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

Since the VPCmpInst code is ready (D50823) and this is a clear use case where we need to model a new compare (including its predicate) that is not in the input IR, I'd appreciate if we could discuss a bit more about using the VPCmpInst approach. At least, I'd like to understand what are the concerns about the VPCmpInst approach and what other people think.

I do have concerns regarding modeling ICmpULE as an opcode only for compare instructions newly created during a VPlan-to-VPlan transformation. For example:

  1. Inconsistent modeling of compare instructions in the VPlan native path. Compare instructions in the input IR will be modeled as VPInstructions with a Instruction::ICmpInst/Instruction::FCmpInst opcode. New compare instructions will be modeled as VPInstructions with predicates as opcodes (VPInstruction::ICmpULE, for now). We'd have to compare the opcode against Instruction::ICmpInst, Instruction::ICmpInst, VPInstruction::ICmpULE and any future predicate opcode to know that a VPInstruction is a comparison. Similar inconsistency to get information about the compare predicate.
  1. Adding ICmpULE as an opcode is paving the way to adding more predicates as opcodes in VPInstruction in the short term. Where would the limit be? Do we want to model the around 30 predicates currently in LLVM CmpInst as opcodes?
  1. The ICmpULE approach may also be detrimental for the Instruction/VPInstruction templatization that we planned to explore.

If these points and the fact that VPCmpInst code is ready to go don't convince you, there isn't much else I can do :). I know this compare representation may sound insignificant but I'm well aware of how painful things can turn when things are built on top of "insignificant decisions" that have to be changed later on. If the problem with VPCmpInst is to rebase this patch on top of D50823, I'm perfectly fine with introducing D50823 after this patch goes in. However, if there are any other concerns regarding the VPCmpInst sub-class, it would be better to know them now. I'd prefer not to keep the ICmpULE opcode representation for a long time.

Thanks,
Diego

hsaito accepted this revision.Aug 23 2018, 2:51 PM

Under the assumption that the acceptance of this patch is not a conscious choice between new CmpULE VPInstruction opcode versus VPCmpInst derivation (whose discussion should continue in D50823 or its follow on), I think this patch is ready to land. LGTM.

This revision is now accepted and ready to land.Aug 23 2018, 2:51 PM
Ayal added a comment.Aug 26 2018, 7:21 AM

Reverted to use the original ICmpULE extended opcode instead of detached ICmpInst. This can be revised quite easily once VPInstructions acquire any other form of modeling compares.

Since the VPCmpInst code is ready (D50823) and this is a clear use case where we need to model a new compare (including its predicate) that is not in the input IR, I'd appreciate if we could discuss a bit more about using the VPCmpInst approach. At least, I'd like to understand what are the concerns about the VPCmpInst approach and what other people think.

I do have concerns regarding modeling ICmpULE as an opcode only for compare instructions newly created during a VPlan-to-VPlan transformation. For example:

...

Under the assumption that the acceptance of this patch is not a conscious choice between new CmpULE VPInstruction opcode versus VPCmpInst derivation (whose discussion should continue in D50823 or its follow on), I think this patch is ready to land. LGTM.

This patch aims to model a rather special early-exit condition that restricts the execution of the entire loop body to certain iterations, rather than model general compare instructions. If preferred, an "EarlyExit" extended opcode can be introduced instead of the controversial ICmpULE. This should be easy to revisit in the future if needed.

This patch focuses on modeling an early-exit compare and then generating it, w/o making strategic design decisions supporting future vplan-to-vplan transformations, the interfaces they may need, potential templatization, or other long-term high-level VPlan concerns. These should be explained and discussed separately along with pros and cons of alternative solutions for supporting the desired interfaces and for holding their storage, including subclassing VPInstructions, using detached Instructions, or other possibilities.

This patch aims to model a rather special early-exit condition that restricts the execution of the entire loop body to certain iterations, rather than model general compare instructions. If preferred, an "EarlyExit" extended opcode can be introduced instead of the controversial ICmpULE. This should be easy to revisit in the future if needed.

This patch is fine as is, or rather much better with ICmpULE than EarlyExit.

This patch focuses on modeling an early-exit compare and then generating it, w/o making strategic design decisions supporting future vplan-to-vplan transformations, the interfaces they may need, potential templatization, or other long-term high-level VPlan concerns. These should be explained and discussed separately along with pros and cons of alternative solutions for supporting the desired interfaces and for holding their storage, including subclassing VPInstructions, using detached Instructions, or other possibilities.

Sure. I agree.

[Full disclosure] I have a big mental barrier in accepting your "early-exit" terminology here since I relate that term to "break out of the loop", but that's just the terminology difference. Nothing to do with the substance of this patch. [End of full disclosure]

This revision was automatically updated to reflect the committed changes.