This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/SystemZ/
-
SystemZ/
-
SystemZTargetTransformInfo.h
-
SystemZTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorize.cpp

Differential D53865

[LoopVectorizer] Improve computation of scalarization overhead.
Needs ReviewPublic

Authored by jonpa on Oct 30 2018, 5:12 AM.

Download Raw Diff

Details

Reviewers

uweigand
hfinkel
fhahn
hsaito
Ayal

Summary

I found loops that got a much overestimated costs of vectorization where two instructions were both scalarized. Since one was using the result of the other, the defining instruction does not need to do the inserts, and the user does not have to extract any elements.

I experimented with this patch that makes the LoopVectorizer collect any instructions that the target will scalarize (expand). This is then used to find these cases and passed (eventually) to getScalarizationOverhead() which then returns a reduced value.

I found so far in practice on SystemZ that this amounts to more float loops being vectorized, typically with the only benefit being vectorized memory operations. I am not sure hos beneficial this is.

This should easily be usable by other targets also, but sofar this is SystemZ only.

Is this useful enough to include in the loop vectorizer?

Diff Detail

Event Timeline

jonpa created this revision.Oct 30 2018, 5:12 AM

Herald added subscribers: jsji, kbarton, aheejin and 9 others. · View Herald TranscriptOct 30 2018, 5:12 AM

ping.

nhaehnle removed a subscriber: nhaehnle.Nov 6 2018, 7:51 AM

ping.

I have found some more potential use for this:

Currently a load is widened always if possible. I think however if it is known that the user of the load is scalarized by target it might be better to also scalarize the load rather than loading the vector register first to then extract all the elements. This would be true if the extract costs more than 1, I think.
In addition, if the target knows that the load and user will be scalarized, it can also consider the folding of the load as a memory operand into the user.

So, please take a look anyone :-)

PS. A related subject here is the presence of loop-invariant operands, which will affect CodeGen. Did anyone try passing the Loop* or some kind of LoopInvariant flag for the operands?

ping!

Is this a fundamentally good idea, considering the future with VPlan, for instance?
Would this be acceptable if it would only affect the specific target(s) that desired this?

Sorry, I must have missed this review.

VPlan based cost modeling (plus VPlan based code motion) should naturally capture this kind of situation ----- but only to the extent that producer/consumer can reside in the same BB. It's taking a lot longer than I wanted to stabilize (compute exactly the same value as existing cost model in LV). If you are interested in developing that area of VPlan based cost model, I can clean up my workspace and upload what I have to Phabricator as WIP patch.

When producer/consumer has to reside in separate BBs for some reason, the current recipe (which resides in BB) based modeling won't help much. As such, from the generalized implementation perspective, some kind of U-D based mapping (like the one used here) may be inevitable. So, from the technical aspect, we should discuss what are the plausible scenarios that make producer and consumer to reside in separate BBs and whether that situations are rare enough to ignore.

typically with the only benefit being vectorized memory operations

I have a mixed feeling here. ICC vectorizes relatively aggressively and it has good enough reasons to do so. Having said that, it comes with the code size and compile time associated to aggressive vectorization (on top of vectorization not always profitable in execution time reduction). So, if we are doing this, we should make it easy to tune (by vectorizer developer as well as by the programmer using the compiler). This comment is not specific to this patch, though.

In D53865#1310026, @hsaito wrote:

Sorry, I must have missed this review.

VPlan based cost modeling (plus VPlan based code motion) should naturally capture this kind of situation ----- but only to the extent that producer/consumer can reside in the same BB. It's taking a lot longer than I wanted to stabilize (compute exactly the same value as existing cost model in LV).

Thanks for taking a look! IIUC, my patch is not useful since VPlan will soon improve this area without it.

I am curious as to how VPlan will accomplish this - will it also add some kind of check with TLI if the instruction will be expanded and propagate this information? Or is there some other way that this may be accomplished?

Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

In D53865#1310934, @jonpa wrote:

In D53865#1310026, @hsaito wrote:

Sorry, I must have missed this review.

VPlan based cost modeling (plus VPlan based code motion) should naturally capture this kind of situation ----- but only to the extent that producer/consumer can reside in the same BB. It's taking a lot longer than I wanted to stabilize (compute exactly the same value as existing cost model in LV).

Thanks for taking a look! IIUC, my patch is not useful since VPlan will soon improve this area without it.

Not so quick.
The underlying supporting mechanism is VPReplicateRecipe, in VPlan.h. The parent of a VP*Recipe is VPBasicBlock. If both use and def belong to the same ReplicateRecipe, things are simple.
Your map based query becomes "do the instructions belong to the same Recipe" query. The question is, of course, can we always do that? If the answer is NO, then, this approach has a hole that need to be filled by some other means.

I am curious as to how VPlan will accomplish this - will it also add some kind of check with TLI if the instruction will be expanded and propagate this information? Or is there some other way that this may be accomplished?

Recipe is making instruction grouping (within VPBasicBlock) easier to identify. If the code motion across VPBasicBlocks is legal, we want to merge two or more ReplicateRecipes. In ideal cases, both use and def are in the same Recipe, so you don't need a map. You just ask for the cost for the ReplicateRecipe ---- scalar compute * VF for each instruction in the recipe + extract for live-ins + insert for live-outs. In the general case, however, use and def are in different ReplicateRecipes. So, things aren't that simple. For each live-ins, we should check whether live-ins are computed in scalar form and ditto for live-out.

My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()? If so, that's probably where you want to dig in.

/// Collect the instructions that are scalar after vectorization. An           
/// instruction is scalar if it is known to be uniform or will be scalarized   
/// during vectorization. Non-uniform scalarized instructions will be          
/// represented by VF values in the vectorized loop, each corresponding to an  
/// iteration of the original scalar loop.                                     
void collectLoopScalars(unsigned VF);

As for the computation of the cost for scalarized instruction:

I think the TTI based per instruction cost should fully account for extract and insert. I don't think we should change that.

What we need to change is the caller side. If the instruction is scalarized, the base cost for it should be scalar exec cost * VF.
We should then add extract cost for each vector source (but only if not already accounted for for another scalar use site) and
insert cost if one or more of the uses is vector.

Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

The only target-specific part I am thinking about is which instructions will later be expanded during *isel*.

My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()?

My understanding is that currently the LoopVectorizer notion of a scalarized instruction refers to an *LLVM I/R* scalarized instruction. In other words, which instructions it will itself produce scalarized. These are the instructions contained in Scalars[VF].

As en example, consider this loop:

define void @fun(i64 %NumIters, float* %Ptr1, float* %Ptr2, float* %Dst) {
entry:
  br label %for.body

for.body:
  %IV  = phi i64 [ 0, %entry ], [ %IVNext, %for.body ]
  %GEP1 = getelementptr inbounds float, float* %Ptr1, i64 %IV
  %LD1 = load float, float* %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr2, i64 %IV
  %LD2 = load float, float* %GEP2
  %mul = fmul float %LD1, %LD2
  %add = fadd float %mul, %LD2
  store float %add, float* %GEP1
  %IVNext = add nuw nsw i64 %IV, 1
  %exitcond = icmp eq i64 %IVNext, %NumIters
  br i1 %exitcond, label %exit, label %for.body

exit:
  ret void
}

This loop is interesting because on z13 vector float operations are not supported, so they are expanded during instruction selection to scalar instructions. If forced to vectorize (even though costs would normally prevent it) with

clang -S -o - -O3 -march=z13 tc_targscal.ll -mllvm -unroll-count=1 -mllvm -debug-only=loop-vectorize -mllvm -force-vector-width=4

, the loop vectorzer produces this loop:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %0 = add i64 %index, 0
  %1 = getelementptr inbounds float, float* %Ptr1, i64 %0
  %2 = getelementptr inbounds float, float* %1, i32 0
  %3 = bitcast float* %2 to <4 x float>*
  %wide.load = load <4 x float>, <4 x float>* %3, align 4, !alias.scope !0, !noalias !3
  %4 = getelementptr inbounds float, float* %Ptr2, i64 %0
  %5 = getelementptr inbounds float, float* %4, i32 0
  %6 = bitcast float* %5 to <4 x float>*
  %wide.load6 = load <4 x float>, <4 x float>* %6, align 4, !alias.scope !3
  %7 = fmul <4 x float> %wide.load, %wide.load6
  %8 = fadd <4 x float> %wide.load6, %7
  %9 = bitcast float* %2 to <4 x float>*
  store <4 x float> %8, <4 x float>* %9, align 4, !alias.scope !0, !noalias !3
  %index.next = add i64 %index, 4
  %10 = icmp eq i64 %index.next, %n.vec
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !5

The cost computation looked like:

LV: Found an estimated cost of 0 for VF 4 For instruction:   %IV = phi i64 [ 0, %entry ], [ %IVNext, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %GEP1 = getelementptr inbounds float, float* %Ptr1, i64 %IV
LV: Found an estimated cost of 1 for VF 4 For instruction:   %LD1 = load float, float* %GEP1, align 4
LV: Found an estimated cost of 0 for VF 4 For instruction:   %GEP2 = getelementptr inbounds float, float* %Ptr2, i64 %IV
LV: Found an estimated cost of 1 for VF 4 For instruction:   %LD2 = load float, float* %GEP2, align 4
LV: Found an estimated cost of 16 for VF 4 For instruction:   %mul = fmul float %LD1, %LD2
LV: Found an estimated cost of 16 for VF 4 For instruction:   %add = fadd float %LD2, %mul
LV: Found an estimated cost of 1 for VF 4 For instruction:   store float %add, float* %GEP1, align 4
LV: Found an estimated cost of 1 for VF 4 For instruction:   %IVNext = add nuw nsw i64 %IV, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond = icmp eq i64 %IVNext, %NumIters
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond, label %exit, label %for.body

So the costs for the vectorized float operations have been calculated by the target as 2x4 extracts + 4 mul/add + 4 inserts = 16. The loop vectorizer has produced vector instructions, and as far as it is concerned, that's what they are.

However, the assembly output looks like:

.LBB0_4:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vl      %v2, 0(%r1,%r3)
        vl      %v3, 0(%r1,%r4)
        vrepf   %v0, %v3, 1
        vrepf   %v1, %v2, 1
        vrepf   %v4, %v2, 2
        vrepf   %v5, %v2, 3
        vrepf   %v6, %v3, 2
        vrepf   %v7, %v3, 3
        meebr   %f1, %f0
        meebr   %f2, %f3
        meebr   %f4, %f6
        meebr   %f5, %f7
        aebr    %f5, %f7
        aebr    %f4, %f6
        aebr    %f2, %f3
        aebr    %f1, %f0
        aghi    %r5, -4
        vmrhf   %v4, %v4, %v5
        vmrhf   %v0, %v2, %v1
        vmrhg   %v0, %v0, %v4
        vst     %v0, 0(%r1,%r3)
        la      %r1, 16(%r1)
        jne     .LBB0_4

, which is 2 x Vector Load + 6 extracts (the fp element 0 overlaps the vector register and does not need an extract) + 4 fp-multiply (meebr) + 4 fp-add + 3 inserts + a Vector Store.

There is no need to insert and extract into a vector register between meebr and aebr. This is where the costs of 16 are wrong - they should have been less.

My question then is how to fix this? My idea was to let the loop vectorizer keep thinking about these instructions as vectorized, while being aware of a later expansion during isel (perhaps ISelExpanded would be a better name than TargetScalarized?).

This could be used to compute better scalarization overhead costs. It would also be interesting at least on SystemZ to use to do scalar stores/loads instead of widening if the def / user is "isel-expanded". This could perhaps be controlled by TTI.supportsEfficientVectorElementLoadStore(), which is already available.

Is this making sense?

ping!

@Ayal? @hsaito?

In D53865#1312568, @jonpa wrote:

Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

The only target-specific part I am thinking about is which instructions will later be expanded during *isel*.

My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()?

My understanding is that currently the LoopVectorizer notion of a scalarized instruction refers to an *LLVM I/R* scalarized instruction. In other words, which instructions it will itself produce scalarized. These are the instructions contained in Scalars[VF].

I think you are either 1) arm-twisting the vectorizer to emit vector code which you know will be scalar or 2) arm-twisting vectorizer's cost model to believe what you are emitting as "vector" to be really scalar. I certainly do not see the reason why "you have to" do that, because letting vectorizer emit scalar IR instructions in those cases should be "equivalent". So, why "do you WANT to" do that? IR going out of vectorizer may be more compact, but what that'll accomplish is cheating all downstream optimizers and their cost models.

It seems to me that what you are trying to do here is bringing "packed scalar" notion into LLVM IR --- i.e., part of my IR uses vector data type but it's really just a compact form of VF-times scalar operation. I suggest bringing this to wider RFC discussion with the rest of the community ---- but I'm sure one of the questions people will ask would be "why?".

My question then is how to fix this? My idea was to let the loop vectorizer keep thinking about these instructions as vectorized, while being aware of a later expansion during isel (perhaps ISelExpanded would be a better name than TargetScalarized?).

So there are basically two possible ways to model sequences like this:

The vectorizer models/emits the instructions as "vector" instructions, but gives a discount to back-to-back instructions which will be scalarized.
The vectorizer compares vector and scalar instructions based on the cost model, and explicitly scalarizes instruction sequences if they would be cheaper.

This patch implements the first possibility, but the second is more useful, I think; it's closer to how the underlying target actually works, and it composes with other cost modeling more easily. For example, for some operations, a vector instruction exists, but it's only a little cheaper than transforming it into a extract+scalar+insert sequence.

Thank you for your feedback!

I think you are either 1) arm-twisting the vectorizer to emit vector code which you know will be scalar or 2) arm-twisting vectorizer's cost model to believe what you are emitting as "vector" to be really scalar. I certainly do not see the reason why "you have to" do that, because letting vectorizer emit scalar IR instructions in those cases should be "equivalent". So, why "do you WANT to" do that? IR going out of vectorizer may be more compact, but what that'll accomplish is cheating all downstream optimizers and their cost models.

I am just trying to keep it simple by not changing how LV generates code, but merely improve the cost computations. Changing the output of a vectorized loop seems like a much bigger project, which I did not attempt.

So there are basically two possible ways to model sequences like this:

The vectorizer models/emits the instructions as "vector" instructions, but gives a discount to back-to-back instructions which will be scalarized.

The vectorizer compares vector and scalar instructions based on the cost model, and explicitly scalarizes instruction sequences if they would be cheaper.

I can see the benefit of (2) in cases where an instruction is vectorizable but in-between two scalarized instructions, like: Scal -> Vec -> Scal. In such a case it may be better to also scalarize the vectorizable (Vec) instruction.

Are you proposing some kind of search over instruction sequences with some limited lookahead? Like comparing the costs of scalarizing I1, I2, I3 compared to vectorizing? So {Extr + Scal1 + Scal2 + Scal3 + Ins} is compared to {Vec1 + Vec2 + Vec3}, where a target expanded vector instruction would automatically (as it currently does) include the scalarization cost...

I suspect that an algorithm that makes these decisions would benefit from knowing which instructions the target *must* scalarize (no vector instruction available). I think that would be the starting points for these searches, or? After all, if all instructions are vectorizable, this step could simply be skipped. In that sense my patch might be a first step in this direction, or?

Are you proposing some kind of search over instruction sequences with some limited lookahead?

Yes, something like this.

I suspect that an algorithm that makes these decisions would benefit from knowing which instructions the target *must* scalarize

I think you would just start with instructions where the cost model says that VectorOperationCost > ScalarOperationCost*VF. It doesn't really matter why it's expensive. (I guess at some point, we might want to model which execution units are used by a vector instruction, but I think you'd want a different interface for that.)

In D53865#1328113, @jonpa wrote:

I think you are either 1) arm-twisting the vectorizer to emit vector code which you know will be scalar or 2) arm-twisting vectorizer's cost model to believe what you are emitting as "vector" to be really scalar. I certainly do not see the reason why "you have to" do that, because letting vectorizer emit scalar IR instructions in those cases should be "equivalent". So, why "do you WANT to" do that? IR going out of vectorizer may be more compact, but what that'll accomplish is cheating all downstream optimizers and their cost models.

I am just trying to keep it simple by not changing how LV generates code, but merely improve the cost computations. Changing the output of a vectorized loop seems like a much bigger project, which I did not attempt.

In my perspective, that's not simple at all. LV already has a mechanism to scalarize instructions, and ways to compute cost for those scalarized instructions. If everyone keeps adding their own way of scalarizing (or computing the cost for it), that's adding unnecessary complexity.

Please try to see if you can add TTI based "can vectorize instruction" check inside this function. Consider something like isScalarWithoutPredication() analogous to isScalarWithPredication().

void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF)

Once the instruction is added to InstsToScalarize, the rest of the cost model you desire should just happen --- else, that's where you'd like to contribute to benefit all including you.

Then, go look at the Recipe construction. Again, we need to check similar to isScalarWithPredication and return false here.

bool VPRecipeBuilder::tryToWiden(Instruction *I, VPBasicBlock *VPBB, VFRange &Range)

That should force the creation of replicate recipe.

Hopefully, that's all you need to do to serialize those vector FP operations for which you don't have HW vector support.

In D53865#1328663, @efriedma wrote:

Are you proposing some kind of search over instruction sequences with some limited lookahead?

Yes, something like this.

Agreed. Again, computePredInstDiscount() and sinkScalarOperands() already perform such a search. See also "Throttled SLP" by Vasileios in LLVM Dev 2015, PACT 2015.

I suspect that an algorithm that makes these decisions would benefit from knowing which instructions the target *must* scalarize

I think you would just start with instructions where the cost model says that VectorOperationCost > ScalarOperationCost*VF. It doesn't really matter why it's expensive. (I guess at some point, we might want to model which execution units are used by a vector instruction, but I think you'd want a different interface for that.)

The interface is already there: expectedCost() notes if the best cost for a given VF is obtained by using vector or scalar instructions. But this is only used to report if any vector instructions will actually be generated, to comply with user "vectorize" directive.

Thanks for review and suggestions! I tried your ideas and was pleasantly surprised that it actually worked quite "out of the box" as you said to explicitly scalarize instructions instead of just correcting the costs for them as I first did. :-)

I still think this patch is experimental in the sense that I haven't seen any convincing benchmark improvements on my target (SystemZ) to motivate this extra code in the LoopVectorizer, although it seems to do what it's intended for. I see two main parts in this patch now.

Find instructions (sequences) that will be scalarized by target and explicitly scalarize them with correct costs, just like for predicated instructions.
Scalarize loads and stores that are connected with a scalarized user/def, instead of always widening them if that would be cheaper.

For (1), I implemented isScalarWithoutPredication(), per your suggestions to compare the "scalar * VF" against "vector" costs. Unfortunately, I had to add an "InPassedVF" argument to getInstructionCost() and getMemoryInstructionCost(), to not confuse those methods which may pass the Instruction pointer to the target implementation. This extra argument is used for VF=1 by setting it to false, so that the user will eventually not think that this is a scalar instruction in a scalar context and find folding opportunities which may not be present in the vectorized loop. I then took inspiration from computePredInstDiscount() and implemented computeInstDiscount() to find the scalarized sequences. Did not attempt to search past vectorizable instructions, but it recognizes scalarized loads / stores.

At this point I noticed for SystemZ that this affected primarily ~100 loops on z13 containing the float (fp32) type, which is scalarized. Typically, this would be of the type "load; op(s); store", with some variations. This patch made those vector loops cheaper (with more correct scalarization costs, while the scalar version still suffered a bit from not considering the folded loads into the scalar operations such as fadd). The operands scalarization cost was also improved by the fact of checking for loop invariant ones, which is not currently done. (Side notes: I saw that now VF=2 got a good treatment since the loop is scalarized, and the DAG problem of expansion to VF=4 disappeared. Also, I had to disable this for integer div/rem, since we still see spilling if those loops are vectorized).

(2) needs (1) to work, since the correct cost is only attained if e.g. the scalarized instruction using the load does not also get an (incorrect) extraction cost. Basically, if the vector instruction plus extraction/insertion costs is greater than the cost for scalar instructions, the memory access is scalarized instead of widened.

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else. This had to be done in setCostBasedWideningDecision(). It was as well not possible to use the new isScalarWithoutPredication() at this point, since getInstructionCost() can not be called since the scalars have not been collected. I therefore used the TLI check to see which instructions which would be scalarized, wrapped in a target hook I named preferScalarizedMemFor(). Perhaps there is a better way for this, but for SystemZ I wanted to avoid extractions into GPRs (integer extractions), and also prefer to store 2x64 bit instead of 128 bits if the wide store requires inserts.

This seemed beneficial for loads in places, while I saw that LSR generated poor code for the 2x64 stores with duplicated address computations for address registers, instead of using an immediate offset.

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else.

If we change "memory widening decision" from "hard decision" to "soft decision to be finalized after collect scalars", we might be able to solve that. In general, I don't think this is any different from data flow based optimization problem

and the standard solution there is do this iteratively. So, some form of going back and revise the prior decision is inevitable if we want to improve this.

In any case, the newer code you have written is more generically applicable, and the improvements needed on top of that is also generically applicable. Best of all, we don't have to teach downstream optimizers about the weirdness in cost modeling.
That's what I was aiming for in reviewing your change.

Is the uploaded code ready for review? Or shall we continue discussing the approach?

In D53865#1336940, @hsaito wrote:

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else.

If we change "memory widening decision" from "hard decision" to "soft decision to be finalized after collect scalars", we might be able to solve that. In general, I don't think this is any different from data flow based optimization problem

and the standard solution there is do this iteratively. So, some form of going back and revise the prior decision is inevitable if we want to improve this.

I was not sure this is a natural extension of the current algorithm. Currently, the memory widening decisions are first made and then the loop uniforms and scalars are collected based on this. Are you saying that changing the decision to 'scalarize' from e.g. 'widen' could be done without affecting the set of uniforms/scalars? Perhaps it could be... Scalarizing a widened access might just mean some extra accesses with immediate offsets, which would not require any changes to the induction variables...

In any case, the newer code you have written is more generically applicable, and the improvements needed on top of that is also generically applicable. Best of all, we don't have to teach downstream optimizers about the weirdness in cost modeling.
That's what I was aiming for in reviewing your change.

Yes, I agree it looks better now :-)

Is the uploaded code ready for review? Or shall we continue discussing the approach?

It's quite ready, except for our discussion above.

Also, in the spirit of keeping LoopVectorizer as clean as possible, I think we should have some more people agreeing to this as a general improvement. If this would be the case, I would be happy to commit, but I can't say at the moment if this is really improving things enough to go in... It may be that this is mostly a theoretical improvement which may or may not be worth having, depending on the complexity of the patch...

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

11 lines

TargetTransformInfoImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

SystemZ/

SystemZTargetTransformInfo.h

1 line

SystemZTargetTransformInfo.cpp

21 lines

Transforms/

Vectorize/

LoopVectorize.cpp

188 lines

Diff 178627

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 562 Lines • ▼ Show 20 Lines	public:
/// should use coldcc calling convention.		/// should use coldcc calling convention.
bool useColdCCForColdCall(Function &F) const;		bool useColdCCForColdCall(Function &F) const;

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;

unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) const;		unsigned VF) const;

		/// Return true if target will expand (scalarize) the vector Opcode, and it
		/// would be preferred to scalarized a memory access connected with it.
		bool preferScalarizedMemFor(bool Stores, unsigned Opcode, Type *VecTy) const;

/// If target has efficient vector element load/store instructions, it can		/// If target has efficient vector element load/store instructions, it can
/// return true here so that insertion/extraction costs are not added to		/// return true here so that insertion/extraction costs are not added to
/// the scalarization cost of a load/store.		/// the scalarization cost of a load/store.
bool supportsEfficientVectorElementLoadStore() const;		bool supportsEfficientVectorElementLoadStore() const;

/// Don't restrict interleaved unrolling to small loops.		/// Don't restrict interleaved unrolling to small loops.
bool enableAggressiveInterleaving(bool LoopHasReductions) const;		bool enableAggressiveInterleaving(bool LoopHasReductions) const;

▲ Show 20 Lines • Show All 494 Lines • ▼ Show 20 Lines	public:
virtual unsigned getJumpBufSize() = 0;		virtual unsigned getJumpBufSize() = 0;
virtual bool shouldBuildLookupTables() = 0;		virtual bool shouldBuildLookupTables() = 0;
virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;		virtual bool shouldBuildLookupTablesForConstant(Constant *C) = 0;
virtual bool useColdCCForColdCall(Function &F) = 0;		virtual bool useColdCCForColdCall(Function &F) = 0;
virtual unsigned		virtual unsigned
getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) = 0;		getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) = 0;
virtual unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		virtual unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) = 0;		unsigned VF) = 0;
		virtual bool
		preferScalarizedMemFor(bool Stores, unsigned Opcode, Type *VecTy) = 0;
virtual bool supportsEfficientVectorElementLoadStore() = 0;		virtual bool supportsEfficientVectorElementLoadStore() = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual const MemCmpExpansionOptions *enableMemCmpExpansion(		virtual const MemCmpExpansionOptions *enableMemCmpExpansion(
bool IsZeroCmp) const = 0;		bool IsZeroCmp) const = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
virtual bool enableMaskedInterleavedAccessVectorization() = 0;		virtual bool enableMaskedInterleavedAccessVectorization() = 0;
virtual bool isFPVectorizationPotentiallyUnsafe() = 0;		virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
▲ Show 20 Lines • Show All 252 Lines • ▼ Show 20 Lines	unsigned getScalarizationOverhead(Type *Ty, bool Insert,
bool Extract) override {		bool Extract) override {
return Impl.getScalarizationOverhead(Ty, Insert, Extract);		return Impl.getScalarizationOverhead(Ty, Insert, Extract);
}		}
unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) override {		unsigned VF) override {
return Impl.getOperandsScalarizationOverhead(Args, VF);		return Impl.getOperandsScalarizationOverhead(Args, VF);
}		}

		bool preferScalarizedMemFor(bool Stores, unsigned Opcode,
		Type *VecTy) override {
		return Impl.preferScalarizedMemFor(Stores, Opcode, VecTy);
		}

bool supportsEfficientVectorElementLoadStore() override {		bool supportsEfficientVectorElementLoadStore() override {
return Impl.supportsEfficientVectorElementLoadStore();		return Impl.supportsEfficientVectorElementLoadStore();
}		}

bool enableAggressiveInterleaving(bool LoopHasReductions) override {		bool enableAggressiveInterleaving(bool LoopHasReductions) override {
return Impl.enableAggressiveInterleaving(LoopHasReductions);		return Impl.enableAggressiveInterleaving(LoopHasReductions);
}		}
const MemCmpExpansionOptions *enableMemCmpExpansion(		const MemCmpExpansionOptions *enableMemCmpExpansion(
▲ Show 20 Lines • Show All 348 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	public:

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) {		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) {
return 0;		return 0;
}		}

unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		unsigned getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
unsigned VF) { return 0; }		unsigned VF) { return 0; }

		bool preferScalarizedMemFor(bool Stores, unsigned Opcode, Type *VecTy) {
		return false;
		}

bool supportsEfficientVectorElementLoadStore() { return false; }		bool supportsEfficientVectorElementLoadStore() { return false; }

bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }		bool enableAggressiveInterleaving(bool LoopHasReductions) { return false; }

const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(		const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(
bool IsZeroCmp) const {		bool IsZeroCmp) const {
return nullptr;		return nullptr;
}		}
▲ Show 20 Lines • Show All 546 Lines • Show Last 20 Lines

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 245 Lines • ▼ Show 20 Lines
	}			}

	unsigned TargetTransformInfo::			unsigned TargetTransformInfo::
	getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,			getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
	unsigned VF) const {			unsigned VF) const {
	return TTIImpl->getOperandsScalarizationOverhead(Args, VF);			return TTIImpl->getOperandsScalarizationOverhead(Args, VF);
	}			}

				bool TargetTransformInfo::
				preferScalarizedMemFor(bool Stores, unsigned Opcode, Type *VecTy) const {
				return TTIImpl->preferScalarizedMemFor(Stores, Opcode, VecTy);
				}

	bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {			bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {
	return TTIImpl->supportsEfficientVectorElementLoadStore();			return TTIImpl->supportsEfficientVectorElementLoadStore();
	}			}

	bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {			bool TargetTransformInfo::enableAggressiveInterleaving(bool LoopHasReductions) const {
	return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);			return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);
	}			}

	▲ Show 20 Lines • Show All 953 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZTargetTransformInfo.h

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	public:
unsigned getMinPrefetchStride() { return 2048; }		unsigned getMinPrefetchStride() { return 2048; }

bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool prefersVectorizedAddressing() { return false; }		bool prefersVectorizedAddressing() { return false; }
bool LSRWithInstrQueries() { return true; }		bool LSRWithInstrQueries() { return true; }
bool supportsEfficientVectorElementLoadStore() { return true; }		bool supportsEfficientVectorElementLoadStore() { return true; }
bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool preferScalarizedMemFor(bool Stores, unsigned Opcode, Type *VecTy);
int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >());		ArrayRef<const Value > Args = ArrayRef<const Value >());
int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);
Show All 33 Lines

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

	Show First 20 Lines • Show All 341 Lines • ▼ Show 20 Lines
	// 3.			// 3.
	static unsigned getNumVectorRegs(Type *Ty) {			static unsigned getNumVectorRegs(Type *Ty) {
	assert(Ty->isVectorTy() && "Expected vector type");			assert(Ty->isVectorTy() && "Expected vector type");
	unsigned WideBits = getScalarSizeInBits(Ty) * Ty->getVectorNumElements();			unsigned WideBits = getScalarSizeInBits(Ty) * Ty->getVectorNumElements();
	assert(WideBits > 0 && "Could not compute size of vector");			assert(WideBits > 0 && "Could not compute size of vector");
	return ((WideBits % 128U) ? ((WideBits / 128U) + 1) : (WideBits / 128U));			return ((WideBits % 128U) ? ((WideBits / 128U) + 1) : (WideBits / 128U));
	}			}

				bool SystemZTTIImpl::
				preferScalarizedMemFor(bool Stores, unsigned Opcode, Type *VecTy) {
				assert(VecTy->isVectorTy());
				if (Stores && getScalarSizeInBits(VecTy) != 64)
				return false;
				if (!Stores && VecTy->isFPOrFPVectorTy())
				return false;

				// It seems these opcodes translate to expanded vector DAG nodes here, but
				// they are in fact not.
				if ((Opcode == Instruction::Select) \|\| (Opcode == Instruction::SExt) \|\|
				(Opcode == Instruction::ZExt) \|\| (Opcode == Instruction::Trunc))
				return false;
				const TargetLoweringBase *TLI = getTLI();
				int ISD = TLI->InstructionOpcodeToISD(Opcode);
				if (!ISD)
				return false;
				std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(DL, VecTy);
				return TLI->isOperationExpand(ISD, LT.second);
				}

	int SystemZTTIImpl::getArithmeticInstrCost(			int SystemZTTIImpl::getArithmeticInstrCost(
	unsigned Opcode, Type *Ty,			unsigned Opcode, Type *Ty,
	TTI::OperandValueKind Op1Info, TTI::OperandValueKind Op2Info,			TTI::OperandValueKind Op1Info, TTI::OperandValueKind Op2Info,
	TTI::OperandValueProperties Opd1PropInfo,			TTI::OperandValueProperties Opd1PropInfo,
	TTI::OperandValueProperties Opd2PropInfo,			TTI::OperandValueProperties Opd2PropInfo,
	ArrayRef<const Value *> Args) {			ArrayRef<const Value *> Args) {

	// TODO: return a good value for BB-VECTORIZER that includes the			// TODO: return a good value for BB-VECTORIZER that includes the
	▲ Show 20 Lines • Show All 768 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,099 Lines • ▼ Show 20 Lines	if (!blockNeedsPredication(I->getParent()))
return false;		return false;
// Loads and stores that need some form of masked operation are predicated		// Loads and stores that need some form of masked operation are predicated
// instructions.		// instructions.
if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))
return Legal->isMaskRequired(I);		return Legal->isMaskRequired(I);
return isScalarWithPredication(I);		return isScalarWithPredication(I);
}		}

		bool isScalarWithoutPredication(Instruction *I, unsigned VF);

/// Returns true if \p I is a memory instruction with consecutive memory		/// Returns true if \p I is a memory instruction with consecutive memory
/// access that can be widened.		/// access that can be widened.
bool memoryInstructionCanBeWidened(Instruction *I, unsigned VF = 1);		bool memoryInstructionCanBeWidened(Instruction *I, unsigned VF = 1);

/// Returns true if \p I is a memory instruction in an interleaved-group		/// Returns true if \p I is a memory instruction in an interleaved-group
/// of memory accesses that can be vectorized with wide vector loads/stores		/// of memory accesses that can be vectorized with wide vector loads/stores
/// and shuffles.		/// and shuffles.
bool interleavedAccessCanBeWidened(Instruction *I, unsigned VF = 1);		bool interleavedAccessCanBeWidened(Instruction *I, unsigned VF = 1);
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	private:
/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width.		/// the factor width.
VectorizationCostTy expectedCost(unsigned VF);		VectorizationCostTy expectedCost(unsigned VF);

/// Returns the execution time cost of an instruction for a given vector		/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);		VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF,
		bool InPassedVF = true);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);		unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy,
		bool InPassedVF);

/// Calculate vectorization cost of memory instruction \p I.		/// Calculate vectorization cost of memory instruction \p I.
unsigned getMemoryInstructionCost(Instruction *I, unsigned VF);		unsigned getMemoryInstructionCost(Instruction *I, unsigned VF, bool InPassedVF = true);

/// The cost computation for scalarized memory instruction.		/// The cost computation for scalarized memory instruction.
unsigned getMemInstScalarizationCost(Instruction *I, unsigned VF);		unsigned getMemInstScalarizationCost(Instruction *I, unsigned VF);

/// The cost computation for interleaving group of memory instructions.		/// The cost computation for interleaving group of memory instructions.
unsigned getInterleaveGroupCost(Instruction *I, unsigned VF);		unsigned getInterleaveGroupCost(Instruction *I, unsigned VF);

/// The cost computation for Gather/Scatter instruction.		/// The cost computation for Gather/Scatter instruction.
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	private:
/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
unsigned VF);		unsigned VF);

		int computeInstDiscount(Instruction *Inst, ScalarCostsTy &ScalarCosts,
		unsigned VF);

/// Collect the instructions that are uniform after vectorization. An		/// Collect the instructions that are uniform after vectorization. An
/// instruction is uniform if we represent it with a single scalar value in		/// instruction is uniform if we represent it with a single scalar value in
/// the vectorized loop corresponding to each vector iteration. Examples of		/// the vectorized loop corresponding to each vector iteration. Examples of
/// uniform instructions include pointer operands of consecutive or		/// uniform instructions include pointer operands of consecutive or
/// interleaved memory accesses. Note that although uniformity implies an		/// interleaved memory accesses. Note that although uniformity implies an
/// instruction will be scalar, the reverse is not true. In general, a		/// instruction will be scalar, the reverse is not true. In general, a
/// scalarized instruction will be represented by VF scalar values in the		/// scalarized instruction will be represented by VF scalar values in the
/// vectorized loop, each corresponding to an iteration of the original		/// vectorized loop, each corresponding to an iteration of the original
▲ Show 20 Lines • Show All 3,099 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::isScalarWithPredication(Instruction *I, unsigned VF) {
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
return mayDivideByZero(*I);		return mayDivideByZero(*I);
}		}
return false;		return false;
}		}

		bool LoopVectorizationCostModel::isScalarWithoutPredication(Instruction *I,
		unsigned VF) {
		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))
		return false;

		// A TTI query could be used to ask target if the instruction will be
		// expanded during instruction selection, but it should be equivalent to
		// find a vector cost greater than the sum of the scalar costs.
		assert(VF > 1 && "Should only be called with VF > 1.");
		unsigned ScalarCost =
		getInstructionCost(I, 1, false/InPassedVF/).first * VF;
		unsigned VectorCost = getInstructionCost(I, VF).first;
		return (ScalarCost < VectorCost);
		}

bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(Instruction *I,		bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(Instruction *I,
unsigned VF) {		unsigned VF) {
assert(isAccessInterleaved(I) && "Expecting interleaved access.");		assert(isAccessInterleaved(I) && "Expecting interleaved access.");
assert(getWideningDecision(I, VF) == CM_Unknown &&		assert(getWideningDecision(I, VF) == CM_Unknown &&
"Decision should not be set yet.");		"Decision should not be set yet.");
auto *Group = getInterleavedAccessGroup(I);		auto *Group = getInterleavedAccessGroup(I);
assert(Group && "Must have a group.");		assert(Group && "Must have a group.");

▲ Show 20 Lines • Show All 859 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB)
// for emulated masked memrefs.		// for emulated masked memrefs.
if (!useEmulatedMaskMemRefHack(&I) &&		if (!useEmulatedMaskMemRefHack(&I) &&
computePredInstDiscount(&I, ScalarCosts, VF) >= 0)		computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());		ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
// Remember that BB will remain after vectorization.		// Remember that BB will remain after vectorization.
PredicatedBBsAfterVectorization.insert(BB);		PredicatedBBsAfterVectorization.insert(BB);
}		}
}		}

		// Find sequences of non predicated instructions that have lower cost if
		// scalarized. This would typically mean instructions that the target has
		// no native vector support for.
		for (BasicBlock *BB : TheLoop->blocks()) {
		if (blockNeedsPredication(BB))
		continue;
		for (Instruction &I : *BB)
		if (!isScalarAfterVectorization(&I, VF) && !ScalarCostsVF.count(&I) &&
		isScalarWithoutPredication(&I, VF)) {
		ScalarCostsTy ScalarCosts;
		if (computeInstDiscount(&I, ScalarCosts, VF) >= 0)
		ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
		}
		}
}		}

int LoopVectorizationCostModel::computePredInstDiscount(		int LoopVectorizationCostModel::computePredInstDiscount(
Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,		Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,
unsigned VF) {		unsigned VF) {
assert(!isUniformAfterVectorization(PredInst, VF) &&		assert(!isUniformAfterVectorization(PredInst, VF) &&
"Instruction marked uniform-after-vectorization will be predicated");		"Instruction marked uniform-after-vectorization will be predicated");

▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {
// of the instruction costs more, and scalarizing would be beneficial.		// of the instruction costs more, and scalarizing would be beneficial.
Discount += VectorCost - ScalarCost;		Discount += VectorCost - ScalarCost;
ScalarCosts[I] = ScalarCost;		ScalarCosts[I] = ScalarCost;
}		}

return Discount;		return Discount;
}		}

		int LoopVectorizationCostModel::computeInstDiscount(
		Instruction Inst, DenseMap<Instruction , unsigned> &ScalarCosts,
		unsigned VF) {

		// Initialize the discount to zero, meaning that the scalar version and the
		// vector version cost the same.
		int Discount = 0;

		auto needsExtract = [&](Instruction *I) -> bool {
		if (isa<LoadInst>(I))
		return getWideningDecision(I, VF) != CM_Scalarize;
		return (TheLoop->contains(I) && !isScalarAfterVectorization(I, VF) &&
		!ScalarCosts.count(I));
		};

		// Holds instructions to analyze. Those instructions are the ones that
		// would be scalarized if we find that the scalar version costs less.
		SmallVector<Instruction *, 8> Worklist;
		Worklist.push_back(Inst);
		while (!Worklist.empty()) {
		Instruction *I = Worklist.pop_back_val();
		unsigned VectorCost = getInstructionCost(I, VF).first;
		unsigned ScalarCost =
		VF * getInstructionCost(I, 1, false/InPassedVF/).first;

		// Operands
		for (Use &U : I->operands())
		if (auto *J = dyn_cast<Instruction>(U.get())) {
		assert(VectorType::isValidElementType(J->getType()) &&
		"Instruction has non-scalar type");
		if (needsExtract(J))
		ScalarCost += TTI.getScalarizationOverhead(
		ToVectorTy(J->getType(), VF), false, true /Extract/);
		}

		// Users
		SmallVector<Instruction *, 8> Tmp_worklist;
		bool VectorUse = false;
		for (User *U : I->users()) {
		Instruction *UI = cast<Instruction>(U);
		if (isScalarWithoutPredication(UI, VF))
		Tmp_worklist.push_back(UI);
		else if (isa<StoreInst>(UI) &&
		getWideningDecision(UI, VF) == CM_Scalarize)
		continue;
		else {
		VectorUse = true;
		break;
		}
		}
		if (!VectorUse)
		Worklist.append(Tmp_worklist.begin(), Tmp_worklist.end());
		else
		ScalarCost += TTI.getScalarizationOverhead(ToVectorTy(I->getType(), VF),
		true /Insert/, false);

		// Compute the discount. A non-negative discount means the vector version
		// of the instruction costs more, and scalarizing would be beneficial.
		Discount += VectorCost - ScalarCost;
		ScalarCosts[I] = ScalarCost;
		}

		return Discount;
		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::expectedCost(unsigned VF) {
VectorizationCostTy Cost;		VectorizationCostTy Cost;

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
VectorizationCostTy BlockCost;		VectorizationCostTy BlockCost;

▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines	assert(!Legal->isMaskRequired(I) &&
"Reverse masked interleaved access not supported.");		"Reverse masked interleaved access not supported.");
Cost += Group->getNumMembers() *		Cost += Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
}		}
return Cost;		return Cost;
}		}

unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,		unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
unsigned VF) {		unsigned VF,
		bool InPassedVF) {
// Calculate scalar cost only. Vectorization cost should be ready at this		// Calculate scalar cost only. Vectorization cost should be ready at this
// moment.		// moment.
if (VF == 1) {		if (VF == 1) {
Type *ValTy = getMemInstValueType(I);		Type *ValTy = getMemInstValueType(I);
unsigned Alignment = getLoadStoreAlignment(I);		unsigned Alignment = getLoadStoreAlignment(I);
unsigned AS = getLoadStoreAddressSpace(I);		unsigned AS = getLoadStoreAddressSpace(I);

return TTI.getAddressComputationCost(ValTy) +		return TTI.getAddressComputationCost(ValTy) +
TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS, I);		TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS,
		(InPassedVF ? I : nullptr));
}		}
return getWideningCost(I, VF);		return getWideningCost(I, VF);
}		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF,
		bool InPassedVF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (isUniformAfterVectorization(I, VF))		if (isUniformAfterVectorization(I, VF))
VF = 1;		VF = 1;

if (VF > 1 && isProfitableToScalarize(I, VF))		if (VF > 1 && isProfitableToScalarize(I, VF))
return VectorizationCostTy(InstsToScalarize[VF][I], false);		return VectorizationCostTy(InstsToScalarize[VF][I], false);

// Forced scalars do not have any scalarization overhead.		// Forced scalars do not have any scalarization overhead.
auto ForcedScalar = ForcedScalars.find(VF);		auto ForcedScalar = ForcedScalars.find(VF);
if (VF > 1 && ForcedScalar != ForcedScalars.end()) {		if (VF > 1 && ForcedScalar != ForcedScalars.end()) {
auto InstSet = ForcedScalar->second;		auto InstSet = ForcedScalar->second;
if (InstSet.find(I) != InstSet.end())		if (InstSet.find(I) != InstSet.end())
return VectorizationCostTy((getInstructionCost(I, 1).first * VF), false);		return VectorizationCostTy((getInstructionCost(I, 1).first * VF), false);
}		}

Type *VectorTy;		Type *VectorTy;
unsigned C = getInstructionCost(I, VF, VectorTy);		unsigned C = getInstructionCost(I, VF, VectorTy, InPassedVF);

bool TypeNotScalarized =		bool TypeNotScalarized =
VF > 1 && VectorTy->isVectorTy() && TTI.getNumberOfParts(VectorTy) < VF;		VF > 1 && VectorTy->isVectorTy() && TTI.getNumberOfParts(VectorTy) < VF;
return VectorizationCostTy(C, TypeNotScalarized);		return VectorizationCostTy(C, TypeNotScalarized);
}		}

void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {		void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
if (VF == 1)		if (VF == 1)
Show All 22 Lines	for (Instruction &I : *BB) {
// relying on instcombine to remove them.		// relying on instcombine to remove them.
// Load: Scalar load + broadcast		// Load: Scalar load + broadcast
// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract		// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
unsigned Cost = getUniformMemOpCost(&I, VF);		unsigned Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
continue;		continue;
}		}

// We assume that widening is the best solution when possible.		// Find cases where scalarization is better than widening +
if (memoryInstructionCanBeWidened(&I, VF)) {		// scalarization overhead.
unsigned Cost = getConsecutiveMemOpCost(&I, VF);		unsigned const WidenCost = memoryInstructionCanBeWidened(&I, VF)
		? getConsecutiveMemOpCost(&I, VF)
		: UINT_MAX;
		bool LoadToScalarized = isa<LoadInst>(&I);
		if (LoadToScalarized)
		for (User *U : I.users())
		if (!TTI.preferScalarizedMemFor(false/Stores/,
		cast<Instruction>(U)->getOpcode(),
		ToVectorTy(I.getType(), VF))) {
		LoadToScalarized = false;
		break;
		}
		bool StoreOfScalarized = false;
		if (isa<StoreInst>(&I))
		if (auto *OpAsIns = dyn_cast<Instruction>(I.getOperand(0)))
		if (TTI.preferScalarizedMemFor(true/Stores/,
		OpAsIns->getOpcode(),
		ToVectorTy(OpAsIns->getType(), VF)))
		StoreOfScalarized = true;
		if (LoadToScalarized \|\| StoreOfScalarized) {
		// Assume that scalarization is best unless cost functions give a
		// smaller sum for widening + overhead.
		unsigned WidenPlusOverheadCost = WidenCost;
		if (WidenCost < UINT_MAX) {
		if (LoadToScalarized) {
		WidenPlusOverheadCost += TTI.getScalarizationOverhead(
		ToVectorTy(I.getType(), VF), false, true /Extract/);
		} else if (StoreOfScalarized) {
		WidenPlusOverheadCost += TTI.getScalarizationOverhead(
		ToVectorTy(I.getOperand(0)->getType(), VF), true /Insert/,
		false);
		}
		}
		unsigned ScalarizationCost =
		VF * getInstructionCost(&I, 1, false/InPassedVF/).first;
		if (ScalarizationCost < WidenPlusOverheadCost) {
		// The cost does not include scalarization overhead, since the
		// user(s)/def will be scalarized as will be recognized in
		// computeInstDiscount() later.
		setWideningDecision(&I, VF, CM_Scalarize, ScalarizationCost);
		continue;
		}
		}

		if (WidenCost < UINT_MAX) {
		// Assume now that widening is the best solution when possible.
int ConsecutiveStride =		int ConsecutiveStride =
Legal->isConsecutivePtr(getLoadStorePointerOperand(&I));		Legal->isConsecutivePtr(getLoadStorePointerOperand(&I));
assert((ConsecutiveStride == 1 \|\| ConsecutiveStride == -1) &&		assert((ConsecutiveStride == 1 \|\| ConsecutiveStride == -1) &&
"Expected consecutive stride.");		"Expected consecutive stride.");
InstWidening Decision =		InstWidening Decision =
ConsecutiveStride == 1 ? CM_Widen : CM_Widen_Reverse;		ConsecutiveStride == 1 ? CM_Widen : CM_Widen_Reverse;
setWideningDecision(&I, VF, Decision, Cost);		setWideningDecision(&I, VF, Decision, WidenCost);
continue;		continue;
}		}

// Choose between Interleaving, Gather/Scatter or Scalarization.		// Choose between Interleaving, Gather/Scatter or Scalarization.
unsigned InterleaveCost = std::numeric_limits<unsigned>::max();		unsigned InterleaveCost = std::numeric_limits<unsigned>::max();
unsigned NumAccesses = 1;		unsigned NumAccesses = 1;
if (isAccessInterleaved(&I)) {		if (isAccessInterleaved(&I)) {
auto Group = getInterleavedAccessGroup(&I);		auto Group = getInterleavedAccessGroup(&I);
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	if (isa<LoadInst>(I)) {
// Make sure I gets scalarized and a cost estimate without		// Make sure I gets scalarized and a cost estimate without
// scalarization overhead.		// scalarization overhead.
ForcedScalars[VF].insert(I);		ForcedScalars[VF].insert(I);
}		}
}		}

unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,		unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
unsigned VF,		unsigned VF,
Type *&VectorTy) {		Type *&VectorTy,
		bool InPassedVF) {
Type *RetTy = I->getType();		Type *RetTy = I->getType();
if (canTruncateToMinimalBitwidth(I, VF))		if (canTruncateToMinimalBitwidth(I, VF))
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
VectorTy = isScalarAfterVectorization(I, VF) ? RetTy : ToVectorTy(RetTy, VF);		VectorTy = isScalarAfterVectorization(I, VF) ? RetTy : ToVectorTy(RetTy, VF);
auto SE = PSE.getSE();		auto SE = PSE.getSE();

// TODO: We need to estimate the cost of intrinsic calls.		// TODO: We need to estimate the cost of intrinsic calls.
switch (I->getOpcode()) {		switch (I->getOpcode()) {
▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::Select: {		case Instruction::Select: {
SelectInst *SI = cast<SelectInst>(I);		SelectInst *SI = cast<SelectInst>(I);
const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());		const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());
bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop));		bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop));
Type *CondTy = SI->getCondition()->getType();		Type *CondTy = SI->getCondition()->getType();
if (!ScalarCond)		if (!ScalarCond)
CondTy = VectorType::get(CondTy, VF);		CondTy = VectorType::get(CondTy, VF);

return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, I);		return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy,
		(InPassedVF ? I : nullptr));
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
Type *ValTy = I->getOperand(0)->getType();		Type *ValTy = I->getOperand(0)->getType();
Instruction *Op0AsInstruction = dyn_cast<Instruction>(I->getOperand(0));		Instruction *Op0AsInstruction = dyn_cast<Instruction>(I->getOperand(0));
if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF))		if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF))
ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);		ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);
VectorTy = ToVectorTy(ValTy, VF);		VectorTy = ToVectorTy(ValTy, VF);
return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr, I);		return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr,
		(InPassedVF ? I : nullptr));
}		}
case Instruction::Store:		case Instruction::Store:
case Instruction::Load: {		case Instruction::Load: {
unsigned Width = VF;		unsigned Width = VF;
if (Width > 1) {		if (Width > 1) {
InstWidening Decision = getWideningDecision(I, Width);		InstWidening Decision = getWideningDecision(I, Width);
assert(Decision != CM_Unknown &&		assert(Decision != CM_Unknown &&
"CM decision should be taken at this point");		"CM decision should be taken at this point");
if (Decision == CM_Scalarize)		if (Decision == CM_Scalarize)
Width = 1;		Width = 1;
}		}
VectorTy = ToVectorTy(getMemInstValueType(I), Width);		VectorTy = ToVectorTy(getMemInstValueType(I), Width);
return getMemoryInstructionCost(I, VF);		return getMemoryInstructionCost(I, VF, InPassedVF);
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
Show All 29 Lines	if (canTruncateToMinimalBitwidth(I, VF)) {
I->getOpcode() == Instruction::SExt) {		I->getOpcode() == Instruction::SExt) {
SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);		SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);
VectorTy =		VectorTy =
smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);		smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
}		}
}		}

unsigned N = isScalarAfterVectorization(I, VF) ? VF : 1;		unsigned N = isScalarAfterVectorization(I, VF) ? VF : 1;
return N * TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy, I);		return N * TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy,
		(InPassedVF ? I : nullptr));
}		}
case Instruction::Call: {		case Instruction::Call: {
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);		unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
if (getVectorIntrinsicIDForCall(CI, TLI))		if (getVectorIntrinsicIDForCall(CI, TLI))
return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));		return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
return CallCost;		return CallCost;
▲ Show 20 Lines • Show All 1,550 Lines • Show Last 20 Lines