This is an archive of the discontinued LLVM Phabricator instance.

Don't vectorize loops when everything will be scalarized
ClosedPublic

Authored by hfinkel on Mar 28 2016, 6:09 PM.

Details

Summary

This patch prevents the loop vectorizer from vectorizing when all of the vector types it generates will be scalarized. I run into this problem on the PPC's QPX vector ISA, which only holds floating-point vector types. The loop vectorizer will, however, happily vectorize loops with purely integer computation. Here's an example:

LV: The Smallest and Widest types: 32 / 32 bits.
LV: The Widest register is: 256 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 1 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Scalar loop costs: 3.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 2 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 2 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 0 for VF 4 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 4 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 4 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 0 for VF 8 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 8 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 8 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 8 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 8 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 8 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
LV: The target has 32 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #4 Interval # 1
LV(REG): At #5 Interval # 1
LV(REG): VF = 8

The problem is that the cost model here is not wrong, exactly. Since all of these operations are scalarized, their cost (aside from the uniform ones) are indeed VF*(scalar cost), just as the model suggests. In fact, the larger the VF picked, the lower the relative overhead from the loop itself (and the induction-variable update and check), and so in a sense, picking the largest VF here is the right thing to do.

The problem is that vectorizing like this, where all of the vectors will be scalarized in the backend, isn't really vectorizing, but rather interleaving. By itself, this would be okay, but then the vectorizer itself also interleaves, and that's where the problem manifests itself. There's aren't actually enough scalar registers to support the normal interleave factor multiplied by a factor of VF (8 in this example). In other words, the problem with this is that our register-pressure heuristic does not account for scalarization.

While we might want to improve our register-pressure heuristic, I don't think this is the right motivating case for that work. Here we have a more-basic problem: The job of the vectorizer is to vectorize things (interleaving aside), and if the IR it generates won't generate any actual vector code, then something is wrong. Thus, if every type looks like it will be scalarized (i.e. will be split into VF or more parts), then don't consider that VF.

Diff Detail

Repository
rL LLVM

Event Timeline

hfinkel updated this revision to Diff 51864.Mar 28 2016, 6:09 PM
hfinkel retitled this revision from to Don't vectorize loops when everything will be scalarized.
hfinkel updated this object.
hfinkel added reviewers: anemet, nadav, congh, jmolloy.
hfinkel added a subscriber: llvm-commits.
spatel added a subscriber: spatel.Mar 28 2016, 7:36 PM
anemet accepted this revision.Mar 28 2016, 9:24 PM
anemet edited edge metadata.

LGTM.

I have two suggestions you may want to consider.

lib/Transforms/Vectorize/LoopVectorize.cpp
1535–1539 ↗(On Diff #51864)

Assuming I understand the logic correctly, you may want to mention it here that for ActuallyVectorized we currently use type legality. (I.e. we could still have a loop where we don't vectorize anything because none of the *instructions* are legal.)

5557–5558 ↗(On Diff #51864)

It seems that the current logic requires ActuallyVectorized to be initialized to false by the caller. I think that it would less error-prone if you initialized to false here inside the API.

What do you think?

This revision is now accepted and ready to land.Mar 28 2016, 9:24 PM
nadav edited edge metadata.Mar 29 2016, 2:03 PM

Hal, I am not sure I understand the problem. Is the problem register pressure or the fact that store <8 x i32> is more expensive than 8 times store i32?

This looks like a problem with the PPC cost model that does not take into account the cost of scalarization.

Hal, I am not sure I understand the problem. Is the problem register pressure or the fact that store <8 x i32> is more expensive than 8 times store i32?

It is really just register pressure. Since there is no legal vector type of i32 in this configuration, everything is just scalarized (or perhaps I should say expanded to avoid overloading terminology here -- the point is that it is type legalization, not operation legalization).

This looks like a problem with the PPC cost model that does not take into account the cost of scalarization.

No, there is no scalarization cost, because nothing is ever a vector. In fact, if you take my test case and turn off interleaving, you get pretty nice-looking code (which is even interleaved in practice, because that's what type legalization gives us). However, between the vector expansion (type legalization) and the interleaving the targets generally requests, the register pressure is too high.

Also, FWIW, Sanjay says that this patch also fixes PR26837 (which applies to SSE).

nadav added a comment.Mar 29 2016, 2:20 PM

I don't like the approach of passing in address-of-bool as parameter argument, especially since you did not document the parameter (is it IN, is it OUT, etc).

Please change the getCost return value to return a struct, or std::pair that ties the cost and the bit that says if vectorization happened. Please declare a type for this pair:

using VectorizationCostTy = std::pair<unsigned, bool>;

Then, update the Doxygen comments of the functions that use it.

Also, the name "ActuallyVectorized" is not descriptive. Maybe "AnyInstructionVectorized" ?

I don't like the approach of passing in address-of-bool as parameter argument, especially since you did not document the parameter (is it IN, is it OUT, etc).

Please change the getCost return value to return a struct, or std::pair that ties the cost and the bit that says if vectorization happened. Please declare a type for this pair:

using VectorizationCostTy = std::pair<unsigned, bool>;

Then, update the Doxygen comments of the functions that use it.

Also, the name "ActuallyVectorized" is not descriptive. Maybe "AnyInstructionVectorized" ?

Sure. Will do.

hfinkel updated this revision to Diff 52018.Mar 29 2016, 7:24 PM
hfinkel edited edge metadata.

Address review comments.

Address review comments.

Nadav, this version uses a std::pair (with an appropriately-commented typedef). What do you think of this version?

Hal, the code looks much better. I was thinking about this patch on the way to work this morning and I was wondering if you could mark the type <4 x i32> as legal, because you _can_ load it and store it to memory. Right?

This revision was automatically updated to reflect the committed changes.