Page MenuHomePhabricator

when calculating RegUsages, ignore instructions which are uniformed after vectorization
ClosedPublic

Authored by wmi on May 20 2016, 9:43 AM.

Details

Summary

This is following patch for http://reviews.llvm.org/D15177.

Just as Hal's comment in http://reviews.llvm.org/D15177?id=41809#inline-126094,
For induction variable only used in GetElementPtr and ICmp, D15177 knows it will not have vectorized version and can be added into VecValuesToIgnore, but it didn't consider the case that induction variable may be used by a add/sub before used in GetElementPtr.

for loop like below:

char a[1000];
char b[1000];
for (long i = 0; i < N; i++)

a[i] = b[i] * 6 + (b[i] + b[i + 1]) * 4 + b[i - 2] + b[i + 2];

When we are computing RegUsages for VF==8 and VF==16,
it is important for the register usages estimation component to know array index exprs like i, i+1, i+2 and i+3 will not have vectorized version after the loop being vectorized, and their live ranges should not be counted as vector register usages, or else it is likely to exaggerate the vector register pressure.

The patch adds instructions for which isUniformAfterVectorization returns true into the VecValuesToIgnore set. A special case is that PHI instructions are never included into the Uniforms set in collectLoopUniforms(), so a special handling for PHI is that when the result of PHI is only used in GetElementPtr or Uniform instructions, the PHI will be added into VecValuesToIgnore set too.

Another following patch in plan is if estimated vector register usage is more than the number of available hardware vector registers for certain VF, don't simply give up the VF. Just add the extra spill cost into the VectorCost. If the total VectorCost of the VF is the lowest, the VF is still worthy to try.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 57942.May 20 2016, 9:43 AM
wmi retitled this revision from to when calculating RegUsages, ignore instructions which are uniformed after vectorization.
wmi updated this object.
wmi added reviewers: hfinkel, jmolloy.
wmi set the repository for this revision to rL LLVM.
wmi added subscribers: llvm-commits, davidxl, congh, mkuper.
mkuper added inline comments.Jun 6 2016, 12:50 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6175 ↗(On Diff #57942)

This isn't directly related to this patch - but wouldn't this be true only for consecutive GEPs? (e.g. see D20789)

mkuper added inline comments.Jun 7 2016, 6:30 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5350 ↗(On Diff #57942)

Another side point - should we remove this?
Looking at http://reviews.llvm.org/rL172178, the reason that we only look at loads, stores, and PHIs is that "We don't have a detailed analysis on which values are vectorized and which stay scalars in the vectorized loop so we use another method. We look at reduction variables, loads and stores, which are the only ways to get information in and out of loop iterations".

That was true at the time, but since then we've gained a precise way of knowing which instructions are uniform, and with this patch will actually use that for ValuesToIgnore. So this check will now only miss instructions that ought to be taken into account, right?

wmi added inline comments.Jun 8 2016, 2:39 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5350 ↗(On Diff #57942)

Yes, it makes sense to remove it.

6175 ↗(On Diff #57942)

Every GEP (no matter it is consecutive or not) will be scalarized. It is not related with the load/store using the GEP. If the induction variable is only used in GEP, it will not be vectorized, right?

mkuper added inline comments.Jun 8 2016, 3:00 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6175 ↗(On Diff #57942)

I don't think so - as far as I know, we should be creating vector GEPs for scatter/gather when it's profitable on the target. (I think the only target that supports it right now is AVX-512.)

wmi added inline comments.Jun 8 2016, 5:09 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6175 ↗(On Diff #57942)

I see. If a[3*i] are vectorized using gather/scatter. It needs a vectorized version of 3*i so probably it is better to generate a vectorized version of i. Then i shouldn't be added into VecValuesToIgnore. Thanks.

wmi updated this revision to Diff 61442.Jun 21 2016, 2:16 PM

Add nonconsecutive pointer values and their dependency into the uniform after vectorization set in collectLoopUniforms if gather/scatter is not supported. The related code in collectValuesToIgnore is removed.

The added test hoo is to ensure the ptr of the load for which gather is possible will not be added into VecValuesToIgnore set.

mkuper added inline comments.Jun 22 2016, 4:43 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4919 ↗(On Diff #61442)

Are you sure about this?

Nonconsecutive pointer values when there is no gather/scatter will be scalarized, but they aren't uniform.
So I'm not sure we should be counting them as uniform. This will work correctly for your new use of isUniformAfterVectorization() (since we really don't need vector registers in either case). But I think it may do the wrong thing for the existing use, in getInstructionCost(). We shouldn't be evaluating the cost of non-consecutive loads/stores as if they are a single scalar load/store.

Am I confused?

wmi added inline comments.Jun 22 2016, 5:45 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4919 ↗(On Diff #61442)

Ah, you are right. I misunderstood what uniform means here. Will fix it.

mssimpso added inline comments.Jun 23 2016, 2:38 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6247–6249 ↗(On Diff #61442)

Hi Wei,

I'm joining this review a bit late, so I apologize if I'm not quite up-to-speed yet. But I'm not sure I follow this. Please correct me if I'm missing something!

If I take the following test case:

define void @test(i32* %a, i64 %n) {
entry:
  br label %for.body

for.body:
  %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
  %0 = trunc i64 %i to i32 
  %1 = getelementptr inbounds i32, i32* %a, i32 %0
  store i32 %0, i32* %1, align 4
  %i.next = add nuw nsw i64 %i, 1
  %cond = icmp eq i64 %i.next, %n
  br i1 %cond, label %for.end, label %for.body

for.end:
  ret void
}

We generate vectorized induction variables so that for the store, we have:

store <4 x i32> %vec.ind1, <4 x i32>* %3, align 4

However, with your change, it looks to me like we will add the induction variable to VecValuesToIngore where previously we wouldn't have. Is this right?

wmi added inline comments.Jun 23 2016, 3:07 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6247–6249 ↗(On Diff #61442)

Thanks for pointing out the problem. For your testcase, %1 is only used in instruction %0 = ... which is a uniform instruction so it will be put into VecValuesToIngore. However, it may be problematic for %0 to be uniform because actually it has both scalar and vector version after vectorization.

I will add your testcase into consideration. Thanks.

mkuper added inline comments.Jun 23 2016, 4:32 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6247–6249 ↗(On Diff #61442)

So the problem isn't here, it's in collectLoopUniforms(), right?
That is, the problem is that we're tracing back from the consecutive pointer, and adding all of the operands of the store to the worklist, instead of just the pointer operand?

wmi added inline comments.Jun 23 2016, 4:52 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6247–6249 ↗(On Diff #61442)

Looks like all of the operands of the store should be added to the worklist, but only those operands not being used in any nonUniform instruction should be regarded as uniform. But that requires collectLoopUniforms algorithm to use some topological order to do uniform check for the values in worklist. Do you think it is the right way to go for collectLoopUniform?

mkuper added inline comments.Jun 23 2016, 5:16 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6247–6249 ↗(On Diff #61442)

You're right, I got confused. We're not tracing back from the store anyway, it's from the use of %0 in the GEP, I didn't read your original comment correctly.
The real problem, as you say, is that we're assuming that every operand of a uniform instruction must be uniform, but it simply isn't true in our current definition of "uniform".

wmi added a comment.Jun 27 2016, 11:01 AM

Put the change in collectLoopUniform into a separate patch: http://reviews.llvm.org/D21755

wmi updated this revision to Diff 63026.Jul 6 2016, 9:27 PM

Extract the major part of collectLoopUniforms into a helper func getDependentClosure so it can be reused by collectValuesToIgnore. For collectLoopUniforms, only loop compare and consecutive ptrs of load/store will be the seed uniform instructions in the WorkList.
For collectValuesToIgnore, loop compare, consecutive ptrs, non-gather/scatter and non-consecutive ptrs will be the seed non-vector instructions in the Worklist.

wmi added a reviewer: mkuper.Jul 6 2016, 9:29 PM
mssimpso added inline comments.Jul 7 2016, 10:06 AM
test/Transforms/LoopVectorize/reverse_iter.ll
38–41 ↗(On Diff #63026)

Hi Wei,

The change to this test doesn't look right to me. Since indvars.iv feeds into the shl, why is it added to VecValuesToIngore? The shift remains as vector computation. Am I missing something?

wmi added inline comments.Jul 7 2016, 10:57 AM
test/Transforms/LoopVectorize/reverse_iter.ll
38–41 ↗(On Diff #63026)

Thanks for catching the problem. My assumption that the chain feeding into "non-gather/scatter && non-consecutive" getelementptr will only have scalar version is wrong. Will update the patch.

wmi updated this revision to Diff 63156.Jul 7 2016, 4:19 PM

Fix the problem pointed out by Matthew. When induction var is only used by uniform instruction or non-consecutive/non-gather scatter ptr instructions, the related phi and update will be added into VecValuesToIgnore set.

mkuper accepted this revision.Jul 15 2016, 4:50 PM
mkuper edited edge metadata.

Sorry, I lost track of this patch.

LGTM, modulo a couple of nits.

lib/Transforms/Vectorize/LoopVectorize.cpp
6495 ↗(On Diff #63156)

I was sure we already had a helper for a common LI/SI getPointerOperand() helper.
Turns out we have (at least!) 6, in:

  • LoopAccessAnalysis
  • EarlyCSE
  • DependenceAnalysis
  • PPCLoopPreIncPrep
  • Delinearization
  • LoadStoreVectorizer

I'm going to refactor this into a common helper somewhere in utils, but can you hoist this into another local helper? It'll be easier for me to keep track of, in case I land after you do.
(If you prefer to do the refactoring yourself, let me know. :-) )

test/Transforms/LoopVectorize/X86/reg-usage.ll
46 ↗(On Diff #63156)

Could you please document what each of the two new tests actually tries to check?

This revision is now accepted and ready to land.Jul 15 2016, 4:50 PM
wmi added a comment.Jul 18 2016, 10:20 AM

Michael, thanks for the review.

lib/Transforms/Vectorize/LoopVectorize.cpp
6495 ↗(On Diff #63156)

Added a local helper for it.

test/Transforms/LoopVectorize/X86/reg-usage.ll
46 ↗(On Diff #63156)

Comments added.

wmi updated this revision to Diff 64343.Jul 18 2016, 10:20 AM
wmi edited edge metadata.

Addressed Michael's comments.

This revision was automatically updated to reflect the committed changes.