This is an archive of the discontinued LLVM Phabricator instance.

[LV] Untangle the concepts of uniform and scalar
ClosedPublic

Authored by mssimpso on Jul 27 2016, 8:22 AM.

Details

Summary

This patch refactors the logic in collectLoopUniforms and collectValuesToIgnore, untangling the concepts of "uniform" and "scalar". It adds isScalarAfterVectorization along side isUniformAfterVectorization to distinguish the two. Known scalar values include those that are uniform, getelementptr instructions that won't be vectorized, and induction variables and induction variable update instructions whose users are all known to be scalar.

This patch includes the following functional changes:

  • In collectLoopUniforms, we mark uniform the pointer operands of interleaved accesses. Although non-consecutive, these pointers are treated like consecutive pointers during vectorization.
  • In collectValuesToIgnore, we insert a value into VecValuesToIgnore if it isScalarAfterVectorization rather than isUniformAfterVectorization. This differs from the previous functionality in that we now add getelementptr instructions that will not be vectorized into VecValuesToIgnore.

This patch also removes the ValuesNotWidened set used for induction variable scalarization since, after the above changes, it is now equivalent to isScalarAfterVectorization.

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 65745.Jul 27 2016, 8:22 AM
mssimpso retitled this revision from to [LV] Mark scalarized GEPs uniform.
mssimpso updated this object.
mssimpso added reviewers: mkuper, wmi.
mssimpso added subscribers: llvm-commits, mcrosier.
mssimpso updated this revision to Diff 65750.Jul 27 2016, 8:35 AM

Removed unused SmallPtrSet.

anemet added a subscriber: anemet.Jul 27 2016, 9:32 AM
wmi added inline comments.Jul 27 2016, 10:34 AM
test/Transforms/LoopVectorize/induction.ll
295 ↗(On Diff #65750)

Hi Matthew,

A problem I see to make getelementptr as uniform when it is non-consecutive is:

For the testcase here, if we don't enable interleave memory access,
we will generate vectorized version for "%0 = shl nsw i64 %i, 2". However with your patch "%0 = shl nsw i64 %i, 2" will also be marked as uniform because "%1 = getelementptr inbounds i32, i32* %a, i64 %0" is marked as uniform. These are contradicted results.

Even if we generate scalarized version for "%0 = shl nsw i64 %i, 2", the instruction cost for "%0 = shl nsw i64 %i, 2" should be VF. Marking it as uniform will lower its cost estimation to be only 1.

Thanks,
Wei.

anemet added inline comments.Jul 27 2016, 10:38 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
1518–1519 ↗(On Diff #65750)

I was looking at this code recently and was surprised to see that we never actually describe what uniformity is (especially because we use it for other things than just loop-invariant addresses). As a first step, would you mind filling this gap?

My take is that we currently call something uniform if we don't need to generate values for each horizontal value in a vector loop iteration, more precisely we only need to generate the first one. This is true for a few things: the induction variable, loop-invariant addresses, pointers for consecutive accesses (because the vector access instruction implicitly generates the horizontal addresses).

mssimpso added inline comments.Jul 27 2016, 12:23 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
1518–1519 ↗(On Diff #65750)

I completely agree and don't mind doing this at all. Thanks for putting that into words!

test/Transforms/LoopVectorize/induction.ll
295 ↗(On Diff #65750)

You're right - the GEP will only be uniform here if the loads/stores are in interleaved groups. This makes some sense to me because when interleaving we treat the pointer as if it was consecutive. Thanks! I will update the patch.

mssimpso updated this revision to Diff 65945.Jul 28 2016, 9:15 AM
mssimpso retitled this revision from [LV] Mark scalarized GEPs uniform to [LV] Untangle the concepts of uniform and scalar.
mssimpso updated this object.
mssimpso added a reviewer: anemet.

Addressed comments from Adam and Wei.

  • Properly defined uniform after vectorization and updated comments.
  • Untangled the concepts of "uniform" and "scalar".
  • Updated the revision title and summary.

Thanks!

Actually, I don't this this revision is quite right yet. I"ll post another update shortly. Sorry!

mssimpso updated this revision to Diff 65977.Jul 28 2016, 1:19 PM
mssimpso updated this object.

Addressed issues I alluded to in my previous comment.

This revision completely separates the decision to scalarize an induction variable from the logic in collectLoopUniforms and collectValuesToIgnore. The need to do this became apparent after better defining what we mean by uniform (see the added comment above the collectLoopUniforms declaration). Since scalar doesn't imply uniformity, it doesn't make sense to use collectLoopUniforms to determine which induction variables will be scalar.

anemet edited edge metadata.Jul 28 2016, 8:57 PM

Very nice clean-up, Matt!

I liked it in the old description that it listed the functional changes. These are gone now (I am assuming you will be using the description for the commit log).

I still want to think a little bit about the functionality changes but would be good if you could take care of the one thing below.

lib/Transforms/Vectorize/LoopVectorize.cpp
1480–1490 ↗(On Diff #65977)

Moving this inside Legal seems orthogonal to this patch, if so please split this part out and commit.

mkuper edited edge metadata.Jul 29 2016, 12:14 AM

I'll look at this tomorrow, I hope.
For now, just a note about what Adam suggested.

lib/Transforms/Vectorize/LoopVectorize.cpp
1487 ↗(On Diff #65977)

If you're going to split this out and commit - note that this sequence is what we have the getPointerOperand() helper (that I still want to unify with the 6 others we have throughout the code base...) for.

mssimpso added inline comments.Jul 29 2016, 7:13 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
1480–1490 ↗(On Diff #65977)

Moving this inside Legal seems orthogonal to this patch, if so please split this part out and commit.

Right. Will do.

If you're going to split this out and commit - note that this sequence is what we have the getPointerOperand() helper (that I still want to unify with the 6 others we have throughout the code base...) for.

Right. Shall I go ahead and commit an additional prequel patch that does this unification?

mssimpso updated this revision to Diff 66125.Jul 29 2016, 8:23 AM
mssimpso edited edge metadata.

Rebased on top of the two (yet to be committed) NFC's.

The first NFC will move the getPointerOperand helper and use it throughout LoopVectorize.cpp where appropriate. I'll commit it once Michael gives the go-ahead. The second will move isGatherOrScatterLegal into LoopVectorizationLegality (using the getPointerOperand helper).

Regarding the functional changes, yes I'll includ those in the commit log. They are:

  • Marking the pointer operands of interleaved accesses uniform. They're treated as if they're consecutive when vectorizing interleaved groups. We have to collect the interleaved groups before the uniforms now.
  • Moving the induction variable logic out of collectValuesToIgnore. IVs and their updates will now only be placed in VecValuesToIgnore if they're uniform. I believe the existing code was there because the IV scalarization logic needed VecValuesToIgnore to decide if a scalar IV was desired. But now, with a clear distinction between uniform and scalar, I think we should move this out. The IVs are handled by isScalarAfterVectorization now.
  • Moving the induction variable logic out of collectValuesToIgnore. IVs and their updates will now only be placed in VecValuesToIgnore if they're uniform. I believe the existing code was there because the IV scalarization logic needed VecValuesToIgnore to decide if a scalar IV was desired. But now, with a clear distinction between uniform and scalar, think we should move this out. The IVs are handled by isScalarAfterVectorization now.

Matt, however, not just IV scalarization logic needs VecValuesToIgnore, calculateRegisterUsage to compute vector register pressure also needs it. Because IV's live range always spreads across the whole loop body, and it is often promoted to i64 type, it is important to include IV in VecValuesToIgnore if there is no vector version IV.

Thanks,
Wei.

Yes, calculuateRegisterUsage uses VecValuesToIgnore. But if we scalarize an induction variable that isn't uniform, we will still create a value for it corresponding to every iteration of the original scalar loop. For example,

%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%3 = or i64 %index, 1
%4 = or i64 %index, 2
%5 = or i64 %index, 3
...
%10 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %index, i32 1
%11 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %3, i32 1
%12 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %4, i32 1
%13 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %5, i32 1

Do we not want to consider the ORs in calculateRegisterUsage (or am I misunderstanding the purpose of this function)? Even so, if the GEPs aren't added to VecValuesToIgnore in this case (they're not), why should we add the IV?

Wei,

I should've also mentioned that the uniform scalar IVs will still be placed in VecValuesToIgnore. The non-uniform scalar IVs will not. I tried to clarify the distinction in the comment above collectLoopUniforms.

Matt.

anemet added inline comments.Jul 29 2016, 9:55 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6356–6360 ↗(On Diff #66125)

This is essentially LGTM now (assuming that @wmi an @mkuper are happy too) and I am only wondering about test-coverage now and that the patch really only changes the described functionality.

  1. Do we already have test-coverage for this in the original version?
  1. Did you try to perf-test this patch or in combination with some other patches? The new version is way better and we can actually reason about what's going on (thank you!) but want to make sure that we didn't miss anything.

Thanks for your work!

mssimpso added inline comments.Jul 29 2016, 10:08 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6356–6360 ↗(On Diff #66125)

Thanks Adam!

Do we already have test-coverage for this in the original version?

We have the existing induction variable scalarization tests, which all continue to pass. Previously, we relied on the IVs being in VecValuesToIgnore. Beyond that we have at least one register usage test (X86), which continues to pass. calculateRegisterUsage also relies on VecValuesToIgnore.

Did you try to perf-test this patch or in combination with some other patches?

Yes, I tested this patch together with D22869 on spec2006 and spec2000 (LTO, Kryo). I saw small improvements in several benchmarks with no non-noise regressions. The most significant improvements were in spec2006/h264ref and spec2000/mesa and were on the order of 1-2%.

mssimpso added inline comments.Jul 29 2016, 10:33 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
6356–6360 ↗(On Diff #66125)

I should have also pointed out to everyone for further clarity that the code I removed here was already (mostly) duplicated. Over in collectLoopUniforms, we mark IVs uniform if they or their updates are only used by uniforms. Thus, here in collectValuesToIgnore, the uniform IVs would have already been added to VecValuesToIgnore (line 6269).

The functional change here is that now we are not adding to VecValuesToIgnore the non-uniform IVs that are used by non-consecutive pointers. This makes sense to me, as I mentioned in my reply to Wei, since we will still create a scalar value for such IVs corresponding to every iteration of the original scalar loop.

wmi edited edge metadata.Jul 29 2016, 10:37 AM

Yes, calculuateRegisterUsage uses VecValuesToIgnore. But if we scalarize an induction variable that isn't uniform, we will still create a value for it corresponding to every iteration of the original scalar loop. For example,

%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%3 = or i64 %index, 1
%4 = or i64 %index, 2
%5 = or i64 %index, 3
...
%10 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %index, i32 1
%11 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %3, i32 1
%12 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %4, i32 1
%13 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %5, i32 1

Do we not want to consider the ORs in calculateRegisterUsage (or am I misunderstanding the purpose of this function)? Even so, if the GEPs aren't added to VecValuesToIgnore in this case (they're not), why should we add the IV?

There are two factors here:

  1. Now calculateRegisterUsage only considers the pressure of vector register, not scalar register. TTI.getNumberOfRegisters only returns vector register number. Scalar register pressure is not modeled in loop vectorization.
  2. induction variable has larger impact on register pressure than ORs above, it has live range across the loop while ORs generated have limited live range. If we don't add iv to VecValuesToIgnore, it may involve some impact from scalar register pressure increase into consideration, but it will definitely pessimize the maximum vector register pressure.

I think it is also better to add GEP to VecValuesToIgnore too for non-consecutive memory access case. although GEP has less impact on vector register pressure than iv.

Thanks,
Wei.

LGTM but please wait for the other reviewers.

In D22867#500844, @wmi wrote:

There are two factors here:

  1. Now calculateRegisterUsage only considers the pressure of vector register, not scalar register. TTI.getNumberOfRegisters only returns vector register number. Scalar register pressure is not modeled in loop vectorization.

This doesn't make sense to me. We use register pressure when deciding to unroll, and we surely do that when we're not vectorizing (VF = 1). I believe TTI.getNumberOfRegisters(true) is what returns the number of vector registers, but in a quick scan of calculateRegisterUsage it doesn't even look like we consider this. I think we just count the number of live intervals. Am I missing something?

  1. induction variable has larger impact on register pressure than ORs above, it has live range across the loop while ORs generated have limited live range. If we don't add iv to VecValuesToIgnore, it may involve some impact from scalar register pressure increase into consideration, but it will definitely pessimize the maximum vector register pressure.

I think it is also better to add GEP to VecValuesToIgnore too for non-consecutive memory access case. although GEP has less impact on vector register pressure than iv.

It sounds like you're advocating to add values to VecValuesToIgnore if isScalarAfterVectorization is true rather than isUniformAfterVectorization? This would make sense to me if calculateRegisterUsage is only intended to track the pressure of vector registers, but I'm not sure that it is. Can you please clarify?

First of all, thanks a lot for working on untangling this!

I'll try to restate my understanding of what we should be doing - that I think aligns with Wei's.

We have three types of values:

  1. Uniform scalar - every value for which we generate a single scalar - the primary IV, base pointer for consecutive pointer values, etc.
  2. Non-uniform "scalar" - values that are non-uniform, but for which we generate VF scalar values, instead of a single vector. For instance, the "scalar IVs".
  3. Non-uniform non-scalar - values that actually get vectorized.

(Perhaps we can adjust the terminology for (2) and (3) - my preference would be to call (3) "vector" and (2) "multi-scalar", but I'm open to other suggestions.)

In any case, as Wei wrote, we need two different groupings:
(a) One for 1 vs. 2 + 3, for operation cost estimation. This is what should go into ValuesNotWidened
(b) The other, for 1 + 2 vs. 3, for vector register pressure estimation.

After this patch we still seem to conflate the two - calculateRegisterUsage() assumes VecValuesToIgnore contains values in groups 3, but it will get 2 + 3.

lib/Transforms/Vectorize/LoopVectorize.cpp
2005 ↗(On Diff #66125)

Maybe return false early if !ScalarInd ?
There's no need to compute ScalarIndUpdate in that case, is there?

In D22867#500844, @wmi wrote:

There are two factors here:

  1. Now calculateRegisterUsage only considers the pressure of vector register, not scalar register. TTI.getNumberOfRegisters only returns vector register number. Scalar register pressure is not modeled in loop vectorization.

This doesn't make sense to me. We use register pressure when deciding to unroll, and we surely do that when we're not vectorizing (VF = 1). I believe TTI.getNumberOfRegisters(true) is what returns the number of vector registers, but in a quick scan of calculateRegisterUsage it doesn't even look like we consider this. I think we just count the number of live intervals. Am I missing something?

calculateRegisterUsage() will skip values in VecValuesToIgnore when VF > 1. So when we're only unrolling it will estimate scalar register pressure (by counting all live intervals, which are all scalar), but when we're both vectorizing and unrolling, it will estimate vector register pressure (by considering only the live intervals that correspond to values that will actually be vectorized).
I'm not saying this necessarily makes sense, just describing the state of the world.

From Wei:

What I mean is only vector register pressure is considered when selectVectorizationFactor. scalar register pressure is only considered in selectInterleaveCount and only when VF==1. When selectVectorizationFactor, scalar register pressure is not considered, so we don't want to count scalarized iv as a live range usage.

All right, I think I can agree with that. I think the confusing part here is that the "meaning" of calculateRegisterUsage seems to be different depending on who calls it.

Yes, calculateRegisterUsage is only intended to track the pressure of vector registers when selectVectorizationFactor. The reason is explained above. Yes, I think it is better to add values to VecValuesToIgnore if isScalarAfterVectorization is true.

Okay. I think we're saying that calcuateRegisterUsage is used to track vector register pressure for vector factor selection, vector register pressure for unrolling (VF > 1), and scalar register pressure for unrolling (VF = 1). It doesn't make a distinction between general purpose and vector registers. So if the loop will have both scalar and vector values, the estimate may be less precise. In that case, I think I agree here. If VF > 1, the intended purpose is to only track the pressure of vector registers, so we should ignore all scalar values.

From Michael:

We have three types of values:

  1. Uniform scalar - every value for which we generate a single scalar - the primary IV, base pointer for consecutive pointer values, etc.
  2. Non-uniform "scalar" - values that are non-uniform, but for which we generate VF scalar values, instead of a single vector. For instance, the "scalar IVs".
  3. Non-uniform non-scalar - values that actually get vectorized.

...

In any case, as Wei wrote, we need two different groupings:
(a) One for 1 vs. 2 + 3, for operation cost estimation. This is what should go into ValuesNotWidened
(b) The other, for 1 + 2 vs. 3, for vector register pressure estimation.

I think I agree, but want to clarify since I removed ValuesNotWidened.

For cost estimation, we will consider (1): isUniformAfterVectorization(), and for register pressure estimation and IV scalarization, we will consider (3): isScalarAfterVectorization. In that case, I think if I replace isUniformAfterVectorization with isScalarAfterVectorization in collectValuesToIgnore, we will all be happy.

lib/Transforms/Vectorize/LoopVectorize.cpp
2005 ↗(On Diff #66125)

Right, good point! I removed the optimization when trying to make the code easier to read. I'll update the patch.

Also, if we use isScalarAfterVectorization in collectValuesToIgnore and for IV scalarization, it probably makes sense to pre-compute this like we do for collectLoopUniforms. I think this will be a little cleaner than passing around ValuesNotWidened as before. I'll update the patch! Thanks for all the feedback.

Matt.

mssimpso updated this revision to Diff 66168.Jul 29 2016, 1:37 PM
mssimpso edited edge metadata.

Addressed comments from Michael and Wei.

  • Use isScalarAfterVectorization in collectValuesToIngore instead of isUniformAfterVectorization
  • Precompute isScalarAfterVectorization like we do for isUniformAfterVectorization.
mkuper added inline comments.Jul 29 2016, 2:14 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
1539 ↗(On Diff #66168)

represted -> represented.

4955 ↗(On Diff #66168)

This is still not quite right.
getUniformPtr() doesn't actually return a pointer if I has a uniform pointer argument. It does for the consecutive and interleaved case, but not for "real" scalar loop-invariant uniform pointers (e.g. a load from a global) - that is, uniform in the Legal->isUniform() sense.

Have I already thanked you for untangling this? :-)

mssimpso added inline comments.Aug 1 2016, 7:34 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
1539 ↗(On Diff #66168)

Thanks.

4955 ↗(On Diff #66168)

Good point. Actually, given the smallness of the change here (just adding the interleaved case), I'm thinking it might make sense to move getUniformPtr back inside collectLoopUniforms, with the appropriate updates to the comments. There's probably value in avoiding creating additional uses of the term "uniform".

mssimpso updated this revision to Diff 66326.Aug 1 2016, 8:28 AM
mssimpso updated this object.

Addressed Michael's comments.

  • Fixed spelling mistake.
  • Moved getUniformPtr functionality inside collectLoopUniforms to avoid creating further "uniform" confusion.
  • Updated summary to reflect functional changes.
mkuper accepted this revision.Aug 1 2016, 10:52 AM
mkuper edited edge metadata.

LGTM

This revision is now accepted and ready to land.Aug 1 2016, 10:52 AM
wmi added a comment.Aug 1 2016, 11:05 AM

LGTM. Thanks for the great untangle work!

This revision was automatically updated to reflect the committed changes.