This is an archive of the discontinued LLVM Phabricator instance.

[LV] Don't mark pointers used by scalarized memory accesses uniform
ClosedPublic

Authored by mssimpso on Sep 6 2016, 11:21 AM.

Details

Summary

Previously, all consecutive pointers were marked uniform after vectorization. However, if a consecutive pointer is used by a memory access that is eventually scalarized, the pointer may not remain uniform after all. An example is predicated stores. Even though a predicated store may be consecutive, it will still be scalarized, making it's pointer operand non-uniform.

This patch updates the logic in collectLoopUniforms to consider the cases where a memory access may be scalarized. If a memory access may be scalarized, its pointer operand is not marked uniform.

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 70435.Sep 6 2016, 11:21 AM
mssimpso retitled this revision from to [LV] Don't mark pointers used by scalarized memory accesses uniform.
mssimpso updated this object.
mssimpso added reviewers: mkuper, wmi, anemet.
mkuper edited edge metadata.Sep 6 2016, 10:53 PM

Thanks, Matt!

lib/Transforms/Vectorize/LoopVectorize.cpp
5290 ↗(On Diff #70435)

I think this will also return true for "reverse consecutive".
Is that true? If so, is it intentional?

5309 ↗(On Diff #70435)

Any chance this and vectorizeMemoryInstruction can share code?
Having this duplicated scares me a bit.

mssimpso added inline comments.Sep 7 2016, 7:42 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5290 ↗(On Diff #70435)

Good point. I'll check this out and add a test case, regardless.

5309 ↗(On Diff #70435)

Sure, we can pull this out into a shared helper function.

5315–5317 ↗(On Diff #70435)

Ideally, we could share this check with vectorizeMemoryInstruction as well. But the version in vectorizeMemoryInstruction uses the actual VF. We don't have the VF available to us yet in Legality. And I think the VF selection is to some extent affected by the uniforms. So the check here is essentially the same as the one in vectorizeMemoryInstruction, but with VF = 1. I'm wondering if it makes sense to replace the check in vectorizeMemoryInstruction with this one. Do you know of any cases that we would miss vectorizing?

mssimpso updated this revision to Diff 70542.Sep 7 2016, 9:00 AM
mssimpso edited edge metadata.
mssimpso marked 3 inline comments as done.

Addressed Michael's comments. Thanks!

  • Added a test case for reverse consecutive pointers. Reverse consecutive pointers and the pointers for reverse interleaved groups remain uniform.
  • Added a shared helper for predicated stores
  • Added a shared helper for irregular types (parameterized by VF)
mssimpso updated this revision to Diff 70557.Sep 7 2016, 10:10 AM

Fixed a typo.

mkuper added inline comments.Sep 7 2016, 10:56 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5337 ↗(On Diff #70557)

Now that I thought about this a bit more, checking this with "1" is probably not very useful. Maybe 2? (since, as discussed above, we don't have the final VF). Although I'm not 100% sure 2 makes sense either.

5326 ↗(On Diff #70542)

What I meant was that I would have preferred a single shared function that performs all of these checks, so that it's easier to keep memoryInstructionMayBeScalarized and vectorizeMemoryInstruction in sync.
(I didn't notice the VF issue, but you could pass VF as part of the interface to that shared helper).

5332–5334 ↗(On Diff #70542)

I think in vectorizeMemoryInstruction you really want VF.

Let's say you have i20, and the ABI alignment for i20 is 1 byte.
Assuming I didn't get the math wrong:
For scalars, you'll have ScalarAllocatedSize == VectorElementSize == 3, so the type is regular.
But for VF=4, you'll DL.getTypeStoreSize(VectorTy) == 10, so VectorElementSize ==2, and the type is irregular.

Which is exactly what we want to happen - storing 4 consecutive i20s writes 12 bytes, not 10, so we can't vectorize the store.

mssimpso added inline comments.Sep 7 2016, 11:40 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5326 ↗(On Diff #70557)

Ah, I see. Yes, I think we can move all of this into a single function.

5332–5334 ↗(On Diff #70557)

Oh I see. I think I was misinterpreting the difference between allocated size and store size. But aren't we really just wanting to know if there's padding been successive elements of a given type? That is, in your example if we have a <4 x i20>, we can't just store this type to memory without the 4 bits between each element.

What about DL.getTypeSizeInBits(ScalarType) != DL.getTypeAllocSizeInBits(ScalarType)? Using "type size" instead of "store size" should prevent rounding up to the next byte. For the i20 example, we would have 20 != 24, so the type is irregular.

mssimpso added inline comments.Sep 7 2016, 11:55 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5332–5334 ↗(On Diff #70557)

What about DL.getTypeSizeInBits(ScalarType) != DL.getTypeAllocSizeInBits(ScalarType)?

After further digging, I think this is indeed equivalent to the existing code in vectorlizeMemoryInstruction. The reason is that DL.getTypeStoreSize(VectorType) is implemented as VF * DL.getTypeSizeInBits(ElementType) in DataLayout.h. So the VF's cancel. I'll update the patch accordingly.

mssimpso updated this revision to Diff 70601.Sep 7 2016, 2:33 PM
mssimpso marked 2 inline comments as done.

Addressed Michael's comments. Thanks!

  • Pulled out all scalarization checks from vectorizeMemoryInstruction into shared function.
  • Updated irregular type check to remove VF as previously mentioned.
  • Added some additional comments.
mkuper added a comment.Sep 7 2016, 3:23 PM

This LGTM, except for the type size bit, which I'm still not sure I completely understand.

lib/Transforms/Vectorize/LoopVectorize.cpp
5325 ↗(On Diff #70601)

I'm still not 100% convinced.

I mean, I'm fairly sure this is correct, I'm just not sure this doesn't miss any cases we would have caught before, where the sizes in bits don't match, but the rounded sizes do. (And to be honest, I'm not sure what the correct behavior is anymore)

To be more explicit originally we were comparing:
alignTo((getTypeSizeInBits(Ty) + 7) / 8, getABITypeAlignment(Ty))
to
(((VF * getTypeSizeInBits(Ty)) + 7) / 8) / VF

And now we're comparing:
alignTo((getTypeSizeInBits(Ty) + 7) / 8, getABITypeAlignment(Ty)) * 8
to
getTypeSizeInBits(Ty)

Are you absolutely sure these are equivalent (or the second one is "more correct")?

mssimpso added inline comments.Sep 8 2016, 9:01 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5325 ↗(On Diff #70601)

Michael,

Thanks for digging deeper on this issue. I appreciate it! You've convinced me that these implementations aren't doing the same thing. Which one is "more correct" is unfortunately debatable at this point.

I think it helps somewhat to look at the original patch (rL180196, PR15758). The intent it seems was to ensure an array of type ScalarTy is "bitcast compatible" with a vector of type ScalarTy. It would seem the most straightforward implementation of that concept is:

VF * DL.getTypeAllocSize(ScalarTy) == DL.getTypeStoreSize(VectorTy)  // (1)

(The existing code moved the VF to the other side of the equation, but the floor division looks a little suspicious to me).

The equivalence of the two implementations (and whether we need the VF or not) boils down to whether vector types are packed or not. If we always assume vectors are packed, we can just check to see if there's padding in the array representation, and the VF is not needed. This is what I assumed was the case. However, it doesn't look like we've made a clear decision about the packed question yet (PR1784). The recent discussions on the mailing list about <N x i1> types seem related as well.

So I think the best thing to do here, at least from vectorizeMemoryInstruction, is to keep the type check as it is for now.

But since the check in collectLoopUniforms doesn't yet have the VF, I think we should be conservative and only mark an instruction uniform if we are sure it will be. The most conservative form of the type check is the one where we just check for padding in the array representation. This is because if:

DL.getTypeAllocSizeInBits(ScalarTy) == DL.getTypeSizeInBits(ScalarTy)  // (2)

The other equation must be true as well. So I think it makes sense to use (1) from vectorizeMemoryInstruction and getInstructionCost* and (2) from collectLoopUniforms. *I forgot to replace the scalarization checks in the cost model with the shared function.

I'll make these updates and post a new version of the patch. Thanks!

mssimpso updated this revision to Diff 70724.Sep 8 2016, 10:44 AM

Implemented changes mentioned in my last comment.

  • Restored previous type check when VF is known. We use a conservative estimate when VF is not known.
  • Updated getInstructionCost to use the shared scalarization function.
mkuper accepted this revision.Sep 8 2016, 11:15 AM
mkuper edited edge metadata.

LGTM

This revision is now accepted and ready to land.Sep 8 2016, 11:15 AM

Thanks, Michael!

This revision was automatically updated to reflect the committed changes.