Page MenuHomePhabricator

[VectorCombine] try to create vector loads from scalar loads
ClosedPublic

Authored by spatel on Jun 12 2020, 2:29 PM.

Details

Summary

This should allow this pass or others to fold subsequent scalar ops together more easily because they will see extractelement ops from a single vector rather than incomplete parts of that vector.

Currently, this transform will make no overall difference to these most basic patterns because the backend (DAGCombiner) will narrow the loads back down to scalars via narrowExtractedVectorLoad().

For now, we do not get any differences for scalar integer loads because those extracts are not free. We will need to match larger patterns and/or adjust the cost equation to allow that.

Diff Detail

Event Timeline

spatel created this revision.Jun 12 2020, 2:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2020, 2:29 PM

Some thoughts.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
57

I'm not sure why we'd care whether the load is of bitcast.
Why are we using bitcast src type as the source of truth?
Are we trying to avoid introducing some cache issues?

I'd think we should instead assess (check, brute-force)
each possible wider load type, first checking cost and then isSafeToLoadUnconditionally().

101

Shouldn't this be +=?

120–121

++NumLoadsVectorized;

nikic added a subscriber: nikic.Jun 13 2020, 1:00 PM

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

lebedev.ri requested changes to this revision.Jun 14 2020, 1:38 AM

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

Ah, good point, too.

This revision now requires changes to proceed.Jun 14 2020, 1:38 AM

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

Ok, let me see if I can rework this using just the cost model. This patch started within InstCombine, so we didn't have access to costs and didn't want to do the transform too loosely. The bitcast was used as a proxy for "cost effective" - it indicated that either the original code or the vectorizers had validated the vector type as a legitimate type for the target.

spatel updated this revision to Diff 280132.Thu, Jul 23, 8:13 AM
spatel retitled this revision from [VectorCombine] try to create vector loads from bitcasted scalar pointers to [VectorCombine] try to create vector loads from scalar loads.
spatel edited the summary of this revision. (Show Details)
spatel marked 3 inline comments as done.Thu, Jul 23, 8:16 AM

Patch updated - this uses the target, cost model, and load attributes only (not pointer types/casts) to decide if we can create a vector load.

nikic added inline comments.Thu, Jul 23, 2:26 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
86

Will other middle end passes be able to handle this well as well? I don't have anything specific in mind here, but would suspect that some passes will be able to deal with a "load" better than a "bitcast, load, extractelement" sequence.

spatel marked 2 inline comments as done.Fri, Jul 24, 9:29 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
86

Other middle end passes almost certainly will *not* handle this as well. :)
It's not quite the same pattern/problem since we're creating vector ops here, but that's what led to removing the generic LoadCombine IR pass ( http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ).
I'm assuming that VectorCombine is running late enough (after GVN, etc.) that we've already done all of the general IR optimizations that can be done with the narrow ops. I could try to cobble some PhaseOrdering tests to enforce it, but these would be negative tests currently.

spatel marked an inline comment as done.Fri, Jul 31, 8:22 AM

Ping.

RKSimon added inline comments.Sat, Aug 1, 5:10 AM
llvm/test/Transforms/VectorCombine/X86/load.ll
60 ↗(On Diff #280132)

do we have test coverage for non-zero gep indices?

spatel marked an inline comment as done.Sat, Aug 1, 7:20 AM
spatel added inline comments.
llvm/test/Transforms/VectorCombine/X86/load.ll
60 ↗(On Diff #280132)

No, that was missing. Added with rGd620a6fe98f7.

spatel updated this revision to Diff 282395.Sat, Aug 1, 7:26 AM
spatel marked an inline comment as done.

Patch updated:
No code changes, but added tests for non-zero gep offsets. This actually works more than I was expecting given that we're only using the base stripPointerCasts(). If there are enough dereferenceable bytes, isSafeToLoadUnconditionally() can still manage to use the offset load via the existing gep.

nikic added inline comments.Sat, Aug 1, 8:15 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
86

I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go through two ThinLTO links at different hierarchy levels. In that case the first one would convert everything to vector loads (given ample dereferencability information), and the second one might have trouble optimizing based on that.

spatel updated this revision to Diff 283886.Fri, Aug 7, 6:27 AM

Patch updated:
Adjusted to match the most basic pattern that starts with an insertelement (so there's no extract created here). Hopefully, that removes any concern about interactions with other passes. Ie, the transform should almost always be profitable. (We could make an argument that this could be part of canonicalization, but we conservatively try not to create vector ops from scalar ops in passes like instcombine.)

LGTM with one minor @nikic Does this look OK now?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
84

Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results with getScalarizationOverhead?

spatel updated this revision to Diff 283925.Fri, Aug 7, 9:15 AM

Patch updated:
Use TTI.getScalarizationOverhead() to model insert cost of original code. I had not used this API before and the documentation comment isn't entirely clear to me, so please see if that looks as intended.

nikic added a comment.Fri, Aug 7, 9:39 AM

LGTM with one minor @nikic Does this look OK now?

Looks good to me as well.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
64

assert / use cast<>? Don't think the pointer operand can have a non-pointer type.

RKSimon accepted this revision.Sat, Aug 8, 1:24 AM

LGTM - the getScalarizationOverhead() change is OK

lebedev.ri accepted this revision.Sat, Aug 8, 1:31 AM
This revision is now accepted and ready to land.Sat, Aug 8, 1:31 AM
xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
64

Not resolved?

spatel updated this revision to Diff 284132.Sat, Aug 8, 8:30 AM
spatel marked 2 inline comments as done.

Patch updated:
Changed pointer check to an assert.

xbolva00 accepted this revision.Sat, Aug 8, 8:34 AM
srj added a subscriber: srj.EditedTue, Aug 11, 3:57 PM

This appears to have broken some of Halide's codegen for Hexagon/HVX; as of this revision, some of our tests are now failing with

llvm/lib/IR/Type.cpp:617: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `NumElts > 0 && "#Elements of a VectorType must be greater than 0"' failed.

(It's not yet clear whether this is an injection on LLVM's part, or a change that reveals a latent bug in Halide; I'm investigating to determine.)

srj added a comment.Tue, Aug 11, 5:05 PM

Update: it appears that VectorSize (from TTI.getMinVectorRegisterBitWidth()) is zero in this case, which causes the assertion failure.

This appears to be the case because HexagonTTIImpl::getMinVectorRegisterBitWidth() returns 0 if useHVX() isn't true... and useHVX() returns false if the HexagonAutoHVX option isn't enabled.

By design, Halide doesn't enable the HexagonAutoHVX option; we like to do all the vectorization ourselves.

I'm not sure how to resolve this issue -- the flaw here seems to lie in HexagonTTIImpl's assumption that disabling HexagonAutoHVX should cause it to report zero-width vectors, which seems to be a dubious decision (and one that is at odds with every other implementation of getMinVectorRegisterBitWidth() that I see in trunk LLVM (none of them appear to ever return 0).

Would it make sense to consider backing out this change until this can be resolved, since it clearly appears to have bad consequences for Hexagon/HVX codegen?

In D81766#2211936, @srj wrote:

Update: it appears that VectorSize (from TTI.getMinVectorRegisterBitWidth()) is zero in this case, which causes the assertion failure.

This appears to be the case because HexagonTTIImpl::getMinVectorRegisterBitWidth() returns 0 if useHVX() isn't true... and useHVX() returns false if the HexagonAutoHVX option isn't enabled.

By design, Halide doesn't enable the HexagonAutoHVX option; we like to do all the vectorization ourselves.

I'm not sure how to resolve this issue -- the flaw here seems to lie in HexagonTTIImpl's assumption that disabling HexagonAutoHVX should cause it to report zero-width vectors, which seems to be a dubious decision (and one that is at odds with every other implementation of getMinVectorRegisterBitWidth() that I see in trunk LLVM (none of them appear to ever return 0).

Would it make sense to consider backing out this change until this can be resolved, since it clearly appears to have bad consequences for Hexagon/HVX codegen?

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

srj added a comment.Tue, Aug 11, 5:17 PM

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

In D81766#2211952, @srj wrote:

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

rGb0b95dab1ce2
l'll see if I can find a better API and/or test case for that tomorrow.

In D81766#2211952, @srj wrote:

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

rGb0b95dab1ce2
l'll see if I can find a better API and/or test case for that tomorrow.

Test added here:
rGb97e402ca5ba

The Loop and SLP vectorizers check this:

// If the target claims to have no vector registers don't attempt
// vectorization.
if (!TTI->getNumberOfRegisters(TTI->getRegisterClassForType(true)))
  return false;

So I should probably add that check to this pass too to be safer.

The Hexagon issue is fixed in rGa2dc19b81b1e.