This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] widen a load with subvector insert
ClosedPublic

Authored by spatel on Nov 3 2022, 8:11 AM.

Details

Summary

This adapts/copies code from the existing fold that allows widening of load scalar+insert. It can help in IR because it removes a shuffle, and the backend can already narrow loads if that is profitable in codegen.

We might be able to consolidate more of the logic, but handling this basic pattern should be enough to make a small difference on one of the motivating examples from issue #17113. The final goal of combining loads on those patterns is not solved though.

Diff Detail

Event Timeline

spatel created this revision.Nov 3 2022, 8:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2022, 8:11 AM
spatel requested review of this revision.Nov 3 2022, 8:11 AM
spatel added inline comments.Nov 3 2022, 8:15 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
131

Note: I created this helper function to reduce code duplication. I didn't push that NFC change to main yet though (pending feedback here to do something different).

Thanks for looking at this!

How useful/feasible would this be to support inserting into a upper subvector as well?

define <8 x float> @upper(ptr dereferenceable(32) %x) {
  %offset = getelementptr inbounds float, ptr %x, i64 4
  %load = load <4 x float>, ptr %offset, align 4
  %insert = shufflevector <4 x float> undef, <4 x float> %load, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x float> %insert
}
spatel added a comment.Nov 6 2022, 6:42 AM

Thanks for looking at this!

How useful/feasible would this be to support inserting into a upper subvector as well?

define <8 x float> @upper(ptr dereferenceable(32) %x) {
  %offset = getelementptr inbounds float, ptr %x, i64 4
  %load = load <4 x float>, ptr %offset, align 4
  %insert = shufflevector <4 x float> undef, <4 x float> %load, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x float> %insert
}

I don't have a real example for that, but yes, it's do-able. It would be similar to enhancements that we made in VectorCombine::vectorizeLoadInsert(), so we'd hopefully be able to share some more code for the 2 patterns. We have to (very carefully...) allow peeking through a gep and translating that as a shuffle index offset. I can add a TODO.

RKSimon added inline comments.Nov 6 2022, 7:36 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
279

Do we need to check for cases where the identity comes from the second operand? IIRC we got burnt by something similar recently.

spatel planned changes to this revision.Nov 6 2022, 7:57 AM
spatel marked an inline comment as done.
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
279

Good catch. If we have a very non-canonical shuffle where both operands are valid loads, but we're actually only shuffling elements from op1 (op0 is unused), this would use the wrong operand.

spatel updated this revision to Diff 473502.Nov 6 2022, 8:55 AM
spatel marked an inline comment as done.

Patch updated:
There are no guarantees on what shuffles the other vectorizer passes can produce, and we don't know that instcombine is run before this pass, so added code/test to handle a non-canonical identity shuffle that is choosing from operand 1 of the shuffle.

arsenm added inline comments.Nov 6 2022, 9:38 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322

Shouldn't need to create an addrspacecast here

spatel added inline comments.Nov 7 2022, 8:13 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322

I copied this line from the existing fold for a load+insertelt, and that was last changed with D121787.

Is that not relevant with this transform and/or the change to opaque pointers? I'm not familiar with all of the addrspace corner-cases, so I'm not sure what to do here.

Add tests derived from D121787?

define <4 x i32> @load_from_other_as(ptr addrspace(5) align 16 dereferenceable(16) %p) {
  %asc = addrspacecast ptr addrspace(5) %p to ptr
  %l = load <2 x i32>, ptr %asc, align 4
  %s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
  ret <4 x i32> %s
}
arsenm added inline comments.Nov 7 2022, 8:21 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322

Opaque pointers are unrelated to the address space; there's no change there.

I don't see how it's relevant for this transform. You're widening a load, which should always be into a load with the same address space that the original load used. The only cast you should need here is the element bitcast for typed pointers

spatel added inline comments.Nov 7 2022, 9:51 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322

I think the issue is that stripPointerCasts() will peek through addrspacecasts.

So if we do that, then we need to cast the source pointer back to the required destination addrspace. There was a comment in the code about this before D121787 changed the code to include a cast.

It's not clear to me from the descriptions if we can use a different stripPointer* API to avoid the issue. But if we do that, then it would be better to change the existing code too, so these 2 transforms are not diverging in implementation.

I'll add a test with addrspacecast and update here, so we have some test coverage for this.

arsenm added inline comments.Nov 7 2022, 9:56 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322

OK, I didn't realize this was stripping away casts. As per the new langref address space rules, it should be legal to change the address space of non-volatile, known dereferenceable pointer to the original address space. However, I don't believe this pass should be responsible for changing it here. I'd expect pure load widening to not look through addrspacecast.

spatel updated this revision to Diff 473724.Nov 7 2022, 9:57 AM

Rebased with a new test (last test/diff in the test file) that includes addrspacecast. This will crash if we don't re-insert a cast.

I think we could reduce the wordiness by calling CreatePointerCast() rather than CreateBitCastOrAddrSpaceCast(), but I'd make that change to the existing code too to keep these in sync (assuming that works in the other transform).

Rebased with a new test (last test/diff in the test file) that includes addrspacecast. This will crash if we don't re-insert a cast.

I think we could reduce the wordiness by calling CreatePointerCast() rather than CreateBitCastOrAddrSpaceCast(), but I'd make that change to the existing code too to keep these in sync (assuming that works in the other transform).

CreateBitCast should be adequate. If you need to, there's stripPointerCastsSameRepresentation

CreateBitCast should be adequate. If you need to, there's stripPointerCastsSameRepresentation

Hmm...CreateBitCast crashes with:

"Assertion failed: (castIsValid(op, S, Ty) && "Invalid cast!")"

stripPointerCastsSameRepresentation() avoids the crash, but then we don't capture the expected alignment from the ptr param.

stripPointerCastsSameRepresentation() avoids the crash, but then we don't capture the expected alignment from the ptr param.

For alignment purposes, stripPointerCasts is OK.

However, why does the vector combiner need to look at the underlying pointer for the alignment? Doesn't instcombine increase the alignment based on the underlying pointer already, such that this can just read the alignment direct from the instruction?

However, why does the vector combiner need to look at the underlying pointer for the alignment? Doesn't instcombine increase the alignment based on the underlying pointer already, such that this can just read the alignment direct from the instruction?

It's correct that InstCombine improves alignment like that, but it's similar reasoning to why we handled a non-canonical shuffle mask in this patch: vector-combine isn't guaranteed to see canonical code. Currently, it's sitting directly after SLP in the normal opt pipelines, and there's a good chance that SLP has created non-canonical code. We could adjust the opt pipeline to avoid that, but then we increase compile-time, so there's no easy answer AFAIK.

RKSimon accepted this revision.Nov 9 2022, 8:35 AM

The addrspacecast issue seems to be independent of the rest of the patch, and its matching equivalent code we already have in vectorizeLoadInsert

I don't know whether there's a suitable bug you could raise but it shouldn't stop us getting this in in.

This revision is now accepted and ready to land.Nov 9 2022, 8:35 AM
This revision was landed with ongoing or failed builds.Nov 10 2022, 11:13 AM
This revision was automatically updated to reflect the committed changes.