This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombine
ClosedPublic

Authored by jrbyrnes on Sep 9 2022, 8:59 AM.

Details

Summary

Since SROA chooses promotion based on reaching load / stores of allocas, we may run into scenarios in which we alloca a vector, but promote it to an integer. The result of which is the familiar LoadCombine pattern (i.e. ZEXT, SHL, OR). However, instead of coming directly from distinct loads, the elements to be combined are coming from ExtractVectorElements which stem from a shared load.

This patch identifies such a pattern and combines it into a load.

Change-Id: I0bc06588f11e88a0a975cde1fd71e9143e6c42dd

Diff Detail

Event Timeline

jrbyrnes created this revision.Sep 9 2022, 8:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2022, 8:59 AM
jrbyrnes requested review of this revision.Sep 9 2022, 8:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2022, 8:59 AM
arsenm added inline comments.Sep 9 2022, 10:15 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7970

Why implicitly cast to int to assert the value? Just use uint64_t?

7978

Early return on condition instead of wrapping ternary operator

llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll
2

Don’t need -O3?

jrbyrnes updated this revision to Diff 459217.Sep 9 2022, 3:57 PM
jrbyrnes marked an inline comment as done.

Address comments.

jrbyrnes marked 2 inline comments as done.Sep 9 2022, 3:58 PM

Adding reviewers for increased perspective.

jmmartinez added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8000

Is is normal that the conditions is checking if VectorIndex is 0 ? Shouldn't it be auto BPVectorIndex = VectorIndex.value_or(0U); ?

8314

This variable seems unused.

precommit combine-vload-extract.ll with current (trunk) codegen and rebase the patch to show the codegen diff

Both SLP and VectorCombine should try to make patterns like this better in IR, so there might be some target cost/legality checks that need adjusting.
There's also an in-progress patch for -aggressive-instcombine that could be relevant:
D127392

Would it be better to transform this before codegen?
https://alive2.llvm.org/ce/z/uyxHSW

jrbyrnes added a comment.EditedSep 19 2022, 10:59 AM

Both SLP and VectorCombine should try to make patterns like this better in IR, so there might be some target cost/legality checks that need adjusting.
There's also an in-progress patch for -aggressive-instcombine that could be relevant:
D127392

Would it be better to transform this before codegen?
https://alive2.llvm.org/ce/z/uyxHSW

Hi, thanks for your comment! The reason I tagged you is because you seem to be involved in the most closely related issues to the one here (D67841, https://bugs.llvm.org/show_bug.cgi?id=42708). It seems the conclusion is to have vectorization passes (and optimization passes in general) leave LoadCombine patterns untouched, and resolve it in the backend, no? That was the logic I used for the design here.

On the other hand, it seems D127392 is using the opposite design approach. Is the current consensus to do load combining in optimizer?

At a glance, D127392 will not address the issue identified here because it does not handle vector loads.

Both SLP and VectorCombine should try to make patterns like this better in IR, so there might be some target cost/legality checks that need adjusting.
There's also an in-progress patch for -aggressive-instcombine that could be relevant:
D127392

Would it be better to transform this before codegen?
https://alive2.llvm.org/ce/z/uyxHSW

Hi, thanks for your comment! The reason I tagged you is because you seem to be involved in the most closely related issues to the one here (D67841, https://bugs.llvm.org/show_bug.cgi?id=42708). It seems the conclusion is to have vectorization passes (and optimization passes in general) leave LoadCombine patterns untouched, and resolve it in the backend, no? That was the logic I used for the design here.

On the other hand, it seems D127392 is using the opposite design approach. Is the current approach to do load combining in optimizer?

LLVM has gone back and forth on this. There was a general load combine pass for IR, but it was removed because it interfered with other transforms in IR. So we started hacking away at codegen instead, but there are programs where doing the transform in codegen is too late to get the optimal results. So we have some limited transforms in the vectorization passes, and now we're trying to reintroduce load combining as a canonicalization (but in very limited cases and gated by target-specific legality checks).

At a glance, D127392 will not address the issue identified here because it does not handle vector loads.

Right - getting that to work correctly on the most basic integer load patterns is the first step, but we could enhance the transform for more cases (hopefully without too much work).

LLVM has gone back and forth on this. There was a general load combine pass for IR, but it was removed because it interfered with other transforms in IR. So we started hacking away at codegen instead, but there are programs where doing the transform in codegen is too late to get the optimal results. So we have some limited transforms in the vectorization passes, and now we're trying to reintroduce load combining as a canonicalization (but in very limited cases and gated by target-specific legality checks).

With this in mind, perhaps the most consistent / best way to handle this pattern is to catch it in CodeGen (this patch), and, in a separate patch, handle this pattern in a vectorization / instcombine pass (gated by legality checks). It seems to that catching it in CodeGen will only help things (e.g. in scenarios where it is not handled by transform passes).

LLVM has gone back and forth on this. There was a general load combine pass for IR, but it was removed because it interfered with other transforms in IR. So we started hacking away at codegen instead, but there are programs where doing the transform in codegen is too late to get the optimal results. So we have some limited transforms in the vectorization passes, and now we're trying to reintroduce load combining as a canonicalization (but in very limited cases and gated by target-specific legality checks).

With this in mind, perhaps the most consistent / best way to handle this pattern is to catch it in CodeGen (this patch), and, in a separate patch, handle this pattern in a vectorization / instcombine pass (gated by legality checks). It seems to that catching it in CodeGen will only help things (e.g. in scenarios where it is not handled by transform passes).

Sure - I didn't look at the diffs closely, but I don't object to improving the SDAG implementation. Just wanted to let you know that there are potential other places to try this kind of transform.

jrbyrnes updated this revision to Diff 461322.Sep 19 2022, 12:46 PM
jrbyrnes marked 2 inline comments as done.

Address review comments -- update usage of Optional API.

Sure - I didn't look at the diffs closely, but I don't object to improving the SDAG implementation. Just wanted to let you know that there are potential other places to try this kind of transform.

Thanks, I appreciate your feedback & thoughts on this -- especially with regard to the approach -- as you seem to be very knowledgeable about the design decisions for this issue.

Anyway, it sounds like the approach for now is to land this upstream. Are there additional thoughts about the implementation details, or is this ready?

spatel added inline comments.Sep 22 2022, 9:35 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7885–7889

I added more tests with:
ef7d61d67cb9
...so please update the auto-generated checks there.

It might help to add some test or code comments to explain the transforms in those examples. Just looking at this code, I can't tell what the relationship is between the 3 index parameters.

7899

and -> an?

8224

Can this just be updated to use getScalarSizeInBits()?

jrbyrnes updated this revision to Diff 462296.Sep 22 2022, 2:38 PM
jrbyrnes marked 2 inline comments as done.

Address review comments.

Update tests in ef7d61d67cb9

jrbyrnes added inline comments.Sep 22 2022, 2:39 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7885–7889

Thanks for taking a look! I updated the tests you added, please let me know if you need additional info on the results.

spatel added inline comments.Sep 23 2022, 7:39 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7894

Check that Depth != 0 instead of adding the Root parameter?

8293–8294

Still need to resolve this - if this can't be:

unsigned LoadBitWidth = P.Load->getMemoryVT().getScalarSizeInBits();

...then please add a test to demonstrate how that fails or a code comment to explain why the simpler code is not valid.

8342–8343

Do we have a test where VectorOffset is non-zero? If not, please add one (add baseline tests as needed, no pre-commit review is necessary).

jrbyrnes updated this revision to Diff 462504.Sep 23 2022, 8:34 AM
jrbyrnes marked 3 inline comments as done.

Address review comments (remove unnecessary "Root" parameter).

I'm still not confident in my understanding of the various index values even with the added code comments.
I'll try to step through some of these tests in the debugger to get a better idea, but it would be good if another reviewer can have a look too for a second opinion.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8342–8343

This is marked as done, but I can't tell from just looking at the tests which one exercises that path. Add an example/comment here or next to a test, so it's clearer what this looks like?

RKSimon added inline comments.Sep 27 2022, 8:24 AM
llvm/include/llvm/CodeGen/SelectionDAGAddressAnalysis.h
52

Would it be better to replace this with addToOffset() that adds to the existing offset (or set it if == None)?

jrbyrnes updated this revision to Diff 463269.Sep 27 2022, 10:02 AM
jrbyrnes marked 2 inline comments as done.

replace setOffset with addToOffset

I'm still not confident in my understanding of the various index values even with the added code comments.
I'll try to step through some of these tests in the debugger to get a better idea, but it would be good if another reviewer can have a look too for a second opinion.

Hi -- thanks for your comments and reviews. I realize I never submitted my last round of comments, so I've done so here. Apologies for the delay -- this has probably added to the incomplete understanding.

I have included a description of the various index parameters via example in an inline comment.

llvm/include/llvm/CodeGen/SelectionDAGAddressAnalysis.h
52

That does make more sense, thanks!

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7894

Thanks for pointing this out!

In fact, this patch did not add the Root parameter, but it does look unnecessary, so I think it makes sense to remove it along with the other changes in this patch.

8342–8343

The non-zero VectorOffset path is covered by any test that uses a ExtractElement with a non-zero index. As an example, extractelement <4 x i8> %ld, i32 1 will have a VectorOffset of 1.

CalculateByteProvider will only allow such a VectorOffset if we are trying to Provide for the 1th byte. We enforce this by making sure the StartingIndex == VectorOffset. The idea is to match {0, 1, 2, 3} bytes with VectorLoad -> ExtractElement {0, 1, 2, 3}. If we find an ExtractElement index that does not match the VectorOffset, we conservatively assume that we are shuffling the elements in the vector and can not combine into a load.

So this covers the VectorIndex (e.g. VectorOffset) and StartingIndex parameters in CalculateByteProvider. The remaining parameter is the original Index parameter. This parameter ensures the shift and loadwidth we find are able to provide for the relevant byte. For example, if we are trying to provide for the 2th byte, then we must find either a Load 8+bit -> SHL 16, or Load 16+bit -> SHL 8, or Load 24+bit. Basically, the combination of shift and byte width must cover the byte we are trying to provide for. If so, then the check (Index >= NarrowByteWidth) will be false, and we will return the ByteProvider.

RKSimon added inline comments.Sep 27 2022, 10:36 AM
llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll
75

Please can you pre-commit these to trunk and then rebase to show the codegen change from this patch/

jrbyrnes marked an inline comment as done.Sep 27 2022, 11:38 AM
jrbyrnes updated this revision to Diff 464357.Sep 30 2022, 12:10 PM

Extend ByteProvider / VectorOffset handling to support vectorScalarTypes > 1 Byte.

Additional comments, tests

jrbyrnes updated this revision to Diff 464377.Sep 30 2022, 1:08 PM

Rebase on top of precommitted tests.

spatel accepted this revision.Oct 3 2022, 9:16 AM

I still can't say that I see all of the potential corner cases, but the extra comments help to explain what's going on, and it seems to work as expected, so LGTM.
Might still be good to get a 2nd approval from another reviewer since several people have commented.

This revision is now accepted and ready to land.Oct 3 2022, 9:16 AM

I still can't say that I see all of the potential corner cases, but the extra comments help to explain what's going on, and it seems to work as expected, so LGTM.
Might still be good to get a 2nd approval from another reviewer since several people have commented.

Hey @spatel , thanks! I appreciate your help and thoughts on this patch. I'll keep the review up for a bit longer while I do some testing to see if anyone else is willing to approve.

arsenm accepted this revision.Oct 3 2022, 11:54 AM

LGTM with nit

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7895

deMorgan this

jrbyrnes updated this revision to Diff 465035.Oct 4 2022, 9:00 AM

Resolve nit, fix a few comment typos. NFC

arsenm accepted this revision.Oct 4 2022, 9:03 AM