This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix issue in shl(or) combine
ClosedPublic

Authored by ruiling on May 9 2023, 8:47 PM.

Details

Summary

The code is doing the optimization:
((a | c1) << c2) ==> (a << c2) + (c1 << c2)
But this is only valid if a and c1 have no common bits being set.

Diff Detail

Event Timeline

ruiling created this revision.May 9 2023, 8:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2023, 8:47 PM
ruiling requested review of this revision.May 9 2023, 8:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2023, 8:47 PM
foad added a comment.May 10 2023, 12:02 AM

The right way is to transform the pattern to (a << c2) | (c1 << c2)
But the right transformation does not do any help on folding the
constant offset into the memory instructions.

It should help because SelectionDAG::isBaseWithConstantOffset knows how to match OR (if the known bits do not overlap) as well as ADD.

It should help because SelectionDAG::isBaseWithConstantOffset knows how to match OR (if the known bits do not overlap) as well as ADD.

Do you mean we can optimize for shl(or) with the help of knownbits? I think I agree with you. But I am not sure whether it is really helpful in practical cases. The isBaseWithConstantOffset you mentioned specifically designed to work on stack slot access. I am not sure if such patterns can also be observed more broadly. And I think such kind of optimization should be done separately, maybe it should be added in the common LLVM code. And the lit-test should also be redesigned. I would rather fix the problematic transformation first. sounds ok to you?

ruiling updated this revision to Diff 521303.May 11 2023, 7:30 AM

fix the issue instead of removing the optimization.

ruiling retitled this revision from AMDGPU: remove an illegal transform for shl(or) to AMDGPU: Fix issue in shl(or) combine.May 11 2023, 7:32 AM
ruiling edited the summary of this revision. (Show Details)
arsenm accepted this revision.May 12 2023, 1:44 AM

Is there a negative test for the common bits case?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9577

Don’t know why this has to change

This revision is now accepted and ready to land.May 12 2023, 1:44 AM

Is there a negative test for the common bits case?

I have added one: shl_or_ptr_not_combine_2use_lds

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9577

I don't know what happened:( will fix it.

This revision was landed with ongoing or failed builds.May 12 2023, 4:51 AM
This revision was automatically updated to reflect the committed changes.
foad added a comment.May 15 2023, 3:33 AM

I think this is OK, but wouldn't it be simpler to transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2) and remove the knownbits check? Or do you think that would make the generated code worse overall?

I think this is OK, but wouldn't it be simpler to transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2) and remove the knownbits check? Or do you think that would make the generated code worse overall?

I observed more assembly instructions in one typical IR with the transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2). That's why I wanted to remove the code in the first version. But later I find common code tries hard to make (shl (or x, c1), c2) -> add (shl x, c2), (shl c1, c2) happen. So I just fix the issue to help possible cases.