This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] fold (bswap(srl (bswap c), 8*x)) -> (shl c, 8*x)
AbandonedPublic

Authored by Chenbing.Zheng on Feb 28 2022, 12:57 AM.

Details

Summary

Combine
t2 = bswap t1; t3 = srl t2, x; bswap t3
to shl t1, x; if x %8 == 0.

Diff Detail

Event Timeline

Chenbing.Zheng requested review of this revision.Feb 28 2022, 12:57 AM

Is this a target-independent optimization that's worth doing in general DAGCombine?

Is this a target-independent optimization that's worth doing in general DAGCombine?

+1 This looks like a useful generic fold to me.

Chenbing.Zheng retitled this revision from [RISCV] DAG combine bswap(srl (bswap t), x) to shl t, x to [RISCV] DAGcombine fold (bswap(srl (bswap c), x)) -> (shl c, x).
Chenbing.Zheng added a reviewer: RKSimon.

address comments

I think you need to confirm that the shift amount is a multiple of 8 ? https://alive2.llvm.org/ce/z/5DdJRS

lebedev.ri requested changes to this revision.Feb 28 2022, 4:57 AM
lebedev.ri added a subscriber: lebedev.ri.

Could you please post the legality reasoning, not just state what the fold is?

https://alive2.llvm.org/ce/z/gRNCw5

This revision now requires changes to proceed.Feb 28 2022, 4:57 AM
Chenbing.Zheng edited the summary of this revision. (Show Details)

add a condition x % 8 == 0, x is left shift num

Could you please post the legality reasoning, not just state what the fold is?

https://alive2.llvm.org/ce/z/gRNCw5

I think you need to confirm that the shift amount is a multiple of 8 ? https://alive2.llvm.org/ce/z/5DdJRS

Thanks, I agree with you

craig.topper added a comment.EditedFeb 28 2022, 9:52 AM

I was looking at solving the same test cases by doing

(bitreverse (srl X, C)) -> (shl (bitreverse X), C)

I've posted my patch here D120667

TBH, this pattern is probably more likely to occur in general code than anything with a bitreverse

craig.topper added a comment.EditedFeb 28 2022, 10:41 AM

TBH, this pattern is probably more likely to occur in general code than anything with a bitreverse

Fair enough, but in that case we need a different set of test cases. All of the examples here could legally be a single brev8(reverse bits within each byte) instruction. The shifts are artifacts of type legalization and shouldn't be there. No amount of DAG combines after type legalization can get to the optimal codegen. The only way to do it is to combine (bitreverse (bswap X)) to brev8 pre-type legalization and then type legalize brev8 with any_extend.

spatel added a subscriber: spatel.Feb 28 2022, 10:49 AM

We'd probably do better with a smaller match/fold. This is really just recognizing that you can move a logical shift before/after a swap by reversing the direction. We don't get this in IR either if I'm seeing it correctly:
https://alive2.llvm.org/ce/z/2UmMSu

TBH, this pattern is probably more likely to occur in general code than anything with a bitreverse

Fair enough, but in that case we need a different set of test cases. All of the examples here could legally be a single brev8(reverse bits within each byte) instruction. The shifts are artifacts of type legalization and shouldn't be there. No amount of DAG combines after type legalization can get to the optimal codegen. The only way to do it is to combine (bitreverse (bswap X)) to brev8 pre-type legalization and then type legalize brev8 with any_extend.

I agree that for these cases in bswap-bitreverse.ll combine (bitreverse (bswap X)) can get more optimizations. So should we keep this combine?Maybe it can be used elsewhere.

We'd probably do better with a smaller match/fold. This is really just recognizing that you can move a logical shift before/after a swap by reversing the direction. We don't get this in IR either if I'm seeing it correctly:
https://alive2.llvm.org/ce/z/2UmMSu

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 5:40 PM

Ping

We need test cases that are targetted specifically at the change being made. If we add a RISCV DAGCombine for (bitreverse (bswap X)) -> brev8 pre-type legalization, then this code change will become untested.

Chenbing.Zheng retitled this revision from [RISCV] DAGcombine fold (bswap(srl (bswap c), x)) -> (shl c, x) to [DAGCombine] fold (bswap(srl (bswap c), x)) -> (shl c, x).

add tests

Please can you precommit the additional tests and then rebase this patch so it shows the changes?

rebase precommit tests D121504

Can you rebase again. I think the changes to bswap-bitreverse.ll have been fixed a different way and no longer apply.

lebedev.ri added inline comments.Mar 17 2022, 12:44 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9727
9729–9730

I'm guessing it isn't worth it to instead check with the knownbits that low 3 bits are zeros?

Chenbing.Zheng retitled this revision from [DAGCombine] fold (bswap(srl (bswap c), x)) -> (shl c, x) to [DAGCombine] fold (bswap(srl (bswap c), 8*x)) -> (shl c, 8*x).

address comments

Chenbing.Zheng marked an inline comment as done.Mar 17 2022, 1:37 AM
Chenbing.Zheng added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9729–9730

It seems that x % 8 == 0 is simple and clear.

lebedev.ri added inline comments.Mar 17 2022, 1:38 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9729–9730

It seems that x % 8 == 0 is simple and clear.

Sure, but does it handle variable shift amounts?

RKSimon added inline comments.Mar 17 2022, 8:05 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9729–9730

It would also add vector support, which probably isn't that relevant but could maybe turn up.

New patch title sounds confusing.

I didn't see a reply to my earlier suggestion - is there a problem with a more general pattern match (independent of the question of using knownbits on the shift amount)?

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 8e383ce85cb7..498d2f51bbd5 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -9745,6 +9745,16 @@ SDValue DAGCombiner::visitBSWAP(SDNode *N) {
     }
   }
 
+  if ((N0.getOpcode() == ISD::SHL || N0.getOpcode() == ISD::SRL) &&
+      N0.hasOneUse()) {
+    auto *ShAmt = dyn_cast<ConstantSDNode>(N0.getOperand(1));
+    if (ShAmt && ShAmt->getZExtValue() % 8 == 0) {
+      SDValue NewSwap = DAG.getNode(ISD::BSWAP, DL, VT, N0.getOperand(0));
+      unsigned InverseShift = N0.getOpcode() == ISD::SHL ? ISD::SRL : ISD::SHL;
+      return DAG.getNode(InverseShift, DL, VT, NewSwap, N0.getOperand(1));
+    }
+  }
+
   return SDValue();
 }
craig.topper added inline comments.Mar 17 2022, 9:32 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
9730

Where did we verify that calling getZExtValue() wouldn't assert if the shift was enormuously large. It would definitely be UB, but we can't guarantee it was folded yet.

I didn't see a reply to my earlier suggestion - is there a problem with a more general pattern match (independent of the question of using knownbits on the shift amount)?

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 8e383ce85cb7..498d2f51bbd5 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -9745,6 +9745,16 @@ SDValue DAGCombiner::visitBSWAP(SDNode *N) {
     }
   }
 
+  if ((N0.getOpcode() == ISD::SHL || N0.getOpcode() == ISD::SRL) &&
+      N0.hasOneUse()) {
+    auto *ShAmt = dyn_cast<ConstantSDNode>(N0.getOperand(1));
+    if (ShAmt && ShAmt->getZExtValue() % 8 == 0) {
+      SDValue NewSwap = DAG.getNode(ISD::BSWAP, DL, VT, N0.getOperand(0));
+      unsigned InverseShift = N0.getOpcode() == ISD::SHL ? ISD::SRL : ISD::SHL;
+      return DAG.getNode(InverseShift, DL, VT, NewSwap, N0.getOperand(1));
+    }
+  }
+
   return SDValue();
 }

I thank this is a correct transform, it change bswap(shl x) -> srl (bswap x) or bswap(srl x) -> shl (bswap x) and it may provide more opportunity for optimization;
but looking at this transformation alone, it doesn't do any optimization. Optimization needs to depend on other instructions before and after. Why don't we do accurate optimization?

I thank this is a correct transform, it change bswap(shl x) -> srl (bswap x) or bswap(srl x) -> shl (bswap x) and it may provide more opportunity for optimization;
but looking at this transformation alone, it doesn't do any optimization. Optimization needs to depend on other instructions before and after. Why don't we do accurate optimization?

We want to have the more flexible/robust pattern-matching because it will handle cases that may not be obvious/visible yet. By making the minimal transform, we try to ensure that the code will be in a "canonical" form that other transforms can then act on. (That is also why I propose adding the transform in IR in D122010.)

In this case, the key optimization is that a bswap-of-bswap (or bitreverse-of-bitreverse) cancels itself out. That optimization already exists, so we don't need to reimplement it here and other places. Just try to get the code into a form that will allow the existing code to apply. This is a fundamental feature of iterative combining. It may seem less efficient, but it's actually more efficient because we don't need to duplicate as much combiner code to get the same level of consistent optimization.

For example, we do not have this test in D121504 (but I think it should be included):

define i32 @test_bswap_shli_8_bswap_i32(i32 %a) nounwind {
    %1 = call i32 @llvm.bswap.i32(i32 %a)
    %2 = shl i32 %1, 8
    %3 = call i32 @llvm.bswap.i32(i32 %2)
    ret i32 %3
}

We could adjust this patch to also deal with the opposite direction shift, but then it will need more redundant pattern-matching code (especially if we need to duplicate it again for bitreverse).

I didn't step through it, but that might be why this patch fails to make some "rev16" optimizations for ARM code that we will get with the transform that I suggested.

Chenbing.Zheng abandoned this revision.Mar 28 2022, 1:25 AM

I thank this is a correct transform, it change bswap(shl x) -> srl (bswap x) or bswap(srl x) -> shl (bswap x) and it may provide more opportunity for optimization;
but looking at this transformation alone, it doesn't do any optimization. Optimization needs to depend on other instructions before and after. Why don't we do accurate optimization?

We want to have the more flexible/robust pattern-matching because it will handle cases that may not be obvious/visible yet. By making the minimal transform, we try to ensure that the code will be in a "canonical" form that other transforms can then act on. (That is also why I propose adding the transform in IR in D122010.)

In this case, the key optimization is that a bswap-of-bswap (or bitreverse-of-bitreverse) cancels itself out. That optimization already exists, so we don't need to reimplement it here and other places. Just try to get the code into a form that will allow the existing code to apply. This is a fundamental feature of iterative combining. It may seem less efficient, but it's actually more efficient because we don't need to duplicate as much combiner code to get the same level of consistent optimization.

For example, we do not have this test in D121504 (but I think it should be included):

define i32 @test_bswap_shli_8_bswap_i32(i32 %a) nounwind {
    %1 = call i32 @llvm.bswap.i32(i32 %a)
    %2 = shl i32 %1, 8
    %3 = call i32 @llvm.bswap.i32(i32 %2)
    ret i32 %3
}

We could adjust this patch to also deal with the opposite direction shift, but then it will need more redundant pattern-matching code (especially if we need to duplicate it again for bitreverse).

I didn't step through it, but that might be why this patch fails to make some "rev16" optimizations for ARM code that we will get with the transform that I suggested.

Sorry for my delayed reply,I think what you said makes sense. I will abundant this patch and I have update D121504 according to your suggestion. Would you mind review it and creat your patch to solve it.
There is a similar problem with bitreverse-shift, and I add tests in D121507.

Sorry for my delayed reply,I think what you said makes sense. I will abundant this patch and I have update D121504 according to your suggestion. Would you mind review it and creat your patch to solve it.

Thanks for committing the extra tests. I posted D122655 to transform those. We can generalize it to handle bitreverse as a follow-up patch if that is needed.