Combine 
t2 = bswap t1;  t3 = srl t2, x;  bswap t3
to shl t1, x; if x %8 == 0.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I think you need to confirm that the shift amount is a multiple of 8 ? https://alive2.llvm.org/ce/z/5DdJRS
I was looking at solving the same test cases by doing
(bitreverse (srl X, C)) -> (shl (bitreverse X), C)
I've posted my patch here D120667
TBH, this pattern is probably more likely to occur in general code than anything with a bitreverse
Fair enough, but in that case we need a different set of test cases. All of the examples here could legally be a single brev8(reverse bits within each byte) instruction. The shifts are artifacts of type legalization and shouldn't be there. No amount of DAG combines after type legalization can get to the optimal codegen. The only way to do it is to combine (bitreverse (bswap X)) to brev8 pre-type legalization and then type legalize brev8 with any_extend.
We'd probably do better with a smaller match/fold. This is really just recognizing that you can move a logical shift before/after a swap by reversing the direction. We don't get this in IR either if I'm seeing it correctly:
https://alive2.llvm.org/ce/z/2UmMSu
I agree that for these cases in bswap-bitreverse.ll combine (bitreverse (bswap X)) can get more optimizations. So should we keep this combine?Maybe it can be used elsewhere.
We need test cases that are targetted specifically at the change being made. If we add a RISCV DAGCombine for (bitreverse (bswap X)) -> brev8 pre-type legalization, then this code change will become untested.
Please can you precommit the additional tests and then rebase this patch so it shows the changes?
Can you rebase again. I think the changes to bswap-bitreverse.ll have been fixed a different way and no longer apply.
| llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 9729–9730 | It seems that x % 8 == 0 is simple and clear. | |
| llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 9729–9730 | 
 Sure, but does it handle variable shift amounts? | |
| llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 9729–9730 | It would also add vector support, which probably isn't that relevant but could maybe turn up. | |
I didn't see a reply to my earlier suggestion - is there a problem with a more general pattern match (independent of the question of using knownbits on the shift amount)?
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp index 8e383ce85cb7..498d2f51bbd5 100644 --- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp @@ -9745,6 +9745,16 @@ SDValue DAGCombiner::visitBSWAP(SDNode *N) { } } + if ((N0.getOpcode() == ISD::SHL || N0.getOpcode() == ISD::SRL) && + N0.hasOneUse()) { + auto *ShAmt = dyn_cast<ConstantSDNode>(N0.getOperand(1)); + if (ShAmt && ShAmt->getZExtValue() % 8 == 0) { + SDValue NewSwap = DAG.getNode(ISD::BSWAP, DL, VT, N0.getOperand(0)); + unsigned InverseShift = N0.getOpcode() == ISD::SHL ? ISD::SRL : ISD::SHL; + return DAG.getNode(InverseShift, DL, VT, NewSwap, N0.getOperand(1)); + } + } + return SDValue(); }
| llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 9730 | Where did we verify that calling getZExtValue() wouldn't assert if the shift was enormuously large. It would definitely be UB, but we can't guarantee it was folded yet. | |
I thank this is a correct transform, it change bswap(shl x) -> srl (bswap x) or bswap(srl x) -> shl (bswap x)  and it may provide more opportunity for optimization;
but looking at this transformation alone, it doesn't do any optimization. Optimization needs to depend on other instructions before and after. Why don't we do accurate optimization?
We want to have the more flexible/robust pattern-matching because it will handle cases that may not be obvious/visible yet. By making the minimal transform, we try to ensure that the code will be in a "canonical" form that other transforms can then act on. (That is also why I propose adding the transform in IR in D122010.)
In this case, the key optimization is that a bswap-of-bswap (or bitreverse-of-bitreverse) cancels itself out. That optimization already exists, so we don't need to reimplement it here and other places. Just try to get the code into a form that will allow the existing code to apply. This is a fundamental feature of iterative combining. It may seem less efficient, but it's actually more efficient because we don't need to duplicate as much combiner code to get the same level of consistent optimization.
For example, we do not have this test in D121504 (but I think it should be included):
define i32 @test_bswap_shli_8_bswap_i32(i32 %a) nounwind {
    %1 = call i32 @llvm.bswap.i32(i32 %a)
    %2 = shl i32 %1, 8
    %3 = call i32 @llvm.bswap.i32(i32 %2)
    ret i32 %3
}We could adjust this patch to also deal with the opposite direction shift, but then it will need more redundant pattern-matching code (especially if we need to duplicate it again for bitreverse).
I didn't step through it, but that might be why this patch fails to make some "rev16" optimizations for ARM code that we will get with the transform that I suggested.
Thanks for committing the extra tests. I posted D122655 to transform those. We can generalize it to handle bitreverse as a follow-up patch if that is needed.
