This is an archive of the discontinued LLVM Phabricator instance.

[SDAG] try to replace subtract-from-constant with xor
ClosedPublic

Authored by spatel on Jun 18 2022, 8:11 AM.

Details

Summary

This is almost the same as the abandoned D48529, but it allows splat vector constants too.
This replaces the x86-specific code that was added with the alternate patch D48557 with the original generic combine.
This transform is a less restricted form of an existing InstCombine and the proposed SDAG equivalent for that in D128080:
https://alive2.llvm.org/ce/z/OUm6N_

Diff Detail

Event Timeline

spatel created this revision.Jun 18 2022, 8:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 18 2022, 8:11 AM
spatel requested review of this revision.Jun 18 2022, 8:11 AM
RKSimon added inline comments.Jun 20 2022, 1:32 AM
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

Based off @foad's comments on D128080 - we need to either add a TLI override to control this or add a way for AMDGPU to reverse it.

RKSimon added inline comments.Jun 20 2022, 1:35 AM
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

Hmm - how similar is TargetLowering::preferIncOfAddToSubOfNot ?

asb accepted this revision.Jun 20 2022, 2:07 AM

Nice change! RISC-V changes LGTM, and it seems to cause various small improvements in codegen for the GCC torture suite.

This revision is now accepted and ready to land.Jun 20 2022, 2:07 AM
spatel planned changes to this revision.Jun 20 2022, 3:33 AM
spatel added inline comments.
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

At first glance, that seems more about constant materialization while this is about a "sub-from" instruction with an immediate operand.

That hook is default true but overridden to false for vector types by ARM/AArch/PowerPC.

deadalnix added inline comments.Jun 26 2022, 6:40 PM
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

TBH, it seems like a missed opportunity from the AMDGPU backend rather than a major problem with this patch.

Isn't there someone from AMD we can get help from here?

foad added inline comments.Jun 27 2022, 4:42 AM
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

AMDGPUDAGToDAGISel::SelectDS1Addr1Offset would have to be taught to match appropriate uses of ISD::XOR as well as ISD::SUB (similar to how addressing mode matchers have to match OR as well as ADD?). This code was added by @arsenm in D12283. It doesn't sound like a very important real world use case so maybe we could just ignore the regression?

deadalnix added inline comments.Jun 27 2022, 2:17 PM
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

I think there is the regression, which is probably okay, but then there is what the test is actually testing for: maximum offset.

So maybe a second test case for the maximum offset - that do not trigger the regression - should be added here, and then we can move on.

spatel updated this revision to Diff 441412.Jun 30 2022, 8:35 AM

Patch updated:
No code changes. Based on review comments, I added 2 AMD offset transform tests that show no diffs with this patch and test comments for the changed tests in case there is motivation to restore those patterns.

This revision is now accepted and ready to land.Jun 30 2022, 8:35 AM
spatel added a comment.Jul 7 2022, 7:51 AM

Ping (AMDGPU) - are the test updates sufficient?

arsenm added inline comments.Jul 7 2022, 8:25 AM
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll
277–329

It might be nice if isBaseWithConstantOffset would recognize this like for add/or but I don't think this is that important

llvm/test/CodeGen/AMDGPU/setcc-multiple-use.ll
18–19

It's a bit weird to see a vector operation traded for a scalar one but this seems fine

This revision was landed with ongoing or failed builds.Jul 8 2022, 5:17 AM
This revision was automatically updated to reflect the committed changes.
barannikov88 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3766

It might be more profitable to do this transformation in SimplifyDemandedBits. Some bits of the result might not be used downstream, in which case the check can be relaxed.

spatel marked an inline comment as done.Jul 8 2022, 8:35 AM
spatel added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3766

The known-bits vs. demanded-bits transforms are two different things if I'm seeing it correctly, although there can be some overlap. Demanded bits may already be used to shrink the constants, so that could enable this transform.

If you have an example in mind that doesn't transform as expected, would you please file an issue, so we can track the fix? If it's not working here in codegen, it may also be missing from the IR equivalent.

I updated the IR side to match this patch with:
79bb915fb60b

barannikov88 added inline comments.Jul 8 2022, 11:32 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3766

The known-bits vs. demanded-bits transforms are two different things if I'm seeing it correctly

My understanding is as follows. SimplifyDemandedBits recurses down to operands of the node propagating information about what bits are needed of them. It might help to simplify the operands. On the way back from the reucursion, the operands tell the node which bits of them are known so that the node itself can be simplified. I could be wrong.
I noticed that SimplifyDemandedBits is mainly used in DAGCombiner, mostly as a last resort if no other transformation could be applied. For some reason this function is not called for ISD::SUB.

in which case the check can be relaxed.

On the second thought, it does not look so simple. Suppose we have two 2-bit operands:

C = 0b10
X = 0b1?

and we're demanding the 1-th bit (counting from 0) from the SUB. We still have to check for possible overflow in 0-th bit.

barannikov88 added inline comments.Jul 8 2022, 3:09 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3766

By the way, a more general check would be to ignore borrow from the bit after the MSB. E.g.
(~X.KnownZero).isSubsetOf(C | SignMask)
or
(~(X.KnownZero | SignMask)).isSubsetOf(C)
or, which is probably a bit cleaner,
(C - ~X.KnownZero) == (C ^ ~X.KnownZero)
if I'm not mistaken.
This, however, does not affect any existing tests. At least not in backend.

spatel marked an inline comment as done.Jul 10 2022, 4:08 AM
spatel added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3766

Yes, that looks like a good enhancement. Here's a try at a general proof and specific example for a regression test:
https://alive2.llvm.org/ce/z/KABK4Z

bjope added a subscriber: bjope.Jul 11 2022, 1:08 PM

Hi @spatel

I've seen some regressions with this (or maybe the similar update in instcombine that replace sub by xor. Those patterns typically involve a sub that is being used as index in a GEP. And I think that both SCEV for regular IR and our downstream machine IR scalar evolution is having a hard time to understand that the xor is a subtract in disguise. So instead of ending up with a tight loop with negative stride for the memory accesses we now end up with xor operations inside the loop. Not quite sure how to deal with the regressions. Maybe the scalar evolution implementations can be improved here. Or maybe our downstream ISel need to select sub instead of xor (if the reverse transform is easy).

My target has lots of different instructions that involve add/sub. We can fold it into "multiply and add/subtract", we can fold it into addressing modes as an offset or by using a post-update on the pointer, we can do add/subtract for both "general purpose" registers and for "pointer" registers. However, logical operations are more limited (specially when it comes to pointer arithmetic since we also will end up moving values between GPR:s that can do the logical operations and the pointer registers that can be used for addressing).

Since this is a one instruction instead of another single instruction (we do not reduce amount of instructions in these combines. I'm interested to understand what deems a xor to be better than sub. Are logical operations considered "better" than arithmetic operations in general, or what is the rule?

Hi @spatel

I've seen some regressions with this (or maybe the similar update in instcombine that replace sub by xor. Those patterns typically involve a sub that is being used as index in a GEP. And I think that both SCEV for regular IR and our downstream machine IR scalar evolution is having a hard time to understand that the xor is a subtract in disguise. So instead of ending up with a tight loop with negative stride for the memory accesses we now end up with xor operations inside the loop. Not quite sure how to deal with the regressions. Maybe the scalar evolution implementations can be improved here. Or maybe our downstream ISel need to select sub instead of xor (if the reverse transform is easy).

My target has lots of different instructions that involve add/sub. We can fold it into "multiply and add/subtract", we can fold it into addressing modes as an offset or by using a post-update on the pointer, we can do add/subtract for both "general purpose" registers and for "pointer" registers. However, logical operations are more limited (specially when it comes to pointer arithmetic since we also will end up moving values between GPR:s that can do the logical operations and the pointer registers that can be used for addressing).

Since this is a one instruction instead of another single instruction (we do not reduce amount of instructions in these combines. I'm interested to understand what deems a xor to be better than sub. Are logical operations considered "better" than arithmetic operations in general, or what is the rule?

Thanks for letting me know. The main codegen motivation for this transform is that it can allow using an immediate-form xor instruction rather than a separate load of the immediate when used as Op0 of a subtract. That kind of improvement is seen in the RISCV and AArch64 diffs. Our folds for bitwise logic tend to be better than sub too, and known-bits / demanded-bits are also easier with xor. The bit-tracking improvement is why I figured we should also do this in instcombine.

But it was already a borderline codegen transform because of the AMDGPU diffs, so we do very likely need to restrict this with a TLI hook. If you have examples of the regressions you're seeing that can be translated to an in-tree target, that would be great.

bjope added a comment.Jul 11 2022, 3:16 PM

Not sure really if this ends up as a regression or not, but with this source we can see some differences if for example using heaxagon or systemz as targets.
(This example is related to doing sub->xor in instcombine so not exactly the patch in this review.)

Input C program (sub_xor.c):

static unsigned ARR[100];

extern void populate(unsigned*);

unsigned foo(unsigned *a) {
    populate(ARR);
    unsigned sum = 0;
    for (int j = 1; j < 4; ++j) {
        unsigned *ptr = &ARR[99];
        for (int i = 0; i < 70; ++i) {
            sum += j * *(ptr - i);
        }
    }
    return sum;
}

Compile with clang -target hexagon -O2 sub_xor.c -o sub_xor.ll -emit-llvm -S and the only difference will be some sub:s ending up as xor (depending on if the instcombine transform from commit 79bb915fb60b2cd2 is reverted or not).
If also comparing the result of llc -O2 sub_xor.ll -o - -print-after-all, given those two different versions of sub_xor.ll from prior step, we can see that Loop Strength Reduce can eliminate the sub but it does not eliminate the xor.

Well, this is the best in-tree example I got so far (when aiming at having a full test from C code, I guess this could be reduced to a much smaller test just running LSR to show that we get different result for sub vs xor).

Not sure really if this ends up as a regression or not, but with this source we can see some differences if for example using heaxagon or systemz as targets.
(This example is related to doing sub->xor in instcombine so not exactly the patch in this review.)

Thanks for the example. The instcombine variation of this fold has a different problem that I noticed in another example.
Can you try the patch below and see if that fixes the regressions for your target?

diff --git a/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp b/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
index 535a7736454c..c51ce0e8d1c3 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
@@ -1966,13 +1966,6 @@ Instruction *InstCombinerImpl::visitSub(BinaryOperator &I) {
       return BinaryOperator::CreateAdd(X, ConstantExpr::getSub(C, C2));
   }
 
-  // If there's no chance any bit will need to borrow from an adjacent bit:
-  // sub C, X --> xor X, C
-  const APInt *Op0C;
-  if (match(Op0, m_APInt(Op0C)) &&
-      (~computeKnownBits(Op1, 0, &I).Zero).isSubsetOf(*Op0C))
-    return BinaryOperator::CreateXor(Op1, Op0);
-
   {
     Value *Y;
     // X-(X+Y) == -Y    X-(Y+X) == -Y
@@ -2231,7 +2224,20 @@ Instruction *InstCombinerImpl::visitSub(BinaryOperator &I) {
         I, Builder.CreateIntrinsic(Intrinsic::ctpop, {I.getType()},
                                    {Builder.CreateNot(X)}));
 
-  return TryToNarrowDeduceFlags();
+  if (Instruction *R = TryToNarrowDeduceFlags())
+    return R;
+
+  // If there's no chance any bit will need to borrow from an adjacent bit:
+  // sub C, X --> xor X, C
+  // Avoid this fold if the sub has no-wrap flags because that could be an
+  // information-losing transform that we cannot recover from.
+  const APInt *Op0C;
+  if (!I.hasNoSignedWrap() && !I.hasNoUnsignedWrap() &&
+      match(Op0, m_APInt(Op0C)) &&
+      (~computeKnownBits(Op1, 0, &I).Zero).isSubsetOf(*Op0C))
+    return BinaryOperator::CreateXor(Op1, Op0);
+
+  return nullptr;
 }
 
 /// This eliminates floating-point negation in either 'fneg(X)' or
foad added a comment.Jul 12 2022, 8:49 AM
+  // If there's no chance any bit will need to borrow from an adjacent bit:
+  // sub C, X --> xor X, C
+  // Avoid this fold if the sub has no-wrap flags because that could be an
+  // information-losing transform that we cannot recover from.
+  const APInt *Op0C;
+  if (!I.hasNoSignedWrap() && !I.hasNoUnsignedWrap() &&
+      match(Op0, m_APInt(Op0C)) &&
+      (~computeKnownBits(Op1, 0, &I).Zero).isSubsetOf(*Op0C))
+    return BinaryOperator::CreateXor(Op1, Op0);

If there's no chance any bit will need to borrow from an adjacent bit then there is no unsigned wrap by definition, so I'm not sure losing the nuw flag would actually lose any information.

If there's no chance any bit will need to borrow from an adjacent bit then there is no unsigned wrap by definition, so I'm not sure losing the nuw flag would actually lose any information.

Yes, good point. That suggests that we should also improve later passes to recognize xor as sub when profitable.

The test that I was looking where I don't think we can recover is:

define i1 @test_negative_combined_sub_signed_overflow(i8 %x) {
; CHECK-LABEL: @test_negative_combined_sub_signed_overflow(
; CHECK-NEXT:    ret i1 false
;
  %y = sub nsw i8 127, %x
  %z = icmp slt i8 %y, -1
  ret i1 %z
}

If we convert that to xor (after applying the equivalent of the SDAG patch in d0eec5f7e787), then we would have:

%y = xor i8 %x, 127
%z = icmp slt i8 %y, -1

And now we can't fold the whole thing to 'false' anymore.

foad added a comment.Jul 12 2022, 10:00 AM

Right, my argument only applies for nuw. I agree that losing nsw can still lose information.

bjope added a comment.Jul 12 2022, 1:37 PM

Not sure really if this ends up as a regression or not, but with this source we can see some differences if for example using heaxagon or systemz as targets.
(This example is related to doing sub->xor in instcombine so not exactly the patch in this review.)

Thanks for the example. The instcombine variation of this fold has a different problem that I noticed in another example.
Can you try the patch below and see if that fixes the regressions for your target?

I rerun some benchmarks with that patch and the cycle counts were restored in those tests that earlier showed a regression. So that is great.
Still a small fear that xor is harder for us to lower into machine IR compared to sub (considering register constraints etc). But I guess that is something we should deal with downstream somehow (investigate if we can select xor as a sub when possible).