This is almost the same as the abandoned D48529, but it allows splat vector constants too.
This replaces the x86-specific code that was added with the alternate patch D48557 with the original generic combine.
This transform is a less restricted form of an existing InstCombine and the proposed SDAG equivalent for that in D128080:
https://alive2.llvm.org/ce/z/OUm6N_
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll | ||
---|---|---|
277–329 | Hmm - how similar is TargetLowering::preferIncOfAddToSubOfNot ? |
Nice change! RISC-V changes LGTM, and it seems to cause various small improvements in codegen for the GCC torture suite.
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll | ||
---|---|---|
277–329 | At first glance, that seems more about constant materialization while this is about a "sub-from" instruction with an immediate operand. That hook is default true but overridden to false for vector types by ARM/AArch/PowerPC. |
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll | ||
---|---|---|
277–329 | TBH, it seems like a missed opportunity from the AMDGPU backend rather than a major problem with this patch. Isn't there someone from AMD we can get help from here? |
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll | ||
---|---|---|
277–329 | AMDGPUDAGToDAGISel::SelectDS1Addr1Offset would have to be taught to match appropriate uses of ISD::XOR as well as ISD::SUB (similar to how addressing mode matchers have to match OR as well as ADD?). This code was added by @arsenm in D12283. It doesn't sound like a very important real world use case so maybe we could just ignore the regression? |
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll | ||
---|---|---|
277–329 | I think there is the regression, which is probably okay, but then there is what the test is actually testing for: maximum offset. So maybe a second test case for the maximum offset - that do not trigger the regression - should be added here, and then we can move on. |
Patch updated:
No code changes. Based on review comments, I added 2 AMD offset transform tests that show no diffs with this patch and test comments for the changed tests in case there is motivation to restore those patterns.
llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll | ||
---|---|---|
277–329 | It might be nice if isBaseWithConstantOffset would recognize this like for add/or but I don't think this is that important | |
llvm/test/CodeGen/AMDGPU/setcc-multiple-use.ll | ||
18–19 | It's a bit weird to see a vector operation traded for a scalar one but this seems fine |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
3766 | It might be more profitable to do this transformation in SimplifyDemandedBits. Some bits of the result might not be used downstream, in which case the check can be relaxed. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
3766 | The known-bits vs. demanded-bits transforms are two different things if I'm seeing it correctly, although there can be some overlap. Demanded bits may already be used to shrink the constants, so that could enable this transform. If you have an example in mind that doesn't transform as expected, would you please file an issue, so we can track the fix? If it's not working here in codegen, it may also be missing from the IR equivalent. I updated the IR side to match this patch with: |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
3766 |
My understanding is as follows. SimplifyDemandedBits recurses down to operands of the node propagating information about what bits are needed of them. It might help to simplify the operands. On the way back from the reucursion, the operands tell the node which bits of them are known so that the node itself can be simplified. I could be wrong.
On the second thought, it does not look so simple. Suppose we have two 2-bit operands: C = 0b10 X = 0b1? and we're demanding the 1-th bit (counting from 0) from the SUB. We still have to check for possible overflow in 0-th bit. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
3766 | By the way, a more general check would be to ignore borrow from the bit after the MSB. E.g. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
3766 | Yes, that looks like a good enhancement. Here's a try at a general proof and specific example for a regression test: |
Hi @spatel
I've seen some regressions with this (or maybe the similar update in instcombine that replace sub by xor. Those patterns typically involve a sub that is being used as index in a GEP. And I think that both SCEV for regular IR and our downstream machine IR scalar evolution is having a hard time to understand that the xor is a subtract in disguise. So instead of ending up with a tight loop with negative stride for the memory accesses we now end up with xor operations inside the loop. Not quite sure how to deal with the regressions. Maybe the scalar evolution implementations can be improved here. Or maybe our downstream ISel need to select sub instead of xor (if the reverse transform is easy).
My target has lots of different instructions that involve add/sub. We can fold it into "multiply and add/subtract", we can fold it into addressing modes as an offset or by using a post-update on the pointer, we can do add/subtract for both "general purpose" registers and for "pointer" registers. However, logical operations are more limited (specially when it comes to pointer arithmetic since we also will end up moving values between GPR:s that can do the logical operations and the pointer registers that can be used for addressing).
Since this is a one instruction instead of another single instruction (we do not reduce amount of instructions in these combines. I'm interested to understand what deems a xor to be better than sub. Are logical operations considered "better" than arithmetic operations in general, or what is the rule?
Thanks for letting me know. The main codegen motivation for this transform is that it can allow using an immediate-form xor instruction rather than a separate load of the immediate when used as Op0 of a subtract. That kind of improvement is seen in the RISCV and AArch64 diffs. Our folds for bitwise logic tend to be better than sub too, and known-bits / demanded-bits are also easier with xor. The bit-tracking improvement is why I figured we should also do this in instcombine.
But it was already a borderline codegen transform because of the AMDGPU diffs, so we do very likely need to restrict this with a TLI hook. If you have examples of the regressions you're seeing that can be translated to an in-tree target, that would be great.
Not sure really if this ends up as a regression or not, but with this source we can see some differences if for example using heaxagon or systemz as targets.
(This example is related to doing sub->xor in instcombine so not exactly the patch in this review.)
Input C program (sub_xor.c):
static unsigned ARR[100]; extern void populate(unsigned*); unsigned foo(unsigned *a) { populate(ARR); unsigned sum = 0; for (int j = 1; j < 4; ++j) { unsigned *ptr = &ARR[99]; for (int i = 0; i < 70; ++i) { sum += j * *(ptr - i); } } return sum; }
Compile with clang -target hexagon -O2 sub_xor.c -o sub_xor.ll -emit-llvm -S and the only difference will be some sub:s ending up as xor (depending on if the instcombine transform from commit 79bb915fb60b2cd2 is reverted or not).
If also comparing the result of llc -O2 sub_xor.ll -o - -print-after-all, given those two different versions of sub_xor.ll from prior step, we can see that Loop Strength Reduce can eliminate the sub but it does not eliminate the xor.
Well, this is the best in-tree example I got so far (when aiming at having a full test from C code, I guess this could be reduced to a much smaller test just running LSR to show that we get different result for sub vs xor).
Thanks for the example. The instcombine variation of this fold has a different problem that I noticed in another example.
Can you try the patch below and see if that fixes the regressions for your target?
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp b/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp index 535a7736454c..c51ce0e8d1c3 100644 --- a/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp +++ b/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp @@ -1966,13 +1966,6 @@ Instruction *InstCombinerImpl::visitSub(BinaryOperator &I) { return BinaryOperator::CreateAdd(X, ConstantExpr::getSub(C, C2)); } - // If there's no chance any bit will need to borrow from an adjacent bit: - // sub C, X --> xor X, C - const APInt *Op0C; - if (match(Op0, m_APInt(Op0C)) && - (~computeKnownBits(Op1, 0, &I).Zero).isSubsetOf(*Op0C)) - return BinaryOperator::CreateXor(Op1, Op0); - { Value *Y; // X-(X+Y) == -Y X-(Y+X) == -Y @@ -2231,7 +2224,20 @@ Instruction *InstCombinerImpl::visitSub(BinaryOperator &I) { I, Builder.CreateIntrinsic(Intrinsic::ctpop, {I.getType()}, {Builder.CreateNot(X)})); - return TryToNarrowDeduceFlags(); + if (Instruction *R = TryToNarrowDeduceFlags()) + return R; + + // If there's no chance any bit will need to borrow from an adjacent bit: + // sub C, X --> xor X, C + // Avoid this fold if the sub has no-wrap flags because that could be an + // information-losing transform that we cannot recover from. + const APInt *Op0C; + if (!I.hasNoSignedWrap() && !I.hasNoUnsignedWrap() && + match(Op0, m_APInt(Op0C)) && + (~computeKnownBits(Op1, 0, &I).Zero).isSubsetOf(*Op0C)) + return BinaryOperator::CreateXor(Op1, Op0); + + return nullptr; } /// This eliminates floating-point negation in either 'fneg(X)' or
If there's no chance any bit will need to borrow from an adjacent bit then there is no unsigned wrap by definition, so I'm not sure losing the nuw flag would actually lose any information.
Yes, good point. That suggests that we should also improve later passes to recognize xor as sub when profitable.
The test that I was looking where I don't think we can recover is:
define i1 @test_negative_combined_sub_signed_overflow(i8 %x) { ; CHECK-LABEL: @test_negative_combined_sub_signed_overflow( ; CHECK-NEXT: ret i1 false ; %y = sub nsw i8 127, %x %z = icmp slt i8 %y, -1 ret i1 %z }
If we convert that to xor (after applying the equivalent of the SDAG patch in d0eec5f7e787), then we would have:
%y = xor i8 %x, 127 %z = icmp slt i8 %y, -1
And now we can't fold the whole thing to 'false' anymore.
Right, my argument only applies for nuw. I agree that losing nsw can still lose information.
I rerun some benchmarks with that patch and the cycle counts were restored in those tests that earlier showed a regression. So that is great.
Still a small fear that xor is harder for us to lower into machine IR compared to sub (considering register constraints etc). But I guess that is something we should deal with downstream somehow (investigate if we can select xor as a sub when possible).
It might be more profitable to do this transformation in SimplifyDemandedBits. Some bits of the result might not be used downstream, in which case the check can be relaxed.