This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine][X86][AArch64][ARM] (C - x) + y -> (y - x) + C fold
ClosedPublic

Authored by lebedev.ri on May 22 2019, 12:11 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.May 22 2019, 12:11 PM

So the DAG is simpler, but that particular case is worse for the backend.

It's "simpler", in some sense, but it's more complicated to lower because it's using the carry output of a uaddo in a non-addcarry operation, which is legal, but more expensive than using in an addcarry, particularly on ARM. I think the missing combine is that we should be able to turn the subtraction into some form of addcarry/subcarry.

So the DAG is simpler, but that particular case is worse for the backend.

It's "simpler", in some sense, but it's more complicated to lower because it's using the carry output of a uaddo in a non-addcarry operation, which is legal, but more expensive than using in an addcarry, particularly on ARM. I think the missing combine is that we should be able to turn the subtraction into some form of addcarry/subcarry.

Hmm, that does not seem to be all.
I've added

bool HasSUBCARRY = TLI.isOperationLegalOrCustom(ISD::SUBCARRY, VT);
bool HasADDCARRY = TLI.isOperationLegalOrCustom(ISD::ADDCARRY, VT);
if (HasSUBCARRY || HasADDCARRY) {
  unsigned ANYCARRY = HasSUBCARRY ? ISD::SUBCARRY : ISD::ADDCARRY;
  // (sub X, Carry) -> (subcarry/addcarry X, 0, Carry)
  if (SDValue Carry = getAsCarry(TLI, N1)) {
    return DAG.getNode(ANYCARRY, DL, DAG.getVTList(VT, Carry.getValueType()),
                       N0, DAG.getConstant(0, DL, VT), Carry);
  }

  if (HasSUBCARRY) {
    // (sub Carry, X) -> (subcarry 0, X, Carry)
    if (SDValue Carry = getAsCarry(TLI, N0)) {
      return DAG.getNode(ISD::SUBCARRY, DL,
                         DAG.getVTList(VT, Carry.getValueType()),
                         DAG.getConstant(0, DL, VT), N1, Carry);
    }
  }
}

at the end of DAGCombiner::visitSUB(), and get this

SelectionDAG has 17 nodes:
  t0: ch = EntryToken
          t6: i32,ch = CopyFromReg t0, Register:i32 %2
            t4: i32,ch = CopyFromReg t0, Register:i32 %1
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t30: i32,i32 = uaddo t4, t2
        t31: i32,i32 = subcarry Constant:i32<0>, t6, t30:1
      t39: i32 = and t31, Constant:i32<65535>
    t40: ch = br_cc t0, seteq:ch, t39, Constant:i32<65535>, BasicBlock:ch<if.end 0x55706fb58828>
  t19: ch = br t40, BasicBlock:ch<for.cond.preheader 0x55706fb58698>

which is then resulting in

SelectionDAG has 22 nodes:
  t0: ch = EntryToken
            t6: i32,ch = CopyFromReg t0, Register:i32 %2
                    t4: i32,ch = CopyFromReg t0, Register:i32 %1
                    t2: i32,ch = CopyFromReg t0, Register:i32 %0
                  t51: i32,i32 = ARMISD::ADDC t4, t2
                t52: i32,i32 = ARMISD::ADDE Constant:i32<0>, Constant:i32<0>, t51:1
              t45: i32 = sub Constant:i32<1>, t52
            t46: i32,i32 = ARMISD::SUBC t45, Constant:i32<1>
          t47: i32,i32 = ARMISD::SUBE Constant:i32<0>, t6, t46:1
        t39: i32 = and t47, Constant:i32<65535>
      t41: glue = ARMISD::CMPZ t39, Constant:i32<65535>
    t43: ch = ARMISD::BRCOND t0, BasicBlock:ch<if.end 0x55706fb58828>, Constant:i32<0>, Register:i32 $cpsr, t41
  t19: ch = br t43, BasicBlock:ch<for.cond.preheader 0x55706fb58698>

So some other ARM-specific combine is missing, too?
Or am i totally misreading this?

uaddo+subcarry doesn't really work the way you'd want it to on ARM... I think you have to invert the carry bit, and there isn't any convenient way to do that. (Note the extra "SUB" between the ARMISD::ADDE and the ARMISD::SUBC.) So I think to get the optimal code here you have to produce neg+addcarry instead of subcarry.

uaddo+subcarry doesn't really work the way you'd want it to on ARM... I think you have to invert the carry bit, and there isn't any convenient way to do that. (Note the extra "SUB" between the ARMISD::ADDE and the ARMISD::SUBC.) So I think to get the optimal code here you have to produce neg+addcarry instead of subcarry.

Hmm. I'm sorry, that did not make the problem any clearer to me.
Sadly, i have literally never encountered 'carry' before, so this makes very little sense to me.
Are you saying that the following transform should be performed: (sub Carry, X) -> (addcarry (sub 0, X), 0, !Carry)?
That regresses the test even further..

Are you saying that the following transform should be performed: (sub Carry, X) -> (addcarry (sub 0, X), 0, !Carry)?

The correct transform is (sub Carry, X) -> (add Carry, (sub 0, X)) -> (addcarry (sub 0, X), 0, Carry), no "!".

Sorry, I wasn't really clear; the carry bit doesn't need to be inverted in target-independent code. It's just that on ARM, "subs"/"sbcs" produce a carry bit which has the opposite meaning of the "overflow" bit from usubo, and "sbc(s)" consumes a carry bit that's inverted from the carry bit operand of subcarry. If the result of a usubo is consumed by a subcarry, the two inversions cancel; otherwise, we need to transfer the the carry bit to a GPR, invert it, then transfer it back to the flags register. (Maybe this could be optimized a bit in certain cases, but it doesn't really optimize in general.)

Are you saying that the following transform should be performed: (sub Carry, X) -> (addcarry (sub 0, X), 0, !Carry)?

The correct transform is (sub Carry, X) -> (add Carry, (sub 0, X)) -> (addcarry (sub 0, X), 0, Carry), no "!".

Oh right, i think i may have mistyped here. D62392

Sorry, I wasn't really clear; the carry bit doesn't need to be inverted in target-independent code. It's just that on ARM, "subs"/"sbcs" produce a carry bit which has the opposite meaning of the "overflow" bit from usubo, and "sbc(s)" consumes a carry bit that's inverted from the carry bit operand of subcarry. If the result of a usubo is consumed by a subcarry, the two inversions cancel; otherwise, we need to transfer the the carry bit to a GPR, invert it, then transfer it back to the flags register. (Maybe this could be optimized a bit in certain cases, but it doesn't really optimize in general.)

Thanks! That helps improves half of the cases (although still regressed as compared to LHS of this diff).

This comment was removed by lebedev.ri.
efriedma added inline comments.May 24 2019, 3:03 PM
test/CodeGen/ARM/addsubcarry-promotion.ll
10 ↗(On Diff #200806)

It looks like this test didn't get rebased correctly?

35 ↗(On Diff #200806)

The big difference that's making the new code worse, now, is that somehow the compare x==0 is getting transformed to something more like x==-1... and I guess something isn't handling that well. That's probably something the ARM backend should be handling in target-specific code, though; feel free to just file a bug for the missed optimization on void a(int s, void f()) { if ((short)s==-1)f(); }

lebedev.ri marked 2 inline comments as done.May 24 2019, 3:23 PM
lebedev.ri added inline comments.
test/CodeGen/ARM/addsubcarry-promotion.ll
10 ↗(On Diff #200806)

No, D62392 is a child revision of this.

35 ↗(On Diff #200806)

I understand how we get here. In IR terms, this is just two folds:
https://rise4fun.com/Alive/fTpN
https://godbolt.org/z/6RLVTQ

And thus we end with:

Optimized type-legalized selection DAG: %bb.0 'fn1:entry'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
              t6: i32,ch = CopyFromReg t0, Register:i32 %2
            t37: i32 = sub Constant:i32<0>, t6
          t39: i32 = sign_extend_inreg t37, ValueType:ch:i16
            t4: i32,ch = CopyFromReg t0, Register:i32 %1
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t36: i32,i32 = uaddo t4, t2
        t47: i32,i32 = addcarry t39, Constant:i32<0>, t36:1
      t46: i32 = and t47, Constant:i32<65535>
    t50: ch = br_cc t0, seteq:ch, t46, Constant:i32<65535>, BasicBlock:ch<if.end 0x5618ef707f18>
  t19: ch = br t50, BasicBlock:ch<for.cond.preheader 0x5618ef707d88>

I wonder, maybe this can still be undone in DAGCombine.
The important bits are:

    t47: i32,i32 = addcarry t39, Constant:i32<0>, t36:1
  t46: i32 = and t47, Constant:i32<65535>
t50: ch = br_cc t0, seteq:ch, t46, Constant:i32<65535>, BasicBlock:ch<if.end 0x5618ef707f18>

I wonder if the problem can be fixed by transforming this back into:

  t47: i32,i32 = addcarry t39, Constant:i32<65535>, t36:1
t50: ch = br_cc t0, seteq:ch, t47, Constant:i32<0>, BasicBlock:ch<if.end 0x5618ef707f18>
lebedev.ri marked an inline comment as done.May 24 2019, 3:31 PM
lebedev.ri added inline comments.
test/CodeGen/ARM/addsubcarry-promotion.ll
35 ↗(On Diff #200806)

Or more specifically, https://rise4fun.com/Alive/slGX
This would be an ok transform here because the second argument of addcarry is 0 here anyway.

lebedev.ri marked an inline comment as done.May 24 2019, 3:42 PM
lebedev.ri added inline comments.
test/CodeGen/ARM/addsubcarry-promotion.ll
35 ↗(On Diff #200806)

Or without hardcoded constants: https://rise4fun.com/Alive/B2k
Not sure if it's worthwhile doing this if it requires creating whole new 'add',
but here we should be able to do it.

lebedev.ri marked 2 inline comments as done.May 25 2019, 10:28 AM
lebedev.ri added inline comments.
test/CodeGen/ARM/addsubcarry-promotion.ll
35 ↗(On Diff #200806)

Done, D62450.
I think that fixes (and improves upon?) the last regression, feel free to review these three patches :)

lebedev.ri edited the summary of this revision. (Show Details)May 25 2019, 10:29 AM
lebedev.ri marked an inline comment as done.May 28 2019, 3:10 PM
lebedev.ri added inline comments.
test/CodeGen/ARM/addsubcarry-promotion.ll
35 ↗(On Diff #200806)

ping @efriedma :)

Hello. I found this code that seems to loop forever:

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv6m-arm-none-eabi"

@a = dso_local local_unnamed_addr global i32 0, align 4
@b = dso_local local_unnamed_addr global i32 0, align 4

; Function Attrs: minsize nounwind optsize
define dso_local i32 @c() local_unnamed_addr #0 {
entry:
  %0 = load i32, i32* @a, align 4
  %sub = sub nsw i32 2000, %0
  %call = tail call i32 bitcast (i32 (...)* @d to i32 (i32)*)(i32 %sub) #2
  %1 = load i32, i32* @b, align 4
  %cmp = icmp sgt i32 %1, 1999
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  %call1 = tail call i32 bitcast (i32 (...)* @e to i32 ()*)() #2
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  ret i32 undef
}

declare dso_local i32 @d(...) local_unnamed_addr #1
declare dso_local i32 @e(...) local_unnamed_addr #1

Hello. I found this code that seems to loop forever:

*This* code? As in *this* patch, correct?

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv6m-arm-none-eabi"

@a = dso_local local_unnamed_addr global i32 0, align 4
@b = dso_local local_unnamed_addr global i32 0, align 4

; Function Attrs: minsize nounwind optsize
define dso_local i32 @c() local_unnamed_addr #0 {
entry:
  %0 = load i32, i32* @a, align 4
  %sub = sub nsw i32 2000, %0
  %call = tail call i32 bitcast (i32 (...)* @d to i32 (i32)*)(i32 %sub) #2
  %1 = load i32, i32* @b, align 4
  %cmp = icmp sgt i32 %1, 1999
  br i1 %cmp, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  %call1 = tail call i32 bitcast (i32 (...)* @e to i32 ()*)() #2
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  ret i32 undef
}

declare dso_local i32 @d(...) local_unnamed_addr #1
declare dso_local i32 @e(...) local_unnamed_addr #1

Thanks for the reproducer.
There may be one more (as in, an unrelated to this one) endless loop in D62257 in test-suite, did not reduce it yet (i reverted that patch)

*This* code? As in *this* patch, correct?

Hello. I meant this .ll file seems to loop forever. With this patch or one of the three others (D62257, D62392, D62450). I only tested them together, so it may be the same as the problem from the test-suite.

*This* code? As in *this* patch, correct?

Hello. I meant this .ll file seems to loop forever. With this patch or one of the three others (D62257, D62392, D62450). I only tested them together, so it may be the same as the problem from the test-suite.

Confirmed, thank you for the reproducer.
This is another hang from what i have seen with D62257.

ping @efriedma.

Rebased, fixed endless combine loop (by not hoisting if constant is opaque, since we don't constant-fold those).

efriedma accepted this revision.May 31 2019, 2:18 PM

LGTM. I'm okay with working on adding more test coverage and fixes for the ADDCARRY cases in a followup.

This revision is now accepted and ready to land.May 31 2019, 2:18 PM
lebedev.ri updated this revision to Diff 202545.Jun 1 2019, 5:43 AM

Rebased, NFC.

This revision was automatically updated to reflect the committed changes.