All changes except ARM look great.
https://rise4fun.com/Alive/R2M
The regression test/CodeGen/ARM/addsubcarry-promotion.ll
is recovered fully by D62392 + D62450.
Differential D62266
[DAGCombine][X86][AArch64][ARM] (C - x) + y -> (y - x) + C fold lebedev.ri on May 22 2019, 12:11 PM. Authored by
Details All changes except ARM look great. The regression test/CodeGen/ARM/addsubcarry-promotion.ll
Diff Detail
Event TimelineComment Actions
It's "simpler", in some sense, but it's more complicated to lower because it's using the carry output of a uaddo in a non-addcarry operation, which is legal, but more expensive than using in an addcarry, particularly on ARM. I think the missing combine is that we should be able to turn the subtraction into some form of addcarry/subcarry. Comment Actions Hmm, that does not seem to be all. bool HasSUBCARRY = TLI.isOperationLegalOrCustom(ISD::SUBCARRY, VT); bool HasADDCARRY = TLI.isOperationLegalOrCustom(ISD::ADDCARRY, VT); if (HasSUBCARRY || HasADDCARRY) { unsigned ANYCARRY = HasSUBCARRY ? ISD::SUBCARRY : ISD::ADDCARRY; // (sub X, Carry) -> (subcarry/addcarry X, 0, Carry) if (SDValue Carry = getAsCarry(TLI, N1)) { return DAG.getNode(ANYCARRY, DL, DAG.getVTList(VT, Carry.getValueType()), N0, DAG.getConstant(0, DL, VT), Carry); } if (HasSUBCARRY) { // (sub Carry, X) -> (subcarry 0, X, Carry) if (SDValue Carry = getAsCarry(TLI, N0)) { return DAG.getNode(ISD::SUBCARRY, DL, DAG.getVTList(VT, Carry.getValueType()), DAG.getConstant(0, DL, VT), N1, Carry); } } } at the end of DAGCombiner::visitSUB(), and get this SelectionDAG has 17 nodes: t0: ch = EntryToken t6: i32,ch = CopyFromReg t0, Register:i32 %2 t4: i32,ch = CopyFromReg t0, Register:i32 %1 t2: i32,ch = CopyFromReg t0, Register:i32 %0 t30: i32,i32 = uaddo t4, t2 t31: i32,i32 = subcarry Constant:i32<0>, t6, t30:1 t39: i32 = and t31, Constant:i32<65535> t40: ch = br_cc t0, seteq:ch, t39, Constant:i32<65535>, BasicBlock:ch<if.end 0x55706fb58828> t19: ch = br t40, BasicBlock:ch<for.cond.preheader 0x55706fb58698> which is then resulting in SelectionDAG has 22 nodes: t0: ch = EntryToken t6: i32,ch = CopyFromReg t0, Register:i32 %2 t4: i32,ch = CopyFromReg t0, Register:i32 %1 t2: i32,ch = CopyFromReg t0, Register:i32 %0 t51: i32,i32 = ARMISD::ADDC t4, t2 t52: i32,i32 = ARMISD::ADDE Constant:i32<0>, Constant:i32<0>, t51:1 t45: i32 = sub Constant:i32<1>, t52 t46: i32,i32 = ARMISD::SUBC t45, Constant:i32<1> t47: i32,i32 = ARMISD::SUBE Constant:i32<0>, t6, t46:1 t39: i32 = and t47, Constant:i32<65535> t41: glue = ARMISD::CMPZ t39, Constant:i32<65535> t43: ch = ARMISD::BRCOND t0, BasicBlock:ch<if.end 0x55706fb58828>, Constant:i32<0>, Register:i32 $cpsr, t41 t19: ch = br t43, BasicBlock:ch<for.cond.preheader 0x55706fb58698> So some other ARM-specific combine is missing, too? Comment Actions uaddo+subcarry doesn't really work the way you'd want it to on ARM... I think you have to invert the carry bit, and there isn't any convenient way to do that. (Note the extra "SUB" between the ARMISD::ADDE and the ARMISD::SUBC.) So I think to get the optimal code here you have to produce neg+addcarry instead of subcarry. Comment Actions Hmm. I'm sorry, that did not make the problem any clearer to me. Comment Actions
The correct transform is (sub Carry, X) -> (add Carry, (sub 0, X)) -> (addcarry (sub 0, X), 0, Carry), no "!". Sorry, I wasn't really clear; the carry bit doesn't need to be inverted in target-independent code. It's just that on ARM, "subs"/"sbcs" produce a carry bit which has the opposite meaning of the "overflow" bit from usubo, and "sbc(s)" consumes a carry bit that's inverted from the carry bit operand of subcarry. If the result of a usubo is consumed by a subcarry, the two inversions cancel; otherwise, we need to transfer the the carry bit to a GPR, invert it, then transfer it back to the flags register. (Maybe this could be optimized a bit in certain cases, but it doesn't really optimize in general.) Comment Actions Oh right, i think i may have mistyped here. D62392
Thanks! That helps improves half of the cases (although still regressed as compared to LHS of this diff). This comment was removed by lebedev.ri.
Comment Actions Hello. I found this code that seems to loop forever: target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64" target triple = "thumbv6m-arm-none-eabi" @a = dso_local local_unnamed_addr global i32 0, align 4 @b = dso_local local_unnamed_addr global i32 0, align 4 ; Function Attrs: minsize nounwind optsize define dso_local i32 @c() local_unnamed_addr #0 { entry: %0 = load i32, i32* @a, align 4 %sub = sub nsw i32 2000, %0 %call = tail call i32 bitcast (i32 (...)* @d to i32 (i32)*)(i32 %sub) #2 %1 = load i32, i32* @b, align 4 %cmp = icmp sgt i32 %1, 1999 br i1 %cmp, label %if.then, label %if.end if.then: ; preds = %entry %call1 = tail call i32 bitcast (i32 (...)* @e to i32 ()*)() #2 br label %if.end if.end: ; preds = %if.then, %entry ret i32 undef } declare dso_local i32 @d(...) local_unnamed_addr #1 declare dso_local i32 @e(...) local_unnamed_addr #1 Comment Actions *This* code? As in *this* patch, correct?
Thanks for the reproducer. Comment Actions Confirmed, thank you for the reproducer. Comment Actions ping @efriedma. Rebased, fixed endless combine loop (by not hoisting if constant is opaque, since we don't constant-fold those). Comment Actions LGTM. I'm okay with working on adding more test coverage and fixes for the ADDCARRY cases in a followup. |
It looks like this test didn't get rebased correctly?