Page MenuHomePhabricator

[ARM] Use ADDCARRY / SUBCARRY
ClosedPublic

Authored by rogfer01 on Jul 10 2017, 2:40 AM.

Details

Summary

This is a preparatory step for D34515

This patch:

  • makes nodes ISD::ADDCARRY and ISD::SUBCARRY legal for i32
  • lowering is done by first converting the boolean value into the carry flag using (_, C) ← (ARMISD::ADDC R, -1) and converted back to an integer value using (R, _) ← (ARMISD::ADDE 0, 0, C). An ARMISD::ADDE between the two operations does the actual addition.
  • for subtraction, given that ISD::SUBCARRY second result is actually a borrow, we need to invert the value of the second operand and result before and after using ARMISD::SUBE. We need to invert the carry result of ARMISD::SUBE to preserve the semantics.
  • given that the generic combiner may lower ISD::ADDCARRY and ISD::SUBCARRYinto ISD::UADDO and ISD::USUBO we need to update their lowering as well otherwise i64 operations now would require branches. This implies updating the corresponding test for unsigned.
  • add new combiner to remove the redundant conversions from/to carry flags to/from boolean values (ARMISD::ADDC (ARMISD::ADDE 0, 0, C), -1)C

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi @efriedma, thanks again for the comments. I have updated the diff with your suggestions.

Please find a comment for the ARMTargetLowering::computeKnownBitsForTargetNode, I think it is now OK but I will appreciate some more insight.

Kind regards,
Roger

lib/Target/ARM/ARMISelLowering.cpp
12823 ↗(On Diff #105841)

My understanding is that ISD::ADDE/ISD::SUBE and (the now deprecated) ISD::ADDC/ISD::SUBC are immediately lowered into ARMISD::ADDE and ARMISD::ADDC respectively so I think that for the second result nothing has changed. The extra check for the first is because the extra hops that we do now when lowering ADDCARRY / SUBCARRY.

That said, you were right to point me at this part of code because I introduced a bug here, which I fixed in this update.

Sadly I'm not sure how can I test that these bits are correctly accounted for this second operand (for the first operand is easy and already caght by the regression testsuite: not knowing it will introduce spurious and operations to make sure only 1 bit is used).

I find difficult to generate an ARMISD::ADDE (using IR) where some combiner wants to check the outcome of the second operand, even more now that the patch legalizes ADDCARRY / SUBCARRY and we do some extra hops to get in and out the carry flag itself. It would be a bit worrying that this were wrong at this point because it'd mean I cannot do exactly that.

It could well be that I'm missing something obvious here.

efriedma added inline comments.Jul 13 2017, 2:49 PM
lib/Target/ARM/ARMISelLowering.cpp
12823 ↗(On Diff #105841)

The second result of ARMISD::SUBE isn't a boolean; it's a copy of CPSR.

rogfer01 updated this revision to Diff 106593.Jul 14 2017, 2:17 AM

ChangeLog:

  • Remove incorrect computation of known bits for an operand that contains a copy of the CPSR and not a boolean.

Thanks again @efriedma.

From your comment I think that we should remove this case as it was wrong already. Let me know if I understood you correctly.

Regards,
Roger

efriedma added inline comments.Jul 14 2017, 12:51 PM
lib/Target/ARM/ARMISelLowering.cpp
4011 ↗(On Diff #106593)

Oops, I should have spotted the issue here earlier.

The value produced by sbc is a+~b+C. So (ARMISD::SUBE i32 1, i32 0) is computing 1+-1+C==C. The value you actually need to produce here is 1-C.

In practice, this means we're generating wrong code for the usub_overflow testcase.

rogfer01 updated this revision to Diff 107118.Jul 18 2017, 9:33 AM
rogfer01 edited the summary of this revision. (Show Details)

ChangeLog:

  • Correctly handle the second operand of ISD::SUBCARRY. We need to invert its value (i.e. 1-x) before passing it to ARMISD::SUBE, also the result of ARMISD::SUBE has to be inverted again to match what ISD::SUBCARRY means.
  • Fix USUBO lowering as a consequence of the above.
  • Add a combiner for (SUBC 1, (SUBC 1, X)) → X
  • Compute the known bits of (SUBC 1, (ADDE 0, 0, C)) (1 bit) to avoid spurious and instructions.
  • Update the tests for usub_builtin_overflow with the correct code generation.
  • Remove combiners that became pointless after the changes above.
rogfer01 marked an inline comment as done.Jul 18 2017, 9:36 AM

Hi @efriedma, thanks for the review, please find the updated changes. I was wrongly handling ISD::SUBCARRY because its operand is a borrow while ARMISD::SUBE expects a carry.

lib/Target/ARM/ARMISelLowering.cpp
4011 ↗(On Diff #106593)

Thanks Eli, this was a serious bug. Sorry for that. I used the obvious approach instead.

efriedma added inline comments.Jul 18 2017, 10:32 AM
lib/Target/ARM/ARMISelLowering.cpp
4014 ↗(On Diff #107118)

Could you use ISD::SUB here, rather than ARMISD::SUBC?

rogfer01 updated this revision to Diff 107301.Jul 19 2017, 7:41 AM
rogfer01 marked an inline comment as done.
rogfer01 edited the summary of this revision. (Show Details)

ChangeLog:

  • Use ISD::SUB instead of ARMISD::SUBC when inverting the borrow / carry.
  • Due to the above, remove now unnecessary target-specific combiner.
rogfer01 marked an inline comment as done.Jul 19 2017, 7:42 AM

Hi @efriedma thanks for your suggestion, this way I can avoid one target-specific combiner which I have now removed.

Regards,
Roger

efriedma added inline comments.Jul 19 2017, 11:08 AM
lib/Target/ARM/ARMISelLowering.cpp
7478 ↗(On Diff #107301)

Maybe you could refactor this to share a little more code for ADDCARRY/SUBCARRY?

9996 ↗(On Diff #107301)

Do we still need the code for SUBC here?

12828 ↗(On Diff #107301)

Do we still need the code for SUBC here?

rogfer01 updated this revision to Diff 108291.Jul 26 2017, 9:02 AM
rogfer01 marked 2 inline comments as done.

ChangeLog:

  • Remove unnecessary combiner for ARMISD::SUBC
  • Refactor a bit the conversion carry to/from boolean
efriedma accepted this revision.Jul 27 2017, 11:35 AM

LGTM, with minor comments.

Please make sure to update the commit message to account for the changes during the review.

lib/Target/ARM/ARMISelLowering.cpp
7477 ↗(On Diff #108291)

Weird indentation (maybe try clang-format?)

This revision is now accepted and ready to land.Jul 27 2017, 11:35 AM
rogfer01 updated this revision to Diff 109107.Aug 1 2017, 7:21 AM
rogfer01 edited the summary of this revision. (Show Details)

ChangeLog:

  • clang-format fixes

I plan to submit this tomorrow morning (BST time).

This revision was automatically updated to reflect the committed changes.
rogfer01 reopened this revision.Aug 3 2017, 11:45 PM

Reopening: this is causing PR34045.

This revision is now accepted and ready to land.Aug 3 2017, 11:45 PM
rogfer01 updated this revision to Diff 114381.Sep 8 2017, 9:45 AM

ChangeLog:

  • Fix problem in S/UMLAL combiner that caused PR34045.
  • New bugpoint-reduced test based on the reproducer of PR34045.
rogfer01 added inline comments.Sep 8 2017, 9:49 AM
lib/Target/ARM/ARMISelLowering.cpp
9891–9894 ↗(On Diff #114381)

This fixes the problem in PR34045.

I wonder if AddeNode + LowAdd might have the same problem but I don't think it can happen.

The rest of boringssl has built correctly using -mthumb, though.

efriedma accepted this revision.Sep 8 2017, 11:34 AM

LGTM

lib/Target/ARM/ARMISelLowering.cpp
9891–9894 ↗(On Diff #114381)

I wonder if AddeNode + LowAdd might have the same problem but I don't think it can happen.

loAdd is a predecessor of AddcNode, which is a predecessor of AddeNode, so I don't think you can run into problems there. (It's possible for loAdd to use the result of the UMUL_LOHI, but you don't end up with a cycle in that case, just a redundant multiply.)

I guess this issue only shows up with your patch because it's impossible to create a DAG like that without ADDCARRY transforms?

Thanks for the review Eli!

lib/Target/ARM/ARMISelLowering.cpp
9891–9894 ↗(On Diff #114381)

I think so. Before the ADDCARRY transforms the DAG is slightly simpler at the point this combiner kicks in.

This revision was automatically updated to reflect the committed changes.
rogfer01 reopened this revision.Sep 12 2017, 9:41 AM

We're still hitting PR34045. I also plan integrate the fix in D37690 to ease reverting if needed.

This revision is now accepted and ready to land.Sep 12 2017, 9:41 AM
rogfer01 updated this revision to Diff 114996.Sep 13 2017, 1:30 AM

ChangeLog:

  • fix for the second instance of PR34045
  • also integrates the changes to fix PR34564, fall-out of the original change, that was reverted as well
rogfer01 requested review of this revision.EditedSep 13 2017, 1:34 AM
rogfer01 edited edge metadata.

In the second instance of PR34045 what happens is that ADDC is exactly HiAdd, not just a predecessor of the latter.

That said, let me first check I can build Chrome for ARM with this change. It is not a matter of bothering Google folks :)

efriedma accepted this revision.Sep 13 2017, 11:35 AM

Makes sense.

This revision is now accepted and ready to land.Sep 13 2017, 11:35 AM
rogfer01 added a comment.EditedSep 14 2017, 6:09 AM

Ok, Chrome for Linux Arm took a bit to build with a debug compiler but finished successfully.

That said tomorrow I'm on holidays tomorrow until Tuesday so I will commit this then (I want to monitor this change as it has proved more disruptive than expected).

This revision was automatically updated to reflect the committed changes.

Roger, We are seeing a fail in ChromeOS with the CL r313618. I'll try to send a test case soon.

rogfer01 reopened this revision.Nov 1 2017, 7:08 AM

This is causing PR35103, so reopening after the revert, so I can investigate what is going wrong there.

This revision is now accepted and ready to land.Nov 1 2017, 7:08 AM
rogfer01 planned changes to this revision.Nov 1 2017, 7:09 AM

Hi @efriedma ,

I'm trying to understand what went wrong in the following input

define i32 @foo(i32 %var1, i32 %var2, i32 %var3, i32 %var4, i32 %var5) local_unnamed_addr #0 {
entry:
  %conv1 = zext i32 %var3 to i64
  %conv2 = zext i32 %var1 to i64
  %add3 = add nuw nsw i64 %conv1, %conv2
  %shr = lshr i64 %add3, 32
  %conv5 = trunc i64 %shr to i32
  %conv7 = and i64 %add3, 4294967295
  %add10 = add nuw nsw i64 %conv7, %conv2
  %shr12 = lshr i64 %add10, 32
  %conv13 = trunc i64 %shr12 to i32
  %add14 = add nuw nsw i32 %conv13, %conv5
  %conv16 = zext i32 %var4 to i64
  %conv18 = zext i32 %var2 to i64
  %add19 = add nuw nsw i64 %conv16, %conv18
  %shr21 = lshr i64 %add19, 32
  %conv24 = zext i32 %var5 to i64
  %add25 = add nuw nsw i64 %shr21, %conv24
  %shr28 = lshr i64 %add25, 32
  %conv29 = trunc i64 %shr28 to i32
  %add30 = add nuw nsw i32 %add14, %conv29
  ret i32 %add30
}

which (with ADDCARRY / SUBCARRY nodes enabled) generates the following selection using

./bin/llc -mtriple armv8 -O3 -o - bad.ll
SelectionDAG has 28 nodes:
  t0: ch = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
    t6: i32,ch = CopyFromReg t0, Register:i32 %vreg2
  t75: i32,i32 = ADDSrr t6, t2, TargetConstant:i32<14>, Register:i32 %noreg
          t58: i32 = MOVi TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %noreg
            t69: i32,i32 = ADDSrr t75, t2, TargetConstant:i32<14>, Register:i32 %noreg
          t84: ch,glue = CopyToReg t0, Register:i32 %CPSR, t69:1
        t70: i32,i32 = ADCri t58, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %noreg, t84:1
        t83: ch,glue = CopyToReg t0, Register:i32 %CPSR, t75:1
      t66: i32,i32 = ADCri t70, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %noreg, t83:1
      t47: i32,ch = LDRi12<Mem:LD4[FixedStack-1](align=8)> TargetFrameIndex:i32<-1>, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0
          t8: i32,ch = CopyFromReg t0, Register:i32 %vreg3
          t4: i32,ch = CopyFromReg t0, Register:i32 %vreg1
        t72: i32,i32 = ADDSrr t8, t4, TargetConstant:i32<14>, Register:i32 %noreg
      t81: ch,glue = CopyToReg t0, Register:i32 %CPSR, t72:1
    t62: i32,i32 = ADCrr t66, t47, TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %noreg, t81:1
  t35: ch,glue = CopyToReg t0, Register:i32 %R0, t62
  t36: ch = BX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %R0, t35, t35:1

Where t75 second value is consumed by t83 CopyToReg which is glued to t66 ADCri. Similarly t69 second value is consumed by t84 CopytoReg which is glued to t70. I think this is not constraining enough the schedule which decides to do

BB#0: derived from LLVM BB %entry
    Live Ins: %R0 %R1 %R2 %R3
	%vreg3<def> = COPY %R3; GPR:%vreg3
	%vreg2<def> = COPY %R2; GPR:%vreg2
	%vreg1<def> = COPY %R1; GPR:%vreg1
	%vreg0<def> = COPY %R0; GPR:%vreg0
	%vreg4<def> = MOVi 0, pred:14, pred:%noreg, opt:%noreg; GPR:%vreg4
	%vreg5<def> = ADDrr %vreg2, %vreg0, pred:14, pred:%noreg, opt:%CPSR<def>; GPR:%vreg5,%vreg2,%vreg0
	%vreg6<def> = ADDrr %vreg5<kill>, %vreg0, pred:14, pred:%noreg, opt:%CPSR<def>; GPR:%vreg6,%vreg5,%vreg0
	%vreg7<def> = ADCri %vreg4<kill>, 0, pred:14, pred:%noreg, opt:%noreg, %CPSR<imp-use>; GPR:%vreg7,%vreg4
  ...

This is, the CPSR computed in %vreg5 is clobbered by the CPSR computed in %vreg6.

Looks like (after we have lowered addcarry/subcarry into ARMISD::ADDC/ARMISD::ADDE and removed the redundant nodes to make sure the carry flag is correctly propagated) we have ended with something that cannot be scheduled sensibly without storing the carry flag somewhere. Given that it is found inside CPSR this will mean MRS / MSR which will move more bits than the ones I need from/to the flags.

Before addcarry/subcarry, the code can be scheduled without having to preserve the flags anywhere so I must be missing something very obvious here. Perhaps the combiners we do in the ARMISelLowering have to be more conservative?

Kind regards,

Uploaded

DAG after isel.

The DAG scheduler, my least favorite part of SelectionDAG. :(

There's supposed to be code that ensures we don't end up with a schedule like that (by either changing the schedule, or introducing an cross-class copy). See ScheduleDAGRRList::DelayForLiveRegsBottomUp and ScheduleDAGRRList::PickNodeToScheduleBottomUp. It's possible there's a bug somewhere in the handling of optional defs, since they're only used on ARM.

Thanks a lot for the pointer Eli! :-)

Hmm, scheduling is correct. I was all wrong, the legalized DAG is already wrong.

typedef unsigned int mp_digit;
typedef unsigned long long mp_word;

int
foo(mp_digit vreg0, mp_digit vreg1, mp_digit vreg2, mp_digit vreg3, mp_digit vreg4)
{
    mp_digit carry = 0;

    mp_word w1 = ((mp_word)0) + (vreg2) + (vreg0);
    mp_digit vreg2_2 = (mp_digit)(w1);
    mp_digit carry2 = (mp_digit)((w1) >> (8 * sizeof(mp_digit)));

    // int r8_1 = carry2;

    mp_word w2 = ((mp_word)0) + (vreg2_2) + (vreg0);
    mp_digit carry4 = (mp_digit)((w2) >> (8 * sizeof(mp_digit)));

    // int r8_2 = rb8_1 + carry4;
    int r8_2 = carry2 + carry4;

    mp_word w3 = ((mp_word)0) + (vreg3) + (vreg1);
    mp_digit carry6 = (mp_digit)((w3) >> (8 * sizeof(mp_digit)));

    mp_word w4 = ((mp_word)carry6) + (vreg4) + (0);
    mp_digit carry7 = (mp_digit)((w4) >> (8 * sizeof(mp_digit)));

    int r8_3 = r8_2 + carry7;

    return r8_3;
}

Two steps of the combiner

. I believe node t57 in the right is wrong.

A combiner kicks in for t33 because it is of the form (add, X, (addcarry Y, 0, C) and then it becomes (addcarry X, Y, C) which looks correct but I think it interacts badly with the nodes t24, t44 and t42 (which also have similar forms).

If my reading is correct, after this combination the final carry7 is not computed.

@deadalnix can you take a look at the screenshot above to check if my findings make sense?

We believe there is a problem here as the generic combiner presumes we're going to use the first result (0) but actually the code is using the second (1)

diff --git a/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 5387a7ed73e3a3f22931f981578859f3c0b08e90..5ec4cdeaa272988ca0a97b35b993b507dddd10b6 100644
--- a/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -2150,7 +2150,8 @@ SDValue DAGCombiner::visitADDLike(SDValue N0, SDValue N1, SDNode *LocReference)
   }
 
   // (add X, (addcarry Y, 0, Carry)) -> (addcarry X, Y, Carry)
-  if (N1.getOpcode() == ISD::ADDCARRY && isNullConstant(N1.getOperand(1)))
+  if (N1.getOpcode() == ISD::ADDCARRY && isNullConstant(N1.getOperand(1)) &&
+      N1.getResNo() == 0)
     return DAG.getNode(ISD::ADDCARRY, DL, N1->getVTList(),
                        N0, N1.getOperand(0), N1.getOperand(2));
 
-- 

@efriedma, @deadalnix is our diagnostic correct?

(Kudos to @avieira who did an amazing job chasing this)

Yes, that looks correct.

rogfer01 updated this revision to Diff 125993.Dec 7 2017, 10:40 AM
  • Fix combiner with addcarry that causes PR35103
  • Add regression test
This revision is now accepted and ready to land.Dec 7 2017, 10:40 AM
rogfer01 requested review of this revision.Dec 7 2017, 10:41 AM
rogfer01 edited edge metadata.
efriedma accepted this revision.Dec 8 2017, 2:37 PM

LGTM

This revision is now accepted and ready to land.Dec 8 2017, 2:37 PM
This revision was automatically updated to reflect the committed changes.
chrib added a subscriber: chrib.Feb 7 2018, 11:50 PM