[ARM] Optimize {s,u}{add,sub}.with.overflow
ClosedPublic

Authored by jgalenson on Jul 19 2017, 10:12 AM.

Details

Summary

The AArch64 backend contains code to optimize {s,u}{add,sub}.with.overflow during ISel. This commit ports that code to the ARM backend.

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes

There doesn't seem to be any test of the inversion logic.

Yes, I can't get a testcase to trigger those paths. Do you have any suggestions for how to do so?

Please include full context when you upload patches to Phabricator; see http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface.

This sort of overlaps with https://reviews.llvm.org/D35192.

lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

We should be able to optimize away the AND here using known bits; does that not happen on trunk?

test/CodeGen/ARM/su_addsub_overflow.ll
20 ↗(On Diff #107331)

These CHECK lines are not very good; try generating checks with utils/update_llc_test_checks.py. Also, please commit this first and rebase your patch on top of it; that makes it easy to see what happens to the generated code.

jgalenson added inline comments.Jul 19 2017, 11:27 AM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

It doesn't because ARM doesn't call setBooleanContents. If there's a way to set that correctly, it would indeed be better. Do you think we could do that?

test/CodeGen/ARM/su_addsub_overflow.ll
20 ↗(On Diff #107331)

That tool does make the changes to the testfiles more obvious, but it also generates overly-specific testcases that seem more brittle. I'll see if I can make a nice middle-ground and update it in a bit.

efriedma added inline comments.Jul 19 2017, 11:33 AM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

Really? Wow, that's a big oversight. We should fix that.

This is the code for my testcase before my patch. I'll now update it after my patch.

Improve CHECK lines on the test.

jgalenson added inline comments.Jul 19 2017, 12:58 PM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

Do you know the correct way to set it? I tried, but a few tests failed, and I don't know the architecture well enough to know if that was my fault or of the tests just needed to be tweaked.

If we do make that change, then I'll be able to simplify this code, of course.

test/CodeGen/ARM/su_addsub_overflow.ll
20 ↗(On Diff #107331)

Okay, I've uploaded both the old and new versions of the test here. Thanks for that tool, by the way; it's pretty convenient.

efriedma added inline comments.Jul 21 2017, 4:08 PM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

It's likely the tests just need to be tweaked; a lot of them are pretty sensitive to the exact compiler output.

jgalenson added inline comments.Jul 21 2017, 5:42 PM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

So which argument to setBooleanContents is correct for ARM? Is it ZeroOrOne?

efriedma added inline comments.Jul 21 2017, 5:59 PM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

ZeroOrOne is probably best; among other things, it matches the AAPCS calling convention rules for bool.

jgalenson added inline comments.Jul 21 2017, 6:04 PM
lib/Target/ARM/ARMISelLowering.cpp
4484 ↗(On Diff #107331)

Okay, thanks. I'll look into a patch that simply adds that and then rebase this commit on top of it, which will allow me to remove the AND-checking logic here.

jgalenson updated this revision to Diff 112190.Aug 22 2017, 9:51 AM
jgalenson added a reviewer: efriedma.

Now that the setBooleanContents patch has landed, I can remove the AND checks here.

I also added the same optimization to BRCOND, which allows it to optimize another case.

efriedma added inline comments.Aug 23 2017, 12:27 PM
test/CodeGen/ARM/su-addsub-overflow.ll
49 ↗(On Diff #112190)

This looks weird; we're generating a cmp, then a sub with exactly the same operands?

jgalenson added inline comments.Aug 23 2017, 12:33 PM
test/CodeGen/ARM/su-addsub-overflow.ll
49 ↗(On Diff #112190)

Well, as you can see in diff 2, we're currently doing that with a lot of other instructions. This removes most of those other instructions. I have another patch I was going to send after this one that fixes this.

Specifically, ARMBaseInstrInfo::optimizeCompareInstr tries to remove the cmp, but it runs right after the MachineSink pass, which sinks the sub into the cont basic block, which stops the optimization from working.

Friendly ping.

If it would help, I can submit my follow-up patch now, although I don't think this should depend on it.

I'm kind of waiting for https://reviews.llvm.org/D35192 to land here... unfortunately, it's been bouncing in and out of the tree due to regressions, but once it lands, I'd like to see what it does to our lowering here.

I think it might make more sense to wait until we have ARMISD::ADDC nodes in the DAG, then try to to optimize away the ARMISD::CMP nodes. That way, you don't have to worry about trying to optimize away redundant instructions after SelectionDAG.

jgalenson updated this revision to Diff 115914.Sep 19 2017, 4:10 PM

Now that https://reviews.llvm.org/D35192 has landed, I've rebased this patch on top of it.

The existing patch actually applied cleanly and worked. However, the two unsigned cases generated slightly worse code (still better than in-tree, however). I managed to improve the uadd case by having getARMXALUOp return an ADDC instead of an ADD, but this did not work for the usub case (although I left the change). There's probably a way to fix the usub case, but I'm not worried much about it because my follow-up patch that modifies ARMBaseInstrInfo::optimizeCompareInstr handles it.

So does this change to getARMXALUOp look correct?

The change to getARMXALUOOp is wrong; ADDC produces two results, so you're making a node with the wrong type.


So on trunk, for the llvm.uadd.with.overflow.i32 case, we produce a sequence like this:

adds    r0, r0, r1
mov     r2, #0
adc     r1, r2, #0
cmp     r1, #1

This is obviously not great... but the ARMISD::ADDE+ARMISD::CMP pattern is something you could DAGCombine away after legalization. I would prefer to do that, rather than try to clean it up after isel. Everything gets more complicated when you're dealing with MachineInstrs (you might end up optimizing simple cases, but not more complex ones), and the pattern works automatically with llvm.uadd.with.overflow.i64 etc.

I haven't thought through how exactly that extends to signed overflow, though.

jgalenson updated this revision to Diff 115940.Sep 19 2017, 5:52 PM

Thanks for the info about ADC. How come no test/assert picked that up?

Here's the rebased patch without that change. You can see that uadd is slightly worse than sadd, but it's still an improvement. I'll look into DAGCombine tomorrow, although I do think that it's fine to do at least some of it at the MI level, since the code there already handles some cases and I'm extending it to catch a few more that it misses. But if it can be done at the DAG level, that's worth doing.

In terms of assertions, we have checks for a lot of nodes in SelectionDAG::getNode... but not all (and I guess ISD::ADDC isn't one of the ones we check). And legalization isn't very picky either.

In terms of tests, the ISD::ADDC probably got lowered to something else before it could do any damage. :)

jgalenson updated this revision to Diff 116084.Sep 20 2017, 2:56 PM

Here's a modification of my incorrect commit from yesterday afternoon that properly gets the correct value out of the ADDC.

efriedma added inline comments.Sep 20 2017, 3:08 PM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

Still the wrong type... you need to call getVTList to get the right type for an ADDC.

Also, are you sure you don't want an ARMISD::ADDC, rather than an ISD::ADDC?

jgalenson updated this revision to Diff 116090.Sep 20 2017, 3:19 PM

Oops, sorry, I must have lost that bit of my commit somehow.

Is it really necessary to have two different of almost identical code to generate an ARMISD::BRCOND? (I would rather have an explicit check for an AND than two versions of the code, if that's the issue.)

Is it really necessary to have two different of almost identical code to generate an ARMISD::BRCOND? (I would rather have an explicit check for an AND than two versions of the code, if that's the issue.)

Both of them are used in the attached testcase. The br_cc case handles almost everything, but the brcond case is needed once (when the brcond isn't combined into a br_cc).

We could outline the two cases into a helper function, although they're different enough that I'm not sure that it would help too much. What do you think?

Friendly ping.

when the brcond isn't combined into a br_cc

BRCOND isn't legal on ARM; it will always eventually get transformed to BR_CC.

when the brcond isn't combined into a br_cc

BRCOND isn't legal on ARM; it will always eventually get transformed to BR_CC.

True. In most of my testcases, brcond is combined into br_cc, so we hit my new br_cc case. However, one time the brcond is not combined into br_cc. It is legalized into it shortly afterwards, but then we lower saddo/uaddo/etc. the normal way before we lower the new br_cc node (and hit my new optimized case). That's why I needed the brcond case.

What is the best way to handle that? Adding code to the normal saddo/etc. lowering code not to lower it if there's a br_cc seems the wrong way to go about it (and I don't think it's even possible). And I don't know how to change the order we see nodes.

but then we lower saddo/uaddo/etc. the normal way before we lower the new br_cc node

Oh, hmm, someone else hit the same issue recently (in a different context). It should be possible to fix; SelectionDAG::Legalize is the relevant code. It doesn't quite work the way you want it to because we don't update the ordering when we add new nodes to the DAG. But I haven't thought through how to compute the right order to visit without recomputing the entire topological order from scratch.

Alternatively, you could try DAGCombining ARMISD::BRCOND after legalization?

It doesn't quite work the way you want it to because we don't update the ordering when we add new nodes to the DAG. But I haven't thought through how to compute the right order to visit without recomputing the entire topological order from scratch.

This does sound like it would solve the problem (and solve other problems as well). It seems a bit out of scope for this commit, though.

Alternatively, you could try DAGCombining ARMISD::BRCOND after legalization?

Good idea. However, the ARMISD::BRCOND isn't combined until after saddo is lowered.

For reference, here's the sequence of events I'm seeing in this one example:

brcond is legalized to br_cc
saddo is legalized
br_cc is legalized to ARMISD::BRCOND
ARMISB::BRCOND is legalized
everything is combined

So combining ARMISD::BRCOND runs into the same problem as using br_cc; they're both too late. Thus without changing the ordering of new DAG nodes (which seems a bit difficult), I don't see how I can do this efficiently without keeping the brcond case.

Good idea. However, the ARMISD::BRCOND isn't combined until after saddo is lowered.

That shouldn't completely block all transforms; you could pattern-match the ARMISD nodes. But I guess the patterns become a lot more complicated, so maybe not worth doing.

lib/Target/ARM/ARMISelLowering.cpp
4582 ↗(On Diff #116090)

Is this assert actually guaranteed somehow? I mean, it should be possible to transform any relevant condition to an equivalent SETEQ or SETNE, but I don't see any code to actually ensure this.

4601 ↗(On Diff #116090)

I would probably write "if ((CC == ISD::SETNE) != isOneConstant(RHS))".

jgalenson updated this revision to Diff 119716.Oct 20 2017, 2:40 PM

Good idea. However, the ARMISD::BRCOND isn't combined until after saddo is lowered.

That shouldn't completely block all transforms; you could pattern-match the ARMISD nodes. But I guess the patterns become a lot more complicated, so maybe not worth doing.

I'm not sure what you mean. After the ARMISD::BRCOND is created, only a few unrelated instructions are matched before the saddo is transformed. So without the reordering we've discussed, I don't see how this would help. I'm probably missing something, but it sounds like it might well not be worth doing.

jgalenson marked an inline comment as done.Oct 20 2017, 2:43 PM
jgalenson added inline comments.
lib/Target/ARM/ARMISelLowering.cpp
4582 ↗(On Diff #116090)

Good point. I actually copied that part of the code over from the AArch64 backend. I would guess that the assumption is that you should only be checking if the overflow bit is set or unset, hence EQ or NE, but I could imagine something transforming those into other operations. I changed it to be part of the condition to avoid the problem. Do you think it's worth doing something similar in the AArch64 backend?

I'm not sure what you mean. After the ARMISD::BRCOND is created, only a few unrelated instructions are matched before the saddo is transformed. So without the reordering we've discussed, I don't see how this would help. I'm probably missing something, but it sounds like it might well not be worth doing.

I mean you could pattern match after the uaddo/saddo is transformed. For unsigned add, you get something like ARMISD::BRCOND(ARMISD::CMPZ(ARMISD::ADDE(0, 0, N), 1)), which you can transform by eliminating the CMPZ+ADDE. The pattern for signed add is slightly different, but similar.

lib/Target/ARM/ARMISelLowering.cpp
4582 ↗(On Diff #116090)

Yes.

I mean you could pattern match after the uaddo/saddo is transformed. For unsigned add, you get something like ARMISD::BRCOND(ARMISD::CMPZ(ARMISD::ADDE(0, 0, N), 1)), which you can transform by eliminating the CMPZ+ADDE. The pattern for signed add is slightly different, but similar.

Ah yes. I considered that, but as you said, it would have more complicated matching and it would be more brittle if the saddo lowering changes (like it just did), so I preferred doing it this way.

lib/Target/ARM/ARMISelLowering.cpp
4582 ↗(On Diff #116090)

Okay, I'll work on submitting a patch there (it should be NFC).

jgalenson retitled this revision from Optimize {s,u}{add,sub}.with.overflow on ARM to [ARM] Optimize {s,u}{add,sub}.with.overflow.Nov 9 2017, 9:31 AM

Are there any other comments? What can I do to get this approved?

rogfer01 added inline comments.Nov 29 2017, 10:16 AM
lib/Target/ARM/ARMISelLowering.cpp
4542–4543 ↗(On Diff #119716)

Apologies if this question looks a bit clueless about all the transformation, but why you need to reverse the condition code here?

jgalenson added inline comments.Nov 29 2017, 11:20 AM
lib/Target/ARM/ARMISelLowering.cpp
4542–4543 ↗(On Diff #119716)

No, it's a good question, and I'm a bit confused about this myself.

The getARMXALUOOp function seems to return the condition for whether there is not an overflow. It doesn't seem to be documented anywhere, but for example, in its SADDO case, it uses the VC flag, which is the "no overflow" case.

Assuming that's right, then since these branches are branching when an overflow occurs, I need to invert the condition.

Does that logic seem right?

rogfer01 added inline comments.Dec 7 2017, 6:23 AM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

This comment now looks odd because I had to revert D35192 (still working on it, though).

What was the reason to use a target specific node here? Did you want to save some round trips lowering this or you needed it for better codegen?

3916 ↗(On Diff #119716)

We could return std::tuple<SDValue, SDValue, SDValue> here I think, but better we leave this for a later change.

4542–4543 ↗(On Diff #119716)

It is certainly confusing.

It returns three things: the intended arithmetic code (in Value), the comparison of the overflow (in OverflowCmp) and a condition code related to OverflowCmp (in ARMcc). For unsigned addition (UADDO) it does the following

Value: add z, x, y
OverflowCmp: cmp z, x
ARMcc: HS

The comparison will compute z - x and update the flags. Using the HS condition code means z >= x and this is exactly *no overflow* (because z ← x + y).

Similarly when doing a signed addition (SADDO)

Value: add z, x, y
OverflowCmp: cmp z, x
ARMcc: VC

This case is less obvious but if there is no overflow in this case there must not be overflow in the comparison (VC): if there were no overflow in the add, then cmp because it does z - x to update the flags would compute y which does not overflow.

Similarly for USUBO and SSUBO but here the comparison is cmp x, y (instead of cmp z, x) which is (suprisingly) more intuitive because unsigned x - y will not overflow if x >= y and trivially for the signed case as not overflowing is exactly VC.

So after this wall of text, I believe your logic is sensible here.

Given that getARMXALUOOp is now going to be used in three places after your change, would it be possible to add a comment to that function like the one below (or an improved wording if you come up with a better one!)

// This function returns three things: the arithmetic computation itself
// (Value), a comparison (OverflowCmp) and a condition code (ARMcc).  The
// comparison and the condition code define the case in which the arithmetic
// computation *does not* overflow.
std::pair<SDValue, SDValue>
ARMTargetLowering::getARMXALUOOp(SDValue Op, SelectionDAG &DAG,
                                 SDValue &ARMcc) const {
...
jgalenson added inline comments.Dec 7 2017, 10:06 AM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

I used the target-specific node here because efriedma suggested it, but using the generic ISD::ADDC does seem to work (it passes all the tests). Should I change it? I don't know much about the difference between them.

It is interesting that this still works after you reverted your patch.

3916 ↗(On Diff #119716)

Yes, that does seem cleaner to me. I can make a follow-up patch after this lands.

4542–4543 ↗(On Diff #119716)

Thanks for the long double-checking! That comment looks good to me, so I'll add it.

jgalenson updated this revision to Diff 125991.Dec 7 2017, 10:23 AM

Added rogfer01's comment to getARMXALUOOp.

rogfer01 added inline comments.Dec 8 2017, 12:44 AM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

My understanding is that several generic nodes (ISD::*) have to be lowered at some point to Arm-specific ones (ARMISD::*). ISD::ADD and ISD::SUB are among those nodes. Using Arm-specific nodes earlier might bring finer control but may miss some of the generic combiners done on generic nodes.

But it is true that when this function is used, the generic node will be found among specific ones (like ARMISD::CMP) so generic combiners are unlikely to kick in here, which might be a reason to use the specific one directly and not bothering using the generic one (which I presume will require an extra iteration replacing it with a specific one).

Changing just this case will make this function look a bit odd. If Eli's comment was in the line of "let's make this function only use ARMISD" (instead of ISD::ADD and ISD::SUB) we can make this change in a later patch and leave this function untouched (except for the comment).

@efriedma am I missing something?

jgalenson updated this revision to Diff 126175.Dec 8 2017, 10:08 AM

I've removed the modification to getARMXALUOOp that uses ADDC instead of ADD. From the discussion, it's not clear exactly how that part should work, and it's no longer needed, so this makes things simpler.

Now that D35192 has re-landed I need to update this patch.

Note that I added back the use of ARMISD::ADDC, which were were discussing a bit before.

efriedma added inline comments.Dec 12 2017, 3:52 PM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

Please don't use ISD::ADDC here; now that we're using ADDCARRY, we shouldn't produce that node at all on ARM. (LowerADDC_ADDE_SUBC_SUBE should have been removed as part of D35192. ARMISD::ADDC is a different node which is still useful.)

jgalenson updated this revision to Diff 126642.Dec 12 2017, 4:01 PM

Add the ARMISD::ADDC node for real. Thanks for catching that!

rogfer01 added inline comments.Dec 13 2017, 12:28 AM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

Thanks @efriedma. Now I see that I should have removed LowerADDC_ADDE_SUBC_SUBE in D35192.

That said I fail to understand why we want to update this function in this patch. @jgalenson does keep using ISD:ADD (not to confuse it with ISD::ADDC) impact the code generation?

I mean, I understand we can use ISD::ADDCARRY here but maybe we want to do this in a later patch?

jgalenson added inline comments.Dec 13 2017, 8:54 AM
lib/Target/ARM/ARMISelLowering.cpp
3942 ↗(On Diff #116084)

Yes, with D35192 landed using ISD::ADD produces worse code. It's possible there's a different way to fix that, but doing this works as well.

Ping.

What else should I do to get this approved? It seems like there's agreement that this patch looks at least mostly good, and it (and the follow-ups) significantly improve the performance of overflow-checked code for ARM.

efriedma accepted this revision.Dec 20 2017, 1:37 PM

LGTM.

This revision is now accepted and ready to land.Dec 20 2017, 1:37 PM
This revision was automatically updated to reflect the committed changes.

Thanks for all the comments on this!