The AArch64 backend contains code to optimize {s,u}{add,sub}.with.overflow during ISel. This commit ports that code to the ARM backend.
Details
Diff Detail
Event Timeline
Yes, I can't get a testcase to trigger those paths. Do you have any suggestions for how to do so?
Please include full context when you upload patches to Phabricator; see http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface.
This sort of overlaps with https://reviews.llvm.org/D35192.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | We should be able to optimize away the AND here using known bits; does that not happen on trunk? | |
test/CodeGen/ARM/su_addsub_overflow.ll | ||
20 ↗ | (On Diff #107331) | These CHECK lines are not very good; try generating checks with utils/update_llc_test_checks.py. Also, please commit this first and rebase your patch on top of it; that makes it easy to see what happens to the generated code. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | It doesn't because ARM doesn't call setBooleanContents. If there's a way to set that correctly, it would indeed be better. Do you think we could do that? | |
test/CodeGen/ARM/su_addsub_overflow.ll | ||
20 ↗ | (On Diff #107331) | That tool does make the changes to the testfiles more obvious, but it also generates overly-specific testcases that seem more brittle. I'll see if I can make a nice middle-ground and update it in a bit. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | Really? Wow, that's a big oversight. We should fix that. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | Do you know the correct way to set it? I tried, but a few tests failed, and I don't know the architecture well enough to know if that was my fault or of the tests just needed to be tweaked. If we do make that change, then I'll be able to simplify this code, of course. | |
test/CodeGen/ARM/su_addsub_overflow.ll | ||
20 ↗ | (On Diff #107331) | Okay, I've uploaded both the old and new versions of the test here. Thanks for that tool, by the way; it's pretty convenient. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | It's likely the tests just need to be tweaked; a lot of them are pretty sensitive to the exact compiler output. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | So which argument to setBooleanContents is correct for ARM? Is it ZeroOrOne? |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | ZeroOrOne is probably best; among other things, it matches the AAPCS calling convention rules for bool. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4484 | Okay, thanks. I'll look into a patch that simply adds that and then rebase this commit on top of it, which will allow me to remove the AND-checking logic here. |
Now that the setBooleanContents patch has landed, I can remove the AND checks here.
I also added the same optimization to BRCOND, which allows it to optimize another case.
test/CodeGen/ARM/su-addsub-overflow.ll | ||
---|---|---|
50 | This looks weird; we're generating a cmp, then a sub with exactly the same operands? |
test/CodeGen/ARM/su-addsub-overflow.ll | ||
---|---|---|
50 | Well, as you can see in diff 2, we're currently doing that with a lot of other instructions. This removes most of those other instructions. I have another patch I was going to send after this one that fixes this. Specifically, ARMBaseInstrInfo::optimizeCompareInstr tries to remove the cmp, but it runs right after the MachineSink pass, which sinks the sub into the cont basic block, which stops the optimization from working. |
Friendly ping.
If it would help, I can submit my follow-up patch now, although I don't think this should depend on it.
I'm kind of waiting for https://reviews.llvm.org/D35192 to land here... unfortunately, it's been bouncing in and out of the tree due to regressions, but once it lands, I'd like to see what it does to our lowering here.
I think it might make more sense to wait until we have ARMISD::ADDC nodes in the DAG, then try to to optimize away the ARMISD::CMP nodes. That way, you don't have to worry about trying to optimize away redundant instructions after SelectionDAG.
Now that https://reviews.llvm.org/D35192 has landed, I've rebased this patch on top of it.
The existing patch actually applied cleanly and worked. However, the two unsigned cases generated slightly worse code (still better than in-tree, however). I managed to improve the uadd case by having getARMXALUOp return an ADDC instead of an ADD, but this did not work for the usub case (although I left the change). There's probably a way to fix the usub case, but I'm not worried much about it because my follow-up patch that modifies ARMBaseInstrInfo::optimizeCompareInstr handles it.
So does this change to getARMXALUOp look correct?
The change to getARMXALUOOp is wrong; ADDC produces two results, so you're making a node with the wrong type.
So on trunk, for the llvm.uadd.with.overflow.i32 case, we produce a sequence like this:
adds r0, r0, r1 mov r2, #0 adc r1, r2, #0 cmp r1, #1
This is obviously not great... but the ARMISD::ADDE+ARMISD::CMP pattern is something you could DAGCombine away after legalization. I would prefer to do that, rather than try to clean it up after isel. Everything gets more complicated when you're dealing with MachineInstrs (you might end up optimizing simple cases, but not more complex ones), and the pattern works automatically with llvm.uadd.with.overflow.i64 etc.
I haven't thought through how exactly that extends to signed overflow, though.
Thanks for the info about ADC. How come no test/assert picked that up?
Here's the rebased patch without that change. You can see that uadd is slightly worse than sadd, but it's still an improvement. I'll look into DAGCombine tomorrow, although I do think that it's fine to do at least some of it at the MI level, since the code there already handles some cases and I'm extending it to catch a few more that it misses. But if it can be done at the DAG level, that's worth doing.
In terms of assertions, we have checks for a lot of nodes in SelectionDAG::getNode... but not all (and I guess ISD::ADDC isn't one of the ones we check). And legalization isn't very picky either.
In terms of tests, the ISD::ADDC probably got lowered to something else before it could do any damage. :)
Here's a modification of my incorrect commit from yesterday afternoon that properly gets the correct value out of the ADDC.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
3942 | Still the wrong type... you need to call getVTList to get the right type for an ADDC. Also, are you sure you don't want an ARMISD::ADDC, rather than an ISD::ADDC? |
Is it really necessary to have two different of almost identical code to generate an ARMISD::BRCOND? (I would rather have an explicit check for an AND than two versions of the code, if that's the issue.)
Both of them are used in the attached testcase. The br_cc case handles almost everything, but the brcond case is needed once (when the brcond isn't combined into a br_cc).
We could outline the two cases into a helper function, although they're different enough that I'm not sure that it would help too much. What do you think?
when the brcond isn't combined into a br_cc
BRCOND isn't legal on ARM; it will always eventually get transformed to BR_CC.
True. In most of my testcases, brcond is combined into br_cc, so we hit my new br_cc case. However, one time the brcond is not combined into br_cc. It is legalized into it shortly afterwards, but then we lower saddo/uaddo/etc. the normal way before we lower the new br_cc node (and hit my new optimized case). That's why I needed the brcond case.
What is the best way to handle that? Adding code to the normal saddo/etc. lowering code not to lower it if there's a br_cc seems the wrong way to go about it (and I don't think it's even possible). And I don't know how to change the order we see nodes.
but then we lower saddo/uaddo/etc. the normal way before we lower the new br_cc node
Oh, hmm, someone else hit the same issue recently (in a different context). It should be possible to fix; SelectionDAG::Legalize is the relevant code. It doesn't quite work the way you want it to because we don't update the ordering when we add new nodes to the DAG. But I haven't thought through how to compute the right order to visit without recomputing the entire topological order from scratch.
Alternatively, you could try DAGCombining ARMISD::BRCOND after legalization?
It doesn't quite work the way you want it to because we don't update the ordering when we add new nodes to the DAG. But I haven't thought through how to compute the right order to visit without recomputing the entire topological order from scratch.
This does sound like it would solve the problem (and solve other problems as well). It seems a bit out of scope for this commit, though.
Alternatively, you could try DAGCombining ARMISD::BRCOND after legalization?
Good idea. However, the ARMISD::BRCOND isn't combined until after saddo is lowered.
For reference, here's the sequence of events I'm seeing in this one example:
brcond is legalized to br_cc
saddo is legalized
br_cc is legalized to ARMISD::BRCOND
ARMISB::BRCOND is legalized
everything is combined
So combining ARMISD::BRCOND runs into the same problem as using br_cc; they're both too late. Thus without changing the ordering of new DAG nodes (which seems a bit difficult), I don't see how I can do this efficiently without keeping the brcond case.
Good idea. However, the ARMISD::BRCOND isn't combined until after saddo is lowered.
That shouldn't completely block all transforms; you could pattern-match the ARMISD nodes. But I guess the patterns become a lot more complicated, so maybe not worth doing.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4581 | Is this assert actually guaranteed somehow? I mean, it should be possible to transform any relevant condition to an equivalent SETEQ or SETNE, but I don't see any code to actually ensure this. | |
4600 | I would probably write "if ((CC == ISD::SETNE) != isOneConstant(RHS))". |
Good idea. However, the ARMISD::BRCOND isn't combined until after saddo is lowered.
That shouldn't completely block all transforms; you could pattern-match the ARMISD nodes. But I guess the patterns become a lot more complicated, so maybe not worth doing.
I'm not sure what you mean. After the ARMISD::BRCOND is created, only a few unrelated instructions are matched before the saddo is transformed. So without the reordering we've discussed, I don't see how this would help. I'm probably missing something, but it sounds like it might well not be worth doing.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4581 | Good point. I actually copied that part of the code over from the AArch64 backend. I would guess that the assumption is that you should only be checking if the overflow bit is set or unset, hence EQ or NE, but I could imagine something transforming those into other operations. I changed it to be part of the condition to avoid the problem. Do you think it's worth doing something similar in the AArch64 backend? |
I'm not sure what you mean. After the ARMISD::BRCOND is created, only a few unrelated instructions are matched before the saddo is transformed. So without the reordering we've discussed, I don't see how this would help. I'm probably missing something, but it sounds like it might well not be worth doing.
I mean you could pattern match after the uaddo/saddo is transformed. For unsigned add, you get something like ARMISD::BRCOND(ARMISD::CMPZ(ARMISD::ADDE(0, 0, N), 1)), which you can transform by eliminating the CMPZ+ADDE. The pattern for signed add is slightly different, but similar.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4581 | Yes. |
I mean you could pattern match after the uaddo/saddo is transformed. For unsigned add, you get something like ARMISD::BRCOND(ARMISD::CMPZ(ARMISD::ADDE(0, 0, N), 1)), which you can transform by eliminating the CMPZ+ADDE. The pattern for signed add is slightly different, but similar.
Ah yes. I considered that, but as you said, it would have more complicated matching and it would be more brittle if the saddo lowering changes (like it just did), so I preferred doing it this way.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4581 | Okay, I'll work on submitting a patch there (it should be NFC). |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4542–4543 | Apologies if this question looks a bit clueless about all the transformation, but why you need to reverse the condition code here? |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
4542–4543 | No, it's a good question, and I'm a bit confused about this myself. The getARMXALUOOp function seems to return the condition for whether there is not an overflow. It doesn't seem to be documented anywhere, but for example, in its SADDO case, it uses the VC flag, which is the "no overflow" case. Assuming that's right, then since these branches are branching when an overflow occurs, I need to invert the condition. Does that logic seem right? |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
3916 | We could return std::tuple<SDValue, SDValue, SDValue> here I think, but better we leave this for a later change. | |
3942 | This comment now looks odd because I had to revert D35192 (still working on it, though). What was the reason to use a target specific node here? Did you want to save some round trips lowering this or you needed it for better codegen? | |
4542–4543 | It is certainly confusing. It returns three things: the intended arithmetic code (in Value), the comparison of the overflow (in OverflowCmp) and a condition code related to OverflowCmp (in ARMcc). For unsigned addition (UADDO) it does the following Value: add z, x, y OverflowCmp: cmp z, x ARMcc: HS The comparison will compute z - x and update the flags. Using the HS condition code means z >= x and this is exactly *no overflow* (because z ← x + y). Similarly when doing a signed addition (SADDO) Value: add z, x, y OverflowCmp: cmp z, x ARMcc: VC This case is less obvious but if there is no overflow in this case there must not be overflow in the comparison (VC): if there were no overflow in the add, then cmp because it does z - x to update the flags would compute y which does not overflow. Similarly for USUBO and SSUBO but here the comparison is cmp x, y (instead of cmp z, x) which is (suprisingly) more intuitive because unsigned x - y will not overflow if x >= y and trivially for the signed case as not overflowing is exactly VC. So after this wall of text, I believe your logic is sensible here. Given that getARMXALUOOp is now going to be used in three places after your change, would it be possible to add a comment to that function like the one below (or an improved wording if you come up with a better one!) // This function returns three things: the arithmetic computation itself // (Value), a comparison (OverflowCmp) and a condition code (ARMcc). The // comparison and the condition code define the case in which the arithmetic // computation *does not* overflow. std::pair<SDValue, SDValue> ARMTargetLowering::getARMXALUOOp(SDValue Op, SelectionDAG &DAG, SDValue &ARMcc) const { ... |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
3916 | Yes, that does seem cleaner to me. I can make a follow-up patch after this lands. | |
3942 | I used the target-specific node here because efriedma suggested it, but using the generic ISD::ADDC does seem to work (it passes all the tests). Should I change it? I don't know much about the difference between them. It is interesting that this still works after you reverted your patch. | |
4542–4543 | Thanks for the long double-checking! That comment looks good to me, so I'll add it. |
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
3942 | My understanding is that several generic nodes (ISD::*) have to be lowered at some point to Arm-specific ones (ARMISD::*). ISD::ADD and ISD::SUB are among those nodes. Using Arm-specific nodes earlier might bring finer control but may miss some of the generic combiners done on generic nodes. But it is true that when this function is used, the generic node will be found among specific ones (like ARMISD::CMP) so generic combiners are unlikely to kick in here, which might be a reason to use the specific one directly and not bothering using the generic one (which I presume will require an extra iteration replacing it with a specific one). Changing just this case will make this function look a bit odd. If Eli's comment was in the line of "let's make this function only use ARMISD" (instead of ISD::ADD and ISD::SUB) we can make this change in a later patch and leave this function untouched (except for the comment). @efriedma am I missing something? |
I've removed the modification to getARMXALUOOp that uses ADDC instead of ADD. From the discussion, it's not clear exactly how that part should work, and it's no longer needed, so this makes things simpler.
Now that D35192 has re-landed I need to update this patch.
Note that I added back the use of ARMISD::ADDC, which were were discussing a bit before.
lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
3942 | Thanks @efriedma. Now I see that I should have removed LowerADDC_ADDE_SUBC_SUBE in D35192. That said I fail to understand why we want to update this function in this patch. @jgalenson does keep using ISD:ADD (not to confuse it with ISD::ADDC) impact the code generation? I mean, I understand we can use ISD::ADDCARRY here but maybe we want to do this in a later patch? |
Ping.
What else should I do to get this approved? It seems like there's agreement that this patch looks at least mostly good, and it (and the follow-ups) significantly improve the performance of overflow-checked code for ARM.
We could return std::tuple<SDValue, SDValue, SDValue> here I think, but better we leave this for a later change.