This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Do several rounds of combine for addcarry nodes.
Needs ReviewPublic

Authored by deadalnix on Jul 3 2019, 4:57 PM.

Details

Summary

addcarry can explore severallevels of the DAG docompute its result. This means that a transformation in the DAG that isn't immediately related to a node might affect it.

We therefore add these nodes to the DeepPatternNodes set so that they can be processed again in case the DAG is modfied.

In the future, other node may be added to the set as need arises.

This is a variation on D57367 that specialize in the specific case of large intereger optimization.

Diff Detail

Event Timeline

deadalnix created this revision.Jul 3 2019, 4:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2019, 4:57 PM
craig.topper added inline comments.Jul 8 2019, 9:43 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
155

tomatch-> to match

craig.topper added inline comments.Jul 8 2019, 9:55 PM
test/CodeGen/X86/addcarry.ll
326

Doesn't this add and adc compute the same result as line 321 and 325?

deadalnix marked an inline comment as done.Jul 11 2019, 4:35 AM
deadalnix added inline comments.
test/CodeGen/X86/addcarry.ll
326

There is a lot of duplication that is generated by this. But once the carry propagation is "linearized" because you removed all the diamonds, then a simple set of optimization can get rid of it all. See D57317 for that specific case.

deadalnix updated this revision to Diff 209187.Jul 11 2019, 5:46 AM

tomatch => to match

deadalnix updated this revision to Diff 210109.Jul 16 2019, 8:37 AM

Rebase now that D59208 landed on master.

subcarry related test cases now get much better codegen.

RKSimon added inline comments.Jul 16 2019, 9:51 AM
test/CodeGen/X86/addcarry.ll
326

Would adding X86ISD::ADD to X86TargetLowering::isCommutativeBinOp help with this?

craig.topper added inline comments.Jul 16 2019, 9:55 AM
test/CodeGen/X86/addcarry.ll
326

If I remember right from what I saw in the DAG. We need to CSE ISD::ADDCARRY with commuted operands in DAG combine. Not sure about the X86ISD node.

RKSimon added inline comments.Jul 16 2019, 10:00 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1217

superfluous change - revert ?

RKSimon added inline comments.Jul 16 2019, 10:02 AM
test/CodeGen/X86/addcarry.ll
326

By the looks of this, DAGCombiner::combine should handle it - but is disabled for commutative binop nodes with more than 1 output value.....

deadalnix marked an inline comment as done.Jul 20 2019, 6:36 PM
deadalnix added inline comments.
test/CodeGen/X86/addcarry.ll
326

@RKSimon It doesn't help, because ADDCARRY is not a binary op. It has 3 inputs, 2 of which are commutative. However, D57317 is ready to go and address that exact problem. I did not land D57317 because as far as I know, this doesn't happen in the wild without this patch.

deadalnix updated this revision to Diff 210981.Jul 20 2019, 7:12 PM

Revert the addition of an unecessary empty line.

Rebase and ping

IMHO this is very unusual and papers over some other problem.

IMHO this is very unusual and papers over some other problem.

The problem it paper over is that the pattern we are interested in are deep. The same technique benefits any deep pattern, such as anything using simplifydemandedbits, but there was concern that doing it in all cases would hurt performance. The reason you end up with deep patterns in that case is because you find yourself facing things such as:

(addcarry (uaddo a, b), 0, c) and now you have two carries that propagate and the rest of the DAG is a total mess. This benefits cryptographic computation - and I'd expect many large integer computation in general - substantially.

The approach isn't that unusual as it is similar to what InstCombine does, only more focused on specific nodes that especially benefit to ensure performances stay good in case this isn't required.

While InstCombine does iterate the entire graph multiple times, it rarely does it more than twice. And the second time is just to detect there no changes. InstCombine largely avoids issues like this because it visits the instructions mostly from the beginning of the basic block down. This means as new instructions are formed their users likely haven’t even been visited yet.

DAGCombine on the other hand visits nodes from the end of the basic block up for the first round. So nodes early in the block haven’t been simplified yet. This results in needing to match long patterns or multiple variations since we can’t rely on canonicalization other than what was done in IR. After the first round the order is basically to visit any new nodes created by the previous legalize. Because they are end of the allnodes list which isn’t re-sorted during legalize or before DAGCombine Not sure about the order here. Then the nodes that did not get modified are visited. Not sure about the order, but I think it’s from the end of the basic block again.

So the addcarry nodes later in the DAG get formed first and need to be revisited when earlier ones are created to get additional optimizations. If the earlier ones were formed first by going from beginning to end we could probably get the optimizations in one pass. That would require changing how DAG combine builds it worklist and doing a topological sort before each DAGCombine run. This will likely require adding new pattern matches and maybe modifying existing ones. Not sure how much this will be.

I also wonder if there’s some better IR representation we should be using. Should we have an addcarry intrinsic that InstCombine could form? Is global isel better designed to handle this then DAGCombine? Would having a better IR representation help both?

I also wonder if there’s some better IR representation we should be using. Should we have an addcarry intrinsic that InstCombine could form? Is global isel better designed to handle this then DAGCombine? Would having a better IR representation help both?

This doesn't really work here. These nodes are often created during to legalization, which happens at the Selection DAG level, not at the IR level. I toyed with the idea of introducing an early legalize pass at the IR level, but it did not seem to get traction.

It is indeed true that InstCombine tends to do better due to node being processed in order. DAGCombiner has several problems here, the main one being that the order in which nodes are traversed is not very predictable, as it depends on the order in which the node were created and added to the DAG. After legalization, you typically end up visiting them in an order that is 100% implementation defined and isn't strictly top/down or bottom up. Changing this lead very quickly to a world of pain, because there are a ton of cases where DAGCombiner transforms A into B if it sees A first, and transform B into A if it sees B first. An example of this is (fadd NaN, undef) and (fadd undef, NaN) which transform into each others.

Topological sorting tends to be an expensive operation, it can easily go into n^2 territory, so I'm not convinced this is a win because any insertion into the DAG would break it as there is no nothing of insertion at a given position. Do you see a way to create/maintain a worklist that is ordered as expected?

Changing this lead very quickly to a world of pain

Probably, yes. :)

Topological sorting tends to be an expensive operation, it can easily go into n^2 territory

Sorting once should be linear in the number of edges, and SelectionDAG DAGs generally have few edges. And it's not that expensive in practice; we do it twice already as part of normal compilation. (See AssignTopologicalOrder.)

Do you see a way to create/maintain a worklist that is ordered as expected?

isel itself is an algorithm which incrementally transforms a SelectionDAG DAG and maintains topological ordering as it works. It visits the wrong direction for what we want, but it should be possible to reverse. Granted, this is maybe not the best approach; most DAGCombines take a node and produce one or more new nodes, just like isel, but I'm not sure how you'd handle the ones that don't.

Or you could probably reproduce something similar without actually maintaining the full topological sort on the DAG itself. Topological sort at the beginning, to build an initial worklist. Then, whenever new nodes are created, keep a separate list of those nodes, and visit them in order of creation before continuing with the regular worklist. For most combines, which visit a node, and replace it with one or more new nodes, all the new nodes should have operands which are either new nodes or nodes that have already been visited, and the results should only be used by new nodes and nodes which have not yet been visited. So you maintain sorted visitation order for the existing nodes. And creation order should be naturally sorted for the new nodes.

SO I came up with a plan to get things processed in topological order. However, as mentioned previously, this is indeed leading into a world of pain. I have a prototype, but it causes numerous regression that I want to investigate.

The general outline is as follow:
1/ Remove as much as possible of the manual management of the worklist. It is not realistic to expect every combine for every opcode + target specific combines to be able to operate the worklist consistently, so utility method to help them do so has to be added. Fortunately, often, this management is not doing anything useful at all and can simply be removed.
2/ Modify the utility methods that are expected to actually manipulate the worklist to mostly preserve topological order. It turn out it is not too difficult to have them do a good enough job.
3/ Do a topological sort at the beginning of the process, before adding all the nodes to the worklist.
4/ Modify the logic that adds the argument of a node to the worklist in case they where not present already to do lazy fixup when required to keep the processing topological. Because of 2 and 3, this ends up to be a lightweight process.

As it turns out, many transform done by DAGCombiner are dependent on the processing order. Because this order is not specifically defined, this means such transform are unreliable in practice, and indeed many do break when the processing order is modified. This is another reason why 2 and 3 are useful, as they allow to move gradually instead of producing one giant diff of death. The opposite also happens, where pattern that were not picked upon before now are being exploited.

I already started to execute step 1 of this plan. If people are happy with taking that path, I'd be happy to push that effort forward. I could do with some help to investigate the regression, as I'm not an expert in every single backends.

That sounds good to me personally.
I found it surprizing that DAGCombine processes things in different order as compared to InstCombine.
Thank you for looking into it.

RKSimon added a subscriber: davezarzycki.

Adding @davezarzycki who was working on D70079 recently

Please rebase too

arsenm resigned from this revision.Feb 13 2020, 2:53 PM