Page MenuHomePhabricator

[DAGCombine] Do several rounds of combine.
Needs ReviewPublic

Authored by deadalnix on May 26 2017, 1:25 AM.

Details

Summary

DAGcombine does one pass over all nodes and then give up. This is suboptimal because one node's combine can expose a combine for another node.

Because some of the combines create infinite loops, we limit things to 3 rounds. Idealy, we like to continue combining as long as something is combinable, but that will require removing the combine that do and undo each others, so in the meantime, doing 3 rounds maximum seems like a good tradeof.

Event Timeline

deadalnix created this revision.May 26 2017, 1:25 AM
deadalnix updated this revision to Diff 100373.May 26 2017, 1:34 AM

Improve checks in constant_sextload_v8i16_to_v8i32 .

RKSimon edited reviewers, added: efriedma; removed: eli.friedman.May 26 2017, 2:27 AM
RKSimon added a subscriber: RKSimon.

Effects on performance? How many of these cases are just where multiple nodes were created and not added to the worklist?

@RKSimon Most of these case aren't because node are not added to the worklist, but because of pattern that are somewhat deep - such as anything depending on KnownBits . Consider the following DAG:

C
|
B
|
A

Now imagine node A is visited and no combine was found. Then B and C are visited and C is transformed into D:

D
|
B
|
A

In this scenrio, A was already visited and isn't added back to the worklist, only B is. We could recursively add uses of uses to the worklist, but it turns out this is not very efficient as it require to add half of the DAG on average back in the worklist everytime a combine is done. Plus it wouldn't catch cases of combine based on getNodeIfExists and other various cases. Simply adding direct uses catches the most common cases, and having an extra pass over all nodes catches everything else, and is more economical as long as > 2 combine are node (on average).

As for performance (I assume you meant compile time performance impact) here are the time I get for a test suite run:

Without this patch:

real    2m41.665s
user    26m33.900s
sys     2m44.728s

With this patch:

real    2m42.729s
user    26m45.772s
sys     2m42.796s

It doesn't looks like the impact is that significant and it seems worth it to me.

filcab added a subscriber: filcab.May 26 2017, 6:06 AM

As for performance (I assume you meant compile time performance impact) here are the time I get for a test suite run:

Without this patch:

real    2m41.665s
user    26m33.900s
sys     2m44.728s

With this patch:

real    2m42.729s
user    26m45.772s
sys     2m42.796s

It doesn't looks like the impact is that significant and it seems worth it to me.

I think for things like this, a run of llc on an LTO'd clang .bc would be better to show if there's a problem.

Thank you,
Filipe

spatel added a subscriber: spatel.May 26 2017, 7:23 AM

I usually do not work with clang. Do you have instructions I can follow to get that bc file ?

I usually do not work with clang. Do you have instructions I can follow to get that bc file ?

You need to build llvm+clang as mentioned on the getting started guide, but set LLVM_ENABLE_LTO=ON on cmake.
clang was an example of "a large code-base" with LTO. I'm ok with timings from other big programs (but clang is usually easier to compare with other people).

I'm getting a bunch of

/usr/bin/ranlib: TypeDatabaseVisitor.cpp.o: plugin needed to handle lto object

When doing so.

Alright so I ended up being able to create a lto build of clang. I'm not sure how to get the bc file to do the benchmarking.

So on the full clang bc, post optimization:

Without this patch:

real    9m26.373s
user    9m24.256s
sys     0m1.948s

With this patch:

real    9m44.870s
user    9m42.484s
sys     0m2.228s
deadalnix updated this revision to Diff 101028.Jun 1 2017, 8:13 AM

Rebase, fix merge conflicts.

niravd added a subscriber: niravd.Jun 1 2017, 9:42 AM
arsenm added inline comments.Jun 1 2017, 11:52 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1315–1318

Spelling

Can we get same optimization results if we run DAG combiner pass multiple times instead of iterating in the pass internally?
If so, I think it is better to run the pass without the internal iteration multiple times for providing more flexibility to control the tradeoff between the code quality and the compilation time (e.g. based on optimization level).

davide added a subscriber: davide.Jun 1 2017, 1:48 PM

If I read these numbers correctly, this makes Instruction selection ~ 4-5% slower on large testcases (in your case, an LTO build of clang).
This is, quite a bit, and I need requires further justifications (i.e. needs to be backed by the performance improvement we get on these testcases for the additional compile time we pay.

@inouehrs That wouldn't be the same as this will bail when no more combine is found.
@davide It's more like 3% as far as I can tell. The sad truth here, looking into it, is that there are a lot of combine that and undo themselves and most of the perf hit come from there. These transform are the very reason why i limited the number of iterations to begin with.

As far as benefit goes, it's very helpful for code that's legalized. I've been caring about this a lot lately, because I have workload that involve a lot of cryptography, and the gains are pretty substancial. In addition, I have various transform that I haven't published yet because they simply cannot kick in reliably without the mechanism in this diff. Any pattern that match over more than 2 level deep suffer from not having this patch.

One things that could be done is to only enable this when optimizations are on. We can then weed out the case that loop over time and enable it consistently when there are not so many of them anymore.

davide added a comment.Jun 1 2017, 2:48 PM

@inouehrs That wouldn't be the same as this will bail when no more combine is found.
@davide It's more like 3% as far as I can tell. The sad truth here, looking into it, is that there are a lot of combine that and undo themselves and most of the perf hit come from there. These transform are the very reason why i limited the number of iterations to begin with.

As far as benefit goes, it's very helpful for code that's legalized. I've been caring about this a lot lately, because I have workload that involve a lot of cryptography, and the gains are pretty substancial. In addition, I have various transform that I haven't published yet because they simply cannot kick in reliably without the mechanism in this diff. Any pattern that match over more than 2 level deep suffer from not having this patch.

One things that could be done is to only enable this when optimizations are on. We can then weed out the case that loop over time and enable it consistently when there are not so many of them anymore.

This wouldn't actually help the worst case, i.e. LTO, when optimizations are almost always on.
I think the impact is still quite significant, and we should have numbers before trying to be more aggressive, as SelectionDAG is already very expensive.

niravd added a comment.Jun 1 2017, 8:23 PM

It sounds like the underlying cause of this is related to the conversation on diamond nodes you started on llvm-dev. Reading your description I think I understnad the issue more and I believe have a solution that should fix the underlying issue without needing to loop on all nodes multiple times.

The problem with optimizations on a deeper DAG optimizations is that we cannot leverage the fact that we add both changed/new nodes and their users to guarantee that we consider the key node (i.e. the one that triggers the optimization). You can work around this by checking for the optimization from all nodes you match on. This will increase the number of times you check, but much less than revisiting all nodes. You should be able to avoid doing the check off of each node you match against, and only check alternating layers (which might let you only have to do the optimization check for the fork and join points in a diamond shape).

RKSimon edited edge metadata.Aug 1 2017, 10:36 AM

@niravd Did you look at your alternative approach any further? @deadalnix has updated D33840 which was supposed to help reduce the performance impact but the results are mild at best.

deadalnix marked an inline comment as done.Aug 9 2017, 2:32 PM

@RKSimon I'll find a way to make that fast, or find an alternative like activating it only in some specific situations. In addition to solving my specific problem, it seems to improve numerous other things, especially for the AMD backend. In any case, I think D33840 is a good thing either way and we should proceed with it.

deadalnix updated this revision to Diff 183737.Jan 26 2019, 5:42 PM

I'd like to ressurect this diff.

To me, it seems like the proper thing to do, at least at some optimisation level. As long as there are combine to do that we know how to do, we should be doign them.

I see a lot of people doing more and more clever pattern over time that are just not necessary as they are combinations of simpler patterns. This is a losing battle anyways because there is a combinatorial explosion. In addition, there are just many patterns that are just not very useful in isolation, but useful to put things in a canonical form that can be picked up later on. These types of transformations are not beneficial right now unless very simple.

I was able to write variosu patches that leverage this canonicalisation mechanism ad get great result for the uses cases I'm interested in (mostly cryptography, which involve a lot of big integer manipulation). I'm sure people interested in the performance of other type of code will find it beneficial as well. Ultimately I could do without this patch, but at the end, the alternative boils down to do something similar in specific cases - the ones I care about - instead of all cases, which seems like a big missed opportunity.

The results are better in numerous cases, terrific in some and there are little regressions in term of codegen quality. I'm happy to work on these regressions, but I'd like to ensure I'm not wasting my time if that patch has no chances to get in.

TL;DR: Not doing this is creating work and complexity in addition to making it prohibitively complex to do some optimisations. I think we should do this, at higher optimisation levels such as O2/O3.

craig.topper added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1362

Don't you need to reset CombineNodes on each iteration?

Have we looked into visiting the nodes bottom up instead of top down as is currently done. Would require an explicit topological sort instead of getting whatever order the previous legalizer step left us with.

deadalnix marked an inline comment as done.Jan 26 2019, 8:07 PM

I'm not sure how processing nodes bottom up really helps. Problems arise when you want to use patterns of depth > 2 because then direct parent.child are not processed again, even though such pattern may now be available. It seems to me that both top/down and bottom/up approaches would suffers from the same problem, but maybe there is something I'm missing.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1362

Both would be correct but semantic obviously differ. Let me investigate this. Good catch.

Currently we largely visit nodes before their operands. So when we simplify the operands several layers down we don’t revisit the later nodes to match patterns. But if we visited the operands first, then the first visit of the later nodes would occur after. So we wouldn’t need to revisit them.

deadalnix marked an inline comment as done.Jan 27 2019, 6:41 AM

After thinking more about this, I do not think going bottom up is a good idea. All patterns match a node + its operands, and so benefit from operand to be combined themselves already. I do not think changing all the patterns to match use rather than operands is a good idea. This is a ton of work, and this is unclear there is any benefit at all.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1362

So I was playing with reseting/not reseting this and even removing it altogether.

The first thing worth noticing is that the intent here is very similar, but more limited in scope. However, result sometime differs depending on if this is executed or not. It is due to various patterns in there depends on execution order - something I've noticed before, for instance in D41235 .

I do not think we need to reset it as all existing nodes are inserted at the start in the worklist, so this ends up only adding nodes to the worklist that have been created by a precedent successful combine. I notice zero codegen difference when reseting the set between iteration, so it seems to just create more work without any benefit.

For bottom up, I was referring to the order of the nodes in the initial work list. Not the patterns themselves. Right now we visit the last instruction in the basic block first. What if we visited nodes with no operands first and visited the last instruction last. That would match what IR instcombine does.

I get it now. We had a mismatched interpretation of which side is up, which side is down. If this is indeed the order in which nodes are processed, then it'd be beneficial to change this.

xbolva00 added inline comments.
test/CodeGen/X86/mmx-cvt.ll
271 ↗(On Diff #183737)

Weird instruction increase

test/CodeGen/X86/not-and-simplify.ll
22 ↗(On Diff #183737)

Bug?

RKSimon added inline comments.Jan 27 2019, 10:28 AM
test/CodeGen/X86/not-and-simplify.ll
22 ↗(On Diff #183737)

Alive says its ok: https://rise4fun.com/Alive/MDFW

And it replaces a load with a re-rematerializable constant

deadalnix marked 5 inline comments as done.Jan 27 2019, 6:51 PM
deadalnix added inline comments.
test/CodeGen/X86/mmx-cvt.ll
271 ↗(On Diff #183737)

Yes. It looks like there are a few regressions, even though overall codegen looks better. I'm happy to investigate them, but I'd like to know if this is like to go forward in principle before investing too much effort that will be wasted.

test/CodeGen/X86/not-and-simplify.ll
22 ↗(On Diff #183737)

I assume this is an improvement, right ?

test/CodeGen/X86/shift-double-x86_64.ll
16 ↗(On Diff #183737)

I'm not sure what is going on here, I assume there is a bug somewhere.

test/CodeGen/X86/shift-double.ll
300 ↗(On Diff #183737)

dito

test/CodeGen/X86/unfold-masked-merge-scalar-constmask-innerouter.ll
40 ↗(On Diff #183737)

Is there any difference between these two in term of codegen ?

Have you done any compile time measurements of this? InstCombine already gets blamed for being a compile time problem. I'm worried about repeating that criticism with DAG combine.

Also do you have examples of the kinds of things that you're seeing in your workloads? Would be good to have test cases for those so we can understand them and so they don't regress in the future if we go forward with this.

Also why is only X86 showing changes from this?

My concern about this that DAGCombine is a relatively expensive and doing it 3 times will make a nontrivial difference in large vector compute blocks. At the very minimum we'd want to disable the most expensive merge operations for most of the passes (store merge and maybe some vector combines).

The only reason this is necessary is that some transforms rely on looking deeper than it's operands to decide if it's valid which means it's triggering condition may change without it being put on the worklist. Last time this was beign discussed I suggested changing the key node in the such transforms so they were either the earliest or a user of the earliest node so DAG changes would happen before the transform was no longer considered.

Maybe we could do the dual and more aggressively add user nodes to the worklist on a combine. I've looked at a few test cases on the rebased patch and it initial glance at the debug trace seems to indicate are both due to a change that enables a SimplifyDemandedBits-based combine from a node a few steps deep. If we could add further descendants in those cases (or alternatively recenter the SimplifyBits combines on computationally earlier nodes) we may be able to fold this back down to one pass with only marginal additional work.

Hi @craig.topper ,

First about the kind of code I try to get to have better codegen, it's mostly about large integer manipulations. I already added a fair amount of reduced test cases in addcarry.ll/subcarry.ll . I'm at a stage where the pattern I have to work with are somewhat deep, see D57302 for an example. These patterns do not do anything useful if other transform cannot pick up from whee they left.

Hi @niravd ,

I wanted to collect some performance data today, but I ran into some problem to get to generate a large bc file of some existing program as LTO seems to be broken on my end for some reason. I did explore the idea of adding more descendant, for instance sext/zext seems to be valuable to punch through. But it seems to me like you would want to add more and more of these over time, and you'd end up with a complicated version of what we have here which misses opportunities.

You raise a good point though. This should only run iteration n if iteration n - 1 actually did change the DAG. You'd expect that the code shouldn't run 3 times that often, but in practice it does because there are a lot of transform that do A -> B -> A . This is why I limited to 3 and not anything more. I do think that investigating these and removing them over time is probably preferable.

I suggested changing the key node in the such transforms so they were either the earliest or a user of the earliest node so DAG changes would happen before the transform was no longer considered.

I do not think this is a very realistic path forward as there are numerous transform looking 2+ deep. As you rightly point out anything SimplifyDemandedBits based does so for instance.

I think as first step we can hide this behavior behind some flag which default to not doing the transform until we can tune things a bit more and figure out in what cases we want to do this? Would that be acceptable to you?

Hi @craig.topper ,

First about the kind of code I try to get to have better codegen, it's mostly about large integer manipulations. I already added a fair amount of reduced test cases in addcarry.ll/subcarry.ll . I'm at a stage where the pattern I have to work with are somewhat deep, see D57302 for an example. These patterns do not do anything useful if other transform cannot pick up from whee they left.

I don't see changes to addcarry.ll and subborrow.ll in this patch. So do we not have test cases from your workloads that show the benefit of this patch?

Are there non-X86 changes from this patch as well that haven't been captured here? Or is X86 somehow the only target affected by this?

I don't see changes to addcarry.ll and subborrow.ll in this patch. So do we not have test cases from your workloads that show the benefit of this patch?

That is because I have other transforms that have no effect without this patch. To be able to do anything more than what's already done, I need to linearize carries propagation as in D57302. Then I can do various transforms such as D57317 (I have others to submit). Without reworking the carry propagation, there is no hope of getting nice chains of adc (or whatever the equivalent is on the target) and without this, breaking diamond propagation doesn't work reliably as the patterns are too deep.

I can submit other patches, but at this time it only looks like it would clutter the review queue as they'd all depend on D57302 which doesn't work reliably without this one. As mentioned earlier, punching through zext/sext and other ops that often find themselves on the path of carries works as well for me, but it seems like a missed opportunity considering what we get out of SimplifyDemandedBits and alike.

Are there non-X86 changes from this patch as well that haven't been captured here? Or is X86 somehow the only target affected by this?

There are other changes. Most of them in the AMDGPU backend. I will get them sorted out before committing anything, but I would like that we decide of a path forward so I can avoid maintaining them manually for a long time. the x86 ones are easy to maintain thanks to utils/update_llc_test_checks.py

I was thinking about ways to reduce the overhead created by this change. I came up with D57367, which is an alternative that focuses on nodes likely to benefit from the change instead of the whole DAG. It misses several opportunity that exist in that patch, but it seems to be a tradeof worth doing.

nikic added a subscriber: nikic.Feb 10 2019, 11:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2019, 11:13 AM