This is an archive of the discontinued LLVM Phabricator instance.

[SDAG] Make the DAGCombine worklist not grow endlessly due to duplicate insertions.
ClosedPublic

Authored by chandlerc on Jul 22 2014, 3:45 AM.

Details

Summary

The old behavior could cause arbitrarily bad memory usage in the DAG
combiner if there was heavy traffic of adding nodes already on the
worklist to it. This commit switches the DAG combine worklist to work
the same way as the instcombine worklist where we null-out removed
entries and only add new entries to the worklist. This results in
subtle, frustrating churn in the particular order in which DAG combines
are applied which causes a number of minor regressions where we fail to
match a pattern previously matched by accident. AFAICT, all of these
should be using AddToWorklist to directly or should be written in a less
brittle way. None of the changes seem drastically bad, and a few of the
changes seem distinctly better.

A major change required to make this work is to significantly harden the
way in which the DAG combiner handle nodes which become dead
(zero-uses). Previously, we relied on the ability to "priority-bump"
them on the combine worklist to achieve recursive deletion of these
nodes and ensure that the frontier of remaining live nodes all were
added to the worklist. Instead, I've introduced a routine to just
implement that precise logic with no indirection. It is a significantly
simpler operation than that of the combiner worklist proper. I suspect
this will also fix some other problems with the combiner.

Note that I have *NO IDEA* what the changes on any architecture other than x86
really imply, please check these for your target! I just don't know how to
evaluate them. I've literally transcribed the test case changes necessary to
pass, but it may be more useful to patch in this change and compare A/B to
understand the differences for a particular test case.

I think the x86 changes are really minor and uninteresting, but the avx512 at
least is hiding a "regression" (but the test case is just noise, not testing
some performance invariant) that might be looked into. Not sure if any of the
others impact specific "important" code paths, but they didn't look terribly
interesting to me, or the changes were really minor.

However, maybe this entire approach is just deeply flawed? What do folks think,
is this worthwhile?

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc updated this revision to Diff 11743.Jul 22 2014, 3:45 AM
chandlerc retitled this revision from to [SDAG] Make the DAGCombine worklist not grow endlessly due to duplicate insertions..
chandlerc updated this object.
chandlerc added reviewers: hfinkel, arsenm, grosbach.
chandlerc added a subscriber: Unknown Object (MLST).

Hi Chandler,

I've taken a look at the ARM changes, and I think they're mostly innocuous.

On the patch as a whole, I don't think relying on nodes being visited in a particular order for good codegen is a good idea anyway. It usually means you're only handling one particular edge-case of a more general construct. So I wouldn't worry too much about that myself. Still annoying for the person like you who comes along to try & make things better though.

I can't think of any problems with the new implementation, though I'm probably not the best person around for picking the right data structure. Hopefully someone else is better there.

Cheers.

Tim.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
101–107 ↗(On Diff #11743)

This comment now seems out of date.

1096 ↗(On Diff #11743)

Is this possible? Isn't that an immediate cycle?

test/CodeGen/ARM/sxt_rot.ll
12–13 ↗(On Diff #11743)

This doesn't look ideal, but it's just a deficiency in the ARM patterns. I've got a fix ready to go, but I'll wait until this is in to avoid giving you conflicts to resolve or anything.

reames added a subscriber: reames.Jul 22 2014, 10:08 AM

Chandler,

Overall, this seems quite reasonable. My only minor comment would be
that you should improve the naming and documentation of
recursivelyDeleteUnusedNodes. It is not currently clear from the
interface that items are added to the worklist at all, or that it is the
frontier of used nodes which are added. Given this seems to be
important from your description, it should be documented.

LGTM -- Mind you, I'm not particularly familiar with this code. So take
my LGTM with a grain or two of salt. :)

Philip

hfinkel accepted this revision.Jul 22 2014, 11:07 AM
hfinkel edited edge metadata.

I just ran the test suite on PPC64/Linux with this patch, and it introduced no failures. ;)

Regarding the PowerPC test changes:

test/CodeGen/PowerPC/complex-return.ll looks like a CodeGen improvement (better store-to-load forwarding).

test/CodeGen/PowerPC/subsumes-pred-regs.ll is also a CodeGen improvement (we changed from comparing the value to 0 to comparing that it is not equal to 1 so that we can reuse the loaded '1' value later).

So I see only improvements here - LGTM.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1082 ↗(On Diff #11743)

It seems like we now have a number of routines that do something like this. SDAG has an:

void RemoveDeadNode(SDNode *N);

which is also supposed to have behavior close to this (except that it does not updated the DC worklist?). Maybe some refactoring could consolidate things?

1096 ↗(On Diff #11743)

And if it is possible, do you need a isPredecessorOf check instead?

This revision is now accepted and ready to land.Jul 22 2014, 11:07 AM

Couple of comments inline.

test/CodeGen/X86/block-placement.ll
240 ↗(On Diff #11743)

Inquiring minds here? It looks like it was deleted because it's no longer applicable? Does the code look ok?

test/CodeGen/X86/divide-by-constant.ll
60 ↗(On Diff #11743)

Assume the additional check lines aren't because the code generated is actually worse?

arsenm added inline comments.Jul 22 2014, 2:52 PM
test/CodeGen/R600/r600-export-fix.ll
6 ↗(On Diff #11743)

I'm pretty sure these are fine. I don't think the specific vector components matter, and Evergreen tests in general are overly sensitive to minor changes in scheduling

I've made the requested minor changes. Let me know if you'd like a fresh patch to review. See detailed replies to questions and comments below.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
101–107 ↗(On Diff #11743)

Yea, updated comments here.

1082 ↗(On Diff #11743)

The intent was for this to be a DAG-combiner specific deletion mechanism.

Maybe it would be more clear as a method on DAGCombiner? I think I like that better. It also (IMO) removes some of the problems with clarifying that it manages the worklist.

1096 ↗(On Diff #11743)

I saw crashes due to this at some point in my testing but I can't reproduce them. I'll remove it for now.

Note that we don't really need to understand the nature of the cycle, just avoid creating an infinite loop in this function because we've already removed N from the set.

test/CodeGen/ARM/sxt_rot.ll
12–13 ↗(On Diff #11743)

Yea, this seemed clearly like a regression. Thanks for prepping the fix.

test/CodeGen/X86/block-placement.ll
240 ↗(On Diff #11743)

This test was trying to check for a very precise misbehavior when we formed a particularly convoluted loop structure (see the comment). After this patch, we don't form that loop structure. I have no way of re-reducing a test case that still does, and the value seems very low.

test/CodeGen/X86/divide-by-constant.ll
60 ↗(On Diff #11743)

Correct. As it happens, this code is quite a bit better (fewer reg-reg dependencies).

test/CodeGen/X86/narrow-shl-load.ll
34–61 ↗(On Diff #11743)

Note that the code for this test case is actually worse after my patch.

Before:

test2:                                  # @test2
# BB#0:                                 # %entry
      pushq   %rax
      movl    $127, 4(%rsp)
      movb    $0, 3(%rsp)
      movl    4(%rsp), %eax
      addl    %eax, %eax
      movsbl  %al, %eax
      sarl    %eax
      cmpl    $-1, %eax
      jne     .LBB1_2

After:

test2:                                  # @test2
# BB#0:                                 # %entry
      pushq   %rax
      movl    $127, 4(%rsp)
      movb    $0, 3(%rsp)
      movl    4(%rsp), %eax
      addl    %eax, %eax
      movsbl  %al, %eax
      shrl    %eax
      movzbl  %al, %eax
      cmpl    $255, %eax
      jne     .LBB1_2

I haven't tracked this down, but the test doesn't really give *any* information about what this was actually trying to test. The change it went in with added a single value type test to *avoid* one DAG combine. Without more context for what the desirable code was, it seemed a bad idea to keep the test at all.

We probably should have something which recognizes that we could do "sarl, cmpl -1" rather than "shrl, movzbl, cmpl 255" (or equivalent other constants). But I'm not fussed about letting this regress minorly and letting someone who wants come along and clean it up later.

test/CodeGen/X86/store-narrow.ll
37 ↗(On Diff #11743)

Note that this is probably a (very minor) regression as we don't actually create partial register stalls here.

test/CodeGen/X86/vec_extract-sse4.ll
7 ↗(On Diff #11743)

This is also likely a minor regression as we fail to avoid the cross-register-bank copy, although in this case it seems likely to be quite minor.

I think it's fine to hit the regressions later if/when they show up.

grosbach accepted this revision.Jul 22 2014, 3:59 PM
grosbach edited edge metadata.

This LGTM w/ the recent updates. I second Tim's comment about node ordering assumptions. That's bad news anyway.

chandlerc closed this revision.Jul 23 2014, 12:17 AM
chandlerc updated this revision to Diff 11804.

Closed by commit rL213727 (authored by @chandlerc).

llvm/trunk/test/CodeGen/X86/divide-by-constant.ll