This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Skip re-visiting EntryToken to avoid compile time explosion
ClosedPublic

Authored by uweigand on Sep 1 2020, 10:36 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
niravd
arsenm

Commits

rG1a25133bcdfe: [DAGCombine] Skip re-visiting EntryToken to avoid compile time explosion

Summary

Compiling the SingleSource/UnitTests/matrix-types-spec.cpp test case on SystemZ takes a very long time (many hours, possibly more than a day), which is causing the test to time out and the build bot to be red since that test was added.

Interestingly, on Intel compilation completes after a minute or so. However, it turns out this is simply because on Intel, LLVM considers SSE vector support to be available by default, while the default SystemZ target has no vector support. Changing those defaults causes the behavior to match again: compiling on Intel with -mno-sse also takes hours, compiling on SystemZ with -march=z15 also takes just a minute.

The following is a reduced test case (basically the transposeSpec) function of matrix-types-spec.cpp:

void test(float *X, float *ResSpec) {
  constexpr int R0 = 31;
  constexpr int C0 = 17;

  using matrix_trans_t = float __attribute__((matrix_type(C0, R0)));
  auto M = __builtin_matrix_column_major_load(X, R0, C0, R0);
  matrix_trans_t Res;
  for (int c = 0; c < C0; ++c)
    for (int r = 0; r < R0; ++r)
      Res[c][r] = M[r][c];
  __builtin_matrix_column_major_store(Res, ResSpec, C0);
}

Building this with -O3 -fenable-matrix (and -mno-sse on Intel) demonstrates the problem.

What seems to happen is that DAGCombine optimization exposes many dead stores to the temporary variable which are optimized away. This results in the store node being replaced by its input chain. Given that many of the stores have no dependency on each other, that input chain will often be the EntryToken. Now DAGCombiner::Run will add the result node and all its users back to its worklist. In case that result node is the EntryToken, all users of the EntryToken are added back -- and those will be a large number of stores ...

So over all, this seems to scale as something like a quadratic or cubic function of the number of stores. (This also seems to explain the difference when vector support is present, because the number of stores is reduced by a factor of 4 then.) The details may be more complicated, but it's a bit hard to analyze all the details given the large number of nodes involved.

Now, I'm wondering what the proper fix for this type of problem is. However, I notice that the fact that an input chain to a node is now the EntryToken basically never allows any significant follow-on optimization to occur. So I tried simply skipping adding the result node and its users to the worklist in the case where that result node is the EntryToken. And in fact, I was unable to find any difference in code generation due to that change. However, the above test case now completes compilation in about half a minute ...

Diff Detail

Unit TestsFailed

	Time	Test
	950 ms	linux > libomp.lock::omp_init_lock.c

Event Timeline

uweigand created this revision.Sep 1 2020, 10:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2020, 10:36 AM

Herald added subscribers: llvm-commits, ecnelises, steven.zhang, hiraditya. · View Herald Transcript

uweigand requested review of this revision.Sep 1 2020, 10:36 AM

Harbormaster completed remote builds in B70264: Diff 289211.Sep 1 2020, 11:09 AM

Could you share the IR before DAGcombine that triggers the problem? DAGCombine already has limits in a few places that should prevent store merging from causing an explosion in compile-time. Would be good to know where this case slips through.

In D86963#2250247, @fhahn wrote:

Could you share the IR before DAGcombine that triggers the problem? DAGCombine already has limits in a few places that should prevent store merging from causing an explosion in compile-time. Would be good to know where this case slips through.

Sure, here's the IR of the function (which remains basically unchanged until ISel), and also the SelectionDAG of the basic block just before the DAGCombine that takes a very long time:

matrix-test.ll49 KBDownload

matrix-test.isel16 MBDownload

Ping.

I'd like to make some progress on this soon, given that this still causes every LNT run on the SystemZ build bot to fail ...

arsenm accepted this revision.Sep 9 2020, 9:24 AM

This revision is now accepted and ready to land.Sep 9 2020, 9:24 AM

Herald added a subscriber: wdng. · View Herald TranscriptSep 9 2020, 9:24 AM

This revision was landed with ongoing or failed builds.Sep 9 2020, 10:14 AM

Closed by commit rG1a25133bcdfe: [DAGCombine] Skip re-visiting EntryToken to avoid compile time explosion (authored by uweigand). · Explain Why

This revision was automatically updated to reflect the committed changes.

uweigand added a commit: rG1a25133bcdfe: [DAGCombine] Skip re-visiting EntryToken to avoid compile time explosion.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

12 lines

Diff 289211

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,552 Lines • ▼ Show 20 Lines	while (SDNode *N = getNextWorklistEntry()) {
if (N->getNumValues() == RV.getNode()->getNumValues())		if (N->getNumValues() == RV.getNode()->getNumValues())
DAG.ReplaceAllUsesWith(N, RV.getNode());		DAG.ReplaceAllUsesWith(N, RV.getNode());
else {		else {
assert(N->getValueType(0) == RV.getValueType() &&		assert(N->getValueType(0) == RV.getValueType() &&
N->getNumValues() == 1 && "Type mismatch");		N->getNumValues() == 1 && "Type mismatch");
DAG.ReplaceAllUsesWith(N, &RV);		DAG.ReplaceAllUsesWith(N, &RV);
}		}

// Push the new node and any users onto the worklist		// Push the new node and any users onto the worklist. Omit this if the
		// new node is the EntryToken (e.g. if a store managed to get optimized
		// out), because re-visiting the EntryToken and its users will not uncover
		// any additional opportunities, but there may be a large number of such
		// users, potentially causing compile time explosion.
		if (RV.getOpcode() != ISD::EntryToken) {
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
AddUsersToWorklist(RV.getNode());		AddUsersToWorklist(RV.getNode());
		}

// Finally, if the node is now dead, remove it from the graph. The node		// Finally, if the node is now dead, remove it from the graph. The node
// may not be dead if the replacement process recursively simplified to		// may not be dead if the replacement process recursively simplified to
// something else needing this node. This will also take care of adding any		// something else needing this node. This will also take care of adding any
// operands which have lost a user to the worklist.		// operands which have lost a user to the worklist.
recursivelyDeleteUnusedNodes(N);		recursivelyDeleteUnusedNodes(N);
}		}

▲ Show 20 Lines • Show All 20,633 Lines • Show Last 20 Lines