This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Skip re-visiting EntryToken to avoid compile time explosion
ClosedPublic

Authored by uweigand on Sep 1 2020, 10:36 AM.

Details

Summary

Compiling the SingleSource/UnitTests/matrix-types-spec.cpp test case on SystemZ takes a very long time (many hours, possibly more than a day), which is causing the test to time out and the build bot to be red since that test was added.

Interestingly, on Intel compilation completes after a minute or so. However, it turns out this is simply because on Intel, LLVM considers SSE vector support to be available by default, while the default SystemZ target has no vector support. Changing those defaults causes the behavior to match again: compiling on Intel with -mno-sse also takes hours, compiling on SystemZ with -march=z15 also takes just a minute.

The following is a reduced test case (basically the transposeSpec) function of matrix-types-spec.cpp:

void test(float *X, float *ResSpec) {
  constexpr int R0 = 31;
  constexpr int C0 = 17;

  using matrix_trans_t = float __attribute__((matrix_type(C0, R0)));
  auto M = __builtin_matrix_column_major_load(X, R0, C0, R0);
  matrix_trans_t Res;
  for (int c = 0; c < C0; ++c)
    for (int r = 0; r < R0; ++r)
      Res[c][r] = M[r][c];
  __builtin_matrix_column_major_store(Res, ResSpec, C0);
}

Building this with -O3 -fenable-matrix (and -mno-sse on Intel) demonstrates the problem.

What seems to happen is that DAGCombine optimization exposes many dead stores to the temporary variable which are optimized away. This results in the store node being replaced by its input chain. Given that many of the stores have no dependency on each other, that input chain will often be the EntryToken. Now DAGCombiner::Run will add the result node and all its users back to its worklist. In case that result node is the EntryToken, all users of the EntryToken are added back -- and those will be a large number of stores ...

So over all, this seems to scale as something like a quadratic or cubic function of the number of stores. (This also seems to explain the difference when vector support is present, because the number of stores is reduced by a factor of 4 then.) The details may be more complicated, but it's a bit hard to analyze all the details given the large number of nodes involved.

Now, I'm wondering what the proper fix for this type of problem is. However, I notice that the fact that an input chain to a node is now the EntryToken basically never allows any significant follow-on optimization to occur. So I tried simply skipping adding the result node and its users to the worklist in the case where that result node is the EntryToken. And in fact, I was unable to find any difference in code generation due to that change. However, the above test case now completes compilation in about half a minute ...

Diff Detail

Event Timeline

uweigand created this revision.Sep 1 2020, 10:36 AM
uweigand requested review of this revision.Sep 1 2020, 10:36 AM
fhahn added a subscriber: fhahn.Sep 1 2020, 11:35 AM

Could you share the IR before DAGcombine that triggers the problem? DAGCombine already has limits in a few places that should prevent store merging from causing an explosion in compile-time. Would be good to know where this case slips through.

Could you share the IR before DAGcombine that triggers the problem? DAGCombine already has limits in a few places that should prevent store merging from causing an explosion in compile-time. Would be good to know where this case slips through.

Sure, here's the IR of the function (which remains basically unchanged until ISel), and also the SelectionDAG of the basic block just before the DAGCombine that takes a very long time:


I'd like to make some progress on this soon, given that this still causes every LNT run on the SystemZ build bot to fail ...

arsenm accepted this revision.Sep 9 2020, 9:24 AM
This revision is now accepted and ready to land.Sep 9 2020, 9:24 AM
This revision was landed with ongoing or failed builds.Sep 9 2020, 10:14 AM
This revision was automatically updated to reflect the committed changes.