Compiling the SingleSource/UnitTests/matrix-types-spec.cpp test case on SystemZ takes a very long time (many hours, possibly more than a day), which is causing the test to time out and the build bot to be red since that test was added.
Interestingly, on Intel compilation completes after a minute or so. However, it turns out this is simply because on Intel, LLVM considers SSE vector support to be available by default, while the default SystemZ target has no vector support. Changing those defaults causes the behavior to match again: compiling on Intel with -mno-sse also takes hours, compiling on SystemZ with -march=z15 also takes just a minute.
The following is a reduced test case (basically the transposeSpec) function of matrix-types-spec.cpp:
void test(float *X, float *ResSpec) { constexpr int R0 = 31; constexpr int C0 = 17; using matrix_trans_t = float __attribute__((matrix_type(C0, R0))); auto M = __builtin_matrix_column_major_load(X, R0, C0, R0); matrix_trans_t Res; for (int c = 0; c < C0; ++c) for (int r = 0; r < R0; ++r) Res[c][r] = M[r][c]; __builtin_matrix_column_major_store(Res, ResSpec, C0); }
Building this with -O3 -fenable-matrix (and -mno-sse on Intel) demonstrates the problem.
What seems to happen is that DAGCombine optimization exposes many dead stores to the temporary variable which are optimized away. This results in the store node being replaced by its input chain. Given that many of the stores have no dependency on each other, that input chain will often be the EntryToken. Now DAGCombiner::Run will add the result node and all its users back to its worklist. In case that result node is the EntryToken, all users of the EntryToken are added back -- and those will be a large number of stores ...
So over all, this seems to scale as something like a quadratic or cubic function of the number of stores. (This also seems to explain the difference when vector support is present, because the number of stores is reduced by a factor of 4 then.) The details may be more complicated, but it's a bit hard to analyze all the details given the large number of nodes involved.
Now, I'm wondering what the proper fix for this type of problem is. However, I notice that the fact that an input chain to a node is now the EntryToken basically never allows any significant follow-on optimization to occur. So I tried simply skipping adding the result node and its users to the worklist in the case where that result node is the EntryToken. And in fact, I was unable to find any difference in code generation due to that change. However, the above test case now completes compilation in about half a minute ...