Page MenuHomePhabricator

[Scheduling] Fall back to the fast cluster algorithm if the DAG is too complex
ClosedPublic

Authored by steven.zhang on Oct 26 2020, 3:22 AM.

Details

Summary

We have added a new load/store cluster algorithm in D85517. However, AArch64 see some compiling deg with the new algorithm as the IsReachable() is not cheap if the DAG is complex. O(M+N) See https://bugs.llvm.org/show_bug.cgi?id=47966

So, this patch added a heuristic to switch to old cluster algorithm if the DAG is too complex. For now, I used the heuristic value 5M for NumOfStores * NumOfSUnits. This is a rough threshold as I don't see compiling time impact with it. Welcome for more comments.

The case in https://bugs.llvm.org/show_bug.cgi?id=47966 is that, we have ~10000 LdSt and ~15000 SUnits in one region, which blow up the compiling time.

Diff Detail

Event Timeline

steven.zhang created this revision.Oct 26 2020, 3:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2020, 3:22 AM
steven.zhang requested review of this revision.Oct 26 2020, 3:22 AM
steven.zhang edited the summary of this revision. (Show Details)Oct 26 2020, 3:29 AM

Did the concept of using the static DAG not work?

llvm/lib/CodeGen/MachineScheduler.cpp
1586

s/be/been/

1587

"they don't have dependencies on each other"

1708

This threshold happens to work for the testcase in question, but my gut tells me that the threshold should be substantially lower than this. It might make sense to generate some artificial test cases and measure how big of a cluster is needed to make compile time explode.

Address comments.

Did the concept of using the static DAG not work?

If we want to close the gap with old algorithm, we have to have the O(1) query algorithm, which requires O(M^2) space and some time effort to construct the index. This is a known graph problem and I have no good idea. The scheduling DAG has property that seldom removing the edges but keeping adding edges. Do you have any good idea to improve the IsReachable() which will also benefit the compiling time of scheduler ?

Regarding to the threshold, I did a test with your case by removing the stores to reduce the DAG. This is the data.

LdSt Num: 794, DAG Size: 1191, 794 x 1191 = 945654 
new algorithm: 0.53s
old algorithm: 0.34s

They are completely the same when it is:

LdSt Num: 669, DAG Size: 1015, 669 x 1015 = 679035 
new algorithm: 0.39s
old algorithm: 0.41s

The time complexity for this algorithm is: O(M*(N+E)) if I understand correctly. M is the number of the load/store, and N is the number of the SUnits and E is the number of edges. For now, I choose the 1000000 as threshold as it seems not deg too much. Is it ok ?

resistor accepted this revision.Oct 28 2020, 11:42 AM
This revision is now accepted and ready to land.Oct 28 2020, 11:42 AM
This revision was landed with ongoing or failed builds.Nov 1 2020, 6:12 PM
This revision was automatically updated to reflect the committed changes.