This is an archive of the discontinued LLVM Phabricator instance.

Make the insertion of predicate deps in the schedule graph not quadratic in the number of predicate deps.
Needs ReviewPublic

Authored by chandlerc on Jun 16 2016, 3:37 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

Before this we would do a linear scan of all existing predicate deps to
check for overlap. When inserting N non-duplicate predicate deps,
this trivially has O(N^2).

Unfortunately, there are two different definitions of "overlap" used so
a single map is insufficient. Instead, we use an index map and a count
map. The the count map handles the preds that aren't required if there
is an exitsing edge, and the index map allows us to find an exact
existing dep and update it in constant time w.r.t. the number of
predicate deps.

This doesn't fix the linear scan over *successor* deps, but for whatever
reason, even with the *insane* schedule graph formed by
test/CodeGen/AMDGPU/spill-scavenge-offset.ll, I can't see that really
hot on my profile. If I can find a way to make that show up, I'll look
at fixing that linear scan as well.

However, one possible reason I can't see this is because fixing this
quadratic behavior immediately uncovers a second quadratic behavior. I'm
going to try to fix that next.

This fix alone is good for a 20% to 55% speed up in the above test case
prior to r272860 which somewhat avoided triggering this quadratic
behavior.. A debug build for me drops from 40s to 32s for the entire
test, and an optimized build from 6s to 4.5s. This shaves about 4s off
of my 'ninja cehck-llvm' time in debug builds where this test is one of
the tall poles. I had really hoped for more dramatic improvements but
there appears to be too much overhead and too many other quadratic
things going on...

Given that this requires two map data structures, one of which with
a decidedly non-trivial key, I'm not 100% certain this the best
approach. It would be so much nicer to have tiered structures from SUnit
to the set of deps on that edge... But that looks like a much more
invasive change. Thoughts? Personally, I still lean toward not having
a quadratic algorithm. =]

Note that these quadratic algorithms impact both the SDAG scheduling and
MI scheduling because both build the Schedule DAG. =[ =[ =[ =[

Diff Detail

Event Timeline

chandlerc updated this revision to Diff 60957.Jun 16 2016, 3:37 AM
chandlerc retitled this revision from to Make the insertion of predicate deps in the schedule graph not quadratic in the number of predicate deps..
chandlerc updated this object.
chandlerc added a subscriber: llvm-commits.
MatzeB added a subscriber: atrick.Jun 16 2016, 3:50 PM
arsenm added a subscriber: arsenm.Jun 16 2016, 3:55 PM

Looks reasonable to me

lib/CodeGen/ScheduleDAG.cpp
90–92

range loop?

TL;DR: The DAG builder is never supposed to form "insane" DAGs.

Why do we have a compile-time stress test checked into unit tests if we don't want to bloat make check time? This test should be attached to a bug report instead!

It's well understood that scheduling DAGs are quadratic if they naively represent all dependencies. That is a problem regardless of how the predecessor/successor edges are uniqued. We don't want to use quadratic space either.

The normal strategy for dealing with the DAG size is pinch off the dependencies at some reasonable limit. Why isn't that happening? For example, that's why we don't include stack pointer manipulation in the DAG. Are these register or memory dependencies? Support for alias analysis was recently enabled, and I expect to see situations like this as fall-out. If that's the case, the fix is to teach the DAG builder to better limit the number of memory dependencies that are being tracked.

The obvious downside to this patch is increasing the size of the DAG and constant overhead. That needs to be measured. I don't think the common case should be penalized, particularly because there are other ways to prevent pathological behavior in the DAG builder.

Sorry I don't have time to investigate this test case. I wish I did.

[Note: There are a lot of innefficiencies in the scheduling DAG because it's shared with ISEL. What makes that especially painful is that it's quite silly for ISEL to build this DAG at all].