This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Cache live-ins and register pressure in scheduler
ClosedPublic

Authored by rampitec on May 12 2017, 12:00 AM.

Details

Summary

Using LIS can be quite expensive, so caching of calculated region
live-ins and pressure is implemented. It does two things:

  1. Caches the info for the second stage when we schedule with decreased target occupancy.
  2. Tracks the basic block from top to bottom thus eliminating the need to scan whole register file liveness at every region split in the middle of the block.

The scheduling is now done in 3 stages instead of two, with the first
one being really a no-op and only used to collect scheduling regions
as sent by the scheduler driver.

There is no functional change to the current behavior, only compilation
speed is affected. In general computeBlockPressure() could be simplified
if we switch to backward RP tracker, because scheduler sends regions
within a block starting from the last upward. We could use a natural
order of upward tracker to seamlessly change between regions of the same
block, since live reg set of a previous tracked region would become a
live-out of the next region. That however requires fixing upward tracker
to properly account defs and uses of the same instruction as both are
contributing to the current pressure. When we converge on the produced
pressure we should be able to switch between them back and forth. In
addition, backward tracker is less expensive as it uses LIS in recede
less often than forward uses it in advance.

At the moment the worst known case compilation time has improved from 26
minutes to 8.5.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.May 12 2017, 12:00 AM
rampitec updated this revision to Diff 98836.May 12 2017, 1:48 PM
rampitec edited the summary of this revision. (Show Details)

Added second very small cache to reuse calculated live set of a block as its only successor live-in set.
Simplified the loop inside computeBlockPressure() to catch end() of the block without code duplication.
Adjusted initial block iterator to the first instruction of the first (topologically) region to save few iterations in some blocks.
Diff is now created against it parent review D33105 to show only the real changes.

rampitec updated this revision to Diff 99017.May 15 2017, 9:55 AM

Skip debug values when crossing basic block boundary.

rampitec updated this revision to Diff 99097.May 15 2017, 7:47 PM

Switch to statefull tracker.

vpykhtin accepted this revision.May 16 2017, 8:01 AM

LGTM.

lib/Target/AMDGPU/GCNSchedStrategy.cpp
340 ↗(On Diff #99097)

Turn this into a helper debug func?

This revision is now accepted and ready to land.May 16 2017, 8:01 AM
rampitec added inline comments.May 16 2017, 8:04 AM
lib/Target/AMDGPU/GCNSchedStrategy.cpp
340 ↗(On Diff #99097)

I want to switch to your print() function from tracker in a short while. It is now this way to compare logs to old implementation.

rampitec added inline comments.May 16 2017, 8:22 AM
lib/Target/AMDGPU/GCNSchedStrategy.cpp
340 ↗(On Diff #99097)

Actually printLivesAt() is a heavy one, recomputing live set. How about factoring this into GCNRPTracker::printLiveRegs()?

vpykhtin added inline comments.May 16 2017, 8:53 AM
lib/Target/AMDGPU/GCNSchedStrategy.cpp
340 ↗(On Diff #99097)

Later commit :)

rampitec updated this revision to Diff 99158.May 16 2017, 9:23 AM
rampitec edited the summary of this revision. (Show Details)

Rebase to master.

This revision was automatically updated to reflect the committed changes.
rampitec marked 4 inline comments as done.May 16 2017, 9:34 AM
rampitec added inline comments.
lib/Target/AMDGPU/GCNSchedStrategy.cpp
340 ↗(On Diff #99097)
rampitec marked an inline comment as done.May 16 2017, 9:34 AM