This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Iterative scheduling infrastructure + minimal registry scheduler
ClosedPublic

Authored by vpykhtin on Mar 16 2017, 11:31 AM.

Details

Summary

Iterative approach to find best schedule is essential for GCN architecture. This change combines a number of ideas for iterative scheduling and present the infrastructure.

Lightweight scheduling

Default schedulers are scheduling immediately on MIR reordering instructions and updating IR data, such as LiveIntervals. This is relatively heavy - instead a scheduling strategy can return an array of MachineInstr pointers (or equivalent, as does SIScheduler) that defines particular schedule. This lightweight schedule can be scored against other variants and implemented once. There're two types of lightweight schedules:

  1. array of pointers to DAG SUnits - supposed to be returned by strategies. The benefit here is that scoring function can use DAG SUnits. Doesn't include debug values.
  2. array of pointers to MachineInstr - this is so called 'detached' schedule in the sence that it doesn't depend on DAG state anymore and includes debug values. This is usefull when there is a need to store some variants for a later selection.

Scheduling using different strategies require a strategy to preserve DAG state so that other strategies can reuse the same DAG. This can be achieved either by saving touched DAG data, or better not touching DAG at all by annotating DAG SUnits with relevant for a particual strategy information: SUnit has NodeNum field which allows easy annotation not using maps. Minreg strategy implements later approach.

GCNUpwardRPTracker

Lightweight schedules cannot be tracked using llvm RP trackers, for this purpose GCNUpwardRPTracker was introduced. As the name states it can only go upward inst by inst. The order of inst is defined by the tracker's caller, so it can be used both for tracking lightweight schedules and IR sequences. Upward tracking is easier to implement because it only requires region liveout set to operate, except for one case, when we need to find used livemask for a large registry use. Despite that for lightweight schedule LiveIntervals isn't updated yet for a given instruction it can be still used because livemask for a use would not change for any schedule, as all defs should dominate the use. The defs can be reordered, but the overall mask should remain the same.

TODO: save liveout sets for every region when recording and reuse for subsequent RP tracking as liveouts doesn't depend on schedule.

GCNRegPressure

the structure to track registry pressure. Contains number of SGPR/VGPRs used, weigths for large SGPR/VGPRs and compare function - pressure giving max occupancy wins, otherwise wins pressure with the lowest large registers weight.

Minimal registry scheduler (example)

This is an experimental simple scheduler the main purpose of which is to learn ways how to consume less possible registers for a region (it doesn't care for performance at all). It doesn't always return minimal usage but works relatively well on large regions with unrolled loops. Its also used in tryMaximizeOccupancy scheduling pass.

Legacy Max occupancy scheduler

included as the example and mimics current behaviour. It doesn't use lightweight schedules but shows an example of how legacy and lightweight schedulers can be intermixed. The main difference is that it first collects all the regions to schedule and sorts them by regpressure. This way it starts with the fattest region first knowing best achievable occupancy beforehand. It also includes tryMaximizeOccupancy pass which tries to minimize registry usage with minreg strategy for the most consuming regions.

None of these schedulers are turned on by default in this change.

Testing:
Legacy Max occupancy scheduler fully passes lit tests.
Minreg runs lit tests without asserts.

No performance impact so far.

Tests to be added soon.

Event Timeline

vpykhtin created this revision.Mar 16 2017, 11:31 AM
axeldavy edited edge metadata.Mar 16 2017, 1:33 PM

The minimal registry scheduler strategy seems to be scheduling top down, with priority with instructions that have as few non ready successors (then as much ready successor) as possible.

I did try such a strategy with sisched, and it doesn't work well on some corner cases.
Imagine such scenario:

SU 6 is required by a lot of SUs reducing register usage.
SU 6 requires SU 1 to 5, which are ready to be scheduled.

SU 7 can be scheduled but increases register usage by 1.
SU 8 can be scheduled if SU 7 is scheduled, but will increase register usage by 1
SU 9 reduces register usage by 2, but requires SU 6.

From SU 10 to 50, we get exactly the same scenario then SU 7, 8, 9.

With the proposed strategy, the order of schedule will be:
7 10 13 16 etc
then
8 11 14 17 etc
then
1 2 3 4 5 6 etc
register usage gets quite high before we schedule SU 6, and begin reducing register usage.

To prevent that, I tried propagating a score describing the potential to release registers by scheduling a given SU.
It was heavy and didn't work up to my hope. There was bad corner cases as well.

The problem is solved by:
https://reviews.llvm.org/D30147

My understanding is that later instead of picking the schedule with best occupancy, a score function will determine which schedule to pick.

It would be interesting to have sisched in the loop. Any idea how to adapt this patch and sisched to achieve that goal ?

Also I'd like to get some of GCNSchedStrategy functionnalities into sisched, in particular:

  1. Accurate count of liveins at the beginning of a region
  2. Accurate limit for the number of registers we can use
  3. Once we determine the occupancy we can hope, schedule all again with that in mind.

1 and 2) would enable to remove all hand picked limits in sisched.

Thus if the goal is to integrate sisched in this framework too, we should try to make these functionnalities available to the schedule strategies.

arsenm added inline comments.Mar 16 2017, 1:43 PM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
280

I think it's better to put the call to the function under DEBUG rather than wrapping an entire function body

297

llvm_unreachable

498

Typo Registry->Register

lib/Target/AMDGPU/GCNMinRegStrategy.cpp
177

Swap order of check?

184–185

pop_back_val

lib/Target/AMDGPU/GCNRegPressure.cpp
21

I think there is a more refined macro to check to not include dump() functions

86–87

I'd rather avoid having more switches over valid vreg classes. We already have isSGPRClass, and you can check the size for whether it is a tuple.

Should the 64-bit cases really count as "large" vgprs? They are common and not that big. If it's just indicating it's a tuple of any size it probably should be renamed

116–118

Sink to uses

148

llvm_unreachable

247

Remove this assert, it's undefined to have a null reference

284

Ditto

Hi Axel, thanks for reply!

all of these successors magic is a bit misleading here, its left from the time I experimented a lot with it. Actually the most important part in minreg is bumpPredsPriority function which does closure on predecessors and bumps the priority for those ready SU's that would allow to continue for the deepest SU. This way the scheduler always strives for going into depth.

As for including sisched I have patch to include it into this, though its a bit outdated and should be updated.

Hi Axel, thanks for reply!

all of these successors magic is a bit misleading here, its left from the time I experimented a lot with it. Actually the most important part in minreg is bumpPredsPriority function which does closure on predecessors and bumps the priority for those ready SU's that would allow to continue for the deepest SU. This way the scheduler always strives for going into depth.

Then in my example, what is the result of this strategy ? Does it avoid scheduling SU 6 late ?

As for including sisched I have patch to include it into this, though its a bit outdated and should be updated.

I think one blocker is sisched doesn't support yet when subRegLiveness is enabled. I'm on it.

I have several sisched patches pending review. Would be great if you could take a look in order to have them merged.

vpykhtin added inline comments.Mar 16 2017, 2:19 PM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
280

This is a callback, I cannot wrap it. It should do nothing because of the postponed scheduling but I decided let it print something usefull :)

lib/Target/AMDGPU/GCNRegPressure.cpp
86–87

It's a good idea not to duplicate switches, I'll try to avoid.

Ok, I'll rename. This is used to compare pressures, pressure with less tuples weigth should win for the same occupancy.

rampitec added inline comments.Mar 16 2017, 7:35 PM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
421

I believe it should be limited by other pre-existing factors affecting occupancy (such as LDS usage).

472

Not really :) The second pass was run only if we shall increase occupancy because of a fat block(s). Otherwise it is waste of compile time. No magic.

lib/Target/AMDGPU/GCNRegPressure.cpp
107

llvm_unreachable

175

Maybe in this case VGPRs should win?

260

Space before ?

273

It can be any power of 2, not just 1. I.e. !(mask && (mask - 1)).

lib/Target/AMDGPU/GCNSchedStrategy.cpp
349

I think we can drop it since scheduler created new for each function. That was my fault to initialize it here.

532

Change to setTargetOccupancy() since it is defined?

Then in my example, what is the result of this strategy ? Does it avoid scheduling SU 6 late ?

From you example it's not clear how the following SUs relate: SU6 <-> SU7, SU9 <-> SU7, SU9 <-> SU8

As for including sisched I have patch to include it into this, though its a bit outdated and should be updated.

I think one blocker is sisched doesn't support yet when subRegLiveness is enabled. I'm on it.

I have several sisched patches pending review. Would be great if you could take a look in order to have them merged.

Ok, I'll try to take a look, thanks.

vpykhtin added inline comments.Mar 17 2017, 2:47 AM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
421

It's limited by TargetOcc later in this function. This function is currently called with TargetOcc = LDSOcc.

472

I feel I should rerun performance tests for all of this to be sure what I'm doing :-)

lib/Target/AMDGPU/GCNRegPressure.cpp
175

Yes, I think we can go this way. Other thought was to compare how far is the pressure from the next occupancy level.

273

Well, I see

LaneBitmask MachineRegisterInfo::getMaxLaneMaskForVReg(unsigned Reg) const {
  // Lane masks are only defined for vregs.
  assert(TargetRegisterInfo::isVirtualRegister(Reg));
  const TargetRegisterClass &TRC = *getRegClass(Reg);
  return TRC.getLaneMask();
}

and

/// Returns the combination of all lane masks of register in this class.
/// The lane masks of the registers are the combination of all lane masks
/// of their subregisters. Returns 1 if there are no subregisters.
LaneBitmask TargetRegisterClass::getLaneMask() const {
  return LaneMask;
}

Have you seen maximum mask other than 1 for a 32bit reg?

lib/Target/AMDGPU/GCNSchedStrategy.cpp
349

Agreed.

Then in my example, what is the result of this strategy ? Does it avoid scheduling SU 6 late ?

From you example it's not clear how the following SUs relate: SU6 <-> SU7, SU9 <-> SU7, SU9 <-> SU8

Ok, I add more details to make it more realistics.

Initially, SUs 1, 2, 3, 4, 5, 7, 10, 13, 16 etc (up to 50) can be scheduled (no dependencies)
SU 6 depends on SUs 1 2 3 4 5 only
SU 8 depends only on SU 7.
SU 9 depends on SUs 6 and 8
SU 11 depends only on SU 10
SU 12 depends on SUs 6, 9 and 11
etc

A possible pseudocode giving that scenario is:
SU 1 to 5: loads from memory
SU 6: read from memory a constant b. (And we'll say that the load order is enforced in this schedule)
SU (7+3*i): read from memory a constant a_i
SU (8+3*i): compute the square of a_i
SU (9+3*i): current sum += a_i ^2

That is we have
b loaded from memory (SU 6)
b+= a_1^2 (SU 7 8 9)
b+= a_2^2 (SU 10 11 12)
etc

I think with the minReg policy, the schedule will be
SU 7 10 13 etc
then
SU 8 11 14 etc
then
SU 1 2 3 4 5 6 9 12 etc

SU 7 10 13 increases register usage, and the register usage is only reduced when 9, 12, etc are scheduled.

Then in my example, what is the result of this strategy ? Does it avoid scheduling SU 6 late ?

From you example it's not clear how the following SUs relate: SU6 <-> SU7, SU9 <-> SU7, SU9 <-> SU8

Ok, I add more details to make it more realistics.

Initially, SUs 1, 2, 3, 4, 5, 7, 10, 13, 16 etc (up to 50) can be scheduled (no dependencies)
SU 6 depends on SUs 1 2 3 4 5 only
SU 8 depends only on SU 7.
SU 9 depends on SUs 6 and 8
SU 11 depends only on SU 10
SU 12 depends on SUs 6, 9 and 11
etc

A possible pseudocode giving that scenario is:
SU 1 to 5: loads from memory
SU 6: read from memory a constant b. (And we'll say that the load order is enforced in this schedule)
SU (7+3*i): read from memory a constant a_i
SU (8+3*i): compute the square of a_i
SU (9+3*i): current sum += a_i ^2

That is we have
b loaded from memory (SU 6)
b+= a_1^2 (SU 7 8 9)
b+= a_2^2 (SU 10 11 12)
etc

I think with the minReg policy, the schedule will be
SU 7 10 13 etc
then
SU 8 11 14 etc
then
SU 1 2 3 4 5 6 9 12 etc

SU 7 10 13 increases register usage, and the register usage is only reduced when 9, 12, etc are scheduled.

I think I should write a test and show how it works, but teoretically:

Before : all ready SUs are put to the ready queue with Priority = 0.

Step 1: 1, 2, 3, 4, 5 have one non ready successor (SU6) and lose. 7, 10, 13 ... SUs would be equivalent as they all have one ready successor and 7 will be selected by default order. This adds SU8 to the ready queue with priority = current step_no (1)
Step 2: We have the only SU8 with highest priority = 1 and it will be selected. Now that SU8 have non-ready successor SU9 it bumps the priority of every predecessor of SU9 in ready queue to step_no whichs is 2. This makes 1, 2, 3, 4, 5 SUs of priority 2
Step 3: We have 1, 2, 3, 4, 5 of highest priority (2) in the queue and as they are equivalent SU1 is selected as default. As SU1 has non-ready successor SU6 we bump the priority of predecessors of SU6 in ready queue to 3.
... continue until all predecessors of SU6 as selected and select SU6.

Continue with the rest of SUs.

I think I should write a test and show how it works, but teoretically:

Before : all ready SUs are put to the ready queue with Priority = 0.

Step 1: 1, 2, 3, 4, 5 have one non ready successor (SU6) and lose. 7, 10, 13 ... SUs would be equivalent as they all have one ready successor and 7 will be selected by default order. This adds SU8 to the ready queue with priority = current step_no (1)
Step 2: We have the only SU8 with highest priority = 1 and it will be selected. Now that SU8 have non-ready successor SU9 it bumps the priority of every predecessor of SU9 in ready queue to step_no whichs is 2. This makes 1, 2, 3, 4, 5 SUs of priority 2
Step 3: We have 1, 2, 3, 4, 5 of highest priority (2) in the queue and as they are equivalent SU1 is selected as default. As SU1 has non-ready successor SU6 we bump the priority of predecessors of SU6 in ready queue to 3.
... continue until all predecessors of SU6 as selected and select SU6.

Continue with the rest of SUs.

I see. I didn't understand before that you were propagating the priority to non-direct predecessors.
In that case, the worst case scenario is then if I take exactly the same example with the numbers reversed.
That is equivalently if SUs with high NodeNum were selected first.

A good policy in my experience is to take the node with the maximum Height when you have several choices.
That is probably a better choice than relying on NodeNum.

I think I should write a test and show how it works, but teoretically:

Before : all ready SUs are put to the ready queue with Priority = 0.

Step 1: 1, 2, 3, 4, 5 have one non ready successor (SU6) and lose. 7, 10, 13 ... SUs would be equivalent as they all have one ready successor and 7 will be selected by default order. This adds SU8 to the ready queue with priority = current step_no (1)
Step 2: We have the only SU8 with highest priority = 1 and it will be selected. Now that SU8 have non-ready successor SU9 it bumps the priority of every predecessor of SU9 in ready queue to step_no whichs is 2. This makes 1, 2, 3, 4, 5 SUs of priority 2
Step 3: We have 1, 2, 3, 4, 5 of highest priority (2) in the queue and as they are equivalent SU1 is selected as default. As SU1 has non-ready successor SU6 we bump the priority of predecessors of SU6 in ready queue to 3.
... continue until all predecessors of SU6 as selected and select SU6.

Continue with the rest of SUs.

I see. I didn't understand before that you were propagating the priority to non-direct predecessors.
In that case, the worst case scenario is then if I take exactly the same example with the numbers reversed.
That is equivalently if SUs with high NodeNum were selected first.

A good policy in my experience is to take the node with the maximum Height when you have several choices.
That is probably a better choice than relying on NodeNum.

Why? Lets assume 10, 11 and 12 are the last SUs in the sequence of summation and I select max nodenum first. This would select 10, 11 and would bump 1, 2, 3, 4, 5 again, and there will be 5, 4, 3, 2, 1, 6, 7, 8, 9

I think I should write a test and show how it works, but teoretically:

Before : all ready SUs are put to the ready queue with Priority = 0.

Step 1: 1, 2, 3, 4, 5 have one non ready successor (SU6) and lose. 7, 10, 13 ... SUs would be equivalent as they all have one ready successor and 7 will be selected by default order. This adds SU8 to the ready queue with priority = current step_no (1)
Step 2: We have the only SU8 with highest priority = 1 and it will be selected. Now that SU8 have non-ready successor SU9 it bumps the priority of every predecessor of SU9 in ready queue to step_no whichs is 2. This makes 1, 2, 3, 4, 5 SUs of priority 2
Step 3: We have 1, 2, 3, 4, 5 of highest priority (2) in the queue and as they are equivalent SU1 is selected as default. As SU1 has non-ready successor SU6 we bump the priority of predecessors of SU6 in ready queue to 3.
... continue until all predecessors of SU6 as selected and select SU6.

Continue with the rest of SUs.

I see. I didn't understand before that you were propagating the priority to non-direct predecessors.
In that case, the worst case scenario is then if I take exactly the same example with the numbers reversed.
That is equivalently if SUs with high NodeNum were selected first.

A good policy in my experience is to take the node with the maximum Height when you have several choices.
That is probably a better choice than relying on NodeNum.

Why? Lets assume 10, 11 and 12 are the last SUs in the sequence of summation and I select max nodenum first. This would select 10, 11 and would bump 1, 2, 3, 4, 5 again, and there will be 5, 4, 3, 2, 1, 6, 7, 8, 9

In the summation example, 12 depends on 9, thus 7, 8, 9 would get bumped as well, thus 7 taken first.

I think I should write a test and show how it works, but teoretically:

Before : all ready SUs are put to the ready queue with Priority = 0.

Step 1: 1, 2, 3, 4, 5 have one non ready successor (SU6) and lose. 7, 10, 13 ... SUs would be equivalent as they all have one ready successor and 7 will be selected by default order. This adds SU8 to the ready queue with priority = current step_no (1)
Step 2: We have the only SU8 with highest priority = 1 and it will be selected. Now that SU8 have non-ready successor SU9 it bumps the priority of every predecessor of SU9 in ready queue to step_no whichs is 2. This makes 1, 2, 3, 4, 5 SUs of priority 2
Step 3: We have 1, 2, 3, 4, 5 of highest priority (2) in the queue and as they are equivalent SU1 is selected as default. As SU1 has non-ready successor SU6 we bump the priority of predecessors of SU6 in ready queue to 3.
... continue until all predecessors of SU6 as selected and select SU6.

Continue with the rest of SUs.

I see. I didn't understand before that you were propagating the priority to non-direct predecessors.
In that case, the worst case scenario is then if I take exactly the same example with the numbers reversed.
That is equivalently if SUs with high NodeNum were selected first.

A good policy in my experience is to take the node with the maximum Height when you have several choices.
That is probably a better choice than relying on NodeNum.

Why? Lets assume 10, 11 and 12 are the last SUs in the sequence of summation and I select max nodenum first. This would select 10, 11 and would bump 1, 2, 3, 4, 5 again, and there will be 5, 4, 3, 2, 1, 6, 7, 8, 9

In the summation example, 12 depends on 9, thus 7, 8, 9 would get bumped as well, thus 7 taken first.

Right, this is a good example. Idea with the height looks good then, thanks!

rampitec added inline comments.Mar 17 2017, 9:47 AM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
472

In the above comment I really meant "if we shall decrease occupancy".

lib/Target/AMDGPU/GCNRegPressure.cpp
273

Not in our BE, but I have seen it once :( VGPR32 and SGPR32 have 1.

rampitec edited edge metadata.Mar 17 2017, 9:47 AM

Can you please add more RUN lines with new strategies to the test/CodeGen/AMDGPU/schedule-regpressure-limit.ll

Can you please add more RUN lines with new strategies to the test/CodeGen/AMDGPU/schedule-regpressure-limit.ll

Ok

axeldavy added inline comments.Mar 18 2017, 6:16 AM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
363

Why is all that need ?
If it's something handleMove is missing, perhaps it's handleMove that should be updated.

vpykhtin updated this revision to Diff 92346.Mar 20 2017, 9:49 AM

Added tests + per review updates.

vpykhtin marked 25 inline comments as done.Mar 20 2017, 9:54 AM

I optimistically moved dump function down in GCNRegPressure.cpp and all comments got out of sync with the source, sorry.

vpykhtin updated this revision to Diff 92350.Mar 20 2017, 10:06 AM

Restored function order

rampitec added inline comments.Mar 20 2017, 10:22 AM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
472

Please check the original code:

void GCNScheduleDAGMILive::finalizeSchedule() {
  // Retry function scheduling if we found resulting occupancy and it is
  // lower than used for first pass scheduling. This will give more freedom
  // to schedule low register pressure blocks.
  // Code is partially copied from MachineSchedulerBase::scheduleRegions().

  if (!LIS || StartingOccupancy <= MinOccupancy)
    return;

I.e. the second pass was inhibited if we have stayed within starting occupancy.

test/CodeGen/AMDGPU/schedule-regpressure-limit.ll
6

Maybe change check: {{[12][0-9]$}} for GCNMaxOccupancy and {{[01][0-9]$}} for MinReg?

test/CodeGen/AMDGPU/schedule-regpressure-limit2.ll
12

Do not want an optimistic [0-...]? ;)

vpykhtin added inline comments.Mar 20 2017, 10:43 AM
lib/Target/AMDGPU/GCNIterativeScheduler.cpp
472

Well, its constrained by const int NumPasses = Occ < TgtOcc ? 2 : 1; above, line 464.

Actually the most strange thing is that I don't need second pass at all because I now target occupancy beforehand due to sorting by pressure. The reason I left two passes is that it somehow helps with performance.

test/CodeGen/AMDGPU/schedule-regpressure-limit.ll
6

After you introduced alias analisys this test contains a lot of merged load stores, therefore regpressure isn't much different now.

test/CodeGen/AMDGPU/schedule-regpressure-limit2.ll
12

I'm not sure it uses leading zero, it should be probably [1-9]?[0-9]

rampitec accepted this revision.Mar 20 2017, 10:48 AM

LGTM

test/CodeGen/AMDGPU/schedule-regpressure-limit2.ll
12

I think you are right.

This revision is now accepted and ready to land.Mar 20 2017, 10:48 AM
This revision was automatically updated to reflect the committed changes.