This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][MISCHED] GCNBalancedSchedStrategy.
Needs ReviewPublic

Authored by alex-t on Aug 20 2023, 8:36 AM.

Details

Summary

The change implements the scheduling strategy to find a reasonable trade-off between the ILP and occupancy.
For that purpose, it computes the heuristic metric to decide if the current schedule is worth to be kept.
This is an attempt to use the idea in the https://reviews.llvm.org/D139710 to replace the shouldRevertScheduling function. This approach avoids additional computations in compile time by utilizing iterative metric computation during scheduling. Unlike the https://reviews.llvm.org/D139710 the heuristic is applied to all scheduling stages.

Diff Detail

Unit TestsFailed

TimeTest
350 msLinux x64 > LLVM.CodeGen/AMDGPU::gfx-callable-return-types.ll
Exit Code: 1 Command Output (stderr):
690 msWindows x64 > LLVM.CodeGen/AMDGPU::gfx-callable-return-types.ll
Exit Code: 1 Command Output (stdout):

Event Timeline

alex-t created this revision.Aug 20 2023, 8:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2023, 8:36 AM
alex-t requested review of this revision.Aug 20 2023, 8:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2023, 8:36 AM
Herald added subscribers: wangpc, wdng. · View Herald Transcript

Please don't be frustrated too much. This review mostly aims to keep the discussion and collect opinions regarding the balancing scheduling strategy.

alex-t edited the summary of this revision. (Show Details)Aug 20 2023, 8:45 AM
jrbyrnes added inline comments.Aug 21 2023, 9:40 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1451

I think this is a lower bound on what we have previously called ScheduleLength?

1464

Does this mean we always take the result from MaxOccupancy stage? I wonder if there is a way to have an initial target metric for the MaxxOcc stage?

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
470

I'm not sure I understand the purpose of UnclusteredHighRPReschedule stage in the context of a balanced scheduler.

alex-t updated this revision to Diff 552108.EditedAug 21 2023, 12:51 PM

Fixed the case when both Top and Bot current cycles are 0. The schedule length is not really zero but it does not make sense to asses such a short MBBs.

alex-t added inline comments.Aug 21 2023, 12:57 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1451

Exactly. I realized that the ratio of the total stall cycles to the total amount of instructions better reflects the metric than the ratio of total stalls to the modeled length (i.e., the total amount of working cycles + the total amount of stalls).

1464

The PrevMetric is zero only when we're in OccInitialSchedule stage because this stage is the first one. We compare the current stage metric with the previous best known on each next stage. Please note that we only record stage metrics if we don't revert the stage schedule.

jrbyrnes added inline comments.Aug 21 2023, 5:01 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1451

That makes sense -- though, I would think we would still want to capture the latency as well. For example, StallTotal of say 15 when the total instruction latency is 30 means something different than if the total instruction latency was 150 (even if number of instructions is the same) -- the second schedule is able to hide latency better, thus has better ILP.

alex-t added inline comments.Aug 22 2023, 5:02 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1451

Oops... I was wrong here.
The SchedBoundary::bumpNode sets the boundary current cycle considering the scheduled instruction latency. The current cycle is set to the recently scheduled instruction "ready cycle" - the latency is already counted.
Thus, Top.CurrCycle + Bot.CurrCycle gives us the total instruction latency, indeed.

jrbyrnes added inline comments.Aug 22 2023, 4:28 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1451

Yes, bumpNode resolves the required latency for the instruction currently being scheduled, so Top.CurrCycle includes how many stalls are required before issuing the instructions in the top-down portion of schedule -- similar story for Bot.CurrCyle.

However, we don't know how many stalls are required between the instructions in the bottom-up portion of schedule and top-down portion of schedule. If none are required, then the sum of CurrCycle is the ScheduleLength. However, if, for example, there is dependency between last node in TopDown and last node in BottomUp, the stalls required to resolve that latency won't be accounted for in Top.CurrCycle + Bot.CurrCycle.

alex-t added inline comments.Aug 23 2023, 6:43 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1451

if, for example, there is dependency between last node in TopDown and last node in BottomUp, the stalls required to resolve that latency won't be accounted for in Top.CurrCycle + Bot.CurrCycle.

Let's see what will happen in this case. The TopReadyCycle refers to the number of cycles necessary for the instruction result to be available to its user. If the user is the top instruction scheduled bottom-up, this number has already been counted in Top.bumpNode() and Top.CurrCycle has been updated.

jrbyrnes added inline comments.Aug 23 2023, 6:09 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1464

I see -- thanks.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
471

Probably shouldn't allow occupancy drops in ClusteredLowOccupancyReschedule, otherwise we will need to rerun the phase.

alex-t updated this revision to Diff 553553.Aug 25 2023, 11:25 AM

Schedule length is now computed, accounting for the possible gap between the Top and Bottom scheduling bounds.

alex-t marked 2 inline comments as done.Aug 25 2023, 11:48 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
470

As I understood, the UnclusteredHighRPReschedule attempts to achieve better occupancy by removing most mutations, allowing for loose instruction placement.
Since our metric aims for the balance between better occupancy and better ILP, we might accept the result of the UnclusteredHighRPReschedule phase if it was managed to achieve better occupancy w/o the loss in ILP.

471

Okay, it may achieve better ILP, sacrificing the occupancy, and the resulting metric will be the same or even better. Did you mean we should have a separate check that reverts the ClusteredLowOccupancyReschedule if the occupancy has been dropped?

jrbyrnes added inline comments.Aug 25 2023, 12:59 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1468

The FurthestPred won't always have the edge that creates the gap.

1477

TopDistance should be based on successor used in TheHighestInBot calculation.

auto TopToBotDistance = 0;
for (auto &Succ : TheFurthestPred->Succs) {
  if (!Succ.isAssignedRegDep) continue;
  unsigned SuccDistance = M.computeInstrLatency(Succ.getSUnit()->getInstr()) + TheFurthestPred->TopReadyCycle;
  if (SuccDistance <= Top.getReadyCycle()) cotninue;
  unsigned BotReadyCycle = Succ.getSUnit()->BotReadyCycle;
  unsigned BotDistance = Bot.getCurrCycle() - BotReadyCycle
  unsigned Gapsize = TopDistance - BotDistance;;
  if (GapSize > MaxGapSize) {
    TopToBotDistance = GapSize;
  }
}
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
470

The phase is only run for a subset of regions -- DAG.RegionsWithHighRP are in that subset. If we have high RP such that we are in danger of dropping occupancy, we reschedule the region without mutations to attempt to reduce RP and save our occupancy.

The meta heuristic used to determine which regions need rescheduling is based on register pressure and occupancy -- this doesn't seem in accordance with the spirit of a "balanced" scheduler.

471

Did you mean we should have a separate check that reverts the ClusteredLowOccupancyReschedule if the occupancy has been dropped?

Yes something like that -- if ClusteredLowOccupancy drops the occupancy then revert. Phase needs a stable occupancy in order for it to achieve its purpose.

alex-t updated this revision to Diff 554427.Aug 29 2023, 10:56 AM
alex-t marked 2 inline comments as done.

In this version, I decided to abandon the iterative metric computation during the scheduling
because it is impossible to predict the final total latency (schedule length) due to the bidirectional scheduling.
The full DAG traversal is needed when the scheduling is done.

alex-t added inline comments.Aug 29 2023, 11:26 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
470

The problem is that we report we are running out of registers right away when the occupancy is going under 4. Just because the VGPR limit is 32. That is exactly why we initially started to look for the heuristic: we have got a regression because of bumping the occupancy from 3 to 4 but introducing the huge latency.
Once again, nobody proved that this "magic" occupancy 4 is the optimal one for all the cases.
The current "naive" objective function:

new_occupancy/old_occupancy * old_metric/new_metric

is just a starting point. It already takes into account the change in occupancy versus change in ILP. The goal is to improve it to reflect our needs.

471

I would prefer to change the heuristic in such a way that dropping occupancy leads to the low metric and the schedule is reverted automatically.

alex-t updated this revision to Diff 556666.Sep 13 2023, 8:07 AM

Changed the criteria for reverting current stage schedule and for the HighOccupancyRPStage running on region

jrbyrnes added inline comments.Sep 13 2023, 5:11 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
993–994

This means we will always prioritize not letting occupancy drop, perhaps at the expensive of throwing away a good ILP schedule.

For this phase specifically, I think we should only do this check if we are using the MaxOccupancy strategy -- or perhaps, implement computeScheduleMetric for the MaxOccupancy strategy and use it in place of GCNSchedStage::shouldRevertScheduling

1007

Seems to me this phase should just not be run if !isRegionWithExcessRP()

alex-t marked 2 inline comments as done.Sep 14 2023, 1:33 PM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
993–994

I am pretty sure this check is useless in this particular place and could be removed.

1007

Some regions were excess RP and the UnclasteredHighRP stage was called to try making them better. They might no longer have excess RP in case it succeeded. Unfortunately, decreasing the RP they lost ILP and they might have worse metrics. Thus they would have been reverted. This check is here to avoid this.

alex-t updated this revision to Diff 556810.Sep 14 2023, 1:57 PM
alex-t marked 2 inline comments as done.

Unused and unnecessary code cleanup