The change implements the scheduling strategy to find a reasonable trade-off between the ILP and occupancy.
For that purpose, it computes the heuristic metric to decide if the current schedule is worth to be kept.
This is an attempt to use the idea in the https://reviews.llvm.org/D139710 to replace the shouldRevertScheduling function. This approach avoids additional computations in compile time by utilizing iterative metric computation during scheduling. Unlike the https://reviews.llvm.org/D139710 the heuristic is applied to all scheduling stages.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
350 ms | Linux x64 > LLVM.CodeGen/AMDGPU::gfx-callable-return-types.ll Exit Code: 1
Command Output (stderr):
| |
690 ms | Windows x64 > LLVM.CodeGen/AMDGPU::gfx-callable-return-types.ll Exit Code: 1
Command Output (stdout):
|
Event Timeline
Please don't be frustrated too much. This review mostly aims to keep the discussion and collect opinions regarding the balancing scheduling strategy.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1451 | I think this is a lower bound on what we have previously called ScheduleLength? | |
1464 | Does this mean we always take the result from MaxOccupancy stage? I wonder if there is a way to have an initial target metric for the MaxxOcc stage? | |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h | ||
470 | I'm not sure I understand the purpose of UnclusteredHighRPReschedule stage in the context of a balanced scheduler. |
Fixed the case when both Top and Bot current cycles are 0. The schedule length is not really zero but it does not make sense to asses such a short MBBs.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1451 | Exactly. I realized that the ratio of the total stall cycles to the total amount of instructions better reflects the metric than the ratio of total stalls to the modeled length (i.e., the total amount of working cycles + the total amount of stalls). | |
1464 | The PrevMetric is zero only when we're in OccInitialSchedule stage because this stage is the first one. We compare the current stage metric with the previous best known on each next stage. Please note that we only record stage metrics if we don't revert the stage schedule. |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1451 | That makes sense -- though, I would think we would still want to capture the latency as well. For example, StallTotal of say 15 when the total instruction latency is 30 means something different than if the total instruction latency was 150 (even if number of instructions is the same) -- the second schedule is able to hide latency better, thus has better ILP. |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1451 | Oops... I was wrong here. |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1451 | Yes, bumpNode resolves the required latency for the instruction currently being scheduled, so Top.CurrCycle includes how many stalls are required before issuing the instructions in the top-down portion of schedule -- similar story for Bot.CurrCyle. However, we don't know how many stalls are required between the instructions in the bottom-up portion of schedule and top-down portion of schedule. If none are required, then the sum of CurrCycle is the ScheduleLength. However, if, for example, there is dependency between last node in TopDown and last node in BottomUp, the stalls required to resolve that latency won't be accounted for in Top.CurrCycle + Bot.CurrCycle. |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1451 |
Let's see what will happen in this case. The TopReadyCycle refers to the number of cycles necessary for the instruction result to be available to its user. If the user is the top instruction scheduled bottom-up, this number has already been counted in Top.bumpNode() and Top.CurrCycle has been updated. |
Schedule length is now computed, accounting for the possible gap between the Top and Bottom scheduling bounds.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h | ||
---|---|---|
470 | As I understood, the UnclusteredHighRPReschedule attempts to achieve better occupancy by removing most mutations, allowing for loose instruction placement. | |
471 | Okay, it may achieve better ILP, sacrificing the occupancy, and the resulting metric will be the same or even better. Did you mean we should have a separate check that reverts the ClusteredLowOccupancyReschedule if the occupancy has been dropped? |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
1468 | The FurthestPred won't always have the edge that creates the gap. | |
1477 | TopDistance should be based on successor used in TheHighestInBot calculation. auto TopToBotDistance = 0; for (auto &Succ : TheFurthestPred->Succs) { if (!Succ.isAssignedRegDep) continue; unsigned SuccDistance = M.computeInstrLatency(Succ.getSUnit()->getInstr()) + TheFurthestPred->TopReadyCycle; if (SuccDistance <= Top.getReadyCycle()) cotninue; unsigned BotReadyCycle = Succ.getSUnit()->BotReadyCycle; unsigned BotDistance = Bot.getCurrCycle() - BotReadyCycle unsigned Gapsize = TopDistance - BotDistance;; if (GapSize > MaxGapSize) { TopToBotDistance = GapSize; } } | |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h | ||
470 | The phase is only run for a subset of regions -- DAG.RegionsWithHighRP are in that subset. If we have high RP such that we are in danger of dropping occupancy, we reschedule the region without mutations to attempt to reduce RP and save our occupancy. The meta heuristic used to determine which regions need rescheduling is based on register pressure and occupancy -- this doesn't seem in accordance with the spirit of a "balanced" scheduler. | |
471 |
Yes something like that -- if ClusteredLowOccupancy drops the occupancy then revert. Phase needs a stable occupancy in order for it to achieve its purpose. |
In this version, I decided to abandon the iterative metric computation during the scheduling
because it is impossible to predict the final total latency (schedule length) due to the bidirectional scheduling.
The full DAG traversal is needed when the scheduling is done.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.h | ||
---|---|---|
470 | The problem is that we report we are running out of registers right away when the occupancy is going under 4. Just because the VGPR limit is 32. That is exactly why we initially started to look for the heuristic: we have got a regression because of bumping the occupancy from 3 to 4 but introducing the huge latency. new_occupancy/old_occupancy * old_metric/new_metric is just a starting point. It already takes into account the change in occupancy versus change in ILP. The goal is to improve it to reflect our needs. | |
471 | I would prefer to change the heuristic in such a way that dropping occupancy leads to the low metric and the schedule is reverted automatically. |
Changed the criteria for reverting current stage schedule and for the HighOccupancyRPStage running on region
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
993–994 | This means we will always prioritize not letting occupancy drop, perhaps at the expensive of throwing away a good ILP schedule. For this phase specifically, I think we should only do this check if we are using the MaxOccupancy strategy -- or perhaps, implement computeScheduleMetric for the MaxOccupancy strategy and use it in place of GCNSchedStage::shouldRevertScheduling | |
1007 | Seems to me this phase should just not be run if !isRegionWithExcessRP() |
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp | ||
---|---|---|
993–994 | I am pretty sure this check is useless in this particular place and could be removed. | |
1007 | Some regions were excess RP and the UnclasteredHighRP stage was called to try making them better. They might no longer have excess RP in case it succeeded. Unfortunately, decreasing the RP they lost ILP and they might have worse metrics. Thus they would have been reverted. This check is here to avoid this. |
clang-format not found in user’s local PATH; not linting file.