Requested SchedGroup pipelines may be non-trivial to satisify. A minimimal example is if the requested pipeline is {2 VMEM, 2 VALU, 2 VMEM} and the original order of SUnits is {VMEM, VALU, VMEM, VALU, VMEM}. Because of existing dependencies, the choice of which SchedGroup the middle VMEM goes into impacts how closely we are able to match the requested pipeline. It seems minimizing the degree of misfit (as measured by the number of edges we can't add) w.r.t the choice we make when mapping an instruction -> SchedGroup is an NP problem. This patch implements the PipelineSolver class which produces a solution for the defined problem for the sched_group_barrier mutation. The solver has both an exponential time exact algorithm and a greedy algorithm. The patch includes some controls which allows the user to select the greedy/exact algorithm.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks! I like the idea behind the greedy solver. Not sure about SchedGroupSU. Maybe just a map between SUs and lists of schedgroups? I think trying to track sched_group_barriers by their order and assigning that an index is a bit confusing.
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | ||
---|---|---|
77 | Should we have the greedy solver be the default? | |
108–109 | Remove commented code. | |
179 | Why have a default constructor? | |
193 | The only requirement I had was that the SCHED_GROUP_BARRIER ends up at the end of the SG. No need to respect NodeNum if we can get a better pipeline otherwise. | |
217 | Use SmallVectorImpl for parameter types. | |
264 | Can't SyncID be negative? | |
462 | Do we need a cl option for the greedy solver if that is what we do by default? | |
761 | Why change this from a map. Can we really index by SyncID? | |
863 | Can continue earlier if A == B. | |
868 | if (A == B || DAG->IsReachable(B, A)) continue; Doesn't tryAddEdge already check for all of these conditions anyway? | |
1105 | What is BarrierID, the same as SyncID? | |
1117 | I think we can just search the entire block. SCHED_GROUP_BARRIERS that appear later will get higher priority since we initialize them bottom up. | |
1125 | Could use a set/map. | |
1130 | Is SGSU just caching whether an SU can be inserted into a SchedGroup? | |
llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir | ||
95 | This seems like a regression. SCHED_GROUP_BARRIER should be at the end of its group. |
Austin thanks for looking at this and for the comments.
I can certainly see how the sched_group_barrier indexing is confusing -- I needed a bidirectional iterator in order to support the backtracking nature of the exact solver. My solution was to use an array and I needed a way to make sure I still supported the user inputted SyncID -- thus the mapping between SyncIDs and array indices. However, transforming the data structure to support bidirection seems to be a problem that should be resolved in the PipelineSolver.
I think if I apply this design principle, I can redo the SchedGroupSUs and clean up the collection phase in general.
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | ||
---|---|---|
77 | Sure -- makes sense | |
264 | SyncID (i.e. *SyncGroupIdx) here is the index into the pipeline array -- it should never be negative | |
761 | I didn't see any bidirectional iterator for DenseMap -- which I needed for the exact solver. Thus, I converted to an array. To make sure the we are still synchronizing a pipeline in the correct way, I created a map BarrierIDToPipelineID which maps the user inputted IDs to a legal array index. | |
868 | Yeah, I'll clean this up a bit. Thanks. | |
1105 | BarrierID is SyncID in your version (the user inputted syncrhonization ID). This is different from the PipelineSolver's *SyncGroupIdx which is an array index. I'll rework naming to make things more consistent and less confusing. | |
1117 | I'll take another look at this collection phase to see if I can shave off any iterations | |
1117 | I'll take another look at this collection phase to see if I can shave off any iterations | |
1130 | Yeah, a map between SU -> Candidate SchedGroups is a fundamental piece of the exact algorithm |
Address Review Comments.
Refactor to decouple SchedBarrierDAGMutation and PipelineSolver a bit -- convert the data structures in PipelineSolver rather than having SchedBarrierDAGMutation construct data structures in a way that is amenable to PipelineSolver parsing.
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | ||
---|---|---|
763 | typo: create | |
763–767 | I find it difficult to tell at a glance what the relationships these maps are defining. Maybe adding more description or breaking it up with a typedef would help. | |
868 | Will missed edges ever be non-zero? Try add edge will return false if B is a successor of A and that is already checked above. What is the intended meaning of a missed edge here? |
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | ||
---|---|---|
868 | A missed edge is simply when we need the DAG to have an edge to conform to the pipeline, but the DAG doesn't support it. Missed edge is the unit used in calculating cost. -- the main idea is quantifying how well we are able to mutate the DAG. TryAddEdge will return false if B is already a successor of A but that does not represent a failed opportunity because the DAG already has the edge we want. Since the DAG conforms to our pipeline for these two nodes, we don't add any cost -- that is why we early check for this condition. In my tests, I have frequently seen 0 missed-edges / cost. This occurs whenever we are able to add an edge between the SU and all of the SUs in the current SG. |
Address Review comments.
Bring in the precommit tests.
Use the PipelineSolver from IGroupLP mutation (not just sched_group_barrier mutation).
Add IGLP tests -- currently, only supports one hardcoded pipeline.
Add some features to help performance / usability of exact PipelineSolver, including:
Timeout feature (& corresponding CLI option) Guiding heuristic, choosing the fit with fewest missed edges first (& corresponding CLI option) Run the greedy algorithm before exact to improve pruning (useful if not using cost heuristic)
This is still somewhat a WIP while experimentation is ongoing. Pushing to make visible the new features.
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | ||
---|---|---|
372 | We cannot do this. The compiler should be deterministic. Maybe timeout with the number of iterations of something. |
Replace timeout feature with deterministic max branches explored feature, and properly handle early termination condition.
Not sure how much effort you are willing to put into the exact algorithm. But maybe you can improve the performance by adding some lower bounds on future costs to improve pruning?
More specifically, when evaluating a partial solution S with cost C, and comparing it against the currently best solution BS with cost BC, you currently prune if C > BC.
I'm proposing to compute some lower bound LC on the cost of the decisions that have not yet been made (e.g. adding the costs of cheapest assignments), and pruning if C + LC > BC.
I did not review the algorithm in enough detail to make a more concrete proposal, and it could very well be that computing a non-trivial lower bound (i.e. one that is not zero) is difficult.
However, in similar problems the technique above often helps a lot.
Hi Jannik. Thanks for the tip / suggestion! Unfortunately, we are not able to make significant changes to this review currently due to delivery timeline. However, the lower bound feature is certainly something I'll look into should the need / opportunity for optimization arise.
Should we have the greedy solver be the default?