Page MenuHomePhabricator

[AMDGPU] Implement pipeline solver for non-trivial pipelines
ClosedPublic

Authored by jrbyrnes on Jul 29 2022, 1:07 PM.

Details

Summary

Requested SchedGroup pipelines may be non-trivial to satisify. A minimimal example is if the requested pipeline is {2 VMEM, 2 VALU, 2 VMEM} and the original order of SUnits is {VMEM, VALU, VMEM, VALU, VMEM}. Because of existing dependencies, the choice of which SchedGroup the middle VMEM goes into impacts how closely we are able to match the requested pipeline. It seems minimizing the degree of misfit (as measured by the number of edges we can't add) w.r.t the choice we make when mapping an instruction -> SchedGroup is an NP problem. This patch implements the PipelineSolver class which produces a solution for the defined problem for the sched_group_barrier mutation. The solver has both an exponential time exact algorithm and a greedy algorithm. The patch includes some controls which allows the user to select the greedy/exact algorithm.

Diff Detail

Event Timeline

jrbyrnes created this revision.Jul 29 2022, 1:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2022, 1:07 PM
jrbyrnes requested review of this revision.Jul 29 2022, 1:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2022, 1:07 PM

Thanks! I like the idea behind the greedy solver. Not sure about SchedGroupSU. Maybe just a map between SUs and lists of schedgroups? I think trying to track sched_group_barriers by their order and assigning that an index is a bit confusing.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
77

Should we have the greedy solver be the default?

108–109

Remove commented code.

179

Why have a default constructor?

193

The only requirement I had was that the SCHED_GROUP_BARRIER ends up at the end of the SG. No need to respect NodeNum if we can get a better pipeline otherwise.

217

Use SmallVectorImpl for parameter types.

264

Can't SyncID be negative?

462

Do we need a cl option for the greedy solver if that is what we do by default?

761

Why change this from a map. Can we really index by SyncID?

863

Can continue earlier if A == B.

868

if (A == B || DAG->IsReachable(B, A))

continue;

Doesn't tryAddEdge already check for all of these conditions anyway?

1105

What is BarrierID, the same as SyncID?

1117

I think we can just search the entire block. SCHED_GROUP_BARRIERS that appear later will get higher priority since we initialize them bottom up.

1125

Could use a set/map.

1130

Is SGSU just caching whether an SU can be inserted into a SchedGroup?

llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir
95

This seems like a regression. SCHED_GROUP_BARRIER should be at the end of its group.

Thanks! I like the idea behind the greedy solver. Not sure about SchedGroupSU. Maybe just a map between SUs and lists of schedgroups? I think trying to track sched_group_barriers by their order and assigning that an index is a bit confusing.

Austin thanks for looking at this and for the comments.

I can certainly see how the sched_group_barrier indexing is confusing -- I needed a bidirectional iterator in order to support the backtracking nature of the exact solver. My solution was to use an array and I needed a way to make sure I still supported the user inputted SyncID -- thus the mapping between SyncIDs and array indices. However, transforming the data structure to support bidirection seems to be a problem that should be resolved in the PipelineSolver.

I think if I apply this design principle, I can redo the SchedGroupSUs and clean up the collection phase in general.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
77

Sure -- makes sense

264

SyncID (i.e. *SyncGroupIdx) here is the index into the pipeline array -- it should never be negative

761

I didn't see any bidirectional iterator for DenseMap -- which I needed for the exact solver. Thus, I converted to an array.

To make sure the we are still synchronizing a pipeline in the correct way, I created a map BarrierIDToPipelineID which maps the user inputted IDs to a legal array index.

868

Yeah, I'll clean this up a bit. Thanks.

1105

BarrierID is SyncID in your version (the user inputted syncrhonization ID).

This is different from the PipelineSolver's *SyncGroupIdx which is an array index. I'll rework naming to make things more consistent and less confusing.

1117

I'll take another look at this collection phase to see if I can shave off any iterations

1117

I'll take another look at this collection phase to see if I can shave off any iterations

1130

Yeah, a map between SU -> Candidate SchedGroups is a fundamental piece of the exact algorithm

jrbyrnes updated this revision to Diff 449439.EditedAug 2 2022, 2:21 PM
jrbyrnes marked 15 inline comments as done.

Address Review Comments.

Refactor to decouple SchedBarrierDAGMutation and PipelineSolver a bit -- convert the data structures in PipelineSolver rather than having SchedBarrierDAGMutation construct data structures in a way that is amenable to PipelineSolver parsing.

kerbowa added inline comments.Aug 4 2022, 11:47 PM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
763

typo: create

763–767

I find it difficult to tell at a glance what the relationships these maps are defining. Maybe adding more description or breaking it up with a typedef would help.

868

Will missed edges ever be non-zero? Try add edge will return false if B is a successor of A and that is already checked above. What is the intended meaning of a missed edge here?

jrbyrnes added inline comments.Aug 5 2022, 8:35 AM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
868

A missed edge is simply when we need the DAG to have an edge to conform to the pipeline, but the DAG doesn't support it. Missed edge is the unit used in calculating cost. -- the main idea is quantifying how well we are able to mutate the DAG.

TryAddEdge will return false if B is already a successor of A but that does not represent a failed opportunity because the DAG already has the edge we want. Since the DAG conforms to our pipeline for these two nodes, we don't add any cost -- that is why we early check for this condition.

In my tests, I have frequently seen 0 missed-edges / cost. This occurs whenever we are able to add an edge between the SU and all of the SUs in the current SG.

jrbyrnes updated this revision to Diff 450403.Aug 5 2022, 2:15 PM
jrbyrnes marked 2 inline comments as done.

Address Review comments.

Bring in the precommit tests.

Use the PipelineSolver from IGroupLP mutation (not just sched_group_barrier mutation).

Add IGLP tests -- currently, only supports one hardcoded pipeline.

jrbyrnes updated this revision to Diff 451255.Aug 9 2022, 1:30 PM

Add some features to help performance / usability of exact PipelineSolver, including:

Timeout feature (& corresponding CLI option)
Guiding heuristic, choosing the fit with fewest missed edges first (& corresponding CLI option)
Run the greedy algorithm before exact to improve pruning (useful if not using cost heuristic)

This is still somewhat a WIP while experimentation is ongoing. Pushing to make visible the new features.

kerbowa added inline comments.Aug 9 2022, 6:27 PM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
372

We cannot do this. The compiler should be deterministic. Maybe timeout with the number of iterations of something.

jrbyrnes updated this revision to Diff 451481.Aug 10 2022, 8:19 AM

Replace timeout feature with deterministic max branches explored feature, and properly handle early termination condition.

Not sure how much effort you are willing to put into the exact algorithm. But maybe you can improve the performance by adding some lower bounds on future costs to improve pruning?

More specifically, when evaluating a partial solution S with cost C, and comparing it against the currently best solution BS with cost BC, you currently prune if C > BC.
I'm proposing to compute some lower bound LC on the cost of the decisions that have not yet been made (e.g. adding the costs of cheapest assignments), and pruning if C + LC > BC.

I did not review the algorithm in enough detail to make a more concrete proposal, and it could very well be that computing a non-trivial lower bound (i.e. one that is not zero) is difficult.
However, in similar problems the technique above often helps a lot.

Not sure how much effort you are willing to put into the exact algorithm. But maybe you can improve the performance by adding some lower bounds on future costs to improve pruning?

More specifically, when evaluating a partial solution S with cost C, and comparing it against the currently best solution BS with cost BC, you currently prune if C > BC.
I'm proposing to compute some lower bound LC on the cost of the decisions that have not yet been made (e.g. adding the costs of cheapest assignments), and pruning if C + LC > BC.

I did not review the algorithm in enough detail to make a more concrete proposal, and it could very well be that computing a non-trivial lower bound (i.e. one that is not zero) is difficult.
However, in similar problems the technique above often helps a lot.

Hi Jannik. Thanks for the tip / suggestion! Unfortunately, we are not able to make significant changes to this review currently due to delivery timeline. However, the lower bound feature is certainly something I'll look into should the need / opportunity for optimization arise.

kerbowa accepted this revision.Aug 17 2022, 1:40 PM

LGTM

This revision is now accepted and ready to land.Aug 17 2022, 1:40 PM
This revision was landed with ongoing or failed builds.Aug 17 2022, 4:22 PM
This revision was automatically updated to reflect the committed changes.