This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow for MFMA Inst Clustering
ClosedPublic

Authored by jrbyrnes on Apr 29 2022, 8:35 AM.

Details

Summary

This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster.

Diff Detail

Event Timeline

jrbyrnes created this revision.Apr 29 2022, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 8:35 AM
jrbyrnes requested review of this revision.Apr 29 2022, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 8:35 AM
foad added inline comments.Apr 29 2022, 9:19 AM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
926 ↗(On Diff #426066)

Nit: please use full capitalized punctuated English sentences for all your comments: https://llvm.org/docs/CodingStandards.html#commenting

943 ↗(On Diff #426066)

Typo "multive".

arsenm added inline comments.Apr 29 2022, 9:21 AM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗(On Diff #426066)

What about copies before they are lowered to accvgpr_write?

rampitec added inline comments.Apr 29 2022, 9:59 AM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗(On Diff #426066)

They are not MAI. This ideom used to filter MFMA from 2 other MAI encoded instructions.

jrbyrnes added inline comments.Apr 29 2022, 10:05 AM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗(On Diff #426066)

Thanks for your comments. I looked into it, and these types of copies are not flagged as MAI to begin with -- the check to exclude accvgpr_write is thus irrelevant.

Would be nice to have some tests that show the results of the clustering as well.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
846 ↗(On Diff #426066)

Should this be moved to a new file I.e. AMDGPUMacroFusion and AMDGPUExportClustering?

856 ↗(On Diff #426066)

I guess isMAI above would handle that?

882–886 ↗(On Diff #426066)

Could these two loops be combined?

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
2

Add # REQUIRES: asserts, to the top of this test.

One thing to mention: transferring all successors and predecessors technically does not guarantee that an independent instruction will not be scheduled in between. I see that you are adding SDep::Cluster, but I remember it didn't use to work in the very similar scenario before. Not sure it is still so. It can happen that in a bigger program the cluster may be broken.

Also we probably need to consider dropping MFMA clustering along with load clustering during GCNScheduleDAGMILive::UnclusteredReschedule stage.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
898 ↗(On Diff #426066)

What if one feeds another and you swap them?

900 ↗(On Diff #426066)

I believe you need to check the result of addEgde.

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
4

You need '; REQUIRES: asserts' if you want to inspect debug output.

34

You also need to check resulting MIR.

kerbowa added inline comments.Apr 29 2022, 10:33 AM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗(On Diff #426066)

I'm not sure, I think it will be flagged. We use this idiom enough that we really should just add an isMFMA function to siinstrinfo.

898 ↗(On Diff #426066)

This is only looking at independent MFMA.

rampitec added inline comments.Apr 29 2022, 10:54 AM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗(On Diff #426066)

These 2 are also isMAI, just based on encoding. It may make sense to create a helper, agree.

898 ↗(On Diff #426066)

Ah, right. Interesting if we may want to cluster dependent MFMA too or not.

One thing to mention: transferring all successors and predecessors technically does not guarantee that an independent instruction will not be scheduled in between. I see that you are adding SDep::Cluster, but I remember it didn't use to work in the very similar scenario before. Not sure it is still so. It can happen that in a bigger program the cluster may be broken.

Also we probably need to consider dropping MFMA clustering along with load clustering during GCNScheduleDAGMILive::UnclusteredReschedule stage.

Hey Stas -- thanks for your comments.

Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.

It seems that if we want to offer unbroken clusters, we will want to offer a scheduler with modified heuristic priorities to prioritize cluster edges. We can also address the issue of UnclusteredReschedule pass in this scheduler. Worth mentioning is that in such a scheme, a cluster can still be broken if the preds / succs of instructions are not handled, which is why I’ve handled them here.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
898 ↗(On Diff #426066)

As the code currently is, an instruction that has dependency with any instruction in Cluster A will be used in Cluster B, and so on. We may be able to cluster instructions with dependencies, but this would create complexities. I thought it best to get the simple version for review first.

Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.

This is conflicting thing, we need to make sure it does not succeed to fill the gap. Probably it needs some tweaking in FillMFMAShadowMutation and GCNHazardRecognizer::ShouldPreferAnother if this option is set. In any way you need some more tests with different clustering/non-clustering scenarios and check the final code, do we get resulting clusters? Especially given that post-RA scheduler will try to torn them.

It seems that if we want to offer unbroken clusters, we will want to offer a scheduler with modified heuristic priorities to prioritize cluster edges. We can also address the issue of UnclusteredReschedule pass in this scheduler. Worth mentioning is that in such a scheme, a cluster can still be broken if the preds / succs of instructions are not handled, which is why I’ve handled them here.

Fair enough. UnclusteredReschedule is also todo for a future patch.

Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.

This is conflicting thing, we need to make sure it does not succeed to fill the gap. Probably it needs some tweaking in FillMFMAShadowMutation and GCNHazardRecognizer::ShouldPreferAnother if this option is set. In any way you need some more tests with different clustering/non-clustering scenarios and check the final code, do we get resulting clusters? Especially given that post-RA scheduler will try to torn them.

It may be that we want the cluster edges to be a suggestion rather than a hard limit, the cluster edges already work this way but the priority for them is low so it usually doesn't matter.

I believe GCNHazardRecognizer::ShouldPreferAnother will not be used now that we use post-RA machine scheduler.

jrbyrnes updated this revision to Diff 426552.May 2 2022, 6:21 PM

Updating D124678: [AMDGPU] Allow for MFMA Inst Clustering

Refactored code to address several review comments:

  1. Pulled MFMA Clustering into its own files
  2. Pulled dependency propagation into its own function to facilitate potential "weak" clustering feature

Added MIR tests

jrbyrnes marked 3 inline comments as done.May 2 2022, 6:23 PM
jrbyrnes marked 2 inline comments as done.
kerbowa added inline comments.May 2 2022, 6:44 PM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
2

This test might be difficult to maintain. You might want to add two different test files. One file that doesn't require asserts which has many tests and has check lines autogenerated with update_mir_test_checks.py.

The test that requires asserts could just be a simple function to see if the clustering is working via the debug output, something that wouldn't be expected to change.

29

Generally, the check lines should come before the function.

rampitec added inline comments.May 3 2022, 11:01 AM
.arcconfig
7 ↗(On Diff #426552)

It does not belong here.

arsenm added inline comments.May 3 2022, 11:02 AM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
2

update_mir_test_checks lines and the manual debug output checks can coexist. It's just update_mir_test_checks isn't very smart and it requires hacking out the irregular run lines before running it

jrbyrnes updated this revision to Diff 426784.May 3 2022, 11:13 AM

Resolve silly arcconfig issue.

Split up tests for maintenance purposes.

Autogen check lines for MIR test.

rampitec added inline comments.May 3 2022, 11:18 AM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
4

We need to run it until after post-RA scheduler to see that clusters hold.

jrbyrnes updated this revision to Diff 426791.May 3 2022, 11:40 AM
jrbyrnes marked 4 inline comments as done.

Add test to MIR after both scheduling passes in sequence to confirm clustering results hold.

jrbyrnes updated this revision to Diff 426792.May 3 2022, 11:41 AM

Remove extraneous file

rampitec added inline comments.May 3 2022, 12:11 PM
llvm/test/CodeGen/AMDGPU/mfma-cluster-edges.mir
2

No space before colon and move it below run lines.

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155

So the cluster does not really hold?

jrbyrnes added inline comments.May 3 2022, 12:43 PM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155

Currently, clusters will be broken by: 1. higher priority instructions, or by 2. independent instructions.

Here we see an independent instruction filling in the gap caused by hardware hazard. I have tried disabling fillMFMAShadow but this does not change the behavior. I think if we want unbroken clusters using SDep::Cluster, we need to address this via a different SchedStrategy (specifically, the logic in tryCandidate and pickNode). Should I start thinking about this?

In the context of CK -- broken clusters will cause problems if clusters of different type blend together. However, I think this won't happen due to dependencies -- hard to say without sample MIR.

rampitec added inline comments.May 3 2022, 12:45 PM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155

If that is caused by a hazard there is nothing we can really do about it, it will be broken that way or another. Thanks.

jrbyrnes added inline comments.May 4 2022, 9:23 AM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155

Hey Stas -- thanks for your thoughts on this.

Based on your comment yesterday, I looked deeper into the broken cluster issue and actually found a couple flaws in this clustering algorithm which can result in avoidable broken clusters. I have addressed these and will release patch soon.

However, these changes will not affect the hazard issue identified here. I have just experimented with a feature that resolves MAI hazards in the scheduler before picking the next node. With this scheduler hack, we have perfect clustering for these tests. If we want unbroken clusters, I think we will need to expose a hacked scheduler. I have also considered bundling instead of clustering, but I think that will not work.

kerbowa added inline comments.May 4 2022, 9:41 AM
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155

I don't think it is necessarily a problem if they are not perfect clusters. The main idea is to have MAC clusters and VMEM/LDS clusters, not to have a specific number of perfectly sequential MFMA.

jrbyrnes updated this revision to Diff 427056.May 4 2022, 10:15 AM

Fix algorithmic flaws:

  1. Use chain as cluster shape (A->B->C->D) instead of fanout (A->{B,C,D}). With a chain, the scheduler will not miss cluster edges due to multiple cluster succs.
  2. Create artificial edges in the cluster. This will coerce the scheduler to start from either the root or leaf of the cluster rather than potentially selecting the middle. In post RA scheduling, if the scheduler selects the middle, it will lose the cluster prefix.

LGTM, but please wait for Austin too.

kerbowa accepted this revision.May 6 2022, 3:03 PM

LGTM with nits. Thanks!

llvm/lib/Target/AMDGPU/AMDGPUMFMAClustering.cpp
24

Can you give this its own debug type or else use machine-scheduler?

34

NIT: insts->instructions on help text

114

NIT: Maybe this should be 'MaxMFMAClusterSize'.

153

I actually don't think this should be gated by having multiple waves. This transformation could also be useful for intra-wave scheduling and I think we should leave it up to the users when it is enabled. If we need more precision we could eventually allow for specifying functions where the clustering should be enabled.

This revision is now accepted and ready to land.May 6 2022, 3:03 PM
jrbyrnes updated this revision to Diff 427793.May 6 2022, 4:59 PM

Resolve NITs

kerbowa accepted this revision.May 9 2022, 8:46 AM
This revision was landed with ongoing or failed builds.May 10 2022, 12:59 PM
This revision was automatically updated to reflect the committed changes.