This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
926 ↗ | (On Diff #426066) | Nit: please use full capitalized punctuated English sentences for all your comments: https://llvm.org/docs/CodingStandards.html#commenting |
943 ↗ | (On Diff #426066) | Typo "multive". |
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
856 ↗ | (On Diff #426066) | What about copies before they are lowered to accvgpr_write? |
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
856 ↗ | (On Diff #426066) | They are not MAI. This ideom used to filter MFMA from 2 other MAI encoded instructions. |
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
856 ↗ | (On Diff #426066) | Thanks for your comments. I looked into it, and these types of copies are not flagged as MAI to begin with -- the check to exclude accvgpr_write is thus irrelevant. |
Would be nice to have some tests that show the results of the clustering as well.
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
846 ↗ | (On Diff #426066) | Should this be moved to a new file I.e. AMDGPUMacroFusion and AMDGPUExportClustering? |
856 ↗ | (On Diff #426066) | I guess isMAI above would handle that? |
882–886 ↗ | (On Diff #426066) | Could these two loops be combined? |
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
2 | Add # REQUIRES: asserts, to the top of this test. |
One thing to mention: transferring all successors and predecessors technically does not guarantee that an independent instruction will not be scheduled in between. I see that you are adding SDep::Cluster, but I remember it didn't use to work in the very similar scenario before. Not sure it is still so. It can happen that in a bigger program the cluster may be broken.
Also we probably need to consider dropping MFMA clustering along with load clustering during GCNScheduleDAGMILive::UnclusteredReschedule stage.
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
898 ↗ | (On Diff #426066) | What if one feeds another and you swap them? |
900 ↗ | (On Diff #426066) | I believe you need to check the result of addEgde. |
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
4 | You need '; REQUIRES: asserts' if you want to inspect debug output. | |
34 | You also need to check resulting MIR. |
Hey Stas -- thanks for your comments.
Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.
It seems that if we want to offer unbroken clusters, we will want to offer a scheduler with modified heuristic priorities to prioritize cluster edges. We can also address the issue of UnclusteredReschedule pass in this scheduler. Worth mentioning is that in such a scheme, a cluster can still be broken if the preds / succs of instructions are not handled, which is why I’ve handled them here.
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
898 ↗ | (On Diff #426066) | As the code currently is, an instruction that has dependency with any instruction in Cluster A will be used in Cluster B, and so on. We may be able to cluster instructions with dependencies, but this would create complexities. I thought it best to get the simple version for review first. |
This is conflicting thing, we need to make sure it does not succeed to fill the gap. Probably it needs some tweaking in FillMFMAShadowMutation and GCNHazardRecognizer::ShouldPreferAnother if this option is set. In any way you need some more tests with different clustering/non-clustering scenarios and check the final code, do we get resulting clusters? Especially given that post-RA scheduler will try to torn them.
It seems that if we want to offer unbroken clusters, we will want to offer a scheduler with modified heuristic priorities to prioritize cluster edges. We can also address the issue of UnclusteredReschedule pass in this scheduler. Worth mentioning is that in such a scheme, a cluster can still be broken if the preds / succs of instructions are not handled, which is why I’ve handled them here.
Fair enough. UnclusteredReschedule is also todo for a future patch.
It may be that we want the cluster edges to be a suggestion rather than a hard limit, the cluster edges already work this way but the priority for them is low so it usually doesn't matter.
I believe GCNHazardRecognizer::ShouldPreferAnother will not be used now that we use post-RA machine scheduler.
Updating D124678: [AMDGPU] Allow for MFMA Inst Clustering
Refactored code to address several review comments:
- Pulled MFMA Clustering into its own files
- Pulled dependency propagation into its own function to facilitate potential "weak" clustering feature
Added MIR tests
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
2 | This test might be difficult to maintain. You might want to add two different test files. One file that doesn't require asserts which has many tests and has check lines autogenerated with update_mir_test_checks.py. The test that requires asserts could just be a simple function to see if the clustering is working via the debug output, something that wouldn't be expected to change. | |
29 | Generally, the check lines should come before the function. |
.arcconfig | ||
---|---|---|
7 ↗ | (On Diff #426552) | It does not belong here. |
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
2 | update_mir_test_checks lines and the manual debug output checks can coexist. It's just update_mir_test_checks isn't very smart and it requires hacking out the irregular run lines before running it |
Resolve silly arcconfig issue.
Split up tests for maintenance purposes.
Autogen check lines for MIR test.
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
4 | We need to run it until after post-RA scheduler to see that clusters hold. |
Add test to MIR after both scheduling passes in sequence to confirm clustering results hold.
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
155 | Currently, clusters will be broken by: 1. higher priority instructions, or by 2. independent instructions. Here we see an independent instruction filling in the gap caused by hardware hazard. I have tried disabling fillMFMAShadow but this does not change the behavior. I think if we want unbroken clusters using SDep::Cluster, we need to address this via a different SchedStrategy (specifically, the logic in tryCandidate and pickNode). Should I start thinking about this? In the context of CK -- broken clusters will cause problems if clusters of different type blend together. However, I think this won't happen due to dependencies -- hard to say without sample MIR. |
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
155 | If that is caused by a hazard there is nothing we can really do about it, it will be broken that way or another. Thanks. |
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
155 | Hey Stas -- thanks for your thoughts on this. Based on your comment yesterday, I looked deeper into the broken cluster issue and actually found a couple flaws in this clustering algorithm which can result in avoidable broken clusters. I have addressed these and will release patch soon. However, these changes will not affect the hazard issue identified here. I have just experimented with a feature that resolves MAI hazards in the scheduler before picking the next node. With this scheduler hack, we have perfect clustering for these tests. If we want unbroken clusters, I think we will need to expose a hacked scheduler. I have also considered bundling instead of clustering, but I think that will not work. |
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir | ||
---|---|---|
155 | I don't think it is necessarily a problem if they are not perfect clusters. The main idea is to have MAC clusters and VMEM/LDS clusters, not to have a specific number of perfectly sequential MFMA. |
Fix algorithmic flaws:
- Use chain as cluster shape (A->B->C->D) instead of fanout (A->{B,C,D}). With a chain, the scheduler will not miss cluster edges due to multiple cluster succs.
- Create artificial edges in the cluster. This will coerce the scheduler to start from either the root or leaf of the cluster rather than potentially selecting the middle. In post RA scheduling, if the scheduler selects the middle, it will lose the cluster prefix.
LGTM with nits. Thanks!
llvm/lib/Target/AMDGPU/AMDGPUMFMAClustering.cpp | ||
---|---|---|
24 | Can you give this its own debug type or else use machine-scheduler? | |
34 | NIT: insts->instructions on help text | |
114 | NIT: Maybe this should be 'MaxMFMAClusterSize'. | |
153 | I actually don't think this should be gated by having multiple waves. This transformation could also be useful for intra-wave scheduling and I think we should leave it up to the users when it is enabled. If we need more precision we could eventually allow for specifying functions where the clustering should be enabled. |
Can you give this its own debug type or else use machine-scheduler?