This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUMFMAClustering.h
4
AMDGPUMFMAClustering.cpp
-
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
1
mfma-cluster-edges.mir
5/12
mfma-cluster.mir

Differential D124678

[AMDGPU] Allow for MFMA Inst Clustering
ClosedPublic

Authored by jrbyrnes on Apr 29 2022, 8:35 AM.

Download Raw Diff

Details

Reviewers

kerbowa
rampitec
vangthao95
arsenm

Commits

rGf822db7670d4: [AMDGPU] Allow for MFMA Inst Clustering

Summary

This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,040 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
	60,100 ms	x64 debian > libFuzzer.libFuzzer::large.test
	60,020 ms	x64 debian > libFuzzer.libFuzzer::minimize_crash.test
	60,020 ms	x64 debian > libFuzzer.libFuzzer::out-of-process-fuzz.test
	60,020 ms	x64 debian > libFuzzer.libFuzzer::value-profile-load.test

Event Timeline

jrbyrnes created this revision.Apr 29 2022, 8:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 8:35 AM

Herald added subscribers: hsmhsm, foad, hiraditya and 7 others. · View Herald Transcript

jrbyrnes requested review of this revision.Apr 29 2022, 8:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 8:35 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

foad added inline comments.Apr 29 2022, 9:19 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
926 ↗	(On Diff #426066)	Nit: please use full capitalized punctuated English sentences for all your comments: https://llvm.org/docs/CodingStandards.html#commenting
943 ↗	(On Diff #426066)	Typo "multive".

Harbormaster completed remote builds in B161990: Diff 426066.Apr 29 2022, 9:19 AM

arsenm added inline comments.Apr 29 2022, 9:21 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗	(On Diff #426066)	What about copies before they are lowered to accvgpr_write?

rampitec added inline comments.Apr 29 2022, 9:59 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗	(On Diff #426066)	They are not MAI. This ideom used to filter MFMA from 2 other MAI encoded instructions.

jrbyrnes added inline comments.Apr 29 2022, 10:05 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗	(On Diff #426066)	Thanks for your comments. I looked into it, and these types of copies are not flagged as MAI to begin with -- the check to exclude accvgpr_write is thus irrelevant.

Would be nice to have some tests that show the results of the clustering as well.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
846 ↗	(On Diff #426066)	Should this be moved to a new file I.e. AMDGPUMacroFusion and AMDGPUExportClustering?
856 ↗	(On Diff #426066)	I guess isMAI above would handle that?
882–886 ↗	(On Diff #426066)	Could these two loops be combined?
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
2	Add `# REQUIRES: asserts`, to the top of this test.

One thing to mention: transferring all successors and predecessors technically does not guarantee that an independent instruction will not be scheduled in between. I see that you are adding SDep::Cluster, but I remember it didn't use to work in the very similar scenario before. Not sure it is still so. It can happen that in a bigger program the cluster may be broken.

Also we probably need to consider dropping MFMA clustering along with load clustering during GCNScheduleDAGMILive::UnclusteredReschedule stage.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
898 ↗	(On Diff #426066)	What if one feeds another and you swap them?
900 ↗	(On Diff #426066)	I believe you need to check the result of addEgde.
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
4	You need '; REQUIRES: asserts' if you want to inspect debug output.
34	You also need to check resulting MIR.

kerbowa added inline comments.Apr 29 2022, 10:33 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗	(On Diff #426066)	I'm not sure, I think it will be flagged. We use this idiom enough that we really should just add an isMFMA function to siinstrinfo.
898 ↗	(On Diff #426066)	This is only looking at independent MFMA.

rampitec added inline comments.Apr 29 2022, 10:54 AM

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
856 ↗	(On Diff #426066)	These 2 are also isMAI, just based on encoding. It may make sense to create a helper, agree.
898 ↗	(On Diff #426066)	Ah, right. Interesting if we may want to cluster dependent MFMA too or not.

In D124678#3482965, @rampitec wrote:

One thing to mention: transferring all successors and predecessors technically does not guarantee that an independent instruction will not be scheduled in between. I see that you are adding SDep::Cluster, but I remember it didn't use to work in the very similar scenario before. Not sure it is still so. It can happen that in a bigger program the cluster may be broken.

Also we probably need to consider dropping MFMA clustering along with load clustering during GCNScheduleDAGMILive::UnclusteredReschedule stage.

Hey Stas -- thanks for your comments.

Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.

It seems that if we want to offer unbroken clusters, we will want to offer a scheduler with modified heuristic priorities to prioritize cluster edges. We can also address the issue of UnclusteredReschedule pass in this scheduler. Worth mentioning is that in such a scheme, a cluster can still be broken if the preds / succs of instructions are not handled, which is why I’ve handled them here.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
898 ↗	(On Diff #426066)	As the code currently is, an instruction that has dependency with any instruction in Cluster A will be used in Cluster B, and so on. We may be able to cluster instructions with dependencies, but this would create complexities. I thought it best to get the simple version for review first.

In D124678#3483663, @jrbyrnes wrote:

Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.

This is conflicting thing, we need to make sure it does not succeed to fill the gap. Probably it needs some tweaking in FillMFMAShadowMutation and GCNHazardRecognizer::ShouldPreferAnother if this option is set. In any way you need some more tests with different clustering/non-clustering scenarios and check the final code, do we get resulting clusters? Especially given that post-RA scheduler will try to torn them.

It seems that if we want to offer unbroken clusters, we will want to offer a scheduler with modified heuristic priorities to prioritize cluster edges. We can also address the issue of UnclusteredReschedule pass in this scheduler. Worth mentioning is that in such a scheme, a cluster can still be broken if the preds / succs of instructions are not handled, which is why I’ve handled them here.

Fair enough. UnclusteredReschedule is also todo for a future patch.

In D124678#3483673, @rampitec wrote:

In D124678#3483663, @jrbyrnes wrote:

Regarding your note -- yes this is something I spent some time thinking about – Sdep::Cluster doesn’t gaurantee a single cluster. In fact, I believe there is a hardware dependency between MFMA’s, so the scheduler will try to fill this gap with an independent instruction.

This is conflicting thing, we need to make sure it does not succeed to fill the gap. Probably it needs some tweaking in FillMFMAShadowMutation and GCNHazardRecognizer::ShouldPreferAnother if this option is set. In any way you need some more tests with different clustering/non-clustering scenarios and check the final code, do we get resulting clusters? Especially given that post-RA scheduler will try to torn them.

It may be that we want the cluster edges to be a suggestion rather than a hard limit, the cluster edges already work this way but the priority for them is low so it usually doesn't matter.

I believe GCNHazardRecognizer::ShouldPreferAnother will not be used now that we use post-RA machine scheduler.

Updating D124678: [AMDGPU] Allow for MFMA Inst Clustering

Refactored code to address several review comments:

Pulled MFMA Clustering into its own files
Pulled dependency propagation into its own function to facilitate potential "weak" clustering feature

Added MIR tests

Herald added a subscriber: mgorny. · View Herald TranscriptMay 2 2022, 6:22 PM

jrbyrnes marked 3 inline comments as done.May 2 2022, 6:23 PM

jrbyrnes marked 2 inline comments as done.

kerbowa added inline comments.May 2 2022, 6:44 PM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
2	This test might be difficult to maintain. You might want to add two different test files. One file that doesn't require asserts which has many tests and has check lines autogenerated with update_mir_test_checks.py. The test that requires asserts could just be a simple function to see if the clustering is working via the debug output, something that wouldn't be expected to change.
29	Generally, the check lines should come before the function.

Harbormaster completed remote builds in B162353: Diff 426552.May 2 2022, 7:30 PM

rampitec added inline comments.May 3 2022, 11:01 AM

.arcconfig
7 ↗	(On Diff #426552)	It does not belong here.

arsenm added inline comments.May 3 2022, 11:02 AM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
2	update_mir_test_checks lines and the manual debug output checks can coexist. It's just update_mir_test_checks isn't very smart and it requires hacking out the irregular run lines before running it

Resolve silly arcconfig issue.

Split up tests for maintenance purposes.

Autogen check lines for MIR test.

rampitec added inline comments.May 3 2022, 11:18 AM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
4	We need to run it until after post-RA scheduler to see that clusters hold.

Add test to MIR after both scheduling passes in sequence to confirm clustering results hold.

Remove extraneous file

rampitec added inline comments.May 3 2022, 12:11 PM

llvm/test/CodeGen/AMDGPU/mfma-cluster-edges.mir
2	No space before colon and move it below run lines.
llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155	So the cluster does not really hold?

jrbyrnes added inline comments.May 3 2022, 12:43 PM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155	Currently, clusters will be broken by: 1. higher priority instructions, or by 2. independent instructions. Here we see an independent instruction filling in the gap caused by hardware hazard. I have tried disabling fillMFMAShadow but this does not change the behavior. I think if we want unbroken clusters using SDep::Cluster, we need to address this via a different SchedStrategy (specifically, the logic in tryCandidate and pickNode). Should I start thinking about this? In the context of CK -- broken clusters will cause problems if clusters of different type blend together. However, I think this won't happen due to dependencies -- hard to say without sample MIR.

rampitec added inline comments.May 3 2022, 12:45 PM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155	If that is caused by a hazard there is nothing we can really do about it, it will be broken that way or another. Thanks.

Harbormaster completed remote builds in B162524: Diff 426792.May 3 2022, 1:19 PM

jrbyrnes added inline comments.May 4 2022, 9:23 AM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155	Hey Stas -- thanks for your thoughts on this. Based on your comment yesterday, I looked deeper into the broken cluster issue and actually found a couple flaws in this clustering algorithm which can result in avoidable broken clusters. I have addressed these and will release patch soon. However, these changes will not affect the hazard issue identified here. I have just experimented with a feature that resolves MAI hazards in the scheduler before picking the next node. With this scheduler hack, we have perfect clustering for these tests. If we want unbroken clusters, I think we will need to expose a hacked scheduler. I have also considered bundling instead of clustering, but I think that will not work.

Herald added a subscriber: jsilvanus. · View Herald TranscriptMay 4 2022, 9:23 AM

kerbowa added inline comments.May 4 2022, 9:41 AM

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir
155	I don't think it is necessarily a problem if they are not perfect clusters. The main idea is to have MAC clusters and VMEM/LDS clusters, not to have a specific number of perfectly sequential MFMA.

Fix algorithmic flaws:

Use chain as cluster shape (A->B->C->D) instead of fanout (A->{B,C,D}). With a chain, the scheduler will not miss cluster edges due to multiple cluster succs.
Create artificial edges in the cluster. This will coerce the scheduler to start from either the root or leaf of the cluster rather than potentially selecting the middle. In post RA scheduling, if the scheduler selects the middle, it will lose the cluster prefix.

Herald added a subscriber: mgrang. · View Herald TranscriptMay 4 2022, 10:15 AM

Harbormaster completed remote builds in B162730: Diff 427056.May 4 2022, 11:39 AM

LGTM, but please wait for Austin too.

LGTM with nits. Thanks!

llvm/lib/Target/AMDGPU/AMDGPUMFMAClustering.cpp
23	Can you give this its own debug type or else use machine-scheduler?
33	NIT: insts->instructions on help text
113	NIT: Maybe this should be 'MaxMFMAClusterSize'.
152	I actually don't think this should be gated by having multiple waves. This transformation could also be useful for intra-wave scheduling and I think we should leave it up to the users when it is enabled. If we need more precision we could eventually allow for specifying functions where the clustering should be enabled.

This revision is now accepted and ready to land.May 6 2022, 3:03 PM

Resolve NITs

Harbormaster completed remote builds in B163264: Diff 427793.May 6 2022, 6:35 PM

kerbowa accepted this revision.May 9 2022, 8:46 AM

This revision was landed with ongoing or failed builds.May 10 2022, 12:59 PM

Closed by commit rGf822db7670d4: [AMDGPU] Allow for MFMA Inst Clustering (authored by jrbyrnes). · Explain Why

This revision was automatically updated to reflect the committed changes.

jrbyrnes added a commit: rGf822db7670d4: [AMDGPU] Allow for MFMA Inst Clustering.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUMFMAClustering.h

21 lines

AMDGPUMFMAClustering.cpp

177 lines

AMDGPUTargetMachine.cpp

3 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

mfma-cluster-edges.mir

71 lines

mfma-cluster.mir

354 lines

Diff 427056

llvm/lib/Target/AMDGPU/AMDGPUMFMAClustering.h

This file was added.

				//===- AMDGPUMFMAClustering.h - AMDGPU MFMA Clustering ------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUMFMACLUSTERING_H
				#define LLVM_LIB_TARGET_AMDGPU_AMDGPUMFMACLUSTERING_H

				#include "llvm/CodeGen/ScheduleDAGMutation.h"
				#include <memory>

				namespace llvm {

				std::unique_ptr<ScheduleDAGMutation> createMFMAClusterDAGMutation();

				} // namespace llvm

				#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUMFMACLUSTERING_H

llvm/lib/Target/AMDGPU/AMDGPUMFMAClustering.cpp

This file was added.

				//===--- AMDGPUMFMAClusting.cpp - AMDGPU MFMA Clustering -------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file This file contains a DAG scheduling mutation to cluster MFMA
				/// instructions.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPUMFMAClustering.h"
				#include "AMDGPUTargetMachine.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "SIInstrInfo.h"
				#include "SIMachineFunctionInfo.h"
				#include "llvm/CodeGen/MachineScheduler.h"

				using namespace llvm;

				#define DEBUG_TYPE "amdgpu-subtarget"
				kerbowaUnsubmitted Not Done Reply Inline Actions Can you give this its own debug type or else use machine-scheduler? kerbowa: Can you give this its own debug type or else use machine-scheduler?

				namespace {

				static cl::opt<bool> EnableMFMACluster("amdgpu-mfma-cluster",
				cl::desc("Enable MFMA clustering"),
				cl::init(false));

				static cl::opt<unsigned>
				MFMAClusterSize("amdgpu-mfma-cluster-size", cl::init(5), cl::Hidden,
				cl::desc("The maximum number of MFMA insts to "
				kerbowaUnsubmitted Not Done Reply Inline Actions NIT: insts->instructions on help text kerbowa: NIT: insts->instructions on help text
				"attempt to cluster together."));

				class MFMAClusterDAGMutation : public ScheduleDAGMutation {
				const SIInstrInfo *TII;
				ScheduleDAGMI *DAG;

				public:
				MFMAClusterDAGMutation() = default;
				void apply(ScheduleDAGInstrs *DAGInstrs) override;
				};

				static void collectMFMASUnits(SmallVectorImpl<SUnit *> &MFMASUnits,
				const SIInstrInfo TII, ScheduleDAGInstrs DAG) {
				for (SUnit &SU : DAG->SUnits) {
				MachineInstr &MAI = *SU.getInstr();
				if (!TII->isMAI(MAI) \|\|
				MAI.getOpcode() == AMDGPU::V_ACCVGPR_WRITE_B32_e64 \|\|
				MAI.getOpcode() == AMDGPU::V_ACCVGPR_READ_B32_e64)
				continue;

				MFMASUnits.push_back(&SU);

				LLVM_DEBUG(dbgs() << "Found MFMA: "; DAG->dumpNode(SU););
				}

				// Sorting the MFMAs in NodeNum order results in a good clustering order
				std::sort(MFMASUnits.begin(), MFMASUnits.end(),
				[](SUnit a, SUnit b) { return a->NodeNum < b->NodeNum; });
				}

				static void propagateDeps(DenseMap<unsigned, unsigned> &SUnit2ClusterInfo,
				llvm::ArrayRef<SDep> ClusterPreds,
				llvm::ArrayRef<SDep> ClusterSuccs,
				unsigned ClusterNum, ScheduleDAGInstrs *DAG) {

				for (auto Node : SUnit2ClusterInfo) {
				if (Node.second != ClusterNum)
				continue; // Only add the combined succs to the current cluster

				LLVM_DEBUG(dbgs() << "Copying Deps To SU(" << Node.first << ")\n");

				for (const SDep &Succ : ClusterSuccs) {
				LLVM_DEBUG(dbgs() << "Copying Succ SU(" << Succ.getSUnit()->NodeNum
				<< ")\n");
				DAG->addEdge(Succ.getSUnit(),
				SDep(&DAG->SUnits[Node.first], SDep::Artificial));
				}

				for (const SDep &Pred : ClusterPreds) {
				LLVM_DEBUG(dbgs() << "Copying Pred SU(" << Pred.getSUnit()->NodeNum
				<< ")\n");
				if (Pred.getSUnit()->NodeNum == ClusterNum)
				continue;
				DAG->addEdge(&DAG->SUnits[Node.first],
				SDep(Pred.getSUnit(), SDep::Artificial));
				}
				}
				}

				static void clusterNeighboringMFMAs(llvm::ArrayRef<SUnit *> MFMASUnits,
				ScheduleDAGInstrs *DAG) {

				DenseMap<unsigned, unsigned> SUnit2ClusterInfo;

				for (unsigned Idx = 0, End = MFMASUnits.size(); Idx < (End - 1); ++Idx) {
				if (SUnit2ClusterInfo.count(MFMASUnits[Idx]->NodeNum))
				continue; // We don't want to cluster against a different cluster

				auto MFMAOpa = MFMASUnits[Idx];
				auto ClusterBase = MFMAOpa;
				unsigned ClusterNum = ClusterBase->NodeNum;
				SmallVector<SDep, 4> ClusterSuccs(MFMAOpa->Succs);
				SmallVector<SDep, 4> ClusterPreds(MFMAOpa->Preds);
				unsigned NextIdx = Idx + 1;
				unsigned ClusterSize = 1;

				// Attempt to cluster all the remaining MFMASunits in a chain
				// starting at ClusterBase/MFMAOpa.
				for (; NextIdx < End; ++NextIdx) {
				if (ClusterSize >= MFMAClusterSize \|\| NextIdx >= End)
				kerbowaUnsubmitted Not Done Reply Inline Actions NIT: Maybe this should be 'MaxMFMAClusterSize'. kerbowa: NIT: Maybe this should be 'MaxMFMAClusterSize'.
				break;
				// Only add independent MFMAs that have not been previously clustered
				if (SUnit2ClusterInfo.count(MFMASUnits[NextIdx]->NodeNum) \|\|
				DAG->IsReachable(MFMASUnits[NextIdx], ClusterBase) \|\|
				DAG->IsReachable(ClusterBase, MFMASUnits[NextIdx]))
				continue;

				auto MFMAOpb = MFMASUnits[NextIdx];
				// Aggregate the cluster inst dependencies for dep propogation
				ClusterPreds.append(MFMAOpb->Preds);
				ClusterSuccs.append(MFMAOpb->Succs);
				if (!DAG->addEdge(MFMAOpb, SDep(MFMAOpa, SDep::Cluster)))
				continue;

				// Enforce ordering to ensure root/leaf of cluster chain gets
				// scheduled first/last
				DAG->addEdge(MFMAOpb, SDep(MFMAOpa, SDep::Artificial));

				LLVM_DEBUG(dbgs() << "Cluster MFMA SU(" << MFMAOpa->NodeNum << ") - SU("
				<< MFMAOpb->NodeNum << ")\n");

				SUnit2ClusterInfo[MFMAOpb->NodeNum] = ClusterNum;
				SUnit2ClusterInfo[MFMAOpa->NodeNum] = ClusterNum;
				++ClusterSize;
				MFMAOpa = MFMAOpb;
				}
				propagateDeps(SUnit2ClusterInfo, ClusterPreds, ClusterSuccs, ClusterNum,
				DAG);
				}
				}

				void MFMAClusterDAGMutation::apply(ScheduleDAGInstrs *DAGInstrs) {
				const GCNSubtarget &ST = DAGInstrs->MF.getSubtarget<GCNSubtarget>();
				TII = ST.getInstrInfo();
				const SIMachineFunctionInfo *MFI =
				DAGInstrs->MF.getInfo<SIMachineFunctionInfo>();
				// The purpose of clustering is to aid with multi wave scheduling.
				// If our occupancy doesn't support multi waves, bypass clustering
				if (!ST.hasMAIInsts() \|\| MFI->getOccupancy() < 2)
				kerbowaUnsubmitted Not Done Reply Inline Actions I actually don't think this should be gated by having multiple waves. This transformation could also be useful for intra-wave scheduling and I think we should leave it up to the users when it is enabled. If we need more precision we could eventually allow for specifying functions where the clustering should be enabled. kerbowa: I actually don't think this should be gated by having multiple waves. This transformation could…
				return;
				DAG = static_cast<ScheduleDAGMI *>(DAGInstrs);
				const TargetSchedModel *TSchedModel = DAGInstrs->getSchedModel();
				if (!TSchedModel \|\| DAG->SUnits.empty())
				return;

				SmallVector<SUnit *, 32> MFMASUnits;
				collectMFMASUnits(MFMASUnits, TII, DAG);

				if (MFMASUnits.size() < 2)
				return;

				clusterNeighboringMFMAs(MFMASUnits, DAG);
				}

				} // namespace

				namespace llvm {

				std::unique_ptr<ScheduleDAGMutation> createMFMAClusterDAGMutation() {
				return EnableMFMACluster ? std::make_unique<MFMAClusterDAGMutation>()
				: nullptr;
				}

				} // end namespace llvm

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show All 10 Lines
/// information needed to emit code for SI+ GPUs.		/// information needed to emit code for SI+ GPUs.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPUTargetMachine.h"		#include "AMDGPUTargetMachine.h"
#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUAliasAnalysis.h"		#include "AMDGPUAliasAnalysis.h"
#include "AMDGPUExportClustering.h"		#include "AMDGPUExportClustering.h"
		#include "AMDGPUMFMAClustering.h"
#include "AMDGPUMacroFusion.h"		#include "AMDGPUMacroFusion.h"
#include "AMDGPUTargetObjectFile.h"		#include "AMDGPUTargetObjectFile.h"
#include "AMDGPUTargetTransformInfo.h"		#include "AMDGPUTargetTransformInfo.h"
#include "GCNIterativeScheduler.h"		#include "GCNIterativeScheduler.h"
#include "GCNSchedStrategy.h"		#include "GCNSchedStrategy.h"
#include "R600.h"		#include "R600.h"
#include "R600TargetMachine.h"		#include "R600TargetMachine.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
▲ Show 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	static ScheduleDAGInstrs createSIMachineScheduler(MachineSchedContext C) {
return new SIScheduleDAGMI(C);		return new SIScheduleDAGMI(C);
}		}

static ScheduleDAGInstrs *		static ScheduleDAGInstrs *
createGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {		createGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {
ScheduleDAGMILive *DAG =		ScheduleDAGMILive *DAG =
new GCNScheduleDAGMILive(C, std::make_unique<GCNMaxOccupancySchedStrategy>(C));		new GCNScheduleDAGMILive(C, std::make_unique<GCNMaxOccupancySchedStrategy>(C));
DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));		DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
		DAG->addMutation(createMFMAClusterDAGMutation());
DAG->addMutation(createAMDGPUMacroFusionDAGMutation());		DAG->addMutation(createAMDGPUMacroFusionDAGMutation());
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());		DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
return DAG;		return DAG;
}		}

static ScheduleDAGInstrs *		static ScheduleDAGInstrs *
createIterativeGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {		createIterativeGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {
auto DAG = new GCNIterativeScheduler(C,		auto DAG = new GCNIterativeScheduler(C,
▲ Show 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	public:
createMachineScheduler(MachineSchedContext *C) const override;		createMachineScheduler(MachineSchedContext *C) const override;

ScheduleDAGInstrs *		ScheduleDAGInstrs *
createPostMachineScheduler(MachineSchedContext *C) const override {		createPostMachineScheduler(MachineSchedContext *C) const override {
ScheduleDAGMI *DAG = createGenericSchedPostRA(C);		ScheduleDAGMI *DAG = createGenericSchedPostRA(C);
const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();
DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));		DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII));		DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII));
		DAG->addMutation(createMFMAClusterDAGMutation());
return DAG;		return DAG;
}		}

bool addPreISel() override;		bool addPreISel() override;
void addMachineSSAOptimization() override;		void addMachineSSAOptimization() override;
bool addILPOpts() override;		bool addILPOpts() override;
bool addInstSelector() override;		bool addInstSelector() override;
bool addIRTranslator() override;		bool addIRTranslator() override;
▲ Show 20 Lines • Show All 672 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPULowerKernelArguments.cpp		AMDGPULowerKernelArguments.cpp
AMDGPULowerKernelAttributes.cpp		AMDGPULowerKernelAttributes.cpp
AMDGPULowerModuleLDSPass.cpp		AMDGPULowerModuleLDSPass.cpp
AMDGPUMachineCFGStructurizer.cpp		AMDGPUMachineCFGStructurizer.cpp
AMDGPUMachineFunction.cpp		AMDGPUMachineFunction.cpp
AMDGPUMachineModuleInfo.cpp		AMDGPUMachineModuleInfo.cpp
AMDGPUMacroFusion.cpp		AMDGPUMacroFusion.cpp
AMDGPUMCInstLower.cpp		AMDGPUMCInstLower.cpp
		AMDGPUMFMAClustering.cpp
AMDGPUMIRFormatter.cpp		AMDGPUMIRFormatter.cpp
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPerfHintAnalysis.cpp		AMDGPUPerfHintAnalysis.cpp
AMDGPUPostLegalizerCombiner.cpp		AMDGPUPostLegalizerCombiner.cpp
AMDGPUPreLegalizerCombiner.cpp		AMDGPUPreLegalizerCombiner.cpp
AMDGPUPrintfRuntimeBinding.cpp		AMDGPUPrintfRuntimeBinding.cpp
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/mfma-cluster-edges.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=machine-scheduler %s -o - -amdgpu-mfma-cluster=1 --debug-only=amdgpu-subtarget,machine-scheduler 2>&1 \| FileCheck -check-prefix=PRERA %s
				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=machine-scheduler %s -o - -amdgpu-mfma-cluster=1 -amdgpu-mfma-cluster-size=2 --debug-only=amdgpu-subtarget,machine-scheduler 2>&1 \| FileCheck -check-prefix=TWOLIMIT %s
				rampitecUnsubmitted Not Done Reply Inline Actions No space before colon and move it below run lines. rampitec: No space before colon and move it below run lines.
				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=postmisched %s -o - -amdgpu-mfma-cluster=1 --debug-only=amdgpu-subtarget,machine-scheduler 2>&1\| FileCheck -check-prefix=POSTRA %s
				# REQUIRES: asserts

				# PRERA: Cluster MFMA SU(2) - SU(6)
				# PRERA-NEXT: Cluster MFMA SU(6) - SU(10)
				# PRERA-NEXT: Cluster MFMA SU(10) - SU(12)

				# TWOLIMIT: Cluster MFMA SU(2) - SU(6)
				# TWOLIMIT: Cluster MFMA SU(10) - SU(11)

				# POSTRA: Cluster MFMA SU(2) - SU(6)
				# POSTRA-NEXT: Cluster MFMA SU(6) - SU(10)
				# POSTRA-NEXT: Cluster MFMA SU(10) - SU(12)

				---
				name: basic_cluster
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				$vgpr1 = V_MOV_B32_e32 1, implicit $exec
				$vgpr0 = V_MOV_B32_e32 1, implicit $exec
				$agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr2 = V_MOV_B32_e32 1, implicit $exec
				$vgpr3 = V_MOV_B32_e32 1, implicit $exec
				$vgpr4 = V_MOV_B32_e32 1, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr5 = V_MOV_B32_e32 1, implicit $exec
				$vgpr6 = V_MOV_B32_e32 1, implicit $exec
				$agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				...

				# PRERA: Cluster MFMA SU(12) - SU(16)
				# PRERA-NEXT: Cluster MFMA SU(16) - SU(20)

				# POSTRA: Cluster MFMA SU(12) - SU(16)
				# POSTRA-NEXT: Cluster MFMA SU(16) - SU(20)

				---
				name: complex_cluster
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $sgpr0, $vgpr10_vgpr11
				$vgpr1 = V_MOV_B32_e32 1, implicit $exec
				$vgpr0 = V_MOV_B32_e32 1, implicit $exec
				$vgpr8 = V_MOV_B32_e32 0, implicit $exec
				$vgpr9 = V_MOV_B32_e32 9, implicit $exec
				$agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr2 = V_MOV_B32_e32 1, implicit $exec
				$vgpr3 = V_MOV_B32_e32 1, implicit $exec
				$vgpr4 = V_MOV_B32_e32 1, implicit $exec
				$vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				$vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				$vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr5 = V_MOV_B32_e32 1, implicit $exec
				$vgpr6 = V_MOV_B32_e32 1, implicit $exec
				$agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				$agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				...

llvm/test/CodeGen/AMDGPU/mfma-cluster.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=machine-scheduler %s -o - -amdgpu-mfma-cluster=1 2>&1 \| FileCheck -check-prefix=PRERA %s
				kerbowaUnsubmitted Done Reply Inline Actions Add `# REQUIRES: asserts`, to the top of this test. kerbowa: Add `# REQUIRES: asserts`, to the top of this test.
				kerbowaUnsubmitted Not Done Reply Inline Actions This test might be difficult to maintain. You might want to add two different test files. One file that doesn't require asserts which has many tests and has check lines autogenerated with update_mir_test_checks.py. The test that requires asserts could just be a simple function to see if the clustering is working via the debug output, something that wouldn't be expected to change. kerbowa: This test might be difficult to maintain. You might want to add two different test files. One…
				arsenmUnsubmitted Not Done Reply Inline Actions update_mir_test_checks lines and the manual debug output checks can coexist. It's just update_mir_test_checks isn't very smart and it requires hacking out the irregular run lines before running it arsenm: update_mir_test_checks lines and the manual debug output checks can coexist. It's just…
				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=machine-scheduler %s -o - 2>&1 \| FileCheck -check-prefix=DEFAULT %s
				# RUN: llc -march=amdgcn -mcpu=gfx90a -start-before=machine-scheduler -stop-after=postmisched %s -o - -amdgpu-mfma-cluster=1 2>&1 \| FileCheck -check-prefix=BOTHSCHEDPASS %s
				rampitecUnsubmitted Done Reply Inline Actions You need '; REQUIRES: asserts' if you want to inspect debug output. rampitec: You need '; REQUIRES: asserts' if you want to inspect debug output.
				rampitecUnsubmitted Not Done Reply Inline Actions We need to run it until after post-RA scheduler to see that clusters hold. rampitec: We need to run it until after post-RA scheduler to see that clusters hold.
				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=machine-scheduler %s -o - -amdgpu-mfma-cluster=1 -amdgpu-mfma-cluster-size=2 2>&1 \| FileCheck -check-prefix=TWOLIMIT %s
				# RUN: llc -march=amdgcn -mcpu=gfx90a -run-pass=postmisched %s -o - -amdgpu-mfma-cluster=1 2>&1\| FileCheck -check-prefix=POSTRA %s


				---
				name: no_cluster
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $sgpr0, $vgpr10_vgpr11
				; PRERA-LABEL: name: no_cluster
				; PRERA: liveins: $sgpr0, $vgpr10_vgpr11
				; PRERA-NEXT: {{ $}}
				; PRERA-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; PRERA-NEXT: GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; PRERA-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				; PRERA-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; PRERA-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; PRERA-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; PRERA-NEXT: $vgpr7 = GLOBAL_LOAD_USHORT $vgpr8_vgpr9, 0, 0, implicit $exec
				; PRERA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; PRERA-NEXT: $vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				kerbowaUnsubmitted Not Done Reply Inline Actions Generally, the check lines should come before the function. kerbowa: Generally, the check lines should come before the function.
				; DEFAULT-LABEL: name: no_cluster
				; DEFAULT: liveins: $sgpr0, $vgpr10_vgpr11
				; DEFAULT-NEXT: {{ $}}
				; DEFAULT-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				rampitecUnsubmitted Done Reply Inline Actions You also need to check resulting MIR. rampitec: You also need to check resulting MIR.
				; DEFAULT-NEXT: $vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; DEFAULT-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				; DEFAULT-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; DEFAULT-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; DEFAULT-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; DEFAULT-NEXT: $vgpr7 = GLOBAL_LOAD_USHORT $vgpr8_vgpr9, 0, 0, implicit $exec
				; DEFAULT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; DEFAULT-NEXT: $vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				; BOTHSCHEDPASS-LABEL: name: no_cluster
				; BOTHSCHEDPASS: liveins: $sgpr0, $vgpr10_vgpr11
				; BOTHSCHEDPASS-NEXT: {{ $}}
				; BOTHSCHEDPASS-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr1 = V_ADD_F16_e32 killed $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: GLOBAL_STORE_DWORD killed $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr3 = DS_READ_U16_gfx9 killed $vgpr2, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr7 = GLOBAL_LOAD_USHORT killed $vgpr8_vgpr9, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, killed $vgpr0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr6 = V_MUL_LO_U32_e64 killed $vgpr1, killed $sgpr0, implicit $exec
				; TWOLIMIT-LABEL: name: no_cluster
				; TWOLIMIT: liveins: $sgpr0, $vgpr10_vgpr11
				; TWOLIMIT-NEXT: {{ $}}
				; TWOLIMIT-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; TWOLIMIT-NEXT: $vgpr7 = GLOBAL_LOAD_USHORT $vgpr8_vgpr9, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				; POSTRA-LABEL: name: no_cluster
				; POSTRA: liveins: $sgpr0, $vgpr10_vgpr11
				; POSTRA-NEXT: {{ $}}
				; POSTRA-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr1 = V_ADD_F16_e32 killed $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; POSTRA-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; POSTRA-NEXT: GLOBAL_STORE_DWORD killed $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr3 = DS_READ_U16_gfx9 killed $vgpr2, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr7 = GLOBAL_LOAD_USHORT killed $vgpr8_vgpr9, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, killed $vgpr0, implicit $exec
				; POSTRA-NEXT: $vgpr6 = V_MUL_LO_U32_e64 killed $vgpr1, killed $sgpr0, implicit $exec
				$vgpr1 = V_MOV_B32_e32 1, implicit $exec
				$vgpr0 = V_MOV_B32_e32 1, implicit $exec
				$vgpr8 = V_MOV_B32_e32 0, implicit $exec
				$vgpr9 = V_MOV_B32_e32 9, implicit $exec
				$vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				$vgpr2 = V_MOV_B32_e32 1, implicit $exec
				$vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				$vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				$vgpr7 = GLOBAL_LOAD_USHORT $vgpr8_vgpr9, 0, 0, implicit $exec
				...


				---
				name: basic_cluster
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				; PRERA-LABEL: name: basic_cluster
				; PRERA: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				; PRERA-NEXT: {{ $}}
				; PRERA-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; PRERA-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-LABEL: name: basic_cluster
				; DEFAULT: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				; DEFAULT-NEXT: {{ $}}
				; DEFAULT-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; DEFAULT-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-LABEL: name: basic_cluster
				; BOTHSCHEDPASS: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				; BOTHSCHEDPASS-NEXT: {{ $}}
				; BOTHSCHEDPASS-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				rampitecUnsubmitted Not Done Reply Inline Actions So the cluster does not really hold? rampitec: So the cluster does not really hold?
				jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Currently, clusters will be broken by: 1. higher priority instructions, or by 2. independent instructions. Here we see an independent instruction filling in the gap caused by hardware hazard. I have tried disabling fillMFMAShadow but this does not change the behavior. I think if we want unbroken clusters using SDep::Cluster, we need to address this via a different SchedStrategy (specifically, the logic in tryCandidate and pickNode). Should I start thinking about this? In the context of CK -- broken clusters will cause problems if clusters of different type blend together. However, I think this won't happen due to dependencies -- hard to say without sample MIR. jrbyrnes: Currently, clusters will be broken by: 1. higher priority instructions, or by 2. independent…
				rampitecUnsubmitted Not Done Reply Inline Actions If that is caused by a hazard there is nothing we can really do about it, it will be broken that way or another. Thanks. rampitec: If that is caused by a hazard there is nothing we can really do about it, it will be broken…
				jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Hey Stas -- thanks for your thoughts on this. Based on your comment yesterday, I looked deeper into the broken cluster issue and actually found a couple flaws in this clustering algorithm which can result in avoidable broken clusters. I have addressed these and will release patch soon. However, these changes will not affect the hazard issue identified here. I have just experimented with a feature that resolves MAI hazards in the scheduler before picking the next node. With this scheduler hack, we have perfect clustering for these tests. If we want unbroken clusters, I think we will need to expose a hacked scheduler. I have also considered bundling instead of clustering, but I think that will not work. jrbyrnes: Hey Stas -- thanks for your thoughts on this. Based on your comment yesterday, I looked deeper…
				kerbowaUnsubmitted Not Done Reply Inline Actions I don't think it is necessarily a problem if they are not perfect clusters. The main idea is to have MAC clusters and VMEM/LDS clusters, not to have a specific number of perfectly sequential MFMA. kerbowa: I don't think it is necessarily a problem if they are not perfect clusters. The main idea is to…
				; BOTHSCHEDPASS-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr3, killed $vgpr4, killed $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr1, killed $vgpr0, killed $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr5, killed $vgpr6, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-LABEL: name: basic_cluster
				; TWOLIMIT: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				; TWOLIMIT-NEXT: {{ $}}
				; TWOLIMIT-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-LABEL: name: basic_cluster
				; POSTRA: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15
				; POSTRA-NEXT: {{ $}}
				; POSTRA-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; POSTRA-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr3, killed $vgpr4, killed $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr1, killed $vgpr0, killed $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr5, killed $vgpr6, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr1 = V_MOV_B32_e32 1, implicit $exec
				$vgpr0 = V_MOV_B32_e32 1, implicit $exec
				$agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr2 = V_MOV_B32_e32 1, implicit $exec
				$vgpr3 = V_MOV_B32_e32 1, implicit $exec
				$vgpr4 = V_MOV_B32_e32 1, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr5 = V_MOV_B32_e32 1, implicit $exec
				$vgpr6 = V_MOV_B32_e32 1, implicit $exec
				$agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				...


				---
				name: complex_cluster
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $sgpr0, $vgpr10_vgpr11
				; PRERA-LABEL: name: complex_cluster
				; PRERA: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $sgpr0, $vgpr10_vgpr11
				; PRERA-NEXT: {{ $}}
				; PRERA-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; PRERA-NEXT: GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; PRERA-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				; PRERA-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; PRERA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; PRERA-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; PRERA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; PRERA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; PRERA-NEXT: $vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				; PRERA-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; PRERA-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; DEFAULT-LABEL: name: complex_cluster
				; DEFAULT: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $sgpr0, $vgpr10_vgpr11
				; DEFAULT-NEXT: {{ $}}
				; DEFAULT-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; DEFAULT-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				; DEFAULT-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; DEFAULT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; DEFAULT-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; DEFAULT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; DEFAULT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; DEFAULT-NEXT: $vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				; DEFAULT-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; DEFAULT-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; BOTHSCHEDPASS-LABEL: name: complex_cluster
				; BOTHSCHEDPASS: liveins: $sgpr0, $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $vgpr10_vgpr11
				; BOTHSCHEDPASS-NEXT: {{ $}}
				; BOTHSCHEDPASS-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr1 = V_ADD_F16_e32 killed $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: GLOBAL_STORE_DWORD killed $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr3 = DS_READ_U16_gfx9 killed $vgpr2, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr3, killed $vgpr4, killed $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr5, killed $vgpr6, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, killed $vgpr0, implicit $exec
				; BOTHSCHEDPASS-NEXT: $vgpr6 = V_MUL_LO_U32_e64 killed $vgpr1, killed $sgpr0, implicit $exec
				; TWOLIMIT-LABEL: name: complex_cluster
				; TWOLIMIT: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $sgpr0, $vgpr10_vgpr11
				; TWOLIMIT-NEXT: {{ $}}
				; TWOLIMIT-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; TWOLIMIT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; TWOLIMIT-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; TWOLIMIT-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; TWOLIMIT-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; TWOLIMIT-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; POSTRA-LABEL: name: complex_cluster
				; POSTRA: liveins: $agpr0_agpr1_agpr2_agpr3, $agpr4_agpr5_agpr6_agpr7, $agpr8_agpr9_agpr10_agpr11, $agpr12_agpr13_agpr14_agpr15, $sgpr0, $vgpr10_vgpr11
				; POSTRA-NEXT: {{ $}}
				; POSTRA-NEXT: $vgpr1 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr0 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr3 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr4 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $vgpr1 = V_ADD_F16_e32 killed $vgpr1, $vgpr0, implicit $mode, implicit $exec
				; POSTRA-NEXT: GLOBAL_STORE_DWORD killed $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr3 = DS_READ_U16_gfx9 killed $vgpr2, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				; POSTRA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				; POSTRA-NEXT: $vgpr5 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr6 = V_MOV_B32_e32 1, implicit $exec
				; POSTRA-NEXT: $vgpr8 = V_MOV_B32_e32 0, implicit $exec
				; POSTRA-NEXT: $vgpr9 = V_MOV_B32_e32 9, implicit $exec
				; POSTRA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr3, killed $vgpr4, killed $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 killed $vgpr5, killed $vgpr6, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				; POSTRA-NEXT: $vgpr5 = V_XOR_B32_e32 $vgpr1, killed $vgpr0, implicit $exec
				; POSTRA-NEXT: $vgpr6 = V_MUL_LO_U32_e64 killed $vgpr1, killed $sgpr0, implicit $exec
				$vgpr1 = V_MOV_B32_e32 1, implicit $exec
				$vgpr0 = V_MOV_B32_e32 1, implicit $exec
				$vgpr8 = V_MOV_B32_e32 0, implicit $exec
				$vgpr9 = V_MOV_B32_e32 9, implicit $exec
				$agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr2 = V_MOV_B32_e32 1, implicit $exec
				$vgpr3 = V_MOV_B32_e32 1, implicit $exec
				$vgpr4 = V_MOV_B32_e32 1, implicit $exec
				$vgpr1 = V_ADD_F16_e32 $vgpr1, $vgpr0, implicit $mode, implicit $exec
				GLOBAL_STORE_DWORD $vgpr10_vgpr11, $vgpr1, 0, 0, implicit $exec
				$vgpr3 = DS_READ_U16_gfx9 $vgpr2, 0, 0, implicit $exec
				$vgpr4 = GLOBAL_LOAD_USHORT $vgpr0_vgpr1, 0, 0, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr5 = V_MOV_B32_e32 1, implicit $exec
				$vgpr6 = V_MOV_B32_e32 1, implicit $exec
				$agpr8_agpr9_agpr10_agpr11 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, $agpr8_agpr9_agpr10_agpr11, 0, 0, 0, implicit $mode, implicit $exec
				$agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr5, $vgpr6, $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
				$vgpr5 = V_XOR_B32_e32 $vgpr1, $vgpr0, implicit $exec
				$vgpr6 = V_MUL_LO_U32_e64 $vgpr1, $sgpr0, implicit $exec
				$agpr12_agpr13_agpr14_agpr15 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, $agpr12_agpr13_agpr14_agpr15, 0, 0, 0, implicit $mode, implicit $exec
				...