This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
-
MachineScheduler.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
cluster_stores.ll

Differential D71717

[MachineScheduler] Ignore artificial edges when forming store chains
ClosedPublic

Authored by foad on Dec 19 2019, 9:59 AM.

Download Raw Diff

Details

Reviewers

atrick
arsenm
rampitec
nhaehnle
MatzeB

Commits

rG0d7bd343127e: [MachineScheduler] Ignore artificial edges when forming store chains

Summary

BaseMemOpClusterMutation::apply forms store chains by looking for
control (i.e. non-data) dependencies from one mem op to another.

In the test case, clusterNeighboringMemOps successfully clusters the
loads, and then adds artificial edges to the loads' successors as
described in the comment:

// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.

The effect of this is that *data* dependencies from one load to a store
are copied as *artificial* dependencies from a different load to the
same store.

Then when BaseMemOpClusterMutation::apply looks at the stores, it finds
that some of them have a control dependency on a previous load, which
breaks the chains and means that the stores are not all considered part
of the same chain and won't all be clustered.

The fix is to only consider non-artificial control dependencies when
forming chains.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

foad created this revision.Dec 19 2019, 9:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 19 2019, 9:59 AM

Herald added subscribers: javed.absar, hiraditya, nhaehnle and 2 others. · View Herald Transcript

Incidentally I did wonder if SDep::isCtrl itself should be taught that an artificial dependency should not be considered a control dependency.

Herald added a subscriber: wdng. · View Herald TranscriptDec 19 2019, 10:02 AM

foad edited the summary of this revision. (Show Details)Dec 19 2019, 10:03 AM

Unit tests: pass. 61028 tests passed, 0 failed and 728 were skipped.

clang-tidy: pass.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster completed remote builds in B42786: Diff 234743.Dec 19 2019, 10:23 AM

This might be dangerous. You may cluster nodes in a way it will be impossible to schedule them according to the artificial edges. Then again, these edges are there for a reason, aren't they? I also do not understand why would it only affect stores, there does not seem to be a check.

Maybe logic of artificial edges creation needs to be revised instead?

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

foad added a reviewer: MatzeB.Dec 20 2019, 1:38 AM

In D71717#1792285, @foad wrote:

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

AFAIR the logic is to cross-transfer all successors to all successors to prevent any of them to be scheduled inside the cluster. The same for predecessors to prevent any predecessor to be scheduled inside the cluster. But that way we end up with a forest of cross edges.

Also any user transformation may insert artificial edges for any other reason. I do not think we can simple ignore them without removing them.

Maybe what we need instead of cross edges is to have a single post-dominator node which will become a singe predecessor for all successors of the nodes in a cluster. The same is for predecessors, we could just use a single dominator to be a common successor of predecessors. I.e. to have just two guard nodes instead of all of them being guards. I think a first and a last node in a cluster could become such single use guards.

In D71717#1792957, @rampitec wrote:

In D71717#1792285, @foad wrote:

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

AFAIR the logic is to cross-transfer all successors to all successors to prevent any of them to be scheduled inside the cluster. The same for predecessors to prevent any predecessor to be scheduled inside the cluster. But that way we end up with a forest of cross edges.

Also any user transformation may insert artificial edges for any other reason. I do not think we can simple ignore them without removing them.

Maybe what we need instead of cross edges is to have a single post-dominator node which will become a singe predecessor for all successors of the nodes in a cluster. The same is for predecessors, we could just use a single dominator to be a common successor of predecessors. I.e. to have just two guard nodes instead of all of them being guards. I think a first and a last node in a cluster could become such single use guards.

Here is the patch to only copy successors to a single post-dominator: D73509. It works as desired while does not ignore any edges. Instead it does not create unneeded edges.

Herald added a subscriber: kerbowa. · View Herald TranscriptJan 27 2020, 2:53 PM

foad mentioned this in D73509: [MachineScheduler] relax successor chain on clustering.Jan 28 2020, 3:25 AM

I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:

flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24

This revision is now accepted and ready to land.Jan 28 2020, 12:52 PM

In D71717#1845450, @rampitec wrote:
I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:
flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24

I can add it but I get different results. With D73509:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v7 offset:24

With this patch:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v7 offset:24

I can see from the debug output that all four stores are being clustered now.

Do you prefer tests that just check the generated code, instead of checking the -debug-only output? It seems to me that there is a high chance of stores getting clustered by accident, even if the scheduler is not doing the right thing. E.g. the scheduler could do nothing at all and the test would still pass, because the loads and stores are already in the correct order before scheduling!

In D71717#1846557, @foad wrote:
In D71717#1845450, @rampitec wrote:
I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:
flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24
I can add it but I get different results. With D73509:
	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v7 offset:24
With this patch:
	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v7 offset:24
I can see from the debug output that all four stores are being clustered now.

Do you prefer tests that just check the generated code, instead of checking the -debug-only output? It seems to me that there is a high chance of stores getting clustered by accident, even if the scheduler is not doing the right thing. E.g. the scheduler could do nothing at all and the test would still pass, because the loads and stores are already in the correct order before scheduling!

Right, the sort is different. But now it is clustered and before it was not. I think it is OK to use debug output in this case. I agree it is very easy to get them accidentally clustered.

Closed by commit rG0d7bd343127e: [MachineScheduler] Ignore artificial edges when forming store chains (authored by foad). · Explain WhyJan 29 2020, 8:26 AM

This revision was automatically updated to reflect the committed changes.

steven.zhang mentioned this in D85517: [Scheduling] Implement a new way to cluster loads/stores.Aug 7 2020, 4:57 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineScheduler.cpp

2 lines

test/

CodeGen/

AMDGPU/

cluster_stores.ll

85 lines

Diff 241169

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,614 Lines • ▼ Show 20 Lines	void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {
DenseMap<unsigned, SmallVector<SUnit *, 4>> StoreChains;		DenseMap<unsigned, SmallVector<SUnit *, 4>> StoreChains;
for (SUnit &SU : DAG->SUnits) {		for (SUnit &SU : DAG->SUnits) {
if ((IsLoad && !SU.getInstr()->mayLoad()) \|\|		if ((IsLoad && !SU.getInstr()->mayLoad()) \|\|
(!IsLoad && !SU.getInstr()->mayStore()))		(!IsLoad && !SU.getInstr()->mayStore()))
continue;		continue;

unsigned ChainPredID = DAG->SUnits.size();		unsigned ChainPredID = DAG->SUnits.size();
for (const SDep &Pred : SU.Preds) {		for (const SDep &Pred : SU.Preds) {
if (Pred.isCtrl()) {		if (Pred.isCtrl() && !Pred.isArtificial()) {
ChainPredID = Pred.getSUnit()->NodeNum;		ChainPredID = Pred.getSUnit()->NodeNum;
break;		break;
}		}
}		}
// Insert the SU to corresponding store chain.		// Insert the SU to corresponding store chain.
auto &Chain = StoreChains.FindAndConstruct(ChainPredID).second;		auto &Chain = StoreChains.FindAndConstruct(ChainPredID).second;
Chain.push_back(&SU);		Chain.push_back(&SU);
}		}
▲ Show 20 Lines • Show All 2,140 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/cluster_stores.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -debug-only=machine-scheduler -o /dev/null %s 2>&1 \| FileCheck --enable-var-scope --check-prefixes=CHECK,DBG %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck --enable-var-scope --check-prefixes=CHECK,GCN %s
				; REQUIRES: asserts

				; CHECK-LABEL: {{^}}cluster_load_cluster_store:
				define amdgpu_kernel void @cluster_load_cluster_store(i32* noalias %lb, i32* noalias %sb) {
				bb:
				; DBG: Cluster ld/st SU(1) - SU(2)

				; DBG: Cluster ld/st SU([[L1:[0-9]+]]) - SU([[L2:[0-9]+]])
				; DBG: Cluster ld/st SU([[L2]]) - SU([[L3:[0-9]+]])
				; DBG: Cluster ld/st SU([[L3]]) - SU([[L4:[0-9]+]])
				; GCN: flat_load_dword [[LD1:v[0-9]+]], v[{{[0-9:]+}}]
				; GCN-NEXT: flat_load_dword [[LD2:v[0-9]+]], v[{{[0-9:]+}}] offset:8
				; GCN-NEXT: flat_load_dword [[LD3:v[0-9]+]], v[{{[0-9:]+}}] offset:16
				; GCN-NEXT: flat_load_dword [[LD4:v[0-9]+]], v[{{[0-9:]+}}] offset:24
				%la0 = getelementptr inbounds i32, i32* %lb, i32 0
				%ld0 = load i32, i32* %la0
				%la1 = getelementptr inbounds i32, i32* %lb, i32 2
				%ld1 = load i32, i32* %la1
				%la2 = getelementptr inbounds i32, i32* %lb, i32 4
				%ld2 = load i32, i32* %la2
				%la3 = getelementptr inbounds i32, i32* %lb, i32 6
				%ld3 = load i32, i32* %la3

				; DBG: Cluster ld/st SU([[S1:[0-9]+]]) - SU([[S2:[0-9]+]])
				; DBG: Cluster ld/st SU([[S2]]) - SU([[S3:[0-9]+]])
				; DBG: Cluster ld/st SU([[S3]]) - SU([[S4:[0-9]+]])
				; GCN: flat_store_dword v[{{[0-9:]+}}], [[LD1]]
				; GCN-NEXT: flat_store_dword v[{{[0-9:]+}}], [[LD2]] offset:8
				; GCN-NEXT: flat_store_dword v[{{[0-9:]+}}], [[LD3]] offset:16
				; GCN-NEXT: flat_store_dword v[{{[0-9:]+}}], [[LD4]] offset:24
				%sa0 = getelementptr inbounds i32, i32* %sb, i32 0
				store i32 %ld0, i32* %sa0
				%sa1 = getelementptr inbounds i32, i32* %sb, i32 2
				store i32 %ld1, i32* %sa1
				%sa2 = getelementptr inbounds i32, i32* %sb, i32 4
				store i32 %ld2, i32* %sa2
				%sa3 = getelementptr inbounds i32, i32* %sb, i32 6
				store i32 %ld3, i32* %sa3

				ret void
				}

				; CHECK-LABEL: {{^}}cluster_load_valu_cluster_store:
				define amdgpu_kernel void @cluster_load_valu_cluster_store(i32* noalias %lb, i32* noalias %sb) {
				bb:
				; DBG: Cluster ld/st SU(1) - SU(2)

				; DBG: Cluster ld/st SU([[L1:[0-9]+]]) - SU([[L2:[0-9]+]])
				; DBG: Cluster ld/st SU([[L2]]) - SU([[L3:[0-9]+]])
				; DBG: Cluster ld/st SU([[L3]]) - SU([[L4:[0-9]+]])
				; GCN: flat_load_dword [[LD1:v[0-9]+]], v[{{[0-9:]+}}]
				; GCN-NEXT: flat_load_dword [[LD2:v[0-9]+]], v[{{[0-9:]+}}] offset:8
				; GCN-NEXT: flat_load_dword [[LD3:v[0-9]+]], v[{{[0-9:]+}}] offset:16
				; GCN-NEXT: flat_load_dword [[LD4:v[0-9]+]], v[{{[0-9:]+}}] offset:24
				%la0 = getelementptr inbounds i32, i32* %lb, i32 0
				%ld0 = load i32, i32* %la0
				%la1 = getelementptr inbounds i32, i32* %lb, i32 2
				%ld1 = load i32, i32* %la1
				%la2 = getelementptr inbounds i32, i32* %lb, i32 4
				%ld2 = load i32, i32* %la2
				%la3 = getelementptr inbounds i32, i32* %lb, i32 6
				%ld3 = load i32, i32* %la3

				; DBG: Cluster ld/st SU([[S1:[0-9]+]]) - SU([[S2:[0-9]+]])
				; DBG: Cluster ld/st SU([[S2]]) - SU([[S3:[0-9]+]])
				; DBG: Cluster ld/st SU([[S3]]) - SU([[S4:[0-9]+]])
				; GCN: v_add_u32_e32 [[ST2:v[0-9]+]], 1, [[LD2]]
				; GCN: flat_store_dword v[{{[0-9:]+}}], [[LD1]]
				; GCN-NEXT: flat_store_dword v[{{[0-9:]+}}], [[ST2]] offset:8
				; GCN-NEXT: flat_store_dword v[{{[0-9:]+}}], [[LD3]] offset:16
				; GCN-NEXT: flat_store_dword v[{{[0-9:]+}}], [[LD4]] offset:24
				%sa0 = getelementptr inbounds i32, i32* %sb, i32 0
				store i32 %ld0, i32* %sa0
				%sa1 = getelementptr inbounds i32, i32* %sb, i32 2
				%add = add i32 %ld1, 1
				store i32 %add, i32* %sa1
				%sa2 = getelementptr inbounds i32, i32* %sb, i32 4
				store i32 %ld2, i32* %sa2
				%sa3 = getelementptr inbounds i32, i32* %sb, i32 6
				store i32 %ld3, i32* %sa3

				ret void
				}