This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Ignore artificial edges when forming store chains
ClosedPublic

Authored by foad on Dec 19 2019, 9:59 AM.

Details

Summary

BaseMemOpClusterMutation::apply forms store chains by looking for
control (i.e. non-data) dependencies from one mem op to another.

In the test case, clusterNeighboringMemOps successfully clusters the
loads, and then adds artificial edges to the loads' successors as
described in the comment:

// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.

The effect of this is that *data* dependencies from one load to a store
are copied as *artificial* dependencies from a different load to the
same store.

Then when BaseMemOpClusterMutation::apply looks at the stores, it finds
that some of them have a control dependency on a previous load, which
breaks the chains and means that the stores are not all considered part
of the same chain and won't all be clustered.

The fix is to only consider non-artificial control dependencies when
forming chains.

Diff Detail

Event Timeline

foad created this revision.Dec 19 2019, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 19 2019, 9:59 AM

Incidentally I did wonder if SDep::isCtrl itself should be taught that an artificial dependency should not be considered a control dependency.

foad edited the summary of this revision. (Show Details)Dec 19 2019, 10:03 AM

Unit tests: pass. 61028 tests passed, 0 failed and 728 were skipped.

clang-tidy: pass.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

This might be dangerous. You may cluster nodes in a way it will be impossible to schedule them according to the artificial edges. Then again, these edges are there for a reason, aren't they? I also do not understand why would it only affect stores, there does not seem to be a check.

Maybe logic of artificial edges creation needs to be revised instead?

foad added a comment.Dec 20 2019, 1:23 AM

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

foad added a reviewer: MatzeB.Dec 20 2019, 1:38 AM

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

AFAIR the logic is to cross-transfer all successors to all successors to prevent any of them to be scheduled inside the cluster. The same for predecessors to prevent any predecessor to be scheduled inside the cluster. But that way we end up with a forest of cross edges.

Also any user transformation may insert artificial edges for any other reason. I do not think we can simple ignore them without removing them.

Maybe what we need instead of cross edges is to have a single post-dominator node which will become a singe predecessor for all successors of the nodes in a cluster. The same is for predecessors, we could just use a single dominator to be a common successor of predecessors. I.e. to have just two guard nodes instead of all of them being guards. I think a first and a last node in a cluster could become such single use guards.

Maybe logic of artificial edges creation needs to be revised instead?

Maybe. I don't really understand this logic. If it is not required for correctness, maybe it should add Weak edges instead of Artificial edges?

AFAIR the logic is to cross-transfer all successors to all successors to prevent any of them to be scheduled inside the cluster. The same for predecessors to prevent any predecessor to be scheduled inside the cluster. But that way we end up with a forest of cross edges.

Also any user transformation may insert artificial edges for any other reason. I do not think we can simple ignore them without removing them.

Maybe what we need instead of cross edges is to have a single post-dominator node which will become a singe predecessor for all successors of the nodes in a cluster. The same is for predecessors, we could just use a single dominator to be a common successor of predecessors. I.e. to have just two guard nodes instead of all of them being guards. I think a first and a last node in a cluster could become such single use guards.

Here is the patch to only copy successors to a single post-dominator: D73509. It works as desired while does not ignore any edges. Instead it does not create unneeded edges.

rampitec accepted this revision.Jan 28 2020, 12:52 PM

I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:

flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24
This revision is now accepted and ready to land.Jan 28 2020, 12:52 PM
foad added a comment.Jan 29 2020, 3:32 AM

I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:

flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24

I can add it but I get different results. With D73509:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v7 offset:24

With this patch:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v7 offset:24

I can see from the debug output that all four stores are being clustered now.

Do you prefer tests that just check the generated code, instead of checking the -debug-only output? It seems to me that there is a high chance of stores getting clustered by accident, even if the scheduler is not doing the right thing. E.g. the scheduler could do nothing at all and the test would still pass, because the loads and stores are already in the correct order before scheduling!

I have realized that BaseMemOpClusterMutation::apply() in fact does not check all control dependencies and just breaks at the first one. I.e. this change just skips some SDeps preferring another one. More or less we are lucky to find a correct SDep which may form a useful chain. We may be not that lucky if order of the SDeps is different and we would somehow use another register (for example data register). We probably need a callback to check if that register belongs to pointer operand and skip it otherwise. Alternatively we may need a full search to find a best SDep in the list.

This is LGTM, but can you please add cluster_load_valu_cluster_store function from the testcase in D73509? At the moment stores are not properly clustered:

flat_store_dword v[2:3], v4
v_add_u32_e32 v1, 1, v5
flat_store_dword v[2:3], v6 offset:16
flat_store_dword v[2:3], v1 offset:8
flat_store_dword v[2:3], v0 offset:24

I can add it but I get different results. With D73509:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v7 offset:24

With this patch:

	flat_store_dword v[2:3], v4
	flat_store_dword v[2:3], v0 offset:8
	flat_store_dword v[2:3], v6 offset:16
	flat_store_dword v[2:3], v7 offset:24

I can see from the debug output that all four stores are being clustered now.

Do you prefer tests that just check the generated code, instead of checking the -debug-only output? It seems to me that there is a high chance of stores getting clustered by accident, even if the scheduler is not doing the right thing. E.g. the scheduler could do nothing at all and the test would still pass, because the loads and stores are already in the correct order before scheduling!

Right, the sort is different. But now it is clustered and before it was not. I think it is OK to use debug output in this case. I agree it is very easy to get them accidentally clustered.

This revision was automatically updated to reflect the committed changes.