This is an archive of the discontinued LLVM Phabricator instance.

MachineScheduler: Enable macro fusion in post-RA scheduler
AbandonedPublic

Authored by MatzeB on Sep 22 2016, 6:22 PM.

Details

Summary

The post-RA scheduler should respect macro fusion opportunities.

This also changes the strategy to enforce adjacent scheduling: While we
previously added a weak clustering edge between the fusing nodes we now
add strong artificial ordering edges towards all other nodes to enforce
the order. This was found necessary to avoid cases in which the cost
heuristic for the weak edges did not have any effect because some nodes
were still in the pending queue and not even considered by
tryCandidate().

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB updated this revision to Diff 72232.Sep 22 2016, 6:22 PM
MatzeB retitled this revision from to MachineScheduler: Enable macro fusion in post-RA scheduler.
MatzeB updated this object.
MatzeB added reviewers: atrick, tstellarAMD, jonpa.
MatzeB set the repository for this revision to rL LLVM.
MatzeB added a subscriber: llvm-commits.
kparzysz accepted this revision.Sep 26 2016, 5:16 AM
kparzysz edited edge metadata.

This looks fairly straightforward... LGTM.

lib/CodeGen/MachineScheduler.cpp
2870

delay -> Delay

This revision is now accepted and ready to land.Sep 26 2016, 5:16 AM
MatzeB updated this revision to Diff 72937.Sep 28 2016, 6:11 PM
MatzeB updated this object.
MatzeB edited edge metadata.

It turned out the approach taken previously was not enough: Currently nodes predicted to stall will end up in the pending queue and not even get consider in tryCandidate() for the usual heuristics. However a possible stalls should not get in the way of the macrofusion heuristic so instead of adjusting the picking heuristic this patch adds artificial scheduling edges to the roots to enforce the adjacent scheduling.

PS: In some internal discussions we decided that a nice long term solution would be to merge the fusing nodes instead in the preparation step (by creating an instruction bundle and merging the ScheduleDAG nodes). However merging nodes after creating the scheduling DAG turns out to be tricky because the existing code expects a fixed number of SUnits, we would need to update or recompute the topological ordering, etc. So to get this specific problem under control I found adding edges to root nodes a robust and simpler solution for now.

atrick edited edge metadata.Sep 28 2016, 10:01 PM

Marking a DAG node as "dead" for the purpose of scheduling should be an easy thing to do, relative to supporting instruction bundles. But adding the extra DAG edges is also a fine solution, just not quite as direct.

MatzeB updated this revision to Diff 73369.Oct 3 2016, 4:14 PM
MatzeB edited edge metadata.

Update the patch so that addFusionEdges() does not walk over all edges in the graph anymore. (Just all nodes and the predecessors edges of ExitSU now).

Doesn't this force macro fusion for all targets/subtargets? If we wanted to do that, we wouldn't need the cluster edge and scheduler heuristic anymore. Shouldn't there be a TII->forceMacroFusion() option?

Doesn't this force macro fusion for all targets/subtargets? If we wanted to do that, we wouldn't need the cluster edge and scheduler heuristic anymore. Shouldn't there be a TII->forceMacroFusion() option?

If the target doesn't implement TII::shouldScheduleAdjacent() then no fusion will happen. Of course this commit forces fusion to be respected even in the presence of possible stalls in the scheduling model. This is a switch of priorities and indeed does not allow you any more to insert the macrofusion check at any place in the heuristic (the scheduling model/stalls aren't part of that heuristic either so I couldn't just move the cluster edge heuristic to an earlier place in tryCandidate() so I went this route).

Do you think a TII->forceMacroFusion() hook is necessary? Currently the only targets implementing shouldScheduleAdjacent() are X86 and AArch64 and they should both prefer fusion over reported stalls, making the cluster/weak solution untested/dead code.

On x86 it's meant to be a heuristic. If you think all subtargets should instead force macro-fusion before scheduling (and have benchmarks to prove it) then we should delete the code that implements the heuristic.

On x86 it's meant to be a heuristic. If you think all subtargets should instead force macro-fusion before scheduling (and have benchmarks to prove it) then we should delete the code that implements the heuristic.

My experience has been that this mostly matters/changes the outcome when scheduling top-down, which currently happens only in the PostRAScheduler. The combination of PostRAScheduling and MacroOpFusion enabled only seems to happen in the BtVer2/Jaguar scheduling model at the moment for which I have no hardware to test. So I have no good way of benchmarking this (but also no indication why this would ever be good as a heuristic).

Anyway I to keep the possibility of weak edges, I'd add make shouldScheduleAdjacent() return an enum value to indicate whether weak/hard edges should be used. I'll update that in the next days.

MatzeB abandoned this revision.Jul 19 2017, 4:16 PM

This is out of date. Nowadays targets can decide themselfes whether they want post-ra fusion by overriding TaretPassConfig::createPostMachineScheduler() and adding a dag mutation there.