This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Improve the consistency of instruction fusion
ClosedPublic

Authored by evandro on Aug 14 2017, 11:50 AM.

Details

Summary

When either instruction in a fused pair has no other dependency, besides on the other instruction, make sure that other instructions do not get scheduled between them. Additionally, avoid fusing an instruction more than once along the same dependency chain.

Diff Detail

Event Timeline

evandro created this revision.Aug 14 2017, 11:50 AM
evandro updated this revision to Diff 111417.Aug 16 2017, 1:42 PM
evandro retitled this revision from [CodeGen] Minimize the dependency graph of fused instructions to [CodeGen] Improve the consistency of instruction fusion.
evandro edited the summary of this revision. (Show Details)
evandro removed a subscriber: hiraditya.

Expand the scope of the patch to make it more generic.

fhahn edited edge metadata.Aug 16 2017, 3:20 PM

I think this changes makes sense and thanks for getting rid of the if (that's not needed any longer) and making the debug message clearer too! I guess adding a test for the case where SecondSU has dependencies that are scheduled between FirstSU and SecondSU is difficult, as in most fused instruction pairs the second instruction only depends on the first instruction?

I am not sure about excluding control dependencies, but I am probably missing a constraint on control dependencies. Couldn't it be the case that an instruction has a control dependency on FirstSU, so after scheduling FirstSU, we could schedule this instruction instead of SecondSU?

There may be situations when the SecondSU has a couple of predecessors, but only one of them makes a pair and becomes the FirstSU. I think that if one chooses the scheduler to be only either top-down or bottom-up, this change would keep the results consistent.

Control dependencies are weaker than data dependencies and, since I could not find different results, it seemed sensible to trim the graph.

There may be situations when the SecondSU has a couple of predecessors, but only one of them makes a pair and becomes the FirstSU. I think that if one chooses the scheduler to be only either top-down or bottom-up, this change would keep the results consistent.

Yep that's a good call I think.

Control dependencies are weaker than data dependencies and, since I could not find different results, it seemed sensible to trim the graph.

OK. IIRC only adding artificial edges for data dependencies was something I tried in my initial patch, but there were some cases where the results are worse. I'll try to chase this up tomorrow.

fhahn added a comment.Aug 22 2017, 2:20 AM

The following tests fail for me on latest master for me (check-llvm-codegen) with this patch applied.

LLVM :: CodeGen/AMDGPU/addrspacecast.ll
LLVM :: CodeGen/AMDGPU/macro-fusion-cluster-vcc-uses.mir
LLVM :: CodeGen/AMDGPU/select-vectors.ll
LLVM :: CodeGen/AMDGPU/udivrem.ll
evandro added a subscriber: arsenm.

Looping @arsenm in to understand these miscomparisons.

The AMDGPU tests failed with:

test/CodeGen/AMDGPU/addrspacecast.ll:272:7: error: expected string not found in input
; CI: s_lshr_b32 flat_scratch_hi, [[ADD]], 8
      ^
<stdin>:1645:2: note: scanning from here
 s_waitcnt lgkmcnt(0)
 ^
<stdin>:1645:2: note: with variable "ADD" equal to "s8"
 s_waitcnt lgkmcnt(0)
 ^
<stdin>:1661:41: note: possible intended match here
 .size store_flat_scratch, .Lfunc_end15-store_flat_scratch
                                        ^
test/CodeGen/AMDGPU/fceil64.ll:18:11: error: expected string not found in input
; SI-DAG: s_add_i32 [[SEXP0:s[0-9]+]], [[SEXP]], 0xfffffc01
          ^
<stdin>:27:2: note: scanning from here
 s_addk_i32 s7, 0xfc01
 ^
<stdin>:27:2: note: with variable "SEXP" equal to "s7"
 s_addk_i32 s7, 0xfc01
 ^
test/CodeGen/AMDGPU/macro-fusion-cluster-vcc-uses.mir:205:13: error: expected string not found in input
# GCN-NEXT: %3 = V_MOV_B32_e32 0, implicit %exec
            ^
<stdin>:447:2: note: scanning from here
 dead %6 = V_CNDMASK_B32_e64 %1, %3, %4, implicit %exec
 ^
<stdin>:447:22: note: possible intended match here
 dead %6 = V_CNDMASK_B32_e64 %1, %3, %4, implicit %exec
                     ^
test/CodeGen/AMDGPU/select-vectors.ll:219:8: error: expected string not found in input
; GCN: v_mov_b32_e32 v{{[0-9]+}}, s[[BLO]]
       ^
<stdin>:923:20: note: scanning from here
 v_cndmask_b32_e32 v1, v0, v1, vcc
                   ^
<stdin>:923:20: note: with variable "BLO" equal to "8"
 v_cndmask_b32_e32 v1, v0, v1, vcc
                   ^
<stdin>:925:2: note: possible intended match here
 v_cndmask_b32_e32 v0, v2, v3, vcc
 ^
evandro updated this revision to Diff 116087.Sep 20 2017, 3:02 PM
evandro edited the summary of this revision. (Show Details)
evandro set the repository for this revision to rL LLVM.

Ping πŸ””

MatzeB added inline comments.Sep 29 2017, 1:59 PM
llvm/lib/CodeGen/MacroFusion.cpp
38–46

Given the code below that adds all succs of first as succs of second, and vice versa for preds of seconds. How can the situation you are checking here even happen without us failing the cycle check in addEdge anyway?

74

Why did you move the SecondSU == ExitSU check into the loop? The condition won't change with different elements.

83–95

This seems unnecessary for SecondSU == ExitSU, maybe shortcut it?

119–122

I'm not sure if (A || (!B && !C)) continue is easier to understand than if (A || B || C) continue...

And I probably should have complained in D34144 already, but this comment doesn't relect reality anymore! And we should document our assumption that fusion can only happen when there is a data dependency or and Order dependency in case of physreg deps (as X86 eflags).

evandro marked 2 inline comments as done.Sep 29 2017, 3:12 PM
evandro added inline comments.
llvm/lib/CodeGen/MacroFusion.cpp
38–46

It can happen that an instr can be eligible for fusion for more than one other instruction. This check limits the fusion to the first seen eligible one.

83–95

Methinks that it is then, to avoid the predecessors of the ExitSU from being scheduled between it and the FirstSU.

evandro updated this revision to Diff 117237.Sep 29 2017, 3:21 PM
MatzeB added inline comments.Sep 29 2017, 3:23 PM
llvm/lib/CodeGen/MacroFusion.cpp
38–46

Hmm ok I was thinking of this:

  A
 / \
B   C

If we see A+B first and fuse it the loop further down will add an extra edge between B and C:

  A
 / \
B  |
 \ |
   C

at this point fusing A+C is pointless because we always must schedule B in between. I was wrong however in thinking that the following addEdge would fail because of this. Nonetheless it would be good to have a check for this situation to avoid adding cluster edges for pairs that cannot be scheduled next to each other anyway. Maybe we should have reachability checks from FirstSU successors to SecondSU and bail if one is found.

evandro added inline comments.Sep 29 2017, 3:26 PM
llvm/lib/CodeGen/MacroFusion.cpp
38–46

What do you mean by reachability exactly? Also, do you think that addEdge() should have this check added?

MatzeB added inline comments.Sep 29 2017, 3:29 PM
llvm/lib/CodeGen/MacroFusion.cpp
38–46

Having a path between the two SDNodes, there is Topo.IsReachable(A, B). It is used by addEdge, canAddEdge already to avoid situations in which we would form cycles which would be plain invalid.
This case however would not be an invalid ScheduleDAG, just one where we would know beforehand that we cannot fulfill the cluster dependency. So this should be checked outside of addEdge() IMO.

evandro marked 5 inline comments as done.Oct 3 2017, 11:37 AM

Ping πŸ””

MatzeB edited edge metadata.Oct 10 2017, 2:34 PM

You can push this if you want, but I still think that first too loops in fuseInstructionPair() are a poor mans version of the reachability check I outlined in the comments (which would catch more cases).

You can push this if you want, but I still think that first too loops in fuseInstructionPair() are a poor mans version of the reachability check I outlined in the comments (which would catch more cases).

I thought that our comments crossed past each other. Let me give this some thought...

evandro updated this revision to Diff 123840.Nov 21 2017, 1:22 PM

Add comment hopefully reflecting conversation at the US Developers Meeting.

fhahn added inline comments.Nov 27 2017, 9:00 AM
llvm/lib/CodeGen/MacroFusion.cpp
79

Would it make sense to move this code to a function (maybe just a closure in this function? I suspect the fact that we use isPred/isSucc makes things slightly tricky? :(

evandro added inline comments.Nov 27 2017, 12:16 PM
llvm/lib/CodeGen/MacroFusion.cpp
79

Where would you suggest, as a new method in SUinit?

evandro added inline comments.Nov 28 2017, 8:40 AM
llvm/lib/CodeGen/MacroFusion.cpp
79

Effectively, this test is equivalent to SDep::getKind() == SDep::Order || SDep::getKind() == SDep::Output.

evandro marked an inline comment as done.Nov 28 2017, 8:43 AM
evandro added inline comments.
llvm/lib/CodeGen/MacroFusion.cpp
79

Hit send before finishing editing the comment above.

Again, effectively, this test is equivalent to SDep::getKind() == SDep::Anti || SDep::getKind() == SDep::Output.

evandro updated this revision to Diff 124602.Nov 28 2017, 10:14 AM
evandro marked 4 inline comments as done.

Move frequent check into separate function.

Ping πŸ””

fhahn accepted this revision.Dec 6 2017, 7:03 AM

Thanks Evandro. I think this looks good to me now. I still think it's unfortunate that we basically duplicate the loop to add artificial dependences, but I don't see a way to share the code without making it more complicated than it is now. Please wait a bit with committing, to give Matthias a chance to voice additional concerns.

llvm/lib/CodeGen/MacroFusion.cpp
79

Hard to tell in Phabricator, could you check if that could go on the previous line?

This revision is now accepted and ready to land.Dec 6 2017, 7:03 AM
evandro marked an inline comment as done.Dec 6 2017, 1:05 PM
evandro added inline comments.
llvm/lib/CodeGen/MacroFusion.cpp
79

Yes, it would.

evandro marked an inline comment as done.Dec 6 2017, 1:05 PM
evandro marked an inline comment as done.Dec 11 2017, 9:02 AM
This revision was automatically updated to reflect the committed changes.