When either instruction in a fused pair has no other dependency, besides on the other instruction, make sure that other instructions do not get scheduled between them. Additionally, avoid fusing an instruction more than once along the same dependency chain.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I think this changes makes sense and thanks for getting rid of the if (that's not needed any longer) and making the debug message clearer too! I guess adding a test for the case where SecondSU has dependencies that are scheduled between FirstSU and SecondSU is difficult, as in most fused instruction pairs the second instruction only depends on the first instruction?
I am not sure about excluding control dependencies, but I am probably missing a constraint on control dependencies. Couldn't it be the case that an instruction has a control dependency on FirstSU, so after scheduling FirstSU, we could schedule this instruction instead of SecondSU?
There may be situations when the SecondSU has a couple of predecessors, but only one of them makes a pair and becomes the FirstSU. I think that if one chooses the scheduler to be only either top-down or bottom-up, this change would keep the results consistent.
Control dependencies are weaker than data dependencies and, since I could not find different results, it seemed sensible to trim the graph.
Yep that's a good call I think.
Control dependencies are weaker than data dependencies and, since I could not find different results, it seemed sensible to trim the graph.
OK. IIRC only adding artificial edges for data dependencies was something I tried in my initial patch, but there were some cases where the results are worse. I'll try to chase this up tomorrow.
The following tests fail for me on latest master for me (check-llvm-codegen) with this patch applied.
LLVM :: CodeGen/AMDGPU/addrspacecast.ll LLVM :: CodeGen/AMDGPU/macro-fusion-cluster-vcc-uses.mir LLVM :: CodeGen/AMDGPU/select-vectors.ll LLVM :: CodeGen/AMDGPU/udivrem.ll
The AMDGPU tests failed with:
test/CodeGen/AMDGPU/addrspacecast.ll:272:7: error: expected string not found in input ; CI: s_lshr_b32 flat_scratch_hi, [[ADD]], 8 ^ <stdin>:1645:2: note: scanning from here s_waitcnt lgkmcnt(0) ^ <stdin>:1645:2: note: with variable "ADD" equal to "s8" s_waitcnt lgkmcnt(0) ^ <stdin>:1661:41: note: possible intended match here .size store_flat_scratch, .Lfunc_end15-store_flat_scratch ^
test/CodeGen/AMDGPU/fceil64.ll:18:11: error: expected string not found in input ; SI-DAG: s_add_i32 [[SEXP0:s[0-9]+]], [[SEXP]], 0xfffffc01 ^ <stdin>:27:2: note: scanning from here s_addk_i32 s7, 0xfc01 ^ <stdin>:27:2: note: with variable "SEXP" equal to "s7" s_addk_i32 s7, 0xfc01 ^
test/CodeGen/AMDGPU/macro-fusion-cluster-vcc-uses.mir:205:13: error: expected string not found in input # GCN-NEXT: %3 = V_MOV_B32_e32 0, implicit %exec ^ <stdin>:447:2: note: scanning from here dead %6 = V_CNDMASK_B32_e64 %1, %3, %4, implicit %exec ^ <stdin>:447:22: note: possible intended match here dead %6 = V_CNDMASK_B32_e64 %1, %3, %4, implicit %exec ^
test/CodeGen/AMDGPU/select-vectors.ll:219:8: error: expected string not found in input ; GCN: v_mov_b32_e32 v{{[0-9]+}}, s[[BLO]] ^ <stdin>:923:20: note: scanning from here v_cndmask_b32_e32 v1, v0, v1, vcc ^ <stdin>:923:20: note: with variable "BLO" equal to "8" v_cndmask_b32_e32 v1, v0, v1, vcc ^ <stdin>:925:2: note: possible intended match here v_cndmask_b32_e32 v0, v2, v3, vcc ^
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
42β50 | Given the code below that adds all succs of first as succs of second, and vice versa for preds of seconds. How can the situation you are checking here even happen without us failing the cycle check in addEdge anyway? | |
106 | Why did you move the SecondSU == ExitSU check into the loop? The condition won't change with different elements. | |
115β127 | This seems unnecessary for SecondSU == ExitSU, maybe shortcut it? | |
151β154 | I'm not sure if (A || (!B && !C)) continue is easier to understand than if (A || B || C) continue... And I probably should have complained in D34144 already, but this comment doesn't relect reality anymore! And we should document our assumption that fusion can only happen when there is a data dependency or and Order dependency in case of physreg deps (as X86 eflags). |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
42β50 | It can happen that an instr can be eligible for fusion for more than one other instruction. This check limits the fusion to the first seen eligible one. | |
115β127 | Methinks that it is then, to avoid the predecessors of the ExitSU from being scheduled between it and the FirstSU. |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
42β50 | Hmm ok I was thinking of this: A / \ B C If we see A+B first and fuse it the loop further down will add an extra edge between B and C: A / \ B | \ | C at this point fusing A+C is pointless because we always must schedule B in between. I was wrong however in thinking that the following addEdge would fail because of this. Nonetheless it would be good to have a check for this situation to avoid adding cluster edges for pairs that cannot be scheduled next to each other anyway. Maybe we should have reachability checks from FirstSU successors to SecondSU and bail if one is found. |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
42β50 | What do you mean by reachability exactly? Also, do you think that addEdge() should have this check added? |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
42β50 | Having a path between the two SDNodes, there is Topo.IsReachable(A, B). It is used by addEdge, canAddEdge already to avoid situations in which we would form cycles which would be plain invalid. |
You can push this if you want, but I still think that first too loops in fuseInstructionPair() are a poor mans version of the reachability check I outlined in the comments (which would catch more cases).
I thought that our comments crossed past each other. Let me give this some thought...
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
94 | Would it make sense to move this code to a function (maybe just a closure in this function? I suspect the fact that we use isPred/isSucc makes things slightly tricky? :( |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
94 | Where would you suggest, as a new method in SUinit? |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
94 | Effectively, this test is equivalent to SDep::getKind() == SDep::Order || SDep::getKind() == SDep::Output. |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
94 | Hit send before finishing editing the comment above. Again, effectively, this test is equivalent to SDep::getKind() == SDep::Anti || SDep::getKind() == SDep::Output. |
Thanks Evandro. I think this looks good to me now. I still think it's unfortunate that we basically duplicate the loop to add artificial dependences, but I don't see a way to share the code without making it more complicated than it is now. Please wait a bit with committing, to give Matthias a chance to voice additional concerns.
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
94 | Hard to tell in Phabricator, could you check if that could go on the previous line? |
llvm/lib/CodeGen/MacroFusion.cpp | ||
---|---|---|
94 | Yes, it would. |
Given the code below that adds all succs of first as succs of second, and vice versa for preds of seconds. How can the situation you are checking here even happen without us failing the cycle check in addEdge anyway?