Bundles coming to scheduler considered free, i.e. zero latency.
Fixed.
Details
- Reviewers
foad kerbowa - Commits
- rGcd69e4c74c17: [AMDGPU] Fix bundle scheduling
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
758 | This seems like a core scheduler flaw, not something to put in a target dag mutation |
I'm surprised this isn't already a standard mutation that targets can override but,
LGTM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp | ||
---|---|---|
758 | That is not so general. Different targets have different level of parallelism here, so latencies can either add up, be parallel or use some more peculiar pattern. In a general case only target knows what will be the latency in this case. There can be a way to describe it somehow, it is just not there. |
This is only relevant for post-ra scheduling because we don't have any bundles when we do pre-ra scheduling, right?
Doing it as a DAG mutation seems a bit late. One problem is that you set the latency for the bundle's succ, but not for the corresponding pred, so you see odd mismatches like this:
SU(5): BUNDLE implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr7, implicit-def $scc # preds left : 2 # succs left : 1 # rdefs left : 0 Latency : 3 Depth : 0 Height : 1 Predecessors: SU(0): Anti Latency=0 SU(0): Anti Latency=0 Successors: SU(11): Data Latency=1 Reg=$sgpr6_sgpr7 ... SU(11): S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 # preds left : 6 # succs left : 0 # rdefs left : 0 Latency : 1 Depth : 2 Height : 0 Predecessors: SU(10): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(9): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(8): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(7): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(6): Data Latency=0 Reg=$sgpr4 SU(5): Data Latency=0 Reg=$sgpr6_sgpr7
(Latency for succ of SU(5) is 1, but latency for pred of SU(11) is 0.)
Instead how about doing this and implementing it in adjustSchedDependency for AMDGPU?
diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp index 96a1f86c3e0..ef5926e4f8f 100644 --- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp +++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp @@ -269,9 +269,9 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) { if (!ImplicitPseudoDef && !ImplicitPseudoUse) { Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx, RegUse, UseOp)); - ST.adjustSchedDependency(SU, UseSU, Dep); } else Dep.setLatency(0); + ST.adjustSchedDependency(SU, UseSU, Dep); UseSU->addPred(Dep); }
Also, just curious, why did you update so many tests? I only found 4 that were failing with the patch:
Failing Tests (4): LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.init.ll LLVM :: CodeGen/AMDGPU/misched-killflags.mir LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll
In fact we have much more bundles coming to pre-RA scheduler when we have xnack and form memory clauses. However, I have tried to use the same mutation in the pre-RA scheduler and did not notice any scheduling difference at all, even with memory clause tests.
Doing it as a DAG mutation seems a bit late. One problem is that you set the latency for the bundle's succ, but not for the corresponding pred, so you see odd mismatches like this:
SU(5): BUNDLE implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr7, implicit-def $scc # preds left : 2 # succs left : 1 # rdefs left : 0 Latency : 3 Depth : 0 Height : 1 Predecessors: SU(0): Anti Latency=0 SU(0): Anti Latency=0 Successors: SU(11): Data Latency=1 Reg=$sgpr6_sgpr7 ... SU(11): S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 # preds left : 6 # succs left : 0 # rdefs left : 0 Latency : 1 Depth : 2 Height : 0 Predecessors: SU(10): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(9): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(8): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(7): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3 SU(6): Data Latency=0 Reg=$sgpr4 SU(5): Data Latency=0 Reg=$sgpr6_sgpr7(Latency for succ of SU(5) is 1, but latency for pred of SU(11) is 0.)
Instead how about doing this and implementing it in adjustSchedDependency for AMDGPU?
diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp index 96a1f86c3e0..ef5926e4f8f 100644 --- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp +++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp @@ -269,9 +269,9 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) { if (!ImplicitPseudoDef && !ImplicitPseudoUse) { Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx, RegUse, UseOp)); - ST.adjustSchedDependency(SU, UseSU, Dep); } else Dep.setLatency(0); + ST.adjustSchedDependency(SU, UseSU, Dep); UseSU->addPred(Dep); }
You are right, it seem I've got correct schedilng with this change, but preds are still wrong. I can either extend the mutation or try this adjustment. I will explore it.
Also, just curious, why did you update so many tests? I only found 4 that were failing with the patch:
Failing Tests (4): LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.init.ll LLVM :: CodeGen/AMDGPU/misched-killflags.mir LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll
Yeah, this is a split from another change. I am preparing the patch which will send memory operation clusters to the post-RA scheduler instead of current DAG mutation, so it looks like some tests from that patch have sneaked here.
Looks like this patch would break some internal logic inside Hexagon's adjustSchedDependency()...
This seems like a core scheduler flaw, not something to put in a target dag mutation