This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix bundle scheduling
ClosedPublic

Authored by rampitec on Jan 9 2020, 3:42 PM.

Details

Summary

Bundles coming to scheduler considered free, i.e. zero latency.
Fixed.

Diff Detail

Event Timeline

rampitec created this revision.Jan 9 2020, 3:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 9 2020, 3:42 PM
arsenm added inline comments.Jan 9 2020, 3:48 PM
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
758

This seems like a core scheduler flaw, not something to put in a target dag mutation

kerbowa accepted this revision.Jan 9 2020, 3:49 PM

I'm surprised this isn't already a standard mutation that targets can override but,

LGTM

This revision is now accepted and ready to land.Jan 9 2020, 3:49 PM
rampitec marked an inline comment as done.Jan 9 2020, 3:59 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
758

That is not so general. Different targets have different level of parallelism here, so latencies can either add up, be parallel or use some more peculiar pattern. In a general case only target knows what will be the latency in this case. There can be a way to describe it somehow, it is just not there.

This revision was automatically updated to reflect the committed changes.
foad added a comment.Jan 10 2020, 3:32 AM

This is only relevant for post-ra scheduling because we don't have any bundles when we do pre-ra scheduling, right?

Doing it as a DAG mutation seems a bit late. One problem is that you set the latency for the bundle's succ, but not for the corresponding pred, so you see odd mismatches like this:

SU(5):   BUNDLE implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr7, implicit-def $scc
  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 1
  Predecessors:
    SU(0): Anti Latency=0
    SU(0): Anti Latency=0
  Successors:
    SU(11): Data Latency=1 Reg=$sgpr6_sgpr7
...
SU(11):   S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3
  # preds left       : 6
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 1
  Depth              : 2
  Height             : 0
  Predecessors:
    SU(10): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(9): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(8): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(7): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(6): Data Latency=0 Reg=$sgpr4
    SU(5): Data Latency=0 Reg=$sgpr6_sgpr7

(Latency for succ of SU(5) is 1, but latency for pred of SU(11) is 0.)

Instead how about doing this and implementing it in adjustSchedDependency for AMDGPU?

diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
index 96a1f86c3e0..ef5926e4f8f 100644
--- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -269,9 +269,9 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
       if (!ImplicitPseudoDef && !ImplicitPseudoUse) {
         Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx,
                                                         RegUse, UseOp));
-        ST.adjustSchedDependency(SU, UseSU, Dep);
       } else
         Dep.setLatency(0);
+      ST.adjustSchedDependency(SU, UseSU, Dep);
 
       UseSU->addPred(Dep);
     }

Also, just curious, why did you update so many tests? I only found 4 that were failing with the patch:

Failing Tests (4):
    LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll
    LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.init.ll
    LLVM :: CodeGen/AMDGPU/misched-killflags.mir
    LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll

This is only relevant for post-ra scheduling because we don't have any bundles when we do pre-ra scheduling, right?

In fact we have much more bundles coming to pre-RA scheduler when we have xnack and form memory clauses. However, I have tried to use the same mutation in the pre-RA scheduler and did not notice any scheduling difference at all, even with memory clause tests.

Doing it as a DAG mutation seems a bit late. One problem is that you set the latency for the bundle's succ, but not for the corresponding pred, so you see odd mismatches like this:

SU(5):   BUNDLE implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr7, implicit-def $scc
  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 1
  Predecessors:
    SU(0): Anti Latency=0
    SU(0): Anti Latency=0
  Successors:
    SU(11): Data Latency=1 Reg=$sgpr6_sgpr7
...
SU(11):   S_NOP 0, implicit killed $sgpr6_sgpr7, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3
  # preds left       : 6
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 1
  Depth              : 2
  Height             : 0
  Predecessors:
    SU(10): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(9): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(8): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(7): Data Latency=0 Reg=$vgpr0_vgpr1_vgpr2_vgpr3
    SU(6): Data Latency=0 Reg=$sgpr4
    SU(5): Data Latency=0 Reg=$sgpr6_sgpr7

(Latency for succ of SU(5) is 1, but latency for pred of SU(11) is 0.)

Instead how about doing this and implementing it in adjustSchedDependency for AMDGPU?

diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
index 96a1f86c3e0..ef5926e4f8f 100644
--- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -269,9 +269,9 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
       if (!ImplicitPseudoDef && !ImplicitPseudoUse) {
         Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx,
                                                         RegUse, UseOp));
-        ST.adjustSchedDependency(SU, UseSU, Dep);
       } else
         Dep.setLatency(0);
+      ST.adjustSchedDependency(SU, UseSU, Dep);
 
       UseSU->addPred(Dep);
     }

You are right, it seem I've got correct schedilng with this change, but preds are still wrong. I can either extend the mutation or try this adjustment. I will explore it.

Also, just curious, why did you update so many tests? I only found 4 that were failing with the patch:

Failing Tests (4):
    LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.barrier.ll
    LLVM :: CodeGen/AMDGPU/llvm.amdgcn.ds.gws.init.ll
    LLVM :: CodeGen/AMDGPU/misched-killflags.mir
    LLVM :: CodeGen/AMDGPU/mul24-pass-ordering.ll

Yeah, this is a split from another change. I am preparing the patch which will send memory operation clusters to the post-RA scheduler instead of current DAG mutation, so it looks like some tests from that patch have sneaked here.

Instead how about doing this and implementing it in adjustSchedDependency for AMDGPU?

diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
index 96a1f86c3e0..ef5926e4f8f 100644
--- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -269,9 +269,9 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
       if (!ImplicitPseudoDef && !ImplicitPseudoUse) {
         Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx,
                                                         RegUse, UseOp));
-        ST.adjustSchedDependency(SU, UseSU, Dep);
       } else
         Dep.setLatency(0);
+      ST.adjustSchedDependency(SU, UseSU, Dep);
 
       UseSU->addPred(Dep);
     }

Looks like this patch would break some internal logic inside Hexagon's adjustSchedDependency()...

Instead how about doing this and implementing it in adjustSchedDependency for AMDGPU?

diff --git a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
index 96a1f86c3e0..ef5926e4f8f 100644
--- a/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -269,9 +269,9 @@ void ScheduleDAGInstrs::addPhysRegDataDeps(SUnit *SU, unsigned OperIdx) {
       if (!ImplicitPseudoDef && !ImplicitPseudoUse) {
         Dep.setLatency(SchedModel.computeOperandLatency(SU->getInstr(), OperIdx,
                                                         RegUse, UseOp));
-        ST.adjustSchedDependency(SU, UseSU, Dep);
       } else
         Dep.setLatency(0);
+      ST.adjustSchedDependency(SU, UseSU, Dep);
 
       UseSU->addPred(Dep);
     }

Looks like this patch would break some internal logic inside Hexagon's adjustSchedDependency()...

D72535