Page MenuHomePhabricator

[AMDGPU] Correct NumLoads in clustering
ClosedPublic

Authored by rampitec on Jan 23 2020, 1:27 PM.

Details

Summary

Scheduler sends NumLoads argument into shouldClusterMemOps()
one less the actual cluster length. So for 2 instructions
it will pass just 1. Correct this number.

This is NFC for in tree targets.

Diff Detail

Event Timeline

rampitec created this revision.Jan 23 2020, 1:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2020, 1:27 PM
foad added a comment.Jan 23 2020, 2:20 PM

I tried something similar in D72325.

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I also have an early version of a patch which can reschedule without clustering if no optimal schedule were found for a region.

We may postpone this change, use that patch and then finally correct this number while increasing clustering threshold at the same time. It may be easier in terms of tests update and less prone to performance regressions.

foad added a comment.Jan 24 2020, 1:10 AM

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I agree. I also think we should fix this properly in MachineScheduler:

--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1584,7 +1584,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     SUnit *SUb = MemOpRecords[Idx+1].SU;
     if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
                                  MemOpRecords[Idx + 1].BaseOps,
-                                 ClusterLength)) {
+                                 ClusterLength + 1)) {
       if (SUa->NodeNum > SUb->NodeNum)
         std::swap(SUa, SUb);
       if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {

... and adjust any other target implementations of shouldClusterMemOps accordingly.

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I agree. I also think we should fix this properly in MachineScheduler:

--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1584,7 +1584,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     SUnit *SUb = MemOpRecords[Idx+1].SU;
     if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
                                  MemOpRecords[Idx + 1].BaseOps,
-                                 ClusterLength)) {
+                                 ClusterLength + 1)) {
       if (SUa->NodeNum > SUb->NodeNum)
         std::swap(SUa, SUb);
       if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {

... and adjust any other target implementations of shouldClusterMemOps accordingly.

I tried something similar in D72325.

Comments there argue about how much should we cluster, but regardless I do not think we should use a wrong data. If we want more clustering we need to increase thresholds, but still rely on a correct input.

I agree. I also think we should fix this properly in MachineScheduler:

--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -1584,7 +1584,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     SUnit *SUb = MemOpRecords[Idx+1].SU;
     if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
                                  MemOpRecords[Idx + 1].BaseOps,
-                                 ClusterLength)) {
+                                 ClusterLength + 1)) {
       if (SUa->NodeNum > SUb->NodeNum)
         std::swap(SUa, SUb);
       if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {

... and adjust any other target implementations of shouldClusterMemOps accordingly.

That I agree too. I just cannot deal with any single performance regression in all of the targets, including out of tree targets. This bug is here at least for 6 years when I first noticed it. I belive every in-tree target need to do the same as in this patch and then we can collectively switch away from this bug. Sounds like the only discussion is about the process how to get there.

foad added a comment.Jan 24 2020, 3:45 AM

I just cannot deal with any single performance regression in all of the targets, including out of tree targets.

We can adjust AArch64 (the only other affected in-tree target) so there's no change in behaviour, and out-of-tree target maintainers can adjust their own targets.

vpykhtin accepted this revision.Jan 24 2020, 6:26 AM

LGTM.

This revision is now accepted and ready to land.Jan 24 2020, 6:26 AM
rampitec updated this revision to Diff 240263.Jan 24 2020, 12:07 PM
rampitec edited the summary of this revision. (Show Details)
rampitec added a reviewer: t.p.northover.

Changed patch to only correct clusterNeighboringMemOps. Threshold logic adjusted accordingly.
We will need to retune our threshold in the AMDGPU separately, likely after scheduler will gain the ability to break clustering under a high pressure.

rampitec requested review of this revision.Jan 24 2020, 12:07 PM
foad accepted this revision.Jan 24 2020, 12:21 PM

LGTM! Perhaps add a comment in TargetInstrInfo.h documenting the argument? E.g. "the number of loads that will be in the cluster if this hook returns true"...?

This revision is now accepted and ready to land.Jan 24 2020, 12:21 PM
rampitec updated this revision to Diff 240272.Jan 24 2020, 12:37 PM

Added operand description to TargetInstrInfo::shouldClusterMemOps()

This revision was automatically updated to reflect the committed changes.