This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStage
ClosedPublic

Authored by alex-t on Dec 9 2022, 5:58 AM.

Details

Summary

Since the divergence-driven ISel was fully enabled we have more VGPRs available.

MachineScheduler trying to take advantage of that bumps up the occupancy sacrificing
the hiding of memory access latency.  This really spoils the initially good schedule.
A new metric that reflects the latency hiding quality of the schedule has been created
to make it to balance between occupancy and latency. The metric is based on the latency
model which computes the bubble to working cycles ratio. Then we use this ratio to decide
if the higher occupancy schedule is profitable as follows:

    Profit = NewOccupancy/OldOccupancy * OldMetric/NewMetric

Diff Detail

Event Timeline

alex-t created this revision.Dec 9 2022, 5:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 5:58 AM
alex-t requested review of this revision.Dec 9 2022, 5:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 5:58 AM
Herald added a subscriber: wdng. · View Herald Transcript
alex-t updated this revision to Diff 481614.Dec 9 2022, 6:05 AM

odd changes removed

alex-t updated this revision to Diff 481618.Dec 9 2022, 6:13 AM

yet one more minor code cleanup

Do you have any performance measurements?
Also guard it with an option to turn it off.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

You probably do not need to compute it always, just in UnclusteredHighRPStage?

1022

No else after return.

1031

Avoid using float. Use scaled integers.

vpykhtin added inline comments.Dec 10 2022, 1:56 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
947

Model -> ReadyCycles?

962

max

967

Model[&MI] = ReadyCycle;

vpykhtin added inline comments.Dec 10 2022, 2:04 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
954

if (!SU) continue;

what is it BTW? Debug instruction?

alex-t updated this revision to Diff 482123.Dec 12 2022, 7:36 AM
alex-t marked 6 inline comments as done.

changes as requested

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition:

bool UnclusteredHighRPStage::initGCNRegion() {
  // Only reschedule regions with the minimum occupancy or regions that may have
  // spilling (excess register pressure).
  if ((!DAG.RegionsWithMinOcc[RegionIdx] ||
       DAG.MinOccupancy <= InitialOccupancy) &&
      !DAG.RegionsWithExcessRP[RegionIdx])
    return false;

  return GCNSchedStage::initGCNRegion();
}

What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage.

947

What do you mean? Don't understand the question

954

Not only debug. Copy f.ex.

1031

What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)?

vpykhtin added inline comments.Dec 12 2022, 7:43 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
947

Sorry, I mean rename Model to ReadyCycles

alex-t updated this revision to Diff 482135.Dec 12 2022, 8:10 AM

Model map renamed to ReadyCycles

alex-t marked an inline comment as done.Dec 12 2022, 8:10 AM
rampitec added inline comments.Dec 12 2022, 10:58 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition:

bool UnclusteredHighRPStage::initGCNRegion() {
  // Only reschedule regions with the minimum occupancy or regions that may have
  // spilling (excess register pressure).
  if ((!DAG.RegionsWithMinOcc[RegionIdx] ||
       DAG.MinOccupancy <= InitialOccupancy) &&
      !DAG.RegionsWithExcessRP[RegionIdx])
    return false;

  return GCNSchedStage::initGCNRegion();
}

What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage.

I believe it is better to place it into the UnclusteredHighRPStage::initGCNSchedStage(). This will allow to rearrange stages and also skip it if UnclusteredHighRPStage itself is skipped.

1031

What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)?

We are not only running on x86. For instance our lit tests are being run on a variety of platforms and llvm does support it. Besides you do not want to get different results depending on the host platform or even host compiler used due to different rounding. In addition it is always good to avoid it so that llvm itself does not need to link in soft float on other platforms simply because their compiler also supports codegen for amdgpu.

alex-t updated this revision to Diff 482244.Dec 12 2022, 1:17 PM

floating points changed to scaled integers

alex-t marked 2 inline comments as done.Dec 12 2022, 1:20 PM
alex-t added inline comments.Dec 12 2022, 2:11 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

I compute, estimate, and store metrics per region and for those with minimal occupancy only. initGCNSchedStage() yet has no regions. The initGCNRegion is also not a proper place as we need a schedule for the given (i.e. previous) stage to be done before we can compute its metrics.

rampitec added inline comments.Dec 12 2022, 2:36 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

Regions are created by the first stage. By the time UnclusteredHighRPStage::initGCNSchedStage() it should be already initialized. The schedule at the initialization is what a previous stage has left.

alex-t added inline comments.Dec 12 2022, 3:04 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

We don't need metrics for the whole function - just for several regions.
To compute and store the metrics for the regions with min occupancy (for which the UnclusteredHighRPStage will actually be done) we would have to have a loop over all regions in the initGCNSchedStage()
I put it into finalizeGCNRegion because I can easily check if the metrics are needed for the concrete region and compute them if necessary.

kerbowa added inline comments.Dec 12 2022, 10:37 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

It might be better in UnclusteredHighRPStage::initGCNRegion(), but I guess the problem is that you need the DAG to be built already? Maybe we even need a virtual getSchedulerMetrics(), since I can imagine this being used for more then the UnclusteredHighRPStage eventually, and now it only does something in the initOccStage. If it stays in the initOccStage, can you add an assert that the next stage is UnclusteredHighRP?

936

Is this debug only?

949

Can't this just iterate over SUnits?

961

What about unbuffered resources like MFMA? I guess maybe it should be considered in future patches.

962

Should this be TargetSchedModel::computeInstrLatency? We don't use itineraries so I think it is just falling back here.

1032

What happens if the original schedule has no stalls, i.e. a metric of 0? Does that mean that no amount of occupancy increase can result in a profitable tradeoff?

This seems heavily biased towards increases in ILP. A single-stall cycle can be weighted as more important than an increase in occupancy from 2 to 3. I don't think this was the intention, was it? I think we need some sort of Occupancy scaling factor in the Profit calculation to have some way of tuning the importance of occupancy.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
130

Could this be moved into the class below so it is not in the llvm namespace?

135

NIT: LLVM formating.

alex-t marked an inline comment as done.Dec 13 2022, 12:13 PM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

Your guess is quite correct. My first intention was exactly UnclusteredHighRPStage::initGCNRegion(). Then I found that unfortunately, I have no DAG yet :)

1032

I agree with you. And a scaling factor was planned. The reason it doesn't exist yet is that we still don't have access to HW other than Navi and we can't run enough tests to determine a reasonable scaling factor.
The idea behind this initial heuristic was: “This fixes the regression. Since we don't have an HW to test with, let's fix it and let QA see what happens."
But your comment is a good catch. Indeed, I should have added a compiler option for the scaling factor with a default value of 1.

alex-t updated this revision to Diff 483134.Dec 15 2022, 4:57 AM
alex-t marked an inline comment as done.
  1. metric calculation calls localized to the UnclusteredHighRPStage::shouldRevertSchedule. Now the getScheduleMetric is only called if it is really necessary.
  2. -amdgpu-schedule-metric-bias=<unsigned value> compiler option was added to ease the further testing and tuning. It defaults to 10 which means the schedule w/o bubbles gets 10 points reward.
  3. Several other changes according to the reviewer request.
alex-t marked 7 inline comments as done.Dec 15 2022, 5:06 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
860

What I really need is to get a metric for the schedule before UnclusteredHighRPStage and after it. The only place I need it is UnclusteredHighRPStage::shouldRevertScheduling
where I already have SUnits which reflects the order before and BB which reflects the current order. So, making the getScheduleMetrics accept a vector of the SUnit looks like a perfect solution. I don't have to care about any stages except the UnclusteredHighRPStage.

949

It now does as I pass the SUs vector to the getScheduleMetrics

alex-t marked an inline comment as done.Dec 15 2022, 5:15 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
961

This is the very first and drafty implementation which aims to find what it yields.
All the specific things like HW hazards and unbuffered resources may be considered further in case we really observe that we need a more precise model.

I still want to have an option to disable this heuristic.
Also are there any performance measurements done?

rampitec added inline comments.Dec 15 2022, 10:55 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1048

If NewMetric is 0 you will divide by 0.

alex-t marked an inline comment as done.Dec 15 2022, 12:45 PM

I still want to have an option to disable this heuristic.
Also are there any performance measurements done?

We still have no HW available to run the benchmarks.
Cause we've been waiting too long for that, the idea behind this patch is:
let's commit it and then the QA will have to run all the benchmarks for me :)

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1048

It is never 0. All metrics with a bubble amount of less than 1% are 1.

unsigned getMetric() const {
    unsigned Metric = (BubbleCycles * ScaleFactor) / ScheduleLength;
    // Metric is zero if the amount of bubbles is less than 1% which is too
    // small. So, return 1.
    return Metric ? Metric : 1;
  }
alex-t marked an inline comment as done.Dec 15 2022, 12:54 PM

@rampitec

amdgpu-schedule-metric-bias=100

makes the scheduler always prefer the occupancy over the latency.
but the schedule metrics are still computed.
Did you mean the option to completely switch the getScheduleMetrics OFF?

@rampitec

amdgpu-schedule-metric-bias=100

makes the scheduler always prefer the occupancy over the latency.
but the schedule metrics are still computed.
Did you mean the option to completely switch the getScheduleMetrics OFF?

OK, thanks. Document it in the help text of the option.

alex-t updated this revision to Diff 483319.Dec 15 2022, 1:27 PM

amdgpu-schedule-metric-bias description updated

vpykhtin added inline comments.Dec 16 2022, 2:24 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1030

Unneeded copy. Add second version of getScheduleMetrics that can perform in MachineInstr*, you can translate an instruction to SUnit inside.

1049

This Profit formula is very hard to read, could you split it?
Instead of the last division by SF, may be compare Profit < (SF * SF)?

vpykhtin added inline comments.Dec 16 2022, 2:51 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
951

auto &SU

1030

Sorry, I mean not second version but template, something like:

// shim accessors
SUnit &getSUnit(&DAG, MachineInstr *MI) { return DAG.getSUnit(MI); }
SUnit &getSUnit(&DAG, SUnit &SU) { return SU; }

template <typename Range>
ScheduleMetrics GCNSchedStage::getScheduleMetrics(Range &&S) {
...
  for (auto &X : S) {
     SUnit &SU = getSUnit(DAG, X);
alex-t updated this revision to Diff 483993.Dec 19 2022, 9:50 AM

refactorerd getScheduleMetrics to avoid copying

alex-t marked 3 inline comments as done.Dec 19 2022, 10:01 AM
alex-t updated this revision to Diff 484202.EditedDec 20 2022, 3:14 AM

schedule printing function moved out from the GCNSchedStage class.
objects passed by reference are marked as const

rampitec added inline comments.Jan 4 2023, 12:19 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
30

Do you still need it?

848

Keep this. It is unrelated to the patch.

953

Could you please use a more specific name for the struct?

970

auto &I.

989

auto &SU.

1004

Any chance for CurrCycle to be zero?

1038

Any chance for CurrCycle to be zero?

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
19

You do not seem to need it here.

291

SUnit is not a small class, how about SUnit*?

alex-t updated this revision to Diff 486640.Jan 5 2023, 11:31 AM
alex-t marked 10 inline comments as done.

minor changes according to the reviewer request

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1004

Since we have

CurrCycle = ++ReadyCycle;

the only chance is an empty input. Since the input is the DAG.SUnits we'd never get there as it is empty.

1038

No chance. For same reason as at line 1006

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
291

It leverages from passing the llvm::ScheduleDAG::SUnits by reference.
The llvm::ScheduleDAG class member SUnits already has a type std::vector<llvm::SUnit>
To pass the vector of addresses I would have to have a loop that fills it in.

rampitec accepted this revision.Jan 5 2023, 11:45 AM

LGTM with a nit.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
961

Indent is off.

This revision is now accepted and ready to land.Jan 5 2023, 11:45 AM
alex-t marked an inline comment as done.Jan 5 2023, 12:04 PM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
961

The clang-format does this! :)
What indent would you like to see?

rampitec added inline comments.Jan 5 2023, 12:05 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
961

Really?! OK then.

alex-t marked 2 inline comments as done.Jan 5 2023, 12:08 PM