This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStage
ClosedPublic

Authored by alex-t on Dec 9 2022, 5:58 AM.

Details

Summary

Since the divergence-driven ISel was fully enabled we have more VGPRs available.

MachineScheduler trying to take advantage of that bumps up the occupancy sacrificing
the hiding of memory access latency.  This really spoils the initially good schedule.
A new metric that reflects the latency hiding quality of the schedule has been created
to make it to balance between occupancy and latency. The metric is based on the latency
model which computes the bubble to working cycles ratio. Then we use this ratio to decide
if the higher occupancy schedule is profitable as follows:

    Profit = NewOccupancy/OldOccupancy * OldMetric/NewMetric

Diff Detail

Event Timeline

alex-t created this revision.Dec 9 2022, 5:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 5:58 AM
alex-t requested review of this revision.Dec 9 2022, 5:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 5:58 AM
Herald added a subscriber: wdng. · View Herald Transcript
alex-t updated this revision to Diff 481614.Dec 9 2022, 6:05 AM

odd changes removed

alex-t updated this revision to Diff 481618.Dec 9 2022, 6:13 AM

yet one more minor code cleanup

Do you have any performance measurements?
Also guard it with an option to turn it off.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

You probably do not need to compute it always, just in UnclusteredHighRPStage?

1079

No else after return.

1088

Avoid using float. Use scaled integers.

vpykhtin added inline comments.Dec 10 2022, 1:56 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
949

Model -> ReadyCycles?

964

max

969

Model[&MI] = ReadyCycle;

vpykhtin added inline comments.Dec 10 2022, 2:04 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
956

if (!SU) continue;

what is it BTW? Debug instruction?

alex-t updated this revision to Diff 482123.Dec 12 2022, 7:36 AM
alex-t marked 6 inline comments as done.

changes as requested

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition:

bool UnclusteredHighRPStage::initGCNRegion() {
  // Only reschedule regions with the minimum occupancy or regions that may have
  // spilling (excess register pressure).
  if ((!DAG.RegionsWithMinOcc[RegionIdx] ||
       DAG.MinOccupancy <= InitialOccupancy) &&
      !DAG.RegionsWithExcessRP[RegionIdx])
    return false;

  return GCNSchedStage::initGCNRegion();
}

What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage.

949

What do you mean? Don't understand the question

956

Not only debug. Copy f.ex.

1088

What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)?

vpykhtin added inline comments.Dec 12 2022, 7:43 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
949

Sorry, I mean rename Model to ReadyCycles

alex-t updated this revision to Diff 482135.Dec 12 2022, 8:10 AM

Model map renamed to ReadyCycles

alex-t marked an inline comment as done.Dec 12 2022, 8:10 AM
rampitec added inline comments.Dec 12 2022, 10:58 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

I only need this at the stage preceding the UnclusteredHighRPStage because it is the "MetricBefore" computation. The UnclusteredHighRPStage only runs for the regions which conform to the condition:

bool UnclusteredHighRPStage::initGCNRegion() {
  // Only reschedule regions with the minimum occupancy or regions that may have
  // spilling (excess register pressure).
  if ((!DAG.RegionsWithMinOcc[RegionIdx] ||
       DAG.MinOccupancy <= InitialOccupancy) &&
      !DAG.RegionsWithExcessRP[RegionIdx])
    return false;

  return GCNSchedStage::initGCNRegion();
}

What I should have done here, is to avoid this running for the GCNSchedStageID::ClusteredLowOccupancyReschedule stage.

I believe it is better to place it into the UnclusteredHighRPStage::initGCNSchedStage(). This will allow to rearrange stages and also skip it if UnclusteredHighRPStage itself is skipped.

1088

What is the evil in float? I would agree if we're targeting the embedded platform with no or very expensive floating point support. Could you explain where is the overhead for x86like (for example)?

We are not only running on x86. For instance our lit tests are being run on a variety of platforms and llvm does support it. Besides you do not want to get different results depending on the host platform or even host compiler used due to different rounding. In addition it is always good to avoid it so that llvm itself does not need to link in soft float on other platforms simply because their compiler also supports codegen for amdgpu.

alex-t updated this revision to Diff 482244.Dec 12 2022, 1:17 PM

floating points changed to scaled integers

alex-t marked 2 inline comments as done.Dec 12 2022, 1:20 PM
alex-t added inline comments.Dec 12 2022, 2:11 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

I compute, estimate, and store metrics per region and for those with minimal occupancy only. initGCNSchedStage() yet has no regions. The initGCNRegion is also not a proper place as we need a schedule for the given (i.e. previous) stage to be done before we can compute its metrics.

rampitec added inline comments.Dec 12 2022, 2:36 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

Regions are created by the first stage. By the time UnclusteredHighRPStage::initGCNSchedStage() it should be already initialized. The schedule at the initialization is what a previous stage has left.

alex-t added inline comments.Dec 12 2022, 3:04 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

We don't need metrics for the whole function - just for several regions.
To compute and store the metrics for the regions with min occupancy (for which the UnclusteredHighRPStage will actually be done) we would have to have a loop over all regions in the initGCNSchedStage()
I put it into finalizeGCNRegion because I can easily check if the metrics are needed for the concrete region and compute them if necessary.

kerbowa added inline comments.Dec 12 2022, 10:37 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

It might be better in UnclusteredHighRPStage::initGCNRegion(), but I guess the problem is that you need the DAG to be built already? Maybe we even need a virtual getSchedulerMetrics(), since I can imagine this being used for more then the UnclusteredHighRPStage eventually, and now it only does something in the initOccStage. If it stays in the initOccStage, can you add an assert that the next stage is UnclusteredHighRP?

938

Is this debug only?

951

Can't this just iterate over SUnits?

963

What about unbuffered resources like MFMA? I guess maybe it should be considered in future patches.

964

Should this be TargetSchedModel::computeInstrLatency? We don't use itineraries so I think it is just falling back here.

1089

What happens if the original schedule has no stalls, i.e. a metric of 0? Does that mean that no amount of occupancy increase can result in a profitable tradeoff?

This seems heavily biased towards increases in ILP. A single-stall cycle can be weighted as more important than an increase in occupancy from 2 to 3. I don't think this was the intention, was it? I think we need some sort of Occupancy scaling factor in the Profit calculation to have some way of tuning the importance of occupancy.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
129

Could this be moved into the class below so it is not in the llvm namespace?

134

NIT: LLVM formating.

alex-t marked an inline comment as done.Dec 13 2022, 12:13 PM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

Your guess is quite correct. My first intention was exactly UnclusteredHighRPStage::initGCNRegion(). Then I found that unfortunately, I have no DAG yet :)

1089

I agree with you. And a scaling factor was planned. The reason it doesn't exist yet is that we still don't have access to HW other than Navi and we can't run enough tests to determine a reasonable scaling factor.
The idea behind this initial heuristic was: “This fixes the regression. Since we don't have an HW to test with, let's fix it and let QA see what happens."
But your comment is a good catch. Indeed, I should have added a compiler option for the scaling factor with a default value of 1.

alex-t updated this revision to Diff 483134.Dec 15 2022, 4:57 AM
alex-t marked an inline comment as done.
  1. metric calculation calls localized to the UnclusteredHighRPStage::shouldRevertSchedule. Now the getScheduleMetric is only called if it is really necessary.
  2. -amdgpu-schedule-metric-bias=<unsigned value> compiler option was added to ease the further testing and tuning. It defaults to 10 which means the schedule w/o bubbles gets 10 points reward.
  3. Several other changes according to the reviewer request.
alex-t marked 7 inline comments as done.Dec 15 2022, 5:06 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
862

What I really need is to get a metric for the schedule before UnclusteredHighRPStage and after it. The only place I need it is UnclusteredHighRPStage::shouldRevertScheduling
where I already have SUnits which reflects the order before and BB which reflects the current order. So, making the getScheduleMetrics accept a vector of the SUnit looks like a perfect solution. I don't have to care about any stages except the UnclusteredHighRPStage.

951

It now does as I pass the SUs vector to the getScheduleMetrics

alex-t marked an inline comment as done.Dec 15 2022, 5:15 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963

This is the very first and drafty implementation which aims to find what it yields.
All the specific things like HW hazards and unbuffered resources may be considered further in case we really observe that we need a more precise model.

I still want to have an option to disable this heuristic.
Also are there any performance measurements done?

rampitec added inline comments.Dec 15 2022, 10:55 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1105

If NewMetric is 0 you will divide by 0.

alex-t marked an inline comment as done.Dec 15 2022, 12:45 PM

I still want to have an option to disable this heuristic.
Also are there any performance measurements done?

We still have no HW available to run the benchmarks.
Cause we've been waiting too long for that, the idea behind this patch is:
let's commit it and then the QA will have to run all the benchmarks for me :)

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1105

It is never 0. All metrics with a bubble amount of less than 1% are 1.

unsigned getMetric() const {
    unsigned Metric = (BubbleCycles * ScaleFactor) / ScheduleLength;
    // Metric is zero if the amount of bubbles is less than 1% which is too
    // small. So, return 1.
    return Metric ? Metric : 1;
  }
alex-t marked an inline comment as done.Dec 15 2022, 12:54 PM

@rampitec

amdgpu-schedule-metric-bias=100

makes the scheduler always prefer the occupancy over the latency.
but the schedule metrics are still computed.
Did you mean the option to completely switch the getScheduleMetrics OFF?

@rampitec

amdgpu-schedule-metric-bias=100

makes the scheduler always prefer the occupancy over the latency.
but the schedule metrics are still computed.
Did you mean the option to completely switch the getScheduleMetrics OFF?

OK, thanks. Document it in the help text of the option.

alex-t updated this revision to Diff 483319.Dec 15 2022, 1:27 PM

amdgpu-schedule-metric-bias description updated

vpykhtin added inline comments.Dec 16 2022, 2:24 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1087

Unneeded copy. Add second version of getScheduleMetrics that can perform in MachineInstr*, you can translate an instruction to SUnit inside.

1106

This Profit formula is very hard to read, could you split it?
Instead of the last division by SF, may be compare Profit < (SF * SF)?

vpykhtin added inline comments.Dec 16 2022, 2:51 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
953

auto &SU

1087

Sorry, I mean not second version but template, something like:

// shim accessors
SUnit &getSUnit(&DAG, MachineInstr *MI) { return DAG.getSUnit(MI); }
SUnit &getSUnit(&DAG, SUnit &SU) { return SU; }

template <typename Range>
ScheduleMetrics GCNSchedStage::getScheduleMetrics(Range &&S) {
...
  for (auto &X : S) {
     SUnit &SU = getSUnit(DAG, X);
alex-t updated this revision to Diff 483993.Dec 19 2022, 9:50 AM

refactorerd getScheduleMetrics to avoid copying

alex-t marked 3 inline comments as done.Dec 19 2022, 10:01 AM
alex-t updated this revision to Diff 484202.EditedDec 20 2022, 3:14 AM

schedule printing function moved out from the GCNSchedStage class.
objects passed by reference are marked as const

rampitec added inline comments.Jan 4 2023, 12:19 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
30

Do you still need it?

848

Keep this. It is unrelated to the patch.

955

Could you please use a more specific name for the struct?

972

auto &I.

991

auto &SU.

1006

Any chance for CurrCycle to be zero?

1040

Any chance for CurrCycle to be zero?

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
19

You do not seem to need it here.

290

SUnit is not a small class, how about SUnit*?

alex-t updated this revision to Diff 486640.Jan 5 2023, 11:31 AM
alex-t marked 10 inline comments as done.

minor changes according to the reviewer request

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
1006

Since we have

CurrCycle = ++ReadyCycle;

the only chance is an empty input. Since the input is the DAG.SUnits we'd never get there as it is empty.

1040

No chance. For same reason as at line 1006

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
290

It leverages from passing the llvm::ScheduleDAG::SUnits by reference.
The llvm::ScheduleDAG class member SUnits already has a type std::vector<llvm::SUnit>
To pass the vector of addresses I would have to have a loop that fills it in.

rampitec accepted this revision.Jan 5 2023, 11:45 AM

LGTM with a nit.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963

Indent is off.

This revision is now accepted and ready to land.Jan 5 2023, 11:45 AM
alex-t marked an inline comment as done.Jan 5 2023, 12:04 PM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963

The clang-format does this! :)
What indent would you like to see?

rampitec added inline comments.Jan 5 2023, 12:05 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
963

Really?! OK then.

alex-t marked 2 inline comments as done.Jan 5 2023, 12:08 PM