This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Machine scheduler should take care of the 'amdgpu-num-vgp' function attribute
Needs ReviewPublic

Authored by alex-t on Sep 5 2023, 9:44 AM.

Details

Summary

We currently have the 'amdgpu-num-vgpr' attribute parsed during the TargetRegisterInfo initialization.

GCNSchedStage class method computing occupancy does not read the attribute and consider all the available
 registers for the given target even if the number of VGPRs was explicitly restricted.

Diff Detail

Event Timeline

alex-t created this revision.Sep 5 2023, 9:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2023, 9:44 AM
alex-t requested review of this revision.Sep 5 2023, 9:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2023, 9:44 AM
Herald added a subscriber: wdng. · View Herald Transcript

I thought this already was used in the excess VGPRs pressure and wave-per-eu is that just not true?

I thought this already was used in the excess VGPRs pressure and wave-per-eu is that just not true?

I thought it is buried inside the logic of waves per eu calculation/getEffectiveWavesPerEU/getWavesPerEU, but apparently it is not.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
677

'llvm::' is not needed.

682

Why not just call getOccupancyWithNumVGPRs here? This logic is already in the getNumWavesPerEUWithNumVGPRs.

I thought this already was used in the excess VGPRs pressure and wave-per-eu is that just not true?

trying to give a detailed response:

We measure the occupancy after the current schedule stage by querying the RegisterPressure class:

unsigned WavesAfter =
      std::min(TargetOccupancy, PressureAfter.getOccupancy(ST));

which does not care about attributes set on function.
Then couple of lines after we query the right method which returns the real number of VGPRs available
for the function and we mark the region as excess RP accordingly:

unsigned MaxVGPRs = ST.getMaxNumVGPRs(MF);
if (PressureAfter.getVGPRNum(false) > MaxVGPRs ||
       *****
    DAG.RegionsWithExcessRP[RegionIdx] = true;
  }

the problem is that we decide if we need to revert the schedule based on both - excess RP and WavesAfter.
The latter is computed w/o the knowledge of the restrictions applied by the attribute:

bool GCNSchedStage::mayCauseSpilling(unsigned WavesAfter) {
  if (WavesAfter <= MFI.getMinWavesPerEU() &&
      !PressureAfter.less(ST, PressureBefore) &&
      isRegionWithExcessRP()) {
    LLVM_DEBUG(dbgs() << "New pressure will result in more spilling.\n");
    return true;
  }

So, we never revert the "bad" schedule!

Thanks to all who took the time to look at this.
The aim of this review is to discuss and find the right approach.
We have a target description interface that keeps the HW details.
We also have a set of user-defined attributes like: "amdgpu-waves-per-eu", "amdgpu-num-vgpr" etc. which affect the defaults.
We also take into account the LDS size.
Shouldn't we compute the occupancy uniformly accounting for all constraints together?

alex-t added a comment.Sep 5 2023, 1:37 PM

One more case that we may want to consider: with "amdgpu-num-vgpr"=8 the difference between 2 RPs of 2 is significant.
The schedule that spends 7 VGPRs does not spill but another one, which spends 9 will.
Currently, we have a "less" operator defined for the GCNRegPressure which only compares the occupancy based on the default target number of available registers and the number of VGPRs from the RP. To this method, RPs in my case are equal as they both have occupancy 10. This is yet another reason why we cannot use the excess RP set to decide if we need to keep or revert the current schedule.

Thanks to all who took the time to look at this.
The aim of this review is to discuss and find the right approach.
We have a target description interface that keeps the HW details.
We also have a set of user-defined attributes like: "amdgpu-waves-per-eu", "amdgpu-num-vgpr" etc. which affect the defaults.
We also take into account the LDS size.
Shouldn't we compute the occupancy uniformly accounting for all constraints together?

There is GCNSubtarget::computeOccupancy which aims to do exactly that.

One more case that we may want to consider: with "amdgpu-num-vgpr"=8 the difference between 2 RPs of 2 is significant.
The schedule that spends 7 VGPRs does not spill but another one, which spends 9 will.
Currently, we have a "less" operator defined for the GCNRegPressure which only compares the occupancy based on the default target number of available registers and the number of VGPRs from the RP. To this method, RPs in my case are equal as they both have occupancy 10. This is yet another reason why we cannot use the excess RP set to decide if we need to keep or revert the current schedule.

8 vgpr case is artificial and only used for stress testing. There is no other reason to use anything below maximum occupancy.

arsenm added a comment.Sep 5 2023, 2:05 PM

amdgpu-num-vgprs is deprecated. Do we really need to try handling it?

alex-t added a comment.Sep 5 2023, 2:18 PM

Thanks to all who took the time to look at this.
The aim of this review is to discuss and find the right approach.
We have a target description interface that keeps the HW details.
We also have a set of user-defined attributes like: "amdgpu-waves-per-eu", "amdgpu-num-vgpr" etc. which affect the defaults.
We also take into account the LDS size.
Shouldn't we compute the occupancy uniformly accounting for all constraints together?

There is GCNSubtarget::computeOccupancy which aims to do exactly that.

No, it does not care of "amdgpu-num-vgpr" attribute.

if (NumVGPRs)
   Occupancy = std::min(Occupancy, getOccupancyWithNumVGPRs(NumVGPRs));
unsigned GCNSubtarget::getOccupancyWithNumVGPRs(unsigned NumVGPRs) const {
  return AMDGPU::IsaInfo::getNumWavesPerEUWithNumVGPRs(this, NumVGPRs);
}
unsigned getNumWavesPerEUWithNumVGPRs(const MCSubtargetInfo *STI,
                                      unsigned NumVGPRs) {
  unsigned MaxWaves = getMaxWavesPerEU(STI);
  unsigned Granule = getVGPRAllocGranule(STI);
  if (NumVGPRs < Granule)
    return MaxWaves;
  unsigned RoundedRegs = alignTo(NumVGPRs, Granule);
  return std::min(std::max(**getTotalNumVGPRs(STI) **/ RoundedRegs, 1u), MaxWaves);
}

getTotalNumVGPRs only takes the defaults.

One more case that we may want to consider: with "amdgpu-num-vgpr"=8 the difference between 2 RPs of 2 is significant.
The schedule that spends 7 VGPRs does not spill but another one, which spends 9 will.
Currently, we have a "less" operator defined for the GCNRegPressure which only compares the occupancy based on the default target number of available registers and the number of VGPRs from the RP. To this method, RPs in my case are equal as they both have occupancy 10. This is yet another reason why we cannot use the excess RP set to decide if we need to keep or revert the current schedule.

8 vgpr case is artificial and only used for stress testing. There is no other reason to use anything below maximum occupancy.

But it is still a valid input and we don't want that tests to become spilling.
Another question: What this attribute is intended for? If it imposes the constraint we have to take it into account when computing the occupancy, otherwise, we schedule/allocate more registers than was requested by the user and this will be the UB.

alex-t added a comment.Sep 5 2023, 2:20 PM

amdgpu-num-vgprs is deprecated. Do we really need to try handling it?

That is a strong reason to abandon this. But in this case, we need to do something with the tests using it.