This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Target/AMDGPU/
-
lib/
-
Target/
-
AMDGPU/
-
GCNIterativeScheduler.cpp
-
GCNSchedStrategy.cpp
-
SIFormMemoryClauses.cpp
1
SIMachineFunctionInfo.h

Differential D134723

[AMDGPU] Set memory bound occupancy based on addressable VGPRs
AbandonedPublic

Authored by critson on Sep 27 2022, 2:06 AM.

Download Raw Diff

Details

Reviewers

foad
rampitec
Joe_Nash
kerbowa

Summary

With GFX11 the minimum occupancy (for memory bound) shaders should
be set based on the number of addressable VGPRs and total VGPRs,
i.e. it should be higher than 4 in many cases to avoid spilling.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,040 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

critson created this revision.Sep 27 2022, 2:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 2:06 AM

Herald added subscribers: kosarev, kerbowa, javed.absar and 9 others. · View Herald Transcript

critson requested review of this revision.Sep 27 2022, 2:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 2:06 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

I'm not sure where this is currently tested to add a new test.
If any reviewers know?

See also D83674 (which I more or less gave up on after the reviewer feedback).

In D134723#3817497, @critson wrote:

I'm not sure where this is currently tested to add a new test.
If any reviewers know?

Not sure. test/CodeGen/AMDGPU/perfhint.ll tests whether a function is marked as memory bound or not, but I don't know if we have tests for the scheduler allowing memory bound functions to get a lower occupancy.

Harbormaster completed remote builds in B188897: Diff 463156.Sep 27 2022, 3:54 AM

rampitec added a reviewer: kerbowa.Sep 27 2022, 9:42 AM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
965	I do not understand the logic really. TotalNumVGPR does not tell anything about spilling. I would understand if that is expressed in a maximum occupancy terms, so you return a fraction of maximum occupancy. Moreover, this limit is really about a number of concurrent potential memory requests so that they do not evict each other or conflict. So if your CU can run more waves you likely want to drop occupancy here and give shader more registers for a memory bound case. That said, this is not supposed to be a strong limit, if scheduler have to spill it should drop the occupancy below this limit. At least it was written that way at some point. I.e. I do not think this is a right place to fix spilling, you do not know real pressure here anyway. If this limit causes spilling then scheduler itself have to be fixed.

So this was based on dealing with the temporary 128 VGPR limit on GFX11, which has now been removed.
On that basis I will probably abandon this revision; however, I do still have concerns about this magic number 4.

Because of this number the scheduler would pick a schedule which used more than 128 VGPRs and spill, when in fact a schedule without spilling was possible.
Essentially when memory bound heuristic has been triggered the scheduler behaves in a manner which artificially reduces occupancy to this value of 4.
With 256 VGPRs addressable then 4 does work as a minimum without spilling for all hardware variants.
Although logically the value is going to be 6 for GFX11 with extra VGPRs in Wave32.

i.e. it should be higher than 4 in many cases to avoid spilling.

As I understand it: the scheduler came up with a schedule that had a max register pressure of 128. For some reason the register allocator (or other post-scheduler passes) failed to meet this limit and actually tried to use more than 128 registers. Normally this would mean that we miss the scheduler's occupancy target, but because 128 was the maximum number of usable registers (on GFX11 before D133723) we spilled instead.

Perhaps it is worth trying to understand *why* we failed to meet the scheduler's estimated limit of 128 registers? It may not be possible for the scheduler to accurately estimate the final register usage in all cases, but perhaps there are some simple problems that we could fix to make the estimate more reliable?

In D134723#3819661, @critson wrote:

So this was based on dealing with the temporary 128 VGPR limit on GFX11, which has now been removed.
On that basis I will probably abandon this revision; however, I do still have concerns about this magic number 4.

The magic number was borrowed from SC and is empirical. The experience is that increasing occupancy beyond 4 for memory bound kernels does not help performance in average. But it is not linked to spilling in any way, just to cache thrashing. That is likely this number may differ for later targets, but again not linked to spilling.

critson abandoned this revision.Oct 3 2022, 4:07 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

GCNIterativeScheduler.cpp

4 lines

GCNSchedStrategy.cpp

4 lines

SIFormMemoryClauses.cpp

2 lines

SIMachineFunctionInfo.h

7 lines

Diff 463156

llvm/lib/Target/AMDGPU/GCNIterativeScheduler.cpp

Show First 20 Lines • Show All 466 Lines • ▼ Show 20 Lines	unsigned GCNIterativeScheduler::tryMaximizeOccupancy(unsigned TargetOcc) {

return std::max(NewOcc, Occ);		return std::max(NewOcc, Occ);
}		}

void GCNIterativeScheduler::scheduleLegacyMaxOccupancy(		void GCNIterativeScheduler::scheduleLegacyMaxOccupancy(
bool TryMaximizeOccupancy) {		bool TryMaximizeOccupancy) {
const auto &ST = MF.getSubtarget<GCNSubtarget>();		const auto &ST = MF.getSubtarget<GCNSubtarget>();
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
auto TgtOcc = MFI->getMinAllowedOccupancy();		auto TgtOcc = MFI->getMinAllowedOccupancy(ST);

sortRegionsByPressure(TgtOcc);		sortRegionsByPressure(TgtOcc);
auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);		auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);

if (TryMaximizeOccupancy && Occ < TgtOcc)		if (TryMaximizeOccupancy && Occ < TgtOcc)
Occ = tryMaximizeOccupancy(TgtOcc);		Occ = tryMaximizeOccupancy(TgtOcc);

// This is really weird but for some magic scheduling regions twice		// This is really weird but for some magic scheduling regions twice
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines

///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////
// ILP scheduler port		// ILP scheduler port

void GCNIterativeScheduler::scheduleILP(		void GCNIterativeScheduler::scheduleILP(
bool TryMaximizeOccupancy) {		bool TryMaximizeOccupancy) {
const auto &ST = MF.getSubtarget<GCNSubtarget>();		const auto &ST = MF.getSubtarget<GCNSubtarget>();
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
auto TgtOcc = MFI->getMinAllowedOccupancy();		auto TgtOcc = MFI->getMinAllowedOccupancy(ST);

sortRegionsByPressure(TgtOcc);		sortRegionsByPressure(TgtOcc);
auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);		auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);

if (TryMaximizeOccupancy && Occ < TgtOcc)		if (TryMaximizeOccupancy && Occ < TgtOcc)
Occ = tryMaximizeOccupancy(TgtOcc);		Occ = tryMaximizeOccupancy(TgtOcc);

TgtOcc = std::min(Occ, TgtOcc);		TgtOcc = std::min(Occ, TgtOcc);
Show All 27 Lines

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Show First 20 Lines • Show All 886 Lines • ▼ Show 20 Lines	void GCNSchedStage::checkScheduling() {
// scheduled region. We might still be able to revert scheduling if the		// scheduled region. We might still be able to revert scheduling if the
// occupancy before was higher, or if the current schedule has register		// occupancy before was higher, or if the current schedule has register
// pressure higher than the excess limits which could lead to more spilling.		// pressure higher than the excess limits which could lead to more spilling.
unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);		unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);

// Allow memory bound functions to drop to 4 waves if not limited by an		// Allow memory bound functions to drop to 4 waves if not limited by an
// attribute.		// attribute.
if (WavesAfter < WavesBefore && WavesAfter < DAG.MinOccupancy &&		if (WavesAfter < WavesBefore && WavesAfter < DAG.MinOccupancy &&
WavesAfter >= MFI.getMinAllowedOccupancy()) {		WavesAfter >= MFI.getMinAllowedOccupancy(ST)) {
LLVM_DEBUG(dbgs() << "Function is memory bound, allow occupancy drop up to "		LLVM_DEBUG(dbgs() << "Function is memory bound, allow occupancy drop up to "
<< MFI.getMinAllowedOccupancy() << " waves\n");		<< MFI.getMinAllowedOccupancy(ST) << " waves\n");
NewOccupancy = WavesAfter;		NewOccupancy = WavesAfter;
}		}

if (NewOccupancy < DAG.MinOccupancy) {		if (NewOccupancy < DAG.MinOccupancy) {
DAG.MinOccupancy = NewOccupancy;		DAG.MinOccupancy = NewOccupancy;
MFI.limitOccupancy(DAG.MinOccupancy);		MFI.limitOccupancy(DAG.MinOccupancy);
DAG.RegionsWithMinOcc.reset();		DAG.RegionsWithMinOcc.reset();
LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "		LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "
▲ Show 20 Lines • Show All 470 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp

Show First 20 Lines • Show All 199 Lines • ▼ Show 20 Lines	bool SIFormMemoryClauses::checkPressure(const MachineInstr &MI,
// spilling just to form a soft clause.		// spilling just to form a soft clause.
//		//
// FIXME: This pressure check is fundamentally broken. First, this is checking		// FIXME: This pressure check is fundamentally broken. First, this is checking
// the global pressure, not the pressure at this specific point in the		// the global pressure, not the pressure at this specific point in the
// program. Second, it's not accounting for the increased liveness of the use		// program. Second, it's not accounting for the increased liveness of the use
// operands due to the early clobber we will introduce. Third, the pressure		// operands due to the early clobber we will introduce. Third, the pressure
// tracking does not account for the alignment requirements for SGPRs, or the		// tracking does not account for the alignment requirements for SGPRs, or the
// fragmentation of registers the allocator will need to satisfy.		// fragmentation of registers the allocator will need to satisfy.
if (Occupancy >= MFI->getMinAllowedOccupancy() &&		if (Occupancy >= MFI->getMinAllowedOccupancy(*ST) &&
MaxPressure.getVGPRNum(ST->hasGFX90AInsts()) <= MaxVGPRs / 2 &&		MaxPressure.getVGPRNum(ST->hasGFX90AInsts()) <= MaxVGPRs / 2 &&
MaxPressure.getSGPRNum() <= MaxSGPRs / 2) {		MaxPressure.getSGPRNum() <= MaxSGPRs / 2) {
LastRecordedOccupancy = Occupancy;		LastRecordedOccupancy = Occupancy;
return true;		return true;
}		}
return false;		return false;
}		}

▲ Show 20 Lines • Show All 208 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show First 20 Lines • Show All 953 Lines • ▼ Show 20 Lines	public:
getGWSPSV(const AMDGPUTargetMachine &TM) {		getGWSPSV(const AMDGPUTargetMachine &TM) {
return &GWSResourcePSV;		return &GWSResourcePSV;
}		}

unsigned getOccupancy() const {		unsigned getOccupancy() const {
return Occupancy;		return Occupancy;
}		}

unsigned getMinAllowedOccupancy() const {		unsigned getMinAllowedOccupancy(const GCNSubtarget &ST) const {
if (!isMemoryBound() && !needsWaveLimiter())		if (!isMemoryBound() && !needsWaveLimiter())
return Occupancy;		return Occupancy;
return (Occupancy < 4) ? Occupancy : 4;		unsigned spillThreshold = ST.getTotalNumVGPRs() / ST.getAddressableNumVGPRs();
		rampitecUnsubmitted Not Done Reply Inline Actions I do not understand the logic really. TotalNumVGPR does not tell anything about spilling. I would understand if that is expressed in a maximum occupancy terms, so you return a fraction of maximum occupancy. Moreover, this limit is really about a number of concurrent potential memory requests so that they do not evict each other or conflict. So if your CU can run more waves you likely want to drop occupancy here and give shader more registers for a memory bound case. That said, this is not supposed to be a strong limit, if scheduler have to spill it should drop the occupancy below this limit. At least it was written that way at some point. I.e. I do not think this is a right place to fix spilling, you do not know real pressure here anyway. If this limit causes spilling then scheduler itself have to be fixed. rampitec: I do not understand the logic really. TotalNumVGPR does not tell anything about spilling. I…
		if (spillThreshold < 4)
		spillThreshold = 4;
		return (Occupancy < spillThreshold) ? Occupancy : spillThreshold;
}		}

void limitOccupancy(const MachineFunction &MF);		void limitOccupancy(const MachineFunction &MF);

void limitOccupancy(unsigned Limit) {		void limitOccupancy(unsigned Limit) {
if (Occupancy > Limit)		if (Occupancy > Limit)
Occupancy = Limit;		Occupancy = Limit;
}		}
Show All 22 Lines