This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Target/AMDGPU/
-
lib/
-
Target/
-
AMDGPU/
-
GCNIterativeScheduler.cpp
-
GCNSchedStrategy.cpp
-
SIFormMemoryClauses.cpp
5
SIMachineFunctionInfo.h
5
SIMachineFunctionInfo.cpp

Differential D83674

[AMDGPU] Calculate minimum allowed occupancy based on threads per lane
Needs RevisionPublic

Authored by foad on Jul 13 2020, 6:11 AM.

Download Raw Diff

Details

Reviewers

rampitec
nhaehnle
arsenm

Summary

Previously the minimum allowed occupancy (for memory-bound functions)
was a constant 4 waves per SIMD. Now it is a constant 16 threads per
SIMD lane, which better accounts for the architectural changes in GFX10.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	440 ms	linux > SanitizerCommon-asan-x86_64-Linux.Linux::Unknown Unit Message ("")
	280 ms	linux > SanitizerCommon-lsan-x86_64-Linux.Linux::Unknown Unit Message ("")
	380 ms	linux > SanitizerCommon-msan-x86_64-Linux.Linux::Unknown Unit Message ("")
	460 ms	linux > SanitizerCommon-tsan-x86_64-Linux.Linux::Unknown Unit Message ("")
	420 ms	linux > SanitizerCommon-ubsan-x86_64-Linux.Linux::Unknown Unit Message ("")

Event Timeline

foad created this revision.Jul 13 2020, 6:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2020, 6:11 AM

Herald added subscribers: llvm-commits, kerbowa, javed.absar and 8 others. · View Herald Transcript

Harbormaster failed remote builds in B63944: Diff 277402!Jul 13 2020, 6:49 AM

rampitec requested changes to this revision.Jul 13 2020, 11:25 AM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200	This is not expressable in a number of VGPRs. The limiter is about memory and not about register budget at all. Meanwhile it looks like 8 waves might be more reasonable for wave32, but wave64 shall still use the same value across the targets, because at the end of the day it boils down to a number of concurrent memory operations.

This revision now requires changes to proceed.Jul 13 2020, 11:25 AM

nhaehnle added inline comments.Jul 16 2020, 12:48 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200	I agree that using number of VGPRs here is an awkward roundabout way of doing it. The point isn't really about VGPRs, it's about how many threads are alive per lane in order to allow a good mix of VALU and VMEM instruction issue in the face of certain (IMHO underspecified) memory pressure situations. I also disagree that 8 is the right answer for Wave32, for precisely the same reason: it is about the number of concurrent memory operations. 16x Wave32 gives you the same amount of memory operations per CU (2 SIMD * 16 waves * 32 = 1024 workitems) as 4 waves on <=gfx9 (4 SIMD * 4 waves * 64 = 1024 workitems).
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919	The `ST` argument requirement seems annoying. Wouldn't it make sense for the MachineFunctionInfo to know the underlying MachineFunction, which would allow access to the GCNSubtarget (i.e. most-derived TargetSubtargetInfo)?

arsenm added inline comments.Jul 16 2020, 12:52 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919	No, I'm actively trying to remove references to the machine function state from MachineFunctionInfo

rampitec added inline comments.Jul 16 2020, 12:59 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200	I can understand the logic behind it, but using of TotalNumVGPRs() is not really what we want here. It probably needs to use getMaxWavesPerEU() and wavefront size to estimate a number of threads in flight. Besides the same logic will give answer 16 even on gfx1030 where 16 is the absolute maximum. The latter does not sound right.

nhaehnle added inline comments.Jul 16 2020, 1:01 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919	Why is that? MachineFunctionInfo is explicitly created with the MachineFunction as a constructor argument?

arsenm added inline comments.Jul 16 2020, 1:02 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919	That's what I'm trying to fix in D80249

nhaehnle added inline comments.Jul 16 2020, 1:03 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200	I agree that gfx10 can probably drop the requirement from the 16 that would be equivalent to gfx9. I've heard numbers in the 10-12 range being thrown around as where dropping lower than that hurts performance. On gfx9, the consensus seemed to be 4 waves for the same kind of folklore. I don't think we really have enough experience and thorough performance studies with gfx10 yet to decide this finally. Maybe it should be a cl::opt... (not a function attribute, since this is specifically about the default heuristic).

rampitec added inline comments.Jul 16 2020, 1:05 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200	Probably, and I do not think we can use simple divide x by y logic here.

nhaehnle added inline comments.Jul 16 2020, 1:11 PM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919	I see. I commented on that change because I don't think it's quite right, but regardless of the outcome of that discussion: D80249 still gives the MFI accesses to the TargetSubtargetInfo, which is what we're talking about here. Even with that change, passing the ST yet again into this method should be unnecessary.

foad mentioned this in D134723: [AMDGPU] Set memory bound occupancy based on addressable VGPRs.Sep 27 2022, 2:12 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

GCNIterativeScheduler.cpp

4 lines

GCNSchedStrategy.cpp

4 lines

SIFormMemoryClauses.cpp

2 lines

SIMachineFunctionInfo.h

7 lines

SIMachineFunctionInfo.cpp

11 lines

Diff 277402

llvm/lib/Target/AMDGPU/GCNIterativeScheduler.cpp

Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	unsigned GCNIterativeScheduler::tryMaximizeOccupancy(unsigned TargetOcc) {

return std::max(NewOcc, Occ);		return std::max(NewOcc, Occ);
}		}

void GCNIterativeScheduler::scheduleLegacyMaxOccupancy(		void GCNIterativeScheduler::scheduleLegacyMaxOccupancy(
bool TryMaximizeOccupancy) {		bool TryMaximizeOccupancy) {
const auto &ST = MF.getSubtarget<GCNSubtarget>();		const auto &ST = MF.getSubtarget<GCNSubtarget>();
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
auto TgtOcc = MFI->getMinAllowedOccupancy();		auto TgtOcc = MFI->getMinAllowedOccupancy(ST);

sortRegionsByPressure(TgtOcc);		sortRegionsByPressure(TgtOcc);
auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);		auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);

if (TryMaximizeOccupancy && Occ < TgtOcc)		if (TryMaximizeOccupancy && Occ < TgtOcc)
Occ = tryMaximizeOccupancy(TgtOcc);		Occ = tryMaximizeOccupancy(TgtOcc);

// This is really weird but for some magic scheduling regions twice		// This is really weird but for some magic scheduling regions twice
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines

///////////////////////////////////////////////////////////////////////////////		///////////////////////////////////////////////////////////////////////////////
// ILP scheduler port		// ILP scheduler port

void GCNIterativeScheduler::scheduleILP(		void GCNIterativeScheduler::scheduleILP(
bool TryMaximizeOccupancy) {		bool TryMaximizeOccupancy) {
const auto &ST = MF.getSubtarget<GCNSubtarget>();		const auto &ST = MF.getSubtarget<GCNSubtarget>();
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
auto TgtOcc = MFI->getMinAllowedOccupancy();		auto TgtOcc = MFI->getMinAllowedOccupancy(ST);

sortRegionsByPressure(TgtOcc);		sortRegionsByPressure(TgtOcc);
auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);		auto Occ = Regions.front()->MaxPressure.getOccupancy(ST);

if (TryMaximizeOccupancy && Occ < TgtOcc)		if (TryMaximizeOccupancy && Occ < TgtOcc)
Occ = tryMaximizeOccupancy(TgtOcc);		Occ = tryMaximizeOccupancy(TgtOcc);

TgtOcc = std::min(Occ, TgtOcc);		TgtOcc = std::min(Occ, TgtOcc);
Show All 27 Lines

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Show First 20 Lines • Show All 352 Lines • ▼ Show 20 Lines	LLVM_DEBUG(dbgs() << "Occupancy before scheduling: " << WavesBefore
<< ", after " << WavesAfter << ".\n");		<< ", after " << WavesAfter << ".\n");

// We could not keep current target occupancy because of the just scheduled		// We could not keep current target occupancy because of the just scheduled
// region. Record new occupancy for next scheduling cycle.		// region. Record new occupancy for next scheduling cycle.
unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);		unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);
// Allow memory bound functions to drop to 4 waves if not limited by an		// Allow memory bound functions to drop to 4 waves if not limited by an
// attribute.		// attribute.
if (WavesAfter < WavesBefore && WavesAfter < MinOccupancy &&		if (WavesAfter < WavesBefore && WavesAfter < MinOccupancy &&
WavesAfter >= MFI.getMinAllowedOccupancy()) {		WavesAfter >= MFI.getMinAllowedOccupancy(ST)) {
LLVM_DEBUG(dbgs() << "Function is memory bound, allow occupancy drop up to "		LLVM_DEBUG(dbgs() << "Function is memory bound, allow occupancy drop up to "
<< MFI.getMinAllowedOccupancy() << " waves\n");		<< MFI.getMinAllowedOccupancy(ST) << " waves\n");
NewOccupancy = WavesAfter;		NewOccupancy = WavesAfter;
}		}
if (NewOccupancy < MinOccupancy) {		if (NewOccupancy < MinOccupancy) {
MinOccupancy = NewOccupancy;		MinOccupancy = NewOccupancy;
MFI.limitOccupancy(MinOccupancy);		MFI.limitOccupancy(MinOccupancy);
LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "		LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "
<< MinOccupancy << ".\n");		<< MinOccupancy << ".\n");
}		}
▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp

Show First 20 Lines • Show All 245 Lines • ▼ Show 20 Lines	bool SIFormMemoryClauses::checkPressure(const MachineInstr &MI,
GCNDownwardRPTracker &RPT) {		GCNDownwardRPTracker &RPT) {
// NB: skip advanceBeforeNext() call. Since all defs will be marked		// NB: skip advanceBeforeNext() call. Since all defs will be marked
// early-clobber they will all stay alive at least to the end of the		// early-clobber they will all stay alive at least to the end of the
// clause. Therefor we should not decrease pressure even if load		// clause. Therefor we should not decrease pressure even if load
// pointer becomes dead and could otherwise be reused for destination.		// pointer becomes dead and could otherwise be reused for destination.
RPT.advanceToNext();		RPT.advanceToNext();
GCNRegPressure MaxPressure = RPT.moveMaxPressure();		GCNRegPressure MaxPressure = RPT.moveMaxPressure();
unsigned Occupancy = MaxPressure.getOccupancy(*ST);		unsigned Occupancy = MaxPressure.getOccupancy(*ST);
if (Occupancy >= MFI->getMinAllowedOccupancy() &&		if (Occupancy >= MFI->getMinAllowedOccupancy(*ST) &&
MaxPressure.getVGPRNum() <= MaxVGPRs &&		MaxPressure.getVGPRNum() <= MaxVGPRs &&
MaxPressure.getSGPRNum() <= MaxSGPRs) {		MaxPressure.getSGPRNum() <= MaxSGPRs) {
LastRecordedOccupancy = Occupancy;		LastRecordedOccupancy = Occupancy;
return true;		return true;
}		}
return false;		return false;
}		}

▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show All 30 Lines
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include <array>		#include <array>
#include <cassert>		#include <cassert>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

		class GCNSubtarget;
class MachineFrameInfo;		class MachineFrameInfo;
class MachineFunction;		class MachineFunction;
class TargetRegisterClass;		class TargetRegisterClass;

class AMDGPUPseudoSourceValue : public PseudoSourceValue {		class AMDGPUPseudoSourceValue : public PseudoSourceValue {
public:		public:
enum AMDGPUPSVKind : unsigned {		enum AMDGPUPSVKind : unsigned {
PSVBuffer = PseudoSourceValue::TargetCustom,		PSVBuffer = PseudoSourceValue::TargetCustom,
▲ Show 20 Lines • Show All 863 Lines • ▼ Show 20 Lines	const AMDGPUGWSResourcePseudoSourceValue *getGWSPSV(const SIInstrInfo &TII) {

return GWSResourcePSV.get();		return GWSResourcePSV.get();
}		}

unsigned getOccupancy() const {		unsigned getOccupancy() const {
return Occupancy;		return Occupancy;
}		}

unsigned getMinAllowedOccupancy() const {		unsigned getMinAllowedOccupancy(const GCNSubtarget &ST) const;
		nhaehnleUnsubmitted Not Done Reply Inline Actions The `ST` argument requirement seems annoying. Wouldn't it make sense for the MachineFunctionInfo to know the underlying MachineFunction, which would allow access to the GCNSubtarget (i.e. most-derived TargetSubtargetInfo)? nhaehnle: The `ST` argument requirement seems annoying. Wouldn't it make sense for the…
		arsenmUnsubmitted Not Done Reply Inline Actions No, I'm actively trying to remove references to the machine function state from MachineFunctionInfo arsenm: No, I'm actively trying to remove references to the machine function state from…
		nhaehnleUnsubmitted Not Done Reply Inline Actions Why is that? MachineFunctionInfo is explicitly created with the MachineFunction as a constructor argument? nhaehnle: Why is that? MachineFunctionInfo is explicitly created with the MachineFunction as a…
		arsenmUnsubmitted Not Done Reply Inline Actions That's what I'm trying to fix in D80249 arsenm: That's what I'm trying to fix in D80249
		nhaehnleUnsubmitted Not Done Reply Inline Actions I see. I commented on that change because I don't think it's quite right, but regardless of the outcome of that discussion: D80249 still gives the MFI accesses to the TargetSubtargetInfo, which is what we're talking about here. Even with that change, passing the ST yet again into this method should be unnecessary. nhaehnle: I see. I commented on that change because I don't think it's quite right, but regardless of the…
if (!isMemoryBound() && !needsWaveLimiter())
return Occupancy;
return (Occupancy < 4) ? Occupancy : 4;
}

void limitOccupancy(const MachineFunction &MF);		void limitOccupancy(const MachineFunction &MF);

void limitOccupancy(unsigned Limit) {		void limitOccupancy(unsigned Limit) {
if (Occupancy > Limit)		if (Occupancy > Limit)
Occupancy = Limit;		Occupancy = Limit;
}		}

Show All 10 Lines

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

//===- SIMachineFunctionInfo.cpp - SI Machine Function Info ---------------===//		//===- SIMachineFunctionInfo.cpp - SI Machine Function Info ---------------===//
		Lint: Lint Inline Actions clang-format suggested style edits found: Lint: Lint: clang-format suggested style edits found:
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	SIMachineFunctionInfo::SIMachineFunctionInfo(const MachineFunction &MF)
if (!S.empty())		if (!S.empty())
S.consumeInteger(0, HighBitsOf32BitAddress);		S.consumeInteger(0, HighBitsOf32BitAddress);

S = F.getFnAttribute("amdgpu-gds-size").getValueAsString();		S = F.getFnAttribute("amdgpu-gds-size").getValueAsString();
if (!S.empty())		if (!S.empty())
S.consumeInteger(0, GDSSize);		S.consumeInteger(0, GDSSize);
}		}

		unsigned SIMachineFunctionInfo::getMinAllowedOccupancy(const GCNSubtarget &ST) const {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -unsigned SIMachineFunctionInfo::getMinAllowedOccupancy(const GCNSubtarget &ST) const { +unsigned +SIMachineFunctionInfo::getMinAllowedOccupancy(const GCNSubtarget &ST) const { Lint: Pre-merge checks: clang-format: please reformat the code ``` -unsigned SIMachineFunctionInfo…
		if (!isMemoryBound() && !needsWaveLimiter())
		return Occupancy;
		// Allow a minimum of 16 threads per SIMD lane, which works out as:
		// - 4 waves per SIMD for GFX9 and below
		// - 8 waves per SIMD for GFX10 wave64
		// - 16 waves per SIMD for GFX10 wave32
		unsigned MinOccupancy = ST.getTotalNumVGPRs() / 64;
		rampitecUnsubmitted Not Done Reply Inline Actions This is not expressable in a number of VGPRs. The limiter is about memory and not about register budget at all. Meanwhile it looks like 8 waves might be more reasonable for wave32, but wave64 shall still use the same value across the targets, because at the end of the day it boils down to a number of concurrent memory operations. rampitec: This is not expressable in a number of VGPRs. The limiter is about memory and not about…
		nhaehnleUnsubmitted Not Done Reply Inline Actions I agree that using number of VGPRs here is an awkward roundabout way of doing it. The point isn't really about VGPRs, it's about how many threads are alive per lane in order to allow a good mix of VALU and VMEM instruction issue in the face of certain (IMHO underspecified) memory pressure situations. I also disagree that 8 is the right answer for Wave32, for precisely the same reason: it is about the number of concurrent memory operations. 16x Wave32 gives you the same amount of memory operations per CU (2 SIMD * 16 waves * 32 = 1024 workitems) as 4 waves on <=gfx9 (4 SIMD * 4 waves * 64 = 1024 workitems). nhaehnle: I agree that using number of VGPRs here is an awkward roundabout way of doing it. The point…
		rampitecUnsubmitted Not Done Reply Inline Actions I can understand the logic behind it, but using of TotalNumVGPRs() is not really what we want here. It probably needs to use getMaxWavesPerEU() and wavefront size to estimate a number of threads in flight. Besides the same logic will give answer 16 even on gfx1030 where 16 is the absolute maximum. The latter does not sound right. rampitec: I can understand the logic behind it, but using of TotalNumVGPRs() is not really what we want…
		nhaehnleUnsubmitted Not Done Reply Inline Actions I agree that gfx10 can probably drop the requirement from the 16 that would be equivalent to gfx9. I've heard numbers in the 10-12 range being thrown around as where dropping lower than that hurts performance. On gfx9, the consensus seemed to be 4 waves for the same kind of folklore. I don't think we really have enough experience and thorough performance studies with gfx10 yet to decide this finally. Maybe it should be a cl::opt... (not a function attribute, since this is specifically about the default heuristic). nhaehnle: I agree that gfx10 can probably drop the requirement from the 16 that would be equivalent to…
		rampitecUnsubmitted Not Done Reply Inline Actions Probably, and I do not think we can use simple divide x by y logic here. rampitec: Probably, and I do not think we can use simple divide x by y logic here.
		return std::min(Occupancy, MinOccupancy);
		}

void SIMachineFunctionInfo::limitOccupancy(const MachineFunction &MF) {		void SIMachineFunctionInfo::limitOccupancy(const MachineFunction &MF) {
limitOccupancy(getMaxWavesPerEU());		limitOccupancy(getMaxWavesPerEU());
const GCNSubtarget& ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget& ST = MF.getSubtarget<GCNSubtarget>();
limitOccupancy(ST.getOccupancyWithLocalMemSize(getLDSSize(),		limitOccupancy(ST.getOccupancyWithLocalMemSize(getLDSSize(),
MF.getFunction()));		MF.getFunction()));
}		}

Register SIMachineFunctionInfo::addPrivateSegmentBuffer(		Register SIMachineFunctionInfo::addPrivateSegmentBuffer(
▲ Show 20 Lines • Show All 389 Lines • Show Last 20 Lines