This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Set threshold for regbanks reassign pass
ClosedPublic

Authored by rampitec on Feb 22 2021, 12:28 PM.

Download Raw Diff

Details

Reviewers

alex-t
foad

Commits

rGd1b92c91afd0: [AMDGPU] Set threshold for regbanks reassign pass

Summary

This is to limit compile time. I did experiments with some
inputs and found that compile time keeps reasonable for this
pass if we have less than 100000 virtual registers and then
starts to explode somewhere between 100000 and 150000.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Feb 22 2021, 12:28 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald TranscriptFeb 22 2021, 12:28 PM

rampitec requested review of this revision.Feb 22 2021, 12:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 22 2021, 12:28 PM

Herald added a subscriber: wdng. · View Herald Transcript

Fixed debug output.

LGTM

This revision is now accepted and ready to land.Feb 23 2021, 3:38 AM

Closed by commit rGd1b92c91afd0: [AMDGPU] Set threshold for regbanks reassign pass (authored by rampitec). · Explain WhyFeb 23 2021, 10:22 AM

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGd1b92c91afd0: [AMDGPU] Set threshold for regbanks reassign pass.

arsenm added inline comments.Feb 23 2021, 10:27 AM

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp
54	This seems like a pretty low threshold

rampitec added inline comments.Feb 23 2021, 12:22 PM

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp
54	It is. Here is what happens when we have ~150000 vregs: 147.0184 ( 75.8%) 0.0000 ( 0.0%) 147.0184 ( 75.5%) 147.1465 ( 75.5%) GCN RegBank Reassign 14.0944 ( 7.3%) 0.0800 ( 12.7%) 14.1743 ( 7.3%) 14.1812 ( 7.3%) Machine Instruction Scheduler And when we have ~100000 the pass is not even visible at the top of -time-passes. So unless there are better ideas we need to limit it. One idea I have is to use kind of heuristic to account not only for the number of vregs, but for the number of registers allocated. What makes it slow is checkInterference() for every probed register at every conflict. Obviously time will be proportional to the number of overlapping LIs at the point of conflict and that more or less can be approximated by the number of registers, at least in a most "fat" portion of a program. Moreover, more overlapping LIs we have less chances we will be able to find a combination of registers to resolve a conflict. If there would be a cheap way to estimate register pressure at a given instruction we could skip individual instructions from search, but I am afraid RPT is not a cheap way.

arsenm added inline comments.Feb 23 2021, 12:25 PM

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp
668	Can you redo this search process in terms of regunits instead?

rampitec added inline comments.Feb 23 2021, 12:28 PM

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp
668	I will need to supply a physreg for VRM at the end. Plus all these isAllocatable() et all checks are not for reg units.

arsenm added inline comments.Feb 23 2021, 12:31 PM

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp
668	VRM should be using regunits internally. It does have LiveRegMatrix::checkRegUnitInterference. Overall we need to rewrite everything considering registers to operate on regunits instead (including reframing reserved registers in terms of reserved regunits)

arsenm added inline comments.Feb 23 2021, 12:33 PM

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp
668	I guess checkRegUnitInterference is part of the implementation, but that just means you're repeating that multiple times by scanning over all of the registers in tuple classes

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

GCNRegBankReassign.cpp

18 lines

Diff 325831

llvm/lib/Target/AMDGPU/GCNRegBankReassign.cpp

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines

	using namespace llvm;			using namespace llvm;

	static cl::opt<unsigned> VerifyStallCycles("amdgpu-verify-regbanks-reassign",			static cl::opt<unsigned> VerifyStallCycles("amdgpu-verify-regbanks-reassign",
	cl::desc("Verify stall cycles in the regbanks reassign pass"),			cl::desc("Verify stall cycles in the regbanks reassign pass"),
	cl::value_desc("0\|1\|2"),			cl::value_desc("0\|1\|2"),
	cl::init(0), cl::Hidden);			cl::init(0), cl::Hidden);

				// Threshold to keep compile time reasonable.
				static cl::opt<unsigned> VRegThresh("amdgpu-regbanks-reassign-threshold",
				cl::desc("Max number of vregs to run the regbanks reassign pass"),
				cl::init(100000), cl::Hidden);
				arsenmUnsubmitted Not Done Reply Inline Actions This seems like a pretty low threshold arsenm: This seems like a pretty low threshold
				rampitecAuthorUnsubmitted Done Reply Inline Actions It is. Here is what happens when we have ~150000 vregs: 147.0184 ( 75.8%) 0.0000 ( 0.0%) 147.0184 ( 75.5%) 147.1465 ( 75.5%) GCN RegBank Reassign 14.0944 ( 7.3%) 0.0800 ( 12.7%) 14.1743 ( 7.3%) 14.1812 ( 7.3%) Machine Instruction Scheduler And when we have ~100000 the pass is not even visible at the top of -time-passes. So unless there are better ideas we need to limit it. One idea I have is to use kind of heuristic to account not only for the number of vregs, but for the number of registers allocated. What makes it slow is checkInterference() for every probed register at every conflict. Obviously time will be proportional to the number of overlapping LIs at the point of conflict and that more or less can be approximated by the number of registers, at least in a most "fat" portion of a program. Moreover, more overlapping LIs we have less chances we will be able to find a combination of registers to resolve a conflict. If there would be a cheap way to estimate register pressure at a given instruction we could skip individual instructions from search, but I am afraid RPT is not a cheap way. rampitec: It is. Here is what happens when we have ~150000 vregs: ``` 147.0184 ( 75.8%) 0.0000 ( 0.

	#define DEBUG_TYPE "amdgpu-regbanks-reassign"			#define DEBUG_TYPE "amdgpu-regbanks-reassign"

	#define NUM_VGPR_BANKS 4			#define NUM_VGPR_BANKS 4
	#define NUM_SGPR_BANKS 8			#define NUM_SGPR_BANKS 8
	#define NUM_BANKS (NUM_VGPR_BANKS + NUM_SGPR_BANKS)			#define NUM_BANKS (NUM_VGPR_BANKS + NUM_SGPR_BANKS)
	#define SGPR_BANK_OFFSET NUM_VGPR_BANKS			#define SGPR_BANK_OFFSET NUM_VGPR_BANKS
	#define VGPR_BANK_MASK 0xf			#define VGPR_BANK_MASK 0xf
	#define SGPR_BANK_MASK 0xff0			#define SGPR_BANK_MASK 0xff0
	▲ Show 20 Lines • Show All 596 Lines • ▼ Show 20 Lines
	MCRegister GCNRegBankReassign::scavengeReg(LiveInterval &LI, unsigned Bank,			MCRegister GCNRegBankReassign::scavengeReg(LiveInterval &LI, unsigned Bank,
	unsigned SubReg) const {			unsigned SubReg) const {
	const TargetRegisterClass *RC = MRI->getRegClass(LI.reg());			const TargetRegisterClass *RC = MRI->getRegClass(LI.reg());
	unsigned MaxNumRegs = (Bank < NUM_VGPR_BANKS) ? MaxNumVGPRs			unsigned MaxNumRegs = (Bank < NUM_VGPR_BANKS) ? MaxNumVGPRs
	: MaxNumSGPRs;			: MaxNumSGPRs;
	unsigned MaxReg = MaxNumRegs + (Bank < NUM_VGPR_BANKS ? AMDGPU::VGPR0			unsigned MaxReg = MaxNumRegs + (Bank < NUM_VGPR_BANKS ? AMDGPU::VGPR0
	: AMDGPU::SGPR0);			: AMDGPU::SGPR0);

	for (MCRegister Reg : RC->getRegisters()) {			for (MCRegister Reg : RC->getRegisters()) {
				arsenmUnsubmitted Not Done Reply Inline Actions Can you redo this search process in terms of regunits instead? arsenm: Can you redo this search process in terms of regunits instead?
				rampitecAuthorUnsubmitted Done Reply Inline Actions I will need to supply a physreg for VRM at the end. Plus all these isAllocatable() et all checks are not for reg units. rampitec: I will need to supply a physreg for VRM at the end. Plus all these isAllocatable() et all…
				arsenmUnsubmitted Not Done Reply Inline Actions VRM should be using regunits internally. It does have LiveRegMatrix::checkRegUnitInterference. Overall we need to rewrite everything considering registers to operate on regunits instead (including reframing reserved registers in terms of reserved regunits) arsenm: VRM should be using regunits internally. It does have LiveRegMatrix::checkRegUnitInterference.
				arsenmUnsubmitted Not Done Reply Inline Actions I guess checkRegUnitInterference is part of the implementation, but that just means you're repeating that multiple times by scanning over all of the registers in tuple classes arsenm: I guess checkRegUnitInterference is part of the implementation, but that just means you're…
	// Check occupancy limit.			// Check occupancy limit.
	if (TRI->isSubRegisterEq(Reg, MaxReg))			if (TRI->isSubRegisterEq(Reg, MaxReg))
	break;			break;

	if (!MRI->isAllocatable(Reg) \|\| getPhysRegBank(Reg, SubReg) != Bank)			if (!MRI->isAllocatable(Reg) \|\| getPhysRegBank(Reg, SubReg) != Bank)
	continue;			continue;

	for (unsigned I = 0; CSRegs[I]; ++I)			for (unsigned I = 0; CSRegs[I]; ++I)
	▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	}			}

	bool GCNRegBankReassign::runOnMachineFunction(MachineFunction &MF) {			bool GCNRegBankReassign::runOnMachineFunction(MachineFunction &MF) {
	ST = &MF.getSubtarget<GCNSubtarget>();			ST = &MF.getSubtarget<GCNSubtarget>();
	if (!ST->hasRegisterBanking() \|\| skipFunction(MF.getFunction()))			if (!ST->hasRegisterBanking() \|\| skipFunction(MF.getFunction()))
	return false;			return false;

	MRI = &MF.getRegInfo();			MRI = &MF.getRegInfo();

				LLVM_DEBUG(dbgs() << "=== RegBanks reassign analysis on function " << MF.getName()
				<< "\nNumVirtRegs = " << MRI->getNumVirtRegs() << "\n\n");

				if (MRI->getNumVirtRegs() > VRegThresh) {
				LLVM_DEBUG(dbgs() << "NumVirtRegs > " << VRegThresh
				<< " threshold, skipping function.\n\n");
				return false;
				}

	TRI = ST->getRegisterInfo();			TRI = ST->getRegisterInfo();
	MLI = &getAnalysis<MachineLoopInfo>();			MLI = &getAnalysis<MachineLoopInfo>();
	VRM = &getAnalysis<VirtRegMap>();			VRM = &getAnalysis<VirtRegMap>();
	LRM = &getAnalysis<LiveRegMatrix>();			LRM = &getAnalysis<LiveRegMatrix>();
	LIS = &getAnalysis<LiveIntervals>();			LIS = &getAnalysis<LiveIntervals>();

	const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();			const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
	unsigned Occupancy = MFI->getOccupancy();			unsigned Occupancy = MFI->getOccupancy();
	MaxNumVGPRs = ST->getMaxNumVGPRs(MF);			MaxNumVGPRs = ST->getMaxNumVGPRs(MF);
	MaxNumSGPRs = ST->getMaxNumSGPRs(MF);			MaxNumSGPRs = ST->getMaxNumSGPRs(MF);
	MaxNumVGPRs = std::min(ST->getMaxNumVGPRs(Occupancy), MaxNumVGPRs);			MaxNumVGPRs = std::min(ST->getMaxNumVGPRs(Occupancy), MaxNumVGPRs);
	MaxNumSGPRs = std::min(ST->getMaxNumSGPRs(Occupancy, true), MaxNumSGPRs);			MaxNumSGPRs = std::min(ST->getMaxNumSGPRs(Occupancy, true), MaxNumSGPRs);

	CSRegs = MRI->getCalleeSavedRegs();			CSRegs = MRI->getCalleeSavedRegs();
	unsigned NumRegBanks = AMDGPU::VGPR_32RegClass.getNumRegs() +			unsigned NumRegBanks = AMDGPU::VGPR_32RegClass.getNumRegs() +
	// Not a tight bound			// Not a tight bound
	AMDGPU::SReg_32RegClass.getNumRegs() / 2 + 1;			AMDGPU::SReg_32RegClass.getNumRegs() / 2 + 1;
	RegsUsed.resize(NumRegBanks);			RegsUsed.resize(NumRegBanks);

	LLVM_DEBUG(dbgs() << "=== RegBanks reassign analysis on function " << MF.getName()
	<< '\n');

	unsigned StallCycles = collectCandidates(MF);			unsigned StallCycles = collectCandidates(MF);
	NumStallsDetected += StallCycles;			NumStallsDetected += StallCycles;

	LLVM_DEBUG(dbgs() << "=== " << StallCycles << " stall cycles detected in "			LLVM_DEBUG(dbgs() << "=== " << StallCycles << " stall cycles detected in "
	"function " << MF.getName() << '\n');			"function " << MF.getName() << '\n');

	LLVM_DEBUG(Candidates.dump(this));			LLVM_DEBUG(Candidates.dump(this));

	Show All 31 Lines