This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Implement SGPR spilling with scalar stores
ClosedPublic

Authored by arsenm on Oct 13 2016, 3:22 AM.

Download Raw Diff

Details

Reviewers

Summary

This avoids the nasty problems caused by using
memory instructions that read the exec mask while
spilling / restoring registers used for control flow
masking, but only for VI when these were added.

This always uses the scalar stores when enabled currently,
but it may be better to still try to spill to a VGPR
and use this on the fallback memory path.

The cache also needs to be flushed before wave termination
if a scalar store is used.

Diff Detail

Event Timeline

arsenm updated this revision to Diff 74489.Oct 13 2016, 3:22 AM

arsenm retitled this revision from to AMDGPU: Implement SGPR spilling with scalar stores.

arsenm updated this object.

arsenm added a subscriber: llvm-commits.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptOct 13 2016, 3:22 AM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 3 others. · View Herald Transcript

nhaehnle added inline comments.Oct 14 2016, 1:45 AM

lib/Target/AMDGPU/SIInsertWaits.cpp
626–627	Please also handle the SI_RETURN case here and below.

arsenm added inline comments.Oct 14 2016, 3:51 AM

lib/Target/AMDGPU/SIInsertWaits.cpp
626–627	I was specifically not handling that, but I guess it isn't a normal function return

nhaehnle added inline comments.Oct 14 2016, 4:29 AM

lib/Target/AMDGPU/SIInsertWaits.cpp
626–627	Yes. The way it's used, we're just concatenating the binaries of multiple shader parts. Only the middle part should ever contain register spills (well, unless you compile with -O0, but we never do that), so it makes sense that all the necessary handling is confined to the shader part where it happens.

Handle si_return

Herald edited edge metadata. · View Herald TranscriptOct 14 2016, 10:29 AM

Fix if offset is 0

Herald edited edge metadata. · View Herald TranscriptOct 28 2016, 4:12 PM

One question about the offsets at which spills happen, though this could stay a TODO for now. Apart from that, LGTM.

test/CodeGen/AMDGPU/si-spill-sgpr-stack.ll
27–37	I'm a bit surprised by the offsets as they seem too far apart. I guess the offset allocation assumes VGPR spilling, and this is a TODO to be fixed later? Should probably be mentioned here in the test and in the appropriate location in the code.

arsenm added inline comments.Nov 7 2016, 11:24 AM

test/CodeGen/AMDGPU/si-spill-sgpr-stack.ll
27–37	The offsets need to be multiplied by the wave size, so they end up looking big. It would be a code size improvement to use the previous one when splitting the spill (although this is mitigated by using the wider instructions in the follow up patch)

I guess the comment on D26005 about leaving 63 elements wasted belongs here? It would be nice to have a TODO about this here, since it seems to be just an arbitrary limitation of how the spill slots are currently allocated.

In D25551#591815, @nhaehnle wrote:

I guess the comment on D26005 about leaving 63 elements wasted belongs here? It would be nice to have a TODO about this here, since it seems to be just an arbitrary limitation of how the spill slots are currently allocated.

D26104 is the patch to use the wide stores which fixes this

LGTM.

This revision is now accepted and ready to land.Nov 11 2016, 4:23 PM

r286766

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIInsertWaits.cpp

43 lines

SIInstrInfo.cpp

14 lines

SIRegisterInfo.cpp

106 lines

test/

CodeGen/

AMDGPU/

attr-amdgpu-flat-work-group-size.ll

4 lines

attr-amdgpu-num-sgpr.ll

18 lines

basic-branch.ll

2 lines

si-spill-sgpr-stack.ll

44 lines

spill-m0.ll

19 lines

MIR/

AMDGPU/

scalar-store-cache-flush.mir

173 lines

Diff 76270

lib/Target/AMDGPU/SIInsertWaits.cpp

Show First 20 Lines • Show All 516 Lines • ▼ Show 20 Lines
bool SIInsertWaits::runOnMachineFunction(MachineFunction &MF) {		bool SIInsertWaits::runOnMachineFunction(MachineFunction &MF) {
bool Changes = false;		bool Changes = false;

ST = &MF.getSubtarget<SISubtarget>();		ST = &MF.getSubtarget<SISubtarget>();
TII = ST->getInstrInfo();		TII = ST->getInstrInfo();
TRI = &TII->getRegisterInfo();		TRI = &TII->getRegisterInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
IV = getIsaVersion(ST->getFeatureBits());		IV = getIsaVersion(ST->getFeatureBits());
		const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

HardwareLimits.Named.VM = getVmcntBitMask(IV);		HardwareLimits.Named.VM = getVmcntBitMask(IV);
HardwareLimits.Named.EXP = getExpcntBitMask(IV);		HardwareLimits.Named.EXP = getExpcntBitMask(IV);
HardwareLimits.Named.LGKM = getLgkmcntBitMask(IV);		HardwareLimits.Named.LGKM = getLgkmcntBitMask(IV);

WaitedOn = ZeroCounts;		WaitedOn = ZeroCounts;
DelayedWaitOn = ZeroCounts;		DelayedWaitOn = ZeroCounts;
LastIssued = ZeroCounts;		LastIssued = ZeroCounts;
LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
LastInstWritesM0 = false;		LastInstWritesM0 = false;
ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();		ReturnsVoid = MFI->returnsVoid();

memset(&UsedRegs, 0, sizeof(UsedRegs));		memset(&UsedRegs, 0, sizeof(UsedRegs));
memset(&DefinedRegs, 0, sizeof(DefinedRegs));		memset(&DefinedRegs, 0, sizeof(DefinedRegs));

SmallVector<MachineInstr *, 4> RemoveMI;		SmallVector<MachineInstr *, 4> RemoveMI;
		SmallVector<MachineBasicBlock *, 4> EndPgmBlocks;

		bool HaveScalarStores = false;

for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();		for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
BI != BE; ++BI) {		BI != BE; ++BI) {

MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;

for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
I != E; ++I) {		I != E; ++I) {

		if (!HaveScalarStores && TII->isScalarStore(*I))
		HaveScalarStores = true;

if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS) {		if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS) {
// There is a hardware bug on CI/SI where SMRD instruction may corrupt		// There is a hardware bug on CI/SI where SMRD instruction may corrupt
// vccz bit, so when we detect that an instruction may read from a		// vccz bit, so when we detect that an instruction may read from a
// corrupt vccz bit, we need to:		// corrupt vccz bit, we need to:
// 1. Insert s_waitcnt lgkm(0) to wait for all outstanding SMRD operations to		// 1. Insert s_waitcnt lgkm(0) to wait for all outstanding SMRD operations to
// complete.		// complete.
// 2. Restore the correct value of vccz by writing the current value		// 2. Restore the correct value of vccz by writing the current value
// of vcc back to vcc.		// of vcc back to vcc.
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();

if (countersNonZero(Required) \|\| countersNonZero(Increment))		if (countersNonZero(Required) \|\| countersNonZero(Increment))
increaseCounters(Required, DelayedWaitOn);		increaseCounters(Required, DelayedWaitOn);

Changes \|= insertWait(MBB, I, Required);		Changes \|= insertWait(MBB, I, Required);

pushInstruction(MBB, I, Increment);		pushInstruction(MBB, I, Increment);
handleSendMsg(MBB, I);		handleSendMsg(MBB, I);

		if (I->getOpcode() == AMDGPU::S_ENDPGM \|\|
		I->getOpcode() == AMDGPU::SI_RETURN)
		nhaehnleUnsubmitted Not Done Reply Inline Actions Please also handle the SI_RETURN case here and below. nhaehnle: Please also handle the SI_RETURN case here and below.
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions I was specifically not handling that, but I guess it isn't a normal function return arsenm: I was specifically not handling that, but I guess it isn't a normal function return
		nhaehnleUnsubmitted Not Done Reply Inline Actions Yes. The way it's used, we're just concatenating the binaries of multiple shader parts. Only the middle part should ever contain register spills (well, unless you compile with -O0, but we never do that), so it makes sense that all the necessary handling is confined to the shader part where it happens. nhaehnle: Yes. The way it's used, we're just concatenating the binaries of multiple shader parts. Only…
		EndPgmBlocks.push_back(&MBB);
}		}

// Wait for everything at the end of the MBB		// Wait for everything at the end of the MBB
Changes \|= insertWait(MBB, MBB.getFirstTerminator(), LastIssued);		Changes \|= insertWait(MBB, MBB.getFirstTerminator(), LastIssued);
}		}

		if (HaveScalarStores) {
		// If scalar writes are used, the cache must be flushed or else the next
		// wave to reuse the same scratch memory can be clobbered.
		//
		// Insert s_dcache_wb at wave termination points if there were any scalar
		// stores, and only if the cache hasn't already been flushed. This could be
		// improved by looking across blocks for flushes in postdominating blocks
		// from the stores but an explicitly requested flush is probably very rare.
		for (MachineBasicBlock *MBB : EndPgmBlocks) {
		bool SeenDCacheWB = false;

		for (MachineBasicBlock::iterator I = MBB->begin(), E = MBB->end();
		I != E; ++I) {

		if (I->getOpcode() == AMDGPU::S_DCACHE_WB)
		SeenDCacheWB = true;
		else if (TII->isScalarStore(*I))
		SeenDCacheWB = false;

		// FIXME: It would be better to insert this before a waitcnt if any.
		if ((I->getOpcode() == AMDGPU::S_ENDPGM \|\|
		I->getOpcode() == AMDGPU::SI_RETURN) && !SeenDCacheWB) {
		Changes = true;
		BuildMI(*MBB, I, I->getDebugLoc(), TII->get(AMDGPU::S_DCACHE_WB));
		}
		}
		}
		}

for (MachineInstr *I : RemoveMI)		for (MachineInstr *I : RemoveMI)
I->eraseFromParent();		I->eraseFromParent();

return Changes;		return Changes;
}		}

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 605 Lines • ▼ Show 20 Lines	if (RI.isSGPRClass(RC)) {

// The SGPR spill/restore instructions only work on number sgprs, so we need		// The SGPR spill/restore instructions only work on number sgprs, so we need
// to make sure we are using the correct register class.		// to make sure we are using the correct register class.
if (TargetRegisterInfo::isVirtualRegister(SrcReg) && RC->getSize() == 4) {		if (TargetRegisterInfo::isVirtualRegister(SrcReg) && RC->getSize() == 4) {
MachineRegisterInfo &MRI = MF->getRegInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
MRI.constrainRegClass(SrcReg, &AMDGPU::SReg_32_XM0RegClass);		MRI.constrainRegClass(SrcReg, &AMDGPU::SReg_32_XM0RegClass);
}		}

BuildMI(MBB, MI, DL, OpDesc)		MachineInstrBuilder Spill = BuildMI(MBB, MI, DL, OpDesc)
.addReg(SrcReg, getKillRegState(isKill)) // data		.addReg(SrcReg, getKillRegState(isKill)) // data
.addFrameIndex(FrameIndex) // addr		.addFrameIndex(FrameIndex) // addr
.addMemOperand(MMO)		.addMemOperand(MMO)
.addReg(MFI->getScratchRSrcReg(), RegState::Implicit)		.addReg(MFI->getScratchRSrcReg(), RegState::Implicit)
.addReg(MFI->getScratchWaveOffsetReg(), RegState::Implicit);		.addReg(MFI->getScratchWaveOffsetReg(), RegState::Implicit);
// Add the scratch resource registers as implicit uses because we may end up		// Add the scratch resource registers as implicit uses because we may end up
// needing them, and need to ensure that the reserved registers are		// needing them, and need to ensure that the reserved registers are
// correctly handled.		// correctly handled.

		if (ST.hasScalarStores()) {
		// m0 is used for offset to scalar stores if used to spill.
		Spill.addReg(AMDGPU::M0, RegState::ImplicitDefine);
		}

return;		return;
}		}

if (!ST.isVGPRSpillingEnabled(*MF->getFunction())) {		if (!ST.isVGPRSpillingEnabled(*MF->getFunction())) {
LLVMContext &Ctx = MF->getFunction()->getContext();		LLVMContext &Ctx = MF->getFunction()->getContext();
Ctx.emitError("SIInstrInfo::storeRegToStackSlot - Do not know how to"		Ctx.emitError("SIInstrInfo::storeRegToStackSlot - Do not know how to"
" spill register");		" spill register");
BuildMI(MBB, MI, DL, get(AMDGPU::KILL))		BuildMI(MBB, MI, DL, get(AMDGPU::KILL))
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	if (RI.isSGPRClass(RC)) {
// FIXME: Maybe this should not include a memoperand because it will be		// FIXME: Maybe this should not include a memoperand because it will be
// lowered to non-memory instructions.		// lowered to non-memory instructions.
const MCInstrDesc &OpDesc = get(getSGPRSpillRestoreOpcode(RC->getSize()));		const MCInstrDesc &OpDesc = get(getSGPRSpillRestoreOpcode(RC->getSize()));
if (TargetRegisterInfo::isVirtualRegister(DestReg) && RC->getSize() == 4) {		if (TargetRegisterInfo::isVirtualRegister(DestReg) && RC->getSize() == 4) {
MachineRegisterInfo &MRI = MF->getRegInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
MRI.constrainRegClass(DestReg, &AMDGPU::SReg_32_XM0RegClass);		MRI.constrainRegClass(DestReg, &AMDGPU::SReg_32_XM0RegClass);
}		}

BuildMI(MBB, MI, DL, OpDesc, DestReg)		MachineInstrBuilder Spill = BuildMI(MBB, MI, DL, OpDesc, DestReg)
.addFrameIndex(FrameIndex) // addr		.addFrameIndex(FrameIndex) // addr
.addMemOperand(MMO)		.addMemOperand(MMO)
.addReg(MFI->getScratchRSrcReg(), RegState::Implicit)		.addReg(MFI->getScratchRSrcReg(), RegState::Implicit)
.addReg(MFI->getScratchWaveOffsetReg(), RegState::Implicit);		.addReg(MFI->getScratchWaveOffsetReg(), RegState::Implicit);

		if (ST.hasScalarStores()) {
		// m0 is used for offset to scalar stores if used to spill.
		Spill.addReg(AMDGPU::M0, RegState::ImplicitDefine);
		}

return;		return;
}		}

if (!ST.isVGPRSpillingEnabled(*MF->getFunction())) {		if (!ST.isVGPRSpillingEnabled(*MF->getFunction())) {
LLVMContext &Ctx = MF->getFunction()->getContext();		LLVMContext &Ctx = MF->getFunction()->getContext();
Ctx.emitError("SIInstrInfo::loadRegFromStackSlot - Do not know how to"		Ctx.emitError("SIInstrInfo::loadRegFromStackSlot - Do not know how to"
" restore register");		" restore register");
BuildMI(MBB, MI, DL, get(AMDGPU::IMPLICIT_DEF), DestReg);		BuildMI(MBB, MI, DL, get(AMDGPU::IMPLICIT_DEF), DestReg);
▲ Show 20 Lines • Show All 2,843 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show All 18 Lines
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/RegisterScavenging.h"		#include "llvm/CodeGen/RegisterScavenging.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/LLVMContext.h"		#include "llvm/IR/LLVMContext.h"

using namespace llvm;		using namespace llvm;

		static cl::opt<bool> EnableSpillSGPRToSMEM(
		"amdgpu-spill-sgpr-to-smem",
		cl::desc("Use scalar stores to spill SGPRs if supported by subtarget"),
		cl::init(true));


static bool hasPressureSet(const int *PSets, unsigned PSetID) {		static bool hasPressureSet(const int *PSets, unsigned PSetID) {
for (unsigned i = 0; PSets[i] != -1; ++i) {		for (unsigned i = 0; PSets[i] != -1; ++i) {
if (PSets[i] == (int)PSetID)		if (PSets[i] == (int)PSetID)
return true;		return true;
}		}
return false;		return false;
}		}

▲ Show 20 Lines • Show All 435 Lines • ▼ Show 20 Lines	BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), ScratchOffset)
.addReg(ScratchOffset)		.addReg(ScratchOffset)
.addImm(OriginalImmOffset);		.addImm(OriginalImmOffset);
}		}
}		}

void SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI,		void SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI,
int Index,		int Index,
RegScavenger *RS) const {		RegScavenger *RS) const {
MachineFunction *MF = MI->getParent()->getParent();
MachineRegisterInfo &MRI = MF->getRegInfo();
MachineBasicBlock *MBB = MI->getParent();		MachineBasicBlock *MBB = MI->getParent();
SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();		MachineFunction *MF = MBB->getParent();
MachineFrameInfo &FrameInfo = MF->getFrameInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
const SISubtarget &ST = MF->getSubtarget<SISubtarget>();		const SISubtarget &ST = MF->getSubtarget<SISubtarget>();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
const DebugLoc &DL = MI->getDebugLoc();

unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());		unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());
unsigned SuperReg = MI->getOperand(0).getReg();		unsigned SuperReg = MI->getOperand(0).getReg();
bool IsKill = MI->getOperand(0).isKill();		bool IsKill = MI->getOperand(0).isKill();
		const DebugLoc &DL = MI->getDebugLoc();

		SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
		MachineFrameInfo &FrameInfo = MF->getFrameInfo();

		bool SpillToSMEM = ST.hasScalarStores() && EnableSpillSGPRToSMEM;

// SubReg carries the "Kill" flag when SubReg == SuperReg.		// SubReg carries the "Kill" flag when SubReg == SuperReg.
unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill);		unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill);
for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {		for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);		unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
unsigned SubReg = NumSubRegs == 1 ?		unsigned SubReg = NumSubRegs == 1 ?
SuperReg : getSubReg(SuperReg, getSubRegFromChannel(i));		SuperReg : getSubReg(SuperReg, getSubRegFromChannel(i));

		if (SpillToSMEM) {
		if (SuperReg == AMDGPU::M0) {
		assert(NumSubRegs == 1);
		unsigned CopyM0
		= MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);

		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY), CopyM0)
		.addReg(AMDGPU::M0, getKillRegState(IsKill));

		// The real spill now kills the temp copy.
		SubReg = SuperReg = CopyM0;
		IsKill = true;
		}

		int64_t FrOffset = FrameInfo.getObjectOffset(Index);
		unsigned Size = FrameInfo.getObjectSize(Index);
		unsigned Align = FrameInfo.getObjectAlignment(Index);
		MachinePointerInfo PtrInfo
		= MachinePointerInfo::getFixedStack(*MF, Index);
		MachineMemOperand *MMO
		= MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOStore,
		Size, Align);

		unsigned OffsetReg = AMDGPU::M0;
		// Add i * 4 wave offset.
		//
		// SMEM instructions only support a single offset, so increment the wave
		// offset.

		int64_t Offset = ST.getWavefrontSize() * (FrOffset + 4 * i);
		if (Offset != 0) {
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), OffsetReg)
		.addReg(MFI->getScratchWaveOffsetReg())
		.addImm(Offset);
		} else {
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg)
		.addReg(MFI->getScratchWaveOffsetReg());
		}

		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_BUFFER_STORE_DWORD_SGPR))
		.addReg(SubReg, getKillRegState(IsKill)) // sdata
		.addReg(MFI->getScratchRSrcReg()) // sbase
		.addReg(OffsetReg) // soff
		.addImm(0) // glc
		.addMemOperand(MMO);

		continue;
		}

struct SIMachineFunctionInfo::SpilledReg Spill =		struct SIMachineFunctionInfo::SpilledReg Spill =
MFI->getSpilledReg(MF, Index, i);		MFI->getSpilledReg(MF, Index, i);
if (Spill.hasReg()) {		if (Spill.hasReg()) {
if (SuperReg == AMDGPU::M0) {		if (SuperReg == AMDGPU::M0) {
assert(NumSubRegs == 1);		assert(NumSubRegs == 1);
unsigned CopyM0		unsigned CopyM0
= MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);		= MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), CopyM0)		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), CopyM0)
Show All 10 Lines	if (Spill.hasReg()) {
.addReg(SubReg, getKillRegState(IsKill))		.addReg(SubReg, getKillRegState(IsKill))
.addImm(Spill.Lane);		.addImm(Spill.Lane);

// FIXME: Since this spills to another register instead of an actual		// FIXME: Since this spills to another register instead of an actual
// frame index, we should delete the frame index when all references to		// frame index, we should delete the frame index when all references to
// it are fixed.		// it are fixed.
} else {		} else {
// Spill SGPR to a frame index.		// Spill SGPR to a frame index.
// FIXME we should use S_STORE_DWORD here for VI.		// TODO: Should VI try to spill to VGPR and then spill to SMEM?

MachineInstrBuilder Mov		MachineInstrBuilder Mov
= BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), TmpReg)		= BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), TmpReg)
.addReg(SubReg, SubKillState);		.addReg(SubReg, SubKillState);


// There could be undef components of a spilled super register.		// There could be undef components of a spilled super register.
// TODO: Can we detect this and skip the spill?		// TODO: Can we detect this and skip the spill?
if (NumSubRegs > 1) {		if (NumSubRegs > 1) {
Show All 34 Lines	void SIRegisterInfo::restoreSGPR(MachineBasicBlock::iterator MI,
SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
MachineFrameInfo &FrameInfo = MF->getFrameInfo();		MachineFrameInfo &FrameInfo = MF->getFrameInfo();
const SISubtarget &ST = MF->getSubtarget<SISubtarget>();		const SISubtarget &ST = MF->getSubtarget<SISubtarget>();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
const DebugLoc &DL = MI->getDebugLoc();		const DebugLoc &DL = MI->getDebugLoc();

unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());		unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());
unsigned SuperReg = MI->getOperand(0).getReg();		unsigned SuperReg = MI->getOperand(0).getReg();
		bool SpillToSMEM = ST.hasScalarStores() && EnableSpillSGPRToSMEM;

// m0 is not allowed as with readlane/writelane, so a temporary SGPR and		// m0 is not allowed as with readlane/writelane, so a temporary SGPR and
// extra copy is needed.		// extra copy is needed.
bool IsM0 = (SuperReg == AMDGPU::M0);		bool IsM0 = (SuperReg == AMDGPU::M0);
if (IsM0) {		if (IsM0) {
assert(NumSubRegs == 1);		assert(NumSubRegs == 1);
SuperReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);		SuperReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);
}		}

		int64_t FrOffset = FrameInfo.getObjectOffset(Index);

for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {		for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);		unsigned TmpReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
unsigned SubReg = NumSubRegs == 1 ?		unsigned SubReg = NumSubRegs == 1 ?
SuperReg : getSubReg(SuperReg, getSubRegFromChannel(i));		SuperReg : getSubReg(SuperReg, getSubRegFromChannel(i));

		if (SpillToSMEM) {
		unsigned Size = FrameInfo.getObjectSize(Index);
		unsigned Align = FrameInfo.getObjectAlignment(Index);
		MachinePointerInfo PtrInfo
		= MachinePointerInfo::getFixedStack(*MF, Index);
		MachineMemOperand *MMO
		= MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOLoad,
		Size, Align);

		unsigned OffsetReg = AMDGPU::M0;

		// Add i * 4 offset
		int64_t Offset = ST.getWavefrontSize() * (FrOffset + 4 * i);
		if (Offset != 0) {
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), OffsetReg)
		.addReg(MFI->getScratchWaveOffsetReg())
		.addImm(Offset);
		} else {
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg)
		.addReg(MFI->getScratchWaveOffsetReg());
		}

		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_BUFFER_LOAD_DWORD_SGPR), SubReg)
		.addReg(MFI->getScratchRSrcReg()) // sbase
		.addReg(OffsetReg) // soff
		.addImm(0) // glc
		.addMemOperand(MMO)
		.addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine);

		continue;
		}

SIMachineFunctionInfo::SpilledReg Spill		SIMachineFunctionInfo::SpilledReg Spill
= MFI->getSpilledReg(MF, Index, i);		= MFI->getSpilledReg(MF, Index, i);

if (Spill.hasReg()) {		if (Spill.hasReg()) {
BuildMI(*MBB, MI, DL,		BuildMI(*MBB, MI, DL,
TII->getMCOpcodeFromPseudo(AMDGPU::V_READLANE_B32),		TII->getMCOpcodeFromPseudo(AMDGPU::V_READLANE_B32),
SubReg)		SubReg)
.addReg(Spill.VGPR)		.addReg(Spill.VGPR)
▲ Show 20 Lines • Show All 577 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/attr-amdgpu-flat-work-group-size.ll

	Show All 28 Lines
	; CHECK: NumVGPRsForWavesPerEU: 1			; CHECK: NumVGPRsForWavesPerEU: 1
	define void @min_128_max_128() #2 {			define void @min_128_max_128() #2 {
	entry:			entry:
	ret void			ret void
	}			}
	attributes #2 = {"amdgpu-flat-work-group-size"="128,128"}			attributes #2 = {"amdgpu-flat-work-group-size"="128,128"}

	; CHECK-LABEL: {{^}}min_1024_max_2048			; CHECK-LABEL: {{^}}min_1024_max_2048
	; CHECK: SGPRBlocks: 2			; CHECK: SGPRBlocks: 1
	; CHECK: VGPRBlocks: 7			; CHECK: VGPRBlocks: 7
	; CHECK: NumSGPRsForWavesPerEU: 19			; CHECK: NumSGPRsForWavesPerEU: 15
	; CHECK: NumVGPRsForWavesPerEU: 32			; CHECK: NumVGPRsForWavesPerEU: 32
	@var = addrspace(1) global float 0.0			@var = addrspace(1) global float 0.0
	define void @min_1024_max_2048() #3 {			define void @min_1024_max_2048() #3 {
	%val0 = load volatile float, float addrspace(1)* @var			%val0 = load volatile float, float addrspace(1)* @var
	%val1 = load volatile float, float addrspace(1)* @var			%val1 = load volatile float, float addrspace(1)* @var
	%val2 = load volatile float, float addrspace(1)* @var			%val2 = load volatile float, float addrspace(1)* @var
	%val3 = load volatile float, float addrspace(1)* @var			%val3 = load volatile float, float addrspace(1)* @var
	%val4 = load volatile float, float addrspace(1)* @var			%val4 = load volatile float, float addrspace(1)* @var
	▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -amdgpu-spill-sgpr-to-smem=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=TOSGPR -check-prefix=ALL %s
				; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -amdgpu-spill-sgpr-to-smem=1 -verify-machineinstrs < %s \| FileCheck -check-prefix=TOSMEM -check-prefix=ALL %s

	; CHECK-LABEL: {{^}}max_14_sgprs:			; If spilling to smem, additional registers are used for the resource
				; descriptor.

				; ALL-LABEL: {{^}}max_14_sgprs:

	; FIXME: Should be ablo to skip this copying of the private segment			; FIXME: Should be ablo to skip this copying of the private segment
	; buffer because all the SGPR spills are to VGPRs.			; buffer because all the SGPR spills are to VGPRs.

	; CHECK: s_mov_b64 s[6:7], s[2:3]			; ALL: s_mov_b64 s[6:7], s[2:3]
	; CHECK: s_mov_b64 s[4:5], s[0:1]			; ALL: s_mov_b64 s[4:5], s[0:1]
				; ALL: SGPRBlocks: 1
	; CHECK: SGPRBlocks: 1			; ALL: NumSGPRsForWavesPerEU: 14
	; CHECK: NumSGPRsForWavesPerEU: 14
	define void @max_14_sgprs(i32 addrspace(1)* %out1,			define void @max_14_sgprs(i32 addrspace(1)* %out1,

	i32 addrspace(1)* %out2,			i32 addrspace(1)* %out2,
	i32 addrspace(1)* %out3,			i32 addrspace(1)* %out3,
	i32 addrspace(1)* %out4,			i32 addrspace(1)* %out4,
	i32 %one, i32 %two, i32 %three, i32 %four) #0 {			i32 %one, i32 %two, i32 %three, i32 %four) #0 {
	store i32 %one, i32 addrspace(1)* %out1			store i32 %one, i32 addrspace(1)* %out1
	store i32 %two, i32 addrspace(1)* %out2			store i32 %two, i32 addrspace(1)* %out2
	store i32 %three, i32 addrspace(1)* %out3			store i32 %three, i32 addrspace(1)* %out3
	store i32 %four, i32 addrspace(1)* %out4			store i32 %four, i32 addrspace(1)* %out4
	▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/basic-branch.ll

	; RUN: llc -O0 -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNNOOPT -check-prefix=GCN %s			; RUN: llc -O0 -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNNOOPT -check-prefix=GCN %s
	; RUN: llc -O0 -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNNOOPT -check-prefix=GCN %s			; RUN: llc -O0 -march=amdgcn -mcpu=tonga -amdgpu-spill-sgpr-to-smem=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNNOOPT -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNOPT -check-prefix=GCN %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNOPT -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNOPT -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCNOPT -check-prefix=GCN %s

	; GCN-LABEL: {{^}}test_branch:			; GCN-LABEL: {{^}}test_branch:
	; GCNNOOPT: v_writelane_b32			; GCNNOOPT: v_writelane_b32
	; GCNNOOPT: v_writelane_b32			; GCNNOOPT: v_writelane_b32
	; GCN: s_cbranch_scc1 [[END:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_scc1 [[END:BB[0-9]+_[0-9]+]]

	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/si-spill-sgpr-stack.ll

	; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-spill-sgpr-to-smem=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=ALL -check-prefix=SGPR %s
				; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-spill-sgpr-to-smem=1 -verify-machineinstrs < %s \| FileCheck -check-prefix=ALL -check-prefix=SMEM %s

	; Make sure this doesn't crash.			; Make sure this doesn't crash.
	; CHECK: {{^}}test:			; ALL-LABEL: {{^}}test:

	; Make sure we are handling hazards correctly.			; Make sure we are handling hazards correctly.
	; CHECK: buffer_load_dword [[VHI:v[0-9]+]], off, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}} offset:12			; SGPR: buffer_load_dword [[VHI:v[0-9]+]], off, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}} offset:12
	; CHECK-NEXT: s_waitcnt vmcnt(0)			; SGPR-NEXT: s_waitcnt vmcnt(0)
	; CHECK-NEXT: v_readfirstlane_b32 s[[HI:[0-9]+]], [[VHI]]			; SGPR-NEXT: v_readfirstlane_b32 s[[HI:[0-9]+]], [[VHI]]
	; CHECK-NEXT: s_nop 4			; SGPR-NEXT: s_nop 4
	; CHECK-NEXT: buffer_store_dword v0, off, s[0:[[HI]]{{\]}}, 0			; SGPR-NEXT: buffer_store_dword v0, off, s[0:[[HI]]{{\]}}, 0
	; CHECK: s_endpgm

				; Make sure scratch wave offset register is correctly incremented and
				; then restored.
				; SMEM: s_mov_b32 m0, s97{{$}}
				; SMEM: s_buffer_store_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Spill
				; SMEM: s_add_u32 m0, s97, 0x100{{$}}
				; SMEM: s_buffer_store_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Spill
				; SMEM: s_add_u32 m0, s97, 0x200{{$}}
				; SMEM: s_buffer_store_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Spill
				; SMEM: s_add_u32 m0, s97, 0x300{{$}}
				; SMEM: s_buffer_store_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Spill


				; SMEM: s_mov_b32 m0, s97{{$}}
				; SMEM: s_buffer_load_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Reload
				; SMEM: s_add_u32 m0, s97, 0x100{{$}}
				; SMEM: s_waitcnt lgkmcnt(0)
				; SMEM: s_buffer_load_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Reload
				; SMEM: s_add_u32 m0, s97, 0x200{{$}}
				; SMEM: s_waitcnt lgkmcnt(0)
				; SMEM: s_buffer_load_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Reload
				; SMEM: s_add_u32 m0, s97, 0x300{{$}}
				; SMEM: s_waitcnt lgkmcnt(0)
				; SMEM: s_buffer_load_dword s{{[0-9]+}}, s[92:95], m0 ; 16-byte Folded Reload
				nhaehnleUnsubmitted Not Done Reply Inline Actions I'm a bit surprised by the offsets as they seem too far apart. I guess the offset allocation assumes VGPR spilling, and this is a TODO to be fixed later? Should probably be mentioned here in the test and in the appropriate location in the code. nhaehnle: I'm a bit surprised by the offsets as they seem too far apart. I guess the offset allocation…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions The offsets need to be multiplied by the wave size, so they end up looking big. It would be a code size improvement to use the previous one when splitting the spill (although this is mitigated by using the wider instructions in the follow up patch) arsenm: The offsets need to be multiplied by the wave size, so they end up looking big. It would be a…

				; ALL: s_endpgm
	define void @test(i32 addrspace(1)* %out, i32 %in) {			define void @test(i32 addrspace(1)* %out, i32 %in) {
	call void asm sideeffect "", "~{SGPR0_SGPR1_SGPR2_SGPR3_SGPR4_SGPR5_SGPR6_SGPR7}" ()			call void asm sideeffect "", "~{SGPR0_SGPR1_SGPR2_SGPR3_SGPR4_SGPR5_SGPR6_SGPR7}" ()
	call void asm sideeffect "", "~{SGPR8_SGPR9_SGPR10_SGPR11_SGPR12_SGPR13_SGPR14_SGPR15}" ()			call void asm sideeffect "", "~{SGPR8_SGPR9_SGPR10_SGPR11_SGPR12_SGPR13_SGPR14_SGPR15}" ()
	call void asm sideeffect "", "~{SGPR16_SGPR17_SGPR18_SGPR19_SGPR20_SGPR21_SGPR22_SGPR23}" ()			call void asm sideeffect "", "~{SGPR16_SGPR17_SGPR18_SGPR19_SGPR20_SGPR21_SGPR22_SGPR23}" ()
	call void asm sideeffect "", "~{SGPR24_SGPR25_SGPR26_SGPR27_SGPR28_SGPR29_SGPR30_SGPR31}" ()			call void asm sideeffect "", "~{SGPR24_SGPR25_SGPR26_SGPR27_SGPR28_SGPR29_SGPR30_SGPR31}" ()
	call void asm sideeffect "", "~{SGPR32_SGPR33_SGPR34_SGPR35_SGPR36_SGPR37_SGPR38_SGPR39}" ()			call void asm sideeffect "", "~{SGPR32_SGPR33_SGPR34_SGPR35_SGPR36_SGPR37_SGPR38_SGPR39}" ()
	call void asm sideeffect "", "~{SGPR40_SGPR41_SGPR42_SGPR43_SGPR44_SGPR45_SGPR46_SGPR47}" ()			call void asm sideeffect "", "~{SGPR40_SGPR41_SGPR42_SGPR43_SGPR44_SGPR45_SGPR46_SGPR47}" ()
	call void asm sideeffect "", "~{SGPR48_SGPR49_SGPR50_SGPR51_SGPR52_SGPR53_SGPR54_SGPR55}" ()			call void asm sideeffect "", "~{SGPR48_SGPR49_SGPR50_SGPR51_SGPR52_SGPR53_SGPR54_SGPR55}" ()
	▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/spill-m0.ll

	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVGPR -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVGPR -check-prefix=GCN %s
	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -mcpu=tonga -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVGPR -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -amdgpu-spill-sgpr-to-smem=0 -march=amdgcn -mcpu=tonga -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVGPR -check-prefix=GCN %s
	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVMEM -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVMEM -check-prefix=GCN %s
	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mattr=+vgpr-spilling -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVMEM -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -amdgpu-spill-sgpr-to-smem=0 -march=amdgcn -mcpu=tonga -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOVMEM -check-prefix=GCN %s
				; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -amdgpu-spill-sgpr-to-smem=1 -march=amdgcn -mcpu=tonga -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck -check-prefix=TOSMEM -check-prefix=GCN %s

	; XXX - Why does it like to use vcc?			; XXX - Why does it like to use vcc?

	; GCN-LABEL: {{^}}spill_m0:			; GCN-LABEL: {{^}}spill_m0:
	; TOSMEM: s_mov_b32 s88, SCRATCH_RSRC_DWORD0			; TOSMEM: s_mov_b32 s88, SCRATCH_RSRC_DWORD0

	; GCN: s_cmp_lg_u32			; GCN: s_cmp_lg_u32

	; TOVGPR: s_mov_b32 vcc_hi, m0			; TOVGPR: s_mov_b32 vcc_hi, m0
	; TOVGPR: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], vcc_hi, 0			; TOVGPR: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], vcc_hi, 0

	; TOVMEM: v_mov_b32_e32 [[SPILL_VREG:v[0-9]+]], m0			; TOVMEM: v_mov_b32_e32 [[SPILL_VREG:v[0-9]+]], m0
	; TOVMEM: buffer_store_dword [[SPILL_VREG]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} ; 4-byte Folded Spill			; TOVMEM: buffer_store_dword [[SPILL_VREG]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} ; 4-byte Folded Spill
	; TOVMEM: s_waitcnt vmcnt(0)			; TOVMEM: s_waitcnt vmcnt(0)

				; TOSMEM: s_mov_b32 vcc_hi, m0
				; TOSMEM: s_mov_b32 m0, s3{{$}}
				; TOSMEM-NOT: vcc_hi
				; TOSMEM: s_buffer_store_dword vcc_hi, s[88:91], m0 ; 4-byte Folded Spill
				; TOSMEM: s_waitcnt lgkmcnt(0)

	; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: [[ENDIF]]:			; GCN: [[ENDIF]]:
	; TOVGPR: v_readlane_b32 vcc_hi, [[SPILL_VREG]], 0			; TOVGPR: v_readlane_b32 vcc_hi, [[SPILL_VREG]], 0
	; TOVGPR: s_mov_b32 m0, vcc_hi			; TOVGPR: s_mov_b32 m0, vcc_hi

	; TOVMEM: buffer_load_dword [[RELOAD_VREG:v[0-9]+]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} ; 4-byte Folded Reload			; TOVMEM: buffer_load_dword [[RELOAD_VREG:v[0-9]+]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} ; 4-byte Folded Reload
	; TOVMEM: s_waitcnt vmcnt(0)			; TOVMEM: s_waitcnt vmcnt(0)
	; TOVMEM: v_readfirstlane_b32 vcc_hi, [[RELOAD_VREG]]			; TOVMEM: v_readfirstlane_b32 vcc_hi, [[RELOAD_VREG]]
	; TOVMEM: s_mov_b32 m0, vcc_hi			; TOVMEM: s_mov_b32 m0, vcc_hi

				; TOSMEM: s_mov_b32 m0, s3{{$}}
				; TOSMEM: s_buffer_load_dword vcc_hi, s[88:91], m0 ; 4-byte Folded Reload
				; TOSMEM-NOT: vcc_hi
				; TOSMEM: s_mov_b32 m0, vcc_hi

	; GCN: s_add_i32 m0, m0, 1			; GCN: s_add_i32 m0, m0, 1
	define void @spill_m0(i32 %cond, i32 addrspace(1)* %out) #0 {			define void @spill_m0(i32 %cond, i32 addrspace(1)* %out) #0 {
	entry:			entry:
	%m0 = call i32 asm sideeffect "s_mov_b32 m0, 0", "={M0}"() #0			%m0 = call i32 asm sideeffect "s_mov_b32 m0, 0", "={M0}"() #0
	%cmp0 = icmp eq i32 %cond, 0			%cmp0 = icmp eq i32 %cond, 0
	br i1 %cmp0, label %if, label %endif			br i1 %cmp0, label %if, label %endif

	if:			if:
	call void asm sideeffect "v_nop", ""() #0			call void asm sideeffect "v_nop", ""() #0
	br label %endif			br label %endif

	endif:			endif:
	%foo = call i32 asm sideeffect "s_add_i32 $0, $1, 1", "=s,{M0}"(i32 %m0) #0			%foo = call i32 asm sideeffect "s_add_i32 $0, $1, 1", "=s,{M0}"(i32 %m0) #0
	store i32 %foo, i32 addrspace(1)* %out			store i32 %foo, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	@lds = internal addrspace(3) global [64 x float] undef			@lds = internal addrspace(3) global [64 x float] undef

	; GCN-LABEL: {{^}}spill_m0_lds:			; GCN-LABEL: {{^}}spill_m0_lds:
	; GCN-NOT: v_readlane_b32 m0			; GCN-NOT: v_readlane_b32 m0
				; GCN-NOT: s_buffer_store_dword m0
				; GCN-NOT: s_buffer_load_dword m0
	define amdgpu_ps void @spill_m0_lds(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg) #0 {			define amdgpu_ps void @spill_m0_lds(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg) #0 {
	main_body:			main_body:
	%4 = call float @llvm.SI.fs.constant(i32 0, i32 0, i32 %3)			%4 = call float @llvm.SI.fs.constant(i32 0, i32 0, i32 %3)
	%cmp = fcmp ueq float 0.0, %4			%cmp = fcmp ueq float 0.0, %4
	br i1 %cmp, label %if, label %else			br i1 %cmp, label %if, label %else

	if:			if:
	%lds_ptr = getelementptr [64 x float], [64 x float] addrspace(3)* @lds, i32 0, i32 0			%lds_ptr = getelementptr [64 x float], [64 x float] addrspace(3)* @lds, i32 0, i32 0
	Show All 22 Lines

test/CodeGen/MIR/AMDGPU/scalar-store-cache-flush.mir

This file was added.

				# RUN: llc -march=amdgcn -run-pass si-insert-waits %s -o - \| FileCheck %s

				--- \|
				define void @basic_insert_dcache_wb() {
				ret void
				}

				define void @explicit_flush_after() {
				ret void
				}

				define void @explicit_flush_before() {
				ret void
				}

				define void @no_scalar_store() {
				ret void
				}

				define void @multi_block_store() {
				bb0:
				br i1 undef, label %bb1, label %bb2

				bb1:
				ret void

				bb2:
				ret void
				}

				define void @one_block_store() {
				bb0:
				br i1 undef, label %bb1, label %bb2

				bb1:
				ret void

				bb2:
				ret void
				}

				define amdgpu_ps float @si_return() {
				ret float undef
				}

				...
				---
				# CHECK-LABEL: name: basic_insert_dcache_wb
				# CHECK: bb.0:
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				name: basic_insert_dcache_wb
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_STORE_DWORD_SGPR undef %sgpr2, undef %sgpr0_sgpr1, undef %m0, 0
				S_ENDPGM
				...
				---
				# Already has an explicitly requested flush after the last store.
				# CHECK-LABEL: name: explicit_flush_after
				# CHECK: bb.0:
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				name: explicit_flush_after
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_STORE_DWORD_SGPR undef %sgpr2, undef %sgpr0_sgpr1, undef %m0, 0
				S_DCACHE_WB
				S_ENDPGM
				...
				---
				# Already has an explicitly requested flush before the last store.
				# CHECK-LABEL: name: explicit_flush_before
				# CHECK: bb.0:
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				name: explicit_flush_before
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_DCACHE_WB
				S_STORE_DWORD_SGPR undef %sgpr2, undef %sgpr0_sgpr1, undef %m0, 0
				S_ENDPGM
				...
				---
				# CHECK-LABEL: no_scalar_store
				# CHECK: bb.0
				# CHECK-NEXT: S_ENDPGM
				name: no_scalar_store
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_ENDPGM
				...

				# CHECK-LABEL: name: multi_block_store
				# CHECK: bb.0:
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				# CHECK: bb.1:
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				name: multi_block_store
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_STORE_DWORD_SGPR undef %sgpr2, undef %sgpr0_sgpr1, undef %m0, 0
				S_ENDPGM

				bb.1:
				S_STORE_DWORD_SGPR undef %sgpr4, undef %sgpr6_sgpr7, undef %m0, 0
				S_ENDPGM
				...
				...

				# This one should be able to omit the flush in the storeless block but
				# this isn't handled now.

				# CHECK-LABEL: name: one_block_store
				# CHECK: bb.0:
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				# CHECK: bb.1:
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: S_ENDPGM

				name: one_block_store
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_ENDPGM

				bb.1:
				S_STORE_DWORD_SGPR undef %sgpr4, undef %sgpr6_sgpr7, undef %m0, 0
				S_ENDPGM
				...
				---
				# CHECK-LABEL: name: si_return
				# CHECK: bb.0:
				# CHECK-NEXT: S_STORE_DWORD
				# CHECK-NEXT: S_WAITCNT
				# CHECK-NEXT: S_DCACHE_WB
				# CHECK-NEXT: SI_RETURN

				name: si_return
				tracksRegLiveness: false

				body: \|
				bb.0:
				S_STORE_DWORD_SGPR undef %sgpr2, undef %sgpr0_sgpr1, undef %m0, 0
				SI_RETURN undef %vgpr0
				...

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Implement SGPR spilling with scalar storesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76270

lib/Target/AMDGPU/SIInsertWaits.cpp

lib/Target/AMDGPU/SIInstrInfo.cpp

lib/Target/AMDGPU/SIRegisterInfo.cpp

test/CodeGen/AMDGPU/attr-amdgpu-flat-work-group-size.ll

test/CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll

test/CodeGen/AMDGPU/basic-branch.ll

test/CodeGen/AMDGPU/si-spill-sgpr-stack.ll

test/CodeGen/AMDGPU/spill-m0.ll

test/CodeGen/MIR/AMDGPU/scalar-store-cache-flush.mir

AMDGPU: Implement SGPR spilling with scalar stores
ClosedPublic