Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm

Commits

rGf52c3cf27251: AMDGPU: fix local stack slot allocation bugs
rL275108: AMDGPU: fix local stack slot allocation bugs

Summary

The main bug fix here is using the 32-bit encoding of V_ADD_I32 in
materializeFrameBaseRegister and resolveFrameIndex, so that arbitrary
immediates work.

The second part is that we may now require the SegmentWaveByteOffset
even when there are initially no stack objects and VGPR spilling isn't
enabled, for stack slots that are allocated later. This means that some
bits become effectively dead and can be cleaned up.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96602

Diff Detail

Event Timeline

nhaehnle updated this revision to Diff 61349.Jun 21 2016, 3:40 AM

nhaehnle retitled this revision from to AMDGPU: fix local stack slot allocation bugs.

nhaehnle updated this object.

nhaehnle added reviewers: arsenm, • tstellarAMD.

nhaehnle added a subscriber: llvm-commits.

Herald added subscribers: kzhuravl, arsenm. · View Herald TranscriptJun 21 2016, 3:40 AM

arsenm added inline comments.Jun 22 2016, 11:00 AM

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
111–112	We shouldn't need to always enable this if a new vreg is created for the constant
lib/Target/AMDGPU/SIRegisterInfo.cpp
286–290	This is all pre-RA, so there's no issue creating new virtual registers. A new vreg can be created if the immediate isn't a valid inline immediate for the constant, and then it can be folded/shrunk later if needed.

nhaehnle added inline comments.Jun 22 2016, 12:21 PM

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
111–112	Right, but is there a way to tell at this stage? This is the MachineFunctionInfo constructor...
lib/Target/AMDGPU/SIRegisterInfo.cpp
286–290	Sure, the problem wasn't creating the virtual register, the problem was that the _e64 encoding only allows a very limited set of immediates. That is, the testcase in the diff fails with a "illegal immediate operand" error. If some other pass is supposed to pull the immediate out into a V_MOV before the instruction verifier runs, then that wasn't successful...

arsenm added inline comments.Jun 24 2016, 12:32 AM

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
111–112	This logic is just if private access is going to be needed. LocalStackSlotAllocation is purely an optimization and doesn't change that. You can create new vregs at that point so you don't need to worry about the special reserved spill registers
lib/Target/AMDGPU/SIRegisterInfo.cpp
286–290	Yes, but the solution isn't to change the instruction encoding here. You don't need to pick the encoding because the instruction shrinking pass later will reduce it. Whenever possible we pick the e64 form and optimize it down later. You can either just always emit the mov of the constant to a new virtual register with the e64 form (which hopefully SIFoldOperands will take care of when legal), or directly check if it's a legal offset and change the emitted instruction

Change encoding back to ADD_I32_e64, use temporary register for immediates.

The other issue isn't that simple, I believe, and the problem actually doesn't
have anything to do with local stack slot allocation per se. It persists
even when I disable that pass.

The issue is that if
(1) VGPR spilling isn't enabled and
(2) there aren't any stack objects initially but
(3) during instruction selection, new stack objects are created (in order to
lower extractelement instructions)
this messes with some assumptions around the existence of stack objects.
Virtual registers or not, in this case we do need to reserve various SGPRs
after the fact because they're needed for frame finalization. But reserved
registers are frozen during instruction selection.

Basically, we would need a hook either when a new stack object is created
during instruction selection or somewhere just before the end of instruction
selection (but again not after, because we have to reserve SGPRs).

To be honest, this is becoming a bit of a waste of time. In Mesa, we're
enabling VGPR spilling all the time anyway. Why is the option to disable
that even there? I'm a bit tempted to just enable VGPR spilling in the test
case and commit only the SIRegisterInfo part.

Fix the bug affecting Mesa while keeping an XFAIL test for the case without
VGPR spilling enabled.

curan added a subscriber: curan.Jul 10 2016, 5:30 AM

@nhaehnle: you can have my Tested-by: Kai Wasserbäch <kai@dev.carbon-project.org> for Diff 62661 as well.

ping

We should probably just remove the VGPR spilling option

arsenm added inline comments.Jul 11 2016, 1:54 PM

test/CodeGen/AMDGPU/local-stack-slot-bug.ll
10	This should check a few of the stack instructions for the add to the pointer operand

Added some CHECK lines.

Removing the vgpr-spilling option is best left to a separate commit.

LGTM

This revision is now accepted and ready to land.Jul 11 2016, 2:42 PM

Closed by commit rL275108: AMDGPU: fix local stack slot allocation bugs (authored by nha). · Explain WhyJul 11 2016, 2:51 PM

This revision was automatically updated to reflect the committed changes.

chapuni added a subscriber: chapuni.Jul 11 2016, 7:39 PM

chapuni added inline comments.

llvm/trunk/test/CodeGen/AMDGPU/selected-stack-object.ll
1 ↗	(On Diff #63584)	You shouldn't XFAIL if it expected assertion failure. With -Asserts, the behavior is unknown. It might pass apparently, or it might execute infinite loop. Tweaked in r275144.

Diff 61349

lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 815 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerFormalArguments(
}		}

if (Info->hasWorkGroupInfo()) {		if (Info->hasWorkGroupInfo()) {
unsigned Reg = Info->addWorkGroupInfo();		unsigned Reg = Info->addWorkGroupInfo();
MF.addLiveIn(Reg, &AMDGPU::SReg_32RegClass);		MF.addLiveIn(Reg, &AMDGPU::SReg_32RegClass);
CCInfo.AllocateReg(Reg);		CCInfo.AllocateReg(Reg);
}		}

if (Info->hasPrivateSegmentWaveByteOffset()) {
// Scratch wave offset passed in system SGPR.		// Scratch wave offset passed in system SGPR.
unsigned PrivateSegmentWaveByteOffsetReg;		unsigned PrivateSegmentWaveByteOffsetReg;

if (AMDGPU::isShader(CallConv)) {		if (AMDGPU::isShader(CallConv)) {
PrivateSegmentWaveByteOffsetReg = findFirstFreeSGPR(CCInfo);		PrivateSegmentWaveByteOffsetReg = findFirstFreeSGPR(CCInfo);
Info->setPrivateSegmentWaveByteOffset(PrivateSegmentWaveByteOffsetReg);		Info->setPrivateSegmentWaveByteOffset(PrivateSegmentWaveByteOffsetReg);
} else		} else
PrivateSegmentWaveByteOffsetReg = Info->addPrivateSegmentWaveByteOffset();		PrivateSegmentWaveByteOffsetReg = Info->addPrivateSegmentWaveByteOffset();

MF.addLiveIn(PrivateSegmentWaveByteOffsetReg, &AMDGPU::SGPR_32RegClass);		MF.addLiveIn(PrivateSegmentWaveByteOffsetReg, &AMDGPU::SGPR_32RegClass);
CCInfo.AllocateReg(PrivateSegmentWaveByteOffsetReg);		CCInfo.AllocateReg(PrivateSegmentWaveByteOffsetReg);
}

// Now that we've figured out where the scratch register inputs are, see if		// Now that we've figured out where the scratch register inputs are, see if
// should reserve the arguments and use them directly.		// should reserve the arguments and use them directly.
bool HasStackObjects = MF.getFrameInfo()->hasStackObjects();		bool HasStackObjects = MF.getFrameInfo()->hasStackObjects();
// Record that we know we have non-spill stack objects so we don't need to		// Record that we know we have non-spill stack objects so we don't need to
// check all stack objects later.		// check all stack objects later.
if (HasStackObjects)		if (HasStackObjects)
Info->setHasNonSpillStackObjects(true);		Info->setHasNonSpillStackObjects(true);
▲ Show 20 Lines • Show All 2,555 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	private:
bool GridWorkgroupCountY : 1;		bool GridWorkgroupCountY : 1;
bool GridWorkgroupCountZ : 1;		bool GridWorkgroupCountZ : 1;

// Feature bits required for inputs passed in system SGPRs.		// Feature bits required for inputs passed in system SGPRs.
bool WorkGroupIDX : 1; // Always initialized.		bool WorkGroupIDX : 1; // Always initialized.
bool WorkGroupIDY : 1;		bool WorkGroupIDY : 1;
bool WorkGroupIDZ : 1;		bool WorkGroupIDZ : 1;
bool WorkGroupInfo : 1;		bool WorkGroupInfo : 1;
bool PrivateSegmentWaveByteOffset : 1;

bool WorkItemIDX : 1; // Always initialized.		bool WorkItemIDX : 1; // Always initialized.
bool WorkItemIDY : 1;		bool WorkItemIDY : 1;
bool WorkItemIDZ : 1;		bool WorkItemIDZ : 1;


MCPhysReg getNextUserSGPR() const {		MCPhysReg getNextUserSGPR() const {
assert(NumSystemSGPRs == 0 && "System SGPRs must be added after user SGPRs");		assert(NumSystemSGPRs == 0 && "System SGPRs must be added after user SGPRs");
▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	public:
bool hasWorkGroupIDZ() const {		bool hasWorkGroupIDZ() const {
return WorkGroupIDZ;		return WorkGroupIDZ;
}		}

bool hasWorkGroupInfo() const {		bool hasWorkGroupInfo() const {
return WorkGroupInfo;		return WorkGroupInfo;
}		}

bool hasPrivateSegmentWaveByteOffset() const {
return PrivateSegmentWaveByteOffset;
}

bool hasWorkItemIDX() const {		bool hasWorkItemIDX() const {
return WorkItemIDX;		return WorkItemIDX;
}		}

bool hasWorkItemIDY() const {		bool hasWorkItemIDY() const {
return WorkItemIDY;		return WorkItemIDY;
}		}

▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	: AMDGPUMachineFunction(MF),
FlatScratchInit(false),		FlatScratchInit(false),
GridWorkgroupCountX(false),		GridWorkgroupCountX(false),
GridWorkgroupCountY(false),		GridWorkgroupCountY(false),
GridWorkgroupCountZ(false),		GridWorkgroupCountZ(false),
WorkGroupIDX(false),		WorkGroupIDX(false),
WorkGroupIDY(false),		WorkGroupIDY(false),
WorkGroupIDZ(false),		WorkGroupIDZ(false),
WorkGroupInfo(false),		WorkGroupInfo(false),
PrivateSegmentWaveByteOffset(false),
WorkItemIDX(false),		WorkItemIDX(false),
WorkItemIDY(false),		WorkItemIDY(false),
WorkItemIDZ(false) {		WorkItemIDZ(false) {
const AMDGPUSubtarget &ST = MF.getSubtarget<AMDGPUSubtarget>();		const AMDGPUSubtarget &ST = MF.getSubtarget<AMDGPUSubtarget>();
const Function *F = MF.getFunction();		const Function *F = MF.getFunction();

PSInputAddr = AMDGPU::getInitialPSInputAddr(*F);		PSInputAddr = AMDGPU::getInitialPSInputAddr(*F);

Show All 17 Lines	SIMachineFunctionInfo::SIMachineFunctionInfo(const MachineFunction &MF)
if (F->hasFnAttribute("amdgpu-work-item-id-z"))		if (F->hasFnAttribute("amdgpu-work-item-id-z"))
WorkItemIDZ = true;		WorkItemIDZ = true;

// X, XY, and XYZ are the only supported combinations, so make sure Y is		// X, XY, and XYZ are the only supported combinations, so make sure Y is
// enabled if Z is.		// enabled if Z is.
if (WorkItemIDZ)		if (WorkItemIDZ)
WorkItemIDY = true;		WorkItemIDY = true;

bool MaySpill = ST.isVGPRSpillingEnabled(*F);
bool HasStackObjects = FrameInfo->hasStackObjects();		bool HasStackObjects = FrameInfo->hasStackObjects();

if (HasStackObjects \|\| MaySpill)
PrivateSegmentWaveByteOffset = true;
arsenmUnsubmitted Not Done Reply Inline Actions We shouldn't need to always enable this if a new vreg is created for the constant arsenm: We shouldn't need to always enable this if a new vreg is created for the constant
nhaehnleAuthorUnsubmitted Not Done Reply Inline Actions Right, but is there a way to tell at this stage? This is the MachineFunctionInfo constructor... nhaehnle: Right, but is there a way to tell at this stage? This is the MachineFunctionInfo constructor...
arsenmUnsubmitted Not Done Reply Inline Actions This logic is just if private access is going to be needed. LocalStackSlotAllocation is purely an optimization and doesn't change that. You can create new vregs at that point so you don't need to worry about the special reserved spill registers arsenm: This logic is just if private access is going to be needed. LocalStackSlotAllocation is purely…

if (ST.isAmdHsaOS()) {		if (ST.isAmdHsaOS()) {
if (HasStackObjects \|\| MaySpill)
PrivateSegmentBuffer = true;		PrivateSegmentBuffer = true;

if (F->hasFnAttribute("amdgpu-dispatch-ptr"))		if (F->hasFnAttribute("amdgpu-dispatch-ptr"))
DispatchPtr = true;		DispatchPtr = true;

if (F->hasFnAttribute("amdgpu-queue-ptr"))		if (F->hasFnAttribute("amdgpu-queue-ptr"))
QueuePtr = true;		QueuePtr = true;
}		}

▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines	void SIRegisterInfo::materializeFrameBaseRegister(MachineBasicBlock *MBB,
const TargetInstrInfo *TII = Subtarget.getInstrInfo();		const TargetInstrInfo *TII = Subtarget.getInstrInfo();

if (Offset == 0) {		if (Offset == 0) {
BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_MOV_B32_e32), BaseReg)		BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_MOV_B32_e32), BaseReg)
.addFrameIndex(FrameIdx);		.addFrameIndex(FrameIdx);
return;		return;
}		}

MachineRegisterInfo &MRI = MF->getRegInfo();		BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_ADD_I32_e32), BaseReg)
unsigned UnusedCarry = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass);

BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_ADD_I32_e64), BaseReg)
.addReg(UnusedCarry, RegState::Define \| RegState::Dead)
arsenmUnsubmitted Not Done Reply Inline Actions This is all pre-RA, so there's no issue creating new virtual registers. A new vreg can be created if the immediate isn't a valid inline immediate for the constant, and then it can be folded/shrunk later if needed. arsenm: This is all pre-RA, so there's no issue creating new virtual registers. A new vreg can be…
nhaehnleAuthorUnsubmitted Not Done Reply Inline Actions Sure, the problem wasn't creating the virtual register, the problem was that the _e64 encoding only allows a very limited set of immediates. That is, the testcase in the diff fails with a "illegal immediate operand" error. If some other pass is supposed to pull the immediate out into a V_MOV before the instruction verifier runs, then that wasn't successful... nhaehnle: Sure, the problem wasn't creating the virtual register, the problem was that the _e64 encoding…
arsenmUnsubmitted Not Done Reply Inline Actions Yes, but the solution isn't to change the instruction encoding here. You don't need to pick the encoding because the instruction shrinking pass later will reduce it. Whenever possible we pick the e64 form and optimize it down later. You can either just always emit the mov of the constant to a new virtual register with the e64 form (which hopefully SIFoldOperands will take care of when legal), or directly check if it's a legal offset and change the emitted instruction arsenm: Yes, but the solution isn't to change the instruction encoding here. You don't need to pick the…
.addImm(Offset)		.addImm(Offset)
.addFrameIndex(FrameIdx);		.addFrameIndex(FrameIdx);
}		}

void SIRegisterInfo::resolveFrameIndex(MachineInstr &MI, unsigned BaseReg,		void SIRegisterInfo::resolveFrameIndex(MachineInstr &MI, unsigned BaseReg,
int64_t Offset) const {		int64_t Offset) const {

MachineBasicBlock *MBB = MI.getParent();		MachineBasicBlock *MBB = MI.getParent();
Show All 31 Lines	#endif

// The offset is not legal, so we must insert an add of the offset.		// The offset is not legal, so we must insert an add of the offset.
MachineRegisterInfo &MRI = MF->getRegInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
unsigned NewReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);		unsigned NewReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
DebugLoc DL = MI.getDebugLoc();		DebugLoc DL = MI.getDebugLoc();

assert(Offset != 0 && "Non-zero offset expected");		assert(Offset != 0 && "Non-zero offset expected");

unsigned UnusedCarry = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass);

// In the case the instruction already had an immediate offset, here only		// In the case the instruction already had an immediate offset, here only
// the requested new offset is added because we are leaving the original		// the requested new offset is added because we are leaving the original
// immediate in place.		// immediate in place.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_I32_e64), NewReg)		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_ADD_I32_e32), NewReg)
.addReg(UnusedCarry, RegState::Define \| RegState::Dead)
.addImm(Offset)		.addImm(Offset)
.addReg(BaseReg);		.addReg(BaseReg);

FIOp->ChangeToRegister(NewReg, false);		FIOp->ChangeToRegister(NewReg, false);
}		}

bool SIRegisterInfo::isFrameOffsetLegal(const MachineInstr *MI,		bool SIRegisterInfo::isFrameOffsetLegal(const MachineInstr *MI,
unsigned BaseReg,		unsigned BaseReg,
▲ Show 20 Lines • Show All 653 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/local-stack-slot-bug.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck %s
				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck %s

				; This used to fail due to a v_add_i32 instruction with an illegal immediate
				; operand that was created during Local Stack Slot Allocation. Test case derived
				; from https://bugs.freedesktop.org/show_bug.cgi?id=96602
				;
				; CHECK-LABEL: {{^}}main:
				define amdgpu_ps float @main(i32 %idx) {
				main_body:
				arsenmUnsubmitted Not Done Reply Inline Actions This should check a few of the stack instructions for the add to the pointer operand arsenm: This should check a few of the stack instructions for the add to the pointer operand
				%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
				%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
				%r = fadd float %v1, %v2
				ret float %r
				}

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: fix local stack slot allocation bugs
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 61349

lib/Target/AMDGPU/SIISelLowering.cpp

lib/Target/AMDGPU/SIMachineFunctionInfo.h

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

lib/Target/AMDGPU/SIRegisterInfo.cpp

test/CodeGen/AMDGPU/local-stack-slot-bug.ll

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: fix local stack slot allocation bugsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 61349

lib/Target/AMDGPU/SIISelLowering.cpp

lib/Target/AMDGPU/SIMachineFunctionInfo.h

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

lib/Target/AMDGPU/SIRegisterInfo.cpp

test/CodeGen/AMDGPU/local-stack-slot-bug.ll

AMDGPU: fix local stack slot allocation bugs
ClosedPublic