This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions
ClosedPublic

Authored by • tstellarAMD on Oct 26 2016, 9:53 AM.

Download Raw Diff

Details

Reviewers

tony-tye
arsenm

Commits

rG6695ba0440f3: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions
rL285479: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions

Summary

Flat instruction can return out of order, so we need always need to wait
for all the outstanding flat operations.

Diff Detail

Repository: rL LLVM

Event Timeline

• tstellarAMD updated this revision to Diff 75913.Oct 26 2016, 9:53 AM

• tstellarAMD retitled this revision from to AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions.

• tstellarAMD updated this object.

• tstellarAMD added reviewers: arsenm, tony-tye.

• tstellarAMD added a subscriber: llvm-commits.

Herald added subscribers: yaxunl, nhaehnle, wdng, kzhuravl. · View Herald TranscriptOct 26 2016, 9:53 AM

arsenm added inline comments.Oct 26 2016, 12:23 PM

lib/Target/AMDGPU/SIInsertWaits.cpp
300–301 ↗	(On Diff #75913)	Is this too strict? I know the manual says something like the only sensible value to use is 0 ,but from the reasoning before it it sounds like that's only if accessing a generic address. We could check the MMO and see if it is really global which is the common case

Only treat flat operations as unordered if they access the flat address space.

LGTM with test fixed

test/CodeGen/MIR/AMDGPU/waitcnt.mir
50 ↗	(On Diff #76033)	This looks like it has a mem operand although the comment on the check line says it doesn't

This revision is now accepted and ready to land.Oct 27 2016, 5:07 PM

• tstellarAMD added inline comments.Oct 28 2016, 7:38 AM

test/CodeGen/MIR/AMDGPU/waitcnt.mir
50 ↗	(On Diff #76033)	The first load is the one without the mem operand, I can clarify this in the comment.

Closed by commit rL285479: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions (authored by tstellar). · Explain WhyOct 28 2016, 5:03 PM

This revision was automatically updated to reflect the committed changes.

Feedback on overall pass.

lib/Target/AMDGPU/SIInsertWaits.cpp
54 ↗	(On Diff #75913)	Is there a named constant for the maximum register number rather than using 512?
88 ↗	(On Diff #75913)	Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and perhaps have an enumeration that is used. Would need to add 4 for GDS when supported.
137 ↗	(On Diff #75913)	instrucitons -> instructions
197 ↗	(On Diff #75913)	Only GFX6 uses exp_cnt for stores. Later targets do not increment this count, but stores of more than 2 dwords have a hardware hazard that requires at lease one instruction between the store and the next write of the register. So EXP_CNT property should only be put on M*BUF instructions for GFX6 and not later.
300–301 ↗	(On Diff #75913)	If the flat operation is known not to access LDS then it cannot return out of order. For example, flat is used in 64 bit to access the global address space. So wonder if should also check the address space of the operation and only do this if the address space is FLAT (and not when GLOBAL)?
316 ↗	(On Diff #75913)	Should this be querying a subtarget feature instead of a specific target generation? The feature here seems to be that soft clauses are supported.
320–325 ↗	(On Diff #75913)	This comment indicates that both SMEM and VMEM clauses must be broken, but the following code only handles VMEM as SMEM is handled elsewhere. The rules for VMEM only have to be followed when XNACK is supported. However, the rules for SMEM need to be followed regardless of whether XNACK is enabled as SMEM operations can complete out of order.
326–331 ↗	(On Diff #75913)	Don't VMEM clauses only have to ensure input registers are not modified inside the clause when XNACK is being supported? We now have a subtarget feature to indicate that so should that be used here instead of checking the generation? So should this NOP insertion only be done when the XNACK feature is enabled?
377–378 ↗	(On Diff #75913)	This is conservatively correct. But a better approach would be to not increment the vmcnt for flat instructions that are to generic address space, and record in the DefinedRegs of the destination as maxint. That would allow non-0 vmcnt for using registers produced by non flat instructions (it would be conservative as the value would assume the flat may have completed early), and 0 for the result of the flat instruction.
381 ↗	(On Diff #75913)	Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI onwards no longer used the export counter for VMEM instructions?
97 ↗	(On Diff #76033)	Given that this is only for flat instructions that can complete early, not all flat, should this be renamed?
193 ↗	(On Diff #76033)	Are BUFFER_CACHE_INV* marked as updating vmcnt? Are FLAT* marked as updating vmcnt? Are GDS instructions marked as lgkmcnt and expcnt? GDS needs waitcnt 0 before EXEC can be updated.
199 ↗	(On Diff #76033)	// LGKM counters may be incremented by more than 1.
200 ↗	(On Diff #76033)	Are S_DCACHE_* marked as updating lgkmcnt? Are FLAT* marked as updating lgkmcnt?
202 ↗	(On Diff #76033)	This check should also apply to scalar writes.
216 ↗	(On Diff #76033)	The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so should not be marked in the td file as affecting the counter.
224 ↗	(On Diff #76033)	Why is this needed as Result is initialized to 0?
241 ↗	(On Diff #76033)	also GDS
246 ↗	(On Diff #76033)	Are any source operands that are registers with a counter value that has not yet been satisfied (ie counter < value already waitedOn)?
268–275 ↗	(On Diff #76033)	Only GFX6 requires to use the expcnt to determine if the input value is InUse. There is also a hardware hazard if input is larger than N dwords which requires M instructions before register can be used as a destination (is that hazard checked?).
333 ↗	(On Diff #76033)	Should this also include scalar writes?
333–337 ↗	(On Diff #76033)	Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done regardless of breaking the soft clauses for consistency.
341 ↗	(On Diff #76033)	Add another bit for GDS. Exports are kept in order only within each export type (color/null, position, parameter cache) so need separate bits.
359 ↗	(On Diff #76033)	For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM store in GFX6 both vmcnt and expcnt will be present in Limit; for GDS both expcnt and lgkm willbe present in Limit. So should this just update the expcnt?
382 ↗	(On Diff #76033)	Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that it IS ordered? If adding other bits for GDS and export types then better to use BitCount(ExpInstrTypesSeen ) == 1
385 ↗	(On Diff #76033)	Currently the LGKM counter is always assumed unordered but this could be improved by tracking the classes of instruction that update it (as is done for EXP_CNT) and then can use non-0 waitcnt when only a single class of instructions have been seen since the last waitcnt for LGKM. This would potentially benefit the DS_* instructions greatly.
401 ↗	(On Diff #76033)	If Required is 0 then no wait is needed on this counter so Value should be set to Hardware limit. Only if Required is non 0 does it mean that there is an instruction in this BB that we must wait on.
459 ↗	(On Diff #76033)	Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this should be: if (Counts.Array[i] < LastIssued.Array[i] - WaitedOn.Array[i])
487 ↗	(On Diff #76033)	Only the expcnt should be considered. When UseRegs is set it includes all counters and we do not need to wait for a GFX6 store to complete before being able to use the source register. That only has to be waited for before using the destination register.
501 ↗	(On Diff #76033)	Should this be a target feature?
558 ↗	(On Diff #76033)	Seem better if this was a target feature that was tested.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIInsertWaits.cpp

13 lines

SIInstrInfo.h

2 lines

SIInstrInfo.cpp

14 lines

test/

CodeGen/

MIR/

AMDGPU/

waitcnt.mir

59 lines

Diff 76275

llvm/trunk/lib/Target/AMDGPU/SIInsertWaits.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	private:
/// \brief Different export instruction types seen since last wait.		/// \brief Different export instruction types seen since last wait.
unsigned ExpInstrTypesSeen;		unsigned ExpInstrTypesSeen;

/// \brief Type of the last opcode.		/// \brief Type of the last opcode.
InstType LastOpcodeType;		InstType LastOpcodeType;

bool LastInstWritesM0;		bool LastInstWritesM0;

		/// Whether or not we have flat operations outstanding.
		bool IsFlatOutstanding;

/// \brief Whether the machine function returns void		/// \brief Whether the machine function returns void
bool ReturnsVoid;		bool ReturnsVoid;

/// Whether the VCCZ bit is possibly corrupt		/// Whether the VCCZ bit is possibly corrupt
bool VCCZCorrupt;		bool VCCZCorrupt;

/// \brief Get increment/decrement amount for this instruction.		/// \brief Get increment/decrement amount for this instruction.
Counters getHwCounts(MachineInstr &MI);		Counters getHwCounts(MachineInstr &MI);
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines
void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,		void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
const Counters &Increment) {		const Counters &Increment) {

// Get the hardware counter increments and sum them up		// Get the hardware counter increments and sum them up
Counters Limit = ZeroCounts;		Counters Limit = ZeroCounts;
unsigned Sum = 0;		unsigned Sum = 0;

		if (TII->mayAccessFlatAddressSpace(*I))
		IsFlatOutstanding = true;

for (unsigned i = 0; i < 3; ++i) {		for (unsigned i = 0; i < 3; ++i) {
LastIssued.Array[i] += Increment.Array[i];		LastIssued.Array[i] += Increment.Array[i];
if (Increment.Array[i])		if (Increment.Array[i])
Limit.Array[i] = LastIssued.Array[i];		Limit.Array[i] = LastIssued.Array[i];
Sum += Increment.Array[i];		Sum += Increment.Array[i];
}		}

// If we don't increase anything then that's it		// If we don't increase anything then that's it
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	bool SIInsertWaits::insertWait(MachineBasicBlock &MBB,
// A function not returning void needs to wait, because other bytecode will		// A function not returning void needs to wait, because other bytecode will
// be appended after it and we don't know what it will be.		// be appended after it and we don't know what it will be.
if (I != MBB.end() && I->getOpcode() == AMDGPU::S_ENDPGM && ReturnsVoid)		if (I != MBB.end() && I->getOpcode() == AMDGPU::S_ENDPGM && ReturnsVoid)
return false;		return false;

// Figure out if the async instructions execute in order		// Figure out if the async instructions execute in order
bool Ordered[3];		bool Ordered[3];

// VM_CNT is always ordered		// VM_CNT is always ordered except when there are flat instructions, which
Ordered[0] = true;		// can return out of order.
		Ordered[0] = !IsFlatOutstanding;

// EXP_CNT is unordered if we have both EXP & VM-writes		// EXP_CNT is unordered if we have both EXP & VM-writes
Ordered[1] = ExpInstrTypesSeen == 3;		Ordered[1] = ExpInstrTypesSeen == 3;

// LGKM_CNT is handled as always unordered. TODO: Handle LDS and GDS		// LGKM_CNT is handled as always unordered. TODO: Handle LDS and GDS
Ordered[2] = false;		Ordered[2] = false;

// The values we are going to put into the S_WAITCNT instruction		// The values we are going to put into the S_WAITCNT instruction
Show All 33 Lines	bool SIInsertWaits::insertWait(MachineBasicBlock &MBB,
BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_WAITCNT))		BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_WAITCNT))
.addImm(encodeWaitcnt(IV,		.addImm(encodeWaitcnt(IV,
Counts.Named.VM,		Counts.Named.VM,
Counts.Named.EXP,		Counts.Named.EXP,
Counts.Named.LGKM));		Counts.Named.LGKM));

LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
LastInstWritesM0 = false;		LastInstWritesM0 = false;
		IsFlatOutstanding = false;
return true;		return true;
}		}

/// \brief helper function for handleOperands		/// \brief helper function for handleOperands
static void increaseCounters(Counters &Dst, const Counters &Src) {		static void increaseCounters(Counters &Dst, const Counters &Src) {

for (unsigned i = 0; i < 3; ++i)		for (unsigned i = 0; i < 3; ++i)
Dst.Array[i] = std::max(Dst.Array[i], Src.Array[i]);		Dst.Array[i] = std::max(Dst.Array[i], Src.Array[i]);
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	bool SIInsertWaits::runOnMachineFunction(MachineFunction &MF) {
HardwareLimits.Named.EXP = getExpcntBitMask(IV);		HardwareLimits.Named.EXP = getExpcntBitMask(IV);
HardwareLimits.Named.LGKM = getLgkmcntBitMask(IV);		HardwareLimits.Named.LGKM = getLgkmcntBitMask(IV);

WaitedOn = ZeroCounts;		WaitedOn = ZeroCounts;
DelayedWaitOn = ZeroCounts;		DelayedWaitOn = ZeroCounts;
LastIssued = ZeroCounts;		LastIssued = ZeroCounts;
LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
LastInstWritesM0 = false;		LastInstWritesM0 = false;
		IsFlatOutstanding = false;
ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();		ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();

memset(&UsedRegs, 0, sizeof(UsedRegs));		memset(&UsedRegs, 0, sizeof(UsedRegs));
memset(&DefinedRegs, 0, sizeof(DefinedRegs));		memset(&DefinedRegs, 0, sizeof(DefinedRegs));

SmallVector<MachineInstr *, 4> RemoveMI;		SmallVector<MachineInstr *, 4> RemoveMI;

for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();		for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 611 Lines • ▼ Show 20 Lines	public:

unsigned isLoadFromStackSlot(const MachineInstr &MI,		unsigned isLoadFromStackSlot(const MachineInstr &MI,
int &FrameIndex) const override;		int &FrameIndex) const override;
unsigned isStoreToStackSlot(const MachineInstr &MI,		unsigned isStoreToStackSlot(const MachineInstr &MI,
int &FrameIndex) const override;		int &FrameIndex) const override;

unsigned getInstSizeInBytes(const MachineInstr &MI) const override;		unsigned getInstSizeInBytes(const MachineInstr &MI) const override;

		bool mayAccessFlatAddressSpace(const MachineInstr &MI) const;

ArrayRef<std::pair<int, const char *>>		ArrayRef<std::pair<int, const char *>>
getSerializableTargetIndices() const override;		getSerializableTargetIndices() const override;

ScheduleHazardRecognizer *		ScheduleHazardRecognizer *
CreateTargetPostRAHazardRecognizer(const InstrItineraryData *II,		CreateTargetPostRAHazardRecognizer(const InstrItineraryData *II,
const ScheduleDAG *DAG) const override;		const ScheduleDAG *DAG) const override;

ScheduleHazardRecognizer *		ScheduleHazardRecognizer *
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 3,534 Lines • ▼ Show 20 Lines	case TargetOpcode::INLINEASM: {
const char *AsmStr = MI.getOperand(0).getSymbolName();		const char *AsmStr = MI.getOperand(0).getSymbolName();
return getInlineAsmLength(AsmStr, *MF->getTarget().getMCAsmInfo());		return getInlineAsmLength(AsmStr, *MF->getTarget().getMCAsmInfo());
}		}
default:		default:
llvm_unreachable("unable to find instruction size");		llvm_unreachable("unable to find instruction size");
}		}
}		}

		bool SIInstrInfo::mayAccessFlatAddressSpace(const MachineInstr &MI) const {
		if (!isFLAT(MI))
		return false;

		if (MI.memoperands_empty())
		return true;

		for (const MachineMemOperand *MMO : MI.memoperands()) {
		if (MMO->getAddrSpace() == AMDGPUAS::FLAT_ADDRESS)
		return true;
		}
		return false;
		}

ArrayRef<std::pair<int, const char *>>		ArrayRef<std::pair<int, const char *>>
SIInstrInfo::getSerializableTargetIndices() const {		SIInstrInfo::getSerializableTargetIndices() const {
static const std::pair<int, const char *> TargetIndices[] = {		static const std::pair<int, const char *> TargetIndices[] = {
{AMDGPU::TI_CONSTDATA_START, "amdgpu-constdata-start"},		{AMDGPU::TI_CONSTDATA_START, "amdgpu-constdata-start"},
{AMDGPU::TI_SCRATCH_RSRC_DWORD0, "amdgpu-scratch-rsrc-dword0"},		{AMDGPU::TI_SCRATCH_RSRC_DWORD0, "amdgpu-scratch-rsrc-dword0"},
{AMDGPU::TI_SCRATCH_RSRC_DWORD1, "amdgpu-scratch-rsrc-dword1"},		{AMDGPU::TI_SCRATCH_RSRC_DWORD1, "amdgpu-scratch-rsrc-dword1"},
{AMDGPU::TI_SCRATCH_RSRC_DWORD2, "amdgpu-scratch-rsrc-dword2"},		{AMDGPU::TI_SCRATCH_RSRC_DWORD2, "amdgpu-scratch-rsrc-dword2"},
{AMDGPU::TI_SCRATCH_RSRC_DWORD3, "amdgpu-scratch-rsrc-dword3"}};		{AMDGPU::TI_SCRATCH_RSRC_DWORD3, "amdgpu-scratch-rsrc-dword3"}};
Show All 17 Lines

llvm/trunk/test/CodeGen/MIR/AMDGPU/waitcnt.mir

				# RUN: llc -march=amdgcn -mcpu=fiji -run-pass si-insert-waits %s -o - \| FileCheck %s

				--- \|
				define void @flat_zero_waitcnt(i32 addrspace(1)* %global4,
				<4 x i32> addrspace(1)* %global16,
				i32 addrspace(4)* %flat4,
				<4 x i32> addrspace(4)* %flat16) {
				ret void
				}
				...
				---

				# CHECK-LABEL: name: flat_zero_waitcnt

				# CHECK-LABEL: bb.0:
				# CHECK: FLAT_LOAD_DWORD
				# CHECK: FLAT_LOAD_DWORDX4
				# Global loads will return in order so we should:
				# s_waitcnt vmcnt(1) lgkmcnt(0)
				# CHECK-NEXT: S_WAITCNT 113

				# CHECK-LABEL: bb.1:
				# CHECK: FLAT_LOAD_DWORD
				# CHECK: FLAT_LOAD_DWORDX4
				# The first load has no mem operand, so we should assume it accesses the flat
				# address space.
				# s_waitcnt vmcnt(0) lgkmcnt(0)
				# CHECK-NEXT: S_WAITCNT 112

				# CHECK-LABEL: bb.2:
				# CHECK: FLAT_LOAD_DWORD
				# CHECK: FLAT_LOAD_DWORDX4
				# One outstand loads access the flat address space.
				# s_waitcnt vmcnt(0) lgkmcnt(0)
				# CHECK-NEXT: S_WAITCNT 112

				name: flat_zero_waitcnt

				body: \|
				bb.0:
				successors: %bb.1
				%vgpr0 = FLAT_LOAD_DWORD %vgpr1_vgpr2, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4 from %ir.global4)
				%vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 %vgpr7_vgpr8, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 16 from %ir.global16)
				%vgpr0 = V_MOV_B32_e32 %vgpr1, implicit %exec
				S_BRANCH %bb.1

				bb.1:
				successors: %bb.2
				%vgpr0 = FLAT_LOAD_DWORD %vgpr1_vgpr2, 0, 0, 0, implicit %exec, implicit %flat_scr
				%vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 %vgpr7_vgpr8, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 16 from %ir.global16)
				%vgpr0 = V_MOV_B32_e32 %vgpr1, implicit %exec
				S_BRANCH %bb.2

				bb.2:
				%vgpr0 = FLAT_LOAD_DWORD %vgpr1_vgpr2, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4 from %ir.flat4)
				%vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 %vgpr7_vgpr8, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 16 from %ir.flat16)
				%vgpr0 = V_MOV_B32_e32 %vgpr1, implicit %exec
				S_ENDPGM
				...