This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions
ClosedPublic

Authored by • tstellarAMD on Oct 26 2016, 9:53 AM.

Download Raw Diff

Details

Reviewers

tony-tye
arsenm

Commits

rG6695ba0440f3: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions
rL285479: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions

Summary

Flat instruction can return out of order, so we need always need to wait
for all the outstanding flat operations.

Diff Detail

Build Status

Buildable 800
Build 800: arc lint + arc unit

Event Timeline

• tstellarAMD updated this revision to Diff 75913.Oct 26 2016, 9:53 AM

• tstellarAMD retitled this revision from to AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions.

• tstellarAMD updated this object.

• tstellarAMD added reviewers: arsenm, tony-tye.

• tstellarAMD added a subscriber: llvm-commits.

Herald added subscribers: yaxunl, nhaehnle, wdng, kzhuravl. · View Herald TranscriptOct 26 2016, 9:53 AM

arsenm added inline comments.Oct 26 2016, 12:23 PM

lib/Target/AMDGPU/SIInsertWaits.cpp
300–301	Is this too strict? I know the manual says something like the only sensible value to use is 0 ,but from the reasoning before it it sounds like that's only if accessing a generic address. We could check the MMO and see if it is really global which is the common case

Only treat flat operations as unordered if they access the flat address space.

LGTM with test fixed

test/CodeGen/MIR/AMDGPU/waitcnt.mir
51	This looks like it has a mem operand although the comment on the check line says it doesn't

This revision is now accepted and ready to land.Oct 27 2016, 5:07 PM

• tstellarAMD added inline comments.Oct 28 2016, 7:38 AM

test/CodeGen/MIR/AMDGPU/waitcnt.mir
51	The first load is the one without the mem operand, I can clarify this in the comment.

Closed by commit rL285479: AMDGPU/SI: Don't use non-0 waitcnt values when waiting on Flat instructions (authored by tstellar). · Explain WhyOct 28 2016, 5:03 PM

This revision was automatically updated to reflect the committed changes.

Feedback on overall pass.

lib/Target/AMDGPU/SIInsertWaits.cpp
54	Is there a named constant for the maximum register number rather than using 512?
88	Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and perhaps have an enumeration that is used. Would need to add 4 for GDS when supported.
97	Given that this is only for flat instructions that can complete early, not all flat, should this be renamed?
137	instrucitons -> instructions
193	Are BUFFER_CACHE_INV* marked as updating vmcnt? Are FLAT* marked as updating vmcnt? Are GDS instructions marked as lgkmcnt and expcnt? GDS needs waitcnt 0 before EXEC can be updated.
197	Only GFX6 uses exp_cnt for stores. Later targets do not increment this count, but stores of more than 2 dwords have a hardware hazard that requires at lease one instruction between the store and the next write of the register. So EXP_CNT property should only be put on M*BUF instructions for GFX6 and not later.
199	// LGKM counters may be incremented by more than 1.
200	Are S_DCACHE_* marked as updating lgkmcnt? Are FLAT* marked as updating lgkmcnt?
202	This check should also apply to scalar writes.
216	The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so should not be marked in the td file as affecting the counter.
224	Why is this needed as Result is initialized to 0?
241	also GDS
246	Are any source operands that are registers with a counter value that has not yet been satisfied (ie counter < value already waitedOn)?
268–275	Only GFX6 requires to use the expcnt to determine if the input value is InUse. There is also a hardware hazard if input is larger than N dwords which requires M instructions before register can be used as a destination (is that hazard checked?).
300–301	If the flat operation is known not to access LDS then it cannot return out of order. For example, flat is used in 64 bit to access the global address space. So wonder if should also check the address space of the operation and only do this if the address space is FLAT (and not when GLOBAL)?
316	Should this be querying a subtarget feature instead of a specific target generation? The feature here seems to be that soft clauses are supported.
320–325	This comment indicates that both SMEM and VMEM clauses must be broken, but the following code only handles VMEM as SMEM is handled elsewhere. The rules for VMEM only have to be followed when XNACK is supported. However, the rules for SMEM need to be followed regardless of whether XNACK is enabled as SMEM operations can complete out of order.
326–331	Don't VMEM clauses only have to ensure input registers are not modified inside the clause when XNACK is being supported? We now have a subtarget feature to indicate that so should that be used here instead of checking the generation? So should this NOP insertion only be done when the XNACK feature is enabled?
333	Should this also include scalar writes?
333–337	Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done regardless of breaking the soft clauses for consistency.
341	Add another bit for GDS. Exports are kept in order only within each export type (color/null, position, parameter cache) so need separate bits.
359	For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM store in GFX6 both vmcnt and expcnt will be present in Limit; for GDS both expcnt and lgkm willbe present in Limit. So should this just update the expcnt?
377–378	This is conservatively correct. But a better approach would be to not increment the vmcnt for flat instructions that are to generic address space, and record in the DefinedRegs of the destination as maxint. That would allow non-0 vmcnt for using registers produced by non flat instructions (it would be conservative as the value would assume the flat may have completed early), and 0 for the result of the flat instruction.
381	Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI onwards no longer used the export counter for VMEM instructions?
382	Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that it IS ordered? If adding other bits for GDS and export types then better to use BitCount(ExpInstrTypesSeen ) == 1
385	Currently the LGKM counter is always assumed unordered but this could be improved by tracking the classes of instruction that update it (as is done for EXP_CNT) and then can use non-0 waitcnt when only a single class of instructions have been seen since the last waitcnt for LGKM. This would potentially benefit the DS_* instructions greatly.
401	If Required is 0 then no wait is needed on this counter so Value should be set to Hardware limit. Only if Required is non 0 does it mean that there is an instruction in this BB that we must wait on.
459	Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this should be: if (Counts.Array[i] < LastIssued.Array[i] - WaitedOn.Array[i])
487	Only the expcnt should be considered. When UseRegs is set it includes all counters and we do not need to wait for a GFX6 store to complete before being able to use the source register. That only has to be waited for before using the destination register.
501	Should this be a target feature?
558	Seem better if this was a target feature that was tested.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIInsertWaits.cpp

13 lines

test/

CodeGen/

MIR/

AMDGPU/

waitcnt.mir

24 lines

Diff 75913

lib/Target/AMDGPU/SIInsertWaits.cpp

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
} Counters;		} Counters;

typedef enum {		typedef enum {
OTHER,		OTHER,
SMEM,		SMEM,
VMEM		VMEM
} InstType;		} InstType;

typedef Counters RegCounters[512];		typedef Counters RegCounters[512];
		tony-tyeUnsubmitted Not Done Reply Inline Actions Is there a named constant for the maximum register number rather than using 512? tony-tye: Is there a named constant for the maximum register number rather than using 512?
typedef std::pair<unsigned, unsigned> RegInterval;		typedef std::pair<unsigned, unsigned> RegInterval;

class SIInsertWaits : public MachineFunctionPass {		class SIInsertWaits : public MachineFunctionPass {

private:		private:
const SISubtarget *ST;		const SISubtarget *ST;
const SIInstrInfo *TII;		const SIInstrInfo *TII;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
Show All 17 Lines	private:
Counters LastIssued;		Counters LastIssued;

/// \brief Registers used by async instructions.		/// \brief Registers used by async instructions.
RegCounters UsedRegs;		RegCounters UsedRegs;

/// \brief Registers defined by async instructions.		/// \brief Registers defined by async instructions.
RegCounters DefinedRegs;		RegCounters DefinedRegs;

/// \brief Different export instruction types seen since last wait.		/// \brief Different export instruction types seen since last wait.
		tony-tyeUnsubmitted Not Done Reply Inline Actions Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and perhaps have an enumeration that is used. Would need to add 4 for GDS when supported. tony-tye: Would be helpful to state what the bits mean. It seems 1 is EXPORT and 2 is MEM-WRITE and…
unsigned ExpInstrTypesSeen;		unsigned ExpInstrTypesSeen;

/// \brief Type of the last opcode.		/// \brief Type of the last opcode.
InstType LastOpcodeType;		InstType LastOpcodeType;

bool LastInstWritesM0;		bool LastInstWritesM0;

		/// Whether or not we have flat operations outstanding.
		bool IsFlatOutstanding;
		tony-tyeUnsubmitted Not Done Reply Inline Actions Given that this is only for flat instructions that can complete early, not all flat, should this be renamed? tony-tye: Given that this is only for flat instructions that can complete early, not all flat, should…

/// \brief Whether the machine function returns void		/// \brief Whether the machine function returns void
bool ReturnsVoid;		bool ReturnsVoid;

/// Whether the VCCZ bit is possibly corrupt		/// Whether the VCCZ bit is possibly corrupt
bool VCCZCorrupt;		bool VCCZCorrupt;

/// \brief Get increment/decrement amount for this instruction.		/// \brief Get increment/decrement amount for this instruction.
Counters getHwCounts(MachineInstr &MI);		Counters getHwCounts(MachineInstr &MI);
Show All 22 Lines	private:
bool unorderedDefines(MachineInstr &MI);		bool unorderedDefines(MachineInstr &MI);

/// \brief Resolve all operand dependencies to counter requirements		/// \brief Resolve all operand dependencies to counter requirements
Counters handleOperands(MachineInstr &MI);		Counters handleOperands(MachineInstr &MI);

/// \brief Insert S_NOP between an instruction writing M0 and S_SENDMSG.		/// \brief Insert S_NOP between an instruction writing M0 and S_SENDMSG.
void handleSendMsg(MachineBasicBlock &MBB, MachineBasicBlock::iterator I);		void handleSendMsg(MachineBasicBlock &MBB, MachineBasicBlock::iterator I);

/// Return true if there are LGKM instrucitons that haven't been waited on		/// Return true if there are LGKM instrucitons that haven't been waited on
		tony-tyeUnsubmitted Not Done Reply Inline Actions instrucitons -> instructions tony-tye: instrucitons -> instructions
/// yet.		/// yet.
bool hasOutstandingLGKM() const;		bool hasOutstandingLGKM() const;

public:		public:
static char ID;		static char ID;

SIInsertWaits() :		SIInsertWaits() :
MachineFunctionPass(ID),		MachineFunctionPass(ID),
Show All 39 Lines
bool SIInsertWaits::hasOutstandingLGKM() const {		bool SIInsertWaits::hasOutstandingLGKM() const {
return WaitedOn.Named.LGKM != LastIssued.Named.LGKM;		return WaitedOn.Named.LGKM != LastIssued.Named.LGKM;
}		}

Counters SIInsertWaits::getHwCounts(MachineInstr &MI) {		Counters SIInsertWaits::getHwCounts(MachineInstr &MI) {
uint64_t TSFlags = MI.getDesc().TSFlags;		uint64_t TSFlags = MI.getDesc().TSFlags;
Counters Result = { { 0, 0, 0 } };		Counters Result = { { 0, 0, 0 } };

Result.Named.VM = !!(TSFlags & SIInstrFlags::VM_CNT);		Result.Named.VM = !!(TSFlags & SIInstrFlags::VM_CNT);
		tony-tyeUnsubmitted Not Done Reply Inline Actions Are BUFFER_CACHE_INV* marked as updating vmcnt? Are FLAT* marked as updating vmcnt? Are GDS instructions marked as lgkmcnt and expcnt? GDS needs waitcnt 0 before EXEC can be updated. tony-tye: Are BUFFER_CACHE_INV* marked as updating vmcnt? Are FLAT* marked as updating vmcnt? Are GDS…

// Only consider stores or EXP for EXP_CNT		// Only consider stores or EXP for EXP_CNT
Result.Named.EXP = !!(TSFlags & SIInstrFlags::EXP_CNT &&		Result.Named.EXP = !!(TSFlags & SIInstrFlags::EXP_CNT &&
(MI.getOpcode() == AMDGPU::EXP \|\| MI.getDesc().mayStore()));		(MI.getOpcode() == AMDGPU::EXP \|\| MI.getDesc().mayStore()));
		tony-tyeUnsubmitted Not Done Reply Inline Actions Only GFX6 uses exp_cnt for stores. Later targets do not increment this count, but stores of more than 2 dwords have a hardware hazard that requires at lease one instruction between the store and the next write of the register. So EXP_CNT property should only be put on MBUF instructions for GFX6 and not later. tony-tye:* Only GFX6 uses exp_cnt for stores. Later targets do not increment this count, but stores of…

// LGKM may uses larger values		// LGKM may uses larger values
		tony-tyeUnsubmitted Not Done Reply Inline Actions // LGKM counters may be incremented by more than 1. tony-tye: // LGKM counters may be incremented by more than 1.
if (TSFlags & SIInstrFlags::LGKM_CNT) {		if (TSFlags & SIInstrFlags::LGKM_CNT) {
		tony-tyeUnsubmitted Not Done Reply Inline Actions Are S_DCACHE_* marked as updating lgkmcnt? Are FLAT* marked as updating lgkmcnt? tony-tye: Are S_DCACHE_* marked as updating lgkmcnt? Are FLAT* marked as updating lgkmcnt?

if (TII->isSMRD(MI)) {		if (TII->isSMRD(MI)) {
		tony-tyeUnsubmitted Not Done Reply Inline Actions This check should also apply to scalar writes. tony-tye: This check should also apply to scalar writes.

if (MI.getNumOperands() != 0) {		if (MI.getNumOperands() != 0) {
assert(MI.getOperand(0).isReg() &&		assert(MI.getOperand(0).isReg() &&
"First LGKM operand must be a register!");		"First LGKM operand must be a register!");

// XXX - What if this is a write into a super register?		// XXX - What if this is a write into a super register?
const TargetRegisterClass *RC = TII->getOpRegClass(MI, 0);		const TargetRegisterClass *RC = TII->getOpRegClass(MI, 0);
unsigned Size = RC->getSize();		unsigned Size = RC->getSize();
Result.Named.LGKM = Size > 4 ? 2 : 1;		Result.Named.LGKM = Size > 4 ? 2 : 1;
} else {		} else {
// s_dcache_inv etc. do not have a a destination register. Assume we		// s_dcache_inv etc. do not have a a destination register. Assume we
// want a wait on these.		// want a wait on these.
// XXX - What is the right value?		// XXX - What is the right value?
Result.Named.LGKM = 1;		Result.Named.LGKM = 1;
		tony-tyeUnsubmitted Not Done Reply Inline Actions The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so should not be marked in the td file as affecting the counter. tony-tye: The scalar data cache invalidate and writeback instructions do not affect the lgkm counter so…
}		}
} else {		} else {
// DS		// DS
Result.Named.LGKM = 1;		Result.Named.LGKM = 1;
}		}

} else {		} else {
Result.Named.LGKM = 0;		Result.Named.LGKM = 0;
		tony-tyeUnsubmitted Not Done Reply Inline Actions Why is this needed as Result is initialized to 0? tony-tye: Why is this needed as Result is initialized to 0?
}		}

return Result;		return Result;
}		}

bool SIInsertWaits::isOpRelevant(MachineOperand &Op) {		bool SIInsertWaits::isOpRelevant(MachineOperand &Op) {
// Constants are always irrelevant		// Constants are always irrelevant
if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()))		if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()))
return false;		return false;

// Defines are always relevant		// Defines are always relevant
if (Op.isDef())		if (Op.isDef())
return true;		return true;

// For exports all registers are relevant		// For exports all registers are relevant
MachineInstr &MI = *Op.getParent();		MachineInstr &MI = *Op.getParent();
if (MI.getOpcode() == AMDGPU::EXP)		if (MI.getOpcode() == AMDGPU::EXP)
		tony-tyeUnsubmitted Not Done Reply Inline Actions also GDS tony-tye: also GDS
return true;		return true;

// For stores the stored value is also relevant		// For stores the stored value is also relevant
if (!MI.getDesc().mayStore())		if (!MI.getDesc().mayStore())
return false;		return false;
		tony-tyeUnsubmitted Not Done Reply Inline Actions Are any source operands that are registers with a counter value that has not yet been satisfied (ie counter < value already waitedOn)? tony-tye: Are any source operands that are registers with a counter value that has not yet been satisfied…

// Check if this operand is the value being stored.		// Check if this operand is the value being stored.
// Special case for DS/FLAT instructions, since the address		// Special case for DS/FLAT instructions, since the address
// operand comes before the value operand and it may have		// operand comes before the value operand and it may have
// multiple data operands.		// multiple data operands.

if (TII->isDS(MI) \|\| TII->isFLAT(MI)) {		if (TII->isDS(MI) \|\| TII->isFLAT(MI)) {
MachineOperand *Data = TII->getNamedOperand(MI, AMDGPU::OpName::data);		MachineOperand *Data = TII->getNamedOperand(MI, AMDGPU::OpName::data);
if (Data && Op.isIdenticalTo(*Data))		if (Data && Op.isIdenticalTo(*Data))
return true;		return true;
}		}

if (TII->isDS(MI)) {		if (TII->isDS(MI)) {
MachineOperand *Data0 = TII->getNamedOperand(MI, AMDGPU::OpName::data0);		MachineOperand *Data0 = TII->getNamedOperand(MI, AMDGPU::OpName::data0);
if (Data0 && Op.isIdenticalTo(*Data0))		if (Data0 && Op.isIdenticalTo(*Data0))
return true;		return true;

MachineOperand *Data1 = TII->getNamedOperand(MI, AMDGPU::OpName::data1);		MachineOperand *Data1 = TII->getNamedOperand(MI, AMDGPU::OpName::data1);
return Data1 && Op.isIdenticalTo(*Data1);		return Data1 && Op.isIdenticalTo(*Data1);
}		}

// NOTE: This assumes that the value operand is before the		// NOTE: This assumes that the value operand is before the
// address operand, and that there is only one value operand.		// address operand, and that there is only one value operand.
for (MachineInstr::mop_iterator I = MI.operands_begin(),		for (MachineInstr::mop_iterator I = MI.operands_begin(),
E = MI.operands_end(); I != E; ++I) {		E = MI.operands_end(); I != E; ++I) {

if (I->isReg() && I->isUse())		if (I->isReg() && I->isUse())
return Op.isIdenticalTo(*I);		return Op.isIdenticalTo(*I);
}		}
		tony-tyeUnsubmitted Not Done Reply Inline Actions Only GFX6 requires to use the expcnt to determine if the input value is InUse. There is also a hardware hazard if input is larger than N dwords which requires M instructions before register can be used as a destination (is that hazard checked?). tony-tye: Only GFX6 requires to use the expcnt to determine if the input value is InUse. There is also a…

return false;		return false;
}		}

RegInterval SIInsertWaits::getRegInterval(const TargetRegisterClass *RC,		RegInterval SIInsertWaits::getRegInterval(const TargetRegisterClass *RC,
const MachineOperand &Reg) const {		const MachineOperand &Reg) const {
unsigned Size = RC->getSize();		unsigned Size = RC->getSize();
assert(Size >= 4);		assert(Size >= 4);

RegInterval Result;		RegInterval Result;
Result.first = TRI->getEncodingValue(Reg.getReg());		Result.first = TRI->getEncodingValue(Reg.getReg());
Result.second = Result.first + Size / 4;		Result.second = Result.first + Size / 4;

return Result;		return Result;
}		}

void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,		void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
const Counters &Increment) {		const Counters &Increment) {

// Get the hardware counter increments and sum them up		// Get the hardware counter increments and sum them up
Counters Limit = ZeroCounts;		Counters Limit = ZeroCounts;
unsigned Sum = 0;		unsigned Sum = 0;

		if (TII->isFLAT(*I))
		IsFlatOutstanding = true;
		tony-tyeUnsubmitted Not Done Reply Inline Actions If the flat operation is known not to access LDS then it cannot return out of order. For example, flat is used in 64 bit to access the global address space. So wonder if should also check the address space of the operation and only do this if the address space is FLAT (and not when GLOBAL)? tony-tye: If the flat operation is known not to access LDS then it cannot return out of order. For…
		arsenmUnsubmitted Not Done Reply Inline Actions Is this too strict? I know the manual says something like the only sensible value to use is 0 ,but from the reasoning before it it sounds like that's only if accessing a generic address. We could check the MMO and see if it is really global which is the common case arsenm: Is this too strict? I know the manual says something like the only sensible value to use is 0…

for (unsigned i = 0; i < 3; ++i) {		for (unsigned i = 0; i < 3; ++i) {
LastIssued.Array[i] += Increment.Array[i];		LastIssued.Array[i] += Increment.Array[i];
if (Increment.Array[i])		if (Increment.Array[i])
Limit.Array[i] = LastIssued.Array[i];		Limit.Array[i] = LastIssued.Array[i];
Sum += Increment.Array[i];		Sum += Increment.Array[i];
}		}

// If we don't increase anything then that's it		// If we don't increase anything then that's it
if (Sum == 0) {		if (Sum == 0) {
LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
return;		return;
}		}

if (ST->getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {		if (ST->getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {
		tony-tyeUnsubmitted Not Done Reply Inline Actions Should this be querying a subtarget feature instead of a specific target generation? The feature here seems to be that soft clauses are supported. tony-tye: Should this be querying a subtarget feature instead of a specific target generation? The…
// Any occurrence of consecutive VMEM or SMEM instructions forms a VMEM		// Any occurrence of consecutive VMEM or SMEM instructions forms a VMEM
// or SMEM clause, respectively.		// or SMEM clause, respectively.
//		//
// The temporary workaround is to break the clauses with S_NOP.		// The temporary workaround is to break the clauses with S_NOP.
//		//
// The proper solution would be to allocate registers such that all source		// The proper solution would be to allocate registers such that all source
// and destination registers don't overlap, e.g. this is illegal:		// and destination registers don't overlap, e.g. this is illegal:
// r0 = load r2		// r0 = load r2
// r2 = load r0		// r2 = load r0
		tony-tyeUnsubmitted Not Done Reply Inline Actions This comment indicates that both SMEM and VMEM clauses must be broken, but the following code only handles VMEM as SMEM is handled elsewhere. The rules for VMEM only have to be followed when XNACK is supported. However, the rules for SMEM need to be followed regardless of whether XNACK is enabled as SMEM operations can complete out of order. tony-tye: This comment indicates that both SMEM and VMEM clauses must be broken, but the following code…
if (LastOpcodeType == VMEM && Increment.Named.VM) {		if (LastOpcodeType == VMEM && Increment.Named.VM) {
// Insert a NOP to break the clause.		// Insert a NOP to break the clause.
BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_NOP))		BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_NOP))
.addImm(0);		.addImm(0);
LastInstWritesM0 = false;		LastInstWritesM0 = false;
}		}
		tony-tyeUnsubmitted Not Done Reply Inline Actions Don't VMEM clauses only have to ensure input registers are not modified inside the clause when XNACK is being supported? We now have a subtarget feature to indicate that so should that be used here instead of checking the generation? So should this NOP insertion only be done when the XNACK feature is enabled? tony-tye: Don't VMEM clauses only have to ensure input registers are not modified inside the clause when…

if (TII->isSMRD(*I))		if (TII->isSMRD(*I))
		tony-tyeUnsubmitted Not Done Reply Inline Actions Should this also include scalar writes? tony-tye: Should this also include scalar writes?
LastOpcodeType = SMEM;		LastOpcodeType = SMEM;
else if (Increment.Named.VM)		else if (Increment.Named.VM)
LastOpcodeType = VMEM;		LastOpcodeType = VMEM;
}		}
		tony-tyeUnsubmitted Not Done Reply Inline Actions Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done regardless of breaking the soft clauses for consistency. tony-tye: Why is this if nested inside the enclosing if? Seems tracking the lastOpcodeType should be done…

// Remember which export instructions we have seen		// Remember which export instructions we have seen
if (Increment.Named.EXP) {		if (Increment.Named.EXP) {
ExpInstrTypesSeen \|= I->getOpcode() == AMDGPU::EXP ? 1 : 2;		ExpInstrTypesSeen \|= I->getOpcode() == AMDGPU::EXP ? 1 : 2;
		tony-tyeUnsubmitted Not Done Reply Inline Actions Add another bit for GDS. Exports are kept in order only within each export type (color/null, position, parameter cache) so need separate bits. tony-tye: Add another bit for GDS. Exports are kept in order only within each export type (color/null…
}		}

for (unsigned i = 0, e = I->getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = I->getNumOperands(); i != e; ++i) {
MachineOperand &Op = I->getOperand(i);		MachineOperand &Op = I->getOperand(i);
if (!isOpRelevant(Op))		if (!isOpRelevant(Op))
continue;		continue;

const TargetRegisterClass RC = TII->getOpRegClass(I, i);		const TargetRegisterClass RC = TII->getOpRegClass(I, i);
RegInterval Interval = getRegInterval(RC, Op);		RegInterval Interval = getRegInterval(RC, Op);
for (unsigned j = Interval.first; j < Interval.second; ++j) {		for (unsigned j = Interval.first; j < Interval.second; ++j) {

// Remember which registers we define		// Remember which registers we define
if (Op.isDef())		if (Op.isDef())
DefinedRegs[j] = Limit;		DefinedRegs[j] = Limit;

// and which one we are using		// and which one we are using
if (Op.isUse())		if (Op.isUse())
UsedRegs[j] = Limit;		UsedRegs[j] = Limit;
		tony-tyeUnsubmitted Not Done Reply Inline Actions For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM store in GFX6 both vmcnt and expcnt will be present in Limit; for GDS both expcnt and lgkm willbe present in Limit. So should this just update the expcnt? tony-tye: For UsedRegs only the expcnt needs to be waited on before the register is available. For VMEM…
}		}
}		}
}		}

bool SIInsertWaits::insertWait(MachineBasicBlock &MBB,		bool SIInsertWaits::insertWait(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
const Counters &Required) {		const Counters &Required) {

// End of program? No need to wait on anything		// End of program? No need to wait on anything
// A function not returning void needs to wait, because other bytecode will		// A function not returning void needs to wait, because other bytecode will
// be appended after it and we don't know what it will be.		// be appended after it and we don't know what it will be.
if (I != MBB.end() && I->getOpcode() == AMDGPU::S_ENDPGM && ReturnsVoid)		if (I != MBB.end() && I->getOpcode() == AMDGPU::S_ENDPGM && ReturnsVoid)
return false;		return false;

// Figure out if the async instructions execute in order		// Figure out if the async instructions execute in order
bool Ordered[3];		bool Ordered[3];

// VM_CNT is always ordered		// VM_CNT is always ordered except when there are flat instructions, which
Ordered[0] = true;		// can return out of order.
		tony-tyeUnsubmitted Not Done Reply Inline Actions This is conservatively correct. But a better approach would be to not increment the vmcnt for flat instructions that are to generic address space, and record in the DefinedRegs of the destination as maxint. That would allow non-0 vmcnt for using registers produced by non flat instructions (it would be conservative as the value would assume the flat may have completed early), and 0 for the result of the flat instruction. tony-tye: This is conservatively correct. But a better approach would be to not increment the vmcnt for…
		Ordered[0] = !IsFlatOutstanding;

// EXP_CNT is unordered if we have both EXP & VM-writes		// EXP_CNT is unordered if we have both EXP & VM-writes
		tony-tyeUnsubmitted Not Done Reply Inline Actions Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI onwards no longer used the export counter for VMEM instructions? tony-tye: Is this still true with current hardware? Pre-SI I think this was the case, but I thought SI…
Ordered[1] = ExpInstrTypesSeen == 3;		Ordered[1] = ExpInstrTypesSeen == 3;
		tony-tyeUnsubmitted Not Done Reply Inline Actions Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that it IS ordered? If adding other bits for GDS and export types then better to use BitCount(ExpInstrTypesSeen ) == 1 tony-tye: Should this be != 3? If both are seen then it will be 3, so 3 means it is NOT ordered, not that…

// LGKM_CNT is handled as always unordered. TODO: Handle LDS and GDS		// LGKM_CNT is handled as always unordered. TODO: Handle LDS and GDS
Ordered[2] = false;		Ordered[2] = false;
		tony-tyeUnsubmitted Not Done Reply Inline Actions Currently the LGKM counter is always assumed unordered but this could be improved by tracking the classes of instruction that update it (as is done for EXP_CNT) and then can use non-0 waitcnt when only a single class of instructions have been seen since the last waitcnt for LGKM. This would potentially benefit the DS_* instructions greatly. tony-tye: Currently the LGKM counter is always assumed unordered but this could be improved by tracking…

// The values we are going to put into the S_WAITCNT instruction		// The values we are going to put into the S_WAITCNT instruction
Counters Counts = HardwareLimits;		Counters Counts = HardwareLimits;

// Do we really need to wait?		// Do we really need to wait?
bool NeedWait = false;		bool NeedWait = false;

for (unsigned i = 0; i < 3; ++i) {		for (unsigned i = 0; i < 3; ++i) {

if (Required.Array[i] <= WaitedOn.Array[i])		if (Required.Array[i] <= WaitedOn.Array[i])
continue;		continue;

NeedWait = true;		NeedWait = true;

if (Ordered[i]) {		if (Ordered[i]) {
unsigned Value = LastIssued.Array[i] - Required.Array[i];		unsigned Value = LastIssued.Array[i] - Required.Array[i];
		tony-tyeUnsubmitted Not Done Reply Inline Actions If Required is 0 then no wait is needed on this counter so Value should be set to Hardware limit. Only if Required is non 0 does it mean that there is an instruction in this BB that we must wait on. tony-tye: If Required is 0 then no wait is needed on this counter so Value should be set to Hardware…

// Adjust the value to the real hardware possibilities.		// Adjust the value to the real hardware possibilities.
Counts.Array[i] = std::min(Value, HardwareLimits.Array[i]);		Counts.Array[i] = std::min(Value, HardwareLimits.Array[i]);

} else		} else
Counts.Array[i] = 0;		Counts.Array[i] = 0;

// Remember on what we have waited on.		// Remember on what we have waited on.
Show All 11 Lines	bool SIInsertWaits::insertWait(MachineBasicBlock &MBB,
BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_WAITCNT))		BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_WAITCNT))
.addImm(encodeWaitcnt(IV,		.addImm(encodeWaitcnt(IV,
Counts.Named.VM,		Counts.Named.VM,
Counts.Named.EXP,		Counts.Named.EXP,
Counts.Named.LGKM));		Counts.Named.LGKM));

LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
LastInstWritesM0 = false;		LastInstWritesM0 = false;
		IsFlatOutstanding = true;
return true;		return true;
}		}

/// \brief helper function for handleOperands		/// \brief helper function for handleOperands
static void increaseCounters(Counters &Dst, const Counters &Src) {		static void increaseCounters(Counters &Dst, const Counters &Src) {

for (unsigned i = 0; i < 3; ++i)		for (unsigned i = 0; i < 3; ++i)
Dst.Array[i] = std::max(Dst.Array[i], Src.Array[i]);		Dst.Array[i] = std::max(Dst.Array[i], Src.Array[i]);
Show All 13 Lines	void SIInsertWaits::handleExistingWait(MachineBasicBlock::iterator I) {
unsigned Imm = I->getOperand(0).getImm();		unsigned Imm = I->getOperand(0).getImm();
Counters Counts, WaitOn;		Counters Counts, WaitOn;

Counts.Named.VM = decodeVmcnt(IV, Imm);		Counts.Named.VM = decodeVmcnt(IV, Imm);
Counts.Named.EXP = decodeExpcnt(IV, Imm);		Counts.Named.EXP = decodeExpcnt(IV, Imm);
Counts.Named.LGKM = decodeLgkmcnt(IV, Imm);		Counts.Named.LGKM = decodeLgkmcnt(IV, Imm);

for (unsigned i = 0; i < 3; ++i) {		for (unsigned i = 0; i < 3; ++i) {
if (Counts.Array[i] <= LastIssued.Array[i])		if (Counts.Array[i] <= LastIssued.Array[i])
		tony-tyeUnsubmitted Not Done Reply Inline Actions Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this should be: if (Counts.Array[i] < LastIssued.Array[i] - WaitedOn.Array[i]) tony-tye: Need delayed waitcnt if Counts is trying to wait on an instruction after the WaitedOn. So this…
WaitOn.Array[i] = LastIssued.Array[i] - Counts.Array[i];		WaitOn.Array[i] = LastIssued.Array[i] - Counts.Array[i];
else		else
WaitOn.Array[i] = 0;		WaitOn.Array[i] = 0;
}		}

increaseCounters(DelayedWaitOn, WaitOn);		increaseCounters(DelayedWaitOn, WaitOn);
}		}

Show All 11 Lines	for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()))		if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()))
continue;		continue;

const TargetRegisterClass *RC = TII->getOpRegClass(MI, i);		const TargetRegisterClass *RC = TII->getOpRegClass(MI, i);
RegInterval Interval = getRegInterval(RC, Op);		RegInterval Interval = getRegInterval(RC, Op);
for (unsigned j = Interval.first; j < Interval.second; ++j) {		for (unsigned j = Interval.first; j < Interval.second; ++j) {

if (Op.isDef()) {		if (Op.isDef()) {
increaseCounters(Result, UsedRegs[j]);		increaseCounters(Result, UsedRegs[j]);
		tony-tyeUnsubmitted Not Done Reply Inline Actions Only the expcnt should be considered. When UseRegs is set it includes all counters and we do not need to wait for a GFX6 store to complete before being able to use the source register. That only has to be waited for before using the destination register. tony-tye: Only the expcnt should be considered. When UseRegs is set it includes all counters and we do…
increaseCounters(Result, DefinedRegs[j]);		increaseCounters(Result, DefinedRegs[j]);
}		}

if (Op.isUse())		if (Op.isUse())
increaseCounters(Result, DefinedRegs[j]);		increaseCounters(Result, DefinedRegs[j]);
}		}
}		}

return Result;		return Result;
}		}

void SIInsertWaits::handleSendMsg(MachineBasicBlock &MBB,		void SIInsertWaits::handleSendMsg(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I) {		MachineBasicBlock::iterator I) {
if (ST->getGeneration() < SISubtarget::VOLCANIC_ISLANDS)		if (ST->getGeneration() < SISubtarget::VOLCANIC_ISLANDS)
		tony-tyeUnsubmitted Not Done Reply Inline Actions Should this be a target feature? tony-tye: Should this be a target feature?
return;		return;

// There must be "S_NOP 0" between an instruction writing M0 and S_SENDMSG.		// There must be "S_NOP 0" between an instruction writing M0 and S_SENDMSG.
if (LastInstWritesM0 && I->getOpcode() == AMDGPU::S_SENDMSG) {		if (LastInstWritesM0 && I->getOpcode() == AMDGPU::S_SENDMSG) {
BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_NOP)).addImm(0);		BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_NOP)).addImm(0);
LastInstWritesM0 = false;		LastInstWritesM0 = false;
return;		return;
}		}
Show All 25 Lines	bool SIInsertWaits::runOnMachineFunction(MachineFunction &MF) {
HardwareLimits.Named.EXP = getExpcntBitMask(IV);		HardwareLimits.Named.EXP = getExpcntBitMask(IV);
HardwareLimits.Named.LGKM = getLgkmcntBitMask(IV);		HardwareLimits.Named.LGKM = getLgkmcntBitMask(IV);

WaitedOn = ZeroCounts;		WaitedOn = ZeroCounts;
DelayedWaitOn = ZeroCounts;		DelayedWaitOn = ZeroCounts;
LastIssued = ZeroCounts;		LastIssued = ZeroCounts;
LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
LastInstWritesM0 = false;		LastInstWritesM0 = false;
		IsFlatOutstanding = false;
ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();		ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();

memset(&UsedRegs, 0, sizeof(UsedRegs));		memset(&UsedRegs, 0, sizeof(UsedRegs));
memset(&DefinedRegs, 0, sizeof(DefinedRegs));		memset(&DefinedRegs, 0, sizeof(DefinedRegs));

SmallVector<MachineInstr *, 4> RemoveMI;		SmallVector<MachineInstr *, 4> RemoveMI;

for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();		for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
BI != BE; ++BI) {		BI != BE; ++BI) {

MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;
for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
I != E; ++I) {		I != E; ++I) {

if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS) {		if (ST->getGeneration() <= SISubtarget::SEA_ISLANDS) {
		tony-tyeUnsubmitted Not Done Reply Inline Actions Seem better if this was a target feature that was tested. tony-tye: Seem better if this was a target feature that was tested.
// There is a hardware bug on CI/SI where SMRD instruction may corrupt		// There is a hardware bug on CI/SI where SMRD instruction may corrupt
// vccz bit, so when we detect that an instruction may read from a		// vccz bit, so when we detect that an instruction may read from a
// corrupt vccz bit, we need to:		// corrupt vccz bit, we need to:
// 1. Insert s_waitcnt lgkm(0) to wait for all outstanding SMRD operations to		// 1. Insert s_waitcnt lgkm(0) to wait for all outstanding SMRD operations to
// complete.		// complete.
// 2. Restore the correct value of vccz by writing the current value		// 2. Restore the correct value of vccz by writing the current value
// of vcc back to vcc.		// of vcc back to vcc.

▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

test/CodeGen/MIR/AMDGPU/waitcnt.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=fiji -run-pass si-insert-waits %s -o - \| FileCheck %s

				--- \|
				define void @flat_zero_waitcnt() { ret void }
				...
				---

				# CHECK-LABEL: name: flat_zero_waitcnt

				# CHECK-LABEL: bb.0:
				# CHECK: FLAT_LOAD_DWORD
				# CHECK: FLAT_LOAD_DWORDX4
				# s_waitcnt vmcnt(0) lgkmcnt(0)
				# CHECK: S_WAITCNT 112

				name: flat_zero_waitcnt

				body: \|
				bb.0:
				%vgpr0 = FLAT_LOAD_DWORD %vgpr1_vgpr2, 0, 0, 0, implicit %exec, implicit %flat_scr
				%vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 %vgpr7_vgpr8, 0, 0, 0, implicit %exec, implicit %flat_scr
				%vgpr0 = V_MOV_B32_e32 %vgpr1, implicit %exec
				S_ENDPGM
				...
				arsenmUnsubmitted Not Done Reply Inline Actions This looks like it has a mem operand although the comment on the check line says it doesn't arsenm: This looks like it has a mem operand although the comment on the check line says it doesn't
				tstellarAMDAuthorUnsubmitted Not Done Reply Inline Actions The first load is the one without the mem operand, I can clarify this in the comment. tstellarAMD: The first load is the one without the mem operand, I can clarify this in the comment.