This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Produce waitcounts for LDS DMA
ClosedPublic

Authored by rampitec on Apr 28 2022, 11:14 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad

Commits

rG51e02409f022: [AMDGPU] Produce waitcounts for LDS DMA

Summary

MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Apr 28 2022, 11:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 11:14 AM

Herald added subscribers: hsmhsm, kerbowa, hiraditya and 7 others. · View Herald Transcript

rampitec requested review of this revision.Apr 28 2022, 11:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 11:14 AM

Herald added a subscriber: wdng. · View Herald Transcript

arsenm added inline comments.Apr 28 2022, 11:37 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
502–503	isVALU is redundant with isMUBUF \|\| isFLAT?

rampitec added inline comments.Apr 28 2022, 11:43 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
502–503	MUBUF and FLAT have VALU = 0 unless it is LDS DMA, which are both memory and VALU instructions. The only of that kind.

arsenm added inline comments.Apr 28 2022, 11:45 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
655	Swap these checks, Inst.mayStore() && ..

Swapped checks.

Harbormaster completed remote builds in B161857: Diff 425869.Apr 28 2022, 2:24 PM

I also feel like EXTRA_VGPR_LDS was made specifically for this scenario, when we have no VGPR result to track dependency. And it was unused and dead code. Unfortunately we practically do not have tests for this pass and this place is certainly not covered. If I remove isDS() call nothing fails, but probably it should not and it was dead. I cannot prove it though as I have no idea what was originally behind it. But that fits so well, so I guess it was DMA.

rampitec marked an inline comment as done.Apr 28 2022, 5:02 PM

foad added inline comments.Apr 29 2022, 2:11 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1100	Was this a pre-existing bug, even if you don't use DMA instructions? E.g. for DS_STORE followed by FLAT_STORE were we missing a wait on expcnt to guarantee WAW order?

rampitec added inline comments.Apr 29 2022, 2:57 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1100	I thought so too, but it didn't get it in this scenario before and it doesn't get it after because there is no pending VM event. I just believe if anything it shall be vmcnt, not expcnt, although I do not see that documented. For DMA the reason for vmcnt is that we shall wait for vmem to be read, although a wait on lgkmt is not needed (does not seem to be properly documented either). For a pure ds_write it is likely not needed because we don't touch VMEM.

A side question: is that possible to facilitate MIR syntax to use symbolic syntax for waitcounts instead of a raw number?

Quite literally I am using -start-before instead of -run-pass to see what is really produced while testing.

In D124626#3482176, @rampitec wrote:

A side question: is that possible to facilitate MIR syntax to use symbolic syntax for waitcounts instead of a raw number?

You would have to add something new to support it. The closest existing thing is probably the comments inlineasm uses for the register class encoded by magic constants

arsenm accepted this revision.Apr 29 2022, 10:33 AM

This revision is now accepted and ready to land.Apr 29 2022, 10:33 AM

This revision was landed with ongoing or failed builds.Apr 29 2022, 11:14 AM

Closed by commit rG51e02409f022: [AMDGPU] Produce waitcounts for LDS DMA (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rG51e02409f022: [AMDGPU] Produce waitcounts for LDS DMA.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInsertWaitcnts.cpp

14 lines

test/

CodeGen/

AMDGPU/

lds-dma-waitcnt.mir

98 lines

Diff 426130

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines
// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs		// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
// We reserve a fixed number of VGPR slots in the scoring tables for		// We reserve a fixed number of VGPR slots in the scoring tables for
// special tokens like SCMEM_LDS (needed for buffer load to LDS).		// special tokens like SCMEM_LDS (needed for buffer load to LDS).
enum RegisterMapping {		enum RegisterMapping {
SQ_MAX_PGM_VGPRS = 512, // Maximum programmable VGPRs across all targets.		SQ_MAX_PGM_VGPRS = 512, // Maximum programmable VGPRs across all targets.
AGPR_OFFSET = 256, // Maximum programmable ArchVGPRs across all targets.		AGPR_OFFSET = 256, // Maximum programmable ArchVGPRs across all targets.
SQ_MAX_PGM_SGPRS = 256, // Maximum programmable SGPRs across all targets.		SQ_MAX_PGM_SGPRS = 256, // Maximum programmable SGPRs across all targets.
NUM_EXTRA_VGPRS = 1, // A reserved slot for DS.		NUM_EXTRA_VGPRS = 1, // A reserved slot for DS.
EXTRA_VGPR_LDS = 0, // This is a placeholder the Shader algorithm uses.		EXTRA_VGPR_LDS = 0, // An artificial register to track LDS writes.
NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.		NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_EXTRA_VGPRS, // Where SGPR starts.
};		};

// Enumerate different types of result-returning VMEM operations. Although		// Enumerate different types of result-returning VMEM operations. Although
// s_waitcnt orders them all with a single vmcnt counter, in the absence of		// s_waitcnt orders them all with a single vmcnt counter, in the absence of
// s_waitcnt only instructions of the same VmemType are guaranteed to write		// s_waitcnt only instructions of the same VmemType are guaranteed to write
// their results in order -- so there is no need to insert an s_waitcnt between		// their results in order -- so there is no need to insert an s_waitcnt between
// two instructions of the same type that write the same vgpr.		// two instructions of the same type that write the same vgpr.
▲ Show 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	void WaitcntBrackets::setExpScore(const MachineInstr *MI,
unsigned Val) {		unsigned Val) {
RegInterval Interval = getRegInterval(MI, TII, MRI, TRI, OpNo);		RegInterval Interval = getRegInterval(MI, TII, MRI, TRI, OpNo);
assert(TRI->isVectorRegister(*MRI, MI->getOperand(OpNo).getReg()));		assert(TRI->isVectorRegister(*MRI, MI->getOperand(OpNo).getReg()));
for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {		for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
setRegScore(RegNo, EXP_CNT, Val);		setRegScore(RegNo, EXP_CNT, Val);
}		}
}		}

		// MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
		// can be accessed. A load from LDS to VMEM does not need a wait.
		static bool mayWriteLDSThroughDMA(const MachineInstr &MI) {
		return SIInstrInfo::isVALU(MI) &&
		(SIInstrInfo::isMUBUF(MI) \|\| SIInstrInfo::isFLAT(MI)) &&
		arsenmUnsubmitted Done Reply Inline Actions isVALU is redundant with isMUBUF \|\| isFLAT? arsenm: isVALU is redundant with isMUBUF \|\| isFLAT?
		rampitecAuthorUnsubmitted Done Reply Inline Actions MUBUF and FLAT have VALU = 0 unless it is LDS DMA, which are both memory and VALU instructions. The only of that kind. rampitec: MUBUF and FLAT have VALU = 0 unless it is LDS DMA, which are both memory and VALU instructions.
		MI.getOpcode() != AMDGPU::BUFFER_STORE_LDS_DWORD;
		}

void WaitcntBrackets::updateByEvent(const SIInstrInfo *TII,		void WaitcntBrackets::updateByEvent(const SIInstrInfo *TII,
const SIRegisterInfo *TRI,		const SIRegisterInfo *TRI,
const MachineRegisterInfo *MRI,		const MachineRegisterInfo *MRI,
WaitEventType E, MachineInstr &Inst) {		WaitEventType E, MachineInstr &Inst) {
InstCounterType T = eventCounter(E);		InstCounterType T = eventCounter(E);
unsigned CurrScore = getScoreUB(T) + 1;		unsigned CurrScore = getScoreUB(T) + 1;
if (CurrScore == 0)		if (CurrScore == 0)
report_fatal_error("InsertWaitcnt score wraparound");		report_fatal_error("InsertWaitcnt score wraparound");
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	for (unsigned I = 0, E = Inst.getNumOperands(); I != E; ++I) {
for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo)		for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo)
VgprVmemTypes[RegNo] \|= 1 << V;		VgprVmemTypes[RegNo] \|= 1 << V;
}		}
}		}
for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {		for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
setRegScore(RegNo, T, CurrScore);		setRegScore(RegNo, T, CurrScore);
}		}
}		}
if (TII->isDS(Inst) && Inst.mayStore()) {		if (Inst.mayStore() && (TII->isDS(Inst) \|\| mayWriteLDSThroughDMA(Inst))) {
		arsenmUnsubmitted Done Reply Inline Actions Swap these checks, Inst.mayStore() && .. arsenm: Swap these checks, Inst.mayStore() && ..
setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);		setRegScore(SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS, T, CurrScore);
}		}
}		}
}		}

void WaitcntBrackets::print(raw_ostream &OS) {		void WaitcntBrackets::print(raw_ostream &OS) {
OS << '\n';		OS << '\n';
for (auto T : inst_counter_types()) {		for (auto T : inst_counter_types()) {
▲ Show 20 Lines • Show All 428 Lines • ▼ Show 20 Lines	if (MI.isCall() && callWaitsOnFunctionEntry(MI)) {
for (const MachineMemOperand *Memop : MI.memoperands()) {		for (const MachineMemOperand *Memop : MI.memoperands()) {
const Value *Ptr = Memop->getValue();		const Value *Ptr = Memop->getValue();
if (Memop->isStore() && SLoadAddresses.count(Ptr)) {		if (Memop->isStore() && SLoadAddresses.count(Ptr)) {
addWait(Wait, LGKM_CNT, 0);		addWait(Wait, LGKM_CNT, 0);
if (PDT->dominates(MI.getParent(), SLoadAddresses.find(Ptr)->second))		if (PDT->dominates(MI.getParent(), SLoadAddresses.find(Ptr)->second))
SLoadAddresses.erase(Ptr);		SLoadAddresses.erase(Ptr);
}		}
unsigned AS = Memop->getAddrSpace();		unsigned AS = Memop->getAddrSpace();
if (AS != AMDGPUAS::LOCAL_ADDRESS)		if (AS != AMDGPUAS::LOCAL_ADDRESS && AS != AMDGPUAS::FLAT_ADDRESS)
		foadUnsubmitted Not Done Reply Inline Actions Was this a pre-existing bug, even if you don't use DMA instructions? E.g. for DS_STORE followed by FLAT_STORE were we missing a wait on expcnt to guarantee WAW order? foad: Was this a pre-existing bug, even if you don't use DMA instructions? E.g. for DS_STORE followed…
		rampitecAuthorUnsubmitted Done Reply Inline Actions I thought so too, but it didn't get it in this scenario before and it doesn't get it after because there is no pending VM event. I just believe if anything it shall be vmcnt, not expcnt, although I do not see that documented. For DMA the reason for vmcnt is that we shall wait for vmem to be read, although a wait on lgkmt is not needed (does not seem to be properly documented either). For a pure ds_write it is likely not needed because we don't touch VMEM. rampitec: I thought so too, but it didn't get it in this scenario before and it doesn't get it after…
continue;		continue;
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;		unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
// VM_CNT is only relevant to vgpr or LDS.		// VM_CNT is only relevant to vgpr or LDS.
ScoreBrackets.determineWait(		ScoreBrackets.determineWait(
VM_CNT, ScoreBrackets.getRegScore(RegNo, VM_CNT), Wait);		VM_CNT, ScoreBrackets.getRegScore(RegNo, VM_CNT), Wait);
if (Memop->isStore()) {		if (Memop->isStore()) {
ScoreBrackets.determineWait(		ScoreBrackets.determineWait(
EXP_CNT, ScoreBrackets.getRegScore(RegNo, EXP_CNT), Wait);		EXP_CNT, ScoreBrackets.getRegScore(RegNo, EXP_CNT), Wait);
▲ Show 20 Lines • Show All 627 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=gfx940 -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s \| FileCheck -check-prefix=GCN %s

				# GCN-LABEL: name: buffer_load_dword_lds_ds_read
				# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
				# GCN-NEXT: S_WAITCNT 3952
				# vmcnt(0)
				# GCN-NEXT: DS_READ_B32_gfx9
				---
				name: buffer_load_dword_lds_ds_read
				body: \|
				bb.0:
				$m0 = S_MOV_B32 0
				BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(1)* undef` + 4), (store (s32) into `i32 addrspace(3)* undef` + 4)
				$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `i32 addrspace(3)* undef`)
				S_ENDPGM 0

				...

				# GCN-LABEL: name: buffer_load_dword_lds_vmcnt_1
				# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
				# GCN-NEXT: BUFFER_LOAD_DWORD_IDXEN
				# GCN-NEXT: S_WAITCNT 3953
				# vmcnt(1)
				# GCN-NEXT: DS_READ_B32_gfx9
				---
				name: buffer_load_dword_lds_vmcnt_1
				body: \|
				bb.0:
				$m0 = S_MOV_B32 0
				BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(1)* undef`), (store (s32) into `i32 addrspace(3)* undef`)
				$vgpr10 = BUFFER_LOAD_DWORD_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(1)* undef`)
				$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `i32 addrspace(3)* undef`)
				S_ENDPGM 0

				...

				# GCN-LABEL: name: buffer_load_dword_lds_flat_read
				# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
				# GCN-NEXT: S_WAITCNT 3952
				# vmcnt(0)
				# GCN-NEXT: FLAT_LOAD_DWORD
				---
				name: buffer_load_dword_lds_flat_read
				body: \|
				bb.0:
				$m0 = S_MOV_B32 0
				BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(1)* undef`), (store (s32) into `i32 addrspace(3)* undef`)
				$vgpr0 = FLAT_LOAD_DWORD $vgpr0_vgpr1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32) from `i32* undef`)

				S_ENDPGM 0

				...

				# GCN-LABEL: name: global_load_lds_dword_ds_read
				# GCN: GLOBAL_LOAD_LDS_DWORD
				# GCN-NEXT: S_WAITCNT 3952
				# vmcnt(0)
				# GCN-NEXT: DS_READ_B32_gfx9
				---
				name: global_load_lds_dword_ds_read
				body: \|
				bb.0:
				$m0 = S_MOV_B32 0
				GLOBAL_LOAD_LDS_DWORD $vgpr0_vgpr1, 4, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(1)* undef` + 4), (store (s32) into `i32 addrspace(3)* undef` + 4)
				$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `i32 addrspace(3)* undef`)
				S_ENDPGM 0

				...

				# GCN-LABEL: name: scratch_load_lds_dword_ds_read
				# GCN: SCRATCH_LOAD_LDS_DWORD
				# GCN-NEXT: S_WAITCNT 3952
				# vmcnt(0)
				# GCN-NEXT: DS_READ_B32_gfx9
				---
				name: scratch_load_lds_dword_ds_read
				body: \|
				bb.0:
				$m0 = S_MOV_B32 0
				SCRATCH_LOAD_LDS_DWORD $vgpr0, 4, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(5)* undef` + 4), (store (s32) into `i32 addrspace(3)* undef` + 4)
				$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `i32 addrspace(3)* undef`)
				S_ENDPGM 0

				...

				# GCN-LABEL: name: buffer_store_lds_dword_ds_read
				# GCN: BUFFER_STORE_LDS_DWORD
				# GCN-NEXT: DS_READ_B32_gfx9
				---
				name: buffer_store_lds_dword_ds_read
				body: \|
				bb.0:
				$m0 = S_MOV_B32 0
				BUFFER_STORE_LDS_DWORD $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `i32 addrspace(3)* undef` + 4), (store (s32) into `i32 addrspace(1)* undef` + 4)
				$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `i32 addrspace(3)* undef`)
				S_ENDPGM 0

				...