This is an archive of the discontinued LLVM Phabricator instance.

Differential D19203

AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic
ClosedPublic

Authored by nhaehnle on Apr 17 2016, 3:20 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm
mareko

Commits

rGf66bdb5ea865: AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic
rL267729: AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic

Summary

So it appears that to guarantee some of the ordering requirements of a GLSL
memoryBarrier() executed in the shader, we need to emit an s_waitcnt.

(We can't use an s_barrier, because memoryBarrier() may appear anywhere in
the shader, in particular it may appear in non-uniform control flow.)

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle updated this revision to Diff 54014.Apr 17 2016, 3:20 PM

nhaehnle retitled this revision from to AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic.

nhaehnle updated this object.

nhaehnle added reviewers: arsenm, mareko, • tstellarAMD.

nhaehnle added a subscriber: llvm-commits.

Herald added a subscriber: arsenm. · View Herald TranscriptApr 17 2016, 3:20 PM

What are the memory barrier requirements for GLSL? Do you need consistency between multiple workgroups?

• tstellarAMD added inline comments.Apr 18 2016, 7:35 AM

include/llvm/IR/IntrinsicsAMDGPU.td
71 ↗	(On Diff #54014)	I think the intrinsic should be int_amdgpu_s_waitcnt, and we should expose the configuration bits with an input argument.

arsenm added inline comments.Apr 18 2016, 8:10 AM

include/llvm/IR/IntrinsicsAMDGPU.td
71 ↗	(On Diff #54014)	I'm not sure waitcnt should be directly exposed, and it should probably be a memfence intrinsic. However, I'm not clear if what is really wanted here is the cache flush intrinsics like is necessary for OpenCL 2.0 This also doesn't need to be convergent

Yes, we need consistency between all shader invocations, which can span all the CUs and SEs on the chip. There isn't really a notion of workgroups for GLSL graphics shaders. Basically, the instruction needs to make sure that all past memory writes by the shader (actually, only 'coherent' and 'volatile' ones) are visible to all other shaders. I'm not sure about what OpenCL needs.

With this patch, the idea is to implement this by setting glc=1 on the coherent/volatile writes and using a wait. I believe (but have not tried) that an alternative would be to always use glc=0 and wait + explicitly request an L1 cache flush at the memory barrier.

Tom, do you want the numeric counts as input, or just bits that indicate whether to wait for vm/exp/lgkm?

In D19203#404014, @nhaehnle wrote:

Tom, do you want the numeric counts as input, or just bits that indicate whether to wait for vm/exp/lgkm?

Just the bits. So, the input to the intrinsic and the instruction are the same.

Changed the intrinsic to take a single argument.

Also changed the SIInsertWaits logic so that the intrinsic will be delayed
up to the next counter-incrementing instruction or the next "natural" wait
(and then merged with it).

LGTM.

This revision is now accepted and ready to land.Apr 25 2016, 2:03 PM

Closed by commit rL267729: AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic (authored by nha). · Explain WhyApr 27 2016, 8:51 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

2 lines

lib/

Target/

AMDGPU/

SIInsertWaits.cpp

83 lines

SIInstructions.td

9 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.s.waitcnt.ll

38 lines

Diff 55243

llvm/trunk/include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	defm int_amdgcn_workitem_id : AMDGPUReadPreloadRegisterIntrinsic_xyz <			defm int_amdgcn_workitem_id : AMDGPUReadPreloadRegisterIntrinsic_xyz <
	"__builtin_amdgcn_workitem_id">;			"__builtin_amdgcn_workitem_id">;
	defm int_amdgcn_workgroup_id : AMDGPUReadPreloadRegisterIntrinsic_xyz <			defm int_amdgcn_workgroup_id : AMDGPUReadPreloadRegisterIntrinsic_xyz <
	"__builtin_amdgcn_workgroup_id">;			"__builtin_amdgcn_workgroup_id">;

	def int_amdgcn_s_barrier : GCCBuiltin<"__builtin_amdgcn_s_barrier">,			def int_amdgcn_s_barrier : GCCBuiltin<"__builtin_amdgcn_s_barrier">,
	Intrinsic<[], [], [IntrConvergent]>;			Intrinsic<[], [], [IntrConvergent]>;

				def int_amdgcn_s_waitcnt : Intrinsic<[], [llvm_i32_ty], []>;

	def int_amdgcn_div_scale : Intrinsic<			def int_amdgcn_div_scale : Intrinsic<
	// 1st parameter: Numerator			// 1st parameter: Numerator
	// 2nd parameter: Denominator			// 2nd parameter: Denominator
	// 3rd parameter: Constant to select select between first and			// 3rd parameter: Constant to select select between first and
	// second. (0 = first, 1 = second).			// second. (0 = first, 1 = second).
	[llvm_anyfloat_ty, llvm_i1_ty],			[llvm_anyfloat_ty, llvm_i1_ty],
	[LLVMMatchType<0>, LLVMMatchType<0>, llvm_i1_ty],			[LLVMMatchType<0>, LLVMMatchType<0>, llvm_i1_ty],
	[IntrNoMem]			[IntrNoMem]
	▲ Show 20 Lines • Show All 314 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInsertWaits.cpp

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	private:
static const Counters WaitCounts;		static const Counters WaitCounts;

/// \brief Constant zero value		/// \brief Constant zero value
static const Counters ZeroCounts;		static const Counters ZeroCounts;

/// \brief Counter values we have already waited on.		/// \brief Counter values we have already waited on.
Counters WaitedOn;		Counters WaitedOn;

		/// \brief Counter values that we must wait on before the next counter
		/// increase.
		Counters DelayedWaitOn;

/// \brief Counter values for last instruction issued.		/// \brief Counter values for last instruction issued.
Counters LastIssued;		Counters LastIssued;

/// \brief Registers used by async instructions.		/// \brief Registers used by async instructions.
RegCounters UsedRegs;		RegCounters UsedRegs;

/// \brief Registers defined by async instructions.		/// \brief Registers defined by async instructions.
RegCounters DefinedRegs;		RegCounters DefinedRegs;
Show All 19 Lines	private:
bool isOpRelevant(MachineOperand &Op);		bool isOpRelevant(MachineOperand &Op);

/// \brief Get register interval an operand affects.		/// \brief Get register interval an operand affects.
RegInterval getRegInterval(const TargetRegisterClass *RC,		RegInterval getRegInterval(const TargetRegisterClass *RC,
const MachineOperand &Reg) const;		const MachineOperand &Reg) const;

/// \brief Handle instructions async components		/// \brief Handle instructions async components
void pushInstruction(MachineBasicBlock &MBB,		void pushInstruction(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I);		MachineBasicBlock::iterator I,
		const Counters& Increment);

/// \brief Insert the actual wait instruction		/// \brief Insert the actual wait instruction
bool insertWait(MachineBasicBlock &MBB,		bool insertWait(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
const Counters &Counts);		const Counters &Counts);

		/// \brief Handle existing wait instructions (from intrinsics)
		void handleExistingWait(MachineBasicBlock::iterator I);

/// \brief Do we need def2def checks?		/// \brief Do we need def2def checks?
bool unorderedDefines(MachineInstr &MI);		bool unorderedDefines(MachineInstr &MI);

/// \brief Resolve all operand dependencies to counter requirements		/// \brief Resolve all operand dependencies to counter requirements
Counters handleOperands(MachineInstr &MI);		Counters handleOperands(MachineInstr &MI);

/// \brief Insert S_NOP between an instruction writing M0 and S_SENDMSG.		/// \brief Insert S_NOP between an instruction writing M0 and S_SENDMSG.
void handleSendMsg(MachineBasicBlock &MBB, MachineBasicBlock::iterator I);		void handleSendMsg(MachineBasicBlock &MBB, MachineBasicBlock::iterator I);
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	RegInterval SIInsertWaits::getRegInterval(const TargetRegisterClass *RC,
RegInterval Result;		RegInterval Result;
Result.first = TRI->getEncodingValue(Reg.getReg());		Result.first = TRI->getEncodingValue(Reg.getReg());
Result.second = Result.first + Size / 4;		Result.second = Result.first + Size / 4;

return Result;		return Result;
}		}

void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,		void SIInsertWaits::pushInstruction(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I) {		MachineBasicBlock::iterator I,
		const Counters &Increment) {

// Get the hardware counter increments and sum them up		// Get the hardware counter increments and sum them up
Counters Increment = getHwCounts(*I);
Counters Limit = ZeroCounts;		Counters Limit = ZeroCounts;
unsigned Sum = 0;		unsigned Sum = 0;

for (unsigned i = 0; i < 3; ++i) {		for (unsigned i = 0; i < 3; ++i) {
LastIssued.Array[i] += Increment.Array[i];		LastIssued.Array[i] += Increment.Array[i];
if (Increment.Array[i])		if (Increment.Array[i])
Limit.Array[i] = LastIssued.Array[i];		Limit.Array[i] = LastIssued.Array[i];
Sum += Increment.Array[i];		Sum += Increment.Array[i];
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines

/// \brief helper function for handleOperands		/// \brief helper function for handleOperands
static void increaseCounters(Counters &Dst, const Counters &Src) {		static void increaseCounters(Counters &Dst, const Counters &Src) {

for (unsigned i = 0; i < 3; ++i)		for (unsigned i = 0; i < 3; ++i)
Dst.Array[i] = std::max(Dst.Array[i], Src.Array[i]);		Dst.Array[i] = std::max(Dst.Array[i], Src.Array[i]);
}		}

		/// \brief check whether any of the counters is non-zero
		static bool countersNonZero(const Counters &Counter) {
		for (unsigned i = 0; i < 3; ++i)
		if (Counter.Array[i])
		return true;
		return false;
		}

		void SIInsertWaits::handleExistingWait(MachineBasicBlock::iterator I) {
		assert(I->getOpcode() == AMDGPU::S_WAITCNT);

		unsigned Imm = I->getOperand(0).getImm();
		Counters Counts, WaitOn;

		Counts.Named.VM = Imm & 0xF;
		Counts.Named.EXP = (Imm >> 4) & 0x7;
		Counts.Named.LGKM = (Imm >> 8) & 0xF;

		for (unsigned i = 0; i < 3; ++i) {
		if (Counts.Array[i] <= LastIssued.Array[i])
		WaitOn.Array[i] = LastIssued.Array[i] - Counts.Array[i];
		else
		WaitOn.Array[i] = 0;
		}

		increaseCounters(DelayedWaitOn, WaitOn);
		}

Counters SIInsertWaits::handleOperands(MachineInstr &MI) {		Counters SIInsertWaits::handleOperands(MachineInstr &MI) {

Counters Result = ZeroCounts;		Counters Result = ZeroCounts;

// S_SENDMSG implicitly waits for all outstanding LGKM transfers to finish,
// but we also want to wait for any other outstanding transfers before
// signalling other hardware blocks
if (MI.getOpcode() == AMDGPU::S_SENDMSG)
return LastIssued;

// For each register affected by this instruction increase the result		// For each register affected by this instruction increase the result
// sequence.		// sequence.
//		//
// TODO: We could probably just look at explicit operands if we removed VCC /		// TODO: We could probably just look at explicit operands if we removed VCC /
// EXEC from SMRD dest reg classes.		// EXEC from SMRD dest reg classes.
for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI.getNumOperands(); i != e; ++i) {
MachineOperand &Op = MI.getOperand(i);		MachineOperand &Op = MI.getOperand(i);
if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()))		if (!Op.isReg() \|\| !TRI->isInAllocatableClass(Op.getReg()))
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	bool SIInsertWaits::runOnMachineFunction(MachineFunction &MF) {
TII = static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());		TII = static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());
TRI =		TRI =
static_cast<const SIRegisterInfo *>(MF.getSubtarget().getRegisterInfo());		static_cast<const SIRegisterInfo *>(MF.getSubtarget().getRegisterInfo());

const AMDGPUSubtarget &ST = MF.getSubtarget<AMDGPUSubtarget>();		const AMDGPUSubtarget &ST = MF.getSubtarget<AMDGPUSubtarget>();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();

WaitedOn = ZeroCounts;		WaitedOn = ZeroCounts;
		DelayedWaitOn = ZeroCounts;
LastIssued = ZeroCounts;		LastIssued = ZeroCounts;
LastOpcodeType = OTHER;		LastOpcodeType = OTHER;
LastInstWritesM0 = false;		LastInstWritesM0 = false;
ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();		ReturnsVoid = MF.getInfo<SIMachineFunctionInfo>()->returnsVoid();

memset(&UsedRegs, 0, sizeof(UsedRegs));		memset(&UsedRegs, 0, sizeof(UsedRegs));
memset(&DefinedRegs, 0, sizeof(DefinedRegs));		memset(&DefinedRegs, 0, sizeof(DefinedRegs));

		SmallVector<MachineInstr *, 4> RemoveMI;

for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();		for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
BI != BE; ++BI) {		BI != BE; ++BI) {

MachineBasicBlock &MBB = *BI;		MachineBasicBlock &MBB = *BI;
for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
I != E; ++I) {		I != E; ++I) {

if (ST.getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS) {		if (ST.getGeneration() <= AMDGPUSubtarget::SEA_ISLANDS) {
Show All 39 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
}		}

// Insert required wait states for SMRD reading an SGPR written by a VALU		// Insert required wait states for SMRD reading an SGPR written by a VALU
// instruction.		// instruction.
if (ST.getGeneration() <= AMDGPUSubtarget::SOUTHERN_ISLANDS &&		if (ST.getGeneration() <= AMDGPUSubtarget::SOUTHERN_ISLANDS &&
I->getOpcode() == AMDGPU::V_READFIRSTLANE_B32)		I->getOpcode() == AMDGPU::V_READFIRSTLANE_B32)
TII->insertWaitStates(MBB, std::next(I), 4);		TII->insertWaitStates(MBB, std::next(I), 4);

		// Record pre-existing, explicitly requested waits
		if (I->getOpcode() == AMDGPU::S_WAITCNT) {
		handleExistingWait(*I);
		RemoveMI.push_back(I);
		continue;
		}

		Counters Required;

// Wait for everything before a barrier.		// Wait for everything before a barrier.
if (I->getOpcode() == AMDGPU::S_BARRIER)		//
Changes \|= insertWait(MBB, I, LastIssued);		// S_SENDMSG implicitly waits for all outstanding LGKM transfers to finish,
		// but we also want to wait for any other outstanding transfers before
		// signalling other hardware blocks
		if (I->getOpcode() == AMDGPU::S_BARRIER \|\|
		I->getOpcode() == AMDGPU::S_SENDMSG)
		Required = LastIssued;
else		else
Changes \|= insertWait(MBB, I, handleOperands(*I));		Required = handleOperands(*I);

pushInstruction(MBB, I);		Counters Increment = getHwCounts(*I);

		if (countersNonZero(Required) \|\| countersNonZero(Increment))
		increaseCounters(Required, DelayedWaitOn);

		Changes \|= insertWait(MBB, I, Required);

		pushInstruction(MBB, I, Increment);
handleSendMsg(MBB, I);		handleSendMsg(MBB, I);
}		}

// Wait for everything at the end of the MBB		// Wait for everything at the end of the MBB
Changes \|= insertWait(MBB, MBB.getFirstTerminator(), LastIssued);		Changes \|= insertWait(MBB, MBB.getFirstTerminator(), LastIssued);
}		}

		for (MachineInstr *I : RemoveMI)
		I->eraseFromParent();

return Changes;		return Changes;
}		}

llvm/trunk/lib/Target/AMDGPU/SIInstructions.td

Show All 37 Lines
def has32BankLDS : Predicate<"Subtarget->getLDSBankCount() == 32">;		def has32BankLDS : Predicate<"Subtarget->getLDSBankCount() == 32">;

def SWaitMatchClass : AsmOperandClass {		def SWaitMatchClass : AsmOperandClass {
let Name = "SWaitCnt";		let Name = "SWaitCnt";
let RenderMethod = "addImmOperands";		let RenderMethod = "addImmOperands";
let ParserMethod = "parseSWaitCntOps";		let ParserMethod = "parseSWaitCntOps";
}		}

def WAIT_FLAG : InstFlag<"printWaitFlag"> {		def WAIT_FLAG : Operand <i32> {
let ParserMatchClass = SWaitMatchClass;		let ParserMatchClass = SWaitMatchClass;
		let PrintMethod = "printWaitFlag";
}		}

let SubtargetPredicate = isGCN in {		let SubtargetPredicate = isGCN in {

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// EXP Instructions		// EXP Instructions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 445 Lines • ▼ Show 20 Lines
> {		> {
let SchedRW = [WriteBarrier];		let SchedRW = [WriteBarrier];
let simm16 = 0;		let simm16 = 0;
let mayLoad = 1;		let mayLoad = 1;
let mayStore = 1;		let mayStore = 1;
let isConvergent = 1;		let isConvergent = 1;
}		}

		let mayLoad = 1, mayStore = 1, hasSideEffects = 1 in
def S_WAITCNT : SOPP <0x0000000c, (ins WAIT_FLAG:$simm16), "s_waitcnt $simm16">;		def S_WAITCNT : SOPP <0x0000000c, (ins WAIT_FLAG:$simm16), "s_waitcnt $simm16">;
def S_SETHALT : SOPP <0x0000000d, (ins i16imm:$simm16), "s_sethalt $simm16">;		def S_SETHALT : SOPP <0x0000000d, (ins i16imm:$simm16), "s_sethalt $simm16">;

// On SI the documentation says sleep for approximately 64 * low 2		// On SI the documentation says sleep for approximately 64 * low 2
// bits, consistent with the reported maximum of 448. On VI the		// bits, consistent with the reported maximum of 448. On VI the
// maximum reported is 960 cycles, so 960 / 64 = 15 max, so is the		// maximum reported is 960 cycles, so 960 / 64 = 15 max, so is the
// maximum really 15 on VI?		// maximum really 15 on VI?
def S_SLEEP : SOPP <0x0000000e, (ins i32imm:$simm16),		def S_SLEEP : SOPP <0x0000000e, (ins i32imm:$simm16),
▲ Show 20 Lines • Show All 1,930 Lines • ▼ Show 20 Lines	def : Pat <
(i32 (addc i32:$src0, i32:$src1)),		(i32 (addc i32:$src0, i32:$src1)),
(S_ADD_U32 $src0, $src1)		(S_ADD_U32 $src0, $src1)
>;		>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SOPP Patterns		// SOPP Patterns
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		def : Pat <
		(int_amdgcn_s_waitcnt i32:$simm16),
		(S_WAITCNT (as_i16imm $simm16))
		>;

// FIXME: These should be removed eventually		// FIXME: These should be removed eventually
def : Pat <		def : Pat <
(int_AMDGPU_barrier_global),		(int_AMDGPU_barrier_global),
(S_BARRIER)		(S_BARRIER)
>;		>;

def : Pat <		def : Pat <
(int_AMDGPU_barrier_local),		(int_AMDGPU_barrier_local),
▲ Show 20 Lines • Show All 1,147 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.s.waitcnt.ll

				; RUN: llc -march=amdgcn -mcpu=SI -verify-machineinstrs < %s \| FileCheck -check-prefix=CHECK %s
				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=CHECK %s

				; CHECK-LABEL: {{^}}test1:
				; CHECK: image_store
				; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0){{$}}
				; CHECK-NEXT: image_store
				; CHECK-NEXT: s_endpgm
				define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <4 x float> %d0, <4 x float> %d1, i32 %c0, i32 %c1) {
				call void @llvm.amdgcn.image.store.i32(<4 x float> %d0, i32 %c0, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 1, i1 0)
				call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00
				call void @llvm.amdgcn.image.store.i32(<4 x float> %d1, i32 %c1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 1, i1 0)
				ret void
				}

				; Test that the intrinsic is merged with automatically generated waits and
				; emitted as late as possible.
				;
				; CHECK-LABEL: {{^}}test2:
				; CHECK: image_load
				; CHECK-NOT: s_waitcnt vmcnt(0){{$}}
				; CHECK: s_waitcnt
				; CHECK-NEXT: image_store
				define amdgpu_ps void @test2(<8 x i32> inreg %rsrc, i32 %c) {
				%t = call <4 x float> @llvm.amdgcn.image.load.i32(i32 %c, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
				call void @llvm.amdgcn.s.waitcnt(i32 3840) ; 0xf00
				%c.1 = mul i32 %c, 2
				call void @llvm.amdgcn.image.store.i32(<4 x float> %t, i32 %c.1, <8 x i32> %rsrc, i32 15, i1 0, i1 0, i1 0, i1 0)
				ret void
				}

				declare void @llvm.amdgcn.s.waitcnt(i32) #0

				declare <4 x float> @llvm.amdgcn.image.load.i32(i32, <8 x i32>, i32, i1, i1, i1, i1) #1
				declare void @llvm.amdgcn.image.store.i32(<4 x float>, i32, <8 x i32>, i32, i1, i1, i1, i1) #0

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readonly }