This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic
ClosedPublic

Authored by kerbowa on Apr 29 2022, 2:28 PM.

Download Raw Diff

Details

Reviewers

rampitec
vangthao95
jrbyrnes
foad
arsenm

Commits

rG2db700215a2e: [AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic

Summary

Adds an intrinsic/builtin that can be used to fine tune scheduler behavior. If
there is a need to have highly optimized codegen and kernel developers have
knowledge of inter-wave runtime behavior which is unknown to the compiler this
builtin can be used to tune scheduling.

This intrinsic creates a barrier between scheduling regions. The immediate
parameter is a mask to determine the types of instructions that should be
prevented from crossing the sched_barrier. In this initial patch, there are only
two variations. A mask of 0 means that no instructions may be scheduled across
the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing
instructions may cross the sched_barrier.

Note that this intrinsic is only meant to work with the scheduling passes. Any
other transformations that may move code will not be impacted in the ways
described above.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kerbowa created this revision.Apr 29 2022, 2:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 2:28 PM

Herald added subscribers: hsmhsm, foad, hiraditya and 8 others. · View Herald Transcript

kerbowa requested review of this revision.Apr 29 2022, 2:28 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 29 2022, 2:28 PM

Herald added subscribers: llvm-commits, cfe-commits, wdng. · View Herald Transcript

Add mir tests.

kerbowa added reviewers: rampitec, vangthao95, jrbyrnes, foad, arsenm.Apr 29 2022, 2:50 PM

You do not handle masks other than 0 yet?

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
219	Since you are going to extend it I'd suggest this is -1. Then you will start carving bits outs of it. That way if someone start to use it it will still work after update.
222	Why not full i32? This is immediate anyway but you will have more bits for the future.
llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
213	Use hex?

In D124700#3483556, @rampitec wrote:

You do not handle masks other than 0 yet?

We handle 0 and 1 only.

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
219	Since the most common use case will be to block all instruction types I thought having that be MASK = 0 made the most sense. After that, we carve out bits for types of instructions that should be scheduled across it. There may be modes where we restrict certain types of memops, so we cannot have MASK = 1 above changed to -1. Since this (MASK = 1) is allowing all ALU across we could define which bits mean VALU/SALU/MFMA etc and use that mask if you think it's better. I'm worried we won't be able to anticipate all the types that we could want to be maskable. It might be better to just have a single bit that can mean all ALU, or all MemOps, and so on to avoid this problem.
222	Good point thanks.

In D124700#3483609, @kerbowa wrote:

In D124700#3483556, @rampitec wrote:

You do not handle masks other than 0 yet?

We handle 0 and 1 only.

Do you mean 1 is supported simply because it has side effects? If I understand it right you will need to remove this to support more flexible masks, right?

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
219	Ok. Let it be 1.

Harbormaster completed remote builds in B162063: Diff 426169.Apr 29 2022, 3:52 PM

In D124700#3483633, @rampitec wrote:

In D124700#3483609, @kerbowa wrote:

In D124700#3483556, @rampitec wrote:

You do not handle masks other than 0 yet?

We handle 0 and 1 only.

Do you mean 1 is supported simply because it has side effects? If I understand it right you will need to remove this to support more flexible masks, right?

Yes.

In D124700#3483715, @kerbowa wrote:

In D124700#3483633, @rampitec wrote:

In D124700#3483609, @kerbowa wrote:

In D124700#3483556, @rampitec wrote:

You do not handle masks other than 0 yet?

We handle 0 and 1 only.

Do you mean 1 is supported simply because it has side effects? If I understand it right you will need to remove this to support more flexible masks, right?

Yes.

LGTM given that. But change imm to i32 before committing.

This revision is now accepted and ready to land.Apr 29 2022, 5:29 PM

Can you add a test to make sure the hazard recognizer and code size estimate don’t think this is a real instruction

Use i32.
Output hex.
Fix hazard rec tests for pseudo instructions.

Herald added a subscriber: jsilvanus. · View Herald TranscriptMay 6 2022, 2:22 PM

LGTM

Harbormaster completed remote builds in B163225: Diff 427747.May 6 2022, 5:21 PM

Closed by commit rG2db700215a2e: [AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic (authored by kerbowa). · Explain WhyMay 11 2022, 1:41 PM

This revision was automatically updated to reflect the committed changes.

kerbowa added a commit: rG2db700215a2e: [AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic.

Herald added a subscriber: kosarev. · View Herald TranscriptMay 11 2022, 1:41 PM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

BuiltinsAMDGPU.def

1 line

test/

CodeGenOpenCL/

builtins-amdgcn.cl

13 lines

SemaOpenCL/

builtins-amdgcn-error.cl

5 lines

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

9 lines

lib/

Target/

AMDGPU/

AMDGPUMCInstLower.cpp

10 lines

SIInstrInfo.cpp

4 lines

SIInstructions.td

12 lines

Utils/

AMDGPUMemoryUtils.cpp

1 line

test/

CodeGen/

AMDGPU/

hazard-pseudo-machineinstrs.mir

69 lines

llvm.amdgcn.sched.barrier.ll

23 lines

sched_barrier.mir

99 lines

Diff 428763

clang/include/clang/Basic/BuiltinsAMDGPU.def

	Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	BUILTIN(__builtin_amdgcn_s_getreg, "UiIi", "n")			BUILTIN(__builtin_amdgcn_s_getreg, "UiIi", "n")
	BUILTIN(__builtin_amdgcn_s_setreg, "vIiUi", "n")			BUILTIN(__builtin_amdgcn_s_setreg, "vIiUi", "n")
	BUILTIN(__builtin_amdgcn_s_getpc, "WUi", "n")			BUILTIN(__builtin_amdgcn_s_getpc, "WUi", "n")
	BUILTIN(__builtin_amdgcn_s_waitcnt, "vIi", "n")			BUILTIN(__builtin_amdgcn_s_waitcnt, "vIi", "n")
	BUILTIN(__builtin_amdgcn_s_sendmsg, "vIiUi", "n")			BUILTIN(__builtin_amdgcn_s_sendmsg, "vIiUi", "n")
	BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")			BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")
	BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")			BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")
	BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")			BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")
				BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")
	BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")			BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
	BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")			BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_init, "vUiUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_init, "vUiUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_barrier, "vUiUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_barrier, "vUiUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_sema_v, "vUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_sema_v, "vUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_sema_br, "vUiUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_sema_br, "vUiUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_sema_p, "vUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_sema_p, "vUi", "n")
	BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")			BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
	▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines

clang/test/CodeGenOpenCL/builtins-amdgcn.cl

	Show First 20 Lines • Show All 390 Lines • ▼ Show 20 Lines

	// CHECK-LABEL: @test_wave_barrier			// CHECK-LABEL: @test_wave_barrier
	// CHECK: call void @llvm.amdgcn.wave.barrier(			// CHECK: call void @llvm.amdgcn.wave.barrier(
	void test_wave_barrier()			void test_wave_barrier()
	{			{
	__builtin_amdgcn_wave_barrier();			__builtin_amdgcn_wave_barrier();
	}			}

				// CHECK-LABEL: @test_sched_barrier
				// CHECK: call void @llvm.amdgcn.sched.barrier(i32 0)
				// CHECK: call void @llvm.amdgcn.sched.barrier(i32 1)
				// CHECK: call void @llvm.amdgcn.sched.barrier(i32 4)
				// CHECK: call void @llvm.amdgcn.sched.barrier(i32 15)
				void test_sched_barrier()
				{
				__builtin_amdgcn_sched_barrier(0);
				__builtin_amdgcn_sched_barrier(1);
				__builtin_amdgcn_sched_barrier(4);
				__builtin_amdgcn_sched_barrier(15);
				}

	// CHECK-LABEL: @test_s_sleep			// CHECK-LABEL: @test_s_sleep
	// CHECK: call void @llvm.amdgcn.s.sleep(i32 1)			// CHECK: call void @llvm.amdgcn.s.sleep(i32 1)
	// CHECK: call void @llvm.amdgcn.s.sleep(i32 15)			// CHECK: call void @llvm.amdgcn.s.sleep(i32 15)
	void test_s_sleep()			void test_s_sleep()
	{			{
	__builtin_amdgcn_s_sleep(1);			__builtin_amdgcn_s_sleep(1);
	__builtin_amdgcn_s_sleep(15);			__builtin_amdgcn_s_sleep(15);
	}			}
	▲ Show 20 Lines • Show All 358 Lines • Show Last 20 Lines

clang/test/SemaOpenCL/builtins-amdgcn-error.cl

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	}			}

	void test_s_setprio(int x)			void test_s_setprio(int x)
	{			{
	__builtin_amdgcn_s_setprio(x); // expected-error {{argument to '__builtin_amdgcn_s_setprio' must be a constant integer}}			__builtin_amdgcn_s_setprio(x); // expected-error {{argument to '__builtin_amdgcn_s_setprio' must be a constant integer}}
	__builtin_amdgcn_s_setprio(65536); // expected-warning {{implicit conversion from 'int' to 'short' changes value from 65536 to 0}}			__builtin_amdgcn_s_setprio(65536); // expected-warning {{implicit conversion from 'int' to 'short' changes value from 65536 to 0}}
	}			}

				void test_sched_barrier(int x)
				{
				__builtin_amdgcn_sched_barrier(x); // expected-error {{argument to '__builtin_amdgcn_sched_barrier' must be a constant integer}}
				}

	void test_sicmp_i32(global ulong* out, int a, int b, uint c)			void test_sicmp_i32(global ulong* out, int a, int b, uint c)
	{			{
	*out = __builtin_amdgcn_sicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_sicmp' must be a constant integer}}			*out = __builtin_amdgcn_sicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_sicmp' must be a constant integer}}
	}			}

	void test_uicmp_i32(global ulong* out, uint a, uint b, uint c)			void test_uicmp_i32(global ulong* out, uint a, uint b, uint c)
	{			{
	*out = __builtin_amdgcn_uicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_uicmp' must be a constant integer}}			*out = __builtin_amdgcn_uicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_uicmp' must be a constant integer}}
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	def int_amdgcn_s_sendmsghalt : GCCBuiltin<"__builtin_amdgcn_s_sendmsghalt">,
[ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects]>;		[ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects]>;

def int_amdgcn_s_barrier : GCCBuiltin<"__builtin_amdgcn_s_barrier">,		def int_amdgcn_s_barrier : GCCBuiltin<"__builtin_amdgcn_s_barrier">,
Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects, IntrConvergent, IntrWillReturn]>;		Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects, IntrConvergent, IntrWillReturn]>;

def int_amdgcn_wave_barrier : GCCBuiltin<"__builtin_amdgcn_wave_barrier">,		def int_amdgcn_wave_barrier : GCCBuiltin<"__builtin_amdgcn_wave_barrier">,
Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects, IntrConvergent, IntrWillReturn]>;		Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects, IntrConvergent, IntrWillReturn]>;

		// The 1st parameter is a mask for the types of instructions that may be allowed
		// to cross the SCHED_BARRIER during scheduling.
		// MASK = 0: No instructions may be scheduled across SCHED_BARRIER.
		// MASK = 1: Non-memory, non-side-effect producing instructions may be
		rampitecUnsubmitted Not Done Reply Inline Actions Since you are going to extend it I'd suggest this is -1. Then you will start carving bits outs of it. That way if someone start to use it it will still work after update. rampitec: Since you are going to extend it I'd suggest this is -1. Then you will start carving bits outs…
		kerbowaAuthorUnsubmitted Not Done Reply Inline Actions Since the most common use case will be to block all instruction types I thought having that be MASK = 0 made the most sense. After that, we carve out bits for types of instructions that should be scheduled across it. There may be modes where we restrict certain types of memops, so we cannot have MASK = 1 above changed to -1. Since this (MASK = 1) is allowing all ALU across we could define which bits mean VALU/SALU/MFMA etc and use that mask if you think it's better. I'm worried we won't be able to anticipate all the types that we could want to be maskable. It might be better to just have a single bit that can mean all ALU, or all MemOps, and so on to avoid this problem. kerbowa: Since the most common use case will be to block all instruction types I thought having that be…
		rampitecUnsubmitted Not Done Reply Inline Actions Ok. Let it be 1. rampitec: Ok. Let it be 1.
		// scheduled across SCHED_BARRIER, i.e. allow ALU instructions to pass.
		def int_amdgcn_sched_barrier : GCCBuiltin<"__builtin_amdgcn_sched_barrier">,
		Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem,
		rampitecUnsubmitted Not Done Reply Inline Actions Why not full i32? This is immediate anyway but you will have more bits for the future. rampitec: Why not full i32? This is immediate anyway but you will have more bits for the future.
		kerbowaAuthorUnsubmitted Not Done Reply Inline Actions Good point thanks. kerbowa: Good point thanks.
		IntrHasSideEffects, IntrConvergent, IntrWillReturn]>;

def int_amdgcn_s_waitcnt : GCCBuiltin<"__builtin_amdgcn_s_waitcnt">,		def int_amdgcn_s_waitcnt : GCCBuiltin<"__builtin_amdgcn_s_waitcnt">,
Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects, IntrWillReturn]>;		Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects, IntrWillReturn]>;

def int_amdgcn_div_scale : Intrinsic<		def int_amdgcn_div_scale : Intrinsic<
// 1st parameter: Numerator		// 1st parameter: Numerator
// 2nd parameter: Denominator		// 2nd parameter: Denominator
// 3rd parameter: Select quotient. Must equal Numerator or Denominator.		// 3rd parameter: Select quotient. Must equal Numerator or Denominator.
// (0 = Denominator, 1 = Numerator).		// (0 = Denominator, 1 = Numerator).
▲ Show 20 Lines • Show All 1,841 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp

Show First 20 Lines • Show All 201 Lines • ▼ Show 20 Lines	if (MI->isBundle()) {
}		}

if (MI->getOpcode() == AMDGPU::WAVE_BARRIER) {		if (MI->getOpcode() == AMDGPU::WAVE_BARRIER) {
if (isVerbose())		if (isVerbose())
OutStreamer->emitRawComment(" wave barrier");		OutStreamer->emitRawComment(" wave barrier");
return;		return;
}		}

		if (MI->getOpcode() == AMDGPU::SCHED_BARRIER) {
		if (isVerbose()) {
		std::string HexString;
		raw_string_ostream HexStream(HexString);
		rampitecUnsubmitted Not Done Reply Inline Actions Use hex? rampitec: Use hex?
		HexStream << format_hex(MI->getOperand(0).getImm(), 10, true);
		OutStreamer->emitRawComment(" sched_barrier mask(" + HexString + ")");
		}
		return;
		}

if (MI->getOpcode() == AMDGPU::SI_MASKED_UNREACHABLE) {		if (MI->getOpcode() == AMDGPU::SI_MASKED_UNREACHABLE) {
if (isVerbose())		if (isVerbose())
OutStreamer->emitRawComment(" divergent unreachable");		OutStreamer->emitRawComment(" divergent unreachable");
return;		return;
}		}

if (MI->isMetaInstruction()) {		if (MI->isMetaInstruction()) {
if (isVerbose())		if (isVerbose())
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,767 Lines • ▼ Show 20 Lines	unsigned SIInstrInfo::getNumWaitStates(const MachineInstr &MI) {
case AMDGPU::S_NOP:		case AMDGPU::S_NOP:
return MI.getOperand(0).getImm() + 1;		return MI.getOperand(0).getImm() + 1;

// FIXME: Any other pseudo instruction?		// FIXME: Any other pseudo instruction?
// SI_RETURN_TO_EPILOG is a fallthrough to code outside of the function. The		// SI_RETURN_TO_EPILOG is a fallthrough to code outside of the function. The
// hazard, even if one exist, won't really be visible. Should we handle it?		// hazard, even if one exist, won't really be visible. Should we handle it?
case AMDGPU::SI_MASKED_UNREACHABLE:		case AMDGPU::SI_MASKED_UNREACHABLE:
case AMDGPU::WAVE_BARRIER:		case AMDGPU::WAVE_BARRIER:
		case AMDGPU::SCHED_BARRIER:
return 0;		return 0;
}		}
}		}

bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {		bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
const SIRegisterInfo *TRI = ST.getRegisterInfo();		const SIRegisterInfo *TRI = ST.getRegisterInfo();
MachineBasicBlock &MBB = *MI.getParent();		MachineBasicBlock &MBB = *MI.getParent();
DebugLoc DL = MBB.findDebugLoc(MI);		DebugLoc DL = MBB.findDebugLoc(MI);
▲ Show 20 Lines • Show All 1,701 Lines • ▼ Show 20 Lines	bool SIInstrInfo::isSchedulingBoundary(const MachineInstr &MI,
// Terminators and labels can't be scheduled around.		// Terminators and labels can't be scheduled around.
if (MI.isTerminator() \|\| MI.isPosition())		if (MI.isTerminator() \|\| MI.isPosition())
return true;		return true;

// INLINEASM_BR can jump to another block		// INLINEASM_BR can jump to another block
if (MI.getOpcode() == TargetOpcode::INLINEASM_BR)		if (MI.getOpcode() == TargetOpcode::INLINEASM_BR)
return true;		return true;

		if (MI.getOpcode() == AMDGPU::SCHED_BARRIER && MI.getOperand(0).getImm() == 0)
		return true;

// Target-independent instructions do not have an implicit-use of EXEC, even		// Target-independent instructions do not have an implicit-use of EXEC, even
// when they operate on VGPRs. Treating EXEC modifications as scheduling		// when they operate on VGPRs. Treating EXEC modifications as scheduling
// boundaries prevents incorrect movements of such instructions.		// boundaries prevents incorrect movements of such instructions.
return MI.modifiesRegister(AMDGPU::EXEC, &RI) \|\|		return MI.modifiesRegister(AMDGPU::EXEC, &RI) \|\|
MI.getOpcode() == AMDGPU::S_SETREG_IMM32_B32 \|\|		MI.getOpcode() == AMDGPU::S_SETREG_IMM32_B32 \|\|
MI.getOpcode() == AMDGPU::S_SETREG_B32 \|\|		MI.getOpcode() == AMDGPU::S_SETREG_B32 \|\|
changesVGPRIndexingMode(MI);		changesVGPRIndexingMode(MI);
}		}
▲ Show 20 Lines • Show All 4,939 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 307 Lines • ▼ Show 20 Lines	def WAVE_BARRIER : SPseudoInstSI<(outs), (ins),
let hasSideEffects = 1;		let hasSideEffects = 1;
let mayLoad = 0;		let mayLoad = 0;
let mayStore = 0;		let mayStore = 0;
let isConvergent = 1;		let isConvergent = 1;
let FixedSize = 1;		let FixedSize = 1;
let Size = 0;		let Size = 0;
}		}

		def SCHED_BARRIER : SPseudoInstSI<(outs), (ins i32imm:$mask),
		[(int_amdgcn_sched_barrier (i32 timm:$mask))]> {
		let SchedRW = [];
		let hasNoSchedulingInfo = 1;
		let hasSideEffects = 1;
		let mayLoad = 0;
		let mayStore = 0;
		let isConvergent = 1;
		let FixedSize = 1;
		let Size = 0;
		}

// SI pseudo instructions. These are used by the CFG structurizer pass		// SI pseudo instructions. These are used by the CFG structurizer pass
// and should be lowered to ISA instructions prior to codegen.		// and should be lowered to ISA instructions prior to codegen.

let isTerminator = 1 in {		let isTerminator = 1 in {

let OtherPredicates = [EnableLateCFGStructurize] in {		let OtherPredicates = [EnableLateCFGStructurize] in {
def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <		def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <
(outs),		(outs),
▲ Show 20 Lines • Show All 2,910 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	bool isReallyAClobber(const Value Ptr, MemoryDef Def, AAResults *AA) {

if (isa<FenceInst>(DefInst))		if (isa<FenceInst>(DefInst))
return false;		return false;

if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(DefInst)) {		if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(DefInst)) {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::amdgcn_s_barrier:		case Intrinsic::amdgcn_s_barrier:
case Intrinsic::amdgcn_wave_barrier:		case Intrinsic::amdgcn_wave_barrier:
		case Intrinsic::amdgcn_sched_barrier:
return false;		return false;
default:		default:
break;		break;
}		}
}		}

// Ignore atomics not aliasing with the original load, any atomic is a		// Ignore atomics not aliasing with the original load, any atomic is a
// universal MemoryDef from MSSA's point of view too, just like a fence.		// universal MemoryDef from MSSA's point of view too, just like a fence.
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/hazard-pseudo-machineinstrs.mir

	# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass post-RA-sched %s -o - \| FileCheck -check-prefix=GCN %s			# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -run-pass post-RA-hazard-rec %s -o - \| FileCheck -check-prefix=GCN %s

				# WAVE_BARRIER and SI_MASKED_UNREACHABLE ect. are not really instructions. To
				# fix the hazard (m0 def followed by V_INTERP), the compiler should insert a
				# S_NOP.

	# WAVE_BARRIER and SI_MASKED_UNREACHABLE are not really instructions.
	# To fix the hazard (m0 def followed by V_INTERP), the scheduler
	# should move another instruction into the slot.
	---			---
	# CHECK-LABEL: name: hazard_wave_barrier
	# CHECK-LABEL: bb.0:
	# GCN: $m0 = S_MOV_B32 killed renamable $sgpr0
	# GCN-NEXT: WAVE_BARRIER
	# GCN-NEXT: S_MOV_B32 0
	# GCN-NEXT: V_INTERP_MOV_F32
	name: hazard_wave_barrier			name: hazard_wave_barrier
	tracksRegLiveness: true			tracksRegLiveness: true
	body: \|			body: \|
	bb.0:			bb.0:
	liveins: $sgpr0			liveins: $sgpr0

				; GCN-LABEL: name: hazard_wave_barrier
				; GCN: liveins: $sgpr0
				; GCN-NEXT: {{ $}}
				; GCN-NEXT: $m0 = S_MOV_B32 killed renamable $sgpr0
				; GCN-NEXT: WAVE_BARRIER
				; GCN-NEXT: S_NOP 0
				; GCN-NEXT: renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec
				; GCN-NEXT: renamable $sgpr1 = S_MOV_B32 0
				; GCN-NEXT: S_ENDPGM 0
	$m0 = S_MOV_B32 killed renamable $sgpr0			$m0 = S_MOV_B32 killed renamable $sgpr0
	WAVE_BARRIER			WAVE_BARRIER
	renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec			renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec
	renamable $sgpr1 = S_MOV_B32 0			renamable $sgpr1 = S_MOV_B32 0
	S_ENDPGM 0			S_ENDPGM 0

	...			...
	# GCN-LABEL: name: hazard-masked-unreachable
	# CHECK-LABEL: bb.0:
	# GCN: $m0 = S_MOV_B32 killed renamable $sgpr0
	# GCN-NEXT: SI_MASKED_UNREACHABLE
	# GCN-NEXT: S_MOV_B32 0
	# GCN-NEXT: V_INTERP_MOV_F32
	---			---
	name: hazard-masked-unreachable			name: hazard-masked-unreachable
	tracksRegLiveness: true			tracksRegLiveness: true
	body: \|			body: \|
				; GCN-LABEL: name: hazard-masked-unreachable
				; GCN: bb.0:
				; GCN-NEXT: successors: %bb.1(0x80000000)
				; GCN-NEXT: liveins: $sgpr0
				; GCN-NEXT: {{ $}}
				; GCN-NEXT: $m0 = S_MOV_B32 killed renamable $sgpr0
				; GCN-NEXT: SI_MASKED_UNREACHABLE
				; GCN-NEXT: S_NOP 0
				; GCN-NEXT: renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec
				; GCN-NEXT: renamable $sgpr1 = S_MOV_B32 0
				; GCN-NEXT: {{ $}}
				; GCN-NEXT: bb.1:
				; GCN-NEXT: S_ENDPGM 0
	bb.0:			bb.0:
	liveins: $sgpr0			liveins: $sgpr0

	$m0 = S_MOV_B32 killed renamable $sgpr0			$m0 = S_MOV_B32 killed renamable $sgpr0
	SI_MASKED_UNREACHABLE			SI_MASKED_UNREACHABLE
	renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec			renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec
	renamable $sgpr1 = S_MOV_B32 0			renamable $sgpr1 = S_MOV_B32 0
	bb.1:			bb.1:
	S_ENDPGM 0			S_ENDPGM 0
	...			...

				---
				name: hazard_sched_barrier
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $sgpr0

				; GCN-LABEL: name: hazard_sched_barrier
				; GCN: liveins: $sgpr0
				; GCN-NEXT: {{ $}}
				; GCN-NEXT: $m0 = S_MOV_B32 killed renamable $sgpr0
				; GCN-NEXT: SCHED_BARRIER 0
				; GCN-NEXT: S_NOP 0
				; GCN-NEXT: renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec
				; GCN-NEXT: renamable $sgpr1 = S_MOV_B32 0
				; GCN-NEXT: S_ENDPGM 0
				$m0 = S_MOV_B32 killed renamable $sgpr0
				SCHED_BARRIER 0
				renamable $vgpr0 = V_INTERP_MOV_F32 2, 0, 0, implicit $mode, implicit $m0, implicit $exec
				renamable $sgpr1 = S_MOV_B32 0
				S_ENDPGM 0

				...

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.barrier.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				define amdgpu_kernel void @test_wave_barrier() #0 {
				; GCN-LABEL: test_wave_barrier:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: ; sched_barrier mask(0x00000000)
				; GCN-NEXT: ; sched_barrier mask(0x00000001)
				; GCN-NEXT: ; sched_barrier mask(0x00000004)
				; GCN-NEXT: ; sched_barrier mask(0x0000000F)
				; GCN-NEXT: s_endpgm
				entry:
				call void @llvm.amdgcn.sched.barrier(i32 0) #1
				call void @llvm.amdgcn.sched.barrier(i32 1) #1
				call void @llvm.amdgcn.sched.barrier(i32 4) #1
				call void @llvm.amdgcn.sched.barrier(i32 15) #1
				ret void
				}

				declare void @llvm.amdgcn.sched.barrier(i32) #1

				attributes #0 = { nounwind }
				attributes #1 = { convergent nounwind }

llvm/test/CodeGen/AMDGPU/sched_barrier.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -march=amdgcn -mcpu=gfx908 -run-pass=machine-scheduler -verify-misched -o - %s \| FileCheck %s

				--- \|
				define amdgpu_kernel void @no_sched_barrier(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) { ret void }
				define amdgpu_kernel void @sched_barrier_0(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) { ret void }
				define amdgpu_kernel void @sched_barrier_1(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) { ret void }

				!0 = distinct !{!0}
				!1 = !{!1, !0}
				...

				---
				name: no_sched_barrier
				tracksRegLiveness: true
				body: \|
				bb.0:
				; CHECK-LABEL: name: no_sched_barrier
				; CHECK: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[GLOBAL_LOAD_DWORD_SADDR]], implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_1:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR1]], [[GLOBAL_LOAD_DWORD_SADDR1]], implicit $exec
				; CHECK-NEXT: S_NOP 0
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_]], [[DEF]], 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_1]], [[DEF]], 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: S_ENDPGM 0
				%0:sreg_64 = IMPLICIT_DEF
				%1:vgpr_32 = IMPLICIT_DEF
				%3:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%4:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %3, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %4, %0, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_NOP 0
				%5:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%6:vgpr_32 = nsw V_MUL_LO_U32_e64 %5, %5, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %6, %0, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_ENDPGM 0
				...

				---
				name: sched_barrier_0
				tracksRegLiveness: true
				body: \|
				bb.0:
				; CHECK-LABEL: name: sched_barrier_0
				; CHECK: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[GLOBAL_LOAD_DWORD_SADDR]], implicit $exec
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_]], [[DEF]], 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: S_NOP 0
				; CHECK-NEXT: SCHED_BARRIER 0
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_1:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR1]], [[GLOBAL_LOAD_DWORD_SADDR1]], implicit $exec
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_1]], [[DEF]], 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: S_ENDPGM 0
				%0:sreg_64 = IMPLICIT_DEF
				%1:vgpr_32 = IMPLICIT_DEF
				%3:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%4:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %3, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %4, %0, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_NOP 0
				SCHED_BARRIER 0
				%5:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%6:vgpr_32 = nsw V_MUL_LO_U32_e64 %5, %5, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %6, %0, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_ENDPGM 0
				...

				---
				name: sched_barrier_1
				tracksRegLiveness: true
				body: \|
				bb.0:
				; CHECK-LABEL: name: sched_barrier_1
				; CHECK: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[GLOBAL_LOAD_DWORD_SADDR]], implicit $exec
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_]], [[DEF]], 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: SCHED_BARRIER 1
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_1:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR1]], [[GLOBAL_LOAD_DWORD_SADDR1]], implicit $exec
				; CHECK-NEXT: S_NOP 0
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_1]], [[DEF]], 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: S_ENDPGM 0
				%0:sreg_64 = IMPLICIT_DEF
				%1:vgpr_32 = IMPLICIT_DEF
				%3:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%4:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %3, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %4, %0, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_NOP 0
				SCHED_BARRIER 1
				%5:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%6:vgpr_32 = nsw V_MUL_LO_U32_e64 %5, %5, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %6, %0, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_ENDPGM 0
				...