This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add amdgcn_sched_group_barrier builtin
ClosedPublic

Authored by kerbowa on Jun 20 2022, 12:13 AM.

Download Raw Diff

Details

Reviewers

rampitec
jrbyrnes
vangthao95
arsenm

Commits

rGf5b21680d122: [AMDGPU] Add amdgcn_sched_group_barrier builtin

Summary

This builtin allows the creation of custom scheduling pipelines on a per-region
basis. Like the sched_barrier builtin this is intended to be used either for
testing, in situations where the default scheduler heuristics cannot be
improved, or in critical kernels where users are trying to get performance that
is close to handwritten assembly. Obviously using these builtins will require
extra work from the kernel writer to maintain the desired behavior.

The builtin can be used to create groups of instructions called "scheduling
groups" where ordering between the groups is enforced by the scheduler.
__builtin_amdgcn_sched_group_barrier takes three parameters. The first parameter
is a mask that determines the types of instructions that you would like to
synchronize around and add to a scheduling group. These instructions will be
selected from the bottom up starting from the sched_group_barrier's location
during instruction scheduling. The second parameter is the number of matching
instructions that will be associated with this sched_group_barrier. The third
parameter is an identifier which is used to describe what other
sched_group_barriers should be synchronized with. Note that multiple
sched_group_barriers must be added in order for them to be useful since they
only synchronize with other sched_group_barriers. Only "scheduling groups" with
a matching third parameter will have any enforced ordering between them.

As an example, the code below tries to create a pipeline of 1 VMEM_READ
instruction followed by 1 VALU instruction followed by 5 MFMA instructions...
1 VMEM_READ
builtin_amdgcn_sched_group_barrier(32, 1, 0)
1 VALU
builtin_amdgcn_sched_group_barrier(2, 1, 0)
5 MFMA
builtin_amdgcn_sched_group_barrier(8, 5, 0)
1 VMEM_READ
builtin_amdgcn_sched_group_barrier(32, 1, 0)
3 VALU
builtin_amdgcn_sched_group_barrier(2, 3, 0)
2 VMEM_WRITE
builtin_amdgcn_sched_group_barrier(64, 2, 0)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kerbowa created this revision.Jun 20 2022, 12:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 20 2022, 12:13 AM

Herald added subscribers: kosarev, jsilvanus, foad and 8 others. · View Herald Transcript

kerbowa requested review of this revision.Jun 20 2022, 12:13 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 20 2022, 12:13 AM

Herald added subscribers: llvm-commits, cfe-commits, wdng. · View Herald Transcript

Somewhat WIP needs more tests and cleanup. Posted for dependent work.

Harbormaster completed remote builds in B170775: Diff 438269.Jun 20 2022, 1:10 AM

antc added a subscriber: antc.Jun 20 2022, 11:56 PM

Hey Austin -- I like the removal of canAddMIs. In the original design, I was leaving open the possibility for users to pass in canAddMIs rather than a mask / SchedGroup name, but it looks like this isn't the direction we're going, and the classification functions defined in a general canAddMI makes things easier.

I see this is a WIP, but I've added some thoughts I had from reading it over. I may have more as I use the design for my patch.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
199	I find it confusing that SchedBarrier uses inversion while SchedGroupBarrier doesn't.
306	As in the update to IGroupLP.cpp in trunk, seems like we are not supposed to use hasValue.
349	Not possible to have unsized groups?
445	If both types of barriers are present -- the SchedBarriers are handled first. However, if there is a conflict between SchedBarrier and SchedGroupBarrier, should SchedBarrier always get the priority? Maybe SchedBarrier should only handle groups not present in SchedGroupBarrier?
llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir
104	I think you are aware of this issue. But the ability for the mutation to match the pipeline is dependent upon which instructions go into which group (when an instruction can be mapped to multiple groups). If we had SchedGroups: 2 VMEM_READ, 1 VALU, 1 MFMA, 2 VMEM_READ and initial schedule: VMEMR, VALU, VMEMR, MFMA, VMEMR, with a dependency between middle VMEMR->MFMA. initSchedGroup will add the middle VMEMR to the last VMEMR group, but we could get a more accurate pipeline by adding it to the first group.

arsenm added inline comments.Jul 5 2022, 10:43 AM

clang/test/SemaOpenCL/builtins-amdgcn-error.cl
70	Test error for each argument?

Fix some bugs. Add better pipeline fitting. Address comments.

Harbormaster completed remote builds in B176366: Diff 445965.Jul 19 2022, 5:00 PM

LGTM

jrbyrnes accepted this revision.Jul 28 2022, 9:37 AM

This revision is now accepted and ready to land.Jul 28 2022, 9:37 AM

This revision was landed with ongoing or failed builds.Jul 28 2022, 10:43 AM

Closed by commit rGf5b21680d122: [AMDGPU] Add amdgcn_sched_group_barrier builtin (authored by kerbowa). · Explain Why

This revision was automatically updated to reflect the committed changes.

kerbowa added a commit: rGf5b21680d122: [AMDGPU] Add amdgcn_sched_group_barrier builtin.

uabelho added a subscriber: uabelho.Jul 29 2022, 10:13 PM

uabelho added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
306	Compiling with gcc, I get a warning that this function is unused. I'm wondering, there seems to be both a const and a non-const version of the isFull method now, but they are identical? Perhaps the non-const version could be removed?

kerbowa marked an inline comment as done.Jul 30 2022, 7:48 AM

kerbowa added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
306	Removed in 7898426a72, thanks!

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

BuiltinsAMDGPU.def

1 line

test/

CodeGenOpenCL/

builtins-amdgcn.cl

13 lines

SemaOpenCL/

builtins-amdgcn-error.cl

5 lines

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

13 lines

lib/

Target/

AMDGPU/

AMDGPUIGroupLP.cpp

616 lines

AMDGPUMCInstLower.cpp

13 lines

SIInstrInfo.cpp

1 line

SIInstructions.td

14 lines

Utils/

AMDGPUMemoryUtils.cpp

1 line

test/

CodeGen/

AMDGPU/

llvm.amdgcn.sched.group.barrier.ll

23 lines

sched-group-barrier-pre-RA.mir

173 lines

Diff 438269

clang/include/clang/Basic/BuiltinsAMDGPU.def

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	BUILTIN(__builtin_amdgcn_s_setreg, "vIiUi", "n")			BUILTIN(__builtin_amdgcn_s_setreg, "vIiUi", "n")
	BUILTIN(__builtin_amdgcn_s_getpc, "WUi", "n")			BUILTIN(__builtin_amdgcn_s_getpc, "WUi", "n")
	BUILTIN(__builtin_amdgcn_s_waitcnt, "vIi", "n")			BUILTIN(__builtin_amdgcn_s_waitcnt, "vIi", "n")
	BUILTIN(__builtin_amdgcn_s_sendmsg, "vIiUi", "n")			BUILTIN(__builtin_amdgcn_s_sendmsg, "vIiUi", "n")
	BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")			BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")
	BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")			BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")
	BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")			BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")
	BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")			BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")
				BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n")
	BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")			BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
	BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")			BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_init, "vUiUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_init, "vUiUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_barrier, "vUiUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_barrier, "vUiUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_sema_v, "vUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_sema_v, "vUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_sema_br, "vUiUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_sema_br, "vUiUi", "n")
	BUILTIN(__builtin_amdgcn_ds_gws_sema_p, "vUi", "n")			BUILTIN(__builtin_amdgcn_ds_gws_sema_p, "vUi", "n")
	BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")			BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
	▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines

clang/test/CodeGenOpenCL/builtins-amdgcn.cl

	Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines
	void test_sched_barrier()			void test_sched_barrier()
	{			{
	__builtin_amdgcn_sched_barrier(0);			__builtin_amdgcn_sched_barrier(0);
	__builtin_amdgcn_sched_barrier(1);			__builtin_amdgcn_sched_barrier(1);
	__builtin_amdgcn_sched_barrier(4);			__builtin_amdgcn_sched_barrier(4);
	__builtin_amdgcn_sched_barrier(15);			__builtin_amdgcn_sched_barrier(15);
	}			}

				// CHECK-LABEL: @test_sched_group_barrier
				// CHECK: call void @llvm.amdgcn.sched.group.barrier(i32 0, i32 1, i32 2)
				// CHECK: call void @llvm.amdgcn.sched.group.barrier(i32 1, i32 2, i32 4)
				// CHECK: call void @llvm.amdgcn.sched.group.barrier(i32 4, i32 8, i32 16)
				// CHECK: call void @llvm.amdgcn.sched.group.barrier(i32 15, i32 10000, i32 -1)
				void test_sched_group_barrier()
				{
				__builtin_amdgcn_sched_group_barrier(0, 1, 2);
				__builtin_amdgcn_sched_group_barrier(1, 2, 4);
				__builtin_amdgcn_sched_group_barrier(4, 8, 16);
				__builtin_amdgcn_sched_group_barrier(15, 10000, -1);
				}

	// CHECK-LABEL: @test_s_sleep			// CHECK-LABEL: @test_s_sleep
	// CHECK: call void @llvm.amdgcn.s.sleep(i32 1)			// CHECK: call void @llvm.amdgcn.s.sleep(i32 1)
	// CHECK: call void @llvm.amdgcn.s.sleep(i32 15)			// CHECK: call void @llvm.amdgcn.s.sleep(i32 15)
	void test_s_sleep()			void test_s_sleep()
	{			{
	__builtin_amdgcn_s_sleep(1);			__builtin_amdgcn_s_sleep(1);
	__builtin_amdgcn_s_sleep(15);			__builtin_amdgcn_s_sleep(15);
	}			}
	▲ Show 20 Lines • Show All 358 Lines • Show Last 20 Lines

clang/test/SemaOpenCL/builtins-amdgcn-error.cl

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	void test_s_setprio(int x)
__builtin_amdgcn_s_setprio(65536); // expected-warning {{implicit conversion from 'int' to 'short' changes value from 65536 to 0}}		__builtin_amdgcn_s_setprio(65536); // expected-warning {{implicit conversion from 'int' to 'short' changes value from 65536 to 0}}
}		}

void test_sched_barrier(int x)		void test_sched_barrier(int x)
{		{
__builtin_amdgcn_sched_barrier(x); // expected-error {{argument to '__builtin_amdgcn_sched_barrier' must be a constant integer}}		__builtin_amdgcn_sched_barrier(x); // expected-error {{argument to '__builtin_amdgcn_sched_barrier' must be a constant integer}}
}		}

		void test_sched_group_barrier(int x)
		{
		__builtin_amdgcn_sched_group_barrier(x, 0, 1); // expected-error {{argument to '__builtin_amdgcn_sched_group_barrier' must be a constant integer}}
		arsenmUnsubmitted Not Done Reply Inline Actions Test error for each argument? arsenm: Test error for each argument?
		}

void test_sicmp_i32(global ulong* out, int a, int b, uint c)		void test_sicmp_i32(global ulong* out, int a, int b, uint c)
{		{
*out = __builtin_amdgcn_sicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_sicmp' must be a constant integer}}		*out = __builtin_amdgcn_sicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_sicmp' must be a constant integer}}
}		}

void test_uicmp_i32(global ulong* out, uint a, uint b, uint c)		void test_uicmp_i32(global ulong* out, uint a, uint b, uint c)
{		{
*out = __builtin_amdgcn_uicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_uicmp' must be a constant integer}}		*out = __builtin_amdgcn_uicmp(a, b, c); // expected-error {{argument to '__builtin_amdgcn_uicmp' must be a constant integer}}
▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 230 Lines • ▼ Show 20 Lines
	// MASK = 0x0000 0040: VMEM write instructions may be scheduled across SCHED_BARRIER.			// MASK = 0x0000 0040: VMEM write instructions may be scheduled across SCHED_BARRIER.
	// MASK = 0x0000 0080: ALL DS instructions may be scheduled across SCHED_BARRIER.			// MASK = 0x0000 0080: ALL DS instructions may be scheduled across SCHED_BARRIER.
	// MASK = 0x0000 0100: ALL DS read instructions may be scheduled accoss SCHED_BARRIER.			// MASK = 0x0000 0100: ALL DS read instructions may be scheduled accoss SCHED_BARRIER.
	// MASK = 0x0000 0200: ALL DS write instructions may be scheduled across SCHED_BARRIER.			// MASK = 0x0000 0200: ALL DS write instructions may be scheduled across SCHED_BARRIER.
	def int_amdgcn_sched_barrier : GCCBuiltin<"__builtin_amdgcn_sched_barrier">,			def int_amdgcn_sched_barrier : GCCBuiltin<"__builtin_amdgcn_sched_barrier">,
	Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects, IntrConvergent,			Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects, IntrConvergent,
	IntrWillReturn]>;			IntrWillReturn]>;

				// The first parameter is a mask that determines the types of instructions that
				// you would like to synchronize around and add to a scheduling group. The
				// values of the mask are defined above for sched_barrier. These instructions
				// will be selected from the bottom up starting from the sched_group_barrier's
				// location during instruction scheduling. The second parameter is the number of
				// matching instructions that will be associated with this sched_group_barrier.
				// The third parameter is an identifier which is used to describe what other
				// sched_group_barriers should be synchronized with.
				def int_amdgcn_sched_group_barrier : GCCBuiltin<"__builtin_amdgcn_sched_group_barrier">,
				Intrinsic<[], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
				[ImmArg<ArgIndex<0>>, ImmArg<ArgIndex<1>>, ImmArg<ArgIndex<2>>, IntrNoMem, IntrHasSideEffects,
				IntrConvergent, IntrWillReturn]>;

	def int_amdgcn_s_waitcnt : GCCBuiltin<"__builtin_amdgcn_s_waitcnt">,			def int_amdgcn_s_waitcnt : GCCBuiltin<"__builtin_amdgcn_s_waitcnt">,
	Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects, IntrWillReturn]>;			Intrinsic<[], [llvm_i32_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, IntrHasSideEffects, IntrWillReturn]>;

	def int_amdgcn_div_scale : Intrinsic<			def int_amdgcn_div_scale : Intrinsic<
	// 1st parameter: Numerator			// 1st parameter: Numerator
	// 2nd parameter: Denominator			// 2nd parameter: Denominator
	// 3rd parameter: Select quotient. Must equal Numerator or Denominator.			// 3rd parameter: Select quotient. Must equal Numerator or Denominator.
	// (0 = Denominator, 1 = Numerator).			// (0 = Denominator, 1 = Numerator).
	▲ Show 20 Lines • Show All 1,903 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp

Show All 15 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPUIGroupLP.h"		#include "AMDGPUIGroupLP.h"
#include "AMDGPUTargetMachine.h"		#include "AMDGPUTargetMachine.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "llvm/ADT/BitmaskEnum.h"		#include "llvm/ADT/BitmaskEnum.h"
		#include "llvm/ADT/DenseMap.h"
#include "llvm/CodeGen/MachineScheduler.h"		#include "llvm/CodeGen/MachineScheduler.h"
#include "llvm/CodeGen/TargetOpcodes.h"		#include "llvm/CodeGen/TargetOpcodes.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "machine-scheduler"		#define DEBUG_TYPE "machine-scheduler"

namespace {		namespace {
Show All 23 Lines	LDRGroupMaxSize("amdgpu-igrouplp-ldr-group-size", cl::init(None),
"in lds/gds read group."));		"in lds/gds read group."));

static cl::opt<Optional<unsigned>>		static cl::opt<Optional<unsigned>>
LDWGroupMaxSize("amdgpu-igrouplp-ldw-group-size", cl::init(None),		LDWGroupMaxSize("amdgpu-igrouplp-ldw-group-size", cl::init(None),
cl::Hidden,		cl::Hidden,
cl::desc("The maximum number of instructions to include "		cl::desc("The maximum number of instructions to include "
"in lds/gds write group."));		"in lds/gds write group."));

typedef function_ref<bool(const MachineInstr &, const SIInstrInfo *)>		// Components of the mask that determines which instruction types may be may be
CanAddMIFn;		// classified into a SchedGroup.
		enum class SchedGroupMask {
		NONE = 0u,
		ALU = 1u << 0,
		VALU = 1u << 1,
		SALU = 1u << 2,
		MFMA = 1u << 3,
		VMEM = 1u << 4,
		VMEM_READ = 1u << 5,
		VMEM_WRITE = 1u << 6,
		DS = 1u << 7,
		DS_READ = 1u << 8,
		DS_WRITE = 1u << 9,
		ALL = ALU \| VALU \| SALU \| MFMA \| VMEM \| VMEM_READ \| VMEM_WRITE \| DS \|
		DS_READ \| DS_WRITE,
		LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ ALL)
		};

// Classify instructions into groups to enable fine tuned control over the		// Classify instructions into groups to enable fine tuned control over the
// scheduler. These groups may be more specific than current SchedModel		// scheduler. These groups may be more specific than current SchedModel
// instruction classes.		// instruction classes.
class SchedGroup {		class SchedGroup {
private:		private:
// Function that returns true if a non-bundle MI may be inserted into this		// Mask that defines which instruction types can be classified into this
// group.		// SchedGroup. The instruction types correspond to the mask from SCHED_BARRIER
const CanAddMIFn canAddMI;		// and SCHED_GROUP_BARRIER.
		SchedGroupMask SGMask;

// Maximum number of SUnits that can be added to this group.		// Maximum number of SUnits that can be added to this group.
Optional<unsigned> MaxSize;		Optional<unsigned> MaxSize;

		// SchedGroups will only synchronize with other SchedGroups that have the same
		// SyncID.
		int SyncID = 0;

// Collection of SUnits that are classified as members of this group.		// Collection of SUnits that are classified as members of this group.
SmallVector<SUnit *, 32> Collection;		SmallVector<SUnit *, 32> Collection;

ScheduleDAGInstrs *DAG;		ScheduleDAGInstrs *DAG;

void tryAddEdge(SUnit A, SUnit B) {		const SIInstrInfo *TII;

		// Try to add and edge from SU A to SU B.
		bool tryAddEdge(SUnit A, SUnit B);

		// Use SGMask to determine whether we can classify MI as a member of this
		// SchedGroup object.
		bool canAddMI(const MachineInstr &MI) const;

		// Returns true if SU can be added to this SchedGroup.
		bool canAddSU(SUnit &SU) const;

		// Returns true if no more instructions may be added to this group.
		bool isFull() const;

		// Add SU to the SchedGroup.
		void add(SUnit &SU) { Collection.push_back(&SU); }

		public:
		// Add DAG dependencies from all SUnits in this SchedGroup and this SU. If
		// MakePred is true, SU will be a predecessor of the SUnits in this
		// SchedGroup, otherwise SU will be a successor.
		void link(SUnit &SU, bool MakePred = false);

		// Add DAG dependencies from all SUnits in this SchedGroup and this SU. Use
		// the predicate to determine whether SU should be a predecessor (P = true)
		// or a successor (P = false) of this SchedGroup.
		void link(SUnit &SU, function_ref<bool(const SUnit A, const SUnit B)> P);

		// Add DAG dependencies such that SUnits in this group shall be ordered
		// before SUnits in OtherGroup.
		void link(SchedGroup &OtherGroup);

		// Identify and add all relevant SUs from the DAG to this SchedGroup.
		void initSchedGroup();

		// Add instructions to the SchedGroup bottom up starting from RIter.
		// ConflictedInstrs is a set of instructions that should not be added to the
		// SchedGroup even when the other conditions for adding it are satisfied.
		// RIter will be added to the SchedGroup as well, and dependencies will be
		// added so that RIter will always be scheduled at the end of the group.
		void initSchedGroup(std::vector<SUnit>::reverse_iterator RIter,
		DenseSet<SUnit *> &ConflictedInstrs);

		int getSyncID() { return SyncID; }

		SchedGroup(SchedGroupMask SGMask, Optional<unsigned> MaxSize,
		ScheduleDAGInstrs DAG, const SIInstrInfo TII)
		: SGMask(SGMask), MaxSize(MaxSize), DAG(DAG), TII(TII) {}

		SchedGroup(SchedGroupMask SGMask, Optional<unsigned> MaxSize, int SyncID,
		ScheduleDAGInstrs DAG, const SIInstrInfo TII)
		: SGMask(SGMask), MaxSize(MaxSize), SyncID(SyncID), DAG(DAG), TII(TII) {}
		};

		class IGroupLPDAGMutation : public ScheduleDAGMutation {
		public:
		const SIInstrInfo *TII;
		ScheduleDAGMI *DAG;

		IGroupLPDAGMutation() = default;
		void apply(ScheduleDAGInstrs *DAGInstrs) override;
		};

		// DAG mutation that coordinates with the SCHED_BARRIER instruction and
		// corresponding builtin. The mutation adds edges from specific instruction
		// classes determined by the SCHED_BARRIER mask so that they cannot be
		class SchedBarrierDAGMutation : public ScheduleDAGMutation {
		private:
		const SIInstrInfo *TII;

		ScheduleDAGMI *DAG;

		// Organize lists of SchedGroups by their SyncID. SchedGroups /
		// SCHED_GROUP_BARRIERs with different SyncIDs will have no edges added
		// between then.
		DenseMap<int, SmallVector<SchedGroup, 4>> SyncedSchedGroupsMap;

		// Used to track instructions that are already to added to a different
		// SchedGroup with the same SyncID.
		DenseMap<int, DenseSet<SUnit *>> SyncedInstrsMap;

		// Add DAG edges that enforce SCHED_BARRIER ordering.
		void addSchedBarrierEdges(SUnit &SU);

		// Use a SCHED_BARRIER's mask to identify instruction SchedGroups that should
		// not be reordered accross the SCHED_BARRIER. This is used for the base
		// SCHED_BARRIER, and not SCHED_GROUP_BARRIER. The difference is that
		// SCHED_BARRIER will always block all instructions that can be classified
		// into a particular SchedClass, whereas SCHED_GROUP_BARRIER has a fixed size
		// and may only synchronize with some SchedGroups. Returns the inverse of
		// Mask. SCHED_BARRIER's mask describes which instruction types should be
		// allowed to be scheduled across it. Invert the mask to get the
		// SchedGroupMask of instructions that should be barred.
		SchedGroupMask invertSchedBarrierMask(SchedGroupMask Mask) const;
		jrbyrnesUnsubmitted Not Done Reply Inline Actions I find it confusing that SchedBarrier uses inversion while SchedGroupBarrier doesn't. jrbyrnes: I find it confusing that SchedBarrier uses inversion while SchedGroupBarrier doesn't.

		// Create SchedGroups for a SCHED_GROUP_BARRIER.
		void initSchedGroupBarrier(std::vector<SUnit>::reverse_iterator RIter);

		// Add DAG edges that try to enforce ordering defined by SCHED_GROUP_BARRIER
		// instructions.
		void addSchedGroupBarrierEdges();

		public:
		void apply(ScheduleDAGInstrs *DAGInstrs) override;

		SchedBarrierDAGMutation() = default;
		};

		bool SchedGroup::tryAddEdge(SUnit A, SUnit B) {
if (A != B && DAG->canAddEdge(B, A)) {		if (A != B && DAG->canAddEdge(B, A)) {
DAG->addEdge(B, SDep(A, SDep::Artificial));		DAG->addEdge(B, SDep(A, SDep::Artificial));
LLVM_DEBUG(dbgs() << "Adding edge...\n"		LLVM_DEBUG(dbgs() << "Adding edge...\n"
<< "from: SU(" << A->NodeNum << ") " << *A->getInstr()		<< "from: SU(" << A->NodeNum << ") " << *A->getInstr()
<< "to: SU(" << B->NodeNum << ") " << *B->getInstr());		<< "to: SU(" << B->NodeNum << ") " << *B->getInstr());
		return true;
}		}
		return false;
}		}

public:		bool SchedGroup::canAddMI(const MachineInstr &MI) const {
// Add DAG dependencies from all SUnits in this SchedGroup and this SU. If		bool Result = false;
// MakePred is true, SU will be a predecessor of the SUnits in this		if (MI.isMetaInstruction())
// SchedGroup, otherwise SU will be a successor.		Result = false;
void link(SUnit &SU, bool MakePred = false) {
		else if (((SGMask & SchedGroupMask::ALU) != SchedGroupMask::NONE) &&
		(TII->isVALU(MI) \|\| TII->isMFMA(MI) \|\| TII->isSALU(MI)))
		Result = true;

		else if (((SGMask & SchedGroupMask::VALU) != SchedGroupMask::NONE) &&
		TII->isVALU(MI) && !TII->isMFMA(MI))
		Result = true;

		else if (((SGMask & SchedGroupMask::SALU) != SchedGroupMask::NONE) &&
		TII->isSALU(MI))
		Result = true;

		else if (((SGMask & SchedGroupMask::MFMA) != SchedGroupMask::NONE) &&
		TII->isMFMA(MI))
		Result = true;

		else if (((SGMask & SchedGroupMask::VMEM) != SchedGroupMask::NONE) &&
		(TII->isVMEM(MI) \|\| (TII->isFLAT(MI) && !TII->isDS(MI))))
		Result = true;

		else if (((SGMask & SchedGroupMask::VMEM_READ) != SchedGroupMask::NONE) &&
		MI.mayLoad() &&
		(TII->isVMEM(MI) \|\| (TII->isFLAT(MI) && !TII->isDS(MI))))
		Result = true;

		else if (((SGMask & SchedGroupMask::VMEM_WRITE) != SchedGroupMask::NONE) &&
		MI.mayStore() &&
		(TII->isVMEM(MI) \|\| (TII->isFLAT(MI) && !TII->isDS(MI))))
		Result = true;

		else if (((SGMask & SchedGroupMask::DS) != SchedGroupMask::NONE) &&
		TII->isDS(MI))
		Result = true;

		else if (((SGMask & SchedGroupMask::DS_READ) != SchedGroupMask::NONE) &&
		MI.mayLoad() && TII->isDS(MI))
		Result = true;

		else if (((SGMask & SchedGroupMask::DS_WRITE) != SchedGroupMask::NONE) &&
		MI.mayStore() && TII->isDS(MI))
		Result = true;

		LLVM_DEBUG(dbgs() << "For SchedGroup with mask "
		<< format_hex((int)SGMask, 10, true)
		<< (Result ? " added " : " unable to add ") << MI);

		return Result;
		}

		void SchedGroup::link(SUnit &SU, bool MakePred) {
for (auto A : Collection) {		for (auto A : Collection) {
SUnit *B = &SU;		SUnit *B = &SU;
if (MakePred)		if (MakePred)
std::swap(A, B);		std::swap(A, B);

tryAddEdge(A, B);		tryAddEdge(A, B);
}		}
}		}

// Add DAG dependencies from all SUnits in this SchedGroup and this SU. Use		void SchedGroup::link(SUnit &SU,
// the predicate to determine whether SU should be a predecessor (P = true)		function_ref<bool(const SUnit A, const SUnit B)> P) {
// or a successor (P = false) of this SchedGroup.
void link(SUnit &SU, function_ref<bool(const SUnit A, const SUnit B)> P) {
for (auto A : Collection) {		for (auto A : Collection) {
SUnit *B = &SU;		SUnit *B = &SU;
if (P(A, B))		if (P(A, B))
std::swap(A, B);		std::swap(A, B);

tryAddEdge(A, B);		tryAddEdge(A, B);
}		}
}		}

// Add DAG dependencies such that SUnits in this group shall be ordered		void SchedGroup::link(SchedGroup &OtherGroup) {
// before SUnits in OtherGroup.
void link(SchedGroup &OtherGroup) {
for (auto B : OtherGroup.Collection)		for (auto B : OtherGroup.Collection)
link(*B);		link(*B);
}		}

// Returns true if no more instructions may be added to this group.		bool SchedGroup::isFull() const {
bool isFull() { return MaxSize.hasValue() && Collection.size() >= *MaxSize; }		return MaxSize.hasValue() && Collection.size() >= *MaxSize;
		jrbyrnesUnsubmitted Not Done Reply Inline Actions As in the update to IGroupLP.cpp in trunk, seems like we are not supposed to use hasValue. jrbyrnes: As in the update to IGroupLP.cpp in trunk, seems like we are not supposed to use hasValue.
		uabelhoUnsubmitted Not Done Reply Inline Actions Compiling with gcc, I get a warning that this function is unused. I'm wondering, there seems to be both a const and a non-const version of the isFull method now, but they are identical? Perhaps the non-const version could be removed? uabelho: Compiling with gcc, I get a warning that this function is unused. I'm wondering, there seems to…
		kerbowaAuthorUnsubmitted Done Reply Inline Actions Removed in 7898426a72, thanks! kerbowa: Removed in 7898426a72, thanks!
		}
// Returns true if SU can be added to this SchedGroup.
bool canAddSU(SUnit &SU, const SIInstrInfo *TII) {
if (isFull())
return false;

		bool SchedGroup::canAddSU(SUnit &SU) const {
MachineInstr &MI = *SU.getInstr();		MachineInstr &MI = *SU.getInstr();
if (MI.getOpcode() != TargetOpcode::BUNDLE)		if (MI.getOpcode() != TargetOpcode::BUNDLE)
return canAddMI(MI, TII);		return canAddMI(MI);

// Special case for bundled MIs.		// Special case for bundled MIs.
const MachineBasicBlock *MBB = MI.getParent();		const MachineBasicBlock *MBB = MI.getParent();
MachineBasicBlock::instr_iterator B = MI.getIterator(), E = ++B;		MachineBasicBlock::instr_iterator B = MI.getIterator(), E = ++B;
while (E != MBB->end() && E->isBundledWithPred())		while (E != MBB->end() && E->isBundledWithPred())
++E;		++E;

// Return true if all of the bundled MIs can be added to this group.		// Return true if all of the bundled MIs can be added to this group.
return std::all_of(		return std::all_of(B, E, [this](MachineInstr &MI) { return canAddMI(MI); });
B, E, [this, TII](MachineInstr &MI) { return canAddMI(MI, TII); });
}		}

void add(SUnit &SU) { Collection.push_back(&SU); }		void SchedGroup::initSchedGroup() {
		for (auto &SU : DAG->SUnits) {
SchedGroup(CanAddMIFn canAddMI, Optional<unsigned> MaxSize,		if (isFull())
ScheduleDAGInstrs *DAG)		break;
: canAddMI(canAddMI), MaxSize(MaxSize), DAG(DAG) {}
};

bool isMFMASGMember(const MachineInstr &MI, const SIInstrInfo *TII) {		if (canAddSU(SU))
return TII->isMFMA(MI);		add(SU);
}		}

bool isVALUSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {
return TII->isVALU(MI) && !TII->isMFMA(MI);
}		}

bool isSALUSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {		void SchedGroup::initSchedGroup(std::vector<SUnit>::reverse_iterator RIter,
return TII->isSALU(MI);		DenseSet<SUnit *> &UsedInstrs) {
}		SUnit &InitSU = *RIter;
		for (auto E = DAG->SUnits.rend(); RIter != E; ++RIter) {
		auto &SU = *RIter;
		if (isFull())
		break;

bool isVMEMSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {		if (!UsedInstrs.count(&SU) && canAddSU(SU)) {
return TII->isVMEM(MI) \|\| (TII->isFLAT(MI) && !TII->isDS(MI));		add(SU);
		UsedInstrs.insert(&SU);
}		}

bool isVMEMReadSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {
return MI.mayLoad() &&
(TII->isVMEM(MI) \|\| (TII->isFLAT(MI) && !TII->isDS(MI)));
}		}

bool isVMEMWriteSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {		add(InitSU);
return MI.mayStore() &&		assert(MaxSize.hasValue());
		jrbyrnesUnsubmitted Not Done Reply Inline Actions Not possible to have unsized groups? jrbyrnes: Not possible to have unsized groups?
(TII->isVMEM(MI) \|\| (TII->isFLAT(MI) && !TII->isDS(MI)));		(*MaxSize)++;
}

bool isDSWriteSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {		link(InitSU);
return MI.mayStore() && TII->isDS(MI);
}		}

bool isDSReadSGMember(const MachineInstr &MI, const SIInstrInfo *TII) {		// Create a pipeline from the SchedGroups in PipelineOrderGroups such that we
return MI.mayLoad() && TII->isDS(MI);		// try to enforce the relative ordering of instructions in each group.
		static void makePipeline(SmallVectorImpl<SchedGroup> &PipelineOrderGroups) {
		auto I = PipelineOrderGroups.begin();
		auto E = PipelineOrderGroups.end();
		for (; I != E; ++I) {
		auto &GroupA = *I;
		for (auto J = std::next(I); J != E; ++J) {
		auto &GroupB = *J;
		GroupA.link(GroupB);
		}
		}
}		}

class IGroupLPDAGMutation : public ScheduleDAGMutation {		// Same as makePipeline but with reverse ordering.
public:		static void
const SIInstrInfo *TII;		makeReversePipeline(SmallVectorImpl<SchedGroup> &PipelineOrderGroups) {
ScheduleDAGMI *DAG;		auto I = PipelineOrderGroups.rbegin();
		auto E = PipelineOrderGroups.rend();
IGroupLPDAGMutation() = default;		for (; I != E; ++I) {
void apply(ScheduleDAGInstrs *DAGInstrs) override;		auto &GroupA = *I;
};		for (auto J = std::next(I); J != E; ++J) {
		auto &GroupB = *J;
// DAG mutation that coordinates with the SCHED_BARRIER instruction and		GroupA.link(GroupB);
// corresponding builtin. The mutation adds edges from specific instruction		}
// classes determined by the SCHED_BARRIER mask so that they cannot be		}
// scheduled around the SCHED_BARRIER.		}
class SchedBarrierDAGMutation : public ScheduleDAGMutation {
private:
const SIInstrInfo *TII;

ScheduleDAGMI *DAG;

// Components of the mask that determines which instructions may not be
// scheduled across the SCHED_BARRIER.
enum class SchedBarrierMasks {
NONE = 0u,
ALU = 1u << 0,
VALU = 1u << 1,
SALU = 1u << 2,
MFMA = 1u << 3,
VMEM = 1u << 4,
VMEM_READ = 1u << 5,
VMEM_WRITE = 1u << 6,
DS = 1u << 7,
DS_READ = 1u << 8,
DS_WRITE = 1u << 9,
LLVM_MARK_AS_BITMASK_ENUM(/* LargestFlag = */ DS_WRITE)
};

// Cache SchedGroups of each type if we have multiple SCHED_BARRIERs in a
// region.
//
std::unique_ptr<SchedGroup> MFMASchedGroup = nullptr;
std::unique_ptr<SchedGroup> VALUSchedGroup = nullptr;
std::unique_ptr<SchedGroup> SALUSchedGroup = nullptr;
std::unique_ptr<SchedGroup> VMEMReadSchedGroup = nullptr;
std::unique_ptr<SchedGroup> VMEMWriteSchedGroup = nullptr;
std::unique_ptr<SchedGroup> DSWriteSchedGroup = nullptr;
std::unique_ptr<SchedGroup> DSReadSchedGroup = nullptr;

// Use a SCHED_BARRIER's mask to identify instruction SchedGroups that should
// not be reordered accross the SCHED_BARRIER.
void getSchedGroupsFromMask(int32_t Mask,
SmallVectorImpl<SchedGroup *> &SchedGroups);

// Add DAG edges that enforce SCHED_BARRIER ordering.
void addSchedBarrierEdges(SUnit &SU);

// Classify instructions and add them to the SchedGroup.
void initSchedGroup(SchedGroup *SG);

// Remove all existing edges from a SCHED_BARRIER.
void resetSchedBarrierEdges(SUnit &SU);

public:
void apply(ScheduleDAGInstrs *DAGInstrs) override;

SchedBarrierDAGMutation() = default;
};

void IGroupLPDAGMutation::apply(ScheduleDAGInstrs *DAGInstrs) {		void IGroupLPDAGMutation::apply(ScheduleDAGInstrs *DAGInstrs) {
const GCNSubtarget &ST = DAGInstrs->MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = DAGInstrs->MF.getSubtarget<GCNSubtarget>();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
DAG = static_cast<ScheduleDAGMI *>(DAGInstrs);		DAG = static_cast<ScheduleDAGMI *>(DAGInstrs);
const TargetSchedModel *TSchedModel = DAGInstrs->getSchedModel();		const TargetSchedModel *TSchedModel = DAGInstrs->getSchedModel();
if (!TSchedModel \|\| DAG->SUnits.empty())		if (!TSchedModel \|\| DAG->SUnits.empty())
return;		return;

LLVM_DEBUG(dbgs() << "Applying IGroupLPDAGMutation...\n");		LLVM_DEBUG(dbgs() << "Applying IGroupLPDAGMutation...\n");

// The order of InstructionGroups in this vector defines the		// The order of InstructionGroups in this vector defines the
// order in which edges will be added. In other words, given the		// order in which edges will be added. In other words, given the
// present ordering, we will try to make each VMEMRead instruction		// present ordering, we will try to make each VMEMRead instruction
// a predecessor of each DSRead instruction, and so on.		// a predecessor of each DSRead instruction, and so on.
SmallVector<SchedGroup, 4> PipelineOrderGroups = {		SmallVector<SchedGroup, 4> PipelineOrderGroups = {
SchedGroup(isVMEMSGMember, VMEMGroupMaxSize, DAG),		SchedGroup(SchedGroupMask::VMEM, VMEMGroupMaxSize, DAG, TII),
SchedGroup(isDSReadSGMember, LDRGroupMaxSize, DAG),		SchedGroup(SchedGroupMask::DS_READ, LDRGroupMaxSize, DAG, TII),
SchedGroup(isMFMASGMember, MFMAGroupMaxSize, DAG),		SchedGroup(SchedGroupMask::MFMA, MFMAGroupMaxSize, DAG, TII),
SchedGroup(isDSWriteSGMember, LDWGroupMaxSize, DAG)};		SchedGroup(SchedGroupMask::DS_WRITE, LDWGroupMaxSize, DAG, TII)};

for (SUnit &SU : DAG->SUnits) {
LLVM_DEBUG(dbgs() << "Checking Node"; DAG->dumpNode(SU));
for (auto &SG : PipelineOrderGroups)		for (auto &SG : PipelineOrderGroups)
if (SG.canAddSU(SU, TII))		SG.initSchedGroup();
SG.add(SU);
}

for (unsigned i = 0; i < PipelineOrderGroups.size() - 1; i++) {		makePipeline(PipelineOrderGroups);
auto &GroupA = PipelineOrderGroups[i];
for (unsigned j = i + 1; j < PipelineOrderGroups.size(); j++) {
auto &GroupB = PipelineOrderGroups[j];
GroupA.link(GroupB);
}
}		}

		// Remove all existing edges from a SCHED_BARRIER or SCHED_GROUP_BARRIER.
		static void resetEdges(SUnit &SU, ScheduleDAGInstrs *DAG) {
		assert(SU.getInstr()->getOpcode() == AMDGPU::SCHED_BARRIER \|\|
		SU.getInstr()->getOpcode() == AMDGPU::SCHED_GROUP_BARRIER);

		while (!SU.Preds.empty())
		for (auto &P : SU.Preds)
		SU.removePred(P);

		while (!SU.Succs.empty())
		for (auto &S : SU.Succs)
		for (auto &SP : S.getSUnit()->Preds)
		if (SP.getSUnit() == &SU)
		S.getSUnit()->removePred(SP);
}		}

void SchedBarrierDAGMutation::apply(ScheduleDAGInstrs *DAGInstrs) {		void SchedBarrierDAGMutation::apply(ScheduleDAGInstrs *DAGInstrs) {
const TargetSchedModel *TSchedModel = DAGInstrs->getSchedModel();		const TargetSchedModel *TSchedModel = DAGInstrs->getSchedModel();
if (!TSchedModel \|\| DAGInstrs->SUnits.empty())		if (!TSchedModel \|\| DAGInstrs->SUnits.empty())
return;		return;

LLVM_DEBUG(dbgs() << "Applying SchedBarrierDAGMutation...\n");		LLVM_DEBUG(dbgs() << "Applying SchedBarrierDAGMutation...\n");

const GCNSubtarget &ST = DAGInstrs->MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = DAGInstrs->MF.getSubtarget<GCNSubtarget>();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
DAG = static_cast<ScheduleDAGMI *>(DAGInstrs);		DAG = static_cast<ScheduleDAGMI *>(DAGInstrs);
for (auto &SU : DAG->SUnits)		for (auto R = DAG->SUnits.rbegin(), E = DAG->SUnits.rend(); R != E; ++R) {
if (SU.getInstr()->getOpcode() == AMDGPU::SCHED_BARRIER)		if (R->getInstr()->getOpcode() == AMDGPU::SCHED_BARRIER)
addSchedBarrierEdges(SU);		addSchedBarrierEdges(*R);

		else if (R->getInstr()->getOpcode() == AMDGPU::SCHED_GROUP_BARRIER)
		initSchedGroupBarrier(R);
		}

		// SCHED_GROUP_BARRIER edges can only be added after we have found and
		// initialized all of the SCHED_GROUP_BARRIER SchedGroups.
		addSchedGroupBarrierEdges();
		jrbyrnesUnsubmitted Not Done Reply Inline Actions If both types of barriers are present -- the SchedBarriers are handled first. However, if there is a conflict between SchedBarrier and SchedGroupBarrier, should SchedBarrier always get the priority? Maybe SchedBarrier should only handle groups not present in SchedGroupBarrier? jrbyrnes: If both types of barriers are present -- the SchedBarriers are handled first. However, if there…
}		}

void SchedBarrierDAGMutation::addSchedBarrierEdges(SUnit &SchedBarrier) {		void SchedBarrierDAGMutation::addSchedBarrierEdges(SUnit &SchedBarrier) {
MachineInstr &MI = *SchedBarrier.getInstr();		MachineInstr &MI = *SchedBarrier.getInstr();
assert(MI.getOpcode() == AMDGPU::SCHED_BARRIER);		assert(MI.getOpcode() == AMDGPU::SCHED_BARRIER);
// Remove all existing edges from the SCHED_BARRIER that were added due to the		// Remove all existing edges from the SCHED_BARRIER that were added due to the
// instruction having side effects.		// instruction having side effects.
resetSchedBarrierEdges(SchedBarrier);		resetEdges(SchedBarrier, DAG);
SmallVector<SchedGroup *, 4> SchedGroups;		auto InvertedMask =
int32_t Mask = MI.getOperand(0).getImm();		invertSchedBarrierMask((SchedGroupMask)MI.getOperand(0).getImm());
getSchedGroupsFromMask(Mask, SchedGroups);		SchedGroup SG(InvertedMask, None, DAG, TII);
for (auto SG : SchedGroups)		SG.initSchedGroup();
SG->link(		// Preserve original instruction ordering relative to the SCHED_BARRIER.
SchedBarrier, (function_ref<bool(const SUnit A, const SUnit B)>)[](		SG.link(
const SUnit A, const SUnit B) {		SchedBarrier,
return A->NodeNum > B->NodeNum;		(function_ref<bool(const SUnit A, const SUnit B)>)[](
});		const SUnit A, const SUnit B) { return A->NodeNum > B->NodeNum; });
}		}

void SchedBarrierDAGMutation::getSchedGroupsFromMask(		SchedGroupMask
int32_t Mask, SmallVectorImpl<SchedGroup *> &SchedGroups) {		SchedBarrierDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
SchedBarrierMasks SBMask = (SchedBarrierMasks)Mask;		// Invert mask and erase bits for types of instructions that are implied to be
// See IntrinsicsAMDGPU.td for an explanation of these masks and their		// allowed past the SCHED_BARRIER.
// mappings.		SchedGroupMask InvertedMask = ~Mask;
//
if ((SBMask & SchedBarrierMasks::VALU) == SchedBarrierMasks::NONE &&		// ALU implies VALU, SALU, MFMA.
(SBMask & SchedBarrierMasks::ALU) == SchedBarrierMasks::NONE) {		if ((InvertedMask & SchedGroupMask::ALU) == SchedGroupMask::NONE)
if (!VALUSchedGroup) {		InvertedMask &=
VALUSchedGroup = std::make_unique<SchedGroup>(isVALUSGMember, None, DAG);		~SchedGroupMask::VALU & ~SchedGroupMask::SALU & ~SchedGroupMask::MFMA;
initSchedGroup(VALUSchedGroup.get());		// VALU, SALU, MFMA implies ALU.
}		else if ((InvertedMask & SchedGroupMask::VALU) == SchedGroupMask::NONE \|\|
		(InvertedMask & SchedGroupMask::SALU) == SchedGroupMask::NONE \|\|
SchedGroups.push_back(VALUSchedGroup.get());		(InvertedMask & SchedGroupMask::MFMA) == SchedGroupMask::NONE)
}		InvertedMask &= ~SchedGroupMask::ALU;

if ((SBMask & SchedBarrierMasks::SALU) == SchedBarrierMasks::NONE &&		// VMEM implies VMEM_READ, VMEM_WRITE.
(SBMask & SchedBarrierMasks::ALU) == SchedBarrierMasks::NONE) {		if ((InvertedMask & SchedGroupMask::VMEM) == SchedGroupMask::NONE)
if (!SALUSchedGroup) {		InvertedMask &= ~SchedGroupMask::VMEM_READ & ~SchedGroupMask::VMEM_WRITE;
SALUSchedGroup = std::make_unique<SchedGroup>(isSALUSGMember, None, DAG);		// VMEM_READ, VMEM_WRITE implies VMEM.
initSchedGroup(SALUSchedGroup.get());		else if ((InvertedMask & SchedGroupMask::VMEM_READ) == SchedGroupMask::NONE \|\|
}		(InvertedMask & SchedGroupMask::VMEM_WRITE) == SchedGroupMask::NONE)
		InvertedMask &= ~SchedGroupMask::VMEM;
SchedGroups.push_back(SALUSchedGroup.get());
}		// DS implies DS_READ, DS_WRITE.
		if ((InvertedMask & SchedGroupMask::DS) == SchedGroupMask::NONE)
if ((SBMask & SchedBarrierMasks::MFMA) == SchedBarrierMasks::NONE &&		InvertedMask &= ~SchedGroupMask::DS_READ & ~SchedGroupMask::DS_WRITE;
(SBMask & SchedBarrierMasks::ALU) == SchedBarrierMasks::NONE) {		// DS_READ, DS_WRITE implies DS.
if (!MFMASchedGroup) {		else if ((InvertedMask & SchedGroupMask::DS_READ) == SchedGroupMask::NONE \|\|
MFMASchedGroup = std::make_unique<SchedGroup>(isMFMASGMember, None, DAG);		(InvertedMask & SchedGroupMask::DS_WRITE) == SchedGroupMask::NONE)
initSchedGroup(MFMASchedGroup.get());		InvertedMask &= ~SchedGroupMask::DS;
}
		return InvertedMask;
SchedGroups.push_back(MFMASchedGroup.get());		}
}
		void SchedBarrierDAGMutation::initSchedGroupBarrier(
if ((SBMask & SchedBarrierMasks::VMEM_READ) == SchedBarrierMasks::NONE &&		std::vector<SUnit>::reverse_iterator RIter) {
(SBMask & SchedBarrierMasks::VMEM) == SchedBarrierMasks::NONE) {		// Remove all existing edges from the SCHED_GROUP_BARRIER that were added due
if (!VMEMReadSchedGroup) {		// to the instruction having side effects.
VMEMReadSchedGroup =		resetEdges(*RIter, DAG);
std::make_unique<SchedGroup>(isVMEMReadSGMember, None, DAG);		MachineInstr &SGB = *RIter->getInstr();
initSchedGroup(VMEMReadSchedGroup.get());		assert(SGB.getOpcode() == AMDGPU::SCHED_GROUP_BARRIER);
}		int32_t SGMask = SGB.getOperand(0).getImm();
		int32_t Size = SGB.getOperand(1).getImm();
SchedGroups.push_back(VMEMReadSchedGroup.get());		int32_t SyncID = SGB.getOperand(2).getImm();
}		// Create a new SchedGroup and add it to a list that is mapped to the SyncID.
		// SchedGroups only enforce ordering between SchedGroups with the same SyncID.
if ((SBMask & SchedBarrierMasks::VMEM_WRITE) == SchedBarrierMasks::NONE &&		auto &SG = SyncedSchedGroupsMap[SyncID].emplace_back((SchedGroupMask)SGMask,
(SBMask & SchedBarrierMasks::VMEM) == SchedBarrierMasks::NONE) {		Size, SyncID, DAG, TII);
if (!VMEMWriteSchedGroup) {
VMEMWriteSchedGroup =		// SyncedInstrsMap is used here is used to avoid adding the same SUs in
std::make_unique<SchedGroup>(isVMEMWriteSGMember, None, DAG);		// multiple SchedGroups that have the same SyncID. This only matters for
initSchedGroup(VMEMWriteSchedGroup.get());		// SCHED_GROUP_BARRIER and not SCHED_BARRIER.
}		SG.initSchedGroup(RIter, SyncedInstrsMap[SG.getSyncID()]);
		}
SchedGroups.push_back(VMEMWriteSchedGroup.get());
}		void SchedBarrierDAGMutation::addSchedGroupBarrierEdges() {
		// Since we traversed the DAG in reverse order when initializing
if ((SBMask & SchedBarrierMasks::DS_READ) == SchedBarrierMasks::NONE &&		// SCHED_GROUP_BARRIERs we need to reverse the order in the vector to maintain
(SBMask & SchedBarrierMasks::DS) == SchedBarrierMasks::NONE) {		// user intentions and program order.
if (!DSReadSchedGroup) {		for (auto &SchedGroups : SyncedSchedGroupsMap)
DSReadSchedGroup =		makeReversePipeline(SchedGroups.second);
std::make_unique<SchedGroup>(isDSReadSGMember, None, DAG);
initSchedGroup(DSReadSchedGroup.get());
}

SchedGroups.push_back(DSReadSchedGroup.get());
}

if ((SBMask & SchedBarrierMasks::DS_WRITE) == SchedBarrierMasks::NONE &&
(SBMask & SchedBarrierMasks::DS) == SchedBarrierMasks::NONE) {
if (!DSWriteSchedGroup) {
DSWriteSchedGroup =
std::make_unique<SchedGroup>(isDSWriteSGMember, None, DAG);
initSchedGroup(DSWriteSchedGroup.get());
}

SchedGroups.push_back(DSWriteSchedGroup.get());
}
}

void SchedBarrierDAGMutation::initSchedGroup(SchedGroup *SG) {
assert(SG);
for (auto &SU : DAG->SUnits)
if (SG->canAddSU(SU, TII))
SG->add(SU);
}

void SchedBarrierDAGMutation::resetSchedBarrierEdges(SUnit &SU) {
assert(SU.getInstr()->getOpcode() == AMDGPU::SCHED_BARRIER);
for (auto &P : SU.Preds)
SU.removePred(P);

for (auto &S : SU.Succs) {
for (auto &SP : S.getSUnit()->Preds) {
if (SP.getSUnit() == &SU) {
S.getSUnit()->removePred(SP);
}
}
}
}		}

} // namespace		} // namespace

namespace llvm {		namespace llvm {

std::unique_ptr<ScheduleDAGMutation> createIGroupLPDAGMutation() {		std::unique_ptr<ScheduleDAGMutation> createIGroupLPDAGMutation() {
return EnableIGroupLP ? std::make_unique<IGroupLPDAGMutation>() : nullptr;		return EnableIGroupLP ? std::make_unique<IGroupLPDAGMutation>() : nullptr;
}		}

std::unique_ptr<ScheduleDAGMutation> createSchedBarrierDAGMutation() {		std::unique_ptr<ScheduleDAGMutation> createSchedBarrierDAGMutation() {
return std::make_unique<SchedBarrierDAGMutation>();		return std::make_unique<SchedBarrierDAGMutation>();
}		}

} // end namespace llvm		} // end namespace llvm

llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp

Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	if (MI->getOpcode() == AMDGPU::SCHED_BARRIER) {
std::string HexString;		std::string HexString;
raw_string_ostream HexStream(HexString);		raw_string_ostream HexStream(HexString);
HexStream << format_hex(MI->getOperand(0).getImm(), 10, true);		HexStream << format_hex(MI->getOperand(0).getImm(), 10, true);
OutStreamer->emitRawComment(" sched_barrier mask(" + HexString + ")");		OutStreamer->emitRawComment(" sched_barrier mask(" + HexString + ")");
}		}
return;		return;
}		}

		if (MI->getOpcode() == AMDGPU::SCHED_GROUP_BARRIER) {
		if (isVerbose()) {
		std::string HexString;
		raw_string_ostream HexStream(HexString);
		HexStream << format_hex(MI->getOperand(0).getImm(), 10, true);
		OutStreamer->emitRawComment(
		" sched_group_barrier mask(" + HexString + ") size(" +
		Twine(MI->getOperand(1).getImm()) + ") SyncID(" +
		Twine(MI->getOperand(2).getImm()) + ")");
		}
		return;
		}

if (MI->getOpcode() == AMDGPU::SI_MASKED_UNREACHABLE) {		if (MI->getOpcode() == AMDGPU::SI_MASKED_UNREACHABLE) {
if (isVerbose())		if (isVerbose())
OutStreamer->emitRawComment(" divergent unreachable");		OutStreamer->emitRawComment(" divergent unreachable");
return;		return;
}		}

if (MI->isMetaInstruction()) {		if (MI->isMetaInstruction()) {
if (isVerbose())		if (isVerbose())
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,771 Lines • ▼ Show 20 Lines	case AMDGPU::S_NOP:
return MI.getOperand(0).getImm() + 1;		return MI.getOperand(0).getImm() + 1;

// FIXME: Any other pseudo instruction?		// FIXME: Any other pseudo instruction?
// SI_RETURN_TO_EPILOG is a fallthrough to code outside of the function. The		// SI_RETURN_TO_EPILOG is a fallthrough to code outside of the function. The
// hazard, even if one exist, won't really be visible. Should we handle it?		// hazard, even if one exist, won't really be visible. Should we handle it?
case AMDGPU::SI_MASKED_UNREACHABLE:		case AMDGPU::SI_MASKED_UNREACHABLE:
case AMDGPU::WAVE_BARRIER:		case AMDGPU::WAVE_BARRIER:
case AMDGPU::SCHED_BARRIER:		case AMDGPU::SCHED_BARRIER:
		case AMDGPU::SCHED_GROUP_BARRIER:
return 0;		return 0;
}		}
}		}

bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {		bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
const SIRegisterInfo *TRI = ST.getRegisterInfo();		const SIRegisterInfo *TRI = ST.getRegisterInfo();
MachineBasicBlock &MBB = *MI.getParent();		MachineBasicBlock &MBB = *MI.getParent();
DebugLoc DL = MBB.findDebugLoc(MI);		DebugLoc DL = MBB.findDebugLoc(MI);
▲ Show 20 Lines • Show All 6,669 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	def SCHED_BARRIER : SPseudoInstSI<(outs), (ins i32imm:$mask),
let hasSideEffects = 1;		let hasSideEffects = 1;
let mayLoad = 0;		let mayLoad = 0;
let mayStore = 0;		let mayStore = 0;
let isConvergent = 1;		let isConvergent = 1;
let FixedSize = 1;		let FixedSize = 1;
let Size = 0;		let Size = 0;
}		}

		def SCHED_GROUP_BARRIER : SPseudoInstSI<
		(outs),
		(ins i32imm:$mask, i32imm:$size, i32imm:$syncid),
		[(int_amdgcn_sched_group_barrier (i32 timm:$mask), (i32 timm:$size), (i32 timm:$syncid))]> {
		let SchedRW = [];
		let hasNoSchedulingInfo = 1;
		let hasSideEffects = 1;
		let mayLoad = 0;
		let mayStore = 0;
		let isConvergent = 1;
		let FixedSize = 1;
		let Size = 0;
		}

// SI pseudo instructions. These are used by the CFG structurizer pass		// SI pseudo instructions. These are used by the CFG structurizer pass
// and should be lowered to ISA instructions prior to codegen.		// and should be lowered to ISA instructions prior to codegen.

let isTerminator = 1 in {		let isTerminator = 1 in {

let OtherPredicates = [EnableLateCFGStructurize] in {		let OtherPredicates = [EnableLateCFGStructurize] in {
def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <		def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <
(outs),		(outs),
▲ Show 20 Lines • Show All 2,923 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	bool isReallyAClobber(const Value Ptr, MemoryDef Def, AAResults *AA) {
if (isa<FenceInst>(DefInst))		if (isa<FenceInst>(DefInst))
return false;		return false;

if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(DefInst)) {		if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(DefInst)) {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::amdgcn_s_barrier:		case Intrinsic::amdgcn_s_barrier:
case Intrinsic::amdgcn_wave_barrier:		case Intrinsic::amdgcn_wave_barrier:
case Intrinsic::amdgcn_sched_barrier:		case Intrinsic::amdgcn_sched_barrier:
		case Intrinsic::amdgcn_sched_group_barrier:
return false;		return false;
default:		default:
break;		break;
}		}
}		}

// Ignore atomics not aliasing with the original load, any atomic is a		// Ignore atomics not aliasing with the original load, any atomic is a
// universal MemoryDef from MSSA's point of view too, just like a fence.		// universal MemoryDef from MSSA's point of view too, just like a fence.
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				define amdgpu_kernel void @test_sched_group_barrier() #0 {
				; GCN-LABEL: test_sched_group_barrier:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: ; sched_group_barrier mask(0x00000000) size(1) SyncID(2)
				; GCN-NEXT: ; sched_group_barrier mask(0x00000001) size(2) SyncID(4)
				; GCN-NEXT: ; sched_group_barrier mask(0x00000004) size(8) SyncID(16)
				; GCN-NEXT: ; sched_group_barrier mask(0x0000000F) size(10000) SyncID(-1)
				; GCN-NEXT: s_endpgm
				entry:
				call void @llvm.amdgcn.sched.group.barrier(i32 0, i32 1, i32 2) #1
				call void @llvm.amdgcn.sched.group.barrier(i32 1, i32 2, i32 4) #1
				call void @llvm.amdgcn.sched.group.barrier(i32 4, i32 8, i32 16) #1
				call void @llvm.amdgcn.sched.group.barrier(i32 15, i32 10000, i32 -1) #1
				ret void
				}

				declare void @llvm.amdgcn.sched.group.barrier(i32, i32, i32) #1

				attributes #0 = { nounwind }
				attributes #1 = { convergent nounwind }

llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -march=amdgcn -mcpu=gfx908 -misched-cluster=false -amdgpu-disable-power-sched=true -run-pass=machine-scheduler -verify-misched -o - %s \| FileCheck %s

				--- \|
				define amdgpu_kernel void @no_sched_group_barrier(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) { ret void }
				define amdgpu_kernel void @sched_group_barrier_1_VMEM_READ_1_VALU_5_MFMA_1_VMEM_READ_3_VALU_2_VMEM_WRITE(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) { ret void }
				define amdgpu_kernel void @sched_group_barrier_2_VMEM_1000_ALU_5_MFMA_2_VMEM_WRITE(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) { ret void }

				!0 = distinct !{!0}
				!1 = !{!1, !0}
				...

				---
				name: no_sched_group_barrier
				tracksRegLiveness: true
				body: \|
				bb.0:
				; CHECK-LABEL: name: no_sched_group_barrier
				; CHECK: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[DEF2:%[0-9]+]]:areg_128 = IMPLICIT_DEF
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[GLOBAL_LOAD_DWORD_SADDR]], implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF2]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_1:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF1]], implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_1:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_2:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF1]], implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_2:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_1]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_]], [[DEF]], 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_3:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_2]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_3:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR1]], [[GLOBAL_LOAD_DWORD_SADDR1]], implicit $exec
				; CHECK-NEXT: S_NOP 0
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_4:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_3]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_3]], [[DEF]], 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: S_ENDPGM 0, implicit [[V_MUL_LO_U32_e64_1]], implicit [[V_MUL_LO_U32_e64_2]], implicit [[V_MFMA_F32_4X4X1F32_e64_4]]
				%0:sreg_64 = IMPLICIT_DEF
				%1:vgpr_32 = IMPLICIT_DEF
				%2:areg_128 = IMPLICIT_DEF
				%3:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%4:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %3, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %4, %0, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				%5:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %1, implicit $exec
				%6:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %1, implicit $exec
				S_NOP 0
				%7:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %2, 0, 0, 0, implicit $mode, implicit $exec
				%8:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %7, 0, 0, 0, implicit $mode, implicit $exec
				%9:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %8, 0, 0, 0, implicit $mode, implicit $exec
				%10:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %9, 0, 0, 0, implicit $mode, implicit $exec
				%11:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %10, 0, 0, 0, implicit $mode, implicit $exec
				%12:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%13:vgpr_32 = nsw V_MUL_LO_U32_e64 %12, %12, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %13, %0, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				S_ENDPGM 0, implicit %5, implicit %6, implicit %11
				...

				---
				name: sched_group_barrier_1_VMEM_READ_1_VALU_5_MFMA_1_VMEM_READ_3_VALU_2_VMEM_WRITE
				tracksRegLiveness: true
				body: \|
				bb.0:
				; CHECK-LABEL: name: sched_group_barrier_1_VMEM_READ_1_VALU_5_MFMA_1_VMEM_READ_3_VALU_2_VMEM_WRITE
				; CHECK: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[DEF2:%[0-9]+]]:areg_128 = IMPLICIT_DEF
				; CHECK-NEXT: SCHED_GROUP_BARRIER 32, 1, 0
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[GLOBAL_LOAD_DWORD_SADDR]], implicit $exec
				; CHECK-NEXT: SCHED_GROUP_BARRIER 2, 1, 0
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF2]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_1:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_2:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_1]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_3:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_2]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_4:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_3]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: SCHED_GROUP_BARRIER 8, 5, 0
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: S_NOP 0
				; CHECK-NEXT: SCHED_GROUP_BARRIER 32, 1, 0
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_1:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF1]], implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_2:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR1]], [[GLOBAL_LOAD_DWORD_SADDR1]], implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_3:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF1]], implicit $exec
				; CHECK-NEXT: SCHED_GROUP_BARRIER 2, 3, 0
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_]], [[DEF]], 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_2]], [[DEF]], 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: SCHED_GROUP_BARRIER 64, 2, 0
				; CHECK-NEXT: S_ENDPGM 0, implicit [[V_MUL_LO_U32_e64_1]], implicit [[V_MUL_LO_U32_e64_3]], implicit [[V_MFMA_F32_4X4X1F32_e64_4]]
				%0:sreg_64 = IMPLICIT_DEF
				%1:vgpr_32 = IMPLICIT_DEF
				%2:areg_128 = IMPLICIT_DEF
				%3:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%4:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %3, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %4, %0, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				%5:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %1, implicit $exec
				%6:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %1, implicit $exec
				S_NOP 0
				%7:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %2, 0, 0, 0, implicit $mode, implicit $exec
				%8:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %7, 0, 0, 0, implicit $mode, implicit $exec
				%9:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %8, 0, 0, 0, implicit $mode, implicit $exec
				%10:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %9, 0, 0, 0, implicit $mode, implicit $exec
				%11:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %10, 0, 0, 0, implicit $mode, implicit $exec
				%12:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%13:vgpr_32 = nsw V_MUL_LO_U32_e64 %12, %12, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %13, %0, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; 1 VMEM_READ
				jrbyrnesUnsubmitted Not Done Reply Inline Actions I think you are aware of this issue. But the ability for the mutation to match the pipeline is dependent upon which instructions go into which group (when an instruction can be mapped to multiple groups). If we had SchedGroups: 2 VMEM_READ, 1 VALU, 1 MFMA, 2 VMEM_READ and initial schedule: VMEMR, VALU, VMEMR, MFMA, VMEMR, with a dependency between middle VMEMR->MFMA. initSchedGroup will add the middle VMEMR to the last VMEMR group, but we could get a more accurate pipeline by adding it to the first group. jrbyrnes: I think you are aware of this issue. But the ability for the mutation to match the pipeline is…
				SCHED_GROUP_BARRIER 32, 1, 0
				; 1 VALU
				SCHED_GROUP_BARRIER 2, 1, 0
				; 5 MFMA
				SCHED_GROUP_BARRIER 8, 5, 0
				; 1 VMEM_READ
				SCHED_GROUP_BARRIER 32, 1, 0
				; 3 VALU
				SCHED_GROUP_BARRIER 2, 3, 0
				; 2 VMEM_WRITE
				SCHED_GROUP_BARRIER 64, 2, 0
				S_ENDPGM 0, implicit %5, implicit %6, implicit %11
				...

				---
				name: sched_group_barrier_2_VMEM_1000_ALU_5_MFMA_2_VMEM_WRITE
				tracksRegLiveness: true
				body: \|
				bb.0:
				; CHECK-LABEL: name: sched_group_barrier_2_VMEM_1000_ALU_5_MFMA_2_VMEM_WRITE
				; CHECK: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[GLOBAL_LOAD_DWORD_SADDR1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR [[DEF]], [[DEF1]], 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				; CHECK-NEXT: [[DEF2:%[0-9]+]]:areg_128 = IMPLICIT_DEF
				; CHECK-NEXT: SCHED_GROUP_BARRIER 16, 2, 0
				; CHECK-NEXT: S_NOP 0
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[GLOBAL_LOAD_DWORD_SADDR]], implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_1:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF1]], implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_2:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF1]], implicit $exec
				; CHECK-NEXT: [[V_MUL_LO_U32_e64_3:%[0-9]+]]:vgpr_32 = nsw V_MUL_LO_U32_e64 [[GLOBAL_LOAD_DWORD_SADDR1]], [[GLOBAL_LOAD_DWORD_SADDR1]], implicit $exec
				; CHECK-NEXT: SCHED_GROUP_BARRIER 1, 1000, 0
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[DEF2]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_1:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_2:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_1]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_3:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_2]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: [[V_MFMA_F32_4X4X1F32_e64_4:%[0-9]+]]:areg_128 = V_MFMA_F32_4X4X1F32_e64 [[DEF1]], [[GLOBAL_LOAD_DWORD_SADDR]], [[V_MFMA_F32_4X4X1F32_e64_3]], 0, 0, 0, implicit $mode, implicit $exec
				; CHECK-NEXT: SCHED_GROUP_BARRIER 8, 5, 0
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_]], [[DEF]], 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: GLOBAL_STORE_DWORD_SADDR [[DEF1]], [[V_MUL_LO_U32_e64_3]], [[DEF]], 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; CHECK-NEXT: SCHED_GROUP_BARRIER 64, 2, 0
				; CHECK-NEXT: S_ENDPGM 0, implicit [[V_MUL_LO_U32_e64_1]], implicit [[V_MUL_LO_U32_e64_2]], implicit [[V_MFMA_F32_4X4X1F32_e64_4]]
				%0:sreg_64 = IMPLICIT_DEF
				%1:vgpr_32 = IMPLICIT_DEF
				%2:areg_128 = IMPLICIT_DEF
				%3:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%4:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %3, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %4, %0, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				%5:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %1, implicit $exec
				%6:vgpr_32 = nsw V_MUL_LO_U32_e64 %3, %1, implicit $exec
				S_NOP 0
				%7:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %2, 0, 0, 0, implicit $mode, implicit $exec
				%8:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %7, 0, 0, 0, implicit $mode, implicit $exec
				%9:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %8, 0, 0, 0, implicit $mode, implicit $exec
				%10:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %9, 0, 0, 0, implicit $mode, implicit $exec
				%11:areg_128 = V_MFMA_F32_4X4X1F32_e64 %1, %3, %10, 0, 0, 0, implicit $mode, implicit $exec
				%12:vgpr_32 = GLOBAL_LOAD_DWORD_SADDR %0, %1, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
				%13:vgpr_32 = nsw V_MUL_LO_U32_e64 %12, %12, implicit $exec
				GLOBAL_STORE_DWORD_SADDR %1, %13, %0, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
				; 2 VMEM
				SCHED_GROUP_BARRIER 16, 2, 0
				; 10 ALU
				SCHED_GROUP_BARRIER 1, 1000, 0
				; 5 MFMA
				SCHED_GROUP_BARRIER 8, 5, 0
				; 2 VMEM_WRITE
				SCHED_GROUP_BARRIER 64, 2, 0
				S_ENDPGM 0, implicit %5, implicit %6, implicit %11
				...

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add amdgcn_sched_group_barrier builtinClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 438269

clang/include/clang/Basic/BuiltinsAMDGPU.def

clang/test/CodeGenOpenCL/builtins-amdgcn.cl

clang/test/SemaOpenCL/builtins-amdgcn-error.cl

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp

llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll

llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir

[AMDGPU] Add amdgcn_sched_group_barrier builtin
ClosedPublic