This is an archive of the discontinued LLVM Phabricator instance.

include/llvm/IR/IntrinsicsAMDGPU.td
397–399	I don't like having this as a pointer. It really isn't, so we're really kidding ourselves here. This could just be an i32. Alternatively, and honestly preferably, this should be just the pointer to the honest-to-goodness address, and the waveID a separate operand. It should be possible to add the DAG nodes to build the M0 value in LowerINTRINSIC_W_CHAIN. Unless the DAG combine is unable to optimize that? (Or possibly even then, and we should fix the DAG combines...)
401–403	The ordering / scope / isVolatile is a bit weird, but it's weird in the same way as the existing intrinsics, so fine by me.
lib/Target/AMDGPU/SIISelLowering.cpp
5227–5228	Use `report_fatal_error` (to support the non-Mesa compiler case).
5245	Same here.

mareko added inline comments.Oct 8 2018, 8:14 AM

include/llvm/IR/IntrinsicsAMDGPU.td
397–399	That was what the previous version did and it produced worse code. It selected correct instructions, but we want CSE of the m0 expression across basic blocks and simplifications based on the numbers of known zero bits in the inputs. Doing it in LowerINTRINSIC_W_CHAIN is too late for those.

nhaehnle added inline comments.Oct 8 2018, 9:26 AM

include/llvm/IR/IntrinsicsAMDGPU.td
397–399	I see, that sucks. We really need better CSE at the MachineInstr level. Well, the problem is that pretending that this thing is a pointer is pretty far from correct. Since it makes for a cleaner "API", I'd personally prefer to have the real pointer here, and the waveID as a separate argument, even if it means a few additional instructions.

mareko added inline comments.Oct 8 2018, 2:56 PM

include/llvm/IR/IntrinsicsAMDGPU.td
397–399	I wouldn't like more instructions, because I may be instruction-bound. The pointer is made by inttoptr anyway, currently even inttoptr((NULL << 16) \| waveID) = inttoptr(waveID). inttoptr is most likely to be used in all cases, because GDS ordered count offsets are constant and have only 14 bits that can change. i32 for m0 would be better.

mareko added inline comments.Oct 8 2018, 2:59 PM

include/llvm/IR/IntrinsicsAMDGPU.td
397–399	One optimization for drivers is to pass a non-zero GDS offset in a user SGPR in the high 16bits, so that the shader doesn't have to shift. Then it's just OR'd with WaveID. If the offset is 0, there is no offset to OR.

arsenm added inline comments.Oct 9 2018, 7:18 PM

include/llvm/IR/IntrinsicsAMDGPU.td
397–399	I've made a few attempts to improve CSE involving M0 so I think it's possible to fix this. This shouldn't be a fake pointer

Does this need to be marked as isSourceOfDivergence?

In D52944#1259919, @arsenm wrote:

Does this need to be marked as isSourceOfDivergence?

Now that you mention it, yes, even though it's for stupid reasons: I believe the ds_ordered_count instruction executes only in a single lane, so it's intuitively a uniform operation; however, it returns its result only in lane 0, so it's formally non-uniform.

In D52944#1260081, @nhaehnle wrote:

In D52944#1259919, @arsenm wrote:

Does this need to be marked as isSourceOfDivergence?

Now that you mention it, yes, even though it's for stupid reasons: I believe the ds_ordered_count instruction executes only in a single lane, so it's intuitively a uniform operation; however, it returns its result only in lane 0, so it's formally non-uniform.

ds_ordered_count hangs if more than 1 lane is active.

Would everybody tolerate "i32 m0" in the intrinsic?

I might change the intrinsic to add the option to insert "s_and_saveexec s[N:M], 1" and "s_mov_b64 exec, s[N:M]" around the intrinsic to get an optimal single-lane block.

If I make m0 integer, DS_ORDERED_COUNT won't be a mem node.

Any comments?

Use report_fatal_error, add SourceOfDivergence lines

Harbormaster completed remote builds in B25360: Diff 175389.Nov 26 2018, 8:40 PM

Ping

FYI, if there is no feedback on January 15, I'll commit this to get it into LLVM 8.0.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 16 2019, 7:47 AM

Closed by commit rL351351: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap (authored by mareko). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

18 lines

lib/

Target/

AMDGPU/

AMDGPU.h

2 lines

AMDGPUISelLowering.h

1 line

AMDGPUISelLowering.cpp

1 line

AMDGPUSearchableTables.td

2 lines

AMDGPUTargetTransformInfo.cpp

2 lines

DSInstructions.td

5 lines

GCNHazardRecognizer.cpp

22 lines

61 lines

14 lines

2 lines

13 lines

5 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.ds.ordered.add.ll

96 lines

llvm.amdgcn.ds.ordered.swap.ll

45 lines

Diff 175389

include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 386 Lines • ▼ Show 20 Lines	Intrinsic<[llvm_float_ty],
[LLVMQualPointerType<llvm_float_ty, 3>,		[LLVMQualPointerType<llvm_float_ty, 3>,
llvm_float_ty,		llvm_float_ty,
llvm_i32_ty, // ordering		llvm_i32_ty, // ordering
llvm_i32_ty, // scope		llvm_i32_ty, // scope
llvm_i1_ty], // isVolatile		llvm_i1_ty], // isVolatile
[IntrArgMemOnly, NoCapture<0>]		[IntrArgMemOnly, NoCapture<0>]
>;		>;

		class AMDGPUDSOrderedIntrinsic : Intrinsic<
		[llvm_i32_ty],
		// M0 = {hi16:address, lo16:waveID}. Allow passing M0 as a pointer, so that
		// the bit packing can be optimized at the IR level.
		[LLVMQualPointerType<llvm_i32_ty, 2>, // IntToPtr(M0)
		nhaehnleUnsubmitted Not Done Reply Inline Actions I don't like having this as a pointer. It really isn't, so we're really kidding ourselves here. This could just be an i32. Alternatively, and honestly preferably, this should be just the pointer to the honest-to-goodness address, and the waveID a separate operand. It should be possible to add the DAG nodes to build the M0 value in LowerINTRINSIC_W_CHAIN. Unless the DAG combine is unable to optimize that? (Or possibly even then, and we should fix the DAG combines...) nhaehnle: I don't like having this as a pointer. It really isn't, so we're really kidding ourselves here.
		marekoAuthorUnsubmitted Not Done Reply Inline Actions That was what the previous version did and it produced worse code. It selected correct instructions, but we want CSE of the m0 expression across basic blocks and simplifications based on the numbers of known zero bits in the inputs. Doing it in LowerINTRINSIC_W_CHAIN is too late for those. mareko: That was what the previous version did and it produced worse code. It selected correct…
		nhaehnleUnsubmitted Not Done Reply Inline Actions I see, that sucks. We really need better CSE at the MachineInstr level. Well, the problem is that pretending that this thing is a pointer is pretty far from correct. Since it makes for a cleaner "API", I'd personally prefer to have the real pointer here, and the waveID as a separate argument, even if it means a few additional instructions. nhaehnle: I see, that sucks. We really need better CSE at the MachineInstr level. Well, the problem is…
		marekoAuthorUnsubmitted Not Done Reply Inline Actions I wouldn't like more instructions, because I may be instruction-bound. The pointer is made by inttoptr anyway, currently even inttoptr((NULL << 16) \| waveID) = inttoptr(waveID). inttoptr is most likely to be used in all cases, because GDS ordered count offsets are constant and have only 14 bits that can change. i32 for m0 would be better. mareko: I wouldn't like more instructions, because I may be instruction-bound. The pointer is made by…
		marekoAuthorUnsubmitted Not Done Reply Inline Actions One optimization for drivers is to pass a non-zero GDS offset in a user SGPR in the high 16bits, so that the shader doesn't have to shift. Then it's just OR'd with WaveID. If the offset is 0, there is no offset to OR. mareko: One optimization for drivers is to pass a non-zero GDS offset in a user SGPR in the high 16bits…
		arsenmUnsubmitted Not Done Reply Inline Actions I've made a few attempts to improve CSE involving M0 so I think it's possible to fix this. This shouldn't be a fake pointer arsenm: I've made a few attempts to improve CSE involving M0 so I think it's possible to fix this.
		llvm_i32_ty, // value to add or swap
		llvm_i32_ty, // ordering
		llvm_i32_ty, // scope
		llvm_i1_ty, // isVolatile
		nhaehnleUnsubmitted Not Done Reply Inline Actions The ordering / scope / isVolatile is a bit weird, but it's weird in the same way as the existing intrinsics, so fine by me. nhaehnle: The ordering / scope / isVolatile is a bit weird, but it's weird in the same way as the…
		llvm_i32_ty, // ordered count index (OA index), also added to the address
		llvm_i1_ty, // wave release, usually set to 1
		llvm_i1_ty], // wave done, set to 1 for the last ordered instruction
		[NoCapture<0>]
		>;

		def int_amdgcn_ds_ordered_add : AMDGPUDSOrderedIntrinsic;
		def int_amdgcn_ds_ordered_swap : AMDGPUDSOrderedIntrinsic;

def int_amdgcn_ds_fadd : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_faddf">;		def int_amdgcn_ds_fadd : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_faddf">;
def int_amdgcn_ds_fmin : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_fminf">;		def int_amdgcn_ds_fmin : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_fminf">;
def int_amdgcn_ds_fmax : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_fmaxf">;		def int_amdgcn_ds_fmax : AMDGPULDSF32Intrin<"__builtin_amdgcn_ds_fmaxf">;

} // TargetPrefix = "amdgcn"		} // TargetPrefix = "amdgcn"

// New-style image intrinsics		// New-style image intrinsics

▲ Show 20 Lines • Show All 1,104 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 236 Lines • ▼ Show 20 Lines
	/// memory locations.			/// memory locations.
	namespace AMDGPUAS {			namespace AMDGPUAS {
	enum : unsigned {			enum : unsigned {
	// The maximum value for flat, generic, local, private, constant and region.			// The maximum value for flat, generic, local, private, constant and region.
	MAX_AMDGPU_ADDRESS = 6,			MAX_AMDGPU_ADDRESS = 6,

	FLAT_ADDRESS = 0, ///< Address space for flat memory.			FLAT_ADDRESS = 0, ///< Address space for flat memory.
	GLOBAL_ADDRESS = 1, ///< Address space for global memory (RAT0, VTX0).			GLOBAL_ADDRESS = 1, ///< Address space for global memory (RAT0, VTX0).
	REGION_ADDRESS = 2, ///< Address space for region memory.			REGION_ADDRESS = 2, ///< Address space for region memory. (GDS)

	CONSTANT_ADDRESS = 4, ///< Address space for constant memory (VTX2)			CONSTANT_ADDRESS = 4, ///< Address space for constant memory (VTX2)
	LOCAL_ADDRESS = 3, ///< Address space for local memory.			LOCAL_ADDRESS = 3, ///< Address space for local memory.
	PRIVATE_ADDRESS = 5, ///< Address space for private memory.			PRIVATE_ADDRESS = 5, ///< Address space for private memory.

	CONSTANT_ADDRESS_32BIT = 6, ///< Address space for 32-bit constant memory			CONSTANT_ADDRESS_32BIT = 6, ///< Address space for 32-bit constant memory

	/// Address space for direct addressible parameter memory (CONST0)			/// Address space for direct addressible parameter memory (CONST0)
	Show All 33 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 468 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
FIRST_MEM_OPCODE_NUMBER = ISD::FIRST_TARGET_MEMORY_OPCODE,		FIRST_MEM_OPCODE_NUMBER = ISD::FIRST_TARGET_MEMORY_OPCODE,
STORE_MSKOR,		STORE_MSKOR,
LOAD_CONSTANT,		LOAD_CONSTANT,
TBUFFER_STORE_FORMAT,		TBUFFER_STORE_FORMAT,
TBUFFER_STORE_FORMAT_X3,		TBUFFER_STORE_FORMAT_X3,
TBUFFER_STORE_FORMAT_D16,		TBUFFER_STORE_FORMAT_D16,
TBUFFER_LOAD_FORMAT,		TBUFFER_LOAD_FORMAT,
TBUFFER_LOAD_FORMAT_D16,		TBUFFER_LOAD_FORMAT_D16,
		DS_ORDERED_COUNT,
ATOMIC_CMP_SWAP,		ATOMIC_CMP_SWAP,
ATOMIC_INC,		ATOMIC_INC,
ATOMIC_DEC,		ATOMIC_DEC,
ATOMIC_LOAD_FADD,		ATOMIC_LOAD_FADD,
ATOMIC_LOAD_FMIN,		ATOMIC_LOAD_FMIN,
ATOMIC_LOAD_FMAX,		ATOMIC_LOAD_FMAX,
BUFFER_LOAD,		BUFFER_LOAD,
BUFFER_LOAD_FORMAT,		BUFFER_LOAD_FORMAT,
Show All 26 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 4,179 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(INTERP_P2)		NODE_NAME_CASE(INTERP_P2)
NODE_NAME_CASE(STORE_MSKOR)		NODE_NAME_CASE(STORE_MSKOR)
NODE_NAME_CASE(LOAD_CONSTANT)		NODE_NAME_CASE(LOAD_CONSTANT)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT_X3)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT_X3)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT_D16)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT_D16)
NODE_NAME_CASE(TBUFFER_LOAD_FORMAT)		NODE_NAME_CASE(TBUFFER_LOAD_FORMAT)
NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)
		NODE_NAME_CASE(DS_ORDERED_COUNT)
NODE_NAME_CASE(ATOMIC_CMP_SWAP)		NODE_NAME_CASE(ATOMIC_CMP_SWAP)
NODE_NAME_CASE(ATOMIC_INC)		NODE_NAME_CASE(ATOMIC_INC)
NODE_NAME_CASE(ATOMIC_DEC)		NODE_NAME_CASE(ATOMIC_DEC)
NODE_NAME_CASE(ATOMIC_LOAD_FADD)		NODE_NAME_CASE(ATOMIC_LOAD_FADD)
NODE_NAME_CASE(ATOMIC_LOAD_FMIN)		NODE_NAME_CASE(ATOMIC_LOAD_FMIN)
NODE_NAME_CASE(ATOMIC_LOAD_FMAX)		NODE_NAME_CASE(ATOMIC_LOAD_FMAX)
NODE_NAME_CASE(BUFFER_LOAD)		NODE_NAME_CASE(BUFFER_LOAD)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT)
▲ Show 20 Lines • Show All 315 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSearchableTables.td

	Show First 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_smax>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_smax>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_umax>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_umax>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_and>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_and>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_or>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_or>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_xor>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_xor>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_cmpswap>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_cmpswap>;
	def : SourceOfDivergence<int_amdgcn_ps_live>;			def : SourceOfDivergence<int_amdgcn_ps_live>;
	def : SourceOfDivergence<int_amdgcn_ds_swizzle>;			def : SourceOfDivergence<int_amdgcn_ds_swizzle>;
				def : SourceOfDivergence<int_amdgcn_ds_ordered_add>;
				def : SourceOfDivergence<int_amdgcn_ds_ordered_swap>;

	foreach intr = AMDGPUImageDimAtomicIntrinsics in			foreach intr = AMDGPUImageDimAtomicIntrinsics in
	def : SourceOfDivergence<intr>;			def : SourceOfDivergence<intr>;

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 302 Lines • ▼ Show 20 Lines	unsigned GCNTTIImpl::getMaxInterleaveFactor(unsigned VF) {
return 8;		return 8;
}		}

bool GCNTTIImpl::getTgtMemIntrinsic(IntrinsicInst *Inst,		bool GCNTTIImpl::getTgtMemIntrinsic(IntrinsicInst *Inst,
MemIntrinsicInfo &Info) const {		MemIntrinsicInfo &Info) const {
switch (Inst->getIntrinsicID()) {		switch (Inst->getIntrinsicID()) {
case Intrinsic::amdgcn_atomic_inc:		case Intrinsic::amdgcn_atomic_inc:
case Intrinsic::amdgcn_atomic_dec:		case Intrinsic::amdgcn_atomic_dec:
		case Intrinsic::amdgcn_ds_ordered_add:
		case Intrinsic::amdgcn_ds_ordered_swap:
case Intrinsic::amdgcn_ds_fadd:		case Intrinsic::amdgcn_ds_fadd:
case Intrinsic::amdgcn_ds_fmin:		case Intrinsic::amdgcn_ds_fmin:
case Intrinsic::amdgcn_ds_fmax: {		case Intrinsic::amdgcn_ds_fmax: {
auto *Ordering = dyn_cast<ConstantInt>(Inst->getArgOperand(2));		auto *Ordering = dyn_cast<ConstantInt>(Inst->getArgOperand(2));
auto *Volatile = dyn_cast<ConstantInt>(Inst->getArgOperand(4));		auto *Volatile = dyn_cast<ConstantInt>(Inst->getArgOperand(4));
if (!Ordering \|\| !Volatile)		if (!Ordering \|\| !Volatile)
return false; // Invalid.		return false; // Invalid.

▲ Show 20 Lines • Show All 410 Lines • Show Last 20 Lines

lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 811 Lines • ▼ Show 20 Lines
	defm : DSAtomicRetPat_mc<DS_XOR_RTN_B64, i64, "atomic_load_xor_local">;			defm : DSAtomicRetPat_mc<DS_XOR_RTN_B64, i64, "atomic_load_xor_local">;
	defm : DSAtomicRetPat_mc<DS_MIN_RTN_I64, i64, "atomic_load_min_local">;			defm : DSAtomicRetPat_mc<DS_MIN_RTN_I64, i64, "atomic_load_min_local">;
	defm : DSAtomicRetPat_mc<DS_MAX_RTN_I64, i64, "atomic_load_max_local">;			defm : DSAtomicRetPat_mc<DS_MAX_RTN_I64, i64, "atomic_load_max_local">;
	defm : DSAtomicRetPat_mc<DS_MIN_RTN_U64, i64, "atomic_load_umin_local">;			defm : DSAtomicRetPat_mc<DS_MIN_RTN_U64, i64, "atomic_load_umin_local">;
	defm : DSAtomicRetPat_mc<DS_MAX_RTN_U64, i64, "atomic_load_umax_local">;			defm : DSAtomicRetPat_mc<DS_MAX_RTN_U64, i64, "atomic_load_umax_local">;

	defm : DSAtomicCmpXChg_mc<DS_CMPST_RTN_B64, i64, "atomic_cmp_swap_local">;			defm : DSAtomicCmpXChg_mc<DS_CMPST_RTN_B64, i64, "atomic_cmp_swap_local">;

				def : Pat <
				(SIds_ordered_count i32:$value, i16:$offset),
				(DS_ORDERED_COUNT $value, (as_i16imm $offset))
				>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Real instructions			// Real instructions
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// SIInstructions.td			// SIInstructions.td
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	▲ Show 20 Lines • Show All 349 Lines • Show Last 20 Lines

lib/Target/AMDGPU/GCNHazardRecognizer.cpp

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	static bool isSMovRel(unsigned Opcode) {
case AMDGPU::S_MOVRELD_B32:		case AMDGPU::S_MOVRELD_B32:
case AMDGPU::S_MOVRELD_B64:		case AMDGPU::S_MOVRELD_B64:
return true;		return true;
default:		default:
return false;		return false;
}		}
}		}

static bool isSendMsgTraceDataOrGDS(const MachineInstr &MI) {		static bool isSendMsgTraceDataOrGDS(const SIInstrInfo &TII,
		const MachineInstr &MI) {
		if (TII.isAlwaysGDS(MI.getOpcode()))
		return true;

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
case AMDGPU::S_SENDMSG:		case AMDGPU::S_SENDMSG:
case AMDGPU::S_SENDMSGHALT:		case AMDGPU::S_SENDMSGHALT:
case AMDGPU::S_TTRACEDATA:		case AMDGPU::S_TTRACEDATA:
return true;		return true;
		// These DS opcodes don't support GDS.
		case AMDGPU::DS_NOP:
		case AMDGPU::DS_PERMUTE_B32:
		case AMDGPU::DS_BPERMUTE_B32:
		return false;
default:		default:
// TODO: GDS		if (TII.isDS(MI.getOpcode())) {
		int GDS = AMDGPU::getNamedOperandIdx(MI.getOpcode(),
		AMDGPU::OpName::gds);
		if (MI.getOperand(GDS).getImm())
		return true;
		}
return false;		return false;
}		}
}		}

static unsigned getHWReg(const SIInstrInfo *TII, const MachineInstr &RegInstr) {		static unsigned getHWReg(const SIInstrInfo *TII, const MachineInstr &RegInstr) {
const MachineOperand *RegOp = TII->getNamedOperand(RegInstr,		const MachineOperand *RegOp = TII->getNamedOperand(RegInstr,
AMDGPU::OpName::simm16);		AMDGPU::OpName::simm16);
return RegOp->getImm() & AMDGPU::Hwreg::ID_MASK_;		return RegOp->getImm() & AMDGPU::Hwreg::ID_MASK_;
Show All 33 Lines	GCNHazardRecognizer::getHazardType(SUnit *SU, int Stalls) {
if (isRFE(MI->getOpcode()) && checkRFEHazards(MI) > 0)		if (isRFE(MI->getOpcode()) && checkRFEHazards(MI) > 0)
return NoopHazard;		return NoopHazard;

if (ST.hasReadM0MovRelInterpHazard() &&		if (ST.hasReadM0MovRelInterpHazard() &&
(TII.isVINTRP(*MI) \|\| isSMovRel(MI->getOpcode())) &&		(TII.isVINTRP(*MI) \|\| isSMovRel(MI->getOpcode())) &&
checkReadM0Hazards(MI) > 0)		checkReadM0Hazards(MI) > 0)
return NoopHazard;		return NoopHazard;

if (ST.hasReadM0SendMsgHazard() && isSendMsgTraceDataOrGDS(*MI) &&		if (ST.hasReadM0SendMsgHazard() && isSendMsgTraceDataOrGDS(TII, *MI) &&
checkReadM0Hazards(MI) > 0)		checkReadM0Hazards(MI) > 0)
return NoopHazard;		return NoopHazard;

if (MI->isInlineAsm() && checkInlineAsmHazards(MI) > 0)		if (MI->isInlineAsm() && checkInlineAsmHazards(MI) > 0)
return NoopHazard;		return NoopHazard;

if (checkAnyInstHazards(MI) > 0)		if (checkAnyInstHazards(MI) > 0)
return NoopHazard;		return NoopHazard;
Show All 37 Lines	unsigned GCNHazardRecognizer::PreEmitNoops(MachineInstr *MI) {

if (isRFE(MI->getOpcode()))		if (isRFE(MI->getOpcode()))
return std::max(WaitStates, checkRFEHazards(MI));		return std::max(WaitStates, checkRFEHazards(MI));

if (ST.hasReadM0MovRelInterpHazard() && (TII.isVINTRP(*MI) \|\|		if (ST.hasReadM0MovRelInterpHazard() && (TII.isVINTRP(*MI) \|\|
isSMovRel(MI->getOpcode())))		isSMovRel(MI->getOpcode())))
return std::max(WaitStates, checkReadM0Hazards(MI));		return std::max(WaitStates, checkReadM0Hazards(MI));

if (ST.hasReadM0SendMsgHazard() && isSendMsgTraceDataOrGDS(*MI))		if (ST.hasReadM0SendMsgHazard() && isSendMsgTraceDataOrGDS(TII, *MI))
return std::max(WaitStates, checkReadM0Hazards(MI));		return std::max(WaitStates, checkReadM0Hazards(MI));

return WaitStates;		return WaitStates;
}		}

void GCNHazardRecognizer::EmitNoop() {		void GCNHazardRecognizer::EmitNoop() {
EmittedInstrs.push_front(nullptr);		EmittedInstrs.push_front(nullptr);
}		}
▲ Show 20 Lines • Show All 468 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 857 Lines • ▼ Show 20 Lines	if (Attr.hasFnAttribute(Attribute::ReadOnly)) {
Info.flags \|= MachineMemOperand::MOVolatile;		Info.flags \|= MachineMemOperand::MOVolatile;
}		}
return true;		return true;
}		}

switch (IntrID) {		switch (IntrID) {
case Intrinsic::amdgcn_atomic_inc:		case Intrinsic::amdgcn_atomic_inc:
case Intrinsic::amdgcn_atomic_dec:		case Intrinsic::amdgcn_atomic_dec:
		case Intrinsic::amdgcn_ds_ordered_add:
		case Intrinsic::amdgcn_ds_ordered_swap:
case Intrinsic::amdgcn_ds_fadd:		case Intrinsic::amdgcn_ds_fadd:
case Intrinsic::amdgcn_ds_fmin:		case Intrinsic::amdgcn_ds_fmin:
case Intrinsic::amdgcn_ds_fmax: {		case Intrinsic::amdgcn_ds_fmax: {
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::getVT(CI.getType());		Info.memVT = MVT::getVT(CI.getType());
Info.ptrVal = CI.getOperand(0);		Info.ptrVal = CI.getOperand(0);
Info.align = 0;		Info.align = 0;
Info.flags = MachineMemOperand::MOLoad \| MachineMemOperand::MOStore;		Info.flags = MachineMemOperand::MOLoad \| MachineMemOperand::MOStore;
Show All 11 Lines
}		}

bool SITargetLowering::getAddrModeArguments(IntrinsicInst *II,		bool SITargetLowering::getAddrModeArguments(IntrinsicInst *II,
SmallVectorImpl<Value*> &Ops,		SmallVectorImpl<Value*> &Ops,
Type *&AccessTy) const {		Type *&AccessTy) const {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::amdgcn_atomic_inc:		case Intrinsic::amdgcn_atomic_inc:
case Intrinsic::amdgcn_atomic_dec:		case Intrinsic::amdgcn_atomic_dec:
		case Intrinsic::amdgcn_ds_ordered_add:
		case Intrinsic::amdgcn_ds_ordered_swap:
case Intrinsic::amdgcn_ds_fadd:		case Intrinsic::amdgcn_ds_fadd:
case Intrinsic::amdgcn_ds_fmin:		case Intrinsic::amdgcn_ds_fmin:
case Intrinsic::amdgcn_ds_fmax: {		case Intrinsic::amdgcn_ds_fmax: {
Value *Ptr = II->getArgOperand(0);		Value *Ptr = II->getArgOperand(0);
AccessTy = II->getType();		AccessTy = II->getType();
Ops.push_back(Ptr);		Ops.push_back(Ptr);
return true;		return true;
}		}
▲ Show 20 Lines • Show All 4,293 Lines • ▼ Show 20 Lines
}		}

SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,		SDValue SITargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
unsigned IntrID = cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue();		unsigned IntrID = cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue();
SDLoc DL(Op);		SDLoc DL(Op);

switch (IntrID) {		switch (IntrID) {
		case Intrinsic::amdgcn_ds_ordered_add:
		case Intrinsic::amdgcn_ds_ordered_swap: {
		MemSDNode *M = cast<MemSDNode>(Op);
		SDValue Chain = M->getOperand(0);
		SDValue M0 = M->getOperand(2);
		SDValue Value = M->getOperand(3);
		unsigned OrderedCountIndex = M->getConstantOperandVal(7);
		unsigned WaveRelease = M->getConstantOperandVal(8);
		unsigned WaveDone = M->getConstantOperandVal(9);
		unsigned ShaderType;
		unsigned Instruction;

		switch (IntrID) {
		case Intrinsic::amdgcn_ds_ordered_add:
		Instruction = 0;
		break;
		case Intrinsic::amdgcn_ds_ordered_swap:
		Instruction = 1;
		break;
		}

		if (WaveDone && !WaveRelease)
		report_fatal_error("ds_ordered_count: wave_done requires wave_release");
		nhaehnleUnsubmitted Not Done Reply Inline Actions Use `report_fatal_error` (to support the non-Mesa compiler case). nhaehnle: Use `report_fatal_error` (to support the non-Mesa compiler case).

		switch (DAG.getMachineFunction().getFunction().getCallingConv()) {
		case CallingConv::AMDGPU_CS:
		case CallingConv::AMDGPU_KERNEL:
		ShaderType = 0;
		break;
		case CallingConv::AMDGPU_PS:
		ShaderType = 1;
		break;
		case CallingConv::AMDGPU_VS:
		ShaderType = 2;
		break;
		case CallingConv::AMDGPU_GS:
		ShaderType = 3;
		break;
		default:
		report_fatal_error("ds_ordered_count unsupported for this calling conv");
		nhaehnleUnsubmitted Not Done Reply Inline Actions Same here. nhaehnle: Same here.
		}

		unsigned Offset0 = OrderedCountIndex << 2;
		unsigned Offset1 = WaveRelease \| (WaveDone << 1) \| (ShaderType << 2) \|
		(Instruction << 4);
		unsigned Offset = Offset0 \| (Offset1 << 8);

		SDValue Ops[] = {
		Chain,
		Value,
		DAG.getTargetConstant(Offset, DL, MVT::i16),
		copyToM0(DAG, Chain, DL, M0).getValue(1), // Glue
		};
		return DAG.getMemIntrinsicNode(AMDGPUISD::DS_ORDERED_COUNT, DL,
		M->getVTList(), Ops, M->getMemoryVT(),
		M->getMemOperand());
		}
case Intrinsic::amdgcn_atomic_inc:		case Intrinsic::amdgcn_atomic_inc:
case Intrinsic::amdgcn_atomic_dec:		case Intrinsic::amdgcn_atomic_dec:
case Intrinsic::amdgcn_ds_fadd:		case Intrinsic::amdgcn_ds_fadd:
case Intrinsic::amdgcn_ds_fmin:		case Intrinsic::amdgcn_ds_fmin:
case Intrinsic::amdgcn_ds_fmax: {		case Intrinsic::amdgcn_ds_fmax: {
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);
unsigned Opc;		unsigned Opc;
switch (IntrID) {		switch (IntrID) {
▲ Show 20 Lines • Show All 4,194 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 544 Lines • ▼ Show 20 Lines	if (TII->isDS(Inst) && (Inst.mayStore() \|\| Inst.mayLoad())) {
if (Inst.getOpcode() != AMDGPU::DS_APPEND &&		if (Inst.getOpcode() != AMDGPU::DS_APPEND &&
Inst.getOpcode() != AMDGPU::DS_CONSUME) {		Inst.getOpcode() != AMDGPU::DS_CONSUME) {
setExpScore(		setExpScore(
&Inst, TII, TRI, MRI,		&Inst, TII, TRI, MRI,
AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::addr),		AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::addr),
CurrScore);		CurrScore);
}		}
if (Inst.mayStore()) {		if (Inst.mayStore()) {
		if (AMDGPU::getNamedOperandIdx(Inst.getOpcode(),
		AMDGPU::OpName::data0) != -1) {
setExpScore(		setExpScore(
&Inst, TII, TRI, MRI,		&Inst, TII, TRI, MRI,
AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data0),		AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::data0),
CurrScore);		CurrScore);
		}
if (AMDGPU::getNamedOperandIdx(Inst.getOpcode(),		if (AMDGPU::getNamedOperandIdx(Inst.getOpcode(),
AMDGPU::OpName::data1) != -1) {		AMDGPU::OpName::data1) != -1) {
setExpScore(&Inst, TII, TRI, MRI,		setExpScore(&Inst, TII, TRI, MRI,
AMDGPU::getNamedOperandIdx(Inst.getOpcode(),		AMDGPU::getNamedOperandIdx(Inst.getOpcode(),
AMDGPU::OpName::data1),		AMDGPU::OpName::data1),
CurrScore);		CurrScore);
}		}
} else if (AMDGPU::getAtomicNoRetOp(Inst.getOpcode()) != -1 &&		} else if (AMDGPU::getAtomicNoRetOp(Inst.getOpcode()) != -1 &&
▲ Show 20 Lines • Show All 719 Lines • ▼ Show 20 Lines

void SIInsertWaitcnts::updateEventWaitcntAfter(		void SIInsertWaitcnts::updateEventWaitcntAfter(
MachineInstr &Inst, BlockWaitcntBrackets *ScoreBrackets) {		MachineInstr &Inst, BlockWaitcntBrackets *ScoreBrackets) {
// Now look at the instruction opcode. If it is a memory access		// Now look at the instruction opcode. If it is a memory access
// instruction, update the upper-bound of the appropriate counter's		// instruction, update the upper-bound of the appropriate counter's
// bracket and the destination operand scores.		// bracket and the destination operand scores.
// TODO: Use the (TSFlags & SIInstrFlags::LGKM_CNT) property everywhere.		// TODO: Use the (TSFlags & SIInstrFlags::LGKM_CNT) property everywhere.
if (TII->isDS(Inst) && TII->usesLGKM_CNT(Inst)) {		if (TII->isDS(Inst) && TII->usesLGKM_CNT(Inst)) {
if (TII->hasModifiersSet(Inst, AMDGPU::OpName::gds)) {		if (TII->isAlwaysGDS(Inst.getOpcode()) \|\|
		TII->hasModifiersSet(Inst, AMDGPU::OpName::gds)) {
ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_ACCESS, Inst);		ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_ACCESS, Inst);
ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_GPR_LOCK, Inst);		ScoreBrackets->updateByEvent(TII, TRI, MRI, GDS_GPR_LOCK, Inst);
} else {		} else {
ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);		ScoreBrackets->updateByEvent(TII, TRI, MRI, LDS_ACCESS, Inst);
}		}
} else if (TII->isFLAT(Inst)) {		} else if (TII->isFLAT(Inst)) {
assert(Inst.mayLoad() \|\| Inst.mayStore());		assert(Inst.mayLoad() \|\| Inst.mayStore());

▲ Show 20 Lines • Show All 600 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 436 Lines • ▼ Show 20 Lines	public:
static bool isDS(const MachineInstr &MI) {		static bool isDS(const MachineInstr &MI) {
return MI.getDesc().TSFlags & SIInstrFlags::DS;		return MI.getDesc().TSFlags & SIInstrFlags::DS;
}		}

bool isDS(uint16_t Opcode) const {		bool isDS(uint16_t Opcode) const {
return get(Opcode).TSFlags & SIInstrFlags::DS;		return get(Opcode).TSFlags & SIInstrFlags::DS;
}		}

		bool isAlwaysGDS(uint16_t Opcode) const;

static bool isMIMG(const MachineInstr &MI) {		static bool isMIMG(const MachineInstr &MI) {
return MI.getDesc().TSFlags & SIInstrFlags::MIMG;		return MI.getDesc().TSFlags & SIInstrFlags::MIMG;
}		}

bool isMIMG(uint16_t Opcode) const {		bool isMIMG(uint16_t Opcode) const {
return get(Opcode).TSFlags & SIInstrFlags::MIMG;		return get(Opcode).TSFlags & SIInstrFlags::MIMG;
}		}

▲ Show 20 Lines • Show All 542 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 2,369 Lines • ▼ Show 20 Lines	bool SIInstrInfo::isSchedulingBoundary(const MachineInstr &MI,
// boundaries prevents incorrect movements of such instructions.		// boundaries prevents incorrect movements of such instructions.
return TargetInstrInfo::isSchedulingBoundary(MI, MBB, MF) \|\|		return TargetInstrInfo::isSchedulingBoundary(MI, MBB, MF) \|\|
MI.modifiesRegister(AMDGPU::EXEC, &RI) \|\|		MI.modifiesRegister(AMDGPU::EXEC, &RI) \|\|
MI.getOpcode() == AMDGPU::S_SETREG_IMM32_B32 \|\|		MI.getOpcode() == AMDGPU::S_SETREG_IMM32_B32 \|\|
MI.getOpcode() == AMDGPU::S_SETREG_B32 \|\|		MI.getOpcode() == AMDGPU::S_SETREG_B32 \|\|
changesVGPRIndexingMode(MI);		changesVGPRIndexingMode(MI);
}		}

		bool SIInstrInfo::isAlwaysGDS(uint16_t Opcode) const {
		return Opcode == AMDGPU::DS_ORDERED_COUNT \|\|
		Opcode == AMDGPU::DS_GWS_INIT \|\|
		Opcode == AMDGPU::DS_GWS_SEMA_V \|\|
		Opcode == AMDGPU::DS_GWS_SEMA_BR \|\|
		Opcode == AMDGPU::DS_GWS_SEMA_P \|\|
		Opcode == AMDGPU::DS_GWS_SEMA_RELEASE_ALL \|\|
		Opcode == AMDGPU::DS_GWS_BARRIER;
		}

bool SIInstrInfo::hasUnwantedEffectsWhenEXECEmpty(const MachineInstr &MI) const {		bool SIInstrInfo::hasUnwantedEffectsWhenEXECEmpty(const MachineInstr &MI) const {
unsigned Opcode = MI.getOpcode();		unsigned Opcode = MI.getOpcode();

if (MI.mayStore() && isSMRD(MI))		if (MI.mayStore() && isSMRD(MI))
return true; // scalar store or atomic		return true; // scalar store or atomic

// These instructions cause shader I/O that may cause hardware lockups		// These instructions cause shader I/O that may cause hardware lockups
// when executed with an empty EXEC mask.		// when executed with an empty EXEC mask.
//		//
// Note: exp with VM = DONE = 0 is automatically skipped by hardware when		// Note: exp with VM = DONE = 0 is automatically skipped by hardware when
// EXEC = 0, but checking for that case here seems not worth it		// EXEC = 0, but checking for that case here seems not worth it
// given the typical code patterns.		// given the typical code patterns.
if (Opcode == AMDGPU::S_SENDMSG \|\| Opcode == AMDGPU::S_SENDMSGHALT \|\|		if (Opcode == AMDGPU::S_SENDMSG \|\| Opcode == AMDGPU::S_SENDMSGHALT \|\|
Opcode == AMDGPU::EXP \|\| Opcode == AMDGPU::EXP_DONE)		Opcode == AMDGPU::EXP \|\| Opcode == AMDGPU::EXP_DONE \|\|
		Opcode == AMDGPU::DS_ORDERED_COUNT)
return true;		return true;

if (MI.isInlineAsm())		if (MI.isInlineAsm())
return true; // conservative assumption		return true; // conservative assumption

// These are like SALU instructions in terms of effects, so it's questionable		// These are like SALU instructions in terms of effects, so it's questionable
// whether we should return true for those.		// whether we should return true for those.
//		//
▲ Show 20 Lines • Show All 3,085 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.td

	Show All 39 Lines

	def AMDGPUclamp : SDNode<"AMDGPUISD::CLAMP", SDTFPUnaryOp>;			def AMDGPUclamp : SDNode<"AMDGPUISD::CLAMP", SDTFPUnaryOp>;

	def SIsbuffer_load : SDNode<"AMDGPUISD::SBUFFER_LOAD",			def SIsbuffer_load : SDNode<"AMDGPUISD::SBUFFER_LOAD",
	SDTypeProfile<1, 3, [SDTCisVT<1, v4i32>, SDTCisVT<2, i32>, SDTCisVT<3, i1>]>,			SDTypeProfile<1, 3, [SDTCisVT<1, v4i32>, SDTCisVT<2, i32>, SDTCisVT<3, i1>]>,
	[SDNPMayLoad, SDNPMemOperand]			[SDNPMayLoad, SDNPMemOperand]
	>;			>;

				def SIds_ordered_count : SDNode<"AMDGPUISD::DS_ORDERED_COUNT",
				SDTypeProfile<1, 2, [SDTCisVT<0, i32>, SDTCisVT<1, i32>, SDTCisVT<2, i16>]>,
				[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain, SDNPInGlue]
				>;

	def SIatomic_inc : SDNode<"AMDGPUISD::ATOMIC_INC", SDTAtomic2,			def SIatomic_inc : SDNode<"AMDGPUISD::ATOMIC_INC", SDTAtomic2,
	[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]			[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
	>;			>;

	def SIatomic_dec : SDNode<"AMDGPUISD::ATOMIC_DEC", SDTAtomic2,			def SIatomic_dec : SDNode<"AMDGPUISD::ATOMIC_DEC", SDTAtomic2,
	[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]			[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
	>;			>;

	▲ Show 20 Lines • Show All 1,977 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.add.ll

This file was added.

				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VIGFX9,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VIGFX9,FUNC %s

				; FUNC-LABEL: {{^}}ds_ordered_add:
				; GCN-DAG: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN-DAG: s_mov_b32 m0,
				; GCN: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:772 gds
				define amdgpu_kernel void @ds_ordered_add(i32 addrspace(2)* inreg %gds, i32 addrspace(1)* %out) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				store i32 %val, i32 addrspace(1)* %out
				ret void
				}

				; Below are various modifications of input operands and shader types.

				; FUNC-LABEL: {{^}}ds_ordered_add_counter2:
				; GCN-DAG: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN-DAG: s_mov_b32 m0,
				; GCN: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:776 gds
				define amdgpu_kernel void @ds_ordered_add_counter2(i32 addrspace(2)* inreg %gds, i32 addrspace(1)* %out) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 2, i1 true, i1 true)
				store i32 %val, i32 addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ds_ordered_add_nodone:
				; GCN-DAG: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN-DAG: s_mov_b32 m0,
				; GCN: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:260 gds
				define amdgpu_kernel void @ds_ordered_add_nodone(i32 addrspace(2)* inreg %gds, i32 addrspace(1)* %out) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 true, i1 false)
				store i32 %val, i32 addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ds_ordered_add_norelease:
				; GCN-DAG: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN-DAG: s_mov_b32 m0,
				; GCN: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:4 gds
				define amdgpu_kernel void @ds_ordered_add_norelease(i32 addrspace(2)* inreg %gds, i32 addrspace(1)* %out) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 false, i1 false)
				store i32 %val, i32 addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ds_ordered_add_cs:
				; GCN: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN: s_mov_b32 m0, s0
				; VIGFX9-NEXT: s_nop 0
				; GCN-NEXT: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:772 gds
				; GCN-NEXT: s_waitcnt expcnt(0) lgkmcnt(0)
				define amdgpu_cs float @ds_ordered_add_cs(i32 addrspace(2)* inreg %gds) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				%r = bitcast i32 %val to float
				ret float %r
				}

				; FUNC-LABEL: {{^}}ds_ordered_add_ps:
				; GCN: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN: s_mov_b32 m0, s0
				; VIGFX9-NEXT: s_nop 0
				; GCN-NEXT: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:1796 gds
				; GCN-NEXT: s_waitcnt expcnt(0) lgkmcnt(0)
				define amdgpu_ps float @ds_ordered_add_ps(i32 addrspace(2)* inreg %gds) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				%r = bitcast i32 %val to float
				ret float %r
				}

				; FUNC-LABEL: {{^}}ds_ordered_add_vs:
				; GCN: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN: s_mov_b32 m0, s0
				; VIGFX9-NEXT: s_nop 0
				; GCN-NEXT: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:2820 gds
				; GCN-NEXT: s_waitcnt expcnt(0) lgkmcnt(0)
				define amdgpu_vs float @ds_ordered_add_vs(i32 addrspace(2)* inreg %gds) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				%r = bitcast i32 %val to float
				ret float %r
				}

				; FUNC-LABEL: {{^}}ds_ordered_add_gs:
				; GCN: v_mov_b32_e32 v[[INCR:[0-9]+]], 31
				; GCN: s_mov_b32 m0, s0
				; VIGFX9-NEXT: s_nop 0
				; GCN-NEXT: ds_ordered_count v{{[0-9]+}}, v[[INCR]] offset:3844 gds
				; GCN-NEXT: s_waitcnt expcnt(0) lgkmcnt(0)
				define amdgpu_gs float @ds_ordered_add_gs(i32 addrspace(2)* inreg %gds) {
				%val = call i32@llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* %gds, i32 31, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				%r = bitcast i32 %val to float
				ret float %r
				}

				declare i32 @llvm.amdgcn.ds.ordered.add(i32 addrspace(2)* nocapture, i32, i32, i32, i1, i32, i1, i1)

test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.swap.ll

This file was added.

				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VIGFX9,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VIGFX9,FUNC %s

				; FUNC-LABEL: {{^}}ds_ordered_swap:
				; GCN: s_mov_b32 m0, s0
				; VIGFX9-NEXT: s_nop 0
				; GCN-NEXT: ds_ordered_count v{{[0-9]+}}, v0 offset:4868 gds
				; GCN-NEXT: s_waitcnt expcnt(0) lgkmcnt(0)
				define amdgpu_cs float @ds_ordered_swap(i32 addrspace(2)* inreg %gds, i32 %value) {
				%val = call i32@llvm.amdgcn.ds.ordered.swap(i32 addrspace(2)* %gds, i32 %value, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				%r = bitcast i32 %val to float
				ret float %r
				}

				; FUNC-LABEL: {{^}}ds_ordered_swap_conditional:
				; GCN: v_cmp_ne_u32_e32 vcc, 0, v0
				; GCN: s_and_saveexec_b64 s[[SAVED:\[[0-9]+:[0-9]+\]]], vcc
				; // We have to use s_cbranch, because ds_ordered_count has side effects with EXEC=0
				; GCN: s_cbranch_execz [[BB:BB._.]]
				; GCN: s_mov_b32 m0, s0
				; VIGFX9-NEXT: s_nop 0
				; GCN-NEXT: ds_ordered_count v{{[0-9]+}}, v0 offset:4868 gds
				; GCN-NEXT: [[BB]]:
				; // Wait for expcnt(0) before modifying EXEC
				; GCN-NEXT: s_waitcnt expcnt(0)
				; GCN-NEXT: s_or_b64 exec, exec, s[[SAVED]]
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				define amdgpu_cs float @ds_ordered_swap_conditional(i32 addrspace(2)* inreg %gds, i32 %value) {
				entry:
				%c = icmp ne i32 %value, 0
				br i1 %c, label %if-true, label %endif

				if-true:
				%val = call i32@llvm.amdgcn.ds.ordered.swap(i32 addrspace(2)* %gds, i32 %value, i32 0, i32 0, i1 false, i32 1, i1 true, i1 true)
				br label %endif

				endif:
				%v = phi i32 [ %val, %if-true ], [ undef, %entry ]
				%r = bitcast i32 %v to float
				ret float %r
				}

				declare i32 @llvm.amdgcn.ds.ordered.swap(i32 addrspace(2)* nocapture, i32, i32, i32, i1, i32, i1, i1)

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add llvm.amdgcn.ds.ordered.add & swapClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 175389

include/llvm/IR/IntrinsicsAMDGPU.td

lib/Target/AMDGPU/AMDGPU.h

lib/Target/AMDGPU/AMDGPUISelLowering.h

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/AMDGPUSearchableTables.td

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

lib/Target/AMDGPU/DSInstructions.td

lib/Target/AMDGPU/GCNHazardRecognizer.cpp

lib/Target/AMDGPU/SIISelLowering.cpp

lib/Target/AMDGPU/SIInsertWaitcnts.cpp

lib/Target/AMDGPU/SIInstrInfo.h

lib/Target/AMDGPU/SIInstrInfo.cpp

lib/Target/AMDGPU/SIInstrInfo.td

test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.add.ll

test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.swap.ll

AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap
ClosedPublic