This is an archive of the discontinued LLVM Phabricator instance.

It uses a single mem operand with both load and store in addrspace(4). The addrspace(4) is common for all buffer intrinsics memops. In fact neither MemSDNode nor MemIntrinsicSDNode can have 2 mem ops. Only MachineSDNode and final MI can. I certainly do now want to create a MachineSDNode here and duplicate a lot of buffer operations logic. If we believe we really want 2 mem ops these can be split in the FinalizeLowering.

It will also need to be rebased on top of D124550 to handle hazard between M0 initialization and LDS DMA.

Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly

In D124884#3489729, @arsenm wrote:

Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly

We cannot match this pattern. If you look at the addressing mode this is byzantine. Yet another addtid instruction on steroids.

arsenm added inline comments.May 3 2022, 3:44 PM

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4289 ↗	(On Diff #426850)	Why is this needed if it's in the MMO?
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	Should use the return / data type?

Harbormaster completed remote builds in B162572: Diff 426850.May 3 2022, 4:11 PM

rampitec added inline comments.May 3 2022, 4:11 PM

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4289 ↗	(On Diff #426850)	I believe to select based on the MMO I would need to write a complex pattern.

rampitec updated this revision to Diff 426876.May 3 2022, 4:15 PM

rampitec marked an inline comment as done.

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	Changed to return. What do you mean by 'use data type'?

arsenm added inline comments.May 3 2022, 4:19 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	You're looking at the pointer element type instead of the return / data type. i.e. we would have i8/i16/i32 return types and you don't need to look at the pointer
1199–1200	I just noticed there is no return type so this is just introducing a dependency on typed pointers which is a no-go. I don't actually see why can't we match these from the buffer intrinsic plus LDS access?

rampitec added inline comments.May 3 2022, 4:20 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	It does not return anything. This instruction does not have vdata. The only way to know the size is by looking at the overloaded LDS base pointer pointee.

rampitec added inline comments.May 3 2022, 4:24 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	`LDS address = LDS_base + LDS_offset + inst_offset + (TIDinWave * 4)` We do not have TIDinWave, certainly not after selection. Even before selection it is extremely problematic. Why typed pointer is a no-go if that works?

rampitec added inline comments.May 3 2022, 4:33 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	On top of that MEM_ADDR also depends on the TID. It not the same address as a normal buffer_load would use with the same operands.

arsenm added inline comments.May 3 2022, 4:34 PM

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1277	This should be addrspace 3 pointers only also
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	Pointee types have been removed from the IR. If this really needs the type it would need to use an attribute on the parameter to carry it which may be new territory

rampitec added inline comments.May 3 2022, 4:44 PM

llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1277	It cannot be overloaded on the pointee type, infrastructure limitation. We are using llvm_anyptr_ty everywhere in such context. If in turn we cannot use pointee types at all then this could be non overloaded pointer to void in addrspace 3.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200	It does not really need type but it needs size. I can add immediate to the intrinsic and switch to void* for LDS base.

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

In D124884#3489967, @rampitec wrote:

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

Or maybe after the pointer, although it is less convenient for lowering.

Harbormaster completed remote builds in B162590: Diff 426876.May 3 2022, 5:45 PM

In D124884#3489967, @rampitec wrote:

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

In D124884#3490643, @arsenm wrote:

In D124884#3489967, @rampitec wrote:

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

Apparently this isn't well developed but works. The verifier is hardcoding these intrinsics (it's also looking at the call site instead of the intrinsic declaration attributes)

In D124884#3490649, @arsenm wrote:

In D124884#3490643, @arsenm wrote:

In D124884#3489967, @rampitec wrote:

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

Apparently this isn't well developed but works. The verifier is hardcoding these intrinsics (it's also looking at the call site instead of the intrinsic declaration attributes)

What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.

In D124884#3490731, @nhaehnle wrote:

What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.

I'd prefer to keep any intrinsics that look like a load or store to look more like the regular load or store instructions. All of these arbitrary immediate parameters are uglier (e.g. the memory ordering arguments that don't actually work on some of the atomics)

In D124884#3490643, @arsenm wrote:

In D124884#3489967, @rampitec wrote:

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

This will need a clang builtin to produce the attribute. I am not sure we really want to expose it as a clang builtin.

Herald added a subscriber: jsilvanus. · View Herald TranscriptMay 4 2022, 12:42 PM

In D124884#3490745, @arsenm wrote:

In D124884#3490731, @nhaehnle wrote:

What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.

I'd prefer to keep any intrinsics that look like a load or store to look more like the regular load or store instructions. All of these arbitrary immediate parameters are uglier (e.g. the memory ordering arguments that don't actually work on some of the atomics)

It is more or less similar to memcpy, and memcpy uses size argument.

asroy added a subscriber: asroy.May 4 2022, 2:30 PM

asroy added inline comments.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17	m0 holds the size of LDS, should we save the value of m0 before overwriting it, and write the value back before issuing ds_read?

arsenm added inline comments.May 4 2022, 2:33 PM

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17	Every user of m0 is supposed to set it itself, and we hopefully clean up the redundant rewrites. It's not something that's generally saved and restored per operation

rampitec added inline comments.May 4 2022, 2:34 PM

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17	DS_* do not read M0 since gfx9. These intrinsics are only available since gfx9. Moreover, on gfx8 and earlier selection of DS opcodes takes care about M0 initialization right before the opcode.

Removed the overload and added i32 %size operand instead.
LDS pointer is i8 addrspace(3) now qualified with the address space.
Rebased on the change to handle hazards between m0 initialization and these operations.

Harbormaster completed remote builds in B162803: Diff 427150.May 4 2022, 5:15 PM

rampitec added a child revision: D125034: [AMDGPU] Add llvm.amdgcn.struct.buffer.load.lds intrinsic.May 5 2022, 12:02 PM

In D124884#3489729, @arsenm wrote:

Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly

In D124884#3492464, @rampitec wrote:

Removed the overload and added i32 %size operand instead.

LDS pointer is i8 addrspace(3) now qualified with the address space.

Rebased on the change to handle hazards between m0 initialization and these operations.

Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17	Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?

In D124884#3496841, @ramjana wrote:

Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?

No, you do not have to.

Do not split voffset because inst_offset is applied to both VMEM and LDS address and voffset is not. Add a separate operand instead.

rampitec mentioned this in D125034: [AMDGPU] Add llvm.amdgcn.struct.buffer.load.lds intrinsic.May 10 2022, 12:09 PM

Harbormaster completed remote builds in B163751: Diff 428454.May 10 2022, 1:54 PM

Removed support for wide than DWORD ops. See D125409.

Herald added a subscriber: kosarev. · View Herald TranscriptMay 11 2022, 1:39 PM

Harbormaster completed remote builds in B163972: Diff 428759.May 11 2022, 4:22 PM

piotr added a subscriber: piotr.May 13 2022, 2:31 AM

piotr added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4277–4279 ↗	(On Diff #428759)	What is preventing this from clobbering M0? There are intrinsics like int_amdgcn_interp* that have a dependency on M0. Shouldn't the code save the existing M0 and restore it after the load?

arsenm added inline comments.May 13 2022, 2:55 AM

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4277–4279 ↗	(On Diff #428759)	m0 isn't treated as a preserved value. Each user is supposed to initialize m0 itself
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8261	I don't see how / where this preserves the LDS bit
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
389	If you're going to rely on the memory operand, the verifier needs to start enforcing these have one memory operand (well, 2 actually with the same sizes)

Return false from getMemOperandsWithOffsetWidth() instead of checking mem operand.

Harbormaster completed remote builds in B164340: Diff 429278.May 13 2022, 11:28 AM

Switched to direct select which allows to use 2 separate memory operands.
The patch now handles both raw and struct intrinsics.

rampitec marked 2 inline comments as done.May 13 2022, 1:58 PM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8261	It has different number of operands comparing to the SIbuffer_load, so selects into _LDS versions of opcodes. In fact after I have removed offset split because we cannot do it on one pointer only, and dropped multi-dword support I start thinking it might be better to drop SIbuffer_load_lds, patterns, and produce MachineSDNode right here (like in the D125279 for global load), it will not be so much code anymore and I will be able to produce 2 separate memory operands.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
389	On a second thought it is better to just return false here. We cannot have a reasonable pointer here on either side anyway, and in fact even 2 memory operands which it should ideally have should be of a different size for a sub-dword operations. A load can be sub-dword, but the store is always extended to a dword.

Harbormaster completed remote builds in B164387: Diff 429347.May 13 2022, 4:04 PM

arsenm added inline comments.May 16 2022, 2:10 PM

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3080	Should return false, the verifier isn't enforcing this
3130	The verifier should probably be enforcing MMO ordering if you're going to rely on that
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8254	Ditto, verifier isn't enforcing this so shouldn't assert

rampitec marked 2 inline comments as done.May 16 2022, 2:13 PM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3130	I do not rely on their order. Not anymore.

Changed asserts to cannot select.

Harbormaster completed remote builds in B164749: Diff 429847.May 16 2022, 3:09 PM

arsenm accepted this revision.May 17 2022, 9:03 AM

This revision is now accepted and ready to land.May 17 2022, 9:03 AM

This revision was landed with ongoing or failed builds.May 17 2022, 10:32 AM

Closed by commit rG791ec1c68e3b: [AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rG791ec1c68e3b: [AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds.

rampitec mentioned this in D125731: [AMDGPU] No need to wait before issuing LDS DMA.May 17 2022, 10:51 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

34 lines

lib/

Target/

AMDGPU/

AMDGPUInstructionSelector.h

1 line

AMDGPUInstructionSelector.cpp

95 lines

AMDGPURegisterBankInfo.cpp

29 lines

SIISelLowering.cpp

90 lines

SIInsertWaitcnts.cpp

3 lines

SIInstrInfo.cpp

2 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.raw.buffer.load.lds.ll

113 lines

llvm.amdgcn.struct.buffer.load.lds.ll

126 lines

Diff 430118

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 1,264 Lines • ▼ Show 20 Lines	class AMDGPUBufferAtomicFP : Intrinsic <
llvm_i32_ty, // vindex(VGPR)		llvm_i32_ty, // vindex(VGPR)
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
[ImmArg<ArgIndex<4>>, IntrWillReturn], "", [SDNPMemOperand]>,		[ImmArg<ArgIndex<4>>, IntrWillReturn], "", [SDNPMemOperand]>,
AMDGPURsrcIntrinsic<1, 0>;		AMDGPURsrcIntrinsic<1, 0>;

// Legacy form of the intrinsic. raw and struct forms should be preferred.		// Legacy form of the intrinsic. raw and struct forms should be preferred.
def int_amdgcn_buffer_atomic_fadd : AMDGPUBufferAtomicFP;		def int_amdgcn_buffer_atomic_fadd : AMDGPUBufferAtomicFP;

		class AMDGPURawBufferLoadLDS : Intrinsic <
		[],
		[llvm_v4i32_ty, // rsrc(SGPR)
		LLVMQualPointerType<llvm_i8_ty, 3>, // LDS base offset
		arsenmUnsubmitted Done Reply Inline Actions This should be addrspace 3 pointers only also arsenm: This should be addrspace 3 pointers only also
		rampitecAuthorUnsubmitted Done Reply Inline Actions It cannot be overloaded on the pointee type, infrastructure limitation. We are using llvm_anyptr_ty everywhere in such context. If in turn we cannot use pointee types at all then this could be non overloaded pointer to void in addrspace 3. rampitec: It cannot be overloaded on the pointee type, infrastructure limitation. We are using…
		llvm_i32_ty, // Data byte size: 1/2/4
		llvm_i32_ty, // voffset(VGPR, included in bounds checking and swizzling)
		llvm_i32_ty, // soffset(SGPR/imm, excluded from bounds checking and swizzling)
		llvm_i32_ty, // imm offset(imm, included in bounds checking and swizzling)
		llvm_i32_ty], // auxiliary data (imm, cachepolicy (bit 0 = glc,
		// bit 1 = slc,
		// bit 2 = dlc on gfx10+))
		// swizzled buffer (bit 3 = swz))
		[IntrWillReturn, NoCapture<ArgIndex<1>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<5>>,
		ImmArg<ArgIndex<6>>], "", [SDNPMemOperand]>, AMDGPURsrcIntrinsic<0>;
		def int_amdgcn_raw_buffer_load_lds : AMDGPURawBufferLoadLDS;

		class AMDGPUStructBufferLoadLDS : Intrinsic <
		[],
		[llvm_v4i32_ty, // rsrc(SGPR)
		LLVMQualPointerType<llvm_i8_ty, 3>, // LDS base offset
		llvm_i32_ty, // Data byte size: 1/2/4
		llvm_i32_ty, // vindex(VGPR)
		llvm_i32_ty, // voffset(VGPR, included in bounds checking and swizzling)
		llvm_i32_ty, // soffset(SGPR/imm, excluded from bounds checking and swizzling)
		llvm_i32_ty, // imm offset(imm, included in bounds checking and swizzling)
		llvm_i32_ty], // auxiliary data (imm, cachepolicy (bit 0 = glc,
		// bit 1 = slc,
		// bit 2 = dlc on gfx10+))
		// swizzled buffer (bit 3 = swz))
		[IntrWillReturn, NoCapture<ArgIndex<1>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<6>>,
		ImmArg<ArgIndex<7>>], "", [SDNPMemOperand]>, AMDGPURsrcIntrinsic<0>;
		def int_amdgcn_struct_buffer_load_lds : AMDGPUStructBufferLoadLDS;

} // defset AMDGPUBufferIntrinsics		} // defset AMDGPUBufferIntrinsics

// Uses that do not set the done bit should set IntrWriteMem on the		// Uses that do not set the done bit should set IntrWriteMem on the
// call site.		// call site.
def int_amdgcn_exp : Intrinsic <[], [		def int_amdgcn_exp : Intrinsic <[], [
llvm_i32_ty, // tgt,		llvm_i32_ty, // tgt,
llvm_i32_ty, // en		llvm_i32_ty, // en
llvm_any_ty, // src0 (f32 or i32)		llvm_any_ty, // src0 (f32 or i32)
▲ Show 20 Lines • Show All 793 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h

Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	private:
bool selectG_GLOBAL_VALUE(MachineInstr &I) const;		bool selectG_GLOBAL_VALUE(MachineInstr &I) const;
bool selectG_PTRMASK(MachineInstr &I) const;		bool selectG_PTRMASK(MachineInstr &I) const;
bool selectG_EXTRACT_VECTOR_ELT(MachineInstr &I) const;		bool selectG_EXTRACT_VECTOR_ELT(MachineInstr &I) const;
bool selectG_INSERT_VECTOR_ELT(MachineInstr &I) const;		bool selectG_INSERT_VECTOR_ELT(MachineInstr &I) const;
bool selectG_SHUFFLE_VECTOR(MachineInstr &I) const;		bool selectG_SHUFFLE_VECTOR(MachineInstr &I) const;
bool selectAMDGPU_BUFFER_ATOMIC_FADD(MachineInstr &I) const;		bool selectAMDGPU_BUFFER_ATOMIC_FADD(MachineInstr &I) const;
bool selectGlobalAtomicFadd(MachineInstr &I, MachineOperand &AddrOp,		bool selectGlobalAtomicFadd(MachineInstr &I, MachineOperand &AddrOp,
MachineOperand &DataOp) const;		MachineOperand &DataOp) const;
		bool selectBufferLoadLds(MachineInstr &MI) const;
bool selectBVHIntrinsic(MachineInstr &I) const;		bool selectBVHIntrinsic(MachineInstr &I) const;
bool selectSMFMACIntrin(MachineInstr &I) const;		bool selectSMFMACIntrin(MachineInstr &I) const;
bool selectWaveAddress(MachineInstr &I) const;		bool selectWaveAddress(MachineInstr &I) const;

std::pair<Register, unsigned> selectVOP3ModsImpl(MachineOperand &Root,		std::pair<Register, unsigned> selectVOP3ModsImpl(MachineOperand &Root,
bool AllowAbs = true) const;		bool AllowAbs = true) const;

InstructionSelector::ComplexRendererFns		InstructionSelector::ComplexRendererFns
▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

Show First 20 Lines • Show All 1,774 Lines • ▼ Show 20 Lines	bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
case Intrinsic::amdgcn_ds_append:		case Intrinsic::amdgcn_ds_append:
return selectDSAppendConsume(I, true);		return selectDSAppendConsume(I, true);
case Intrinsic::amdgcn_ds_consume:		case Intrinsic::amdgcn_ds_consume:
return selectDSAppendConsume(I, false);		return selectDSAppendConsume(I, false);
case Intrinsic::amdgcn_s_barrier:		case Intrinsic::amdgcn_s_barrier:
return selectSBarrier(I);		return selectSBarrier(I);
case Intrinsic::amdgcn_global_atomic_fadd:		case Intrinsic::amdgcn_global_atomic_fadd:
return selectGlobalAtomicFadd(I, I.getOperand(2), I.getOperand(3));		return selectGlobalAtomicFadd(I, I.getOperand(2), I.getOperand(3));
		case Intrinsic::amdgcn_raw_buffer_load_lds:
		case Intrinsic::amdgcn_struct_buffer_load_lds:
		return selectBufferLoadLds(I);
default: {		default: {
return selectImpl(I, *CoverageInfo);		return selectImpl(I, *CoverageInfo);
}		}
}		}
}		}

bool AMDGPUInstructionSelector::selectG_SELECT(MachineInstr &I) const {		bool AMDGPUInstructionSelector::selectG_SELECT(MachineInstr &I) const {
if (selectImpl(I, *CoverageInfo))		if (selectImpl(I, *CoverageInfo))
▲ Show 20 Lines • Show All 1,258 Lines • ▼ Show 20 Lines	auto MIB = BuildMI(*MBB, &MI, DL, TII.get(Opc))
.addImm(Addr.second)		.addImm(Addr.second)
.addImm(0) // cpol		.addImm(0) // cpol
.cloneMemRefs(MI);		.cloneMemRefs(MI);

MI.eraseFromParent();		MI.eraseFromParent();
return constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI);		return constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI);
}		}

		bool AMDGPUInstructionSelector::selectBufferLoadLds(MachineInstr &MI) const {
		unsigned Opc;
		unsigned Size = MI.getOperand(3).getImm();

		// The struct intrinsic variants add one additional operand over raw.
		const bool HasVIndex = MI.getNumOperands() == 9;
		Register VIndex;
		int OpOffset = 0;
		if (HasVIndex) {
		VIndex = MI.getOperand(4).getReg();
		OpOffset = 1;
		}

		Register VOffset = MI.getOperand(4 + OpOffset).getReg();
		Optional<ValueAndVReg> MaybeVOffset =
		getIConstantVRegValWithLookThrough(VOffset, *MRI);
		const bool HasVOffset = !MaybeVOffset \|\| MaybeVOffset->Value.getZExtValue();

		switch (Size) {
		default:
		return false;
		arsenmUnsubmitted Done Reply Inline Actions Should return false, the verifier isn't enforcing this arsenm: Should return false, the verifier isn't enforcing this
		case 1:
		Opc = HasVIndex ? HasVOffset ? AMDGPU::BUFFER_LOAD_UBYTE_LDS_BOTHEN
		: AMDGPU::BUFFER_LOAD_UBYTE_LDS_IDXEN
		: HasVOffset ? AMDGPU::BUFFER_LOAD_UBYTE_LDS_OFFEN
		: AMDGPU::BUFFER_LOAD_UBYTE_LDS_OFFSET;
		break;
		case 2:
		Opc = HasVIndex ? HasVOffset ? AMDGPU::BUFFER_LOAD_USHORT_LDS_BOTHEN
		: AMDGPU::BUFFER_LOAD_USHORT_LDS_IDXEN
		: HasVOffset ? AMDGPU::BUFFER_LOAD_USHORT_LDS_OFFEN
		: AMDGPU::BUFFER_LOAD_USHORT_LDS_OFFSET;
		break;
		case 4:
		Opc = HasVIndex ? HasVOffset ? AMDGPU::BUFFER_LOAD_DWORD_LDS_BOTHEN
		: AMDGPU::BUFFER_LOAD_DWORD_LDS_IDXEN
		: HasVOffset ? AMDGPU::BUFFER_LOAD_DWORD_LDS_OFFEN
		: AMDGPU::BUFFER_LOAD_DWORD_LDS_OFFSET;
		break;
		}

		MachineBasicBlock *MBB = MI.getParent();
		const DebugLoc &DL = MI.getDebugLoc();
		BuildMI(*MBB, &MI, DL, TII.get(AMDGPU::COPY), AMDGPU::M0)
		.add(MI.getOperand(2));

		auto MIB = BuildMI(*MBB, &MI, DL, TII.get(Opc));

		if (HasVIndex && HasVOffset) {
		Register IdxReg = MRI->createVirtualRegister(TRI.getVGPR64Class());
		BuildMI(MBB, &MIB, DL, TII.get(AMDGPU::REG_SEQUENCE), IdxReg)
		.addReg(VIndex)
		.addImm(AMDGPU::sub0)
		.addReg(VOffset)
		.addImm(AMDGPU::sub1);

		MIB.addReg(IdxReg);
		} else if (HasVIndex) {
		MIB.addReg(VIndex);
		} else if (HasVOffset) {
		MIB.addReg(VOffset);
		}

		MIB.add(MI.getOperand(1)); // rsrc
		MIB.add(MI.getOperand(5 + OpOffset)); // soffset
		MIB.add(MI.getOperand(6 + OpOffset)); // imm offset
		unsigned Aux = MI.getOperand(7 + OpOffset).getImm();
		MIB.addImm(Aux & AMDGPU::CPol::ALL); // cpol
		MIB.addImm((Aux >> 3) & 1); // swz

		MachineMemOperand LoadMMO = MI.memoperands_begin();
		arsenmUnsubmitted Not Done Reply Inline Actions The verifier should probably be enforcing MMO ordering if you're going to rely on that arsenm: The verifier should probably be enforcing MMO ordering if you're going to rely on that
		rampitecAuthorUnsubmitted Done Reply Inline Actions I do not rely on their order. Not anymore. rampitec: I do not rely on their order. Not anymore.
		MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
		LoadPtrI.Offset = MI.getOperand(6 + OpOffset).getImm();
		MachinePointerInfo StorePtrI = LoadPtrI;
		StorePtrI.V = nullptr;
		StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;

		auto F = LoadMMO->getFlags() &
		~(MachineMemOperand::MOStore \| MachineMemOperand::MOLoad);
		LoadMMO = MF->getMachineMemOperand(LoadPtrI, F \| MachineMemOperand::MOLoad,
		Size, LoadMMO->getBaseAlign());

		MachineMemOperand *StoreMMO =
		MF->getMachineMemOperand(StorePtrI, F \| MachineMemOperand::MOStore,
		sizeof(int32_t), LoadMMO->getBaseAlign());

		MIB.setMemRefs({LoadMMO, StoreMMO});

		MI.eraseFromParent();
		return constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI);
		}

bool AMDGPUInstructionSelector::selectBVHIntrinsic(MachineInstr &MI) const{		bool AMDGPUInstructionSelector::selectBVHIntrinsic(MachineInstr &MI) const{
MI.setDesc(TII.get(MI.getOperand(1).getImm()));		MI.setDesc(TII.get(MI.getOperand(1).getImm()));
MI.removeOperand(1);		MI.removeOperand(1);
MI.addImplicitDefUseOperands(*MI.getParent()->getParent());		MI.addImplicitDefUseOperands(*MI.getParent()->getParent());
return true;		return true;
}		}

bool AMDGPUInstructionSelector::selectSMFMACIntrin(MachineInstr &MI) const {		bool AMDGPUInstructionSelector::selectSMFMACIntrin(MachineInstr &MI) const {
▲ Show 20 Lines • Show All 1,515 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp

Show First 20 Lines • Show All 3,006 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_s_sendmsghalt: {
// FIXME: Should this use a waterfall loop?		// FIXME: Should this use a waterfall loop?
constrainOpWithReadfirstlane(MI, MRI, 2); // M0		constrainOpWithReadfirstlane(MI, MRI, 2); // M0
return;		return;
}		}
case Intrinsic::amdgcn_s_setreg: {		case Intrinsic::amdgcn_s_setreg: {
constrainOpWithReadfirstlane(MI, MRI, 2);		constrainOpWithReadfirstlane(MI, MRI, 2);
return;		return;
}		}
		case Intrinsic::amdgcn_raw_buffer_load_lds: {
		applyDefaultMapping(OpdMapper);
		constrainOpWithReadfirstlane(MI, MRI, 1); // rsrc
		constrainOpWithReadfirstlane(MI, MRI, 2); // M0
		constrainOpWithReadfirstlane(MI, MRI, 5); // soffset
		return;
		}
		case Intrinsic::amdgcn_struct_buffer_load_lds: {
		applyDefaultMapping(OpdMapper);
		constrainOpWithReadfirstlane(MI, MRI, 1); // rsrc
		constrainOpWithReadfirstlane(MI, MRI, 2); // M0
		constrainOpWithReadfirstlane(MI, MRI, 6); // soffset
		return;
		}
default: {		default: {
if (const AMDGPU::RsrcIntrinsic *RSrcIntrin =		if (const AMDGPU::RsrcIntrinsic *RSrcIntrin =
AMDGPU::lookupRsrcIntrinsic(IntrID)) {		AMDGPU::lookupRsrcIntrinsic(IntrID)) {
// Non-images can have complications from operands that allow both SGPR		// Non-images can have complications from operands that allow both SGPR
// and VGPR. For now it's too complicated to figure out the final opcode		// and VGPR. For now it's too complicated to figure out the final opcode
// to derive the register bank from the MCInstrDesc.		// to derive the register bank from the MCInstrDesc.
if (RSrcIntrin->IsImage) {		if (RSrcIntrin->IsImage) {
applyMappingImage(MI, OpdMapper, MRI, RSrcIntrin->RsrcArg);		applyMappingImage(MI, OpdMapper, MRI, RSrcIntrin->RsrcArg);
▲ Show 20 Lines • Show All 1,408 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_raw_tbuffer_load: {
// FIXME: Should make intrinsic ID the last operand of the instruction,		// FIXME: Should make intrinsic ID the last operand of the instruction,
// then this would be the same as store		// then this would be the same as store
OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);		OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);		OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);		OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);		OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
break;		break;
}		}
		case Intrinsic::amdgcn_raw_buffer_load_lds: {
		OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
		OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
		OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
		OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
		break;
		}
case Intrinsic::amdgcn_raw_buffer_store:		case Intrinsic::amdgcn_raw_buffer_store:
case Intrinsic::amdgcn_raw_buffer_store_format:		case Intrinsic::amdgcn_raw_buffer_store_format:
case Intrinsic::amdgcn_raw_tbuffer_store: {		case Intrinsic::amdgcn_raw_tbuffer_store: {
OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);		OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);		OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);		OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);		OpdsMapping[4] = getSGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
break;		break;
}		}
case Intrinsic::amdgcn_struct_buffer_load:		case Intrinsic::amdgcn_struct_buffer_load:
case Intrinsic::amdgcn_struct_tbuffer_load: {		case Intrinsic::amdgcn_struct_tbuffer_load: {
OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);		OpdsMapping[0] = getVGPROpMapping(MI.getOperand(0).getReg(), MRI, *TRI);
OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);		OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);		OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);		OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);		OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
break;		break;
}		}
		case Intrinsic::amdgcn_struct_buffer_load_lds: {
		OpdsMapping[1] = getSGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
		OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
		OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
		OpdsMapping[5] = getVGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
		OpdsMapping[6] = getSGPROpMapping(MI.getOperand(6).getReg(), MRI, *TRI);
		break;
		}
case Intrinsic::amdgcn_struct_buffer_store:		case Intrinsic::amdgcn_struct_buffer_store:
case Intrinsic::amdgcn_struct_tbuffer_store: {		case Intrinsic::amdgcn_struct_tbuffer_store: {
OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);		OpdsMapping[1] = getVGPROpMapping(MI.getOperand(1).getReg(), MRI, *TRI);
OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);		OpdsMapping[2] = getSGPROpMapping(MI.getOperand(2).getReg(), MRI, *TRI);
OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);		OpdsMapping[3] = getVGPROpMapping(MI.getOperand(3).getReg(), MRI, *TRI);
OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);		OpdsMapping[4] = getVGPROpMapping(MI.getOperand(4).getReg(), MRI, *TRI);
OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);		OpdsMapping[5] = getSGPROpMapping(MI.getOperand(5).getReg(), MRI, *TRI);
break;		break;
▲ Show 20 Lines • Show All 139 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,185 Lines • ▼ Show 20 Lines	if (Attr.hasFnAttr(Attribute::ReadOnly)) {
ISD::INTRINSIC_W_CHAIN;		ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::getVT(CI.getArgOperand(0)->getType());		Info.memVT = MVT::getVT(CI.getArgOperand(0)->getType());
Info.flags \|= MachineMemOperand::MOLoad \|		Info.flags \|= MachineMemOperand::MOLoad \|
MachineMemOperand::MOStore \|		MachineMemOperand::MOStore \|
MachineMemOperand::MODereferenceable;		MachineMemOperand::MODereferenceable;

// XXX - Should this be volatile without known ordering?		// XXX - Should this be volatile without known ordering?
Info.flags \|= MachineMemOperand::MOVolatile;		Info.flags \|= MachineMemOperand::MOVolatile;

		switch (IntrID) {
		default:
		break;
		case Intrinsic::amdgcn_raw_buffer_load_lds:
		case Intrinsic::amdgcn_struct_buffer_load_lds: {
		unsigned Width = cast<ConstantInt>(CI.getArgOperand(2))->getZExtValue();
		arsenmUnsubmitted Done Reply Inline Actions Should use the return / data type? arsenm: Should use the return / data type?
		rampitecAuthorUnsubmitted Done Reply Inline Actions Changed to return. What do you mean by 'use data type'? rampitec: Changed to return. What do you mean by 'use data type'?
		arsenmUnsubmitted Done Reply Inline Actions You're looking at the pointer element type instead of the return / data type. i.e. we would have i8/i16/i32 return types and you don't need to look at the pointer arsenm: You're looking at the pointer element type instead of the return / data type. i.e. we would…
		arsenmUnsubmitted Done Reply Inline Actions I just noticed there is no return type so this is just introducing a dependency on typed pointers which is a no-go. I don't actually see why can't we match these from the buffer intrinsic plus LDS access? arsenm: I just noticed there is no return type so this is just introducing a dependency on typed…
		rampitecAuthorUnsubmitted Done Reply Inline Actions `LDS address = LDS_base + LDS_offset + inst_offset + (TIDinWave * 4)` We do not have TIDinWave, certainly not after selection. Even before selection it is extremely problematic. Why typed pointer is a no-go if that works? rampitec: `LDS address = LDS_base + LDS_offset + inst_offset + (TIDinWave * 4)` We do not have TIDinWave…
		rampitecAuthorUnsubmitted Done Reply Inline Actions On top of that MEM_ADDR also depends on the TID. It not the same address as a normal buffer_load would use with the same operands. rampitec: On top of that MEM_ADDR also depends on the TID. It not the same address as a normal…
		rampitecAuthorUnsubmitted Done Reply Inline Actions It does not return anything. This instruction does not have vdata. The only way to know the size is by looking at the overloaded LDS base pointer pointee. rampitec: It does not return anything. This instruction does not have vdata. The only way to know the…
		arsenmUnsubmitted Done Reply Inline Actions Pointee types have been removed from the IR. If this really needs the type it would need to use an attribute on the parameter to carry it which may be new territory arsenm: Pointee types have been removed from the IR. If this really needs the type it would need to use…
		rampitecAuthorUnsubmitted Done Reply Inline Actions It does not really need type but it needs size. I can add immediate to the intrinsic and switch to void* for LDS base. rampitec: It does not really need type but it needs size. I can add immediate to the intrinsic and switch…
		Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
		return true;
		}
		}
}		}
return true;		return true;
}		}

switch (IntrID) {		switch (IntrID) {
case Intrinsic::amdgcn_atomic_inc:		case Intrinsic::amdgcn_atomic_inc:
case Intrinsic::amdgcn_atomic_dec:		case Intrinsic::amdgcn_atomic_dec:
case Intrinsic::amdgcn_ds_ordered_add:		case Intrinsic::amdgcn_ds_ordered_add:
▲ Show 20 Lines • Show All 7,021 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_struct_buffer_store_format: {
// Handle BUFFER_STORE_BYTE/SHORT overloaded intrinsics		// Handle BUFFER_STORE_BYTE/SHORT overloaded intrinsics
EVT VDataType = VData.getValueType().getScalarType();		EVT VDataType = VData.getValueType().getScalarType();
if (!IsD16 && !VDataVT.isVector() && EltType.getSizeInBits() < 32)		if (!IsD16 && !VDataVT.isVector() && EltType.getSizeInBits() < 32)
return handleByteShortBufferStores(DAG, VDataType, DL, Ops, M);		return handleByteShortBufferStores(DAG, VDataType, DL, Ops, M);

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}
		case Intrinsic::amdgcn_raw_buffer_load_lds:
		case Intrinsic::amdgcn_struct_buffer_load_lds: {
		unsigned Opc;
		bool HasVIndex = IntrinsicID == Intrinsic::amdgcn_struct_buffer_load_lds;
		unsigned OpOffset = HasVIndex ? 1 : 0;
		SDValue VOffset = Op.getOperand(5 + OpOffset);
		auto CVOffset = dyn_cast<ConstantSDNode>(VOffset);
		bool HasVOffset = !CVOffset \|\| !CVOffset->isZero();
		unsigned Size = Op->getConstantOperandVal(4);

		switch (Size) {
		default:
		return SDValue();
		arsenmUnsubmitted Done Reply Inline Actions Ditto, verifier isn't enforcing this so shouldn't assert arsenm: Ditto, verifier isn't enforcing this so shouldn't assert
		case 1:
		Opc = HasVIndex ? HasVOffset ? AMDGPU::BUFFER_LOAD_UBYTE_LDS_BOTHEN
		: AMDGPU::BUFFER_LOAD_UBYTE_LDS_IDXEN
		: HasVOffset ? AMDGPU::BUFFER_LOAD_UBYTE_LDS_OFFEN
		: AMDGPU::BUFFER_LOAD_UBYTE_LDS_OFFSET;
		break;
		case 2:
		arsenmUnsubmitted Done Reply Inline Actions I don't see how / where this preserves the LDS bit arsenm: I don't see how / where this preserves the LDS bit
		rampitecAuthorUnsubmitted Done Reply Inline Actions It has different number of operands comparing to the SIbuffer_load, so selects into _LDS versions of opcodes. In fact after I have removed offset split because we cannot do it on one pointer only, and dropped multi-dword support I start thinking it might be better to drop SIbuffer_load_lds, patterns, and produce MachineSDNode right here (like in the D125279 for global load), it will not be so much code anymore and I will be able to produce 2 separate memory operands. rampitec: It has different number of operands comparing to the SIbuffer_load, so selects into _LDS…
		Opc = HasVIndex ? HasVOffset ? AMDGPU::BUFFER_LOAD_USHORT_LDS_BOTHEN
		: AMDGPU::BUFFER_LOAD_USHORT_LDS_IDXEN
		: HasVOffset ? AMDGPU::BUFFER_LOAD_USHORT_LDS_OFFEN
		: AMDGPU::BUFFER_LOAD_USHORT_LDS_OFFSET;
		break;
		case 4:
		Opc = HasVIndex ? HasVOffset ? AMDGPU::BUFFER_LOAD_DWORD_LDS_BOTHEN
		: AMDGPU::BUFFER_LOAD_DWORD_LDS_IDXEN
		: HasVOffset ? AMDGPU::BUFFER_LOAD_DWORD_LDS_OFFEN
		: AMDGPU::BUFFER_LOAD_DWORD_LDS_OFFSET;
		break;
		}

		SDValue M0Val = copyToM0(DAG, Chain, DL, Op.getOperand(3));

		SmallVector<SDValue, 8> Ops;

		if (HasVIndex && HasVOffset)
		Ops.push_back(DAG.getBuildVector(MVT::v2i32, DL,
		{ Op.getOperand(5), // VIndex
		VOffset }));
		else if (HasVIndex)
		Ops.push_back(Op.getOperand(5));
		else if (HasVOffset)
		Ops.push_back(VOffset);

		Ops.push_back(Op.getOperand(2)); // rsrc
		Ops.push_back(Op.getOperand(6 + OpOffset)); // soffset
		Ops.push_back(Op.getOperand(7 + OpOffset)); // imm offset
		unsigned Aux = Op.getConstantOperandVal(8 + OpOffset);
		Ops.push_back(
		DAG.getTargetConstant(Aux & AMDGPU::CPol::ALL, DL, MVT::i8)); // cpol
		Ops.push_back(
		DAG.getTargetConstant((Aux >> 3) & 1, DL, MVT::i8)); // swz
		Ops.push_back(M0Val.getValue(0)); // Chain
		Ops.push_back(M0Val.getValue(1)); // Glue

		auto *M = cast<MemSDNode>(Op);
		MachineMemOperand *LoadMMO = M->getMemOperand();
		MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
		LoadPtrI.Offset = Op->getConstantOperandVal(7 + OpOffset);
		MachinePointerInfo StorePtrI = LoadPtrI;
		StorePtrI.V = nullptr;
		StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;

		auto F = LoadMMO->getFlags() &
		~(MachineMemOperand::MOStore \| MachineMemOperand::MOLoad);
		LoadMMO = MF.getMachineMemOperand(LoadPtrI, F \| MachineMemOperand::MOLoad,
		Size, LoadMMO->getBaseAlign());

		MachineMemOperand *StoreMMO =
		MF.getMachineMemOperand(StorePtrI, F \| MachineMemOperand::MOStore,
		sizeof(int32_t), LoadMMO->getBaseAlign());

		auto Load = DAG.getMachineNode(Opc, DL, M->getVTList(), Ops);
		DAG.setNodeMemRefs(Load, {LoadMMO, StoreMMO});

		return SDValue(Load, 0);
		}
case Intrinsic::amdgcn_end_cf:		case Intrinsic::amdgcn_end_cf:
return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,		return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
Op->getOperand(2), Chain), 0);		Op->getOperand(2), Chain), 0);

default: {		default: {
if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =		if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =
AMDGPU::getImageDimIntrinsicInfo(IntrinsicID))		AMDGPU::getImageDimIntrinsicInfo(IntrinsicID))
return lowerImage(Op, ImageDimIntr, DAG, true);		return lowerImage(Op, ImageDimIntr, DAG, true);
▲ Show 20 Lines • Show All 4,579 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Show First 20 Lines • Show All 1,093 Lines • ▼ Show 20 Lines	if (MI.isCall() && callWaitsOnFunctionEntry(MI)) {
if (Memop->isStore() && SLoadAddresses.count(Ptr)) {		if (Memop->isStore() && SLoadAddresses.count(Ptr)) {
addWait(Wait, LGKM_CNT, 0);		addWait(Wait, LGKM_CNT, 0);
if (PDT->dominates(MI.getParent(), SLoadAddresses.find(Ptr)->second))		if (PDT->dominates(MI.getParent(), SLoadAddresses.find(Ptr)->second))
SLoadAddresses.erase(Ptr);		SLoadAddresses.erase(Ptr);
}		}
unsigned AS = Memop->getAddrSpace();		unsigned AS = Memop->getAddrSpace();
if (AS != AMDGPUAS::LOCAL_ADDRESS && AS != AMDGPUAS::FLAT_ADDRESS)		if (AS != AMDGPUAS::LOCAL_ADDRESS && AS != AMDGPUAS::FLAT_ADDRESS)
continue;		continue;
		// No need to wait before load from VMEM to LDS.
		if (mayWriteLDSThroughDMA(MI))
		continue;
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;		unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
// VM_CNT is only relevant to vgpr or LDS.		// VM_CNT is only relevant to vgpr or LDS.
ScoreBrackets.determineWait(		ScoreBrackets.determineWait(
VM_CNT, ScoreBrackets.getRegScore(RegNo, VM_CNT), Wait);		VM_CNT, ScoreBrackets.getRegScore(RegNo, VM_CNT), Wait);
if (Memop->isStore()) {		if (Memop->isStore()) {
ScoreBrackets.determineWait(		ScoreBrackets.determineWait(
EXP_CNT, ScoreBrackets.getRegScore(RegNo, EXP_CNT), Wait);		EXP_CNT, ScoreBrackets.getRegScore(RegNo, EXP_CNT), Wait);
}		}
▲ Show 20 Lines • Show All 626 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 379 Lines • ▼ Show 20 Lines	if (SOffset) {
BaseOps.push_back(SOffset);		BaseOps.push_back(SOffset);
else		else
Offset += SOffset->getImm();		Offset += SOffset->getImm();
}		}
// Get appropriate operand, and compute width accordingly.		// Get appropriate operand, and compute width accordingly.
DataOpIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::vdst);		DataOpIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::vdst);
if (DataOpIdx == -1)		if (DataOpIdx == -1)
DataOpIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::vdata);		DataOpIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::vdata);
		if (DataOpIdx == -1) // LDS DMA
		return false;
		arsenmUnsubmitted Done Reply Inline Actions If you're going to rely on the memory operand, the verifier needs to start enforcing these have one memory operand (well, 2 actually with the same sizes) arsenm: If you're going to rely on the memory operand, the verifier needs to start enforcing these have…
		rampitecAuthorUnsubmitted Done Reply Inline Actions On a second thought it is better to just return false here. We cannot have a reasonable pointer here on either side anyway, and in fact even 2 memory operands which it should ideally have should be of a different size for a sub-dword operations. A load can be sub-dword, but the store is always extended to a dword. rampitec: On a second thought it is better to just return false here. We cannot have a reasonable pointer…
Width = getOpSize(LdSt, DataOpIdx);		Width = getOpSize(LdSt, DataOpIdx);
return true;		return true;
}		}

if (isMIMG(LdSt)) {		if (isMIMG(LdSt)) {
int SRsrcIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::srsrc);		int SRsrcIdx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::srsrc);
BaseOps.push_back(&LdSt.getOperand(SRsrcIdx));		BaseOps.push_back(&LdSt.getOperand(SRsrcIdx));
int VAddr0Idx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::vaddr0);		int VAddr0Idx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::vaddr0);
▲ Show 20 Lines • Show All 8,033 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN
				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN

				declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)

				define amdgpu_ps float @buffer_load_lds_dword(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds) {
				; GCN-LABEL: buffer_load_lds_dword:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword off, s[0:3], 0 lds
				; GCN-NEXT: buffer_load_dword off, s[0:3], 0 offset:4 glc lds
				; GCN-NEXT: buffer_load_dword off, s[0:3], 0 offset:8 slc lds
				; GCN-NEXT: v_mov_b32_e32 v0, s4
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: ds_read_b32 v0, v0
				asroyUnsubmitted Done Reply Inline Actions m0 holds the size of LDS, should we save the value of m0 before overwriting it, and write the value back before issuing ds_read? asroy: m0 holds the size of LDS, should we save the value of m0 before overwriting it, and write the…
				rampitecAuthorUnsubmitted Done Reply Inline Actions DS_* do not read M0 since gfx9. These intrinsics are only available since gfx9. Moreover, on gfx8 and earlier selection of DS opcodes takes care about M0 initialization right before the opcode. rampitec: DS_* do not read M0 since gfx9. These intrinsics are only available since gfx9. Moreover, on…
				arsenmUnsubmitted Done Reply Inline Actions Every user of m0 is supposed to set it itself, and we hopefully clean up the redundant rewrites. It's not something that's generally saved and restored per operation arsenm: Every user of m0 is supposed to set it itself, and we hopefully clean up the redundant rewrites.
				ramjanaUnsubmitted Done Reply Inline Actions Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic? ramjana: Just to be clear , Is your expectation that intrinsic user to save and restore m0 before…
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: ; return to shader part epilog
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 0, i32 0, i32 0, i32 0)
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 0, i32 0, i32 4, i32 1)
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 0, i32 0, i32 8, i32 2)
				%ptr = bitcast i8 addrspace(3)* %lds to float addrspace(3)*
				%res = load float, float addrspace(3)* %ptr
				ret float %res
				}

				define amdgpu_ps void @buffer_load_lds_dword_imm_voffset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds) {
				; GCN-LABEL: buffer_load_lds_dword_imm_voffset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: v_mov_b32_e32 v0, 0x800
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v0, s[0:3], 0 offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 2048, i32 0, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_v_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %voffset) {
				; GCN-LABEL: buffer_load_lds_dword_v_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v0, s[0:3], 0 offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %voffset, i32 0, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_s_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 inreg %soffset) {
				; GCN-LABEL: buffer_load_lds_dword_s_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword off, s[0:3], s5 lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 0, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_vs_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %voffset, i32 inreg %soffset) {
				; GCN-LABEL: buffer_load_lds_dword_vs_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v0, s[0:3], s5 offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_vs_imm_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %voffset, i32 inreg %soffset) {
				; GCN-LABEL: buffer_load_lds_dword_vs_imm_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v0, s[0:3], s5 offen offset:2048 lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %voffset, i32 %soffset, i32 2048, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_ushort(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds) {
				; GCN-LABEL: buffer_load_lds_ushort:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: v_mov_b32_e32 v0, 0x800
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_ushort v0, s[0:3], 0 offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 2, i32 2048, i32 0, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_ubyte(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds) {
				; GCN-LABEL: buffer_load_lds_ubyte:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_ubyte off, s[0:3], 0 offset:2048 lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 1, i32 0, i32 0, i32 2048, i32 0)
				ret void
				}

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.lds.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,SDAG
				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GISEL

				declare void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* nocapture, i32 %size, i32 %vindex, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)

				define amdgpu_ps float @buffer_load_lds_dword(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds) {
				; SDAG-LABEL: buffer_load_lds_dword:
				; SDAG: ; %bb.0: ; %main_body
				; SDAG-NEXT: v_mov_b32_e32 v0, 8
				; SDAG-NEXT: s_mov_b32 m0, s4
				; SDAG-NEXT: s_nop 0
				; SDAG-NEXT: buffer_load_dword v0, s[0:3], 0 idxen lds
				; SDAG-NEXT: buffer_load_dword v0, s[0:3], 0 idxen offset:4 glc lds
				; SDAG-NEXT: buffer_load_dword v0, s[0:3], 0 idxen offset:8 slc lds
				; SDAG-NEXT: v_mov_b32_e32 v0, s4
				; SDAG-NEXT: s_waitcnt vmcnt(0)
				; SDAG-NEXT: ds_read_b32 v0, v0
				; SDAG-NEXT: s_waitcnt lgkmcnt(0)
				; SDAG-NEXT: ; return to shader part epilog
				;
				; GISEL-LABEL: buffer_load_lds_dword:
				; GISEL: ; %bb.0: ; %main_body
				; GISEL-NEXT: s_mov_b32 m0, s4
				; GISEL-NEXT: v_mov_b32_e32 v0, 8
				; GISEL-NEXT: buffer_load_dword v0, s[0:3], 0 idxen lds
				; GISEL-NEXT: buffer_load_dword v0, s[0:3], 0 idxen offset:4 glc lds
				; GISEL-NEXT: buffer_load_dword v0, s[0:3], 0 idxen offset:8 slc lds
				; GISEL-NEXT: v_mov_b32_e32 v0, s4
				; GISEL-NEXT: s_waitcnt vmcnt(0)
				; GISEL-NEXT: ds_read_b32 v0, v0
				; GISEL-NEXT: s_waitcnt lgkmcnt(0)
				; GISEL-NEXT: ; return to shader part epilog
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 8, i32 0, i32 0, i32 0, i32 0)
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 8, i32 0, i32 0, i32 4, i32 1)
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 8, i32 0, i32 0, i32 8, i32 2)
				%ptr = bitcast i8 addrspace(3)* %lds to float addrspace(3)*
				%res = load float, float addrspace(3)* %ptr
				ret float %res
				}

				define amdgpu_ps void @buffer_load_lds_dword_imm_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex) {
				; GCN-LABEL: buffer_load_lds_dword_imm_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v0, s[0:3], 0 idxen offset:2048 lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %vindex, i32 0, i32 0, i32 2048, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_v_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex, i32 %voffset) {
				; GCN-LABEL: buffer_load_lds_dword_v_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v[0:1], s[0:3], 0 idxen offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %vindex, i32 %voffset, i32 0, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_s_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex, i32 inreg %soffset) {
				; GCN-LABEL: buffer_load_lds_dword_s_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v0, s[0:3], s5 idxen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %vindex, i32 0, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_vs_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex, i32 %voffset, i32 inreg %soffset) {
				; GCN-LABEL: buffer_load_lds_dword_vs_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v[0:1], s[0:3], s5 idxen offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %vindex, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_dword_vs_imm_offset(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex, i32 %voffset, i32 inreg %soffset) {
				; GCN-LABEL: buffer_load_lds_dword_vs_imm_offset:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_dword v[0:1], s[0:3], s5 idxen offen offset:2048 lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 4, i32 %vindex, i32 %voffset, i32 %soffset, i32 2048, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_ushort(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex) {
				; GCN-LABEL: buffer_load_lds_ushort:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: v_mov_b32_e32 v1, 0x800
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_ushort v[0:1], s[0:3], 0 idxen offen lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 2, i32 %vindex, i32 2048, i32 0, i32 0, i32 0)
				ret void
				}

				define amdgpu_ps void @buffer_load_lds_ubyte(<4 x i32> inreg %rsrc, i8 addrspace(3)* inreg %lds, i32 %vindex) {
				; GCN-LABEL: buffer_load_lds_ubyte:
				; GCN: ; %bb.0: ; %main_body
				; GCN-NEXT: s_mov_b32 m0, s4
				; GCN-NEXT: s_nop 0
				; GCN-NEXT: buffer_load_ubyte v0, s[0:3], 0 idxen offset:2048 lds
				; GCN-NEXT: s_endpgm
				main_body:
				call void @llvm.amdgcn.struct.buffer.load.lds(<4 x i32> %rsrc, i8 addrspace(3)* %lds, i32 1, i32 %vindex, i32 0, i32 0, i32 2048, i32 0)
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.ldsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 430118

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.lds.ll

[AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds
ClosedPublic