This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Add llvm.amdgcn.s.buffer.load intrinsic
Needs ReviewPublic

Authored by • tstellarAMD on Dec 8 2016, 11:49 AM.

Download Raw Diff

Details

Reviewers

nhaehnle
mareko
arsenm

Summary

This patch also adds a new address space to represent loads from the constant
address space that use a buffer resource.

Note that this new address space does not support GEP instructions due to
limitations in the instruction selector.

Diff Detail

Build Status

Buildable 3454
Build 3454: arc lint + arc unit

Event Timeline

• tstellarAMD updated this revision to Diff 80800.Dec 8 2016, 11:49 AM

• tstellarAMD retitled this revision from to AMDGPU/SI: Add llvm.amdgcn.s.buffer.load intrinsic.

• tstellarAMD updated this object.

• tstellarAMD added a reviewer: arsenm.

• tstellarAMD added a subscriber: llvm-commits.

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 2 others. · View Herald TranscriptDec 8 2016, 11:49 AM

• tstellarAMD added reviewers: mareko, nhaehnle.Dec 14 2016, 7:29 AM

arsenm added inline comments.Dec 14 2016, 9:47 AM

include/llvm/IR/IntrinsicsAMDGPU.td
421	This should probably be a more general type so that you can also mangle with FP types

mareko added inline comments.Dec 14 2016, 11:20 AM

include/llvm/IR/IntrinsicsAMDGPU.td
425	Do these flags prevent CSE? Mesa re-loads TGSI constants on each use. There is no reuse. It relies on CSE to do its the job.

More test cases and rebase on top of latest master.

Herald added a subscriber: tpr. · View Herald TranscriptJan 31 2017, 11:44 AM

• tstellarAMD added inline comments.Jan 31 2017, 11:46 AM

include/llvm/IR/IntrinsicsAMDGPU.td
421	The only other option would be llvm_any_ty, I think tablegen needs some more work to make this happen. But we can always change this in the future without breaking backwards compatibility, because the intrinsic is already overloaded.

How is this different from using amdgcn.buffer.load if D28993 lands (which is not certain)?

In D27586#662394, @mareko wrote:

How is this different from using amdgcn.buffer.load if D28993 lands (which is not certain)?

I don't think it's legal to select amdgcn.buffer.load to SMRD unless you can prove that it is uniform. llvm.amdgcn.s.buffer.load is known to always be uniform.

arsenm added inline comments.Jan 31 2017, 4:49 PM

lib/Target/AMDGPU/AMDGPU.h
172	Why 42?
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
167	Does this need to specify non-integral? Also there are a handful of places that assume 64-bit max we should take care of
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
87	This should be last

In D27586#662432, @tstellarAMD wrote:

In D27586#662394, @mareko wrote:

How is this different from using amdgcn.buffer.load if D28993 lands (which is not certain)?

I don't think it's legal to select amdgcn.buffer.load to SMRD unless you can prove that it is uniform. llvm.amdgcn.s.buffer.load is known to always be uniform.

lib/Target/AMDGPU/AMDGPU.h
172	4 dword resource size for address space 2
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
167	We want to be able to use inttoptr, so I don't think we can say non-integral. Do you know where any of these places are?

In D27586#662432, @tstellarAMD wrote:

In D27586#662394, @mareko wrote:

How is this different from using amdgcn.buffer.load if D28993 lands (which is not certain)?

I don't think it's legal to select amdgcn.buffer.load to SMRD unless you can prove that it is uniform. llvm.amdgcn.s.buffer.load is known to always be uniform.

Well, I can't prove that it is uniform, but neither does Mesa for non-constant offsets. The idea of my patch is that SMRD is selected first and moveToVALU will lower it if necessary.

I haven't looked in too much detail yet. I assume getelementptr doesn't work with these pointers, so it would be good to have a negative test which ensures that GEP use fails.

In D27586#662991, @nhaehnle wrote:

I haven't looked in too much detail yet. I assume getelementptr doesn't work with these pointers, so it would be good to have a negative test which ensures that GEP use fails.

That's correct. In order to support GEP, we would need to make i128 legal in the backend, which would be a pretty significant change. GlobalISel will make this much much easier, so I'm not sure it's even worth trying to support with SelectionDAG.

Would you please describe the purpose of this patch? It's not obvious why it's useful.

What is the behavior of the constant address space in LLVM? Note that vertex buffers and 'restrict' read-only buffers use immutable memory too, so those are also 'constant' and selecting SMEM for those is desirable if it's possible.

OpenGL Constant buffers aren't limited to SMEM. They can be read with a VGPR offset too. LLVM doesn't have the capability to recognize a non-constant SGPR offset at ISel. If you wanna select SMEM with a non-constant offset, you also need to update moveToVALU.

In D27586#663090, @mareko wrote:

Would you please describe the purpose of this patch? It's not obvious why it's useful.

The main reason it is useful is because it tells the compiler that this is a load from a constant value without neededing any more analysis. It's also useful because s_buffer_load_* instructions have a much more simplified resource descriptor, so if then do end up getting selected to MUBUF you don't have to worry about swizzled addressing. It is true however, that you could just use a single llvm.amdgcn.buffer.load.i32 intrinsic for everything, but you may end up with worse code if you are unable to do the analysis required to select it to SMRD instructions.

What is the behavior of the constant address space in LLVM? Note that vertex buffers and 'restrict' read-only buffers use immutable memory too, so those are also 'constant' and selecting SMEM for those is desirable if it's possible.

Constant address space just means that the memory is unchanged for the life of the program.

OpenGL Constant buffers aren't limited to SMEM. They can be read with a VGPR offset too. LLVM doesn't have the capability to recognize a non-constant SGPR offset at ISel. If you wanna select SMEM with a non-constant offset, you also need to update moveToVALU.

In D27586#663420, @tstellarAMD wrote:

In D27586#663090, @mareko wrote:

Would you please describe the purpose of this patch? It's not obvious why it's useful.

The main reason it is useful is because it tells the compiler that this is a load from a constant value without neededing any more analysis. It's also useful because s_buffer_load_* instructions have a much more simplified resource descriptor, so if then do end up getting selected to MUBUF you don't have to worry about swizzled addressing. It is true however, that you could just use a single llvm.amdgcn.buffer.load.i32 intrinsic for everything, but you may end up with worse code if you are unable to do the analysis required to select it to SMRD instructions.

For OpenGL, we'll also need to support non-constant SGPR offsets. And for those, moveToVALU needs to have the corresponding lowering. I don't think having an intrinsic that only accepts constant nodes for the offset is useful.

Also for OpenGL, llvm.amdgcn.buffer.load for SMRD might not be the best choice, because Mesa also needs to determine whether SMRD can be selected with regard to SQC L1 coherency. Therefore, I'd define llvm.amdgcn.s.buffer.load as: "do llvm.amdgcn.buffer.load, but allow SMRD selection for constant and non-constant offsets, and also the referenced memory is immutable (which is why we can use SMRD)". I don't think any other definition is useful for Mesa. Do we agree?

What is the behavior of the constant address space in LLVM? Note that vertex buffers and 'restrict' read-only buffers use immutable memory too, so those are also 'constant' and selecting SMEM for those is desirable if it's possible.

Constant address space just means that the memory is unchanged for the life of the program.

Yes, but what effect on the compiler does it have? Will it allow code sinking and arbitrary scheduling?

t-tye added a subscriber: t-tye.Mar 22 2017, 6:38 PM

tony-tye removed a subscriber: tony-tye.Mar 22 2017, 6:50 PM

dstuttard mentioned this in D28993: AMDGPU: Try to select SMEM opcodes for llvm.amdgcn.buffer.load.May 3 2017, 5:49 AM

I've got a scenario that will benefit from this change. It seems to me that this might be a more workable solution than D28993 (although a more generic approach like that is attractive). This change doesn't necessarily preclude something like D28993 at a later stage does it?

Is it possible to make the non-const offset change that mareko talks about as well?

airlied added a subscriber: airlied.May 8 2017, 12:19 AM

In D27586#744594, @dstuttard wrote:

I've got a scenario that will benefit from this change. It seems to me that this might be a more workable solution than D28993 (although a more generic approach like that is attractive). This change doesn't necessarily preclude something like D28993 at a later stage does it?

Is it possible to make the non-const offset change that mareko talks about as well?

Yes, we'd like to support non-const SGPR offsets as well. Constant offsets are not that interesting to us. Constant offsets can be supported first just to get things going. Non-const offsets can be added afterwards, and the SALU->VALU lowering for s_load_dword should also do inst_offset folding at least. The idea is that the lowering shouldn't generate worse code than amdgcn.buffer.load.dword used directly.

arsenm resigned from this revision.Feb 21 2019, 6:52 PM

Herald added subscribers: jdoerfert, jvesely, arsenm. · View Herald TranscriptFeb 21 2019, 6:52 PM

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

8 lines

lib/

Target/

AMDGPU/

AMDGPU.h

1 line

AMDGPUISelLowering.h

1 line

AMDGPUISelLowering.cpp

1 line

AMDGPUTargetMachine.cpp

2 lines

AMDGPUTargetTransformInfo.h

6 lines

AMDGPUTargetTransformInfo.cpp

34 lines

4 lines

2 lines

60 lines

10 lines

15 lines

test/

CodeGen/

AMDGPU/

mubuf.ll

22 lines

smrd.ll

49 lines

Transforms/

EarlyCSE/

AMDGPU/

intrinsics.ll

36 lines

Diff 86472

include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 410 Lines • ▼ Show 20 Lines	class AMDGPUBufferLoad : Intrinsic <
llvm_i32_ty, // vindex(VGPR)		llvm_i32_ty, // vindex(VGPR)
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty, // glc(imm)		llvm_i1_ty, // glc(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
[IntrReadMem]>;		[IntrReadMem]>;
def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;		def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;
def int_amdgcn_buffer_load : AMDGPUBufferLoad;		def int_amdgcn_buffer_load : AMDGPUBufferLoad;


		def int_amdgcn_s_buffer_load : Intrinsic <
		[llvm_anyint_ty],
		arsenmUnsubmitted Not Done Reply Inline Actions This should probably be a more general type so that you can also mangle with FP types arsenm: This should probably be a more general type so that you can also mangle with FP types
		tstellarAMDAuthorUnsubmitted Not Done Reply Inline Actions The only other option would be llvm_any_ty, I think tablegen needs some more work to make this happen. But we can always change this in the future without breaking backwards compatibility, because the intrinsic is already overloaded. tstellarAMD: The only other option would be llvm_any_ty, I think tablegen needs some more work to make this…
		[LLVMQualPointerType<LLVMMatchType<0>, 42>,
		llvm_i32_ty, // byte offset
		llvm_i1_ty], // glc
		[IntrReadMem, IntrArgMemOnly, NoCapture<0>]>;
		marekoUnsubmitted Not Done Reply Inline Actions Do these flags prevent CSE? Mesa re-loads TGSI constants on each use. There is no reuse. It relies on CSE to do its the job. mareko: Do these flags prevent CSE? Mesa re-loads TGSI constants on each use. There is no reuse. It…

class AMDGPUBufferStore : Intrinsic <		class AMDGPUBufferStore : Intrinsic <
[],		[],
[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32		[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32
llvm_v4i32_ty, // rsrc(SGPR)		llvm_v4i32_ty, // rsrc(SGPR)
llvm_i32_ty, // vindex(VGPR)		llvm_i32_ty, // vindex(VGPR)
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty, // glc(imm)		llvm_i1_ty, // glc(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
▲ Show 20 Lines • Show All 233 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPU.h

Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	enum AddressSpaces : unsigned {
CONSTANT_BUFFER_9 = 17,		CONSTANT_BUFFER_9 = 17,
CONSTANT_BUFFER_10 = 18,		CONSTANT_BUFFER_10 = 18,
CONSTANT_BUFFER_11 = 19,		CONSTANT_BUFFER_11 = 19,
CONSTANT_BUFFER_12 = 20,		CONSTANT_BUFFER_12 = 20,
CONSTANT_BUFFER_13 = 21,		CONSTANT_BUFFER_13 = 21,
CONSTANT_BUFFER_14 = 22,		CONSTANT_BUFFER_14 = 22,
CONSTANT_BUFFER_15 = 23,		CONSTANT_BUFFER_15 = 23,

		CONSTANT_ADDRESS_W_RSRC = 42,
		arsenmUnsubmitted Not Done Reply Inline Actions Why 42? arsenm: Why 42?
		tstellarAMDAuthorUnsubmitted Not Done Reply Inline Actions 4 dword resource size for address space 2 tstellarAMD: 4 dword resource size for address space 2
// Some places use this if the address space can't be determined.		// Some places use this if the address space can't be determined.
UNKNOWN_ADDRESS_SPACE = ~0u		UNKNOWN_ADDRESS_SPACE = ~0u
};		};

} // namespace AMDGPUAS		} // namespace AMDGPUAS

#endif		#endif

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 334 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
STORE_MSKOR,		STORE_MSKOR,
LOAD_CONSTANT,		LOAD_CONSTANT,
TBUFFER_STORE_FORMAT,		TBUFFER_STORE_FORMAT,
ATOMIC_CMP_SWAP,		ATOMIC_CMP_SWAP,
ATOMIC_INC,		ATOMIC_INC,
ATOMIC_DEC,		ATOMIC_DEC,
BUFFER_LOAD,		BUFFER_LOAD,
BUFFER_LOAD_FORMAT,		BUFFER_LOAD_FORMAT,
		SBUFFER_LOAD,
LAST_AMDGPU_ISD_NUMBER		LAST_AMDGPU_ISD_NUMBER
};		};


} // End namespace AMDGPUISD		} // End namespace AMDGPUISD

} // End namespace llvm		} // End namespace llvm

#endif		#endif

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 3,299 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(STORE_MSKOR)		NODE_NAME_CASE(STORE_MSKOR)
NODE_NAME_CASE(LOAD_CONSTANT)		NODE_NAME_CASE(LOAD_CONSTANT)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT)
NODE_NAME_CASE(ATOMIC_CMP_SWAP)		NODE_NAME_CASE(ATOMIC_CMP_SWAP)
NODE_NAME_CASE(ATOMIC_INC)		NODE_NAME_CASE(ATOMIC_INC)
NODE_NAME_CASE(ATOMIC_DEC)		NODE_NAME_CASE(ATOMIC_DEC)
NODE_NAME_CASE(BUFFER_LOAD)		NODE_NAME_CASE(BUFFER_LOAD)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT)
		NODE_NAME_CASE(SBUFFER_LOAD)
case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;		case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;
}		}
return nullptr;		return nullptr;
}		}

SDValue AMDGPUTargetLowering::getSqrtEstimate(SDValue Operand,		SDValue AMDGPUTargetLowering::getSqrtEstimate(SDValue Operand,
SelectionDAG &DAG, int Enabled,		SelectionDAG &DAG, int Enabled,
int &RefinementSteps,		int &RefinementSteps,
▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	static StringRef computeDataLayout(const Triple &TT) {
if (TT.getArch() == Triple::r600) {		if (TT.getArch() == Triple::r600) {
// 32-bit pointers.		// 32-bit pointers.
return "e-p:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128"		return "e-p:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128"
"-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64";		"-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64";
}		}

// 32-bit private, local, and region pointers. 64-bit global, constant and		// 32-bit private, local, and region pointers. 64-bit global, constant and
// flat.		// flat.
return "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32"		return "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p42:128:128"
		arsenmUnsubmitted Not Done Reply Inline Actions Does this need to specify non-integral? Also there are a handful of places that assume 64-bit max we should take care of arsenm: Does this need to specify non-integral? Also there are a handful of places that assume 64-bit…
		tstellarAMDAuthorUnsubmitted Not Done Reply Inline Actions We want to be able to use inttoptr, so I don't think we can say non-integral. Do you know where any of these places are? tstellarAMD: We want to be able to use inttoptr, so I don't think we can say non-integral. Do you know…
"-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128"		"-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128"
"-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64";		"-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64";
}		}

LLVM_READNONE		LLVM_READNONE
static StringRef getGPUOrDefault(const Triple &TT, StringRef GPU) {		static StringRef getGPUOrDefault(const Triple &TT, StringRef GPU) {
if (!GPU.empty())		if (!GPU.empty())
return GPU;		return GPU;
▲ Show 20 Lines • Show All 545 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	unsigned getFlatAddressSpace() const {
// Don't bother running InferAddressSpaces pass on graphics shaders which		// Don't bother running InferAddressSpaces pass on graphics shaders which
// don't use flat addressing.		// don't use flat addressing.
if (IsGraphicsShader)		if (IsGraphicsShader)
return -1;		return -1;
return ST->hasFlatAddressSpace() ? AMDGPUAS::FLAT_ADDRESS : -1;		return ST->hasFlatAddressSpace() ? AMDGPUAS::FLAT_ADDRESS : -1;
}		}

unsigned getVectorSplitCost() { return 0; }		unsigned getVectorSplitCost() { return 0; }

		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info);

		Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
		Type *ExpectedType);

};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
unsigned AMDGPUTTIImpl::getRegisterBitWidth(bool Vector) {		unsigned AMDGPUTTIImpl::getRegisterBitWidth(bool Vector) {
return Vector ? 0 : 32;		return Vector ? 0 : 32;
}		}

unsigned AMDGPUTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {		unsigned AMDGPUTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {
switch (AddrSpace) {		switch (AddrSpace) {
case AMDGPUAS::GLOBAL_ADDRESS:		case AMDGPUAS::GLOBAL_ADDRESS:
case AMDGPUAS::CONSTANT_ADDRESS:		case AMDGPUAS::CONSTANT_ADDRESS:
		case AMDGPUAS::CONSTANT_ADDRESS_W_RSRC:
		arsenmUnsubmitted Not Done Reply Inline Actions This should be last arsenm: This should be last
case AMDGPUAS::FLAT_ADDRESS:		case AMDGPUAS::FLAT_ADDRESS:
return 128;		return 128;
case AMDGPUAS::LOCAL_ADDRESS:		case AMDGPUAS::LOCAL_ADDRESS:
case AMDGPUAS::REGION_ADDRESS:		case AMDGPUAS::REGION_ADDRESS:
return 64;		return 64;
case AMDGPUAS::PRIVATE_ADDRESS:		case AMDGPUAS::PRIVATE_ADDRESS:
return 8 * ST->getMaxPrivateElementSize();		return 8 * ST->getMaxPrivateElementSize();
default:		default:
▲ Show 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	bool AMDGPUTTIImpl::isSourceOfDivergence(const Value *V) const {
}		}

// Assume all function calls are a source of divergence.		// Assume all function calls are a source of divergence.
if (isa<CallInst>(V) \|\| isa<InvokeInst>(V))		if (isa<CallInst>(V) \|\| isa<InvokeInst>(V))
return true;		return true;

return false;		return false;
}		}

		bool AMDGPUTTIImpl::getTgtMemIntrinsic(IntrinsicInst *Inst,
		MemIntrinsicInfo &Info) {
		IRBuilder<> Builder(Inst);
		switch (Inst->getIntrinsicID()) {
		default:
		return false;
		case Intrinsic::amdgcn_s_buffer_load:
		Info.ReadMem = true;
		Info.WriteMem = false;
		Info.IsSimple = true;
		Info.NumMemRefs = 1;

		// We can only set this if the intrinsic is functionally equivalent to a
		// load/store.
		if (auto Offset = dyn_cast<ConstantInt>(Inst->getArgOperand(1))) {
		if (Offset->isZero() &&
		static_cast<ConstantInt*>(Inst->getArgOperand(2))->isZero()) {
		Info.PtrVal = Inst->getArgOperand(0);
		}
		}
		break;
		}
		return true;
		}

		Value AMDGPUTTIImpl::getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
		Type *ExpectedType) {
		if (Inst->getType() == ExpectedType)
		return Inst;

		return nullptr;
		}

lib/Target/AMDGPU/BUFInstructions.td

	Show First 20 Lines • Show All 707 Lines • ▼ Show 20 Lines
	// int_SI_vs_load_input			// int_SI_vs_load_input
	def : Pat<			def : Pat<
	(SIload_input v4i32:$tlst, imm:$attr_offset, i32:$buf_idx_vgpr),			(SIload_input v4i32:$tlst, imm:$attr_offset, i32:$buf_idx_vgpr),
	(BUFFER_LOAD_FORMAT_XYZW_IDXEN $buf_idx_vgpr, $tlst, (i32 0), imm:$attr_offset, 0, 0, 0)			(BUFFER_LOAD_FORMAT_XYZW_IDXEN $buf_idx_vgpr, $tlst, (i32 0), imm:$attr_offset, 0, 0, 0)
	>;			>;

	// Offset in an 32-bit VGPR			// Offset in an 32-bit VGPR
	def : Pat <			def : Pat <
	(SIload_constant v4i32:$sbase, i32:$voff),			(i32 (SIsbuffer_load v4i32:$sbase, i32:$offset, i1:$glc)),
	(BUFFER_LOAD_DWORD_OFFEN $voff, $sbase, (i32 0), 0, 0, 0, 0)			(BUFFER_LOAD_DWORD_OFFEN $offset, $sbase, (i32 0), 0, (as_i1imm $glc), 0, 0)
	>;			>;


	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// buffer_load/store_format patterns			// buffer_load/store_format patterns
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	multiclass MUBUF_LoadIntrinsicPat<SDPatternOperator name, ValueType vt,			multiclass MUBUF_LoadIntrinsicPat<SDPatternOperator name, ValueType vt,
	▲ Show 20 Lines • Show All 625 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	public:
MachineBasicBlock *		MachineBasicBlock *
EmitInstrWithCustomInserter(MachineInstr &MI,		EmitInstrWithCustomInserter(MachineInstr &MI,
MachineBasicBlock *BB) const override;		MachineBasicBlock *BB) const override;
bool enableAggressiveFMAFusion(EVT VT) const override;		bool enableAggressiveFMAFusion(EVT VT) const override;
EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,		EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,
EVT VT) const override;		EVT VT) const override;
MVT getScalarShiftAmountTy(const DataLayout &, EVT) const override;		MVT getScalarShiftAmountTy(const DataLayout &, EVT) const override;
bool isFMAFasterThanFMulAndFAdd(EVT VT) const override;		bool isFMAFasterThanFMulAndFAdd(EVT VT) const override;
		void LowerOperationWrapper(SDNode *N, SmallVectorImpl<SDValue> &Results,
		SelectionDAG &DAG) const override;
SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;		SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;
void ReplaceNodeResults(SDNode *N, SmallVectorImpl<SDValue> &Results,		void ReplaceNodeResults(SDNode *N, SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;

SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;		SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;
SDNode PostISelFolding(MachineSDNode N, SelectionDAG &DAG) const override;		SDNode PostISelFolding(MachineSDNode N, SelectionDAG &DAG) const override;
void AdjustInstrPostInstrSelection(MachineInstr &MI,		void AdjustInstrPostInstrSelection(MachineInstr &MI,
SDNode *Node) const override;		SDNode *Node) const override;
Show All 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show All 28 Lines
#include "llvm/ADT/APFloat.h"		#include "llvm/ADT/APFloat.h"
#include "llvm/ADT/APInt.h"		#include "llvm/ADT/APInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/BitVector.h"		#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/StringSwitch.h"		#include "llvm/ADT/StringSwitch.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
		#include "llvm/Analysis/Loads.h"
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
#include "llvm/CodeGen/CallingConvLower.h"		#include "llvm/CodeGen/CallingConvLower.h"
#include "llvm/CodeGen/DAGCombine.h"		#include "llvm/CodeGen/DAGCombine.h"
#include "llvm/CodeGen/ISDOpcodes.h"		#include "llvm/CodeGen/ISDOpcodes.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
▲ Show 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_atomic_dec:
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::getVT(CI.getType());		Info.memVT = MVT::getVT(CI.getType());
Info.ptrVal = CI.getOperand(0);		Info.ptrVal = CI.getOperand(0);
Info.align = 0;		Info.align = 0;
Info.vol = false;		Info.vol = false;
Info.readMem = true;		Info.readMem = true;
Info.writeMem = true;		Info.writeMem = true;
return true;		return true;
		case Intrinsic::amdgcn_s_buffer_load:
		Info.opc = AMDGPUISD::SBUFFER_LOAD;
		Info.memVT = MVT::getVT(CI.getType());
		Info.ptrVal = CI.getOperand(0);
		Info.align = 0;
		Info.vol = false;
		Info.readMem = true;
		Info.writeMem = false;
		return true;
default:		default:
return false;		return false;
}		}
}		}

bool SITargetLowering::isShuffleMaskLegal(const SmallVectorImpl<int> &,		bool SITargetLowering::isShuffleMaskLegal(const SmallVectorImpl<int> &,
EVT) const {		EVT) const {
// SI has some legal vector types, but no legal vector operations. Say no		// SI has some legal vector types, but no legal vector operations. Say no
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	if (Subtarget->getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {
// by setting the stride value in the resource descriptor which would		// by setting the stride value in the resource descriptor which would
// increase the size limit to (stride * 4GB). However, this is risky,		// increase the size limit to (stride * 4GB). However, this is risky,
// because it has never been validated.		// because it has never been validated.
return isLegalFlatAddressingMode(AM);		return isLegalFlatAddressingMode(AM);
}		}

return isLegalMUBUFAddressingMode(AM);		return isLegalMUBUFAddressingMode(AM);

		case AMDGPUAS::CONSTANT_ADDRESS_W_RSRC:
case AMDGPUAS::CONSTANT_ADDRESS:		case AMDGPUAS::CONSTANT_ADDRESS:
// If the offset isn't a multiple of 4, it probably isn't going to be		// If the offset isn't a multiple of 4, it probably isn't going to be
// correctly aligned.		// correctly aligned.
// FIXME: Can we get the real alignment here?		// FIXME: Can we get the real alignment here?
if (AM.BaseOffs % 4 != 0)		if (AM.BaseOffs % 4 != 0)
return isLegalMUBUFAddressingMode(AM);		return isLegalMUBUFAddressingMode(AM);

// There are no SMRD extloads, so if we have to do a small type access we		// There are no SMRD extloads, so if we have to do a small type access we
▲ Show 20 Lines • Show All 1,415 Lines • ▼ Show 20 Lines	bool SITargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {

return false;		return false;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Custom DAG Lowering Operations		// Custom DAG Lowering Operations
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//


		void SITargetLowering::LowerOperationWrapper(SDNode *N,
		SmallVectorImpl<SDValue> &Results,
		SelectionDAG &DAG) const {

		if (N->getOpcode() != AMDGPUISD::SBUFFER_LOAD) {
		TargetLowering::LowerOperationWrapper(N, Results, DAG);
		return;
		}

		SDLoc DL(N);
		MemSDNode *M = cast<MemSDNode>(N);
		SDValue Ops[] = {
		M->getOperand(0), // Chain
		DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, M->getOperand(1)), // Ptr
		M->getOperand(2), // Offset
		DAG.getTargetConstant(cast<ConstantSDNode>(
		M->getOperand(3))->getZExtValue(), DL, MVT::i1) // glc
		};

		auto MMO = M->getMemOperand();
		if (isDereferenceablePointer(MMO->getValue(), DAG.getDataLayout()))
		MMO->setFlags(MachineMemOperand::MODereferenceable);

		SDValue LD = DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
		M->getVTList(), Ops, M->getMemoryVT(),
		M->getMemOperand());
		Results.push_back(LD);
		Results.push_back(LD.getValue(1));
		}

SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {		SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
default: return AMDGPUTargetLowering::LowerOperation(Op, DAG);		default: return AMDGPUTargetLowering::LowerOperation(Op, DAG);
case ISD::BRCOND: return LowerBRCOND(Op, DAG);		case ISD::BRCOND: return LowerBRCOND(Op, DAG);
case ISD::LOAD: {		case ISD::LOAD: {
SDValue Result = LowerLOAD(Op, DAG);		SDValue Result = LowerLOAD(Op, DAG);
assert((!Result.getNode() \|\|		assert((!Result.getNode() \|\|
Result.getNode()->getNumValues() == 2) &&		Result.getNode()->getNumValues() == 2) &&
▲ Show 20 Lines • Show All 688 Lines • ▼ Show 20 Lines	case Intrinsic::r600_read_tidig_y:
return CreateLiveInRegister(DAG, &AMDGPU::VGPR_32RegClass,		return CreateLiveInRegister(DAG, &AMDGPU::VGPR_32RegClass,
TRI->getPreloadedValue(MF, SIRegisterInfo::WORKITEM_ID_Y), VT);		TRI->getPreloadedValue(MF, SIRegisterInfo::WORKITEM_ID_Y), VT);
case Intrinsic::amdgcn_workitem_id_z:		case Intrinsic::amdgcn_workitem_id_z:
case Intrinsic::r600_read_tidig_z:		case Intrinsic::r600_read_tidig_z:
return CreateLiveInRegister(DAG, &AMDGPU::VGPR_32RegClass,		return CreateLiveInRegister(DAG, &AMDGPU::VGPR_32RegClass,
TRI->getPreloadedValue(MF, SIRegisterInfo::WORKITEM_ID_Z), VT);		TRI->getPreloadedValue(MF, SIRegisterInfo::WORKITEM_ID_Z), VT);
case AMDGPUIntrinsic::SI_load_const: {		case AMDGPUIntrinsic::SI_load_const: {
SDValue Ops[] = {		SDValue Ops[] = {
Op.getOperand(1),		DAG.getEntryNode(), // Chain
Op.getOperand(2)		Op.getOperand(1), // Ptr
		Op.getOperand(2), // Offset
		DAG.getTargetConstant(0, DL, MVT::i1) // glc
};		};

MachineMemOperand *MMO = MF.getMachineMemOperand(		MachineMemOperand *MMO = MF.getMachineMemOperand(
MachinePointerInfo(),		MachinePointerInfo(),
MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|		MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
MachineMemOperand::MOInvariant,		MachineMemOperand::MOInvariant,
VT.getStoreSize(), 4);		VT.getStoreSize(), 4);
return DAG.getMemIntrinsicNode(AMDGPUISD::LOAD_CONSTANT, DL,		SDVTList VTList = DAG.getVTList(MVT::i32, MVT::Other);
Op->getVTList(), Ops, VT, MMO);		SDValue Load = DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
		VTList, Ops, MVT::i32, MMO);

		SDValue MergeOps[] = {
		DAG.getNode(ISD::BITCAST, DL, MVT::f32, Load),
		Load.getValue(1)
		};

		return DAG.getMergeValues(MergeOps, DL);
}		}
case AMDGPUIntrinsic::amdgcn_fdiv_fast:		case AMDGPUIntrinsic::amdgcn_fdiv_fast:
return lowerFDIV_FAST(Op, DAG);		return lowerFDIV_FAST(Op, DAG);
case AMDGPUIntrinsic::SI_vs_load_input:		case AMDGPUIntrinsic::SI_vs_load_input:
return DAG.getNode(AMDGPUISD::LOAD_INPUT, DL, VT,		return DAG.getNode(AMDGPUISD::LOAD_INPUT, DL, VT,
Op.getOperand(1),		Op.getOperand(1),
Op.getOperand(2),		Op.getOperand(2),
Op.getOperand(3));		Op.getOperand(3));
▲ Show 20 Lines • Show All 2,061 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.td

Show All 20 Lines	def SIEncodingFamily {
int SI = 0;		int SI = 0;
int VI = 1;		int VI = 1;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SI DAG Nodes		// SI DAG Nodes
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def SIload_constant : SDNode<"AMDGPUISD::LOAD_CONSTANT",
SDTypeProfile<1, 2, [SDTCisVT<0, f32>, SDTCisVT<1, v4i32>, SDTCisVT<2, i32>]>,
[SDNPMayLoad, SDNPMemOperand]
>;

def SIatomic_inc : SDNode<"AMDGPUISD::ATOMIC_INC", SDTAtomic2,		def SIatomic_inc : SDNode<"AMDGPUISD::ATOMIC_INC", SDTAtomic2,
[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]		[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
>;		>;

def SIatomic_dec : SDNode<"AMDGPUISD::ATOMIC_DEC", SDTAtomic2,		def SIatomic_dec : SDNode<"AMDGPUISD::ATOMIC_DEC", SDTAtomic2,
[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]		[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
>;		>;

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// PatFrags for global memory operations		// PatFrags for global memory operations
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

defm atomic_inc_global : global_binary_atomic_op<SIatomic_inc>;		defm atomic_inc_global : global_binary_atomic_op<SIatomic_inc>;
defm atomic_dec_global : global_binary_atomic_op<SIatomic_dec>;		defm atomic_dec_global : global_binary_atomic_op<SIatomic_dec>;

		def SIsbuffer_load : SDNode<"AMDGPUISD::SBUFFER_LOAD",
		SDTypeProfile<1, 3, [SDTCisVT<1, v4i32>, SDTCisVT<2, i32>, SDTCisVT<3, i1>]>,
		[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]
		>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SDNodes and PatFrag for local loads and stores to enable s_mov_b32 m0, -1		// SDNodes and PatFrag for local loads and stores to enable s_mov_b32 m0, -1
// to be glued to the memory instructions.		// to be glued to the memory instructions.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def SIld_local : SDNode <"ISD::LOAD", SDTLoad,		def SIld_local : SDNode <"ISD::LOAD", SDTLoad,
[SDNPHasChain, SDNPMayLoad, SDNPMemOperand, SDNPInGlue]		[SDNPHasChain, SDNPMayLoad, SDNPMemOperand, SDNPInGlue]
>;		>;
▲ Show 20 Lines • Show All 1,198 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SMInstructions.td

	Show First 20 Lines • Show All 270 Lines • ▼ Show 20 Lines

	defm : SMRD_Pattern <"S_LOAD_DWORD", i32>;			defm : SMRD_Pattern <"S_LOAD_DWORD", i32>;
	defm : SMRD_Pattern <"S_LOAD_DWORDX2", v2i32>;			defm : SMRD_Pattern <"S_LOAD_DWORDX2", v2i32>;
	defm : SMRD_Pattern <"S_LOAD_DWORDX4", v4i32>;			defm : SMRD_Pattern <"S_LOAD_DWORDX4", v4i32>;
	defm : SMRD_Pattern <"S_LOAD_DWORDX8", v8i32>;			defm : SMRD_Pattern <"S_LOAD_DWORDX8", v8i32>;
	defm : SMRD_Pattern <"S_LOAD_DWORDX16", v16i32>;			defm : SMRD_Pattern <"S_LOAD_DWORDX16", v16i32>;

	// 1. Offset as an immediate			// 1. Offset as an immediate
	def SM_LOAD_PATTERN : Pat < // name this pattern to reuse AddedComplexity on CI			// name this pattern to reuse AddedComplexity on CI
	(SIload_constant v4i32:$sbase, (SMRDBufferImm i32:$offset)),			def SM_LOAD_PATTERN : Pat <
	(S_BUFFER_LOAD_DWORD_IMM $sbase, $offset, 0)			(i32 (SIsbuffer_load v4i32:$sbase, (SMRDBufferImm i32:$offset), i1:$glc)),
				(S_BUFFER_LOAD_DWORD_IMM $sbase, $offset, (as_i1imm $glc))
	>;			>;

	// 2. Offset loaded in an 32bit SGPR			// 2. Offset loaded in an 32bit SGPR
	def : Pat <			def : Pat <
	(SIload_constant v4i32:$sbase, (SMRDBufferSgpr i32:$offset)),			(i32 (SIsbuffer_load v4i32:$sbase, (SMRDBufferSgpr i32:$offset), i1:$glc)),
	(S_BUFFER_LOAD_DWORD_SGPR $sbase, $offset, 0)			(S_BUFFER_LOAD_DWORD_SGPR $sbase, $offset, (as_i1imm $glc))
	>;			>;

	} // End let AddedComplexity = 100			} // End let AddedComplexity = 100

	} // let Predicates = [isGCN]			} // let Predicates = [isGCN]

	let Predicates = [isVI] in {			let Predicates = [isVI] in {

	▲ Show 20 Lines • Show All 219 Lines • ▼ Show 20 Lines

	def : SMRD_Pattern_ci <"S_LOAD_DWORD", i32>;			def : SMRD_Pattern_ci <"S_LOAD_DWORD", i32>;
	def : SMRD_Pattern_ci <"S_LOAD_DWORDX2", v2i32>;			def : SMRD_Pattern_ci <"S_LOAD_DWORDX2", v2i32>;
	def : SMRD_Pattern_ci <"S_LOAD_DWORDX4", v4i32>;			def : SMRD_Pattern_ci <"S_LOAD_DWORDX4", v4i32>;
	def : SMRD_Pattern_ci <"S_LOAD_DWORDX8", v8i32>;			def : SMRD_Pattern_ci <"S_LOAD_DWORDX8", v8i32>;
	def : SMRD_Pattern_ci <"S_LOAD_DWORDX16", v16i32>;			def : SMRD_Pattern_ci <"S_LOAD_DWORDX16", v16i32>;

	def : Pat <			def : Pat <
	(SIload_constant v4i32:$sbase, (SMRDBufferImm32 i32:$offset)),			(i32 (SIsbuffer_load v4i32:$sbase, (SMRDBufferImm32 i32:$offset), i1:$glc)),
	(S_BUFFER_LOAD_DWORD_IMM_ci $sbase, $offset, 0)> {			(S_BUFFER_LOAD_DWORD_IMM_ci $sbase, $offset, $glc)> {
	let Predicates = [isCI]; // should this be isCIOnly?			let Predicates = [isCI]; // should this be isCIOnly?
	}			}

	} // End let AddedComplexity = SM_LOAD_PATTERN.AddedComplexity			} // End let AddedComplexity = SM_LOAD_PATTERN.AddedComplexity

test/CodeGen/AMDGPU/mubuf.ll

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	main_body:
%tmp1 = load <16 x i8>, <16 x i8> addrspace(2)* %tmp0		%tmp1 = load <16 x i8>, <16 x i8> addrspace(2)* %tmp0
%tmp2 = shl i32 %6, 2		%tmp2 = shl i32 %6, 2
%tmp3 = call i32 @llvm.SI.buffer.load.dword.i32.i32(<16 x i8> %tmp1, i32 %tmp2, i32 65, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0)		%tmp3 = call i32 @llvm.SI.buffer.load.dword.i32.i32(<16 x i8> %tmp1, i32 %tmp2, i32 65, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0)
%tmp4 = add i32 %6, 16		%tmp4 = add i32 %6, 16
call void @llvm.SI.tbuffer.store.i32(<16 x i8> %tmp1, i32 %tmp3, i32 1, i32 %tmp4, i32 %4, i32 0, i32 4, i32 4, i32 1, i32 0, i32 1, i32 1, i32 0)		call void @llvm.SI.tbuffer.store.i32(<16 x i8> %tmp1, i32 %tmp3, i32 1, i32 %tmp4, i32 %4, i32 0, i32 4, i32 4, i32 1, i32 0, i32 1, i32 1, i32 0)
ret void		ret void
}		}

		; Using the load.const intrinsic with an vgpr offset
		; CHECK-LABEL: {{^}}s_buffer_load:
		; CHECK: buffer_load_dword v{{[0-9+]}}, v{{[0-9+]}}, s[{{[0-9]+}}:{{[0-9]+}}], 0 offen
		; CHECK: buffer_load_dword v{{[0-9+]}}, v{{[0-9+]}}, s[{{[0-9]+}}:{{[0-9]+}}], 0 offen
		define amdgpu_ps void @s_buffer_load(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float, i32 addrspace(42)* addrspace(2)* inreg %in) {
		main_body:
		%tid = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
		%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0
		%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20
		%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 %tid)
		%23 = load i32 addrspace(42), i32 addrspace(42) addrspace(2)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %23, i32 %tid, i1 false)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %s.buffer.float)
		ret void
		}

;;;==========================================================================;;;		;;;==========================================================================;;;
;;; MUBUF STORE TESTS		;;; MUBUF STORE TESTS
;;;==========================================================================;;;		;;;==========================================================================;;;

; MUBUF store with an immediate byte offset that fits into 12-bits		; MUBUF store with an immediate byte offset that fits into 12-bits
; CHECK-LABEL: {{^}}mubuf_store0:		; CHECK-LABEL: {{^}}mubuf_store0:
; CHECK: buffer_store_dword v{{[0-9]}}, off, s[{{[0-9]:[0-9]}}], 0 offset:4 ; encoding: [0x04,0x00,0x70,0xe0		; CHECK: buffer_store_dword v{{[0-9]}}, off, s[{{[0-9]:[0-9]}}], 0 offset:4 ; encoding: [0x04,0x00,0x70,0xe0
define void @mubuf_store0(i32 addrspace(1)* %out) {		define void @mubuf_store0(i32 addrspace(1)* %out) {
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
; CHECK: buffer_store_dword v{{[0-9]+}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64		; CHECK: buffer_store_dword v{{[0-9]+}}, v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64
define void @store_vgpr_ptr(i32 addrspace(1)* %out) #0 {		define void @store_vgpr_ptr(i32 addrspace(1)* %out) #0 {
%tid = call i32 @llvm.amdgcn.workitem.id.x() readnone		%tid = call i32 @llvm.amdgcn.workitem.id.x() readnone
%out.gep = getelementptr i32, i32 addrspace(1)* %out, i32 %tid		%out.gep = getelementptr i32, i32 addrspace(1)* %out, i32 %tid
store i32 99, i32 addrspace(1)* %out.gep, align 4		store i32 99, i32 addrspace(1)* %out.gep, align 4
ret void		ret void
}		}

		declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #1
declare i32 @llvm.SI.buffer.load.dword.i32.i32(<16 x i8>, i32, i32, i32, i32, i32, i32, i32, i32) #0		declare i32 @llvm.SI.buffer.load.dword.i32.i32(<16 x i8>, i32, i32, i32, i32, i32, i32, i32, i32) #0
declare void @llvm.SI.tbuffer.store.i32(<16 x i8>, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32)		declare void @llvm.SI.tbuffer.store.i32(<16 x i8>, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32)
		declare float @llvm.SI.load.const(<16 x i8>, i32) #1
		declare i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* nocapture, i32, i1)
		declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)

attributes #0 = { nounwind readonly }		attributes #0 = { nounwind readonly }
		attributes #1 = { nounwind readnone }

test/CodeGen/AMDGPU/smrd.ll

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	entry:
%1 = load i32, i32 addrspace(2)* %0		%1 = load i32, i32 addrspace(2)* %0
store i32 %1, i32 addrspace(1)* %out		store i32 %1, i32 addrspace(1)* %out
ret void		ret void
}		}

; SMRD load using the load.const intrinsic with an immediate offset		; SMRD load using the load.const intrinsic with an immediate offset
; GCN-LABEL: {{^}}smrd_load_const0:		; GCN-LABEL: {{^}}smrd_load_const0:
; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x4 ; encoding: [0x04		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x4 ; encoding: [0x04
		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x4 ; encoding: [0x04
		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x10
; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x10		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x10
define amdgpu_ps void @smrd_load_const0(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float) {		define amdgpu_ps void @smrd_load_const0(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float, i32 addrspace(42)* addrspace(2)* inreg %in) {
main_body:		main_body:
%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0		%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0
%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20		%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20
%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 16)		%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 16)
call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %22)		%23 = load i32 addrspace(42), i32 addrspace(42) addrspace(2)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %23, i32 16, i1 false)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %s.buffer.float)
ret void		ret void
}		}

; SMRD load using the load.const intrinsic with the largest possible immediate		; SMRD load using the load.const intrinsic with the largest possible immediate
; offset.		; offset.
; GCN-LABEL: {{^}}smrd_load_const1:		; GCN-LABEL: {{^}}smrd_load_const1:
; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xff ; encoding: [0xff		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xff ; encoding: [0xff
		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xff ; encoding: [0xff
		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3fc
; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3fc		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3fc
define amdgpu_ps void @smrd_load_const1(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float) {		define amdgpu_ps void @smrd_load_const1(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float, i32 addrspace(42)* addrspace(2)* inreg %in) {
main_body:		main_body:
%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0		%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0
%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20		%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20
%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1020)		%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1020)
call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %22)		%23 = load i32 addrspace(42), i32 addrspace(42) addrspace(2)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %23, i32 1020, i1 false)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %s.buffer.float)
ret void		ret void
}		}
; SMRD load using the load.const intrinsic with an offset greater than the		; SMRD load using the load.const intrinsic with an offset greater than the
; largets possible immediate.		; largets possible immediate.
; immediate offset.		; immediate offset.
; GCN-LABEL: {{^}}smrd_load_const2:		; GCN-LABEL: {{^}}smrd_load_const2:
; SI: s_movk_i32 s[[OFFSET:[0-9]]], 0x400		; SI: s_movk_i32 s[[OFFSET:[0-9]]], 0x400
; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], s[[OFFSET]] ; encoding: [0x0[[OFFSET]]		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], s[[OFFSET]] ; encoding: [0x0[[OFFSET]]
		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], s[[OFFSET]] ; encoding: [0x0[[OFFSET]]
		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x100
; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x100		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x100
; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x400		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x400
define amdgpu_ps void @smrd_load_const2(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float) {		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x400
		define amdgpu_ps void @smrd_load_const2(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float, i32 addrspace(42)* addrspace(2)* inreg %in) {
main_body:		main_body:
%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0		%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0
%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20		%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20
%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1024)		%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1024)
call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %22)		%23 = load i32 addrspace(42), i32 addrspace(42) addrspace(2)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %23, i32 1024, i1 false)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %s.buffer.float)
ret void		ret void
}		}

; SMRD load with the largest possible immediate offset on VI		; SMRD load with the largest possible immediate offset on VI
; GCN-LABEL: {{^}}smrd_load_const3:		; GCN-LABEL: {{^}}smrd_load_const3:
; SI: s_mov_b32 [[OFFSET:s[0-9]+]], 0xffffc		; SI: s_mov_b32 [[OFFSET:s[0-9]+]], 0xffffc
; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3ffff
; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3ffff		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3ffff
; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xffffc		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xffffc
define amdgpu_ps void @smrd_load_const3(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float) {		; VI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xffffc
		define amdgpu_ps void @smrd_load_const3(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float, i32 addrspace(42)* addrspace(2)* inreg %in) {
main_body:		main_body:
%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0		%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0
%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20		%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20
%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1048572)		%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1048572)
call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %22)		%23 = load i32 addrspace(42), i32 addrspace(42) addrspace(2)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %23, i32 1048572, i1 false)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %s.buffer.float)
ret void		ret void
}		}

; SMRD load with an offset greater than the largest possible immediate on VI		; SMRD load with an offset greater than the largest possible immediate on VI
; GCN-LABEL: {{^}}smrd_load_const4:		; GCN-LABEL: {{^}}smrd_load_const4:
; SIVI: s_mov_b32 [[OFFSET:s[0-9]+]], 0x100000		; SIVI: s_mov_b32 [[OFFSET:s[0-9]+]], 0x100000
; SIVI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]		; SIVI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; SIVI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x40000
; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x40000		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x40000
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_ps void @smrd_load_const4(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float) {		define amdgpu_ps void @smrd_load_const4(<16 x i8> addrspace(2)* inreg, <16 x i8> addrspace(2)* inreg, <32 x i8> addrspace(2)* inreg, i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, float, float, float, float, i32 addrspace(42)* addrspace(2)* inreg %in) {
main_body:		main_body:
%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0		%20 = getelementptr <16 x i8>, <16 x i8> addrspace(2)* %0, i32 0
%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20		%21 = load <16 x i8>, <16 x i8> addrspace(2)* %20
%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1048576)		%22 = call float @llvm.SI.load.const(<16 x i8> %21, i32 1048576)
call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %22)		%23 = load i32 addrspace(42), i32 addrspace(42) addrspace(2)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %23, i32 1048576, i1 false)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.SI.export(i32 15, i32 1, i32 1, i32 0, i32 0, float %22, float %22, float %22, float %s.buffer.float)
ret void		ret void
}		}

; Function Attrs: nounwind readnone		; Function Attrs: nounwind readnone
declare float @llvm.SI.load.const(<16 x i8>, i32) #0		declare float @llvm.SI.load.const(<16 x i8>, i32) #0

declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)		declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)

		declare i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* nocapture, i32, i1)

attributes #0 = { nounwind readnone }		attributes #0 = { nounwind readnone }

test/Transforms/EarlyCSE/AMDGPU/intrinsics.ll

This file was added.

				; RUN: opt < %s -S -mtriple=amdgcn-- -early-cse \| FileCheck %s

				; CHECK-LABEL: @no_cse
				; CHECK: call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 0, i1 false)
				; CHECK: call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 4, i1 false)
				define void @no_cse(i32 addrspace(1)* %out, i32 addrspace(42)* %in) {
				%a = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 0, i1 false)
				%b = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 4, i1 false)
				%c = add i32 %a, %b
				store i32 %c, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @cse_zero_offset
				; CHECK: [[CSE:%[a-z0-9A-Z]+]] = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 0, i1 false)
				; CHECK: add i32 [[CSE]], [[CSE]]
				define void @cse_zero_offset(i32 addrspace(1)* %out, i32 addrspace(42)* %in) {
				%a = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 0, i1 false)
				%b = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 0, i1 false)
				%c = add i32 %a, %b
				store i32 %c, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @cse_nonzero_offset
				; CHECK: [[CSE:%[a-z0-9A-Z]+]] = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 4, i1 false)
				; CHECK: add i32 [[CSE]], [[CSE]]
				define void @cse_nonzero_offset(i32 addrspace(1)* %out, i32 addrspace(42)* %in) {
				%a = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 4, i1 false)
				%b = call i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* %in, i32 4, i1 false)
				%c = add i32 %a, %b
				store i32 %c, i32 addrspace(1)* %out
				ret void
				}

				declare i32 @llvm.amdgcn.s.buffer.load.i32(i32 addrspace(42)* nocapture, i32, i1)

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Add llvm.amdgcn.s.buffer.load intrinsicNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86472

include/llvm/IR/IntrinsicsAMDGPU.td

lib/Target/AMDGPU/AMDGPU.h

lib/Target/AMDGPU/AMDGPUISelLowering.h

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

lib/Target/AMDGPU/BUFInstructions.td

lib/Target/AMDGPU/SIISelLowering.h

lib/Target/AMDGPU/SIISelLowering.cpp

lib/Target/AMDGPU/SIInstrInfo.td

lib/Target/AMDGPU/SMInstructions.td

test/CodeGen/AMDGPU/mubuf.ll

test/CodeGen/AMDGPU/smrd.ll

test/Transforms/EarlyCSE/AMDGPU/intrinsics.ll

AMDGPU/SI: Add llvm.amdgcn.s.buffer.load intrinsic
Needs ReviewPublic