This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Round up kernel argument allocation size
ClosedPublic

Authored by arsenm on May 25 2018, 5:50 AM.

Download Raw Diff

Details

Reviewers

t-tye
rampitec
kzhuravl
b-sumner

Summary

AFAIK the driver's allocation will actually have to round this
up anyway. It is useful to track the rounded up size, so that
the end of the kernel segment is known to be dereferencable so
a wider s_load_dword can be used for a short argument at the end
of the segment.

One thing I'm unclear on is if we need to really allocate the
implicit arg bytes if they aren't actually used

Diff Detail

Event Timeline

arsenm created this revision.May 25 2018, 5:50 AM

Herald added subscribers: tpr, dstuttard, yaxunl and 2 others. · View Herald TranscriptMay 25 2018, 5:50 AM

Are we sure that is what RT(s) do?

In D47370#1112729, @rampitec wrote:

Are we sure that is what RT(s) do?

It doesn't really matter if it does or not, since we're now requesting the larger allocation

t-tye added inline comments.May 25 2018, 12:05 PM

lib/Target/AMDGPU/AMDGPUSubtarget.cpp
426	I believe you can align this to 16. See HSA spec at www.hsafoundation.com/html_spec111/HSA_Library.htm#PRM/Topics/04_SyntaxSemantics/kernarg_segment.htm which says: "The alignment of the base address of the kernel's kernarg segment variables is the larger of 16 bytes and the maximum alignment of the kernel's kernarg segment variables." I suspect that the OpenCL runtime simply aligns all kernarg allocations to 256 but not sure of other languages.

In D47370#1112733, @arsenm wrote:

In D47370#1112729, @rampitec wrote:

Are we sure that is what RT(s) do?

It doesn't really matter if it does or not, since we're now requesting the larger allocation

Note that technically the HSA spec requires the kernarg segment size to match the size and alignment of the kernel arguments. So it ought not to be larger than deduced from the arguments. In this context the implicit arguments are just extra arguments.

arsenm added inline comments.May 25 2018, 12:09 PM

lib/Target/AMDGPU/AMDGPUSubtarget.cpp
426	I think we really only need to pad up to 4 at the end to avoid vmem extloads. Wider scalar loads at the end may be useful in some cases, but we do so badly at this now I wouldn't worry about it yet

LGTM

This revision is now accepted and ready to land.May 25 2018, 12:13 PM

t-tye added inline comments.May 25 2018, 1:58 PM

lib/Target/AMDGPU/AMDGPUSubtarget.cpp
426	Confirmed that OpenCL chooses to always align kernarg up to 128. I don't think the compiler should rely on anything higher than defined by the HSA spec 16 but interesting to know.

r333456

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUMachineFunction.h

1 line

AMDGPUSubtarget.cpp

12 lines

SIMachineFunctionInfo.cpp

5 lines

test/

CodeGen/

AMDGPU/

kernel-args.ll

53 lines

llvm.amdgcn.kernarg.segment.ptr.ll

48 lines

Diff 148590

lib/Target/AMDGPU/AMDGPUMachineFunction.h

	Show All 14 Lines

	namespace llvm {			namespace llvm {

	class AMDGPUMachineFunction : public MachineFunctionInfo {			class AMDGPUMachineFunction : public MachineFunctionInfo {
	/// A map to keep track of local memory objects and their offsets within the			/// A map to keep track of local memory objects and their offsets within the
	/// local memory space.			/// local memory space.
	SmallDenseMap<const GlobalValue *, unsigned, 4> LocalMemoryObjects;			SmallDenseMap<const GlobalValue *, unsigned, 4> LocalMemoryObjects;

				protected:
	uint64_t KernArgSize;			uint64_t KernArgSize;
	unsigned MaxKernArgAlign;			unsigned MaxKernArgAlign;

	/// Number of bytes in the LDS that are being used.			/// Number of bytes in the LDS that are being used.
	unsigned LDSSize;			unsigned LDSSize;

	// FIXME: This should probably be removed.			// FIXME: This should probably be removed.
	/// Start of implicit kernel args			/// Start of implicit kernel args
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSubtarget.cpp

	Show First 20 Lines • Show All 408 Lines • ▼ Show 20 Lines
	}			}

	bool SISubtarget::isVGPRSpillingEnabled(const Function& F) const {			bool SISubtarget::isVGPRSpillingEnabled(const Function& F) const {
	return EnableVGPRSpilling \|\| !AMDGPU::isShader(F.getCallingConv());			return EnableVGPRSpilling \|\| !AMDGPU::isShader(F.getCallingConv());
	}			}

	unsigned SISubtarget::getKernArgSegmentSize(const Function &F,			unsigned SISubtarget::getKernArgSegmentSize(const Function &F,
	unsigned ExplicitArgBytes) const {			unsigned ExplicitArgBytes) const {
				uint64_t TotalSize = ExplicitArgBytes;
	unsigned ImplicitBytes = getImplicitArgNumBytes(F);			unsigned ImplicitBytes = getImplicitArgNumBytes(F);
	if (ImplicitBytes == 0)
	return ExplicitArgBytes;

				if (ImplicitBytes != 0) {
	unsigned Alignment = getAlignmentForImplicitArgPtr();			unsigned Alignment = getAlignmentForImplicitArgPtr();
	return alignTo(ExplicitArgBytes, Alignment) + ImplicitBytes;			TotalSize = alignTo(ExplicitArgBytes, Alignment) + ImplicitBytes;
				}

				// Being able to dereference past the end is useful for emitting scalar loads.
				return alignTo(TotalSize, 4);
				t-tyeUnsubmitted Not Done Reply Inline Actions I believe you can align this to 16. See HSA spec at www.hsafoundation.com/html_spec111/HSA_Library.htm#PRM/Topics/04_SyntaxSemantics/kernarg_segment.htm which says: "The alignment of the base address of the kernel's kernarg segment variables is the larger of 16 bytes and the maximum alignment of the kernel's kernarg segment variables." I suspect that the OpenCL runtime simply aligns all kernarg allocations to 256 but not sure of other languages. t-tye: I believe you can align this to 16. See HSA spec at www.hsafoundation.
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I think we really only need to pad up to 4 at the end to avoid vmem extloads. Wider scalar loads at the end may be useful in some cases, but we do so badly at this now I wouldn't worry about it yet arsenm: I think we really only need to pad up to 4 at the end to avoid vmem extloads. Wider scalar…
				t-tyeUnsubmitted Not Done Reply Inline Actions Confirmed that OpenCL chooses to always align kernarg up to 128. I don't think the compiler should rely on anything higher than defined by the HSA spec 16 but interesting to know. t-tye: Confirmed that OpenCL chooses to always align kernarg up to 128. I don't think the compiler…
	}			}

	unsigned SISubtarget::getOccupancyWithNumSGPRs(unsigned SGPRs) const {			unsigned SISubtarget::getOccupancyWithNumSGPRs(unsigned SGPRs) const {
	if (getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {			if (getGeneration() >= SISubtarget::VOLCANIC_ISLANDS) {
	if (SGPRs <= 80)			if (SGPRs <= 80)
	return 10;			return 10;
	if (SGPRs <= 88)			if (SGPRs <= 88)
	return 9;			return 9;
	▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (!isEntryFunction()) {
ArgInfo.PrivateSegmentBuffer =		ArgInfo.PrivateSegmentBuffer =
ArgDescriptor::createRegister(ScratchRSrcReg);		ArgDescriptor::createRegister(ScratchRSrcReg);
ArgInfo.PrivateSegmentWaveByteOffset =		ArgInfo.PrivateSegmentWaveByteOffset =
ArgDescriptor::createRegister(ScratchWaveOffsetReg);		ArgDescriptor::createRegister(ScratchWaveOffsetReg);

if (F.hasFnAttribute("amdgpu-implicitarg-ptr"))		if (F.hasFnAttribute("amdgpu-implicitarg-ptr"))
ImplicitArgPtr = true;		ImplicitArgPtr = true;
} else {		} else {
if (F.hasFnAttribute("amdgpu-implicitarg-ptr"))		if (F.hasFnAttribute("amdgpu-implicitarg-ptr")) {
KernargSegmentPtr = true;		KernargSegmentPtr = true;
		assert(MaxKernArgAlign == 0);
		MaxKernArgAlign = ST.getAlignmentForImplicitArgPtr();
		}
}		}

CallingConv::ID CC = F.getCallingConv();		CallingConv::ID CC = F.getCallingConv();
if (CC == CallingConv::AMDGPU_KERNEL \|\| CC == CallingConv::SPIR_KERNEL) {		if (CC == CallingConv::AMDGPU_KERNEL \|\| CC == CallingConv::SPIR_KERNEL) {
if (!F.arg_empty())		if (!F.arg_empty())
KernargSegmentPtr = true;		KernargSegmentPtr = true;
WorkGroupIDX = true;		WorkGroupIDX = true;
WorkItemIDX = true;		WorkItemIDX = true;
▲ Show 20 Lines • Show All 244 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/kernel-args.ll

	; RUN: llc < %s -march=amdgcn -verify-machineinstrs \| FileCheck -enable-var-scope --check-prefixes=SI,GCN,MESA-GCN,FUNC %s			; RUN: llc < %s -march=amdgcn -verify-machineinstrs \| FileCheck -enable-var-scope --check-prefixes=SI,GCN,MESA-GCN,FUNC %s
	; RUN: llc < %s -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs \| FileCheck -enable-var-scope -check-prefixes=VI,GCN,MESA-VI,MESA-GCN,FUNC %s			; RUN: llc < %s -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs \| FileCheck -enable-var-scope -check-prefixes=VI,GCN,MESA-VI,MESA-GCN,FUNC %s
	; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs \| FileCheck -enable-var-scope -check-prefixes=VI,GCN,HSA-VI,FUNC %s			; RUN: llc < %s -mtriple=amdgcn--amdhsa -mcpu=fiji -verify-machineinstrs \| FileCheck -enable-var-scope -check-prefixes=VI,GCN,HSA-VI,FUNC %s
	; RUN: llc < %s -march=r600 -mcpu=redwood -verify-machineinstrs \| FileCheck -enable-var-scope -check-prefix=EG --check-prefix=FUNC %s			; RUN: llc < %s -march=r600 -mcpu=redwood -verify-machineinstrs \| FileCheck -enable-var-scope -check-prefix=EG --check-prefix=FUNC %s
	; RUN: llc < %s -march=r600 -mcpu=cayman -verify-machineinstrs \| FileCheck -enable-var-scope --check-prefix=EG --check-prefix=FUNC %s			; RUN: llc < %s -march=r600 -mcpu=cayman -verify-machineinstrs \| FileCheck -enable-var-scope --check-prefix=EG --check-prefix=FUNC %s

	; FUNC-LABEL: {{^}}i8_arg:			; FUNC-LABEL: {{^}}i8_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: AND_INT {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z			; EG: AND_INT {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xb			; SI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xb
	; MESA-VI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x2c			; MESA-VI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x2c
	; MESA-GCN: s_and_b32 s{{[0-9]+}}, [[VAL]], 0xff			; MESA-GCN: s_and_b32 s{{[0-9]+}}, [[VAL]], 0xff
	; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8			; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8
	; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0			; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0
	; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]
	; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]
	; FIXME: Should be using s_load_dword			; FIXME: Should be using s_load_dword
	; HSA-VI: flat_load_ubyte v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}			; HSA-VI: flat_load_ubyte v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}

	define amdgpu_kernel void @i8_arg(i32 addrspace(1)* nocapture %out, i8 %in) nounwind {			define amdgpu_kernel void @i8_arg(i32 addrspace(1)* nocapture %out, i8 %in) nounwind {
	entry:			entry:
	%0 = zext i8 %in to i32			%0 = zext i8 %in to i32
	store i32 %0, i32 addrspace(1)* %out, align 4			store i32 %0, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i8_zext_arg:			; FUNC-LABEL: {{^}}i8_zext_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z			; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb			; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb
	; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c			; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c
	; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8			; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8
	; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0			; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0
	; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]
	; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]
	; FIXME: Should be using s_load_dword			; FIXME: Should be using s_load_dword
	; HSA-VI: flat_load_ubyte v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}			; HSA-VI: flat_load_ubyte v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}

	define amdgpu_kernel void @i8_zext_arg(i32 addrspace(1)* nocapture %out, i8 zeroext %in) nounwind {			define amdgpu_kernel void @i8_zext_arg(i32 addrspace(1)* nocapture %out, i8 zeroext %in) nounwind {
	entry:			entry:
	%0 = zext i8 %in to i32			%0 = zext i8 %in to i32
	store i32 %0, i32 addrspace(1)* %out, align 4			store i32 %0, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i8_sext_arg:			; FUNC-LABEL: {{^}}i8_sext_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z			; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb			; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb
	; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c			; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c
	; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8			; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8
	; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0			; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0
	; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]
	; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]
	; FIXME: Should be using s_load_dword			; FIXME: Should be using s_load_dword
	; HSA-VI: flat_load_sbyte v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}			; HSA-VI: flat_load_sbyte v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}

	define amdgpu_kernel void @i8_sext_arg(i32 addrspace(1)* nocapture %out, i8 signext %in) nounwind {			define amdgpu_kernel void @i8_sext_arg(i32 addrspace(1)* nocapture %out, i8 signext %in) nounwind {
	entry:			entry:
	%0 = sext i8 %in to i32			%0 = sext i8 %in to i32
	store i32 %0, i32 addrspace(1)* %out, align 4			store i32 %0, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i16_arg:			; FUNC-LABEL: {{^}}i16_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG: AND_INT {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z			; EG: AND_INT {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xb			; SI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xb
	; MESA-VI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x2c			; MESA-VI: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x2c
	; MESA-GCN: s_and_b32 s{{[0-9]+}}, [[VAL]], 0xff			; MESA-GCN: s_and_b32 s{{[0-9]+}}, [[VAL]], 0xff
	; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8			; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8
	; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0			; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0
	; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]
	; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]
	; FIXME: Should be using s_load_dword			; FIXME: Should be using s_load_dword
	; HSA-VI: flat_load_ushort v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}			; HSA-VI: flat_load_ushort v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}

	define amdgpu_kernel void @i16_arg(i32 addrspace(1)* nocapture %out, i16 %in) nounwind {			define amdgpu_kernel void @i16_arg(i32 addrspace(1)* nocapture %out, i16 %in) nounwind {
	entry:			entry:
	%0 = zext i16 %in to i32			%0 = zext i16 %in to i32
	store i32 %0, i32 addrspace(1)* %out, align 4			store i32 %0, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i16_zext_arg:			; FUNC-LABEL: {{^}}i16_zext_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z			; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb			; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb
	; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c			; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c
	; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8			; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8
	; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0			; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0
	; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]
	; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]
	; FIXME: Should be using s_load_dword			; FIXME: Should be using s_load_dword
	; HSA-VI: flat_load_ushort v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}			; HSA-VI: flat_load_ushort v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}

	define amdgpu_kernel void @i16_zext_arg(i32 addrspace(1)* nocapture %out, i16 zeroext %in) nounwind {			define amdgpu_kernel void @i16_zext_arg(i32 addrspace(1)* nocapture %out, i16 zeroext %in) nounwind {
	entry:			entry:
	%0 = zext i16 %in to i32			%0 = zext i16 %in to i32
	store i32 %0, i32 addrspace(1)* %out, align 4			store i32 %0, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i16_sext_arg:			; FUNC-LABEL: {{^}}i16_sext_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z			; EG: MOV {{[ ]}}T{{[0-9]+\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb			; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb
	; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c			; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c
	; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8			; HSA-VI: s_add_u32 [[SPTR_LO:s[0-9]+]], s4, 8
	; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0			; HSA-VI: s_addc_u32 [[SPTR_HI:s[0-9]+]], s5, 0
	; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_LO:[0-9]+]], [[SPTR_LO]]
	; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]			; HSA-VI: v_mov_b32_e32 v[[VPTR_HI:[0-9]+]], [[SPTR_HI]]
	; FIXME: Should be using s_load_dword			; FIXME: Should be using s_load_dword
	; HSA-VI: flat_load_sshort v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}			; HSA-VI: flat_load_sshort v{{[0-9]+}}, v{{\[}}[[VPTR_LO]]:[[VPTR_HI]]]{{$}}

	define amdgpu_kernel void @i16_sext_arg(i32 addrspace(1)* nocapture %out, i16 signext %in) nounwind {			define amdgpu_kernel void @i16_sext_arg(i32 addrspace(1)* nocapture %out, i16 signext %in) nounwind {
	entry:			entry:
	%0 = sext i16 %in to i32			%0 = sext i16 %in to i32
	store i32 %0, i32 addrspace(1)* %out, align 4			store i32 %0, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i32_arg:			; FUNC-LABEL: {{^}}i32_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG: T{{[0-9]\.[XYZW]}}, KC0[2].Z			; EG: T{{[0-9]\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb			; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb
	; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c			; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c
	; HSA-VI: s_load_dword s{{[0-9]}}, s[4:5], 0x8			; HSA-VI: s_load_dword s{{[0-9]}}, s[4:5], 0x8
	define amdgpu_kernel void @i32_arg(i32 addrspace(1)* nocapture %out, i32 %in) nounwind {			define amdgpu_kernel void @i32_arg(i32 addrspace(1)* nocapture %out, i32 %in) nounwind {
	entry:			entry:
	store i32 %in, i32 addrspace(1)* %out, align 4			store i32 %in, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}f32_arg:			; FUNC-LABEL: {{^}}f32_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: T{{[0-9]\.[XYZW]}}, KC0[2].Z			; EG: T{{[0-9]\.[XYZW]}}, KC0[2].Z
	; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb			; SI: s_load_dword s{{[0-9]}}, s[0:1], 0xb
	; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c			; MESA-VI: s_load_dword s{{[0-9]}}, s[0:1], 0x2c
	; HSA-VI: s_load_dword s{{[0-9]+}}, s[4:5], 0x8			; HSA-VI: s_load_dword s{{[0-9]+}}, s[4:5], 0x8
	define amdgpu_kernel void @f32_arg(float addrspace(1)* nocapture %out, float %in) nounwind {			define amdgpu_kernel void @f32_arg(float addrspace(1)* nocapture %out, float %in) nounwind {
	entry:			entry:
	store float %in, float addrspace(1)* %out, align 4			store float %in, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v2i8_arg:			; FUNC-LABEL: {{^}}v2i8_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	define amdgpu_kernel void @v2i8_arg(<2 x i8> addrspace(1)* %out, <2 x i8> %in) {			define amdgpu_kernel void @v2i8_arg(<2 x i8> addrspace(1)* %out, <2 x i8> %in) {
	entry:			entry:
	store <2 x i8> %in, <2 x i8> addrspace(1)* %out			store <2 x i8> %in, <2 x i8> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v2i16_arg:			; FUNC-LABEL: {{^}}v2i16_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16

	; SI: buffer_load_ushort			; SI: buffer_load_ushort
	; SI: buffer_load_ushort			; SI: buffer_load_ushort

	; VI: s_load_dword s			; VI: s_load_dword s
	define amdgpu_kernel void @v2i16_arg(<2 x i16> addrspace(1)* %out, <2 x i16> %in) {			define amdgpu_kernel void @v2i16_arg(<2 x i16> addrspace(1)* %out, <2 x i16> %in) {
	entry:			entry:
	store <2 x i16> %in, <2 x i16> addrspace(1)* %out			store <2 x i16> %in, <2 x i16> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v2i32_arg:			; FUNC-LABEL: {{^}}v2i32_arg:
				; HSA-VI: kernarg_segment_byte_size = 16
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].X
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[2].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[2].W
	; SI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xb			; SI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xb
	; MESA-VI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x2c			; MESA-VI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x2c
	; HSA-VI: s_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x8			; HSA-VI: s_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x8
	define amdgpu_kernel void @v2i32_arg(<2 x i32> addrspace(1)* nocapture %out, <2 x i32> %in) nounwind {			define amdgpu_kernel void @v2i32_arg(<2 x i32> addrspace(1)* nocapture %out, <2 x i32> %in) nounwind {
	entry:			entry:
	store <2 x i32> %in, <2 x i32> addrspace(1)* %out, align 4			store <2 x i32> %in, <2 x i32> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v2f32_arg:			; FUNC-LABEL: {{^}}v2f32_arg:
				; HSA-VI: kernarg_segment_byte_size = 16
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].X
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[2].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[2].W
	; SI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xb			; SI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xb
	; MESA-VI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x2c			; MESA-VI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x2c
	; HSA-VI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[4:5], 0x8			; HSA-VI: s_load_dwordx2 s{{\[[0-9]:[0-9]\]}}, s[4:5], 0x8
	define amdgpu_kernel void @v2f32_arg(<2 x float> addrspace(1)* nocapture %out, <2 x float> %in) nounwind {			define amdgpu_kernel void @v2f32_arg(<2 x float> addrspace(1)* nocapture %out, <2 x float> %in) nounwind {
	entry:			entry:
	store <2 x float> %in, <2 x float> addrspace(1)* %out, align 4			store <2 x float> %in, <2 x float> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v3i8_arg:			; FUNC-LABEL: {{^}}v3i8_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG-DAG: VTX_READ_8 T{{[0-9]}}.X, T{{[0-9]}}.X, 40			; EG-DAG: VTX_READ_8 T{{[0-9]}}.X, T{{[0-9]}}.X, 40
	; EG-DAG: VTX_READ_8 T{{[0-9]}}.X, T{{[0-9]}}.X, 41			; EG-DAG: VTX_READ_8 T{{[0-9]}}.X, T{{[0-9]}}.X, 41
	; EG-DAG: VTX_READ_8 T{{[0-9]}}.X, T{{[0-9]}}.X, 42			; EG-DAG: VTX_READ_8 T{{[0-9]}}.X, T{{[0-9]}}.X, 42
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	define amdgpu_kernel void @v3i8_arg(<3 x i8> addrspace(1)* nocapture %out, <3 x i8> %in) nounwind {			define amdgpu_kernel void @v3i8_arg(<3 x i8> addrspace(1)* nocapture %out, <3 x i8> %in) nounwind {
	entry:			entry:
	store <3 x i8> %in, <3 x i8> addrspace(1)* %out, align 4			store <3 x i8> %in, <3 x i8> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v3i16_arg:			; FUNC-LABEL: {{^}}v3i16_arg:
				; HSA-VI: kernarg_segment_byte_size = 16
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4

	; EG-DAG: VTX_READ_16 T{{[0-9]}}.X, T{{[0-9]}}.X, 44			; EG-DAG: VTX_READ_16 T{{[0-9]}}.X, T{{[0-9]}}.X, 44
	; EG-DAG: VTX_READ_16 T{{[0-9]}}.X, T{{[0-9]}}.X, 46			; EG-DAG: VTX_READ_16 T{{[0-9]}}.X, T{{[0-9]}}.X, 46
	; EG-DAG: VTX_READ_16 T{{[0-9]}}.X, T{{[0-9]}}.X, 48			; EG-DAG: VTX_READ_16 T{{[0-9]}}.X, T{{[0-9]}}.X, 48
	; MESA-GCN: buffer_load_ushort			; MESA-GCN: buffer_load_ushort
	; MESA-GCN: buffer_load_ushort			; MESA-GCN: buffer_load_ushort
	; MESA-GCN: buffer_load_ushort			; MESA-GCN: buffer_load_ushort
	; HSA-VI: flat_load_ushort			; HSA-VI: flat_load_ushort
	; HSA-VI: flat_load_ushort			; HSA-VI: flat_load_ushort
	; HSA-VI: flat_load_ushort			; HSA-VI: flat_load_ushort
	define amdgpu_kernel void @v3i16_arg(<3 x i16> addrspace(1)* nocapture %out, <3 x i16> %in) nounwind {			define amdgpu_kernel void @v3i16_arg(<3 x i16> addrspace(1)* nocapture %out, <3 x i16> %in) nounwind {
	entry:			entry:
	store <3 x i16> %in, <3 x i16> addrspace(1)* %out, align 4			store <3 x i16> %in, <3 x i16> addrspace(1)* %out, align 4
	ret void			ret void
	}			}
	; FUNC-LABEL: {{^}}v3i32_arg:			; FUNC-LABEL: {{^}}v3i32_arg:
				; HSA-VI: kernarg_segment_byte_size = 32
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W
	; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0xd			; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0xd
	; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0x34			; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0x34
	; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10			; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10
	define amdgpu_kernel void @v3i32_arg(<3 x i32> addrspace(1)* nocapture %out, <3 x i32> %in) nounwind {			define amdgpu_kernel void @v3i32_arg(<3 x i32> addrspace(1)* nocapture %out, <3 x i32> %in) nounwind {
	entry:			entry:
	store <3 x i32> %in, <3 x i32> addrspace(1)* %out, align 4			store <3 x i32> %in, <3 x i32> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v3f32_arg:			; FUNC-LABEL: {{^}}v3f32_arg:
				; HSA-VI: kernarg_segment_byte_size = 32
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W
	; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0xd			; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0xd
	; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0x34			; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]+\]}}, s[0:1], 0x34
	; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10			; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10
	define amdgpu_kernel void @v3f32_arg(<3 x float> addrspace(1)* nocapture %out, <3 x float> %in) nounwind {			define amdgpu_kernel void @v3f32_arg(<3 x float> addrspace(1)* nocapture %out, <3 x float> %in) nounwind {
	entry:			entry:
	store <3 x float> %in, <3 x float> addrspace(1)* %out, align 4			store <3 x float> %in, <3 x float> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v4i8_arg:			; FUNC-LABEL: {{^}}v4i8_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; MESA-GCN: buffer_load_ubyte			; MESA-GCN: buffer_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	define amdgpu_kernel void @v4i8_arg(<4 x i8> addrspace(1)* %out, <4 x i8> %in) {			define amdgpu_kernel void @v4i8_arg(<4 x i8> addrspace(1)* %out, <4 x i8> %in) {
	entry:			entry:
	store <4 x i8> %in, <4 x i8> addrspace(1)* %out			store <4 x i8> %in, <4 x i8> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v4i16_arg:			; FUNC-LABEL: {{^}}v4i16_arg:
				; HSA-VI: kernarg_segment_byte_size = 16
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16

	; SI: buffer_load_ushort			; SI: buffer_load_ushort
	; SI: buffer_load_ushort			; SI: buffer_load_ushort
	; SI: buffer_load_ushort			; SI: buffer_load_ushort
	; SI: buffer_load_ushort			; SI: buffer_load_ushort

	; VI: s_load_dword s			; VI: s_load_dword s
	; VI: s_load_dword s			; VI: s_load_dword s
	define amdgpu_kernel void @v4i16_arg(<4 x i16> addrspace(1)* %out, <4 x i16> %in) {			define amdgpu_kernel void @v4i16_arg(<4 x i16> addrspace(1)* %out, <4 x i16> %in) {
	entry:			entry:
	store <4 x i16> %in, <4 x i16> addrspace(1)* %out			store <4 x i16> %in, <4 x i16> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v4i32_arg:			; FUNC-LABEL: {{^}}v4i32_arg:
				; HSA-VI: kernarg_segment_byte_size = 32
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].X

	; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xd			; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xd
	; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x34			; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x34
	; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10			; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10
	define amdgpu_kernel void @v4i32_arg(<4 x i32> addrspace(1)* nocapture %out, <4 x i32> %in) nounwind {			define amdgpu_kernel void @v4i32_arg(<4 x i32> addrspace(1)* nocapture %out, <4 x i32> %in) nounwind {
	entry:			entry:
	store <4 x i32> %in, <4 x i32> addrspace(1)* %out, align 4			store <4 x i32> %in, <4 x i32> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v4f32_arg:			; FUNC-LABEL: {{^}}v4f32_arg:
				; HSA-VI: kernarg_segment_byte_size = 32
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[3].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].X
	; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xd			; SI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0xd
	; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x34			; MESA-VI: s_load_dwordx4 s{{\[[0-9]:[0-9]\]}}, s[0:1], 0x34
	; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10			; HSA-VI: s_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x10
	define amdgpu_kernel void @v4f32_arg(<4 x float> addrspace(1)* nocapture %out, <4 x float> %in) nounwind {			define amdgpu_kernel void @v4f32_arg(<4 x float> addrspace(1)* nocapture %out, <4 x float> %in) nounwind {
	entry:			entry:
	store <4 x float> %in, <4 x float> addrspace(1)* %out, align 4			store <4 x float> %in, <4 x float> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v8i8_arg:			; FUNC-LABEL: {{^}}v8i8_arg:
				; HSA-VI: kernarg_segment_byte_size = 16
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	Show All 15 Lines
	; HSA-GCN: float_load_ubyte			; HSA-GCN: float_load_ubyte
	define amdgpu_kernel void @v8i8_arg(<8 x i8> addrspace(1)* %out, <8 x i8> %in) {			define amdgpu_kernel void @v8i8_arg(<8 x i8> addrspace(1)* %out, <8 x i8> %in) {
	entry:			entry:
	store <8 x i8> %in, <8 x i8> addrspace(1)* %out			store <8 x i8> %in, <8 x i8> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v8i16_arg:			; FUNC-LABEL: {{^}}v8i16_arg:
				; HSA-VI: kernarg_segment_byte_size = 32
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	Show All 14 Lines
	; VI: s_load_dword s			; VI: s_load_dword s
	define amdgpu_kernel void @v8i16_arg(<8 x i16> addrspace(1)* %out, <8 x i16> %in) {			define amdgpu_kernel void @v8i16_arg(<8 x i16> addrspace(1)* %out, <8 x i16> %in) {
	entry:			entry:
	store <8 x i16> %in, <8 x i16> addrspace(1)* %out			store <8 x i16> %in, <8 x i16> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v8i32_arg:			; FUNC-LABEL: {{^}}v8i32_arg:
				; HSA-VI: kernarg_segment_byte_size = 64
	; HSA-VI: kernarg_segment_alignment = 5			; HSA-VI: kernarg_segment_alignment = 5
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].X
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].X
	; SI: s_load_dwordx8 s{{\[[0-9]+:[0-9]+\]}}, s[0:1], 0x11			; SI: s_load_dwordx8 s{{\[[0-9]+:[0-9]+\]}}, s[0:1], 0x11
	; MESA-VI: s_load_dwordx8 s{{\[[0-9]+:[0-9]+\]}}, s[0:1], 0x44			; MESA-VI: s_load_dwordx8 s{{\[[0-9]+:[0-9]+\]}}, s[0:1], 0x44
	; HSA-VI: s_load_dwordx8 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x20			; HSA-VI: s_load_dwordx8 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x20
	define amdgpu_kernel void @v8i32_arg(<8 x i32> addrspace(1)* nocapture %out, <8 x i32> %in) nounwind {			define amdgpu_kernel void @v8i32_arg(<8 x i32> addrspace(1)* nocapture %out, <8 x i32> %in) nounwind {
	entry:			entry:
	store <8 x i32> %in, <8 x i32> addrspace(1)* %out, align 4			store <8 x i32> %in, <8 x i32> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v8f32_arg:			; FUNC-LABEL: {{^}}v8f32_arg:
				; HSA-VI: kernarg_segment_byte_size = 64
	; HSA-VI: kernarg_segment_alignment = 5			; HSA-VI: kernarg_segment_alignment = 5
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[4].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].X
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[5].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].X
	; SI: s_load_dwordx8 s{{\[[0-9]+:[0-9]+\]}}, s[0:1], 0x11			; SI: s_load_dwordx8 s{{\[[0-9]+:[0-9]+\]}}, s[0:1], 0x11
	define amdgpu_kernel void @v8f32_arg(<8 x float> addrspace(1)* nocapture %out, <8 x float> %in) nounwind {			define amdgpu_kernel void @v8f32_arg(<8 x float> addrspace(1)* nocapture %out, <8 x float> %in) nounwind {
	entry:			entry:
	store <8 x float> %in, <8 x float> addrspace(1)* %out, align 4			store <8 x float> %in, <8 x float> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v16i8_arg:			; FUNC-LABEL: {{^}}v16i8_arg:
				; HSA-VI: kernarg_segment_byte_size = 32
	; HSA-VI: kernarg_segment_alignment = 4			; HSA-VI: kernarg_segment_alignment = 4
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	; EG: VTX_READ_8			; EG: VTX_READ_8
	Show All 40 Lines
	; HSA-VI: flat_load_ubyte			; HSA-VI: flat_load_ubyte
	define amdgpu_kernel void @v16i8_arg(<16 x i8> addrspace(1)* %out, <16 x i8> %in) {			define amdgpu_kernel void @v16i8_arg(<16 x i8> addrspace(1)* %out, <16 x i8> %in) {
	entry:			entry:
	store <16 x i8> %in, <16 x i8> addrspace(1)* %out			store <16 x i8> %in, <16 x i8> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v16i16_arg:			; FUNC-LABEL: {{^}}v16i16_arg:
				; HSA-VI: kernarg_segment_byte_size = 64
	; HSA-VI: kernarg_segment_alignment = 5			; HSA-VI: kernarg_segment_alignment = 5
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	; EG: VTX_READ_16			; EG: VTX_READ_16
	Show All 34 Lines
	; VI: s_load_dword s			; VI: s_load_dword s
	define amdgpu_kernel void @v16i16_arg(<16 x i16> addrspace(1)* %out, <16 x i16> %in) {			define amdgpu_kernel void @v16i16_arg(<16 x i16> addrspace(1)* %out, <16 x i16> %in) {
	entry:			entry:
	store <16 x i16> %in, <16 x i16> addrspace(1)* %out			store <16 x i16> %in, <16 x i16> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v16i32_arg:			; FUNC-LABEL: {{^}}v16i32_arg:
				; HSA-VI: kernarg_segment_byte_size = 128
	; HSA-VI: kernarg_segment_alignment = 6			; HSA-VI: kernarg_segment_alignment = 6
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].X
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].W
	Show All 11 Lines
	; HSA-VI: s_load_dwordx16 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x40			; HSA-VI: s_load_dwordx16 s[{{[0-9]+:[0-9]+}}], s[4:5], 0x40
	define amdgpu_kernel void @v16i32_arg(<16 x i32> addrspace(1)* nocapture %out, <16 x i32> %in) nounwind {			define amdgpu_kernel void @v16i32_arg(<16 x i32> addrspace(1)* nocapture %out, <16 x i32> %in) nounwind {
	entry:			entry:
	store <16 x i32> %in, <16 x i32> addrspace(1)* %out, align 4			store <16 x i32> %in, <16 x i32> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v16f32_arg:			; FUNC-LABEL: {{^}}v16f32_arg:
				; HSA-VI: kernarg_segment_byte_size = 128
	; HSA-VI: kernarg_segment_alignment = 6			; HSA-VI: kernarg_segment_alignment = 6
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[6].W
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].X			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].X
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Y			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Y
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Z			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].Z
	; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].W			; EG-DAG: T{{[0-9]\.[XYZW]}}, KC0[7].W
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	; XGCN: s_load_dwordx2			; XGCN: s_load_dwordx2
	; XGCN: buffer_store_dwordx2			; XGCN: buffer_store_dwordx2
	; define amdgpu_kernel void @kernel_arg_v1i64(<1 x i64> addrspace(1)* %out, <1 x i64> %a) nounwind {			; define amdgpu_kernel void @kernel_arg_v1i64(<1 x i64> addrspace(1)* %out, <1 x i64> %a) nounwind {
	; store <1 x i64> %a, <1 x i64> addrspace(1)* %out, align 8			; store <1 x i64> %a, <1 x i64> addrspace(1)* %out, align 8
	; ret void			; ret void
	; }			; }

	; FUNC-LABEL: {{^}}i1_arg:			; FUNC-LABEL: {{^}}i1_arg:
				; HSA-VI: kernarg_segment_byte_size = 12
				; HSA-VI: kernarg_segment_alignment = 4

	; SI: buffer_load_ubyte			; SI: buffer_load_ubyte
	; SI: v_and_b32_e32			; SI: v_and_b32_e32
	; SI: buffer_store_byte			; SI: buffer_store_byte
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @i1_arg(i1 addrspace(1)* %out, i1 %x) nounwind {			define amdgpu_kernel void @i1_arg(i1 addrspace(1)* %out, i1 %x) nounwind {
	store i1 %x, i1 addrspace(1)* %out, align 1			store i1 %x, i1 addrspace(1)* %out, align 1
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i1_arg_zext_i32:			; FUNC-LABEL: {{^}}i1_arg_zext_i32:
				; HSA-VI: kernarg_segment_byte_size = 12
				; HSA-VI: kernarg_segment_alignment = 4

	; SI: buffer_load_ubyte			; SI: buffer_load_ubyte
	; SI: buffer_store_dword			; SI: buffer_store_dword
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @i1_arg_zext_i32(i32 addrspace(1)* %out, i1 %x) nounwind {			define amdgpu_kernel void @i1_arg_zext_i32(i32 addrspace(1)* %out, i1 %x) nounwind {
	%ext = zext i1 %x to i32			%ext = zext i1 %x to i32
	store i32 %ext, i32 addrspace(1)* %out, align 4			store i32 %ext, i32 addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i1_arg_zext_i64:			; FUNC-LABEL: {{^}}i1_arg_zext_i64:
				; HSA-VI: kernarg_segment_byte_size = 12
				; HSA-VI: kernarg_segment_alignment = 4

	; SI: buffer_load_ubyte			; SI: buffer_load_ubyte
	; SI: buffer_store_dwordx2			; SI: buffer_store_dwordx2
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @i1_arg_zext_i64(i64 addrspace(1)* %out, i1 %x) nounwind {			define amdgpu_kernel void @i1_arg_zext_i64(i64 addrspace(1)* %out, i1 %x) nounwind {
	%ext = zext i1 %x to i64			%ext = zext i1 %x to i64
	store i64 %ext, i64 addrspace(1)* %out, align 8			store i64 %ext, i64 addrspace(1)* %out, align 8
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i1_arg_sext_i32:			; FUNC-LABEL: {{^}}i1_arg_sext_i32:
				; HSA-VI: kernarg_segment_byte_size = 12
				; HSA-VI: kernarg_segment_alignment = 4

	; SI: buffer_load_ubyte			; SI: buffer_load_ubyte
	; SI: buffer_store_dword			; SI: buffer_store_dword
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @i1_arg_sext_i32(i32 addrspace(1)* %out, i1 %x) nounwind {			define amdgpu_kernel void @i1_arg_sext_i32(i32 addrspace(1)* %out, i1 %x) nounwind {
	%ext = sext i1 %x to i32			%ext = sext i1 %x to i32
	store i32 %ext, i32addrspace(1)* %out, align 4			store i32 %ext, i32addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}i1_arg_sext_i64:			; FUNC-LABEL: {{^}}i1_arg_sext_i64:
				; HSA-VI: kernarg_segment_byte_size = 12
				; HSA-VI: kernarg_segment_alignment = 4

	; SI: buffer_load_ubyte			; SI: buffer_load_ubyte
	; SI: v_bfe_i32			; SI: v_bfe_i32
	; SI: v_ashrrev_i32			; SI: v_ashrrev_i32
	; SI: buffer_store_dwordx2			; SI: buffer_store_dwordx2
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @i1_arg_sext_i64(i64 addrspace(1)* %out, i1 %x) nounwind {			define amdgpu_kernel void @i1_arg_sext_i64(i64 addrspace(1)* %out, i1 %x) nounwind {
	%ext = sext i1 %x to i64			%ext = sext i1 %x to i64
	store i64 %ext, i64 addrspace(1)* %out, align 8			store i64 %ext, i64 addrspace(1)* %out, align 8
	ret void			ret void
	}			}

test/CodeGen/AMDGPU/llvm.amdgcn.kernarg.segment.ptr.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefixes=CO-V2,HSA,ALL %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefixes=CO-V2,HSA,ALL %s
	; RUN: llc -mtriple=amdgcn-mesa-mesa3d -verify-machineinstrs < %s \| FileCheck -check-prefixes=CO-V2,OS-MESA3D,MESA,ALL %s			; RUN: llc -mtriple=amdgcn-mesa-mesa3d -verify-machineinstrs < %s \| FileCheck -check-prefixes=CO-V2,OS-MESA3D,MESA,ALL %s
	; RUN: llc -mtriple=amdgcn-mesa-unknown -verify-machineinstrs < %s \| FileCheck -check-prefixes=OS-UNKNOWN,MESA,ALL %s			; RUN: llc -mtriple=amdgcn-mesa-unknown -verify-machineinstrs < %s \| FileCheck -check-prefixes=OS-UNKNOWN,MESA,ALL %s

	; ALL-LABEL: {{^}}test:			; ALL-LABEL: {{^}}test:
	; CO-V2: enable_sgpr_kernarg_segment_ptr = 1			; CO-V2: enable_sgpr_kernarg_segment_ptr = 1
				; HSA: kernarg_segment_byte_size = 8
				; HSA: kernarg_segment_alignment = 4

	; CO-V2: s_load_dword s{{[0-9]+}}, s[4:5], 0xa			; CO-V2: s_load_dword s{{[0-9]+}}, s[4:5], 0xa

	; OS-UNKNOWN: s_load_dword s{{[0-9]+}}, s[0:1], 0xa			; OS-UNKNOWN: s_load_dword s{{[0-9]+}}, s[0:1], 0xa
	define amdgpu_kernel void @test(i32 addrspace(1)* %out) #1 {			define amdgpu_kernel void @test(i32 addrspace(1)* %out) #1 {
	%kernarg.segment.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()			%kernarg.segment.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()
	%header.ptr = bitcast i8 addrspace(4)* %kernarg.segment.ptr to i32 addrspace(4)*			%header.ptr = bitcast i8 addrspace(4)* %kernarg.segment.ptr to i32 addrspace(4)*
	%gep = getelementptr i32, i32 addrspace(4)* %header.ptr, i64 10			%gep = getelementptr i32, i32 addrspace(4)* %header.ptr, i64 10
	%value = load i32, i32 addrspace(4)* %gep			%value = load i32, i32 addrspace(4)* %gep
	store i32 %value, i32 addrspace(1)* %out			store i32 %value, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}test_implicit:			; ALL-LABEL: {{^}}test_implicit:
				; HSA: kernarg_segment_byte_size = 8
				; OS-MESA3D: kernarg_segment_byte_size = 24
				; CO-V2: kernarg_segment_alignment = 4

	; 10 + 9 (36 prepended implicit bytes) + 2(out pointer) = 21 = 0x15			; 10 + 9 (36 prepended implicit bytes) + 2(out pointer) = 21 = 0x15
	; OS-UNKNOWN: s_load_dword s{{[0-9]+}}, s[0:1], 0x15			; OS-UNKNOWN: s_load_dword s{{[0-9]+}}, s[0:1], 0x15
	define amdgpu_kernel void @test_implicit(i32 addrspace(1)* %out) #1 {			define amdgpu_kernel void @test_implicit(i32 addrspace(1)* %out) #1 {
	%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()			%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()
	%header.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*			%header.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*
	%gep = getelementptr i32, i32 addrspace(4)* %header.ptr, i64 10			%gep = getelementptr i32, i32 addrspace(4)* %header.ptr, i64 10
	%value = load i32, i32 addrspace(4)* %gep			%value = load i32, i32 addrspace(4)* %gep
	store i32 %value, i32 addrspace(1)* %out			store i32 %value, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}test_implicit_alignment			; ALL-LABEL: {{^}}test_implicit_alignment:
	; HSA: kernarg_segment_byte_size = 10			; HSA: kernarg_segment_byte_size = 12
	; OS-MESA3D: kernarg_segment_byte_size = 28			; OS-MESA3D: kernarg_segment_byte_size = 28
				; CO-V2: kernarg_segment_alignment = 4


	; OS-UNKNOWN: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xc			; OS-UNKNOWN: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xc
	; HSA: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x4			; HSA: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x4
	; OS-MESA3D: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x3			; OS-MESA3D: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x3
	; ALL: v_mov_b32_e32 [[V_VAL:v[0-9]+]], [[VAL]]			; ALL: v_mov_b32_e32 [[V_VAL:v[0-9]+]], [[VAL]]
	; MESA: buffer_store_dword [[V_VAL]]			; MESA: buffer_store_dword [[V_VAL]]
	; HSA: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[V_VAL]]			; HSA: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[V_VAL]]
	define amdgpu_kernel void @test_implicit_alignment(i32 addrspace(1)* %out, <2 x i8> %in) #1 {			define amdgpu_kernel void @test_implicit_alignment(i32 addrspace(1)* %out, <2 x i8> %in) #1 {
	%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()			%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()
	%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*			%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*
	%val = load i32, i32 addrspace(4)* %arg.ptr			%val = load i32, i32 addrspace(4)* %arg.ptr
	store i32 %val, i32 addrspace(1)* %out			store i32 %val, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}opencl_test_implicit_alignment			; ALL-LABEL: {{^}}opencl_test_implicit_alignment
	; HSA: kernarg_segment_byte_size = 64			; HSA: kernarg_segment_byte_size = 64
	; OS-MESA3D: kernarg_segment_byte_size = 28			; OS-MESA3D: kernarg_segment_byte_size = 28
				; CO-V2: kernarg_segment_alignment = 4


	; OS-UNKNOWN: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xc			; OS-UNKNOWN: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0xc
	; HSA: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x4			; HSA: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x4
	; OS-MESA3D: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x3			; OS-MESA3D: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9]+:[0-9]+}}], 0x3
	; ALL: v_mov_b32_e32 [[V_VAL:v[0-9]+]], [[VAL]]			; ALL: v_mov_b32_e32 [[V_VAL:v[0-9]+]], [[VAL]]
	; MESA: buffer_store_dword [[V_VAL]]			; MESA: buffer_store_dword [[V_VAL]]
	; HSA: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[V_VAL]]			; HSA: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[V_VAL]]
	define amdgpu_kernel void @opencl_test_implicit_alignment(i32 addrspace(1)* %out, <2 x i8> %in) #2 {			define amdgpu_kernel void @opencl_test_implicit_alignment(i32 addrspace(1)* %out, <2 x i8> %in) #2 {
	%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()			%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()
	%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*			%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*
	%val = load i32, i32 addrspace(4)* %arg.ptr			%val = load i32, i32 addrspace(4)* %arg.ptr
	store i32 %val, i32 addrspace(1)* %out			store i32 %val, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL-LABEL: {{^}}test_no_kernargs:			; ALL-LABEL: {{^}}test_no_kernargs:
	; HSA: enable_sgpr_kernarg_segment_ptr = 1			; CO-V2: enable_sgpr_kernarg_segment_ptr = 1
				; HSA: kernarg_segment_byte_size = 0
				; OS-MESA3D: kernarg_segment_byte_size = 16
				; CO-V2: kernarg_segment_alignment = 32

	; HSA: s_load_dword s{{[0-9]+}}, s[4:5]			; HSA: s_load_dword s{{[0-9]+}}, s[4:5]
	define amdgpu_kernel void @test_no_kernargs() #1 {			define amdgpu_kernel void @test_no_kernargs() #1 {
	%kernarg.segment.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()			%kernarg.segment.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()
	%header.ptr = bitcast i8 addrspace(4)* %kernarg.segment.ptr to i32 addrspace(4)*			%header.ptr = bitcast i8 addrspace(4)* %kernarg.segment.ptr to i32 addrspace(4)*
	%gep = getelementptr i32, i32 addrspace(4)* %header.ptr, i64 10			%gep = getelementptr i32, i32 addrspace(4)* %header.ptr, i64 10
	%value = load i32, i32 addrspace(4)* %gep			%value = load i32, i32 addrspace(4)* %gep
	store volatile i32 %value, i32 addrspace(1)* undef			store volatile i32 %value, i32 addrspace(1)* undef
	ret void			ret void
	}			}

				; GCN-LABEL: {{^}}opencl_test_implicit_alignment_no_explicit_kernargs:
				; HSA: kernarg_segment_byte_size = 48
				; OS-MESA3d: kernarg_segment_byte_size = 16
				; CO-V2: kernarg_segment_alignment = 4
				define amdgpu_kernel void @opencl_test_implicit_alignment_no_explicit_kernargs() #2 {
				%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()
				%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*
				%val = load volatile i32, i32 addrspace(4)* %arg.ptr
				store volatile i32 %val, i32 addrspace(1)* null
				ret void
				}

				; GCN-LABEL: {{^}}opencl_test_implicit_alignment_no_explicit_kernargs_round_up:
				; HSA: kernarg_segment_byte_size = 40
				; OS-MESA3D: kernarg_segment_byte_size = 16
				; CO-V2: kernarg_segment_alignment = 4
				define amdgpu_kernel void @opencl_test_implicit_alignment_no_explicit_kernargs_round_up() #3 {
				%implicitarg.ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr()
				%arg.ptr = bitcast i8 addrspace(4)* %implicitarg.ptr to i32 addrspace(4)*
				%val = load volatile i32, i32 addrspace(4)* %arg.ptr
				store volatile i32 %val, i32 addrspace(1)* null
				ret void
				}

	declare i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #0			declare i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #0
	declare i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #0			declare i8 addrspace(4)* @llvm.amdgcn.implicitarg.ptr() #0

	attributes #0 = { nounwind readnone }			attributes #0 = { nounwind readnone }
	attributes #1 = { nounwind }			attributes #1 = { nounwind }
	attributes #2 = { nounwind "amdgpu-implicitarg-num-bytes"="48" }			attributes #2 = { nounwind "amdgpu-implicitarg-num-bytes"="48" }
				attributes #3 = { nounwind "amdgpu-implicitarg-num-bytes"="38" }