This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUTargetTransformInfo.h
5/17
AMDGPUTargetTransformInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
lower-mem-intrinsics.ll

Differential D76567

AMDGPU: Implement getMemcpyLoopLoweringType
AbandonedPublic

Authored by arsenm on Mar 22 2020, 8:23 AM.

Download Raw Diff

Details

Reviewers

rampitec
kerbowa
nhaehnle
foad

Diff Detail

Event Timeline

arsenm created this revision.Mar 22 2020, 8:23 AM

Herald added subscribers: hiraditya, t-tye, tpr and 5 others. · View Herald TranscriptMar 22 2020, 8:23 AM

foad added inline comments.Mar 23 2020, 2:26 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
325	I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments? The converse of `MinAlign <= 1 \|\| MinAlign >= 4` is `MinAlign == 2` so maybe swap the `if` around: if (MinAlign == 2) ... else ... ?
357	Are you intentionally ignoring the alignment arguments?
363	Did you mean just `while (RemainingBytes >= 8)` here? If not the whole thing would be clearer structured as: if (RemainingBytes % 8 == 0) { ... i64 loop ... } else if (RemainingBytes % 4 == 0) { ... i32 loop ... } else { ... i8 loop ... }

Address comments, fix residual part

I still don't understand the logic for when to use 2-byte accesses. Is it something like: use 1, 4, 8 and 16-byte accesses unconditionally, but 2-byte accesses only when we know source and destination are at least 2-byte aligned? Why is the implementation of this different depending on whether the length is a known constant or not?

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
344	"16-byte"?
359	Should those `==` be `>=`?
384	Should those `>` be `>=`?

Try again for 2 byte cases. I'm still somewhat unsure what we should be doing with 2-byte accesses, but try to use them for now

Wrong diff

foad added inline comments.Mar 25 2020, 1:28 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	`<=`? You can't do unaligned dword (or multi-dword) accesses, can you?
349–351	Don't all these (multi-)dword cases need to be guarded by `MinAlign >= 4`?

arsenm marked 2 inline comments as done.Mar 25 2020, 12:31 PM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	Yes, you can on anything remotely new. It's also not critical to get this exactly right, since the loads will still be legalized later.
349–351	No, unaligned access is supposed to be enabled by default on targets that support it. I didn't bother trying to worry about what's best without the support. We also really need some microbenchmarks to make precise decisions here (although that will probably never happen)

foad added inline comments.Mar 26 2020, 2:37 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	Then I don't understand why you have a special case for `MinAlign == 2` at all. Why not just use unaligned (multi-)dword accesses, like you would for `MinAlign == 1`?

arsenm marked an inline comment as done.Mar 26 2020, 6:46 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	See D74345, we recently discovered 2 byte aligned accesses end up getting executed as multiple 1 byte accesses

foad added inline comments.Mar 26 2020, 9:10 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	So accesses are slow if `(run time address) % 4 == 2`. If `MinAlign == 2` then there's a 50% chance of the access being slow. If `MinAlign == 1` then it's only a 25% chance. Is that right?

arsenm marked an inline comment as done.Mar 26 2020, 9:14 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	Yes, that is my understanding

foad added inline comments.Mar 27 2020, 1:38 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
332	Then please add a comment that checking MinAlign is just a way to estimate the likelihood that the run time address will be 2-but-not-4 aligned (or "will end in 0b10"?). But if "2 byte aligned accesses end up getting executed as multiple 1 byte accesses" I still don't understand why it's better for us to generate multiple 1-byte instructions, rather than just letting the hardware do that thing.
353	Can't you remove these "degenerate" cases and return v4i32 unconditionally? Surely the generic logic knows that for a short copy it should execute the v4i32 loop zero times, and then fall into the "residual" logic below?

foad added inline comments.Mar 30 2020, 7:01 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
338	Shouldn't this `&&` be `\|\|`, so we avoid forming either a 128-bit load or store?

Just to give a concrete counter-proposal, I was thinking this should look more like D77057.

foad mentioned this in D77057: AMDGPU: Implement getMemcpyLoopLoweringType.Mar 30 2020, 7:09 AM

arsenm abandoned this revision.Mar 30 2020, 3:16 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.h

12 lines

AMDGPUTargetTransformInfo.cpp

88 lines

test/

CodeGen/

AMDGPU/

lower-mem-intrinsics.ll

1481 lines

Diff 252349

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	bool isLegalToVectorizeMemChain(unsigned ChainSizeInBytes,
unsigned Alignment,		unsigned Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;
bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,		bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,
unsigned Alignment,		unsigned Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;
bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,		bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,
unsigned Alignment,		unsigned Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;
		Type getMemcpyLoopLoweringType(LLVMContext &Context, Value Length,
		unsigned SrcAddrSpace, unsigned DestAddrSpace,
		unsigned SrcAlign, unsigned DestAlign) const;

		void getMemcpyLoopResidualLoweringType(SmallVectorImpl<Type *> &OpsOut,
		LLVMContext &Context,
		unsigned RemainingBytes,
		unsigned SrcAddrSpace,
		unsigned DestAddrSpace,
		unsigned SrcAlign,
		unsigned DestAlign) const;
unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);

bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) const;		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) const;

int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

	Show First 20 Lines • Show All 305 Lines • ▼ Show 20 Lines
	}			}

	bool GCNTTIImpl::isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,			bool GCNTTIImpl::isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,
	unsigned Alignment,			unsigned Alignment,
	unsigned AddrSpace) const {			unsigned AddrSpace) const {
	return isLegalToVectorizeMemChain(ChainSizeInBytes, Alignment, AddrSpace);			return isLegalToVectorizeMemChain(ChainSizeInBytes, Alignment, AddrSpace);
	}			}

				// FIXME: Really we would like to issue multiple 128-bit loads and stores per
				// iteration. Should we report a larger size and let it legalize?
				//
				// FIXME: Should we use narrower types for local/region, or account for when
				// unaligned access is legal?
				//
				// FIXME: This could use fine tuning and microbenchmarks.
				Type GCNTTIImpl::getMemcpyLoopLoweringType(LLVMContext &Context, Value Length,
				unsigned SrcAddrSpace,
				unsigned DestAddrSpace,
				unsigned SrcAlign,
				unsigned DstAlign) const {
				foadUnsubmitted Not Done Reply Inline Actions I don't follow the reasoning here. When and why do we need to respect the alignment? Could you add comments? The converse of `MinAlign <= 1 \|\| MinAlign >= 4` is `MinAlign == 2` so maybe swap the `if` around: if (MinAlign == 2) ... else ... ? foad: I don't follow the reasoning here. When and why do we need to respect the alignment? Could you…
				const ConstantInt *ConstLen = dyn_cast<ConstantInt>(Length);
				unsigned MinAlign = std::min(SrcAlign, DstAlign);

				if (!ConstLen) {
				// 2-byte aligned access are executed as multiple 1-byte accesses, so don't
				// introduce them.
				if (MinAlign == 2)
				foadUnsubmitted Not Done Reply Inline Actions `<=`? You can't do unaligned dword (or multi-dword) accesses, can you? foad: `<=`? You can't do unaligned dword (or multi-dword) accesses, can you?
				arsenmAuthorUnsubmitted Done Reply Inline Actions Yes, you can on anything remotely new. It's also not critical to get this exactly right, since the loads will still be legalized later. arsenm: Yes, you can on anything remotely new. It's also not critical to get this exactly right, since…
				foadUnsubmitted Not Done Reply Inline Actions Then I don't understand why you have a special case for `MinAlign == 2` at all. Why not just use unaligned (multi-)dword accesses, like you would for `MinAlign == 1`? foad: Then I don't understand why you have a special case for `MinAlign == 2` at all. Why not just…
				arsenmAuthorUnsubmitted Done Reply Inline Actions See D74345, we recently discovered 2 byte aligned accesses end up getting executed as multiple 1 byte accesses arsenm: See D74345, we recently discovered 2 byte aligned accesses end up getting executed as multiple…
				foadUnsubmitted Not Done Reply Inline Actions So accesses are slow if `(run time address) % 4 == 2`. If `MinAlign == 2` then there's a 50% chance of the access being slow. If `MinAlign == 1` then it's only a 25% chance. Is that right? foad: So accesses are slow if `(run time address) % 4 == 2`. If `MinAlign == 2` then there's a 50%…
				arsenmAuthorUnsubmitted Done Reply Inline Actions Yes, that is my understanding arsenm: Yes, that is my understanding
				foadUnsubmitted Not Done Reply Inline Actions Then please add a comment that checking MinAlign is just a way to estimate the likelihood that the run time address will be 2-but-not-4 aligned (or "will end in 0b10"?). But if "2 byte aligned accesses end up getting executed as multiple 1 byte accesses" I still don't understand why it's better for us to generate multiple 1-byte instructions, rather than just letting the hardware do that thing. foad: Then please add a comment that checking MinAlign is just a way to estimate the likelihood that…
				return Type::getInt8Ty(Context);

				// Not all subtargets have 128-bit DS instructions, and we currently don't
				// form them by default.
				if ((SrcAddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|
				SrcAddrSpace == AMDGPUAS::REGION_ADDRESS) &&
				foadUnsubmitted Not Done Reply Inline Actions Shouldn't this `&&` be `\|\|`, so we avoid forming either a 128-bit load or store? foad: Shouldn't this `&&` be `\|\|`, so we avoid forming either a 128-bit load or store?
				(DestAddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|
				DestAddrSpace == AMDGPUAS::REGION_ADDRESS)) {
				return VectorType::get(Type::getInt32Ty(Context), 2);
				}

				// Global memory works best with 16-byte accesses. Private memory will also
				foadUnsubmitted Done Reply Inline Actions "16-byte"? foad: "16-byte"?
				// hit this, although they'll be decomposed.
				return VectorType::get(Type::getInt32Ty(Context), 4);
				}

				uint64_t Size = ConstLen->getZExtValue();
				if (Size >= 16)
				return VectorType::get(Type::getInt32Ty(Context), 4);
				foadUnsubmitted Not Done Reply Inline Actions Don't all these (multi-)dword cases need to be guarded by `MinAlign >= 4`? foad: Don't all these (multi-)dword cases need to be guarded by `MinAlign >= 4`?
				arsenmAuthorUnsubmitted Done Reply Inline Actions No, unaligned access is supposed to be enabled by default on targets that support it. I didn't bother trying to worry about what's best without the support. We also really need some microbenchmarks to make precise decisions here (although that will probably never happen) arsenm: No, unaligned access is supposed to be enabled by default on targets that support it. I didn't…

				// These cases are a bit degenerate since we don't want to introduce loops for
				foadUnsubmitted Not Done Reply Inline Actions Can't you remove these "degenerate" cases and return v4i32 unconditionally? Surely the generic logic knows that for a short copy it should execute the v4i32 loop zero times, and then fall into the "residual" logic below? foad: Can't you remove these "degenerate" cases and return v4i32 unconditionally? Surely the generic…
				// them anyway.
				if (Size >= 8)
				return VectorType::get(Type::getInt32Ty(Context), 2);

				foadUnsubmitted Not Done Reply Inline Actions Are you intentionally ignoring the alignment arguments? foad: Are you intentionally ignoring the alignment arguments?
				if (Size >= 4)
				return Type::getInt32Ty(Context);
				foadUnsubmitted Not Done Reply Inline Actions Should those `==` be `>=`? foad: Should those `==` be `>=`?

				if (Size >= 2 && MinAlign >= 2)
				return Type::getInt16Ty(Context);

				foadUnsubmitted Not Done Reply Inline Actions Did you mean just `while (RemainingBytes >= 8)` here? If not the whole thing would be clearer structured as: if (RemainingBytes % 8 == 0) { ... i64 loop ... } else if (RemainingBytes % 4 == 0) { ... i32 loop ... } else { ... i8 loop ... } foad: Did you mean just `while (RemainingBytes >= 8)` here? If not the whole thing would be clearer…
				return Type::getInt8Ty(Context);
				}

				void GCNTTIImpl::getMemcpyLoopResidualLoweringType(
				SmallVectorImpl<Type *> &OpsOut, LLVMContext &Context,
				unsigned RemainingBytes, unsigned SrcAddrSpace, unsigned DestAddrSpace,
				unsigned SrcAlign, unsigned DestAlign) const {
				assert(RemainingBytes < 16);

				Type *I64Ty = Type::getInt64Ty(Context);
				Type *I32Ty = Type::getInt32Ty(Context);

				while (RemainingBytes >= 8) {
				OpsOut.push_back(I64Ty);
				RemainingBytes -= 8;
				}

				while (RemainingBytes >= 4) {
				OpsOut.push_back(I32Ty);
				RemainingBytes -= 4;
				}
				foadUnsubmitted Not Done Reply Inline Actions Should those `>` be `>=`? foad: Should those `>` be `>=`?

				if (SrcAlign >= 2 && DestAlign >= 2) {
				Type *I16Ty = Type::getInt16Ty(Context);

				while (RemainingBytes >= 2) {
				OpsOut.push_back(I16Ty);
				RemainingBytes -= 2;
				}
				}

				Type *I8Ty = Type::getInt8Ty(Context);
				while (RemainingBytes) {
				OpsOut.push_back(I8Ty);
				--RemainingBytes;
				}
				}

	unsigned GCNTTIImpl::getMaxInterleaveFactor(unsigned VF) {			unsigned GCNTTIImpl::getMaxInterleaveFactor(unsigned VF) {
	// Disable unrolling if the loop is not vectorized.			// Disable unrolling if the loop is not vectorized.
	// TODO: Enable this again.			// TODO: Enable this again.
	if (VF == 1)			if (VF == 1)
	return 1;			return 1;

	return 8;			return 8;
	}			}
	▲ Show 20 Lines • Show All 776 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -amdgpu-lower-intrinsics %s \| FileCheck -check-prefix=OPT %s			; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-lower-intrinsics -amdgpu-mem-intrinsic-expand-size=1024 %s \| FileCheck -check-prefixes=OPT,MAX1024 %s
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-lower-intrinsics -amdgpu-mem-intrinsic-expand-size=-1 %s \| FileCheck -check-prefixes=OPT,ALL %s

	declare void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace(1)* nocapture readonly, i64, i1) #1			declare void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace(1)* nocapture readonly, i64, i1) #1
	declare void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* nocapture, i8 addrspace(3)* nocapture readonly, i32, i1) #1			declare void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* nocapture, i8 addrspace(3)* nocapture readonly, i32, i1) #1
				declare void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* nocapture, i8 addrspace(1)* nocapture readonly, i32, i1) #1
				declare void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* nocapture, i8 addrspace(5)* nocapture readonly, i32, i1) #1
				declare void @llvm.memcpy.p3i8.p3i8.i32(i8 addrspace(3)* nocapture, i8 addrspace(3)* nocapture readonly, i32, i1) #1

	declare void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace(1)* nocapture readonly, i64, i1) #1			declare void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace(1)* nocapture readonly, i64, i1) #1
				declare void @llvm.memmove.p1i8.p3i8.i32(i8 addrspace(1)* nocapture, i8 addrspace(3)* nocapture readonly, i32, i1) #1
				declare void @llvm.memmove.p5i8.p5i8.i32(i8 addrspace(5)* nocapture, i8 addrspace(5)* nocapture readonly, i32, i1) #1

	declare void @llvm.memset.p1i8.i64(i8 addrspace(1)* nocapture, i8, i64, i1) #1			declare void @llvm.memset.p1i8.i64(i8 addrspace(1)* nocapture, i8, i64, i1) #1

	; Test the upper bound for sizes to leave			; Test the upper bound for sizes to leave
	define amdgpu_kernel void @max_size_small_static_memcpy_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {			define amdgpu_kernel void @max_size_small_static_memcpy_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
	; OPT-LABEL: @max_size_small_static_memcpy_caller0(			; MAX1024-LABEL: @max_size_small_static_memcpy_caller0(
	; OPT-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* [[DST:%.]], i8 addrspace(1) [[SRC:%.*]], i64 1024, i1 false)			; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* [[DST:%.]], i8 addrspace(1) [[SRC:%.*]], i64 1024, i1 false)
	; OPT-NEXT: ret void			; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @max_size_small_static_memcpy_caller0(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 1
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 1
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: ret void
	;			;
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1024, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1024, i1 false)
	ret void			ret void
	}			}

	; Smallest static size which will be expanded			; Smallest static size which will be expanded
	define amdgpu_kernel void @min_size_large_static_memcpy_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {			define amdgpu_kernel void @min_size_large_static_memcpy_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
	; OPT-LABEL: @min_size_large_static_memcpy_caller0(			; OPT-LABEL: @min_size_large_static_memcpy_caller0(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
	; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]			; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
	; OPT: load-store-loop:			; OPT: load-store-loop:
	; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]			; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
	; OPT-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: [[TMP2:%.]] = load i8, i8 addrspace(1) [[TMP1]], align 1			; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 1
	; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: store i8 [[TMP2]], i8 addrspace(1)* [[TMP3]], align 1			; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 1
	; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1			; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
	; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1025			; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
	; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]			; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
	; OPT: memcpy-split:			; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP8]], i64 1024
				; OPT-NEXT: [[TMP10:%.]] = load i8, i8 addrspace(1) [[TMP9]], align 1
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP11]], i64 1024
				; OPT-NEXT: store i8 [[TMP10]], i8 addrspace(1)* [[TMP12]], align 1
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
	;			;
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1025, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1025, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @max_size_small_static_memmove_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {			define amdgpu_kernel void @max_size_small_static_memmove_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
	; OPT-LABEL: @max_size_small_static_memmove_caller0(			; MAX1024-LABEL: @max_size_small_static_memmove_caller0(
	; OPT-NEXT: call void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* [[DST:%.]], i8 addrspace(1) [[SRC:%.*]], i64 1024, i1 false)			; MAX1024-NEXT: call void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* [[DST:%.]], i8 addrspace(1) [[SRC:%.*]], i64 1024, i1 false)
	; OPT-NEXT: ret void			; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @max_size_small_static_memmove_caller0(
				; ALL-NEXT: [[COMPARE_SRC_DST:%.]] = icmp ult i8 addrspace(1) [[SRC:%.]], [[DST:%.]]
				; ALL-NEXT: [[COMPARE_N_TO_0:%.*]] = icmp eq i64 1024, 0
				; ALL-NEXT: br i1 [[COMPARE_SRC_DST]], label [[COPY_BACKWARDS:%.]], label [[COPY_FORWARD:%.]]
				; ALL: copy_backwards:
				; ALL-NEXT: br i1 [[COMPARE_N_TO_0]], label [[MEMMOVE_DONE:%.]], label [[COPY_BACKWARDS_LOOP:%.]]
				; ALL: copy_backwards_loop:
				; ALL-NEXT: [[TMP1:%.]] = phi i64 [ [[INDEX_PTR:%.]], [[COPY_BACKWARDS_LOOP]] ], [ 1024, [[COPY_BACKWARDS]] ]
				; ALL-NEXT: [[INDEX_PTR]] = sub i64 [[TMP1]], 1
				; ALL-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC]], i64 [[INDEX_PTR]]
				; ALL-NEXT: [[ELEMENT:%.]] = load i8, i8 addrspace(1) [[TMP2]], align 1
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST]], i64 [[INDEX_PTR]]
				; ALL-NEXT: store i8 [[ELEMENT]], i8 addrspace(1)* [[TMP3]], align 1
				; ALL-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_PTR]], 0
				; ALL-NEXT: br i1 [[TMP4]], label [[MEMMOVE_DONE]], label [[COPY_BACKWARDS_LOOP]]
				; ALL: copy_forward:
				; ALL-NEXT: br i1 [[COMPARE_N_TO_0]], label [[MEMMOVE_DONE]], label [[COPY_FORWARD_LOOP:%.*]]
				; ALL: copy_forward_loop:
				; ALL-NEXT: [[INDEX_PTR1:%.]] = phi i64 [ [[INDEX_INCREMENT:%.]], [[COPY_FORWARD_LOOP]] ], [ 0, [[COPY_FORWARD]] ]
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC]], i64 [[INDEX_PTR1]]
				; ALL-NEXT: [[ELEMENT2:%.]] = load i8, i8 addrspace(1) [[TMP5]], align 1
				; ALL-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST]], i64 [[INDEX_PTR1]]
				; ALL-NEXT: store i8 [[ELEMENT2]], i8 addrspace(1)* [[TMP6]], align 1
				; ALL-NEXT: [[INDEX_INCREMENT]] = add i64 [[INDEX_PTR1]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_INCREMENT]], 1024
				; ALL-NEXT: br i1 [[TMP7]], label [[MEMMOVE_DONE]], label [[COPY_FORWARD_LOOP]]
				; ALL: memmove_done:
				; ALL-NEXT: ret void
	;			;
	call void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1024, i1 false)			call void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1024, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @min_size_large_static_memmove_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {			define amdgpu_kernel void @min_size_large_static_memmove_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
	; OPT-LABEL: @min_size_large_static_memmove_caller0(			; OPT-LABEL: @min_size_large_static_memmove_caller0(
	; OPT-NEXT: [[COMPARE_SRC_DST:%.]] = icmp ult i8 addrspace(1) [[SRC:%.]], [[DST:%.]]			; OPT-NEXT: [[COMPARE_SRC_DST:%.]] = icmp ult i8 addrspace(1) [[SRC:%.]], [[DST:%.]]
	Show All 24 Lines
	; OPT: memmove_done:			; OPT: memmove_done:
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
	;			;
	call void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1025, i1 false)			call void @llvm.memmove.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 1025, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @max_size_small_static_memset_caller0(i8 addrspace(1)* %dst, i8 %val) #0 {			define amdgpu_kernel void @max_size_small_static_memset_caller0(i8 addrspace(1)* %dst, i8 %val) #0 {
	; OPT-LABEL: @max_size_small_static_memset_caller0(			; MAX1024-LABEL: @max_size_small_static_memset_caller0(
	; OPT-NEXT: call void @llvm.memset.p1i8.i64(i8 addrspace(1)* [[DST:%.]], i8 [[VAL:%.]], i64 1024, i1 false)			; MAX1024-NEXT: call void @llvm.memset.p1i8.i64(i8 addrspace(1)* [[DST:%.]], i8 [[VAL:%.]], i64 1024, i1 false)
	; OPT-NEXT: ret void			; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @max_size_small_static_memset_caller0(
				; ALL-NEXT: br i1 false, label [[SPLIT:%.]], label [[LOADSTORELOOP:%.]]
				; ALL: loadstoreloop:
				; ALL-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP3:%.*]], [[LOADSTORELOOP]] ]
				; ALL-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[TMP1]]
				; ALL-NEXT: store i8 [[VAL:%.]], i8 addrspace(1) [[TMP2]], align 1
				; ALL-NEXT: [[TMP3]] = add i64 [[TMP1]], 1
				; ALL-NEXT: [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1024
				; ALL-NEXT: br i1 [[TMP4]], label [[LOADSTORELOOP]], label [[SPLIT]]
				; ALL: split:
				; ALL-NEXT: ret void
	;			;
	call void @llvm.memset.p1i8.i64(i8 addrspace(1)* %dst, i8 %val, i64 1024, i1 false)			call void @llvm.memset.p1i8.i64(i8 addrspace(1)* %dst, i8 %val, i64 1024, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @min_size_large_static_memset_caller0(i8 addrspace(1)* %dst, i8 %val) #0 {			define amdgpu_kernel void @min_size_large_static_memset_caller0(i8 addrspace(1)* %dst, i8 %val) #0 {
	; OPT-LABEL: @min_size_large_static_memset_caller0(			; OPT-LABEL: @min_size_large_static_memset_caller0(
	; OPT-NEXT: br i1 false, label [[SPLIT:%.]], label [[LOADSTORELOOP:%.]]			; OPT-NEXT: br i1 false, label [[SPLIT:%.]], label [[LOADSTORELOOP:%.]]
	; OPT: loadstoreloop:			; OPT: loadstoreloop:
	; OPT-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP3:%.*]], [[LOADSTORELOOP]] ]			; OPT-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP3:%.*]], [[LOADSTORELOOP]] ]
	; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[TMP1]]			; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[TMP1]]
	; OPT-NEXT: store i8 [[VAL:%.]], i8 addrspace(1) [[TMP2]], align 1			; OPT-NEXT: store i8 [[VAL:%.]], i8 addrspace(1) [[TMP2]], align 1
	; OPT-NEXT: [[TMP3]] = add i64 [[TMP1]], 1			; OPT-NEXT: [[TMP3]] = add i64 [[TMP1]], 1
	; OPT-NEXT: [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1025			; OPT-NEXT: [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 1025
	; OPT-NEXT: br i1 [[TMP4]], label [[LOADSTORELOOP]], label [[SPLIT]]			; OPT-NEXT: br i1 [[TMP4]], label [[LOADSTORELOOP]], label [[SPLIT]]
	; OPT: split:			; OPT: split:
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
	;			;
	call void @llvm.memset.p1i8.i64(i8 addrspace(1)* %dst, i8 %val, i64 1025, i1 false)			call void @llvm.memset.p1i8.i64(i8 addrspace(1)* %dst, i8 %val, i64 1025, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @variable_memcpy_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {			define amdgpu_kernel void @variable_memcpy_caller0(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {
	; OPT-LABEL: @variable_memcpy_caller0(			; OPT-LABEL: @variable_memcpy_caller0(
	; OPT-NEXT: [[TMP1:%.]] = icmp ne i64 [[N:%.]], 0			; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
	; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]			; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
	; OPT: loop-memcpy-expansion:			; OPT: loop-memcpy-expansion:
	; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]			; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
	; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(1) [[TMP2]], align 1			; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 1
	; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(1)* [[TMP4]], align 1			; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
	; OPT-NEXT: [[TMP5]] = add i64 [[LOOP_INDEX]], 1			; OPT-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX]], 1
	; OPT-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[N]]			; OPT-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
	; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]			; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 1
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
	; OPT: post-loop-memcpy-expansion:			; OPT: post-loop-memcpy-expansion:
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i64 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
	;			;
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @variable_memcpy_caller1(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {			define amdgpu_kernel void @variable_memcpy_caller1(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {
	; OPT-LABEL: @variable_memcpy_caller1(			; OPT-LABEL: @variable_memcpy_caller1(
	; OPT-NEXT: [[TMP1:%.]] = icmp ne i64 [[N:%.]], 0			; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
	; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]			; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
	; OPT: loop-memcpy-expansion:			; OPT: loop-memcpy-expansion:
	; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]			; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
	; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(1) [[TMP2]], align 1			; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 1
	; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(1)* [[TMP4]], align 1			; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
	; OPT-NEXT: [[TMP5]] = add i64 [[LOOP_INDEX]], 1			; OPT-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX]], 1
	; OPT-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[N]]			; OPT-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
	; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]			; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 1
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
	; OPT: post-loop-memcpy-expansion:			; OPT: post-loop-memcpy-expansion:
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i64 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
	;			;
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @memcpy_multi_use_one_function(i8 addrspace(1)* %dst0, i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 %n, i64 %m) #0 {			define amdgpu_kernel void @memcpy_multi_use_one_function(i8 addrspace(1)* %dst0, i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 %n, i64 %m) #0 {
	; OPT-LABEL: @memcpy_multi_use_one_function(			; OPT-LABEL: @memcpy_multi_use_one_function(
	; OPT-NEXT: [[TMP1:%.]] = icmp ne i64 [[N:%.]], 0			; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
	; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION2:%.]], label [[POST_LOOP_MEMCPY_EXPANSION1:%.]]			; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST0:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION2:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER5:%.]]
	; OPT: loop-memcpy-expansion2:			; OPT: loop-memcpy-expansion2:
	; OPT-NEXT: [[LOOP_INDEX3:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION2]] ]			; OPT-NEXT: [[LOOP_INDEX3:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION2]] ]
	; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX3]]			; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX3]]
	; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(1) [[TMP2]], align 1			; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 1
	; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST0:%.*]], i64 [[LOOP_INDEX3]]			; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX3]]
	; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(1)* [[TMP4]], align 1			; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
	; OPT-NEXT: [[TMP5]] = add i64 [[LOOP_INDEX3]], 1			; OPT-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX3]], 1
	; OPT-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[N]]			; OPT-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
	; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION2]], label [[POST_LOOP_MEMCPY_EXPANSION1]]			; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION2]], label [[LOOP_MEMCPY_RESIDUAL_HEADER5]]
				; OPT: loop-memcpy-residual4:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX6:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER5]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL4:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX6]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 1
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX6]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL4]], label [[POST_LOOP_MEMCPY_EXPANSION1:%.*]]
	; OPT: post-loop-memcpy-expansion1:			; OPT: post-loop-memcpy-expansion1:
	; OPT-NEXT: [[TMP7:%.]] = icmp ne i64 [[M:%.]], 0			; OPT-NEXT: [[TMP20:%.]] = bitcast i8 addrspace(1) [[SRC]] to <4 x i32> addrspace(1)*
	; OPT-NEXT: br i1 [[TMP7]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]			; OPT-NEXT: [[TMP21:%.]] = bitcast i8 addrspace(1) [[DST1:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP22:%.]] = udiv i64 [[M:%.]], 16
				; OPT-NEXT: [[TMP23:%.*]] = urem i64 [[M]], 16
				; OPT-NEXT: [[TMP24:%.*]] = sub i64 [[M]], [[TMP23]]
				; OPT-NEXT: [[TMP25:%.*]] = icmp ne i64 [[TMP22]], 0
				; OPT-NEXT: br i1 [[TMP25]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
	; OPT: loop-memcpy-expansion:			; OPT: loop-memcpy-expansion:
	; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[POST_LOOP_MEMCPY_EXPANSION1]] ], [ [[TMP11:%.]], [[LOOP_MEMCPY_EXPANSION]] ]			; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[POST_LOOP_MEMCPY_EXPANSION1]] ], [ [[TMP29:%.]], [[LOOP_MEMCPY_EXPANSION]] ]
	; OPT-NEXT: [[TMP8:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP26:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP20]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: [[TMP9:%.]] = load i8, i8 addrspace(1) [[TMP8]], align 1			; OPT-NEXT: [[TMP27:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP26]], align 1
	; OPT-NEXT: [[TMP10:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST1:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP28:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP21]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: store i8 [[TMP9]], i8 addrspace(1)* [[TMP10]], align 1			; OPT-NEXT: store <4 x i32> [[TMP27]], <4 x i32> addrspace(1)* [[TMP28]], align 1
	; OPT-NEXT: [[TMP11]] = add i64 [[LOOP_INDEX]], 1			; OPT-NEXT: [[TMP29]] = add i64 [[LOOP_INDEX]], 1
	; OPT-NEXT: [[TMP12:%.*]] = icmp ult i64 [[TMP11]], [[M]]			; OPT-NEXT: [[TMP30:%.*]] = icmp ult i64 [[TMP29]], [[TMP22]]
	; OPT-NEXT: br i1 [[TMP12]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]			; OPT-NEXT: br i1 [[TMP30]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP37:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP31:%.]] = bitcast <4 x i32> addrspace(1) [[TMP20]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP32:%.]] = bitcast <4 x i32> addrspace(1) [[TMP21]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP33:%.*]] = add i64 [[TMP24]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP34:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP31]], i64 [[TMP33]]
				; OPT-NEXT: [[TMP35:%.]] = load i8, i8 addrspace(1) [[TMP34]], align 1
				; OPT-NEXT: [[TMP36:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP32]], i64 [[TMP33]]
				; OPT-NEXT: store i8 [[TMP35]], i8 addrspace(1)* [[TMP36]], align 1
				; OPT-NEXT: [[TMP37]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP38:%.*]] = icmp ult i64 [[TMP37]], [[TMP23]]
				; OPT-NEXT: br i1 [[TMP38]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
	; OPT: post-loop-memcpy-expansion:			; OPT: post-loop-memcpy-expansion:
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP39:%.*]] = icmp ne i64 [[TMP23]], 0
				; OPT-NEXT: br i1 [[TMP39]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				; OPT: loop-memcpy-residual-header5:
				; OPT-NEXT: [[TMP40:%.*]] = icmp ne i64 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP40]], label [[LOOP_MEMCPY_RESIDUAL4]], label [[POST_LOOP_MEMCPY_EXPANSION1]]
	;			;
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst0, i8 addrspace(1)* %src, i64 %n, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst0, i8 addrspace(1)* %src, i64 %n, i1 false)
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 %m, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 %m, i1 false)
	ret void			ret void
	}			}

	define amdgpu_kernel void @memcpy_alt_type(i8 addrspace(1)* %dst, i8 addrspace(3)* %src, i32 %n) #0 {			define amdgpu_kernel void @memcpy_alt_type(i8 addrspace(1)* %dst, i8 addrspace(3)* %src, i32 %n) #0 {
	; OPT-LABEL: @memcpy_alt_type(			; OPT-LABEL: @memcpy_alt_type(
	; OPT-NEXT: [[TMP1:%.]] = icmp ne i32 [[N:%.]], 0			; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(3) [[SRC:%.]] to <4 x i32> addrspace(3)
	; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]			; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i32 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i32 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i32 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
	; OPT: loop-memcpy-expansion:			; OPT: loop-memcpy-expansion:
	; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]			; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
	; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(3) [[TMP1]], i32 [[LOOP_INDEX]]
	; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(3) [[TMP2]], align 1			; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(3) [[TMP7]], align 1
	; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i32 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i32 [[LOOP_INDEX]]
	; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(1)* [[TMP4]], align 1			; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
	; OPT-NEXT: [[TMP5]] = add i32 [[LOOP_INDEX]], 1			; OPT-NEXT: [[TMP10]] = add i32 [[LOOP_INDEX]], 1
	; OPT-NEXT: [[TMP6:%.*]] = icmp ult i32 [[TMP5]], [[N]]			; OPT-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP3]]
	; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]			; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(3) [[TMP1]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i32 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP12]], i32 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(3) [[TMP15]], align 1
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i32 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i32 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
	; OPT: post-loop-memcpy-expansion:			; OPT: post-loop-memcpy-expansion:
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i32 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
	;			;
	call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* %dst, i8 addrspace(3)* %src, i32 %n, i1 false)			call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* %dst, i8 addrspace(3)* %src, i32 %n, i1 false)
	ret void			ret void
	}			}

	; One of the uses in the function should be expanded, the other left alone.			; One of the uses in the function should be expanded, the other left alone.
	define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(i8 addrspace(1)* %dst0, i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 %n) #0 {			define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(i8 addrspace(1)* %dst0, i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 %n) #0 {
	; OPT-LABEL: @memcpy_multi_use_one_function_keep_small(			; MAX1024-LABEL: @memcpy_multi_use_one_function_keep_small(
				; MAX1024-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; MAX1024-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST0:%.]] to <4 x i32> addrspace(1)
				; MAX1024-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; MAX1024-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; MAX1024-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; MAX1024-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; MAX1024-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; MAX1024: loop-memcpy-expansion:
				; MAX1024-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; MAX1024-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; MAX1024-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 1
				; MAX1024-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; MAX1024-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
				; MAX1024-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX]], 1
				; MAX1024-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
				; MAX1024-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; MAX1024: loop-memcpy-residual:
				; MAX1024-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; MAX1024-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; MAX1024-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; MAX1024-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; MAX1024-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; MAX1024-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 1
				; MAX1024-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; MAX1024-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; MAX1024-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; MAX1024-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; MAX1024-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; MAX1024: post-loop-memcpy-expansion:
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* [[DST1:%.]], i8 addrspace(1) [[SRC]], i64 102, i1 false)
				; MAX1024-NEXT: ret void
				; MAX1024: loop-memcpy-residual-header:
				; MAX1024-NEXT: [[TMP20:%.*]] = icmp ne i64 [[TMP4]], 0
				; MAX1024-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				; ALL-LABEL: @memcpy_multi_use_one_function_keep_small(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST0:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; ALL-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; ALL-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; ALL-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; ALL-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; ALL: loop-memcpy-expansion:
				; ALL-NEXT: [[LOOP_INDEX1:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; ALL-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX1]]
				; ALL-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 1
				; ALL-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX1]]
				; ALL-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
				; ALL-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX1]], 1
				; ALL-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
				; ALL-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; ALL: loop-memcpy-residual:
				; ALL-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; ALL-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; ALL-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; ALL-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; ALL-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; ALL-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 1
				; ALL-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; ALL-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; ALL-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; ALL-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; ALL: post-loop-memcpy-expansion:
				; ALL-NEXT: [[TMP20:%.]] = bitcast i8 addrspace(1) [[SRC]] to <4 x i32> addrspace(1)*
				; ALL-NEXT: [[TMP21:%.]] = bitcast i8 addrspace(1) [[DST1:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[POST_LOOP_MEMCPY_EXPANSION]] ], [ [[TMP25:%.]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP22:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP20]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP23:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP22]], align 1
				; ALL-NEXT: [[TMP24:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP21]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store <4 x i32> [[TMP23]], <4 x i32> addrspace(1)* [[TMP24]], align 1
				; ALL-NEXT: [[TMP25]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP26:%.*]] = icmp ult i64 [[TMP25]], 6
				; ALL-NEXT: br i1 [[TMP26]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: [[TMP27:%.]] = bitcast <4 x i32> addrspace(1) [[TMP20]] to i32 addrspace(1)*
				; ALL-NEXT: [[TMP28:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP27]], i64 24
				; ALL-NEXT: [[TMP29:%.]] = load i32, i32 addrspace(1) [[TMP28]], align 1
				; ALL-NEXT: [[TMP30:%.]] = bitcast <4 x i32> addrspace(1) [[TMP21]] to i32 addrspace(1)*
				; ALL-NEXT: [[TMP31:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP30]], i64 24
				; ALL-NEXT: store i32 [[TMP29]], i32 addrspace(1)* [[TMP31]], align 1
				; ALL-NEXT: [[TMP32:%.]] = bitcast <4 x i32> addrspace(1) [[TMP20]] to i8 addrspace(1)*
				; ALL-NEXT: [[TMP33:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP32]], i64 100
				; ALL-NEXT: [[TMP34:%.]] = load i8, i8 addrspace(1) [[TMP33]], align 1
				; ALL-NEXT: [[TMP35:%.]] = bitcast <4 x i32> addrspace(1) [[TMP21]] to i8 addrspace(1)*
				; ALL-NEXT: [[TMP36:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP35]], i64 100
				; ALL-NEXT: store i8 [[TMP34]], i8 addrspace(1)* [[TMP36]], align 1
				; ALL-NEXT: [[TMP37:%.]] = bitcast <4 x i32> addrspace(1) [[TMP20]] to i8 addrspace(1)*
				; ALL-NEXT: [[TMP38:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP37]], i64 101
				; ALL-NEXT: [[TMP39:%.]] = load i8, i8 addrspace(1) [[TMP38]], align 1
				; ALL-NEXT: [[TMP40:%.]] = bitcast <4 x i32> addrspace(1) [[TMP21]] to i8 addrspace(1)*
				; ALL-NEXT: [[TMP41:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP40]], i64 101
				; ALL-NEXT: store i8 [[TMP39]], i8 addrspace(1)* [[TMP41]], align 1
				; ALL-NEXT: ret void
				; ALL: loop-memcpy-residual-header:
				; ALL-NEXT: [[TMP42:%.*]] = icmp ne i64 [[TMP4]], 0
				; ALL-NEXT: br i1 [[TMP42]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst0, i8 addrspace(1)* %src, i64 %n, i1 false)
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 102, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1028(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1028(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP8]], i64 256
				; OPT-NEXT: [[TMP10:%.]] = load i32, i32 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP11]], i64 256
				; OPT-NEXT: store i32 [[TMP10]], i32 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1028, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1025(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1025(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP8]], i64 1024
				; OPT-NEXT: [[TMP10:%.]] = load i8, i8 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP11]], i64 1024
				; OPT-NEXT: store i8 [[TMP10]], i8 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1025, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1026(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1026(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP8]], i64 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP11]], i64 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1026, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1032(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1032(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP8]], i64 128
				; OPT-NEXT: [[TMP10:%.]] = load i64, i64 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP11]], i64 128
				; OPT-NEXT: store i64 [[TMP10]], i64 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1032, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1034(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1034(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP8]], i64 128
				; OPT-NEXT: [[TMP10:%.]] = load i64, i64 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP11]], i64 128
				; OPT-NEXT: store i64 [[TMP10]], i64 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP13]], i64 516
				; OPT-NEXT: [[TMP15:%.]] = load i16, i16 addrspace(1) [[TMP14]], align 4
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP16]], i64 516
				; OPT-NEXT: store i16 [[TMP15]], i16 addrspace(1)* [[TMP17]], align 4
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1034, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1035(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1035(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP8]], i64 128
				; OPT-NEXT: [[TMP10:%.]] = load i64, i64 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP11]], i64 128
				; OPT-NEXT: store i64 [[TMP10]], i64 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP13]], i64 516
				; OPT-NEXT: [[TMP15:%.]] = load i16, i16 addrspace(1) [[TMP14]], align 4
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP16]], i64 516
				; OPT-NEXT: store i16 [[TMP15]], i16 addrspace(1)* [[TMP17]], align 4
				; OPT-NEXT: [[TMP18:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP19:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP18]], i64 1034
				; OPT-NEXT: [[TMP20:%.]] = load i8, i8 addrspace(1) [[TMP19]], align 2
				; OPT-NEXT: [[TMP21:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP22:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP21]], i64 1034
				; OPT-NEXT: store i8 [[TMP20]], i8 addrspace(1)* [[TMP22]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1035, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1036(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1036(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP8]], i64 128
				; OPT-NEXT: [[TMP10:%.]] = load i64, i64 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP11]], i64 128
				; OPT-NEXT: store i64 [[TMP10]], i64 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP13]], i64 258
				; OPT-NEXT: [[TMP15:%.]] = load i32, i32 addrspace(1) [[TMP14]], align 4
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP16]], i64 258
				; OPT-NEXT: store i32 [[TMP15]], i32 addrspace(1)* [[TMP17]], align 4
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1036, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1039(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1039(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP8]], i64 128
				; OPT-NEXT: [[TMP10:%.]] = load i64, i64 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP11]], i64 128
				; OPT-NEXT: store i64 [[TMP10]], i64 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP13]], i64 258
				; OPT-NEXT: [[TMP15:%.]] = load i32, i32 addrspace(1) [[TMP14]], align 4
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP16]], i64 258
				; OPT-NEXT: store i32 [[TMP15]], i32 addrspace(1)* [[TMP17]], align 4
				; OPT-NEXT: [[TMP18:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP19:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP18]], i64 518
				; OPT-NEXT: [[TMP20:%.]] = load i16, i16 addrspace(1) [[TMP19]], align 4
				; OPT-NEXT: [[TMP21:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP22:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP21]], i64 518
				; OPT-NEXT: store i16 [[TMP20]], i16 addrspace(1)* [[TMP22]], align 4
				; OPT-NEXT: [[TMP23:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP24:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP23]], i64 1038
				; OPT-NEXT: [[TMP25:%.]] = load i8, i8 addrspace(1) [[TMP24]], align 2
				; OPT-NEXT: [[TMP26:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP27:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP26]], i64 1038
				; OPT-NEXT: store i8 [[TMP25]], i8 addrspace(1)* [[TMP27]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1039, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align2_global_align2_1039(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align2_global_align2_1039(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 2
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 2
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP8]], i64 128
				; OPT-NEXT: [[TMP10:%.]] = load i64, i64 addrspace(1) [[TMP9]], align 2
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i64 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 addrspace(1) [[TMP11]], i64 128
				; OPT-NEXT: store i64 [[TMP10]], i64 addrspace(1)* [[TMP12]], align 2
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP13]], i64 258
				; OPT-NEXT: [[TMP15:%.]] = load i32, i32 addrspace(1) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i32 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP16]], i64 258
				; OPT-NEXT: store i32 [[TMP15]], i32 addrspace(1)* [[TMP17]], align 2
				; OPT-NEXT: [[TMP18:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP19:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP18]], i64 518
				; OPT-NEXT: [[TMP20:%.]] = load i16, i16 addrspace(1) [[TMP19]], align 2
				; OPT-NEXT: [[TMP21:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP22:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP21]], i64 518
				; OPT-NEXT: store i16 [[TMP20]], i16 addrspace(1)* [[TMP22]], align 2
				; OPT-NEXT: [[TMP23:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP24:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP23]], i64 1038
				; OPT-NEXT: [[TMP25:%.]] = load i8, i8 addrspace(1) [[TMP24]], align 2
				; OPT-NEXT: [[TMP26:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP27:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP26]], i64 1038
				; OPT-NEXT: store i8 [[TMP25]], i8 addrspace(1)* [[TMP27]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 2 %dst, i8 addrspace(1)* align 2 %src, i64 1039, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1027(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP8]], i64 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP11]], i64 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(1) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP16]], i64 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(1)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align2_global_align4_1027(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align2_global_align4_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 2
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP8]], i64 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(1) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP11]], i64 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(1)* [[TMP12]], align 2
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(1) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP16]], i64 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(1)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 2 %dst, i8 addrspace(1)* align 4 %src, i64 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align2_1027(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align2_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 2
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP8]], i64 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(1) [[TMP9]], align 2
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP11]], i64 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(1)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(1) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP16]], i64 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(1)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 2 %src, i64 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_private_align4_private_align4_1027(i8 addrspace(5)* %dst, i8 addrspace(5)* %src) #0 {
				; OPT-LABEL: @memcpy_private_align4_private_align4_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(5) [[SRC:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(5) [[DST:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(5) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(5)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP8]], i32 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(5) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP11]], i32 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(5)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP13]], i32 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(5) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP16]], i32 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(5)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* align 4 %dst, i8 addrspace(5)* align 4 %src, i32 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_private_align2_private_align4_1027(i8 addrspace(5)* %dst, i8 addrspace(5)* %src) #0 {
				; OPT-LABEL: @memcpy_private_align2_private_align4_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(5) [[SRC:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(5) [[DST:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(5) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(5)* [[TMP5]], align 2
				; OPT-NEXT: [[TMP6]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP8]], i32 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(5) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP11]], i32 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(5)* [[TMP12]], align 2
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP13]], i32 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(5) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP16]], i32 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(5)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* align 2 %dst, i8 addrspace(5)* align 4 %src, i32 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_private_align1_private_align4_1027(i8 addrspace(5)* %dst, i8 addrspace(5)* %src) #0 {
				; OPT-LABEL: @memcpy_private_align1_private_align4_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(5) [[SRC:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(5) [[DST:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(5) [[TMP3]], align 4
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(5)* [[TMP5]], align 1
				; OPT-NEXT: [[TMP6]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP8]], i32 1024
				; OPT-NEXT: [[TMP10:%.]] = load i8, i8 addrspace(5) [[TMP9]], align 4
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP11]], i32 1024
				; OPT-NEXT: store i8 [[TMP10]], i8 addrspace(5)* [[TMP12]], align 1
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP13]], i32 1025
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(5) [[TMP14]], align 1
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP16]], i32 1025
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(5)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP19:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP18]], i32 1026
				; OPT-NEXT: [[TMP20:%.]] = load i8, i8 addrspace(5) [[TMP19]], align 2
				; OPT-NEXT: [[TMP21:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP22:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP21]], i32 1026
				; OPT-NEXT: store i8 [[TMP20]], i8 addrspace(5)* [[TMP22]], align 1
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* align 1 %dst, i8 addrspace(5)* align 4 %src, i32 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_private_align4_private_align2_1027(i8 addrspace(5)* %dst, i8 addrspace(5)* %src) #0 {
				; OPT-LABEL: @memcpy_private_align4_private_align2_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(5) [[SRC:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(5) [[DST:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(5) [[TMP3]], align 2
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(5)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP8]], i32 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(5) [[TMP9]], align 2
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP11]], i32 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(5)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP13]], i32 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(5) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP16]], i32 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(5)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* align 4 %dst, i8 addrspace(5)* align 2 %src, i32 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_private_align4_private_align1_1027(i8 addrspace(5)* %dst, i8 addrspace(5)* %src) #0 {
				; OPT-LABEL: @memcpy_private_align4_private_align1_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(5) [[SRC:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(5) [[DST:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(5) [[TMP3]], align 1
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(5)* [[TMP5]], align 4
				; OPT-NEXT: [[TMP6]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP8]], i32 1024
				; OPT-NEXT: [[TMP10:%.]] = load i8, i8 addrspace(5) [[TMP9]], align 1
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP11]], i32 1024
				; OPT-NEXT: store i8 [[TMP10]], i8 addrspace(5)* [[TMP12]], align 4
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP13]], i32 1025
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(5) [[TMP14]], align 1
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP16]], i32 1025
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(5)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP19:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP18]], i32 1026
				; OPT-NEXT: [[TMP20:%.]] = load i8, i8 addrspace(5) [[TMP19]], align 1
				; OPT-NEXT: [[TMP21:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP22:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP21]], i32 1026
				; OPT-NEXT: store i8 [[TMP20]], i8 addrspace(5)* [[TMP22]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* align 4 %dst, i8 addrspace(5)* align 1 %src, i32 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_private_align2_private_align2_1027(i8 addrspace(5)* %dst, i8 addrspace(5)* %src) #0 {
				; OPT-LABEL: @memcpy_private_align2_private_align2_1027(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(5) [[SRC:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(5) [[DST:%.]] to <4 x i32> addrspace(5)
				; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; OPT: load-store-loop:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; OPT-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(5) [[TMP3]], align 2
				; OPT-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(5) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(5)* [[TMP5]], align 2
				; OPT-NEXT: [[TMP6]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP6]], 64
				; OPT-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; OPT: memcpy-split:
				; OPT-NEXT: [[TMP8:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP8]], i32 512
				; OPT-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(5) [[TMP9]], align 2
				; OPT-NEXT: [[TMP11:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i16 addrspace(5)*
				; OPT-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(5) [[TMP11]], i32 512
				; OPT-NEXT: store i16 [[TMP10]], i16 addrspace(5)* [[TMP12]], align 2
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(5) [[TMP1]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP13]], i32 1026
				; OPT-NEXT: [[TMP15:%.]] = load i8, i8 addrspace(5) [[TMP14]], align 2
				; OPT-NEXT: [[TMP16:%.]] = bitcast <4 x i32> addrspace(5) [[TMP2]] to i8 addrspace(5)*
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(5) [[TMP16]], i32 1026
				; OPT-NEXT: store i8 [[TMP15]], i8 addrspace(5)* [[TMP17]], align 2
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p5i8.p5i8.i32(i8 addrspace(5)* align 2 %dst, i8 addrspace(5)* align 2 %src, i32 1027, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_variable(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {
				; OPT-LABEL: @memcpy_global_align4_global_align4_variable(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 4
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 4
				; OPT-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
				; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 4
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 4
				; OPT-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i64 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align2_global_align2_variable(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {
				; OPT-LABEL: @memcpy_global_align2_global_align2_variable(
	; OPT-NEXT: [[TMP1:%.]] = icmp ne i64 [[N:%.]], 0			; OPT-NEXT: [[TMP1:%.]] = icmp ne i64 [[N:%.]], 0
	; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]			; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]
	; OPT: loop-memcpy-expansion:			; OPT: loop-memcpy-expansion:
	; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]			; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
	; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(1) [[TMP2]], align 1			; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(1) [[TMP2]], align 1
	; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST0:%.*]], i64 [[LOOP_INDEX]]			; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
	; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(1)* [[TMP4]], align 1			; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(1)* [[TMP4]], align 1
	; OPT-NEXT: [[TMP5]] = add i64 [[LOOP_INDEX]], 1			; OPT-NEXT: [[TMP5]] = add i64 [[LOOP_INDEX]], 1
	; OPT-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[N]]			; OPT-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[N]]
	; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]			; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]
	; OPT: post-loop-memcpy-expansion:			; OPT: post-loop-memcpy-expansion:
	; OPT-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* [[DST1:%.]], i8 addrspace(1) [[SRC]], i64 102, i1 false)
	; OPT-NEXT: ret void			; OPT-NEXT: ret void
	;			;
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst0, i8 addrspace(1)* %src, i64 %n, i1 false)			call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 2 %dst, i8 addrspace(1)* align 2 %src, i64 %n, i1 false)
	call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %dst1, i8 addrspace(1)* %src, i64 102, i1 false)			ret void
				}

				define amdgpu_kernel void @memcpy_global_align1_global_align1_variable(i8 addrspace(1)* %dst, i8 addrspace(1)* %src, i64 %n) #0 {
				; OPT-LABEL: @memcpy_global_align1_global_align1_variable(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i64 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i64 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i64 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i64 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 1
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 1
				; OPT-NEXT: [[TMP10]] = add i64 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP11:%.*]] = icmp ult i64 [[TMP10]], [[TMP3]]
				; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i64 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i64 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 1
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i64 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i64 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i64 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 1 %dst, i8 addrspace(1)* align 1 %src, i64 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_local_align4_local_align4_variable(i8 addrspace(3)* %dst, i8 addrspace(3)* %src, i32 %n) #0 {
				; OPT-LABEL: @memcpy_local_align4_local_align4_variable(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(3) [[SRC:%.]] to <2 x i32> addrspace(3)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(3) [[DST:%.]] to <2 x i32> addrspace(3)
				; OPT-NEXT: [[TMP3:%.]] = udiv i32 [[N:%.]], 8
				; OPT-NEXT: [[TMP4:%.*]] = urem i32 [[N]], 8
				; OPT-NEXT: [[TMP5:%.*]] = sub i32 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(3) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP8:%.]] = load <2 x i32>, <2 x i32> addrspace(3) [[TMP7]], align 4
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(3) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <2 x i32> [[TMP8]], <2 x i32> addrspace(3)* [[TMP9]], align 4
				; OPT-NEXT: [[TMP10]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP3]]
				; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <2 x i32> addrspace(3) [[TMP1]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <2 x i32> addrspace(3) [[TMP2]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP14:%.*]] = add i32 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP12]], i32 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(3) [[TMP15]], align 4
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP13]], i32 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(3)* [[TMP17]], align 4
				; OPT-NEXT: [[TMP18]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i32 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i32 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p3i8.p3i8.i32(i8 addrspace(3)* align 4 %dst, i8 addrspace(3)* align 4 %src, i32 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_local_align2_local_align2_variable(i8 addrspace(3)* %dst, i8 addrspace(3)* %src, i32 %n) #0 {
				; OPT-LABEL: @memcpy_local_align2_local_align2_variable(
				; OPT-NEXT: [[TMP1:%.]] = icmp ne i32 [[N:%.]], 0
				; OPT-NEXT: br i1 [[TMP1]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[POST_LOOP_MEMCPY_EXPANSION:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP5:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP3:%.]] = load i8, i8 addrspace(3) [[TMP2]], align 1
				; OPT-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[DST:%.*]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store i8 [[TMP3]], i8 addrspace(3)* [[TMP4]], align 1
				; OPT-NEXT: [[TMP5]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP6:%.*]] = icmp ult i32 [[TMP5]], [[N]]
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				;
				call void @llvm.memcpy.p3i8.p3i8.i32(i8 addrspace(3)* align 2 %dst, i8 addrspace(3)* align 2 %src, i32 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_local_align1_local_align1_variable(i8 addrspace(3)* %dst, i8 addrspace(3)* %src, i32 %n) #0 {
				; OPT-LABEL: @memcpy_local_align1_local_align1_variable(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(3) [[SRC:%.]] to <2 x i32> addrspace(3)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(3) [[DST:%.]] to <2 x i32> addrspace(3)
				; OPT-NEXT: [[TMP3:%.]] = udiv i32 [[N:%.]], 8
				; OPT-NEXT: [[TMP4:%.*]] = urem i32 [[N]], 8
				; OPT-NEXT: [[TMP5:%.*]] = sub i32 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(3) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP8:%.]] = load <2 x i32>, <2 x i32> addrspace(3) [[TMP7]], align 1
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(3) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <2 x i32> [[TMP8]], <2 x i32> addrspace(3)* [[TMP9]], align 1
				; OPT-NEXT: [[TMP10]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP3]]
				; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <2 x i32> addrspace(3) [[TMP1]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <2 x i32> addrspace(3) [[TMP2]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP14:%.*]] = add i32 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP12]], i32 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(3) [[TMP15]], align 1
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP13]], i32 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(3)* [[TMP17]], align 1
				; OPT-NEXT: [[TMP18]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i32 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i32 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p3i8.p3i8.i32(i8 addrspace(3)* align 1 %dst, i8 addrspace(3)* align 1 %src, i32 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_local_align4_global_align4_variable(i8 addrspace(3)* %dst, i8 addrspace(1)* %src, i32 %n) #0 {
				; OPT-LABEL: @memcpy_local_align4_global_align4_variable(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(3) [[DST:%.]] to <4 x i32> addrspace(3)
				; OPT-NEXT: [[TMP3:%.]] = udiv i32 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i32 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i32 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP7]], align 4
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(3) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(3)* [[TMP9]], align 4
				; OPT-NEXT: [[TMP10]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP3]]
				; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(1) [[TMP1]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(3) [[TMP2]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP14:%.*]] = add i32 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP12]], i32 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(1) [[TMP15]], align 4
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP13]], i32 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(3)* [[TMP17]], align 4
				; OPT-NEXT: [[TMP18]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i32 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i32 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %dst, i8 addrspace(1)* align 4 %src, i32 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_local_align4_variable(i8 addrspace(1)* %dst, i8 addrspace(3)* %src, i32 %n) #0 {
				; OPT-LABEL: @memcpy_global_align4_local_align4_variable(
				; OPT-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(3) [[SRC:%.]] to <4 x i32> addrspace(3)
				; OPT-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; OPT-NEXT: [[TMP3:%.]] = udiv i32 [[N:%.]], 16
				; OPT-NEXT: [[TMP4:%.*]] = urem i32 [[N]], 16
				; OPT-NEXT: [[TMP5:%.*]] = sub i32 [[N]], [[TMP4]]
				; OPT-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP3]], 0
				; OPT-NEXT: br i1 [[TMP6]], label [[LOOP_MEMCPY_EXPANSION:%.]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.]]
				; OPT: loop-memcpy-expansion:
				; OPT-NEXT: [[LOOP_INDEX:%.]] = phi i32 [ 0, [[TMP0:%.]] ], [ [[TMP10:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
				; OPT-NEXT: [[TMP7:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(3) [[TMP1]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: [[TMP8:%.]] = load <4 x i32>, <4 x i32> addrspace(3) [[TMP7]], align 4
				; OPT-NEXT: [[TMP9:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i32 [[LOOP_INDEX]]
				; OPT-NEXT: store <4 x i32> [[TMP8]], <4 x i32> addrspace(1)* [[TMP9]], align 4
				; OPT-NEXT: [[TMP10]] = add i32 [[LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP10]], [[TMP3]]
				; OPT-NEXT: br i1 [[TMP11]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
				; OPT: loop-memcpy-residual:
				; OPT-NEXT: [[RESIDUAL_LOOP_INDEX:%.]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP18:%.]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
				; OPT-NEXT: [[TMP12:%.]] = bitcast <4 x i32> addrspace(3) [[TMP1]] to i8 addrspace(3)*
				; OPT-NEXT: [[TMP13:%.]] = bitcast <4 x i32> addrspace(1) [[TMP2]] to i8 addrspace(1)*
				; OPT-NEXT: [[TMP14:%.*]] = add i32 [[TMP5]], [[RESIDUAL_LOOP_INDEX]]
				; OPT-NEXT: [[TMP15:%.]] = getelementptr inbounds i8, i8 addrspace(3) [[TMP12]], i32 [[TMP14]]
				; OPT-NEXT: [[TMP16:%.]] = load i8, i8 addrspace(3) [[TMP15]], align 4
				; OPT-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP13]], i32 [[TMP14]]
				; OPT-NEXT: store i8 [[TMP16]], i8 addrspace(1)* [[TMP17]], align 4
				; OPT-NEXT: [[TMP18]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
				; OPT-NEXT: [[TMP19:%.*]] = icmp ult i32 [[TMP18]], [[TMP4]]
				; OPT-NEXT: br i1 [[TMP19]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
				; OPT: post-loop-memcpy-expansion:
				; OPT-NEXT: ret void
				; OPT: loop-memcpy-residual-header:
				; OPT-NEXT: [[TMP20:%.*]] = icmp ne i32 [[TMP4]], 0
				; OPT-NEXT: br i1 [[TMP20]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
				;
				call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %dst, i8 addrspace(3)* align 4 %src, i32 %n, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_16(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_16(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 16, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_16(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <4 x i32> addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> addrspace(1) [[TMP3]], align 4
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store <4 x i32> [[TMP4]], <4 x i32> addrspace(1)* [[TMP5]], align 4
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 1
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 16, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_12(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_12(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 12, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_12(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <2 x i32> addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <2 x i32> addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> addrspace(1) [[TMP3]], align 4
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store <2 x i32> [[TMP4]], <2 x i32> addrspace(1)* [[TMP5]], align 4
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 1
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: [[TMP8:%.]] = bitcast <2 x i32> addrspace(1) [[TMP1]] to i32 addrspace(1)*
				; ALL-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP8]], i64 2
				; ALL-NEXT: [[TMP10:%.]] = load i32, i32 addrspace(1) [[TMP9]], align 4
				; ALL-NEXT: [[TMP11:%.]] = bitcast <2 x i32> addrspace(1) [[TMP2]] to i32 addrspace(1)*
				; ALL-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP11]], i64 2
				; ALL-NEXT: store i32 [[TMP10]], i32 addrspace(1)* [[TMP12]], align 4
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 12, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_8(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_8(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 8, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_8(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <2 x i32> addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <2 x i32> addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> addrspace(1) [[TMP3]], align 4
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store <2 x i32> [[TMP4]], <2 x i32> addrspace(1)* [[TMP5]], align 4
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 1
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 8, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_10(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_10(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 10, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_10(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to <2 x i32> addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to <2 x i32> addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> addrspace(1) [[TMP3]], align 4
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store <2 x i32> [[TMP4]], <2 x i32> addrspace(1)* [[TMP5]], align 4
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 1
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: [[TMP8:%.]] = bitcast <2 x i32> addrspace(1) [[TMP1]] to i16 addrspace(1)*
				; ALL-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP8]], i64 4
				; ALL-NEXT: [[TMP10:%.]] = load i16, i16 addrspace(1) [[TMP9]], align 4
				; ALL-NEXT: [[TMP11:%.]] = bitcast <2 x i32> addrspace(1) [[TMP2]] to i16 addrspace(1)*
				; ALL-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP11]], i64 4
				; ALL-NEXT: store i16 [[TMP10]], i16 addrspace(1)* [[TMP12]], align 4
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 10, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_4(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_4(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 4, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_4(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to i32 addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to i32 addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load i32, i32 addrspace(1) [[TMP3]], align 4
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store i32 [[TMP4]], i32 addrspace(1)* [[TMP5]], align 4
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 1
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 4, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_2(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_2(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 2, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_2(
				; ALL-NEXT: [[TMP1:%.]] = bitcast i8 addrspace(1) [[SRC:%.]] to i16 addrspace(1)
				; ALL-NEXT: [[TMP2:%.]] = bitcast i8 addrspace(1) [[DST:%.]] to i16 addrspace(1)
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP6:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP1]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP4:%.]] = load i16, i16 addrspace(1) [[TMP3]], align 2
				; ALL-NEXT: [[TMP5:%.]] = getelementptr inbounds i16, i16 addrspace(1) [[TMP2]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store i16 [[TMP4]], i16 addrspace(1)* [[TMP5]], align 2
				; ALL-NEXT: [[TMP6]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 1
				; ALL-NEXT: br i1 [[TMP7]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 2, i1 false)
				ret void
				}

				define amdgpu_kernel void @memcpy_global_align4_global_align4_1(i8 addrspace(1)* %dst, i8 addrspace(1)* %src) #0 {
				; MAX1024-LABEL: @memcpy_global_align4_global_align4_1(
				; MAX1024-NEXT: call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 [[DST:%.]], i8 addrspace(1) align 4 [[SRC:%.*]], i64 1, i1 false)
				; MAX1024-NEXT: ret void
				;
				; ALL-LABEL: @memcpy_global_align4_global_align4_1(
				; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
				; ALL: load-store-loop:
				; ALL-NEXT: [[LOOP_INDEX:%.]] = phi i64 [ 0, [[TMP0:%.]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
				; ALL-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: [[TMP2:%.]] = load i8, i8 addrspace(1) [[TMP1]], align 1
				; ALL-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
				; ALL-NEXT: store i8 [[TMP2]], i8 addrspace(1)* [[TMP3]], align 1
				; ALL-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
				; ALL-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1
				; ALL-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
				; ALL: memcpy-split:
				; ALL-NEXT: ret void
				;
				call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* align 4 %dst, i8 addrspace(1)* align 4 %src, i64 1, i1 false)
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { argmemonly nounwind }			attributes #1 = { argmemonly nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Implement getMemcpyLoopLoweringTypeAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 252349

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll

AMDGPU: Implement getMemcpyLoopLoweringType
AbandonedPublic