This is an archive of the discontinued LLVM Phabricator instance.

[Mem2Reg] Also handle memcpy
Needs RevisionPublic

Authored by loladiro on Sep 15 2017, 3:39 PM.

Download Raw Diff

Details

Reviewers

chandlerc
• dberlin

Summary

In julia, when we know we're moving data between two memory locations,
we always emit that as a memcpy rather than a load/store pair. However,
this can give worse optimization results in certain cases because some
optimizations that can handle load/store pairs cannot handle memcpys.
Mem2reg is one of these optimizations. This patch adds rudamentary
support for mem2reg for recognizing memcpys that cover the whole alloca
we're promoting. While several more sophisticated passes (SROA, GVN)
can get similar optimizations, it is preferable to have these kinds
of cases caught early to expose optimization opportunities before
getting to these later passes. The approach taken here is to split
the memcpy into a load/store pair early (after legality analysis)
and retain the rest of the analysis only on loads/stores. It would
be possible of course to leave the memcpy as is and generate the
left over load or store only on demand. However, that would entail
a significantly larger patch for unclear benefit.

Diff Detail

Build Status

Buildable 10344
Build 10344: arc lint + arc unit

Event Timeline

loladiro created this revision.Sep 15 2017, 3:39 PM

Herald added a reviewer: • dberlin. · View Herald TranscriptSep 15 2017, 3:39 PM

Harbormaster completed remote builds in B10321: Diff 115514.Sep 15 2017, 3:39 PM

Fix a small bug and also look through a single level of bitcasts.
Since IRBuilder automatically inserts bitcasts to i8*, it seems
prudent to handle that case.

Harbormaster completed remote builds in B10344: Diff 115591.Sep 17 2017, 7:30 PM

clang's standard pass pipeline never uses the mem2reg pass; it uses SROA instead. What pass pipeline are you using where this matters?

Hi @efriedma,

this is the julia pass pipeline (https://github.com/JuliaLang/julia/blob/master/src/jitlayers.cpp#L148). IIRC the original list of passes came from VMKit,
but the pass list was adjusted as needed over the years.

In D37939#875329, @loladiro wrote:

Hi @efriedma,

this is the julia pass pipeline (https://github.com/JuliaLang/julia/blob/master/src/jitlayers.cpp#L148). IIRC the original list of passes came from VMKit,
but the pass list was adjusted as needed over the years.

I would still suggest just switching to SROA. You can (and should) run it quite early in the pipeline, but that seems much more likely to be a good long term solution.

I would look at the current early pass pipeline in LLVM for ideas about an effective sequencing here.

(marking as needing changes to clear dashboard)

This revision now requires changes to proceed.Sep 19 2017, 6:27 PM

Revision Contents

Path

Size

lib/

Transforms/

Utils/

PromoteMemoryToRegister.cpp

172 lines

test/

Transforms/

Mem2Reg/

memcpy.ll

101 lines

Diff 115591

lib/Transforms/Utils/PromoteMemoryToRegister.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

#define DEBUG_TYPE "mem2reg"		#define DEBUG_TYPE "mem2reg"

STATISTIC(NumLocalPromoted, "Number of alloca's promoted within one block");		STATISTIC(NumLocalPromoted, "Number of alloca's promoted within one block");
STATISTIC(NumSingleStore, "Number of alloca's promoted with a single store");		STATISTIC(NumSingleStore, "Number of alloca's promoted with a single store");
STATISTIC(NumDeadAlloca, "Number of dead alloca's removed");		STATISTIC(NumDeadAlloca, "Number of dead alloca's removed");
STATISTIC(NumPHIInsert, "Number of PHI nodes inserted");		STATISTIC(NumPHIInsert, "Number of PHI nodes inserted");

		static bool isSplittableMemCpy(const MemCpyInst MCI, const AllocaInst AI) {
		// Punt if this alloca is an array allocation
		if (AI->isArrayAllocation())
		return false;
		if (MCI->isVolatile())
		return false;
		Value *Length = MCI->getLength();
		if (!isa<ConstantInt>(Length))
		return false;
		// Anything less than the full alloca, we leave for SROA
		const DataLayout &DL = AI->getModule()->getDataLayout();
		size_t AIElSize = DL.getTypeAllocSize(AI->getAllocatedType());
		if (cast<ConstantInt>(Length)->getZExtValue() != AIElSize)
		return false;
		// If the other argument is also an alloca, we need to be sure that either
		// the types are bitcastable, or the other alloca is not eligible for
		// promotion (e.g. because the memcpy is for less than the whole size of
		// that alloca), otherwise we risk turning an allocatable alloca into a
		// non-allocatable one when splitting the memcpy.
		AllocaInst *OtherAI = dyn_cast<AllocaInst>(
		AI == MCI->getSource() ? MCI->getDest() : MCI->getSource());
		if (OtherAI) {
		if (!CastInst::isBitCastable(AI->getAllocatedType(),
		OtherAI->getAllocatedType()) &&
		DL.getTypeAllocSize(OtherAI->getAllocatedType()) == AIElSize)
		return false;
		}
		return true;
		}

		/// Look at the result of a bitcast and see if it's only used by lifetime
		/// intrinsics or splittable memcpys. This is needed, because IRBuilder
		/// will always insert a bitcast to i8* for these intrinsics.
		static bool onlyHasCanonicalizableUsers(const AllocaInst AI, const Value V) {
		for (const User *U : V->users()) {
		const IntrinsicInst *II = dyn_cast<IntrinsicInst>(U);
		if (!II)
		return false;

		if (isa<MemCpyInst>(II)) {
		if (!isSplittableMemCpy(cast<MemCpyInst>(II), AI))
		return false;
		continue;
		}

		if (II->getIntrinsicID() != Intrinsic::lifetime_start &&
		II->getIntrinsicID() != Intrinsic::lifetime_end)
		return false;
		}
		return true;
		}

bool llvm::isAllocaPromotable(const AllocaInst *AI) {		bool llvm::isAllocaPromotable(const AllocaInst *AI) {
// FIXME: If the memory unit is of pointer or integer type, we can permit		// FIXME: If the memory unit is of pointer or integer type, we can permit
// assignments to subsections of the memory unit.		// assignments to subsections of the memory unit.
unsigned AS = AI->getType()->getAddressSpace();		unsigned AS = AI->getType()->getAddressSpace();

// Only allow direct and non-volatile loads and stores...		// Only allow direct and non-volatile loads and stores...
for (const User *U : AI->users()) {		for (const User *U : AI->users()) {
if (const LoadInst *LI = dyn_cast<LoadInst>(U)) {		if (const LoadInst *LI = dyn_cast<LoadInst>(U)) {
// Note that atomic loads can be transformed; atomic semantics do		// Note that atomic loads can be transformed; atomic semantics do
// not have any meaning for a local alloca.		// not have any meaning for a local alloca.
if (LI->isVolatile())		if (LI->isVolatile())
return false;		return false;
} else if (const StoreInst *SI = dyn_cast<StoreInst>(U)) {		} else if (const StoreInst *SI = dyn_cast<StoreInst>(U)) {
if (SI->getOperand(0) == AI)		if (SI->getOperand(0) == AI)
return false; // Don't allow a store OF the AI, only INTO the AI.		return false; // Don't allow a store OF the AI, only INTO the AI.
// Note that atomic stores can be transformed; atomic semantics do		// Note that atomic stores can be transformed; atomic semantics do
// not have any meaning for a local alloca.		// not have any meaning for a local alloca.
if (SI->isVolatile())		if (SI->isVolatile())
return false;		return false;
		} else if (const MemCpyInst *MCI = dyn_cast<MemCpyInst>(U)) {
		if (!isSplittableMemCpy(MCI, AI))
		return false;
} else if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(U)) {		} else if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(U)) {
if (II->getIntrinsicID() != Intrinsic::lifetime_start &&		if (II->getIntrinsicID() != Intrinsic::lifetime_start &&
II->getIntrinsicID() != Intrinsic::lifetime_end)		II->getIntrinsicID() != Intrinsic::lifetime_end)
return false;		return false;
} else if (const BitCastInst *BCI = dyn_cast<BitCastInst>(U)) {		} else if (const BitCastInst *BCI = dyn_cast<BitCastInst>(U)) {
if (BCI->getType() != Type::getInt8PtrTy(U->getContext(), AS))		if (BCI->getType() != Type::getInt8PtrTy(U->getContext(), AS))
return false;		return false;
if (!onlyUsedByLifetimeMarkers(BCI))		if (!onlyHasCanonicalizableUsers(AI, BCI))
return false;		return false;
} else if (const GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(U)) {		} else if (const GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(U)) {
if (GEPI->getType() != Type::getInt8PtrTy(U->getContext(), AS))		if (GEPI->getType() != Type::getInt8PtrTy(U->getContext(), AS))
return false;		return false;
if (!GEPI->hasAllZeroIndices())		if (!GEPI->hasAllZeroIndices())
return false;		return false;
if (!onlyUsedByLifetimeMarkers(GEPI))		if (!onlyUsedByLifetimeMarkers(GEPI))
return false;		return false;
} else {		} else {
return false;		return false;
}		}
}		}

return true;		return true;
}		}

namespace {		namespace {

struct AllocaInfo {		struct AllocaInfo {
SmallVector<BasicBlock *, 32> DefiningBlocks;		SmallVector<BasicBlock *, 32> DefiningBlocks;
SmallVector<BasicBlock *, 32> UsingBlocks;		SmallVector<BasicBlock *, 32> UsingBlocks;

		// This gets updated with stores we find as we get along. Our use of
		// a vector for DefiningBlocks has the side effect of counting the number
		// of stores, so if DefiningBlocks.size() == 1, there is only one store
		// and we can quickly find it here.
StoreInst *OnlyStore;		StoreInst *OnlyStore;
BasicBlock *OnlyBlock;		BasicBlock *OnlyBlock;
bool OnlyUsedInOneBlock;		bool OnlyUsedInOneBlock;

Value *AllocaPointerVal;
DbgDeclareInst *DbgDeclare;		DbgDeclareInst *DbgDeclare;

void clear() {		void clear() {
DefiningBlocks.clear();		DefiningBlocks.clear();
UsingBlocks.clear();		UsingBlocks.clear();
OnlyStore = nullptr;		OnlyStore = nullptr;
OnlyBlock = nullptr;		OnlyBlock = nullptr;
OnlyUsedInOneBlock = true;		OnlyUsedInOneBlock = true;
AllocaPointerVal = nullptr;
DbgDeclare = nullptr;		DbgDeclare = nullptr;
}		}

/// Scan the uses of the specified alloca, filling in the AllocaInfo used		/// Scan the uses of the specified alloca, filling in the AllocaInfo used
/// by the rest of the pass to reason about the uses of this alloca.		/// by the rest of the pass to reason about the uses of this alloca.
void AnalyzeAlloca(AllocaInst *AI) {		void AnalyzeAlloca(AllocaInst *AI) {
clear();		clear();

// As we scan the uses of the alloca instruction, keep track of stores,		// As we scan the uses of the alloca instruction, keep track of stores,
// and decide whether all of the loads and stores to the alloca are within		// and decide whether all of the loads and stores to the alloca are within
// the same basic block.		// the same basic block.
for (auto UI = AI->user_begin(), E = AI->user_end(); UI != E;) {		for (auto UI = AI->user_begin(), E = AI->user_end(); UI != E;) {
Instruction User = cast<Instruction>(UI++);		Instruction User = cast<Instruction>(UI++);

if (StoreInst *SI = dyn_cast<StoreInst>(User)) {		if (StoreInst *SI = dyn_cast<StoreInst>(User)) {
// Remember the basic blocks which define new values for the alloca		// Remember the basic blocks which define new values for the alloca
DefiningBlocks.push_back(SI->getParent());		DefiningBlocks.push_back(SI->getParent());
AllocaPointerVal = SI->getOperand(0);
OnlyStore = SI;		OnlyStore = SI;
} else {		} else {
LoadInst *LI = cast<LoadInst>(User);		LoadInst *LI = cast<LoadInst>(User);
// Otherwise it must be a load instruction, keep track of variable		// Keep track of variable reads.
// reads.
UsingBlocks.push_back(LI->getParent());		UsingBlocks.push_back(LI->getParent());
AllocaPointerVal = LI;
}		}

if (OnlyUsedInOneBlock) {		if (OnlyUsedInOneBlock) {
if (!OnlyBlock)		if (!OnlyBlock)
OnlyBlock = User->getParent();		OnlyBlock = User->getParent();
else if (OnlyBlock != User->getParent())		else if (OnlyBlock != User->getParent())
OnlyUsedInOneBlock = false;		OnlyUsedInOneBlock = false;
}		}
Show All 27 Lines	class LargeBlockInfo {
/// The index starts out as the number of the instruction from the start of		/// The index starts out as the number of the instruction from the start of
/// the block.		/// the block.
DenseMap<const Instruction *, unsigned> InstNumbers;		DenseMap<const Instruction *, unsigned> InstNumbers;

public:		public:

/// This code only looks at accesses to allocas.		/// This code only looks at accesses to allocas.
static bool isInterestingInstruction(const Instruction *I) {		static bool isInterestingInstruction(const Instruction *I) {
		if (isa<MemCpyInst>(I)) {
		const MemCpyInst *MCI = cast<MemCpyInst>(I);
		return isa<AllocaInst>(MCI->getSource()) \|\|
		isa<AllocaInst>(MCI->getDest());
		} else {
return (isa<LoadInst>(I) && isa<AllocaInst>(I->getOperand(0))) \|\|		return (isa<LoadInst>(I) && isa<AllocaInst>(I->getOperand(0))) \|\|
(isa<StoreInst>(I) && isa<AllocaInst>(I->getOperand(1)));		(isa<StoreInst>(I) && isa<AllocaInst>(I->getOperand(1)));
}		}
		}

/// Get or calculate the index of the specified instruction.		/// Get or calculate the index of the specified instruction.
unsigned getInstructionIndex(const Instruction *I) {		unsigned getInstructionIndex(const Instruction *I) {
assert(isInterestingInstruction(I) &&		assert(isInterestingInstruction(I) &&
"Not a load/store to/from an alloca?");		"Not a load/store to/from an alloca?");

// If we already have this instruction number, return it.		// If we already have this instruction number, return it.
DenseMap<const Instruction *, unsigned>::iterator It = InstNumbers.find(I);		DenseMap<const Instruction *, unsigned>::iterator It = InstNumbers.find(I);
Show All 9 Lines	for (const Instruction &BBI : *BB)
if (isInterestingInstruction(&BBI))		if (isInterestingInstruction(&BBI))
InstNumbers[&BBI] = InstNo++;		InstNumbers[&BBI] = InstNo++;
It = InstNumbers.find(I);		It = InstNumbers.find(I);

assert(It != InstNumbers.end() && "Didn't insert instruction?");		assert(It != InstNumbers.end() && "Didn't insert instruction?");
return It->second;		return It->second;
}		}

		// When we split a memcpy intrinsic, we need to update the numbering in this
		// struct. To make sure the relative ordering remains the same, we give both
		// the LI and the SI the number that the MCI used to have (if they are both
		// interesting). This means that they will have equal numbers, which usually
		// can't happen. However, since they can never reference the same alloca
		// (since memcpy operands may not overlap), this is fine, because we will
		// never compare instruction indices for instructions that operate on distinct
		// allocas.
		void splitMemCpy(MemCpyInst MCI, LoadInst LI, StoreInst *SI) {
		DenseMap<const Instruction *, unsigned>::iterator It =
		InstNumbers.find(MCI);
		if (It == InstNumbers.end())
		return;
		unsigned MemCpyNumber = It->second;
		InstNumbers[LI] = MemCpyNumber;
		InstNumbers[SI] = MemCpyNumber;
		deleteValue(MCI);
		}

void deleteValue(const Instruction *I) { InstNumbers.erase(I); }		void deleteValue(const Instruction *I) { InstNumbers.erase(I); }

void clear() { InstNumbers.clear(); }		void clear() { InstNumbers.clear(); }
};		};

struct PromoteMem2Reg {		struct PromoteMem2Reg {
/// The alloca instructions being promoted.		/// The alloca instructions being promoted.
std::vector<AllocaInst *> Allocas;		std::vector<AllocaInst *> Allocas;
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	static void addAssumeNonNull(AssumptionCache AC, LoadInst LI) {
ICmpInst *LoadNotNull = new ICmpInst(ICmpInst::ICMP_NE, LI,		ICmpInst *LoadNotNull = new ICmpInst(ICmpInst::ICMP_NE, LI,
Constant::getNullValue(LI->getType()));		Constant::getNullValue(LI->getType()));
LoadNotNull->insertAfter(LI);		LoadNotNull->insertAfter(LI);
CallInst *CI = CallInst::Create(AssumeIntrinsic, {LoadNotNull});		CallInst *CI = CallInst::Create(AssumeIntrinsic, {LoadNotNull});
CI->insertAfter(LoadNotNull);		CI->insertAfter(LoadNotNull);
AC->registerAssumption(CI);		AC->registerAssumption(CI);
}		}

static void removeLifetimeIntrinsicUsers(AllocaInst *AI) {		/// Split a memcpy instruction into the corresponding load/store. It is a little
// Knowing that this alloca is promotable, we know that it's safe to kill all		/// more complicated than one might imagine, because we need to deal with the
// instructions except for load and store.		/// fact that the side of the copy we're not currently processing might also
		/// be a promotable alloca. We need to be careful to not break the promotable
		/// predicate for that other alloca (if any).
		static void doMemCpySplit(LargeBlockInfo &LBI, MemCpyInst *MCI,
		AllocaInst *AI) {
		AAMDNodes AA;
		MCI->getAAMetadata(AA);
		Value *MCISrc = MCI->getSource();
		Type *LoadType = AI->getAllocatedType();
		AllocaInst *SrcAI = dyn_cast<AllocaInst>(MCISrc);
		if (SrcAI && SrcAI->getType() != AI->getType()) {
		if (CastInst::isBitCastable(SrcAI->getAllocatedType(), LoadType))
		LoadType = SrcAI->getAllocatedType();
		}
		if (cast<PointerType>(MCISrc->getType())->getElementType() != LoadType)
		MCISrc = CastInst::Create(
		Instruction::BitCast, MCISrc,
		LoadType->getPointerTo(
		cast<PointerType>(MCISrc->getType())->getAddressSpace()),
		"", MCI);
		// This might add to the end of the use list, but that's fine. At worst,
		// we'd not visit the instructions we insert here, but we don't care
		// about them in this loop anyway.
		LoadInst *LI = new LoadInst(LoadType, MCISrc, "", MCI->isVolatile(),
		MCI->getAlignment(), MCI);
		Value *Val = LI;
		Value *MCIDest = MCI->getDest();
		AllocaInst *DestAI = dyn_cast<AllocaInst>(MCIDest);
		Type *DestElTy = DestAI ? DestAI->getAllocatedType() : AI->getAllocatedType();
		if (LI->getType() != DestElTy &&
		CastInst::isBitCastable(LI->getType(), DestElTy))
		Val = CastInst::Create(Instruction::BitCast, Val, DestElTy, "", MCI);
		if (cast<PointerType>(MCIDest->getType())->getElementType() != Val->getType())
		MCIDest = CastInst::Create(
		Instruction::BitCast, MCIDest,
		Val->getType()->getPointerTo(
		cast<PointerType>(MCIDest->getType())->getAddressSpace()),
		"", MCI);
		StoreInst *SI =
		new StoreInst(Val, MCIDest, MCI->isVolatile(), MCI->getAlignment(), MCI);
		LI->setAAMetadata(AA);
		SI->setAAMetadata(AA);
		LBI.splitMemCpy(MCI, LI, SI);
		MCI->eraseFromParent();
		}

		static void canonicalizeUsers(LargeBlockInfo &LBI, AllocaInst *AI) {
		// Knowing that this alloca is promotable, we know that it's safe to split
		// MTIs into load/store and to kill all other instructions except for
		// load and store.

for (auto UI = AI->user_begin(), UE = AI->user_end(); UI != UE;) {		for (auto UI = AI->user_begin(), UE = AI->user_end(); UI != UE;) {
Instruction I = cast<Instruction>(UI);		Instruction I = cast<Instruction>(UI);
++UI;		++UI;
if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))
continue;		continue;

		if (isa<MemCpyInst>(I)) {
		MemCpyInst *MCI = cast<MemCpyInst>(I);
		doMemCpySplit(LBI, MCI, AI);
		continue;
		}

if (!I->getType()->isVoidTy()) {		if (!I->getType()->isVoidTy()) {
// The only users of this bitcast/GEP instruction are lifetime intrinsics.		// The only users of this bitcast/GEP instruction are lifetime/memcpy
// Follow the use/def chain to erase them now instead of leaving it for		// intrinsics. Split memcpys and delete lifetime intrinsics.
// dead code elimination later.
for (auto UUI = I->user_begin(), UUE = I->user_end(); UUI != UUE;) {		for (auto UUI = I->user_begin(), UUE = I->user_end(); UUI != UUE;) {
Instruction Inst = cast<Instruction>(UUI);		Instruction Inst = cast<Instruction>(UUI);
++UUI;		++UUI;
		if (isa<MemCpyInst>(Inst)) {
		doMemCpySplit(LBI, cast<MemCpyInst>(Inst), AI);
		} else {
		// Must be a lifetime intrinsic
Inst->eraseFromParent();		Inst->eraseFromParent();
}		}
}		}
		}
I->eraseFromParent();		I->eraseFromParent();
}		}
}		}

/// \brief Rewrite as many loads as possible given a single store.		/// \brief Rewrite as many loads as possible given a single store.
///		///
/// When there is only a single store, we can use the domtree to trivially		/// When there is only a single store, we can use the domtree to trivially
/// replace all of the dominated loads with the stored value. Do so, and return		/// replace all of the dominated loads with the stored value. Do so, and return
▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines	void PromoteMem2Reg::run() {

for (unsigned AllocaNum = 0; AllocaNum != Allocas.size(); ++AllocaNum) {		for (unsigned AllocaNum = 0; AllocaNum != Allocas.size(); ++AllocaNum) {
AllocaInst *AI = Allocas[AllocaNum];		AllocaInst *AI = Allocas[AllocaNum];

assert(isAllocaPromotable(AI) && "Cannot promote non-promotable alloca!");		assert(isAllocaPromotable(AI) && "Cannot promote non-promotable alloca!");
assert(AI->getParent()->getParent() == &F &&		assert(AI->getParent()->getParent() == &F &&
"All allocas should be in the same function, which is same as DF!");		"All allocas should be in the same function, which is same as DF!");

removeLifetimeIntrinsicUsers(AI);		canonicalizeUsers(LBI, AI);

if (AI->use_empty()) {		if (AI->use_empty()) {
// If there are no uses of the alloca, just delete it now.		// If there are no uses of the alloca, just delete it now.
AI->eraseFromParent();		AI->eraseFromParent();

// Remove the alloca from the Allocas list, since it has been processed		// Remove the alloca from the Allocas list, since it has been processed
RemoveFromAllocasList(AllocaNum);		RemoveFromAllocasList(AllocaNum);
++NumDeadAlloca;		++NumDeadAlloca;
▲ Show 20 Lines • Show All 436 Lines • Show Last 20 Lines

test/Transforms/Mem2Reg/memcpy.ll

This file was added.

				; RUN: opt < %s -mem2reg -S \| FileCheck %s

				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				declare void @llvm.memcpy.p0i128.p0i64.i32(i128 , i64 , i32, i32, i1)
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8 , i8 , i32, i32, i1)
				declare void @llvm.memcpy.p0i64.p0i64.i32(i64 , i64 , i32, i32, i1)
				declare void @llvm.memcpy.p0f64.p0i64.i32(double , i64 , i32, i32, i1)

				define i128 @test_cpy_different(i64) {
				; CHECK-LABEL: @test_cpy_different
				; CHECK-NOT: alloca i64
				; CHECK: store i64 %0
				%a = alloca i64
				%b = alloca i128
				store i128 0, i128 *%b
				store i64 %0, i64 *%a
				call void @llvm.memcpy.p0i128.p0i64.i32(i128 %b, i64 %a, i32 8, i32 0, i1 0)
				%loaded = load i128, i128 *%b
				ret i128 %loaded
				}

				define i64 @test_cpy_same(i64) {
				; CHECK-LABEL: @test_cpy_same
				; CHECK-NOT: alloca
				; CHECK: ret i64 %0
				%a = alloca i64
				%b = alloca i64
				store i64 %0, i64 *%a
				call void @llvm.memcpy.p0i64.p0i64.i32(i64 %b, i64 %a, i32 8, i32 0, i1 0)
				%loaded = load i64, i64 *%b
				ret i64 %loaded
				}

				define double @test_cpy_different_type(i64) {
				; CHECK-LABEL: @test_cpy_different_type
				; CHECK-NOT: alloca
				; CHECK: bitcast i64 %0 to double
				%a = alloca i64
				%b = alloca double
				store i64 %0, i64 *%a
				call void @llvm.memcpy.p0f64.p0i64.i32(double %b, i64 %a, i32 8, i32 0, i1 0)
				%loaded = load double, double *%b
				ret double %loaded
				}

				define i128 @test_cpy_differenti8(i64) {
				; CHECK-LABEL: @test_cpy_differenti8
				; CHECK-NOT: alloca i64
				; CHECK: store i64 %0
				%a = alloca i64
				%b = alloca i128
				store i128 0, i128 *%b
				store i64 %0, i64 *%a
				%acast = bitcast i64* %a to i8*
				%bcast = bitcast i128* %b to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8 %bcast, i8 %acast, i32 8, i32 0, i1 0)
				%loaded = load i128, i128 *%b
				ret i128 %loaded
				}

				define i64 @test_cpy_samei8(i64) {
				; CHECK-LABEL: @test_cpy_samei8
				; CHECK-NOT: alloca
				; CHECK: ret i64 %0
				%a = alloca i64
				%b = alloca i64
				store i64 %0, i64 *%a
				%acast = bitcast i64* %a to i8*
				%bcast = bitcast i64* %b to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8 %bcast, i8 %acast, i32 8, i32 0, i1 0)
				%loaded = load i64, i64 *%b
				ret i64 %loaded
				}

				define double @test_cpy_different_typei8(i64) {
				; CHECK-LABEL: @test_cpy_different_typei8
				; CHECK-NOT: alloca
				; CHECK: bitcast i64 %0 to double
				%a = alloca i64
				%b = alloca double
				store i64 %0, i64 *%a
				%acast = bitcast i64* %a to i8*
				%bcast = bitcast double* %b to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8 %bcast, i8 %acast, i32 8, i32 0, i1 0)
				%loaded = load double, double *%b
				ret double %loaded
				}

				define i64 @test_cpy_differenti8_reverse(i128) {
				; CHECK-LABEL: @test_cpy_differenti8_reverse
				; CHECK-NOT: alloca i64
				%a = alloca i64
				%b = alloca i128
				store i128 %0, i128 *%b
				%acast = bitcast i64* %a to i8*
				%bcast = bitcast i128* %b to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8 %acast, i8 %bcast, i32 8, i32 0, i1 0)
				%loaded = load i64, i64 *%a
				ret i64 %loaded
				}