This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Patch to improve memcpy inlined assembly sequence.
Needs ReviewPublic

Authored by rs on Nov 1 2016, 7:11 AM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
john.brawn

Summary

memcpy's which are <= 64 (see ARMSubtarget::getMaxInlineSizeThreshold) can be inlined into a sequence of loads/stores, if the memcpy number of bytes is greater than 64 then the memcpy library function is used instead.

For a copy where the number of bytes is a multiple of 4 bytes the memcpy inling function can generate full word loads for loading the source using the ldr instruction or loading multiple words using the ldm instruction and then storing the words to the destination using str or storing multiple words with stm. The optimal sequence in most cases is the one where multiple words are loaded and stored using ldm/stm as doing a single ldm/stm is faster than doing 1 word loads and stores.

When the number of bytes to copy isn't a multiple of 4 then memcpy inling will end up using ldrb/strb if the remainder is 1, lrdh/strh if the remainder is 2 and if it's 3 bytes then the backend will generate ldrb/strb/lrdh/strh. If the number of bytes was a multiple of 4 then the backend would have been able to collapse the load/store in a previous ldm/stm instruction or just do a ldr/str, in the case when the remaining bytes is 1 or 2 then the ldr/str will take the same time as the 1 byte ldrb/strb or the 2 byte ldrb/strb but when it's 3 bytes and you generate ldrb/strb/ldrh/strh then the ldr/str is much better. Also even if the remainder is 1 or 2 bytes and you're copying multiple words then it's possible the backend has already decided to collapse the load/stores in a multi load/store operation using ldm/stm.

This patch tries to implement this simple optimization by padding the destination, source and increasing the number of bytes to be a multiple of 4 words when doing a memcpy operation.

The patch implements a pass that looks for the memcpy intrinsic and uses the simple herustic below to decide whether to pad the dest/source or not:

Is the destination a stack allocated constant array ?
Is the source a constant ?
Is the number of bytes to copy a constant ?
Is destination array size == constant source size == number of bytes

If answer to those questions is yes then the pass pads the destination/source and increases the number of bytes to copy in the memcpy operation.

The pass is implemented as a midend IR level pass but is only added when the target is an ARM 32 bit core. The reason it's implemented as a
midend pass instead of a backend pass or implementing it in ARMTargetLowering::getOptimalMemOpType or ARMSelectionDAGInfo::EmitTargetCodeForMemcpy is because on previous attempts I've found that it wasn't possible to pad the source/destination as the IR objects at that level were immutable. It might be possible for me to pad the SelectionDag nodes but I would have to implement some analysis to make sure it's safe to do so. Maybe adding a midend analysis pass which would propagate information to the backend information about which memcpy's are safe to pad might work ? If you want to see my previous attempt at doing it in ARMSelectionDAGInfo::EmitTargetCodeForMemcpy then let me know. Ideally I would have liked to have done this optimization close to where the memcpy was going to be inlined.

Diff Detail

Event Timeline

rs updated this revision to Diff 76552.Nov 1 2016, 7:11 AM

rs retitled this revision from to [ARM] Patch to improve memcpy lined assembly sequence..

rs updated this object.

rs added reviewers: rengolin, t.p.northover.

rs added a subscriber: llvm-commits.

Herald added subscribers: mgorny, aemerson. · View Herald TranscriptNov 1 2016, 7:11 AM

rs retitled this revision from [ARM] Patch to improve memcpy lined assembly sequence. to [ARM] Patch to improve memcpy inlined assembly sequence..Nov 1 2016, 7:12 AM

Ping.

rs added a reviewer: john.brawn.Nov 15 2016, 9:34 AM

This seems like a ridiculously specific optimization, even in its outline form. This is compounded by even more assumptions the actual implementation makes (that there will be GEPs for example).

The implementation is also really rather broken, but before we even get into details like that I think we need to sort out what benefit we'd get from a correct version. Basically, the high-level intent seems to be to optimize

char arr[] = "whatever";

occurring in function scope. Is that really common enough to be worth writing an entire pass for?

Thanks for the review Tim, I really appreciate it.

There are a few occurrences of this pattern in LNT test suite that can benefit from this type of optimization. I agree that this pass is very specific, I uploaded this as a way to start a discussion on how this can be implemented in a better way without using a pass, as I said I've attempted this a few times in other parts of the backend but have been unsuccessful because I've found the destination/sources IR node objects are immutable so I can't pad them. If possible I would like an experts opinion such as yourself on how the padding of the destination/source can be done on the SelectionDAG IR nodes ?

The IR is definitely the right place to do this... trying to do the sort of modifications required for this any later would be messy at best.

This needs to be generalized beyond handling just i8 arrays; this would probably trigger with some frequency on structs with small members.

This is probably interesting for other targets to some extent; other common architectures don't have LDM/STM, but they have larger registers which could benefit from a similar transformation (for example, on x86, SSE registers are used to lower memcpy.)

Granted, I'm also skeptical that this actually triggers frequently enough to be worth bothering; saying it only triggers a few times in the entirety of LNT isn't exactly encouraging.

Thanks for your review comments Eli.

The IR is definitely the right place to do this... trying to do the sort of modifications required for this any later would be messy at best.
This needs to be generalized beyond handling just i8 arrays; this would probably trigger with some frequency on structs with small members.

Do you have any suggestions where this modification can be plugged in ? Or do you think it's fine as a pass but needs to be generalised ?

This is probably interesting for other targets to some extent; other common architectures don't have LDM/STM, but they have larger registers which could benefit from a similar transformation (for example, on x86, SSE registers are used to lower memcpy.)

Granted, I'm also skeptical that this actually triggers frequently enough to be worth bothering; saying it only triggers a few times in the entirety of LNT isn't exactly encouraging.

After generalising it a bit more it might be able to optimise more examples in LNT.

It's fine as a pass. I mean, it would be nice to include as part of some existing pass, but I'm not sure where it would fit in.

It's fine as a pass. I mean, it would be nice to include as part of some existing pass, but I'm not sure where it would fit in.

A colleague of mine has mentioned GlobalOpt as a potentially good place to plug this optimization in, I'm currently looking to see if it can fit nicely in their.

john.brawn resigned from this revision.May 12 2020, 6:39 AM

Herald added subscribers: danielkiss, JDevlieghere, kristof.beyls. · View Herald TranscriptMay 12 2020, 6:39 AM

Revision Contents

Path

Size

lib/

Target/

ARM/

1 line

258 lines

4 lines

1 line

test/

CodeGen/

ARM/

arm-pad-memcpy-lengths-dont-match.ll

36 lines

arm-pad-memcpy-more-than-64-bytes.ll

33 lines

arm-pad-memcpy-strings-test1.ll

44 lines

arm-pad-memcpy-strings-test2.ll

44 lines

Diff 76552

lib/Target/ARM/ARM.h

	Show All 37 Lines
	FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);			FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);
	FunctionPass *createARMExpandPseudoPass();			FunctionPass *createARMExpandPseudoPass();
	FunctionPass *createARMConstantIslandPass();			FunctionPass *createARMConstantIslandPass();
	FunctionPass *createMLxExpansionPass();			FunctionPass *createMLxExpansionPass();
	FunctionPass *createThumb2ITBlockPass();			FunctionPass *createThumb2ITBlockPass();
	FunctionPass *createARMOptimizeBarriersPass();			FunctionPass *createARMOptimizeBarriersPass();
	FunctionPass *createThumb2SizeReductionPass(			FunctionPass *createThumb2SizeReductionPass(
	std::function<bool(const Function &)> Ftor = nullptr);			std::function<bool(const Function &)> Ftor = nullptr);
				FunctionPass *createARMPadMemcpyPass();

	void LowerARMMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,			void LowerARMMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,
	ARMAsmPrinter &AP);			ARMAsmPrinter &AP);

	void computeBlockSize(MachineFunction MF, MachineBasicBlock MBB,			void computeBlockSize(MachineFunction MF, MachineBasicBlock MBB,
	BasicBlockInfo &BBI);			BasicBlockInfo &BBI);
	std::vector<BasicBlockInfo> computeAllBlockSizes(MachineFunction *MF);			std::vector<BasicBlockInfo> computeAllBlockSizes(MachineFunction *MF);

	void initializeARMLoadStoreOptPass(PassRegistry &);			void initializeARMLoadStoreOptPass(PassRegistry &);
	void initializeARMPreAllocLoadStoreOptPass(PassRegistry &);			void initializeARMPreAllocLoadStoreOptPass(PassRegistry &);

	} // end namespace llvm;			} // end namespace llvm;

	#endif			#endif

lib/Target/ARM/ARMPadMemcpyPass.cpp

This file was added.

				// ARMPadMemcpyPass.cpp - Pads destination and source of memecpy so that they
				// take up a full word of bytes.

				#define DEBUG_TYPE "arm-pad-memcpy"

				#include "ARM.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/GlobalVariable.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Intrinsics.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/Operator.h"
				#include "llvm/IR/ValueSymbolTable.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Transforms/Scalar.h"

				using namespace llvm;

				cl::opt<bool> DisableARMPadMemcpy("disable-arm-pad-memcpy");

				namespace {

				class ARMPadMemcpyPass : public FunctionPass {
				public:
				static char ID;
				explicit ARMPadMemcpyPass() : FunctionPass(ID) {}

				// Ideally I would call ARMSubtarget::getMaxInlineSizeThreshold() to get the
				// right value but can't get a hold of a Subtarget object from a Functiion
				// object.
				// This value represents the maximum memcpy size allowed for inlining memcpy,
				// if number of bytes is greater then library function will be called
				// instead.
				const unsigned MemcpyInliningLimit = 64;

				virtual bool runOnFunction(Function &F) override;

				StringRef getPassName() const override { return "ARM Pad Memcpy"; }
				};

				char ARMPadMemcpyPass::ID = 0;

				static RegisterPass<ARMPadMemcpyPass> X("arm-pad-memcpy",
				"Pad memcpy source and destination");

				static bool IsCharArray(Type *t) {
				const unsigned int CHAR_BIT_SIZE = 8;
				return t && t->isArrayTy() && t->getArrayElementType()->isIntegerTy() &&
				t->getArrayElementType()->getIntegerBitWidth() == CHAR_BIT_SIZE;
				}

				bool ARMPadMemcpyPass::runOnFunction(Function &F) {
				if (DisableARMPadMemcpy) {
				return false;
				}
				DEBUG(dbgs() << "Running ARMPadMemcpy on module " << F.getName() << "\n");

				bool modified = false;
				for (Function::iterator b = F.begin(); b != F.end(); ++b) {
				for (BasicBlock::iterator i = b->begin(); i != b->end(); ++i) {
				CallInst *CI = dyn_cast<CallInst>(i);
				if (!CI) {
				continue;
				}

				Function *CallMemcpy = CI->getCalledFunction();
				// find out if the current call instruction is a call to llvm memcpy
				// intrinsics
				if (CallMemcpy == NULL \|\| !CallMemcpy->isIntrinsic() \|\|
				CallMemcpy->getIntrinsicID() != Intrinsic::memcpy) {
				continue;
				}

				DEBUG(dbgs() << "Found call to strcpy/memcpy\n");

				GEPOperator *destinationPtr = dyn_cast<GEPOperator>(CI->getArgOperand(0));
				GEPOperator *sourcePtr = dyn_cast<GEPOperator>(CI->getArgOperand(1));
				ConstantInt *bytesToCopy = dyn_cast<ConstantInt>(CI->getArgOperand(2));
				ConstantInt *isVolatile = dyn_cast<ConstantInt>(CI->getArgOperand(4));

				if (!bytesToCopy) {
				DEBUG(dbgs() << "Number of bytes to copy is null\n");
				continue;
				}

				uint64_t numBytesToCopy = bytesToCopy->getZExtValue();

				if (!destinationPtr) {
				DEBUG(dbgs() << "Destination isn't a GEP operation\n");
				continue;
				}

				if (!sourcePtr) {
				DEBUG(dbgs() << "Source isn't a GEP operation\n");
				continue;
				}

				if (!isVolatile \|\| isVolatile->isOne()) {
				DEBUG(dbgs() << "Not padding strings for this memcpy because it's "
				"a volatile operations\n");
				continue;
				}

				if (!(numBytesToCopy % 4)) {
				DEBUG(dbgs() << "Bytes to copy in strcpy/memcpy is already word "
				"aligned so nothing to do here.\n");
				continue;
				}

				GlobalVariable *sourceVar =
				dyn_cast<GlobalVariable>(sourcePtr->getPointerOperand());
				if (!sourceVar) {
				DEBUG(dbgs() << "Source pointer isn't a global constant variable.\n");
				continue;
				}

				DEBUG(dbgs() << "Source is a pointer\n");
				if (!sourceVar->hasInitializer() \|\| !sourceVar->isConstant() \|\|
				!sourceVar->hasLocalLinkage() \|\| !sourceVar->hasGlobalUnnamedAddr()) {
				DEBUG(dbgs() << "Source is not constant global, thus it's "
				"mutable therefore it's not safe to pad\n");
				continue;
				}

				ConstantDataArray *sourceDataArray =
				dyn_cast<ConstantDataArray>(sourceVar->getInitializer());

				if (!sourceDataArray \|\| !IsCharArray(sourceDataArray->getType())) {
				DEBUG(dbgs() << "source isn't a constant data array\n");
				continue;
				}

				GetElementPtrInst *destinationGEP =
				dyn_cast<GetElementPtrInst>(destinationPtr);
				if (!destinationGEP) {
				DEBUG(dbgs() << "Destination isn't a GEP Instruction\n");
				continue;
				}

				AllocaInst *alloca =
				dyn_cast<AllocaInst>(destinationGEP->getPointerOperand());
				if (!alloca) {
				DEBUG(dbgs() << "Destination isn't allocated on the stack.\n");
				continue;
				}

				if (!alloca->isStaticAlloca()) {
				DEBUG(dbgs() << "Destination allocation isn't a static "
				"constant which is locally allocated in this "
				"function, so skipping.\n");
				continue;
				}

				// Make sure destination is definitley a char array.
				if (!IsCharArray(alloca->getAllocatedType())) {
				DEBUG(dbgs() << "Destination doesn't look like a constant char (8 "
				"bits) array\n");
				continue;
				}

				uint64_t dzSize = alloca->getAllocatedType()->getArrayNumElements();
				uint64_t szSize = sourceDataArray->getType()->getNumElements();

				// For safety purposes lets add a constraint and only padd when
				// num bytes to copy == destination array size == source string
				// which is a constant
				DEBUG(dbgs() << "Number of bytes to copy is: " << numBytesToCopy << "\n");
				DEBUG(dbgs() << "Size of destination array is: " << dzSize << "\n");
				DEBUG(dbgs() << "Size of source array is: " << szSize << "\n");
				if (numBytesToCopy != dzSize \|\| dzSize != szSize) {
				DEBUG(dbgs() << "Size of number of bytes to copy, destination "
				"array and source string don't match, so "
				"skipping\n");
				continue;
				}
				DEBUG(dbgs() << "Going to pad.\n");
				unsigned int numBytesToPad = 4 - (numBytesToCopy % 4);
				DEBUG(dbgs() << "Number of bytes to pad by is " << numBytesToPad << "\n");
				unsigned int totalBytes = numBytesToCopy + numBytesToPad;

				if (totalBytes > MemcpyInliningLimit) {
				DEBUG(dbgs() << "Not going to pad because total number of bytes is "
				<< totalBytes << " which be greater than the inlining "
				"limit for memcpy which is "
				<< MemcpyInliningLimit << "\n");
				continue;
				}

				// update destination char array to be word aligned (memcpy(X,...,...))
				IRBuilder<> buildAlloca(alloca);
				AllocaInst *newAlloca = cast<AllocaInst>(buildAlloca.CreateAlloca(
				ArrayType::get(alloca->getAllocatedType()->getArrayElementType(),
				numBytesToCopy + numBytesToPad)));
				newAlloca->takeName(alloca);
				newAlloca->setAlignment(alloca->getAlignment());

				DEBUG(dbgs()
				<< "Updating users of destination stack object to use new size\n");
				for (auto U : alloca->users()) {
				GetElementPtrInst *gep = dyn_cast<GetElementPtrInst>(U);
				if (gep) {
				IRBuilder<> buildGEP(gep);
				GetElementPtrInst *newGEP = dyn_cast<GetElementPtrInst>(
				buildGEP.CreateGEP(nullptr, newAlloca,
				{gep->getOperand(1), gep->getOperand(2)}));
				newGEP->takeName(gep);
				newGEP->setIsInBounds(gep->isInBounds());
				gep->replaceAllUsesWith(newGEP);
				}
				}

				// update source to be word aligned (memcpy(...,X,...))
				// create replacement string with padded null bytes.
				StringRef data = sourceDataArray->getRawDataValues();
				std::vector<uint8_t> strData(data.begin(), data.end());
				for (unsigned int p = 0; p < numBytesToPad; p++)
				strData.push_back('\0');
				auto Arr = llvm::makeArrayRef(strData.data(), totalBytes);

				// create new padded version of global variable string.
				Constant *sourceReplace = ConstantDataArray::get(F.getContext(), Arr);
				GlobalVariable *newGV = new GlobalVariable(
				*F.getParent(), sourceReplace->getType(), true,
				sourceVar->getLinkage(), sourceReplace, sourceReplace->getName());

				// copy any other attributes from original global variable string
				// e.g. unamed_addr
				newGV->copyAttributesFrom(sourceVar);
				newGV->takeName(sourceVar);

				// create new expression for memcpy intrinsic to reference new global
				// variable string.
				Constant *replace = ConstantExpr::getInBoundsGetElementPtr(
				cast<PointerType>(newGV->getType()->getScalarType())
				->getContainedType(0u),
				newGV, {sourcePtr->getOperand(1), sourcePtr->getOperand(2)});

				// replace intrinsic source.
				CI->setArgOperand(1, replace);

				// Update number of bytes to copy (memcpy(...,...,X))
				CI->setArgOperand(2,
				ConstantInt::get(bytesToCopy->getType(), totalBytes));
				modified = true;
				DEBUG(dbgs() << "Padded dest/source and increased number of bytes.\n");
				}
				}
				return modified;
				}

				} // end of anonymous namespace

				FunctionPass *llvm::createARMPadMemcpyPass() { return new ARMPadMemcpyPass; }

lib/Target/ARM/ARMTargetMachine.cpp

	Show First 20 Lines • Show All 363 Lines • ▼ Show 20 Lines
	}			}

	void ARMPassConfig::addIRPasses() {			void ARMPassConfig::addIRPasses() {
	if (TM->Options.ThreadModel == ThreadModel::Single)			if (TM->Options.ThreadModel == ThreadModel::Single)
	addPass(createLowerAtomicPass());			addPass(createLowerAtomicPass());
	else			else
	addPass(createAtomicExpandPass(TM));			addPass(createAtomicExpandPass(TM));

				if (TM->getOptLevel() != CodeGenOpt::None) {
				addPass(createARMPadMemcpyPass());
				}

	// Cmpxchg instructions are often used with a subsequent comparison to			// Cmpxchg instructions are often used with a subsequent comparison to
	// determine whether it succeeded. We can exploit existing control-flow in			// determine whether it succeeded. We can exploit existing control-flow in
	// ldrex/strex loops to simplify this, but it needs tidying up.			// ldrex/strex loops to simplify this, but it needs tidying up.
	if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)			if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)
	addPass(createCFGSimplificationPass(-1, [this](const Function &F) {			addPass(createCFGSimplificationPass(-1, [this](const Function &F) {
	const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);			const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);
	return ST.hasAnyDataBarrier() && !ST.isThumb1Only();			return ST.hasAnyDataBarrier() && !ST.isThumb1Only();
	}));			}));
	▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

lib/Target/ARM/CMakeLists.txt

Show All 39 Lines	add_llvm_target(ARMCodeGen
MLxExpansionPass.cpp		MLxExpansionPass.cpp
Thumb1FrameLowering.cpp		Thumb1FrameLowering.cpp
Thumb1InstrInfo.cpp		Thumb1InstrInfo.cpp
ThumbRegisterInfo.cpp		ThumbRegisterInfo.cpp
Thumb2ITBlockPass.cpp		Thumb2ITBlockPass.cpp
Thumb2InstrInfo.cpp		Thumb2InstrInfo.cpp
Thumb2SizeReduction.cpp		Thumb2SizeReduction.cpp
ARMComputeBlockSize.cpp		ARMComputeBlockSize.cpp
		ARMPadMemcpyPass.cpp
)		)

add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)

test/CodeGen/ARM/arm-pad-memcpy-lengths-dont-match.ll

This file was added.

				; Test for padding memcpy's. This tests that the simple heuristic to decide
				; whether to pad or not.The heuristic says to not pad when the destination
				; of the memcpy isn't the same length as the source string to copy and the
				; number of bytes to copy.

				; RUN: llc < %s -mtriple=arm-arm-none-eabi -O3 \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv6m-arm-none-eabi"
				@.str = private unnamed_addr constant [17 x i8] c"aaaaaaaaaaaaaaaa\00", align 1

				; Function Attrs: nounwind
				define hidden void @foo() local_unnamed_addr #0 {
				entry:
				%something = alloca [20 x i8], align 1
				%0 = getelementptr inbounds [20 x i8], [20 x i8]* %something, i32 0, i32 0
				call void @llvm.lifetime.start(i64 20, i8* nonnull %0) #3
				; CHECK: ldm
				; CHECK: stm
				; CHECK-NEXT: ldrb
				; CHECK-NEXT: strb
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull %0, i8* getelementptr inbounds ([17 x i8], [17 x i8]* @.str, i32 0, i32 0), i32 17, i32 1, i1 false)
				%call2 = call i32 bitcast (i32 (...)* @bar to i32 (i8))(i8* nonnull %0) #3
				call void @llvm.lifetime.end(i64 20, i8* nonnull %0) #3
				ret void
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start(i64, i8* nocapture) #1

				declare i32 @bar(...) local_unnamed_addr #2

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture writeonly, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/ARM/arm-pad-memcpy-more-than-64-bytes.ll

This file was added.

				; This tests the arm pad memcpy's pass. This tests checks if the padding
				; pass doesn't pad when the copy is >64 bytes and instead calls the memcpy
				; library function.

				; RUN: llc < %s -mtriple=arm-arm-none-eabi -O3 \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv6m-arm-none-eabi"

				@.str = private unnamed_addr constant [65 x i8] c"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaazzz\00", align 1

				; Function Attrs: nounwind
				define hidden void @foo() local_unnamed_addr #0 {
				entry:
				%something = alloca [65 x i8], align 1
				%0 = getelementptr inbounds [65 x i8], [65 x i8]* %something, i32 0, i32 0
				call void @llvm.lifetime.start(i64 65, i8* nonnull %0) #3
				; CHECK: __aeabi_memcpy
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull %0, i8* getelementptr inbounds ([65 x i8], [65 x i8]* @.str, i32 0, i32 0), i32 65, i32 1, i1 false)
				%call2 = call i32 bitcast (i32 (...)* @bar to i32 (i8))(i8* nonnull %0) #3
				call void @llvm.lifetime.end(i64 65, i8* nonnull %0) #3
				ret void
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start(i64, i8* nocapture) #1

				declare i32 @bar(...) local_unnamed_addr #2

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture writeonly, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/ARM/arm-pad-memcpy-strings-test1.ll

This file was added.

				; This tests the arm memcpy padding pass.This test checks that a 22 byte
				; string is padded to 24 bytes which allows the instruction selector to
				; select ldm/stm instructions for all for all of memcpy instead of using
				; ldrh / strh to copy the last 2 bytes. This also tests that when the pass
				; is turned off with option '-disable-arm-pad-memcpy' that ldrh / strh is
				; used to copy the last 2 bytes.

				; RUN: llc < %s -mtriple thumbv6m-arm-none-eabi -O3 \| FileCheck %s
				; RUN: llc < %s -mtriple thumbv6m-arm-none-eabi -O3 -disable-arm-pad-memcpy \| FileCheck %s --check-prefix=TURNED-OFF
				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv6m-none--eabi"

				@.str = private unnamed_addr constant [22 x i8] c"aaaaaaaaaaaaaaaaaaaaa\00", align 1

				; Function Attrs: nounwind
				define i32 @main() local_unnamed_addr #0 {
				entry:
				%a = alloca [22 x i8], align 1
				%0 = getelementptr inbounds [22 x i8], [22 x i8]* %a, i32 0, i32 0
				call void @llvm.lifetime.start(i64 22, i8* nonnull %0) #3
				; CHECK: ldm
				; CHECK: stm
				; CHECK: ldm
				; CHECK: stm
				; CHECK-NOT: ldrh
				; CHECK-NOT: strh
				; TURNED-OFF: ldrh
				; TURNED-OFF: strh
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull %0, i8* getelementptr inbounds ([22 x i8], [22 x i8]* @.str, i32 0, i32 0), i32 22, i32 1, i1 false)
				%call2 = call i32 bitcast (i32 (...)* @foo to i32 (i8))(i8* nonnull %0) #3
				call void @llvm.lifetime.end(i64 22, i8* nonnull %0) #3
				ret i32 0
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start(i64, i8* nocapture) #1

				declare i32 @foo(...) local_unnamed_addr #2

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture writeonly, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/ARM/arm-pad-memcpy-strings-test2.ll

This file was added.

				; This tests the arm memcpy padding pass.When the string to copy is 62
				; bytes the padding pass should pad by 2 bytes to make the copy use the
				; full word allowing the instruction selector to use ldm / stm to copy the
				; full string.If padding wasn't enabled the copy would use ldrh/strh to
				; copy the last 2 bytes.

				; RUN: llc < %s -mtriple thumbv6m-arm-none-eabi -O3 \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv6m-none--eabi"

				@.str = private unnamed_addr constant [62 x i8] c"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\00", align 1

				; Function Attrs: nounwind
				define i32 @main() local_unnamed_addr #0 {
				entry:
				%a = alloca [62 x i8], align 1
				%0 = getelementptr inbounds [62 x i8], [62 x i8]* %a, i32 0, i32 0
				call void @llvm.lifetime.start(i64 62, i8* nonnull %0) #3
				; CHECK: ldm
				; CHECK-NEXT: stm
				; CHECK-NEXT: ldm
				; CHECK-NEXT: stm
				; CHECK-NEXT: ldm
				; CHECK-NEXT: stm
				; CHECK-NEXT: ldm
				; CHECK-NEXT: stm
				; CHECK-NOT: ldrh
				; CHECK-NOT: strh
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull %0, i8* getelementptr inbounds ([62 x i8], [62 x i8]* @.str, i32 0, i32 0), i32 62, i32 1, i1 false)
				%call2 = call i32 bitcast (i32 (...)* @foo to i32 (i8))(i8* nonnull %0) #3
				call void @llvm.lifetime.end(i64 62, i8* nonnull %0) #3
				ret i32 0
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start(i64, i8* nocapture) #1

				declare i32 @foo(...) local_unnamed_addr #2

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture writeonly, i8* nocapture readonly, i32, i32, i1) #1