This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Modify codegen for memcpy intrinsic to prefer LDM/STM.
ClosedPublic

Authored by scott-0 on Sep 29 2015, 3:22 AM.

Download Raw Diff

Details

Reviewers

scott-0
jmolloy
pcc

Commits

rG953f908173e3: [ARM] Modify codegen for memcpy intrinsic to prefer LDM/STM.

Summary

We were previously codegen'ing memcpy as regular load/store operations and
hoping that the register allocator would allocate registers in ascending order
so that we could apply an LDM/STM combine after register allocation. According
to the commit that first introduced this code (r37179), we planned to teach
the register allocator to allocate the registers in ascending order. This
never got implemented, and up to now we've been stuck with very poor codegen.

A much simpler approach for achieving better codegen is to create MEMCPY pseudo
instructions, attach scratch virtual registers to them and then, post register
allocation, expand the MEMCPYs into LDM/STM pairs using the scratch registers.
The register allocator will have picked arbitrary registers which we sort when
expanding the MEMCPY. This approach also avoids the need to repeatedly
calculate offsets which ultimately ought to be eliminated pre-RA in order to
decrease register pressure.

Fixes PR9199 and PR23768.

[This is based on Peter Collingbourne's r238473 which was reverted. I'm happy to produce the diff from r238473 if that's helpful.]

Diff Detail

Event Timeline

scott-0 updated this revision to Diff 35953.Sep 29 2015, 3:22 AM

scott-0 retitled this revision from to [ARM] Modify codegen for memcpy intrinsic to prefer LDM/STM..

scott-0 updated this object.

scott-0 added a reviewer: pcc.

scott-0 added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptSep 29 2015, 3:22 AM

Hi Scott,

In general I like this change. It's an elegant solution to the problem. I've got a bikeshed to paint regarding the name - I think MEMCPY might be more appropriate as that's exactly what it's doing.

I think there should be more documentation about the inputs and outputs of the node - it took me a bit to realise that the outputs are the updated base registers (which means you can chain them - neat!)

There also looks to be minimal-zero non-Thumb1 tests for this - this happens in all modes so I'd expect the same amount of testing for each of ARM, T1 and T2.

The instprinter stuff is ugly. But I see why it's needed.

Cheers,

James

lib/Target/ARM/ARMISelLowering.h
190	Can we call this MEMCPY instead of MCOPY? It self-describes a bit better, I think.
lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
752 ↗	(On Diff #35953)	You should be able to construct the vector using the iterator construction syntax: std::vector<MCOperand> RegOps(MI->begin() + OpNum, MI->end());
754 ↗	(On Diff #35953)	Use std::stable_sort instead of std::sort, for deterministicness.

This revision now requires changes to proceed.Oct 1 2015, 5:09 AM

New version on it's way.

lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
754 ↗	(On Diff #35953)	Ok, but the registers being sorted are unique.

Thanks for the review.

Definitely a good idea on the whole.

lib/Target/ARM/ARMISelLowering.cpp
8097–8098	Why's the second MI needed? Does regalloc not bother to allocate <def,dead> operands or something?
lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
753–754 ↗	(On Diff #36262)	I think it'd be better to do the sorting when the instruction is expanded, it seems like a useful property to have if anyone wants to analyse an LDM/STM.

Thanks for the review. I'll put up a new version after the weekend.

lib/Target/ARM/ARMISelLowering.cpp
8097–8098	I tried it with <def,kill> and BuildMI refused; I'll investigate using <def,dead>.
lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp
753–754 ↗	(On Diff #36262)	I'll investigate moving it.

Addressed comments.

Looks good to me now, with 1+epsilon nits:

lib/Target/ARM/ARMBaseInstrInfo.cpp
1252	This is idiomatically: AddDefaultPred(LDM.addOperand(MI->getOperand(3)));
1259–1263	I'm not entirely convinced this is clearer than for(unsigned I = 5; I < MI->getNumOperands(); ++I) ScratchRegs.push_back(MI->getOperand(I).getReg()); but that's just bike-shedding and probably personal biases, feel free to leave it if you disagree.

I think it'd be better to do the sorting when the instruction is expanded, it seems like a useful property to have if anyone wants to analyse an LDM/STM.

Interestingly, adding an assertion to require ascending order in ARMInstPrinter.cpp fails some other cases, e.g. stack_guard_remat.ll

Thanks for the review; I will do those final edits and commit.

lib/Target/ARM/ARMBaseInstrInfo.cpp
1259–1263	I agree; I was just a bit lambda-happy.

Committed as r249322

jmolloy accepted this revision.Oct 6 2015, 1:51 AM

jmolloy edited edge metadata.

This revision is now accepted and ready to land.Oct 6 2015, 1:51 AM

scott-0 closed this revision.Oct 6 2015, 1:51 AM

jevinskie added a subscriber: jevinskie.Nov 24 2015, 12:00 PM

Revision Contents

Path

Size

lib/

Target/

ARM/

2 lines

63 lines

4 lines

35 lines

21 lines

1 line

ARMSelectionDAGInfo.cpp

56 lines

Thumb2SizeReduction.cpp

19 lines

test/

CodeGen/

ARM/

ldm-stm-base-materialization.ll

93 lines

load-store-flags.ll

4 lines

memcpy-ldm-stm.ll

94 lines

Thumb/

ldm-stm-base-materialization-thumb2.ll

93 lines

ldm-stm-base-materialization.ll

77 lines

thumb-memcpy-ldm-stm.ll

Diff 36514

lib/Target/ARM/ARMBaseInstrInfo.h

Show First 20 Lines • Show All 346 Lines • ▼ Show 20 Lines	private:

/// verifyInstruction - Perform target specific instruction verification.		/// verifyInstruction - Perform target specific instruction verification.
bool verifyInstruction(const MachineInstr *MI,		bool verifyInstruction(const MachineInstr *MI,
StringRef &ErrInfo) const override;		StringRef &ErrInfo) const override;

virtual void expandLoadStackGuard(MachineBasicBlock::iterator MI,		virtual void expandLoadStackGuard(MachineBasicBlock::iterator MI,
Reloc::Model RM) const = 0;		Reloc::Model RM) const = 0;

		void expandMEMCPY(MachineBasicBlock::iterator) const;

private:		private:
/// Modeling special VFP / NEON fp MLA / MLS hazards.		/// Modeling special VFP / NEON fp MLA / MLS hazards.

/// MLxEntryMap - Map fp MLA / MLS to the corresponding entry in the internal		/// MLxEntryMap - Map fp MLA / MLS to the corresponding entry in the internal
/// MLx table.		/// MLx table.
DenseMap<unsigned, unsigned> MLxEntryMap;		DenseMap<unsigned, unsigned> MLxEntryMap;

/// MLxHazardOpcodes - Set of add / sub and multiply opcodes that would cause		/// MLxHazardOpcodes - Set of add / sub and multiply opcodes that would cause
▲ Show 20 Lines • Show All 145 Lines • Show Last 20 Lines

lib/Target/ARM/ARMBaseInstrInfo.cpp

	Show First 20 Lines • Show All 1,213 Lines • ▼ Show 20 Lines
	}			}

	unsigned ARMBaseInstrInfo::isLoadFromStackSlotPostFE(const MachineInstr *MI,			unsigned ARMBaseInstrInfo::isLoadFromStackSlotPostFE(const MachineInstr *MI,
	int &FrameIndex) const {			int &FrameIndex) const {
	const MachineMemOperand *Dummy;			const MachineMemOperand *Dummy;
	return MI->mayLoad() && hasLoadFromStackSlot(MI, Dummy, FrameIndex);			return MI->mayLoad() && hasLoadFromStackSlot(MI, Dummy, FrameIndex);
	}			}

				/// \brief Expands MEMCPY to either LDMIA/STMIA or LDMIA_UPD/STMID_UPD
				/// depending on whether the result is used.
				void ARMBaseInstrInfo::expandMEMCPY(MachineBasicBlock::iterator MBBI) const {
				bool isThumb1 = Subtarget.isThumb1Only();
				bool isThumb2 = Subtarget.isThumb2();
				const ARMBaseInstrInfo *TII = Subtarget.getInstrInfo();

				MachineInstr *MI = MBBI;
				DebugLoc dl = MI->getDebugLoc();
				MachineBasicBlock *BB = MI->getParent();

				MachineInstrBuilder LDM, STM;
				if (isThumb1 \|\| !MI->getOperand(1).isDead()) {
				LDM = BuildMI(*BB, MI, dl, TII->get(isThumb2 ? ARM::t2LDMIA_UPD
				: isThumb1 ? ARM::tLDMIA_UPD
				: ARM::LDMIA_UPD))
				.addOperand(MI->getOperand(1));
				} else {
				LDM = BuildMI(*BB, MI, dl, TII->get(isThumb2 ? ARM::t2LDMIA : ARM::LDMIA));
				}

				if (isThumb1 \|\| !MI->getOperand(0).isDead()) {
				STM = BuildMI(*BB, MI, dl, TII->get(isThumb2 ? ARM::t2STMIA_UPD
				: isThumb1 ? ARM::tSTMIA_UPD
				: ARM::STMIA_UPD))
				.addOperand(MI->getOperand(0));
				} else {
				STM = BuildMI(*BB, MI, dl, TII->get(isThumb2 ? ARM::t2STMIA : ARM::STMIA));
				}

				LDM.addOperand(MI->getOperand(3)).addImm(ARMCC::AL).addReg(0);
				t.p.northoverUnsubmitted Done Reply Inline Actions This is idiomatically: AddDefaultPred(LDM.addOperand(MI->getOperand(3))); t.p.northover: This is idiomatically: AddDefaultPred(LDM.addOperand(MI->getOperand(3)));
				STM.addOperand(MI->getOperand(2)).addImm(ARMCC::AL).addReg(0);

				// Sort the scratch registers into ascending order.
				const TargetRegisterInfo &TRI = getRegisterInfo();
				unsigned NumScratch = MI->getOperand(4).getImm();
				llvm::SmallVector<unsigned, 6> ScratchRegs(NumScratch);
				std::transform(MI->operands_begin() + 5, MI->operands_end(),
				ScratchRegs.begin(),
				[] (const MachineOperand &Op) {
				return Op.getReg();
				});
				t.p.northoverUnsubmitted Done Reply Inline Actions I'm not entirely convinced this is clearer than for(unsigned I = 5; I < MI->getNumOperands(); ++I) ScratchRegs.push_back(MI->getOperand(I).getReg()); but that's just bike-shedding and probably personal biases, feel free to leave it if you disagree. t.p.northover: I'm not entirely convinced this is clearer than for(unsigned I = 5; I < MI->getNumOperands…
				scott-0AuthorUnsubmitted Not Done Reply Inline Actions I agree; I was just a bit lambda-happy. scott-0: I agree; I was just a bit lambda-happy.
				std::sort(ScratchRegs.begin(), ScratchRegs.end(),
				[&TRI](const unsigned &Reg1,
				const unsigned &Reg2) -> bool {
				return TRI.getEncodingValue(Reg1) <
				TRI.getEncodingValue(Reg2);
				});

				for (const auto &Reg : ScratchRegs) {
				LDM.addReg(Reg, RegState::Define);
				STM.addReg(Reg, RegState::Kill);
				}

				BB->erase(MBBI);
				}


	bool			bool
	ARMBaseInstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {			ARMBaseInstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {
	MachineFunction &MF = *MI->getParent()->getParent();			MachineFunction &MF = *MI->getParent()->getParent();
	Reloc::Model RM = MF.getTarget().getRelocationModel();			Reloc::Model RM = MF.getTarget().getRelocationModel();

	if (MI->getOpcode() == TargetOpcode::LOAD_STACK_GUARD) {			if (MI->getOpcode() == TargetOpcode::LOAD_STACK_GUARD) {
	assert(getSubtarget().getTargetTriple().isOSBinFormatMachO() &&			assert(getSubtarget().getTargetTriple().isOSBinFormatMachO() &&
	"LOAD_STACK_GUARD currently supported only for MachO.");			"LOAD_STACK_GUARD currently supported only for MachO.");
	expandLoadStackGuard(MI, RM);			expandLoadStackGuard(MI, RM);
	MI->getParent()->erase(MI);			MI->getParent()->erase(MI);
	return true;			return true;
	}			}

				if (MI->getOpcode() == ARM::MEMCPY) {
				expandMEMCPY(MI);
				return true;
				}

	// This hook gets to expand COPY instructions before they become			// This hook gets to expand COPY instructions before they become
	// copyPhysReg() calls. Look for VMOVS instructions that can legally be			// copyPhysReg() calls. Look for VMOVS instructions that can legally be
	// widened to VMOVD. We prefer the VMOVD when possible because it may be			// widened to VMOVD. We prefer the VMOVD when possible because it may be
	// changed into a VORR that can go down the NEON pipeline.			// changed into a VORR that can go down the NEON pipeline.
	if (!WidenVMOVS \|\| !MI->isCopy() \|\| Subtarget.isCortexA15() \|\|			if (!WidenVMOVS \|\| !MI->isCopy() \|\| Subtarget.isCortexA15() \|\|
	Subtarget.isFPOnlySP())			Subtarget.isFPOnlySP())
	return false;			return false;

	▲ Show 20 Lines • Show All 3,341 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
// Vector OR with immediate		// Vector OR with immediate
VORRIMM,		VORRIMM,
// Vector AND with NOT of immediate		// Vector AND with NOT of immediate
VBICIMM,		VBICIMM,

// Vector bitwise select		// Vector bitwise select
VBSL,		VBSL,

		// Pseudo-instruction representing a memory copy using ldm/stm
		// instructions.
		MEMCPY,
		jmolloyUnsubmitted Done Reply Inline Actions Can we call this MEMCPY instead of MCOPY? It self-describes a bit better, I think. jmolloy: Can we call this MEMCPY instead of MCOPY? It self-describes a bit better, I think.

// Vector load N-element structure to all lanes:		// Vector load N-element structure to all lanes:
VLD2DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,		VLD2DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
VLD3DUP,		VLD3DUP,
VLD4DUP,		VLD4DUP,

// NEON loads with post-increment base updates:		// NEON loads with post-increment base updates:
VLD1_UPD,		VLD1_UPD,
VLD2_UPD,		VLD2_UPD,
▲ Show 20 Lines • Show All 465 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,172 Lines • ▼ Show 20 Lines	const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {
case ARMISD::VMULLu: return "ARMISD::VMULLu";		case ARMISD::VMULLu: return "ARMISD::VMULLu";
case ARMISD::UMLAL: return "ARMISD::UMLAL";		case ARMISD::UMLAL: return "ARMISD::UMLAL";
case ARMISD::SMLAL: return "ARMISD::SMLAL";		case ARMISD::SMLAL: return "ARMISD::SMLAL";
case ARMISD::BUILD_VECTOR: return "ARMISD::BUILD_VECTOR";		case ARMISD::BUILD_VECTOR: return "ARMISD::BUILD_VECTOR";
case ARMISD::BFI: return "ARMISD::BFI";		case ARMISD::BFI: return "ARMISD::BFI";
case ARMISD::VORRIMM: return "ARMISD::VORRIMM";		case ARMISD::VORRIMM: return "ARMISD::VORRIMM";
case ARMISD::VBICIMM: return "ARMISD::VBICIMM";		case ARMISD::VBICIMM: return "ARMISD::VBICIMM";
case ARMISD::VBSL: return "ARMISD::VBSL";		case ARMISD::VBSL: return "ARMISD::VBSL";
		case ARMISD::MEMCPY: return "ARMISD::MEMCPY";
case ARMISD::VLD2DUP: return "ARMISD::VLD2DUP";		case ARMISD::VLD2DUP: return "ARMISD::VLD2DUP";
case ARMISD::VLD3DUP: return "ARMISD::VLD3DUP";		case ARMISD::VLD3DUP: return "ARMISD::VLD3DUP";
case ARMISD::VLD4DUP: return "ARMISD::VLD4DUP";		case ARMISD::VLD4DUP: return "ARMISD::VLD4DUP";
case ARMISD::VLD1_UPD: return "ARMISD::VLD1_UPD";		case ARMISD::VLD1_UPD: return "ARMISD::VLD1_UPD";
case ARMISD::VLD2_UPD: return "ARMISD::VLD2_UPD";		case ARMISD::VLD2_UPD: return "ARMISD::VLD2_UPD";
case ARMISD::VLD3_UPD: return "ARMISD::VLD3_UPD";		case ARMISD::VLD3_UPD: return "ARMISD::VLD3_UPD";
case ARMISD::VLD4_UPD: return "ARMISD::VLD4_UPD";		case ARMISD::VLD4_UPD: return "ARMISD::VLD4_UPD";
case ARMISD::VLD2LN_UPD: return "ARMISD::VLD2LN_UPD";		case ARMISD::VLD2LN_UPD: return "ARMISD::VLD2LN_UPD";
▲ Show 20 Lines • Show All 6,878 Lines • ▼ Show 20 Lines	case ARM::COPY_STRUCT_BYVAL_I32:
return EmitStructByval(MI, BB);		return EmitStructByval(MI, BB);
case ARM::WIN__CHKSTK:		case ARM::WIN__CHKSTK:
return EmitLowered__chkstk(MI, BB);		return EmitLowered__chkstk(MI, BB);
case ARM::WIN__DBZCHK:		case ARM::WIN__DBZCHK:
return EmitLowered__dbzchk(MI, BB);		return EmitLowered__dbzchk(MI, BB);
}		}
}		}

		/// \brief Attaches vregs to MEMCPY that it will use as scratch registers
		/// when it is expanded into LDM/STM. This is done as a post-isel lowering
		/// instead of as a custom inserter because we need the use list from the SDNode.
		static void attachMEMCPYScratchRegs(const ARMSubtarget *Subtarget,
		MachineInstr MI, const SDNode Node) {
		bool isThumb1 = Subtarget->isThumb1Only();

		DebugLoc DL = MI->getDebugLoc();
		MachineFunction *MF = MI->getParent()->getParent();
		MachineRegisterInfo &MRI = MF->getRegInfo();
		MachineInstrBuilder MIB(*MF, MI);

		// If the new dst/src is unused mark it as dead.
		if (!Node->hasAnyUseOfValue(0)) {
		MI->getOperand(0).setIsDead(true);
		}
		if (!Node->hasAnyUseOfValue(1)) {
		MI->getOperand(1).setIsDead(true);
		}

		// The MEMCPY both defines and kills the scratch registers.
		const ARMBaseInstrInfo *TII = Subtarget->getInstrInfo();
		for (unsigned I = 0; I != MI->getOperand(4).getImm(); ++I) {
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Why's the second MI needed? Does regalloc not bother to allocate <def,dead> operands or something? t.p.northover: Why's the second MI needed? Does regalloc not bother to allocate <def,dead> operands or…
		scott-0AuthorUnsubmitted Not Done Reply Inline Actions I tried it with <def,kill> and BuildMI refused; I'll investigate using <def,dead>. scott-0: I tried it with <def,kill> and BuildMI refused; I'll investigate using <def,dead>.
		unsigned TmpReg = MRI.createVirtualRegister(isThumb1 ? &ARM::tGPRRegClass
		: &ARM::GPRRegClass);
		MIB.addReg(TmpReg, RegState::Define\|RegState::Dead);
		}
		}

void ARMTargetLowering::AdjustInstrPostInstrSelection(MachineInstr *MI,		void ARMTargetLowering::AdjustInstrPostInstrSelection(MachineInstr *MI,
SDNode *Node) const {		SDNode *Node) const {
		if (MI->getOpcode() == ARM::MEMCPY) {
		attachMEMCPYScratchRegs(Subtarget, MI, Node);
		return;
		}

const MCInstrDesc *MCID = &MI->getDesc();		const MCInstrDesc *MCID = &MI->getDesc();
// Adjust potentially 's' setting instructions after isel, i.e. ADC, SBC, RSB,		// Adjust potentially 's' setting instructions after isel, i.e. ADC, SBC, RSB,
// RSC. Coming out of isel, they have an implicit CPSR def, but the optional		// RSC. Coming out of isel, they have an implicit CPSR def, but the optional
// operand is still set to noreg. If needed, set the optional operand's		// operand is still set to noreg. If needed, set the optional operand's
// register to CPSR, and remove the redundant implicit def.		// register to CPSR, and remove the redundant implicit def.
//		//
// e.g. ADCS (..., CPSR<imp-def>) -> ADC (... opt:CPSR<def>).		// e.g. ADCS (..., CPSR<imp-def>) -> ADC (... opt:CPSR<def>).

▲ Show 20 Lines • Show All 3,916 Lines • Show Last 20 Lines

lib/Target/ARM/ARMInstrInfo.td

	Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines

	def SDT_ARMTCRET : SDTypeProfile<0, 1, [SDTCisPtrTy<0>]>;			def SDT_ARMTCRET : SDTypeProfile<0, 1, [SDTCisPtrTy<0>]>;

	def SDT_ARMBFI : SDTypeProfile<1, 3, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,			def SDT_ARMBFI : SDTypeProfile<1, 3, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,
	SDTCisVT<2, i32>, SDTCisVT<3, i32>]>;			SDTCisVT<2, i32>, SDTCisVT<3, i32>]>;

	def SDT_WIN__DBZCHK : SDTypeProfile<0, 1, [SDTCisVT<0, i32>]>;			def SDT_WIN__DBZCHK : SDTypeProfile<0, 1, [SDTCisVT<0, i32>]>;

				def SDT_ARMMEMCPY : SDTypeProfile<2, 3, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,
				SDTCisVT<2, i32>, SDTCisVT<3, i32>,
				SDTCisVT<4, i32>]>;

	def SDTBinaryArithWithFlags : SDTypeProfile<2, 2,			def SDTBinaryArithWithFlags : SDTypeProfile<2, 2,
	[SDTCisSameAs<0, 2>,			[SDTCisSameAs<0, 2>,
	SDTCisSameAs<0, 3>,			SDTCisSameAs<0, 3>,
	SDTCisInt<0>, SDTCisVT<1, i32>]>;			SDTCisInt<0>, SDTCisVT<1, i32>]>;

	// SDTBinaryArithWithFlagsInOut - RES1, CPSR = op LHS, RHS, CPSR			// SDTBinaryArithWithFlagsInOut - RES1, CPSR = op LHS, RHS, CPSR
	def SDTBinaryArithWithFlagsInOut : SDTypeProfile<2, 3,			def SDTBinaryArithWithFlagsInOut : SDTypeProfile<2, 3,
	[SDTCisSameAs<0, 2>,			[SDTCisSameAs<0, 2>,
	▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines

	def ARMrbit : SDNode<"ARMISD::RBIT", SDTIntUnaryOp>;			def ARMrbit : SDNode<"ARMISD::RBIT", SDTIntUnaryOp>;

	def ARMtcret : SDNode<"ARMISD::TC_RETURN", SDT_ARMTCRET,			def ARMtcret : SDNode<"ARMISD::TC_RETURN", SDT_ARMTCRET,
	[SDNPHasChain, SDNPOptInGlue, SDNPVariadic]>;			[SDNPHasChain, SDNPOptInGlue, SDNPVariadic]>;

	def ARMbfi : SDNode<"ARMISD::BFI", SDT_ARMBFI>;			def ARMbfi : SDNode<"ARMISD::BFI", SDT_ARMBFI>;

				def ARMmemcopy : SDNode<"ARMISD::MEMCPY", SDT_ARMMEMCPY,
				[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
				SDNPMayStore, SDNPMayLoad]>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// ARM Instruction Predicate Definitions.			// ARM Instruction Predicate Definitions.
	//			//
	def HasV4T : Predicate<"Subtarget->hasV4TOps()">,			def HasV4T : Predicate<"Subtarget->hasV4TOps()">,
	AssemblerPredicate<"HasV4TOps", "armv4t">;			AssemblerPredicate<"HasV4TOps", "armv4t">;
	def NoV4T : Predicate<"!Subtarget->hasV4TOps()">;			def NoV4T : Predicate<"!Subtarget->hasV4TOps()">;
	def HasV5T : Predicate<"Subtarget->hasV5TOps()">,			def HasV5T : Predicate<"Subtarget->hasV5TOps()">,
	AssemblerPredicate<"HasV5TOps", "armv5t">;			AssemblerPredicate<"HasV5TOps", "armv5t">;
	▲ Show 20 Lines • Show All 4,382 Lines • ▼ Show 20 Lines

	let usesCustomInserter = 1 in {			let usesCustomInserter = 1 in {
	def COPY_STRUCT_BYVAL_I32 : PseudoInst<			def COPY_STRUCT_BYVAL_I32 : PseudoInst<
	(outs), (ins GPR:$dst, GPR:$src, i32imm:$size, i32imm:$alignment),			(outs), (ins GPR:$dst, GPR:$src, i32imm:$size, i32imm:$alignment),
	NoItinerary,			NoItinerary,
	[(ARMcopystructbyval GPR:$dst, GPR:$src, imm:$size, imm:$alignment)]>;			[(ARMcopystructbyval GPR:$dst, GPR:$src, imm:$size, imm:$alignment)]>;
	}			}

				let hasPostISelHook = 1, Constraints = "$newdst = $dst, $newsrc = $src" in {
				// %newsrc, %newdst = MEMCPY %dst, %src, N, ...N scratch regs...
				// Copies N registers worth of memory from address %src to address %dst
				// and returns the incremented addresses. N scratch register will
				// be attached for the copy to use.
				def MEMCPY : PseudoInst<
				(outs GPR:$newdst, GPR:$newsrc),
				(ins GPR:$dst, GPR:$src, i32imm:$nreg, variable_ops),
				NoItinerary,
				[(set GPR:$newdst, GPR:$newsrc,
				(ARMmemcopy GPR:$dst, GPR:$src, imm:$nreg))]>;
				}

	def ldrex_1 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{			def ldrex_1 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{
	return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i8;			return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i8;
	}]>;			}]>;

	def ldrex_2 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{			def ldrex_2 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{
	return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i16;			return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i16;
	}]>;			}]>;

	▲ Show 20 Lines • Show All 1,145 Lines • Show Last 20 Lines

lib/Target/ARM/ARMInstrThumb.td

Show First 20 Lines • Show All 734 Lines • ▼ Show 20 Lines	def tLDMIA : T1I<(outs), (ins tGPR:$Rn, pred:$p, reglist:$regs, variable_ops),
bits<8> regs;		bits<8> regs;
let Inst{10-8} = Rn;		let Inst{10-8} = Rn;
let Inst{7-0} = regs;		let Inst{7-0} = regs;
}		}

// Writeback version is just a pseudo, as there's no encoding difference.		// Writeback version is just a pseudo, as there's no encoding difference.
// Writeback happens iff the base register is not in the destination register		// Writeback happens iff the base register is not in the destination register
// list.		// list.
		let mayLoad = 1, hasExtraDefRegAllocReq = 1 in
def tLDMIA_UPD :		def tLDMIA_UPD :
InstTemplate<AddrModeNone, 0, IndexModeNone, Pseudo, GenericDomain,		InstTemplate<AddrModeNone, 0, IndexModeNone, Pseudo, GenericDomain,
"$Rn = $wb", IIC_iLoad_mu>,		"$Rn = $wb", IIC_iLoad_mu>,
PseudoInstExpansion<(tLDMIA tGPR:$Rn, pred:$p, reglist:$regs)> {		PseudoInstExpansion<(tLDMIA tGPR:$Rn, pred:$p, reglist:$regs)> {
let Size = 2;		let Size = 2;
let OutOperandList = (outs GPR:$wb);		let OutOperandList = (outs GPR:$wb);
let InOperandList = (ins GPR:$Rn, pred:$p, reglist:$regs, variable_ops);		let InOperandList = (ins GPR:$Rn, pred:$p, reglist:$regs, variable_ops);
let Pattern = [];		let Pattern = [];
▲ Show 20 Lines • Show All 767 Lines • Show Last 20 Lines

lib/Target/ARM/ARMSelectionDAGInfo.cpp

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(SelectionDAG &DAG, SDLoc dl,

unsigned BytesLeft = SizeVal & 3;		unsigned BytesLeft = SizeVal & 3;
unsigned NumMemOps = SizeVal >> 2;		unsigned NumMemOps = SizeVal >> 2;
unsigned EmittedNumMemOps = 0;		unsigned EmittedNumMemOps = 0;
EVT VT = MVT::i32;		EVT VT = MVT::i32;
unsigned VTSize = 4;		unsigned VTSize = 4;
unsigned i = 0;		unsigned i = 0;
// Emit a maximum of 4 loads in Thumb1 since we have fewer registers		// Emit a maximum of 4 loads in Thumb1 since we have fewer registers
const unsigned MAX_LOADS_IN_LDM = Subtarget.isThumb1Only() ? 4 : 6;		const unsigned MaxLoadsInLDM = Subtarget.isThumb1Only() ? 4 : 6;
SDValue TFOps[6];		SDValue TFOps[6];
SDValue Loads[6];		SDValue Loads[6];
uint64_t SrcOff = 0, DstOff = 0;		uint64_t SrcOff = 0, DstOff = 0;

// Emit up to MAX_LOADS_IN_LDM loads, then a TokenFactor barrier, then the		// FIXME: We should invent a VMEMCPY pseudo-instruction that lowers to
// same number of stores. The loads and stores will get combined into		// VLDM/VSTM and make this code emit it when appropriate. This would reduce
// ldm/stm later on.		// pressure on the general purpose registers. However this seems harder to map
while (EmittedNumMemOps < NumMemOps) {		// onto the register allocator's view of the world.
for (i = 0;
i < MAX_LOADS_IN_LDM && EmittedNumMemOps + i < NumMemOps; ++i) {		// The number of MEMCPY pseudo-instructions to emit. We use up to
Loads[i] = DAG.getLoad(VT, dl, Chain,		// MaxLoadsInLDM registers per mcopy, which will get lowered into ldm/stm
DAG.getNode(ISD::ADD, dl, MVT::i32, Src,		// later on. This is a lower bound on the number of MEMCPY operations we must
DAG.getConstant(SrcOff, dl, MVT::i32)),		// emit.
SrcPtrInfo.getWithOffset(SrcOff), isVolatile,		unsigned NumMEMCPYs = (NumMemOps + MaxLoadsInLDM - 1) / MaxLoadsInLDM;
false, false, 0);
TFOps[i] = Loads[i].getValue(1);		SDVTList VTs = DAG.getVTList(MVT::i32, MVT::i32, MVT::Other, MVT::Glue);
SrcOff += VTSize;
}		for (unsigned I = 0; I != NumMEMCPYs; ++I) {
Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,		// Evenly distribute registers among MEMCPY operations to reduce register
makeArrayRef(TFOps, i));		// pressure.
		unsigned NextEmittedNumMemOps = NumMemOps * (I + 1) / NumMEMCPYs;
		unsigned NumRegs = NextEmittedNumMemOps - EmittedNumMemOps;

		Dst = DAG.getNode(ARMISD::MEMCPY, dl, VTs, Chain, Dst, Src,
		DAG.getConstant(NumRegs, dl, MVT::i32));
		Src = Dst.getValue(1);
		Chain = Dst.getValue(2);

for (i = 0;		DstPtrInfo = DstPtrInfo.getWithOffset(NumRegs * VTSize);
i < MAX_LOADS_IN_LDM && EmittedNumMemOps + i < NumMemOps; ++i) {		SrcPtrInfo = SrcPtrInfo.getWithOffset(NumRegs * VTSize);
TFOps[i] = DAG.getStore(Chain, dl, Loads[i],
DAG.getNode(ISD::ADD, dl, MVT::i32, Dst,
DAG.getConstant(DstOff, dl, MVT::i32)),
DstPtrInfo.getWithOffset(DstOff),
isVolatile, false, 0);
DstOff += VTSize;
}
Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,
makeArrayRef(TFOps, i));

EmittedNumMemOps += i;		EmittedNumMemOps = NextEmittedNumMemOps;
}		}

if (BytesLeft == 0)		if (BytesLeft == 0)
return Chain;		return Chain;

// Issue loads / stores for the trailing (1 - 3) bytes.		// Issue loads / stores for the trailing (1 - 3) bytes.
unsigned BytesLeftSave = BytesLeft;		unsigned BytesLeftSave = BytesLeft;
i = 0;		i = 0;
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

lib/Target/ARM/Thumb2SizeReduction.cpp

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	namespace {
{ ARM::t2STRBi12,ARM::tSTRBi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRBi12,ARM::tSTRBi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },
{ ARM::t2STRBs, ARM::tSTRBr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRBs, ARM::tSTRBr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },
{ ARM::t2STRHi12,ARM::tSTRHi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRHi12,ARM::tSTRHi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },
{ ARM::t2STRHs, ARM::tSTRHr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRHs, ARM::tSTRHr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },

{ ARM::t2LDMIA, ARM::tLDMIA, 0, 0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2LDMIA, ARM::tLDMIA, 0, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2LDMIA_RET,0, ARM::tPOP_RET, 0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2LDMIA_RET,0, ARM::tPOP_RET, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2LDMIA_UPD,ARM::tLDMIA_UPD,ARM::tPOP,0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2LDMIA_UPD,ARM::tLDMIA_UPD,ARM::tPOP,0, 0, 1, 1, 1,1, 0,1,0 },
// ARM::t2STM (with no basereg writeback) has no Thumb1 equivalent		// ARM::t2STMIA (with no basereg writeback) has no Thumb1 equivalent.
		// tSTMIA_UPD is a change in semantics which can only be used if the base
		// register is killed. This difference is correctly handled elsewhere.
		{ ARM::t2STMIA, ARM::tSTMIA_UPD, 0, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2STMIA_UPD,ARM::tSTMIA_UPD, 0, 0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2STMIA_UPD,ARM::tSTMIA_UPD, 0, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2STMDB_UPD, 0, ARM::tPUSH, 0, 0, 1, 1, 1,1, 0,1,0 }		{ ARM::t2STMDB_UPD, 0, ARM::tPUSH, 0, 0, 1, 1, 1,1, 0,1,0 }
};		};

class Thumb2SizeReduce : public MachineFunctionPass {		class Thumb2SizeReduce : public MachineFunctionPass {
public:		public:
static char ID;		static char ID;
Thumb2SizeReduce(std::function<bool(const Function &)> Ftor);		Thumb2SizeReduce(std::function<bool(const Function &)> Ftor);
▲ Show 20 Lines • Show All 293 Lines • ▼ Show 20 Lines	case ARM::t2LDMIA: {

if (!isOK)		if (!isOK)
return false;		return false;

OpNum = 0;		OpNum = 0;
isLdStMul = true;		isLdStMul = true;
break;		break;
}		}
		case ARM::t2STMIA: {
		// If the base register is killed, we don't care what its value is after the
		// instruction, so we can use an updating STMIA.
		if (!MI->getOperand(0).isKill())
		return false;

		break;
		}
case ARM::t2LDMIA_RET: {		case ARM::t2LDMIA_RET: {
unsigned BaseReg = MI->getOperand(1).getReg();		unsigned BaseReg = MI->getOperand(1).getReg();
if (BaseReg != ARM::SP)		if (BaseReg != ARM::SP)
return false;		return false;
Opc = Entry.NarrowOpc2; // tPOP_RET		Opc = Entry.NarrowOpc2; // tPOP_RET
OpNum = 2;		OpNum = 2;
isLdStMul = true;		isLdStMul = true;
break;		break;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	if (HasImmOffset) {
if ((OffsetImm & (Scale - 1)) \|\| OffsetImm > MaxOffset)		if ((OffsetImm & (Scale - 1)) \|\| OffsetImm > MaxOffset)
// Make sure the immediate field fits.		// Make sure the immediate field fits.
return false;		return false;
}		}

// Add the 16-bit load / store instruction.		// Add the 16-bit load / store instruction.
DebugLoc dl = MI->getDebugLoc();		DebugLoc dl = MI->getDebugLoc();
MachineInstrBuilder MIB = BuildMI(MBB, MI, dl, TII->get(Opc));		MachineInstrBuilder MIB = BuildMI(MBB, MI, dl, TII->get(Opc));

		// tSTMIA_UPD takes a defining register operand. We've already checked that
		// the register is killed, so mark it as dead here.
		if (Entry.WideOpc == ARM::t2STMIA)
		MIB.addReg(MI->getOperand(0).getReg(), RegState::Define \| RegState::Dead);

if (!isLdStMul) {		if (!isLdStMul) {
MIB.addOperand(MI->getOperand(0));		MIB.addOperand(MI->getOperand(0));
MIB.addOperand(MI->getOperand(1));		MIB.addOperand(MI->getOperand(1));

if (HasImmOffset)		if (HasImmOffset)
MIB.addImm(OffsetImm / Scale);		MIB.addImm(OffsetImm / Scale);

assert((!HasShift \|\| OffsetReg) && "Invalid so_reg load / store address!");		assert((!HasShift \|\| OffsetReg) && "Invalid so_reg load / store address!");
▲ Show 20 Lines • Show All 535 Lines • Show Last 20 Lines

test/CodeGen/ARM/ldm-stm-base-materialization.ll

This file was added.

				; RUN: llc -mtriple armv7a-none-eabi -mattr=-neon < %s -verify-machineinstrs -o - \| FileCheck %s

				; Thumb1 (thumbv6m) is tested in tests/Thumb

				@a = external global i32*
				@b = external global i32*

				; Function Attrs: nounwind
				define void @foo24() #0 {
				entry:
				; CHECK-LABEL: foo24:
				; We use '[rl0-9]*' to allow 'r0'..'r12', 'lr'
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: add [[NSB:[rl0-9]+]], [[SB]], #4
				; CHECK-NEXT: ldm [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]], [[R5:[rl0-9]+]], [[R6:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]], {[[R1]], [[R2]], [[R3]], [[R4]], [[R5]], [[R6]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 24, i32 4, i1 false)
				ret void
				}

				define void @foo28() #0 {
				entry:
				; CHECK-LABEL: foo28:
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: add [[NSB:[rl0-9]+]], [[SB]], #4
				; CHECK-NEXT: ldm [[NLB]]!, {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]]!, {[[R1]], [[R2]], [[R3]]}
				; CHECK-NEXT: ldm [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]], {[[R1]], [[R2]], [[R3]], [[R4]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 28, i32 4, i1 false)
				ret void
				}

				define void @foo32() #0 {
				entry:
				; CHECK-LABEL: foo32:
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: add [[NSB:[rl0-9]+]], [[SB]], #4
				; CHECK-NEXT: ldm [[NLB]]!, {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]]!, {[[R1]], [[R2]], [[R3]], [[R4]]}
				; CHECK-NEXT: ldm [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]], {[[R1]], [[R2]], [[R3]], [[R4]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 32, i32 4, i1 false)
				ret void
				}

				define void @foo36() #0 {
				entry:
				; CHECK-LABEL: foo36:
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: add [[NSB:[rl0-9]+]], [[SB]], #4
				; CHECK-NEXT: ldm [[NLB]]!, {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]]!, {[[R1]], [[R2]], [[R3]], [[R4]]}
				; CHECK-NEXT: ldm [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]], [[R5:[rl0-9]+]]}
				; CHECK-NEXT: stm [[NSB]], {[[R1]], [[R2]], [[R3]], [[R4]], [[R5]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 36, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/ARM/load-store-flags.ll

	; RUN: llc -mtriple=thumbv7-apple-ios7.0 -o - %s -verify-machineinstrs \| FileCheck %s			; RUN: llc -mtriple=thumbv7-apple-ios7.0 -o - %s -verify-machineinstrs \| FileCheck %s

	; The base register for the store is killed by the last instruction, but is			; The base register for the store is killed by the last instruction, but is
	; actually also used during as part of the store itself. If an extra ADD is			; actually also used during as part of the store itself. If an extra ADD is
	; inserted, it should not kill the base.			; inserted, it should not kill the base.
	define void @test_base_kill(i32 %v0, i32 %v1, i32* %addr) {			define void @test_base_kill(i32 %v0, i32 %v1, i32* %addr) {
	; CHECK-LABEL: test_base_kill:			; CHECK-LABEL: test_base_kill:
	; CHECK: adds [[NEWBASE:r[0-9]+]], r2, #4			; CHECK: adds [[NEWBASE:r[0-9]+]], r2, #4
	; CHECK: stm.w [[NEWBASE]], {r0, r1, r2}			; CHECK: stm [[NEWBASE]]!, {r0, r1, r2}

	%addr.1 = getelementptr i32, i32* %addr, i32 1			%addr.1 = getelementptr i32, i32* %addr, i32 1
	store i32 %v0, i32* %addr.1			store i32 %v0, i32* %addr.1

	%addr.2 = getelementptr i32, i32* %addr, i32 2			%addr.2 = getelementptr i32, i32* %addr, i32 2
	store i32 %v1, i32* %addr.2			store i32 %v1, i32* %addr.2

	%addr.3 = getelementptr i32, i32* %addr, i32 3			%addr.3 = getelementptr i32, i32* %addr, i32 3
	%val = ptrtoint i32* %addr to i32			%val = ptrtoint i32* %addr to i32
	store i32 %val, i32* %addr.3			store i32 %val, i32* %addr.3

	ret void			ret void
	}			}

	; Similar, but it's not sufficient to look at just the last instruction (where			; Similar, but it's not sufficient to look at just the last instruction (where
	; liveness of the base is determined). An intervening instruction might be moved			; liveness of the base is determined). An intervening instruction might be moved
	; past it to form the STM.			; past it to form the STM.
	define void @test_base_kill_mid(i32 %v0, i32* %addr, i32 %v1) {			define void @test_base_kill_mid(i32 %v0, i32* %addr, i32 %v1) {
	; CHECK-LABEL: test_base_kill_mid:			; CHECK-LABEL: test_base_kill_mid:
	; CHECK: adds [[NEWBASE:r[0-9]+]], r1, #4			; CHECK: adds [[NEWBASE:r[0-9]+]], r1, #4
	; CHECK: stm.w [[NEWBASE]], {r0, r1, r2}			; CHECK: stm [[NEWBASE]]!, {r0, r1, r2}

	%addr.1 = getelementptr i32, i32* %addr, i32 1			%addr.1 = getelementptr i32, i32* %addr, i32 1
	store i32 %v0, i32* %addr.1			store i32 %v0, i32* %addr.1

	%addr.2 = getelementptr i32, i32* %addr, i32 2			%addr.2 = getelementptr i32, i32* %addr, i32 2
	%val = ptrtoint i32* %addr to i32			%val = ptrtoint i32* %addr to i32
	store i32 %val, i32* %addr.2			store i32 %val, i32* %addr.2

	%addr.3 = getelementptr i32, i32* %addr, i32 3			%addr.3 = getelementptr i32, i32* %addr, i32 3
	store i32 %v1, i32* %addr.3			store i32 %v1, i32* %addr.3

	ret void			ret void
	}			}

test/CodeGen/ARM/memcpy-ldm-stm.ll

This file was added.

				; RUN: llc -mtriple=thumbv6m-eabi -verify-machineinstrs %s -o - \| \
				; RUN: FileCheck %s --check-prefix=CHECK --check-prefix=CHECKV6
				; RUN: llc -mtriple=thumbv6m-eabi -O=0 -verify-machineinstrs %s -o - \| \
				; RUN: FileCheck %s --check-prefix=CHECK --check-prefix=CHECKV6
				; RUN: llc -mtriple=thumbv7a-eabi -mattr=-neon -verify-machineinstrs %s -o - \| \
				; RUN: FileCheck %s --check-prefix=CHECK --check-prefix=CHECKV7
				; RUN: llc -mtriple=armv7a-eabi -mattr=-neon -verify-machineinstrs %s -o - \| \
				; RUN: FileCheck %s --check-prefix=CHECK --check-prefix=CHECKV7

				@d = external global [64 x i32]
				@s = external global [64 x i32]

				; Function Attrs: nounwind
				define void @t1() #0 {
				entry:
				; CHECK-LABEL: t1:
				; CHECKV6: ldr [[LB:r[0-7]]],
				; CHECKV6-NEXT: ldr [[SB:r[0-7]]],
				; We use '[rl0-9]+' to allow 'r0'..'r12', 'lr'
				; CHECKV7: movt [[LB:[rl0-9]+]], :upper16:d
				; CHECKV7-NEXT: movt [[SB:[rl0-9]+]], :upper16:s
				; CHECK-NEXT: ldm{{(\.w)?}} [[LB]]!,
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]]!,
				; Think of the monstrosity '{{\[}}[[LB]]]' as '[ [[LB]] ]' without the spaces.
				; CHECK-NEXT: ldrb{{(\.w)?}} {{.*}}, {{\[}}[[LB]]]
				; CHECK-NEXT: strb{{(\.w)?}} {{.*}}, {{\[}}[[SB]]]
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 17, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind
				define void @t2() #0 {
				entry:
				; CHECK-LABEL: t2:
				; CHECKV6: ldr [[LB:r[0-7]]],
				; CHECKV6-NEXT: ldr [[SB:r[0-7]]],
				; CHECKV7: movt [[LB:[rl0-9]+]], :upper16:d
				; CHECKV7-NEXT: movt [[SB:[rl0-9]+]], :upper16:s
				; CHECK-NEXT: ldm{{(\.w)?}} [[LB]]!,
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]]!,
				; CHECK-NEXT: ldrh{{(\.w)?}} {{.*}}, {{\[}}[[LB]]]
				; CHECK-NEXT: ldrb{{(\.w)?}} {{.*}}, {{\[}}[[LB]], #2]
				; CHECK-NEXT: strb{{(\.w)?}} {{.*}}, {{\[}}[[SB]], #2]
				; CHECK-NEXT: strh{{(\.w)?}} {{.*}}, {{\[}}[[SB]]]
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 15, i32 4, i1 false)
				ret void
				}

				; PR23768
				%struct.T = type { i8, i64, i8 }

				@copy = external global %struct.T, align 8
				@etest = external global %struct.T, align 8

				define void @t3() {
				call void @llvm.memcpy.p0i8.p0i8.i32(
				i8* getelementptr inbounds (%struct.T, %struct.T* @copy, i32 0, i32 0),
				i8* getelementptr inbounds (%struct.T, %struct.T* @etest, i32 0, i32 0),
				i32 24, i32 8, i1 false)
				call void @llvm.memcpy.p0i8.p0i8.i32(
				i8* getelementptr inbounds (%struct.T, %struct.T* @copy, i32 0, i32 0),
				i8* getelementptr inbounds (%struct.T, %struct.T* @etest, i32 0, i32 0),
				i32 24, i32 8, i1 false)
				ret void
				}

				%struct.S = type { [12 x i32] }

				; CHECK-LABEL: test3
				define void @test3(%struct.S* %d, %struct.S* %s) #0 {
				%1 = bitcast %struct.S* %d to i8*
				%2 = bitcast %struct.S* %s to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %2, i32 48, i32 4, i1 false)
				; 3 ldm/stm pairs in v6; 2 in v7
				; CHECK: ldm{{(\.w)?}} {{[rl0-9]+!?}}, [[REGLIST1:{.*}]]
				; CHECK: stm{{(\.w)?}} {{[rl0-9]+!?}}, [[REGLIST1]]
				; CHECK: ldm{{(\.w)?}} {{[rl0-9]+!?}}, [[REGLIST2:{.*}]]
				; CHECK: stm{{(\.w)?}} {{[rl0-9]+!?}}, [[REGLIST2]]
				; CHECKV6: ldm {{r[0-7]!?}}, [[REGLIST3:{.*}]]
				; CHECKV6: stm {{r[0-7]!?}}, [[REGLIST3]]
				; CHECKV7-NOT: ldm
				; CHECKV7-NOT: stm
				%arrayidx = getelementptr inbounds %struct.S, %struct.S* %s, i32 0, i32 0, i32 1
				tail call void @g(i32* %arrayidx) #3
				ret void
				}

				declare void @g(i32*)

				; Set "no-frame-pointer-elim" to increase register pressure
				attributes #0 = { "no-frame-pointer-elim"="true" }

				; Function Attrs: nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/Thumb/ldm-stm-base-materialization-thumb2.ll

This file was added.

				; RUN: llc -mattr=-neon < %s -verify-machineinstrs -o - \| FileCheck %s

				target triple = "thumbv7a-none--eabi"

				@a = external global i32*
				@b = external global i32*

				; Function Attrs: nounwind
				define void @foo24() #0 {
				entry:
				; CHECK-LABEL: foo24:
				; We use '[rl0-9]*' to allow 'r0'..'r12', 'lr'
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add{{s?}}{{(\.w)?}} [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: adds [[SB]], #4
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]], [[R5:[rl0-9]+]], [[R6:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]], {[[R1]], [[R2]], [[R3]], [[R4]], [[R5]], [[R6]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 24, i32 4, i1 false)
				ret void
				}

				define void @foo28() #0 {
				entry:
				; CHECK-LABEL: foo28:
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add{{(\.w)?}} [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: adds [[SB]], #4
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]]!, {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]]!, {[[R1]], [[R2]], [[R3]]}
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]], {[[R1]], [[R2]], [[R3]], [[R4]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 28, i32 4, i1 false)
				ret void
				}

				define void @foo32() #0 {
				entry:
				; CHECK-LABEL: foo32:
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add{{(\.w)?}} [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: adds [[SB]], #4
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]]!, {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]]!, {[[R1]], [[R2]], [[R3]], [[R4]]}
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]], {[[R1]], [[R2]], [[R3]], [[R4]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 32, i32 4, i1 false)
				ret void
				}

				define void @foo36() #0 {
				entry:
				; CHECK-LABEL: foo36:
				; CHECK: movt [[LB:[rl0-9]+]], :upper16:b
				; CHECK: movt [[SB:[rl0-9]+]], :upper16:a
				; CHECK: add{{(\.w)?}} [[NLB:[rl0-9]+]], [[LB]], #4
				; CHECK: adds [[SB]], #4
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]]!, {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]]!, {[[R1]], [[R2]], [[R3]], [[R4]]}
				; CHECK-NEXT: ldm{{(\.w)?}} [[NLB]], {[[R1:[rl0-9]+]], [[R2:[rl0-9]+]], [[R3:[rl0-9]+]], [[R4:[rl0-9]+]], [[R5:[rl0-9]+]]}
				; CHECK-NEXT: stm{{(\.w)?}} [[SB]], {[[R1]], [[R2]], [[R3]], [[R4]], [[R5]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 36, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/Thumb/ldm-stm-base-materialization.ll

	; RUN: llc < %s -mtriple=thumbv6m-eabi -verify-machineinstrs -o - \| FileCheck %s			; RUN: llc < %s -mtriple=thumbv6m-eabi -verify-machineinstrs -o - \| FileCheck %s
	target datalayout = "e-m:e-p:32:32-i1:8:32-i8:8:32-i16:16:32-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-i1:8:32-i8:8:32-i16:16:32-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv6m-none--eabi"			target triple = "thumbv6m-none--eabi"

	@a = external global i32*			@a = external global i32*
	@b = external global i32*			@b = external global i32*

	; Function Attrs: nounwind			; Function Attrs: nounwind
	define void @foo() #0 {			define void @foo24() #0 {
	entry:			entry:
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo24:
	; CHECK: ldr r[[SB:[0-9]]], .LCPI
	; CHECK: ldr r[[LB:[0-9]]], .LCPI			; CHECK: ldr r[[LB:[0-9]]], .LCPI
	; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4			; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4
	; CHECK-NEXT: ldm r[[NLB]],			; CHECK: ldr r[[SB:[0-9]]], .LCPI
	; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4			; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4
	; CHECK-NEXT: stm r[[NSB]]			; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]]}
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]]}
	%0 = load i32, i32* @a, align 4			%0 = load i32, i32* @a, align 4
	%arrayidx = getelementptr inbounds i32, i32* %0, i32 1			%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
	%1 = bitcast i32* %arrayidx to i8*			%1 = bitcast i32* %arrayidx to i8*
	%2 = load i32, i32* @b, align 4			%2 = load i32, i32* @b, align 4
	%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1			%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
	%3 = bitcast i32* %arrayidx1 to i8*			%3 = bitcast i32* %arrayidx1 to i8*
	tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 24, i32 4, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 24, i32 4, i1 false)
	ret void			ret void
	}			}

				define void @foo28() #0 {
				entry:
				; CHECK-LABEL: foo28:
				; CHECK: ldr r[[LB:[0-9]]], .LCPI
				; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4
				; CHECK: ldr r[[SB:[0-9]]], .LCPI
				; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]]}
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]], r[[R4:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]], r[[R4]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 28, i32 4, i1 false)
				ret void
				}

				define void @foo32() #0 {
				entry:
				; CHECK-LABEL: foo32:
				; CHECK: ldr r[[LB:[0-9]]], .LCPI
				; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4
				; CHECK: ldr r[[SB:[0-9]]], .LCPI
				; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]], r[[R4:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]], r[[R4]]}
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]], r[[R4:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]], r[[R4]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 32, i32 4, i1 false)
				ret void
				}

				define void @foo36() #0 {
				entry:
				; CHECK-LABEL: foo36:
				; CHECK: ldr r[[LB:[0-9]]], .LCPI
				; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4
				; CHECK: ldr r[[SB:[0-9]]], .LCPI
				; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]]}
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]]}
				; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]]}
				%0 = load i32, i32* @a, align 4
				%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
				%1 = bitcast i32* %arrayidx to i8*
				%2 = load i32, i32* @b, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
				%3 = bitcast i32* %arrayidx1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 36, i32 4, i1 false)
				ret void
				}

	; Function Attrs: nounwind			; Function Attrs: nounwind
	declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1			declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/Thumb/thumb-memcpy-ldm-stm.ll

This file was deleted.

	; RUN: llc -mtriple=thumbv6m-eabi -verify-machineinstrs %s -o - \| FileCheck %s
	@d = external global [64 x i32]
	@s = external global [64 x i32]

	; Function Attrs: nounwind
	define void @t1() #0 {
	entry:
	; CHECK-LABEL: t1:
	; CHECK: ldr r[[LB:[0-9]]],
	; CHECK-NEXT: ldm r[[LB]]!,
	; CHECK-NEXT: ldr r[[SB:[0-9]]],
	; CHECK-NEXT: stm r[[SB]]!,
	; CHECK-NEXT: ldrb {{.*}}, [r[[LB]]]
	; CHECK-NEXT: strb {{.*}}, [r[[SB]]]
	tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 17, i32 4, i1 false)
	ret void
	}

	; Function Attrs: nounwind
	define void @t2() #0 {
	entry:
	; CHECK-LABEL: t2:
	; CHECK: ldr r[[LB:[0-9]]],
	; CHECK-NEXT: ldm r[[LB]]!,
	; CHECK-NEXT: ldr r[[SB:[0-9]]],
	; CHECK-NEXT: stm r[[SB]]!,
	; CHECK-NEXT: ldrh {{.*}}, [r[[LB]]]
	; CHECK-NEXT: ldrb {{.*}}, [r[[LB]], #2]
	; CHECK-NEXT: strb {{.*}}, [r[[SB]], #2]
	; CHECK-NEXT: strh {{.*}}, [r[[SB]]]
	tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 15, i32 4, i1 false)
	ret void
	}

	; Function Attrs: nounwind
	declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1