This is an archive of the discontinued LLVM Phabricator instance.

Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM.
ClosedPublic

Authored by pcc on May 5 2015, 3:55 PM.

Download Raw Diff

Details

Reviewers

rengolin
john.brawn

Commits

rG450fbee6b2dc: Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM.
rL238473: Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM.

Summary

We were previously codegen'ing these as regular load/store operations and
hoping that the register allocator would allocate registers in ascending order
so that we could apply an LDM/STM combine after register allocation. According
to the commit that first introduced this code (r37179), we planned to teach
the register allocator to allocate the registers in ascending order. This
never got implemented, and up to now we've been stuck with very poor codegen.

A much simpler approach for achiveing better codegen is to create LDM/STM
instructions with identical sets of virtual registers, let the register
allocator pick arbitrary registers and order register lists when printing an
MCInst. This approach also avoids the need to repeatedly calculate offsets
which ultimately ought to be eliminated pre-RA in order to decrease register
pressure.

This is implemented by lowering the memcpy intrinsic to a series of SD-only
MCOPY pseudo-instructions which performs a memory copy using a given number
of registers. During SD->MI lowering, we lower MCOPY to LDM/STM. This is a
little unusual, but it avoids the need to encode register lists in the SD,
and we can take advantage of SD use lists to decide whether to use the _UPD
variant of the instructions.

Fixes PR9199.

Diff Detail

Event Timeline

pcc updated this revision to Diff 24992.May 5 2015, 3:55 PM

pcc retitled this revision from to Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM..

pcc updated this object.

pcc edited the test plan for this revision. (Show Details)

pcc added a reviewer: rengolin.

pcc added a subscriber: Unknown Object (MLST).

Herald added a subscriber: rengolin. · View Herald TranscriptMay 5 2015, 3:55 PM

This looks like a generally good idea to me, though there are some points that need to be looked at (in the inline comments).

If this were to be committed it does make me wonder what should become of the load/store optimizer though. Perhaps ARMPreAllocLoadStoreOpt should be made to introduce MCOPY? That's something for another day though.

lib/Target/ARM/ARMISelLowering.cpp
7567	The assumption here is that !isThumb2 means ARM, but it could also mean Thumb1. This means for a Thumb1 target we emit invalid instructions. Either LowerMCOPY should handle Thumb1, or we shouldn't be turning memcpy into MCOPY for Thumb1.
lib/Target/ARM/ARMSelectionDAGInfo.cpp
28	EmitTargetCodeForMemcpy is called by SelectionDAG::getMemcpy only when getMemcpyLoadsAndStores fails to generate a load/store sequence. ARMTargetLowering::ARMTargetLowering currently sets MaxStoresPerMemcpy to 4, so this function will only be triggered for memcpys of >16 bytes. If MCOPY gives better results than individual loads and stores then maybe that should be lowered to 0 so that this function is always used?
67–68	If the number of words to be copied is not an exact multiple of MAX_LOADS_IN_LDM, then splitting it up in this way may not be the best idea. Consider, for example, a copy of 7 words. Splitting it into 6+1 means that the total registers that need to be available is 8 (source, dest, 6 reg list), but if we were to split it up as 3+4 then the total registers that need to be available is 6 (or maybe 5 if the source and dest are dead after the memcpy). That would reduce register pressure and in some cases allow fewer callee-saved registers to need to be saved. Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem.
87	I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left. Maybe it would be clearer if the calculation of BytesLeft were done in one go after the MCOPY generation instead of split around it. And possibly a more explanatory comment.

rengolin added a reviewer: john.brawn.May 8 2015, 5:17 AM

rengolin removed a subscriber: john.brawn.

Fix Thumb-1
Evenly distribute registers among MCOPYs
Add MCOPY to ARMTargetLowering::getTargetNodeName
Improve comments
Improve test

lib/Target/ARM/ARMISelLowering.cpp
7567	Done
lib/Target/ARM/ARMSelectionDAGInfo.cpp
28	If the memcpy is short enough (and the target supports it) getMemcpyLoadsAndStores use the extension registers, which normally ends up being shorter due to less pressure on the general-purpose registers, so I left this as is in order to take advantage of that. (Ideally I think we should have something like a VMCOPY pseudo-instruction that lowers to VLDM/VSTM, but this seems harder to map onto the register allocator's view of the world, as VLDM/VSTM take a consecutive register range.)
67–68	The code now calculates the number of LDM/STMs we need anyway, and divides the registers evenly among them. Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem. I'd be surprised if it were a problem right now. The register allocator is basically solving the same problem except distributed over a smaller number of instructions.
87	I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left. Right. Now that we're distributing the registers among the MCOPYs, we save on overall register pressure when emitting the extra LDM/STM pair.

One minor nitpick, otherwise looks good to me.

lib/Target/ARM/Thumb2SizeReduction.cpp
128–129	This comment should be updated - tSTMIA_UPD isn't the equivalent of t2STMIA, but the difference is correctly handled elsewhere.

This revision is now accepted and ready to land.May 28 2015, 4:20 AM

Closed by commit rL238473: Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM. (authored by pcc). · Explain WhyMay 28 2015, 1:06 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMISelLowering.h

4 lines

ARMISelLowering.cpp

47 lines

ARMInstrInfo.td

15 lines

ARMSelectionDAGInfo.cpp

60 lines

InstPrinter/

ARMInstPrinter.cpp

17 lines

Thumb2SizeReduction.cpp

15 lines

test/

CodeGen/

Thumb/

ldm-stm-base-materialization.ll

8 lines

thumb-memcpy-ldm-stm.ll

4 lines

Diff 24992

lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	enum NodeType {
// Vector OR with immediate		// Vector OR with immediate
VORRIMM,		VORRIMM,
// Vector AND with NOT of immediate		// Vector AND with NOT of immediate
VBICIMM,		VBICIMM,

// Vector bitwise select		// Vector bitwise select
VBSL,		VBSL,

		// Pseudo-instruction representing a memory copy using ldm/stm
		// instructions.
		MCOPY,

// Vector load N-element structure to all lanes:		// Vector load N-element structure to all lanes:
VLD2DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,		VLD2DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
VLD3DUP,		VLD3DUP,
VLD4DUP,		VLD4DUP,

// NEON loads with post-increment base updates:		// NEON loads with post-increment base updates:
VLD1_UPD,		VLD1_UPD,
VLD2_UPD,		VLD2_UPD,
▲ Show 20 Lines • Show All 419 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,553 Lines • ▼ Show 20 Lines	ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr *MI,
case ARM::COPY_STRUCT_BYVAL_I32:		case ARM::COPY_STRUCT_BYVAL_I32:
++NumLoopByVals;		++NumLoopByVals;
return EmitStructByval(MI, BB);		return EmitStructByval(MI, BB);
case ARM::WIN__CHKSTK:		case ARM::WIN__CHKSTK:
return EmitLowered__chkstk(MI, BB);		return EmitLowered__chkstk(MI, BB);
}		}
}		}

		/// \brief Lowers MCOPY to either LDMIA/STMIA or LDMIA_UPD/STMID_UPD depending
		/// on whether the result is used. This is done as a post-isel lowering instead
		/// of as a custom inserter because we need the use list from the SDNode.
		static void LowerMCOPY(const ARMSubtarget Subtarget, MachineInstr MI,
		SDNode *Node) {
		bool isThumb2 = Subtarget->isThumb2();
		john.brawnUnsubmitted Not Done Reply Inline Actions The assumption here is that !isThumb2 means ARM, but it could also mean Thumb1. This means for a Thumb1 target we emit invalid instructions. Either LowerMCOPY should handle Thumb1, or we shouldn't be turning memcpy into MCOPY for Thumb1. john.brawn: The assumption here is that !isThumb2 means ARM, but it could also mean Thumb1. This means for…
		pccAuthorUnsubmitted Not Done Reply Inline Actions Done pcc: Done
		const ARMBaseInstrInfo *TII = Subtarget->getInstrInfo();

		DebugLoc dl = MI->getDebugLoc();
		MachineBasicBlock *BB = MI->getParent();
		MachineFunction *MF = BB->getParent();
		MachineRegisterInfo &MRI = MF->getRegInfo();

		MachineInstrBuilder LD, ST;
		if (Node->hasAnyUseOfValue(1)) {
		LD = BuildMI(*BB, MI, dl,
		TII->get(isThumb2 ? ARM::t2LDMIA_UPD : ARM::LDMIA_UPD))
		.addOperand(MI->getOperand(1));
		} else {
		LD = BuildMI(*BB, MI, dl, TII->get(isThumb2 ? ARM::t2LDMIA : ARM::LDMIA));
		}

		if (Node->hasAnyUseOfValue(0)) {
		ST = BuildMI(*BB, MI, dl,
		TII->get(isThumb2 ? ARM::t2STMIA_UPD : ARM::STMIA_UPD))
		.addOperand(MI->getOperand(0));
		} else {
		ST = BuildMI(*BB, MI, dl, TII->get(isThumb2 ? ARM::t2STMIA : ARM::STMIA));
		}

		LD.addOperand(MI->getOperand(3)).addImm(ARMCC::AL).addReg(0);
		ST.addOperand(MI->getOperand(2)).addImm(ARMCC::AL).addReg(0);

		for (unsigned I = 0; I != MI->getOperand(4).getImm(); ++I) {
		unsigned TmpReg = MRI.createVirtualRegister(&ARM::GPRRegClass);
		LD.addReg(TmpReg, RegState::Define);
		ST.addReg(TmpReg, RegState::Kill);
		}

		MI->eraseFromParent();
		}

void ARMTargetLowering::AdjustInstrPostInstrSelection(MachineInstr *MI,		void ARMTargetLowering::AdjustInstrPostInstrSelection(MachineInstr *MI,
SDNode *Node) const {		SDNode *Node) const {
		if (MI->getOpcode() == ARM::MCOPY) {
		LowerMCOPY(Subtarget, MI, Node);
		return;
		}

const MCInstrDesc *MCID = &MI->getDesc();		const MCInstrDesc *MCID = &MI->getDesc();
// Adjust potentially 's' setting instructions after isel, i.e. ADC, SBC, RSB,		// Adjust potentially 's' setting instructions after isel, i.e. ADC, SBC, RSB,
// RSC. Coming out of isel, they have an implicit CPSR def, but the optional		// RSC. Coming out of isel, they have an implicit CPSR def, but the optional
// operand is still set to noreg. If needed, set the optional operand's		// operand is still set to noreg. If needed, set the optional operand's
// register to CPSR, and remove the redundant implicit def.		// register to CPSR, and remove the redundant implicit def.
//		//
// e.g. ADCS (..., CPSR<imp-def>) -> ADC (... opt:CPSR<def>).		// e.g. ADCS (..., CPSR<imp-def>) -> ADC (... opt:CPSR<def>).

▲ Show 20 Lines • Show All 3,744 Lines • Show Last 20 Lines

lib/Target/ARM/ARMInstrInfo.td

	Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines
	def SDT_ARMTCRET : SDTypeProfile<0, 1, [SDTCisPtrTy<0>]>;			def SDT_ARMTCRET : SDTypeProfile<0, 1, [SDTCisPtrTy<0>]>;

	def SDT_ARMBFI : SDTypeProfile<1, 3, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,			def SDT_ARMBFI : SDTypeProfile<1, 3, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,
	SDTCisVT<2, i32>, SDTCisVT<3, i32>]>;			SDTCisVT<2, i32>, SDTCisVT<3, i32>]>;

	def SDT_ARMVMAXNM : SDTypeProfile<1, 2, [SDTCisFP<0>, SDTCisFP<1>, SDTCisFP<2>]>;			def SDT_ARMVMAXNM : SDTypeProfile<1, 2, [SDTCisFP<0>, SDTCisFP<1>, SDTCisFP<2>]>;
	def SDT_ARMVMINNM : SDTypeProfile<1, 2, [SDTCisFP<0>, SDTCisFP<1>, SDTCisFP<2>]>;			def SDT_ARMVMINNM : SDTypeProfile<1, 2, [SDTCisFP<0>, SDTCisFP<1>, SDTCisFP<2>]>;

				def SDT_ARMMCOPY : SDTypeProfile<2, 3, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,
				SDTCisVT<2, i32>, SDTCisVT<3, i32>,
				SDTCisVT<4, i32>]>;

	def SDTBinaryArithWithFlags : SDTypeProfile<2, 2,			def SDTBinaryArithWithFlags : SDTypeProfile<2, 2,
	[SDTCisSameAs<0, 2>,			[SDTCisSameAs<0, 2>,
	SDTCisSameAs<0, 3>,			SDTCisSameAs<0, 3>,
	SDTCisInt<0>, SDTCisVT<1, i32>]>;			SDTCisInt<0>, SDTCisVT<1, i32>]>;

	// SDTBinaryArithWithFlagsInOut - RES1, CPSR = op LHS, RHS, CPSR			// SDTBinaryArithWithFlagsInOut - RES1, CPSR = op LHS, RHS, CPSR
	def SDTBinaryArithWithFlagsInOut : SDTypeProfile<2, 3,			def SDTBinaryArithWithFlagsInOut : SDTypeProfile<2, 3,
	[SDTCisSameAs<0, 2>,			[SDTCisSameAs<0, 2>,
	▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	def ARMtcret : SDNode<"ARMISD::TC_RETURN", SDT_ARMTCRET,			def ARMtcret : SDNode<"ARMISD::TC_RETURN", SDT_ARMTCRET,
	[SDNPHasChain, SDNPOptInGlue, SDNPVariadic]>;			[SDNPHasChain, SDNPOptInGlue, SDNPVariadic]>;

	def ARMbfi : SDNode<"ARMISD::BFI", SDT_ARMBFI>;			def ARMbfi : SDNode<"ARMISD::BFI", SDT_ARMBFI>;

	def ARMvmaxnm : SDNode<"ARMISD::VMAXNM", SDT_ARMVMAXNM, []>;			def ARMvmaxnm : SDNode<"ARMISD::VMAXNM", SDT_ARMVMAXNM, []>;
	def ARMvminnm : SDNode<"ARMISD::VMINNM", SDT_ARMVMINNM, []>;			def ARMvminnm : SDNode<"ARMISD::VMINNM", SDT_ARMVMINNM, []>;

				def ARMmcopy : SDNode<"ARMISD::MCOPY", SDT_ARMMCOPY,
				[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
				SDNPMayStore, SDNPMayLoad]>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// ARM Instruction Predicate Definitions.			// ARM Instruction Predicate Definitions.
	//			//
	def HasV4T : Predicate<"Subtarget->hasV4TOps()">,			def HasV4T : Predicate<"Subtarget->hasV4TOps()">,
	AssemblerPredicate<"HasV4TOps", "armv4t">;			AssemblerPredicate<"HasV4TOps", "armv4t">;
	def NoV4T : Predicate<"!Subtarget->hasV4TOps()">;			def NoV4T : Predicate<"!Subtarget->hasV4TOps()">;
	def HasV5T : Predicate<"Subtarget->hasV5TOps()">,			def HasV5T : Predicate<"Subtarget->hasV5TOps()">,
	AssemblerPredicate<"HasV5TOps", "armv5t">;			AssemblerPredicate<"HasV5TOps", "armv5t">;
	▲ Show 20 Lines • Show All 4,357 Lines • ▼ Show 20 Lines

	let usesCustomInserter = 1 in {			let usesCustomInserter = 1 in {
	def COPY_STRUCT_BYVAL_I32 : PseudoInst<			def COPY_STRUCT_BYVAL_I32 : PseudoInst<
	(outs), (ins GPR:$dst, GPR:$src, i32imm:$size, i32imm:$alignment),			(outs), (ins GPR:$dst, GPR:$src, i32imm:$size, i32imm:$alignment),
	NoItinerary,			NoItinerary,
	[(ARMcopystructbyval GPR:$dst, GPR:$src, imm:$size, imm:$alignment)]>;			[(ARMcopystructbyval GPR:$dst, GPR:$src, imm:$size, imm:$alignment)]>;
	}			}

				let hasPostISelHook = 1 in {
				def MCOPY : PseudoInst<
				(outs GPR:$newdst, GPR:$newsrc), (ins GPR:$dst, GPR:$src, i32imm:$nreg),
				NoItinerary,
				[(set GPR:$newdst, GPR:$newsrc, (ARMmcopy GPR:$dst, GPR:$src, imm:$nreg))]>;
				}

	def ldrex_1 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{			def ldrex_1 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{
	return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i8;			return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i8;
	}]>;			}]>;

	def ldrex_2 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{			def ldrex_2 : PatFrag<(ops node:$ptr), (int_arm_ldrex node:$ptr), [{
	return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i16;			return cast<MemIntrinsicSDNode>(N)->getMemoryVT() == MVT::i16;
	}]>;			}]>;

	▲ Show 20 Lines • Show All 1,130 Lines • Show Last 20 Lines

lib/Target/ARM/ARMSelectionDAGInfo.cpp

Show All 19 Lines

ARMSelectionDAGInfo::ARMSelectionDAGInfo(const DataLayout &DL)		ARMSelectionDAGInfo::ARMSelectionDAGInfo(const DataLayout &DL)
: TargetSelectionDAGInfo(&DL) {}		: TargetSelectionDAGInfo(&DL) {}

ARMSelectionDAGInfo::~ARMSelectionDAGInfo() {		ARMSelectionDAGInfo::~ARMSelectionDAGInfo() {
}		}

SDValue		SDValue
ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(SelectionDAG &DAG, SDLoc dl,		ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(SelectionDAG &DAG, SDLoc dl,
		john.brawnUnsubmitted Not Done Reply Inline Actions EmitTargetCodeForMemcpy is called by SelectionDAG::getMemcpy only when getMemcpyLoadsAndStores fails to generate a load/store sequence. ARMTargetLowering::ARMTargetLowering currently sets MaxStoresPerMemcpy to 4, so this function will only be triggered for memcpys of >16 bytes. If MCOPY gives better results than individual loads and stores then maybe that should be lowered to 0 so that this function is always used? john.brawn: EmitTargetCodeForMemcpy is called by SelectionDAG::getMemcpy only when getMemcpyLoadsAndStores…
		pccAuthorUnsubmitted Not Done Reply Inline Actions If the memcpy is short enough (and the target supports it) getMemcpyLoadsAndStores use the extension registers, which normally ends up being shorter due to less pressure on the general-purpose registers, so I left this as is in order to take advantage of that. (Ideally I think we should have something like a VMCOPY pseudo-instruction that lowers to VLDM/VSTM, but this seems harder to map onto the register allocator's view of the world, as VLDM/VSTM take a consecutive register range.) pcc: If the memcpy is short enough (and the target supports it) getMemcpyLoadsAndStores use the…
SDValue Chain,		SDValue Chain,
SDValue Dst, SDValue Src,		SDValue Dst, SDValue Src,
SDValue Size, unsigned Align,		SDValue Size, unsigned Align,
bool isVolatile, bool AlwaysInline,		bool isVolatile, bool AlwaysInline,
MachinePointerInfo DstPtrInfo,		MachinePointerInfo DstPtrInfo,
MachinePointerInfo SrcPtrInfo) const {		MachinePointerInfo SrcPtrInfo) const {
const ARMSubtarget &Subtarget =		const ARMSubtarget &Subtarget =
DAG.getMachineFunction().getSubtarget<ARMSubtarget>();		DAG.getMachineFunction().getSubtarget<ARMSubtarget>();
Show All 17 Lines	ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(SelectionDAG &DAG, SDLoc dl,
unsigned VTSize = 4;		unsigned VTSize = 4;
unsigned i = 0;		unsigned i = 0;
// Emit a maximum of 4 loads in Thumb1 since we have fewer registers		// Emit a maximum of 4 loads in Thumb1 since we have fewer registers
const unsigned MAX_LOADS_IN_LDM = Subtarget.isThumb1Only() ? 4 : 6;		const unsigned MAX_LOADS_IN_LDM = Subtarget.isThumb1Only() ? 4 : 6;
SDValue TFOps[6];		SDValue TFOps[6];
SDValue Loads[6];		SDValue Loads[6];
uint64_t SrcOff = 0, DstOff = 0;		uint64_t SrcOff = 0, DstOff = 0;

// Emit up to MAX_LOADS_IN_LDM loads, then a TokenFactor barrier, then the		SDVTList VTs = DAG.getVTList(MVT::i32, MVT::i32, MVT::Other, MVT::Glue);
// same number of stores. The loads and stores will get combined into
// ldm/stm later on.
while (EmittedNumMemOps < NumMemOps) {
for (i = 0;
i < MAX_LOADS_IN_LDM && EmittedNumMemOps + i < NumMemOps; ++i) {
Loads[i] = DAG.getLoad(VT, dl, Chain,
DAG.getNode(ISD::ADD, dl, MVT::i32, Src,
DAG.getConstant(SrcOff, dl, MVT::i32)),
SrcPtrInfo.getWithOffset(SrcOff), isVolatile,
false, false, 0);
TFOps[i] = Loads[i].getValue(1);
SrcOff += VTSize;
}
Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,
makeArrayRef(TFOps, i));

for (i = 0;		// Subtract 1 to avoid emitting an mcopy for a 4-byte copy; a load/store is
i < MAX_LOADS_IN_LDM && EmittedNumMemOps + i < NumMemOps; ++i) {		// good enough in that case.
TFOps[i] = DAG.getStore(Chain, dl, Loads[i],		while (EmittedNumMemOps < NumMemOps - 1) {
DAG.getNode(ISD::ADD, dl, MVT::i32, Dst,		// Use up to MAX_LOADS_IN_LDM registers per mcopy. The mcopys will get
DAG.getConstant(DstOff, dl, MVT::i32)),		// lowered into ldm/stm later on.
		john.brawnUnsubmitted Not Done Reply Inline Actions If the number of words to be copied is not an exact multiple of MAX_LOADS_IN_LDM, then splitting it up in this way may not be the best idea. Consider, for example, a copy of 7 words. Splitting it into 6+1 means that the total registers that need to be available is 8 (source, dest, 6 reg list), but if we were to split it up as 3+4 then the total registers that need to be available is 6 (or maybe 5 if the source and dest are dead after the memcpy). That would reduce register pressure and in some cases allow fewer callee-saved registers to need to be saved. Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem. john.brawn: If the number of words to be copied is not an exact multiple of MAX_LOADS_IN_LDM, then…
		pccAuthorUnsubmitted Not Done Reply Inline Actions The code now calculates the number of LDM/STMs we need anyway, and divides the registers evenly among them. Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem. I'd be surprised if it were a problem right now. The register allocator is basically solving the same problem except distributed over a smaller number of instructions. pcc: The code now calculates the number of LDM/STMs we need anyway, and divides the registers evenly…
DstPtrInfo.getWithOffset(DstOff),		unsigned NumRegs = std::min(MAX_LOADS_IN_LDM, NumMemOps - EmittedNumMemOps);
isVolatile, false, 0);
DstOff += VTSize;		Dst = DAG.getNode(ARMISD::MCOPY, dl, VTs, Chain, Dst, Src,
}		DAG.getConstant(NumRegs, dl, MVT::i32));
Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,		Src = Dst.getValue(1);
makeArrayRef(TFOps, i));		Chain = Dst.getValue(2);

EmittedNumMemOps += i;		DstPtrInfo = DstPtrInfo.getWithOffset(NumRegs * VTSize);
		SrcPtrInfo = SrcPtrInfo.getWithOffset(NumRegs * VTSize);

		EmittedNumMemOps += NumRegs;
}		}

		BytesLeft += (NumMemOps - EmittedNumMemOps) * 4;

if (BytesLeft == 0)		if (BytesLeft == 0)
return Chain;		return Chain;

// Issue loads / stores for the trailing (1 - 3) bytes.		// Issue loads / stores for the trailing (1 - 7) bytes.
		john.brawnUnsubmitted Not Done Reply Inline Actions I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left. Maybe it would be clearer if the calculation of BytesLeft were done in one go after the MCOPY generation instead of split around it. And possibly a more explanatory comment. john.brawn:* I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can…
		pccAuthorUnsubmitted Not Done Reply Inline Actions I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left. Right. Now that we're distributing the registers among the MCOPYs, we save on overall register pressure when emitting the extra LDM/STM pair. pcc:* > I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can…
unsigned BytesLeftSave = BytesLeft;		unsigned BytesLeftSave = BytesLeft;
i = 0;		i = 0;
while (BytesLeft) {		while (BytesLeft) {
if (BytesLeft >= 2) {		if (BytesLeft >= 4) {
		VT = MVT::i32;
		VTSize = 4;
		} else if (BytesLeft >= 2) {
VT = MVT::i16;		VT = MVT::i16;
VTSize = 2;		VTSize = 2;
} else {		} else {
VT = MVT::i8;		VT = MVT::i8;
VTSize = 1;		VTSize = 1;
}		}

Loads[i] = DAG.getLoad(VT, dl, Chain,		Loads[i] = DAG.getLoad(VT, dl, Chain,
DAG.getNode(ISD::ADD, dl, MVT::i32, Src,		DAG.getNode(ISD::ADD, dl, MVT::i32, Src,
DAG.getConstant(SrcOff, dl, MVT::i32)),		DAG.getConstant(SrcOff, dl, MVT::i32)),
SrcPtrInfo.getWithOffset(SrcOff),		SrcPtrInfo.getWithOffset(SrcOff),
false, false, false, 0);		false, false, false, 0);
TFOps[i] = Loads[i].getValue(1);		TFOps[i] = Loads[i].getValue(1);
++i;		++i;
SrcOff += VTSize;		SrcOff += VTSize;
BytesLeft -= VTSize;		BytesLeft -= VTSize;
}		}
Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,		Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,
makeArrayRef(TFOps, i));		makeArrayRef(TFOps, i));

i = 0;		i = 0;
BytesLeft = BytesLeftSave;		BytesLeft = BytesLeftSave;
while (BytesLeft) {		while (BytesLeft) {
if (BytesLeft >= 2) {		if (BytesLeft >= 4) {
		VT = MVT::i32;
		VTSize = 4;
		} else if (BytesLeft >= 2) {
VT = MVT::i16;		VT = MVT::i16;
VTSize = 2;		VTSize = 2;
} else {		} else {
VT = MVT::i8;		VT = MVT::i8;
VTSize = 1;		VTSize = 1;
}		}

TFOps[i] = DAG.getStore(Chain, dl, Loads[i],		TFOps[i] = DAG.getStore(Chain, dl, Loads[i],
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

lib/Target/ARM/InstPrinter/ARMInstPrinter.cpp

Show First 20 Lines • Show All 738 Lines • ▼ Show 20 Lines	void ARMInstPrinter::printPKHASRShiftImm(const MCInst *MI, unsigned OpNum,
assert(Imm > 0 && Imm <= 32 && "Invalid PKH shift immediate value!");		assert(Imm > 0 && Imm <= 32 && "Invalid PKH shift immediate value!");
O << ", asr " << markup("<imm:") << "#" << Imm << markup(">");		O << ", asr " << markup("<imm:") << "#" << Imm << markup(">");
}		}

void ARMInstPrinter::printRegisterList(const MCInst *MI, unsigned OpNum,		void ARMInstPrinter::printRegisterList(const MCInst *MI, unsigned OpNum,
const MCSubtargetInfo &STI,		const MCSubtargetInfo &STI,
raw_ostream &O) {		raw_ostream &O) {
O << "{";		O << "{";
for (unsigned i = OpNum, e = MI->getNumOperands(); i != e; ++i) {
if (i != OpNum)		// The backend may have given us a register list in non-ascending order. Sort
		// it now.
		std::vector<MCOperand> RegOps(MI->size() - OpNum);
		std::copy(MI->begin() + OpNum, MI->end(), RegOps.begin());
		std::sort(RegOps.begin(), RegOps.end(),
		[this](const MCOperand &O1, const MCOperand &O2) -> bool {
		return MRI.getEncodingValue(O1.getReg()) <
		MRI.getEncodingValue(O2.getReg());
		});

		for (unsigned i = 0, e = RegOps.size(); i != e; ++i) {
		if (i != 0)
O << ", ";		O << ", ";
printRegName(O, MI->getOperand(i).getReg());		printRegName(O, RegOps[i].getReg());
}		}
O << "}";		O << "}";
}		}

void ARMInstPrinter::printGPRPairOperand(const MCInst *MI, unsigned OpNum,		void ARMInstPrinter::printGPRPairOperand(const MCInst *MI, unsigned OpNum,
const MCSubtargetInfo &STI,		const MCSubtargetInfo &STI,
raw_ostream &O) {		raw_ostream &O) {
unsigned Reg = MI->getOperand(OpNum).getReg();		unsigned Reg = MI->getOperand(OpNum).getReg();
▲ Show 20 Lines • Show All 886 Lines • Show Last 20 Lines

lib/Target/ARM/Thumb2SizeReduction.cpp

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	namespace {
{ ARM::t2STRBi12,ARM::tSTRBi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRBi12,ARM::tSTRBi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },
{ ARM::t2STRBs, ARM::tSTRBr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRBs, ARM::tSTRBr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },
{ ARM::t2STRHi12,ARM::tSTRHi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRHi12,ARM::tSTRHi, 0, 5, 0, 1, 0, 0,0, 0,1,0 },
{ ARM::t2STRHs, ARM::tSTRHr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },		{ ARM::t2STRHs, ARM::tSTRHr, 0, 0, 0, 1, 0, 0,0, 0,1,0 },

{ ARM::t2LDMIA, ARM::tLDMIA, 0, 0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2LDMIA, ARM::tLDMIA, 0, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2LDMIA_RET,0, ARM::tPOP_RET, 0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2LDMIA_RET,0, ARM::tPOP_RET, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2LDMIA_UPD,ARM::tLDMIA_UPD,ARM::tPOP,0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2LDMIA_UPD,ARM::tLDMIA_UPD,ARM::tPOP,0, 0, 1, 1, 1,1, 0,1,0 },
// ARM::t2STM (with no basereg writeback) has no Thumb1 equivalent		// ARM::t2STM (with no basereg writeback) has no Thumb1 equivalent
		{ ARM::t2STMIA, ARM::tSTMIA_UPD, 0, 0, 0, 1, 1, 1,1, 0,1,0 },
		john.brawnUnsubmitted Not Done Reply Inline Actions This comment should be updated - tSTMIA_UPD isn't the equivalent of t2STMIA, but the difference is correctly handled elsewhere. john.brawn: This comment should be updated - tSTMIA_UPD isn't the equivalent of t2STMIA, but the difference…
{ ARM::t2STMIA_UPD,ARM::tSTMIA_UPD, 0, 0, 0, 1, 1, 1,1, 0,1,0 },		{ ARM::t2STMIA_UPD,ARM::tSTMIA_UPD, 0, 0, 0, 1, 1, 1,1, 0,1,0 },
{ ARM::t2STMDB_UPD, 0, ARM::tPUSH, 0, 0, 1, 1, 1,1, 0,1,0 }		{ ARM::t2STMDB_UPD, 0, ARM::tPUSH, 0, 0, 1, 1, 1,1, 0,1,0 }
};		};

class Thumb2SizeReduce : public MachineFunctionPass {		class Thumb2SizeReduce : public MachineFunctionPass {
public:		public:
static char ID;		static char ID;
Thumb2SizeReduce();		Thumb2SizeReduce();
▲ Show 20 Lines • Show All 290 Lines • ▼ Show 20 Lines	case ARM::t2LDMIA: {

if (!isOK)		if (!isOK)
return false;		return false;

OpNum = 0;		OpNum = 0;
isLdStMul = true;		isLdStMul = true;
break;		break;
}		}
		case ARM::t2STMIA: {
		// If the base register is killed, we don't care what its value is after the
		// instruction, so we can use an updating STMIA.
		if (!MI->getOperand(0).isKill())
		return false;

		break;
		}
case ARM::t2LDMIA_RET: {		case ARM::t2LDMIA_RET: {
unsigned BaseReg = MI->getOperand(1).getReg();		unsigned BaseReg = MI->getOperand(1).getReg();
if (BaseReg != ARM::SP)		if (BaseReg != ARM::SP)
return false;		return false;
Opc = Entry.NarrowOpc2; // tPOP_RET		Opc = Entry.NarrowOpc2; // tPOP_RET
OpNum = 2;		OpNum = 2;
isLdStMul = true;		isLdStMul = true;
break;		break;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	if (HasImmOffset) {
if ((OffsetImm & (Scale - 1)) \|\| OffsetImm > MaxOffset)		if ((OffsetImm & (Scale - 1)) \|\| OffsetImm > MaxOffset)
// Make sure the immediate field fits.		// Make sure the immediate field fits.
return false;		return false;
}		}

// Add the 16-bit load / store instruction.		// Add the 16-bit load / store instruction.
DebugLoc dl = MI->getDebugLoc();		DebugLoc dl = MI->getDebugLoc();
MachineInstrBuilder MIB = BuildMI(MBB, MI, dl, TII->get(Opc));		MachineInstrBuilder MIB = BuildMI(MBB, MI, dl, TII->get(Opc));

		// tSTMIA_UPD takes a defining register operand. We've already checked that
		// the register is killed, so mark it as dead here.
		if (Entry.WideOpc == ARM::t2STMIA)
		MIB.addReg(MI->getOperand(0).getReg(), RegState::Define \| RegState::Dead);

if (!isLdStMul) {		if (!isLdStMul) {
MIB.addOperand(MI->getOperand(0));		MIB.addOperand(MI->getOperand(0));
MIB.addOperand(MI->getOperand(1));		MIB.addOperand(MI->getOperand(1));

if (HasImmOffset)		if (HasImmOffset)
MIB.addImm(OffsetImm / Scale);		MIB.addImm(OffsetImm / Scale);

assert((!HasShift \|\| OffsetReg) && "Invalid so_reg load / store address!");		assert((!HasShift \|\| OffsetReg) && "Invalid so_reg load / store address!");
▲ Show 20 Lines • Show All 531 Lines • Show Last 20 Lines

test/CodeGen/Thumb/ldm-stm-base-materialization.ll

	; RUN: llc < %s -mtriple=thumbv6m-eabi -verify-machineinstrs -o - \| FileCheck %s			; RUN: llc < %s -mtriple=thumbv6m-eabi -verify-machineinstrs -o - \| FileCheck %s
	target datalayout = "e-m:e-p:32:32-i1:8:32-i8:8:32-i16:16:32-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-i1:8:32-i8:8:32-i16:16:32-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv6m-none--eabi"			target triple = "thumbv6m-none--eabi"

	@a = external global i32*			@a = external global i32*
	@b = external global i32*			@b = external global i32*

	; Function Attrs: nounwind			; Function Attrs: nounwind
	define void @foo() #0 {			define void @foo() #0 {
	entry:			entry:
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: ldr r[[SB:[0-9]]], .LCPI
	; CHECK: ldr r[[LB:[0-9]]], .LCPI			; CHECK: ldr r[[LB:[0-9]]], .LCPI
	; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4			; CHECK: adds r[[NLB:[0-9]]], r[[LB]], #4
	; CHECK-NEXT: ldm r[[NLB]],			; CHECK: ldr r[[SB:[0-9]]], .LCPI
	; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4			; CHECK: adds r[[NSB:[0-9]]], r[[SB]], #4
	; CHECK-NEXT: stm r[[NSB]]			; CHECK-NEXT: ldm r[[NLB]]!, {r[[R1:[0-9]]], r[[R2:[0-9]]], r[[R3:[0-9]]], r[[R4:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]]!, {r[[R1]], r[[R2]], r[[R3]], r[[R4]]}
				; CHECK-NEXT: ldm r[[NLB]], {r[[R1:[0-9]]], r[[R2:[0-9]]]}
				; CHECK-NEXT: stm r[[NSB]], {r[[R1]], r[[R2]]}
	%0 = load i32, i32* @a, align 4			%0 = load i32, i32* @a, align 4
	%arrayidx = getelementptr inbounds i32, i32* %0, i32 1			%arrayidx = getelementptr inbounds i32, i32* %0, i32 1
	%1 = bitcast i32* %arrayidx to i8*			%1 = bitcast i32* %arrayidx to i8*
	%2 = load i32, i32* @b, align 4			%2 = load i32, i32* @b, align 4
	%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1			%arrayidx1 = getelementptr inbounds i32, i32* %2, i32 1
	%3 = bitcast i32* %arrayidx1 to i8*			%3 = bitcast i32* %arrayidx1 to i8*
	tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 24, i32 4, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %3, i32 24, i32 4, i1 false)
	ret void			ret void
	}			}

	; Function Attrs: nounwind			; Function Attrs: nounwind
	declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1			declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1

test/CodeGen/Thumb/thumb-memcpy-ldm-stm.ll

	; RUN: llc -mtriple=thumbv6m-eabi -verify-machineinstrs %s -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv6m-eabi -verify-machineinstrs %s -o - \| FileCheck %s
	@d = external global [64 x i32]			@d = external global [64 x i32]
	@s = external global [64 x i32]			@s = external global [64 x i32]

	; Function Attrs: nounwind			; Function Attrs: nounwind
	define void @t1() #0 {			define void @t1() #0 {
	entry:			entry:
	; CHECK-LABEL: t1:			; CHECK-LABEL: t1:
	; CHECK: ldr r[[LB:[0-9]]],			; CHECK: ldr r[[LB:[0-9]]],
	; CHECK-NEXT: ldm r[[LB]]!,
	; CHECK-NEXT: ldr r[[SB:[0-9]]],			; CHECK-NEXT: ldr r[[SB:[0-9]]],
				; CHECK-NEXT: ldm r[[LB]]!,
	; CHECK-NEXT: stm r[[SB]]!,			; CHECK-NEXT: stm r[[SB]]!,
	; CHECK-NEXT: ldrb {{.*}}, [r[[LB]]]			; CHECK-NEXT: ldrb {{.*}}, [r[[LB]]]
	; CHECK-NEXT: strb {{.*}}, [r[[SB]]]			; CHECK-NEXT: strb {{.*}}, [r[[SB]]]
	tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 17, i32 4, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 17, i32 4, i1 false)
	ret void			ret void
	}			}

	; Function Attrs: nounwind			; Function Attrs: nounwind
	define void @t2() #0 {			define void @t2() #0 {
	entry:			entry:
	; CHECK-LABEL: t2:			; CHECK-LABEL: t2:
	; CHECK: ldr r[[LB:[0-9]]],			; CHECK: ldr r[[LB:[0-9]]],
	; CHECK-NEXT: ldm r[[LB]]!,
	; CHECK-NEXT: ldr r[[SB:[0-9]]],			; CHECK-NEXT: ldr r[[SB:[0-9]]],
				; CHECK-NEXT: ldm r[[LB]]!,
	; CHECK-NEXT: stm r[[SB]]!,			; CHECK-NEXT: stm r[[SB]]!,
	; CHECK-NEXT: ldrh {{.*}}, [r[[LB]]]			; CHECK-NEXT: ldrh {{.*}}, [r[[LB]]]
	; CHECK-NEXT: ldrb {{.*}}, [r[[LB]], #2]			; CHECK-NEXT: ldrb {{.*}}, [r[[LB]], #2]
	; CHECK-NEXT: strb {{.*}}, [r[[SB]], #2]			; CHECK-NEXT: strb {{.*}}, [r[[SB]], #2]
	; CHECK-NEXT: strh {{.*}}, [r[[SB]]]			; CHECK-NEXT: strh {{.*}}, [r[[SB]]]
	tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 15, i32 4, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 15, i32 4, i1 false)
	ret void			ret void
	}			}

	; Function Attrs: nounwind			; Function Attrs: nounwind
	declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1			declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1