This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Teach the ARM Load Store Optimizer to collapse ldr/str's to ldrd/strd's
Needs ReviewPublic

Authored by rs on Apr 27 2015, 10:00 AM.

Download Raw Diff

This revision needs review, but there are no reviewers specified.

Details

Reviewers: None

Summary

This patch adds support to the ARM Load Store Optimizer to generate ldrd/strd's for V7-M class cores. I've adapted code from the AArch64 Load Store Optimizer to implement this optimization in the ARM Load Store Optimizer and I've kept the comments the same in some places.

This patch will only collapse ldr/str's to ldrd/strd for V7-M, a follow up patch will add support for generating these instruction sequences for V7-AR class cores.

Diff Detail

Event Timeline

rs updated this revision to Diff 24484.Apr 27 2015, 10:00 AM

rs retitled this revision from to [ARM] Teach the ARM Load Store Optimizer to collapse ldr/str's to ldrd/strd's.

rs updated this object.

rs edited the test plan for this revision. (Show Details)

rs added a subscriber: Unknown Object (MLST).

Herald added subscribers: aemerson, rengolin. · View Herald TranscriptApr 27 2015, 10:00 AM

I don't know much about the ARM Load/Store Optimizer but from looking at it there's already some machinery for generating LDRD/STRD. Why is this a separate LoadStoreToDoubleOpti function instead of being integrated into LoadStoreMultipleOpti?

Always favouring LDRD/STRD is probably a little simplistic. LDRD/STRD is better than LDM/STM in that:

it has more flexible addressing, so can be used in cases where LDM/STM can't
it may be faster on the cpu you're compiling for (from a brief peruse of TRMs: on pre-7A and Cortex-M3/4 LDM may be faster, otherwise LDRD is at least as fast or may be faster)

Also: when optimizing for size we would want to use LDM if it means less bytes worth of instructions.

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
67–70	By default always using ldrd/strd without reference to if it's faster on the target CPU sounds like a bad idea, e.g. according to the Cortex-M3 TRM LDRD is 3 cycles, but LDM is 2 + (nr registers - 1).
1821–1831	Actually the even/odd restriction is in A32 restriction, not a non-M-class restriction, i.e. in 7-A/R T32 there should be no problem.

I agree with John, would be much better if this code was an extension, not a copy of the current machinery.

I've adapted code from the AArch64 Load Store Optimizer to implement this optimization in the ARM Load Store Optimizer and I've kept the comments the same in some places.

Can you be more specific about what you adapted? Keeping the same comment in the same places is not always the correct thing to do, but more importantly, copy&paste is most definitely not the right thing to do.

cheers,
--renato

rengolin added inline comments.Apr 28 2015, 5:33 AM

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1821–1831	Certainly the wrong way. A better way would be to have a flag in table gen (like fast-double-store or whatever). The best way would be to have a cost-model, like we have for the vectorizer, but that would be a big change for this small patch.

Hi John and Renato,

Thanks for the review comments.

I don't know much about the ARM Load/Store Optimizer but from looking at it there's already some machinery for generating LDRD/STRD

The other machinery you're referring to is part of the ARM Pre RegAlloc Pass "Pre- register allocation pass that move load / stores from consecutive locations close to make it more likely they will be combined later.".
I haven't looked too much at this pass but I do see the method 'CanFormLdStDWord' which I could use.

Why is this a separate LoadStoreToDoubleOpti function instead of being integrated into LoadStoreMultipleOpti?

I thought the code would be cleaner if this was done in a separate function to LoadStoreMultipleOpti.
The grouping algorithm I use (AArch64 uses) for finding ldr/str's to pair together is different to what
LoadStoreMultipleOpti uses. But I guess I can plug in my code inside the if(TryMerge) { ... } region and
instead of collapsing all the load/stores into an ldm/stm for V7M I can instead iterate through the list 2 at a time.
If you prefer it to be part of LoadStoreMultipleOpti then I can rework the patch to make it so.

Also: when optimizing for size we would want to use LDM if it means less bytes worth of instructions.

Ok.

Can you be more specific about what you adapted?

The AArch64 Load/Store optimizer collapses pairs of ldr/str instructions to ldp/stp instructions.
The algorithm it uses to find pairs of ldr/str instructions and the merging step
is what I needed to get the ARM Load/Store Optimizer to collapse ldr/str instructions into ldrd/strd instructions. So what
I took from the AArch64 backend is the following:

AArch64LoadStoreOpt::optimizeBlock - renamed to ARMLoadStoreOpt::LoadStoreToDoubleOpti
- main loop almost the same except it looks for ldr/str thumb2 instructions to collapse
AArch64LoadStoreOpt::findMatchingInsn - ARMLoadStoreOpt::findMatchingInsn
- Most of it is the same except I've removed unscaled offset checking. I also added a check for "Cortex-M3 errata 602117".
AArch64LoadStoreOpt::mergePairedInsns - renamed to ARMLoadStoreOpt::mergePairedInsns
- almost the same but removed anything to do with sign extensions that was in the AArch64 Load/Store.

Ideally I wouldn't have had to copy over code from the AArch64 backend but there isn't a shared code directory for the AArch64 and ARM backends.

Keeping the same comment in the same places is not always the correct thing to do

I've kept the same comments in places where I couldn't have put the description of the behaviour of the code any better myself.

rs added inline comments.Apr 29 2015, 8:34 AM

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1821–1831	A better way would be to have a flag in table gen (like fast-double-store or whatever). OK this approach sounds better, will do it this way for my next patch.

In D9298#163286, @rs wrote:

If you prefer it to be part of LoadStoreMultipleOpti then I can rework the patch to make it so.

Yes, that would be better. It looks like MergeOps is the function that gets a set of registers then tries to generate an LDM from them. There you could put some stuff to decide whether to instead break it up into a sequence of LDRD. Looks like you may need to adjust MergeLDR_STR also: it only collects together ascending sequences but thumb2 LDRD doesn't require that.

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1821–1831	There's some stuff in ARMBaseInstrInfo, e.g. getOperandLatency, getNumMicroOps that appears to understand the timing of LDM/STM instructions. Maybe it's possible to use that plus LDRD/STRD timing information? I.e. instead of putting something in tablegen then assuming "well, LDRD is fast so I'm going to guess that here it'll be faster than LDM" instead calculate what the timing of LDRD and LDM would be for loading a set of registers and use whichever is quickest.

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMLoadStoreOptimizer.cpp

290 lines

test/

CodeGen/

ARM/

ldrd.ll

1 line

Thumb2/

aapcs.ll

6 lines

thumb2-memcpy-ldrd-strd.ll

20 lines

Diff 24484

lib/Target/ARM/ARMLoadStoreOptimizer.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
STATISTIC(NumLDRD2LDM, "Number of ldrd instructions turned back into ldm");		STATISTIC(NumLDRD2LDM, "Number of ldrd instructions turned back into ldm");
STATISTIC(NumSTRD2STM, "Number of strd instructions turned back into stm");		STATISTIC(NumSTRD2STM, "Number of strd instructions turned back into stm");
STATISTIC(NumLDRD2LDR, "Number of ldrd instructions turned back into ldr's");		STATISTIC(NumLDRD2LDR, "Number of ldrd instructions turned back into ldr's");
STATISTIC(NumSTRD2STR, "Number of strd instructions turned back into str's");		STATISTIC(NumSTRD2STR, "Number of strd instructions turned back into str's");

/// ARMAllocLoadStoreOpt - Post- register allocation pass the combine		/// ARMAllocLoadStoreOpt - Post- register allocation pass the combine
/// load / store instructions to form ldm / stm instructions.		/// load / store instructions to form ldm / stm instructions.

		static cl::opt<unsigned> ScanLimit("arm-load-store-scan-limit",
		cl::init(20), cl::Hidden);

		static cl::opt<bool> AlwaysCollapseToLoadStoreDouble("arm-load-store-use-ldrd-strd",
		cl::Hidden,
		cl::desc("Always try and collapse load/store pairs into ldrd/strd's if" \
		"available on target architecture"), cl::init(true));
		john.brawnUnsubmitted Not Done Reply Inline Actions By default always using ldrd/strd without reference to if it's faster on the target CPU sounds like a bad idea, e.g. according to the Cortex-M3 TRM LDRD is 3 cycles, but LDM is 2 + (nr registers - 1). john.brawn: By default always using ldrd/strd without reference to if it's faster on the target CPU sounds…

namespace {		namespace {
struct ARMLoadStoreOpt : public MachineFunctionPass {		struct ARMLoadStoreOpt : public MachineFunctionPass {
static char ID;		static char ID;
ARMLoadStoreOpt() : MachineFunctionPass(ID) {}		ARMLoadStoreOpt() : MachineFunctionPass(ID) {}

const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
const ARMSubtarget *STI;		const ARMSubtarget *STI;
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	bool MergeBaseUpdateLoadStore(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MBBI,		MachineBasicBlock::iterator MBBI,
const TargetInstrInfo *TII,		const TargetInstrInfo *TII,
bool &Advance,		bool &Advance,
MachineBasicBlock::iterator &I);		MachineBasicBlock::iterator &I);
bool MergeBaseUpdateLSMultiple(MachineBasicBlock &MBB,		bool MergeBaseUpdateLSMultiple(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MBBI,		MachineBasicBlock::iterator MBBI,
bool &Advance,		bool &Advance,
MachineBasicBlock::iterator &I);		MachineBasicBlock::iterator &I);
		// Merge the two instructions indicated into a single pair-wise instruction.
		// If MergeForward is true, erase the first instruction and fold its
		// operation into the second. If false, the reverse. Return the instruction
		// following the first instruction (which may change during processing).
		// -1 means none, 0 means I, and 1 means Paired.
		MachineBasicBlock::iterator
		mergePairedInsns(MachineBasicBlock::iterator I,
		MachineBasicBlock::iterator Paired, bool MergeForward);
bool LoadStoreMultipleOpti(MachineBasicBlock &MBB);		bool LoadStoreMultipleOpti(MachineBasicBlock &MBB);
		bool LoadStoreToDoubleOpti(MachineBasicBlock &MBB);
bool MergeReturnIntoLDM(MachineBasicBlock &MBB);		bool MergeReturnIntoLDM(MachineBasicBlock &MBB);
		MachineBasicBlock::iterator findMatchingInsn(MachineBasicBlock::iterator I,
		bool &MergeForward, unsigned Limit);
};		};
char ARMLoadStoreOpt::ID = 0;		char ARMLoadStoreOpt::ID = 0;
}		}

static bool definesCPSR(const MachineInstr *MI) {		static bool definesCPSR(const MachineInstr *MI) {
for (const auto &MO : MI->operands()) {		for (const auto &MO : MI->operands()) {
if (!MO.isReg())		if (!MO.isReg())
continue;		continue;
▲ Show 20 Lines • Show All 214 Lines • ▼ Show 20 Lines	static unsigned getImmScale(unsigned Opc) {
case ARM::tLDRHi:		case ARM::tLDRHi:
case ARM::tSTRHi:		case ARM::tSTRHi:
return 2;		return 2;
case ARM::tLDRBi:		case ARM::tLDRBi:
case ARM::tSTRBi:		case ARM::tSTRBi:
return 4;		return 4;
}		}
}		}
		static unsigned getMatchingPairOpcode(unsigned Opc) {
		switch (Opc) {
		default:
		llvm_unreachable("Opcode has no pairwise equivalent!");
		case ARM::t2LDRi12:
		return ARM::t2LDRDi8;
		case ARM::t2STRi12:
		return ARM::t2STRDi8;
		}
		}

		MachineBasicBlock::iterator
		ARMLoadStoreOpt::mergePairedInsns(MachineBasicBlock::iterator I,
		MachineBasicBlock::iterator Paired,
		bool MergeForward) {
		MachineBasicBlock::iterator NextI = I;
		++NextI;
		// If NextI is the second of the two instructions to be merged, we need
		// to skip one further. Either way we merge will invalidate the iterator,
		// and we don't need to scan the new instruction, as it's a pairwise
		// instruction, which we're not considering for further action anyway.
		if (NextI == Paired)
		++NextI;

		unsigned Opc = I->getOpcode();
		int OffsetStride = 4;

		unsigned NewOpc = getMatchingPairOpcode(Opc);
		// Insert our new paired instruction after whichever of the paired
		// instructions MergeForward indicates.
		MachineBasicBlock::iterator InsertionPoint = MergeForward ? Paired : I;
		// Also based on MergeForward is from where we copy the base register operand
		// so we get the flags compatible with the input code.
		MachineOperand &BaseRegOp =
		MergeForward ? Paired->getOperand(1) : I->getOperand(1);

		// Which register is Rt and which is Rt2 depends on the offset order.
		MachineInstr RtMI, Rt2MI;
		if (I->getOperand(2).getImm() ==
		Paired->getOperand(2).getImm() + OffsetStride) {
		RtMI = Paired;
		Rt2MI = I;
		} else {
		RtMI = I;
		Rt2MI = Paired;
		}
		int OffsetImm = RtMI->getOperand(2).getImm();

		// Construct the new instruction.
		MachineInstrBuilder MIB = BuildMI(*I->getParent(), InsertionPoint,
		I->getDebugLoc(), TII->get(NewOpc))
		.addOperand(RtMI->getOperand(0))
		.addOperand(Rt2MI->getOperand(0))
		.addOperand(BaseRegOp)
		.addImm(OffsetImm);
		AddDefaultPred(MIB);

		DEBUG(dbgs() << "Creating pair load/store. Replacing instructions:\n ");
		DEBUG(I->print(dbgs()));
		DEBUG(dbgs() << " ");
		DEBUG(Paired->print(dbgs()));
		DEBUG(dbgs() << " with instruction:\n ");

		// Erase the old instructions.
		I->eraseFromParent();
		Paired->eraseFromParent();

		return NextI;
		}

/// Update future uses of the base register with the offset introduced		/// Update future uses of the base register with the offset introduced
/// due to writeback. This function only works on Thumb1.		/// due to writeback. This function only works on Thumb1.
void		void
ARMLoadStoreOpt::UpdateBaseRegUses(MachineBasicBlock &MBB,		ARMLoadStoreOpt::UpdateBaseRegUses(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MBBI,		MachineBasicBlock::iterator MBBI,
DebugLoc dl, unsigned Base,		DebugLoc dl, unsigned Base,
unsigned WordOffset,		unsigned WordOffset,
▲ Show 20 Lines • Show All 1,203 Lines • ▼ Show 20 Lines	if (Opcode == ARM::LDRD \|\| Opcode == ARM::STRD \|\|

MBB.erase(MI);		MBB.erase(MI);
MBBI = NewBBI;		MBBI = NewBBI;
return true;		return true;
}		}
return false;		return false;
}		}

		/// trackRegDefsUses - Remember what registers the specified instruction uses
		/// and modifies.
		static void trackRegDefsUses(MachineInstr *MI, BitVector &ModifiedRegs,
		BitVector &UsedRegs,
		const TargetRegisterInfo *TRI) {
		for (unsigned i = 0, e = MI->getNumOperands(); i != e; ++i) {
		MachineOperand &MO = MI->getOperand(i);
		if (MO.isRegMask())
		ModifiedRegs.setBitsNotInMask(MO.getRegMask());

		if (!MO.isReg() \|\| MO.getReg() == 0)
		continue;
		unsigned Reg = MO.getReg();
		if (MO.isDef()) {
		for (MCRegAliasIterator AI(Reg, TRI, true); AI.isValid(); ++AI)
		ModifiedRegs.set(*AI);
		} else {
		assert(MO.isUse() && "Reg operand not a def and not a use?!?");
		for (MCRegAliasIterator AI(Reg, TRI, true); AI.isValid(); ++AI)
		UsedRegs.set(*AI);
		}
		}
		}

		/// findMatchingInsn - Scan the instructions looking for a load/store that can
		/// be combined with the current instruction into a load/store pair.
		MachineBasicBlock::iterator
		ARMLoadStoreOpt::findMatchingInsn(MachineBasicBlock::iterator I,
		bool &MergeForward, unsigned Limit) {
		MachineBasicBlock::iterator E = I->getParent()->end();
		MachineBasicBlock::iterator MBBI = I;
		MachineInstr *FirstMI = I;
		++MBBI;

		int Opc = FirstMI->getOpcode();
		bool MayLoad = FirstMI->mayLoad();
		unsigned Reg = FirstMI->getOperand(0).getReg();
		unsigned BaseReg = FirstMI->getOperand(1).getReg();
		int Offset = FirstMI->getOperand(2).getImm();

		// Early exit if the first instruction modifies the base register.
		if (FirstMI->modifiesRegister(BaseReg, TRI))
		return E;

		int OffsetStride = 4;

		// Track which registers have been modified and used between the first insn
		// (inclusive) and the second insn.
		BitVector ModifiedRegs, UsedRegs;
		ModifiedRegs.resize(TRI->getNumRegs());
		UsedRegs.resize(TRI->getNumRegs());
		for (unsigned Count = 0; MBBI != E && Count < Limit; ++MBBI) {
		MachineInstr *MI = MBBI;

		// Skip DBG_VALUE instructions. Otherwise debug info can affect the
		// optimization by changing how far we scan.
		if (MI->isDebugValue())
		continue;

		// Now that we know this is a real instruction, count it.
		++Count;

		bool CanMergeOpc = Opc == MI->getOpcode();

		if (CanMergeOpc && MI->getOperand(2).isImm()) {
		// If we've found another instruction with the same opcode, check to see
		// if the base and offset are compatible with our starting instruction.
		unsigned MIBaseReg = MI->getOperand(1).getReg();
		int MIOffset = MI->getOperand(2).getImm();
		if (BaseReg == MIBaseReg && ((Offset == MIOffset + OffsetStride) \|\|
		(Offset + OffsetStride == MIOffset))) {
		int MinOffset = Offset < MIOffset ? Offset : MIOffset;
		// If this is a volatile load/store that otherwise matched, stop looking
		// as something is going on that we don't have enough information to
		// safely transform. Similarly, stop if we see a hint to avoid pairs.
		if (MI->hasOrderedMemoryRef())
		return E;

		// If the destination register of the loads is the same register, bail
		// and keep looking. A load-pair instruction with both destination
		// registers the same is UNPREDICTABLE and will result in an exception.
		if (MayLoad && Reg == MI->getOperand(0).getReg()) {
		trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);
		continue;
		}

		// Cortex-M3 errata 602117: LDRD with base in list may result in incorrect base
		// register when interrupted or faulted.
		if (STI->isCortexM3() && MI->modifiesRegister(BaseReg, TRI))
		return E;

		// If the Rt of the second instruction was not modified or used between
		// the two instructions, we can combine the second into the first.
		if (!ModifiedRegs[MI->getOperand(0).getReg()] &&
		!UsedRegs[MI->getOperand(0).getReg()]) {
		MergeForward = false;
		return MBBI;
		}

		// Likewise, if the Rt of the first instruction is not modified or used
		// between the two instructions, we can combine the first into the
		// second.
		if (!ModifiedRegs[FirstMI->getOperand(0).getReg()] &&
		!UsedRegs[FirstMI->getOperand(0).getReg()]) {
		MergeForward = true;
		return MBBI;
		}
		// Unable to combine these instructions due to interference in between.
		// Keep looking.
		}
		}

		// If the instruction wasn't a matching load or store, but does (or can)
		// modify memory, stop searching, as we don't have alias analysis or
		// anything like that to tell us whether the access is tromping on the
		// locations we care about. The big one we want to catch is calls.
		//
		// FIXME: Theoretically, we can do better than that for SP and FP based
		// references since we can effectively know where those are touching. It's
		// unclear if it's worth the extra code, though. Most paired instructions
		// will be sequential, perhaps with a few intervening non-memory related
		// instructions.
		if (MI->mayStore() \|\| MI->isCall())
		return E;
		// Likewise, if we're matching a store instruction, we don't want to
		// move across a load, as it may be reading the same location.
		if (FirstMI->mayStore() && MI->mayLoad())
		return E;

		// Update modified / uses register lists.
		trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);

		// Otherwise, if the base register is modified, we have no match, so
		// return early.
		if (ModifiedRegs[BaseReg])
		return E;
		}
		return E;
		}

		// FIXME: Currently, only supports collapsing ldr/str's to ldrd/strd's for
		// V7M based cores. V7A and V7R architectures also support ldrd/strd instruction
		// with a few restrictions, for example for the ldrd instruction
		// the first destination register must be an even numbered register and
		// second register must be (first register number + 1). We should update
		// the code at some point to make it possible to generate ldrd/strd for
		// these architectuers as well.
		bool ARMLoadStoreOpt::LoadStoreToDoubleOpti(MachineBasicBlock &MBB) {
		if (!isThumb2 \|\| !STI->hasV7Ops() \|\| !STI->isMClass()) {
		return false;
		}
		john.brawnUnsubmitted Not Done Reply Inline Actions Actually the even/odd restriction is in A32 restriction, not a non-M-class restriction, i.e. in 7-A/R T32 there should be no problem. john.brawn: Actually the even/odd restriction is in A32 restriction, not a non-M-class restriction, i.e. in…
		rengolinUnsubmitted Not Done Reply Inline Actions Certainly the wrong way. A better way would be to have a flag in table gen (like fast-double-store or whatever). The best way would be to have a cost-model, like we have for the vectorizer, but that would be a big change for this small patch. rengolin: Certainly the wrong way. A better way would be to have a flag in table gen (like fast-double…
		rsAuthorUnsubmitted Not Done Reply Inline Actions A better way would be to have a flag in table gen (like fast-double-store or whatever). OK this approach sounds better, will do it this way for my next patch. rs: > A better way would be to have a flag in table gen (like fast-double-store or whatever). OK…
		john.brawnUnsubmitted Not Done Reply Inline Actions There's some stuff in ARMBaseInstrInfo, e.g. getOperandLatency, getNumMicroOps that appears to understand the timing of LDM/STM instructions. Maybe it's possible to use that plus LDRD/STRD timing information? I.e. instead of putting something in tablegen then assuming "well, LDRD is fast so I'm going to guess that here it'll be faster than LDM" instead calculate what the timing of LDRD and LDM would be for loading a set of registers and use whichever is quickest. john.brawn: There's some stuff in ARMBaseInstrInfo, e.g. getOperandLatency, getNumMicroOps that appears to…
		bool Modified = false;
		for (MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
		MBBI != E;) {
		MachineInstr *MI = MBBI;
		switch(MI->getOpcode()) {
		default:
		// Just move on to the next instruction
		++MBBI;
		break;
		case ARM::t2LDRi12:
		case ARM::t2STRi12: {
		// If this is a volatile load/store, don't mess with it.
		if (MI->hasOrderedMemoryRef()) {
		++MBBI;
		break;
		}
		// Make sure this is a reg+imm (as opposed to an address reloc).
		if (!MI->getOperand(2).isImm()) {
		++MBBI;
		break;
		}
		// Look ahead up to ScanLimit instructions for a pairable instruction.
		bool MergeForward = false;
		MachineBasicBlock::iterator Paired =
		findMatchingInsn(MBBI, MergeForward, ScanLimit);
		if (Paired != E) {
		// Merge the loads into a pair. Keeping the iterator straight is a
		// pain, so we let the merge routine tell us what the next instruction
		// is after it's done mucking about.
		MBBI = mergePairedInsns(MBBI, Paired, MergeForward);
		Modified = true;
		break;
		}
		}
		++MBBI;
		break;
		}
		}
		return Modified;
		}

/// LoadStoreMultipleOpti - An optimization pass to turn multiple LDR / STR		/// LoadStoreMultipleOpti - An optimization pass to turn multiple LDR / STR
/// ops of the same base and incrementing offset into LDM / STM ops.		/// ops of the same base and incrementing offset into LDM / STM ops.
bool ARMLoadStoreOpt::LoadStoreMultipleOpti(MachineBasicBlock &MBB) {		bool ARMLoadStoreOpt::LoadStoreMultipleOpti(MachineBasicBlock &MBB) {
unsigned NumMerges = 0;		unsigned NumMerges = 0;
unsigned NumMemOps = 0;		unsigned NumMemOps = 0;
MemOpQueue MemOps;		MemOpQueue MemOps;
unsigned CurrBase = 0;		unsigned CurrBase = 0;
int CurrOpc = -1;		int CurrOpc = -1;
▲ Show 20 Lines • Show All 222 Lines • ▼ Show 20 Lines	bool ARMLoadStoreOpt::runOnMachineFunction(MachineFunction &Fn) {
RS = new RegScavenger();		RS = new RegScavenger();
isThumb2 = AFI->isThumb2Function();		isThumb2 = AFI->isThumb2Function();
isThumb1 = AFI->isThumbFunction() && !isThumb2;		isThumb1 = AFI->isThumbFunction() && !isThumb2;

bool Modified = false;		bool Modified = false;
for (MachineFunction::iterator MFI = Fn.begin(), E = Fn.end(); MFI != E;		for (MachineFunction::iterator MFI = Fn.begin(), E = Fn.end(); MFI != E;
++MFI) {		++MFI) {
MachineBasicBlock &MBB = *MFI;		MachineBasicBlock &MBB = *MFI;
		if (AlwaysCollapseToLoadStoreDouble)
		Modified \|= LoadStoreToDoubleOpti(MBB);
		if (!Modified) {
Modified \|= LoadStoreMultipleOpti(MBB);		Modified \|= LoadStoreMultipleOpti(MBB);
if (STI->hasV5TOps())		if (STI->hasV5TOps())
Modified \|= MergeReturnIntoLDM(MBB);		Modified \|= MergeReturnIntoLDM(MBB);
}		}
		}

delete RS;		delete RS;
return Modified;		return Modified;
}		}


/// ARMPreAllocLoadStoreOpt - Pre- register allocation pass that move		/// ARMPreAllocLoadStoreOpt - Pre- register allocation pass that move
/// load / stores from consecutive locations close to make it more		/// load / stores from consecutive locations close to make it more
▲ Show 20 Lines • Show All 462 Lines • Show Last 20 Lines

test/CodeGen/ARM/ldrd.ll

	Show All 12 Lines

	define i64 @t(i64 %a) nounwind readonly {			define i64 @t(i64 %a) nounwind readonly {
	entry:			entry:
	; A8-LABEL: t:			; A8-LABEL: t:
	; A8: ldrd r2, r3, [r2]			; A8: ldrd r2, r3, [r2]

	; M3-LABEL: t:			; M3-LABEL: t:
	; M3-NOT: ldrd			; M3-NOT: ldrd
				; M3: umull

	%0 = load i64, i64* @b, align 4			%0 = load i64, i64* @b, align 4
	%1 = load i64, i64* %0, align 4			%1 = load i64, i64* %0, align 4
	%2 = mul i64 %1, %a			%2 = mul i64 %1, %a
	ret i64 %2			ret i64 %2
	}			}

	; rdar://10435045 mixed LDRi8/LDRi12			; rdar://10435045 mixed LDRi8/LDRi12
	▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

test/CodeGen/Thumb2/aapcs.ll

	Show All 27 Lines
	; SOFT: ldr r0, [sp, #48]			; SOFT: ldr r0, [sp, #48]
	; HARD: vldr s0, [sp]			; HARD: vldr s0, [sp]
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	ret float %i			ret float %i
	}			}

	define double @double_on_stack(double %a, double %b, double %c, double %d, double %e, double %f, double %g, double %h, double %i) {			define double @double_on_stack(double %a, double %b, double %c, double %d, double %e, double %f, double %g, double %h, double %i) {
	; CHECK-LABEL: double_on_stack:			; CHECK-LABEL: double_on_stack:
	; SOFT: ldr r0, [sp, #48]			; SOFT: ldrd r0, r1, [sp, #48]
	; SOFT: ldr r1, [sp, #52]
	; HARD: vldr d0, [sp]			; HARD: vldr d0, [sp]
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	ret double %i			ret double %i
	}			}

	define double @double_not_split(double %a, double %b, double %c, double %d, double %e, double %f, double %g, float %h, double %i) {			define double @double_not_split(double %a, double %b, double %c, double %d, double %e, double %f, double %g, float %h, double %i) {
	; CHECK-LABEL: double_not_split:			; CHECK-LABEL: double_not_split:
	; SOFT: ldr r0, [sp, #48]			; SOFT: ldrd r0, r1, [sp, #48]
	; SOFT: ldr r1, [sp, #52]
	; HARD: vldr d0, [sp]			; HARD: vldr d0, [sp]
	; CHECK-NEXT: bx lr			; CHECK-NEXT: bx lr
	ret double %i			ret double %i
	}			}

test/CodeGen/Thumb2/thumb2-memcpy-ldrd-strd.ll

				; RUN: llc < %s -mtriple=thumbv7-apple-darwin -mattr=+thumb2 -mcpu=cortex-m7 \| FileCheck %s
				@d = external global [64 x i32]
				@s = external global [64 x i32]

				; Function Attrs: nounwind
				define void @t1() #0 {
				entry:
				; CHECK-LABEL: t1:
				; CHECK: ldrd
				; CHECK-NEXT: ldrd
				; CHECK-NEXT: strd
				; CHECK-NEXT: strd
				; CHECK-NEXT: ldrb
				; CHECK-NEXT: strb
				tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* bitcast ([64 x i32]* @s to i8), i8 bitcast ([64 x i32]* @d to i8*), i32 17, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture readonly, i32, i32, i1) #1