This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Introduce t2DoLoopStartTP
ClosedPublic

Authored by dmgreen on Nov 2 2020, 1:22 AM.

Download Raw Diff

Details

Reviewers

samparker
samtebbs
SjoerdMeijer
efriedma
simon_tatham

Commits

rG08d1c2d4701f: [ARM] Introduce t2DoLoopStartTP

Summary

This introduces a new pseudo instruction, almost identical to a t2DoLoopStart but taking 2 parameters - the original loop iteration count needed for a low overhead loop, plus the VCTP element count needed for a DLSTP instruction setting up a tail predicated loop. The idea is that the instruction holds both values and the backend ARMLowOverheadLoops pass can pick between the two, depending on whether it creates a tail predicated loop or falls back to a low overhead loop.

To do that there needs to be something that converts a t2DoLoopStart to a t2DoLoopStartTP, for which this patch repurposes the MVEVPTOptimisationsPass as a "tail predication and vpt optimisation" pass. The extra operand for the t2DoLoopStartTP is chosen based on the operands of VCTP's in the loop, and the instruction is moved as late in the block as possible to attempt to increase the likelihood of making tail predicated loops.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Nov 2 2020, 1:22 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 2 2020, 1:22 AM

Herald added subscribers: danielkiss, nikic, hiraditya, kristof.beyls. · View Herald Transcript

dmgreen requested review of this revision.Nov 2 2020, 1:22 AM

The TP at the end of the name somewhat implies that this is only for tail predication, would it make sense to change t2DoLoopStart to take the extra register?

Yeah I was wondering which way to go with that. The t2DoLoopStartTP is meant to mean "a t2DoLoopStart that is almost certainly going to become a DLSTP". A t2DoLoopStart are for all the low overhead loops that are not expected to change to tail predicated loop. That way we could treat them differently elsewhere in the pipeline if we need to.

In D90591#2367928, @dmgreen wrote:

Yeah I was wondering which way to go with that. The t2DoLoopStartTP is meant to mean "a t2DoLoopStart that is almost certainly going to become a DLSTP". A t2DoLoopStart are for all the low overhead loops that are not expected to change to tail predicated loop. That way we could treat them differently elsewhere in the pipeline if we need to.

That makes sense 👍 Do you envisage us moving all of the TP validation to the MVEVPTOptimisationsPass and then only doing the actual codegen in ARMLowOverheadLoops?

That makes sense 👍 Do you envisage us moving all of the TP validation to the MVEVPTOptimisationsPass and then only doing the actual codegen in ARMLowOverheadLoops?

I was thinking of putting some bits here, there is some stuff about terminators that produce values that I was hoping to do here too. Also something about ensuring lr is used as a predicate. We can't do everything at this early stage as we have to be very careful about being able to handle loops that end up being reverted. Some things are better to do pre-RA though, like the changes here, and let the backend pass do what it can from there like removing the unused arg.

This sounds like a good idea to me, but if we're going to do stuff earlier, to make our lives easier, maybe we should do this at the IR level and use a target-specific intrinsic and free ourselves completely from searching at the MI level?

Searching at the pre-ra level seems simple enough so long as we are in SSA form, and there are extra things I would like to add at the same point if I can makes sure they work. Doing those post-isel has some benefits in making sure we know the instructions we are looking at. Things like reverting calls is easier to do at this point (as it's easier to make sure we catch all calls).

Things like reverting calls is easier to do at this point.

What calls do you mean..?
What specifically do you need from the pre-RA MachineInstr form that means this shouldn't be done at IR level? The search here is certainly not more simple and manageable than doing this MVETailPredication, where the components are also found, and handling phis at this level is just an added pain.

Sos I missed your comment.

This code seemed simple enough to me, perhaps because I wrote essentially the same thing a few times. We are just looking through vregs for the PHI's. There may be COPY's in the way but they are simple enough to deal with. It's all just SSA form still.

The other things I would like to put here start with combining t2LoopDec and t2LoopEnd into a single instructions. A t2LoopEndDec is what I've called it. Thats a terminator that produces a value, but it seems to work OK in the testing I've been trying, given a few adjustments. I was going the "they can never spill" approach, which I've not found any problems with so long as there are no calls in the loop (or inline asm). It's obviously more accurate to determine what was made into a call as opposed to trying to guess at what will become a call from pre-isel. And if we get it wrong, the compiler would crash, so we do have to be a bit careful. It is using a lot of the same code as this, and also removing COPYs from the loop which seemed generally useful.

Unfortunately I found out the AMDGPU backend also has terminators the produce values, but they work differently. I will have to figure something out there.

I was also going to add an LR predicate to MVE instructions to make sure they would never go wrong, done in the same place hopefully. That's another one that involves updating 100's of tests unfortunately.

I would like to put here start with combining t2LoopDec and t2LoopEnd

Ok, fair enough.

t2LoopEndDec is what I've called it.

It's a good name.

I was going the "they can never spill" approach

And how is this done? Is it a generic codegen change and that's why AMDGPU is problematic?

I was also going to add an LR predicate to MVE instructions to make sure they would never go wrong

I'm not sure what you mean, are you talking about adding another optional register operand?

A bit late, but have a high-level comment about this:

repurposes the MVEVPTOptimisationsPass as a "tail predication and vpt optimisation" pass

This feels messy because we are doing tail-predication now in 2 different places, and that doesn't sound ideal. Happily unaware of some of the details here, but I am wondering if this is necessary?

And how is this done? Is it a generic codegen change and that's why AMDGPU is problematic?

There are just some changes in phi elimination and and the register allocator to make a terminator that produces a value never spill. I was thinking that this was not something that already existed, so we could invent some semantics for it - only use it very carefully in cases we know are going to be OK. Unfortunately the AMDGPU backend is already using it in a different way, where they have pseudo copy terminators for spills. I have to take a look what we can do there. It also might just be a bad idea in general to say that these can never spill, but I'd like to try it if we can and I've not seen any problems with all the testing I've ran.

I'm not sure what you mean, are you talking about adding another optional register operand?

Yep. As you might imagine a lot of tests have changed. But it will make sure that the condition from a VPT block is never wrong, that we don't spill lr around an MVE instruction that would be using it as the predicate.

With the MVE instructions taking LR, maybe you wouldn't need to make the terminator change, have you tried it yet? I guess I'm naively assuming that the register allocator will be much less likely to spill something that is used by most / all of the MVE instructions within the loop.

In D90591#2375686, @samparker wrote:

With the MVE instructions taking LR, maybe you wouldn't need to make the terminator change, have you tried it yet? I guess I'm naively assuming that the register allocator will be much less likely to spill something that is used by most / all of the MVE instructions within the loop.

"Much less likely" isn't a fantastic answer for something that we should really never be reverting, unfortunately. You can always have a number of MVE instructions at the start of the loop followed by a load of scalar instructions. I think when I had it that way in experimental patches there were still some reverts happening.

That's all about different patches though! Any comments on this one? If we can get this in, I can commit a few others without breaking performance on some important machine learning matrix multiply kernels, which would be good.

That's all about different patches though! Any comments on this one?

Sorry, it's just difficult to gauge this without a view on the wider implementation idea. For the other stuff that you're talking about, of course we need that in MI but, like I said at the start, we could currently achieve this patch with probably 10x less code at the IR level.

we could currently achieve this patch with probably 10x less code at the IR level.

10x sounds like a lot. I think that the code would still need a t2DoLoopStartTP instruction, so those change would all have to stay. The findLoopComponents is used in other patches - that is what they are built upon and seems generally useful so that would have to stay, along with the "loop" boilerplate changes for that pass. That just leaves ConvertTailPredLoop. If we did it pre-isel then we would need an extra intrinsic for the space of time between where we convert it and ISel, plus the lowering in ISel. It would still need to recognize the instruction and do some checks for things like VCTP's being the same, find the incoming phi value for the loop, then create the new instruction like is done here. This also moves the instruction to the end of the block which we would need to either find a new place for or it would lead to worse performance.

I don't think I see a lot of difference in it, and would prefer not to add a new intrinsic if we can avoid it.

Rebase and fix up some comments I noticed reading this again.

Remove some debug asserts that shouldn't be necessary.

I think that the code would still need a t2DoLoopStartTP instruction, so those change would all have to stay

Agreed, and I think this pseudo is very good idea.

The findLoopComponents is used in other patches

But I think it could be then added with those patches when needed.

AFAICT the search you're describing would be trivial in MVETailPredication now that you've, almost, added start_loop_iterations - and that's my problem. I don't understand why adding a target-specific intrinsic and a tablegen pattern to match it would not be the favoured approach?

samparker added inline comments.Nov 9 2020, 4:03 AM

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
442	This change highlights how many times we query the LoopStart opcode and it looks worth while to have this IsDo as a little helper!
627–628	Call getLoopStartOperand instead?
1444–1445	Can use getLoopStartOperand again.
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
189	What about only checking that it's not predicated in the case where there's more than one VCTP? At least then we can handle the VPT -> VCTP case.
230	I don't follow what's happening here, what uses could there be which we need to schedule for?

Added IsDo and altered the code about checking predicates.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
189	Would the backend pass handle multiple VCTP's in different blocks? If so we could just remove the check. I've done that here, but can add it back in if you think it might cause problems.
230	There can be COPY's between the LoopStart and the PHI.

samparker accepted this revision.Nov 9 2020, 7:46 AM

samparker added inline comments.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
189	Yep, this should be fine. The backend can handle any number of VCTPs and VPT blocks.
llvm/test/CodeGen/Thumb2/mve-vldshuffle.ll
169 ↗	(On Diff #303867)	How was the register allocator getting this so wrong?! I wonder if there's some logic missing, or some priorities need to be re-ordered, to handle the case of register classes with only one register.

This revision is now accepted and ready to land.Nov 9 2020, 7:46 AM

Closed by commit rG08d1c2d4701f: [ARM] Introduce t2DoLoopStartTP (authored by dmgreen). · Explain WhyNov 10 2020, 10:08 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG08d1c2d4701f: [ARM] Introduce t2DoLoopStartTP.

pirama added a subscriber: pirama.Nov 10 2020, 11:27 AM

pirama added inline comments.

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp

234

This variable seems unused outside of debugging purposes:

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp:234:23: warning: unused variable 'MI' [-Wunused-variable]
  MachineInstrBuilder MI = BuildMI(*MBB, InsertPt, LoopStart->getDebugLoc(),
                      ^
1 warning generated.

Ah, Thanks! I'll try and fix that up now

dmgreen mentioned this in D103236: [ARM] Introduce t2WhileLoopStartTP.May 27 2021, 3:43 AM

dmgreen mentioned this in rGbee2f618d599: [ARM] Introduce t2WhileLoopStartTP.Jun 13 2021, 5:56 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMBaseInstrInfo.h

1 line

ARMBaseInstrInfo.cpp

4 lines

ARMInstrThumb2.td

3 lines

ARMLowOverheadLoops.cpp

176 lines

MVEVPTOptimisationsPass.cpp

210 lines

test/

CodeGen/

ARM/

O3-pipeline.ll

3 lines

Thumb2/

LowOverheadLoops/

exitcount.ll

2 lines

mov-operand.ll

2 lines

mve-tail-data-types.ll

12 lines

reductions.ll

4 lines

mve-fma-loops.ll

69 lines

mve-gather-scatter-tailpred.ll

8 lines

mve-postinc-dct.ll

113 lines

mve-postinc-lsr.ll

20 lines

Diff 304244

llvm/lib/Target/ARM/ARMBaseInstrInfo.h

Show First 20 Lines • Show All 654 Lines • ▼ Show 20 Lines	case ARM::MVE_VCTP64:
return true;		return true;
}		}
return false;		return false;
}		}

static inline		static inline
bool isLoopStart(MachineInstr &MI) {		bool isLoopStart(MachineInstr &MI) {
return MI.getOpcode() == ARM::t2DoLoopStart \|\|		return MI.getOpcode() == ARM::t2DoLoopStart \|\|
		MI.getOpcode() == ARM::t2DoLoopStartTP \|\|
MI.getOpcode() == ARM::t2WhileLoopStart;		MI.getOpcode() == ARM::t2WhileLoopStart;
}		}

static inline		static inline
bool isCondBranchOpcode(int Opc) {		bool isCondBranchOpcode(int Opc) {
return Opc == ARM::Bcc \|\| Opc == ARM::tBcc \|\| Opc == ARM::t2Bcc;		return Opc == ARM::Bcc \|\| Opc == ARM::tBcc \|\| Opc == ARM::t2Bcc;
}		}

▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

Show First 20 Lines • Show All 5,943 Lines • ▼ Show 20 Lines	if (Opc == ARM::tPICADD \|\| Opc == ARM::PICADD \|\| Opc == ARM::PICSTR \|\|
Opc == ARM::PICLDRB \|\| Opc == ARM::PICLDRH \|\| Opc == ARM::PICLDRSB \|\|		Opc == ARM::PICLDRB \|\| Opc == ARM::PICLDRH \|\| Opc == ARM::PICLDRSB \|\|
Opc == ARM::PICLDRSH \|\| Opc == ARM::t2LDRpci_pic \|\|		Opc == ARM::PICLDRSH \|\| Opc == ARM::t2LDRpci_pic \|\|
Opc == ARM::t2MOVi16_ga_pcrel \|\| Opc == ARM::t2MOVTi16_ga_pcrel \|\|		Opc == ARM::t2MOVi16_ga_pcrel \|\| Opc == ARM::t2MOVTi16_ga_pcrel \|\|
Opc == ARM::t2MOV_ga_pcrel)		Opc == ARM::t2MOV_ga_pcrel)
return outliner::InstrType::Illegal;		return outliner::InstrType::Illegal;

// Be conservative with ARMv8.1 MVE instructions.		// Be conservative with ARMv8.1 MVE instructions.
if (Opc == ARM::t2BF_LabelPseudo \|\| Opc == ARM::t2DoLoopStart \|\|		if (Opc == ARM::t2BF_LabelPseudo \|\| Opc == ARM::t2DoLoopStart \|\|
Opc == ARM::t2WhileLoopStart \|\| Opc == ARM::t2LoopDec \|\|		Opc == ARM::t2DoLoopStartTP \|\| Opc == ARM::t2WhileLoopStart \|\|
Opc == ARM::t2LoopEnd)		Opc == ARM::t2LoopDec \|\| Opc == ARM::t2LoopEnd)
return outliner::InstrType::Illegal;		return outliner::InstrType::Illegal;

const MCInstrDesc &MCID = MI.getDesc();		const MCInstrDesc &MCID = MI.getDesc();
uint64_t MIFlags = MCID.TSFlags;		uint64_t MIFlags = MCID.TSFlags;
if ((MIFlags & ARMII::DomainMask) == ARMII::DomainMVE)		if ((MIFlags & ARMII::DomainMask) == ARMII::DomainMVE)
return outliner::InstrType::Illegal;		return outliner::InstrType::Illegal;

// Is this a terminator for a basic block?		// Is this a terminator for a basic block?
▲ Show 20 Lines • Show All 340 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrThumb2.td

	Show First 20 Lines • Show All 5,421 Lines • ▼ Show 20 Lines

	let Predicates = [IsThumb2, HasV8_1MMainline, HasLOB] in {			let Predicates = [IsThumb2, HasV8_1MMainline, HasLOB] in {

	let usesCustomInserter = 1 in			let usesCustomInserter = 1 in
	def t2DoLoopStart :			def t2DoLoopStart :
	t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts), 4, IIC_Br,			t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts), 4, IIC_Br,
	[(set GPRlr:$X, (int_start_loop_iterations rGPR:$elts))]>;			[(set GPRlr:$X, (int_start_loop_iterations rGPR:$elts))]>;

				def t2DoLoopStartTP :
				t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts, rGPR:$count), 4, IIC_Br, []>;

	let hasSideEffects = 0 in			let hasSideEffects = 0 in
	def t2LoopDec :			def t2LoopDec :
	t2PseudoInst<(outs GPRlr:$Rm), (ins GPRlr:$Rn, imm0_7:$size),			t2PseudoInst<(outs GPRlr:$Rm), (ins GPRlr:$Rn, imm0_7:$size),
	4, IIC_Br, []>, Sched<[WriteBr]>;			4, IIC_Br, []>, Sched<[WriteBr]>;

	let isBranch = 1, isTerminator = 1, hasSideEffects = 1, Defs = [CPSR] in {			let isBranch = 1, isTerminator = 1, hasSideEffects = 1, Defs = [CPSR] in {
	// Set WhileLoopStart and LoopEnd to occupy 8 bytes because they may			// Set WhileLoopStart and LoopEnd to occupy 8 bytes because they may
	// get converted into t2CMP and t2Bcc.			// get converted into t2CMP and t2Bcc.
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp

Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	static bool isDomainMVE(MachineInstr *MI) {
return Domain == ARMII::DomainMVE;		return Domain == ARMII::DomainMVE;
}		}

static bool shouldInspect(MachineInstr &MI) {		static bool shouldInspect(MachineInstr &MI) {
return isDomainMVE(&MI) \|\| isVectorPredicate(&MI) \|\|		return isDomainMVE(&MI) \|\| isVectorPredicate(&MI) \|\|
hasVPRUse(&MI);		hasVPRUse(&MI);
}		}

		static bool isDo(MachineInstr *MI) {
		return MI->getOpcode() != ARM::t2WhileLoopStart;
		}

namespace {		namespace {

using InstSet = SmallPtrSetImpl<MachineInstr *>;		using InstSet = SmallPtrSetImpl<MachineInstr *>;

class PostOrderLoopTraversal {		class PostOrderLoopTraversal {
MachineLoop &ML;		MachineLoop &ML;
MachineLoopInfo &MLI;		MachineLoopInfo &MLI;
SmallPtrSet<MachineBasicBlock*, 4> Visited;		SmallPtrSet<MachineBasicBlock*, 4> Visited;
▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines	SmallVectorImpl<VPTState> &getVPTBlocks() {
return VPTState::Blocks;		return VPTState::Blocks;
}		}

// Return the operand for the loop start instruction. This will be the loop		// Return the operand for the loop start instruction. This will be the loop
// iteration count, or the number of elements if we're tail predicating.		// iteration count, or the number of elements if we're tail predicating.
MachineOperand &getLoopStartOperand() {		MachineOperand &getLoopStartOperand() {
if (IsTailPredicationLegal())		if (IsTailPredicationLegal())
return TPNumElements;		return TPNumElements;
return Start->getOpcode() == ARM::t2DoLoopStart ? Start->getOperand(1)		return isDo(Start) ? Start->getOperand(1) : Start->getOperand(0);
: Start->getOperand(0);
}		}

unsigned getStartOpcode() const {		unsigned getStartOpcode() const {
bool IsDo = Start->getOpcode() == ARM::t2DoLoopStart;		bool IsDo = isDo(Start);
		samparkerUnsubmitted Not Done Reply Inline Actions This change highlights how many times we query the LoopStart opcode and it looks worth while to have this IsDo as a little helper! samparker: This change highlights how many times we query the LoopStart opcode and it looks worth while to…
if (!IsTailPredicationLegal())		if (!IsTailPredicationLegal())
return IsDo ? ARM::t2DLS : ARM::t2WLS;		return IsDo ? ARM::t2DLS : ARM::t2WLS;

return VCTPOpcodeToLSTP(VCTPs.back()->getOpcode(), IsDo);		return VCTPOpcodeToLSTP(VCTPs.back()->getOpcode(), IsDo);
}		}

void dump() const {		void dump() const {
if (Start) dbgs() << "ARM Loops: Found Loop Start: " << *Start;		if (Start) dbgs() << "ARM Loops: Found Loop Start: " << *Start;
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	bool LowOverheadLoop::ValidateTailPredicate() {
if (!ValidateLiveOuts()) {		if (!ValidateLiveOuts()) {
LLVM_DEBUG(dbgs() << "ARM Loops: Invalid live outs.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Invalid live outs.\n");
return false;		return false;
}		}

// Check that creating a [W\|D]LSTP, which will define LR with an element		// Check that creating a [W\|D]LSTP, which will define LR with an element
// count instead of iteration count, won't affect any other instructions		// count instead of iteration count, won't affect any other instructions
// than the LoopStart and LoopDec.		// than the LoopStart and LoopDec.
// TODO: We should try to insert the [W\|D]LSTP after any of the other uses.		// TODO: We should try to insert the [W\|D]LSTP after any of the other uses.
Register StartReg = Start->getOpcode() == ARM::t2DoLoopStart		Register StartReg = isDo(Start) ? Start->getOperand(1).getReg()
		samparkerUnsubmitted Not Done Reply Inline Actions Call getLoopStartOperand instead? samparker: Call getLoopStartOperand instead?
? Start->getOperand(1).getReg()
: Start->getOperand(0).getReg();		: Start->getOperand(0).getReg();
if (StartInsertPt == Start && StartReg == ARM::LR) {		if (StartInsertPt == Start && StartReg == ARM::LR) {
if (auto *IterCount = RDA.getMIOperand(		if (auto *IterCount = RDA.getMIOperand(Start, isDo(Start) ? 1 : 0)) {
Start, Start->getOpcode() == ARM::t2DoLoopStart ? 1 : 0)) {
SmallPtrSet<MachineInstr *, 2> Uses;		SmallPtrSet<MachineInstr *, 2> Uses;
RDA.getGlobalUses(IterCount, MCRegister::from(ARM::LR), Uses);		RDA.getGlobalUses(IterCount, MCRegister::from(ARM::LR), Uses);
for (auto *Use : Uses) {		for (auto *Use : Uses) {
if (Use != Start && Use != Dec) {		if (Use != Start && Use != Dec) {
LLVM_DEBUG(dbgs() << " ARM Loops: Found LR use: " << *Use);		LLVM_DEBUG(dbgs() << " ARM Loops: Found LR use: " << *Use);
return false;		return false;
}		}
}		}
}		}
}		}

// For tail predication, we need to provide the number of elements, instead		// For tail predication, we need to provide the number of elements, instead
// of the iteration count, to the loop start instruction. The number of		// of the iteration count, to the loop start instruction. The number of
// elements is provided to the vctp instruction, so we need to check that		// elements is provided to the vctp instruction, so we need to check that
// we can use this register at InsertPt.		// we can use this register at InsertPt.
MachineInstr *VCTP = VCTPs.back();		MachineInstr *VCTP = VCTPs.back();
		if (Start->getOpcode() == ARM::t2DoLoopStartTP) {
		TPNumElements = Start->getOperand(2);
		StartInsertPt = Start;
		StartInsertBB = Start->getParent();
		} else {
TPNumElements = VCTP->getOperand(1);		TPNumElements = VCTP->getOperand(1);
MCRegister NumElements = TPNumElements.getReg().asMCReg();		MCRegister NumElements = TPNumElements.getReg().asMCReg();

// If the register is defined within loop, then we can't perform TP.		// If the register is defined within loop, then we can't perform TP.
// TODO: Check whether this is just a mov of a register that would be		// TODO: Check whether this is just a mov of a register that would be
// available.		// available.
if (RDA.hasLocalDefBefore(VCTP, NumElements)) {		if (RDA.hasLocalDefBefore(VCTP, NumElements)) {
LLVM_DEBUG(dbgs() << "ARM Loops: VCTP operand is defined in the loop.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: VCTP operand is defined in the loop.\n");
return false;		return false;
}		}

// The element count register maybe defined after InsertPt, in which case we		// The element count register maybe defined after InsertPt, in which case we
// need to try to move either InsertPt or the def so that the [w\|d]lstp can		// need to try to move either InsertPt or the def so that the [w\|d]lstp can
// use the value.		// use the value.

if (StartInsertPt != StartInsertBB->end() &&		if (StartInsertPt != StartInsertBB->end() &&
!RDA.isReachingDefLiveOut(&*StartInsertPt, NumElements)) {		!RDA.isReachingDefLiveOut(&*StartInsertPt, NumElements)) {
if (auto *ElemDef = RDA.getLocalLiveOutMIDef(StartInsertBB, NumElements)) {		if (auto *ElemDef =
		RDA.getLocalLiveOutMIDef(StartInsertBB, NumElements)) {
if (RDA.isSafeToMoveForwards(ElemDef, &*StartInsertPt)) {		if (RDA.isSafeToMoveForwards(ElemDef, &*StartInsertPt)) {
ElemDef->removeFromParent();		ElemDef->removeFromParent();
StartInsertBB->insert(StartInsertPt, ElemDef);		StartInsertBB->insert(StartInsertPt, ElemDef);
LLVM_DEBUG(dbgs() << "ARM Loops: Moved element count def: "		LLVM_DEBUG(dbgs()
<< *ElemDef);		<< "ARM Loops: Moved element count def: " << *ElemDef);
} else if (RDA.isSafeToMoveBackwards(&*StartInsertPt, ElemDef)) {		} else if (RDA.isSafeToMoveBackwards(&*StartInsertPt, ElemDef)) {
StartInsertPt->removeFromParent();		StartInsertPt->removeFromParent();
StartInsertBB->insertAfter(MachineBasicBlock::iterator(ElemDef),		StartInsertBB->insertAfter(MachineBasicBlock::iterator(ElemDef),
&*StartInsertPt);		&*StartInsertPt);
LLVM_DEBUG(dbgs() << "ARM Loops: Moved start past: " << *ElemDef);		LLVM_DEBUG(dbgs() << "ARM Loops: Moved start past: " << *ElemDef);
} else {		} else {
// If we fail to move an instruction and the element count is provided		// If we fail to move an instruction and the element count is provided
// by a mov, use the mov operand if it will have the same value at the		// by a mov, use the mov operand if it will have the same value at the
// insertion point		// insertion point
MachineOperand Operand = ElemDef->getOperand(1);		MachineOperand Operand = ElemDef->getOperand(1);
if (isMovRegOpcode(ElemDef->getOpcode()) &&		if (isMovRegOpcode(ElemDef->getOpcode()) &&
RDA.getUniqueReachingMIDef(ElemDef, Operand.getReg().asMCReg()) ==		RDA.getUniqueReachingMIDef(ElemDef, Operand.getReg().asMCReg()) ==
RDA.getUniqueReachingMIDef(&*StartInsertPt,		RDA.getUniqueReachingMIDef(&*StartInsertPt,
Operand.getReg().asMCReg())) {		Operand.getReg().asMCReg())) {
TPNumElements = Operand;		TPNumElements = Operand;
NumElements = TPNumElements.getReg();		NumElements = TPNumElements.getReg();
} else {		} else {
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs()
<< "ARM Loops: Unable to move element count to loop "		<< "ARM Loops: Unable to move element count to loop "
<< "start instruction.\n");		<< "start instruction.\n");
return false;		return false;
}		}
}		}
}		}
}		}

// Could inserting the [W\|D]LSTP cause some unintended affects? In a perfect
// world the [w\|d]lstp instruction would be last instruction in the preheader
// and so it would only affect instructions within the loop body. But due to
// scheduling, and/or the logic in this pass (above), the insertion point can
// be moved earlier. So if the Loop Start isn't the last instruction in the
// preheader, and if the initial element count is smaller than the vector
// width, the Loop Start instruction will immediately generate one or more
// false lane mask which can, incorrectly, affect the proceeding MVE
// instructions in the preheader.
auto CannotInsertWDLSTPBetween = [](MachineBasicBlock::iterator I,
MachineBasicBlock::iterator E) {
for (; I != E; ++I) {
if (shouldInspect(*I)) {
LLVM_DEBUG(dbgs() << "ARM Loops: Instruction blocks [W\|D]LSTP"
<< " insertion: " << *I);
return true;
}
}
return false;
};

if (CannotInsertWDLSTPBetween(StartInsertPt, StartInsertBB->end()))
return false;

// Especially in the case of while loops, InsertBB may not be the		// Especially in the case of while loops, InsertBB may not be the
// preheader, so we need to check that the register isn't redefined		// preheader, so we need to check that the register isn't redefined
// before entering the loop.		// before entering the loop.
auto CannotProvideElements = [this](MachineBasicBlock *MBB,		auto CannotProvideElements = [this](MachineBasicBlock *MBB,
MCRegister NumElements) {		MCRegister NumElements) {
if (MBB->empty())		if (MBB->empty())
return false;		return false;
// NumElements is redefined in this block.		// NumElements is redefined in this block.
if (RDA.hasLocalDefBefore(&MBB->back(), NumElements))		if (RDA.hasLocalDefBefore(&MBB->back(), NumElements))
return true;		return true;

// Don't continue searching up through multiple predecessors.		// Don't continue searching up through multiple predecessors.
if (MBB->pred_size() > 1)		if (MBB->pred_size() > 1)
return true;		return true;

return false;		return false;
};		};

// Search backwards for a def, until we get to InsertBB.		// Search backwards for a def, until we get to InsertBB.
MachineBasicBlock *MBB = Preheader;		MachineBasicBlock *MBB = Preheader;
while (MBB && MBB != StartInsertBB) {		while (MBB && MBB != StartInsertBB) {
if (CannotProvideElements(MBB, NumElements)) {		if (CannotProvideElements(MBB, NumElements)) {
LLVM_DEBUG(dbgs() << "ARM Loops: Unable to provide element count.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Unable to provide element count.\n");
return false;		return false;
}		}
MBB = *MBB->pred_begin();		MBB = *MBB->pred_begin();
}		}
		}

		// Could inserting the [W\|D]LSTP cause some unintended affects? In a perfect
		// world the [w\|d]lstp instruction would be last instruction in the preheader
		// and so it would only affect instructions within the loop body. But due to
		// scheduling, and/or the logic in this pass (above), the insertion point can
		// be moved earlier. So if the Loop Start isn't the last instruction in the
		// preheader, and if the initial element count is smaller than the vector
		// width, the Loop Start instruction will immediately generate one or more
		// false lane mask which can, incorrectly, affect the proceeding MVE
		// instructions in the preheader.
		auto CannotInsertWDLSTPBetween = [](MachineBasicBlock::iterator I,
		MachineBasicBlock::iterator E) {
		for (; I != E; ++I) {
		if (shouldInspect(*I)) {
		LLVM_DEBUG(dbgs() << "ARM Loops: Instruction blocks [W\|D]LSTP"
		<< " insertion: " << *I);
		return true;
		}
		}
		return false;
		};

		if (CannotInsertWDLSTPBetween(StartInsertPt, StartInsertBB->end()))
		return false;

// Check that the value change of the element count is what we expect and		// Check that the value change of the element count is what we expect and
// that the predication will be equivalent. For this we need:		// that the predication will be equivalent. For this we need:
// NumElements = NumElements - VectorWidth. The sub will be a sub immediate		// NumElements = NumElements - VectorWidth. The sub will be a sub immediate
// and we can also allow register copies within the chain too.		// and we can also allow register copies within the chain too.
auto IsValidSub = [](MachineInstr *MI, int ExpectedVecWidth) {		auto IsValidSub = [](MachineInstr *MI, int ExpectedVecWidth) {
return -getAddSubImmediate(*MI) == ExpectedVecWidth;		return -getAddSubImmediate(*MI) == ExpectedVecWidth;
};		};

MBB = VCTP->getParent();		MachineBasicBlock *MBB = VCTP->getParent();
// Remove modifications to the element count since they have no purpose in a		// Remove modifications to the element count since they have no purpose in a
// tail predicated loop. Explicitly refer to the vctp operand no matter which		// tail predicated loop. Explicitly refer to the vctp operand no matter which
// register NumElements has been assigned to, since that is what the		// register NumElements has been assigned to, since that is what the
// modifications will be using		// modifications will be using
if (auto *Def = RDA.getUniqueReachingMIDef(		if (auto *Def = RDA.getUniqueReachingMIDef(
&MBB->back(), VCTP->getOperand(1).getReg().asMCReg())) {		&MBB->back(), VCTP->getOperand(1).getReg().asMCReg())) {
SmallPtrSet<MachineInstr*, 2> ElementChain;		SmallPtrSet<MachineInstr*, 2> ElementChain;
SmallPtrSet<MachineInstr*, 2> Ignore;		SmallPtrSet<MachineInstr*, 2> Ignore;
▲ Show 20 Lines • Show All 292 Lines • ▼ Show 20 Lines	void LowOverheadLoop::Validate(ARMBasicBlockUtils *BBUtils) {
// be able to safely define LR.		// be able to safely define LR.
auto FindStartInsertionPoint = [](MachineInstr Start, MachineInstr Dec,		auto FindStartInsertionPoint = [](MachineInstr Start, MachineInstr Dec,
MachineBasicBlock::iterator &InsertPt,		MachineBasicBlock::iterator &InsertPt,
MachineBasicBlock *&InsertBB,		MachineBasicBlock *&InsertBB,
ReachingDefAnalysis &RDA,		ReachingDefAnalysis &RDA,
InstSet &ToRemove) {		InstSet &ToRemove) {
// For a t2DoLoopStart it is always valid to use the start insertion point.		// For a t2DoLoopStart it is always valid to use the start insertion point.
// For WLS we can define LR if LR already contains the same value.		// For WLS we can define LR if LR already contains the same value.
if (Start->getOpcode() == ARM::t2DoLoopStart \|\|		if (isDo(Start) \|\| Start->getOperand(0).getReg() == ARM::LR) {
Start->getOperand(0).getReg() == ARM::LR) {
InsertPt = MachineBasicBlock::iterator(Start);		InsertPt = MachineBasicBlock::iterator(Start);
InsertBB = Start->getParent();		InsertBB = Start->getParent();
return true;		return true;
}		}

// We've found no suitable LR def and Start doesn't use LR directly. Can we		// We've found no suitable LR def and Start doesn't use LR directly. Can we
// just define LR anyway?		// just define LR anyway?
if (!RDA.isSafeToDefRegAt(Start, MCRegister::from(ARM::LR)))		if (!RDA.isSafeToDefRegAt(Start, MCRegister::from(ARM::LR)))
▲ Show 20 Lines • Show All 354 Lines • ▼ Show 20 Lines
// see the comment below how this chain could look like.		// see the comment below how this chain could look like.
//		//
void ARMLowOverheadLoops::IterationCountDCE(LowOverheadLoop &LoLoop) {		void ARMLowOverheadLoops::IterationCountDCE(LowOverheadLoop &LoLoop) {
if (!LoLoop.IsTailPredicationLegal())		if (!LoLoop.IsTailPredicationLegal())
return;		return;

LLVM_DEBUG(dbgs() << "ARM Loops: Trying DCE on loop iteration count.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Trying DCE on loop iteration count.\n");

MachineInstr *Def = RDA->getMIOperand(		MachineInstr *Def =
LoLoop.Start, LoLoop.Start->getOpcode() == ARM::t2DoLoopStart ? 1 : 0);		RDA->getMIOperand(LoLoop.Start, isDo(LoLoop.Start) ? 1 : 0);
		samparkerUnsubmitted Not Done Reply Inline Actions Can use getLoopStartOperand again. samparker: Can use getLoopStartOperand again.
if (!Def) {		if (!Def) {
LLVM_DEBUG(dbgs() << "ARM Loops: Couldn't find iteration count.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Couldn't find iteration count.\n");
return;		return;
}		}

// Collect and remove the users of iteration count.		// Collect and remove the users of iteration count.
SmallPtrSet<MachineInstr*, 4> Killed = { LoLoop.Start, LoLoop.Dec,		SmallPtrSet<MachineInstr*, 4> Killed = { LoLoop.Start, LoLoop.Dec,
LoLoop.End };		LoLoop.End };
if (!TryRemove(Def, *RDA, LoLoop.ToRemove, Killed))		if (!TryRemove(Def, *RDA, LoLoop.ToRemove, Killed))
LLVM_DEBUG(dbgs() << "ARM Loops: Unsafe to remove loop iteration count.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Unsafe to remove loop iteration count.\n");
}		}

MachineInstr* ARMLowOverheadLoops::ExpandLoopStart(LowOverheadLoop &LoLoop) {		MachineInstr* ARMLowOverheadLoops::ExpandLoopStart(LowOverheadLoop &LoLoop) {
LLVM_DEBUG(dbgs() << "ARM Loops: Expanding LoopStart.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Expanding LoopStart.\n");
// When using tail-predication, try to delete the dead code that was used to		// When using tail-predication, try to delete the dead code that was used to
// calculate the number of loop iterations.		// calculate the number of loop iterations.
IterationCountDCE(LoLoop);		IterationCountDCE(LoLoop);

MachineBasicBlock::iterator InsertPt = LoLoop.StartInsertPt;		MachineBasicBlock::iterator InsertPt = LoLoop.StartInsertPt;
MachineInstr *Start = LoLoop.Start;		MachineInstr *Start = LoLoop.Start;
MachineBasicBlock *MBB = LoLoop.StartInsertBB;		MachineBasicBlock *MBB = LoLoop.StartInsertBB;
bool IsDo = Start->getOpcode() == ARM::t2DoLoopStart;
unsigned Opc = LoLoop.getStartOpcode();		unsigned Opc = LoLoop.getStartOpcode();
MachineOperand &Count = LoLoop.getLoopStartOperand();		MachineOperand &Count = LoLoop.getLoopStartOperand();

MachineInstrBuilder MIB =		MachineInstrBuilder MIB =
BuildMI(*MBB, InsertPt, Start->getDebugLoc(), TII->get(Opc));		BuildMI(*MBB, InsertPt, Start->getDebugLoc(), TII->get(Opc));

MIB.addDef(ARM::LR);		MIB.addDef(ARM::LR);
MIB.add(Count);		MIB.add(Count);
if (!IsDo)		if (!isDo(Start))
MIB.add(Start->getOperand(1));		MIB.add(Start->getOperand(1));

LoLoop.ToRemove.insert(Start);		LoLoop.ToRemove.insert(Start);
LLVM_DEBUG(dbgs() << "ARM Loops: Inserted start: " << *MIB);		LLVM_DEBUG(dbgs() << "ARM Loops: Inserted start: " << *MIB);
return &*MIB;		return &*MIB;
}		}

void ARMLowOverheadLoops::ConvertVPTBlocks(LowOverheadLoop &LoLoop) {		void ARMLowOverheadLoops::ConvertVPTBlocks(LowOverheadLoop &LoLoop) {
▲ Show 20 Lines • Show All 218 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp

//===-- MVEVPTOptimisationsPass.cpp ---------------------------------------===//		//===-- MVEVPTOptimisationsPass.cpp ---------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
/// \file This pass does a few optimisations related to MVE VPT blocks before		/// \file This pass does a few optimisations related to Tail predicated loops
/// register allocation is performed. The goal is to maximize the sizes of the		/// and MVE VPT blocks before register allocation is performed. For VPT blocks
/// blocks that will be created by the MVE VPT Block Insertion pass (which runs		/// the goal is to maximize the sizes of the blocks that will be created by the
/// after register allocation). The first optimisation done by this pass is the		/// MVE VPT Block Insertion pass (which runs after register allocation). For
/// replacement of "opposite" VCMPs with VPNOTs, so the Block Insertion pass		/// tail predicated loops we transform the loop into something that will
/// can delete them later to create larger VPT blocks.		/// hopefully make the backend ARMLowOverheadLoops pass's job easier.
/// The second optimisation replaces re-uses of old VCCR values with VPNOTs when		///
/// inside a block of predicated instructions. This is done to avoid
/// spill/reloads of VPR in the middle of a block, which prevents the Block
/// Insertion pass from creating large blocks.
//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "ARM.h"		#include "ARM.h"
#include "ARMSubtarget.h"		#include "ARMSubtarget.h"
#include "MCTargetDesc/ARMBaseInfo.h"		#include "MCTargetDesc/ARMBaseInfo.h"
#include "Thumb2InstrInfo.h"		#include "Thumb2InstrInfo.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
		#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
		#include "llvm/CodeGen/MachineLoopInfo.h"
		#include "llvm/InitializePasses.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include <cassert>		#include <cassert>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "arm-mve-vpt-opts"		#define DEBUG_TYPE "arm-mve-vpt-opts"

namespace {		namespace {
class MVEVPTOptimisations : public MachineFunctionPass {		class MVEVPTOptimisations : public MachineFunctionPass {
public:		public:
static char ID;		static char ID;
const Thumb2InstrInfo *TII;		const Thumb2InstrInfo *TII;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;

MVEVPTOptimisations() : MachineFunctionPass(ID) {}		MVEVPTOptimisations() : MachineFunctionPass(ID) {}

bool runOnMachineFunction(MachineFunction &Fn) override;		bool runOnMachineFunction(MachineFunction &Fn) override;

		void getAnalysisUsage(AnalysisUsage &AU) const override {
		AU.addRequired<MachineLoopInfo>();
		AU.addPreserved<MachineLoopInfo>();
		AU.addRequired<MachineDominatorTree>();
		AU.addPreserved<MachineDominatorTree>();
		MachineFunctionPass::getAnalysisUsage(AU);
		}

StringRef getPassName() const override {		StringRef getPassName() const override {
return "ARM MVE VPT Optimisation Pass";		return "ARM MVE TailPred and VPT Optimisation Pass";
}		}

private:		private:
		bool ConvertTailPredLoop(MachineLoop ML, MachineDominatorTree DT);
MachineInstr &ReplaceRegisterUseWithVPNOT(MachineBasicBlock &MBB,		MachineInstr &ReplaceRegisterUseWithVPNOT(MachineBasicBlock &MBB,
MachineInstr &Instr,		MachineInstr &Instr,
MachineOperand &User,		MachineOperand &User,
Register Target);		Register Target);
bool ReduceOldVCCRValueUses(MachineBasicBlock &MBB);		bool ReduceOldVCCRValueUses(MachineBasicBlock &MBB);
bool ReplaceVCMPsByVPNOTs(MachineBasicBlock &MBB);		bool ReplaceVCMPsByVPNOTs(MachineBasicBlock &MBB);
bool ConvertVPSEL(MachineBasicBlock &MBB);		bool ConvertVPSEL(MachineBasicBlock &MBB);
};		};

char MVEVPTOptimisations::ID = 0;		char MVEVPTOptimisations::ID = 0;

} // end anonymous namespace		} // end anonymous namespace

INITIALIZE_PASS(MVEVPTOptimisations, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(MVEVPTOptimisations, DEBUG_TYPE,
"ARM MVE VPT Optimisations pass", false, false)		"ARM MVE TailPred and VPT Optimisations pass", false,
		false)
		INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
		INITIALIZE_PASS_END(MVEVPTOptimisations, DEBUG_TYPE,
		"ARM MVE TailPred and VPT Optimisations pass", false, false)

		static MachineInstr LookThroughCOPY(MachineInstr MI,
		MachineRegisterInfo *MRI) {
		while (MI && MI->getOpcode() == TargetOpcode::COPY &&
		MI->getOperand(1).getReg().isVirtual())
		MI = MRI->getVRegDef(MI->getOperand(1).getReg());
		return MI;
		}

		// Given a loop ML, this attempts to find the t2LoopEnd, t2LoopDec and
		// corresponding PHI that make up a low overhead loop. Only handles 'do' loops
		// at the moment, returning a t2DoLoopStart in LoopStart.
		static bool findLoopComponents(MachineLoop ML, MachineRegisterInfo MRI,
		MachineInstr &LoopStart, MachineInstr &LoopPhi,
		MachineInstr &LoopDec, MachineInstr &LoopEnd) {
		MachineBasicBlock *Header = ML->getHeader();
		MachineBasicBlock *Latch = ML->getLoopLatch();
		if (!Header \|\| !Latch) {
		LLVM_DEBUG(dbgs() << " no Loop Latch or Header\n");
		return false;
		}

		// Find the loop end from the terminators.
		LoopEnd = nullptr;
		for (auto &T : Latch->terminators()) {
		if (T.getOpcode() == ARM::t2LoopEnd && T.getOperand(1).getMBB() == Header) {
		LoopEnd = &T;
		break;
		}
		}
		if (!LoopEnd) {
		LLVM_DEBUG(dbgs() << " no LoopEnd\n");
		return false;
		}
		LLVM_DEBUG(dbgs() << " found loop end: " << *LoopEnd);

		// Find the dec from the use of the end. There may be copies between
		// instructions. We expect the loop to loop like:
		// $vs = t2DoLoopStart ...
		// loop:
		// $vp = phi [ $vs ], [ $vd ]
		// ...
		// $vd = t2LoopDec $vp
		// ...
		// t2LoopEnd $vd, loop
		LoopDec =
		LookThroughCOPY(MRI->getVRegDef(LoopEnd->getOperand(0).getReg()), MRI);
		if (!LoopDec \|\| LoopDec->getOpcode() != ARM::t2LoopDec) {
		LLVM_DEBUG(dbgs() << " didn't find LoopDec where we expected!\n");
		return false;
		}
		LLVM_DEBUG(dbgs() << " found loop dec: " << *LoopDec);

		LoopPhi =
		LookThroughCOPY(MRI->getVRegDef(LoopDec->getOperand(1).getReg()), MRI);
		if (!LoopPhi \|\| LoopPhi->getOpcode() != TargetOpcode::PHI \|\|
		LoopPhi->getNumOperands() != 5 \|\|
		(LoopPhi->getOperand(2).getMBB() != Latch &&
		LoopPhi->getOperand(4).getMBB() != Latch)) {
		LLVM_DEBUG(dbgs() << " didn't find PHI where we expected!\n");
		return false;
		}
		LLVM_DEBUG(dbgs() << " found loop phi: " << *LoopPhi);

		Register StartReg = LoopPhi->getOperand(2).getMBB() == Latch
		? LoopPhi->getOperand(3).getReg()
		: LoopPhi->getOperand(1).getReg();
		LoopStart = LookThroughCOPY(MRI->getVRegDef(StartReg), MRI);
		if (!LoopStart \|\| LoopStart->getOpcode() != ARM::t2DoLoopStart) {
		LLVM_DEBUG(dbgs() << " didn't find Start where we expected!\n");
		return false;
		}
		LLVM_DEBUG(dbgs() << " found loop start: " << *LoopStart);

		return true;
		}

		// Convert t2DoLoopStart to t2DoLoopStartTP if the loop contains VCTP
		// instructions. This keeps the VCTP count reg operand on the t2DoLoopStartTP
		// instruction, making the backend ARMLowOverheadLoops passes job of finding the
		// VCTP operand much simpler.
		bool MVEVPTOptimisations::ConvertTailPredLoop(MachineLoop *ML,
		MachineDominatorTree *DT) {
		LLVM_DEBUG(dbgs() << "ConvertTailPredLoop on loop "
		<< ML->getHeader()->getName() << "\n");

		// Find some loop components including the LoopEnd/Dec/Start, and any VCTP's
		// in the loop.
		MachineInstr LoopEnd, LoopPhi, LoopStart, LoopDec;
		if (!findLoopComponents(ML, MRI, LoopStart, LoopPhi, LoopDec, LoopEnd))
		return false;

		SmallVector<MachineInstr *, 4> VCTPs;
		for (MachineBasicBlock *BB : ML->blocks())
		for (MachineInstr &MI : *BB)
		if (isVCTP(&MI))
		VCTPs.push_back(&MI);

		if (VCTPs.empty()) {
		LLVM_DEBUG(dbgs() << " no VCTPs\n");
		return false;
		}

		// Check all VCTPs are the same.
		MachineInstr FirstVCTP = VCTPs.begin();
		for (MachineInstr *VCTP : VCTPs) {
		LLVM_DEBUG(dbgs() << " with VCTP " << *VCTP);
		if (VCTP->getOpcode() != FirstVCTP->getOpcode() \|\|
		samparkerUnsubmitted Not Done Reply Inline Actions What about only checking that it's not predicated in the case where there's more than one VCTP? At least then we can handle the VPT -> VCTP case. samparker: What about only checking that it's not predicated in the case where there's more than one VCTP?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Would the backend pass handle multiple VCTP's in different blocks? If so we could just remove the check. I've done that here, but can add it back in if you think it might cause problems. dmgreen: Would the backend pass handle multiple VCTP's in different blocks? If so we could just remove…
		samparkerUnsubmitted Not Done Reply Inline Actions Yep, this should be fine. The backend can handle any number of VCTPs and VPT blocks. samparker: Yep, this should be fine. The backend can handle any number of VCTPs and VPT blocks.
		VCTP->getOperand(0).getReg() != FirstVCTP->getOperand(0).getReg()) {
		LLVM_DEBUG(dbgs() << " VCTP's are not identical\n");
		return false;
		}
		}

		// Check for the register being used can be setup before the loop. We expect
		// this to be:
		// $vx = ...
		// loop:
		// $vp = PHI [ $vx ], [ $vd ]
		// ..
		// $vpr = VCTP $vp
		// ..
		// $vd = t2SUBri $vp, #n
		// ..
		Register CountReg = FirstVCTP->getOperand(1).getReg();
		if (!CountReg.isVirtual()) {
		LLVM_DEBUG(dbgs() << " cannot determine VCTP PHI\n");
		return false;
		}
		MachineInstr *Phi = LookThroughCOPY(MRI->getVRegDef(CountReg), MRI);
		if (!Phi \|\| Phi->getOpcode() != TargetOpcode::PHI \|\|
		Phi->getNumOperands() != 5 \|\|
		(Phi->getOperand(2).getMBB() != ML->getLoopLatch() &&
		Phi->getOperand(4).getMBB() != ML->getLoopLatch())) {
		LLVM_DEBUG(dbgs() << " cannot determine VCTP Count\n");
		return false;
		}
		CountReg = Phi->getOperand(2).getMBB() == ML->getLoopLatch()
		? Phi->getOperand(3).getReg()
		: Phi->getOperand(1).getReg();

		// Replace the t2DoLoopStart with the t2DoLoopStartTP, move it to the end of
		// the preheader and add the new CountReg to it. We attempt to place it late
		// in the preheader, but may need to move that earlier based on uses.
		MachineBasicBlock *MBB = LoopStart->getParent();
		MachineBasicBlock::iterator InsertPt = MBB->getFirstTerminator();
		for (MachineInstr &Use :
		MRI->use_instructions(LoopStart->getOperand(0).getReg()))
		if ((InsertPt != MBB->end() && !DT->dominates(&*InsertPt, &Use)) \|\|
		samparkerUnsubmitted Not Done Reply Inline Actions I don't follow what's happening here, what uses could there be which we need to schedule for? samparker: I don't follow what's happening here, what uses could there be which we need to schedule for?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions There can be COPY's between the LoopStart and the PHI. dmgreen: There can be COPY's between the LoopStart and the PHI.
		!DT->dominates(ML->getHeader(), Use.getParent()))
		InsertPt = &Use;

		MachineInstrBuilder MI = BuildMI(*MBB, InsertPt, LoopStart->getDebugLoc(),
		piramaUnsubmitted Not Done Reply Inline Actions This variable seems unused outside of debugging purposes: llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp:234:23: warning: unused variable 'MI' [-Wunused-variable] MachineInstrBuilder MI = BuildMI(MBB, InsertPt, LoopStart->getDebugLoc(), ^ 1 warning generated. pirama:* This variable seems unused outside of debugging purposes: ```…
		TII->get(ARM::t2DoLoopStartTP))
		.add(LoopStart->getOperand(0))
		.add(LoopStart->getOperand(1))
		.addReg(CountReg);
		LLVM_DEBUG(dbgs() << "Replacing " << *LoopStart << " with "
		<< *MI.getInstr());
		MRI->constrainRegClass(CountReg, &ARM::rGPRRegClass);
		LoopStart->eraseFromParent();

		return true;
		}

// Returns true if Opcode is any VCMP Opcode.		// Returns true if Opcode is any VCMP Opcode.
static bool IsVCMP(unsigned Opcode) { return VCMPOpcodeToVPT(Opcode) != 0; }		static bool IsVCMP(unsigned Opcode) { return VCMPOpcodeToVPT(Opcode) != 0; }

// Returns true if a VCMP with this Opcode can have its operands swapped.		// Returns true if a VCMP with this Opcode can have its operands swapped.
// There is 2 kind of VCMP that can't have their operands swapped: Float VCMPs,		// There is 2 kind of VCMP that can't have their operands swapped: Float VCMPs,
// and VCMPr instructions (since the r is always on the right).		// and VCMPr instructions (since the r is always on the right).
static bool CanHaveSwappedOperands(unsigned Opcode) {		static bool CanHaveSwappedOperands(unsigned Opcode) {
▲ Show 20 Lines • Show All 402 Lines • ▼ Show 20 Lines	bool MVEVPTOptimisations::runOnMachineFunction(MachineFunction &Fn) {
const ARMSubtarget &STI =		const ARMSubtarget &STI =
static_cast<const ARMSubtarget &>(Fn.getSubtarget());		static_cast<const ARMSubtarget &>(Fn.getSubtarget());

if (!STI.isThumb2() \|\| !STI.hasMVEIntegerOps())		if (!STI.isThumb2() \|\| !STI.hasMVEIntegerOps())
return false;		return false;

TII = static_cast<const Thumb2InstrInfo *>(STI.getInstrInfo());		TII = static_cast<const Thumb2InstrInfo *>(STI.getInstrInfo());
MRI = &Fn.getRegInfo();		MRI = &Fn.getRegInfo();
		MachineLoopInfo *MLI = &getAnalysis<MachineLoopInfo>();
		MachineDominatorTree *DT = &getAnalysis<MachineDominatorTree>();

LLVM_DEBUG(dbgs() << "******** ARM MVE VPT Optimisations ********\n"		LLVM_DEBUG(dbgs() << "******** ARM MVE VPT Optimisations ********\n"
<< "********** Function: " << Fn.getName() << '\n');		<< "********** Function: " << Fn.getName() << '\n');

bool Modified = false;		bool Modified = false;
		for (MachineLoop *ML : MLI->getBase().getLoopsInPreorder())
		Modified \|= ConvertTailPredLoop(ML, DT);

for (MachineBasicBlock &MBB : Fn) {		for (MachineBasicBlock &MBB : Fn) {
Modified \|= ReplaceVCMPsByVPNOTs(MBB);		Modified \|= ReplaceVCMPsByVPNOTs(MBB);
Modified \|= ReduceOldVCCRValueUses(MBB);		Modified \|= ReduceOldVCCRValueUses(MBB);
Modified \|= ConvertVPSEL(MBB);		Modified \|= ConvertVPSEL(MBB);
}		}

LLVM_DEBUG(dbgs() << "**************************************\n");		LLVM_DEBUG(dbgs() << "**************************************\n");
return Modified;		return Modified;
}		}

/// createMVEVPTOptimisationsPass		/// createMVEVPTOptimisationsPass
FunctionPass *llvm::createMVEVPTOptimisationsPass() {		FunctionPass *llvm::createMVEVPTOptimisationsPass() {
return new MVEVPTOptimisations();		return new MVEVPTOptimisations();
}		}

llvm/test/CodeGen/ARM/O3-pipeline.ll

	Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Early Machine Loop Invariant Code Motion			; CHECK-NEXT: Early Machine Loop Invariant Code Motion
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: Machine Block Frequency Analysis			; CHECK-NEXT: Machine Block Frequency Analysis
	; CHECK-NEXT: Machine Common Subexpression Elimination			; CHECK-NEXT: Machine Common Subexpression Elimination
	; CHECK-NEXT: MachinePostDominator Tree Construction			; CHECK-NEXT: MachinePostDominator Tree Construction
	; CHECK-NEXT: Machine code sinking			; CHECK-NEXT: Machine code sinking
	; CHECK-NEXT: Peephole Optimizations			; CHECK-NEXT: Peephole Optimizations
	; CHECK-NEXT: Remove dead machine instructions			; CHECK-NEXT: Remove dead machine instructions
	; CHECK-NEXT: MVE VPT Optimisation Pass			; CHECK-NEXT: MachineDominator Tree Construction
				; CHECK-NEXT: MVE TailPred and VPT Optimisation Pass
	; CHECK-NEXT: ARM MLA / MLS expansion pass			; CHECK-NEXT: ARM MLA / MLS expansion pass
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: ARM pre- register allocation load / store optimization pass			; CHECK-NEXT: ARM pre- register allocation load / store optimization pass
	; CHECK-NEXT: ARM A15 S->D optimizer			; CHECK-NEXT: ARM A15 S->D optimizer
	; CHECK-NEXT: Detect Dead Lanes			; CHECK-NEXT: Detect Dead Lanes
	; CHECK-NEXT: Process Implicit Definitions			; CHECK-NEXT: Process Implicit Definitions
	; CHECK-NEXT: Remove unreachable machine basic blocks			; CHECK-NEXT: Remove unreachable machine basic blocks
	; CHECK-NEXT: Live Variable Analysis			; CHECK-NEXT: Live Variable Analysis
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled -o - %s \| FileCheck %s			; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled -o - %s \| FileCheck %s
	%struct.SpeexPreprocessState_ = type { i32, i32, half, half }			%struct.SpeexPreprocessState_ = type { i32, i32, half, half }

	define void @foo(%struct.SpeexPreprocessState_* nocapture readonly %st, i16* %x) {			define void @foo(%struct.SpeexPreprocessState_* nocapture readonly %st, i16* %x) {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: ldrd r12, r4, [r0]			; CHECK-NEXT: ldrd r12, r4, [r0]
	; CHECK-NEXT: ldrd r2, r3, [r0, #8]			; CHECK-NEXT: ldrd r2, r3, [r0, #8]
	; CHECK-NEXT: rsb r12, r12, r4, lsl #1			; CHECK-NEXT: rsb r12, r12, r4, lsl #1
	; CHECK-NEXT: mov r4, r12			; CHECK-NEXT: mov r4, r12
	; CHECK-NEXT: dlstp.16 lr, r4			; CHECK-NEXT: dlstp.16 lr, r12
	; CHECK-NEXT: .LBB0_1: @ %do.body			; CHECK-NEXT: .LBB0_1: @ %do.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vldrh.u16 q0, [r3], #16			; CHECK-NEXT: vldrh.u16 q0, [r3], #16
	; CHECK-NEXT: vstrh.16 q0, [r2], #16			; CHECK-NEXT: vstrh.16 q0, [r2], #16
	; CHECK-NEXT: letp lr, .LBB0_1			; CHECK-NEXT: letp lr, .LBB0_1
	; CHECK-NEXT: @ %bb.2: @ %do.end			; CHECK-NEXT: @ %bb.2: @ %do.end
	; CHECK-NEXT: ldr r2, [r0]			; CHECK-NEXT: ldr r2, [r0]
	; CHECK-NEXT: ldr r0, [r0, #8]			; CHECK-NEXT: ldr r0, [r0, #8]
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/mov-operand.ll

	Show All 10 Lines
	; CHECK-NEXT: it ge			; CHECK-NEXT: it ge
	; CHECK-NEXT: movge r4, #4			; CHECK-NEXT: movge r4, #4
	; CHECK-NEXT: movs r3, #1			; CHECK-NEXT: movs r3, #1
	; CHECK-NEXT: subs r4, r1, r4			; CHECK-NEXT: subs r4, r1, r4
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q0, #0x0
	; CHECK-NEXT: adds r4, #3			; CHECK-NEXT: adds r4, #3
	; CHECK-NEXT: add.w r12, r3, r4, lsr #2			; CHECK-NEXT: add.w r12, r3, r4, lsr #2
	; CHECK-NEXT: mov r3, r1			; CHECK-NEXT: mov r3, r1
	; CHECK-NEXT: dlstp.32 lr, r3			; CHECK-NEXT: dlstp.32 lr, r1
	; CHECK-NEXT: mov r4, r0			; CHECK-NEXT: mov r4, r0
	; CHECK-NEXT: .LBB0_1: @ %do.body.i			; CHECK-NEXT: .LBB0_1: @ %do.body.i
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vldrw.u32 q1, [r4], #16			; CHECK-NEXT: vldrw.u32 q1, [r4], #16
	; CHECK-NEXT: vadd.f32 q0, q0, q1			; CHECK-NEXT: vadd.f32 q0, q0, q1
	; CHECK-NEXT: letp lr, .LBB0_1			; CHECK-NEXT: letp lr, .LBB0_1
	; CHECK-NEXT: @ %bb.2: @ %arm_mean_f32_mve.exit			; CHECK-NEXT: @ %bb.2: @ %arm_mean_f32_mve.exit
	; CHECK-NEXT: vmov s4, r1			; CHECK-NEXT: vmov s4, r1
	▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll

	Show First 20 Lines • Show All 408 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: sub.w r5, r12, #1			; CHECK-NEXT: sub.w r5, r12, #1
	; CHECK-NEXT: and r9, r12, #3			; CHECK-NEXT: and r9, r12, #3
	; CHECK-NEXT: cmp r5, #3			; CHECK-NEXT: cmp r5, #3
	; CHECK-NEXT: bhs .LBB5_6			; CHECK-NEXT: bhs .LBB5_6
	; CHECK-NEXT: @ %bb.3:			; CHECK-NEXT: @ %bb.3:
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
	; CHECK-NEXT: b .LBB5_8			; CHECK-NEXT: b .LBB5_8
	; CHECK-NEXT: .LBB5_4: @ %vector.ph			; CHECK-NEXT: .LBB5_4: @ %vector.ph
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r6, #0
	; CHECK-NEXT: dlstp.32 lr, r12			; CHECK-NEXT: dlstp.32 lr, r12
	; CHECK-NEXT: .LBB5_5: @ %vector.body			; CHECK-NEXT: .LBB5_5: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r6, #4
	; CHECK-NEXT: vldrb.u32 q0, [r0], #4			; CHECK-NEXT: vldrb.u32 q0, [r0], #4
	; CHECK-NEXT: vldrb.u32 q1, [r1], #4			; CHECK-NEXT: vldrb.u32 q1, [r1], #4
	; CHECK-NEXT: vmlas.u32 q1, q0, r2			; CHECK-NEXT: vmlas.u32 q1, q0, r2
	; CHECK-NEXT: vstrw.32 q1, [r3], #16			; CHECK-NEXT: vstrw.32 q1, [r3], #16
	; CHECK-NEXT: letp lr, .LBB5_5			; CHECK-NEXT: letp lr, .LBB5_5
	; CHECK-NEXT: b .LBB5_11			; CHECK-NEXT: b .LBB5_11
	; CHECK-NEXT: .LBB5_6: @ %for.body.preheader.new			; CHECK-NEXT: .LBB5_6: @ %for.body.preheader.new
	; CHECK-NEXT: bic r5, r12, #3			; CHECK-NEXT: bic r5, r12, #3
	▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: sub.w r5, r12, #1			; CHECK-NEXT: sub.w r5, r12, #1
	; CHECK-NEXT: and r9, r12, #3			; CHECK-NEXT: and r9, r12, #3
	; CHECK-NEXT: cmp r5, #3			; CHECK-NEXT: cmp r5, #3
	; CHECK-NEXT: bhs .LBB7_6			; CHECK-NEXT: bhs .LBB7_6
	; CHECK-NEXT: @ %bb.3:			; CHECK-NEXT: @ %bb.3:
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
	; CHECK-NEXT: b .LBB7_8			; CHECK-NEXT: b .LBB7_8
	; CHECK-NEXT: .LBB7_4: @ %vector.ph			; CHECK-NEXT: .LBB7_4: @ %vector.ph
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r6, #0
	; CHECK-NEXT: dlstp.32 lr, r12			; CHECK-NEXT: dlstp.32 lr, r12
	; CHECK-NEXT: .LBB7_5: @ %vector.body			; CHECK-NEXT: .LBB7_5: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r6, #4
	; CHECK-NEXT: vldrb.u32 q0, [r0], #4			; CHECK-NEXT: vldrb.u32 q0, [r0], #4
	; CHECK-NEXT: vldrb.u32 q1, [r1], #4			; CHECK-NEXT: vldrb.u32 q1, [r1], #4
	; CHECK-NEXT: vmlas.u32 q1, q0, r2			; CHECK-NEXT: vmlas.u32 q1, q0, r2
	; CHECK-NEXT: vstrw.32 q1, [r3], #16			; CHECK-NEXT: vstrw.32 q1, [r3], #16
	; CHECK-NEXT: letp lr, .LBB7_5			; CHECK-NEXT: letp lr, .LBB7_5
	; CHECK-NEXT: b .LBB7_11			; CHECK-NEXT: b .LBB7_11
	; CHECK-NEXT: .LBB7_6: @ %for.body.preheader.new			; CHECK-NEXT: .LBB7_6: @ %for.body.preheader.new
	; CHECK-NEXT: bic r5, r12, #3			; CHECK-NEXT: bic r5, r12, #3
	▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: sub.w r5, r12, #1			; CHECK-NEXT: sub.w r5, r12, #1
	; CHECK-NEXT: and r9, r12, #3			; CHECK-NEXT: and r9, r12, #3
	; CHECK-NEXT: cmp r5, #3			; CHECK-NEXT: cmp r5, #3
	; CHECK-NEXT: bhs .LBB9_6			; CHECK-NEXT: bhs .LBB9_6
	; CHECK-NEXT: @ %bb.3:			; CHECK-NEXT: @ %bb.3:
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
	; CHECK-NEXT: b .LBB9_8			; CHECK-NEXT: b .LBB9_8
	; CHECK-NEXT: .LBB9_4: @ %vector.ph			; CHECK-NEXT: .LBB9_4: @ %vector.ph
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r6, #0
	; CHECK-NEXT: dlstp.32 lr, r12			; CHECK-NEXT: dlstp.32 lr, r12
	; CHECK-NEXT: .LBB9_5: @ %vector.body			; CHECK-NEXT: .LBB9_5: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r6, #4
	; CHECK-NEXT: vldrw.u32 q0, [r0], #16			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vldrw.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: vmlas.u32 q1, q0, r2			; CHECK-NEXT: vmlas.u32 q1, q0, r2
	; CHECK-NEXT: vstrw.32 q1, [r3], #16			; CHECK-NEXT: vstrw.32 q1, [r3], #16
	; CHECK-NEXT: letp lr, .LBB9_5			; CHECK-NEXT: letp lr, .LBB9_5
	; CHECK-NEXT: b .LBB9_11			; CHECK-NEXT: b .LBB9_11
	; CHECK-NEXT: .LBB9_6: @ %for.body.preheader.new			; CHECK-NEXT: .LBB9_6: @ %for.body.preheader.new
	; CHECK-NEXT: bic r5, r12, #3			; CHECK-NEXT: bic r5, r12, #3
	▲ Show 20 Lines • Show All 245 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/reductions.ll

	Show First 20 Lines • Show All 544 Lines • ▼ Show 20 Lines
	define dso_local arm_aapcs_vfpcc void @two_reductions_mul_add_v8i16(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %N) local_unnamed_addr {			define dso_local arm_aapcs_vfpcc void @two_reductions_mul_add_v8i16(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %N) local_unnamed_addr {
	; CHECK-LABEL: two_reductions_mul_add_v8i16:			; CHECK-LABEL: two_reductions_mul_add_v8i16:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: vpush {d8, d9}			; CHECK-NEXT: vpush {d8, d9}
	; CHECK-NEXT: cbz r2, .LBB7_4			; CHECK-NEXT: cbz r2, .LBB7_4
	; CHECK-NEXT: @ %bb.1: @ %vector.ph			; CHECK-NEXT: @ %bb.1: @ %vector.ph
	; CHECK-NEXT: adds r3, r2, #7			; CHECK-NEXT: adds r3, r2, #7
	; CHECK-NEXT: movs r4, #1
	; CHECK-NEXT: bic r3, r3, #7
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q0, #0x0
				; CHECK-NEXT: bic r3, r3, #7
				; CHECK-NEXT: movs r4, #1
	; CHECK-NEXT: subs r3, #8			; CHECK-NEXT: subs r3, #8
	; CHECK-NEXT: vmov q3, q0			; CHECK-NEXT: vmov q3, q0
	; CHECK-NEXT: add.w lr, r4, r3, lsr #3			; CHECK-NEXT: add.w lr, r4, r3, lsr #3
	; CHECK-NEXT: mov r3, r0			; CHECK-NEXT: mov r3, r0
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: mov r4, r1			; CHECK-NEXT: mov r4, r1
	; CHECK-NEXT: .LBB7_2: @ %vector.body			; CHECK-NEXT: .LBB7_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	▲ Show 20 Lines • Show All 194 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-fma-loops.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled %s -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled %s -o - \| FileCheck %s

	define arm_aapcs_vfpcc void @fmas1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {			define arm_aapcs_vfpcc void @fmas1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
	; CHECK-LABEL: fmas1:			; CHECK-LABEL: fmas1:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r4, pc}			; CHECK-NEXT: poplt {r4, pc}
	; CHECK-NEXT: .LBB0_1: @ %vector.ph			; CHECK-NEXT: .LBB0_1: @ %vector.ph
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r4, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB0_2: @ %vector.body			; CHECK-NEXT: .LBB0_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r4, #4
	; CHECK-NEXT: vldrw.u32 q0, [r1], #16			; CHECK-NEXT: vldrw.u32 q0, [r1], #16
	; CHECK-NEXT: vldrw.u32 q1, [r0], #16			; CHECK-NEXT: vldrw.u32 q1, [r0], #16
	; CHECK-NEXT: vfmas.f32 q1, q0, r12			; CHECK-NEXT: vfmas.f32 q1, q0, r12
	; CHECK-NEXT: vstrw.32 q1, [r2], #16			; CHECK-NEXT: vstrw.32 q1, [r2], #16
	; CHECK-NEXT: letp lr, .LBB0_2			; CHECK-NEXT: letp lr, .LBB0_2
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r4, pc}			; CHECK-NEXT: poplt {r4, pc}
	; CHECK-NEXT: .LBB1_1: @ %vector.ph			; CHECK-NEXT: .LBB1_1: @ %vector.ph
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r4, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB1_2: @ %vector.body			; CHECK-NEXT: .LBB1_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r4, #4
	; CHECK-NEXT: vldrw.u32 q0, [r0], #16			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vldrw.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: vfmas.f32 q1, q0, r12			; CHECK-NEXT: vfmas.f32 q1, q0, r12
	; CHECK-NEXT: vstrw.32 q1, [r2], #16			; CHECK-NEXT: vstrw.32 q1, [r2], #16
	; CHECK-NEXT: letp lr, .LBB1_2			; CHECK-NEXT: letp lr, .LBB1_2
	▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r4, pc}			; CHECK-NEXT: poplt {r4, pc}
	; CHECK-NEXT: .LBB2_1: @ %vector.ph			; CHECK-NEXT: .LBB2_1: @ %vector.ph
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r4, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB2_2: @ %vector.body			; CHECK-NEXT: .LBB2_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r4, #4
	; CHECK-NEXT: vldrw.u32 q0, [r0], #16			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vldrw.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: vfma.f32 q1, q0, r12			; CHECK-NEXT: vfma.f32 q1, q0, r12
	; CHECK-NEXT: vstrw.32 q1, [r2], #16			; CHECK-NEXT: vstrw.32 q1, [r2], #16
	; CHECK-NEXT: letp lr, .LBB2_2			; CHECK-NEXT: letp lr, .LBB2_2
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r4, pc}			; CHECK-NEXT: poplt {r4, pc}
	; CHECK-NEXT: .LBB3_1: @ %vector.ph			; CHECK-NEXT: .LBB3_1: @ %vector.ph
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r4, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB3_2: @ %vector.body			; CHECK-NEXT: .LBB3_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r4, #4
	; CHECK-NEXT: vldrw.u32 q0, [r0], #16			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vldrw.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: vfma.f32 q1, q0, r12			; CHECK-NEXT: vfma.f32 q1, q0, r12
	; CHECK-NEXT: vstrw.32 q1, [r2], #16			; CHECK-NEXT: vstrw.32 q1, [r2], #16
	; CHECK-NEXT: letp lr, .LBB3_2			; CHECK-NEXT: letp lr, .LBB3_2
	▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: fmss3:			; CHECK-LABEL: fmss3:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r7, lr}			; CHECK-NEXT: .save {r7, lr}
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r7, pc}			; CHECK-NEXT: poplt {r7, pc}
	; CHECK-NEXT: .LBB6_1: @ %vector.ph			; CHECK-NEXT: .LBB6_1: @ %vector.ph
	; CHECK-NEXT: add.w r12, r3, #3
	; CHECK-NEXT: mov.w lr, #1
	; CHECK-NEXT: bic r12, r12, #3
	; CHECK-NEXT: sub.w r12, r12, #4
	; CHECK-NEXT: add.w lr, lr, r12, lsr #2
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vdup.32 q0, r12			; CHECK-NEXT: vdup.32 q0, r12
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB6_2: @ %vector.body			; CHECK-NEXT: .LBB6_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r3
	; CHECK-NEXT: add.w r12, r12, #4			; CHECK-NEXT: add.w r12, r12, #4
	; CHECK-NEXT: subs r3, #4
	; CHECK-NEXT: vmov q3, q0			; CHECK-NEXT: vmov q3, q0
	; CHECK-NEXT: vpstt			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: vldrwt.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q2, [r0], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
	; CHECK-NEXT: vfms.f32 q3, q2, q1			; CHECK-NEXT: vfms.f32 q3, q2, q1
	; CHECK-NEXT: vpst			; CHECK-NEXT: vstrw.32 q3, [r2], #16
	; CHECK-NEXT: vstrwt.32 q3, [r2], #16			; CHECK-NEXT: letp lr, .LBB6_2
	; CHECK-NEXT: le lr, .LBB6_2
	; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup			; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%cmp8 = icmp sgt i32 %n, 0			%cmp8 = icmp sgt i32 %n, 0
	br i1 %cmp8, label %vector.ph, label %for.cond.cleanup			br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %n, 3			%n.rnd.up = add i32 %n, 3
	Show All 37 Lines
	; CHECK-LABEL: fmss4:			; CHECK-LABEL: fmss4:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r7, lr}			; CHECK-NEXT: .save {r7, lr}
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r7, pc}			; CHECK-NEXT: poplt {r7, pc}
	; CHECK-NEXT: .LBB7_1: @ %vector.ph			; CHECK-NEXT: .LBB7_1: @ %vector.ph
	; CHECK-NEXT: add.w r12, r3, #3
	; CHECK-NEXT: mov.w lr, #1
	; CHECK-NEXT: bic r12, r12, #3
	; CHECK-NEXT: sub.w r12, r12, #4
	; CHECK-NEXT: add.w lr, lr, r12, lsr #2
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vdup.32 q0, r12			; CHECK-NEXT: vdup.32 q0, r12
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB7_2: @ %vector.body			; CHECK-NEXT: .LBB7_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r3
	; CHECK-NEXT: add.w r12, r12, #4			; CHECK-NEXT: add.w r12, r12, #4
	; CHECK-NEXT: subs r3, #4
	; CHECK-NEXT: vmov q3, q0			; CHECK-NEXT: vmov q3, q0
	; CHECK-NEXT: vpstt			; CHECK-NEXT: vldrw.u32 q1, [r0], #16
	; CHECK-NEXT: vldrwt.u32 q1, [r0], #16			; CHECK-NEXT: vldrw.u32 q2, [r1], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
	; CHECK-NEXT: vfms.f32 q3, q2, q1			; CHECK-NEXT: vfms.f32 q3, q2, q1
	; CHECK-NEXT: vpst			; CHECK-NEXT: vstrw.32 q3, [r2], #16
	; CHECK-NEXT: vstrwt.32 q3, [r2], #16			; CHECK-NEXT: letp lr, .LBB7_2
	; CHECK-NEXT: le lr, .LBB7_2
	; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup			; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%cmp8 = icmp sgt i32 %n, 0			%cmp8 = icmp sgt i32 %n, 0
	br i1 %cmp8, label %vector.ph, label %for.cond.cleanup			br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %n, 3			%n.rnd.up = add i32 %n, 3
	▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: fms2:			; CHECK-LABEL: fms2:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r7, lr}			; CHECK-NEXT: .save {r7, lr}
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r7, pc}			; CHECK-NEXT: poplt {r7, pc}
	; CHECK-NEXT: .LBB9_1: @ %vector.ph			; CHECK-NEXT: .LBB9_1: @ %vector.ph
	; CHECK-NEXT: add.w r12, r3, #3
	; CHECK-NEXT: mov.w lr, #1
	; CHECK-NEXT: bic r12, r12, #3
	; CHECK-NEXT: sub.w r12, r12, #4
	; CHECK-NEXT: add.w lr, lr, r12, lsr #2
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vdup.32 q0, r12			; CHECK-NEXT: vdup.32 q0, r12
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB9_2: @ %vector.body			; CHECK-NEXT: .LBB9_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r3
	; CHECK-NEXT: add.w r12, r12, #4			; CHECK-NEXT: add.w r12, r12, #4
	; CHECK-NEXT: subs r3, #4			; CHECK-NEXT: vldrw.u32 q1, [r0], #16
	; CHECK-NEXT: vpstt			; CHECK-NEXT: vldrw.u32 q2, [r1], #16
	; CHECK-NEXT: vldrwt.u32 q1, [r0], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
	; CHECK-NEXT: vfms.f32 q2, q1, q0			; CHECK-NEXT: vfms.f32 q2, q1, q0
	; CHECK-NEXT: vpst			; CHECK-NEXT: vstrw.32 q2, [r2], #16
	; CHECK-NEXT: vstrwt.32 q2, [r2], #16			; CHECK-NEXT: letp lr, .LBB9_2
	; CHECK-NEXT: le lr, .LBB9_2
	; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup			; CHECK-NEXT: @ %bb.3: @ %for.cond.cleanup
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%cmp8 = icmp sgt i32 %n, 0			%cmp8 = icmp sgt i32 %n, 0
	br i1 %cmp8, label %vector.ph, label %for.cond.cleanup			br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %n, 3			%n.rnd.up = add i32 %n, 3
	Show All 38 Lines
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r4, pc}			; CHECK-NEXT: poplt {r4, pc}
	; CHECK-NEXT: .LBB10_1: @ %vector.ph			; CHECK-NEXT: .LBB10_1: @ %vector.ph
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r4, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB10_2: @ %vector.body			; CHECK-NEXT: .LBB10_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vldrw.u32 q0, [r0], #16			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vldrw.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r4, #4
	; CHECK-NEXT: vneg.f32 q1, q1			; CHECK-NEXT: vneg.f32 q1, q1
	; CHECK-NEXT: vfma.f32 q1, q0, r12			; CHECK-NEXT: vfma.f32 q1, q0, r12
	; CHECK-NEXT: vstrw.32 q1, [r2], #16			; CHECK-NEXT: vstrw.32 q1, [r2], #16
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: cmp r3, #1			; CHECK-NEXT: cmp r3, #1
	; CHECK-NEXT: it lt			; CHECK-NEXT: it lt
	; CHECK-NEXT: poplt {r4, pc}			; CHECK-NEXT: poplt {r4, pc}
	; CHECK-NEXT: .LBB11_1: @ %vector.ph			; CHECK-NEXT: .LBB11_1: @ %vector.ph
	; CHECK-NEXT: vmov r12, s0			; CHECK-NEXT: vmov r12, s0
	; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: movs r4, #0			; CHECK-NEXT: movs r4, #0
				; CHECK-NEXT: dlstp.32 lr, r3
	; CHECK-NEXT: .LBB11_2: @ %vector.body			; CHECK-NEXT: .LBB11_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vldrw.u32 q0, [r0], #16			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vldrw.u32 q1, [r1], #16			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: adds r4, #4			; CHECK-NEXT: adds r4, #4
	; CHECK-NEXT: vneg.f32 q1, q1			; CHECK-NEXT: vneg.f32 q1, q1
	; CHECK-NEXT: vfma.f32 q1, q0, r12			; CHECK-NEXT: vfma.f32 q1, q0, r12
	; CHECK-NEXT: vstrw.32 q1, [r2], #16			; CHECK-NEXT: vstrw.32 q1, [r2], #16
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-gather-scatter-tailpred.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -enable-arm-maskedldst -enable-mem-access-versioning=false -tail-predication=force-enabled %s -o - \| FileCheck %s			; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -enable-arm-maskedldst -enable-mem-access-versioning=false -tail-predication=force-enabled %s -o - \| FileCheck %s

	define dso_local void @mve_gather_qi_wb(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %n, i32 %m, i32 %l) {			define dso_local void @mve_gather_qi_wb(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %n, i32 %m, i32 %l) {
	; CHECK-LABEL: mve_gather_qi_wb:			; CHECK-LABEL: mve_gather_qi_wb:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r7, lr}			; CHECK-NEXT: .save {r7, lr}
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: add.w r12, r0, r3, lsl #2			; CHECK-NEXT: add.w r12, r0, r3, lsl #2
	; CHECK-NEXT: adr r0, .LCPI0_0			; CHECK-NEXT: adr r0, .LCPI0_0
	; CHECK-NEXT: vldrw.u32 q0, [r0]			; CHECK-NEXT: vldrw.u32 q0, [r0]
	; CHECK-NEXT: movw lr, #1250			; CHECK-NEXT: movw lr, #1250
	; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vmov.i32 q1, #0x0			; CHECK-NEXT: vmov.i32 q1, #0x0
				; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vadd.i32 q0, q0, r1			; CHECK-NEXT: vadd.i32 q0, q0, r1
	; CHECK-NEXT: adds r1, r3, #4			; CHECK-NEXT: adds r1, r3, #4
	; CHECK-NEXT: .LBB0_1: @ %vector.body			; CHECK-NEXT: .LBB0_1: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r3			; CHECK-NEXT: vctp.32 r3
	; CHECK-NEXT: vmov q2, q1			; CHECK-NEXT: vmov q2, q1
	; CHECK-NEXT: vpstt			; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q1, [r12], #16			; CHECK-NEXT: vldrwt.u32 q1, [r12], #16
	▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines

	define dso_local void @mve_gatherscatter_offset(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %n, i32 %m, i32 %l) {			define dso_local void @mve_gatherscatter_offset(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %n, i32 %m, i32 %l) {
	; CHECK-LABEL: mve_gatherscatter_offset:			; CHECK-LABEL: mve_gatherscatter_offset:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, lr}			; CHECK-NEXT: .save {r4, lr}
	; CHECK-NEXT: push {r4, lr}			; CHECK-NEXT: push {r4, lr}
	; CHECK-NEXT: .vsave {d8, d9}			; CHECK-NEXT: .vsave {d8, d9}
	; CHECK-NEXT: vpush {d8, d9}			; CHECK-NEXT: vpush {d8, d9}
	; CHECK-NEXT: movw lr, #1250
	; CHECK-NEXT: add.w r4, r0, r3, lsl #2			; CHECK-NEXT: add.w r4, r0, r3, lsl #2
	; CHECK-NEXT: adr r0, .LCPI1_0			; CHECK-NEXT: adr r0, .LCPI1_0
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: movw lr, #1250
	; CHECK-NEXT: vldrw.u32 q1, [r0]			; CHECK-NEXT: vldrw.u32 q1, [r0]
	; CHECK-NEXT: add.w r12, r3, #4			; CHECK-NEXT: add.w r12, r3, #4
	; CHECK-NEXT: vmov.i32 q2, #0x0			; CHECK-NEXT: vmov.i32 q2, #0x0
	; CHECK-NEXT: vmov.i32 q0, #0x14			; CHECK-NEXT: vmov.i32 q0, #0x14
				; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: .LBB1_1: @ %vector.body			; CHECK-NEXT: .LBB1_1: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r3			; CHECK-NEXT: vctp.32 r3
	; CHECK-NEXT: vmov q3, q2			; CHECK-NEXT: vmov q3, q2
	; CHECK-NEXT: vpstt			; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q2, [r1, q1, uxtw #2]			; CHECK-NEXT: vldrwt.u32 q2, [r1, q1, uxtw #2]
	; CHECK-NEXT: vldrwt.u32 q4, [r4], #16			; CHECK-NEXT: vldrwt.u32 q4, [r4], #16
	; CHECK-NEXT: subs r3, #4			; CHECK-NEXT: subs r3, #4
	▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: mve_scatter_qi:			; CHECK-LABEL: mve_scatter_qi:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r7, lr}			; CHECK-NEXT: .save {r7, lr}
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: add.w r12, r0, r3, lsl #2			; CHECK-NEXT: add.w r12, r0, r3, lsl #2
	; CHECK-NEXT: adr r0, .LCPI2_0			; CHECK-NEXT: adr r0, .LCPI2_0
	; CHECK-NEXT: vldrw.u32 q0, [r0]			; CHECK-NEXT: vldrw.u32 q0, [r0]
	; CHECK-NEXT: movw lr, #1250			; CHECK-NEXT: movw lr, #1250
	; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vmov.i32 q1, #0x0			; CHECK-NEXT: vmov.i32 q1, #0x0
				; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: vadd.i32 q0, q0, r1			; CHECK-NEXT: vadd.i32 q0, q0, r1
	; CHECK-NEXT: adds r1, r3, #4			; CHECK-NEXT: adds r1, r3, #4
	; CHECK-NEXT: vmov.i32 q2, #0x3			; CHECK-NEXT: vmov.i32 q2, #0x3
	; CHECK-NEXT: .LBB2_1: @ %vector.body			; CHECK-NEXT: .LBB2_1: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r3			; CHECK-NEXT: vctp.32 r3
	; CHECK-NEXT: vmov q3, q1			; CHECK-NEXT: vmov q3, q1
	; CHECK-NEXT: vpst			; CHECK-NEXT: vpst
	▲ Show 20 Lines • Show All 293 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll

	Show All 9 Lines
	; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, lr}			; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, lr}
	; CHECK-NEXT: ldr r3, [r0, #4]			; CHECK-NEXT: ldr r3, [r0, #4]
	; CHECK-NEXT: sub.w r12, r3, #1			; CHECK-NEXT: sub.w r12, r3, #1
	; CHECK-NEXT: cmp.w r12, #2			; CHECK-NEXT: cmp.w r12, #2
	; CHECK-NEXT: blo .LBB0_5			; CHECK-NEXT: blo .LBB0_5
	; CHECK-NEXT: @ %bb.1: @ %for.body.preheader			; CHECK-NEXT: @ %bb.1: @ %for.body.preheader
	; CHECK-NEXT: ldr r5, [r0, #8]			; CHECK-NEXT: ldr r5, [r0, #8]
	; CHECK-NEXT: ldr r3, [r0]			; CHECK-NEXT: ldr r3, [r0]
	; CHECK-NEXT: adds r0, r5, #3
	; CHECK-NEXT: bic r0, r0, #3
	; CHECK-NEXT: add.w r4, r3, r5, lsl #2			; CHECK-NEXT: add.w r4, r3, r5, lsl #2
	; CHECK-NEXT: subs r3, r0, #4
	; CHECK-NEXT: movs r0, #1			; CHECK-NEXT: movs r0, #1
	; CHECK-NEXT: lsl.w r9, r5, #2			; CHECK-NEXT: lsl.w r9, r5, #2
	; CHECK-NEXT: add.w r8, r0, r3, lsr #2
	; CHECK-NEXT: .LBB0_2: @ %for.body			; CHECK-NEXT: .LBB0_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB0_3 Depth 2			; CHECK-NEXT: @ Child Loop BB0_3 Depth 2
	; CHECK-NEXT: dls lr, r8
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q0, #0x0
				; CHECK-NEXT: dlstp.32 lr, r5
	; CHECK-NEXT: mov r7, r1			; CHECK-NEXT: mov r7, r1
	; CHECK-NEXT: mov r3, r4			; CHECK-NEXT: mov r3, r4
	; CHECK-NEXT: mov r6, r5			; CHECK-NEXT: mov r6, r5
	; CHECK-NEXT: .LBB0_3: @ %vector.body			; CHECK-NEXT: .LBB0_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB0_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB0_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: vctp.32 r6			; CHECK-NEXT: vldrw.u32 q1, [r7], #16
	; CHECK-NEXT: subs r6, #4			; CHECK-NEXT: vldrw.u32 q2, [r3], #16
	; CHECK-NEXT: vpsttt			; CHECK-NEXT: vfma.f32 q0, q2, q1
	; CHECK-NEXT: vldrwt.u32 q1, [r7], #16			; CHECK-NEXT: letp lr, .LBB0_3
	; CHECK-NEXT: vldrwt.u32 q2, [r3], #16
	; CHECK-NEXT: vfmat.f32 q0, q2, q1
	; CHECK-NEXT: le lr, .LBB0_3
	; CHECK-NEXT: @ %bb.4: @ %middle.block			; CHECK-NEXT: @ %bb.4: @ %middle.block
	; CHECK-NEXT: @ in Loop: Header=BB0_2 Depth=1			; CHECK-NEXT: @ in Loop: Header=BB0_2 Depth=1
	; CHECK-NEXT: vadd.f32 s4, s2, s3			; CHECK-NEXT: vadd.f32 s4, s2, s3
	; CHECK-NEXT: add.w r3, r2, r0, lsl #2			; CHECK-NEXT: add.w r3, r2, r0, lsl #2
	; CHECK-NEXT: vadd.f32 s0, s0, s1			; CHECK-NEXT: vadd.f32 s0, s0, s1
	; CHECK-NEXT: adds r0, #1			; CHECK-NEXT: adds r0, #1
	; CHECK-NEXT: add r4, r9			; CHECK-NEXT: add r4, r9
	; CHECK-NEXT: cmp r0, r12			; CHECK-NEXT: cmp r0, r12
	▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: cmp r1, #2			; CHECK-NEXT: cmp r1, #2
	; CHECK-NEXT: blo .LBB1_5			; CHECK-NEXT: blo .LBB1_5
	; CHECK-NEXT: @ %bb.1: @ %for.body.preheader			; CHECK-NEXT: @ %bb.1: @ %for.body.preheader
	; CHECK-NEXT: ldr.w r12, [r0, #8]			; CHECK-NEXT: ldr.w r12, [r0, #8]
	; CHECK-NEXT: movs r4, #1			; CHECK-NEXT: movs r4, #1
	; CHECK-NEXT: ldr r3, [r0]			; CHECK-NEXT: ldr r3, [r0]
	; CHECK-NEXT: add.w r0, r12, #3			; CHECK-NEXT: add.w r0, r12, #3
	; CHECK-NEXT: bic r0, r0, #3			; CHECK-NEXT: bic r0, r0, #3
	; CHECK-NEXT: add.w r6, r3, r12, lsl #2			; CHECK-NEXT: add.w r5, r3, r12, lsl #2
	; CHECK-NEXT: subs r0, #4			; CHECK-NEXT: subs r0, #4
	; CHECK-NEXT: add.w r7, r3, r12, lsl #3			; CHECK-NEXT: add.w r7, r3, r12, lsl #3
	; CHECK-NEXT: lsl.w r10, r12, #3			; CHECK-NEXT: lsl.w r9, r12, #3
	; CHECK-NEXT: add.w r8, r4, r0, lsr #2			; CHECK-NEXT: add.w r8, r4, r0, lsr #2
	; CHECK-NEXT: .LBB1_2: @ %for.body			; CHECK-NEXT: .LBB1_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB1_3 Depth 2			; CHECK-NEXT: @ Child Loop BB1_3 Depth 2
	; CHECK-NEXT: dls lr, r8			; CHECK-NEXT: dls lr, r8
				; CHECK-NEXT: ldr r6, [sp] @ 4-byte Reload
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q0, #0x0
	; CHECK-NEXT: ldr r5, [sp] @ 4-byte Reload
	; CHECK-NEXT: add.w r11, r4, #1			; CHECK-NEXT: add.w r11, r4, #1
	; CHECK-NEXT: mov r3, r6			; CHECK-NEXT: mov r3, r5
	; CHECK-NEXT: mov r0, r7			; CHECK-NEXT: mov r0, r7
	; CHECK-NEXT: vmov q1, q0			; CHECK-NEXT: vmov q1, q0
	; CHECK-NEXT: mov r9, r12			; CHECK-NEXT: mov r10, r12
	; CHECK-NEXT: .LBB1_3: @ %vector.body			; CHECK-NEXT: .LBB1_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB1_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB1_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: vctp.32 r9			; CHECK-NEXT: vctp.32 r10
	; CHECK-NEXT: sub.w r9, r9, #4			; CHECK-NEXT: sub.w r10, r10, #4
	; CHECK-NEXT: vpstttt			; CHECK-NEXT: vpstttt
	; CHECK-NEXT: vldrwt.u32 q2, [r5], #16			; CHECK-NEXT: vldrwt.u32 q2, [r6], #16
	; CHECK-NEXT: vldrwt.u32 q3, [r3], #16			; CHECK-NEXT: vldrwt.u32 q3, [r3], #16
	; CHECK-NEXT: vfmat.f32 q1, q3, q2			; CHECK-NEXT: vfmat.f32 q1, q3, q2
	; CHECK-NEXT: vldrwt.u32 q3, [r0], #16			; CHECK-NEXT: vldrwt.u32 q3, [r0], #16
	; CHECK-NEXT: vpst			; CHECK-NEXT: vpst
	; CHECK-NEXT: vfmat.f32 q0, q3, q2			; CHECK-NEXT: vfmat.f32 q0, q3, q2
	; CHECK-NEXT: le lr, .LBB1_3			; CHECK-NEXT: le lr, .LBB1_3
	; CHECK-NEXT: @ %bb.4: @ %middle.block			; CHECK-NEXT: @ %bb.4: @ %middle.block
	; CHECK-NEXT: @ in Loop: Header=BB1_2 Depth=1			; CHECK-NEXT: @ in Loop: Header=BB1_2 Depth=1
	; CHECK-NEXT: vadd.f32 s8, s2, s3			; CHECK-NEXT: vadd.f32 s8, s2, s3
	; CHECK-NEXT: add.w r0, r2, r11, lsl #2			; CHECK-NEXT: add.w r0, r2, r11, lsl #2
	; CHECK-NEXT: vadd.f32 s0, s0, s1			; CHECK-NEXT: vadd.f32 s0, s0, s1
	; CHECK-NEXT: add r6, r10			; CHECK-NEXT: add r5, r9
	; CHECK-NEXT: vadd.f32 s2, s6, s7			; CHECK-NEXT: vadd.f32 s2, s6, s7
	; CHECK-NEXT: add r7, r10			; CHECK-NEXT: add r7, r9
	; CHECK-NEXT: vadd.f32 s4, s4, s5			; CHECK-NEXT: vadd.f32 s4, s4, s5
	; CHECK-NEXT: vadd.f32 s0, s0, s8			; CHECK-NEXT: vadd.f32 s0, s0, s8
	; CHECK-NEXT: vadd.f32 s2, s4, s2			; CHECK-NEXT: vadd.f32 s2, s4, s2
	; CHECK-NEXT: vstr s0, [r0]			; CHECK-NEXT: vstr s0, [r0]
	; CHECK-NEXT: add.w r0, r2, r4, lsl #2			; CHECK-NEXT: add.w r0, r2, r4, lsl #2
	; CHECK-NEXT: adds r4, #2			; CHECK-NEXT: adds r4, #2
	; CHECK-NEXT: cmp r4, r1			; CHECK-NEXT: cmp r4, r1
	; CHECK-NEXT: vstr s2, [r0]			; CHECK-NEXT: vstr s2, [r0]
	▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: bic r3, r3, #3			; CHECK-NEXT: bic r3, r3, #3
	; CHECK-NEXT: lsls r7, r0, #2			; CHECK-NEXT: lsls r7, r0, #2
	; CHECK-NEXT: subs r3, #4			; CHECK-NEXT: subs r3, #4
	; CHECK-NEXT: add.w r3, r5, r3, lsr #2			; CHECK-NEXT: add.w r3, r5, r3, lsr #2
	; CHECK-NEXT: str r3, [sp] @ 4-byte Spill			; CHECK-NEXT: str r3, [sp] @ 4-byte Spill
	; CHECK-NEXT: .LBB2_2: @ %for.body			; CHECK-NEXT: .LBB2_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB2_3 Depth 2			; CHECK-NEXT: @ Child Loop BB2_3 Depth 2
	; CHECK-NEXT: ldr r0, [sp] @ 4-byte Reload			; CHECK-NEXT: ldrd r0, r10, [sp] @ 8-byte Folded Reload
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q0, #0x0
	; CHECK-NEXT: add.w r9, r5, #2			; CHECK-NEXT: add.w r9, r5, #2
	; CHECK-NEXT: add.w r11, r5, #1			; CHECK-NEXT: add.w r11, r5, #1
	; CHECK-NEXT: dls lr, r0			; CHECK-NEXT: dls lr, r0
	; CHECK-NEXT: mov r3, r12
	; CHECK-NEXT: ldr r6, [sp, #12] @ 4-byte Reload			; CHECK-NEXT: ldr r6, [sp, #12] @ 4-byte Reload
				; CHECK-NEXT: mov r3, r12
	; CHECK-NEXT: mov r0, r1			; CHECK-NEXT: mov r0, r1
	; CHECK-NEXT: ldr.w r10, [sp, #4] @ 4-byte Reload
	; CHECK-NEXT: mov r4, r8			; CHECK-NEXT: mov r4, r8
	; CHECK-NEXT: vmov q2, q0			; CHECK-NEXT: vmov q2, q0
	; CHECK-NEXT: vmov q1, q0			; CHECK-NEXT: vmov q1, q0
	; CHECK-NEXT: .LBB2_3: @ %vector.body			; CHECK-NEXT: .LBB2_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB2_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB2_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: vctp.32 r10			; CHECK-NEXT: vctp.32 r10
	; CHECK-NEXT: sub.w r10, r10, #4			; CHECK-NEXT: sub.w r10, r10, #4
	▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: bic r0, r0, #3			; CHECK-NEXT: bic r0, r0, #3
	; CHECK-NEXT: lsls r7, r3, #4			; CHECK-NEXT: lsls r7, r3, #4
	; CHECK-NEXT: subs r0, #4			; CHECK-NEXT: subs r0, #4
	; CHECK-NEXT: add.w r0, r6, r0, lsr #2			; CHECK-NEXT: add.w r0, r6, r0, lsr #2
	; CHECK-NEXT: strd r0, r3, [sp, #4] @ 8-byte Folded Spill			; CHECK-NEXT: strd r0, r3, [sp, #4] @ 8-byte Folded Spill
	; CHECK-NEXT: .LBB3_2: @ %for.body			; CHECK-NEXT: .LBB3_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB3_3 Depth 2			; CHECK-NEXT: @ Child Loop BB3_3 Depth 2
	; CHECK-NEXT: ldr r0, [sp, #4] @ 4-byte Reload
	; CHECK-NEXT: vmov.i32 q0, #0x0
	; CHECK-NEXT: mov r3, r8
	; CHECK-NEXT: mov r5, r9
	; CHECK-NEXT: dls lr, r0
	; CHECK-NEXT: adds r0, r6, #3			; CHECK-NEXT: adds r0, r6, #3
	; CHECK-NEXT: str r0, [sp, #28] @ 4-byte Spill			; CHECK-NEXT: str r0, [sp, #28] @ 4-byte Spill
	; CHECK-NEXT: adds r0, r6, #2			; CHECK-NEXT: adds r0, r6, #2
	; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
	; CHECK-NEXT: mov r4, r10
	; CHECK-NEXT: ldr.w r11, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: vmov q1, q0
	; CHECK-NEXT: str r0, [sp, #24] @ 4-byte Spill			; CHECK-NEXT: str r0, [sp, #24] @ 4-byte Spill
	; CHECK-NEXT: adds r0, r6, #1			; CHECK-NEXT: adds r0, r6, #1
	; CHECK-NEXT: str r0, [sp, #20] @ 4-byte Spill			; CHECK-NEXT: str r0, [sp, #20] @ 4-byte Spill
				; CHECK-NEXT: ldrd r0, r11, [sp, #4] @ 8-byte Folded Reload
				; CHECK-NEXT: vmov.i32 q0, #0x0
				; CHECK-NEXT: mov r3, r8
				; CHECK-NEXT: mov r5, r9
				; CHECK-NEXT: dls lr, r0
				; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
	; CHECK-NEXT: mov r0, r12			; CHECK-NEXT: mov r0, r12
				; CHECK-NEXT: mov r4, r10
				; CHECK-NEXT: vmov q1, q0
	; CHECK-NEXT: vmov q2, q0			; CHECK-NEXT: vmov q2, q0
	; CHECK-NEXT: vmov q3, q0			; CHECK-NEXT: vmov q3, q0
	; CHECK-NEXT: .LBB3_3: @ %vector.body			; CHECK-NEXT: .LBB3_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB3_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB3_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: vctp.32 r11			; CHECK-NEXT: vctp.32 r11
	; CHECK-NEXT: sub.w r11, r11, #4			; CHECK-NEXT: sub.w r11, r11, #4
	; CHECK-NEXT: vpstttt			; CHECK-NEXT: vpstttt
	▲ Show 20 Lines • Show All 169 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: add.w r1, r0, r1, lsr #2			; CHECK-NEXT: add.w r1, r0, r1, lsr #2
	; CHECK-NEXT: str r1, [sp, #8] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #8] @ 4-byte Spill
	; CHECK-NEXT: add.w r1, r3, r3, lsl #2			; CHECK-NEXT: add.w r1, r3, r3, lsl #2
	; CHECK-NEXT: lsls r1, r1, #2			; CHECK-NEXT: lsls r1, r1, #2
	; CHECK-NEXT: str r1, [sp, #4] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #4] @ 4-byte Spill
	; CHECK-NEXT: .LBB4_2: @ %for.body			; CHECK-NEXT: .LBB4_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB4_3 Depth 2			; CHECK-NEXT: @ Child Loop BB4_3 Depth 2
	; CHECK-NEXT: ldr r1, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: vmov.i32 q1, #0x0
	; CHECK-NEXT: add.w r10, r0, #2
	; CHECK-NEXT: adds r7, r0, #1
	; CHECK-NEXT: dls lr, r1
	; CHECK-NEXT: adds r1, r0, #4			; CHECK-NEXT: adds r1, r0, #4
	; CHECK-NEXT: str r1, [sp, #28] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #28] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #3			; CHECK-NEXT: adds r1, r0, #3
	; CHECK-NEXT: str r1, [sp, #24] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #24] @ 4-byte Spill
	; CHECK-NEXT: mov r3, r8			; CHECK-NEXT: ldrd r1, r11, [sp, #8] @ 8-byte Folded Reload
				; CHECK-NEXT: vmov.i32 q1, #0x0
				; CHECK-NEXT: add.w r10, r0, #2
				; CHECK-NEXT: adds r7, r0, #1
				; CHECK-NEXT: dls lr, r1
	; CHECK-NEXT: ldr r1, [sp, #20] @ 4-byte Reload			; CHECK-NEXT: ldr r1, [sp, #20] @ 4-byte Reload
				; CHECK-NEXT: mov r3, r8
	; CHECK-NEXT: vmov q0, q1			; CHECK-NEXT: vmov q0, q1
	; CHECK-NEXT: ldr.w r11, [sp, #12] @ 4-byte Reload
	; CHECK-NEXT: vmov q3, q1			; CHECK-NEXT: vmov q3, q1
	; CHECK-NEXT: vmov q2, q1			; CHECK-NEXT: vmov q2, q1
	; CHECK-NEXT: vmov q4, q1			; CHECK-NEXT: vmov q4, q1
	; CHECK-NEXT: .LBB4_3: @ %vector.body			; CHECK-NEXT: .LBB4_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB4_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB4_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: add.w r9, r3, r5			; CHECK-NEXT: add.w r9, r3, r5
	; CHECK-NEXT: vctp.32 r11			; CHECK-NEXT: vctp.32 r11
	▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: add.w r1, r0, r1, lsr #2			; CHECK-NEXT: add.w r1, r0, r1, lsr #2
	; CHECK-NEXT: str r1, [sp, #4] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #4] @ 4-byte Spill
	; CHECK-NEXT: add.w r1, r3, r3, lsl #1			; CHECK-NEXT: add.w r1, r3, r3, lsl #1
	; CHECK-NEXT: lsls r1, r1, #3			; CHECK-NEXT: lsls r1, r1, #3
	; CHECK-NEXT: str r1, [sp] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp] @ 4-byte Spill
	; CHECK-NEXT: .LBB5_2: @ %for.body			; CHECK-NEXT: .LBB5_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB5_3 Depth 2			; CHECK-NEXT: @ Child Loop BB5_3 Depth 2
	; CHECK-NEXT: ldr r1, [sp, #4] @ 4-byte Reload
	; CHECK-NEXT: vmov.i32 q1, #0x0
	; CHECK-NEXT: add.w r11, r0, #2
	; CHECK-NEXT: adds r4, r0, #1
	; CHECK-NEXT: dls lr, r1
	; CHECK-NEXT: adds r1, r0, #5			; CHECK-NEXT: adds r1, r0, #5
	; CHECK-NEXT: str r1, [sp, #28] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #28] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #4			; CHECK-NEXT: adds r1, r0, #4
	; CHECK-NEXT: str r1, [sp, #24] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #24] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #3			; CHECK-NEXT: adds r1, r0, #3
	; CHECK-NEXT: str r1, [sp, #20] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #20] @ 4-byte Spill
	; CHECK-NEXT: mov r3, r9			; CHECK-NEXT: ldrd r1, r8, [sp, #4] @ 8-byte Folded Reload
				; CHECK-NEXT: vmov.i32 q1, #0x0
				; CHECK-NEXT: add.w r11, r0, #2
				; CHECK-NEXT: adds r4, r0, #1
				; CHECK-NEXT: dls lr, r1
	; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload			; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
				; CHECK-NEXT: mov r3, r9
	; CHECK-NEXT: vmov q3, q1			; CHECK-NEXT: vmov q3, q1
	; CHECK-NEXT: ldr.w r8, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: vmov q4, q1			; CHECK-NEXT: vmov q4, q1
	; CHECK-NEXT: vmov q0, q1			; CHECK-NEXT: vmov q0, q1
	; CHECK-NEXT: vmov q5, q1			; CHECK-NEXT: vmov q5, q1
	; CHECK-NEXT: vmov q2, q1			; CHECK-NEXT: vmov q2, q1
	; CHECK-NEXT: .LBB5_3: @ %vector.body			; CHECK-NEXT: .LBB5_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB5_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB5_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: add.w r12, r3, r5			; CHECK-NEXT: add.w r12, r3, r5
	▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: add.w r1, r0, r1, lsr #2			; CHECK-NEXT: add.w r1, r0, r1, lsr #2
	; CHECK-NEXT: str r1, [sp, #16] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #16] @ 4-byte Spill
	; CHECK-NEXT: rsb r1, r3, r3, lsl #3			; CHECK-NEXT: rsb r1, r3, r3, lsl #3
	; CHECK-NEXT: lsls r1, r1, #2			; CHECK-NEXT: lsls r1, r1, #2
	; CHECK-NEXT: str r1, [sp, #12] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #12] @ 4-byte Spill
	; CHECK-NEXT: .LBB6_2: @ %for.body			; CHECK-NEXT: .LBB6_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB6_3 Depth 2			; CHECK-NEXT: @ Child Loop BB6_3 Depth 2
	; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
	; CHECK-NEXT: vmov.i32 q2, #0x0
	; CHECK-NEXT: adds r4, r0, #2
	; CHECK-NEXT: add.w r8, r0, #1
	; CHECK-NEXT: dls lr, r1
	; CHECK-NEXT: adds r1, r0, #6			; CHECK-NEXT: adds r1, r0, #6
	; CHECK-NEXT: str r1, [sp, #44] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #44] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #5			; CHECK-NEXT: adds r1, r0, #5
	; CHECK-NEXT: str r1, [sp, #40] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #40] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #4			; CHECK-NEXT: adds r1, r0, #4
	; CHECK-NEXT: str r1, [sp, #36] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #36] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #3			; CHECK-NEXT: adds r1, r0, #3
	; CHECK-NEXT: str r1, [sp, #32] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #32] @ 4-byte Spill
	; CHECK-NEXT: mov r3, r12			; CHECK-NEXT: ldrd r3, r1, [sp, #16] @ 8-byte Folded Reload
				; CHECK-NEXT: vmov.i32 q2, #0x0
				; CHECK-NEXT: adds r4, r0, #2
				; CHECK-NEXT: add.w r8, r0, #1
				; CHECK-NEXT: dls lr, r3
	; CHECK-NEXT: ldr.w r9, [sp, #28] @ 4-byte Reload			; CHECK-NEXT: ldr.w r9, [sp, #28] @ 4-byte Reload
				; CHECK-NEXT: mov r3, r12
	; CHECK-NEXT: vmov q4, q2			; CHECK-NEXT: vmov q4, q2
	; CHECK-NEXT: ldr r1, [sp, #20] @ 4-byte Reload
	; CHECK-NEXT: vmov q5, q2			; CHECK-NEXT: vmov q5, q2
	; CHECK-NEXT: vmov q3, q2			; CHECK-NEXT: vmov q3, q2
	; CHECK-NEXT: vmov q6, q2			; CHECK-NEXT: vmov q6, q2
	; CHECK-NEXT: vmov q1, q2			; CHECK-NEXT: vmov q1, q2
	; CHECK-NEXT: vstrw.32 q2, [sp, #64] @ 16-byte Spill			; CHECK-NEXT: vstrw.32 q2, [sp, #64] @ 16-byte Spill
	; CHECK-NEXT: .LBB6_3: @ %vector.body			; CHECK-NEXT: .LBB6_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB6_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB6_2 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	▲ Show 20 Lines • Show All 255 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: lsls r5, r3, #2			; CHECK-NEXT: lsls r5, r3, #2
	; CHECK-NEXT: add.w r1, r0, r1, lsr #2			; CHECK-NEXT: add.w r1, r0, r1, lsr #2
	; CHECK-NEXT: str r1, [sp, #16] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #16] @ 4-byte Spill
	; CHECK-NEXT: lsls r1, r3, #5			; CHECK-NEXT: lsls r1, r3, #5
	; CHECK-NEXT: str r1, [sp, #12] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #12] @ 4-byte Spill
	; CHECK-NEXT: .LBB7_2: @ %for.body			; CHECK-NEXT: .LBB7_2: @ %for.body
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB7_3 Depth 2			; CHECK-NEXT: @ Child Loop BB7_3 Depth 2
	; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
	; CHECK-NEXT: vmov.i32 q3, #0x0
	; CHECK-NEXT: adds r4, r0, #3
	; CHECK-NEXT: add.w r8, r0, #2
	; CHECK-NEXT: dls lr, r1
	; CHECK-NEXT: adds r1, r0, #7			; CHECK-NEXT: adds r1, r0, #7
	; CHECK-NEXT: str r1, [sp, #44] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #44] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #6			; CHECK-NEXT: adds r1, r0, #6
				; CHECK-NEXT: ldrd r3, r10, [sp, #16] @ 8-byte Folded Reload
	; CHECK-NEXT: str r1, [sp, #40] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #40] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #5			; CHECK-NEXT: adds r1, r0, #5
	; CHECK-NEXT: str r1, [sp, #36] @ 4-byte Spill			; CHECK-NEXT: str r1, [sp, #36] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #4			; CHECK-NEXT: adds r1, r0, #4
				; CHECK-NEXT: str r1, [sp, #32] @ 4-byte Spill
				; CHECK-NEXT: dls lr, r3
	; CHECK-NEXT: ldr.w r12, [sp, #28] @ 4-byte Reload			; CHECK-NEXT: ldr.w r12, [sp, #28] @ 4-byte Reload
				; CHECK-NEXT: vmov.i32 q3, #0x0
				; CHECK-NEXT: adds r4, r0, #3
				; CHECK-NEXT: add.w r8, r0, #2
				; CHECK-NEXT: adds r1, r0, #1
	; CHECK-NEXT: mov r3, r9			; CHECK-NEXT: mov r3, r9
	; CHECK-NEXT: ldr.w r10, [sp, #20] @ 4-byte Reload
	; CHECK-NEXT: vmov q5, q3			; CHECK-NEXT: vmov q5, q3
	; CHECK-NEXT: str r1, [sp, #32] @ 4-byte Spill
	; CHECK-NEXT: adds r1, r0, #1
	; CHECK-NEXT: vmov q6, q3			; CHECK-NEXT: vmov q6, q3
	; CHECK-NEXT: vmov q4, q3			; CHECK-NEXT: vmov q4, q3
	; CHECK-NEXT: vmov q7, q3			; CHECK-NEXT: vmov q7, q3
	; CHECK-NEXT: vmov q2, q3			; CHECK-NEXT: vmov q2, q3
	; CHECK-NEXT: vstrw.32 q3, [sp, #64] @ 16-byte Spill			; CHECK-NEXT: vstrw.32 q3, [sp, #64] @ 16-byte Spill
	; CHECK-NEXT: vstrw.32 q3, [sp, #80] @ 16-byte Spill			; CHECK-NEXT: vstrw.32 q3, [sp, #80] @ 16-byte Spill
	; CHECK-NEXT: .LBB7_3: @ %vector.body			; CHECK-NEXT: .LBB7_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB7_2 Depth=1			; CHECK-NEXT: @ Parent Loop BB7_2 Depth=1
	▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll

	Show First 20 Lines • Show All 737 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ldr.w r12, [r0, r9, lsl #2]			; CHECK-NEXT: ldr.w r12, [r0, r9, lsl #2]
	; CHECK-NEXT: subs r0, r2, r2			; CHECK-NEXT: subs r0, r2, r2
	; CHECK-NEXT: ble .LBB5_3			; CHECK-NEXT: ble .LBB5_3
	; CHECK-NEXT: @ %bb.6: @ %for.body24.preheader			; CHECK-NEXT: @ %bb.6: @ %for.body24.preheader
	; CHECK-NEXT: @ in Loop: Header=BB5_5 Depth=1			; CHECK-NEXT: @ in Loop: Header=BB5_5 Depth=1
	; CHECK-NEXT: ldr.w r11, [sp, #88]			; CHECK-NEXT: ldr.w r11, [sp, #88]
	; CHECK-NEXT: mov r6, r12			; CHECK-NEXT: mov r6, r12
	; CHECK-NEXT: ldr r1, [sp, #12] @ 4-byte Reload			; CHECK-NEXT: ldr r1, [sp, #12] @ 4-byte Reload
	; CHECK-NEXT: dlstp.16 lr, r11
	; CHECK-NEXT: ldr r5, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: mov r8, r12			; CHECK-NEXT: mov r8, r12
				; CHECK-NEXT: dlstp.16 lr, r11
				; CHECK-NEXT: ldr r0, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: mla r3, r9, r11, r1			; CHECK-NEXT: mla r3, r9, r11, r1
	; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload			; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
	; CHECK-NEXT: ldrd r7, r0, [sp] @ 8-byte Folded Reload			; CHECK-NEXT: ldrd r7, r5, [sp] @ 8-byte Folded Reload
	; CHECK-NEXT: mov r10, r12			; CHECK-NEXT: mov r10, r12
	; CHECK-NEXT: .LBB5_7: @ %for.body24			; CHECK-NEXT: .LBB5_7: @ %for.body24
	; CHECK-NEXT: @ Parent Loop BB5_5 Depth=1			; CHECK-NEXT: @ Parent Loop BB5_5 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: vldrb.s16 q0, [r7], #8			; CHECK-NEXT: vldrb.s16 q0, [r7], #8
	; CHECK-NEXT: vadd.i16 q1, q0, r4			; CHECK-NEXT: vadd.i16 q1, q0, r4
	; CHECK-NEXT: vldrb.s16 q0, [r3], #8			; CHECK-NEXT: vldrb.s16 q0, [r3], #8
	; CHECK-NEXT: vmlava.s16 r12, q0, q1			; CHECK-NEXT: vmlava.s16 r12, q0, q1
	; CHECK-NEXT: vldrb.s16 q1, [r5], #8			; CHECK-NEXT: vldrb.s16 q1, [r0], #8
	; CHECK-NEXT: vadd.i16 q1, q1, r4			; CHECK-NEXT: vadd.i16 q1, q1, r4
	; CHECK-NEXT: vmlava.s16 r6, q0, q1			; CHECK-NEXT: vmlava.s16 r6, q0, q1
	; CHECK-NEXT: vldrb.s16 q1, [r0], #8			; CHECK-NEXT: vldrb.s16 q1, [r5], #8
	; CHECK-NEXT: vadd.i16 q1, q1, r4			; CHECK-NEXT: vadd.i16 q1, q1, r4
	; CHECK-NEXT: vmlava.s16 r8, q0, q1			; CHECK-NEXT: vmlava.s16 r8, q0, q1
	; CHECK-NEXT: vldrb.s16 q1, [r1], #8			; CHECK-NEXT: vldrb.s16 q1, [r1], #8
	; CHECK-NEXT: vadd.i16 q1, q1, r4			; CHECK-NEXT: vadd.i16 q1, q1, r4
	; CHECK-NEXT: vmlava.s16 r10, q0, q1			; CHECK-NEXT: vmlava.s16 r10, q0, q1
	; CHECK-NEXT: letp lr, .LBB5_7			; CHECK-NEXT: letp lr, .LBB5_7
	; CHECK-NEXT: b .LBB5_4			; CHECK-NEXT: b .LBB5_4
	; CHECK-NEXT: .LBB5_8: @ %if.end			; CHECK-NEXT: .LBB5_8: @ %if.end
	▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ldr.w r12, [r0, r9, lsl #2]			; CHECK-NEXT: ldr.w r12, [r0, r9, lsl #2]
	; CHECK-NEXT: subs r0, r2, r2			; CHECK-NEXT: subs r0, r2, r2
	; CHECK-NEXT: ble .LBB6_6			; CHECK-NEXT: ble .LBB6_6
	; CHECK-NEXT: @ %bb.4: @ %for.body24.preheader			; CHECK-NEXT: @ %bb.4: @ %for.body24.preheader
	; CHECK-NEXT: @ in Loop: Header=BB6_3 Depth=1			; CHECK-NEXT: @ in Loop: Header=BB6_3 Depth=1
	; CHECK-NEXT: ldr.w r11, [sp, #88]			; CHECK-NEXT: ldr.w r11, [sp, #88]
	; CHECK-NEXT: mov r6, r12			; CHECK-NEXT: mov r6, r12
	; CHECK-NEXT: ldr r1, [sp, #12] @ 4-byte Reload			; CHECK-NEXT: ldr r1, [sp, #12] @ 4-byte Reload
	; CHECK-NEXT: dlstp.16 lr, r11
	; CHECK-NEXT: ldr r5, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: mov r8, r12			; CHECK-NEXT: mov r8, r12
				; CHECK-NEXT: dlstp.16 lr, r11
				; CHECK-NEXT: ldr r0, [sp, #8] @ 4-byte Reload
	; CHECK-NEXT: mla r3, r9, r11, r1			; CHECK-NEXT: mla r3, r9, r11, r1
	; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload			; CHECK-NEXT: ldr r1, [sp, #16] @ 4-byte Reload
	; CHECK-NEXT: ldrd r7, r0, [sp] @ 8-byte Folded Reload			; CHECK-NEXT: ldrd r7, r5, [sp] @ 8-byte Folded Reload
	; CHECK-NEXT: mov r10, r12			; CHECK-NEXT: mov r10, r12
	; CHECK-NEXT: .LBB6_5: @ %for.body24			; CHECK-NEXT: .LBB6_5: @ %for.body24
	; CHECK-NEXT: @ Parent Loop BB6_3 Depth=1			; CHECK-NEXT: @ Parent Loop BB6_3 Depth=1
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=2			; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
	; CHECK-NEXT: vldrb.s16 q0, [r7], #8			; CHECK-NEXT: vldrb.s16 q0, [r7], #8
	; CHECK-NEXT: vadd.i16 q1, q0, r4			; CHECK-NEXT: vadd.i16 q1, q0, r4
	; CHECK-NEXT: vldrb.s16 q0, [r3], #8			; CHECK-NEXT: vldrb.s16 q0, [r3], #8
	; CHECK-NEXT: vmlava.s16 r12, q0, q1			; CHECK-NEXT: vmlava.s16 r12, q0, q1
	; CHECK-NEXT: vldrb.s16 q1, [r5], #8			; CHECK-NEXT: vldrb.s16 q1, [r0], #8
	; CHECK-NEXT: vadd.i16 q1, q1, r4			; CHECK-NEXT: vadd.i16 q1, q1, r4
	; CHECK-NEXT: vmlava.s16 r6, q0, q1			; CHECK-NEXT: vmlava.s16 r6, q0, q1
	; CHECK-NEXT: vldrb.s16 q1, [r0], #8			; CHECK-NEXT: vldrb.s16 q1, [r5], #8
	; CHECK-NEXT: vadd.i16 q1, q1, r4			; CHECK-NEXT: vadd.i16 q1, q1, r4
	; CHECK-NEXT: vmlava.s16 r8, q0, q1			; CHECK-NEXT: vmlava.s16 r8, q0, q1
	; CHECK-NEXT: vldrb.s16 q1, [r1], #8			; CHECK-NEXT: vldrb.s16 q1, [r1], #8
	; CHECK-NEXT: vadd.i16 q1, q1, r4			; CHECK-NEXT: vadd.i16 q1, q1, r4
	; CHECK-NEXT: vmlava.s16 r10, q0, q1			; CHECK-NEXT: vmlava.s16 r10, q0, q1
	; CHECK-NEXT: letp lr, .LBB6_5			; CHECK-NEXT: letp lr, .LBB6_5
	; CHECK-NEXT: b .LBB6_7			; CHECK-NEXT: b .LBB6_7
	; CHECK-NEXT: .LBB6_6: @ in Loop: Header=BB6_3 Depth=1			; CHECK-NEXT: .LBB6_6: @ in Loop: Header=BB6_3 Depth=1
	▲ Show 20 Lines • Show All 469 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Introduce t2DoLoopStartTPClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 304244

llvm/lib/Target/ARM/ARMBaseInstrInfo.h

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

llvm/lib/Target/ARM/ARMInstrThumb2.td

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp

llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp

llvm/test/CodeGen/ARM/O3-pipeline.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/mov-operand.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/reductions.ll

llvm/test/CodeGen/Thumb2/mve-fma-loops.ll

llvm/test/CodeGen/Thumb2/mve-gather-scatter-tailpred.ll

llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll

llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll

[ARM] Introduce t2DoLoopStartTP
ClosedPublic