This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
3
ReachingDefAnalysis.h
-
lib/
-
CodeGen/
-
ReachingDefAnalysis.cpp
-
Target/ARM/
-
ARM/
-
ARMBaseInstrInfo.h
6/21
ARMLowOverheadLoops.cpp
-
test/CodeGen/Thumb2/LowOverheadLoops/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
cond-vector-reduce-mve-codegen.ll
-
constant-init-reduction.mir
-
constant-reduction.mir
-
matrix.mir
-
nested-reductions.mir
-
reductions-8-16.mir
-
two-reducing-loops.mir
-
vector-arith-codegen.ll

Differential D75533

[ARM][LowOverheadLoops] Handle reductions
ClosedPublic

Authored by samparker on Mar 3 2020, 8:34 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen

Commits

rG3ee580d0176f: [ARM][LowOverheadLoops] Handle reductions

Summary

While validating live-out values, record instructions that look like a reduction. This will comprise of a vector op (for now only vadd), a vorr (vmov) which store the previous value of vadd and then a vpsel in the exit block which is predicated upon a vctp. This vctp will combine the last two iterations using the vmov and vadd into a vector which can then be consumed by a vaddv.
Once we have determined that it's safe to perform tail-predication, we need to change this sequence of instructions so that the predication doesn't produce incorrect code. This involves changing the register allocation of the vadd so it updates itself and the predication on the final iteration will not update the falsely predicated lanes. This mimics what the vmov, vctp and vpsel do and so we then don't need any of those instructions.

Diff Detail

Event Timeline

samparker created this revision.Mar 3 2020, 8:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2020, 8:34 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

samparker added a parent revision: D75452: [ARM][MVE] Validate tail predication values.Mar 3 2020, 8:35 AM

Harbormaster failed remote builds in B47916: Diff 247913!Mar 3 2020, 8:53 AM

Big patch: this is just a first scan of the code, and a first round of nits. Now going to look again, to let things sink in.

llvm/include/llvm/CodeGen/ReachingDefAnalysis.h
155	you renamed this `Defs` to `Incoming`...
157	So rename this one too?
233	and here too?
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
179	Can you comment what these members are? I agree that most are self-explanatory, but I am e.g. interested in `Init`, and if it e.g. can be null (there is a check in the fixup function), and what the meaning is of that.
554	nit: perhaps an assert that VPSEL is a vpsel would be good.
574	nit, I think nicer to read is: MO.getReg() == 0 -> !MO.getReg().isValid()
582	Nit: just a bit shorter would be: for (auto &MO : MI->uses()) { if (MO.isImm() && MO.getImm() == Imm) return true; return false;
630	nit: was just curious why we expect the first item in the set to be the vpsel. Can we rely on that with a set?
644	Could this be a good candidate for a helper function in ARMBaseInstrInfo.h?
661	I guess you mean VMOV can be an alias for VORR, which you're checking here?
688	Can or should this not be checked much earlier?

samparker marked 5 inline comments as done.Mar 16 2020, 8:28 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
179	'Init' is the possible instruction that maybe initialising our result register, such as a mov #0, but we won't necessarily have an instruction doing this.
630	Because the set only has one member if we get here.
644	I'm not sure how relevant it is to the rest of the backend, but it would be more readable here as a local helper - especially once we add more supported opcodes.
661	Yes, I don't think we actually have a MVE VMOV instruction.
688	Yes, at some point in an unknown patch. I've moved it into one of the legality helpers.

Rebased, which has made checking the VPSEL predicate a much more simple task.

Herald added a subscriber: danielkiss. · View Herald TranscriptJun 29 2020, 5:33 AM

SjoerdMeijer mentioned this in D82773: [ARM][MVE] Tail-predication: clean-up removing unused code.Jun 29 2020, 8:04 AM

SjoerdMeijer added inline comments.Jun 29 2020, 8:26 AM

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
768	typo: that
803	Perhaps it's good to add an comment here or in the description of the algorithm on line 733 - 753 that reductions need special treatment as they define values that are not used by predicated instructions inside the loop.
1282	Do we have a test case with more than 1 reduction?

Rebased after adding tests in reductions.ll.
Re-added support for VADDi8 and VADDi16.
Added some extra comments and TODOs.

Thanks, nice optimisation.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
250–254	is it not const anymore?
777	Do we have a (negative) test case with a float reduction?
821	nit: perhaps move this comment down a bit...
823	...and check this earlier.

This revision is now accepted and ready to land.Jun 30 2020, 8:39 AM

SjoerdMeijer mentioned this in rGaf45907653fd: [ARM][MVE] Tail-predication: clean-up of unused code.Jun 30 2020, 9:13 AM

samparker marked an inline comment as done.Jul 1 2020, 12:24 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
777	The vectorizer doesn't seem to want to produce vector float reductions loops, so I've left them out on testing.

Closed by commit rG3ee580d0176f: [ARM][LowOverheadLoops] Handle reductions (authored by samparker). · Explain WhyJul 1 2020, 1:02 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

ReachingDefAnalysis.h

12 lines

lib/

CodeGen/

ReachingDefAnalysis.cpp

7 lines

Target/

ARM/

ARMBaseInstrInfo.h

4 lines

ARMLowOverheadLoops.cpp

304 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

cond-vector-reduce-mve-codegen.ll

24 lines

constant-init-reduction.mir

349 lines

constant-reduction.mir

137 lines

matrix.mir

27 lines

nested-reductions.mir

275 lines

reductions-8-16.mir

592 lines

two-reducing-loops.mir

304 lines

vector-arith-codegen.ll

59 lines

Diff 247913

llvm/include/llvm/CodeGen/ReachingDefAnalysis.h

Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	public:
/// reaching def instuction of PhysReg that reaches MI.		/// reaching def instuction of PhysReg that reaches MI.
int getClearance(MachineInstr *MI, MCPhysReg PhysReg) const;		int getClearance(MachineInstr *MI, MCPhysReg PhysReg) const;

/// Provides the uses, in the same block as MI, of register that MI defines.		/// Provides the uses, in the same block as MI, of register that MI defines.
/// This does not consider live-outs.		/// This does not consider live-outs.
void getReachingLocalUses(MachineInstr *MI, int PhysReg,		void getReachingLocalUses(MachineInstr *MI, int PhysReg,
InstSet &Uses) const;		InstSet &Uses) const;

/// Search MBB for a definition of PhysReg and insert it into Defs. If no		/// Search MBB for a definition of PhysReg and insert it into Incoming. If no
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions you renamed this `Defs` to `Incoming`... SjoerdMeijer: you renamed this `Defs` to `Incoming`...
/// definition is found, recursively search the predecessor blocks for them.		/// definition is found, recursively search the successor blocks for them.
void getLiveOuts(MachineBasicBlock *MBB, int PhysReg, InstSet &Defs,		void getLiveOuts(MachineBasicBlock *MBB, int PhysReg, InstSet &Defs) const;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions So rename this one too? SjoerdMeijer: So rename this one too?
BlockSet &VisitedBBs) const;

/// For the given block, collect the instructions that use the live-in		/// For the given block, collect the instructions that use the live-in
/// value of the provided register. Return whether the value is still		/// value of the provided register. Return whether the value is still
/// live on exit.		/// live on exit.
bool getLiveInUses(MachineBasicBlock *MBB, int PhysReg,		bool getLiveInUses(MachineBasicBlock *MBB, int PhysReg,
InstSet &Uses) const;		InstSet &Uses) const;

/// Collect the users of the value stored in PhysReg, which is defined		/// Collect the users of the value stored in PhysReg, which is defined
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	private:

/// Provides the MI, from the given block, corresponding to the Id or a		/// Provides the MI, from the given block, corresponding to the Id or a
/// nullptr if the id does not refer to the block.		/// nullptr if the id does not refer to the block.
MachineInstr getInstFromId(MachineBasicBlock MBB, int InstId) const;		MachineInstr getInstFromId(MachineBasicBlock MBB, int InstId) const;

/// Provides the instruction of the closest reaching def instruction of		/// Provides the instruction of the closest reaching def instruction of
/// PhysReg that reaches MI, relative to the begining of MI's basic block.		/// PhysReg that reaches MI, relative to the begining of MI's basic block.
MachineInstr getReachingLocalMIDef(MachineInstr MI, int PhysReg) const;		MachineInstr getReachingLocalMIDef(MachineInstr MI, int PhysReg) const;

		/// Search MBB for a definition of PhysReg and insert it into Defs. If no
		/// definition is found, recursively search the predecessor blocks for them.
		void getLiveOuts(MachineBasicBlock *MBB, int PhysReg, InstSet &Defs,
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions and here too? SjoerdMeijer: and here too?
		BlockSet &VisitedBBs) const;
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_CODEGEN_REACHINGDEFSANALYSIS_H		#endif // LLVM_CODEGEN_REACHINGDEFSANALYSIS_H

llvm/lib/CodeGen/ReachingDefAnalysis.cpp

Show First 20 Lines • Show All 320 Lines • ▼ Show 20 Lines	while (!ToVisit.empty()) {
if (getLiveInUses(MBB, PhysReg, Uses))		if (getLiveInUses(MBB, PhysReg, Uses))
ToVisit.insert(ToVisit.end(), MBB->successors().begin(),		ToVisit.insert(ToVisit.end(), MBB->successors().begin(),
MBB->successors().end());		MBB->successors().end());
Visited.insert(MBB);		Visited.insert(MBB);
}		}
}		}
}		}

void		void
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -void -ReachingDefAnalysis::getLiveOuts(MachineBasicBlock MBB, int PhysReg, - InstSet &Defs) const { - SmallPtrSet<MachineBasicBlock, 2> VisitedBBs; +void ReachingDefAnalysis::getLiveOuts(MachineBasicBlock MBB, int PhysReg, + InstSet &Defs) const { + SmallPtrSet<MachineBasicBlock , 2> VisitedBBs; Lint: Pre-merge checks: clang-format: please reformat the code ``` -void -ReachingDefAnalysis::getLiveOuts…
ReachingDefAnalysis::getLiveOuts(MachineBasicBlock *MBB, int PhysReg,		ReachingDefAnalysis::getLiveOuts(MachineBasicBlock *MBB, int PhysReg,
		InstSet &Defs) const {
		SmallPtrSet<MachineBasicBlock*, 2> VisitedBBs;
		return getLiveOuts(MBB, PhysReg, Defs, VisitedBBs);
		}

		void
		ReachingDefAnalysis::getLiveOuts(MachineBasicBlock *MBB, int PhysReg,
InstSet &Defs, BlockSet &VisitedBBs) const {		InstSet &Defs, BlockSet &VisitedBBs) const {
if (VisitedBBs.count(MBB))		if (VisitedBBs.count(MBB))
return;		return;

VisitedBBs.insert(MBB);		VisitedBBs.insert(MBB);
LivePhysRegs LiveRegs(*TRI);		LivePhysRegs LiveRegs(*TRI);
LiveRegs.addLiveOuts(*MBB);		LiveRegs.addLiveOuts(*MBB);
if (!LiveRegs.contains(PhysReg))		if (!LiveRegs.contains(PhysReg))
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMBaseInstrInfo.h

	Show First 20 Lines • Show All 492 Lines • ▼ Show 20 Lines
	Opc == ARM::tSUBSi3 \|\| Opc == ARM::tSUBSi8 \|\|			Opc == ARM::tSUBSi3 \|\| Opc == ARM::tSUBSi8 \|\|
	Opc == ARM::t2SUBri \|\| Opc == ARM::t2SUBri12 \|\| Opc == ARM::t2SUBSri;			Opc == ARM::t2SUBri \|\| Opc == ARM::t2SUBri12 \|\| Opc == ARM::t2SUBSri;
	}			}

	static inline bool isMovRegOpcode(int Opc) {			static inline bool isMovRegOpcode(int Opc) {
	return Opc == ARM::MOVr \|\| Opc == ARM::tMOVr \|\| Opc == ARM::t2MOVr;			return Opc == ARM::MOVr \|\| Opc == ARM::tMOVr \|\| Opc == ARM::t2MOVr;
	}			}

				static inline bool isLSRImmOpcode(int Opc) {
				return Opc == ARM::LSRi \|\| Opc == ARM::tLSRri \|\| Opc == ARM::t2LSRri;
				}

	/// isValidCoprocessorNumber - decide whether an explicit coprocessor			/// isValidCoprocessorNumber - decide whether an explicit coprocessor
	/// number is legal in generic instructions like CDP. The answer can			/// number is legal in generic instructions like CDP. The answer can
	/// vary with the subtarget.			/// vary with the subtarget.
	static inline bool isValidCoprocessorNumber(unsigned Num,			static inline bool isValidCoprocessorNumber(unsigned Num,
	const FeatureBitset& featureBits) {			const FeatureBitset& featureBits) {
	// Armv8-A disallows everything other than 111x (CP14 and CP15).			// Armv8-A disallows everything other than 111x (CP14 and CP15).
	if (featureBits[ARM::HasV8Ops] && (Num & 0xE) != 0xE)			if (featureBits[ARM::HasV8Ops] && (Num & 0xE) != 0xE)
	return false;			return false;
	▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "arm-low-overhead-loops"		#define DEBUG_TYPE "arm-low-overhead-loops"
#define ARM_LOW_OVERHEAD_LOOPS_NAME "ARM Low Overhead Loops pass"		#define ARM_LOW_OVERHEAD_LOOPS_NAME "ARM Low Overhead Loops pass"

namespace {		namespace {

		using InstSet = SmallPtrSetImpl<MachineInstr *>;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - using InstSet = SmallPtrSetImpl<MachineInstr >; - - class PostOrderLoopTraversal { - MachineLoop &ML; - MachineLoopInfo &MLI; - SmallPtrSet<MachineBasicBlock, 4> Visited; - SmallVector<MachineBasicBlock, 4> Order; - - public: - PostOrderLoopTraversal(MachineLoop &ML, MachineLoopInfo &MLI) - : ML(ML), MLI(MLI) { } - - const SmallVectorImpl<MachineBasicBlock> &getOrder() const { - return Order; - } - - // Visit all the blocks within the loop, as well as exit blocks and any - // blocks properly dominating the header. - void ProcessLoop() { - std::function<void(MachineBasicBlock)> Search = [this, &Search] - (MachineBasicBlock MBB) -> void { - if (Visited.count(MBB)) - return; - - Visited.insert(MBB); - for (auto Succ : MBB->successors()) { - if (!ML.contains(Succ)) - continue; - Search(Succ); - } - Order.push_back(MBB); - }; - - // Insert exit blocks. - SmallVector<MachineBasicBlock, 2> ExitBlocks; - ML.getExitBlocks(ExitBlocks); - for (auto MBB : ExitBlocks) - Order.push_back(MBB); - - // Then add the loop body. - Search(ML.getHeader()); - - // Then try the preheader and its predecessors. - std::function<void(MachineBasicBlock)> GetPredecessor = - [this, &GetPredecessor] (MachineBasicBlock MBB) -> void { - Order.push_back(MBB); - if (MBB->pred_size() == 1) - GetPredecessor(MBB->pred_begin()); - }; - - if (auto Preheader = ML.getLoopPreheader()) - GetPredecessor(Preheader); - else if (auto Preheader = MLI.findLoopPreheader(&ML, true)) - GetPredecessor(Preheader); - } +using InstSet = SmallPtrSetImpl<MachineInstr >; + +class PostOrderLoopTraversal { + MachineLoop &ML; + MachineLoopInfo &MLI; + SmallPtrSet<MachineBasicBlock , 4> Visited; + SmallVector<MachineBasicBlock , 4> Order; + +public: + PostOrderLoopTraversal(MachineLoop &ML, MachineLoopInfo &MLI) + : ML(ML), MLI(MLI) {} + + const SmallVectorImpl<MachineBasicBlock > &getOrder() const { return Order; } + + // Visit all the blocks within the loop, as well as exit blocks and any + // blocks properly dominating the header. + void ProcessLoop() { + std::function<void(MachineBasicBlock )> Search = + [this, &Search](MachineBasicBlock MBB) -> void { + if (Visited.count(MBB)) + return; + + Visited.insert(MBB); + for (auto Succ : MBB->successors()) { + if (!ML.contains(Succ)) + continue; + Search(Succ); + } + Order.push_back(MBB); + }; + + // Insert exit blocks. + SmallVector<MachineBasicBlock , 2> ExitBlocks; + ML.getExitBlocks(ExitBlocks); + for (auto MBB : ExitBlocks) + Order.push_back(MBB); + + // Then add the loop body. + Search(ML.getHeader()); + + // Then try the preheader and its predecessors. + std::function<void(MachineBasicBlock )> GetPredecessor = + [this, &GetPredecessor](MachineBasicBlock MBB) -> void { + Order.push_back(MBB); + if (MBB->pred_size() == 1) + GetPredecessor(MBB->pred_begin()); + }; + + if (auto Preheader = ML.getLoopPreheader()) + GetPredecessor(Preheader); + else if (auto Preheader = MLI.findLoopPreheader(&ML, true)) + GetPredecessor(Preheader); + } Lint: Pre-merge checks: clang-format: please reformat the code ``` - using InstSet = SmallPtrSetImpl<MachineInstr *>…

class PostOrderLoopTraversal {		class PostOrderLoopTraversal {
MachineLoop &ML;		MachineLoop &ML;
MachineLoopInfo &MLI;		MachineLoopInfo &MLI;
SmallPtrSet<MachineBasicBlock*, 4> Visited;		SmallPtrSet<MachineBasicBlock*, 4> Visited;
SmallVector<MachineBasicBlock*, 4> Order;		SmallVector<MachineBasicBlock*, 4> Order;

public:		public:
PostOrderLoopTraversal(MachineLoop &ML, MachineLoopInfo &MLI)		PostOrderLoopTraversal(MachineLoop &ML, MachineLoopInfo &MLI)
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	public:
}		}

unsigned size() const { return Insts.size(); }		unsigned size() const { return Insts.size(); }
SmallVectorImpl<PredicatedMI> &getInsts() { return Insts; }		SmallVectorImpl<PredicatedMI> &getInsts() { return Insts; }
MachineInstr *getVPST() const { return VPST->MI; }		MachineInstr *getVPST() const { return VPST->MI; }
PredicatedMI *getDivergent() const { return Divergent; }		PredicatedMI *getDivergent() const { return Divergent; }
};		};

		struct Reduction {
		MachineInstr *Init;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Can you comment what these members are? I agree that most are self-explanatory, but I am e.g. interested in `Init`, and if it e.g. can be null (there is a check in the fixup function), and what the meaning is of that. SjoerdMeijer: Can you comment what these members are? I agree that most are self-explanatory, but I am e.g.
		samparkerAuthorUnsubmitted Done Reply Inline Actions 'Init' is the possible instruction that maybe initialising our result register, such as a mov #0, but we won't necessarily have an instruction doing this. samparker: 'Init' is the possible instruction that maybe initialising our result register, such as a mov…
		MachineInstr &Copy;
		MachineInstr &Reduce;
		MachineInstr &VPSEL;
		MachineInstr &VCTP;

		Reduction(MachineInstr Init, MachineInstr Mov, MachineInstr *Add,
		MachineInstr Sel, MachineInstr Pred)
		: Init(Init), Copy(Mov), Reduce(Add), VPSEL(Sel), VCTP(Pred) { }
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - : Init(Init), Copy(Mov), Reduce(Add), VPSEL(Sel), VCTP(Pred) { } + : Init(Init), Copy(Mov), Reduce(Add), VPSEL(Sel), VCTP(Pred) {} Lint: Pre-merge checks: clang-format: please reformat the code ``` - : Init(Init), Copy(Mov), Reduce(Add), VPSEL…
		};

struct LowOverheadLoop {		struct LowOverheadLoop {

MachineLoop &ML;		MachineLoop &ML;
MachineLoopInfo &MLI;		MachineLoopInfo &MLI;
ReachingDefAnalysis &RDA;		ReachingDefAnalysis &RDA;
const TargetRegisterInfo &TRI;		const TargetRegisterInfo &TRI;
		const ARMBaseInstrInfo &TII;
MachineFunction *MF = nullptr;		MachineFunction *MF = nullptr;
		MachineBasicBlock *Preheader = nullptr;
MachineInstr *InsertPt = nullptr;		MachineInstr *InsertPt = nullptr;
MachineInstr *Start = nullptr;		MachineInstr *Start = nullptr;
MachineInstr *Dec = nullptr;		MachineInstr *Dec = nullptr;
MachineInstr *End = nullptr;		MachineInstr *End = nullptr;
MachineInstr *VCTP = nullptr;		MachineInstr *VCTP = nullptr;
VPTBlock *CurrentBlock = nullptr;		VPTBlock *CurrentBlock = nullptr;
SetVector<MachineInstr*> CurrentPredicate;		SetVector<MachineInstr*> CurrentPredicate;
SmallVector<VPTBlock, 4> VPTBlocks;		SmallVector<VPTBlock, 4> VPTBlocks;
SmallPtrSet<MachineInstr*, 4> ToRemove;		SmallPtrSet<MachineInstr*, 4> ToRemove;
		SmallVector<std::unique_ptr<Reduction>, 1> Reductions;
bool Revert = false;		bool Revert = false;
bool CannotTailPredicate = false;		bool CannotTailPredicate = false;

LowOverheadLoop(MachineLoop &ML, MachineLoopInfo &MLI,		LowOverheadLoop(MachineLoop &ML, MachineLoopInfo &MLI,
ReachingDefAnalysis &RDA, const TargetRegisterInfo &TRI)		ReachingDefAnalysis &RDA, const TargetRegisterInfo &TRI,
: ML(ML), MLI(MLI), RDA(RDA), TRI(TRI) {		const ARMBaseInstrInfo &TII)
		: ML(ML), MLI(MLI), RDA(RDA), TRI(TRI), TII(TII) {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - : ML(ML), MLI(MLI), RDA(RDA), TRI(TRI), TII(TII) { + : ML(ML), MLI(MLI), RDA(RDA), TRI(TRI), TII(TII) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - : ML(ML), MLI(MLI), RDA(RDA), TRI(TRI), TII…
MF = ML.getHeader()->getParent();		MF = ML.getHeader()->getParent();
		if (auto *MBB = ML.getLoopPreheader())
		Preheader = MBB;
		else if (auto *MBB = MLI.findLoopPreheader(&ML, true))
		Preheader = MBB;
}		}

// If this is an MVE instruction, check that we know how to use tail		// If this is an MVE instruction, check that we know how to use tail
// predication with it. Record VPT blocks and return whether the		// predication with it. Record VPT blocks and return whether the
// instruction is valid for tail predication.		// instruction is valid for tail predication.
bool ValidateMVEInst(MachineInstr *MI);		bool ValidateMVEInst(MachineInstr *MI);

void AnalyseMVEInst(MachineInstr *MI) {		void AnalyseMVEInst(MachineInstr *MI) {
CannotTailPredicate = !ValidateMVEInst(MI);		CannotTailPredicate = !ValidateMVEInst(MI);
}		}

bool IsTailPredicationLegal() const {		bool IsTailPredicationLegal() const {
// For now, let's keep things really simple and only support a single		// For now, let's keep things really simple and only support a single
// block for tail predication.		// block for tail predication.
return !Revert && FoundAllComponents() && VCTP &&		return !Revert && FoundAllComponents() && VCTP &&
!CannotTailPredicate && ML.getNumBlocks() == 1;		!CannotTailPredicate && ML.getNumBlocks() == 1;
}		}

// Check that the predication in the loop will be equivalent once we		// Check that the predication in the loop will be equivalent once we
// perform the conversion. Also ensure that we can provide the number		// perform the conversion. Also ensure that we can provide the number
// of elements to the loop start instruction.		// of elements to the loop start instruction.
bool ValidateTailPredicate(MachineInstr *StartInsertPt);		bool ValidateTailPredicate(MachineInstr *StartInsertPt);

		// See whether the live-out instructions are a reduction that we can fixup
		// later.
		bool FindValidReduction(InstSet &LiveMIs, InstSet &LiveOutUsers);

// Check that any values available outside of the loop will be the same		// Check that any values available outside of the loop will be the same
// after tail predication conversion.		// after tail predication conversion.
bool ValidateLiveOuts() const;		bool ValidateLiveOuts();

		// Is the vpsel in the exit block predicated upon the element count in
		// a way that allows it to combine values from two iterations?
		MachineInstr* getMergePredicate(MachineInstr *VPSEL) const;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - MachineInstr* getMergePredicate(MachineInstr VPSEL) const; + MachineInstr getMergePredicate(MachineInstr VPSEL) const; Lint: Pre-merge checks:* clang-format: please reformat the code ``` - MachineInstr* getMergePredicate(MachineInstr…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions is it not const anymore? SjoerdMeijer: is it not const anymore?

// Is it safe to define LR with DLS/WLS?		// Is it safe to define LR with DLS/WLS?
// LR can be defined if it is the operand to start, because it's the same		// LR can be defined if it is the operand to start, because it's the same
// value, or if it's going to be equivalent to the operand to Start.		// value, or if it's going to be equivalent to the operand to Start.
MachineInstr *isSafeToDefineLR();		MachineInstr *isSafeToDefineLR();

// Check the branch targets are within range and we satisfy our		// Check the branch targets are within range and we satisfy our
// restrictions.		// restrictions.
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	private:
void RevertWhile(MachineInstr *MI) const;		void RevertWhile(MachineInstr *MI) const;

bool RevertLoopDec(MachineInstr *MI) const;		bool RevertLoopDec(MachineInstr *MI) const;

void RevertLoopEnd(MachineInstr *MI, bool SkipCmp = false) const;		void RevertLoopEnd(MachineInstr *MI, bool SkipCmp = false) const;

void ConvertVPTBlocks(LowOverheadLoop &LoLoop);		void ConvertVPTBlocks(LowOverheadLoop &LoLoop);

		void FixupReductions(LowOverheadLoop &LoLoop) const;

MachineInstr *ExpandLoopStart(LowOverheadLoop &LoLoop);		MachineInstr *ExpandLoopStart(LowOverheadLoop &LoLoop);

void Expand(LowOverheadLoop &LoLoop);		void Expand(LowOverheadLoop &LoLoop);

void IterationCountDCE(LowOverheadLoop &LoLoop);		void IterationCountDCE(LowOverheadLoop &LoLoop);
};		};
}		}

▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	auto CannotProvideElements = [this](MachineBasicBlock *MBB,
// Don't continue searching up through multiple predecessors.		// Don't continue searching up through multiple predecessors.
if (MBB->pred_size() > 1)		if (MBB->pred_size() > 1)
return true;		return true;

return false;		return false;
};		};

// First, find the block that looks like the preheader.		// First, find the block that looks like the preheader.
MachineBasicBlock *MBB = MLI.findLoopPreheader(&ML, true);		MachineBasicBlock *MBB = Preheader;
if (!MBB) {		if (!MBB) {
LLVM_DEBUG(dbgs() << "ARM Loops: Didn't find preheader.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Didn't find preheader.\n");
return false;		return false;
}		}

// Then search backwards for a def, until we get to InsertBB.		// Then search backwards for a def, until we get to InsertBB.
while (MBB != InsertBB) {		while (MBB != InsertBB) {
if (CannotProvideElements(MBB, NumElements)) {		if (CannotProvideElements(MBB, NumElements)) {
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	static bool isVectorPredicated(MachineInstr *MI) {
return PIdx != -1 && MI->getOperand(PIdx + 1).getReg() == ARM::VPR;		return PIdx != -1 && MI->getOperand(PIdx + 1).getReg() == ARM::VPR;
}		}

static bool isRegInClass(const MachineOperand &MO,		static bool isRegInClass(const MachineOperand &MO,
const TargetRegisterClass *Class) {		const TargetRegisterClass *Class) {
return MO.isReg() && MO.getReg() && Class->contains(MO.getReg());		return MO.isReg() && MO.getReg() && Class->contains(MO.getReg());
}		}

bool LowOverheadLoop::ValidateLiveOuts() const {		MachineInstr* LowOverheadLoop::getMergePredicate(MachineInstr *VPSEL) const {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -MachineInstr* LowOverheadLoop::getMergePredicate(MachineInstr VPSEL) const { +MachineInstr LowOverheadLoop::getMergePredicate(MachineInstr VPSEL) const { Lint: Pre-merge checks:* clang-format: please reformat the code ``` -MachineInstr* LowOverheadLoop::getMergePredicate…
		unsigned VPRIdx = llvm::findFirstVPTPredOperandIdx(*VPSEL) + 1;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: perhaps an assert that VPSEL is a vpsel would be good. SjoerdMeijer: nit: perhaps an assert that VPSEL is a vpsel would be good.
		MachineInstr *Pred = RDA.getMIOperand(VPSEL, VPRIdx);
		if (!Pred \|\| Pred->getOpcode() != VCTP->getOpcode())
		return nullptr;

		MachineInstr *ExitBlockElems = RDA.getMIOperand(Pred, 1);
		if (!ExitBlockElems)
		return nullptr;

		auto FirstMIUse = [this](MachineInstr MI) -> MachineInstr {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - auto FirstMIUse = [this](MachineInstr MI) -> MachineInstr { + auto FirstMIUse = [this](MachineInstr MI) -> MachineInstr { Lint: Pre-merge checks: clang-format: please reformat the code ``` - auto FirstMIUse = [this](MachineInstr *MI) ->…
		for (auto &MO : MI->uses()) {
		if (!MO.isReg() \|\| MO.getReg() == 0)
		continue;
		return RDA.getMIOperand(MI, MO);
		}
		return nullptr;
		};

		auto LastMIUse = [this](MachineInstr MI) -> MachineInstr {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - auto LastMIUse = [this](MachineInstr MI) -> MachineInstr { + auto LastMIUse = [this](MachineInstr MI) -> MachineInstr { Lint: Pre-merge checks: clang-format: please reformat the code ``` - auto LastMIUse = [this](MachineInstr *MI) ->…
		for (auto &MO : reverse(MI->uses())) {
		if (!MO.isReg() \|\| MO.getReg() == 0)
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit, I think nicer to read is: MO.getReg() == 0 -> !MO.getReg().isValid() SjoerdMeijer: nit, I think nicer to read is: MO.getReg() == 0 -> !MO.getReg().isValid()
		continue;
		return RDA.getMIOperand(MI, MO);
		}
		return nullptr;
		};

		auto FirstImmUse = [](MachineInstr *MI, int64_t Imm) {
		for (auto &MO : MI->uses()) {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Nit: just a bit shorter would be: for (auto &MO : MI->uses()) { if (MO.isImm() && MO.getImm() == Imm) return true; return false; SjoerdMeijer: Nit: just a bit shorter would be: for (auto &MO : MI->uses()) { if (MO.isImm() && MO.
		if (!MO.isImm())
		continue;
		return MO.getImm() == Imm;
		}
		return false;
		};

		// Check if the VCTP is using the exiting element count calculated in the
		// preheader. The instructions will look like something like this, where
		// X is the vector factor:
		// BackedgeCount = (SUB (BIC (ADD TotalElems, X-1), X-1), X)
		// TripCount = (ADD BackedgeCount, 1)
		// ExitBlockElems = (SUB TotalElems, (LSR BackedgeCount, log2(X)))

		MachineInstr *TripCount = RDA.getMIOperand(Start, 0);
		if (!TripCount)
		return nullptr;

		if (auto *LSR = LastMIUse(ExitBlockElems)) {
		if (!isLSRImmOpcode(LSR->getOpcode()))
		return nullptr;
		unsigned ShiftAmt = log2(getTailPredVectorWidth(VCTP->getOpcode()));
		if (FirstImmUse(LSR, ShiftAmt))
		if (auto *BackedgeCount = FirstMIUse(LSR))
		if (BackedgeCount == LastMIUse(TripCount))
		return Pred;
		}

		return nullptr;
		}

		bool
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -bool -LowOverheadLoop::FindValidReduction(InstSet &LiveMIs, InstSet &LiveOutUsers) { +bool LowOverheadLoop::FindValidReduction(InstSet &LiveMIs, + InstSet &LiveOutUsers) { Lint: Pre-merge checks: clang-format: please reformat the code ``` -bool -LowOverheadLoop::FindValidReduction(InstSet…
		LowOverheadLoop::FindValidReduction(InstSet &LiveMIs, InstSet &LiveOutUsers) {
		// Also check for reductions where the operation needs to be merging values
		// from the last and previous loop iterations. This means an instruction
		// producing a value and a vmov storing the value calculated in the previous
		// iteration. So we can have two live-out regs, one produced by a vmov and
		// both being consumed by a vpsel.
		LLVM_DEBUG(dbgs() << "ARM Loops: Found loop live-outs:\n";
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - LLVM_DEBUG(dbgs() << "ARM Loops: Found loop live-outs:\n"; - for (auto MI : LiveMIs) - dbgs() << " - " << MI); + LLVM_DEBUG(dbgs() << "ARM Loops: Found loop live-outs:\n"; for (auto MI + : LiveMIs) + dbgs() + << " - " << MI); Lint: Pre-merge checks: clang-format: please reformat the code ``` - LLVM_DEBUG(dbgs() << "ARM Loops: Found loop live…
		for (auto *MI : LiveMIs)
		dbgs() << " - " << *MI);

		// Expect a vmov, a vadd and a single vpsel user.
		if (LiveMIs.size() != 2 \|\| LiveOutUsers.size() != 1)
		return false;

		MachineInstr VPSEL = LiveOutUsers.begin();
		if (VPSEL->getOpcode() != ARM::MVE_VPSEL)
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: was just curious why we expect the first item in the set to be the vpsel. Can we rely on that with a set? SjoerdMeijer: nit: was just curious why we expect the first item in the set to be the vpsel. Can we rely on…
		samparkerAuthorUnsubmitted Done Reply Inline Actions Because the set only has one member if we get here. samparker: Because the set only has one member if we get here.
		return false;

		MachineInstr *MergePred = getMergePredicate(VPSEL);
		if (!MergePred) {
		LLVM_DEBUG(dbgs() << "ARM Loops: Not using equivalent predicate.\n");
		return false;
		}

		MachineInstr *Reduce = RDA.getMIOperand(VPSEL, 1);
		if (!Reduce)
		return false;

		// TODO: Support more operations that VADD.
		switch (VCTP->getOpcode()) {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Could this be a good candidate for a helper function in ARMBaseInstrInfo.h? SjoerdMeijer: Could this be a good candidate for a helper function in ARMBaseInstrInfo.h?
		samparkerAuthorUnsubmitted Done Reply Inline Actions I'm not sure how relevant it is to the rest of the backend, but it would be more readable here as a local helper - especially once we add more supported opcodes. samparker: I'm not sure how relevant it is to the rest of the backend, but it would be more readable here…
		default:
		return false;
		case ARM::MVE_VCTP32:
		if (Reduce->getOpcode() != ARM::MVE_VADDi32)
		return false;
		break;
		case ARM::MVE_VCTP16:
		if (Reduce->getOpcode() != ARM::MVE_VADDi16)
		return false;
		break;
		case ARM::MVE_VCTP8:
		if (Reduce->getOpcode() != ARM::MVE_VADDi8)
		return false;
		break;
		}

		// Check that the VORR is actually a VMOV.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I guess you mean VMOV can be an alias for VORR, which you're checking here? SjoerdMeijer: I guess you mean VMOV can be an alias for VORR, which you're checking here?
		samparkerAuthorUnsubmitted Done Reply Inline Actions Yes, I don't think we actually have a MVE VMOV instruction. samparker: Yes, I don't think we actually have a MVE VMOV instruction.
		MachineInstr *Copy = RDA.getMIOperand(VPSEL, 2);
		if (!Copy \|\| Copy->getOpcode() != ARM::MVE_VORR \|\|
		!Copy->getOperand(1).isReg() \|\| !Copy->getOperand(2).isReg() \|\|
		Copy->getOperand(1).getReg() != Copy->getOperand(2).getReg())
		return false;

		assert((LiveMIs.count(Reduce) && LiveMIs.count(Copy)) &&
		"Expected live outs to be consumed by vpsel");

		assert((Reduce->getOperand(0).getReg() == Reduce->getOperand(1).getReg() \|\|
		Reduce->getOperand(0).getReg() == Reduce->getOperand(2).getReg()) &&
		"Expected VADD to be overwriting one of its operands");

		// Check that the vadd and vmov are only used by each other and the vpsel.
		SmallPtrSet<MachineInstr*, 2> CopyUsers;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - SmallPtrSet<MachineInstr, 2> CopyUsers; + SmallPtrSet<MachineInstr , 2> CopyUsers; Lint: Pre-merge checks: clang-format: please reformat the code ``` - SmallPtrSet<MachineInstr*, 2> CopyUsers; +…
		RDA.getGlobalUses(Copy, Copy->getOperand(0).getReg(), CopyUsers);
		if (CopyUsers.size() > 2 \|\| !CopyUsers.count(Reduce))
		return false;

		SmallPtrSet<MachineInstr*, 2> ReduceUsers;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - SmallPtrSet<MachineInstr, 2> ReduceUsers; + SmallPtrSet<MachineInstr , 2> ReduceUsers; Lint: Pre-merge checks: clang-format: please reformat the code ``` - SmallPtrSet<MachineInstr*, 2> ReduceUsers; +…
		RDA.getGlobalUses(Reduce, Reduce->getOperand(0).getReg(), ReduceUsers);
		if (ReduceUsers.size() > 2 \|\| !ReduceUsers.count(Copy))
		return false;

		// Then find whether there's an instruction initialising the register that
		// is storing the reduction.
		if (!Preheader)
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Can or should this not be checked much earlier? SjoerdMeijer: Can or should this not be checked much earlier?
		samparkerAuthorUnsubmitted Done Reply Inline Actions Yes, at some point in an unknown patch. I've moved it into one of the legality helpers. samparker: Yes, at some point in an unknown patch. I've moved it into one of the legality helpers.
		return false;

		SmallPtrSet<MachineInstr*, 2> Incoming;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - SmallPtrSet<MachineInstr, 2> Incoming; + SmallPtrSet<MachineInstr , 2> Incoming; Lint: Pre-merge checks: clang-format: please reformat the code ``` - SmallPtrSet<MachineInstr*, 2> Incoming; +…
		RDA.getLiveOuts(Preheader, Copy->getOperand(1).getReg(), Incoming);
		if (Incoming.size() > 1)
		return false;

		MachineInstr Init = Incoming.empty() ? nullptr : Incoming.begin();
		LLVM_DEBUG(dbgs() << "ARM Loops: Found a reduction:\n"
		<< " - " << Copy << " - " << Reduce
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - << " - " << Copy << " - " << Reduce - << " - " << MergePred << " - " << VPSEL); - Reductions.push_back(std::make_unique<Reduction>(Init, Copy, Reduce, - VPSEL, MergePred)); + << " - " << Copy << " - " << Reduce << " - " << MergePred + << " - " << VPSEL); + Reductions.push_back( + std::make_unique<Reduction>(Init, Copy, Reduce, VPSEL, MergePred)); Lint: Pre-merge checks: clang-format: please reformat the code ``` - << " - " << Copy << " - " << Reduce…
		<< " - " << MergePred << " - " << VPSEL);
		Reductions.push_back(std::make_unique<Reduction>(Init, Copy, Reduce,
		VPSEL, MergePred));
		return true;
		}

		bool LowOverheadLoop::ValidateLiveOuts() {
// Collect Q-regs that are live in the exit blocks. We don't collect scalars		// Collect Q-regs that are live in the exit blocks. We don't collect scalars
// because they won't be affected by lane predication.		// because they won't be affected by lane predication.
const TargetRegisterClass *QPRs = TRI.getRegClass(ARM::MQPRRegClassID);		const TargetRegisterClass *QPRs = TRI.getRegClass(ARM::MQPRRegClassID);
SmallSet<Register, 2> LiveOuts;		SmallSet<Register, 2> LiveOuts;
SmallVector<MachineBasicBlock *, 2> ExitBlocks;		SmallVector<MachineBasicBlock *, 2> ExitBlocks;
ML.getExitBlocks(ExitBlocks);		ML.getExitBlocks(ExitBlocks);
for (auto *MBB : ExitBlocks)		assert(ExitBlocks.size() == 1 && "Expected a single exit block");
for (const MachineBasicBlock::RegisterMaskPair &RegMask : MBB->liveins())
if (QPRs->contains(RegMask.PhysReg))		SmallPtrSet<MachineInstr*, 2> LiveOutUsers;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - SmallPtrSet<MachineInstr, 2> LiveOutUsers; + SmallPtrSet<MachineInstr , 2> LiveOutUsers; Lint: Pre-merge checks: clang-format: please reformat the code ``` - SmallPtrSet<MachineInstr*, 2> LiveOutUsers; +…
LiveOuts.insert(RegMask.PhysReg);		MachineBasicBlock *MBB = ExitBlocks.front();
		for (const MachineBasicBlock::RegisterMaskPair &RegMask : MBB->liveins()) {
		Register PhysReg = RegMask.PhysReg;
		if (QPRs->contains(PhysReg)) {
		LiveOuts.insert(PhysReg);
		RDA.getLiveInUses(MBB, PhysReg, LiveOutUsers);
		}
		}

// Collect the instructions in the loop body that define the live-out values.		// Collect the instructions in the loop body that define the live-out values.
SmallPtrSet<MachineInstr *, 2> LiveMIs;		SmallPtrSet<MachineInstr *, 2> LiveMIs;
assert(ML.getNumBlocks() == 1 && "Expected single block loop!");		assert(ML.getNumBlocks() == 1 && "Expected single block loop!");
MachineBasicBlock *MBB = ML.getHeader();		MBB = ML.getHeader();
for (auto Reg : LiveOuts)		for (auto Reg : LiveOuts) {
if (auto *MI = RDA.getLocalLiveOutMIDef(MBB, Reg))		if (auto *MI = RDA.getLocalLiveOutMIDef(MBB, Reg))
		if (!isVectorPredicated(MI))
LiveMIs.insert(MI);		LiveMIs.insert(MI);
		}

LLVM_DEBUG(dbgs() << "ARM Loops: Found loop live-outs:\n";		// If we have any non-predicated live-outs, they need to be part of a
for (auto *MI : LiveMIs)		// reduction that we can fixup later.
dbgs() << " - " << *MI);		if (!LiveMIs.empty() && !FindValidReduction(LiveMIs, LiveOutUsers))
// We've already validated that any VPT predication within the loop will be
// equivalent when we perform the predication transformation; so we know that
// any VPT predicated instruction is predicated upon VCTP. Any live-out
// instruction needs to be predicated, so check this here.
for (auto *MI : LiveMIs)
if (!isVectorPredicated(MI))
return false;		return false;

// We want to find out if the tail-predicated version of this loop will		// We want to find out if the tail-predicated version of this loop will
// produce the same values as the loop in its original form. For this to		// produce the same values as the loop in its original form. For this to
// be true, the newly inserted implicit predication must not change the		// be true, the newly inserted implicit predication must not change the
// the results. All MVE loads and stores have to be predicated, so we know		// the results. All MVE loads and stores have to be predicated, so we know
// that any load operands, or stored results are equivalent already.		// that any load operands, or stored results are equivalent already.
// Other explicitly predicated instructions will perform the same operation		// Other explicitly predicated instructions will perform the same operation
// in the original loop and the tail-predicated form too. Here, we call		// in the original loop and the tail-predicated form too. Here, we call
// predicated instructions 'Known' and their users are also Known if it		// predicated instructions 'Known' and their users are also Known if it
// overwrites an Known operand. This is because tail predication will mean		// overwrites an Known operand. This is because tail predication will mean
// that the false lanes are not updated, but we know the output because we		// that the false lanes are not updated, but we know the output because we
// know the input.		// know the input.
// For any 'Unknown' instructions, we can check that they're only consumed by		// For any 'Unknown' instructions, we can check that they're only consumed by
// Known instructions because it means that the unknown false lane(s) are		// Known instructions because it means that the unknown false lane(s) are
// replaced with known lane(s).		// replaced with known lane(s).
// What this should result in is that all instructions are dependent upon		// What this should result in is that all instructions are dependent upon
// predicated inputs. It also means that for an iteration of the loop, a		// predicated inputs. It also means that for an iteration of the loop, a
// register should hold the same value whether tail predication has happened		// register should hold the same value whether tail predication has happened
// or not, with the exception when the differences are masked away by their		// or not, with the exception when the differences are masked away by their
// user(s) and not observable elsewhere.		// user(s) and not observable elsewhere.
SetVector<MachineInstr *> Unknowns;		SetVector<MachineInstr *> Unknowns;
SmallPtrSet<MachineInstr *, 4> Knowns;		SmallPtrSet<MachineInstr *, 4> Knowns;
		Knowns.insert(LiveMIs.begin(), LiveMIs.end());
for (auto &MI : *MBB) {		for (auto &MI : *MBB) {
if (isVectorPredicated(&MI)) {		if (isVectorPredicated(&MI)) {
Knowns.insert(&MI);		Knowns.insert(&MI);
continue;		continue;
}		}

// Only evaluate instructions which produce a single value.		// Only evaluate instructions which produce a single value.
if (MI.getNumDefs() != 1 \|\| !MI.defs().begin()->isReg()) {		if (MI.getNumDefs() != 1 \|\| !MI.defs().begin()->isReg()) {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions typo: that SjoerdMeijer: typo: that
Unknowns.insert(&MI);		Unknowns.insert(&MI);
continue;		continue;
}		}

Register DefReg = MI.defs().begin()->getReg();		Register DefReg = MI.defs().begin()->getReg();
for (auto &MO : MI.operands()) {		for (auto &MO : MI.operands()) {
if (!isRegInClass(MO, QPRs) \|\| !MO.isUse() \|\| MO.getReg() != DefReg)		if (!isRegInClass(MO, QPRs) \|\| !MO.isUse() \|\| MO.getReg() != DefReg)
continue;		continue;

		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Do we have a (negative) test case with a float reduction? SjoerdMeijer: Do we have a (negative) test case with a float reduction?
		samparkerAuthorUnsubmitted Done Reply Inline Actions The vectorizer doesn't seem to want to produce vector float reductions loops, so I've left them out on testing. samparker: The vectorizer doesn't seem to want to produce vector float reductions loops, so I've left them…
// If this instruction overwrites one of its operands, and that register		// If this instruction overwrites one of its operands, and that register
// has known lanes, then this instruction also has known predicated false		// has known lanes, then this instruction also has known predicated false
// lanes.		// lanes.
if (auto *OpDef = RDA.getMIOperand(&MI, MO)) {		if (auto *OpDef = RDA.getMIOperand(&MI, MO)) {
if (Knowns.count(OpDef)) {		if (Knowns.count(OpDef)) {
Knowns.insert(&MI);		Knowns.insert(&MI);
break;		break;
}		}
}		}
}		}
if (!Knowns.count(&MI))		if (!Knowns.count(&MI))
Unknowns.insert(&MI);		Unknowns.insert(&MI);
}		}

auto HasKnownUsers = [this](MachineInstr *MI, const MachineOperand &MO,		auto HasKnownUsers = [this](MachineInstr *MI, const MachineOperand &MO,
SmallPtrSetImpl<MachineInstr *> &Knowns) {		InstSet &Knowns) {
SmallPtrSet<MachineInstr *, 2> Uses;		SmallPtrSet<MachineInstr *, 2> Uses;
RDA.getGlobalUses(MI, MO.getReg(), Uses);		RDA.getGlobalUses(MI, MO.getReg(), Uses);
for (auto *Use : Uses) {		for (auto *Use : Uses) {
if (Use != MI && !Knowns.count(Use))		if (Use != MI && !Knowns.count(Use))
return false;		return false;
}		}
return true;		return true;
};		};

// Now for all the unknown values, see if they're only consumed by known		// Now for all the unknown values, see if they're only consumed by known
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Perhaps it's good to add an comment here or in the description of the algorithm on line 733 - 753 that reductions need special treatment as they define values that are not used by predicated instructions inside the loop. SjoerdMeijer: Perhaps it's good to add an comment here or in the description of the algorithm on line 733…
// instructions.		// instructions.
for (auto *MI : reverse(Unknowns)) {		for (auto *MI : reverse(Unknowns)) {
for (auto &MO : MI->operands()) {		for (auto &MO : MI->operands()) {
if (!isRegInClass(MO, QPRs) \|\| !MO.isDef())		if (!isRegInClass(MO, QPRs) \|\| !MO.isDef())
continue;		continue;
if (!HasKnownUsers(MI, MO, Knowns)) {		if (!HasKnownUsers(MI, MO, Knowns)) {
LLVM_DEBUG(dbgs() << "ARM Loops: Found an unknown def of : "		LLVM_DEBUG(dbgs() << "ARM Loops: Found an unknown def of : "
<< TRI.getRegAsmName(MO.getReg()) << " at " << *MI);		<< TRI.getRegAsmName(MO.getReg()) << " at " << *MI);
return false;		return false;
}		}
}		}
Knowns.insert(MI);		Knowns.insert(MI);
}		}
return true;		return true;
}		}

void LowOverheadLoop::CheckLegality(ARMBasicBlockUtils *BBUtils) {		void LowOverheadLoop::CheckLegality(ARMBasicBlockUtils *BBUtils) {
if (Revert)		if (Revert)
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: perhaps move this comment down a bit... SjoerdMeijer: nit: perhaps move this comment down a bit...
return;		return;

		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions ...and check this earlier. SjoerdMeijer: ...and check this earlier.
if (!End->getOperand(1).isMBB())		if (!End->getOperand(1).isMBB())
report_fatal_error("Expected LoopEnd to target basic block");		report_fatal_error("Expected LoopEnd to target basic block");

// TODO Maybe there's cases where the target doesn't have to be the header,		// TODO Maybe there's cases where the target doesn't have to be the header,
// but for now be safe and revert.		// but for now be safe and revert.
if (End->getOperand(1).getMBB() != ML.getHeader()) {		if (End->getOperand(1).getMBB() != ML.getHeader()) {
LLVM_DEBUG(dbgs() << "ARM Loops: LoopEnd is not targetting header.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: LoopEnd is not targetting header.\n");
Revert = true;		Revert = true;
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	if ((Flags & ARMII::ValidForTailPredication) == 0 && !IsUse) {
return false;		return false;
}		}

// If the instruction is already explicitly predicated, then the conversion		// If the instruction is already explicitly predicated, then the conversion
// will be fine, but ensure that all memory operations are predicated.		// will be fine, but ensure that all memory operations are predicated.
return !IsUse && MI->mayLoadOrStore() ? false : true;		return !IsUse && MI->mayLoadOrStore() ? false : true;
}		}


		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - Lint: Pre-merge checks: clang-format: please reformat the code ``` - ```
bool ARMLowOverheadLoops::runOnMachineFunction(MachineFunction &mf) {		bool ARMLowOverheadLoops::runOnMachineFunction(MachineFunction &mf) {
const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(mf.getSubtarget());		const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(mf.getSubtarget());
if (!ST.hasLOB())		if (!ST.hasLOB())
return false;		return false;

MF = &mf;		MF = &mf;
LLVM_DEBUG(dbgs() << "ARM Loops on " << MF->getName() << " ------------- \n");		LLVM_DEBUG(dbgs() << "ARM Loops on " << MF->getName() << " ------------- \n");

▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	for (auto &MI : *MBB) {
if (isLoopStart(MI))		if (isLoopStart(MI))
return &MI;		return &MI;
}		}
if (MBB->pred_size() == 1)		if (MBB->pred_size() == 1)
return SearchForStart(*MBB->pred_begin());		return SearchForStart(*MBB->pred_begin());
return nullptr;		return nullptr;
};		};

LowOverheadLoop LoLoop(ML, MLI, RDA, TRI);		LowOverheadLoop LoLoop(ML, MLI, RDA, TRI, *TII);
// Search the preheader for the start intrinsic.		// Search the preheader for the start intrinsic.
// FIXME: I don't see why we shouldn't be supporting multiple predecessors		// FIXME: I don't see why we shouldn't be supporting multiple predecessors
// with potentially multiple set.loop.iterations, so we need to enable this.		// with potentially multiple set.loop.iterations, so we need to enable this.
if (auto *Preheader = ML->getLoopPreheader())		if (LoLoop.Preheader)
LoLoop.Start = SearchForStart(Preheader);		LoLoop.Start = SearchForStart(LoLoop.Preheader);
else if (auto *Preheader = MLI->findLoopPreheader(ML, true))
LoLoop.Start = SearchForStart(Preheader);
else		else
return false;		return false;

// Find the low-overhead loop components and decide whether or not to fall		// Find the low-overhead loop components and decide whether or not to fall
// back to a normal loop. Also look for a vctp instructions and decide		// back to a normal loop. Also look for a vctp instructions and decide
// whether we can convert that predicate using tail predication.		// whether we can convert that predicate using tail predication.
for (auto *MBB : reverse(ML->getBlocks())) {		for (auto *MBB : reverse(ML->getBlocks())) {
for (auto &MI : *MBB) {		for (auto &MI : *MBB) {
▲ Show 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	MachineInstr* ARMLowOverheadLoops::ExpandLoopStart(LowOverheadLoop &LoLoop) {
// If we're inserting at a mov lr, then remove it as it's redundant.		// If we're inserting at a mov lr, then remove it as it's redundant.
if (InsertPt != Start)		if (InsertPt != Start)
LoLoop.ToRemove.insert(InsertPt);		LoLoop.ToRemove.insert(InsertPt);
LoLoop.ToRemove.insert(Start);		LoLoop.ToRemove.insert(Start);
LLVM_DEBUG(dbgs() << "ARM Loops: Inserted start: " << *MIB);		LLVM_DEBUG(dbgs() << "ARM Loops: Inserted start: " << *MIB);
return &*MIB;		return &*MIB;
}		}

		void ARMLowOverheadLoops::FixupReductions(LowOverheadLoop &LoLoop) const {
		LLVM_DEBUG(dbgs() << "ARM Loops: Fixing up reduction(s).\n");
		auto BuildMov = [this](MachineInstr &InsertPt, Register To, Register From) {
		MachineBasicBlock *MBB = InsertPt.getParent();
		MachineInstrBuilder MIB =
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - MachineInstrBuilder MIB = - BuildMI(MBB, &InsertPt, InsertPt.getDebugLoc(), TII->get(ARM::MVE_VORR)); + MachineInstrBuilder MIB = BuildMI(MBB, &InsertPt, InsertPt.getDebugLoc(), + TII->get(ARM::MVE_VORR)); Lint: Pre-merge checks: clang-format: please reformat the code ``` - MachineInstrBuilder MIB = - BuildMI(*MBB…
		BuildMI(*MBB, &InsertPt, InsertPt.getDebugLoc(), TII->get(ARM::MVE_VORR));
		MIB.addDef(To);
		MIB.addReg(From);
		MIB.addReg(From);
		MIB.addImm(0);
		MIB.addReg(0);
		MIB.addReg(To);
		LLVM_DEBUG(dbgs() << "ARM Loops: Inserted VMOV: " << *MIB);
		};

		for (auto &Reduction : LoLoop.Reductions) {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Do we have a test case with more than 1 reduction? SjoerdMeijer: Do we have a test case with more than 1 reduction?
		MachineInstr &Copy = Reduction->Copy;
		MachineInstr &Reduce = Reduction->Reduce;
		Register DestReg = Copy.getOperand(0).getReg();

		// Change the initialiser if present
		if (Reduction->Init) {
		MachineInstr *Init = Reduction->Init;

		for (unsigned i = 0; i < Init->getNumOperands(); ++i) {
		MachineOperand &MO = Init->getOperand(i);
		if (MO.isReg() && MO.isUse() && MO.isTied() &&
		Init->findTiedOperandIdx(i) == 0)
		Init->getOperand(i).setReg(DestReg);
		}
		Init->getOperand(0).setReg(DestReg);
		LLVM_DEBUG(dbgs() << "ARM Loops: Changed init regs: " << *Init);
		} else
		BuildMov(LoLoop.Preheader->instr_back(), DestReg, Copy.getOperand(1).getReg());
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - BuildMov(LoLoop.Preheader->instr_back(), DestReg, Copy.getOperand(1).getReg()); + BuildMov(LoLoop.Preheader->instr_back(), DestReg, + Copy.getOperand(1).getReg()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - BuildMov(LoLoop.Preheader->instr_back()…

		// Change the reducing op to write to the register that is used to copy
		// its value on the next iteration. Also update the tied-def operand.
		Reduce.getOperand(0).setReg(DestReg);
		Reduce.getOperand(5).setReg(DestReg);
		LLVM_DEBUG(dbgs() << "ARM Loops: Changed reduction regs: " << Reduce);

		// Instead of a vpsel, just copy the register into the necessary one.
		MachineInstr &VPSEL = Reduction->VPSEL;
		if (VPSEL.getOperand(0).getReg() != DestReg)
		BuildMov(VPSEL, VPSEL.getOperand(0).getReg(), DestReg);

		// Remove the unnecessary instructions.
		LLVM_DEBUG(dbgs() << "ARM Loops: Removing:\n"
		<< " - " << Copy
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - << " - " << Copy - << " - " << VPSEL - << " - " << Reduction->VCTP); + << " - " << Copy << " - " << VPSEL << " - " + << Reduction->VCTP); Lint: Pre-merge checks: clang-format: please reformat the code ``` - << " - " << Copy - <<…
		<< " - " << VPSEL
		<< " - " << Reduction->VCTP);
		Copy.eraseFromParent();
		VPSEL.eraseFromParent();
		Reduction->VCTP.eraseFromParent();
		}
		}

void ARMLowOverheadLoops::ConvertVPTBlocks(LowOverheadLoop &LoLoop) {		void ARMLowOverheadLoops::ConvertVPTBlocks(LowOverheadLoop &LoLoop) {
auto RemovePredicate = [](MachineInstr *MI) {		auto RemovePredicate = [](MachineInstr *MI) {
LLVM_DEBUG(dbgs() << "ARM Loops: Removing predicate from: " << *MI);		LLVM_DEBUG(dbgs() << "ARM Loops: Removing predicate from: " << *MI);
if (int PIdx = llvm::findFirstVPTPredOperandIdx(*MI)) {		if (int PIdx = llvm::findFirstVPTPredOperandIdx(*MI)) {
assert(MI->getOperand(PIdx).getImm() == ARMVCC::Then &&		assert(MI->getOperand(PIdx).getImm() == ARMVCC::Then &&
"Expected Then predicate!");		"Expected Then predicate!");
MI->getOperand(PIdx).setImm(ARMVCC::None);		MI->getOperand(PIdx).setImm(ARMVCC::None);
MI->getOperand(PIdx+1).setReg(0);		MI->getOperand(PIdx+1).setReg(0);
▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	else
LoLoop.Start->eraseFromParent();		LoLoop.Start->eraseFromParent();
bool FlagsAlreadySet = RevertLoopDec(LoLoop.Dec);		bool FlagsAlreadySet = RevertLoopDec(LoLoop.Dec);
RevertLoopEnd(LoLoop.End, FlagsAlreadySet);		RevertLoopEnd(LoLoop.End, FlagsAlreadySet);
} else {		} else {
LoLoop.Start = ExpandLoopStart(LoLoop);		LoLoop.Start = ExpandLoopStart(LoLoop);
RemoveDeadBranch(LoLoop.Start);		RemoveDeadBranch(LoLoop.Start);
LoLoop.End = ExpandLoopEnd(LoLoop);		LoLoop.End = ExpandLoopEnd(LoLoop);
RemoveDeadBranch(LoLoop.End);		RemoveDeadBranch(LoLoop.End);
if (LoLoop.IsTailPredicationLegal())		if (LoLoop.IsTailPredicationLegal()) {
ConvertVPTBlocks(LoLoop);		ConvertVPTBlocks(LoLoop);
		FixupReductions(LoLoop);
		}
for (auto *I : LoLoop.ToRemove) {		for (auto *I : LoLoop.ToRemove) {
LLVM_DEBUG(dbgs() << "ARM Loops: Erasing " << *I);		LLVM_DEBUG(dbgs() << "ARM Loops: Erasing " << *I);
I->eraseFromParent();		I->eraseFromParent();
}		}
}		}

PostOrderLoopTraversal DFS(LoLoop.ML, *MLI);		PostOrderLoopTraversal DFS(LoLoop.ML, *MLI);
DFS.ProcessLoop();		DFS.ProcessLoop();
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

	Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: and_mul_reduce_add:			; CHECK-LABEL: and_mul_reduce_add:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: push {r4, r5, r7, lr}			; CHECK-NEXT: push {r4, r5, r7, lr}
	; CHECK-NEXT: ldr.w r12, [sp, #16]			; CHECK-NEXT: ldr.w r12, [sp, #16]
	; CHECK-NEXT: cmp.w r12, #0			; CHECK-NEXT: cmp.w r12, #0
	; CHECK-NEXT: beq .LBB2_4			; CHECK-NEXT: beq .LBB2_4
	; CHECK-NEXT: @ %bb.1: @ %vector.ph			; CHECK-NEXT: @ %bb.1: @ %vector.ph
	; CHECK-NEXT: add.w r4, r12, #3			; CHECK-NEXT: add.w r4, r12, #3
	; CHECK-NEXT: vmov.i32 q1, #0x0			; CHECK-NEXT: vmov.i32 q0, #0x0
	; CHECK-NEXT: bic r4, r4, #3			; CHECK-NEXT: bic r4, r4, #3
	; CHECK-NEXT: subs r5, r4, #4			; CHECK-NEXT: subs r5, r4, #4
	; CHECK-NEXT: movs r4, #1
	; CHECK-NEXT: add.w lr, r4, r5, lsr #2
	; CHECK-NEXT: lsrs r4, r5, #2			; CHECK-NEXT: lsrs r4, r5, #2
	; CHECK-NEXT: sub.w r4, r12, r4, lsl #2			; CHECK-NEXT: sub.w r4, r12, r4, lsl #2
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dlstp.32 lr, r12
	; CHECK-NEXT: .LBB2_2: @ %vector.body			; CHECK-NEXT: .LBB2_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r12			; CHECK-NEXT: vldrw.u32 q1, [r1], #16
	; CHECK-NEXT: vmov q0, q1			; CHECK-NEXT: vldrw.u32 q2, [r0], #16
	; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q1, [r1], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
	; CHECK-NEXT: sub.w r12, r12, #4
	; CHECK-NEXT: vsub.i32 q1, q2, q1			; CHECK-NEXT: vsub.i32 q1, q2, q1
	; CHECK-NEXT: vpsttt			; CHECK-NEXT: vcmp.i32 eq, q1, zr
	; CHECK-NEXT: vcmpt.i32 eq, q1, zr			; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q1, [r3], #16			; CHECK-NEXT: vldrwt.u32 q1, [r3], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r2], #16			; CHECK-NEXT: vldrwt.u32 q2, [r2], #16
	; CHECK-NEXT: vmul.i32 q1, q2, q1			; CHECK-NEXT: vmul.i32 q1, q2, q1
	; CHECK-NEXT: vadd.i32 q1, q1, q0			; CHECK-NEXT: vadd.i32 q0, q1, q0
	; CHECK-NEXT: le lr, .LBB2_2			; CHECK-NEXT: letp lr, .LBB2_2
	; CHECK-NEXT: @ %bb.3: @ %middle.block			; CHECK-NEXT: @ %bb.3: @ %middle.block
	; CHECK-NEXT: vctp.32 r4
	; CHECK-NEXT: vpsel q0, q1, q0
	; CHECK-NEXT: vaddv.u32 r0, q0			; CHECK-NEXT: vaddv.u32 r0, q0
	; CHECK-NEXT: pop {r4, r5, r7, pc}			; CHECK-NEXT: pop {r4, r5, r7, pc}
	; CHECK-NEXT: .LBB2_4:			; CHECK-NEXT: .LBB2_4:
	; CHECK-NEXT: movs r0, #0			; CHECK-NEXT: movs r0, #0
	; CHECK-NEXT: pop {r4, r5, r7, pc}			; CHECK-NEXT: pop {r4, r5, r7, pc}
	i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {			i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {
	entry:			entry:
	%cmp8 = icmp eq i32 %N, 0			%cmp8 = icmp eq i32 %N, 0
	▲ Show 20 Lines • Show All 285 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/constant-init-reduction.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define dso_local arm_aapcs_vfpcc signext i16 @constant_init_sub_reduction(i8* nocapture readonly %a, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%0 = add i32 %N, 7
				%1 = lshr i32 %0, 3
				%2 = shl nuw i32 %1, 3
				%3 = add i32 %2, -8
				%4 = lshr i32 %3, 3
				%5 = add nuw nsw i32 %4, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %5)
				%6 = shl i32 %4, 3
				%7 = sub i32 %N, %6
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv = phi i8* [ %scevgep, %vector.body ], [ %a, %vector.ph ]
				%vec.phi = phi <8 x i16> [ <i16 32767, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0>, %vector.ph ], [ %13, %vector.body ]
				%8 = phi i32 [ %5, %vector.ph ], [ %14, %vector.body ]
				%9 = phi i32 [ %N, %vector.ph ], [ %11, %vector.body ]
				%lsr.iv15 = bitcast i8* %lsr.iv to <8 x i8>*
				%10 = call <8 x i1> @llvm.arm.mve.vctp16(i32 %9)
				%11 = sub i32 %9, 8
				%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %lsr.iv15, i32 1, <8 x i1> %10, <8 x i8> undef)
				%12 = zext <8 x i8> %wide.masked.load to <8 x i16>
				%13 = sub <8 x i16> %vec.phi, %12
				%scevgep = getelementptr i8, i8* %lsr.iv, i32 8
				%14 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %8, i32 1)
				%15 = icmp ne i32 %14, 0
				br i1 %15, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]
				%.lcssa = phi <8 x i16> [ %13, %vector.body ]
				%16 = call <8 x i1> @llvm.arm.mve.vctp16(i32 %7)
				%17 = select <8 x i1> %16, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa
				%18 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> %17)
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %middle.block, %entry
				%res.0.lcssa = phi i16 [ 32767, %entry ], [ %18, %middle.block ]
				ret i16 %res.0.lcssa
				}

				; Function Attrs: norecurse nounwind readonly
				define dso_local arm_aapcs_vfpcc signext i16 @constant_init_add_reduction(i8* nocapture readonly %a, i32 %N) local_unnamed_addr #0 {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%0 = add i32 %N, 7
				%1 = lshr i32 %0, 3
				%2 = shl nuw i32 %1, 3
				%3 = add i32 %2, -8
				%4 = lshr i32 %3, 3
				%5 = add nuw nsw i32 %4, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %5)
				%6 = shl i32 %4, 3
				%7 = sub i32 %N, %6
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv = phi i8* [ %scevgep, %vector.body ], [ %a, %vector.ph ]
				%vec.phi = phi <8 x i16> [ zeroinitializer, %vector.ph ], [ %13, %vector.body ]
				%8 = phi i32 [ %5, %vector.ph ], [ %14, %vector.body ]
				%9 = phi i32 [ %N, %vector.ph ], [ %11, %vector.body ]
				%lsr.iv15 = bitcast i8* %lsr.iv to <8 x i8>*
				%10 = call <8 x i1> @llvm.arm.mve.vctp16(i32 %9)
				%11 = sub i32 %9, 8
				%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %lsr.iv15, i32 1, <8 x i1> %10, <8 x i8> undef)
				%12 = zext <8 x i8> %wide.masked.load to <8 x i16>
				%13 = add <8 x i16> %vec.phi, %12
				%scevgep = getelementptr i8, i8* %lsr.iv, i32 8
				%14 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %8, i32 1)
				%15 = icmp ne i32 %14, 0
				br i1 %15, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]
				%.lcssa = phi <8 x i16> [ %13, %vector.body ]
				%16 = call <8 x i1> @llvm.arm.mve.vctp16(i32 %7)
				%17 = select <8 x i1> %16, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa
				%18 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> %17)
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %middle.block, %entry
				%res.0.lcssa = phi i16 [ 0, %entry ], [ %18, %middle.block ]
				ret i16 %res.0.lcssa
				}

				declare <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>*, i32 immarg, <8 x i1>, <8 x i8>)
				declare i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <8 x i1> @llvm.arm.mve.vctp16(i32)

				...
				---
				name: constant_init_sub_reduction
				alignment: 16
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants:
				- id: 0
				value: '<8 x i16> <i16 32767, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0>'
				alignment: 16
				isTargetSpecific: false
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: constant_init_sub_reduction
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1
				; CHECK: tCMPi8 renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2IT 0, 2, implicit-def $itstate
				; CHECK: renamable $r0 = t2MOVi16 32767, 0 /* CC::eq */, $cpsr, implicit killed $r0, implicit $itstate
				; CHECK: renamable $r0 = tSXTH killed renamable $r0, 0 /* CC::eq */, $cpsr, implicit killed $r0, implicit $itstate
				; CHECK: tBX_RET 0 /* CC::eq */, killed $cpsr, implicit $r0, implicit killed $itstate
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $r2, dead $cpsr = tADDi3 renamable $r1, 7, 14 /* CC::al */, $noreg
				; CHECK: renamable $r3, dead $cpsr = tMOVi8 1, 14 /* CC::al */, $noreg
				; CHECK: renamable $r2 = t2BICri killed renamable $r2, 7, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14 /* CC::al */, $noreg
				; CHECK: renamable $lr = nuw nsw t2ADDrs killed renamable $r3, renamable $r2, 27, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r2, dead $cpsr = tLSRri killed renamable $r2, 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $r3 = tLEApcrel %const.0, 14 /* CC::al */, $noreg
				; CHECK: renamable $r2 = t2SUBrs renamable $r1, killed renamable $r2, 26, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q0 = MVE_VLDRWU32 killed renamable $r3, 0, 0, $noreg :: (load 16 from constant-pool)
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q0, $r0, $r1, $r2
				; CHECK: renamable $vpr = MVE_VCTP16 renamable $r1, 0, $noreg
				; CHECK: $q1 = MVE_VORR killed $q0, killed $q0, 0, $noreg, undef $q1
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRBU16_post killed renamable $r0, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv15, align 1)
				; CHECK: renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 8, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VSUBi16 renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q1, $r2
				; CHECK: renamable $vpr = MVE_VCTP16 killed renamable $r2, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				; CHECK: $sp = t2LDMIA_UPD $sp, 14 /* CC::al */, $noreg, def $r7, def $lr
				; CHECK: renamable $r0 = tSXTH killed renamable $r0, 14 /* CC::al */, $noreg
				; CHECK: tBX_RET 14 /* CC::al */, $noreg, implicit killed $r0
				; CHECK: bb.3 (align 16):
				; CHECK: CONSTPOOL_ENTRY 0, %const.0, 16
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $lr

				tCMPi8 renamable $r1, 0, 14, $noreg, implicit-def $cpsr
				t2IT 0, 2, implicit-def $itstate
				renamable $r0 = t2MOVi16 32767, 0, $cpsr, implicit killed $r0, implicit $itstate
				renamable $r0 = tSXTH killed renamable $r0, 0, $cpsr, implicit killed $r0, implicit $itstate
				tBX_RET 0, killed $cpsr, implicit $r0, implicit killed $itstate
				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $r2, dead $cpsr = tADDi3 renamable $r1, 7, 14, $noreg
				renamable $r3, dead $cpsr = tMOVi8 1, 14, $noreg
				renamable $r2 = t2BICri killed renamable $r2, 7, 14, $noreg, $noreg
				renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14, $noreg
				renamable $lr = nuw nsw t2ADDrs killed renamable $r3, renamable $r2, 27, 14, $noreg, $noreg
				renamable $r2, dead $cpsr = tLSRri killed renamable $r2, 3, 14, $noreg
				renamable $r3 = tLEApcrel %const.0, 14, $noreg
				renamable $r2 = t2SUBrs renamable $r1, killed renamable $r2, 26, 14, $noreg, $noreg
				renamable $q0 = MVE_VLDRWU32 killed renamable $r3, 0, 0, $noreg :: (load 16 from constant-pool)
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q0, $r0, $r1, $r2

				renamable $vpr = MVE_VCTP16 renamable $r1, 0, $noreg
				$q1 = MVE_VORR killed $q0, $q0, 0, $noreg, undef $q1
				MVE_VPST 8, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRBU16_post killed renamable $r0, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv15, align 1)
				renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 8, 14, $noreg
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VSUBi16 renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q1, $r2

				renamable $vpr = MVE_VCTP16 killed renamable $r2, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				$sp = t2LDMIA_UPD $sp, 14, $noreg, def $r7, def $lr
				renamable $r0 = tSXTH killed renamable $r0, 14, $noreg
				tBX_RET 14, $noreg, implicit killed $r0

				bb.3 (align 16):
				CONSTPOOL_ENTRY 0, %const.0, 16

				...
				---
				name: constant_init_add_reduction
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: constant_init_add_reduction
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1
				; CHECK: tCMPi8 renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2IT 0, 2, implicit-def $itstate
				; CHECK: renamable $r0 = tMOVi8 $noreg, 0, 0 /* CC::eq */, $cpsr, implicit killed $r0, implicit $itstate
				; CHECK: renamable $r0 = tSXTH killed renamable $r0, 0 /* CC::eq */, $cpsr, implicit killed $r0, implicit $itstate
				; CHECK: tBX_RET 0 /* CC::eq */, killed $cpsr, implicit $r0, implicit killed $itstate
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $r2, dead $cpsr = tADDi3 renamable $r1, 7, 14 /* CC::al */, $noreg
				; CHECK: renamable $r3, dead $cpsr = tMOVi8 1, 14 /* CC::al */, $noreg
				; CHECK: renamable $r2 = t2BICri killed renamable $r2, 7, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14 /* CC::al */, $noreg
				; CHECK: renamable $lr = nuw nsw t2ADDrs killed renamable $r3, renamable $r2, 27, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r2, dead $cpsr = tLSRri killed renamable $r2, 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $r2 = t2SUBrs renamable $r1, killed renamable $r2, 26, 14 /* CC::al */, $noreg, $noreg
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q0, $r0, $r1, $r2
				; CHECK: renamable $vpr = MVE_VCTP16 renamable $r1, 0, $noreg
				; CHECK: $q1 = MVE_VORR killed $q0, killed $q0, 0, $noreg, undef $q1
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRBU16_post killed renamable $r0, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv15, align 1)
				; CHECK: renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 8, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VMOVLu8bh killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q0 = MVE_VADDi16 renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q1, $r2
				; CHECK: renamable $vpr = MVE_VCTP16 killed renamable $r2, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				; CHECK: $sp = t2LDMIA_UPD $sp, 14 /* CC::al */, $noreg, def $r7, def $lr
				; CHECK: renamable $r0 = tSXTH killed renamable $r0, 14 /* CC::al */, $noreg
				; CHECK: tBX_RET 14 /* CC::al */, $noreg, implicit killed $r0
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $lr

				tCMPi8 renamable $r1, 0, 14, $noreg, implicit-def $cpsr
				t2IT 0, 2, implicit-def $itstate
				renamable $r0 = tMOVi8 $noreg, 0, 0, $cpsr, implicit killed $r0, implicit $itstate
				renamable $r0 = tSXTH killed renamable $r0, 0, $cpsr, implicit killed $r0, implicit $itstate
				tBX_RET 0, killed $cpsr, implicit $r0, implicit killed $itstate
				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $r2, dead $cpsr = tADDi3 renamable $r1, 7, 14, $noreg
				renamable $r3, dead $cpsr = tMOVi8 1, 14, $noreg
				renamable $r2 = t2BICri killed renamable $r2, 7, 14, $noreg, $noreg
				renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14, $noreg
				renamable $lr = nuw nsw t2ADDrs killed renamable $r3, renamable $r2, 27, 14, $noreg, $noreg
				renamable $r2, dead $cpsr = tLSRri killed renamable $r2, 3, 14, $noreg
				renamable $r2 = t2SUBrs renamable $r1, killed renamable $r2, 26, 14, $noreg, $noreg
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q0, $r0, $r1, $r2

				renamable $vpr = MVE_VCTP16 renamable $r1, 0, $noreg
				$q1 = MVE_VORR killed $q0, $q0, 0, $noreg, undef $q1
				MVE_VPST 8, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRBU16_post killed renamable $r0, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv15, align 1)
				renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 8, 14, $noreg
				renamable $q0 = MVE_VMOVLu8bh killed renamable $q0, 0, $noreg, undef renamable $q0
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VADDi16 renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q1, $r2

				renamable $vpr = MVE_VCTP16 killed renamable $r2, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				$sp = t2LDMIA_UPD $sp, 14, $noreg, def $r7, def $lr
				renamable $r0 = tSXTH killed renamable $r0, 14, $noreg
				tBX_RET 14, $noreg, implicit killed $r0

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/constant-reduction.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define dso_local arm_aapcs_vfpcc i32 @constant_reduction(i8* nocapture readonly %a) {
				entry:
				call void @llvm.set.loop.iterations.i32(i32 250)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%lsr.iv = phi i8* [ %scevgep, %vector.body ], [ %a, %entry ]
				%vec.phi = phi <4 x i32> [ zeroinitializer, %entry ], [ %5, %vector.body ]
				%0 = phi i32 [ 250, %entry ], [ %6, %vector.body ]
				%1 = phi i32 [ 999, %entry ], [ %3, %vector.body ]
				%lsr.iv10 = bitcast i8* %lsr.iv to <4 x i8>*
				%2 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %1)
				%3 = sub i32 %1, 4
				%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %lsr.iv10, i32 1, <4 x i1> %2, <4 x i8> undef)
				%4 = zext <4 x i8> %wide.masked.load to <4 x i32>
				%5 = add <4 x i32> %vec.phi, %4
				%scevgep = getelementptr i8, i8* %lsr.iv, i32 4
				%6 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %0, i32 1)
				%7 = icmp ne i32 %6, 0
				br i1 %7, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <4 x i32> [ %vec.phi, %vector.body ]
				%.lcssa = phi <4 x i32> [ %5, %vector.body ]
				%8 = call <4 x i1> @llvm.arm.mve.vctp32(i32 3)
				%9 = select <4 x i1> %8, <4 x i32> %.lcssa, <4 x i32> %vec.phi.lcssa
				%10 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %9)
				ret i32 %10
				}

				declare <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>*, i32 immarg, <4 x i1>, <4 x i8>)
				declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)
				...
				---
				name: constant_reduction
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: constant_reduction
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $lr = t2MOVi 250, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $r1 = t2MOVi16 999, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VMOVimmi32 255, 0, $noreg, undef renamable $q1
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q0, $q1, $r0, $r1
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r1, 0, $noreg
				; CHECK: $q2 = MVE_VORR killed $q0, killed $q0, 0, $noreg, undef $q2
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRBU32_post killed renamable $r0, 4, 1, killed renamable $vpr :: (load 4 from %ir.lsr.iv10, align 1)
				; CHECK: renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VAND killed renamable $q0, renamable $q1, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q0 = MVE_VADDi32 renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q2
				; CHECK: renamable $r0, dead $cpsr = tMOVi8 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $vpr = MVE_VCTP32 killed renamable $r0, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q2, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu32no_acc killed renamable $q0, 0, $noreg
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $lr

				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $lr = t2MOVi 250, 14, $noreg, $noreg
				renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				renamable $r1 = t2MOVi16 999, 14, $noreg
				renamable $q1 = MVE_VMOVimmi32 255, 0, $noreg, undef renamable $q1
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q0, $q1, $r0, $r1

				renamable $vpr = MVE_VCTP32 renamable $r1, 0, $noreg
				$q2 = MVE_VORR killed $q0, $q0, 0, $noreg, undef $q2
				MVE_VPST 8, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRBU32_post killed renamable $r0, 4, 1, killed renamable $vpr :: (load 4 from %ir.lsr.iv10, align 1)
				renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 4, 14, $noreg
				renamable $q0 = MVE_VAND killed renamable $q0, renamable $q1, 0, $noreg, undef renamable $q0
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VADDi32 renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q2

				renamable $r0, dead $cpsr = tMOVi8 3, 14, $noreg
				renamable $vpr = MVE_VCTP32 killed renamable $r0, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q2, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu32no_acc killed renamable $q0, 0, $noreg
				tPOP_RET 14, $noreg, def $r7, def $pc, implicit killed $r0

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/matrix.mir

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	body: \|
; CHECK: bb.4.for.cond4.preheader.us.preheader:		; CHECK: bb.4.for.cond4.preheader.us.preheader:
; CHECK: successors: %bb.5(0x80000000)		; CHECK: successors: %bb.5(0x80000000)
; CHECK: liveins: $r4, $r5, $r7, $r8, $r10, $r12		; CHECK: liveins: $r4, $r5, $r7, $r8, $r10, $r12
; CHECK: renamable $r0 = t2ADDri $r10, 3, 14 /* CC::al */, $noreg, $noreg		; CHECK: renamable $r0 = t2ADDri $r10, 3, 14 /* CC::al */, $noreg, $noreg
; CHECK: $lr = tMOVr $r10, 14 /* CC::al */, $noreg		; CHECK: $lr = tMOVr $r10, 14 /* CC::al */, $noreg
; CHECK: renamable $r0 = t2BICri killed renamable $r0, 3, 14 /* CC::al */, $noreg, $noreg		; CHECK: renamable $r0 = t2BICri killed renamable $r0, 3, 14 /* CC::al */, $noreg, $noreg
; CHECK: renamable $r3 = t2LSLri $r10, 1, 14 /* CC::al */, $noreg, $noreg		; CHECK: renamable $r3 = t2LSLri $r10, 1, 14 /* CC::al */, $noreg, $noreg
; CHECK: renamable $r1, dead $cpsr = tSUBi3 killed renamable $r0, 4, 14 /* CC::al */, $noreg		; CHECK: renamable $r1, dead $cpsr = tSUBi3 killed renamable $r0, 4, 14 /* CC::al */, $noreg
; CHECK: renamable $r0, dead $cpsr = tMOVi8 1, 14 /* CC::al */, $noreg
; CHECK: renamable $q0 = MVE_VDUP32 renamable $r7, 0, $noreg, undef renamable $q0		; CHECK: renamable $q0 = MVE_VDUP32 renamable $r7, 0, $noreg, undef renamable $q0
; CHECK: renamable $r0 = nuw nsw t2ADDrs killed renamable $r0, renamable $r1, 19, 14 /* CC::al */, $noreg, $noreg
; CHECK: renamable $r1, dead $cpsr = tLSRri killed renamable $r1, 2, 14 /* CC::al */, $noreg		; CHECK: renamable $r1, dead $cpsr = tLSRri killed renamable $r1, 2, 14 /* CC::al */, $noreg
; CHECK: renamable $r9 = t2SUBrs $r10, killed renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg		; CHECK: renamable $r9 = t2SUBrs $r10, killed renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg
; CHECK: bb.5.for.cond4.preheader.us:		; CHECK: bb.5.for.cond4.preheader.us:
; CHECK: successors: %bb.6(0x80000000)		; CHECK: successors: %bb.6(0x80000000)
; CHECK: liveins: $lr, $q0, $r0, $r3, $r4, $r5, $r7, $r8, $r9, $r10, $r12		; CHECK: liveins: $lr, $q0, $q2, $r0, $r3, $r4, $r5, $r7, $r8, $r9, $r10, $r12
; CHECK: renamable $r1 = t2LDRs renamable $r4, renamable $r7, 2, 14 /* CC::al */, $noreg :: (load 4 from %ir.arrayidx12.us)		; CHECK: renamable $r1 = t2LDRs renamable $r4, renamable $r7, 2, 14 /* CC::al */, $noreg :: (load 4 from %ir.arrayidx12.us)
; CHECK: $q1 = MVE_VORR $q0, $q0, 0, $noreg, undef $q1		; CHECK: dead $q1 = MVE_VORR $q0, $q0, 0, $noreg, undef $q1
; CHECK: $r2 = tMOVr killed $lr, 14 /* CC::al */, $noreg		; CHECK: $r2 = tMOVr killed $lr, 14 /* CC::al */, $noreg
; CHECK: renamable $q1 = MVE_VMOV_to_lane_32 killed renamable $q1, killed renamable $r1, 0, 14 /* CC::al */, $noreg		; CHECK: $q2 = MVE_VMOV_to_lane_32 killed $q2, killed renamable $r1, 0, 14 /* CC::al */, $noreg
; CHECK: $r6 = tMOVr $r5, 14 /* CC::al */, $noreg		; CHECK: $r6 = tMOVr $r5, 14 /* CC::al */, $noreg
; CHECK: $r1 = tMOVr $r8, 14 /* CC::al */, $noreg		; CHECK: $r1 = tMOVr $r8, 14 /* CC::al */, $noreg
; CHECK: $lr = t2DLS renamable $r0		; CHECK: $lr = MVE_DLSTP_32 killed renamable $r2
; CHECK: bb.6.vector.body:		; CHECK: bb.6.vector.body:
; CHECK: successors: %bb.6(0x7c000000), %bb.7(0x04000000)		; CHECK: successors: %bb.6(0x7c000000), %bb.7(0x04000000)
; CHECK: liveins: $lr, $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $r9, $r10, $r12		; CHECK: liveins: $lr, $q0, $q2, $r0, $r1, $r3, $r4, $r5, $r6, $r7, $r8, $r9, $r10, $r12
; CHECK: renamable $vpr = MVE_VCTP32 renamable $r2, 0, $noreg		; CHECK: renamable $r6, renamable $q1 = MVE_VLDRHS32_post killed renamable $r6, 8, 0, $noreg :: (load 8 from %ir.lsr.iv1012, align 2)
; CHECK: $q2 = MVE_VORR killed $q1, killed $q1, 0, $noreg, undef $q2		; CHECK: renamable $r1, renamable $q3 = MVE_VLDRHS32_post killed renamable $r1, 8, 0, killed $noreg :: (load 8 from %ir.lsr.iv46, align 2)
; CHECK: MVE_VPST 4, implicit $vpr
; CHECK: renamable $r6, renamable $q1 = MVE_VLDRHS32_post killed renamable $r6, 8, 1, renamable $vpr :: (load 8 from %ir.lsr.iv1012, align 2)
; CHECK: renamable $r1, renamable $q3 = MVE_VLDRHS32_post killed renamable $r1, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv46, align 2)
; CHECK: renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 4, 14 /* CC::al */, $noreg
; CHECK: renamable $q1 = nsw MVE_VMULi32 killed renamable $q3, killed renamable $q1, 0, $noreg, undef renamable $q1		; CHECK: renamable $q1 = nsw MVE_VMULi32 killed renamable $q3, killed renamable $q1, 0, $noreg, undef renamable $q1
; CHECK: renamable $q1 = MVE_VADDi32 killed renamable $q1, renamable $q2, 0, $noreg, undef renamable $q1		; CHECK: $q2 = MVE_VADDi32 renamable $q1, killed renamable $q2, 0, $noreg, undef $q2
; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.6		; CHECK: $lr = MVE_LETP killed renamable $lr, %bb.6
; CHECK: bb.7.middle.block:		; CHECK: bb.7.middle.block:
; CHECK: successors: %bb.8(0x04000000), %bb.5(0x7c000000)		; CHECK: successors: %bb.8(0x04000000), %bb.5(0x7c000000)
; CHECK: liveins: $q0, $q1, $q2, $r0, $r3, $r4, $r5, $r7, $r8, $r9, $r10, $r12		; CHECK: liveins: $q0, $q1, $q2, $r0, $r3, $r4, $r5, $r7, $r8, $r9, $r10, $r12
; CHECK: renamable $vpr = MVE_VCTP32 renamable $r9, 0, $noreg
; CHECK: renamable $r5 = tADDhirr killed renamable $r5, renamable $r3, 14 /* CC::al */, $noreg		; CHECK: renamable $r5 = tADDhirr killed renamable $r5, renamable $r3, 14 /* CC::al */, $noreg
; CHECK: renamable $q1 = MVE_VPSEL killed renamable $q1, killed renamable $q2, 0, killed renamable $vpr		; CHECK: $q1 = MVE_VORR $q2, $q2, 0, $noreg, killed $q1
; CHECK: $lr = tMOVr $r10, 14 /* CC::al */, $noreg		; CHECK: $lr = tMOVr $r10, 14 /* CC::al */, $noreg
; CHECK: renamable $r2 = MVE_VADDVu32no_acc killed renamable $q1, 0, $noreg		; CHECK: renamable $r2 = MVE_VADDVu32no_acc killed renamable $q1, 0, $noreg
; CHECK: t2STRs killed renamable $r2, renamable $r4, renamable $r7, 2, 14 /* CC::al */, $noreg :: (store 4 into %ir.27)		; CHECK: t2STRs killed renamable $r2, renamable $r4, renamable $r7, 2, 14 /* CC::al */, $noreg :: (store 4 into %ir.27)
; CHECK: renamable $r7, dead $cpsr = nuw nsw tADDi8 killed renamable $r7, 1, 14 /* CC::al */, $noreg		; CHECK: renamable $r7, dead $cpsr = nuw nsw tADDi8 killed renamable $r7, 1, 14 /* CC::al */, $noreg
; CHECK: tCMPhir renamable $r7, $r10, 14 /* CC::al */, $noreg, implicit-def $cpsr		; CHECK: tCMPhir renamable $r7, $r10, 14 /* CC::al */, $noreg, implicit-def $cpsr
; CHECK: tBcc %bb.5, 1 /* CC::ne */, killed $cpsr		; CHECK: tBcc %bb.5, 1 /* CC::ne */, killed $cpsr
; CHECK: bb.8.for.end16:		; CHECK: bb.8.for.end16:
; CHECK: successors: %bb.9(0x50000000), %bb.13(0x30000000)		; CHECK: successors: %bb.9(0x50000000), %bb.13(0x30000000)
▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/nested-reductions.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define dso_local arm_aapcs_vfpcc i32 @nested_reduction(i16** nocapture readonly %a, i16* nocapture readonly %b, i32 %N, i32 %M) {
				entry:
				%cmp23 = icmp eq i32 %N, 0
				%cmp220 = icmp eq i32 %M, 0
				%or.cond = or i1 %cmp23, %cmp220
				br i1 %or.cond, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%n.rnd.up = add i32 %M, 3
				%n.vec = and i32 %n.rnd.up, -4
				%0 = add i32 %n.vec, -4
				%1 = lshr i32 %0, 2
				%2 = add nuw nsw i32 %1, 1
				%3 = shl i32 %1, 2
				%4 = sub i32 %M, %3
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.preheader.us.preheader, %middle.block
				%i.025.us = phi i32 [ %inc9.us, %middle.block ], [ 0, %for.cond1.preheader.us.preheader ]
				%res.024.us = phi i32 [ %19, %middle.block ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i16, i16* %a, i32 %i.025.us
				%5 = load i16, i16* %arrayidx.us, align 4
				%6 = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 %res.024.us, i32 0
				call void @llvm.set.loop.iterations.i32(i32 %2)
				br label %vector.body

				vector.body: ; preds = %vector.body, %for.cond1.preheader.us
				%lsr.iv38 = phi i16* [ %scevgep39, %vector.body ], [ %b, %for.cond1.preheader.us ]
				%lsr.iv = phi i16* [ %scevgep, %vector.body ], [ %5, %for.cond1.preheader.us ]
				%vec.phi = phi <4 x i32> [ %6, %for.cond1.preheader.us ], [ %14, %vector.body ]
				%7 = phi i32 [ %2, %for.cond1.preheader.us ], [ %15, %vector.body ]
				%8 = phi i32 [ %M, %for.cond1.preheader.us ], [ %10, %vector.body ]
				%lsr.iv3840 = bitcast i16* %lsr.iv38 to <4 x i16>*
				%lsr.iv37 = bitcast i16* %lsr.iv to <4 x i16>*
				%9 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %8)
				%10 = sub i32 %8, 4
				%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %lsr.iv37, i32 2, <4 x i1> %9, <4 x i16> undef)
				%11 = sext <4 x i16> %wide.masked.load to <4 x i32>
				%wide.masked.load32 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %lsr.iv3840, i32 2, <4 x i1> %9, <4 x i16> undef)
				%12 = sext <4 x i16> %wide.masked.load32 to <4 x i32>
				%13 = add <4 x i32> %vec.phi, %11
				%14 = sub <4 x i32> %13, %12
				%scevgep = getelementptr i16, i16* %lsr.iv, i32 4
				%scevgep39 = getelementptr i16, i16* %lsr.iv38, i32 4
				%15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %7, i32 1)
				%16 = icmp ne i32 %15, 0
				br i1 %16, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <4 x i32> [ %vec.phi, %vector.body ]
				%.lcssa = phi <4 x i32> [ %14, %vector.body ]
				%17 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %4)
				%18 = select <4 x i1> %17, <4 x i32> %.lcssa, <4 x i32> %vec.phi.lcssa
				%19 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %18)
				%inc9.us = add nuw i32 %i.025.us, 1
				%exitcond29 = icmp eq i32 %inc9.us, %N
				br i1 %exitcond29, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %middle.block, %entry
				%res.0.lcssa = phi i32 [ 0, %entry ], [ %19, %middle.block ]
				ret i32 %res.0.lcssa
				}

				declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>)
				declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)

				...
				---
				name: nested_reduction
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 32
				offsetAdjustment: -24
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r6', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 3, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r5', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 4, name: '', type: spill-slot, offset: -20, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 5, name: '', type: spill-slot, offset: -24, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r10', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 6, name: '', type: spill-slot, offset: -28, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r9', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 7, name: '', type: spill-slot, offset: -32, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r8', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: nested_reduction
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.5(0x55555555), %bb.1(0x2aaaaaab)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $lr, $r8, $r9, $r10
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $r5, killed $r6, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 20
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: frame-setup CFI_INSTRUCTION offset $r6, -12
				; CHECK: frame-setup CFI_INSTRUCTION offset $r5, -16
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -20
				; CHECK: $r7 = frame-setup tADDrSPi $sp, 3, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa $r7, 8
				; CHECK: $sp = frame-setup t2STMDB_UPD $sp, 14 /* CC::al */, $noreg, killed $r8, killed $r9, killed $r10
				; CHECK: frame-setup CFI_INSTRUCTION offset $r10, -24
				; CHECK: frame-setup CFI_INSTRUCTION offset $r9, -28
				; CHECK: frame-setup CFI_INSTRUCTION offset $r8, -32
				; CHECK: tCMPi8 renamable $r2, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: renamable $r12 = t2MOVi 0, 14 /* CC::al */, $noreg, $noreg
				; CHECK: t2IT 1, 8, implicit-def $itstate
				; CHECK: tCMPi8 renamable $r3, 0, 1 /* CC::ne */, killed $cpsr, implicit-def $cpsr, implicit killed $itstate
				; CHECK: tBcc %bb.5, 0 /* CC::eq */, killed $cpsr
				; CHECK: bb.1.for.cond1.preheader.us.preheader:
				; CHECK: successors: %bb.2(0x80000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3
				; CHECK: $r8 = tMOVr killed $r1, 14 /* CC::al */, $noreg
				; CHECK: renamable $r1, dead $cpsr = tADDi3 renamable $r3, 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $r1 = t2BICri killed renamable $r1, 3, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r6, dead $cpsr = tMOVi8 1, 14 /* CC::al */, $noreg
				; CHECK: renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $r5, dead $cpsr = tMOVi8 0, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VDUP32 renamable $r5, 0, $noreg, undef renamable $q0
				; CHECK: renamable $r12 = t2MOVi 0, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r10 = nuw nsw t2ADDrs killed renamable $r6, renamable $r1, 19, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r1, dead $cpsr = tLSRri killed renamable $r1, 2, 14 /* CC::al */, $noreg
				; CHECK: renamable $r9 = t2SUBrs renamable $r3, killed renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg
				; CHECK: bb.2.for.cond1.preheader.us:
				; CHECK: successors: %bb.3(0x80000000)
				; CHECK: liveins: $q0, $r0, $r2, $r3, $r5, $r8, $r9, $r10, $r12
				; CHECK: renamable $r6 = t2LDRs renamable $r0, renamable $r5, 2, 14 /* CC::al */, $noreg :: (load 4 from %ir.arrayidx.us)
				; CHECK: $q2 = MVE_VORR $q0, $q0, 0, $noreg, undef $q2
				; CHECK: renamable $q2 = MVE_VMOV_to_lane_32 killed renamable $q2, killed renamable $r12, 0, 14 /* CC::al */, $noreg
				; CHECK: $r1 = tMOVr $r8, 14 /* CC::al */, $noreg
				; CHECK: $lr = t2DLS renamable $r10
				; CHECK: $r4 = tMOVr $r3, 14 /* CC::al */, $noreg
				; CHECK: bb.3.vector.body:
				; CHECK: successors: %bb.3(0x7c000000), %bb.4(0x04000000)
				; CHECK: liveins: $lr, $q0, $q2, $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r8, $r9, $r10
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r4, 0, $noreg
				; CHECK: $q1 = MVE_VORR killed $q2, killed $q2, 0, $noreg, undef $q1
				; CHECK: MVE_VPST 4, implicit $vpr
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRHU32_post killed renamable $r1, 8, 1, renamable $vpr :: (load 8 from %ir.lsr.iv3840, align 2)
				; CHECK: renamable $r6, renamable $q3 = MVE_VLDRHU32_post killed renamable $r6, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv37, align 2)
				; CHECK: renamable $r4, dead $cpsr = tSUBi8 killed renamable $r4, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $q3 = MVE_VADDi32 renamable $q1, killed renamable $q3, 0, $noreg, undef renamable $q3
				; CHECK: renamable $q2 = MVE_VSUBi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.3
				; CHECK: bb.4.middle.block:
				; CHECK: successors: %bb.5(0x04000000), %bb.2(0x7c000000)
				; CHECK: liveins: $q0, $q1, $q2, $r0, $r2, $r3, $r5, $r8, $r9, $r10
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r9, 0, $noreg
				; CHECK: renamable $r5, dead $cpsr = nuw tADDi8 killed renamable $r5, 1, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VPSEL killed renamable $q2, killed renamable $q1, 0, killed renamable $vpr
				; CHECK: tCMPr renamable $r5, renamable $r2, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: renamable $r12 = MVE_VADDVu32no_acc killed renamable $q1, 0, $noreg
				; CHECK: tBcc %bb.2, 1 /* CC::ne */, killed $cpsr
				; CHECK: bb.5.for.cond.cleanup:
				; CHECK: liveins: $r12
				; CHECK: $r0 = tMOVr killed $r12, 14 /* CC::al */, $noreg
				; CHECK: $sp = t2LDMIA_UPD $sp, 14 /* CC::al */, $noreg, def $r8, def $r9, def $r10
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $r7, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.5(0x80000000), %bb.1(0x40000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $lr, $r8, $r9, $r10

				frame-setup tPUSH 14, $noreg, killed $r4, killed $r5, killed $r6, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 20
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				frame-setup CFI_INSTRUCTION offset $r6, -12
				frame-setup CFI_INSTRUCTION offset $r5, -16
				frame-setup CFI_INSTRUCTION offset $r4, -20
				$r7 = frame-setup tADDrSPi $sp, 3, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa $r7, 8
				$sp = frame-setup t2STMDB_UPD $sp, 14, $noreg, killed $r8, killed $r9, killed $r10
				frame-setup CFI_INSTRUCTION offset $r10, -24
				frame-setup CFI_INSTRUCTION offset $r9, -28
				frame-setup CFI_INSTRUCTION offset $r8, -32
				tCMPi8 renamable $r2, 0, 14, $noreg, implicit-def $cpsr
				renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
				t2IT 1, 8, implicit-def $itstate
				tCMPi8 renamable $r3, 0, 1, killed $cpsr, implicit-def $cpsr, implicit killed $itstate
				tBcc %bb.5, 0, killed $cpsr

				bb.1.for.cond1.preheader.us.preheader:
				successors: %bb.2(0x80000000)
				liveins: $r0, $r1, $r2, $r3

				$r8 = tMOVr killed $r1, 14, $noreg
				renamable $r1, dead $cpsr = tADDi3 renamable $r3, 3, 14, $noreg
				renamable $r1 = t2BICri killed renamable $r1, 3, 14, $noreg, $noreg
				renamable $r6, dead $cpsr = tMOVi8 1, 14, $noreg
				renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 4, 14, $noreg
				renamable $r5, dead $cpsr = tMOVi8 0, 14, $noreg
				renamable $q0 = MVE_VDUP32 renamable $r5, 0, $noreg, undef renamable $q0
				renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
				renamable $r10 = nuw nsw t2ADDrs killed renamable $r6, renamable $r1, 19, 14, $noreg, $noreg
				renamable $r1, dead $cpsr = tLSRri killed renamable $r1, 2, 14, $noreg
				renamable $r9 = t2SUBrs renamable $r3, killed renamable $r1, 18, 14, $noreg, $noreg

				bb.2.for.cond1.preheader.us:
				successors: %bb.3(0x80000000)
				liveins: $q0, $r0, $r2, $r3, $r5, $r8, $r9, $r10, $r12

				renamable $r6 = t2LDRs renamable $r0, renamable $r5, 2, 14, $noreg :: (load 4 from %ir.arrayidx.us)
				$q2 = MVE_VORR $q0, $q0, 0, $noreg, undef $q2
				renamable $q2 = MVE_VMOV_to_lane_32 killed renamable $q2, killed renamable $r12, 0, 14, $noreg
				$r1 = tMOVr $r8, 14, $noreg
				$lr = tMOVr $r10, 14, $noreg
				$r4 = tMOVr $r3, 14, $noreg
				t2DoLoopStart renamable $r10

				bb.3.vector.body:
				successors: %bb.3(0x7c000000), %bb.4(0x04000000)
				liveins: $lr, $q0, $q2, $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r8, $r9, $r10

				renamable $vpr = MVE_VCTP32 renamable $r4, 0, $noreg
				$q1 = MVE_VORR killed $q2, $q2, 0, $noreg, undef $q1
				MVE_VPST 4, implicit $vpr
				renamable $r1, renamable $q2 = MVE_VLDRHU32_post killed renamable $r1, 8, 1, renamable $vpr :: (load 8 from %ir.lsr.iv3840, align 2)
				renamable $r6, renamable $q3 = MVE_VLDRHU32_post killed renamable $r6, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv37, align 2)
				renamable $r4, dead $cpsr = tSUBi8 killed renamable $r4, 4, 14, $noreg
				renamable $q3 = MVE_VADDi32 renamable $q1, killed renamable $q3, 0, $noreg, undef renamable $q3
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q2 = MVE_VSUBi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
				t2LoopEnd renamable $lr, %bb.3, implicit-def dead $cpsr
				tB %bb.4, 14, $noreg

				bb.4.middle.block:
				successors: %bb.5(0x04000000), %bb.2(0x7c000000)
				liveins: $q0, $q1, $q2, $r0, $r2, $r3, $r5, $r8, $r9, $r10

				renamable $vpr = MVE_VCTP32 renamable $r9, 0, $noreg
				renamable $r5, dead $cpsr = nuw tADDi8 killed renamable $r5, 1, 14, $noreg
				renamable $q1 = MVE_VPSEL killed renamable $q2, killed renamable $q1, 0, killed renamable $vpr
				tCMPr renamable $r5, renamable $r2, 14, $noreg, implicit-def $cpsr
				renamable $r12 = MVE_VADDVu32no_acc killed renamable $q1, 0, $noreg
				tBcc %bb.2, 1, killed $cpsr

				bb.5.for.cond.cleanup:
				liveins: $r12

				$r0 = tMOVr killed $r12, 14, $noreg
				$sp = t2LDMIA_UPD $sp, 14, $noreg, def $r8, def $r9, def $r10
				tPOP_RET 14, $noreg, def $r4, def $r5, def $r6, def $r7, def $pc, implicit killed $r0

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/reductions-8-16.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define dso_local arm_aapcs_vfpcc zeroext i8 @test_add_reduce_8(i8* nocapture readonly %a, i8* nocapture readonly %b, i8* nocapture readonly %c) {
				entry:
				call void @llvm.set.loop.iterations.i32(i32 32)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%lsr.iv24 = phi i8* [ %scevgep25, %vector.body ], [ %c, %entry ]
				%lsr.iv21 = phi i8* [ %scevgep22, %vector.body ], [ %b, %entry ]
				%lsr.iv = phi i8* [ %scevgep, %vector.body ], [ %a, %entry ]
				%vec.phi = phi <16 x i8> [ zeroinitializer, %entry ], [ %6, %vector.body ]
				%0 = phi i32 [ 32, %entry ], [ %7, %vector.body ]
				%1 = phi i32 [ 499, %entry ], [ %3, %vector.body ]
				%lsr.iv2426 = bitcast i8* %lsr.iv24 to <16 x i8>*
				%lsr.iv2123 = bitcast i8* %lsr.iv21 to <16 x i8>*
				%lsr.iv20 = bitcast i8* %lsr.iv to <16 x i8>*
				%2 = call <16 x i1> @llvm.arm.mve.vctp8(i32 %1)
				%3 = sub i32 %1, 16
				%wide.masked.load = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %lsr.iv20, i32 1, <16 x i1> %2, <16 x i8> undef)
				%wide.masked.load16 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %lsr.iv2123, i32 1, <16 x i1> %2, <16 x i8> undef)
				%4 = mul <16 x i8> %wide.masked.load16, %wide.masked.load
				%wide.masked.load17 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %lsr.iv2426, i32 1, <16 x i1> %2, <16 x i8> undef)
				%5 = add <16 x i8> %wide.masked.load17, %vec.phi
				%6 = add <16 x i8> %5, %4
				%scevgep = getelementptr i8, i8* %lsr.iv, i32 16
				%scevgep22 = getelementptr i8, i8* %lsr.iv21, i32 16
				%scevgep25 = getelementptr i8, i8* %lsr.iv24, i32 16
				%7 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %0, i32 1)
				%8 = icmp ne i32 %7, 0
				br i1 %8, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <16 x i8> [ %vec.phi, %vector.body ]
				%.lcssa = phi <16 x i8> [ %6, %vector.body ]
				%9 = call <16 x i1> @llvm.arm.mve.vctp8(i32 3)
				%10 = select <16 x i1> %9, <16 x i8> %.lcssa, <16 x i8> %vec.phi.lcssa
				%11 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> %10)
				ret i8 %11
				}

				define dso_local arm_aapcs_vfpcc zeroext i8 @test_sub_reduce_8(i8* nocapture readonly %a, i8* nocapture readonly %b, i8* nocapture readonly %c) {
				entry:
				call void @llvm.set.loop.iterations.i32(i32 32)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%lsr.iv23 = phi i8* [ %scevgep24, %vector.body ], [ %c, %entry ]
				%lsr.iv20 = phi i8* [ %scevgep21, %vector.body ], [ %b, %entry ]
				%lsr.iv = phi i8* [ %scevgep, %vector.body ], [ %a, %entry ]
				%vec.phi = phi <16 x i8> [ <i8 -1, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0>, %entry ], [ %6, %vector.body ]
				%0 = phi i32 [ 32, %entry ], [ %7, %vector.body ]
				%1 = phi i32 [ 499, %entry ], [ %3, %vector.body ]
				%lsr.iv2325 = bitcast i8* %lsr.iv23 to <16 x i8>*
				%lsr.iv2022 = bitcast i8* %lsr.iv20 to <16 x i8>*
				%lsr.iv19 = bitcast i8* %lsr.iv to <16 x i8>*
				%2 = call <16 x i1> @llvm.arm.mve.vctp8(i32 %1)
				%3 = sub i32 %1, 16
				%wide.masked.load = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %lsr.iv19, i32 1, <16 x i1> %2, <16 x i8> undef)
				%wide.masked.load15 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %lsr.iv2022, i32 1, <16 x i1> %2, <16 x i8> undef)
				%4 = mul <16 x i8> %wide.masked.load15, %wide.masked.load
				%wide.masked.load16 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %lsr.iv2325, i32 1, <16 x i1> %2, <16 x i8> undef)
				%5 = sub <16 x i8> %vec.phi, %wide.masked.load16
				%6 = sub <16 x i8> %5, %4
				%scevgep = getelementptr i8, i8* %lsr.iv, i32 16
				%scevgep21 = getelementptr i8, i8* %lsr.iv20, i32 16
				%scevgep24 = getelementptr i8, i8* %lsr.iv23, i32 16
				%7 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %0, i32 1)
				%8 = icmp ne i32 %7, 0
				br i1 %8, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <16 x i8> [ %vec.phi, %vector.body ]
				%.lcssa = phi <16 x i8> [ %6, %vector.body ]
				%9 = call <16 x i1> @llvm.arm.mve.vctp8(i32 3)
				%10 = select <16 x i1> %9, <16 x i8> %.lcssa, <16 x i8> %vec.phi.lcssa
				%11 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> %10)
				ret i8 %11
				}

				define dso_local arm_aapcs_vfpcc zeroext i16 @test_add_reduce_16(i16* nocapture readonly %a, i16* nocapture readonly %b) {
				entry:
				call void @llvm.set.loop.iterations.i32(i32 38)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%lsr.iv18 = phi i16* [ %scevgep19, %vector.body ], [ %b, %entry ]
				%lsr.iv = phi i16* [ %scevgep, %vector.body ], [ %a, %entry ]
				%vec.phi = phi <8 x i16> [ zeroinitializer, %entry ], [ %5, %vector.body ]
				%0 = phi i32 [ 38, %entry ], [ %6, %vector.body ]
				%1 = phi i32 [ 299, %entry ], [ %3, %vector.body ]
				%lsr.iv1820 = bitcast i16* %lsr.iv18 to <8 x i16>*
				%lsr.iv17 = bitcast i16* %lsr.iv to <8 x i16>*
				%2 = call <8 x i1> @llvm.arm.mve.vctp16(i32 %1)
				%3 = sub i32 %1, 8
				%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %lsr.iv17, i32 2, <8 x i1> %2, <8 x i16> undef)
				%wide.masked.load14 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %lsr.iv1820, i32 2, <8 x i1> %2, <8 x i16> undef)
				%4 = and <8 x i16> %wide.masked.load14, %wide.masked.load
				%5 = add <8 x i16> %4, %vec.phi
				%scevgep = getelementptr i16, i16* %lsr.iv, i32 8
				%scevgep19 = getelementptr i16, i16* %lsr.iv18, i32 8
				%6 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %0, i32 1)
				%7 = icmp ne i32 %6, 0
				br i1 %7, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]
				%.lcssa = phi <8 x i16> [ %5, %vector.body ]
				%8 = call <8 x i1> @llvm.arm.mve.vctp16(i32 3)
				%9 = select <8 x i1> %8, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa
				%10 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> %9)
				ret i16 %10
				}

				define dso_local arm_aapcs_vfpcc zeroext i16 @test_sub_reduce_16(i16* nocapture readonly %a, i16* nocapture readonly %b) {
				entry:
				call void @llvm.set.loop.iterations.i32(i32 38)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%lsr.iv18 = phi i16* [ %scevgep19, %vector.body ], [ %b, %entry ]
				%lsr.iv = phi i16* [ %scevgep, %vector.body ], [ %a, %entry ]
				%vec.phi = phi <8 x i16> [ <i16 -1, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0>, %entry ], [ %5, %vector.body ]
				%0 = phi i32 [ 38, %entry ], [ %6, %vector.body ]
				%1 = phi i32 [ 299, %entry ], [ %3, %vector.body ]
				%lsr.iv1820 = bitcast i16* %lsr.iv18 to <8 x i16>*
				%lsr.iv17 = bitcast i16* %lsr.iv to <8 x i16>*
				%2 = call <8 x i1> @llvm.arm.mve.vctp16(i32 %1)
				%3 = sub i32 %1, 8
				%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %lsr.iv17, i32 2, <8 x i1> %2, <8 x i16> undef)
				%wide.masked.load14 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %lsr.iv1820, i32 2, <8 x i1> %2, <8 x i16> undef)
				%4 = and <8 x i16> %wide.masked.load14, %wide.masked.load
				%5 = sub <8 x i16> %vec.phi, %4
				%scevgep = getelementptr i16, i16* %lsr.iv, i32 8
				%scevgep19 = getelementptr i16, i16* %lsr.iv18, i32 8
				%6 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %0, i32 1)
				%7 = icmp ne i32 %6, 0
				br i1 %7, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]
				%.lcssa = phi <8 x i16> [ %5, %vector.body ]
				%8 = call <8 x i1> @llvm.arm.mve.vctp16(i32 3)
				%9 = select <8 x i1> %8, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa
				%10 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> %9)
				ret i16 %10
				}

				declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)
				declare i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8>)
				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
				declare i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <16 x i1> @llvm.arm.mve.vctp8(i32)
				declare <8 x i1> @llvm.arm.mve.vctp16(i32)

				...
				---
				name: test_add_reduce_8
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_add_reduce_8
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $lr = t2MOVi 32, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q1 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q1
				; CHECK: renamable $r3 = t2MOVi16 499, 14 /* CC::al */, $noreg
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q1, $r0, $r1, $r2, $r3
				; CHECK: renamable $vpr = MVE_VCTP8 renamable $r3, 0, $noreg
				; CHECK: $q0 = MVE_VORR killed $q1, killed $q1, 0, $noreg, undef $q0
				; CHECK: MVE_VPST 2, implicit $vpr
				; CHECK: renamable $r0, renamable $q1 = MVE_VLDRBU8_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv20, align 1)
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRBU8_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv2123, align 1)
				; CHECK: renamable $r2, renamable $q3 = MVE_VLDRBU8_post killed renamable $r2, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv2426, align 1)
				; CHECK: renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 16, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VMULi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				; CHECK: renamable $q2 = MVE_VADDi8 killed renamable $q3, renamable $q0, 0, $noreg, undef renamable $q2
				; CHECK: renamable $q1 = MVE_VADDi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q1
				; CHECK: renamable $r0, dead $cpsr = tMOVi8 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $vpr = MVE_VCTP8 killed renamable $r0, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q1, killed renamable $q0, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu8no_acc killed renamable $q0, 0, $noreg
				; CHECK: renamable $r0 = tUXTB killed renamable $r0, 14 /* CC::al */, $noreg
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $lr

				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $lr = t2MOVi 32, 14, $noreg, $noreg
				renamable $q1 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q1
				renamable $r3 = t2MOVi16 499, 14, $noreg
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q1, $r0, $r1, $r2, $r3

				renamable $vpr = MVE_VCTP8 renamable $r3, 0, $noreg
				$q0 = MVE_VORR killed $q1, $q1, 0, $noreg, undef $q0
				MVE_VPST 2, implicit $vpr
				renamable $r0, renamable $q1 = MVE_VLDRBU8_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv20, align 1)
				renamable $r1, renamable $q2 = MVE_VLDRBU8_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv2123, align 1)
				renamable $r2, renamable $q3 = MVE_VLDRBU8_post killed renamable $r2, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv2426, align 1)
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 16, 14, $noreg
				renamable $q1 = MVE_VMULi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $q2 = MVE_VADDi8 killed renamable $q3, renamable $q0, 0, $noreg, undef renamable $q2
				renamable $q1 = MVE_VADDi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q1

				renamable $r0, dead $cpsr = tMOVi8 3, 14, $noreg
				renamable $vpr = MVE_VCTP8 killed renamable $r0, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q1, killed renamable $q0, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu8no_acc killed renamable $q0, 0, $noreg
				renamable $r0 = tUXTB killed renamable $r0, 14, $noreg
				tPOP_RET 14, $noreg, def $r7, def $pc, implicit killed $r0

				...
				---
				name: test_sub_reduce_8
				alignment: 16
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants:
				- id: 0
				value: '<16 x i8> <i8 -1, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0>'
				alignment: 16
				isTargetSpecific: false
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_sub_reduce_8
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $r3 = tLEApcrel %const.0, 14 /* CC::al */, $noreg
				; CHECK: renamable $lr = t2MOVi 32, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q1 = MVE_VLDRWU32 killed renamable $r3, 0, 0, $noreg :: (load 16 from constant-pool)
				; CHECK: renamable $r3 = t2MOVi16 499, 14 /* CC::al */, $noreg
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q1, $r0, $r1, $r2, $r3
				; CHECK: renamable $vpr = MVE_VCTP8 renamable $r3, 0, $noreg
				; CHECK: $q0 = MVE_VORR killed $q1, killed $q1, 0, $noreg, undef $q0
				; CHECK: MVE_VPST 2, implicit $vpr
				; CHECK: renamable $r0, renamable $q1 = MVE_VLDRBU8_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv19, align 1)
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRBU8_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv2022, align 1)
				; CHECK: renamable $r2, renamable $q3 = MVE_VLDRBU8_post killed renamable $r2, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv2325, align 1)
				; CHECK: renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 16, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VMULi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				; CHECK: renamable $q2 = MVE_VSUBi8 renamable $q0, killed renamable $q3, 0, $noreg, undef renamable $q2
				; CHECK: renamable $q1 = MVE_VSUBi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q1
				; CHECK: renamable $r0, dead $cpsr = tMOVi8 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $vpr = MVE_VCTP8 killed renamable $r0, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q1, killed renamable $q0, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu8no_acc killed renamable $q0, 0, $noreg
				; CHECK: renamable $r0 = tUXTB killed renamable $r0, 14 /* CC::al */, $noreg
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc, implicit killed $r0
				; CHECK: bb.3 (align 16):
				; CHECK: CONSTPOOL_ENTRY 0, %const.0, 16
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $lr

				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $r3 = tLEApcrel %const.0, 14, $noreg
				renamable $lr = t2MOVi 32, 14, $noreg, $noreg
				renamable $q1 = MVE_VLDRWU32 killed renamable $r3, 0, 0, $noreg :: (load 16 from constant-pool)
				renamable $r3 = t2MOVi16 499, 14, $noreg
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q1, $r0, $r1, $r2, $r3

				renamable $vpr = MVE_VCTP8 renamable $r3, 0, $noreg
				$q0 = MVE_VORR killed $q1, $q1, 0, $noreg, undef $q0
				MVE_VPST 2, implicit $vpr
				renamable $r0, renamable $q1 = MVE_VLDRBU8_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv19, align 1)
				renamable $r1, renamable $q2 = MVE_VLDRBU8_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv2022, align 1)
				renamable $r2, renamable $q3 = MVE_VLDRBU8_post killed renamable $r2, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv2325, align 1)
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 16, 14, $noreg
				renamable $q1 = MVE_VMULi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $q2 = MVE_VSUBi8 renamable $q0, killed renamable $q3, 0, $noreg, undef renamable $q2
				renamable $q1 = MVE_VSUBi8 killed renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q1

				renamable $r0, dead $cpsr = tMOVi8 3, 14, $noreg
				renamable $vpr = MVE_VCTP8 killed renamable $r0, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q1, killed renamable $q0, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu8no_acc killed renamable $q0, 0, $noreg
				renamable $r0 = tUXTB killed renamable $r0, 14, $noreg
				tPOP_RET 14, $noreg, def $r7, def $pc, implicit killed $r0

				bb.3 (align 16):
				CONSTPOOL_ENTRY 0, %const.0, 16

				...
				---
				name: test_add_reduce_16
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_add_reduce_16
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $lr = t2MOVi 38, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $r2 = t2MOVi16 299, 14 /* CC::al */, $noreg
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q0, $r0, $r1, $r2
				; CHECK: renamable $vpr = MVE_VCTP16 renamable $r2, 0, $noreg
				; CHECK: $q1 = MVE_VORR killed $q0, killed $q0, 0, $noreg, undef $q1
				; CHECK: MVE_VPST 4, implicit $vpr
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRHU16_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv17, align 2)
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRHU16_post killed renamable $r1, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv1820, align 2)
				; CHECK: renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VAND killed renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q0 = MVE_VADDi16 killed renamable $q0, renamable $q1, 0, $noreg, undef renamable $q0
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q1
				; CHECK: renamable $r0, dead $cpsr = tMOVi8 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $vpr = MVE_VCTP16 killed renamable $r0, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				; CHECK: renamable $r0 = tUXTH killed renamable $r0, 14 /* CC::al */, $noreg
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $lr

				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $lr = t2MOVi 38, 14, $noreg, $noreg
				renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				renamable $r2 = t2MOVi16 299, 14, $noreg
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q0, $r0, $r1, $r2

				renamable $vpr = MVE_VCTP16 renamable $r2, 0, $noreg
				$q1 = MVE_VORR killed $q0, $q0, 0, $noreg, undef $q1
				MVE_VPST 4, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRHU16_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv17, align 2)
				renamable $r1, renamable $q2 = MVE_VLDRHU16_post killed renamable $r1, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv1820, align 2)
				renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14, $noreg
				renamable $q0 = MVE_VAND killed renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VADDi16 killed renamable $q0, renamable $q1, 0, $noreg, undef renamable $q0
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q1

				renamable $r0, dead $cpsr = tMOVi8 3, 14, $noreg
				renamable $vpr = MVE_VCTP16 killed renamable $r0, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				renamable $r0 = tUXTH killed renamable $r0, 14, $noreg
				tPOP_RET 14, $noreg, def $r7, def $pc, implicit killed $r0

				...
				---
				name: test_sub_reduce_16
				alignment: 16
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants:
				- id: 0
				value: '<8 x i16> <i16 -1, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0>'
				alignment: 16
				isTargetSpecific: false
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_sub_reduce_16
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: dead $r7 = frame-setup tMOVr $sp, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_register $r7
				; CHECK: renamable $r2 = tLEApcrel %const.0, 14 /* CC::al */, $noreg
				; CHECK: renamable $lr = t2MOVi 38, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $q0 = MVE_VLDRWU32 killed renamable $r2, 0, 0, $noreg :: (load 16 from constant-pool)
				; CHECK: renamable $r2 = t2MOVi16 299, 14 /* CC::al */, $noreg
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: bb.1.vector.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $lr, $q0, $r0, $r1, $r2
				; CHECK: renamable $vpr = MVE_VCTP16 renamable $r2, 0, $noreg
				; CHECK: $q1 = MVE_VORR killed $q0, killed $q0, 0, $noreg, undef $q1
				; CHECK: MVE_VPST 4, implicit $vpr
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRHU16_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv17, align 2)
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRHU16_post killed renamable $r1, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv1820, align 2)
				; CHECK: renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VAND killed renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q0 = MVE_VSUBi16 renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.middle.block:
				; CHECK: liveins: $q0, $q1
				; CHECK: renamable $r0, dead $cpsr = tMOVi8 3, 14 /* CC::al */, $noreg
				; CHECK: renamable $vpr = MVE_VCTP16 killed renamable $r0, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				; CHECK: renamable $r0 = tUXTH killed renamable $r0, 14 /* CC::al */, $noreg
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc, implicit killed $r0
				; CHECK: bb.3 (align 16):
				; CHECK: CONSTPOOL_ENTRY 0, %const.0, 16
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $lr

				frame-setup tPUSH 14, $noreg, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				$r7 = frame-setup tMOVr $sp, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa_register $r7
				renamable $r2 = tLEApcrel %const.0, 14, $noreg
				renamable $lr = t2MOVi 38, 14, $noreg, $noreg
				renamable $q0 = MVE_VLDRWU32 killed renamable $r2, 0, 0, $noreg :: (load 16 from constant-pool)
				renamable $r2 = t2MOVi16 299, 14, $noreg
				t2DoLoopStart renamable $lr

				bb.1.vector.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $lr, $q0, $r0, $r1, $r2

				renamable $vpr = MVE_VCTP16 renamable $r2, 0, $noreg
				$q1 = MVE_VORR killed $q0, $q0, 0, $noreg, undef $q1
				MVE_VPST 4, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRHU16_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.lsr.iv17, align 2)
				renamable $r1, renamable $q2 = MVE_VLDRHU16_post killed renamable $r1, 16, 1, killed renamable $vpr :: (load 16 from %ir.lsr.iv1820, align 2)
				renamable $r2, dead $cpsr = tSUBi8 killed renamable $r2, 8, 14, $noreg
				renamable $q0 = MVE_VAND killed renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VSUBi16 renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				t2LoopEnd renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14, $noreg

				bb.2.middle.block:
				liveins: $q0, $q1

				renamable $r0, dead $cpsr = tMOVi8 3, 14, $noreg
				renamable $vpr = MVE_VCTP16 killed renamable $r0, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu16no_acc killed renamable $q0, 0, $noreg
				renamable $r0 = tUXTH killed renamable $r0, 14, $noreg
				tPOP_RET 14, $noreg, def $r7, def $pc, implicit killed $r0

				bb.3 (align 16):
				CONSTPOOL_ENTRY 0, %const.0, 16

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/two-reducing-loops.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s
				--- \|
				define dso_local arm_aapcs_vfpcc i32 @two_reducing_loops(i16* nocapture readonly %a, i16* nocapture readonly %b, i8* nocapture readonly %c, i32 %N) {
				entry:
				%cmp28 = icmp eq i32 %N, 0
				%0 = add i32 %N, 3
				%1 = lshr i32 %0, 2
				%2 = shl nuw i32 %1, 2
				%3 = add i32 %2, -4
				%4 = lshr i32 %3, 2
				%5 = add nuw nsw i32 %4, 1
				br i1 %cmp28, label %for.cond.cleanup7, label %vector.ph

				vector.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %5)
				%6 = shl i32 %4, 2
				%7 = sub i32 %N, %6
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv66 = phi i16* [ %scevgep67, %vector.body ], [ %b, %vector.ph ]
				%lsr.iv63 = phi i16* [ %scevgep64, %vector.body ], [ %a, %vector.ph ]
				%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %15, %vector.body ]
				%8 = phi i32 [ %5, %vector.ph ], [ %16, %vector.body ]
				%9 = phi i32 [ %N, %vector.ph ], [ %11, %vector.body ]
				%lsr.iv6668 = bitcast i16* %lsr.iv66 to <4 x i16>*
				%lsr.iv6365 = bitcast i16* %lsr.iv63 to <4 x i16>*
				%10 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %9)
				%11 = sub i32 %9, 4
				%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %lsr.iv6365, i32 2, <4 x i1> %10, <4 x i16> undef)
				%12 = zext <4 x i16> %wide.masked.load to <4 x i32>
				%wide.masked.load36 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %lsr.iv6668, i32 2, <4 x i1> %10, <4 x i16> undef)
				%13 = zext <4 x i16> %wide.masked.load36 to <4 x i32>
				%14 = mul nuw nsw <4 x i32> %13, %12
				%15 = add <4 x i32> %14, %vec.phi
				%scevgep64 = getelementptr i16, i16* %lsr.iv63, i32 4
				%scevgep67 = getelementptr i16, i16* %lsr.iv66, i32 4
				%16 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %8, i32 1)
				%17 = icmp ne i32 %16, 0
				br i1 %17, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <4 x i32> [ %vec.phi, %vector.body ]
				%.lcssa70 = phi <4 x i32> [ %15, %vector.body ]
				%18 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %7)
				%19 = icmp eq i32 %N, 0
				%20 = select <4 x i1> %18, <4 x i32> %.lcssa70, <4 x i32> %vec.phi.lcssa
				%21 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %20)
				%22 = add i32 %N, 3
				%23 = lshr i32 %22, 2
				%24 = shl nuw i32 %23, 2
				%25 = add i32 %24, -4
				%26 = lshr i32 %25, 2
				%27 = add nuw nsw i32 %26, 1
				br i1 %19, label %for.cond.cleanup7, label %vector.ph40

				vector.ph40: ; preds = %middle.block
				%28 = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 %21, i32 0
				call void @llvm.set.loop.iterations.i32(i32 %27)
				%29 = shl i32 %26, 2
				%30 = sub i32 %N, %29
				br label %vector.body39

				vector.body39: ; preds = %vector.body39, %vector.ph40
				%lsr.iv = phi i8* [ %scevgep, %vector.body39 ], [ %c, %vector.ph40 ]
				%vec.phi51 = phi <4 x i32> [ %28, %vector.ph40 ], [ %36, %vector.body39 ]
				%31 = phi i32 [ %27, %vector.ph40 ], [ %37, %vector.body39 ]
				%32 = phi i32 [ %N, %vector.ph40 ], [ %34, %vector.body39 ]
				%lsr.iv62 = bitcast i8* %lsr.iv to <4 x i8>*
				%33 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %32)
				%34 = sub i32 %32, 4
				%wide.masked.load54 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %lsr.iv62, i32 1, <4 x i1> %33, <4 x i8> undef)
				%35 = zext <4 x i8> %wide.masked.load54 to <4 x i32>
				%36 = sub <4 x i32> %vec.phi51, %35
				%scevgep = getelementptr i8, i8* %lsr.iv, i32 4
				%37 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %31, i32 1)
				%38 = icmp ne i32 %37, 0
				br i1 %38, label %vector.body39, label %middle.block37

				middle.block37: ; preds = %vector.body39
				%vec.phi51.lcssa = phi <4 x i32> [ %vec.phi51, %vector.body39 ]
				%.lcssa = phi <4 x i32> [ %36, %vector.body39 ]
				%39 = call <4 x i1> @llvm.arm.mve.vctp32(i32 %30)
				%40 = select <4 x i1> %39, <4 x i32> %.lcssa, <4 x i32> %vec.phi51.lcssa
				%41 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %40)
				br label %for.cond.cleanup7

				for.cond.cleanup7: ; preds = %middle.block37, %entry, %middle.block
				%res.1.lcssa = phi i32 [ %21, %middle.block ], [ 0, %entry ], [ %41, %middle.block37 ]
				ret i32 %res.1.lcssa
				}

				declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>) #1
				declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>) #2
				declare <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>*, i32 immarg, <4 x i1>, <4 x i8>) #1
				declare void @llvm.set.loop.iterations.i32(i32) #3
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #3
				declare <4 x i1> @llvm.arm.mve.vctp32(i32) #4
				...
				---
				name: two_reducing_loops
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 16
				offsetAdjustment: -8
				maxAlignment: 4
				fixedStack: []
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r5', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 3, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: two_reducing_loops
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.8(0x30000000), %bb.1(0x50000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2, $r3, $r4, $r5
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $r5, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 16
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: frame-setup CFI_INSTRUCTION offset $r5, -12
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -16
				; CHECK: dead $r7 = frame-setup tADDrSPi $sp, 2, 14 /* CC::al */, $noreg
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa $r7, 8
				; CHECK: tCMPi8 renamable $r3, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: tBcc %bb.8, 0 /* CC::eq */, killed $cpsr
				; CHECK: bb.1.vector.ph:
				; CHECK: successors: %bb.2(0x80000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3
				; CHECK: renamable $r5, dead $cpsr = tADDi3 renamable $r3, 3, 14 /* CC::al */, $noreg
				; CHECK: $q1 = MVE_VMOVimmi32 0, 0, $noreg, undef $q1
				; CHECK: renamable $r5 = t2BICri killed renamable $r5, 3, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r4, dead $cpsr = tSUBi3 killed renamable $r5, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $r5, dead $cpsr = tLSRri renamable $r4, 2, 14 /* CC::al */, $noreg
				; CHECK: renamable $r12 = t2SUBrs renamable $r3, killed renamable $r5, 18, 14 /* CC::al */, $noreg, $noreg
				; CHECK: $r5 = tMOVr $r3, 14 /* CC::al */, $noreg
				; CHECK: $lr = MVE_DLSTP_32 killed renamable $r5
				; CHECK: bb.2.vector.body:
				; CHECK: successors: %bb.2(0x7c000000), %bb.3(0x04000000)
				; CHECK: liveins: $lr, $q1, $r0, $r1, $r2, $r3, $r4, $r12
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRHU32_post killed renamable $r0, 8, 0, $noreg :: (load 8 from %ir.lsr.iv6365, align 2)
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRHU32_post killed renamable $r1, 8, 0, killed $noreg :: (load 8 from %ir.lsr.iv6668, align 2)
				; CHECK: renamable $q0 = nuw nsw MVE_VMULi32 killed renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: $q1 = MVE_VADDi32 renamable $q0, killed renamable $q1, 0, $noreg, undef $q1
				; CHECK: $lr = MVE_LETP killed renamable $lr, %bb.2
				; CHECK: bb.3.middle.block:
				; CHECK: successors: %bb.7(0x30000000), %bb.4(0x50000000)
				; CHECK: liveins: $q0, $q1, $r2, $r3, $r4, $r12
				; CHECK: $q0 = MVE_VORR killed $q1, killed $q1, 0, $noreg, killed $q0
				; CHECK: renamable $r0 = MVE_VADDVu32no_acc killed renamable $q0, 0, $noreg
				; CHECK: tCBZ $r3, %bb.7
				; CHECK: bb.4.vector.ph40:
				; CHECK: successors: %bb.5(0x80000000)
				; CHECK: liveins: $r0, $r2, $r3, $r4, $r12
				; CHECK: renamable $r1, dead $cpsr = tMOVi8 1, 14 /* CC::al */, $noreg
				; CHECK: renamable $lr = nuw nsw t2ADDrs killed renamable $r1, killed renamable $r4, 19, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r1, dead $cpsr = tMOVi8 0, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VMOVimmi32 255, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q1 = MVE_VDUP32 killed renamable $r1, 0, $noreg, undef renamable $q1
				; CHECK: $lr = t2DLS killed renamable $lr
				; CHECK: renamable $q1 = MVE_VMOV_to_lane_32 killed renamable $q1, killed renamable $r0, 0, 14 /* CC::al */, $noreg
				; CHECK: bb.5.vector.body39:
				; CHECK: successors: %bb.5(0x7c000000), %bb.6(0x04000000)
				; CHECK: liveins: $lr, $q0, $q1, $r2, $r3, $r12
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				; CHECK: $q2 = MVE_VORR killed $q1, killed $q1, 0, $noreg, undef $q2
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r2, renamable $q1 = MVE_VLDRBU32_post killed renamable $r2, 4, 1, killed renamable $vpr :: (load 4 from %ir.lsr.iv62, align 1)
				; CHECK: renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VAND killed renamable $q1, renamable $q0, 0, $noreg, undef renamable $q1
				; CHECK: renamable $q1 = MVE_VSUBi32 renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.5
				; CHECK: bb.6.middle.block37:
				; CHECK: successors: %bb.7(0x80000000)
				; CHECK: liveins: $q1, $q2, $r12
				; CHECK: renamable $vpr = MVE_VCTP32 killed renamable $r12, 0, $noreg
				; CHECK: renamable $q0 = MVE_VPSEL killed renamable $q1, killed renamable $q2, 0, killed renamable $vpr
				; CHECK: renamable $r0 = MVE_VADDVu32no_acc killed renamable $q0, 0, $noreg
				; CHECK: bb.7.for.cond.cleanup7:
				; CHECK: liveins: $r0
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $r5, def $r7, def $pc, implicit killed $r0
				; CHECK: bb.8:
				; CHECK: renamable $r0, dead $cpsr = tMOVi8 0, 14 /* CC::al */, $noreg
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $r5, def $r7, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.8(0x30000000), %bb.1(0x50000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $r5, $lr

				frame-setup tPUSH 14, $noreg, killed $r4, killed $r5, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 16
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				frame-setup CFI_INSTRUCTION offset $r5, -12
				frame-setup CFI_INSTRUCTION offset $r4, -16
				$r7 = frame-setup tADDrSPi $sp, 2, 14, $noreg
				frame-setup CFI_INSTRUCTION def_cfa $r7, 8
				tCMPi8 renamable $r3, 0, 14, $noreg, implicit-def $cpsr
				tBcc %bb.8, 0, killed $cpsr

				bb.1.vector.ph:
				successors: %bb.2(0x80000000)
				liveins: $r0, $r1, $r2, $r3

				renamable $r5, dead $cpsr = tADDi3 renamable $r3, 3, 14, $noreg
				renamable $q0 = MVE_VMOVimmi32 0, 0, $noreg, undef renamable $q0
				renamable $r5 = t2BICri killed renamable $r5, 3, 14, $noreg, $noreg
				renamable $r4, dead $cpsr = tSUBi3 killed renamable $r5, 4, 14, $noreg
				renamable $r5, dead $cpsr = tMOVi8 1, 14, $noreg
				renamable $lr = nuw nsw t2ADDrs killed renamable $r5, renamable $r4, 19, 14, $noreg, $noreg
				renamable $r5, dead $cpsr = tLSRri renamable $r4, 2, 14, $noreg
				renamable $r12 = t2SUBrs renamable $r3, killed renamable $r5, 18, 14, $noreg, $noreg
				$r5 = tMOVr $r3, 14, $noreg
				t2DoLoopStart renamable $lr

				bb.2.vector.body:
				successors: %bb.2(0x7c000000), %bb.3(0x04000000)
				liveins: $lr, $q0, $r0, $r1, $r2, $r3, $r4, $r5, $r12

				renamable $vpr = MVE_VCTP32 renamable $r5, 0, $noreg
				$q1 = MVE_VORR killed $q0, $q0, 0, $noreg, undef $q1
				MVE_VPST 4, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRHU32_post killed renamable $r0, 8, 1, renamable $vpr :: (load 8 from %ir.lsr.iv6365, align 2)
				renamable $r1, renamable $q2 = MVE_VLDRHU32_post killed renamable $r1, 8, 1, killed renamable $vpr :: (load 8 from %ir.lsr.iv6668, align 2)
				renamable $r5, dead $cpsr = tSUBi8 killed renamable $r5, 4, 14, $noreg
				renamable $q0 = nuw nsw MVE_VMULi32 killed renamable $q2, killed renamable $q0, 0, $noreg, undef renamable $q0
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VADDi32 killed renamable $q0, renamable $q1, 0, $noreg, undef renamable $q0
				t2LoopEnd renamable $lr, %bb.2, implicit-def dead $cpsr
				tB %bb.3, 14, $noreg

				bb.3.middle.block:
				successors: %bb.7(0x30000000), %bb.4(0x50000000)
				liveins: $q0, $q1, $r2, $r3, $r4, $r12

				renamable $vpr = MVE_VCTP32 renamable $r12, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q0, killed renamable $q1, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu32no_acc killed renamable $q0, 0, $noreg
				tCBZ $r3, %bb.7

				bb.4.vector.ph40:
				successors: %bb.5(0x80000000)
				liveins: $r0, $r2, $r3, $r4, $r12

				renamable $r1, dead $cpsr = tMOVi8 1, 14, $noreg
				renamable $lr = nuw nsw t2ADDrs killed renamable $r1, killed renamable $r4, 19, 14, $noreg, $noreg
				renamable $r1, dead $cpsr = tMOVi8 0, 14, $noreg
				renamable $q0 = MVE_VMOVimmi32 255, 0, $noreg, undef renamable $q0
				renamable $q1 = MVE_VDUP32 killed renamable $r1, 0, $noreg, undef renamable $q1
				t2DoLoopStart renamable $lr
				renamable $q1 = MVE_VMOV_to_lane_32 killed renamable $q1, killed renamable $r0, 0, 14, $noreg

				bb.5.vector.body39:
				successors: %bb.5(0x7c000000), %bb.6(0x04000000)
				liveins: $lr, $q0, $q1, $r2, $r3, $r12

				renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				$q2 = MVE_VORR killed $q1, $q1, 0, $noreg, undef $q2
				MVE_VPST 8, implicit $vpr
				renamable $r2, renamable $q1 = MVE_VLDRBU32_post killed renamable $r2, 4, 1, killed renamable $vpr :: (load 4 from %ir.lsr.iv62, align 1)
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14, $noreg
				renamable $q1 = MVE_VAND killed renamable $q1, renamable $q0, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q1 = MVE_VSUBi32 renamable $q2, killed renamable $q1, 0, $noreg, undef renamable $q1
				t2LoopEnd renamable $lr, %bb.5, implicit-def dead $cpsr
				tB %bb.6, 14, $noreg

				bb.6.middle.block37:
				successors: %bb.7(0x80000000)
				liveins: $q1, $q2, $r12

				renamable $vpr = MVE_VCTP32 killed renamable $r12, 0, $noreg
				renamable $q0 = MVE_VPSEL killed renamable $q1, killed renamable $q2, 0, killed renamable $vpr
				renamable $r0 = MVE_VADDVu32no_acc killed renamable $q0, 0, $noreg

				bb.7.for.cond.cleanup7:
				liveins: $r0

				tPOP_RET 14, $noreg, def $r4, def $r5, def $r7, def $pc, implicit killed $r0

				bb.8:
				renamable $r0, dead $cpsr = tMOVi8 0, 14, $noreg
				tPOP_RET 14, $noreg, def $r4, def $r5, def $r7, def $pc, implicit killed $r0

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vector-arith-codegen.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=armv8.1m.main -mattr=+mve -disable-mve-tail-predication=false --verify-machineinstrs %s -o - \| FileCheck %s			; RUN: llc -mtriple=armv8.1m.main -mattr=+mve -disable-mve-tail-predication=false --verify-machineinstrs %s -o - \| FileCheck %s

	define dso_local i32 @mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {			define dso_local i32 @mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {
	; CHECK-LABEL: mul_reduce_add:			; CHECK-LABEL: mul_reduce_add:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: cmp r2, #0			; CHECK-NEXT: cmp r2, #0
	; CHECK-NEXT: itt eq			; CHECK-NEXT: itt eq
	; CHECK-NEXT: moveq r0, #0			; CHECK-NEXT: moveq r0, #0
	; CHECK-NEXT: bxeq lr			; CHECK-NEXT: bxeq lr
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: adds r3, r2, #3			; CHECK-NEXT: adds r3, r2, #3
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q1, #0x0
	; CHECK-NEXT: bic r3, r3, #3			; CHECK-NEXT: bic r3, r3, #3
	; CHECK-NEXT: sub.w r12, r3, #4			; CHECK-NEXT: sub.w r12, r3, #4
	; CHECK-NEXT: movs r3, #1
	; CHECK-NEXT: add.w lr, r3, r12, lsr #2
	; CHECK-NEXT: lsr.w r3, r12, #2			; CHECK-NEXT: lsr.w r3, r12, #2
	; CHECK-NEXT: sub.w r3, r2, r3, lsl #2			; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dlstp.32 lr, r2
	; CHECK-NEXT: .LBB0_1: @ %vector.body			; CHECK-NEXT: .LBB0_1: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r2			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vmov q1, q0			; CHECK-NEXT: vldrw.u32 q2, [r1], #16
	; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q0, [r0], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
	; CHECK-NEXT: subs r2, #4
	; CHECK-NEXT: vmul.i32 q0, q2, q0			; CHECK-NEXT: vmul.i32 q0, q2, q0
	; CHECK-NEXT: vadd.i32 q0, q0, q1			; CHECK-NEXT: vadd.i32 q1, q0, q1
	; CHECK-NEXT: le lr, .LBB0_1			; CHECK-NEXT: letp lr, .LBB0_1
	; CHECK-NEXT: @ %bb.2: @ %middle.block			; CHECK-NEXT: @ %bb.2: @ %middle.block
	; CHECK-NEXT: vctp.32 r3			; CHECK-NEXT: vmov q0, q1
	; CHECK-NEXT: vpsel q0, q0, q1
	; CHECK-NEXT: vaddv.u32 r0, q0			; CHECK-NEXT: vaddv.u32 r0, q0
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%cmp8 = icmp eq i32 %N, 0			%cmp8 = icmp eq i32 %N, 0
	br i1 %cmp8, label %for.cond.cleanup, label %vector.ph			br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %N, 3			%n.rnd.up = add i32 %N, 3
	Show All 36 Lines
	; CHECK-LABEL: mul_reduce_add_const:			; CHECK-LABEL: mul_reduce_add_const:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: cmp r2, #0			; CHECK-NEXT: cmp r2, #0
	; CHECK-NEXT: itt eq			; CHECK-NEXT: itt eq
	; CHECK-NEXT: moveq r0, #0			; CHECK-NEXT: moveq r0, #0
	; CHECK-NEXT: bxeq lr			; CHECK-NEXT: bxeq lr
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: adds r1, r2, #3			; CHECK-NEXT: adds r1, r2, #3
	; CHECK-NEXT: movs r3, #1
	; CHECK-NEXT: bic r1, r1, #3			; CHECK-NEXT: bic r1, r1, #3
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q1, #0x0
	; CHECK-NEXT: subs r1, #4			; CHECK-NEXT: subs r1, #4
	; CHECK-NEXT: add.w lr, r3, r1, lsr #2
	; CHECK-NEXT: lsrs r1, r1, #2			; CHECK-NEXT: lsrs r1, r1, #2
	; CHECK-NEXT: sub.w r1, r2, r1, lsl #2			; CHECK-NEXT: sub.w r1, r2, r1, lsl #2
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dlstp.32 lr, r2
	; CHECK-NEXT: .LBB1_1: @ %vector.body			; CHECK-NEXT: .LBB1_1: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r2			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vmov q1, q0			; CHECK-NEXT: vadd.i32 q1, q0, q1
	; CHECK-NEXT: vpst			; CHECK-NEXT: letp lr, .LBB1_1
	; CHECK-NEXT: vldrwt.u32 q0, [r0], #16
	; CHECK-NEXT: subs r2, #4
	; CHECK-NEXT: vadd.i32 q0, q0, q1
	; CHECK-NEXT: le lr, .LBB1_1
	; CHECK-NEXT: @ %bb.2: @ %middle.block			; CHECK-NEXT: @ %bb.2: @ %middle.block
	; CHECK-NEXT: vctp.32 r1			; CHECK-NEXT: vmov q0, q1
	; CHECK-NEXT: vpsel q0, q0, q1
	; CHECK-NEXT: vaddv.u32 r0, q0			; CHECK-NEXT: vaddv.u32 r0, q0
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%cmp6 = icmp eq i32 %N, 0			%cmp6 = icmp eq i32 %N, 0
	br i1 %cmp6, label %for.cond.cleanup, label %vector.ph			br i1 %cmp6, label %for.cond.cleanup, label %vector.ph

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %N, 3			%n.rnd.up = add i32 %N, 3
	Show All 32 Lines
	; CHECK-LABEL: add_reduce_add_const:			; CHECK-LABEL: add_reduce_add_const:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: cmp r2, #0			; CHECK-NEXT: cmp r2, #0
	; CHECK-NEXT: itt eq			; CHECK-NEXT: itt eq
	; CHECK-NEXT: moveq r0, #0			; CHECK-NEXT: moveq r0, #0
	; CHECK-NEXT: bxeq lr			; CHECK-NEXT: bxeq lr
	; CHECK-NEXT: push {r7, lr}			; CHECK-NEXT: push {r7, lr}
	; CHECK-NEXT: adds r1, r2, #3			; CHECK-NEXT: adds r1, r2, #3
	; CHECK-NEXT: movs r3, #1
	; CHECK-NEXT: bic r1, r1, #3			; CHECK-NEXT: bic r1, r1, #3
	; CHECK-NEXT: vmov.i32 q0, #0x0			; CHECK-NEXT: vmov.i32 q1, #0x0
	; CHECK-NEXT: subs r1, #4			; CHECK-NEXT: subs r1, #4
	; CHECK-NEXT: add.w lr, r3, r1, lsr #2
	; CHECK-NEXT: lsrs r1, r1, #2			; CHECK-NEXT: lsrs r1, r1, #2
	; CHECK-NEXT: sub.w r1, r2, r1, lsl #2			; CHECK-NEXT: sub.w r1, r2, r1, lsl #2
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dlstp.32 lr, r2
	; CHECK-NEXT: .LBB2_1: @ %vector.body			; CHECK-NEXT: .LBB2_1: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r2			; CHECK-NEXT: vldrw.u32 q0, [r0], #16
	; CHECK-NEXT: vmov q1, q0			; CHECK-NEXT: vadd.i32 q1, q0, q1
	; CHECK-NEXT: vpst			; CHECK-NEXT: letp lr, .LBB2_1
	; CHECK-NEXT: vldrwt.u32 q0, [r0], #16
	; CHECK-NEXT: subs r2, #4
	; CHECK-NEXT: vadd.i32 q0, q0, q1
	; CHECK-NEXT: le lr, .LBB2_1
	; CHECK-NEXT: @ %bb.2: @ %middle.block			; CHECK-NEXT: @ %bb.2: @ %middle.block
	; CHECK-NEXT: vctp.32 r1			; CHECK-NEXT: vmov q0, q1
	; CHECK-NEXT: vpsel q0, q0, q1
	; CHECK-NEXT: vaddv.u32 r0, q0			; CHECK-NEXT: vaddv.u32 r0, q0
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%cmp6 = icmp eq i32 %N, 0			%cmp6 = icmp eq i32 %N, 0
	br i1 %cmp6, label %for.cond.cleanup, label %vector.ph			br i1 %cmp6, label %for.cond.cleanup, label %vector.ph

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %N, 3			%n.rnd.up = add i32 %N, 3
	▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM][LowOverheadLoops] Handle reductionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 247913

llvm/include/llvm/CodeGen/ReachingDefAnalysis.h

llvm/lib/CodeGen/ReachingDefAnalysis.cpp

llvm/lib/Target/ARM/ARMBaseInstrInfo.h

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/constant-init-reduction.mir

llvm/test/CodeGen/Thumb2/LowOverheadLoops/constant-reduction.mir

llvm/test/CodeGen/Thumb2/LowOverheadLoops/matrix.mir

llvm/test/CodeGen/Thumb2/LowOverheadLoops/nested-reductions.mir

llvm/test/CodeGen/Thumb2/LowOverheadLoops/reductions-8-16.mir

llvm/test/CodeGen/Thumb2/LowOverheadLoops/two-reducing-loops.mir

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vector-arith-codegen.ll

[ARM][LowOverheadLoops] Handle reductions
ClosedPublic