This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
3/15
ARMLowOverheadLoops.cpp
-
test/CodeGen/Thumb2/LowOverheadLoops/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
ctlz-non-zeros.mir
-
safe-retaining.mir
-
unsafe-retaining.mir

Differential D76235

[ARM][LowOverheadLoops] Add checks for narrowing
ClosedPublic

Authored by samparker on Mar 16 2020, 10:14 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen

Commits

rG94cacebccadf: [ARM][LowOverheadLoops] Add checks for narrowing

Summary

Modify ValidateLiveOuts to track 'FalseZeros' more precisely, including checks on specific operations that can generate non-zeros from zero values, e.g VMVN. We can then check that any instructions that retain some information in their output register (all narrowing instructions) that they only use and def registers that always have zeros in their falsely predicated bytes, whether or not tail predication happens.
Most of the logic remains the same, just the names of the data structures and helpers have been renamed to reflect the change in logic. The key change, apart from the opcode checkers, is that the FalseZeros set now strictly contains only instructions which will always generate zeros, and not instructions that could also have their false bytes masked away later.

Diff Detail

Event Timeline

samparker created this revision.Mar 16 2020, 10:14 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptMar 16 2020, 10:14 AM

SjoerdMeijer added inline comments.Mar 18 2020, 1:54 AM

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
521	This is becoming a very tricky pass (if it wasn't already that), for a bunch of different reason: the analysis, the transformations, etc. To keep things readable, I am now going to nitpick on comments. I believe there is one school of thought that code should be self-documenting, and comments will always be out of date or misleading, which I think I can mostly agree with. But here in this case, this function, that checks certain opcodes and returning true/false to a function named `canGenerateNonZeros` doesn't tell me much. So my request is here is if you can define what exactly canGenerateNonZeros mean, and why these opcodes.
538	Same request here: what is `retainsPreviousValue`, and why these opcodes?
583	Same here
588	Probably this comment describes the function's purpose, so can be moved up.

samparker marked 2 inline comments as done.Mar 19 2020, 2:20 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
521	Fair! Here, any MVE instruction that could generate a non-zero result given nothing but zero'd registers should be listed. This allows us to track zeros in means zeros out.
538	All these instructions operate upon half a lane of the source register, writing to the destination, but they also leave the other half of the destination register untouched. The reference manual uses this sentence: 'The other half of the destination vector element retains its previous value'.

SjoerdMeijer added inline comments.Mar 19 2020, 3:21 AM

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
521	Thanks, makes sense. If you can later add this as comments, that would be much appreciated. I am now going to do a second pass over this patch, let things sink in, and continue the review.

I am now up to date with this, and don't have any questions about the logic, that looks good.
I would like to go over it once more when (some of) the nits have been addressed as I would like to get an overview how things read and look as this introduces quite a few new things and concepts.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
538	Thanks for pointing this out. Just a nit/suggestion for the function name then: `retainsPreviousValue` -> `laneRetainsPreviousValue` A few questions about the list below. Is this list meant to be complete? From looking at the ArmARM a bit more for this behaviour, it looks like there are more, the first search result e.g. showed VQMOVN. Are we for example not interested in this one? Also curious if it would be worth describing this property in a different way, or if this switch will do.
583	nit: perhaps something along the lines of: `alwaysZero` -> `laneAlwaysZero`
617	nit: bytes -> lanes?
625	nit: KnownFalseZeros -> Predicated?
629	nit: perhaps something with "lane" in it: AlwaysFalseLaneZero, or FalseLaneZero, or FalsePredLaneZero

samparker marked an inline comment as done.Mar 23 2020, 1:59 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
538	It is supposed to be complete, so thanks! Which also reminds me that there's no tests for any movn either!

samparker mentioned this in D76608: [ARM][MVE] Add target flag for narrowing insts.Mar 23 2020, 6:34 AM

Added/moved some comments.
Renamed variables/functions.
Added/removed tests.

samparker added a parent revision: D76608: [ARM][MVE] Add target flag for narrowing insts.Mar 23 2020, 8:03 AM

Cheers, looks good to me.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
547	nit: retainsPreviousHalf -> retainsPreviousHalfElement
550	nit, just: return Flags & ARMII::RetainsPreviousHalfElement;

This revision is now accepted and ready to land.Mar 23 2020, 12:18 PM

Closed by commit rG94cacebccadf: [ARM][LowOverheadLoops] Add checks for narrowing (authored by samparker). · Explain WhyMar 24 2020, 2:08 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2020, 2:08 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMLowOverheadLoops.cpp

111 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

ctlz-non-zeros.mir

330 lines

safe-retaining.mir

273 lines

unsafe-retaining.mir

281 lines

Diff 252044

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "arm-low-overhead-loops"			#define DEBUG_TYPE "arm-low-overhead-loops"
	#define ARM_LOW_OVERHEAD_LOOPS_NAME "ARM Low Overhead Loops pass"			#define ARM_LOW_OVERHEAD_LOOPS_NAME "ARM Low Overhead Loops pass"

	namespace {			namespace {

				using InstSet = SmallPtrSetImpl<MachineInstr *>;

	class PostOrderLoopTraversal {			class PostOrderLoopTraversal {
	MachineLoop &ML;			MachineLoop &ML;
	MachineLoopInfo &MLI;			MachineLoopInfo &MLI;
	SmallPtrSet<MachineBasicBlock*, 4> Visited;			SmallPtrSet<MachineBasicBlock*, 4> Visited;
	SmallVector<MachineBasicBlock*, 4> Order;			SmallVector<MachineBasicBlock*, 4> Order;

	public:			public:
	PostOrderLoopTraversal(MachineLoop &ML, MachineLoopInfo &MLI)			PostOrderLoopTraversal(MachineLoop &ML, MachineLoopInfo &MLI)
	▲ Show 20 Lines • Show All 439 Lines • ▼ Show 20 Lines
	static bool isVectorPredicated(MachineInstr *MI) {			static bool isVectorPredicated(MachineInstr *MI) {
	int PIdx = llvm::findFirstVPTPredOperandIdx(*MI);			int PIdx = llvm::findFirstVPTPredOperandIdx(*MI);
	return PIdx != -1 && MI->getOperand(PIdx + 1).getReg() == ARM::VPR;			return PIdx != -1 && MI->getOperand(PIdx + 1).getReg() == ARM::VPR;
	}			}

	static bool isRegInClass(const MachineOperand &MO,			static bool isRegInClass(const MachineOperand &MO,
	const TargetRegisterClass *Class) {			const TargetRegisterClass *Class) {
	return MO.isReg() && MO.getReg() && Class->contains(MO.getReg());			return MO.isReg() && MO.getReg() && Class->contains(MO.getReg());
	}			}
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions This is becoming a very tricky pass (if it wasn't already that), for a bunch of different reason: the analysis, the transformations, etc. To keep things readable, I am now going to nitpick on comments. I believe there is one school of thought that code should be self-documenting, and comments will always be out of date or misleading, which I think I can mostly agree with. But here in this case, this function, that checks certain opcodes and returning true/false to a function named `canGenerateNonZeros` doesn't tell me much. So my request is here is if you can define what exactly canGenerateNonZeros mean, and why these opcodes. SjoerdMeijer: This is becoming a very tricky pass (if it wasn't already that), for a bunch of different…
				samparkerAuthorUnsubmitted Done Reply Inline Actions Fair! Here, any MVE instruction that could generate a non-zero result given nothing but zero'd registers should be listed. This allows us to track zeros in means zeros out. samparker: Fair! Here, any MVE instruction that could generate a non-zero result given nothing but zero'd…
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Thanks, makes sense. If you can later add this as comments, that would be much appreciated. I am now going to do a second pass over this patch, let things sink in, and continue the review. SjoerdMeijer: Thanks, makes sense. If you can later add this as comments, that would be much appreciated. I…

				// Can this instruction generate a non-zero result when given only zeroed
				// operands? This allows us to know that, given operands with false bytes
				// zeroed by masked loads, that the result will also contain zeros in those
				// bytes.
				static bool canGenerateNonZeros(const MachineInstr &MI) {
				switch (MI.getOpcode()) {
				default:
				break;
				// FIXME: FP minus 0?
				//case ARM::MVE_VNEGf16:
				//case ARM::MVE_VNEGf32:
				case ARM::MVE_VMVN:
				case ARM::MVE_VORN:
				case ARM::MVE_VCLZs8:
				case ARM::MVE_VCLZs16:
				case ARM::MVE_VCLZs32:
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Same request here: what is `retainsPreviousValue`, and why these opcodes? SjoerdMeijer: Same request here: what is `retainsPreviousValue`, and why these opcodes?
				samparkerAuthorUnsubmitted Done Reply Inline Actions All these instructions operate upon half a lane of the source register, writing to the destination, but they also leave the other half of the destination register untouched. The reference manual uses this sentence: 'The other half of the destination vector element retains its previous value'. samparker: All these instructions operate upon half a lane of the source register, writing to the…
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Thanks for pointing this out. Just a nit/suggestion for the function name then: `retainsPreviousValue` -> `laneRetainsPreviousValue` A few questions about the list below. Is this list meant to be complete? From looking at the ArmARM a bit more for this behaviour, it looks like there are more, the first search result e.g. showed VQMOVN. Are we for example not interested in this one? Also curious if it would be worth describing this property in a different way, or if this switch will do. SjoerdMeijer: Thanks for pointing this out. Just a nit/suggestion for the function name then…
				samparkerAuthorUnsubmitted Done Reply Inline Actions It is supposed to be complete, so thanks! Which also reminds me that there's no tests for any movn either! samparker: It is supposed to be complete, so thanks! Which also reminds me that there's no tests for any…
				return true;
				}
				return false;
				}

				// MVE 'narrowing' operate on half a lane, reading from half and writing
				// to half, which are referred to has the top and bottom half. The other
				// half retains its previous value.
				static bool retainsPreviousHalf(const MachineInstr &MI) {
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: retainsPreviousHalf -> retainsPreviousHalfElement SjoerdMeijer: nit: retainsPreviousHalf -> retainsPreviousHalfElement
				const MCInstrDesc &MCID = MI.getDesc();
				uint64_t Flags = MCID.TSFlags;
				return Flags & ARMII::RetainsPreviousHalf != 0;
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit, just: return Flags & ARMII::RetainsPreviousHalfElement; SjoerdMeijer: nit, just: return Flags & ARMII::RetainsPreviousHalfElement;
				}

				// Look at its register uses to see if it only can only receive zeros
				// into its false lanes which would then produce zeros. Also check that
				// the output register is also defined by an FalseLaneZeros instruction
				// so that if tail-predication happens, the lanes that aren't updated will
				// still be zeros.
				static bool producesFalseLaneZeros(MachineInstr &MI,
				const TargetRegisterClass *QPRs,
				const ReachingDefAnalysis &RDA,
				InstSet &FalseLaneZeros) {
				if (canGenerateNonZeros(MI))
				return false;
				for (auto &MO : MI.operands()) {
				if (!MO.isReg() \|\| !MO.getReg())
				continue;
				if (auto *OpDef = RDA.getMIOperand(&MI, MO))
				if (FalseLaneZeros.count(OpDef))
				continue;
				return false;
				}
				LLVM_DEBUG(dbgs() << "ARM Loops: Always False Zeros: " << MI);
				return true;
				}

	bool LowOverheadLoop::ValidateLiveOuts() const {			bool LowOverheadLoop::ValidateLiveOuts() const {
	// We want to find out if the tail-predicated version of this loop will			// We want to find out if the tail-predicated version of this loop will
	// produce the same values as the loop in its original form. For this to			// produce the same values as the loop in its original form. For this to
	// be true, the newly inserted implicit predication must not change the			// be true, the newly inserted implicit predication must not change the
	// the (observable) results.			// the (observable) results.
	// We're doing this because many instructions in the loop will not be			// We're doing this because many instructions in the loop will not be
	// predicated and so the conversion from VPT predication to tail-predication			// predicated and so the conversion from VPT predication to tail-predication
	// can result in different values being produced; due to the tail-predication			// can result in different values being produced; due to the tail-predication
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Same here SjoerdMeijer: Same here
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: perhaps something along the lines of: `alwaysZero` -> `laneAlwaysZero` SjoerdMeijer: nit: perhaps something along the lines of: `alwaysZero` -> `laneAlwaysZero`
	// preventing many instructions from updating their falsely predicated			// preventing many instructions from updating their falsely predicated
	// lanes. This analysis assumes that all the instructions perform lane-wise			// lanes. This analysis assumes that all the instructions perform lane-wise
	// operations and don't perform any exchanges.			// operations and don't perform any exchanges.
	// A masked load, whether through VPT or tail predication, will write zeros			// A masked load, whether through VPT or tail predication, will write zeros
	// to any of the falsely predicated bytes. So, from the loads, we know that			// to any of the falsely predicated bytes. So, from the loads, we know that
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Probably this comment describes the function's purpose, so can be moved up. SjoerdMeijer: Probably this comment describes the function's purpose, so can be moved up.
	// the false lanes are zeroed and here we're trying to track that those false			// the false lanes are zeroed and here we're trying to track that those false
	// lanes remain zero, or where they change, the differences are masked away			// lanes remain zero, or where they change, the differences are masked away
	// by their user(s).			// by their user(s).
	// All MVE loads and stores have to be predicated, so we know that any load			// All MVE loads and stores have to be predicated, so we know that any load
	// operands, or stored results are equivalent already. Other explicitly			// operands, or stored results are equivalent already. Other explicitly
	// predicated instructions will perform the same operation in the original			// predicated instructions will perform the same operation in the original
	// loop and the tail-predicated form too. Because of this, we can insert			// loop and the tail-predicated form too. Because of this, we can insert
	// loads, stores and other predicated instructions into our KnownFalseZeros			// loads, stores and other predicated instructions into our Predicated
	// set and build from there.			// set and build from there.
	const TargetRegisterClass *QPRs = TRI.getRegClass(ARM::MQPRRegClassID);			const TargetRegisterClass *QPRs = TRI.getRegClass(ARM::MQPRRegClassID);
	SetVector<MachineInstr *> UnknownFalseLanes;			SetVector<MachineInstr *> Unknown;
	SmallPtrSet<MachineInstr *, 4> KnownFalseZeros;			SmallPtrSet<MachineInstr *, 4> FalseLaneZeros;
				SmallPtrSet<MachineInstr *, 4> Predicated;
	MachineBasicBlock *MBB = ML.getHeader();			MachineBasicBlock *MBB = ML.getHeader();

	for (auto &MI : *MBB) {			for (auto &MI : *MBB) {
	const MCInstrDesc &MCID = MI.getDesc();			const MCInstrDesc &MCID = MI.getDesc();
	uint64_t Flags = MCID.TSFlags;			uint64_t Flags = MCID.TSFlags;
	if ((Flags & ARMII::DomainMask) != ARMII::DomainMVE)			if ((Flags & ARMII::DomainMask) != ARMII::DomainMVE)
	continue;			continue;

	if (isVectorPredicated(&MI)) {			if (isVectorPredicated(&MI)) {
	KnownFalseZeros.insert(&MI);			if (MI.mayLoad())
				FalseLaneZeros.insert(&MI);
				Predicated.insert(&MI);
	continue;			continue;
	}			}

	if (MI.getNumDefs() == 0)			if (MI.getNumDefs() == 0)
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: bytes -> lanes? SjoerdMeijer: nit: bytes -> lanes?
	continue;			continue;

	// Only evaluate instructions which produce a single value.			if (producesFalseLaneZeros(MI, QPRs, RDA, FalseLaneZeros))
	assert((MI.getNumDefs() == 1 && MI.defs().begin()->isReg()) &&			FalseLaneZeros.insert(&MI);
	"Expected no more than one register def");			else if (retainsPreviousHalf(MI))
				return false;
	Register DefReg = MI.defs().begin()->getReg();			else
	for (auto &MO : MI.operands()) {			Unknown.insert(&MI);
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: KnownFalseZeros -> Predicated? SjoerdMeijer: nit: KnownFalseZeros -> Predicated?
	if (!isRegInClass(MO, QPRs) \|\| !MO.isUse() \|\| MO.getReg() != DefReg)
	continue;

	// If this instruction overwrites one of its operands, and that register
	// has known lanes, then this instruction also has known predicated false
	// lanes.
	if (auto *OpDef = RDA.getMIOperand(&MI, MO)) {
	if (KnownFalseZeros.count(OpDef)) {
	KnownFalseZeros.insert(&MI);
	break;
	}
	}
	}
	if (!KnownFalseZeros.count(&MI))
	UnknownFalseLanes.insert(&MI);
	}			}

	auto HasKnownUsers = [this](MachineInstr *MI, const MachineOperand &MO,			auto HasPredicatedUsers = [this](MachineInstr *MI, const MachineOperand &MO,
	SmallPtrSetImpl<MachineInstr *> &Knowns) {			SmallPtrSetImpl<MachineInstr *> &Predicated) {
				SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: perhaps something with "lane" in it: AlwaysFalseLaneZero, or FalseLaneZero, or FalsePredLaneZero SjoerdMeijer: nit: perhaps something with "lane" in it: - AlwaysFalseLaneZero, or - FalseLaneZero, or…
	SmallPtrSet<MachineInstr *, 2> Uses;			SmallPtrSet<MachineInstr *, 2> Uses;
	RDA.getGlobalUses(MI, MO.getReg(), Uses);			RDA.getGlobalUses(MI, MO.getReg(), Uses);
	for (auto *Use : Uses) {			for (auto *Use : Uses) {
	if (Use != MI && !Knowns.count(Use))			if (Use != MI && !Predicated.count(Use))
	return false;			return false;
	}			}
	return true;			return true;
	};			};

	// Now for all the unknown values, see if they're only consumed by known			// Visit the unknowns in reverse so that we can start at the values being
	// instructions. Visit in reverse so that we can start at the values being
	// stored and then we can work towards the leaves, hopefully adding more			// stored and then we can work towards the leaves, hopefully adding more
	// instructions to KnownFalseZeros.			// instructions to Predicated.
	for (auto *MI : reverse(UnknownFalseLanes)) {			for (auto *MI : reverse(Unknown)) {
	for (auto &MO : MI->operands()) {			for (auto &MO : MI->operands()) {
	if (!isRegInClass(MO, QPRs) \|\| !MO.isDef())			if (!isRegInClass(MO, QPRs) \|\| !MO.isDef())
	continue;			continue;
	if (!HasKnownUsers(MI, MO, KnownFalseZeros)) {			if (!HasPredicatedUsers(MI, MO, Predicated)) {
	LLVM_DEBUG(dbgs() << "ARM Loops: Found an unknown def of : "			LLVM_DEBUG(dbgs() << "ARM Loops: Found an unknown def of : "
	<< TRI.getRegAsmName(MO.getReg()) << " at " << *MI);			<< TRI.getRegAsmName(MO.getReg()) << " at " << *MI);
	return false;			return false;
	}			}
	}			}
	// Any unknown false lanes have been masked away by the user(s).			// Any unknown false lanes have been masked away by the user(s).
	KnownFalseZeros.insert(MI);			Predicated.insert(MI);
	}			}

	// Collect Q-regs that are live in the exit blocks. We don't collect scalars			// Collect Q-regs that are live in the exit blocks. We don't collect scalars
	// because they won't be affected by lane predication.			// because they won't be affected by lane predication.
	SmallSet<Register, 2> LiveOuts;			SmallSet<Register, 2> LiveOuts;
	SmallVector<MachineBasicBlock *, 2> ExitBlocks;			SmallVector<MachineBasicBlock *, 2> ExitBlocks;
	ML.getExitBlocks(ExitBlocks);			ML.getExitBlocks(ExitBlocks);
	for (auto *MBB : ExitBlocks)			for (auto *MBB : ExitBlocks)
	▲ Show 20 Lines • Show All 492 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/ctlz-non-zeros.mir

This file was added.

				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				# CHECK-NOT: LETP

				--- \|
				define arm_aapcs_vfpcc void @test_ctlz_i8(<8 x i16>* %a, <8 x i16>* %b, <8 x i16>* %c, i32 %elts, i32 %iters) #0 {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <8 x i16>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <8 x i16>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <8 x i16>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <8 x i1> @llvm.arm.mve.vctp16(i32 %count)
				%elts.rem = sub i32 %count, 8
				%masked.load.a = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %addr.a, i32 2, <8 x i1> %pred, <8 x i16> undef)
				%masked.load.b = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %addr.b, i32 2, <8 x i1> %pred, <8 x i16> undef)
				%bitcast.a = bitcast <8 x i16> %masked.load.a to <16 x i8>
				%ctlz = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> %bitcast.a, i1 false)
				%shrn = call <16 x i8> @llvm.arm.mve.vshrn.v16i8.v8i16(<16 x i8> %ctlz, <8 x i16> %masked.load.b, i32 1, i32 1, i32 0, i32 1, i32 0, i32 1)
				%bitcast = bitcast <16 x i8> %shrn to <8 x i16>
				call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %bitcast, <8 x i16>* %addr.c, i32 2, <8 x i1> %pred)
				%addr.a.next = getelementptr <8 x i16>, <8 x i16>* %addr.b, i32 1
				%addr.b.next = getelementptr <8 x i16>, <8 x i16>* %addr.b, i32 1
				%addr.c.next = getelementptr <8 x i16>, <8 x i16>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				define arm_aapcs_vfpcc void @test_ctlz_i16(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %c, i32 %elts, i32 %iters) #0 {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <4 x i32>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <4 x i32>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <4 x i32>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <4 x i1> @llvm.arm.mve.vctp32(i32 %count)
				%elts.rem = sub i32 %count, 4
				%masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.a, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.b, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%bitcast.a = bitcast <4 x i32> %masked.load.a to <8 x i16>
				%ctlz = call <8 x i16> @llvm.ctlz.v8i16(<8 x i16> %bitcast.a, i1 false)
				%shrn = call <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16> %ctlz, <4 x i32> %masked.load.b, i32 3, i32 1, i32 0, i32 1, i32 0, i32 1)
				%bitcast = bitcast <8 x i16> %shrn to <4 x i32>
				call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %bitcast, <4 x i32>* %addr.c, i32 4, <4 x i1> %pred)
				%addr.a.next = getelementptr <4 x i32>, <4 x i32>* %addr.a, i32 1
				%addr.b.next = getelementptr <4 x i32>, <4 x i32>* %addr.b, i32 1
				%addr.c.next = getelementptr <4 x i32>, <4 x i32>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				define arm_aapcs_vfpcc void @test_ctlz_i32(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %c, i32 %elts, i32 %iters) #0 {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <4 x i32>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <4 x i32>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <4 x i32>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <4 x i1> @llvm.arm.mve.vctp32(i32 %count)
				%elts.rem = sub i32 %count, 4
				%masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.a, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.b, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%ctlz = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %masked.load.b, i1 false)
				%bitcast.a = bitcast <4 x i32> %masked.load.a to <8 x i16>
				%shrn = call <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16> %bitcast.a, <4 x i32> %ctlz, i32 3, i32 1, i32 0, i32 1, i32 0, i32 1)
				%bitcast = bitcast <8 x i16> %shrn to <4 x i32>
				call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %bitcast, <4 x i32>* %addr.c, i32 4, <4 x i1> %pred)
				%addr.a.next = getelementptr <4 x i32>, <4 x i32>* %addr.a, i32 1
				%addr.b.next = getelementptr <4 x i32>, <4 x i32>* %addr.b, i32 1
				%addr.c.next = getelementptr <4 x i32>, <4 x i32>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>, i1 immarg)
				declare <8 x i16> @llvm.ctlz.v8i16(<8 x i16>, i1 immarg)
				declare <16 x i8> @llvm.ctlz.v16i8(<16 x i8>, i1 immarg)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
				declare <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16>, <4 x i32>, i32, i32, i32, i32, i32, i32)
				declare <8 x i1> @llvm.arm.mve.vctp16(i32)
				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
				declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
				declare <16 x i8> @llvm.arm.mve.vshrn.v16i8.v8i16(<16 x i8>, <8 x i16>, i32, i32, i32, i32, i32, i32)

				...
				---
				name: test_ctlz_i8
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r12 = t2LDRi12 $sp, 8, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r12
				$r4 = tMOVr killed $r12, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r4

				renamable $vpr = MVE_VCTP16 renamable $r3, 0, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r1, renamable $q0 = MVE_VLDRHU16_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 2)
				renamable $q1 = MVE_VLDRHU16 killed renamable $r0, 0, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 2)
				$lr = tMOVr $r4, 14 /* CC::al */, $noreg
				renamable $r4, dead $cpsr = tSUBi8 killed $r4, 1, 14 /* CC::al */, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 8, 14 /* CC::al */, $noreg
				renamable $q1 = MVE_VCLZs8 killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				$r0 = tMOVr $r1, 14 /* CC::al */, $noreg
				renamable $q1 = MVE_VQSHRUNs16th killed renamable $q1, killed renamable $q0, 1, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRHU16_post killed renamable $q1, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 2)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...
				---
				name: test_ctlz_i16
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r4
				$r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r12

				renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				$lr = tMOVr $r12, 14 /* CC::al */, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r1, renamable $q0 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				renamable $r0, renamable $q1 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				renamable $q1 = MVE_VCLZs16 killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q1 = MVE_VQSHRUNs32th killed renamable $q1, killed renamable $q0, 3, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRWU32_post killed renamable $q1, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...
				---
				name: test_ctlz_i32
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r4
				$r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r12

				renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				$lr = tMOVr $r12, 14 /* CC::al */, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				renamable $r1, renamable $q1 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				renamable $q1 = MVE_VCLZs32 killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VQSHRUNs32th killed renamable $q0, killed renamable $q1, 3, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRWU32_post killed renamable $q0, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/safe-retaining.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define arm_aapcs_vfpcc void @test_vqrshruntq_n_s32(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %c, i32 %elts, i32 %iters) {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <4 x i32>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <4 x i32>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <4 x i32>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <4 x i1> @llvm.arm.mve.vctp32(i32 %count)
				%elts.rem = sub i32 %count, 4
				%masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.a, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.b, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%bitcast.a = bitcast <4 x i32> %masked.load.a to <8 x i16>
				%shrn = call <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16> %bitcast.a, <4 x i32> %masked.load.b, i32 3, i32 1, i32 0, i32 1, i32 0, i32 1)
				%bitcast = bitcast <8 x i16> %shrn to <4 x i32>
				call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %bitcast, <4 x i32>* %addr.c, i32 4, <4 x i1> %pred)
				%addr.a.next = getelementptr <4 x i32>, <4 x i32>* %addr.a, i32 1
				%addr.b.next = getelementptr <4 x i32>, <4 x i32>* %addr.b, i32 1
				%addr.c.next = getelementptr <4 x i32>, <4 x i32>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				define arm_aapcs_vfpcc void @test_vqrshruntq_n_s16(<8 x i16>* %a, <8 x i16>* %b, <8 x i16>* %c, i32 %elts, i32 %iters) {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <8 x i16>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <8 x i16>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <8 x i16>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <8 x i1> @llvm.arm.mve.vctp16(i32 %count)
				%elts.rem = sub i32 %count, 8
				%masked.load.a = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %addr.a, i32 2, <8 x i1> %pred, <8 x i16> undef)
				%masked.load.b = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %addr.b, i32 2, <8 x i1> %pred, <8 x i16> undef)
				%bitcast.a = bitcast <8 x i16> %masked.load.a to <16 x i8>
				%shrn = call <16 x i8> @llvm.arm.mve.vshrn.v16i8.v8i16(<16 x i8> %bitcast.a, <8 x i16> %masked.load.b, i32 1, i32 1, i32 0, i32 1, i32 0, i32 1)
				%bitcast = bitcast <16 x i8> %shrn to <8 x i16>
				call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %bitcast, <8 x i16>* %addr.c, i32 2, <8 x i1> %pred)
				%addr.a.next = getelementptr <8 x i16>, <8 x i16>* %addr.b, i32 1
				%addr.b.next = getelementptr <8 x i16>, <8 x i16>* %addr.b, i32 1
				%addr.c.next = getelementptr <8 x i16>, <8 x i16>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
				declare <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16>, <4 x i32>, i32, i32, i32, i32, i32, i32)
				declare <8 x i1> @llvm.arm.mve.vctp16(i32)
				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
				declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
				declare <16 x i8> @llvm.arm.mve.vshrn.v16i8.v8i16(<16 x i8>, <8 x i16>, i32, i32, i32, i32, i32, i32)

				...
				---
				name: test_vqrshruntq_n_s32
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				restorePoint: ''
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_vqrshruntq_n_s32
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2, $r3, $r4
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -8
				; CHECK: tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2IT 11, 8, implicit-def $itstate
				; CHECK: tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				; CHECK: renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				; CHECK: dead $lr = MVE_DLSTP_32 killed renamable $r3
				; CHECK: $r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg
				; CHECK: bb.1.loop.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $r0, $r1, $r2, $r12
				; CHECK: $lr = tMOVr $r12, 14 /* CC::al */, $noreg
				; CHECK: renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r1, renamable $q0 = MVE_VLDRWU32_post killed renamable $r1, 16, 0, $noreg :: (load 16 from %ir.addr.b, align 4)
				; CHECK: renamable $r0, renamable $q1 = MVE_VLDRWU32_post killed renamable $r0, 16, 0, $noreg :: (load 16 from %ir.addr.a, align 4)
				; CHECK: renamable $q1 = MVE_VQSHRUNs32th killed renamable $q1, killed renamable $q0, 3, 0, $noreg
				; CHECK: renamable $r2 = MVE_VSTRWU32_post killed renamable $q1, killed renamable $r2, 16, 0, killed $noreg :: (store 16 into %ir.addr.c, align 4)
				; CHECK: dead $lr = MVE_LETP killed renamable $lr, %bb.1
				; CHECK: bb.2.exit:
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r4
				$r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r12

				renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				$lr = tMOVr $r12, 14 /* CC::al */, $noreg
				renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r1, renamable $q0 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				renamable $r0, renamable $q1 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q1 = MVE_VQSHRUNs32th killed renamable $q1, killed renamable $q0, 3, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRWU32_post killed renamable $q1, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...
				---
				name: test_vqrshruntq_n_s16
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_vqrshruntq_n_s16
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2, $r3, $r4
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -8
				; CHECK: tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2IT 11, 8, implicit-def $itstate
				; CHECK: tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				; CHECK: renamable $r12 = t2LDRi12 $sp, 8, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				; CHECK: dead $lr = MVE_DLSTP_16 killed renamable $r3
				; CHECK: $r4 = tMOVr killed $r12, 14 /* CC::al */, $noreg
				; CHECK: bb.1.loop.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $r0, $r1, $r2, $r4
				; CHECK: $lr = tMOVr $r4, 14 /* CC::al */, $noreg
				; CHECK: renamable $r4, dead $cpsr = tSUBi8 killed $r4, 1, 14 /* CC::al */, $noreg
				; CHECK: renamable $r1, renamable $q0 = MVE_VLDRHU16_post killed renamable $r1, 16, 0, $noreg :: (load 16 from %ir.addr.b, align 2)
				; CHECK: renamable $q1 = MVE_VLDRHU16 killed renamable $r0, 0, 0, $noreg :: (load 16 from %ir.addr.a, align 2)
				; CHECK: $r0 = tMOVr $r1, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VQSHRUNs16th killed renamable $q1, killed renamable $q0, 1, 0, $noreg
				; CHECK: renamable $r2 = MVE_VSTRHU16_post killed renamable $q1, killed renamable $r2, 16, 0, killed $noreg :: (store 16 into %ir.addr.c, align 2)
				; CHECK: dead $lr = MVE_LETP killed renamable $lr, %bb.1
				; CHECK: bb.2.exit:
				; CHECK: tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r12 = t2LDRi12 $sp, 8, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r12
				$r4 = tMOVr killed $r12, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r4

				renamable $vpr = MVE_VCTP16 renamable $r3, 0, $noreg
				$lr = tMOVr $r4, 14 /* CC::al */, $noreg
				renamable $r4, dead $cpsr = tSUBi8 killed $r4, 1, 14 /* CC::al */, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 8, 14 /* CC::al */, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r1, renamable $q0 = MVE_VLDRHU16_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 2)
				renamable $q1 = MVE_VLDRHU16 killed renamable $r0, 0, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 2)
				renamable $lr = t2LoopDec killed renamable $lr, 1
				$r0 = tMOVr $r1, 14 /* CC::al */, $noreg
				renamable $q1 = MVE_VQSHRUNs16th killed renamable $q1, killed renamable $q0, 1, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRHU16_post killed renamable $q1, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 2)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/unsafe-retaining.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define arm_aapcs_vfpcc void @test_vmvn(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %c, i32 %elts, i32 %iters) #0 {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <4 x i32>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <4 x i32>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <4 x i32>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <4 x i1> @llvm.arm.mve.vctp32(i32 %count)
				%elts.rem = sub i32 %count, 4
				%masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.a, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.b, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%not = xor <4 x i32> %masked.load.b, <i32 -1, i32 -1, i32 -1, i32 -1>
				%bitcast.a = bitcast <4 x i32> %masked.load.a to <8 x i16>
				%shrn = call <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16> %bitcast.a, <4 x i32> %not, i32 15, i32 1, i32 0, i32 0, i32 0, i32 0)
				%bitcast = bitcast <8 x i16> %shrn to <4 x i32>
				call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %bitcast, <4 x i32>* %addr.c, i32 4, <4 x i1> %pred)
				%addr.a.next = getelementptr <4 x i32>, <4 x i32>* %addr.a, i32 1
				%addr.b.next = getelementptr <4 x i32>, <4 x i32>* %addr.b, i32 1
				%addr.c.next = getelementptr <4 x i32>, <4 x i32>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				define arm_aapcs_vfpcc void @test_vorn(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %c, i32 %elts, i32 %iters) #0 {
				entry:
				%cmp = icmp slt i32 %elts, 1
				br i1 %cmp, label %exit, label %loop.ph

				loop.ph: ; preds = %entry
				call void @llvm.set.loop.iterations.i32(i32 %iters)
				br label %loop.body

				loop.body: ; preds = %loop.body, %loop.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %loop.body ], [ %iters, %loop.ph ]
				%count = phi i32 [ %elts, %loop.ph ], [ %elts.rem, %loop.body ]
				%addr.a = phi <4 x i32>* [ %a, %loop.ph ], [ %addr.a.next, %loop.body ]
				%addr.b = phi <4 x i32>* [ %b, %loop.ph ], [ %addr.b.next, %loop.body ]
				%addr.c = phi <4 x i32>* [ %c, %loop.ph ], [ %addr.c.next, %loop.body ]
				%pred = call <4 x i1> @llvm.arm.mve.vctp32(i32 %count)
				%elts.rem = sub i32 %count, 4
				%masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.a, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr.b, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%not = xor <4 x i32> %masked.load.b, <i32 -1, i32 -1, i32 -1, i32 -1>
				%or = or <4 x i32> %not, %masked.load.a
				%bitcast.a = bitcast <4 x i32> %masked.load.a to <8 x i16>
				%shrn = call <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16> %bitcast.a, <4 x i32> %or, i32 3, i32 1, i32 0, i32 1, i32 0, i32 1)
				%bitcast = bitcast <8 x i16> %shrn to <4 x i32>
				call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %bitcast, <4 x i32>* %addr.c, i32 4, <4 x i1> %pred)
				%addr.a.next = getelementptr <4 x i32>, <4 x i32>* %addr.a, i32 1
				%addr.b.next = getelementptr <4 x i32>, <4 x i32>* %addr.b, i32 1
				%addr.c.next = getelementptr <4 x i32>, <4 x i32>* %addr.c, i32 1
				%loop.dec = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%end = icmp ne i32 %loop.dec, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %end, label %loop.body, label %exit

				exit: ; preds = %loop.body, %entry
				ret void
				}

				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.arm.mve.vctp32(i32)
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
				declare <8 x i16> @llvm.arm.mve.vshrn.v8i16.v4i32(<8 x i16>, <4 x i32>, i32, i32, i32, i32, i32, i32)

				...
				---
				name: test_vmvn
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_vmvn
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2, $r3, $r4
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -8
				; CHECK: tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2IT 11, 8, implicit-def $itstate
				; CHECK: frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				; CHECK: renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				; CHECK: dead $lr = t2DLS renamable $r4
				; CHECK: $r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg
				; CHECK: bb.1.loop.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r12
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				; CHECK: $lr = tMOVr $r12, 14 /* CC::al */, $noreg
				; CHECK: MVE_VPST 4, implicit $vpr
				; CHECK: renamable $r0, renamable $q0 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				; CHECK: renamable $r1, renamable $q1 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				; CHECK: renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $q1 = MVE_VMVN killed renamable $q1, 0, $noreg, undef renamable $q1
				; CHECK: renamable $q0 = MVE_VQSHRNbhs32 killed renamable $q0, killed renamable $q1, 15, 0, $noreg
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r2 = MVE_VSTRWU32_post killed renamable $q0, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				; CHECK: dead $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.exit:
				; CHECK: frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r4
				$r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r12

				renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				$lr = tMOVr $r12, 14 /* CC::al */, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r0, renamable $q0 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				renamable $r1, renamable $q1 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				renamable $q1 = MVE_VMVN killed renamable $q1, 0, $noreg, undef renamable $q1
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q0 = MVE_VQSHRNbhs32 killed renamable $q0, killed renamable $q1, 15, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRWU32_post killed renamable $q0, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...
				---
				name: test_vorn
				alignment: 2
				tracksRegLiveness: true
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				stackSize: 8
				offsetAdjustment: 0
				maxAlignment: 4
				fixedStack:
				- { id: 0, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: test_vorn
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x80000000)
				; CHECK: liveins: $lr, $r0, $r1, $r2, $r3, $r4
				; CHECK: frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 8
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -8
				; CHECK: tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2IT 11, 8, implicit-def $itstate
				; CHECK: frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				; CHECK: renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				; CHECK: dead $lr = t2DLS renamable $r4
				; CHECK: $r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg
				; CHECK: bb.1.loop.body:
				; CHECK: successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r12
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				; CHECK: MVE_VPST 4, implicit $vpr
				; CHECK: renamable $r1, renamable $q0 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				; CHECK: renamable $r0, renamable $q1 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				; CHECK: $lr = tMOVr $r12, 14 /* CC::al */, $noreg
				; CHECK: renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				; CHECK: renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				; CHECK: renamable $q0 = MVE_VORN renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q1 = MVE_VQSHRUNs32th killed renamable $q1, killed renamable $q0, 3, 0, $noreg
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r2 = MVE_VSTRWU32_post killed renamable $q1, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				; CHECK: dead $lr = t2LEUpdate killed renamable $lr, %bb.1
				; CHECK: bb.2.exit:
				; CHECK: frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc
				bb.0.entry:
				successors: %bb.1(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $lr

				frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 8
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r4, -8
				tCMPi8 renamable $r3, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2IT 11, 8, implicit-def $itstate
				frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r4, def $pc, implicit killed $itstate
				renamable $r4 = tLDRspi $sp, 2, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
				t2DoLoopStart renamable $r4
				$r12 = tMOVr killed $r4, 14 /* CC::al */, $noreg

				bb.1.loop.body:
				successors: %bb.1(0x7c000000), %bb.2(0x04000000)
				liveins: $r0, $r1, $r2, $r3, $r12

				renamable $vpr = MVE_VCTP32 renamable $r3, 0, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $r1, renamable $q0 = MVE_VLDRWU32_post killed renamable $r1, 16, 1, renamable $vpr :: (load 16 from %ir.addr.b, align 4)
				renamable $r0, renamable $q1 = MVE_VLDRWU32_post killed renamable $r0, 16, 1, renamable $vpr :: (load 16 from %ir.addr.a, align 4)
				$lr = tMOVr $r12, 14 /* CC::al */, $noreg
				renamable $r12 = t2SUBri killed $r12, 1, 14 /* CC::al */, $noreg, $noreg
				renamable $r3, dead $cpsr = tSUBi8 killed renamable $r3, 4, 14 /* CC::al */, $noreg
				renamable $q0 = MVE_VORN renamable $q1, killed renamable $q0, 0, $noreg, undef renamable $q0
				renamable $lr = t2LoopDec killed renamable $lr, 1
				renamable $q1 = MVE_VQSHRUNs32th killed renamable $q1, killed renamable $q0, 3, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r2 = MVE_VSTRWU32_post killed renamable $q1, killed renamable $r2, 16, 1, killed renamable $vpr :: (store 16 into %ir.addr.c, align 4)
				t2LoopEnd killed renamable $lr, %bb.1, implicit-def dead $cpsr
				tB %bb.2, 14 /* CC::al */, $noreg

				bb.2.exit:
				frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $pc

				...