This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
3/8
ARMLowOverheadLoops.cpp
-
test/CodeGen/Thumb2/LowOverheadLoops/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
add_reduce.mir
1/3
vmldava_in_vpt.mir

Differential D72509

[ARM][LowOverheadLoops] Allow all MVE instrs.
ClosedPublic

Authored by samparker on Jan 10 2020, 7:44 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen

Commits

rGe27632c30263: [ARM][LowOverheadLoops] Allow all MVE instrs.

Summary

We have a whitelist of instructions that we allow when tail predicating, since these are trivial ones that we've deemed need no special handling. Now change ARMLowOverheadLoops to allow the non-trivial instructions if they're contained within a valid VPT block. Since a valid block is one that is predicated upon the VCTP so we know that these non-trivial instructions will still behave as expected once the implicit predication is used instead.

Diff Detail

Event Timeline

samparker created this revision.Jan 10 2020, 7:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 10 2020, 7:44 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

samparker added a parent revision: D72504: [ARM][LowOverheadLoops] Change predicate inspection.Jan 10 2020, 7:44 AM

LGTM, with just a nit inlined.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
479	nit: how about `RecordAllowedMVEInsts` to do a bit more justice what this function is doing?

This revision is now accepted and ready to land.Jan 13 2020, 5:34 AM

This looks sensible, using the VPT blocks to know that the instructions are predicated on something that makes tail predication valid.

I had couple of questions about some of the existing details. The VPSEL question is the only one really relevant here, the others are probably best looked at elsewhere (or they may be handled already in here somewhere, or I may just be thinking about them wrongly). They shouldn't block this improvement.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
525	What if we had something like a `VMOV s0 s1`. I know that's not anything to do with this patch, but is that outlawed anywhere at the moment?
530	What about VPNOT or VPSEL? They would use both the vpr and the tail predicate to different extents. VPSEL we seem to mark as IsValidForTailPredication, which I'm not sure about. We need to make sure it's using the "same" predicate as before (and making sure it doesn't just use whatever we had last put there, if we delete the vcpt!)
llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
174	Another unrelated one, but there is nothing in the instructions of this loop that says that this is predicated on a hardware register, right? Or that is has unmodelled side effects? The refs to vpr and the vctp are removed but no other ref is added.

samparker marked 3 inline comments as done.Jan 13 2020, 6:59 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
525	It's not outlawed., why do we need to be concerned about it?
530	Good point. Yes, it looks like the problem is that we're only recording VPT blocks whereas, as you say, there are other instructions that also need handling. It looks like VPSEL/VPNOT would currently stuffed into the previous VPT block, even if they're not in one. This will mean that the predicates are checked for correctness, but it also could cause an assert too! I'll need to put up a different patch for that change.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
174	Yes, at this point it's completely implicit, which I don't think is a problem currently. Maybe we could add a new register and a predicate def for LSTP and LETP, but I have no idea how invasive that would be for the predicate users.

dmgreen added inline comments.Jan 13 2020, 7:18 AM

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
525	If we did `VMOV s7, s0; VMOV s6, s1; VMOV s5, s2; VMOV s4, s3`, that would be a reverse shuffle of a i32 vector. I would presume that would be trouble for tail prediction as it can no longer really be sure about what lanes are and are not predicated.
530	Sounds good. I think they are both currently outside of IT blocks. We should be folding VPNOT into blocks where we can (creating else's), but are not yet. The same thing could be done for VPSEL in the future, turning it into a VMOVt.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
174	I may be possible to add something similar to VPR, to the MVE_P base of all MVE instructions. Might be some differences in register orders though. Like you said, this is very late so might not be a problem currently. It's best not to be implicit if we can help it though.

samparker added a child revision: D72629: [ARM][MVE] Disallow VPSEL for tail predication.Jan 13 2020, 9:12 AM

LGTM by the way. I hadn't meant to tread on Sjoerd toes, I didn't see he was taking a look already. The issues were mostly unrelated, this looks like a nice improvement on it's own.

samparker marked an inline comment as done.Jan 14 2020, 1:45 AM

samparker added inline comments.

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
525	The conversion has to be about predication conversion, just making sure that it is equivalent, so these shouldn't be a problem. What needs to happen though is checking for values live in and out.... and then checking whether these scalar regs are aliasing the Q regs. I'm currently working on this.

Closed by commit rGe27632c30263: [ARM][LowOverheadLoops] Allow all MVE instrs. (authored by samparker). · Explain WhyJan 14 2020, 4:10 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMLowOverheadLoops.cpp

44 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

add_reduce.mir

255 lines

vmldava_in_vpt.mir

239 lines

Diff 237324

llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp

Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	struct LowOverheadLoop {
SmallVector<VPTBlock, 4> VPTBlocks;		SmallVector<VPTBlock, 4> VPTBlocks;
bool Revert = false;		bool Revert = false;
bool CannotTailPredicate = false;		bool CannotTailPredicate = false;

LowOverheadLoop(MachineLoop *ML) : ML(ML) {		LowOverheadLoop(MachineLoop *ML) : ML(ML) {
MF = ML->getHeader()->getParent();		MF = ML->getHeader()->getParent();
}		}

bool RecordVPTBlocks(MachineInstr *MI);

// If this is an MVE instruction, check that we know how to use tail		// If this is an MVE instruction, check that we know how to use tail
// predication with it.		// predication with it. Record VPT blocks and return whether the
void AnalyseMVEInst(MachineInstr *MI) {		// instruction is valid for tail predication.
if (CannotTailPredicate)		bool RecordMVEInsts(MachineInstr *MI);
return;

if (!RecordVPTBlocks(MI)) {
CannotTailPredicate = true;
return;
}

const MCInstrDesc &MCID = MI->getDesc();
uint64_t Flags = MCID.TSFlags;
if ((Flags & ARMII::DomainMask) != ARMII::DomainMVE)
return;

if ((Flags & ARMII::ValidForTailPredication) == 0) {
LLVM_DEBUG(dbgs() << "ARM Loops: Can't tail predicate: " << *MI);
CannotTailPredicate = true;
}
}

bool IsTailPredicationLegal() const {		bool IsTailPredicationLegal() const {
// For now, let's keep things really simple and only support a single		// For now, let's keep things really simple and only support a single
// block for tail predication.		// block for tail predication.
return !Revert && FoundAllComponents() && VCTP &&		return !Revert && FoundAllComponents() && VCTP &&
!CannotTailPredicate && ML->getNumBlocks() == 1;		!CannotTailPredicate && ML->getNumBlocks() == 1;
}		}

▲ Show 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	void LowOverheadLoop::CheckLegality(ARMBasicBlockUtils *BBUtils,

assert(ML->getBlocks().size() == 1 &&		assert(ML->getBlocks().size() == 1 &&
"Shouldn't be processing a loop with more than one block");		"Shouldn't be processing a loop with more than one block");
CannotTailPredicate = !ValidateTailPredicate(InsertPt, RDA, MLI);		CannotTailPredicate = !ValidateTailPredicate(InsertPt, RDA, MLI);
LLVM_DEBUG(if (CannotTailPredicate)		LLVM_DEBUG(if (CannotTailPredicate)
dbgs() << "ARM Loops: Couldn't validate tail predicate.\n");		dbgs() << "ARM Loops: Couldn't validate tail predicate.\n");
}		}

bool LowOverheadLoop::RecordVPTBlocks(MachineInstr* MI) {		bool LowOverheadLoop::RecordMVEInsts(MachineInstr* MI) {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: how about `RecordAllowedMVEInsts` to do a bit more justice what this function is doing? SjoerdMeijer: nit: how about `RecordAllowedMVEInsts` to do a bit more justice what this function is doing?
		if (CannotTailPredicate)
		return false;

// Only support a single vctp.		// Only support a single vctp.
if (isVCTP(MI) && VCTP)		if (isVCTP(MI) && VCTP)
return false;		return false;

// Start a new vpt block when we discover a vpt.		// Start a new vpt block when we discover a vpt.
if (MI->getOpcode() == ARM::MVE_VPST) {		if (MI->getOpcode() == ARM::MVE_VPST) {
VPTBlocks.emplace_back(MI, CurrentPredicate);		VPTBlocks.emplace_back(MI, CurrentPredicate);
CurrentBlock = &VPTBlocks.back();		CurrentBlock = &VPTBlocks.back();
Show All 24 Lines	bool LowOverheadLoop::RecordMVEInsts(MachineInstr* MI) {
// If we find a vpr def that is not already predicated on the vctp, we've		// If we find a vpr def that is not already predicated on the vctp, we've
// got disjoint predicates that may not be equivalent when we do the		// got disjoint predicates that may not be equivalent when we do the
// conversion.		// conversion.
if (IsDef && !IsUse && VCTP && !isVCTP(MI)) {		if (IsDef && !IsUse && VCTP && !isVCTP(MI)) {
LLVM_DEBUG(dbgs() << "ARM Loops: Found disjoint vpr def: " << *MI);		LLVM_DEBUG(dbgs() << "ARM Loops: Found disjoint vpr def: " << *MI);
return false;		return false;
}		}

		uint64_t Flags = MCID.TSFlags;
		if ((Flags & ARMII::DomainMask) != ARMII::DomainMVE)
		return true;
		dmgreenUnsubmitted Not Done Reply Inline Actions What if we had something like a `VMOV s0 s1`. I know that's not anything to do with this patch, but is that outlawed anywhere at the moment? dmgreen: What if we had something like a `VMOV s0 s1`. I know that's not anything to do with this patch…
		samparkerAuthorUnsubmitted Done Reply Inline Actions It's not outlawed., why do we need to be concerned about it? samparker: It's not outlawed., why do we need to be concerned about it?
		dmgreenUnsubmitted Not Done Reply Inline Actions If we did `VMOV s7, s0; VMOV s6, s1; VMOV s5, s2; VMOV s4, s3`, that would be a reverse shuffle of a i32 vector. I would presume that would be trouble for tail prediction as it can no longer really be sure about what lanes are and are not predicated. dmgreen: If we did `VMOV s7, s0; VMOV s6, s1; VMOV s5, s2; VMOV s4, s3`, that would be a reverse shuffle…
		samparkerAuthorUnsubmitted Done Reply Inline Actions The conversion has to be about predication conversion, just making sure that it is equivalent, so these shouldn't be a problem. What needs to happen though is checking for values live in and out.... and then checking whether these scalar regs are aliasing the Q regs. I'm currently working on this. samparker: The conversion has to be about predication conversion, just making sure that it is equivalent…

		// If we find an instruction that has been marked as not valid for tail
		// predication, only allow the instruction if it's contained within a valid
		// VPT block.
		if ((Flags & ARMII::ValidForTailPredication) == 0 && !IsUse) {
		dmgreenUnsubmitted Not Done Reply Inline Actions What about VPNOT or VPSEL? They would use both the vpr and the tail predicate to different extents. VPSEL we seem to mark as IsValidForTailPredication, which I'm not sure about. We need to make sure it's using the "same" predicate as before (and making sure it doesn't just use whatever we had last put there, if we delete the vcpt!) dmgreen: What about VPNOT or VPSEL? They would use both the vpr and the tail predicate to different…
		samparkerAuthorUnsubmitted Done Reply Inline Actions Good point. Yes, it looks like the problem is that we're only recording VPT blocks whereas, as you say, there are other instructions that also need handling. It looks like VPSEL/VPNOT would currently stuffed into the previous VPT block, even if they're not in one. This will mean that the predicates are checked for correctness, but it also could cause an assert too! I'll need to put up a different patch for that change. samparker: Good point. Yes, it looks like the problem is that we're only recording VPT blocks whereas, as…
		dmgreenUnsubmitted Not Done Reply Inline Actions Sounds good. I think they are both currently outside of IT blocks. We should be folding VPNOT into blocks where we can (creating else's), but are not yet. The same thing could be done for VPSEL in the future, turning it into a VMOVt. dmgreen: Sounds good. I think they are both currently outside of IT blocks. We should be folding VPNOT…
		LLVM_DEBUG(dbgs() << "ARM Loops: Can't tail predicate: " << *MI);
		CannotTailPredicate = true;
		}

return true;		return true;
}		}

bool ARMLowOverheadLoops::runOnMachineFunction(MachineFunction &mf) {		bool ARMLowOverheadLoops::runOnMachineFunction(MachineFunction &mf) {
const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(mf.getSubtarget());		const ARMSubtarget &ST = static_cast<const ARMSubtarget&>(mf.getSubtarget());
if (!ST.hasLOB())		if (!ST.hasLOB())
return false;		return false;

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	for (auto &MI : *MBB) {
// TODO: Though the call will require LE to execute again, does this		// TODO: Though the call will require LE to execute again, does this
// mean we should revert? Always executing LE hopefully should be		// mean we should revert? Always executing LE hopefully should be
// faster than performing a sub,cmp,br or even subs,br.		// faster than performing a sub,cmp,br or even subs,br.
LoLoop.Revert = true;		LoLoop.Revert = true;
LLVM_DEBUG(dbgs() << "ARM Loops: Found call.\n");		LLVM_DEBUG(dbgs() << "ARM Loops: Found call.\n");
} else {		} else {
// Record VPR defs and build up their corresponding vpt blocks.		// Record VPR defs and build up their corresponding vpt blocks.
// Check we know how to tail predicate any mve instructions.		// Check we know how to tail predicate any mve instructions.
LoLoop.AnalyseMVEInst(&MI);		LoLoop.RecordMVEInsts(&MI);
}		}

// We need to ensure that LR is not used or defined inbetween LoopDec and		// We need to ensure that LR is not used or defined inbetween LoopDec and
// LoopEnd.		// LoopEnd.
if (!LoLoop.Dec \|\| LoLoop.End \|\| LoLoop.Revert)		if (!LoLoop.Dec \|\| LoLoop.End \|\| LoLoop.Revert)
continue;		continue;

// If we find that LR has been written or read between LoopDec and		// If we find that LR has been written or read between LoopDec and
▲ Show 20 Lines • Show All 407 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/add_reduce.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define hidden i32 @max_min_add_reduce(i8* %input_1_vect, i8* %input_2_vect, i32 %input_1_offset, i32 %input_2_offset, i32* %output, i32 %out_offset, i32 %out_mult, i32 %out_shift, i32 %out_activation_min, i32 %out_activation_max, i32 %block_size) local_unnamed_addr #0 {
				entry:
				%add = add i32 %block_size, 3
				%div = lshr i32 %add, 2
				%0 = call i1 @llvm.test.set.loop.iterations.i32(i32 %div)
				br i1 %0, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%.splatinsert.i41 = insertelement <4 x i32> undef, i32 %out_activation_min, i32 0
				%.splat.i42 = shufflevector <4 x i32> %.splatinsert.i41, <4 x i32> undef, <4 x i32> zeroinitializer
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %out_activation_max, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				%scevgep = getelementptr i32, i32* %output, i32 -1
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret i32 0

				for.body: ; preds = %for.body, %for.body.lr.ph
				%lsr.iv3 = phi i32 [ %lsr.iv.next, %for.body ], [ %div, %for.body.lr.ph ]
				%lsr.iv = phi i32* [ %scevgep1, %for.body ], [ %scevgep, %for.body.lr.ph ]
				%input_1_vect.addr.052 = phi i8* [ %input_1_vect, %for.body.lr.ph ], [ %add.ptr, %for.body ]
				%input_2_vect.addr.051 = phi i8* [ %input_2_vect, %for.body.lr.ph ], [ %add.ptr14, %for.body ]
				%num_elements.049 = phi i32 [ %block_size, %for.body.lr.ph ], [ %sub, %for.body ]
				%input_2_cast = bitcast i8* %input_2_vect.addr.051 to <4 x i32>*
				%input_1_cast = bitcast i8* %input_1_vect.addr.052 to <4 x i32>*
				%scevgep2 = getelementptr i32, i32* %lsr.iv, i32 1
				%pred = tail call <4 x i1> @llvm.arm.mve.vctp32(i32 %num_elements.049)
				%load.1 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %input_1_cast, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%insert.input_1_offset = insertelement <4 x i32> undef, i32 %input_1_offset, i32 0
				%splat.input_1_offset = shufflevector <4 x i32> %insert.input_1_offset, <4 x i32> undef, <4 x i32> zeroinitializer
				%insert.input_2_offset = insertelement <4 x i32> undef, i32 %input_2_offset, i32 0
				%splat.input_2_offset = shufflevector <4 x i32> %insert.input_2_offset, <4 x i32> undef, <4 x i32> zeroinitializer
				%add.1 = add <4 x i32> %load.1, %splat.input_1_offset
				%load.2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %input_2_cast, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%add.2 = add <4 x i32> %load.2, %splat.input_2_offset
				%mul = mul <4 x i32> %add.1, %add.2
				%insert.output = insertelement <4 x i32> undef, i32 %out_offset, i32 0
				%splat.output = shufflevector <4 x i32> %insert.output, <4 x i32> undef, <4 x i32> zeroinitializer
				%add7 = add <4 x i32> %mul, %splat.output
				%max = tail call <4 x i32> @llvm.arm.mve.max.predicated.v4i32.v4i1(<4 x i32> %add7, <4 x i32> %.splat.i42, i32 1, <4 x i1> %pred, <4 x i32> undef)
				%min = tail call <4 x i32> @llvm.arm.mve.min.predicated.v4i32.v4i1(<4 x i32> %max, <4 x i32> %.splat.i, i32 1, <4 x i1> %pred, <4 x i32> undef)
				%reduce = tail call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %min)
				store i32 %reduce, i32* %scevgep2
				%add.ptr = getelementptr inbounds i8, i8* %input_1_vect.addr.052, i32 4
				%add.ptr14 = getelementptr inbounds i8, i8* %input_2_vect.addr.051, i32 4
				%sub = add i32 %num_elements.049, -4
				%iv.next = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv3, i32 1)
				%cmp = icmp ne i32 %iv.next, 0
				%scevgep1 = getelementptr i32, i32* %lsr.iv, i32 1
				%lsr.iv.next = add i32 %lsr.iv3, -1
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}
				declare <4 x i1> @llvm.arm.mve.vctp32(i32) #1
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>) #2
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>) #3
				declare <4 x i32> @llvm.arm.mve.max.predicated.v4i32.v4i1(<4 x i32>, <4 x i32>, i32, <4 x i1>, <4 x i32>) #1
				declare <4 x i32> @llvm.arm.mve.min.predicated.v4i32.v4i1(<4 x i32>, <4 x i32>, i32, <4 x i1>, <4 x i32>) #1
				declare i1 @llvm.test.set.loop.iterations.i32(i32) #4
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #4
				declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>) #5

				...
				---
				name: max_min_add_reduce
				alignment: 2
				exposesReturnsTwice: false
				legalized: false
				regBankSelected: false
				selected: false
				failedISel: false
				tracksRegLiveness: true
				hasWinCFI: false
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 24
				offsetAdjustment: 0
				maxAlignment: 4
				adjustsStack: false
				hasCalls: false
				stackProtector: ''
				maxCallFrameSize: 0
				cvBytesOfCalleeSavedRegisters: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				localFrameSize: 0
				savePoint: ''
				restorePoint: ''
				fixedStack:
				- { id: 0, type: default, offset: 24, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, type: default, offset: 20, size: 4, alignment: 4, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, type: default, offset: 16, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 3, type: default, offset: 12, size: 4, alignment: 4, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 4, type: default, offset: 8, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 5, type: default, offset: 4, size: 4, alignment: 4, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 6, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r8', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 3, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r6', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 4, name: '', type: spill-slot, offset: -20, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r5', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 5, name: '', type: spill-slot, offset: -24, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: max_min_add_reduce
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $lr
				; CHECK: $sp = frame-setup t2STMDB_UPD $sp, 14, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $r8, killed $lr
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 24
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r8, -8
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -12
				; CHECK: frame-setup CFI_INSTRUCTION offset $r6, -16
				; CHECK: frame-setup CFI_INSTRUCTION offset $r5, -20
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -24
				; CHECK: renamable $r12 = t2LDRi12 $sp, 48, 14, $noreg :: (load 4 from %fixed-stack.6, align 8)
				; CHECK: renamable $r5 = t2ADDri renamable $r12, 3, 14, $noreg, $noreg
				; CHECK: renamable $r7, dead $cpsr = tLSRri killed renamable $r5, 2, 14, $noreg
				; CHECK: $lr = t2WLS renamable $r7, %bb.3
				; CHECK: bb.1.for.body.lr.ph:
				; CHECK: successors: %bb.2(0x80000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r7, $r12
				; CHECK: $r6, $r5 = t2LDRDi8 $sp, 40, 14, $noreg :: (load 4 from %fixed-stack.4, align 8), (load 4 from %fixed-stack.5)
				; CHECK: $r4 = tMOVr killed $r7, 14, $noreg
				; CHECK: $r7, $r8 = t2LDRDi8 $sp, 24, 14, $noreg :: (load 4 from %fixed-stack.0, align 8), (load 4 from %fixed-stack.1)
				; CHECK: renamable $q0 = MVE_VDUP32 killed renamable $r5, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q1 = MVE_VDUP32 killed renamable $r6, 0, $noreg, undef renamable $q1
				; CHECK: renamable $r5, dead $cpsr = tSUBi3 killed renamable $r7, 4, 14, $noreg
				; CHECK: bb.2.for.body:
				; CHECK: successors: %bb.2(0x7c000000), %bb.3(0x04000000)
				; CHECK: liveins: $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r5, $r8, $r12
				; CHECK: renamable $vpr = MVE_VCTP32 renamable $r12, 0, $noreg
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRWU32_post killed renamable $r1, 4, 1, renamable $vpr :: (load 16 from %ir.input_2_cast, align 4)
				; CHECK: MVE_VPST 8, implicit $vpr
				; CHECK: renamable $r0, renamable $q3 = MVE_VLDRWU32_post killed renamable $r0, 4, 1, renamable $vpr :: (load 16 from %ir.input_1_cast, align 4)
				; CHECK: renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r3, 0, $noreg, undef renamable $q2
				; CHECK: renamable $q3 = MVE_VADD_qr_i32 killed renamable $q3, renamable $r2, 0, $noreg, undef renamable $q3
				; CHECK: $lr = tMOVr $r4, 14, $noreg
				; CHECK: renamable $q2 = MVE_VMULi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
				; CHECK: renamable $r4, dead $cpsr = tSUBi8 killed $r4, 1, 14, $noreg
				; CHECK: renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r8, 0, $noreg, undef renamable $q2
				; CHECK: renamable $r12 = t2SUBri killed renamable $r12, 4, 14, $noreg, $noreg
				; CHECK: MVE_VPST 4, implicit $vpr
				; CHECK: renamable $q2 = MVE_VMAXu32 killed renamable $q2, renamable $q1, 1, renamable $vpr, undef renamable $q2
				; CHECK: renamable $q2 = MVE_VMINu32 killed renamable $q2, renamable $q0, 1, killed renamable $vpr, undef renamable $q2
				; CHECK: renamable $r6 = MVE_VADDVu32no_acc killed renamable $q2, 0, $noreg
				; CHECK: early-clobber renamable $r5 = t2STR_PRE killed renamable $r6, killed renamable $r5, 4, 14, $noreg :: (store 4 into %ir.scevgep2)
				; CHECK: $lr = t2LEUpdate killed renamable $lr, %bb.2
				; CHECK: bb.3.for.cond.cleanup:
				; CHECK: $r0, dead $cpsr = tMOVi8 0, 14, $noreg
				; CHECK: $sp = t2LDMIA_RET $sp, 14, $noreg, def $r4, def $r5, def $r6, def $r7, def $r8, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.1(0x40000000), %bb.3(0x40000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $lr

				$sp = frame-setup t2STMDB_UPD $sp, 14, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $r8, killed $lr
				frame-setup CFI_INSTRUCTION def_cfa_offset 24
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r8, -8
				frame-setup CFI_INSTRUCTION offset $r7, -12
				frame-setup CFI_INSTRUCTION offset $r6, -16
				frame-setup CFI_INSTRUCTION offset $r5, -20
				frame-setup CFI_INSTRUCTION offset $r4, -24
				renamable $r12 = t2LDRi12 $sp, 48, 14, $noreg :: (load 4 from %fixed-stack.0, align 8)
				renamable $r5 = t2ADDri renamable $r12, 3, 14, $noreg, $noreg
				renamable $r7, dead $cpsr = tLSRri killed renamable $r5, 2, 14, $noreg
				t2WhileLoopStart renamable $r7, %bb.3, implicit-def dead $cpsr
				tB %bb.1, 14, $noreg

				bb.1.for.body.lr.ph:
				successors: %bb.2(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r7, $r12

				$r6, $r5 = t2LDRDi8 $sp, 40, 14, $noreg :: (load 4 from %fixed-stack.2, align 8), (load 4 from %fixed-stack.1)
				$r4 = tMOVr killed $r7, 14, $noreg
				$r7, $r8 = t2LDRDi8 $sp, 24, 14, $noreg :: (load 4 from %fixed-stack.6, align 8), (load 4 from %fixed-stack.5)
				renamable $q0 = MVE_VDUP32 killed renamable $r5, 0, $noreg, undef renamable $q0
				renamable $q1 = MVE_VDUP32 killed renamable $r6, 0, $noreg, undef renamable $q1
				renamable $r5, dead $cpsr = tSUBi3 killed renamable $r7, 4, 14, $noreg

				bb.2.for.body:
				successors: %bb.2(0x7c000000), %bb.3(0x04000000)
				liveins: $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r5, $r8, $r12

				renamable $vpr = MVE_VCTP32 renamable $r12, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r1, renamable $q2 = MVE_VLDRWU32_post killed renamable $r1, 4, 1, renamable $vpr :: (load 16 from %ir.input_2_cast, align 4)
				MVE_VPST 8, implicit $vpr
				renamable $r0, renamable $q3 = MVE_VLDRWU32_post killed renamable $r0, 4, 1, renamable $vpr :: (load 16 from %ir.input_1_cast, align 4)
				renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r3, 0, $noreg, undef renamable $q2
				renamable $q3 = MVE_VADD_qr_i32 killed renamable $q3, renamable $r2, 0, $noreg, undef renamable $q3
				$lr = tMOVr $r4, 14, $noreg
				renamable $q2 = MVE_VMULi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
				renamable $r4, dead $cpsr = tSUBi8 killed $r4, 1, 14, $noreg
				renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r8, 0, $noreg, undef renamable $q2
				renamable $r12 = t2SUBri killed renamable $r12, 4, 14, $noreg, $noreg
				MVE_VPST 4, implicit $vpr
				renamable $q2 = MVE_VMAXu32 killed renamable $q2, renamable $q1, 1, renamable $vpr, undef renamable $q2
				renamable $q2 = MVE_VMINu32 killed renamable $q2, renamable $q0, 1, killed renamable $vpr, undef renamable $q2
				renamable $r6 = MVE_VADDVu32no_acc killed renamable $q2, 0, $noreg
				early-clobber renamable $r5 = t2STR_PRE killed renamable $r6, killed renamable $r5, 4, 14, $noreg :: (store 4 into %ir.scevgep2)
				renamable $lr = t2LoopDec killed renamable $lr, 1
				t2LoopEnd killed renamable $lr, %bb.2, implicit-def dead $cpsr
				tB %bb.3, 14, $noreg

				bb.3.for.cond.cleanup:
				$r0, dead $cpsr = tMOVi8 0, 14, $noreg
				$sp = t2LDMIA_RET $sp, 14, $noreg, def $r4, def $r5, def $r6, def $r7, def $r8, def $pc, implicit killed $r0

				...

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -run-pass=arm-low-overhead-loops %s -o - \| FileCheck %s

				--- \|
				define hidden i32 @vmldava_in_vpt(i8* %input_1_vect, i8* %input_2_vect, i32 %input_1_offset, i32 %input_2_offset, i32 %out_offset, i32 %out_mult, i32 %out_shift, i32 %out_activation_min, i32 %out_activation_max, i32 %block_size) local_unnamed_addr #0 {
				entry:
				%add = add i32 %block_size, 3
				%div = lshr i32 %add, 2
				%0 = call i1 @llvm.test.set.loop.iterations.i32(i32 %div)
				br i1 %0, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%.splatinsert.i41 = insertelement <4 x i32> undef, i32 %out_activation_min, i32 0
				%.splat.i42 = shufflevector <4 x i32> %.splatinsert.i41, <4 x i32> undef, <4 x i32> zeroinitializer
				%.splatinsert.i = insertelement <4 x i32> undef, i32 %out_activation_max, i32 0
				%.splat.i = shufflevector <4 x i32> %.splatinsert.i, <4 x i32> undef, <4 x i32> zeroinitializer
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				%res = phi i32 [ 0, %entry ], [ %acc.next, %for.body ]
				ret i32 %res

				for.body: ; preds = %for.body, %for.body.lr.ph
				%lsr.iv = phi i32 [ %lsr.iv.next, %for.body ], [ %div, %for.body.lr.ph ]
				%input_1_vect.addr.052 = phi i8* [ %input_1_vect, %for.body.lr.ph ], [ %add.ptr, %for.body ]
				%input_2_vect.addr.051 = phi i8* [ %input_2_vect, %for.body.lr.ph ], [ %add.ptr14, %for.body ]
				%num_elements.049 = phi i32 [ %block_size, %for.body.lr.ph ], [ %sub, %for.body ]
				%acc = phi i32 [ 0, %for.body.lr.ph ], [ %acc.next, %for.body ]
				%input_2_cast = bitcast i8* %input_2_vect.addr.051 to <4 x i32>*
				%input_1_cast = bitcast i8* %input_1_vect.addr.052 to <4 x i32>*
				%pred = tail call <4 x i1> @llvm.arm.mve.vctp32(i32 %num_elements.049)
				%load.1 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %input_1_cast, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%insert.input_1_offset = insertelement <4 x i32> undef, i32 %input_1_offset, i32 0
				%splat.input_1_offset = shufflevector <4 x i32> %insert.input_1_offset, <4 x i32> undef, <4 x i32> zeroinitializer
				%insert.input_2_offset = insertelement <4 x i32> undef, i32 %input_2_offset, i32 0
				%splat.input_2_offset = shufflevector <4 x i32> %insert.input_2_offset, <4 x i32> undef, <4 x i32> zeroinitializer
				%add.1 = add <4 x i32> %load.1, %splat.input_1_offset
				%load.2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %input_2_cast, i32 4, <4 x i1> %pred, <4 x i32> undef)
				%add.2 = add <4 x i32> %load.2, %splat.input_2_offset
				%mul = mul <4 x i32> %add.1, %add.2
				%insert.output = insertelement <4 x i32> undef, i32 %out_offset, i32 0
				%splat.output = shufflevector <4 x i32> %insert.output, <4 x i32> undef, <4 x i32> zeroinitializer
				%add7 = add <4 x i32> %mul, %splat.output
				%max = tail call <4 x i32> @llvm.arm.mve.max.predicated.v4i32.v4i1(<4 x i32> %add7, <4 x i32> %.splat.i42, i32 1, <4 x i1> %pred, <4 x i32> undef)
				%min = tail call <4 x i32> @llvm.arm.mve.min.predicated.v4i32.v4i1(<4 x i32> %max, <4 x i32> %.splat.i, i32 1, <4 x i1> %pred, <4 x i32> undef)
				%acc.next = call i32 @llvm.arm.mve.vmldava.predicated.v4i32.v4i1(i32 0, i32 0, i32 0, i32 %acc, <4 x i32> %min, <4 x i32> %max, <4 x i1> %pred)
				%add.ptr = getelementptr inbounds i8, i8* %input_1_vect.addr.052, i32 4
				%add.ptr14 = getelementptr inbounds i8, i8* %input_2_vect.addr.051, i32 4
				%sub = add i32 %num_elements.049, -4
				%iv.next = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
				%cmp = icmp ne i32 %iv.next, 0
				%lsr.iv.next = add i32 %lsr.iv, -1
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}
				declare <4 x i1> @llvm.arm.mve.vctp32(i32) #1
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>) #2
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>) #3
				declare <4 x i32> @llvm.arm.mve.min.predicated.v4i32.v4i1(<4 x i32>, <4 x i32>, i32, <4 x i1>, <4 x i32>) #1
				declare <4 x i32> @llvm.arm.mve.max.predicated.v4i32.v4i1(<4 x i32>, <4 x i32>, i32, <4 x i1>, <4 x i32>) #1
				declare i32 @llvm.arm.mve.vmldava.predicated.v4i32.v4i1(i32, i32, i32, i32, <4 x i32>, <4 x i32>, <4 x i1>) #1
				declare i1 @llvm.test.set.loop.iterations.i32(i32) #4
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #4
				...
				---
				name: vmldava_in_vpt
				alignment: 2
				exposesReturnsTwice: false
				legalized: false
				regBankSelected: false
				selected: false
				failedISel: false
				tracksRegLiveness: true
				hasWinCFI: false
				registers: []
				liveins:
				- { reg: '$r0', virtual-reg: '' }
				- { reg: '$r1', virtual-reg: '' }
				- { reg: '$r2', virtual-reg: '' }
				- { reg: '$r3', virtual-reg: '' }
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 20
				offsetAdjustment: 0
				maxAlignment: 4
				adjustsStack: false
				hasCalls: false
				stackProtector: ''
				maxCallFrameSize: 0
				cvBytesOfCalleeSavedRegisters: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				localFrameSize: 0
				savePoint: ''
				restorePoint: ''
				fixedStack:
				- { id: 0, type: default, offset: 20, size: 4, alignment: 4, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, type: default, offset: 16, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, type: default, offset: 12, size: 4, alignment: 4, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 3, type: default, offset: 8, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 4, type: default, offset: 4, size: 4, alignment: 4, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 5, type: default, offset: 0, size: 4, alignment: 8, stack-id: default,
				isImmutable: true, isAliased: false, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				stack:
				- { id: 0, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r6', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 3, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r5', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 4, name: '', type: spill-slot, offset: -20, size: 4, alignment: 4,
				stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				callSites: []
				constants: []
				machineFunctionInfo: {}
				body: \|
				; CHECK-LABEL: name: vmldava_in_vpt
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $lr
				; CHECK: frame-setup tPUSH 14, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $lr, implicit-def $sp, implicit $sp
				; CHECK: frame-setup CFI_INSTRUCTION def_cfa_offset 20
				; CHECK: frame-setup CFI_INSTRUCTION offset $lr, -4
				; CHECK: frame-setup CFI_INSTRUCTION offset $r7, -8
				; CHECK: frame-setup CFI_INSTRUCTION offset $r6, -12
				; CHECK: frame-setup CFI_INSTRUCTION offset $r5, -16
				; CHECK: frame-setup CFI_INSTRUCTION offset $r4, -20
				; CHECK: renamable $r7 = tLDRspi $sp, 10, 14, $noreg :: (load 4 from %fixed-stack.5)
				; CHECK: renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
				; CHECK: $lr = MVE_WLSTP_32 renamable $r7, %bb.3
				; CHECK: bb.1.for.body.lr.ph:
				; CHECK: successors: %bb.2(0x80000000)
				; CHECK: liveins: $r0, $r1, $r2, $r3, $r5, $r7
				; CHECK: $r6 = tMOVr killed $r5, 14, $noreg
				; CHECK: $r5, $r12 = t2LDRDi8 $sp, 32, 14, $noreg :: (load 4 from %fixed-stack.3), (load 4 from %fixed-stack.4, align 8)
				; CHECK: renamable $r4 = tLDRspi $sp, 5, 14, $noreg :: (load 4 from %fixed-stack.0, align 8)
				; CHECK: renamable $q0 = MVE_VDUP32 killed renamable $r12, 0, $noreg, undef renamable $q0
				; CHECK: renamable $q1 = MVE_VDUP32 killed renamable $r5, 0, $noreg, undef renamable $q1
				; CHECK: renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
				; CHECK: bb.2.for.body:
				; CHECK: successors: %bb.2(0x7c000000), %bb.3(0x04000000)
				; CHECK: liveins: $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r6, $r7, $r12
				; CHECK: renamable $r1, renamable $q2 = MVE_VLDRWU32_post killed renamable $r1, 4, 0, $noreg :: (load 16 from %ir.input_2_cast, align 4)
				; CHECK: renamable $r0, renamable $q3 = MVE_VLDRWU32_post killed renamable $r0, 4, 0, $noreg :: (load 16 from %ir.input_1_cast, align 4)
				; CHECK: renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r3, 0, $noreg, undef renamable $q2
				; CHECK: renamable $q3 = MVE_VADD_qr_i32 killed renamable $q3, renamable $r2, 0, $noreg, undef renamable $q3
				; CHECK: $lr = tMOVr $r6, 14, $noreg
				; CHECK: renamable $q2 = MVE_VMULi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
				; CHECK: renamable $r6, dead $cpsr = tSUBi8 killed $r6, 1, 14, $noreg
				; CHECK: renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r4, 0, $noreg, undef renamable $q2
				; CHECK: renamable $q2 = MVE_VMAXu32 killed renamable $q2, renamable $q1, 0, $noreg, undef renamable $q2
				; CHECK: renamable $q3 = MVE_VMINu32 renamable $q2, renamable $q0, 0, $noreg, undef renamable $q3
				; CHECK: renamable $r12 = MVE_VMLADAVas32 killed renamable $r12, killed renamable $q3, killed renamable $q2, 0, killed $noreg
				dmgreenUnsubmitted Not Done Reply Inline Actions Another unrelated one, but there is nothing in the instructions of this loop that says that this is predicated on a hardware register, right? Or that is has unmodelled side effects? The refs to vpr and the vctp are removed but no other ref is added. dmgreen: Another unrelated one, but there is nothing in the instructions of this loop that says that…
				samparkerAuthorUnsubmitted Done Reply Inline Actions Yes, at this point it's completely implicit, which I don't think is a problem currently. Maybe we could add a new register and a predicate def for LSTP and LETP, but I have no idea how invasive that would be for the predicate users. samparker: Yes, at this point it's completely implicit, which I don't think is a problem currently. Maybe…
				dmgreenUnsubmitted Not Done Reply Inline Actions I may be possible to add something similar to VPR, to the MVE_P base of all MVE instructions. Might be some differences in register orders though. Like you said, this is very late so might not be a problem currently. It's best not to be implicit if we can help it though. dmgreen: I may be possible to add something similar to VPR, to the MVE_P base of all MVE instructions.
				; CHECK: $lr = MVE_LETP killed renamable $lr, %bb.2
				; CHECK: bb.3.for.cond.cleanup:
				; CHECK: liveins: $r12
				; CHECK: $r0 = tMOVr killed $r12, 14, $noreg
				; CHECK: tPOP_RET 14, $noreg, def $r4, def $r5, def $r6, def $r7, def $pc, implicit killed $r0
				bb.0.entry:
				successors: %bb.1(0x40000000), %bb.3(0x40000000)
				liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $lr

				frame-setup tPUSH 14, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $lr, implicit-def $sp, implicit $sp
				frame-setup CFI_INSTRUCTION def_cfa_offset 20
				frame-setup CFI_INSTRUCTION offset $lr, -4
				frame-setup CFI_INSTRUCTION offset $r7, -8
				frame-setup CFI_INSTRUCTION offset $r6, -12
				frame-setup CFI_INSTRUCTION offset $r5, -16
				frame-setup CFI_INSTRUCTION offset $r4, -20
				renamable $r7 = tLDRspi $sp, 10, 14, $noreg :: (load 4 from %fixed-stack.0)
				renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
				renamable $r4, dead $cpsr = tADDi3 renamable $r7, 3, 14, $noreg
				renamable $r5, dead $cpsr = tLSRri killed renamable $r4, 2, 14, $noreg
				t2WhileLoopStart renamable $r5, %bb.3, implicit-def dead $cpsr
				tB %bb.1, 14, $noreg

				bb.1.for.body.lr.ph:
				successors: %bb.2(0x80000000)
				liveins: $r0, $r1, $r2, $r3, $r5, $r7

				$r6 = tMOVr killed $r5, 14, $noreg
				$r5, $r12 = t2LDRDi8 $sp, 32, 14, $noreg :: (load 4 from %fixed-stack.2), (load 4 from %fixed-stack.1, align 8)
				renamable $r4 = tLDRspi $sp, 5, 14, $noreg :: (load 4 from %fixed-stack.5, align 8)
				renamable $q0 = MVE_VDUP32 killed renamable $r12, 0, $noreg, undef renamable $q0
				renamable $q1 = MVE_VDUP32 killed renamable $r5, 0, $noreg, undef renamable $q1
				renamable $r12 = t2MOVi 0, 14, $noreg, $noreg

				bb.2.for.body:
				successors: %bb.2(0x7c000000), %bb.3(0x04000000)
				liveins: $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r6, $r7, $r12

				renamable $vpr = MVE_VCTP32 renamable $r7, 0, $noreg
				MVE_VPST 8, implicit $vpr
				renamable $r1, renamable $q2 = MVE_VLDRWU32_post killed renamable $r1, 4, 1, renamable $vpr :: (load 16 from %ir.input_2_cast, align 4)
				MVE_VPST 8, implicit $vpr
				renamable $r0, renamable $q3 = MVE_VLDRWU32_post killed renamable $r0, 4, 1, renamable $vpr :: (load 16 from %ir.input_1_cast, align 4)
				renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r3, 0, $noreg, undef renamable $q2
				renamable $q3 = MVE_VADD_qr_i32 killed renamable $q3, renamable $r2, 0, $noreg, undef renamable $q3
				$lr = tMOVr $r6, 14, $noreg
				renamable $q2 = MVE_VMULi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
				renamable $r6, dead $cpsr = tSUBi8 killed $r6, 1, 14, $noreg
				renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r4, 0, $noreg, undef renamable $q2
				renamable $r7, dead $cpsr = tSUBi8 killed renamable $r7, 4, 14, $noreg
				MVE_VPST 2, implicit $vpr
				renamable $q2 = MVE_VMAXu32 killed renamable $q2, renamable $q1, 1, renamable $vpr, undef renamable $q2
				renamable $q3 = MVE_VMINu32 renamable $q2, renamable $q0, 1, renamable $vpr, undef renamable $q3
				renamable $r12 = MVE_VMLADAVas32 killed renamable $r12, killed renamable $q3, killed renamable $q2, 1, killed renamable $vpr
				renamable $lr = t2LoopDec killed renamable $lr, 1
				t2LoopEnd killed renamable $lr, %bb.2, implicit-def dead $cpsr
				tB %bb.3, 14, $noreg

				bb.3.for.cond.cleanup:
				liveins: $r12

				$r0 = tMOVr killed $r12, 14, $noreg
				tPOP_RET 14, $noreg, def $r4, def $r5, def $r6, def $r7, def $pc, implicit killed $r0

				...