This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsARM.td
-
lib/Target/ARM/
-
Target/
-
ARM/
-
ARM.h
-
ARMTargetMachine.cpp
-
CMakeLists.txt
-
MVETailPredication.cpp
-
test/CodeGen/Thumb2/LowOverheadLoops/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
basic-tail-pred.ll
-
nested.ll
-
tail-pred-narrow.ll
-
tail-pred-pattern-fail.ll
-
tail-pred-widen.ll
-
tail-reduce.ll
-
vector-unroll.ll

Differential D65884

[ARM] MVE Tail Predication
ClosedPublic

Authored by samparker on Aug 7 2019, 8:20 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
simon_tatham
olista01
samtebbs

Commits

rG312409e464cd: [ARM] MVE Tail Predication
rL371179: [ARM] MVE Tail Predication

Summary

The MVE and LOB extensions of Armv8.1m can be combined to enable 'tail predication' which removes the need for a scalar remainder loop after vectorization. Lane predication is performed implicitly via a system register. The effects of predication is described in Section B5.6.3 of the Armv8.1-m Arch Reference Manual, the key points being:

For vector operations that perform reduction across the vector and produce a scalar result, whether the value is accumulated or not.
For non-load instructions, the predicate flags determine if the destination register byte is updated with the new value or if the previous value is preserved.
For vector store instructions, whether the store occurs or not.
For vector load instructions, whether the value that is loaded or whether zeros are written to that element of the destination register.

This patch implements a pass that takes a hardware loop, containing masked vector instructions, and converts it something that resembles an MVE tail predicated loop. Currently, if we had code generation, we'd generate a loop in which the VCTP would generate the predicate and VPST would then setup the value of VPR.PO. The loads and stores would be placed in VPT blocks so this is not tail predication, but normal VPT predication with the predicate based upon a element counting induction variable. Further work needs to be done to finally produce a true tail predicated loop.

Because only the loads and stores are predicated, in both the LLVM IR and MIR level, we will restrict support to only lane-wise operations (no horizontal reductions). We will perform a final check on MIR during loop finalisation too.

Another restriction, specific to MVE, is that all the vector instructions need operate on the same number of elements. This is because predication is performed at the byte level and this is set on entry to the loop, or by the VCTP instead.

Diff Detail

Repository: rL LLVM

Event Timeline

samparker created this revision.Aug 7 2019, 8:20 AM

Herald added subscribers: zzheng, kristof.beyls, javed.absar, mgorny. · View Herald TranscriptAug 7 2019, 8:20 AM

Hi Sam, nice one!

A first scan from my side, just some questions and nits.

Further work needs to be done to finally produce a true tail predicated loop.

Can you elaborate a little bit on this? I guess you mean instruction selection patterns to finally produce code? Can this work be committed without that being ready? I guess the question rephrased is: can you outline the plan?

You nicely state assumptions in the description, e.g.:

Because only the loads and stores are predicated, in both the LLVM IR and MIR level, we will restrict support to only lane-wise operations ..
..
Another restriction, specific to MVE, is that all the vector instructions need operate on the same number of elements.

I think it would be good to add that to the code too as comments.

lib/Target/ARM/Thumb2TailPredication.cpp
1 ↗	(On Diff #213894)	nit: MVETailPredication.cpp -> Thumb2TailPredication.cpp?
19 ↗	(On Diff #213894)	nit: typo "can generate as"?
292 ↗	(On Diff #213894)	Just wondering if this utility function is interesting and generic enough to be moved to e.g. LoopInfo.
303 ↗	(On Diff #213894)	nit: magic constant?
328 ↗	(On Diff #213894)	nit: newline
369 ↗	(On Diff #213894)	nit: this nested-if looks a bit funny, perhaps just getting rid of the brackets helps

First up, patch context! Everyone seems to be forgetting recently :)

Why does the llvm_arm_vctp32 not return a <4xi1> directly? I didn't understand the "from a software and hardware perspective, is a 16-bit scalar", or at least why that would matter in IR. Would it not just be represented as a vector of i1's? As it already is for cmps/selects for example. What does having to use the separate conversion help with?

include/llvm/IR/IntrinsicsARM.td
785 ↗	(On Diff #213894)	This, at least in isel, is called a PREDICATE_CAST. (It used to be called a REINTERPRET_CAST, but I thought that name was overly broad). I think from IR a better name might be int_arm_mve_vmsr / int_arm_mve_vmrs, to represent that it is converting from a predicate to the scalar, as the instruction does.

samparker edited the summary of this revision. (Show Details)Aug 8 2019, 12:16 AM

Why does the llvm_arm_vctp32 not return a <4xi1> directly?

The vctp family are defined like that because the ACLE specifies that they return a mve_pred16_t and I'm assuming this is a scalar - but I can't find a definition! I think that all the user facing predicate generators will produce a scalar and we will need to do the conversion to make it nice and LLVMy.

Can you elaborate a little bit on this? I guess you mean instruction selection patterns to finally produce code? Can this work be committed without that being ready? I guess the question rephrased is: can you outline the plan?

The overall plan is as follows:

Enable vectorizer.
Teach the vectorizer to produce a predicated loop, using a pragma and a target profitability hook. Think we also need to specify that we support masked load/store intrinsics too.
This pass converts some vector icmps to vctp intrinsics.
A bit of isel for vctp and final bits for other MVE instructions.
The loop is finalised in the LowOverheadLoop pass where we check for validity and remove the VCTP and VPST instructions, as well as their VPT blocks.
Beer.

include/llvm/IR/IntrinsicsARM.td
785 ↗	(On Diff #213894)	Sounds good.

Updated with context.

In D65884#1620474, @samparker wrote:

Why does the llvm_arm_vctp32 not return a <4xi1> directly?

The vctp family are defined like that because the ACLE specifies that they return a mve_pred16_t and I'm assuming this is a scalar - but I can't find a definition! I think that all the user facing predicate generators will produce a scalar and we will need to do the conversion to make it nice and LLVMy.

Sure, the ACLE intrinsic needs to return an i16, but does that mean the IR intrinsic needs to? It could be expanded to two instructions, llvm_arm_vctp32 and llvm_arm_vmrs, with the i16 coming from the vmrs. This kind of thing sounds like it would be useful already for things like masked loads. i.e I'm saying can we invert where the conversion happens?

So if we started with acle:

mve_pred16_t pred = vctp8q(i)
l = vldrbq_z_s8(a, pred)

It would get expanded to become:

// vctp8q
<4 x i1> t1 = llvm.arm.vctp(i)
i16 pred = llvm.arm.vmrs(t1)
// vldrbq_z_s8
<4 x i1> t2 = llvm.arm.vmsr(pred)
l = llvm.masked.load(a, t2)

And you could use instcombine to fold out the converts (vmsr(vmrs(a)) == a), into

t1 = llvm.arm.vctp(i)
llvm.masked.load(a, t1)

It would work even better for compares that already have predicate that llvm knows about. They whole thing would just become llvm IR and we can let it optimise away. This is getting a bit much into intrinsic design, though, with isn't this patches problem!

Sam's suggestion to me for the ACLE intrinsics was that there should be an IR intrinsic that converts the i16 provided by the user into an <n x i1> for whatever n makes sense. In my unpushed (and unpolished) draft implementation there's also one that converts back again, which the ACLE intrinsics will need for the return value of vcmp. So it could be used here as well if that's useful.

The overall plan is as follows:

Enable vectorizer.

Teach the vectorizer to produce a predicated loop, using a pragma and a target profitability hook. Think we also need to specify that we support masked load/store intrinsics too.

This pass converts some vector icmps to vctp intrinsics.

A bit of isel for vctp and final bits for other MVE instructions.

The loop is finalised in the LowOverheadLoop pass where we check for validity and remove the VCTP and VPST instructions, as well as their VPT blocks.

Beer.

Thanks for this! That's quite a lot of moving parts!
Thinking out loud what the strategy could be so that we properly test the whole flow..... One would be to start with the ground work: enabling the vectorizer is flipping a switch, we can already produce a predicated loop with a pragma (target profitability hook is missing), and a bit of isel should also be easy. With the small(er) and easier groundwork in, we then have the 2 biggies remaining: this patch, and the LowOverheadLoop pass modifications. But when we have these 2 ready, we can test the flow and check if we haven't missed anything (that could change the design/flow).
Another option is to commit this once we're happy with it as it won't be enabled/triggering. Then we will have to see how it behaves when tail predication support is ready in LowOverheadLoop (and the other bits and pieces).

In D65884#1621103, @dmgreen wrote:

Sure, the ACLE intrinsic needs to return an i16, but does that mean the IR intrinsic needs to? It could be expanded to two instructions, llvm_arm_vctp32 and llvm_arm_vmrs, with the i16 coming from the vmrs. This kind of thing sounds like it would be useful already for things like masked loads. i.e I'm saying can we invert where the conversion happens?

It would work even better for compares that already have predicate that llvm knows about. They whole thing would just become llvm IR and we can let it optimise away.

I'm not following your suggestions here. I'm don't see how either one approach is better for the IR, we're just talking about two intrinsics that convert a scalar and vector, and I don't know why we'd need to get instcombine involved. I don't mind how these intrinsics end up getting implemented, as long as we have normal vectors in the end, and for this patch it makes sense for me to have vctp looking like the acle intrinsic.

In D65884#1621209, @SjoerdMeijer wrote:

Thinking out loud what the strategy could be so that we properly test the whole flow.....

I think it will be okay to land this when we can, it's disabled and the vectorizer won't be generating suitable input anyway. The finalisation can happen last and with no real rush - isel should still generate a valid loop, the vctp instructions will be more efficient that doing the arithmetic with vectors in the loop. So really, isel for masked load/stores and the vctp is the next most important bit because then we at least run llc. Then we can prod the vectorizer into generating masked loops and we can find all the bugs :)

You had me convinced until the last line. I think it's probably simpler for the vctp to produce a v4i1, as we don't need the convert at all in this patch (unless I'm missing something).

Essentially, the vctp can either look more like the acle intrinsic (produce a i16, makes the acle->IR simpler, but the IR->instruction needs to match on a convert(vctp)), or more like the VCTP instruction (produce a v4i1, makes the IR->Instruction simpler). I would go with the second option, but Simon is the expert on all things Intrinsics. Go with whatever he thinks is OK!

lib/Target/ARM/CMakeLists.txt
61 ↗	(On Diff #214078)	I would go with MVETailPredicationPass.cpp

In D65884#1621209, @SjoerdMeijer wrote:

Thinking out loud what the strategy could be so that we properly test the whole flow.....
I think it will be okay to land this when we can, it's disabled and the vectorizer won't be generating suitable input anyway. The finalisation can happen last and with no real rush - isel should still generate a valid > loop, the vctp instructions will be more efficient that doing the arithmetic with vectors in the loop. So really, isel for masked load/stores and the vctp is the next most important bit because then we at least run llc. > Then we can prod the vectorizer into generating masked loops and we can find all the bugs :)

Ok, sounds like a good plan to me.

SjoerdMeijer added inline comments.Aug 28 2019, 5:39 AM

lib/Target/ARM/Thumb2TailPredication.cpp
95 ↗	(On Diff #214078)	nit: perhaps slightly more informative function name, something with "LoopIterations" in it?
134 ↗	(On Diff #214078)	"Wrong Subtarget" -> perhaps better/clearer to say that Subtarget does not support TP.
138 ↗	(On Diff #214078)	Perhaps better to move this message to the next block as it might not be really running at this point (if there's no preheader)?
145 ↗	(On Diff #214078)	Bit of a nit, but perhaps setting 'Setup' can be a function, so that we don't have the `if (!Setup)` check below twice, but simply a return when we found it.
152 ↗	(On Diff #214078)	nit: extra whitespace in "pre- preheader"
187 ↗	(On Diff #214078)	I am not sure, but I'm wondering if this function is misnamed? I think I would expect `isTailPredicate` to work on a Loop, not a Instruction/Value.
204 ↗	(On Diff #214078)	I need to read this function again, but it isn't clear to me yet why this is then a tail-predicated loop and not some other vectorized loop.

samparker marked 5 inline comments as done.Aug 28 2019, 7:54 AM

samparker added inline comments.

lib/Target/ARM/Thumb2TailPredication.cpp
187 ↗	(On Diff #214078)	TPCandidate has to be constructed with a loop and will only operate upon that loop. This method needs to answer whether V is the value that produces the tail predicate. The tail predicate being defined below in the comments. I will make it explicit that we're answering whether V == %pred.
204 ↗	(On Diff #214078)	It's only answering whether the given value, V, is equivalent to a vctp instruction. So we're not talking about MVE tail predication yet, we're just matching vctp. I expect that most of these loops will then be converted to a tail predicated form.

samparker marked 2 inline comments as done.Aug 29 2019, 2:06 AM

Hopefully addressed all the comments. VCTP intrinsics now return a vector so I've removed the predication conversion intrinsics as I don't need them here.

SjoerdMeijer added inline comments.Aug 29 2019, 6:00 AM

lib/Target/ARM/MVETailPredication.cpp
364 ↗	(On Diff #217801)	What we are doing here, is looking at generated code to get `%Elems`. But the vectorizer has generated this code, so somewhere this knowledge is present. The only question is, I guess, how easily accessible. Can we query SCEV or something else to avoid this pattern matching?

SjoerdMeijer added inline comments.Aug 29 2019, 7:57 AM

lib/Target/ARM/MVETailPredication.cpp
301 ↗	(On Diff #217801)	A few nits I forgot to mention in my previous message: Replace constant 128 with `TTI.getRegisterBitWidth(true)`? And I don't think I understand the `Lanes == 128` part.

samparker marked an inline comment as done.Aug 30 2019, 5:45 AM

samparker added inline comments.

lib/Target/ARM/MVETailPredication.cpp
364 ↗	(On Diff #217801)	I've had a play with some other parts of SCEV, the helpers for delinearization, to see if it could answer the size of the array accessed by the loop. I can't seem to get this to work with the GEPs in the tests as they're not be represented by clean AddRecExprs. I presume it's not unreasonable for SCEV to not be that useful after vectorization..? It does looks like it could be used though, but I don't think I have the knowledge to do it safely! I'll add a TODO / FIXME.

Added a couple of comments.

SjoerdMeijer added inline comments.Aug 30 2019, 5:56 AM

lib/Target/ARM/MVETailPredication.cpp
364 ↗	(On Diff #217801)	Alright, fair enough. Last question then, last crazy idea, could this information be attached to the loop as metadata?

samparker marked an inline comment as done.Sep 3 2019, 1:08 AM

samparker added inline comments.

lib/Target/ARM/MVETailPredication.cpp
364 ↗	(On Diff #217801)	That doesn't seem to be the way metadata is used. It would require it to be kept up-to-date by any transform, which is unreasonable. Loop properties are provided by analysis passes, so this could live in one of them... but it seems a bit niche and should probably be done in a better way :)

SjoerdMeijer added inline comments.Sep 3 2019, 5:10 AM

lib/Target/ARM/MVETailPredication.cpp
364 ↗	(On Diff #217801)	Okay, that's also fair enough. I really cannot admit I am a big fan of this pattern matching here, which is a bit of understatement, but I guess it is what it is for now. I.e., your TODO captures it well for me. The pattern matching is extremely 'focused', and looking into SCEV and friends could be a project on itself...and so while that is considered, this is okay'ish. I will now continue reviewing the remaining bits I haven't looked at.

If I am not mistaken, there is no test with a nested-loop. Would probably be good to test that too.

lib/Target/ARM/MVETailPredication.cpp
301 ↗	(On Diff #217801)	thanks for adding the comment, I still think you can use TTI.getRegisterBitWidth(true) here
177 ↗	(On Diff #218073)	For me, personally, creating a TPCandidate class doesn't add an abstraction that is really useful or clearer as literally the only thing it captures is 1 list, i.e. `MaskedInsts`, and we don't have multiple candidates and the lifetime of this object is extremely short. So, a simplification/clean-up in my opinion, is to get rid of the class and then we just have 3 local helper functions, and one variable that we pass thought. This is probably a nit, and/or a style preference.

Cheers!

Now using TTI.
Removed TPCandidate.
Added some nested loop tests.

Thanks, this now looks reasonable to me as an initial commit.
It is disabled by default, and we can now experiment with this and iterate on this if required.

This revision is now accepted and ready to land.Sep 4 2019, 8:52 AM

Closed by commit rL371179: [ARM] MVE Tail Predication (authored by sam_parker). · Explain WhySep 6 2019, 1:23 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptSep 6 2019, 1:23 AM

I think this might have caused a regression (my git-bisect is nearly complete):

FAIL: LLVM :: CodeGen/ARM/O3-pipeline.ll (24373 of 51236)
******************** TEST 'LLVM :: CodeGen/ARM/O3-pipeline.ll' FAILED ********************
Script:
--
: 'RUN: at line 1';   /tmp/_update_lc/t/bin/llc -mtriple=arm -O3 -debug-pass=Structure < /home/dave/s/lp/llvm/test/CodeGen/ARM/O3-pipeline.ll -o /dev/null 2>&1 | grep -v "Verify generated machine code" | /tmp/_update_lc/t/bin/FileCheck /home/dave/s/lp/llvm/test/CodeGen/ARM/O3-pipeline.ll
--
Exit Code: 1

Command Output (stderr):
--
/home/dave/s/lp/llvm/test/CodeGen/ARM/O3-pipeline.ll:55:15: error: CHECK-NEXT: is not on the line after the previous match
; CHECK-NEXT: Safe Stack instrumentation pass
              ^
<stdin>:65:2: note: 'next' match was here
 Safe Stack instrumentation pass
 ^
<stdin>:61:25: note: previous match ended here
 Hardware Loop Insertion
                        ^
<stdin>:62:1: note: non-matching line after previous match is here
 Scalar Evolution Analysis
^

--

********************
Testing Time: 64.36s
********************
Failing Tests (1):
    LLVM :: CodeGen/ARM/O3-pipeline.ll

  Expected Passes    : 50558
  Expected Failures  : 168
  Unsupported Tests  : 509
  Unexpected Failures: 1

simon_tatham mentioned this in D70485: [ARM,MVE] Add intrinsics to deal with predicates..Nov 21 2019, 9:15 AM

simon_tatham mentioned this in D70592: [ARM,MVE] Rename and clean up VCTP IR intrinsics..Nov 22 2019, 3:47 AM

simon_tatham mentioned this in rG48cce077efcc: [ARM,MVE] Rename and clean up VCTP IR intrinsics..Dec 2 2019, 8:24 AM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsARM.td

4 lines

lib/

Target/

ARM/

ARM.h

2 lines

ARMTargetMachine.cpp

5 lines

CMakeLists.txt

1 line

MVETailPredication.cpp

469 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

basic-tail-pred.ll

385 lines

nested.ll

152 lines

tail-pred-narrow.ll

54 lines

tail-pred-pattern-fail.ll

505 lines

tail-pred-widen.ll

173 lines

tail-reduce.ll

118 lines

vector-unroll.ll

118 lines

Diff 219044

llvm/trunk/include/llvm/IR/IntrinsicsARM.td

	Show First 20 Lines • Show All 771 Lines • ▼ Show 20 Lines
	class Neon_Dot_Intrinsic			class Neon_Dot_Intrinsic
	: Intrinsic<[llvm_anyvector_ty],			: Intrinsic<[llvm_anyvector_ty],
	[LLVMMatchType<0>, llvm_anyvector_ty,			[LLVMMatchType<0>, llvm_anyvector_ty,
	LLVMMatchType<1>],			LLVMMatchType<1>],
	[IntrNoMem]>;			[IntrNoMem]>;
	def int_arm_neon_udot : Neon_Dot_Intrinsic;			def int_arm_neon_udot : Neon_Dot_Intrinsic;
	def int_arm_neon_sdot : Neon_Dot_Intrinsic;			def int_arm_neon_sdot : Neon_Dot_Intrinsic;

				def int_arm_vctp8 : Intrinsic<[llvm_v16i1_ty], [llvm_i32_ty], [IntrNoMem]>;
				def int_arm_vctp16 : Intrinsic<[llvm_v8i1_ty], [llvm_i32_ty], [IntrNoMem]>;
				def int_arm_vctp32 : Intrinsic<[llvm_v4i1_ty], [llvm_i32_ty], [IntrNoMem]>;
				def int_arm_vctp64 : Intrinsic<[llvm_v2i1_ty], [llvm_i32_ty], [IntrNoMem]>;

	// GNU eabi mcount			// GNU eabi mcount
	def int_arm_gnu_eabi_mcount : Intrinsic<[],			def int_arm_gnu_eabi_mcount : Intrinsic<[],
	[],			[],
	[IntrReadMem, IntrWriteMem]>;			[IntrReadMem, IntrWriteMem]>;

	} // end TargetPrefix			} // end TargetPrefix

llvm/trunk/lib/Target/ARM/ARM.h

	Show All 29 Lines
	class FunctionPass;			class FunctionPass;
	class InstructionSelector;			class InstructionSelector;
	class MachineBasicBlock;			class MachineBasicBlock;
	class MachineFunction;			class MachineFunction;
	class MachineInstr;			class MachineInstr;
	class MCInst;			class MCInst;
	class PassRegistry;			class PassRegistry;

				Pass *createMVETailPredicationPass();
	FunctionPass *createARMLowOverheadLoopsPass();			FunctionPass *createARMLowOverheadLoopsPass();
	Pass *createARMParallelDSPPass();			Pass *createARMParallelDSPPass();
	FunctionPass *createARMISelDag(ARMBaseTargetMachine &TM,			FunctionPass *createARMISelDag(ARMBaseTargetMachine &TM,
	CodeGenOpt::Level OptLevel);			CodeGenOpt::Level OptLevel);
	FunctionPass *createA15SDOptimizerPass();			FunctionPass *createA15SDOptimizerPass();
	FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);			FunctionPass *createARMLoadStoreOptimizationPass(bool PreAlloc = false);
	FunctionPass *createARMExpandPseudoPass();			FunctionPass *createARMExpandPseudoPass();
	FunctionPass *createARMCodeGenPreparePass();			FunctionPass *createARMCodeGenPreparePass();
	Show All 16 Lines
	void initializeARMPreAllocLoadStoreOptPass(PassRegistry &);			void initializeARMPreAllocLoadStoreOptPass(PassRegistry &);
	void initializeARMCodeGenPreparePass(PassRegistry &);			void initializeARMCodeGenPreparePass(PassRegistry &);
	void initializeARMConstantIslandsPass(PassRegistry &);			void initializeARMConstantIslandsPass(PassRegistry &);
	void initializeARMExpandPseudoPass(PassRegistry &);			void initializeARMExpandPseudoPass(PassRegistry &);
	void initializeThumb2SizeReducePass(PassRegistry &);			void initializeThumb2SizeReducePass(PassRegistry &);
	void initializeThumb2ITBlockPass(PassRegistry &);			void initializeThumb2ITBlockPass(PassRegistry &);
	void initializeMVEVPTBlockPass(PassRegistry &);			void initializeMVEVPTBlockPass(PassRegistry &);
	void initializeARMLowOverheadLoopsPass(PassRegistry &);			void initializeARMLowOverheadLoopsPass(PassRegistry &);
				void initializeMVETailPredicationPass(PassRegistry &);

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_LIB_TARGET_ARM_ARM_H			#endif // LLVM_LIB_TARGET_ARM_ARM_H

llvm/trunk/lib/Target/ARM/ARMTargetMachine.cpp

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeARMTarget() {
initializeARMPreAllocLoadStoreOptPass(Registry);		initializeARMPreAllocLoadStoreOptPass(Registry);
initializeARMParallelDSPPass(Registry);		initializeARMParallelDSPPass(Registry);
initializeARMCodeGenPreparePass(Registry);		initializeARMCodeGenPreparePass(Registry);
initializeARMConstantIslandsPass(Registry);		initializeARMConstantIslandsPass(Registry);
initializeARMExecutionDomainFixPass(Registry);		initializeARMExecutionDomainFixPass(Registry);
initializeARMExpandPseudoPass(Registry);		initializeARMExpandPseudoPass(Registry);
initializeThumb2SizeReducePass(Registry);		initializeThumb2SizeReducePass(Registry);
initializeMVEVPTBlockPass(Registry);		initializeMVEVPTBlockPass(Registry);
		initializeMVETailPredicationPass(Registry);
initializeARMLowOverheadLoopsPass(Registry);		initializeARMLowOverheadLoopsPass(Registry);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
if (TT.isOSBinFormatMachO())		if (TT.isOSBinFormatMachO())
return std::make_unique<TargetLoweringObjectFileMachO>();		return std::make_unique<TargetLoweringObjectFileMachO>();
if (TT.isOSWindows())		if (TT.isOSWindows())
return std::make_unique<TargetLoweringObjectFileCOFF>();		return std::make_unique<TargetLoweringObjectFileCOFF>();
▲ Show 20 Lines • Show All 335 Lines • ▼ Show 20 Lines	if ((TM->getOptLevel() != CodeGenOpt::None &&
// expect it to be generally either beneficial or harmless. On Mach-O it		// expect it to be generally either beneficial or harmless. On Mach-O it
// is disabled as we emit the .subsections_via_symbols directive which		// is disabled as we emit the .subsections_via_symbols directive which
// means that merging extern globals is not safe.		// means that merging extern globals is not safe.
bool MergeExternalByDefault = !TM->getTargetTriple().isOSBinFormatMachO();		bool MergeExternalByDefault = !TM->getTargetTriple().isOSBinFormatMachO();
addPass(createGlobalMergePass(TM, 127, OnlyOptimizeForSize,		addPass(createGlobalMergePass(TM, 127, OnlyOptimizeForSize,
MergeExternalByDefault));		MergeExternalByDefault));
}		}

if (TM->getOptLevel() != CodeGenOpt::None)		if (TM->getOptLevel() != CodeGenOpt::None) {
addPass(createHardwareLoopsPass());		addPass(createHardwareLoopsPass());
		addPass(createMVETailPredicationPass());
		}

return false;		return false;
}		}

bool ARMPassConfig::addInstSelector() {		bool ARMPassConfig::addInstSelector() {
addPass(createARMISelDag(getARMTargetMachine(), getOptLevel()));		addPass(createARMISelDag(getARMTargetMachine(), getOptLevel()));
return false;		return false;
}		}
▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/ARM/CMakeLists.txt

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	add_llvm_target(ARMCodeGen
ARMOptimizeBarriersPass.cpp		ARMOptimizeBarriersPass.cpp
ARMRegisterBankInfo.cpp		ARMRegisterBankInfo.cpp
ARMSelectionDAGInfo.cpp		ARMSelectionDAGInfo.cpp
ARMSubtarget.cpp		ARMSubtarget.cpp
ARMTargetMachine.cpp		ARMTargetMachine.cpp
ARMTargetObjectFile.cpp		ARMTargetObjectFile.cpp
ARMTargetTransformInfo.cpp		ARMTargetTransformInfo.cpp
MLxExpansionPass.cpp		MLxExpansionPass.cpp
		MVETailPredication.cpp
MVEVPTBlockPass.cpp		MVEVPTBlockPass.cpp
Thumb1FrameLowering.cpp		Thumb1FrameLowering.cpp
Thumb1InstrInfo.cpp		Thumb1InstrInfo.cpp
ThumbRegisterInfo.cpp		ThumbRegisterInfo.cpp
Thumb2ITBlockPass.cpp		Thumb2ITBlockPass.cpp
Thumb2InstrInfo.cpp		Thumb2InstrInfo.cpp
Thumb2SizeReduction.cpp		Thumb2SizeReduction.cpp
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(Utils)		add_subdirectory(Utils)

llvm/trunk/lib/Target/ARM/MVETailPredication.cpp

				//===- MVETailPredication.cpp - MVE Tail Predication ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// Armv8.1m introduced MVE, M-Profile Vector Extension, and low-overhead
				/// branches to help accelerate DSP applications. These two extensions can be
				/// combined to provide implicit vector predication within a low-overhead loop.
				/// The HardwareLoops pass inserts intrinsics identifying loops that the
				/// backend will attempt to convert into a low-overhead loop. The vectorizer is
				/// responsible for generating a vectorized loop in which the lanes are
				/// predicated upon the iteration counter. This pass looks at these predicated
				/// vector loops, that are targets for low-overhead loops, and prepares it for
				/// code generation. Once the vectorizer has produced a masked loop, there's a
				/// couple of final forms:
				/// - A tail-predicated loop, with implicit predication.
				/// - A loop containing multiple VCPT instructions, predicating multiple VPT
				/// blocks of instructions operating on different vector types.

				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/Analysis/LoopPass.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/ScalarEvolutionExpander.h"
				#include "llvm/Analysis/ScalarEvolutionExpressions.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "ARM.h"
				#include "ARMSubtarget.h"

				using namespace llvm;

				#define DEBUG_TYPE "mve-tail-predication"
				#define DESC "Transform predicated vector loops to use MVE tail predication"

				static cl::opt<bool>
				DisableTailPredication("disable-mve-tail-predication", cl::Hidden,
				cl::init(true),
				cl::desc("Disable MVE Tail Predication"));
				namespace {

				class MVETailPredication : public LoopPass {
				SmallVector<IntrinsicInst*, 4> MaskedInsts;
				Loop *L = nullptr;
				ScalarEvolution *SE = nullptr;
				TargetTransformInfo *TTI = nullptr;

				public:
				static char ID;

				MVETailPredication() : LoopPass(ID) { }

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<ScalarEvolutionWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addRequired<TargetPassConfig>();
				AU.addRequired<TargetTransformInfoWrapperPass>();
				AU.addPreserved<LoopInfoWrapperPass>();
				AU.setPreservesCFG();
				}

				bool runOnLoop(Loop *L, LPPassManager&) override;

				private:

				/// Perform the relevant checks on the loop and convert if possible.
				bool TryConvert(Value *TripCount);

				/// Return whether this is a vectorized loop, that contains masked
				/// load/stores.
				bool IsPredicatedVectorLoop();

				/// Compute a value for the total number of elements that the predicated
				/// loop will process.
				Value ComputeElements(Value TripCount, VectorType *VecTy);

				/// Is the icmp that generates an i1 vector, based upon a loop counter
				/// and a limit that is defined outside the loop.
				bool isTailPredicate(Value Predicate, Value NumElements);
				};

				} // end namespace

				static bool IsDecrement(Instruction &I) {
				auto *Call = dyn_cast<IntrinsicInst>(&I);
				if (!Call)
				return false;

				Intrinsic::ID ID = Call->getIntrinsicID();
				return ID == Intrinsic::loop_decrement_reg;
				}

				static bool IsMasked(Instruction *I) {
				auto *Call = dyn_cast<IntrinsicInst>(I);
				if (!Call)
				return false;

				Intrinsic::ID ID = Call->getIntrinsicID();
				// TODO: Support gather/scatter expand/compress operations.
				return ID == Intrinsic::masked_store \|\| ID == Intrinsic::masked_load;
				}

				bool MVETailPredication::runOnLoop(Loop *L, LPPassManager&) {
				if (skipLoop(L) \|\| DisableTailPredication)
				return false;

				Function &F = *L->getHeader()->getParent();
				auto &TPC = getAnalysis<TargetPassConfig>();
				auto &TM = TPC.getTM<TargetMachine>();
				auto *ST = &TM.getSubtarget<ARMSubtarget>(F);
				TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
				SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
				this->L = L;

				// The MVE and LOB extensions are combined to enable tail-predication, but
				// there's nothing preventing us from generating VCTP instructions for v8.1m.
				if (!ST->hasMVEIntegerOps() \|\| !ST->hasV8_1MMainlineOps()) {
				LLVM_DEBUG(dbgs() << "TP: Not a v8.1m.main+mve target.\n");
				return false;
				}

				BasicBlock *Preheader = L->getLoopPreheader();
				if (!Preheader)
				return false;

				auto FindLoopIterations = [](BasicBlock BB) -> IntrinsicInst {
				for (auto &I : *BB) {
				auto *Call = dyn_cast<IntrinsicInst>(&I);
				if (!Call)
				continue;

				Intrinsic::ID ID = Call->getIntrinsicID();
				if (ID == Intrinsic::set_loop_iterations \|\|
				ID == Intrinsic::test_set_loop_iterations)
				return cast<IntrinsicInst>(&I);
				}
				return nullptr;
				};

				// Look for the hardware loop intrinsic that sets the iteration count.
				IntrinsicInst *Setup = FindLoopIterations(Preheader);

				// The test.set iteration could live in the pre- preheader.
				if (!Setup) {
				if (!Preheader->getSinglePredecessor())
				return false;
				Setup = FindLoopIterations(Preheader->getSinglePredecessor());
				if (!Setup)
				return false;
				}

				// Search for the hardware loop intrinic that decrements the loop counter.
				IntrinsicInst *Decrement = nullptr;
				for (auto *BB : L->getBlocks()) {
				for (auto &I : *BB) {
				if (IsDecrement(I)) {
				Decrement = cast<IntrinsicInst>(&I);
				break;
				}
				}
				}

				if (!Decrement)
				return false;

				LLVM_DEBUG(dbgs() << "TP: Running on Loop: " << *L
				<< *Setup << "\n"
				<< *Decrement << "\n");
				bool Changed = TryConvert(Setup->getArgOperand(0));
				return Changed;
				}

				bool MVETailPredication::isTailPredicate(Value V, Value NumElements) {
				// Look for the following:

				// %trip.count.minus.1 = add i32 %N, -1
				// %broadcast.splatinsert10 = insertelement <4 x i32> undef,
				// i32 %trip.count.minus.1, i32 0
				// %broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10,
				// <4 x i32> undef,
				// <4 x i32> zeroinitializer
				// ...
				// ...
				// %index = phi i32
				// %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				// %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert,
				// <4 x i32> undef,
				// <4 x i32> zeroinitializer
				// %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				// %pred = icmp ule <4 x i32> %induction, %broadcast.splat11

				// And return whether V == %pred.

				using namespace PatternMatch;

				CmpInst::Predicate Pred;
				Instruction *Shuffle = nullptr;
				Instruction *Induction = nullptr;

				// The vector icmp
				if (!match(V, m_ICmp(Pred, m_Instruction(Induction),
				m_Instruction(Shuffle))) \|\|
				Pred != ICmpInst::ICMP_ULE \|\| !L->isLoopInvariant(Shuffle))
				return false;

				// First find the stuff outside the loop which is setting up the limit
				// vector....
				// The invariant shuffle that broadcast the limit into a vector.
				Instruction *Insert = nullptr;
				if (!match(Shuffle, m_ShuffleVector(m_Instruction(Insert), m_Undef(),
				m_Zero())))
				return false;

				// Insert the limit into a vector.
				Instruction *BECount = nullptr;
				if (!match(Insert, m_InsertElement(m_Undef(), m_Instruction(BECount),
				m_Zero())))
				return false;

				// The limit calculation, backedge count.
				Value *TripCount = nullptr;
				if (!match(BECount, m_Add(m_Value(TripCount), m_AllOnes())))
				return false;

				if (TripCount != NumElements)
				return false;

				// Now back to searching inside the loop body...
				// Find the add with takes the index iv and adds a constant vector to it.
				Instruction *BroadcastSplat = nullptr;
				Constant *Const = nullptr;
				if (!match(Induction, m_Add(m_Instruction(BroadcastSplat),
				m_Constant(Const))))
				return false;

				// Check that we're adding <0, 1, 2, 3...
				if (auto *CDS = dyn_cast<ConstantDataSequential>(Const)) {
				for (unsigned i = 0; i < CDS->getNumElements(); ++i) {
				if (CDS->getElementAsInteger(i) != i)
				return false;
				}
				} else
				return false;

				// The shuffle which broadcasts the index iv into a vector.
				if (!match(BroadcastSplat, m_ShuffleVector(m_Instruction(Insert), m_Undef(),
				m_Zero())))
				return false;

				// The insert element which initialises a vector with the index iv.
				Instruction *IV = nullptr;
				if (!match(Insert, m_InsertElement(m_Undef(), m_Instruction(IV), m_Zero())))
				return false;

				// The index iv.
				auto *Phi = dyn_cast<PHINode>(IV);
				if (!Phi)
				return false;

				// TODO: Don't think we need to check the entry value.
				Value *OnEntry = Phi->getIncomingValueForBlock(L->getLoopPreheader());
				if (!match(OnEntry, m_Zero()))
				return false;

				Value *InLoop = Phi->getIncomingValueForBlock(L->getLoopLatch());
				unsigned Lanes = cast<VectorType>(Insert->getType())->getNumElements();

				Instruction *LHS = nullptr;
				if (!match(InLoop, m_Add(m_Instruction(LHS), m_SpecificInt(Lanes))))
				return false;

				return LHS == Phi;
				}

				static VectorType* getVectorType(IntrinsicInst *I) {
				unsigned TypeOp = I->getIntrinsicID() == Intrinsic::masked_load ? 0 : 1;
				auto *PtrTy = cast<PointerType>(I->getOperand(TypeOp)->getType());
				return cast<VectorType>(PtrTy->getElementType());
				}

				bool MVETailPredication::IsPredicatedVectorLoop() {
				// Check that the loop contains at least one masked load/store intrinsic.
				// We only support 'normal' vector instructions - other than masked
				// load/stores.
				for (auto *BB : L->getBlocks()) {
				for (auto &I : *BB) {
				if (IsMasked(&I)) {
				VectorType *VecTy = getVectorType(cast<IntrinsicInst>(&I));
				unsigned Lanes = VecTy->getNumElements();
				unsigned ElementWidth = VecTy->getScalarSizeInBits();
				// MVE vectors are 128-bit, but don't support 128 x i1.
				// TODO: Can we support vectors larger than 128-bits?
				unsigned MaxWidth = TTI->getRegisterBitWidth(true);
				if (Lanes * ElementWidth != MaxWidth \|\| Lanes == MaxWidth)
				return false;
				MaskedInsts.push_back(cast<IntrinsicInst>(&I));
				} else if (auto *Int = dyn_cast<IntrinsicInst>(&I)) {
				for (auto &U : Int->args()) {
				if (isa<VectorType>(U->getType()))
				return false;
				}
				}
				}
				}

				return !MaskedInsts.empty();
				}

				Value* MVETailPredication::ComputeElements(Value *TripCount,
				VectorType *VecTy) {
				const SCEV *TripCountSE = SE->getSCEV(TripCount);
				ConstantInt *VF = ConstantInt::get(cast<IntegerType>(TripCount->getType()),
				VecTy->getNumElements());

				if (VF->equalsInt(1))
				return nullptr;

				// TODO: Support constant trip counts.
				auto VisitAdd = [&](const SCEVAddExpr S) -> const SCEVMulExpr {
				if (auto *Const = dyn_cast<SCEVConstant>(S->getOperand(0))) {
				if (Const->getAPInt() != -VF->getValue())
				return nullptr;
				} else
				return nullptr;
				return dyn_cast<SCEVMulExpr>(S->getOperand(1));
				};

				auto VisitMul = [&](const SCEVMulExpr S) -> const SCEVUDivExpr {
				if (auto *Const = dyn_cast<SCEVConstant>(S->getOperand(0))) {
				if (Const->getValue() != VF)
				return nullptr;
				} else
				return nullptr;
				return dyn_cast<SCEVUDivExpr>(S->getOperand(1));
				};

				auto VisitDiv = [&](const SCEVUDivExpr S) -> const SCEV {
				if (auto *Const = dyn_cast<SCEVConstant>(S->getRHS())) {
				if (Const->getValue() != VF)
				return nullptr;
				} else
				return nullptr;

				if (auto *RoundUp = dyn_cast<SCEVAddExpr>(S->getLHS())) {
				if (auto *Const = dyn_cast<SCEVConstant>(RoundUp->getOperand(0))) {
				if (Const->getAPInt() != (VF->getValue() - 1))
				return nullptr;
				} else
				return nullptr;

				return RoundUp->getOperand(1);
				}
				return nullptr;
				};

				// TODO: Can we use SCEV helpers, such as findArrayDimensions, and friends to
				// determine the numbers of elements instead? Looks like this is what is used
				// for delinearization, but I'm not sure if it can be applied to the
				// vectorized form - at least not without a bit more work than I feel
				// comfortable with.

				// Search for Elems in the following SCEV:
				// (1 + ((-VF + (VF * (((VF - 1) + %Elems) /u VF))<nuw>) /u VF))<nuw><nsw>
				const SCEV *Elems = nullptr;
				if (auto *TC = dyn_cast<SCEVAddExpr>(TripCountSE))
				if (auto *Div = dyn_cast<SCEVUDivExpr>(TC->getOperand(1)))
				if (auto *Add = dyn_cast<SCEVAddExpr>(Div->getLHS()))
				if (auto *Mul = VisitAdd(Add))
				if (auto *Div = VisitMul(Mul))
				if (auto *Res = VisitDiv(Div))
				Elems = Res;

				if (!Elems)
				return nullptr;

				Instruction *InsertPt = L->getLoopPreheader()->getTerminator();
				if (!isSafeToExpandAt(Elems, InsertPt, *SE))
				return nullptr;

				auto DL = L->getHeader()->getModule()->getDataLayout();
				SCEVExpander Expander(*SE, DL, "elements");
				return Expander.expandCodeFor(Elems, Elems->getType(), InsertPt);
				}

				bool MVETailPredication::TryConvert(Value *TripCount) {
				if (!IsPredicatedVectorLoop())
				return false;

				LLVM_DEBUG(dbgs() << "TP: Found predicated vector loop.\n");

				// Walk through the masked intrinsics and try to find whether the predicate
				// operand is generated from an induction variable.
				Module *M = L->getHeader()->getModule();
				Type *Ty = IntegerType::get(M->getContext(), 32);
				SmallPtrSet<Value*, 4> Predicates;

				for (auto *I : MaskedInsts) {
				Intrinsic::ID ID = I->getIntrinsicID();
				unsigned PredOp = ID == Intrinsic::masked_load ? 2 : 3;
				Value *Predicate = I->getArgOperand(PredOp);
				if (Predicates.count(Predicate))
				continue;

				VectorType *VecTy = getVectorType(I);
				Value *NumElements = ComputeElements(TripCount, VecTy);
				if (!NumElements)
				continue;

				if (!isTailPredicate(Predicate, NumElements)) {
				LLVM_DEBUG(dbgs() << "TP: Not tail predicate: " << *Predicate << "\n");
				continue;
				}

				LLVM_DEBUG(dbgs() << "TP: Found tail predicate: " << *Predicate << "\n");
				Predicates.insert(Predicate);

				// Insert a phi to count the number of elements processed by the loop.
				IRBuilder<> Builder(L->getHeader()->getFirstNonPHI());
				PHINode *Processed = Builder.CreatePHI(Ty, 2);
				Processed->addIncoming(NumElements, L->getLoopPreheader());

				// Insert the intrinsic to represent the effect of tail predication.
				Builder.SetInsertPoint(cast<Instruction>(Predicate));
				ConstantInt *Factor =
				ConstantInt::get(cast<IntegerType>(Ty), VecTy->getNumElements());
				Intrinsic::ID VCTPID;
				switch (VecTy->getNumElements()) {
				default:
				llvm_unreachable("unexpected number of lanes");
				case 2: VCTPID = Intrinsic::arm_vctp64; break;
				case 4: VCTPID = Intrinsic::arm_vctp32; break;
				case 8: VCTPID = Intrinsic::arm_vctp16; break;
				case 16: VCTPID = Intrinsic::arm_vctp8; break;
				}
				Function *VCTP = Intrinsic::getDeclaration(M, VCTPID);
				// TODO: This add likely already exists in the loop.
				Value *Remaining = Builder.CreateSub(Processed, Factor);
				Value *TailPredicate = Builder.CreateCall(VCTP, Remaining);
				Predicate->replaceAllUsesWith(TailPredicate);

				// Add the incoming value to the new phi.
				Processed->addIncoming(Remaining, L->getLoopLatch());
				LLVM_DEBUG(dbgs() << "TP: Insert processed elements phi: "
				<< *Processed << "\n"
				<< "TP: Inserted VCTP: " << *TailPredicate << "\n");
				}

				for (auto I : L->blocks())
				DeleteDeadPHIs(I);

				return true;
				}

				Pass *llvm::createMVETailPredicationPass() {
				return new MVETailPredication();
				}

				char MVETailPredication::ID = 0;

				INITIALIZE_PASS_BEGIN(MVETailPredication, DEBUG_TYPE, DESC, false, false)
				INITIALIZE_PASS_END(MVETailPredication, DEBUG_TYPE, DESC, false, false)

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve,+lob %s -S -o - \| FileCheck %s

				; CHECK-LABEL: mul_v16i8
				; CHECK: vector.body:
				; CHECK: %index = phi i32
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 16
				; CHECK: [[VCTP:%[^ ]+]] = call <16 x i1> @llvm.arm.vctp8(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* {{.*}}, i32 4, <16 x i1> [[VCTP]], <16 x i8> undef)
				; CHECK: [[LD1:%[^ ]+]] = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* {{.*}}, i32 4, <16 x i1> [[VCTP]], <16 x i8> undef)
				; CHECK: tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> {{.}}, <16 x i8> {{.*}}, i32 4, <16 x i1> [[VCTP]])
				define dso_local arm_aapcs_vfpcc void @mul_v16i8(i8* noalias nocapture readonly %a, i8* noalias nocapture readonly %b, i8* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 15
				%tmp9 = lshr i32 %tmp8, 4
				%tmp10 = shl nuw i32 %tmp9, 4
				%tmp11 = add i32 %tmp10, -16
				%tmp12 = lshr i32 %tmp11, 4
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <16 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <16 x i32> %broadcast.splatinsert10, <16 x i32> undef, <16 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer
				%induction = add <16 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%tmp = getelementptr inbounds i8, i8* %a, i32 %index
				%tmp1 = icmp ule <16 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i8* %tmp to <16 x i8>*
				%wide.masked.load = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp2, i32 4, <16 x i1> %tmp1, <16 x i8> undef)
				%tmp3 = getelementptr inbounds i8, i8* %b, i32 %index
				%tmp4 = bitcast i8* %tmp3 to <16 x i8>*
				%wide.masked.load2 = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp4, i32 4, <16 x i1> %tmp1, <16 x i8> undef)
				%mul = mul nsw <16 x i8> %wide.masked.load2, %wide.masked.load
				%tmp6 = getelementptr inbounds i8, i8* %c, i32 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i8>*
				tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> %mul, <16 x i8>* %tmp7, i32 4, <16 x i1> %tmp1)
				%index.next = add i32 %index, 16
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; CHECK-LABEL: mul_v8i16
				; CHECK: vector.body:
				; CHECK: %index = phi i32
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 8
				; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.vctp16(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* {{.*}}, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
				; CHECK: [[LD1:%[^ ]+]] = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* {{.*}}, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
				; CHECK: tail call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> {{.}}, <8 x i16> {{.*}}, i32 4, <8 x i1> [[VCTP]])
				define dso_local arm_aapcs_vfpcc void @mul_v8i16(i16* noalias nocapture readonly %a, i16* noalias nocapture readonly %b, i16* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 7
				%tmp9 = lshr i32 %tmp8, 3
				%tmp10 = shl nuw i32 %tmp9, 3
				%tmp11 = add i32 %tmp10, -8
				%tmp12 = lshr i32 %tmp11, 3
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <8 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <8 x i32> %broadcast.splatinsert10, <8 x i32> undef, <8 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%tmp = getelementptr inbounds i16, i16* %a, i32 %index
				%tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i16* %tmp to <8 x i16>*
				%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
				%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index
				%tmp4 = bitcast i16* %tmp3 to <8 x i16>*
				%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
				%mul = mul nsw <8 x i16> %wide.masked.load2, %wide.masked.load
				%tmp6 = getelementptr inbounds i16, i16* %c, i32 %index
				%tmp7 = bitcast i16* %tmp6 to <8 x i16>*
				tail call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %mul, <8 x i16>* %tmp7, i32 4, <8 x i1> %tmp1)
				%index.next = add i32 %index, 8
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; CHECK-LABEL: mul_v4i32
				; CHECK: vector.body:
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 4
				; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.vctp32(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: [[LD1:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> [[VCTP]])
				define dso_local arm_aapcs_vfpcc void @mul_v4i32(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%mul = mul nsw <4 x i32> %wide.masked.load2, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %mul, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; CHECK-LABEL: copy_v2i64
				; CHECK: vector.body:
				; CHECK: %index = phi i32
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 2
				; CHECK: [[VCTP:%[^ ]+]] = call <2 x i1> @llvm.arm.vctp64(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>* {{.*}}, i32 4, <2 x i1> [[VCTP]], <2 x i64> undef)
				; CHECK: tail call void @llvm.masked.store.v2i64.p0v2i64(<2 x i64> [[LD0]], <2 x i64>* {{.*}}, i32 4, <2 x i1> [[VCTP]])
				define void @copy_v2i64(i64* %a, i64* %b, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 1
				%tmp9 = lshr i32 %tmp8, 1
				%tmp10 = shl nuw i32 %tmp9, 1
				%tmp11 = add i32 %tmp10, -2
				%tmp12 = lshr i32 %tmp11, 1
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <2 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <2 x i32> %broadcast.splatinsert10, <2 x i32> undef, <2 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <2 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <2 x i32> %broadcast.splatinsert, <2 x i32> undef, <2 x i32> zeroinitializer
				%induction = add <2 x i32> %broadcast.splat, <i32 0, i32 1>
				%tmp1 = icmp ule <2 x i32> %induction, %broadcast.splat11
				%tmp = getelementptr inbounds i64, i64* %a, i32 %index
				%tmp2 = bitcast i64* %tmp to <2 x i64>*
				%wide.masked.load = tail call <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>* %tmp2, i32 4, <2 x i1> %tmp1, <2 x i64> undef)
				%tmp3 = getelementptr inbounds i64, i64* %b, i32 %index
				%tmp7 = bitcast i64* %tmp3 to <2 x i64>*
				tail call void @llvm.masked.store.v2i64.p0v2i64(<2 x i64> %wide.masked.load, <2 x i64>* %tmp7, i32 4, <2 x i1> %tmp1)
				%index.next = add i32 %index, 2
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; CHECK-LABEL: split_vector
				; CHECK: vector.body:
				; CHECK: %index = phi i32
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 4
				; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.vctp32(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: [[LD1:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> [[VCTP]])
				define dso_local arm_aapcs_vfpcc void @split_vector(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%extract.1.low = shufflevector <4 x i32> %wide.masked.load, <4 x i32> undef, < 2 x i32> < i32 0, i32 2>
				%extract.1.high = shufflevector <4 x i32> %wide.masked.load, <4 x i32> undef, < 2 x i32> < i32 1, i32 3>
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%extract.2.low = shufflevector <4 x i32> %wide.masked.load2, <4 x i32> undef, < 2 x i32> < i32 0, i32 2>
				%extract.2.high = shufflevector <4 x i32> %wide.masked.load2, <4 x i32> undef, < 2 x i32> < i32 1, i32 3>
				%mul = mul nsw <2 x i32> %extract.1.low, %extract.2.low
				%sub = sub nsw <2 x i32> %extract.1.high, %extract.2.high
				%combine = shufflevector <2 x i32> %mul, <2 x i32> %sub, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %combine, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; One of the loads now uses ult predicate.
				; CHECK-LABEL: mismatch_load_pred
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 4
				; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.vctp32(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: [[LD1:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> %wrong, <4 x i32> undef)
				; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> [[VCTP]])
				define dso_local arm_aapcs_vfpcc void @mismatch_load_pred(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%wrong = icmp ult <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %wrong, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; The store now uses ult predicate.
				; CHECK-LABEL: mismatch_store_pred
				; CHECK: %index = phi i32
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
				; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 4
				; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.vctp32(i32 [[REMAINING]])
				; CHECK: [[LD0:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: [[LD1:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> %wrong)
				define dso_local arm_aapcs_vfpcc void @mismatch_store_pred(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%wrong = icmp ult <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %wrong)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)
				declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)
				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
				declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
				declare void @llvm.masked.store.v2i64.p0v2i64(<2 x i64>, <2 x i64>*, i32 immarg, <2 x i1>)
				declare <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>*, i32 immarg, <2 x i1>, <2 x i64>)
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/nested.ll

				; RUN: opt -mtriple=armv8.1m.main -mattr=+mve -S -mve-tail-predication -disable-mve-tail-predication=false %s -o - \| FileCheck %s

				; TODO: Support extending loads
				; CHECK-LABEL: mat_vec_sext_i16
				; CHECK-NOT: call {{.*}} @llvm.arm.vctp
				define void @mat_vec_sext_i16(i16** nocapture readonly %A, i16* nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
				entry:
				%cmp24 = icmp eq i32 %N, 0
				br i1 %cmp24, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%n.rnd.up = add i32 %N, 3
				%n.vec = and i32 %n.rnd.up, -4
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert28 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat29 = shufflevector <4 x i32> %broadcast.splatinsert28, <4 x i32> undef, <4 x i32> zeroinitializer
				%tmp = add i32 %n.vec, -4
				%tmp1 = lshr i32 %tmp, 2
				%tmp2 = add nuw nsw i32 %tmp1, 1
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %middle.block, %for.cond1.preheader.us.preheader
				%i.025.us = phi i32 [ %inc10.us, %middle.block ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i16, i16* %A, i32 %i.025.us
				%tmp3 = load i16, i16* %arrayidx.us, align 4
				%arrayidx8.us = getelementptr inbounds i32, i32* %C, i32 %i.025.us
				%arrayidx8.promoted.us = load i32, i32* %arrayidx8.us, align 4
				%tmp4 = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 %arrayidx8.promoted.us, i32 0
				call void @llvm.set.loop.iterations.i32(i32 %tmp2)
				br label %vector.body

				vector.body: ; preds = %vector.body, %for.cond1.preheader.us
				%index = phi i32 [ 0, %for.cond1.preheader.us ], [ %index.next, %vector.body ]
				%vec.phi = phi <4 x i32> [ %tmp4, %for.cond1.preheader.us ], [ %tmp14, %vector.body ]
				%tmp5 = phi i32 [ %tmp2, %for.cond1.preheader.us ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp6 = getelementptr inbounds i16, i16* %tmp3, i32 %index
				%tmp7 = icmp ule <4 x i32> %induction, %broadcast.splat29
				%tmp8 = bitcast i16* %tmp6 to <4 x i16>*
				%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %tmp8, i32 2, <4 x i1> %tmp7, <4 x i16> undef)
				%tmp9 = sext <4 x i16> %wide.masked.load to <4 x i32>
				%tmp10 = getelementptr inbounds i16, i16* %B, i32 %index
				%tmp11 = bitcast i16* %tmp10 to <4 x i16>*
				%wide.masked.load30 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %tmp11, i32 2, <4 x i1> %tmp7, <4 x i16> undef)
				%tmp12 = sext <4 x i16> %wide.masked.load30 to <4 x i32>
				%tmp13 = mul nsw <4 x i32> %tmp12, %tmp9
				%tmp14 = add nsw <4 x i32> %tmp13, %vec.phi
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp5, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%tmp17 = select <4 x i1> %tmp7, <4 x i32> %tmp14, <4 x i32> %vec.phi
				%tmp18 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %tmp17)
				store i32 %tmp18, i32* %arrayidx8.us, align 4
				%inc10.us = add nuw i32 %i.025.us, 1
				%exitcond27 = icmp eq i32 %inc10.us, %N
				br i1 %exitcond27, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %middle.block, %entry
				ret void
				}

				; CHECK-LABEL: mat_vec_i32
				; CHECK: phi
				; CHECK: phi
				; CHECK: phi
				; CHECK: [[IV:%[^ ]+]] = phi i32 [ %N, %for.cond1.preheader.us ], [ [[REM:%[^ ]+]], %vector.body ]
				; CHECK: [[REM]] = sub i32 [[IV]], 4
				; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.vctp32(i32 [[REM]])
				; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
				define void @mat_vec_i32(i32** nocapture readonly %A, i32* nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
				entry:
				%cmp23 = icmp eq i32 %N, 0
				br i1 %cmp23, label %for.cond.cleanup, label %for.cond1.preheader.us.preheader

				for.cond1.preheader.us.preheader: ; preds = %entry
				%n.rnd.up = add i32 %N, 3
				%n.vec = and i32 %n.rnd.up, -4
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert27 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat28 = shufflevector <4 x i32> %broadcast.splatinsert27, <4 x i32> undef, <4 x i32> zeroinitializer
				%tmp = add i32 %n.vec, -4
				%tmp1 = lshr i32 %tmp, 2
				%tmp2 = add nuw nsw i32 %tmp1, 1
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %middle.block, %for.cond1.preheader.us.preheader
				%i.024.us = phi i32 [ %inc9.us, %middle.block ], [ 0, %for.cond1.preheader.us.preheader ]
				%arrayidx.us = getelementptr inbounds i32, i32* %A, i32 %i.024.us
				%tmp3 = load i32, i32* %arrayidx.us, align 4
				%arrayidx7.us = getelementptr inbounds i32, i32* %C, i32 %i.024.us
				%arrayidx7.promoted.us = load i32, i32* %arrayidx7.us, align 4
				%tmp4 = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 %arrayidx7.promoted.us, i32 0
				call void @llvm.set.loop.iterations.i32(i32 %tmp2)
				br label %vector.body

				vector.body: ; preds = %vector.body, %for.cond1.preheader.us
				%index = phi i32 [ 0, %for.cond1.preheader.us ], [ %index.next, %vector.body ]
				%vec.phi = phi <4 x i32> [ %tmp4, %for.cond1.preheader.us ], [ %tmp12, %vector.body ]
				%tmp5 = phi i32 [ %tmp2, %for.cond1.preheader.us ], [ %tmp13, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp6 = getelementptr inbounds i32, i32* %tmp3, i32 %index
				%tmp7 = icmp ule <4 x i32> %induction, %broadcast.splat28
				%tmp8 = bitcast i32* %tmp6 to <4 x i32>*
				%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp8, i32 4, <4 x i1> %tmp7, <4 x i32> undef)
				%tmp9 = getelementptr inbounds i32, i32* %B, i32 %index
				%tmp10 = bitcast i32* %tmp9 to <4 x i32>*
				%wide.masked.load29 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp10, i32 4, <4 x i1> %tmp7, <4 x i32> undef)
				%tmp11 = mul nsw <4 x i32> %wide.masked.load29, %wide.masked.load
				%tmp12 = add nsw <4 x i32> %vec.phi, %tmp11
				%index.next = add i32 %index, 4
				%tmp13 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp5, i32 1)
				%tmp14 = icmp ne i32 %tmp13, 0
				br i1 %tmp14, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%tmp15 = select <4 x i1> %tmp7, <4 x i32> %tmp12, <4 x i32> %vec.phi
				%tmp16 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %tmp15)
				store i32 %tmp16, i32* %arrayidx7.us, align 4
				%inc9.us = add nuw i32 %i.024.us, 1
				%exitcond26 = icmp eq i32 %inc9.us, %N
				br i1 %exitcond26, label %for.cond.cleanup, label %for.cond1.preheader.us

				for.cond.cleanup: ; preds = %middle.block, %entry
				ret void
				}

				; Function Attrs: argmemonly nounwind readonly willreturn
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>) #0

				; Function Attrs: argmemonly nounwind readonly willreturn
				declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>) #0

				; Function Attrs: nounwind readnone willreturn
				declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>) #1

				; Function Attrs: noduplicate nounwind
				declare void @llvm.set.loop.iterations.i32(i32) #2

				; Function Attrs: noduplicate nounwind
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #2

				attributes #0 = { argmemonly nounwind readonly willreturn }
				attributes #1 = { nounwind readnone willreturn }
				attributes #2 = { noduplicate nounwind }

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-narrow.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve,+lob %s -S -o - \| FileCheck %s

				; TODO: We should be able to generate a vctp for the loads.
				; CHECK-LABEL: trunc_v4i32_v4i16
				; CHECK-NOT: vcpt
				define void @trunc_v4i32_v4i16(i32* readonly %a, i32* readonly %b, i16* %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%mul = mul nsw <4 x i32> %wide.masked.load2, %wide.masked.load
				%trunc = trunc <4 x i32> %mul to <4 x i16>
				%tmp6 = getelementptr inbounds i16, i16* %c, i32 %index
				%tmp7 = bitcast i16* %tmp6 to <4 x i16>*
				tail call void @llvm.masked.store.v4i16.p0v4i16(<4 x i16> %trunc, <4 x i16>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
				declare void @llvm.masked.store.v4i16.p0v4i16(<4 x i16>, <4 x i16>*, i32 immarg, <4 x i1>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-pattern-fail.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve,+lob %s -S -o - \| FileCheck %s

				; The following functions should all fail to become tail-predicated.
				; CHECK-NOT: call i32 @llvm.arm.vctp

				; trip.count.minus.1 has been inserted into element 1, not 0.
				define dso_local arm_aapcs_vfpcc void @wrong_ph_insert_0(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 1
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; The insert isn't using an undef for operand 0.
				define dso_local arm_aapcs_vfpcc void @wrong_ph_insert_def(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> <i32 1, i32 1, i32 1, i32 1>, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; The shuffle uses a defined value for operand 1.
				define dso_local arm_aapcs_vfpcc void @wrong_ph_shuffle_1(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; The shuffle uses a non zero value for operand 2.
				define dso_local arm_aapcs_vfpcc void @wrong_ph_shuffle_2(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; %N - 2
				define dso_local arm_aapcs_vfpcc void @trip_count_minus_2(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.2 = add i32 %N, -2
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.2, i32 1
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; index has been inserted at element 1, not 0.
				define dso_local arm_aapcs_vfpcc void @wrong_loop_insert(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 1
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				define dso_local arm_aapcs_vfpcc void @wrong_loop_invalid_index_splat(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%incorrect = add i32 %index, 1
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %incorrect, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; Now using ult, not ule for the vector icmp
				define dso_local arm_aapcs_vfpcc void @wrong_pred_opcode(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ult <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; The add in the body uses 1, 2, 3, 4
				define void @wrong_body_broadcast_splat(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 1, i32 2, i32 3, i32 4>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; Using a variable for the loop body broadcast.
				define void @wrong_body_broadcast_splat_2(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N, <4 x i32> %offsets) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, %offsets
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; adding 5, instead of 4, to index.
				define void @wrong_index_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 5
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>) #1
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>) #2
				declare void @llvm.set.loop.iterations.i32(i32) #3
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #3

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-widen.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve,+lob %s -S -o - \| FileCheck %s

				; CHECK-LABEL: expand_v8i16_v8i32
				; CHECK-NOT: call i32 @llvm.arm.vctp
				define void @expand_v8i16_v8i32(i16* noalias nocapture readonly %a, i16* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 7
				%tmp9 = lshr i32 %tmp8, 3
				%tmp10 = shl nuw i32 %tmp9, 3
				%tmp11 = add i32 %tmp10, -8
				%tmp12 = lshr i32 %tmp11, 3
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <8 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <8 x i32> %broadcast.splatinsert10, <8 x i32> undef, <8 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%tmp = getelementptr inbounds i16, i16* %a, i32 %index
				%tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i16* %tmp to <8 x i16>*
				%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
				%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index
				%tmp4 = bitcast i16* %tmp3 to <8 x i16>*
				%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
				%expand.1 = zext <8 x i16> %wide.masked.load to <8 x i32>
				%expand.2 = zext <8 x i16> %wide.masked.load2 to <8 x i32>
				%mul = mul nsw <8 x i32> %expand.2, %expand.1
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
				%tmp7 = bitcast i32* %tmp6 to <8 x i32>*
				tail call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> %mul, <8 x i32>* %tmp7, i32 4, <8 x i1> %tmp1)
				%index.next = add i32 %index, 8
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; CHECK-LABEL: expand_v8i16_v4i32
				; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[ELEMS_REM:%[^ ]+]], %vector.body ]
				; CHECK: [[ELEMS_REM]] = sub i32 [[ELEMS]], 8
				; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.vctp16(i32 [[ELEMS_REM]])
				; CHECK: tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* {{.*}}, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
				; CHECK: %store.pred = icmp ule <4 x i32> %induction.store
				; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> %store.pred)
				; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> %store.pred)
				define void @expand_v8i16_v4i32(i16* readonly %a, i16* readonly %b, i32* %c, i32* %d, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 7
				%tmp9 = lshr i32 %tmp8, 3
				%tmp10 = shl nuw i32 %tmp9, 3
				%tmp11 = add i32 %tmp10, -8
				%tmp12 = lshr i32 %tmp11, 3
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <8 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <8 x i32> %broadcast.splatinsert10, <8 x i32> undef, <8 x i32> zeroinitializer
				%broadcast.splatinsert10.store = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11.store = shufflevector <4 x i32> %broadcast.splatinsert10.store, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%store.idx = phi i32 [ 0, %vector.ph ], [ %store.idx.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%tmp = getelementptr inbounds i16, i16* %a, i32 %index
				%tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i16* %tmp to <8 x i16>*
				%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
				%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index
				%tmp4 = bitcast i16* %tmp3 to <8 x i16>*
				%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
				%extract.2.low = shufflevector <8 x i16> %wide.masked.load2, <8 x i16> undef, < 4 x i32> <i32 0, i32 1, i32 2, i32 3>
				%extract.2.high = shufflevector <8 x i16> %wide.masked.load2, <8 x i16> undef, < 4 x i32> <i32 4, i32 5, i32 6, i32 7>
				%expand.1 = zext <4 x i16> %extract.2.low to <4 x i32>
				%expand.2 = zext <4 x i16> %extract.2.high to <4 x i32>
				%mul = mul nsw <4 x i32> %expand.2, %expand.1
				%sub = mul nsw <4 x i32> %expand.1, %expand.2
				%broadcast.splatinsert.store = insertelement <4 x i32> undef, i32 %store.idx, i32 0
				%broadcast.splat.store = shufflevector <4 x i32> %broadcast.splatinsert.store, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction.store = add <4 x i32> %broadcast.splat.store, <i32 0, i32 1, i32 2, i32 3>
				%store.pred = icmp ule <4 x i32> %induction.store, %broadcast.splat11.store
				%tmp6 = getelementptr inbounds i32, i32* %c, i32 %store.idx
				%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %mul, <4 x i32>* %tmp7, i32 4, <4 x i1> %store.pred)
				%gep = getelementptr inbounds i32, i32* %d, i32 %store.idx
				%cast.gep = bitcast i32* %gep to <4 x i32>*
				tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %sub, <4 x i32>* %cast.gep, i32 4, <4 x i1> %store.pred)
				%store.idx.next = add i32 %store.idx, 4
				%index.next = add i32 %index, 8
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				; CHECK-LABEL: expand_v4i32_v4i64
				; CHECK-NOT: call i32 @llvm.arm.vctp
				define void @expand_v4i32_v4i64(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i64* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 3
				%tmp9 = lshr i32 %tmp8, 2
				%tmp10 = shl nuw i32 %tmp9, 2
				%tmp11 = add i32 %tmp10, -4
				%tmp12 = lshr i32 %tmp11, 2
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
				%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
				%tmp = getelementptr inbounds i32, i32* %a, i32 %index
				%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i32* %tmp to <4 x i32>*
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
				%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
				%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
				%expand.1 = zext <4 x i32> %wide.masked.load to <4 x i64>
				%expand.2 = zext <4 x i32> %wide.masked.load2 to <4 x i64>
				%mul = mul nsw <4 x i64> %expand.2, %expand.1
				%tmp6 = getelementptr inbounds i64, i64* %c, i32 %index
				%tmp7 = bitcast i64* %tmp6 to <4 x i64>*
				tail call void @llvm.masked.store.v4i64.p0v4i64(<4 x i64> %mul, <4 x i64>* %tmp7, i32 4, <4 x i1> %tmp1)
				%index.next = add i32 %index, 4
				%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
				%tmp16 = icmp ne i32 %tmp15, 0
				br i1 %tmp16, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
				declare void @llvm.masked.store.v8i32.p0v8i32(<8 x i32>, <8 x i32>*, i32 immarg, <8 x i1>)
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
				declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
				declare void @llvm.masked.store.v4i64.p0v4i64(<4 x i64>, <4 x i64>*, i32 immarg, <4 x i1>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s

				; CHECK-LABEL: reduction_i32
				; CHECK: phi i32 [ 0, %entry ]
				; CHECK: phi <8 x i16> [ zeroinitializer, %entry ]
				; CHECK: phi i32
				; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %entry ], [ [[ELEMS:%[^ ]+]], %vector.body ]
				; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8
				; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.vctp16(i32 [[ELEMS]])
				; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
				; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp6, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
				define i16 @reduction_i32(i16* nocapture readonly %A, i16* nocapture readonly %B, i32 %N) {
				entry:
				%tmp = add i32 %N, -1
				%n.rnd.up = add nuw nsw i32 %tmp, 8
				%n.vec = and i32 %n.rnd.up, -8
				%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0
				%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer
				%0 = add i32 %n.vec, -8
				%1 = lshr i32 %0, 3
				%2 = add nuw nsw i32 %1, 1
				call void @llvm.set.loop.iterations.i32(i32 %2)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <8 x i16> [ zeroinitializer, %entry ], [ %tmp8, %vector.body ]
				%3 = phi i32 [ %2, %entry ], [ %4, %vector.body ]
				%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index
				%tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				%tmp4 = bitcast i16* %tmp2 to <8 x i16>*
				%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
				%tmp5 = getelementptr inbounds i16, i16* %B, i32 %index
				%tmp6 = bitcast i16* %tmp5 to <8 x i16>*
				%wide.masked.load3 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp6, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
				%tmp7 = add <8 x i16> %wide.masked.load, %vec.phi
				%tmp8 = add <8 x i16> %tmp7, %wide.masked.load3
				%index.next = add nuw nsw i32 %index, 8
				%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)
				%5 = icmp ne i32 %4, 0
				br i1 %5, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]
				%.lcssa3 = phi <8 x i1> [ %tmp3, %vector.body ]
				%.lcssa = phi <8 x i16> [ %tmp8, %vector.body ]
				%tmp10 = select <8 x i1> %.lcssa3, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa
				%rdx.shuf = shufflevector <8 x i16> %tmp10, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <8 x i16> %rdx.shuf, %tmp10
				%rdx.shuf4 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx5 = add <8 x i16> %rdx.shuf4, %bin.rdx
				%rdx.shuf6 = shufflevector <8 x i16> %bin.rdx5, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx7 = add <8 x i16> %rdx.shuf6, %bin.rdx5
				%tmp11 = extractelement <8 x i16> %bin.rdx7, i32 0
				ret i16 %tmp11
				}

				; CHECK-LABEL: reduction_i32_with_scalar
				; CHECK: phi i32 [ 0, %entry ]
				; CHECK: phi <8 x i16> [ zeroinitializer, %entry ]
				; CHECK: phi i32
				; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %entry ], [ [[ELEMS:%[^ ]+]], %vector.body ]
				; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8
				; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.vctp16(i32 [[ELEMS]])
				; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
				define i16 @reduction_i32_with_scalar(i16* nocapture readonly %A, i16 %B, i32 %N) local_unnamed_addr {
				entry:
				%tmp = add i32 %N, -1
				%n.rnd.up = add nuw nsw i32 %tmp, 8
				%n.vec = and i32 %n.rnd.up, -8
				%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0
				%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer
				%broadcast.splatinsert3 = insertelement <8 x i16> undef, i16 %B, i32 0
				%broadcast.splat4 = shufflevector <8 x i16> %broadcast.splatinsert3, <8 x i16> undef, <8 x i32> zeroinitializer
				%0 = add i32 %n.vec, -8
				%1 = lshr i32 %0, 3
				%2 = add nuw nsw i32 %1, 1
				call void @llvm.set.loop.iterations.i32(i32 %2)
				br label %vector.body

				vector.body: ; preds = %vector.body, %entry
				%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <8 x i16> [ zeroinitializer, %entry ], [ %tmp6, %vector.body ]
				%3 = phi i32 [ %2, %entry ], [ %4, %vector.body ]
				%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index
				%tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				%tmp4 = bitcast i16* %tmp2 to <8 x i16>*
				%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
				%tmp5 = add <8 x i16> %vec.phi, %broadcast.splat4
				%tmp6 = add <8 x i16> %tmp5, %wide.masked.load
				%index.next = add nuw nsw i32 %index, 8
				%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)
				%5 = icmp ne i32 %4, 0
				br i1 %5, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%tmp8 = select <8 x i1> %tmp3, <8 x i16> %tmp6, <8 x i16> %vec.phi
				%rdx.shuf = shufflevector <8 x i16> %tmp8, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <8 x i16> %rdx.shuf, %tmp8
				%rdx.shuf5 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx6 = add <8 x i16> %rdx.shuf5, %bin.rdx
				%rdx.shuf7 = shufflevector <8 x i16> %bin.rdx6, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx8 = add <8 x i16> %rdx.shuf7, %bin.rdx6
				%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0
				ret i16 %tmp9
				}

				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
				declare void @llvm.set.loop.iterations.i32(i32)
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/vector-unroll.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve,+lob %s -S -o - \| FileCheck %s

				; TODO: The unrolled pattern is preventing the transform
				; CHECK-LABEL: mul_v16i8_unroll
				; CHECK-NOT: call i32 @llvm.arm.vcpt
				define void @mul_v16i8_unroll(i8* noalias nocapture readonly %a, i8* noalias nocapture readonly %b, i8* noalias nocapture %c, i32 %N) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				%tmp8 = add i32 %N, 15
				%tmp9 = lshr i32 %tmp8, 4
				%tmp10 = shl nuw i32 %tmp9, 4
				%tmp11 = add i32 %tmp10, -16
				%tmp12 = lshr i32 %tmp11, 4
				%tmp13 = add nuw nsw i32 %tmp12, 1
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph: ; preds = %entry
				%trip.count.minus.1 = add i32 %N, -1
				%broadcast.splatinsert10 = insertelement <16 x i32> undef, i32 %trip.count.minus.1, i32 0
				%broadcast.splat11 = shufflevector <16 x i32> %broadcast.splatinsert10, <16 x i32> undef, <16 x i32> zeroinitializer
				%xtraiter = and i32 %tmp13, 1
				%0 = icmp ult i32 %tmp12, 1
				br i1 %0, label %for.cond.cleanup.loopexit.unr-lcssa, label %vector.ph.new

				vector.ph.new: ; preds = %vector.ph
				call void @llvm.set.loop.iterations.i32(i32 %tmp13)
				%unroll_iter = sub i32 %tmp13, %xtraiter
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph.new
				%index = phi i32 [ 0, %vector.ph.new ], [ %index.next.1, %vector.body ]
				%niter = phi i32 [ %unroll_iter, %vector.ph.new ], [ %niter.nsub.1, %vector.body ]
				%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer
				%induction = add <16 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%tmp = getelementptr inbounds i8, i8* %a, i32 %index
				%tmp1 = icmp ule <16 x i32> %induction, %broadcast.splat11
				%tmp2 = bitcast i8* %tmp to <16 x i8>*
				%wide.masked.load = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp2, i32 4, <16 x i1> %tmp1, <16 x i8> undef)
				%tmp3 = getelementptr inbounds i8, i8* %b, i32 %index
				%tmp4 = bitcast i8* %tmp3 to <16 x i8>*
				%wide.masked.load2 = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp4, i32 4, <16 x i1> %tmp1, <16 x i8> undef)
				%mul = mul nsw <16 x i8> %wide.masked.load2, %wide.masked.load
				%tmp6 = getelementptr inbounds i8, i8* %c, i32 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i8>*
				tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> %mul, <16 x i8>* %tmp7, i32 4, <16 x i1> %tmp1)
				%index.next = add nuw nsw i32 %index, 16
				%niter.nsub = sub i32 %niter, 1
				%broadcast.splatinsert.1 = insertelement <16 x i32> undef, i32 %index.next, i32 0
				%broadcast.splat.1 = shufflevector <16 x i32> %broadcast.splatinsert.1, <16 x i32> undef, <16 x i32> zeroinitializer
				%induction.1 = add <16 x i32> %broadcast.splat.1, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%tmp.1 = getelementptr inbounds i8, i8* %a, i32 %index.next
				%tmp1.1 = icmp ule <16 x i32> %induction.1, %broadcast.splat11
				%tmp2.1 = bitcast i8* %tmp.1 to <16 x i8>*
				%wide.masked.load.1 = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp2.1, i32 4, <16 x i1> %tmp1.1, <16 x i8> undef)
				%tmp3.1 = getelementptr inbounds i8, i8* %b, i32 %index.next
				%tmp4.1 = bitcast i8* %tmp3.1 to <16 x i8>*
				%wide.masked.load2.1 = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp4.1, i32 4, <16 x i1> %tmp1.1, <16 x i8> undef)
				%mul.1 = mul nsw <16 x i8> %wide.masked.load2.1, %wide.masked.load.1
				%tmp6.1 = getelementptr inbounds i8, i8* %c, i32 %index.next
				%tmp7.1 = bitcast i8* %tmp6.1 to <16 x i8>*
				tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> %mul.1, <16 x i8>* %tmp7.1, i32 4, <16 x i1> %tmp1.1)
				%index.next.1 = add i32 %index.next, 16
				%niter.nsub.1 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %niter.nsub, i32 1)
				%niter.ncmp.1 = icmp ne i32 %niter.nsub.1, 0
				br i1 %niter.ncmp.1, label %vector.body, label %for.cond.cleanup.loopexit.unr-lcssa.loopexit

				for.cond.cleanup.loopexit.unr-lcssa.loopexit: ; preds = %vector.body
				%index.unr.ph = phi i32 [ %index.next.1, %vector.body ]
				%tmp14.unr.ph = phi i32 [ -2, %vector.body ]
				br label %for.cond.cleanup.loopexit.unr-lcssa

				for.cond.cleanup.loopexit.unr-lcssa: ; preds = %for.cond.cleanup.loopexit.unr-lcssa.loopexit, %vector.ph
				%index.unr = phi i32 [ 0, %vector.ph ], [ %index.unr.ph, %for.cond.cleanup.loopexit.unr-lcssa.loopexit ]
				%tmp14.unr = phi i32 [ %tmp13, %vector.ph ], [ %tmp14.unr.ph, %for.cond.cleanup.loopexit.unr-lcssa.loopexit ]
				%lcmp.mod = icmp ne i32 %xtraiter, 0
				br i1 %lcmp.mod, label %vector.body.epil.preheader, label %for.cond.cleanup.loopexit

				vector.body.epil.preheader: ; preds = %for.cond.cleanup.loopexit.unr-lcssa
				br label %vector.body.epil

				vector.body.epil: ; preds = %vector.body.epil.preheader
				%index.epil = phi i32 [ %index.unr, %vector.body.epil.preheader ]
				%tmp14.epil = phi i32 [ %tmp14.unr, %vector.body.epil.preheader ]
				%broadcast.splatinsert.epil = insertelement <16 x i32> undef, i32 %index.epil, i32 0
				%broadcast.splat.epil = shufflevector <16 x i32> %broadcast.splatinsert.epil, <16 x i32> undef, <16 x i32> zeroinitializer
				%induction.epil = add <16 x i32> %broadcast.splat.epil, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%tmp.epil = getelementptr inbounds i8, i8* %a, i32 %index.epil
				%tmp1.epil = icmp ule <16 x i32> %induction.epil, %broadcast.splat11
				%tmp2.epil = bitcast i8* %tmp.epil to <16 x i8>*
				%wide.masked.load.epil = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp2.epil, i32 4, <16 x i1> %tmp1.epil, <16 x i8> undef)
				%tmp3.epil = getelementptr inbounds i8, i8* %b, i32 %index.epil
				%tmp4.epil = bitcast i8* %tmp3.epil to <16 x i8>*
				%wide.masked.load2.epil = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp4.epil, i32 4, <16 x i1> %tmp1.epil, <16 x i8> undef)
				%mul.epil = mul nsw <16 x i8> %wide.masked.load2.epil, %wide.masked.load.epil
				%tmp6.epil = getelementptr inbounds i8, i8* %c, i32 %index.epil
				%tmp7.epil = bitcast i8* %tmp6.epil to <16 x i8>*
				tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> %mul.epil, <16 x i8>* %tmp7.epil, i32 4, <16 x i1> %tmp1.epil)
				%index.next.epil = add i32 %index.epil, 16
				%tmp15.epil = add nuw nsw i32 %tmp14.epil, -1
				%tmp16.epil = icmp ne i32 %tmp15.epil, 0
				br label %for.cond.cleanup.loopexit.epilog-lcssa

				for.cond.cleanup.loopexit.epilog-lcssa: ; preds = %vector.body.epil
				br label %for.cond.cleanup.loopexit

				for.cond.cleanup.loopexit: ; preds = %for.cond.cleanup.loopexit.unr-lcssa, %for.cond.cleanup.loopexit.epilog-lcssa
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void
				}

				declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>) #1
				declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>) #2
				declare void @llvm.set.loop.iterations.i32(i32) #3
				declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #3

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] MVE Tail PredicationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 219044

llvm/trunk/include/llvm/IR/IntrinsicsARM.td

llvm/trunk/lib/Target/ARM/ARM.h

llvm/trunk/lib/Target/ARM/ARMTargetMachine.cpp

llvm/trunk/lib/Target/ARM/CMakeLists.txt

llvm/trunk/lib/Target/ARM/MVETailPredication.cpp

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/nested.ll

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-narrow.ll

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-pattern-fail.ll

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-widen.ll

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll

llvm/trunk/test/CodeGen/Thumb2/LowOverheadLoops/vector-unroll.ll

[ARM] MVE Tail Predication
ClosedPublic