This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
7/32
MVETailPredication.cpp
-
test/CodeGen/Thumb2/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
basic-tail-pred.ll
-
clear-maskedinsts.ll
-
cond-vector-reduce-mve-codegen.ll
-
extending-loads.ll
-
fast-fp-loops.ll
-
mve-tail-data-types.ll
-
nested.ll
1/2
tail-pred-const.ll
-
tail-pred-widen.ll
-
tail-reduce.ll
-
vector-arith-codegen.ll
-
vector-reduce-mve-tail.ll
-
mve-fma-loops.ll

Differential D79175

[ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC
ClosedPublic

Authored by SjoerdMeijer on Apr 30 2020, 6:59 AM.

Download Raw Diff

Details

Reviewers

samparker
dmgreen
efriedma

Commits

rGd1522513d4c4: [ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask

Summary

With D79100 we can rely on @llvm.get.active.lane.mask() that is generated by the vectoriser to get the number of elements processed by the loop, which is required to set up tail-predication. This intrinsic generates the predicate for the masked loads/stores, and consumes the Backedge Taken Count (BTC) as its second argument; we can now simply extract and use that to set up tail-predication.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Apr 30 2020, 6:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2020, 6:59 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer mentioned this in D79100: [LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops.Apr 30 2020, 7:27 AM

I was really hoping to see a lot of stuff being deleted from this pass, not added... isn't half of the original code now redundant?

In D79175#2012797, @samparker wrote:

I was really hoping to see a lot of stuff being deleted from this pass, not added... isn't half of the original code now redundant?

Copied from the description of this change:

Now we pick up this intrinsic the number of elements, which simplifies the pattern matching we were doing to find this value. I have not yet removed the pattern matching because that would require changing of a lot of tests. Thus, for now, the intrinsic and pattern matching coexist together, but as a follow up we probably want to remove this.

I propose we focus on handling first this new intrinsic. At this moment the original code is still used, because it will be triggered by all existing tests which need updating.
Handling intrinsics, removing half the code, and updating all tests is a massive change that doesn't make reviewing it easier, so thought that this is best done in steps, and this is the first one.

But fair enough, I will start working on that, and will do that here or in a separate diff while I wait for feedback on its parent D79100.

Spring clean up: this deletes half the pass, i.e. all the pattern matching.
This is possible because of intrinsic @llvm.get.active.lane.mask()that will be generated by D79100 and friends.

I am posting this for review, while I am fixing up the remaining test cases. I.e., I have modified one to the new situation: llvm/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll, but now I need to fix up the others too, but that will be more of the same, so thought it was good to post this already.

efriedma added inline comments.May 27 2020, 4:29 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
544	This addition can overflow.
611–617	Do you need to check that the first argument to get_active_lane_mask is an induction variable for the loop "L"? Do you need to check the types of the arguments to get_active_lane_mask?

efriedma added inline comments.May 27 2020, 5:14 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
576	This subtraction can also overflow.

A glorious amount of red in this diff.

llvm/lib/Target/ARM/MVETailPredication.cpp
544	And that's not okay, right? Trip count will always be BTC + 1 and we don't handle uncountable loops.
576	But that's okay, right? This predication is only really useful when wrapping happens and the intrinsic reflects that overflow can/will happen.

efriedma added inline comments.May 28 2020, 12:52 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
544	The masks are wrong if it overflows.
576	The problem here is that the original get_active_lane_mask can do something like this: Iteration 1: get_active_lane_mask(0, 5) -> all-true Iteration 2: get_active_lane_mask(4, 5) -> <true, false, false, false> Iteration 3: get_active_lane_mask(8, 5) -> all-false In the rewritten code, you end up with this: Iteration 1: vctp(5) -> all-true Iteration 2: vctp(1)- > <true, false, false, false> Iteration 3: vctp(-3)- > all-true There are a couple ways you could deal with this: Use saturating subtraction Prove the trip count is small enough that this never happens, i.e. `ceil(ElementCount / VectorWidth) >= TripCount`.

efriedma added inline comments.May 28 2020, 12:57 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
576	Using saturating subtraction would require proving that the induction variable used as the first argument to llvm.get.active.mask doesn't overflow. But I guess you have to check that anyway.

SjoerdMeijer mentioned this in rG7fb8a40e5220: New intrinsic @llvm.get.active.lane.mask().May 29 2020, 1:03 AM

samparker added inline comments.May 29 2020, 1:10 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
576	I would be hesitate to introduce saturating math here. I think the reasonable assumption is that the sub will wrap, but only on the final iteration. So just asserting that the element count is within the bounds of the trip count should be fine.

SjoerdMeijer added inline comments.May 29 2020, 7:40 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
576	The problem here is that the original get_active_lane_mask can do something like this: Iteration 1: get_active_lane_mask(0, 5) -> all-true Iteration 2: get_active_lane_mask(4, 5) -> <true, false, false, false> Iteration 3: get_active_lane_mask(8, 5) -> all-false In general, I see the problem, and I see that this can happen. But here in this context, I was wondering if it not actually boils down to Sam's earlier remark about countable loops. That is, before we were pattern matching a particular pattern produced by the vectoriser, we are handling (vector) loops produced by the vectoriser. Thus, we will never execute Iteration #3 from the example above, because `ceil(ElementCount / VectorWidth) >= TripCount` will always hold for these loops. My question is, with the intrinsic approach, can we still rely on that, would that be a valid assumption to make? Along these same lines, this pass also relies on a check `IsPredicatedVectorLoop` and presence of masked loads/stores currently produced by the vectoriser, which gets its predicate from @llvm.get.active.lane.mask also generated by the vectoriser.

efriedma added inline comments.May 29 2020, 11:06 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
576	It's not safe to assume get_active_lane_mask was produced by the LLVM vectorizer. I mean, the ACLE has a vctp intrinsic; it's not that much of a stretch that we could expose get_active_lane_mask to users at some point. And even if it was produced by the LLVM vectorizer, other passes that can modify the loop structure run between the vectorizer and the MVETailPredication pass. And even if you're looking at the unmodified vectorizer output, you still need to verify the "L" is actually the loop that was vectorized. In summary, I don't think it's a good idea to make assumptions beyond what LangRef actually promises. It's still a lot easier to pattern-match than it would be without the intrinsic.

SjoerdMeijer marked an inline comment as done.May 29 2020, 11:41 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/MVETailPredication.cpp
576	Ok, cheers, got it. It's still a lot easier to pattern-match than it would be without the intrinsic. Yep, will have a go at this next week.

I am not entire done yet, but this adds IsSafeActiveMask, which performs checks on the induction variable and backedge-taken count that are arguments to @llvm.get.active.lane.mask, and tests have been added for this to test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll. I have also now updated all other tests to the new situation, i.e. manually added the @llvm.get.active.lane.mask instead of the icmp.

The approach for the overflow check is to use SCEV and query if the loop entry is protected by a conditional BTC + 1 < 0. In other words, if the scalar trip count overflows and becomes negative, we shouldn't enter the loop and create a tripcount expression BTC + 1 as that won't be valid. As I said, not entirely done yet, but wanted to check this after our overflow discussion while I fix up these things:

the vctp can be cloned in the exit block, and looks like I am missing that at the moment.
I am always creating a new num.elements = BTC + 1 expression, but it looks like that value might exist and I can reuse, hopefully reducing some of the codegen changes that we see in some of the tests.

efriedma added inline comments.Jun 2 2020, 5:22 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
436	I think you want llvm::cannotBeMaxInLoop?
460	You also need to check that the AddRec is associated with the right loop.
576	(It looks like the current version doesn't try to address the potential overflow in the subtraction.)

This should include everything now. Main additions are:

check for potential overflow in the subtraction.
check that the induction/addRec is associated with the right loop.

Updated test

This should include everything now.

:-)
If @llvm.get.active.lane.mask can't be lowered to a VCTP (e.g. overflow) I guess that means we will have to revert it to an icmp, in order to avoid a backend isel match error. This shouldn't happen yet, but will add this.

SjoerdMeijer mentioned this in D79001: [ARM][MVE] Tail-predication: support nested loops with dependent iterators..Jun 4 2020, 1:17 AM

I'd like to see a few negative testcases, where we can't transform the llvm.get.active.lane.mask.

llvm/lib/Target/ARM/MVETailPredication.cpp
431	getNumElements() is fine on a FixedVectorType.
463	!isKnownNonNegative?
482	I was expecting something more like `IVExpr->getLoop() == L`. L might not be the innermost loop.

I was just about to upload a new diff when I noticed your review. Many thanks again.

This includes 2 new function:

getNumElements(): this looks in the preheader to see whether the number of elements value is already present, to avoid recreating it.
RevertActiveLaneMask(): if it is not safe to lower @llvm.get.active.lane.mask to a VCTP, we recreate the icmp.

About testing:

I'd like to see a few negative testcases, where we can't transform the llvm.get.active.lane.mask.

If have put most negative tests in test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll:

@overflow: this for over flow if BTC is a constant and UINT_MAX
@IV_not_an_induction
@IV_wrong_step
@IV_step_not_constant
@outerloop_phi
@overflow_in_sub

And there is also one in test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll:

@reduction_not_guarded: this checks for overflow if BTC is a runtime variable not guarded by a loop check.

These negative tests include checks to see if we recreate the icmp; it shouldn't emit @llvm.get.active.lane.mask or the VCTP.

I have already addressed the 2 minor comments, but TODO: I still need to look into the question about !isKnownNonNegative, as that doesn't seem to work for me (most tail predication start failing and are rejected because of overflow); need to look if I don't properly construct that SCEV expression, or something else.

SjoerdMeijer marked an inline comment as done.Jun 5 2020, 12:11 PM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/MVETailPredication.cpp
463	I played with SCEV today, and tried to use `isKnownNonNegative` (and similar ones). These SCEV helpers don't seem to provide the required information, i.e. they are not able to find precise enough value ranges to tell us values are non-negative, and so the `isKnownNonNegative` SCEV helpers and friends don't seem to have the context of the loop. Our loops usually look like this, they have this or a similar loop guard: %cmp = icmp sgt %N, 0 br i1 %cmp, label %vector.preheader, label %exit For this example, %N is our ElementCount. When we construct our overflow check: ceil(ElementCount / VectorWidth) >= TripCount and query SCEV, it doesn't have the context that %N > 0, resulting in a negative lowerbound, and thus `isKnownNonNegative` returns False. Looking into how I could add more context to SCEV, I checked for example `getSCEVAtScope` hoping this would be more context sensitive, and some others too. PredicatedScalarEvolution looked promising, I think it is designed for exactly this (I haven't used it yet), but the LoopUtils helpers `isKnownNegativeInLoop` and `cannotBeMaxInLoop` provide this with a convenience interface. These LoopUtils helpers were actually contributed by @samparker after a similar experience (which he might be able to confirm here). Long story short, it looks like helpers `isKnownNegativeInLoop` and `cannotBeMaxInLoop` are actually the right choice here (also confirmed after further debugging and tracing the tests that I mentioned previously).

efriedma added inline comments.Jun 5 2020, 1:12 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
463	The description of isKnownNegativeInLoop says "Returns true if we can prove that \p S is defined and always negative in loop L." If it returns false, we have proven nothing, so you can't use it like this. I wasn't trying to imply you shouldn't use isKnownNonNegativeInLoop, if that's appropriate.

Short story:

isKnownNonNegativeInLoop is unfortunately not able to give an answer for this expression, and as a result most/all loops would be rejected. I have added a FIXMEs, and am using isKnownNegativeInLoop as that is at least able to catch some cases (the test cases with constant values) and is probably better than nothing. I have tried several SCEV helpers, but just none of them seem to support this expression. I think teaching SCEV about this expression is a separate issue. @efriedma, @samparker : please let me know what you think, and what you think the order of events should be.

Longer story:

I wasn't trying to imply you shouldn't use isKnownNonNegativeInLoop, if that's appropriate.

Thanks for confirming. I indeed got confused, briefly went onto the wrong track, but rediscovered isKnownNonNegativeInLoop and experimented further with that.

While evaluating this expression and if it is non-negative:

(((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount

and dumping KnownNonNegative information for the intermediate expressions, I see that SCEV is able to determine KnownNonNegative for all intermediate expressions, except the last one:

BTC: (-1 + %N)
BTC KnownNonNegative: 1
elemcount: %N
elemcount + vlen-1: (3 + %N)
KnownNonNegative: 1
Ceil: ((3 + %N) /u 4)
Ceil KnownNonNegative: 1
TripCount: (1 + ((-4 + (4 * ((3 + %N)/u 4))<nuw>) /u 4))<nuw><nsw>
TripCount KnownNonNegative: 1
ECMinusTC: (-1 + (-1 * ((-4 + (4 * ((3 + %N) /u 4))<nuw>) /u 4))<nsw> + ((3 + %N) /u 4))
KnownNonNegative: 0

When I request signed integer ranges for rounded element count (Ceil) and the trip count (TC) I see this:

Range Ceil: [0,1073741824)
Range TC: [1,1073741825)

And that looks very sensible and promising. I wanted to add support for this here, but then discovered a case that worked slightly differently, and it needs some more thinking and investigation, and probably best be added as a helper somewhere to looputils/SCEV. I have traced SCEV and its decision making, and roughly see where it is rejecting this, but need to investigate that further.

Maybe instead of querying SCEV about (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount, it would be easier for SCEV to reason about ((ElementCount + (VectorWidth - 1)) - TripCount * VectorWidth?

Maybe instead of querying SCEV about (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount, it would be easier for SCEV to reason about ((ElementCount + (VectorWidth - 1)) - TripCount * VectorWidth?

Thanks! I might have tried this (have tried many different expression, with/without overflow flags, etc), but will double check tomorrow.

I was actually just uploading an approach using integer ranges. If we know that:

Range(ElementCount + (VectorWidth-1) / VectorWidth) - Range(TC) == 0

If this set difference results in the empty set, we know overflow doesn't happen.
This seems to work for all cases, i.e. when values are runtime values or constants, and is a simple check.

In D79175#2080874, @efriedma wrote:

Maybe instead of querying SCEV about (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount, it would be easier for SCEV to reason about ((ElementCount + (VectorWidth - 1)) - TripCount * VectorWidth?

No luck with that one too: isKnownNonNegativeInLoop is not able to prove that for this expression.

How about the current implementation, and just looking at the signed ranges?

efriedma added inline comments.Jun 9 2020, 1:29 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
456	Can `ElementCount + (VW-1)` overflow? Do we need to check for that?
483	The general idea here makes sense. The precise way you're implementing it seems a little strange; it's fine if TripCount is smaller than BTC, I think.
516	Is it really legal for the induction variable to be stepping in either direction?
537	Not sure we can safely assume "I" is an add instruction.

SjoerdMeijer marked 2 inline comments as done.Jun 10 2020, 8:57 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/MVETailPredication.cpp
456	We are not generating code for `ElementCount + (VW-1)` , so that one is fine. We do want to know about overflow for `Ceil`, so will add a check for that.
516	It's definitely a case we want to support. In D77635, the vectoriser was taught to create a vector induction variable when a primary induction is initially absent, which is the case with decrementing loops, i.e. a step value of -1. We have a test case for counting down loops here in `test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll` with function `@foo4`. As this is produced by the vectoriser, I didn't see a problem with this, but will give it some more thoughts if we need to check more for this, but if it helps to get a first version in, I can remove this and address this in a follow up.

Addressed the other comments, simplified the value range check.

efriedma added inline comments.Jun 10 2020, 11:50 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
456	Not sure I understand; even if we aren't generating code, we're using it as input to the safety check. Does the math there work correctly even if it overflows?
462	Ceil is the result of a UDiv; it trivially can't be negative.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll
253	I don't understand how this loop is supposed to work. %index is zero in the first iteration, and UINT_MAX-3 in the second iteration.

SjoerdMeijer marked 3 inline comments as done.Jun 10 2020, 12:33 PM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/MVETailPredication.cpp
456	The Ceil expression doesn't have the non-wrapping flags. Therefore, my understanding is, that this ceiling calculation is done in a higher bit range and so no information is lost.
462	Ah yeah, that of course doesn't make any sense, I am removing it.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll
253	yep, thanks for catching, doesn't make sense, some sort of copy-paste mistake.

Removed the overflow check on the Ceil expression, the udiv.
Removed recognising the -Step case and corresponding test. That is not produced by the vectoriser, so don't need to recognise it. I am guessing this is no longer produced since the vectoriser now understands decrementing loops.

efriedma added inline comments.Jun 10 2020, 1:18 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
456	Therefore, my understanding is, that this ceiling calculation is done in a higher bit range SCEV math is modular math; it happens in the width of SCEV::getType(). (So Add, Mul, and AddRec can overflow.) If you want wider math, you need to explicitly zero-extend.

SjoerdMeijer marked an inline comment as done.Jun 10 2020, 1:55 PM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/MVETailPredication.cpp
456	Ahhhh, thanks for explaining. This is a real puzzle..... I think I am going to solve this differently then, because I am afraid we wouldn't be able to put any meaningful bound on `ElementCount + (VW-1)` (have seen this already but will double-check). I think I am going to use the TripCount (TC) for this, which usually looks like this: (1 + ((-4 + (4 * ((3 + %N) /u 4))<nuw>) /u 4))<nuw><nsw> For which we are able to find useful value ranges like this: TC: [1,1073741825) Because TC uses %N, and is also used in `ElementCount + (VW-1), I think that means that if: upperbound(TC) <= UINT_MAX - VectorWidth that we are okay.

Added overflow check for the rounding expression + test case.

some minor tweaks:

added an option to force tail-prediction,
removed the unreachable, and generate the splat BTC in the preheader if it doesn't exist.

Sorry for being a bit impatient, but was wondering if this is okay as an initial commit?
As there are several moving parts involved here, an initial commit and a first in-tree version would be convenient to iterate on, for example:

I want to have a look at codegen to see if we can improve it,
I have added a test case @Correlation that we want to support. It's currently rejected because of possible overflow, that's why I have added an option to force it, but I would like to have a closer look at this again.

I don't think the overflow check (step 2.2) in IsSafeActiveMask is right. The way it's comparing ranges doesn't seem sound: the "range" is conservative, so it's possible the actual value of the trip count at runtime is only one of the values in the range. Comparing the overlap on the ranges produces a result that doesn't really correspond to what you're looking for. Not sure if I'm explaining that clearly.

If it's really too hard to come up with the relevant proofs, we might want to revise the definition of llvm.get.active.lane.mask. I'm not sure what the revision looks like. We could say that it never produces an all-false vector; instead, it produces poison in that case. Or we could change the arguments somehow to make them easier to reason about.

That said, I'm okay with committing this as-is with the understanding that the pass will stay disabled until we resolve the issues with the overflow check. It looks substantially like what I expect the final form to be. LGTM

This revision is now accepted and ready to land.Jun 16 2020, 7:07 PM

Many thanks @efriedma and @samparker for all your help, really appreciated, will also mention that in the commit message.

And many thanks for your thoughts on the overflow behaviour. Looks like I will need to explore a few different strategies. Easiest would be if SCEV just understands our expressions. When I debugged SCEV, and while I didn't yet get fully to the bottom of it, my impression is that SCEV (its different helpers) is doing a lot of pattern matching, tries different strategies, and finally compares the expression against the loop guard expression. So, I don't know yet how easy or feasible it is to fit this analysis in. That's why I would tend to have a preference for revising the definition of llvm.get.active.lane.mask, because that looks easier. I appreciate that is somewhat working around SCEV limitations, which may or may not be a strong argument.

Closed by commit rGd1522513d4c4: [ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask (authored by SjoerdMeijer). · Explain WhyJun 17 2020, 7:32 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer mentioned this in D82773: [ARM][MVE] Tail-predication: clean-up removing unused code.Jun 29 2020, 8:04 AM

SjoerdMeijer mentioned this in rGaf45907653fd: [ARM][MVE] Tail-predication: clean-up of unused code.Jun 30 2020, 9:13 AM

SjoerdMeijer mentioned this in D86147: [LangRef] Revise semantics of get.active.lane.mask.Aug 18 2020, 9:26 AM

SjoerdMeijer mentioned this in rG2002bb487898: [LangRef] Revise semantics of intrinsic get.active.lane.mask.Aug 25 2020, 8:24 AM

SjoerdMeijer mentioned this in D86074: [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value.Sep 14 2020, 3:55 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

MVETailPredication.cpp

575 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

basic-tail-pred.ll

63 lines

clear-maskedinsts.ll

52 lines

cond-vector-reduce-mve-codegen.ll

138 lines

extending-loads.ll

30 lines

fast-fp-loops.ll

18 lines

mve-tail-data-types.ll

97 lines

30 lines

359 lines

17 lines

136 lines

vector-arith-codegen.ll

64 lines

vector-reduce-mve-tail.ll

27 lines

mve-fma-loops.ll

191 lines

Diff 269250

llvm/lib/Target/ARM/MVETailPredication.cpp

Show All 25 Lines
/// vector loops, that are targets for low-overhead loops, and prepares it for		/// vector loops, that are targets for low-overhead loops, and prepares it for
/// code generation. Once the vectorizer has produced a masked loop, there's a		/// code generation. Once the vectorizer has produced a masked loop, there's a
/// couple of final forms:		/// couple of final forms:
/// - A tail-predicated loop, with implicit predication.		/// - A tail-predicated loop, with implicit predication.
/// - A loop containing multiple VCPT instructions, predicating multiple VPT		/// - A loop containing multiple VCPT instructions, predicating multiple VPT
/// blocks of instructions operating on different vector types.		/// blocks of instructions operating on different vector types.
///		///
/// This pass:		/// This pass:
/// 1) Pattern matches the scalar iteration count produced by the vectoriser.		/// 1) Checks if the predicates of the masked load/store instructions are
/// The scalar loop iteration count represents the number of elements to be		/// generated by intrinsic @llvm.get.active.lanes(). This intrinsic consumes
/// processed.		/// the Backedge Taken Count (BTC) of the scalar loop as its second argument,
/// TODO: this could be emitted using an intrinsic, similar to the hardware		/// which we extract to set up the number of elements processed by the loop.
/// loop intrinsics, so that we don't need to pattern match this here.		/// 2) Intrinsic @llvm.get.active.lanes() is then replaced by the MVE target
/// 2) Inserts the VCTP intrinsic to represent the effect of		/// specific VCTP intrinsic to represent the effect of tail predication.
/// tail predication. This will be picked up by the ARM Low-overhead loop		/// This will be picked up by the ARM Low-overhead loop pass, which performs
/// pass, which performs the final transformation to a DLSTP or WLSTP		/// the final transformation to a DLSTP or WLSTP tail-predicated loop.
/// tail-predicated loop.

#include "ARM.h"		#include "ARM.h"
#include "ARMSubtarget.h"		#include "ARMSubtarget.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/LoopPass.h"		#include "llvm/Analysis/LoopPass.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
Show All 14 Lines
#define DESC "Transform predicated vector loops to use MVE tail predication"		#define DESC "Transform predicated vector loops to use MVE tail predication"

cl::opt<bool>		cl::opt<bool>
DisableTailPredication("disable-mve-tail-predication", cl::Hidden,		DisableTailPredication("disable-mve-tail-predication", cl::Hidden,
cl::init(true),		cl::init(true),
cl::desc("Disable MVE Tail Predication"));		cl::desc("Disable MVE Tail Predication"));
namespace {		namespace {

// Bookkeeping for pattern matching the loop trip count and the number of
// elements processed by the loop.
struct TripCountPattern {
// An icmp instruction that calculates a predicate of active/inactive lanes
// used by the masked loads/stores.
Instruction *Predicate = nullptr;

// The add instruction that increments the IV.
Value *TripCount = nullptr;

// The number of elements processed by the vector loop.
Value *NumElements = nullptr;

// Other instructions in the icmp chain that calculate the predicate.
FixedVectorType *VecTy = nullptr;
Instruction *Shuffle = nullptr;
Instruction *Induction = nullptr;

TripCountPattern(Instruction P, Value TC, FixedVectorType *VT)
: Predicate(P), TripCount(TC), VecTy(VT){};
};

class MVETailPredication : public LoopPass {		class MVETailPredication : public LoopPass {
SmallVector<IntrinsicInst*, 4> MaskedInsts;		SmallVector<IntrinsicInst*, 4> MaskedInsts;
Loop *L = nullptr;		Loop *L = nullptr;
LoopInfo *LI = nullptr;		LoopInfo *LI = nullptr;
const DataLayout *DL;		const DataLayout *DL;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
ScalarEvolution *SE = nullptr;		ScalarEvolution *SE = nullptr;
TargetTransformInfo *TTI = nullptr;		TargetTransformInfo *TTI = nullptr;
TargetLibraryInfo *TLI = nullptr;		TargetLibraryInfo *TLI = nullptr;
bool ClonedVCTPInExitBlock = false;		bool ClonedVCTPInExitBlock = false;
		IntrinsicInst *ActiveLaneMask = nullptr;
		FixedVectorType *VecTy = nullptr;

public:		public:
static char ID;		static char ID;

MVETailPredication() : LoopPass(ID) { }		MVETailPredication() : LoopPass(ID) { }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<ScalarEvolutionWrapperPass>();		AU.addRequired<ScalarEvolutionWrapperPass>();
Show All 11 Lines
private:		private:
/// Perform the relevant checks on the loop and convert if possible.		/// Perform the relevant checks on the loop and convert if possible.
bool TryConvert(Value *TripCount);		bool TryConvert(Value *TripCount);

/// Return whether this is a vectorized loop, that contains masked		/// Return whether this is a vectorized loop, that contains masked
/// load/stores.		/// load/stores.
bool IsPredicatedVectorLoop();		bool IsPredicatedVectorLoop();

/// Compute a value for the total number of elements that the predicated		/// Perform checks on the arguments of @llvm.get.active.lane.mask
/// loop will process if it is a runtime value.		/// intrinsic: check if the first is a loop induction variable, and for the
bool ComputeRuntimeElements(TripCountPattern &TCP);		/// the second check that no overflow can occur in the expression that use
		/// this backedge-taken count.
/// Return whether this is the icmp that generates an i1 vector, based		bool IsSafeActiveMask(Value TripCount, FixedVectorType VecTy);
/// upon a loop counter and a limit that is defined outside the loop,
/// that generates the active/inactive lanes required for tail-predication.
bool isTailPredicate(TripCountPattern &TCP);

/// Insert the intrinsic to represent the effect of tail predication.		/// Insert the intrinsic to represent the effect of tail predication.
void InsertVCTPIntrinsic(TripCountPattern &TCP,		void InsertVCTPIntrinsic(IntrinsicInst ActiveLaneMask, Value TripCount,
		FixedVectorType *VecTy,
DenseMap<Instruction , Instruction > &NewPredicates);		DenseMap<Instruction , Instruction > &NewPredicates);

/// Rematerialize the iteration count in exit blocks, which enables		/// Rematerialize the iteration count in exit blocks, which enables
/// ARMLowOverheadLoops to better optimise away loop update statements inside		/// ARMLowOverheadLoops to better optimise away loop update statements inside
/// hardware-loops.		/// hardware-loops.
void RematerializeIterCount();		void RematerializeIterCount();

		/// If it is not safe to lower @llvm.get.active.lane.mask to a VCTP, it needs
		/// to be lowered to an icmp.
		void RevertActiveLaneMask();
};		};

} // end namespace		} // end namespace

static bool IsDecrement(Instruction &I) {		static bool IsDecrement(Instruction &I) {
auto *Call = dyn_cast<IntrinsicInst>(&I);		auto *Call = dyn_cast<IntrinsicInst>(&I);
if (!Call)		if (!Call)
return false;		return false;
Show All 17 Lines	void MVETailPredication::RematerializeIterCount() {
SCEVExpander Rewriter(SE, DL, "mvetp");		SCEVExpander Rewriter(SE, DL, "mvetp");
ReplaceExitVal ReplaceExitValue = AlwaysRepl;		ReplaceExitVal ReplaceExitValue = AlwaysRepl;

formLCSSARecursively(L, DT, LI, SE);		formLCSSARecursively(L, DT, LI, SE);
rewriteLoopExitValues(L, LI, TLI, SE, TTI, Rewriter, DT, ReplaceExitValue,		rewriteLoopExitValues(L, LI, TLI, SE, TTI, Rewriter, DT, ReplaceExitValue,
DeadInsts);		DeadInsts);
}		}

		void MVETailPredication::RevertActiveLaneMask() {
		if (!ActiveLaneMask)
		return;

		int VectorWidth = VecTy->getElementCount().Min;
		IRBuilder<> Builder(ActiveLaneMask->getParent()->getFirstNonPHI());

		// 1. Create the vector induction step. This %induction will be the LHS of
		// the icmp:
		//
		// %splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		// %splat = shufflevector <4 x i32> %splatinsert, <4 x i32> undef, <4 x i32> 0
		// %induction = add <4 x i32> %splat, <i32 0, i32 1, i32 2, i32 3>

		Value *Index = ActiveLaneMask->getOperand(0);
		Value *SplatIndex =
		Builder.CreateVectorSplat(VectorWidth, Index, "lane.mask");

		SmallVector<Constant *, 8> Indices;
		for (int i = 0; i < VectorWidth; ++i)
		Indices.push_back(ConstantInt::get(Index->getType(), i));

		Constant *CV = ConstantVector::get(Indices);
		Value *Induction = Builder.CreateAdd(SplatIndex, CV, "lane.mask.induction");

		LLVM_DEBUG(dbgs() << "ARM TP: New index: " << *SplatIndex << "\n";
		dbgs() << "ARM TP: New Induction: " << *Induction << "\n");

		// 2. In the Preheader, we expect this pattern splatting the BTC to
		// a vector. Find this %broadcast.splat, which will be the RHS of the
		// the icmp:
		//
		// %TC.minus.1 = add i32 %N, -1
		// %splatinsert = insertelement <4 x i32> undef, i32 %TC.minus.1, i32 0
		// %splat = shufflevector <4 x i32> %splatinsert, <4 x i32> undef, <16 x i32> 0

		auto *Preheader = L->getLoopPreheader();
		auto *BTC = ActiveLaneMask->getOperand(1);
		Value *SplatBTC = nullptr;

		if (auto *C = dyn_cast<ConstantInt>(BTC)) {
		Builder.SetInsertPoint(Preheader->getTerminator());
		SplatBTC = Builder.CreateVectorSplat(VectorWidth, C);
		} else {
		Instruction *InsertElem;
		for (auto &V : *Preheader) {
		InsertElem = dyn_cast<InsertElementInst>(&V);
		if (!InsertElem)
		continue;
		ConstantInt *CI = dyn_cast<ConstantInt>(InsertElem->getOperand(2));
		if (!CI)
		continue;
		if (InsertElem->getOperand(1) != BTC \|\| CI->getSExtValue() != 0)
		continue;
		if ((SplatBTC = dyn_cast<ShuffleVectorInst>(*InsertElem->users().begin())))
		break;
		}
		}
		if (!SplatBTC)
		llvm_unreachable("Couldn't revert intrinsic to icmp");

		Builder.SetInsertPoint(ActiveLaneMask);
		Value *ICmp = Builder.CreateICmp(ICmpInst::ICMP_ULE, Induction, SplatBTC);
		LLVM_DEBUG(dbgs() << "ARM TP: new compare: " << *ICmp << "\n");
		ActiveLaneMask->replaceAllUsesWith(ICmp);
		ActiveLaneMask->eraseFromParent();
		}

bool MVETailPredication::runOnLoop(Loop *L, LPPassManager&) {		bool MVETailPredication::runOnLoop(Loop *L, LPPassManager&) {
if (skipLoop(L) \|\| DisableTailPredication)		if (skipLoop(L) \|\| DisableTailPredication)
return false;		return false;

MaskedInsts.clear();		MaskedInsts.clear();
Function &F = *L->getHeader()->getParent();		Function &F = *L->getHeader()->getParent();
auto &TPC = getAnalysis<TargetPassConfig>();		auto &TPC = getAnalysis<TargetPassConfig>();
auto &TM = TPC.getTM<TargetMachine>();		auto &TM = TPC.getTM<TargetMachine>();
auto *ST = &TM.getSubtarget<ARMSubtarget>(F);		auto *ST = &TM.getSubtarget<ARMSubtarget>(F);
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();		auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
TLI = TLIP ? &TLIP->getTLI(*L->getHeader()->getParent()) : nullptr;		TLI = TLIP ? &TLIP->getTLI(*L->getHeader()->getParent()) : nullptr;
DL = &L->getHeader()->getModule()->getDataLayout();		DL = &L->getHeader()->getModule()->getDataLayout();
this->L = L;		this->L = L;
		ActiveLaneMask = nullptr;

// The MVE and LOB extensions are combined to enable tail-predication, but		// The MVE and LOB extensions are combined to enable tail-predication, but
// there's nothing preventing us from generating VCTP instructions for v8.1m.		// there's nothing preventing us from generating VCTP instructions for v8.1m.
if (!ST->hasMVEIntegerOps() \|\| !ST->hasV8_1MMainlineOps()) {		if (!ST->hasMVEIntegerOps() \|\| !ST->hasV8_1MMainlineOps()) {
LLVM_DEBUG(dbgs() << "ARM TP: Not a v8.1m.main+mve target.\n");		LLVM_DEBUG(dbgs() << "ARM TP: Not a v8.1m.main+mve target.\n");
return false;		return false;
}		}

▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	bool MVETailPredication::runOnLoop(Loop *L, LPPassManager&) {
ClonedVCTPInExitBlock = false;		ClonedVCTPInExitBlock = false;
LLVM_DEBUG(dbgs() << "ARM TP: Running on Loop: " << L << Setup << "\n"		LLVM_DEBUG(dbgs() << "ARM TP: Running on Loop: " << L << Setup << "\n"
<< *Decrement << "\n");		<< *Decrement << "\n");

if (TryConvert(Setup->getArgOperand(0))) {		if (TryConvert(Setup->getArgOperand(0))) {
if (ClonedVCTPInExitBlock)		if (ClonedVCTPInExitBlock)
RematerializeIterCount();		RematerializeIterCount();
return true;		return true;
}

LLVM_DEBUG(dbgs() << "ARM TP: Can't tail-predicate this loop.\n");
return false;
}

// Pattern match predicates/masks and determine if they use the loop induction
// variable to control the number of elements processed by the loop. If so,
// the loop is a candidate for tail-predication.
bool MVETailPredication::isTailPredicate(TripCountPattern &TCP) {
using namespace PatternMatch;

// Pattern match the loop body and find the add with takes the index iv
// and adds a constant vector to it:
//
// vector.body:
// ..
// %index = phi i32
// %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
// %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert,
// <4 x i32> undef,
// <4 x i32> zeroinitializer
// %induction = [add\|or] <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
// %pred = icmp ule <4 x i32> %induction, %broadcast.splat11
//
// Please note that the 'or' is equivalent to the 'and' here, this relies on
// BroadcastSplat being the IV which we know is a phi with 0 start and Lanes
// increment, which is all being checked below.
Instruction *BroadcastSplat = nullptr;
Constant *Const = nullptr;
if (!match(TCP.Induction,
m_Add(m_Instruction(BroadcastSplat), m_Constant(Const))) &&
!match(TCP.Induction,
m_Or(m_Instruction(BroadcastSplat), m_Constant(Const))))
return false;

// Check that we're adding <0, 1, 2, 3...
if (auto *CDS = dyn_cast<ConstantDataSequential>(Const)) {
for (unsigned i = 0; i < CDS->getNumElements(); ++i) {
if (CDS->getElementAsInteger(i) != i)
return false;
}
} else		} else
return false;		RevertActiveLaneMask();

Instruction *Insert = nullptr;
// The shuffle which broadcasts the index iv into a vector.
if (!match(BroadcastSplat,
m_Shuffle(m_Instruction(Insert), m_Undef(), m_ZeroMask())))
return false;

// The insert element which initialises a vector with the index iv.
Instruction *IV = nullptr;
if (!match(Insert, m_InsertElt(m_Undef(), m_Instruction(IV), m_Zero())))
return false;

// The index iv.
auto *Phi = dyn_cast<PHINode>(IV);
if (!Phi)
return false;

// TODO: Don't think we need to check the entry value.
Value *OnEntry = Phi->getIncomingValueForBlock(L->getLoopPreheader());
if (!match(OnEntry, m_Zero()))
return false;

Value *InLoop = Phi->getIncomingValueForBlock(L->getLoopLatch());		LLVM_DEBUG(dbgs() << "ARM TP: Can't tail-predicate this loop.\n");
unsigned Lanes = cast<FixedVectorType>(Insert->getType())->getNumElements();

Instruction *LHS = nullptr;
if (!match(InLoop, m_Add(m_Instruction(LHS), m_SpecificInt(Lanes))))
return false;		return false;

return LHS == Phi;
}		}

static FixedVectorType getVectorType(IntrinsicInst I) {		static FixedVectorType getVectorType(IntrinsicInst I) {
unsigned TypeOp = I->getIntrinsicID() == Intrinsic::masked_load ? 0 : 1;		unsigned TypeOp = I->getIntrinsicID() == Intrinsic::masked_load ? 0 : 1;
auto *PtrTy = cast<PointerType>(I->getOperand(TypeOp)->getType());		auto *PtrTy = cast<PointerType>(I->getOperand(TypeOp)->getType());
return cast<FixedVectorType>(PtrTy->getElementType());		auto *VecTy = cast<FixedVectorType>(PtrTy->getElementType());
		assert(VecTy && "No scalable vectors expected here");
		return VecTy;
}		}

bool MVETailPredication::IsPredicatedVectorLoop() {		bool MVETailPredication::IsPredicatedVectorLoop() {
// Check that the loop contains at least one masked load/store intrinsic.		// Check that the loop contains at least one masked load/store intrinsic.
// We only support 'normal' vector instructions - other than masked		// We only support 'normal' vector instructions - other than masked
// load/stores.		// load/stores.
for (auto *BB : L->getBlocks()) {		for (auto *BB : L->getBlocks()) {
for (auto &I : *BB) {		for (auto &I : *BB) {
Show All 16 Lines	for (auto &I : *BB) {
}		}
}		}
}		}
}		}

return !MaskedInsts.empty();		return !MaskedInsts.empty();
}		}

// Pattern match the predicate, which is an icmp with a constant vector of this
// form:
//
// icmp ult <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>
//
// and return the constant, i.e. 32002 in this example. This is assumed to be
// the scalar loop iteration count: the number of loop elements by the
// the vector loop. Further checks are performed in function isTailPredicate(),
// to verify 'induction' behaves as an induction variable.
//
static bool ComputeConstElements(TripCountPattern &TCP) {
if (!dyn_cast<ConstantInt>(TCP.TripCount))
return false;

ConstantInt *VF = ConstantInt::get(
cast<IntegerType>(TCP.TripCount->getType()), TCP.VecTy->getNumElements());
using namespace PatternMatch;
CmpInst::Predicate CC;

if (!match(TCP.Predicate, m_ICmp(CC, m_Instruction(TCP.Induction),
m_AnyIntegralConstant())) \|\|
CC != ICmpInst::ICMP_ULT)
return false;

LLVM_DEBUG(dbgs() << "ARM TP: icmp with constants: "; TCP.Predicate->dump(););
Value *ConstVec = TCP.Predicate->getOperand(1);

auto *CDS = dyn_cast<ConstantDataSequential>(ConstVec);
if (!CDS \|\| CDS->getNumElements() != VF->getSExtValue())
return false;

if ((TCP.NumElements = CDS->getSplatValue())) {
assert(dyn_cast<ConstantInt>(TCP.NumElements)->getSExtValue() %
VF->getSExtValue() !=
0 &&
"tail-predication: trip count should not be a multiple of the VF");
LLVM_DEBUG(dbgs() << "ARM TP: Found const elem count: " << *TCP.NumElements
<< "\n");
return true;
}
return false;
}

// Pattern match the loop iteration count setup:
//
// %trip.count.minus.1 = add i32 %N, -1
// %broadcast.splatinsert10 = insertelement <4 x i32> undef,
// i32 %trip.count.minus.1, i32 0
// %broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10,
// <4 x i32> undef,
// <4 x i32> zeroinitializer
// ..
// vector.body:
// ..
//
static bool MatchElemCountLoopSetup(Loop L, Instruction Shuffle,
Value *NumElements) {
using namespace PatternMatch;
Instruction *Insert = nullptr;

if (!match(Shuffle,
m_Shuffle(m_Instruction(Insert), m_Undef(), m_ZeroMask())))
return false;

// Insert the limit into a vector.
Instruction *BECount = nullptr;
if (!match(Insert,
m_InsertElt(m_Undef(), m_Instruction(BECount), m_Zero())))
return false;

// The limit calculation, backedge count.
Value *TripCount = nullptr;
if (!match(BECount, m_Add(m_Value(TripCount), m_AllOnes())))
return false;

if (TripCount != NumElements \|\| !L->isLoopInvariant(BECount))
return false;

return true;
}

bool MVETailPredication::ComputeRuntimeElements(TripCountPattern &TCP) {
using namespace PatternMatch;
const SCEV *TripCountSE = SE->getSCEV(TCP.TripCount);
ConstantInt *VF = ConstantInt::get(
cast<IntegerType>(TCP.TripCount->getType()), TCP.VecTy->getNumElements());

if (VF->equalsInt(1))
return false;

CmpInst::Predicate Pred;
if (!match(TCP.Predicate, m_ICmp(Pred, m_Instruction(TCP.Induction),
m_Instruction(TCP.Shuffle))) \|\|
Pred != ICmpInst::ICMP_ULE)
return false;

LLVM_DEBUG(dbgs() << "Computing number of elements for vector trip count: ";
TCP.TripCount->dump());

// Otherwise, continue and try to pattern match the vector iteration
// count expression
auto VisitAdd = [&](const SCEVAddExpr S) -> const SCEVMulExpr {
if (auto *Const = dyn_cast<SCEVConstant>(S->getOperand(0))) {
if (Const->getAPInt() != -VF->getValue())
return nullptr;
} else
return nullptr;
return dyn_cast<SCEVMulExpr>(S->getOperand(1));
};

auto VisitMul = [&](const SCEVMulExpr S) -> const SCEVUDivExpr {
if (auto *Const = dyn_cast<SCEVConstant>(S->getOperand(0))) {
if (Const->getValue() != VF)
return nullptr;
} else
return nullptr;
return dyn_cast<SCEVUDivExpr>(S->getOperand(1));
};

auto VisitDiv = [&](const SCEVUDivExpr S) -> const SCEV {
if (auto *Const = dyn_cast<SCEVConstant>(S->getRHS())) {
if (Const->getValue() != VF)
return nullptr;
} else
return nullptr;

if (auto *RoundUp = dyn_cast<SCEVAddExpr>(S->getLHS())) {
if (auto *Const = dyn_cast<SCEVConstant>(RoundUp->getOperand(0))) {
if (Const->getAPInt() != (VF->getValue() - 1))
return nullptr;
} else
return nullptr;

return RoundUp->getOperand(1);
}
return nullptr;
};

// TODO: Can we use SCEV helpers, such as findArrayDimensions, and friends to
// determine the numbers of elements instead? Looks like this is what is used
// for delinearization, but I'm not sure if it can be applied to the
// vectorized form - at least not without a bit more work than I feel
// comfortable with.

// Search for Elems in the following SCEV:
// (1 + ((-VF + (VF * (((VF - 1) + %Elems) /u VF))<nuw>) /u VF))<nuw><nsw>
const SCEV *Elems = nullptr;
if (auto *TC = dyn_cast<SCEVAddExpr>(TripCountSE))
if (auto *Div = dyn_cast<SCEVUDivExpr>(TC->getOperand(1)))
if (auto *Add = dyn_cast<SCEVAddExpr>(Div->getLHS()))
if (auto *Mul = VisitAdd(Add))
if (auto *Div = VisitMul(Mul))
if (auto *Res = VisitDiv(Div))
Elems = Res;

if (!Elems)
return false;

Instruction *InsertPt = L->getLoopPreheader()->getTerminator();
if (!isSafeToExpandAt(Elems, InsertPt, *SE))
return false;

auto DL = L->getHeader()->getModule()->getDataLayout();
SCEVExpander Expander(*SE, DL, "elements");
TCP.NumElements = Expander.expandCodeFor(Elems, Elems->getType(), InsertPt);

if (!MatchElemCountLoopSetup(L, TCP.Shuffle, TCP.NumElements))
return false;

return true;
}

// Look through the exit block to see whether there's a duplicate predicate		// Look through the exit block to see whether there's a duplicate predicate
// instruction. This can happen when we need to perform a select on values		// instruction. This can happen when we need to perform a select on values
// from the last and previous iteration. Instead of doing a straight		// from the last and previous iteration. Instead of doing a straight
// replacement of that predicate with the vctp, clone the vctp and place it		// replacement of that predicate with the vctp, clone the vctp and place it
// in the block. This means that the VPR doesn't have to be live into the		// in the block. This means that the VPR doesn't have to be live into the
// exit block which should make it easier to convert this loop into a proper		// exit block which should make it easier to convert this loop into a proper
// tail predicated loop.		// tail predicated loop.
static bool Cleanup(DenseMap<Instruction, Instruction> &NewPredicates,		static bool Cleanup(DenseMap<Instruction, Instruction> &NewPredicates,
Show All 31 Lines	while (!MaybeDead.empty()) {
MaybeDead.remove(I);		MaybeDead.remove(I);
if (I->hasNUsesOrMore(1))		if (I->hasNUsesOrMore(1))
continue;		continue;

for (auto &U : I->operands())		for (auto &U : I->operands())
if (auto *OpI = dyn_cast<Instruction>(U))		if (auto *OpI = dyn_cast<Instruction>(U))
MaybeDead.insert(OpI);		MaybeDead.insert(OpI);

I->dropAllReferences();
Dead.insert(I);		Dead.insert(I);
}		}

for (auto *I : Dead) {		for (auto *I : Dead) {
LLVM_DEBUG(dbgs() << "ARM TP: removing dead insn: "; I->dump());		LLVM_DEBUG(dbgs() << "ARM TP: removing dead insn: "; I->dump());
I->eraseFromParent();		I->eraseFromParent();
}		}

for (auto I : L->blocks())		for (auto I : L->blocks())
DeleteDeadPHIs(I);		DeleteDeadPHIs(I);

return ClonedVCTPInExitBlock;		return ClonedVCTPInExitBlock;
}		}

void MVETailPredication::InsertVCTPIntrinsic(TripCountPattern &TCP,		// The active lane intrinsic has this form:
		//
		// @llvm.get.active.lane.mask(IV, BTC)
		//
		// Here we perform checks that this intrinsic behaves as expected,
		// which means:
		//
		// 1) The element count, which is calculated with BTC + 1, cannot overflow.
		// 2) The IV must be an induction phi with an increment equal to the
		// vector width.
		// 3) FIXME: The element count needs to be sufficiently large that the decrement of
		// element counter doesn't overflow, which means that we need to prove:
		// ceil(ElementCount / VectorWidth) >= TripCount
		// by rounding up ElementCount up:
		// ((ElementCount + (VectorWidth - 1)) / VectorWidth
		// and evaluate if expression isKnownNonNegative:
		// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
		efriedmaUnsubmitted Not Done Reply Inline Actions getNumElements() is fine on a FixedVectorType. efriedma: getNumElements() is fine on a FixedVectorType.
		// FIXME: isKnownNonNegative doesn't understand this expression, and cannot
		// prove it is non-negative.
		bool MVETailPredication::IsSafeActiveMask(Value *TripCount,
		FixedVectorType *VecTy) {
		int VectorWidth = VecTy->getNumElements();
		efriedmaUnsubmitted Not Done Reply Inline Actions I think you want llvm::cannotBeMaxInLoop? efriedma: I think you want llvm::cannotBeMaxInLoop?
		auto *BackedgeTakenCount = ActiveLaneMask->getOperand(1);
		auto *BTC = SE->getSCEV(BackedgeTakenCount);

		// 1) Test whether entry to the loop is protected by a conditional
		// BTC + 1 < 0. In other words, if the scalar trip count overflows,
		// becomes negative, we shouldn't enter the loop and creating
		// tripcount expression BTC + 1 is not safe. So, check that BTC
		// isn't max. This is evaluated in unsigned, because the semantics
		// of @get.active.lane.mask is a ULE comparison.
		if (!llvm::cannotBeMaxInLoop(BTC, L, SE, false /Signed*/)) {
		LLVM_DEBUG(dbgs() << "ARM TP: overflow detected in: ";
		BTC->dump());
		return false;
		}

		// 2) Prove that the sub expression is non-negative, i.e. it doesn't overflow:
		// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
		auto *One = SE->getOne(TripCount->getType());
		auto *ElementCount = SE->getAddExpr(BTC, One);
		auto *Tmp = SE->getAddExpr(ElementCount,
		efriedmaUnsubmitted Not Done Reply Inline Actions Can `ElementCount + (VW-1)` overflow? Do we need to check for that? efriedma: Can `ElementCount + (VW-1)` overflow? Do we need to check for that?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions We are not generating code for `ElementCount + (VW-1)` , so that one is fine. We do want to know about overflow for `Ceil`, so will add a check for that. SjoerdMeijer: We are not generating code for `ElementCount + (VW-1) `, so that one is fine. We do want to…
		efriedmaUnsubmitted Not Done Reply Inline Actions Not sure I understand; even if we aren't generating code, we're using it as input to the safety check. Does the math there work correctly even if it overflows? efriedma: Not sure I understand; even if we aren't generating code, we're using it as input to the safety…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions The Ceil expression doesn't have the non-wrapping flags. Therefore, my understanding is, that this ceiling calculation is done in a higher bit range and so no information is lost. SjoerdMeijer: The Ceil expression doesn't have the non-wrapping flags. Therefore, my understanding is, that…
		efriedmaUnsubmitted Not Done Reply Inline Actions Therefore, my understanding is, that this ceiling calculation is done in a higher bit range SCEV math is modular math; it happens in the width of SCEV::getType(). (So Add, Mul, and AddRec can overflow.) If you want wider math, you need to explicitly zero-extend. efriedma: > Therefore, my understanding is, that this ceiling calculation is done in a higher bit range…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ahhhh, thanks for explaining. This is a real puzzle..... I think I am going to solve this differently then, because I am afraid we wouldn't be able to put any meaningful bound on `ElementCount + (VW-1)` (have seen this already but will double-check). I think I am going to use the TripCount (TC) for this, which usually looks like this: (1 + ((-4 + (4 * ((3 + %N) /u 4))<nuw>) /u 4))<nuw><nsw> For which we are able to find useful value ranges like this: TC: [1,1073741825) Because TC uses %N, and is also used in `ElementCount + (VW-1), I think that means that if: upperbound(TC) <= UINT_MAX - VectorWidth that we are okay. SjoerdMeijer: Ahhhh, thanks for explaining. This is a real puzzle..... I think I am going to solve this…
		SE->getSCEV(ConstantInt::get(TripCount->getType(), VectorWidth - 1)));
		auto Ceil = SE->getUDivExpr(Tmp,
		SE->getSCEV(ConstantInt::get(TripCount->getType(), VectorWidth)));
		auto *ECMinusTC = SE->getMinusSCEV(Ceil, SE->getSCEV(TripCount));
		efriedmaUnsubmitted Not Done Reply Inline Actions You also need to check that the AddRec is associated with the right loop. efriedma: You also need to check that the AddRec is associated with the right loop.
		// FIXME: we should be using isKnownNonNegativeInLoop here, but that is
		// not able to understand this expression, and would reject most/all cases.
		efriedmaUnsubmitted Not Done Reply Inline Actions Ceil is the result of a UDiv; it trivially can't be negative. efriedma: Ceil is the result of a UDiv; it trivially can't be negative.
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ah yeah, that of course doesn't make any sense, I am removing it. SjoerdMeijer: Ah yeah, that of course doesn't make any sense, I am removing it.
		// Using isKnownNegativeInLoop is probably better than nothing.
		efriedmaUnsubmitted Not Done Reply Inline Actions !isKnownNonNegative? efriedma: !isKnownNonNegative?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I played with SCEV today, and tried to use `isKnownNonNegative` (and similar ones). These SCEV helpers don't seem to provide the required information, i.e. they are not able to find precise enough value ranges to tell us values are non-negative, and so the `isKnownNonNegative` SCEV helpers and friends don't seem to have the context of the loop. Our loops usually look like this, they have this or a similar loop guard: %cmp = icmp sgt %N, 0 br i1 %cmp, label %vector.preheader, label %exit For this example, %N is our ElementCount. When we construct our overflow check: ceil(ElementCount / VectorWidth) >= TripCount and query SCEV, it doesn't have the context that %N > 0, resulting in a negative lowerbound, and thus `isKnownNonNegative` returns False. Looking into how I could add more context to SCEV, I checked for example `getSCEVAtScope` hoping this would be more context sensitive, and some others too. PredicatedScalarEvolution looked promising, I think it is designed for exactly this (I haven't used it yet), but the LoopUtils helpers `isKnownNegativeInLoop` and `cannotBeMaxInLoop` provide this with a convenience interface. These LoopUtils helpers were actually contributed by @samparker after a similar experience (which he might be able to confirm here). Long story short, it looks like helpers `isKnownNegativeInLoop` and `cannotBeMaxInLoop` are actually the right choice here (also confirmed after further debugging and tracing the tests that I mentioned previously). SjoerdMeijer: I played with SCEV today, and tried to use `isKnownNonNegative` (and similar ones). These SCEV…
		efriedmaUnsubmitted Not Done Reply Inline Actions The description of isKnownNegativeInLoop says "Returns true if we can prove that \p S is defined and always negative in loop L." If it returns false, we have proven nothing, so you can't use it like this. I wasn't trying to imply you shouldn't use isKnownNonNegativeInLoop, if that's appropriate. efriedma: The description of isKnownNegativeInLoop says "Returns true if we can prove that \p S is…
		if (llvm::isKnownNegativeInLoop(ECMinusTC, L, *SE)) {
		LLVM_DEBUG(dbgs() << "ARM TP: overflow in element count decrement\n");
		return false;
		}

		// 3) Find out if IV is an induction phi. Note that We can't use Loop
		// helpers here to get the induction variable, because the hardware loop is
		// no longer in loopsimplify form, and also the hwloop intrinsic use a
		// different counter. Using SCEV, we check that the induction is of the
		// form i = i + 4, where the increment must be equal to the VectorWidth.
		auto *IV = ActiveLaneMask->getOperand(0);
		auto *IVExpr = SE->getSCEV(IV);
		auto *AddExpr = dyn_cast<SCEVAddRecExpr>(IVExpr);
		if (!AddExpr) {
		LLVM_DEBUG(dbgs() << "ARM TP: induction not an add expr: "; IVExpr->dump());
		return false;
		}
		// Check that this AddRec is associated with this loop.
		if (AddExpr->getLoop() != L) {
		efriedmaUnsubmitted Not Done Reply Inline Actions I was expecting something more like `IVExpr->getLoop() == L`. L might not be the innermost loop. efriedma: I was expecting something more like `IVExpr->getLoop() == L`. L might not be the innermost…
		LLVM_DEBUG(dbgs() << "ARM TP: phi not part of this loop\n");
		efriedmaUnsubmitted Not Done Reply Inline Actions The general idea here makes sense. The precise way you're implementing it seems a little strange; it's fine if TripCount is smaller than BTC, I think. efriedma: The general idea here makes sense. The precise way you're implementing it seems a little…
		return false;
		}
		auto *Step = dyn_cast<SCEVConstant>(AddExpr->getOperand(1));
		if (!Step) {
		LLVM_DEBUG(dbgs() << "ARM TP: induction step is not a constant: ";
		AddExpr->getOperand(1)->dump());
		return false;
		}
		auto StepValue = Step->getValue()->getSExtValue();
		if (VectorWidth == StepValue \|\|
		VectorWidth == -StepValue)
		return true;

		LLVM_DEBUG(dbgs() << "ARM TP: step value " << StepValue << " doesn't match "
		"vector width : " << VectorWidth << "\n");

		return false;
		}

		// Materialize NumElements in the preheader block.
		static Value getNumElements(BasicBlock Preheader, Value *BTC) {
		// First, check the preheader if it not already exist:
		//
		// preheader:
		// %BTC = add i32 %N, -1
		// ..
		// vector.body:
		//
		// if %BTC already exists. We don't need to emit %NumElems = %BTC + 1,
		// but instead can just return %N.
		for (auto &I : *Preheader) {
		if (&I == BTC) {
		LLVM_DEBUG(dbgs() << "ARM TP: Found num elems: " << I << "\n");
		efriedmaUnsubmitted Not Done Reply Inline Actions Is it really legal for the induction variable to be stepping in either direction? efriedma: Is it really legal for the induction variable to be stepping in either direction?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions It's definitely a case we want to support. In D77635, the vectoriser was taught to create a vector induction variable when a primary induction is initially absent, which is the case with decrementing loops, i.e. a step value of -1. We have a test case for counting down loops here in `test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll` with function `@foo4`. As this is produced by the vectoriser, I didn't see a problem with this, but will give it some more thoughts if we need to check more for this, but if it helps to get a first version in, I can remove this and address this in a follow up. SjoerdMeijer: It's definitely a case we want to support. In D77635, the vectoriser was taught to create a…
		return I.getOperand(0);
		}
		}

		// But we do need to materialise BTC if it is not already there,
		// e.g. if it is a constant.
		IRBuilder<> Builder(Preheader->getTerminator());
		Value *NumElements = Builder.CreateAdd(BTC,
		ConstantInt::get(BTC->getType(), 1), "num.elements");
		LLVM_DEBUG(dbgs() << "ARM TP: Created num elems: " << *NumElements << "\n");
		return NumElements;
		}

		void MVETailPredication::InsertVCTPIntrinsic(IntrinsicInst *ActiveLaneMask,
		Value TripCount, FixedVectorType VecTy,
DenseMap<Instruction, Instruction> &NewPredicates) {		DenseMap<Instruction, Instruction> &NewPredicates) {
IRBuilder<> Builder(L->getHeader()->getFirstNonPHI());		IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
Module *M = L->getHeader()->getModule();		Module *M = L->getHeader()->getModule();
Type *Ty = IntegerType::get(M->getContext(), 32);		Type *Ty = IntegerType::get(M->getContext(), 32);

		// The backedge-taken count in @llvm.get.active.lane.mask, its 2nd operand,
		efriedmaUnsubmitted Not Done Reply Inline Actions Not sure we can safely assume "I" is an add instruction. efriedma: Not sure we can safely assume "I" is an add instruction.
		// is one less than the trip count. So we need to find or create
		// %num.elements = %BTC + 1 in the preheader.
		Value *BTC = ActiveLaneMask->getOperand(1);
		Builder.SetInsertPoint(L->getLoopPreheader()->getTerminator());
		Value *NumElements = getNumElements(L->getLoopPreheader(), BTC);

// Insert a phi to count the number of elements processed by the loop.		// Insert a phi to count the number of elements processed by the loop.
		efriedmaUnsubmitted Not Done Reply Inline Actions This addition can overflow. efriedma: This addition can overflow.
		samparkerUnsubmitted Not Done Reply Inline Actions And that's not okay, right? Trip count will always be BTC + 1 and we don't handle uncountable loops. samparker: And that's not okay, right? Trip count will always be BTC + 1 and we don't handle uncountable…
		efriedmaUnsubmitted Not Done Reply Inline Actions The masks are wrong if it overflows. efriedma: The masks are wrong if it overflows.
		Builder.SetInsertPoint(L->getHeader()->getFirstNonPHI() );
PHINode *Processed = Builder.CreatePHI(Ty, 2);		PHINode *Processed = Builder.CreatePHI(Ty, 2);
Processed->addIncoming(TCP.NumElements, L->getLoopPreheader());		Processed->addIncoming(NumElements, L->getLoopPreheader());

// Insert the intrinsic to represent the effect of tail predication.		// Replace @llvm.get.active.mask() with the ARM specific VCTP intrinic, and thus
Builder.SetInsertPoint(cast<Instruction>(TCP.Predicate));		// represent the effect of tail predication.
		Builder.SetInsertPoint(ActiveLaneMask);
ConstantInt *Factor =		ConstantInt *Factor =
ConstantInt::get(cast<IntegerType>(Ty), TCP.VecTy->getNumElements());		ConstantInt::get(cast<IntegerType>(Ty), VecTy->getNumElements());

Intrinsic::ID VCTPID;		Intrinsic::ID VCTPID;
switch (TCP.VecTy->getNumElements()) {		switch (VecTy->getNumElements()) {
default:		default:
llvm_unreachable("unexpected number of lanes");		llvm_unreachable("unexpected number of lanes");
case 4: VCTPID = Intrinsic::arm_mve_vctp32; break;		case 4: VCTPID = Intrinsic::arm_mve_vctp32; break;
case 8: VCTPID = Intrinsic::arm_mve_vctp16; break;		case 8: VCTPID = Intrinsic::arm_mve_vctp16; break;
case 16: VCTPID = Intrinsic::arm_mve_vctp8; break;		case 16: VCTPID = Intrinsic::arm_mve_vctp8; break;

// FIXME: vctp64 currently not supported because the predicate		// FIXME: vctp64 currently not supported because the predicate
// vector wants to be <2 x i1>, but v2i1 is not a legal MVE		// vector wants to be <2 x i1>, but v2i1 is not a legal MVE
// type, so problems happen at isel time.		// type, so problems happen at isel time.
// Intrinsic::arm_mve_vctp64 exists for ACLE intrinsics		// Intrinsic::arm_mve_vctp64 exists for ACLE intrinsics
// purposes, but takes a v4i1 instead of a v2i1.		// purposes, but takes a v4i1 instead of a v2i1.
}		}
Function *VCTP = Intrinsic::getDeclaration(M, VCTPID);		Function *VCTP = Intrinsic::getDeclaration(M, VCTPID);
Value *TailPredicate = Builder.CreateCall(VCTP, Processed);		Value *VCTPCall = Builder.CreateCall(VCTP, Processed);
TCP.Predicate->replaceAllUsesWith(TailPredicate);		ActiveLaneMask->replaceAllUsesWith(VCTPCall);
NewPredicates[TCP.Predicate] = cast<Instruction>(TailPredicate);		NewPredicates[ActiveLaneMask] = cast<Instruction>(VCTPCall);

// Add the incoming value to the new phi.		// Add the incoming value to the new phi.
// TODO: This add likely already exists in the loop.		// TODO: This add likely already exists in the loop.
Value *Remaining = Builder.CreateSub(Processed, Factor);		Value *Remaining = Builder.CreateSub(Processed, Factor);
		efriedmaUnsubmitted Not Done Reply Inline Actions This subtraction can also overflow. efriedma: This subtraction can also overflow.
		samparkerUnsubmitted Not Done Reply Inline Actions But that's okay, right? This predication is only really useful when wrapping happens and the intrinsic reflects that overflow can/will happen. samparker: But that's okay, right? This predication is only really useful when wrapping happens and the…
		efriedmaUnsubmitted Not Done Reply Inline Actions The problem here is that the original get_active_lane_mask can do something like this: Iteration 1: get_active_lane_mask(0, 5) -> all-true Iteration 2: get_active_lane_mask(4, 5) -> <true, false, false, false> Iteration 3: get_active_lane_mask(8, 5) -> all-false In the rewritten code, you end up with this: Iteration 1: vctp(5) -> all-true Iteration 2: vctp(1)- > <true, false, false, false> Iteration 3: vctp(-3)- > all-true There are a couple ways you could deal with this: Use saturating subtraction Prove the trip count is small enough that this never happens, i.e. `ceil(ElementCount / VectorWidth) >= TripCount`. efriedma: The problem here is that the original get_active_lane_mask can do something like this…
		efriedmaUnsubmitted Not Done Reply Inline Actions Using saturating subtraction would require proving that the induction variable used as the first argument to llvm.get.active.mask doesn't overflow. But I guess you have to check that anyway. efriedma: Using saturating subtraction would require proving that the induction variable used as the…
		samparkerUnsubmitted Not Done Reply Inline Actions I would be hesitate to introduce saturating math here. I think the reasonable assumption is that the sub will wrap, but only on the final iteration. So just asserting that the element count is within the bounds of the trip count should be fine. samparker: I would be hesitate to introduce saturating math here. I think the reasonable assumption is…
		SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions The problem here is that the original get_active_lane_mask can do something like this: Iteration 1: get_active_lane_mask(0, 5) -> all-true Iteration 2: get_active_lane_mask(4, 5) -> <true, false, false, false> Iteration 3: get_active_lane_mask(8, 5) -> all-false In general, I see the problem, and I see that this can happen. But here in this context, I was wondering if it not actually boils down to Sam's earlier remark about countable loops. That is, before we were pattern matching a particular pattern produced by the vectoriser, we are handling (vector) loops produced by the vectoriser. Thus, we will never execute Iteration #3 from the example above, because `ceil(ElementCount / VectorWidth) >= TripCount` will always hold for these loops. My question is, with the intrinsic approach, can we still rely on that, would that be a valid assumption to make? Along these same lines, this pass also relies on a check `IsPredicatedVectorLoop` and presence of masked loads/stores currently produced by the vectoriser, which gets its predicate from @llvm.get.active.lane.mask also generated by the vectoriser. SjoerdMeijer: > The problem here is that the original get_active_lane_mask can do something like this: > >…
		efriedmaUnsubmitted Not Done Reply Inline Actions It's not safe to assume get_active_lane_mask was produced by the LLVM vectorizer. I mean, the ACLE has a vctp intrinsic; it's not that much of a stretch that we could expose get_active_lane_mask to users at some point. And even if it was produced by the LLVM vectorizer, other passes that can modify the loop structure run between the vectorizer and the MVETailPredication pass. And even if you're looking at the unmodified vectorizer output, you still need to verify the "L" is actually the loop that was vectorized. In summary, I don't think it's a good idea to make assumptions beyond what LangRef actually promises. It's still a lot easier to pattern-match than it would be without the intrinsic. efriedma: It's not safe to assume get_active_lane_mask was produced by the LLVM vectorizer. I mean, the…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ok, cheers, got it. It's still a lot easier to pattern-match than it would be without the intrinsic. Yep, will have a go at this next week. SjoerdMeijer: Ok, cheers, got it. > It's still a lot easier to pattern-match than it would be without the…
		efriedmaUnsubmitted Not Done Reply Inline Actions (It looks like the current version doesn't try to address the potential overflow in the subtraction.) efriedma: (It looks like the current version doesn't try to address the potential overflow in the…
Processed->addIncoming(Remaining, L->getLoopLatch());		Processed->addIncoming(Remaining, L->getLoopLatch());
LLVM_DEBUG(dbgs() << "ARM TP: Insert processed elements phi: "		LLVM_DEBUG(dbgs() << "ARM TP: Insert processed elements phi: "
<< *Processed << "\n"		<< *Processed << "\n"
<< "ARM TP: Inserted VCTP: " << *TailPredicate << "\n");		<< "ARM TP: Inserted VCTP: " << *VCTPCall << "\n");
}		}

bool MVETailPredication::TryConvert(Value *TripCount) {		bool MVETailPredication::TryConvert(Value *TripCount) {
if (!IsPredicatedVectorLoop()) {		if (!IsPredicatedVectorLoop()) {
LLVM_DEBUG(dbgs() << "ARM TP: no masked instructions in loop.\n");		LLVM_DEBUG(dbgs() << "ARM TP: no masked instructions in loop.\n");
return false;		return false;
}		}

LLVM_DEBUG(dbgs() << "ARM TP: Found predicated vector loop.\n");		LLVM_DEBUG(dbgs() << "ARM TP: Found predicated vector loop.\n");

// Walk through the masked intrinsics and try to find whether the predicate
// operand is generated from an induction variable.
SetVector<Instruction*> Predicates;		SetVector<Instruction*> Predicates;
DenseMap<Instruction, Instruction> NewPredicates;		DenseMap<Instruction, Instruction> NewPredicates;

#ifndef NDEBUG		// Walk through the masked intrinsics and try to find whether the predicate
// For debugging purposes, use this to indicate we have been able to		// operand is generated by intrinsic @llvm.get.active.lane.mask().
// pattern match the scalar loop trip count.
bool FoundScalarTC = false;
#endif

for (auto *I : MaskedInsts) {		for (auto *I : MaskedInsts) {
Intrinsic::ID ID = I->getIntrinsicID();		unsigned PredOp = I->getIntrinsicID() == Intrinsic::masked_load ? 2 : 3;
// First, find the icmp used by this masked load/store.
unsigned PredOp = ID == Intrinsic::masked_load ? 2 : 3;
auto *Predicate = dyn_cast<Instruction>(I->getArgOperand(PredOp));		auto *Predicate = dyn_cast<Instruction>(I->getArgOperand(PredOp));
if (!Predicate \|\| Predicates.count(Predicate))		if (!Predicate \|\| Predicates.count(Predicate))
continue;		continue;

// Step 1: using this icmp, now calculate the number of elements		ActiveLaneMask = dyn_cast<IntrinsicInst>(Predicate);
// processed by this loop.		if (!ActiveLaneMask \|\|
TripCountPattern TCP(Predicate, TripCount, getVectorType(I));		ActiveLaneMask->getIntrinsicID() != Intrinsic::get_active_lane_mask)
if (!(ComputeConstElements(TCP) \|\| ComputeRuntimeElements(TCP)))
continue;		continue;

LLVM_DEBUG(FoundScalarTC = true);

if (!isTailPredicate(TCP)) {
LLVM_DEBUG(dbgs() << "ARM TP: Not an icmp that generates tail predicate: "
<< *Predicate << "\n");
continue;
}

LLVM_DEBUG(dbgs() << "ARM TP: Found icmp generating tail predicate: "
<< *Predicate << "\n");
Predicates.insert(Predicate);		Predicates.insert(Predicate);
		LLVM_DEBUG(dbgs() << "ARM TP: Found active lane mask: "
		<< *ActiveLaneMask << "\n");

// Step 2: emit the VCTP intrinsic representing the effect of TP.		VecTy = getVectorType(I);
InsertVCTPIntrinsic(TCP, NewPredicates);		if (!IsSafeActiveMask(TripCount, VecTy)) {
}		LLVM_DEBUG(dbgs() << "ARM TP: Not safe to insert VCTP.\n");

if (!NewPredicates.size()) {
LLVM_DEBUG(if (!FoundScalarTC)
dbgs() << "ARM TP: Can't determine loop itertion count\n");
return false;		return false;
}		}
		LLVM_DEBUG(dbgs() << "ARM TP: Safe to insert VCTP.\n");
		InsertVCTPIntrinsic(ActiveLaneMask, TripCount, VecTy, NewPredicates);
		efriedmaUnsubmitted Not Done Reply Inline Actions Do you need to check that the first argument to get_active_lane_mask is an induction variable for the loop "L"? Do you need to check the types of the arguments to get_active_lane_mask? efriedma: Do you need to check that the first argument to get_active_lane_mask is an induction variable…
		}

// Now clean up.		// Now clean up.
ClonedVCTPInExitBlock = Cleanup(NewPredicates, Predicates, L);		ClonedVCTPInExitBlock = Cleanup(NewPredicates, Predicates, L);
return true;		return true;
}		}

Pass *llvm::createMVETailPredicationPass() {		Pass *llvm::createMVETailPredicationPass() {
return new MVETailPredication();		return new MVETailPredication();
}		}

char MVETailPredication::ID = 0;		char MVETailPredication::ID = 0;

INITIALIZE_PASS_BEGIN(MVETailPredication, DEBUG_TYPE, DESC, false, false)		INITIALIZE_PASS_BEGIN(MVETailPredication, DEBUG_TYPE, DESC, false, false)
INITIALIZE_PASS_END(MVETailPredication, DEBUG_TYPE, DESC, false, false)		INITIALIZE_PASS_END(MVETailPredication, DEBUG_TYPE, DESC, false, false)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll

	Show All 28 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer			%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer
	%induction = or <16 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>			%induction = or <16 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%tmp = getelementptr inbounds i8, i8* %a, i32 %index			%tmp = getelementptr inbounds i8, i8* %a, i32 %index
	%tmp1 = icmp ule <16 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <16 x i32> %induction, %broadcast.splat11
				%active.lane.mask = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp2 = bitcast i8* %tmp to <16 x i8>*			%tmp2 = bitcast i8* %tmp to <16 x i8>*
	%wide.masked.load = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp2, i32 4, <16 x i1> %tmp1, <16 x i8> undef)			%wide.masked.load = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp2, i32 4, <16 x i1> %active.lane.mask, <16 x i8> undef)
	%tmp3 = getelementptr inbounds i8, i8* %b, i32 %index			%tmp3 = getelementptr inbounds i8, i8* %b, i32 %index
	%tmp4 = bitcast i8* %tmp3 to <16 x i8>*			%tmp4 = bitcast i8* %tmp3 to <16 x i8>*
	%wide.masked.load2 = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp4, i32 4, <16 x i1> %tmp1, <16 x i8> undef)			%wide.masked.load2 = tail call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %tmp4, i32 4, <16 x i1> %active.lane.mask, <16 x i8> undef)
	%mul = mul nsw <16 x i8> %wide.masked.load2, %wide.masked.load			%mul = mul nsw <16 x i8> %wide.masked.load2, %wide.masked.load
	%tmp6 = getelementptr inbounds i8, i8* %c, i32 %index			%tmp6 = getelementptr inbounds i8, i8* %c, i32 %index
	%tmp7 = bitcast i8* %tmp6 to <16 x i8>*			%tmp7 = bitcast i8* %tmp6 to <16 x i8>*
	tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> %mul, <16 x i8>* %tmp7, i32 4, <16 x i1> %tmp1)			tail call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> %mul, <16 x i8>* %tmp7, i32 4, <16 x i1> %active.lane.mask)
	%index.next = add i32 %index, 16			%index.next = add i32 %index, 16
	%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)			%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
	%tmp16 = icmp ne i32 %tmp15, 0			%tmp16 = icmp ne i32 %tmp15, 0
	br i1 %tmp16, label %vector.body, label %for.cond.cleanup			br i1 %tmp16, label %vector.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %vector.body, %entry			for.cond.cleanup: ; preds = %vector.body, %entry
	ret void			ret void
	}			}
	Show All 27 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
	%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%tmp = getelementptr inbounds i16, i16* %a, i32 %index			%tmp = getelementptr inbounds i16, i16* %a, i32 %index
	%tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				%active.lane.mask = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp2 = bitcast i16* %tmp to <8 x i16>*			%tmp2 = bitcast i16* %tmp to <8 x i16>*
	%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)			%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %active.lane.mask, <8 x i16> undef)
	%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index			%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index
	%tmp4 = bitcast i16* %tmp3 to <8 x i16>*			%tmp4 = bitcast i16* %tmp3 to <8 x i16>*
	%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)			%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %active.lane.mask, <8 x i16> undef)
	%mul = mul nsw <8 x i16> %wide.masked.load2, %wide.masked.load			%mul = mul nsw <8 x i16> %wide.masked.load2, %wide.masked.load
	%tmp6 = getelementptr inbounds i16, i16* %c, i32 %index			%tmp6 = getelementptr inbounds i16, i16* %c, i32 %index
	%tmp7 = bitcast i16* %tmp6 to <8 x i16>*			%tmp7 = bitcast i16* %tmp6 to <8 x i16>*
	tail call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %mul, <8 x i16>* %tmp7, i32 4, <8 x i1> %tmp1)			tail call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %mul, <8 x i16>* %tmp7, i32 4, <8 x i1> %active.lane.mask)
	%index.next = add i32 %index, 8			%index.next = add i32 %index, 8
	%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)			%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
	%tmp16 = icmp ne i32 %tmp15, 0			%tmp16 = icmp ne i32 %tmp15, 0
	br i1 %tmp16, label %vector.body, label %for.cond.cleanup			br i1 %tmp16, label %vector.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %vector.body, %entry			for.cond.cleanup: ; preds = %vector.body, %entry
	ret void			ret void
	}			}
	Show All 26 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp = getelementptr inbounds i32, i32* %a, i32 %index			%tmp = getelementptr inbounds i32, i32* %a, i32 %index
	%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11			; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
	%tmp2 = bitcast i32* %tmp to <4 x i32>*			%tmp2 = bitcast i32* %tmp to <4 x i32>*
	%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)
				%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index			%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
	%tmp4 = bitcast i32* %tmp3 to <4 x i32>*			%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
	%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%mul = mul nsw <4 x i32> %wide.masked.load2, %wide.masked.load			%mul = mul nsw <4 x i32> %wide.masked.load2, %wide.masked.load
	%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index			%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
	%tmp7 = bitcast i32* %tmp6 to <4 x i32>*			%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
	tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %mul, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)			tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %mul, <4 x i32>* %tmp7, i32 4, <4 x i1> %active.lane.mask)
	%index.next = add i32 %index, 4			%index.next = add i32 %index, 4
	%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)			%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
	%tmp16 = icmp ne i32 %tmp15, 0			%tmp16 = icmp ne i32 %tmp15, 0
	br i1 %tmp16, label %vector.body, label %for.cond.cleanup			br i1 %tmp16, label %vector.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %vector.body, %entry			for.cond.cleanup: ; preds = %vector.body, %entry
	ret void			ret void
	}			}
	Show All 27 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp = getelementptr inbounds i32, i32* %a, i32 %index			%tmp = getelementptr inbounds i32, i32* %a, i32 %index
	%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11			; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)
	%tmp2 = bitcast i32* %tmp to <4 x i32>*			%tmp2 = bitcast i32* %tmp to <4 x i32>*
	%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%extract.1.low = shufflevector <4 x i32> %wide.masked.load, <4 x i32> undef, < 2 x i32> < i32 0, i32 2>			%extract.1.low = shufflevector <4 x i32> %wide.masked.load, <4 x i32> undef, < 2 x i32> < i32 0, i32 2>
	%extract.1.high = shufflevector <4 x i32> %wide.masked.load, <4 x i32> undef, < 2 x i32> < i32 1, i32 3>			%extract.1.high = shufflevector <4 x i32> %wide.masked.load, <4 x i32> undef, < 2 x i32> < i32 1, i32 3>
	%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index			%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
	%tmp4 = bitcast i32* %tmp3 to <4 x i32>*			%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
	%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%extract.2.low = shufflevector <4 x i32> %wide.masked.load2, <4 x i32> undef, < 2 x i32> < i32 0, i32 2>			%extract.2.low = shufflevector <4 x i32> %wide.masked.load2, <4 x i32> undef, < 2 x i32> < i32 0, i32 2>
	%extract.2.high = shufflevector <4 x i32> %wide.masked.load2, <4 x i32> undef, < 2 x i32> < i32 1, i32 3>			%extract.2.high = shufflevector <4 x i32> %wide.masked.load2, <4 x i32> undef, < 2 x i32> < i32 1, i32 3>
	%mul = mul nsw <2 x i32> %extract.1.low, %extract.2.low			%mul = mul nsw <2 x i32> %extract.1.low, %extract.2.low
	%sub = sub nsw <2 x i32> %extract.1.high, %extract.2.high			%sub = sub nsw <2 x i32> %extract.1.high, %extract.2.high
	%combine = shufflevector <2 x i32> %mul, <2 x i32> %sub, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%combine = shufflevector <2 x i32> %mul, <2 x i32> %sub, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index			%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
	%tmp7 = bitcast i32* %tmp6 to <4 x i32>*			%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
	tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %combine, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)			tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %combine, <4 x i32>* %tmp7, i32 4, <4 x i1> %active.lane.mask)
	%index.next = add i32 %index, 4			%index.next = add i32 %index, 4
	%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)			%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
	%tmp16 = icmp ne i32 %tmp15, 0			%tmp16 = icmp ne i32 %tmp15, 0
	br i1 %tmp16, label %vector.body, label %for.cond.cleanup			br i1 %tmp16, label %vector.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %vector.body, %entry			for.cond.cleanup: ; preds = %vector.body, %entry
	ret void			ret void
	}			}
	Show All 26 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp = getelementptr inbounds i32, i32* %a, i32 %index			%tmp = getelementptr inbounds i32, i32* %a, i32 %index
	%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%wrong = icmp ult <4 x i32> %induction, %broadcast.splat11			%wrong = icmp ult <4 x i32> %induction, %broadcast.splat11
	%tmp2 = bitcast i32* %tmp to <4 x i32>*			%tmp2 = bitcast i32* %tmp to <4 x i32>*
	%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index			%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
	%tmp4 = bitcast i32* %tmp3 to <4 x i32>*			%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
	%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %wrong, <4 x i32> undef)			%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %wrong, <4 x i32> undef)
	%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load			%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
	%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index			%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
	%tmp7 = bitcast i32* %tmp6 to <4 x i32>*			%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
	tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %tmp1)			tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %active.lane.mask)
	%index.next = add i32 %index, 4			%index.next = add i32 %index, 4
	%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)			%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
	%tmp16 = icmp ne i32 %tmp15, 0			%tmp16 = icmp ne i32 %tmp15, 0
	br i1 %tmp16, label %vector.body, label %for.cond.cleanup			br i1 %tmp16, label %vector.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %vector.body, %entry			for.cond.cleanup: ; preds = %vector.body, %entry
	ret void			ret void
	}			}

	; The store now uses ult predicate.			; The store now uses ult predicate.
	; CHECK-LABEL: mismatch_store_pred			; CHECK-LABEL: mismatch_store_pred
				; CHECK: vector.body:
	; CHECK: %index = phi i32			; CHECK: %index = phi i32
	; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]			; CHECK: [[ELEMS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[REMAINING:%[^ ]+]], %vector.body ]
	; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[ELEMS]])			; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[ELEMS]])
	; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 4			; CHECK: [[REMAINING]] = sub i32 [[ELEMS]], 4
	; CHECK: [[LD0:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)			; CHECK: [[LD0:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
	; CHECK: [[LD1:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)			; CHECK: [[LD1:%[^ ]+]] = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]], <4 x i32> undef)
	; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> %wrong)			; CHECK: tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> {{.}}, <4 x i32> {{.*}}, i32 4, <4 x i1> %wrong)
	define dso_local arm_aapcs_vfpcc void @mismatch_store_pred(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {			define dso_local arm_aapcs_vfpcc void @mismatch_store_pred(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture %c, i32 %N) {
	Show All 16 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp = getelementptr inbounds i32, i32* %a, i32 %index			%tmp = getelementptr inbounds i32, i32* %a, i32 %index
	%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%wrong = icmp ult <4 x i32> %induction, %broadcast.splat11			%wrong = icmp ult <4 x i32> %induction, %broadcast.splat11
	%tmp2 = bitcast i32* %tmp to <4 x i32>*			%tmp2 = bitcast i32* %tmp to <4 x i32>*
	%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index			%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
	%tmp4 = bitcast i32* %tmp3 to <4 x i32>*			%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
	%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load12 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
	%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load			%tmp5 = mul nsw <4 x i32> %wide.masked.load12, %wide.masked.load
	%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index			%tmp6 = getelementptr inbounds i32, i32* %c, i32 %index
	%tmp7 = bitcast i32* %tmp6 to <4 x i32>*			%tmp7 = bitcast i32* %tmp6 to <4 x i32>*
	tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %wrong)			tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %tmp5, <4 x i32>* %tmp7, i32 4, <4 x i1> %wrong)
	%index.next = add i32 %index, 4			%index.next = add i32 %index, 4
	%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)			%tmp15 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %tmp14, i32 1)
	%tmp16 = icmp ne i32 %tmp15, 0			%tmp16 = icmp ne i32 %tmp15, 0
	br i1 %tmp16, label %vector.body, label %for.cond.cleanup			br i1 %tmp16, label %vector.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %vector.body, %entry			for.cond.cleanup: ; preds = %vector.body, %entry
	ret void			ret void
	}			}

	declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)			declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)
	declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)			declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)
	declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)			declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
	declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)			declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
	declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)			declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
	declare void @llvm.masked.store.v2i64.p0v2i64(<2 x i64>, <2 x i64>*, i32 immarg, <2 x i1>)			declare void @llvm.masked.store.v2i64.p0v2i64(<2 x i64>, <2 x i64>*, i32 immarg, <2 x i1>)
	declare <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>*, i32 immarg, <2 x i1>, <2 x i64>)			declare <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>*, i32 immarg, <2 x i1>, <2 x i64>)
	declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)			declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
	declare void @llvm.set.loop.iterations.i32(i32)			declare void @llvm.set.loop.iterations.i32(i32)
	declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)			declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
				declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)
				declare <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/clear-maskedinsts.ll

Show All 17 Lines
; CHECK-NEXT: br i1 [[TOBOOL]], label [[VECTOR_BODY75_PREHEADER:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TOBOOL]], label [[VECTOR_BODY75_PREHEADER:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.body75.preheader:		; CHECK: vector.body75.preheader:
; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP2]])		; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP2]])
; CHECK-NEXT: br label [[VECTOR_BODY75:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY75:%.*]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT71:%.*]] = insertelement <4 x i32> undef, i32 [[X]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT71:%.*]] = insertelement <4 x i32> undef, i32 [[X]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT72:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT71]], <4 x i32> undef, <4 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT72:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT71]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP3]])		; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP3]])
		; CHECK-NEXT: [[NUM_ELEMENTS:%.*]] = add i32 [[TRIP_COUNT_MINUS_183]], 1
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[LSR_IV9:%.]] = phi i32 [ [[SCEVGEP10:%.]], [[VECTOR_BODY]] ], [ [[D:%.]], [[VECTOR_PH]] ]		; CHECK-NEXT: [[LSR_IV9:%.]] = phi i32 [ [[SCEVGEP10:%.]], [[VECTOR_BODY]] ], [ [[D:%.]], [[VECTOR_PH]] ]
; CHECK-NEXT: [[TMP4:%.]] = phi i32 [ [[TMP3]], [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP5:%.]] = phi i32 [ [[N]], [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[TMP4:%.]] = phi i32 [ [[TMP3]], [[VECTOR_PH]] ], [ [[TMP10:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP5:%.]] = phi i32 [ [[NUM_ELEMENTS]], [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[LSR_IV911:%.]] = bitcast i32 [[LSR_IV9]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV911:%.]] = bitcast i32 [[LSR_IV9]] to <4 x i32>*
; CHECK-NEXT: [[TMP6:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP5]])		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0
; CHECK-NEXT: [[TMP7]] = sub i32 [[TMP5]], 4		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[BROADCAST_SPLAT72]], <4 x i32>* [[LSR_IV911]], i32 4, <4 x i1> [[TMP6]])		; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>
		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> undef, i32 [[TRIP_COUNT_MINUS_183]], i32 0
		; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[TMP8:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP5]])
		; CHECK-NEXT: [[TMP9]] = sub i32 [[TMP5]], 4
		; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[BROADCAST_SPLAT72]], <4 x i32>* [[LSR_IV911]], i32 4, <4 x i1> [[TMP8]])
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[SCEVGEP10]] = getelementptr i32, i32* [[LSR_IV9]], i32 4		; CHECK-NEXT: [[SCEVGEP10]] = getelementptr i32, i32* [[LSR_IV9]], i32 4
; CHECK-NEXT: [[TMP8]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP4]], i32 1)		; CHECK-NEXT: [[TMP10]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP4]], i32 1)
; CHECK-NEXT: [[TMP9:%.*]] = icmp ne i32 [[TMP8]], 0		; CHECK-NEXT: [[TMP11:%.*]] = icmp ne i32 [[TMP10]], 0
; CHECK-NEXT: br i1 [[TMP9]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]]		; CHECK-NEXT: br i1 [[TMP11]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]]
; CHECK: vector.body75:		; CHECK: vector.body75:
; CHECK-NEXT: [[LSR_IV6:%.]] = phi i32 [ [[S1:%.]], [[VECTOR_BODY75_PREHEADER]] ], [ [[SCEVGEP7:%.]], [[VECTOR_BODY75]] ]		; CHECK-NEXT: [[LSR_IV6:%.]] = phi i32 [ [[S1:%.]], [[VECTOR_BODY75_PREHEADER]] ], [ [[SCEVGEP7:%.]], [[VECTOR_BODY75]] ]
; CHECK-NEXT: [[LSR_IV3:%.]] = phi i32 [ [[S2:%.]], [[VECTOR_BODY75_PREHEADER]] ], [ [[SCEVGEP4:%.]], [[VECTOR_BODY75]] ]		; CHECK-NEXT: [[LSR_IV3:%.]] = phi i32 [ [[S2:%.]], [[VECTOR_BODY75_PREHEADER]] ], [ [[SCEVGEP4:%.]], [[VECTOR_BODY75]] ]
; CHECK-NEXT: [[LSR_IV:%.]] = phi i32 [ [[D]], [[VECTOR_BODY75_PREHEADER]] ], [ [[SCEVGEP:%.*]], [[VECTOR_BODY75]] ]		; CHECK-NEXT: [[LSR_IV:%.]] = phi i32 [ [[D]], [[VECTOR_BODY75_PREHEADER]] ], [ [[SCEVGEP:%.*]], [[VECTOR_BODY75]] ]
; CHECK-NEXT: [[INDEX80:%.]] = phi i32 [ [[INDEX_NEXT81:%.]], [[VECTOR_BODY75]] ], [ 0, [[VECTOR_BODY75_PREHEADER]] ]		; CHECK-NEXT: [[INDEX80:%.]] = phi i32 [ [[INDEX_NEXT81:%.]], [[VECTOR_BODY75]] ], [ 0, [[VECTOR_BODY75_PREHEADER]] ]
; CHECK-NEXT: [[TMP10:%.]] = phi i32 [ [[TMP2]], [[VECTOR_BODY75_PREHEADER]] ], [ [[TMP15:%.]], [[VECTOR_BODY75]] ]		; CHECK-NEXT: [[TMP12:%.]] = phi i32 [ [[TMP2]], [[VECTOR_BODY75_PREHEADER]] ], [ [[TMP17:%.]], [[VECTOR_BODY75]] ]
; CHECK-NEXT: [[LSR_IV68:%.]] = bitcast i32 [[LSR_IV6]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV68:%.]] = bitcast i32 [[LSR_IV6]] to <4 x i32>*
; CHECK-NEXT: [[LSR_IV35:%.]] = bitcast i32 [[LSR_IV3]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV35:%.]] = bitcast i32 [[LSR_IV3]] to <4 x i32>*
; CHECK-NEXT: [[LSR_IV2:%.]] = bitcast i32 [[LSR_IV]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV2:%.]] = bitcast i32 [[LSR_IV]] to <4 x i32>*
; CHECK-NEXT: [[BROADCAST_SPLATINSERT84:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX80]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT84:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX80]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT85:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT84]], <4 x i32> undef, <4 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT85:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT84]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT: [[INDUCTION86:%.*]] = add <4 x i32> [[BROADCAST_SPLAT85]], <i32 0, i32 1, i32 2, i32 3>		; CHECK-NEXT: [[INDUCTION86:%.*]] = add <4 x i32> [[BROADCAST_SPLAT85]], <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> undef, i32 [[TRIP_COUNT_MINUS_183]], i32 0		; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> undef, i32 [[TRIP_COUNT_MINUS_183]], i32 0
; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i32> [[TMP11]], <4 x i32> undef, <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <4 x i32> [[TMP13]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP13:%.*]] = icmp ule <4 x i32> [[INDUCTION86]], [[TMP12]]		; CHECK-NEXT: [[TMP15:%.*]] = icmp ule <4 x i32> [[INDUCTION86]], [[TMP14]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV68]], i32 4, <4 x i1> [[TMP13]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV68]], i32 4, <4 x i1> [[TMP15]], <4 x i32> undef)
; CHECK-NEXT: [[WIDE_MASKED_LOAD89:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV35]], i32 4, <4 x i1> [[TMP13]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD89:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV35]], i32 4, <4 x i1> [[TMP15]], <4 x i32> undef)
; CHECK-NEXT: [[TMP14:%.*]] = call <4 x i32> @llvm.usub.sat.v4i32(<4 x i32> [[WIDE_MASKED_LOAD89]], <4 x i32> [[WIDE_MASKED_LOAD]])		; CHECK-NEXT: [[TMP16:%.*]] = call <4 x i32> @llvm.usub.sat.v4i32(<4 x i32> [[WIDE_MASKED_LOAD89]], <4 x i32> [[WIDE_MASKED_LOAD]])
; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP14]], <4 x i32>* [[LSR_IV2]], i32 4, <4 x i1> [[TMP13]])		; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP16]], <4 x i32>* [[LSR_IV2]], i32 4, <4 x i1> [[TMP15]])
; CHECK-NEXT: [[INDEX_NEXT81]] = add i32 [[INDEX80]], 4		; CHECK-NEXT: [[INDEX_NEXT81]] = add i32 [[INDEX80]], 4
; CHECK-NEXT: [[SCEVGEP]] = getelementptr i32, i32* [[LSR_IV]], i32 4		; CHECK-NEXT: [[SCEVGEP]] = getelementptr i32, i32* [[LSR_IV]], i32 4
; CHECK-NEXT: [[SCEVGEP4]] = getelementptr i32, i32* [[LSR_IV3]], i32 4		; CHECK-NEXT: [[SCEVGEP4]] = getelementptr i32, i32* [[LSR_IV3]], i32 4
; CHECK-NEXT: [[SCEVGEP7]] = getelementptr i32, i32* [[LSR_IV6]], i32 4		; CHECK-NEXT: [[SCEVGEP7]] = getelementptr i32, i32* [[LSR_IV6]], i32 4
; CHECK-NEXT: [[TMP15]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP10]], i32 1)		; CHECK-NEXT: [[TMP17]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP12]], i32 1)
; CHECK-NEXT: [[TMP16:%.*]] = icmp ne i32 [[TMP15]], 0		; CHECK-NEXT: [[TMP18:%.*]] = icmp ne i32 [[TMP17]], 0
; CHECK-NEXT: br i1 [[TMP16]], label [[VECTOR_BODY75]], label [[FOR_COND_CLEANUP]]		; CHECK-NEXT: br i1 [[TMP18]], label [[VECTOR_BODY75]], label [[FOR_COND_CLEANUP]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret i32 0		; CHECK-NEXT: ret i32 0
;		;
entry:		entry:
%cmp63 = icmp sgt i32 %n, 0		%cmp63 = icmp sgt i32 %n, 0
br i1 %cmp63, label %for.body.lr.ph, label %for.cond.cleanup		br i1 %cmp63, label %for.body.lr.ph, label %for.cond.cleanup

for.body.lr.ph: ; preds = %entry		for.body.lr.ph: ; preds = %entry
Show All 22 Lines	vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%4 = phi i32 [ %3, %vector.ph ], [ %8, %vector.body ]		%4 = phi i32 [ %3, %vector.ph ], [ %8, %vector.body ]
%lsr.iv911 = bitcast i32* %lsr.iv9 to <4 x i32>*		%lsr.iv911 = bitcast i32* %lsr.iv9 to <4 x i32>*
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%5 = insertelement <4 x i32> undef, i32 %trip.count.minus.183, i32 0		%5 = insertelement <4 x i32> undef, i32 %trip.count.minus.183, i32 0
%6 = shufflevector <4 x i32> %5, <4 x i32> undef, <4 x i32> zeroinitializer		%6 = shufflevector <4 x i32> %5, <4 x i32> undef, <4 x i32> zeroinitializer
%7 = icmp ule <4 x i32> %induction, %6		%7 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.183)
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %broadcast.splat72, <4 x i32>* %lsr.iv911, i32 4, <4 x i1> %7)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %broadcast.splat72, <4 x i32>* %lsr.iv911, i32 4, <4 x i1> %7)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%scevgep10 = getelementptr i32, i32* %lsr.iv9, i32 4		%scevgep10 = getelementptr i32, i32* %lsr.iv9, i32 4
%8 = call i32 @llvm.loop.decrement.reg.i32(i32 %4, i32 1)		%8 = call i32 @llvm.loop.decrement.reg.i32(i32 %4, i32 1)
%9 = icmp ne i32 %8, 0		%9 = icmp ne i32 %8, 0
br i1 %9, label %vector.body, label %for.cond.cleanup		br i1 %9, label %vector.body, label %for.cond.cleanup

vector.body75: ; preds = %vector.body75, %vector.body75.preheader		vector.body75: ; preds = %vector.body75, %vector.body75.preheader
Show All 26 Lines
for.cond.cleanup: ; preds = %vector.body, %vector.body75, %entry		for.cond.cleanup: ; preds = %vector.body, %vector.body75, %entry
ret i32 0		ret i32 0
}		}
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
declare <4 x i32> @llvm.usub.sat.v4i32(<4 x i32>, <4 x i32>)		declare <4 x i32> @llvm.usub.sat.v4i32(<4 x i32>, <4 x i32>)
declare void @llvm.set.loop.iterations.i32(i32)		declare void @llvm.set.loop.iterations.i32(i32)
declare i32 @llvm.loop.decrement.reg.i32(i32, i32)		declare i32 @llvm.loop.decrement.reg.i32(i32, i32)

		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
		declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)
		declare <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -disable-mve-tail-predication=false --verify-machineinstrs %s -o - \| FileCheck %s		; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -disable-mve-tail-predication=false --verify-machineinstrs %s -o - \| FileCheck %s

define dso_local i32 @vpsel_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c, i32 %N) {		define dso_local i32 @vpsel_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c, i32 %N) {
; CHECK-LABEL: vpsel_mul_reduce_add:		; CHECK-LABEL: vpsel_mul_reduce_add:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: itt eq		; CHECK-NEXT: itt eq
; CHECK-NEXT: moveq r0, #0		; CHECK-NEXT: moveq r0, #0
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r4, r5, r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: sub sp, #4		; CHECK-NEXT: sub sp, #4
; CHECK-NEXT: adds r4, r3, #3		; CHECK-NEXT: add.w r12, r3, #3
		; CHECK-NEXT: mov.w lr, #1
		; CHECK-NEXT: bic r12, r12, #3
; CHECK-NEXT: vmov.i32 q1, #0x0		; CHECK-NEXT: vmov.i32 q1, #0x0
; CHECK-NEXT: bic r4, r4, #3		; CHECK-NEXT: sub.w r12, r12, #4
; CHECK-NEXT: sub.w r12, r4, #4		; CHECK-NEXT: add.w lr, lr, r12, lsr #2
; CHECK-NEXT: movs r4, #1		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: add.w lr, r4, r12, lsr #2
; CHECK-NEXT: lsr.w r4, r12, #2
; CHECK-NEXT: sub.w r12, r3, r4, lsl #2
; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB0_1: @ %vector.body		; CHECK-NEXT: .LBB0_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r3		; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: and r5, r4, #15		; CHECK-NEXT: and r4, r12, #15
; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill		; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill
; CHECK-NEXT: vdup.32 q3, r5		; CHECK-NEXT: vdup.32 q3, r4
; CHECK-NEXT: vmov q0, q1		; CHECK-NEXT: vmov q0, q1
; CHECK-NEXT: vpstt		; CHECK-NEXT: vpstt
; CHECK-NEXT: vldrwt.u32 q1, [r2], #16		; CHECK-NEXT: vldrwt.u32 q1, [r2], #16
; CHECK-NEXT: vldrwt.u32 q2, [r1], #16		; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
; CHECK-NEXT: vcmp.i32 eq, q3, zr		; CHECK-NEXT: vcmp.i32 eq, q3, zr
; CHECK-NEXT: adds r4, #4		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vpsel q1, q2, q1		; CHECK-NEXT: vpsel q1, q2, q1
; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload		; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q2, [r0], #16		; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
; CHECK-NEXT: vmul.i32 q1, q1, q2		; CHECK-NEXT: vmul.i32 q1, q1, q2
; CHECK-NEXT: subs r3, #4		; CHECK-NEXT: subs r3, #4
; CHECK-NEXT: vadd.i32 q1, q1, q0		; CHECK-NEXT: vadd.i32 q1, q1, q0
; CHECK-NEXT: le lr, .LBB0_1		; CHECK-NEXT: le lr, .LBB0_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r12
; CHECK-NEXT: vpsel q0, q1, q0		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: add sp, #4		; CHECK-NEXT: add sp, #4
; CHECK-NEXT: pop {r4, r5, r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp eq i32 %N, 0		%cmp8 = icmp eq i32 %N, 0
br i1 %cmp8, label %for.cond.cleanup, label %vector.ph		br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%tmp = getelementptr inbounds i32, i32* %a, i32 %index		%tmp = getelementptr inbounds i32, i32* %a, i32 %index
%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		%tmp1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%tmp2 = bitcast i32* %tmp to <4 x i32>*		%tmp2 = bitcast i32* %tmp to <4 x i32>*
%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index		%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
%tmp4 = bitcast i32* %tmp3 to <4 x i32>*		%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%tmp5 = getelementptr inbounds i32, i32* %c, i32 %index		%tmp5 = getelementptr inbounds i32, i32* %c, i32 %index
%tmp6 = bitcast i32* %tmp5 to <4 x i32>*		%tmp6 = bitcast i32* %tmp5 to <4 x i32>*
%wide.masked.load.c = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp6, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.c = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp6, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
Show All 16 Lines
for.cond.cleanup: ; preds = %middle.block, %entry		for.cond.cleanup: ; preds = %middle.block, %entry
%res.0.lcssa = phi i32 [ 0, %entry ], [ %tmp9, %middle.block ]		%res.0.lcssa = phi i32 [ 0, %entry ], [ %tmp9, %middle.block ]
ret i32 %res.0.lcssa		ret i32 %res.0.lcssa
}		}

define dso_local i32 @vpsel_mul_reduce_add_2(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b,		define dso_local i32 @vpsel_mul_reduce_add_2(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b,
; CHECK-LABEL: vpsel_mul_reduce_add_2:		; CHECK-LABEL: vpsel_mul_reduce_add_2:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r4, r5, r6, lr}		; CHECK-NEXT: push {r4, r5, r7, lr}
; CHECK-NEXT: sub sp, #4		; CHECK-NEXT: sub sp, #4
; CHECK-NEXT: ldr.w r12, [sp, #20]		; CHECK-NEXT: ldr.w r12, [sp, #20]
; CHECK-NEXT: cmp.w r12, #0		; CHECK-NEXT: cmp.w r12, #0
; CHECK-NEXT: beq .LBB1_4		; CHECK-NEXT: beq .LBB1_4
; CHECK-NEXT: @ %bb.1: @ %vector.ph		; CHECK-NEXT: @ %bb.1: @ %vector.ph
; CHECK-NEXT: add.w r5, r12, #3		; CHECK-NEXT: add.w r4, r12, #3
; CHECK-NEXT: vmov.i32 q1, #0x0		; CHECK-NEXT: vmov.i32 q1, #0x0
; CHECK-NEXT: bic r5, r5, #3		; CHECK-NEXT: bic r4, r4, #3
; CHECK-NEXT: subs r4, r5, #4		; CHECK-NEXT: sub.w lr, r4, #4
; CHECK-NEXT: movs r5, #1		; CHECK-NEXT: movs r4, #1
; CHECK-NEXT: add.w lr, r5, r4, lsr #2		; CHECK-NEXT: add.w lr, r4, lr, lsr #2
; CHECK-NEXT: lsrs r4, r4, #2		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: sub.w r4, r12, r4, lsl #2
; CHECK-NEXT: movs r5, #0
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB1_2: @ %vector.body		; CHECK-NEXT: .LBB1_2: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r12		; CHECK-NEXT: vctp.32 r12
; CHECK-NEXT: and r6, r5, #15		; CHECK-NEXT: and r5, r4, #15
; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill		; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill
; CHECK-NEXT: vmov q0, q1		; CHECK-NEXT: vmov q0, q1
; CHECK-NEXT: vpstt		; CHECK-NEXT: vpstt
; CHECK-NEXT: vldrwt.u32 q1, [r3], #16		; CHECK-NEXT: vldrwt.u32 q1, [r3], #16
; CHECK-NEXT: vldrwt.u32 q2, [r2], #16		; CHECK-NEXT: vldrwt.u32 q2, [r2], #16
; CHECK-NEXT: vdup.32 q3, r6		; CHECK-NEXT: vdup.32 q3, r5
; CHECK-NEXT: vsub.i32 q1, q2, q1		; CHECK-NEXT: vsub.i32 q1, q2, q1
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q2, [r1], #16		; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
; CHECK-NEXT: vcmp.i32 eq, q3, zr		; CHECK-NEXT: vcmp.i32 eq, q3, zr
; CHECK-NEXT: adds r5, #4		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vpsel q1, q1, q2		; CHECK-NEXT: vpsel q1, q1, q2
; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload		; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q2, [r0], #16		; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
; CHECK-NEXT: vmul.i32 q1, q1, q2		; CHECK-NEXT: vmul.i32 q1, q1, q2
; CHECK-NEXT: sub.w r12, r12, #4		; CHECK-NEXT: sub.w r12, r12, #4
; CHECK-NEXT: vadd.i32 q1, q1, q0		; CHECK-NEXT: vadd.i32 q1, q1, q0
; CHECK-NEXT: le lr, .LBB1_2		; CHECK-NEXT: le lr, .LBB1_2
; CHECK-NEXT: @ %bb.3: @ %middle.block		; CHECK-NEXT: @ %bb.3: @ %middle.block
; CHECK-NEXT: vctp.32 r4
; CHECK-NEXT: vpsel q0, q1, q0		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: add sp, #4		; CHECK-NEXT: add sp, #4
; CHECK-NEXT: pop {r4, r5, r6, pc}		; CHECK-NEXT: pop {r4, r5, r7, pc}
; CHECK-NEXT: .LBB1_4:		; CHECK-NEXT: .LBB1_4:
; CHECK-NEXT: movs r0, #0		; CHECK-NEXT: movs r0, #0
; CHECK-NEXT: add sp, #4		; CHECK-NEXT: add sp, #4
; CHECK-NEXT: pop {r4, r5, r6, pc}		; CHECK-NEXT: pop {r4, r5, r7, pc}
i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {		i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {
entry:		entry:
%cmp8 = icmp eq i32 %N, 0		%cmp8 = icmp eq i32 %N, 0
br i1 %cmp8, label %for.cond.cleanup, label %vector.ph		br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%tmp = getelementptr inbounds i32, i32* %a, i32 %index		%tmp = getelementptr inbounds i32, i32* %a, i32 %index
%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		%tmp1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%tmp2 = bitcast i32* %tmp to <4 x i32>*		%tmp2 = bitcast i32* %tmp to <4 x i32>*
%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index		%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
%tmp4 = bitcast i32* %tmp3 to <4 x i32>*		%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%tmp5 = getelementptr inbounds i32, i32* %c, i32 %index		%tmp5 = getelementptr inbounds i32, i32* %c, i32 %index
%tmp6 = bitcast i32* %tmp5 to <4 x i32>*		%tmp6 = bitcast i32* %tmp5 to <4 x i32>*
%wide.masked.load.c = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp6, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.c = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp6, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
Show All 20 Lines
for.cond.cleanup: ; preds = %middle.block, %entry		for.cond.cleanup: ; preds = %middle.block, %entry
%res.0.lcssa = phi i32 [ 0, %entry ], [ %reduce, %middle.block ]		%res.0.lcssa = phi i32 [ 0, %entry ], [ %reduce, %middle.block ]
ret i32 %res.0.lcssa		ret i32 %res.0.lcssa
}		}

define dso_local i32 @and_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b,		define dso_local i32 @and_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b,
; CHECK-LABEL: and_mul_reduce_add:		; CHECK-LABEL: and_mul_reduce_add:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r4, r5, r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: ldr.w r12, [sp, #16]		; CHECK-NEXT: sub sp, #4
		; CHECK-NEXT: ldr.w r12, [sp, #12]
; CHECK-NEXT: cmp.w r12, #0		; CHECK-NEXT: cmp.w r12, #0
; CHECK-NEXT: beq .LBB2_4		; CHECK-NEXT: beq .LBB2_4
; CHECK-NEXT: @ %bb.1: @ %vector.ph		; CHECK-NEXT: @ %bb.1: @ %vector.ph
; CHECK-NEXT: add.w r4, r12, #3		; CHECK-NEXT: add.w r4, r12, #3
; CHECK-NEXT: vmov.i32 q1, #0x0		; CHECK-NEXT: vmov.i32 q1, #0x0
; CHECK-NEXT: bic r4, r4, #3		; CHECK-NEXT: bic r4, r4, #3
; CHECK-NEXT: subs r5, r4, #4		; CHECK-NEXT: sub.w lr, r4, #4
; CHECK-NEXT: movs r4, #1		; CHECK-NEXT: movs r4, #1
; CHECK-NEXT: add.w lr, r4, r5, lsr #2		; CHECK-NEXT: add.w lr, r4, lr, lsr #2
; CHECK-NEXT: lsrs r4, r5, #2		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: sub.w r4, r12, r4, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB2_2: @ %vector.body		; CHECK-NEXT: .LBB2_2: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r12		; CHECK-NEXT: vctp.32 r12
; CHECK-NEXT: vmov q0, q1		; CHECK-NEXT: vmov q0, q1
; CHECK-NEXT: vpstt		; CHECK-NEXT: vpstt
; CHECK-NEXT: vldrwt.u32 q1, [r1], #16		; CHECK-NEXT: vldrwt.u32 q1, [r1], #16
; CHECK-NEXT: vldrwt.u32 q2, [r0], #16		; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
; CHECK-NEXT: sub.w r12, r12, #4		; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill
; CHECK-NEXT: vsub.i32 q1, q2, q1		; CHECK-NEXT: vsub.i32 q1, q2, q1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vpsttt		; CHECK-NEXT: vpsttt
; CHECK-NEXT: vcmpt.i32 eq, q1, zr		; CHECK-NEXT: vcmpt.i32 eq, q1, zr
; CHECK-NEXT: vldrwt.u32 q1, [r3], #16		; CHECK-NEXT: vldrwt.u32 q1, [r3], #16
; CHECK-NEXT: vldrwt.u32 q2, [r2], #16		; CHECK-NEXT: vldrwt.u32 q2, [r2], #16
		; CHECK-NEXT: sub.w r12, r12, #4
; CHECK-NEXT: vmul.i32 q1, q2, q1		; CHECK-NEXT: vmul.i32 q1, q2, q1
; CHECK-NEXT: vadd.i32 q1, q1, q0		; CHECK-NEXT: vadd.i32 q1, q1, q0
; CHECK-NEXT: le lr, .LBB2_2		; CHECK-NEXT: le lr, .LBB2_2
; CHECK-NEXT: @ %bb.3: @ %middle.block		; CHECK-NEXT: @ %bb.3: @ %middle.block
; CHECK-NEXT: vctp.32 r4		; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload
; CHECK-NEXT: vpsel q0, q1, q0		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r4, r5, r7, pc}		; CHECK-NEXT: add sp, #4
		; CHECK-NEXT: pop {r4, pc}
; CHECK-NEXT: .LBB2_4:		; CHECK-NEXT: .LBB2_4:
; CHECK-NEXT: movs r0, #0		; CHECK-NEXT: movs r0, #0
; CHECK-NEXT: pop {r4, r5, r7, pc}		; CHECK-NEXT: add sp, #4
		; CHECK-NEXT: pop {r4, pc}
i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {		i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {
entry:		entry:
%cmp8 = icmp eq i32 %N, 0		%cmp8 = icmp eq i32 %N, 0
br i1 %cmp8, label %for.cond.cleanup, label %vector.ph		br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%tmp = getelementptr inbounds i32, i32* %a, i32 %index		%tmp = getelementptr inbounds i32, i32* %a, i32 %index
%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		%tmp1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%tmp2 = bitcast i32* %tmp to <4 x i32>*		%tmp2 = bitcast i32* %tmp to <4 x i32>*
%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index		%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
%tmp4 = bitcast i32* %tmp3 to <4 x i32>*		%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%sub = sub <4 x i32> %wide.masked.load.a, %wide.masked.load.b		%sub = sub <4 x i32> %wide.masked.load.a, %wide.masked.load.b
%cmp = icmp eq <4 x i32> %sub, <i32 0, i32 0, i32 0, i32 0>		%cmp = icmp eq <4 x i32> %sub, <i32 0, i32 0, i32 0, i32 0>
%mask = and <4 x i1> %cmp, %tmp1		%mask = and <4 x i1> %cmp, %tmp1
Show All 17 Lines
for.cond.cleanup: ; preds = %middle.block, %entry		for.cond.cleanup: ; preds = %middle.block, %entry
%res.0.lcssa = phi i32 [ 0, %entry ], [ %reduce, %middle.block ]		%res.0.lcssa = phi i32 [ 0, %entry ], [ %reduce, %middle.block ]
ret i32 %res.0.lcssa		ret i32 %res.0.lcssa
}		}

define dso_local i32 @or_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {		define dso_local i32 @or_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {
; CHECK-LABEL: or_mul_reduce_add:		; CHECK-LABEL: or_mul_reduce_add:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r4, r5, r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: ldr.w r12, [sp, #16]		; CHECK-NEXT: sub sp, #4
		; CHECK-NEXT: ldr.w r12, [sp, #12]
; CHECK-NEXT: cmp.w r12, #0		; CHECK-NEXT: cmp.w r12, #0
; CHECK-NEXT: beq .LBB3_4		; CHECK-NEXT: beq .LBB3_4
; CHECK-NEXT: @ %bb.1: @ %vector.ph		; CHECK-NEXT: @ %bb.1: @ %vector.ph
; CHECK-NEXT: add.w r4, r12, #3		; CHECK-NEXT: add.w r4, r12, #3
; CHECK-NEXT: vmov.i32 q1, #0x0		; CHECK-NEXT: vmov.i32 q1, #0x0
; CHECK-NEXT: bic r4, r4, #3		; CHECK-NEXT: bic r4, r4, #3
; CHECK-NEXT: subs r5, r4, #4		; CHECK-NEXT: sub.w lr, r4, #4
; CHECK-NEXT: movs r4, #1		; CHECK-NEXT: movs r4, #1
; CHECK-NEXT: add.w lr, r4, r5, lsr #2		; CHECK-NEXT: add.w lr, r4, lr, lsr #2
; CHECK-NEXT: lsrs r4, r5, #2		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: sub.w r4, r12, r4, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB3_2: @ %vector.body		; CHECK-NEXT: .LBB3_2: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r12		; CHECK-NEXT: vctp.32 r12
; CHECK-NEXT: vmov q0, q1		; CHECK-NEXT: vmov q0, q1
; CHECK-NEXT: vpstt		; CHECK-NEXT: vpstt
; CHECK-NEXT: vldrwt.u32 q1, [r1], #16		; CHECK-NEXT: vldrwt.u32 q1, [r1], #16
; CHECK-NEXT: vldrwt.u32 q2, [r0], #16		; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
; CHECK-NEXT: vpnot		; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill
; CHECK-NEXT: vsub.i32 q1, q2, q1		; CHECK-NEXT: vsub.i32 q1, q2, q1
; CHECK-NEXT: sub.w r12, r12, #4		; CHECK-NEXT: vpnot
; CHECK-NEXT: vpstee		; CHECK-NEXT: vpstee
; CHECK-NEXT: vcmpt.i32 ne, q1, zr		; CHECK-NEXT: vcmpt.i32 ne, q1, zr
; CHECK-NEXT: vldrwe.u32 q1, [r3], #16		; CHECK-NEXT: vldrwe.u32 q1, [r3], #16
; CHECK-NEXT: vldrwe.u32 q2, [r2], #16		; CHECK-NEXT: vldrwe.u32 q2, [r2], #16
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vmul.i32 q1, q2, q1		; CHECK-NEXT: vmul.i32 q1, q2, q1
		; CHECK-NEXT: sub.w r12, r12, #4
; CHECK-NEXT: vadd.i32 q1, q1, q0		; CHECK-NEXT: vadd.i32 q1, q1, q0
; CHECK-NEXT: le lr, .LBB3_2		; CHECK-NEXT: le lr, .LBB3_2
; CHECK-NEXT: @ %bb.3: @ %middle.block		; CHECK-NEXT: @ %bb.3: @ %middle.block
; CHECK-NEXT: vctp.32 r4		; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload
; CHECK-NEXT: vpsel q0, q1, q0		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r4, r5, r7, pc}		; CHECK-NEXT: add sp, #4
		; CHECK-NEXT: pop {r4, pc}
; CHECK-NEXT: .LBB3_4:		; CHECK-NEXT: .LBB3_4:
; CHECK-NEXT: movs r0, #0		; CHECK-NEXT: movs r0, #0
; CHECK-NEXT: pop {r4, r5, r7, pc}		; CHECK-NEXT: add sp, #4
		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp eq i32 %N, 0		%cmp8 = icmp eq i32 %N, 0
br i1 %cmp8, label %for.cond.cleanup, label %vector.ph		br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %add, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%tmp = getelementptr inbounds i32, i32* %a, i32 %index		%tmp = getelementptr inbounds i32, i32* %a, i32 %index
%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		%tmp1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%tmp2 = bitcast i32* %tmp to <4 x i32>*		%tmp2 = bitcast i32* %tmp to <4 x i32>*
%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.a = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index		%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
%tmp4 = bitcast i32* %tmp3 to <4 x i32>*		%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)		%wide.masked.load.b = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
%sub = sub <4 x i32> %wide.masked.load.a, %wide.masked.load.b		%sub = sub <4 x i32> %wide.masked.load.a, %wide.masked.load.b
%cmp = icmp eq <4 x i32> %sub, <i32 0, i32 0, i32 0, i32 0>		%cmp = icmp eq <4 x i32> %sub, <i32 0, i32 0, i32 0, i32 0>
%mask = or <4 x i1> %cmp, %tmp1		%mask = or <4 x i1> %cmp, %tmp1
Show All 21 Lines

define dso_local void @continue_on_zero(i32* noalias nocapture %arg, i32* noalias nocapture readonly %arg1, i32 %arg2) {		define dso_local void @continue_on_zero(i32* noalias nocapture %arg, i32* noalias nocapture readonly %arg1, i32 %arg2) {
; CHECK-LABEL: continue_on_zero:		; CHECK-LABEL: continue_on_zero:
; CHECK: @ %bb.0: @ %bb		; CHECK: @ %bb.0: @ %bb
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r2, #0		; CHECK-NEXT: cmp r2, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: dlstp.32 lr, r2		; CHECK-NEXT: dlstp.32 lr, r2
; CHECK-NEXT: .LBB4_1: @ %bb9		; CHECK-NEXT: .LBB4_1: @ %bb9
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r1], #16
; CHECK-NEXT: vcmp.i32 ne, q0, zr		; CHECK-NEXT: vcmp.i32 ne, q0, zr
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q1, [r0]		; CHECK-NEXT: vldrwt.u32 q1, [r0]
; CHECK-NEXT: vmul.i32 q0, q1, q0		; CHECK-NEXT: vmul.i32 q0, q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vstrwt.32 q0, [r0], #16		; CHECK-NEXT: vstrwt.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB4_1		; CHECK-NEXT: letp lr, .LBB4_1
Show All 12 Lines	bb3: ; preds = %bb
br label %bb9		br label %bb9

bb9: ; preds = %bb9, %bb3		bb9: ; preds = %bb9, %bb3
%tmp10 = phi i32 [ 0, %bb3 ], [ %tmp25, %bb9 ]		%tmp10 = phi i32 [ 0, %bb3 ], [ %tmp25, %bb9 ]
%tmp11 = insertelement <4 x i32> undef, i32 %tmp10, i32 0		%tmp11 = insertelement <4 x i32> undef, i32 %tmp10, i32 0
%tmp12 = shufflevector <4 x i32> %tmp11, <4 x i32> undef, <4 x i32> zeroinitializer		%tmp12 = shufflevector <4 x i32> %tmp11, <4 x i32> undef, <4 x i32> zeroinitializer
%tmp13 = add <4 x i32> %tmp12, <i32 0, i32 1, i32 2, i32 3>		%tmp13 = add <4 x i32> %tmp12, <i32 0, i32 1, i32 2, i32 3>
%tmp14 = getelementptr inbounds i32, i32* %arg1, i32 %tmp10		%tmp14 = getelementptr inbounds i32, i32* %arg1, i32 %tmp10
%tmp15 = icmp ule <4 x i32> %tmp13, %tmp8
		; %tmp15 = icmp ule <4 x i32> %tmp13, %tmp8
		%tmp15 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %tmp10, i32 %tmp6)

%tmp16 = bitcast i32* %tmp14 to <4 x i32>*		%tmp16 = bitcast i32* %tmp14 to <4 x i32>*
%tmp17 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp16, i32 4, <4 x i1> %tmp15, <4 x i32> undef)		%tmp17 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp16, i32 4, <4 x i1> %tmp15, <4 x i32> undef)
%tmp18 = icmp ne <4 x i32> %tmp17, zeroinitializer		%tmp18 = icmp ne <4 x i32> %tmp17, zeroinitializer
%tmp19 = getelementptr inbounds i32, i32* %arg, i32 %tmp10		%tmp19 = getelementptr inbounds i32, i32* %arg, i32 %tmp10
%tmp20 = and <4 x i1> %tmp18, %tmp15		%tmp20 = and <4 x i1> %tmp18, %tmp15
%tmp21 = bitcast i32* %tmp19 to <4 x i32>*		%tmp21 = bitcast i32* %tmp19 to <4 x i32>*
%tmp22 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp21, i32 4, <4 x i1> %tmp20, <4 x i32> undef)		%tmp22 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp21, i32 4, <4 x i1> %tmp20, <4 x i32> undef)
%tmp23 = mul nsw <4 x i32> %tmp22, %tmp17		%tmp23 = mul nsw <4 x i32> %tmp22, %tmp17
Show All 9 Lines

define dso_local arm_aapcs_vfpcc void @range_test(i32* noalias nocapture %arg, i32* noalias nocapture readonly %arg1, i32 %arg2, i32 %arg3) {		define dso_local arm_aapcs_vfpcc void @range_test(i32* noalias nocapture %arg, i32* noalias nocapture readonly %arg1, i32 %arg2, i32 %arg3) {
; CHECK-LABEL: range_test:		; CHECK-LABEL: range_test:
; CHECK: @ %bb.0: @ %bb		; CHECK: @ %bb.0: @ %bb
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB5_1: @ %bb12		; CHECK-NEXT: .LBB5_1: @ %bb12
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vldrw.u32 q0, [r0]		; CHECK-NEXT: vldrw.u32 q0, [r0]
; CHECK-NEXT: vptt.i32 ne, q0, zr		; CHECK-NEXT: vptt.i32 ne, q0, zr
; CHECK-NEXT: vcmpt.s32 le, q0, r2		; CHECK-NEXT: vcmpt.s32 le, q0, r2
; CHECK-NEXT: vldrwt.u32 q1, [r1], #16		; CHECK-NEXT: vldrwt.u32 q1, [r1], #16
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vmul.i32 q0, q1, q0		; CHECK-NEXT: vmul.i32 q0, q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vstrwt.32 q0, [r0], #16		; CHECK-NEXT: vstrwt.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB5_1		; CHECK-NEXT: letp lr, .LBB5_1
; CHECK-NEXT: @ %bb.2: @ %bb32		; CHECK-NEXT: @ %bb.2: @ %bb32
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
bb:		bb:
%tmp = icmp eq i32 %arg3, 0		%tmp = icmp eq i32 %arg3, 0
Show All 10 Lines	bb4: ; preds = %bb
br label %bb12		br label %bb12

bb12: ; preds = %bb12, %bb4		bb12: ; preds = %bb12, %bb4
%tmp13 = phi i32 [ 0, %bb4 ], [ %tmp30, %bb12 ]		%tmp13 = phi i32 [ 0, %bb4 ], [ %tmp30, %bb12 ]
%tmp14 = insertelement <4 x i32> undef, i32 %tmp13, i32 0		%tmp14 = insertelement <4 x i32> undef, i32 %tmp13, i32 0
%tmp15 = shufflevector <4 x i32> %tmp14, <4 x i32> undef, <4 x i32> zeroinitializer		%tmp15 = shufflevector <4 x i32> %tmp14, <4 x i32> undef, <4 x i32> zeroinitializer
%tmp16 = add <4 x i32> %tmp15, <i32 0, i32 1, i32 2, i32 3>		%tmp16 = add <4 x i32> %tmp15, <i32 0, i32 1, i32 2, i32 3>
%tmp17 = getelementptr inbounds i32, i32* %arg, i32 %tmp13		%tmp17 = getelementptr inbounds i32, i32* %arg, i32 %tmp13
%tmp18 = icmp ule <4 x i32> %tmp16, %tmp9
		; %tmp18 = icmp ule <4 x i32> %tmp16, %tmp9
		%tmp18= call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %tmp13, i32 %tmp7)

%tmp19 = bitcast i32* %tmp17 to <4 x i32>*		%tmp19 = bitcast i32* %tmp17 to <4 x i32>*
%tmp20 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp19, i32 4, <4 x i1> %tmp18, <4 x i32> undef)		%tmp20 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp19, i32 4, <4 x i1> %tmp18, <4 x i32> undef)
%tmp21 = icmp ne <4 x i32> %tmp20, zeroinitializer		%tmp21 = icmp ne <4 x i32> %tmp20, zeroinitializer
%tmp22 = icmp sle <4 x i32> %tmp20, %tmp11		%tmp22 = icmp sle <4 x i32> %tmp20, %tmp11
%tmp23 = getelementptr inbounds i32, i32* %arg1, i32 %tmp13		%tmp23 = getelementptr inbounds i32, i32* %arg1, i32 %tmp13
%tmp24 = and <4 x i1> %tmp22, %tmp21		%tmp24 = and <4 x i1> %tmp22, %tmp21
%tmp25 = and <4 x i1> %tmp24, %tmp18		%tmp25 = and <4 x i1> %tmp24, %tmp18
%tmp26 = bitcast i32* %tmp23 to <4 x i32>*		%tmp26 = bitcast i32* %tmp23 to <4 x i32>*
Show All 10 Lines
}		}

; Function Attrs: argmemonly nounwind readonly willreturn		; Function Attrs: argmemonly nounwind readonly willreturn
declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)

; Function Attrs: nounwind readnone willreturn		; Function Attrs: nounwind readnone willreturn
declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)		declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)

		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/extending-loads.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -disable-mve-tail-predication=false %s -o - \| FileCheck %s		; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -disable-mve-tail-predication=false %s -o - \| FileCheck %s

define dso_local arm_aapcs_vfpcc void @sext_i8(i16* noalias nocapture %a, i8* nocapture readonly %b, i32 %N) {		define dso_local arm_aapcs_vfpcc void @sext_i8(i16* noalias nocapture %a, i8* nocapture readonly %b, i32 %N) {
; CHECK-LABEL: sext_i8:		; CHECK-LABEL: sext_i8:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r2, #0		; CHECK-NEXT: cmp r2, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: dlstp.16 lr, r2		; CHECK-NEXT: dlstp.16 lr, r2
; CHECK-NEXT: .LBB0_1: @ %vector.body		; CHECK-NEXT: .LBB0_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r3, #8
; CHECK-NEXT: vldrb.s16 q0, [r1], #8		; CHECK-NEXT: vldrb.s16 q0, [r1], #8
; CHECK-NEXT: vldrh.u16 q1, [r0]		; CHECK-NEXT: vldrh.u16 q1, [r0]
; CHECK-NEXT: vadd.i16 q0, q1, q0		; CHECK-NEXT: vadd.i16 q0, q1, q0
; CHECK-NEXT: vstrh.16 q0, [r0], #16		; CHECK-NEXT: vstrh.16 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB0_1		; CHECK-NEXT: letp lr, .LBB0_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer		%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
%induction = or <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%induction = or <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%0 = getelementptr inbounds i8, i8* %b, i32 %index		%0 = getelementptr inbounds i8, i8* %b, i32 %index
%1 = icmp ule <8 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <8 x i32> %induction, %broadcast.splat11
		%1 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i8* %0 to <8 x i8>*		%2 = bitcast i8* %0 to <8 x i8>*
%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %2, i32 1, <8 x i1> %1, <8 x i8> undef)		%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %2, i32 1, <8 x i1> %1, <8 x i8> undef)
%3 = sext <8 x i8> %wide.masked.load to <8 x i16>		%3 = sext <8 x i8> %wide.masked.load to <8 x i16>
%4 = getelementptr inbounds i16, i16* %a, i32 %index		%4 = getelementptr inbounds i16, i16* %a, i32 %index
%5 = bitcast i16* %4 to <8 x i16>*		%5 = bitcast i16* %4 to <8 x i16>*
%wide.masked.load12 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %5, i32 2, <8 x i1> %1, <8 x i16> undef)		%wide.masked.load12 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %5, i32 2, <8 x i1> %1, <8 x i16> undef)
%6 = add <8 x i16> %wide.masked.load12, %3		%6 = add <8 x i16> %wide.masked.load12, %3
%7 = bitcast i16* %4 to <8 x i16>*		%7 = bitcast i16* %4 to <8 x i16>*
Show All 9 Lines
; Function Attrs: nofree norecurse nounwind		; Function Attrs: nofree norecurse nounwind
define dso_local arm_aapcs_vfpcc void @zext_i8(i16* noalias nocapture %a, i8* nocapture readonly %b, i32 %N) local_unnamed_addr #0 {		define dso_local arm_aapcs_vfpcc void @zext_i8(i16* noalias nocapture %a, i8* nocapture readonly %b, i32 %N) local_unnamed_addr #0 {
; CHECK-LABEL: zext_i8:		; CHECK-LABEL: zext_i8:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r2, #0		; CHECK-NEXT: cmp r2, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: dlstp.16 lr, r2		; CHECK-NEXT: dlstp.16 lr, r2
; CHECK-NEXT: .LBB1_1: @ %vector.body		; CHECK-NEXT: .LBB1_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r3, #8
; CHECK-NEXT: vldrb.u16 q0, [r1], #8		; CHECK-NEXT: vldrb.u16 q0, [r1], #8
; CHECK-NEXT: vldrh.u16 q1, [r0]		; CHECK-NEXT: vldrh.u16 q1, [r0]
; CHECK-NEXT: vadd.i16 q0, q1, q0		; CHECK-NEXT: vadd.i16 q0, q1, q0
; CHECK-NEXT: vstrh.16 q0, [r0], #16		; CHECK-NEXT: vstrh.16 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB1_1		; CHECK-NEXT: letp lr, .LBB1_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer		%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
%induction = or <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%induction = or <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%0 = getelementptr inbounds i8, i8* %b, i32 %index		%0 = getelementptr inbounds i8, i8* %b, i32 %index
%1 = icmp ule <8 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <8 x i32> %induction, %broadcast.splat11
		%1 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i8* %0 to <8 x i8>*		%2 = bitcast i8* %0 to <8 x i8>*
%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %2, i32 1, <8 x i1> %1, <8 x i8> undef)		%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %2, i32 1, <8 x i1> %1, <8 x i8> undef)
%3 = zext <8 x i8> %wide.masked.load to <8 x i16>		%3 = zext <8 x i8> %wide.masked.load to <8 x i16>
%4 = getelementptr inbounds i16, i16* %a, i32 %index		%4 = getelementptr inbounds i16, i16* %a, i32 %index
%5 = bitcast i16* %4 to <8 x i16>*		%5 = bitcast i16* %4 to <8 x i16>*
%wide.masked.load12 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %5, i32 2, <8 x i1> %1, <8 x i16> undef)		%wide.masked.load12 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %5, i32 2, <8 x i1> %1, <8 x i16> undef)
%6 = add <8 x i16> %wide.masked.load12, %3		%6 = add <8 x i16> %wide.masked.load12, %3
%7 = bitcast i16* %4 to <8 x i16>*		%7 = bitcast i16* %4 to <8 x i16>*
Show All 9 Lines
; Function Attrs: nofree norecurse nounwind		; Function Attrs: nofree norecurse nounwind
define dso_local arm_aapcs_vfpcc void @sext_i16(i32* noalias nocapture %a, i16* nocapture readonly %b, i32 %N) local_unnamed_addr #0 {		define dso_local arm_aapcs_vfpcc void @sext_i16(i32* noalias nocapture %a, i16* nocapture readonly %b, i32 %N) local_unnamed_addr #0 {
; CHECK-LABEL: sext_i16:		; CHECK-LABEL: sext_i16:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r2, #0		; CHECK-NEXT: cmp r2, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: dlstp.32 lr, r2		; CHECK-NEXT: dlstp.32 lr, r2
; CHECK-NEXT: .LBB2_1: @ %vector.body		; CHECK-NEXT: .LBB2_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: vldrh.s32 q0, [r1], #8		; CHECK-NEXT: vldrh.s32 q0, [r1], #8
; CHECK-NEXT: vldrw.u32 q1, [r0]		; CHECK-NEXT: vldrw.u32 q1, [r0]
; CHECK-NEXT: vadd.i32 q0, q1, q0		; CHECK-NEXT: vadd.i32 q0, q1, q0
; CHECK-NEXT: vstrw.32 q0, [r0], #16		; CHECK-NEXT: vstrw.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB2_1		; CHECK-NEXT: letp lr, .LBB2_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i16, i16* %b, i32 %index		%0 = getelementptr inbounds i16, i16* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <4 x i16>*		%2 = bitcast i16* %0 to <4 x i16>*
%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)
%3 = sext <4 x i16> %wide.masked.load to <4 x i32>		%3 = sext <4 x i16> %wide.masked.load to <4 x i32>
%4 = getelementptr inbounds i32, i32* %a, i32 %index		%4 = getelementptr inbounds i32, i32* %a, i32 %index
%5 = bitcast i32* %4 to <4 x i32>*		%5 = bitcast i32* %4 to <4 x i32>*
%wide.masked.load10 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %5, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load10 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %5, i32 4, <4 x i1> %1, <4 x i32> undef)
%6 = add nsw <4 x i32> %wide.masked.load10, %3		%6 = add nsw <4 x i32> %wide.masked.load10, %3
%7 = bitcast i32* %4 to <4 x i32>*		%7 = bitcast i32* %4 to <4 x i32>*
Show All 9 Lines
; Function Attrs: nofree norecurse nounwind		; Function Attrs: nofree norecurse nounwind
define dso_local arm_aapcs_vfpcc void @zext_i16(i32* noalias nocapture %a, i16* nocapture readonly %b, i32 %N) local_unnamed_addr #0 {		define dso_local arm_aapcs_vfpcc void @zext_i16(i32* noalias nocapture %a, i16* nocapture readonly %b, i32 %N) local_unnamed_addr #0 {
; CHECK-LABEL: zext_i16:		; CHECK-LABEL: zext_i16:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r2, #0		; CHECK-NEXT: cmp r2, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: dlstp.32 lr, r2		; CHECK-NEXT: dlstp.32 lr, r2
; CHECK-NEXT: .LBB3_1: @ %vector.body		; CHECK-NEXT: .LBB3_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: vldrh.u32 q0, [r1], #8		; CHECK-NEXT: vldrh.u32 q0, [r1], #8
; CHECK-NEXT: vldrw.u32 q1, [r0]		; CHECK-NEXT: vldrw.u32 q1, [r0]
; CHECK-NEXT: vadd.i32 q0, q1, q0		; CHECK-NEXT: vadd.i32 q0, q1, q0
; CHECK-NEXT: vstrw.32 q0, [r0], #16		; CHECK-NEXT: vstrw.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB3_1		; CHECK-NEXT: letp lr, .LBB3_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i16, i16* %b, i32 %index		%0 = getelementptr inbounds i16, i16* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <4 x i16>*		%2 = bitcast i16* %0 to <4 x i16>*
%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)
%3 = zext <4 x i16> %wide.masked.load to <4 x i32>		%3 = zext <4 x i16> %wide.masked.load to <4 x i32>
%4 = getelementptr inbounds i32, i32* %a, i32 %index		%4 = getelementptr inbounds i32, i32* %a, i32 %index
%5 = bitcast i32* %4 to <4 x i32>*		%5 = bitcast i32* %4 to <4 x i32>*
%wide.masked.load10 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %5, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load10 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %5, i32 4, <4 x i1> %1, <4 x i32> undef)
%6 = add <4 x i32> %wide.masked.load10, %3		%6 = add <4 x i32> %wide.masked.load10, %3
%7 = bitcast i32* %4 to <4 x i32>*		%7 = bitcast i32* %4 to <4 x i32>*
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %6, <4 x i32>* %7, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %6, <4 x i32>* %7, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%8 = icmp eq i32 %index.next, %n.vec		%8 = icmp eq i32 %index.next, %n.vec
br i1 %8, label %for.cond.cleanup, label %vector.body		br i1 %8, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

declare <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>*, i32 immarg, <8 x i1>, <8 x i8>)		declare <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>*, i32 immarg, <8 x i1>, <8 x i8>)
declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)		declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)		declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>)		declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>)
declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)

		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
		declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll

Show All 29 Lines
; CHECK-NEXT: subs r6, r3, #1		; CHECK-NEXT: subs r6, r3, #1
; CHECK-NEXT: and r7, r3, #3		; CHECK-NEXT: and r7, r3, #3
; CHECK-NEXT: cmp r6, #3		; CHECK-NEXT: cmp r6, #3
; CHECK-NEXT: bhs .LBB0_6		; CHECK-NEXT: bhs .LBB0_6
; CHECK-NEXT: @ %bb.3:		; CHECK-NEXT: @ %bb.3:
; CHECK-NEXT: mov.w r12, #0		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: b .LBB0_8		; CHECK-NEXT: b .LBB0_8
; CHECK-NEXT: .LBB0_4: @ %vector.ph		; CHECK-NEXT: .LBB0_4: @ %vector.ph
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB0_5: @ %vector.body		; CHECK-NEXT: .LBB0_5: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r1], #16
; CHECK-NEXT: vldrw.u32 q1, [r2], #16		; CHECK-NEXT: vldrw.u32 q1, [r2], #16
; CHECK-NEXT: vmul.f32 q0, q1, q0		; CHECK-NEXT: vmul.f32 q0, q1, q0
; CHECK-NEXT: vstrw.32 q0, [r0], #16		; CHECK-NEXT: vstrw.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB0_5		; CHECK-NEXT: letp lr, .LBB0_5
; CHECK-NEXT: b .LBB0_11		; CHECK-NEXT: b .LBB0_11
; CHECK-NEXT: .LBB0_6: @ %for.body.preheader.new		; CHECK-NEXT: .LBB0_6: @ %for.body.preheader.new
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	vector.ph: ; preds = %vector.memcheck
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%2 = getelementptr inbounds float, float* %b, i32 %index		%2 = getelementptr inbounds float, float* %b, i32 %index
%3 = icmp ule <4 x i32> %induction, %broadcast.splat22
		; %3 = icmp ule <4 x i32> %induction, %broadcast.splat22
		%3 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%4 = bitcast float* %2 to <4 x float>*		%4 = bitcast float* %2 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %3, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %3, <4 x float> undef)
%5 = getelementptr inbounds float, float* %c, i32 %index		%5 = getelementptr inbounds float, float* %c, i32 %index
%6 = bitcast float* %5 to <4 x float>*		%6 = bitcast float* %5 to <4 x float>*
%wide.masked.load23 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %6, i32 4, <4 x i1> %3, <4 x float> undef)		%wide.masked.load23 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %6, i32 4, <4 x i1> %3, <4 x float> undef)
%7 = fmul fast <4 x float> %wide.masked.load23, %wide.masked.load		%7 = fmul fast <4 x float> %wide.masked.load23, %wide.masked.load
%8 = getelementptr inbounds float, float* %a, i32 %index		%8 = getelementptr inbounds float, float* %a, i32 %index
%9 = bitcast float* %8 to <4 x float>*		%9 = bitcast float* %8 to <4 x float>*
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
; CHECK-NEXT: cbz r2, .LBB1_4		; CHECK-NEXT: cbz r2, .LBB1_4
; CHECK-NEXT: @ %bb.1: @ %vector.ph		; CHECK-NEXT: @ %bb.1: @ %vector.ph
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB1_2: @ %vector.body		; CHECK-NEXT: .LBB1_2: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpstt		; CHECK-NEXT: vpstt
; CHECK-NEXT: vldrwt.u32 q2, [r0], #16		; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
; CHECK-NEXT: vldrwt.u32 q3, [r1], #16		; CHECK-NEXT: vldrwt.u32 q3, [r1], #16
; CHECK-NEXT: vfma.f32 q0, q3, q2		; CHECK-NEXT: vfma.f32 q0, q3, q2
; CHECK-NEXT: le lr, .LBB1_2		; CHECK-NEXT: le lr, .LBB1_2
; CHECK-NEXT: @ %bb.3: @ %middle.block		; CHECK-NEXT: @ %bb.3: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vmov.f32 s4, s2		; CHECK-NEXT: vmov.f32 s4, s2
; CHECK-NEXT: vmov.f32 s5, s3		; CHECK-NEXT: vmov.f32 s5, s3
; CHECK-NEXT: vadd.f32 q0, q0, q1		; CHECK-NEXT: vadd.f32 q0, q0, q1
; CHECK-NEXT: vmov r0, s1		; CHECK-NEXT: vmov r0, s1
; CHECK-NEXT: vadd.f32 q0, q0, r0		; CHECK-NEXT: vadd.f32 q0, q0, r0
; CHECK-NEXT: @ kill: def $s0 killed $s0 killed $q0		; CHECK-NEXT: @ kill: def $s0 killed $s0 killed $q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
Show All 19 Lines

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x float> [ zeroinitializer, %vector.ph ], [ %6, %vector.body ]		%vec.phi = phi <4 x float> [ zeroinitializer, %vector.ph ], [ %6, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %b, i32 %index		%0 = getelementptr inbounds float, float* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %c, i32 %index		%3 = getelementptr inbounds float, float* %c, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load13 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load13 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fmul fast <4 x float> %wide.masked.load13, %wide.masked.load		%5 = fmul fast <4 x float> %wide.masked.load13, %wide.masked.load
%6 = fadd fast <4 x float> %5, %vec.phi		%6 = fadd fast <4 x float> %5, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
▲ Show 20 Lines • Show All 295 Lines • ▼ Show 20 Lines
declare <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>*, i32 immarg, <4 x i1>, <4 x float>)		declare <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>*, i32 immarg, <4 x i1>, <4 x float>)

; Function Attrs: argmemonly nounwind willreturn		; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32 immarg, <4 x i1>)

; Function Attrs: argmemonly nounwind readonly willreturn		; Function Attrs: argmemonly nounwind readonly willreturn
declare <4 x half> @llvm.masked.load.v4f16.p0v4f16(<4 x half>*, i32 immarg, <4 x i1>, <4 x half>)		declare <4 x half> @llvm.masked.load.v4f16.p0v4f16(<4 x half>*, i32 immarg, <4 x i1>, <4 x half>)

		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll

Show All 9 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB0_1: @ %vector.body		; CHECK-NEXT: .LBB0_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrbt.u32 q2, [r1], #4		; CHECK-NEXT: vldrbt.u32 q2, [r1], #4
; CHECK-NEXT: vmla.u32 q0, q2, r0		; CHECK-NEXT: vmla.u32 q0, q2, r0
; CHECK-NEXT: le lr, .LBB0_1		; CHECK-NEXT: le lr, .LBB0_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp7 = icmp eq i32 %N, 0		%cmp7 = icmp eq i32 %N, 0
br i1 %cmp7, label %for.cond.cleanup, label %vector.ph		br i1 %cmp7, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
Show All 9 Lines

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i8, i8* %b, i32 %index		%0 = getelementptr inbounds i8, i8* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i8* %0 to <4 x i8>*		%2 = bitcast i8* %0 to <4 x i8>*
%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %2, i32 1, <4 x i1> %1, <4 x i8> undef)		%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %2, i32 1, <4 x i1> %1, <4 x i8> undef)
%3 = zext <4 x i8> %wide.masked.load to <4 x i32>		%3 = zext <4 x i8> %wide.masked.load to <4 x i32>
%4 = mul nuw nsw <4 x i32> %broadcast.splat13, %3		%4 = mul nuw nsw <4 x i32> %broadcast.splat13, %3
%5 = add nuw nsw <4 x i32> %4, %vec.phi		%5 = add nuw nsw <4 x i32> %4, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%6 = icmp eq i32 %index.next, %n.vec		%6 = icmp eq i32 %index.next, %n.vec
br i1 %6, label %middle.block, label %vector.body		br i1 %6, label %middle.block, label %vector.body
Show All 17 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB1_1: @ %vector.body		; CHECK-NEXT: .LBB1_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrht.s32 q2, [r1], #8		; CHECK-NEXT: vldrht.s32 q2, [r1], #8
; CHECK-NEXT: vmla.u32 q0, q2, r0		; CHECK-NEXT: vmla.u32 q0, q2, r0
; CHECK-NEXT: le lr, .LBB1_1		; CHECK-NEXT: le lr, .LBB1_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp7 = icmp eq i32 %N, 0		%cmp7 = icmp eq i32 %N, 0
br i1 %cmp7, label %for.cond.cleanup, label %vector.ph		br i1 %cmp7, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
Show All 9 Lines

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i16, i16* %b, i32 %index		%0 = getelementptr inbounds i16, i16* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <4 x i16>*		%2 = bitcast i16* %0 to <4 x i16>*
%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)
%3 = sext <4 x i16> %wide.masked.load to <4 x i32>		%3 = sext <4 x i16> %wide.masked.load to <4 x i32>
%4 = mul nsw <4 x i32> %broadcast.splat13, %3		%4 = mul nsw <4 x i32> %broadcast.splat13, %3
%5 = add nsw <4 x i32> %4, %vec.phi		%5 = add nsw <4 x i32> %4, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%6 = icmp eq i32 %index.next, %n.vec		%6 = icmp eq i32 %index.next, %n.vec
br i1 %6, label %middle.block, label %vector.body		br i1 %6, label %middle.block, label %vector.body
Show All 17 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB2_1: @ %vector.body		; CHECK-NEXT: .LBB2_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrbt.u32 q2, [r1], #4		; CHECK-NEXT: vldrbt.u32 q2, [r1], #4
; CHECK-NEXT: vmla.u32 q0, q2, r0		; CHECK-NEXT: vmla.u32 q0, q2, r0
; CHECK-NEXT: le lr, .LBB2_1		; CHECK-NEXT: le lr, .LBB2_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp7 = icmp eq i32 %N, 0		%cmp7 = icmp eq i32 %N, 0
br i1 %cmp7, label %for.cond.cleanup, label %vector.ph		br i1 %cmp7, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
Show All 9 Lines

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i8, i8* %b, i32 %index		%0 = getelementptr inbounds i8, i8* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i8* %0 to <4 x i8>*		%2 = bitcast i8* %0 to <4 x i8>*
%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %2, i32 1, <4 x i1> %1, <4 x i8> undef)		%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %2, i32 1, <4 x i1> %1, <4 x i8> undef)
%3 = zext <4 x i8> %wide.masked.load to <4 x i32>		%3 = zext <4 x i8> %wide.masked.load to <4 x i32>
%4 = mul nuw nsw <4 x i32> %broadcast.splat13, %3		%4 = mul nuw nsw <4 x i32> %broadcast.splat13, %3
%5 = add nuw nsw <4 x i32> %4, %vec.phi		%5 = add nuw nsw <4 x i32> %4, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%6 = icmp eq i32 %index.next, %n.vec		%6 = icmp eq i32 %index.next, %n.vec
br i1 %6, label %middle.block, label %vector.body		br i1 %6, label %middle.block, label %vector.body
Show All 17 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB3_1: @ %vector.body		; CHECK-NEXT: .LBB3_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrht.u32 q2, [r1], #8		; CHECK-NEXT: vldrht.u32 q2, [r1], #8
; CHECK-NEXT: vmla.u32 q0, q2, r0		; CHECK-NEXT: vmla.u32 q0, q2, r0
; CHECK-NEXT: le lr, .LBB3_1		; CHECK-NEXT: le lr, .LBB3_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp7 = icmp eq i32 %N, 0		%cmp7 = icmp eq i32 %N, 0
br i1 %cmp7, label %for.cond.cleanup, label %vector.ph		br i1 %cmp7, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
Show All 9 Lines

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i16, i16* %b, i32 %index		%0 = getelementptr inbounds i16, i16* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <4 x i16>*		%2 = bitcast i16* %0 to <4 x i16>*
%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)
%3 = zext <4 x i16> %wide.masked.load to <4 x i32>		%3 = zext <4 x i16> %wide.masked.load to <4 x i32>
%4 = mul nsw <4 x i32> %broadcast.splat13, %3		%4 = mul nsw <4 x i32> %broadcast.splat13, %3
%5 = add nsw <4 x i32> %4, %vec.phi		%5 = add nsw <4 x i32> %4, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%6 = icmp eq i32 %index.next, %n.vec		%6 = icmp eq i32 %index.next, %n.vec
br i1 %6, label %middle.block, label %vector.body		br i1 %6, label %middle.block, label %vector.body
Show All 17 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB4_1: @ %vector.body		; CHECK-NEXT: .LBB4_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q2, [r1], #16		; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
; CHECK-NEXT: vmla.u32 q0, q2, r0		; CHECK-NEXT: vmla.u32 q0, q2, r0
; CHECK-NEXT: le lr, .LBB4_1		; CHECK-NEXT: le lr, .LBB4_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp6 = icmp eq i32 %N, 0		%cmp6 = icmp eq i32 %N, 0
br i1 %cmp6, label %for.cond.cleanup, label %vector.ph		br i1 %cmp6, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert9 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert9 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat10 = shufflevector <4 x i32> %broadcast.splatinsert9, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat10 = shufflevector <4 x i32> %broadcast.splatinsert9, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %a, i32 0		%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %a, i32 0
%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %4, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %4, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i32, i32* %b, i32 %index		%0 = getelementptr inbounds i32, i32* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat10
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat10
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i32* %0 to <4 x i32>*		%2 = bitcast i32* %0 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
%3 = mul nsw <4 x i32> %wide.masked.load, %broadcast.splat12		%3 = mul nsw <4 x i32> %wide.masked.load, %broadcast.splat12
%4 = add nsw <4 x i32> %3, %vec.phi		%4 = add nsw <4 x i32> %3, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%5 = icmp eq i32 %index.next, %n.vec		%5 = icmp eq i32 %index.next, %n.vec
br i1 %5, label %middle.block, label %vector.body		br i1 %5, label %middle.block, label %vector.body

Show All 37 Lines
; CHECK-NEXT: sub.w r5, r12, #1		; CHECK-NEXT: sub.w r5, r12, #1
; CHECK-NEXT: and r9, r12, #3		; CHECK-NEXT: and r9, r12, #3
; CHECK-NEXT: cmp r5, #3		; CHECK-NEXT: cmp r5, #3
; CHECK-NEXT: bhs .LBB5_6		; CHECK-NEXT: bhs .LBB5_6
; CHECK-NEXT: @ %bb.3:		; CHECK-NEXT: @ %bb.3:
; CHECK-NEXT: mov.w r12, #0		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: b .LBB5_8		; CHECK-NEXT: b .LBB5_8
; CHECK-NEXT: .LBB5_4: @ %vector.ph		; CHECK-NEXT: .LBB5_4: @ %vector.ph
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r12		; CHECK-NEXT: dlstp.32 lr, r12
; CHECK-NEXT: .LBB5_5: @ %vector.body		; CHECK-NEXT: .LBB5_5: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrb.u32 q0, [r0], #4		; CHECK-NEXT: vldrb.u32 q0, [r0], #4
; CHECK-NEXT: vldrb.u32 q1, [r1], #4		; CHECK-NEXT: vldrb.u32 q1, [r1], #4
; CHECK-NEXT: vmlas.u32 q1, q0, r2		; CHECK-NEXT: vmlas.u32 q1, q0, r2
; CHECK-NEXT: vstrw.32 q1, [r3], #16		; CHECK-NEXT: vstrw.32 q1, [r3], #16
; CHECK-NEXT: letp lr, .LBB5_5		; CHECK-NEXT: letp lr, .LBB5_5
; CHECK-NEXT: b .LBB5_11		; CHECK-NEXT: b .LBB5_11
; CHECK-NEXT: .LBB5_6: @ %for.body.preheader.new		; CHECK-NEXT: .LBB5_6: @ %for.body.preheader.new
; CHECK-NEXT: bic r5, r12, #3		; CHECK-NEXT: bic r5, r12, #3
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	vector.ph: ; preds = %for.body.lr.ph
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%2 = getelementptr inbounds i8, i8* %a, i32 %index		%2 = getelementptr inbounds i8, i8* %a, i32 %index
%3 = icmp ule <4 x i32> %induction, %broadcast.splat20
		; %3 = icmp ule <4 x i32> %induction, %broadcast.splat20
		%3 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%4 = bitcast i8* %2 to <4 x i8>*		%4 = bitcast i8* %2 to <4 x i8>*
%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %4, i32 1, <4 x i1> %3, <4 x i8> undef)		%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %4, i32 1, <4 x i1> %3, <4 x i8> undef)
%5 = zext <4 x i8> %wide.masked.load to <4 x i32>		%5 = zext <4 x i8> %wide.masked.load to <4 x i32>
%6 = getelementptr inbounds i8, i8* %b, i32 %index		%6 = getelementptr inbounds i8, i8* %b, i32 %index
%7 = bitcast i8* %6 to <4 x i8>*		%7 = bitcast i8* %6 to <4 x i8>*
%wide.masked.load21 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %7, i32 1, <4 x i1> %3, <4 x i8> undef)		%wide.masked.load21 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %7, i32 1, <4 x i1> %3, <4 x i8> undef)
%8 = zext <4 x i8> %wide.masked.load21 to <4 x i32>		%8 = zext <4 x i8> %wide.masked.load21 to <4 x i32>
%9 = mul nuw nsw <4 x i32> %8, %5		%9 = mul nuw nsw <4 x i32> %8, %5
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
define arm_aapcs_vfpcc void @test_vec_mul_scalar_add_short(i16* nocapture readonly %a, i16* nocapture readonly %b, i16 signext %c, i32* nocapture %res, i32 %N) {		define arm_aapcs_vfpcc void @test_vec_mul_scalar_add_short(i16* nocapture readonly %a, i16* nocapture readonly %b, i16 signext %c, i32* nocapture %res, i32 %N) {
; CHECK-LABEL: test_vec_mul_scalar_add_short:		; CHECK-LABEL: test_vec_mul_scalar_add_short:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r4, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: ldr.w r12, [sp, #8]		; CHECK-NEXT: ldr.w r12, [sp, #8]
; CHECK-NEXT: cmp.w r12, #0		; CHECK-NEXT: cmp.w r12, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r4, pc}		; CHECK-NEXT: popeq {r4, pc}
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r12		; CHECK-NEXT: dlstp.32 lr, r12
; CHECK-NEXT: .LBB6_1: @ %vector.body		; CHECK-NEXT: .LBB6_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrh.s32 q0, [r0], #8		; CHECK-NEXT: vldrh.s32 q0, [r0], #8
; CHECK-NEXT: vldrh.s32 q1, [r1], #8		; CHECK-NEXT: vldrh.s32 q1, [r1], #8
; CHECK-NEXT: vmlas.u32 q1, q0, r2		; CHECK-NEXT: vmlas.u32 q1, q0, r2
; CHECK-NEXT: vstrw.32 q1, [r3], #16		; CHECK-NEXT: vstrw.32 q1, [r3], #16
; CHECK-NEXT: letp lr, .LBB6_1		; CHECK-NEXT: letp lr, .LBB6_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r4, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
Show All 12 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i16, i16* %a, i32 %index		%0 = getelementptr inbounds i16, i16* %a, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat13
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat13
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <4 x i16>*		%2 = bitcast i16* %0 to <4 x i16>*
%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)
%3 = sext <4 x i16> %wide.masked.load to <4 x i32>		%3 = sext <4 x i16> %wide.masked.load to <4 x i32>
%4 = getelementptr inbounds i16, i16* %b, i32 %index		%4 = getelementptr inbounds i16, i16* %b, i32 %index
%5 = bitcast i16* %4 to <4 x i16>*		%5 = bitcast i16* %4 to <4 x i16>*
%wide.masked.load14 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %5, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load14 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %5, i32 2, <4 x i1> %1, <4 x i16> undef)
%6 = sext <4 x i16> %wide.masked.load14 to <4 x i32>		%6 = sext <4 x i16> %wide.masked.load14 to <4 x i32>
%7 = mul nsw <4 x i32> %6, %3		%7 = mul nsw <4 x i32> %6, %3
Show All 39 Lines
; CHECK-NEXT: sub.w r5, r12, #1		; CHECK-NEXT: sub.w r5, r12, #1
; CHECK-NEXT: and r9, r12, #3		; CHECK-NEXT: and r9, r12, #3
; CHECK-NEXT: cmp r5, #3		; CHECK-NEXT: cmp r5, #3
; CHECK-NEXT: bhs .LBB7_6		; CHECK-NEXT: bhs .LBB7_6
; CHECK-NEXT: @ %bb.3:		; CHECK-NEXT: @ %bb.3:
; CHECK-NEXT: mov.w r12, #0		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: b .LBB7_8		; CHECK-NEXT: b .LBB7_8
; CHECK-NEXT: .LBB7_4: @ %vector.ph		; CHECK-NEXT: .LBB7_4: @ %vector.ph
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r12		; CHECK-NEXT: dlstp.32 lr, r12
; CHECK-NEXT: .LBB7_5: @ %vector.body		; CHECK-NEXT: .LBB7_5: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrb.u32 q0, [r0], #4		; CHECK-NEXT: vldrb.u32 q0, [r0], #4
; CHECK-NEXT: vldrb.u32 q1, [r1], #4		; CHECK-NEXT: vldrb.u32 q1, [r1], #4
; CHECK-NEXT: vmlas.u32 q1, q0, r2		; CHECK-NEXT: vmlas.u32 q1, q0, r2
; CHECK-NEXT: vstrw.32 q1, [r3], #16		; CHECK-NEXT: vstrw.32 q1, [r3], #16
; CHECK-NEXT: letp lr, .LBB7_5		; CHECK-NEXT: letp lr, .LBB7_5
; CHECK-NEXT: b .LBB7_11		; CHECK-NEXT: b .LBB7_11
; CHECK-NEXT: .LBB7_6: @ %for.body.preheader.new		; CHECK-NEXT: .LBB7_6: @ %for.body.preheader.new
; CHECK-NEXT: bic r5, r12, #3		; CHECK-NEXT: bic r5, r12, #3
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	vector.ph: ; preds = %for.body.lr.ph
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%2 = getelementptr inbounds i8, i8* %a, i32 %index		%2 = getelementptr inbounds i8, i8* %a, i32 %index
%3 = icmp ule <4 x i32> %induction, %broadcast.splat20
		; %3 = icmp ule <4 x i32> %induction, %broadcast.splat20
		%3 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%4 = bitcast i8* %2 to <4 x i8>*		%4 = bitcast i8* %2 to <4 x i8>*
%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %4, i32 1, <4 x i1> %3, <4 x i8> undef)		%wide.masked.load = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %4, i32 1, <4 x i1> %3, <4 x i8> undef)
%5 = zext <4 x i8> %wide.masked.load to <4 x i32>		%5 = zext <4 x i8> %wide.masked.load to <4 x i32>
%6 = getelementptr inbounds i8, i8* %b, i32 %index		%6 = getelementptr inbounds i8, i8* %b, i32 %index
%7 = bitcast i8* %6 to <4 x i8>*		%7 = bitcast i8* %6 to <4 x i8>*
%wide.masked.load21 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %7, i32 1, <4 x i1> %3, <4 x i8> undef)		%wide.masked.load21 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* %7, i32 1, <4 x i1> %3, <4 x i8> undef)
%8 = zext <4 x i8> %wide.masked.load21 to <4 x i32>		%8 = zext <4 x i8> %wide.masked.load21 to <4 x i32>
%9 = mul nuw nsw <4 x i32> %8, %5		%9 = mul nuw nsw <4 x i32> %8, %5
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
define arm_aapcs_vfpcc void @test_vec_mul_scalar_add_ushort(i16* nocapture readonly %a, i16* nocapture readonly %b, i16 signext %c, i32* nocapture %res, i32 %N) {		define arm_aapcs_vfpcc void @test_vec_mul_scalar_add_ushort(i16* nocapture readonly %a, i16* nocapture readonly %b, i16 signext %c, i32* nocapture %res, i32 %N) {
; CHECK-LABEL: test_vec_mul_scalar_add_ushort:		; CHECK-LABEL: test_vec_mul_scalar_add_ushort:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r4, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: ldr.w r12, [sp, #8]		; CHECK-NEXT: ldr.w r12, [sp, #8]
; CHECK-NEXT: cmp.w r12, #0		; CHECK-NEXT: cmp.w r12, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r4, pc}		; CHECK-NEXT: popeq {r4, pc}
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r12		; CHECK-NEXT: dlstp.32 lr, r12
; CHECK-NEXT: .LBB8_1: @ %vector.body		; CHECK-NEXT: .LBB8_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrh.u32 q0, [r0], #8		; CHECK-NEXT: vldrh.u32 q0, [r0], #8
; CHECK-NEXT: vldrh.u32 q1, [r1], #8		; CHECK-NEXT: vldrh.u32 q1, [r1], #8
; CHECK-NEXT: vmlas.u32 q1, q0, r2		; CHECK-NEXT: vmlas.u32 q1, q0, r2
; CHECK-NEXT: vstrw.32 q1, [r3], #16		; CHECK-NEXT: vstrw.32 q1, [r3], #16
; CHECK-NEXT: letp lr, .LBB8_1		; CHECK-NEXT: letp lr, .LBB8_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r4, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
Show All 12 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i16, i16* %a, i32 %index		%0 = getelementptr inbounds i16, i16* %a, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat13
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat13
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <4 x i16>*		%2 = bitcast i16* %0 to <4 x i16>*
%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %2, i32 2, <4 x i1> %1, <4 x i16> undef)
%3 = zext <4 x i16> %wide.masked.load to <4 x i32>		%3 = zext <4 x i16> %wide.masked.load to <4 x i32>
%4 = getelementptr inbounds i16, i16* %b, i32 %index		%4 = getelementptr inbounds i16, i16* %b, i32 %index
%5 = bitcast i16* %4 to <4 x i16>*		%5 = bitcast i16* %4 to <4 x i16>*
%wide.masked.load14 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %5, i32 2, <4 x i1> %1, <4 x i16> undef)		%wide.masked.load14 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %5, i32 2, <4 x i1> %1, <4 x i16> undef)
%6 = zext <4 x i16> %wide.masked.load14 to <4 x i32>		%6 = zext <4 x i16> %wide.masked.load14 to <4 x i32>
%7 = mul nuw nsw <4 x i32> %6, %3		%7 = mul nuw nsw <4 x i32> %6, %3
Show All 39 Lines
; CHECK-NEXT: sub.w r5, r12, #1		; CHECK-NEXT: sub.w r5, r12, #1
; CHECK-NEXT: and r9, r12, #3		; CHECK-NEXT: and r9, r12, #3
; CHECK-NEXT: cmp r5, #3		; CHECK-NEXT: cmp r5, #3
; CHECK-NEXT: bhs .LBB9_6		; CHECK-NEXT: bhs .LBB9_6
; CHECK-NEXT: @ %bb.3:		; CHECK-NEXT: @ %bb.3:
; CHECK-NEXT: mov.w r12, #0		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: b .LBB9_8		; CHECK-NEXT: b .LBB9_8
; CHECK-NEXT: .LBB9_4: @ %vector.ph		; CHECK-NEXT: .LBB9_4: @ %vector.ph
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r12		; CHECK-NEXT: dlstp.32 lr, r12
; CHECK-NEXT: .LBB9_5: @ %vector.body		; CHECK-NEXT: .LBB9_5: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r0], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r1], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vmlas.u32 q1, q0, r2		; CHECK-NEXT: vmlas.u32 q1, q0, r2
; CHECK-NEXT: vstrw.32 q1, [r3], #16		; CHECK-NEXT: vstrw.32 q1, [r3], #16
; CHECK-NEXT: letp lr, .LBB9_5		; CHECK-NEXT: letp lr, .LBB9_5
; CHECK-NEXT: b .LBB9_11		; CHECK-NEXT: b .LBB9_11
; CHECK-NEXT: .LBB9_6: @ %for.body.preheader.new		; CHECK-NEXT: .LBB9_6: @ %for.body.preheader.new
; CHECK-NEXT: bic r5, r12, #3		; CHECK-NEXT: bic r5, r12, #3
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	vector.ph: ; preds = %vector.memcheck
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%2 = getelementptr inbounds i32, i32* %a, i32 %index		%2 = getelementptr inbounds i32, i32* %a, i32 %index
%3 = icmp ule <4 x i32> %induction, %broadcast.splat22
		; %3 = icmp ule <4 x i32> %induction, %broadcast.splat22
		%3 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%4 = bitcast i32* %2 to <4 x i32>*		%4 = bitcast i32* %2 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %4, i32 4, <4 x i1> %3, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %4, i32 4, <4 x i1> %3, <4 x i32> undef)
%5 = getelementptr inbounds i32, i32* %b, i32 %index		%5 = getelementptr inbounds i32, i32* %b, i32 %index
%6 = bitcast i32* %5 to <4 x i32>*		%6 = bitcast i32* %5 to <4 x i32>*
%wide.masked.load23 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %6, i32 4, <4 x i1> %3, <4 x i32> undef)		%wide.masked.load23 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %6, i32 4, <4 x i1> %3, <4 x i32> undef)
%7 = mul nsw <4 x i32> %wide.masked.load23, %wide.masked.load		%7 = mul nsw <4 x i32> %wide.masked.load23, %wide.masked.load
%8 = add nsw <4 x i32> %7, %broadcast.splat25		%8 = add nsw <4 x i32> %7, %broadcast.splat25
%9 = getelementptr inbounds i32, i32* %res, i32 %index		%9 = getelementptr inbounds i32, i32* %res, i32 %index
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines

define dso_local arm_aapcs_vfpcc void @test_v8i8_to_v8i16(i16* noalias nocapture %a, i8* nocapture readonly %b, i8* nocapture readonly %c, i32 %N) {		define dso_local arm_aapcs_vfpcc void @test_v8i8_to_v8i16(i16* noalias nocapture %a, i8* nocapture readonly %b, i8* nocapture readonly %c, i32 %N) {
; CHECK-LABEL: test_v8i8_to_v8i16:		; CHECK-LABEL: test_v8i8_to_v8i16:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.16 lr, r3		; CHECK-NEXT: dlstp.16 lr, r3
; CHECK-NEXT: .LBB10_1: @ %vector.body		; CHECK-NEXT: .LBB10_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #8
; CHECK-NEXT: vldrb.u16 q0, [r1], #8		; CHECK-NEXT: vldrb.u16 q0, [r1], #8
; CHECK-NEXT: vldrb.u16 q1, [r2], #8		; CHECK-NEXT: vldrb.u16 q1, [r2], #8
; CHECK-NEXT: vmul.i16 q0, q1, q0		; CHECK-NEXT: vmul.i16 q0, q1, q0
; CHECK-NEXT: vstrh.16 q0, [r0], #16		; CHECK-NEXT: vstrh.16 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB10_1		; CHECK-NEXT: letp lr, .LBB10_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer		%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%0 = getelementptr inbounds i8, i8* %b, i32 %index		%0 = getelementptr inbounds i8, i8* %b, i32 %index
%1 = icmp ule <8 x i32> %induction, %broadcast.splat13
		; %1 = icmp ule <8 x i32> %induction, %broadcast.splat13
		%1 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i8* %0 to <8 x i8>*		%2 = bitcast i8* %0 to <8 x i8>*
%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %2, i32 1, <8 x i1> %1, <8 x i8> undef)		%wide.masked.load = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %2, i32 1, <8 x i1> %1, <8 x i8> undef)
%3 = zext <8 x i8> %wide.masked.load to <8 x i16>		%3 = zext <8 x i8> %wide.masked.load to <8 x i16>
%4 = getelementptr inbounds i8, i8* %c, i32 %index		%4 = getelementptr inbounds i8, i8* %c, i32 %index
%5 = bitcast i8* %4 to <8 x i8>*		%5 = bitcast i8* %4 to <8 x i8>*
%wide.masked.load14 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %5, i32 1, <8 x i1> %1, <8 x i8> undef)		%wide.masked.load14 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* %5, i32 1, <8 x i1> %1, <8 x i8> undef)
%6 = zext <8 x i8> %wide.masked.load14 to <8 x i16>		%6 = zext <8 x i8> %wide.masked.load14 to <8 x i16>
%7 = mul nuw <8 x i16> %6, %3		%7 = mul nuw <8 x i16> %6, %3
Show All 10 Lines

declare <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>*, i32 immarg, <4 x i1>, <4 x i8>)		declare <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>*, i32 immarg, <4 x i1>, <4 x i8>)
declare <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>*, i32 immarg, <8 x i1>, <8 x i8>)		declare <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>*, i32 immarg, <8 x i1>, <8 x i8>)
declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>)		declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32 immarg, <4 x i1>, <4 x i16>)
declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)		declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)		declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
		declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/nested.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -mtriple=armv8.1m.main -mattr=+mve -S -mve-tail-predication -disable-mve-tail-predication=false %s -o - \| FileCheck %s			; RUN: opt -mtriple=armv8.1m.main -mattr=+mve -S -mve-tail-predication -disable-mve-tail-predication=false %s -o - \| FileCheck %s

	define void @mat_vec_sext_i16(i16** nocapture readonly %A, i16* nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {			define void @mat_vec_sext_i16(i16** nocapture readonly %A, i16* nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
	; CHECK-LABEL: @mat_vec_sext_i16(			; CHECK-LABEL: @mat_vec_sext_i16(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP24:%.]] = icmp eq i32 [[N:%.]], 0			; CHECK-NEXT: [[CMP24:%.]] = icmp eq i32 [[N:%.]], 0
	; CHECK-NEXT: br i1 [[CMP24]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP24]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.]]
	; CHECK: for.cond1.preheader.us.preheader:			; CHECK: for.cond1.preheader.us.preheader:
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
	; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4			; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = add i32 [[N]], -1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT28:%.*]] = insertelement <4 x i32> undef, i32 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT29:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT28]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP:%.*]] = add i32 [[N_VEC]], -4			; CHECK-NEXT: [[TMP:%.*]] = add i32 [[N_VEC]], -4
	; CHECK-NEXT: [[TMP1:%.*]] = lshr i32 [[TMP]], 2			; CHECK-NEXT: [[TMP1:%.*]] = lshr i32 [[TMP]], 2
	; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i32 [[TMP1]], 1			; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i32 [[TMP1]], 1
	; CHECK-NEXT: br label [[FOR_COND1_PREHEADER_US:%.*]]			; CHECK-NEXT: br label [[FOR_COND1_PREHEADER_US:%.*]]
	; CHECK: for.cond1.preheader.us:			; CHECK: for.cond1.preheader.us:
	; CHECK-NEXT: [[I_025_US:%.]] = phi i32 [ [[INC10_US:%.]], [[MIDDLE_BLOCK:%.*]] ], [ 0, [[FOR_COND1_PREHEADER_US_PREHEADER]] ]			; CHECK-NEXT: [[I_025_US:%.]] = phi i32 [ [[INC10_US:%.]], [[MIDDLE_BLOCK:%.*]] ], [ 0, [[FOR_COND1_PREHEADER_US_PREHEADER]] ]
	; CHECK-NEXT: [[ARRAYIDX_US:%.]] = getelementptr inbounds i16, i16** [[A:%.*]], i32 [[I_025_US]]			; CHECK-NEXT: [[ARRAYIDX_US:%.]] = getelementptr inbounds i16, i16** [[A:%.*]], i32 [[I_025_US]]
	; CHECK-NEXT: [[TMP3:%.]] = load i16, i16** [[ARRAYIDX_US]], align 4			; CHECK-NEXT: [[TMP3:%.]] = load i16, i16** [[ARRAYIDX_US]], align 4
	; CHECK-NEXT: [[ARRAYIDX8_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[I_025_US]]			; CHECK-NEXT: [[ARRAYIDX8_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[I_025_US]]
	; CHECK-NEXT: [[ARRAYIDX8_PROMOTED_US:%.]] = load i32, i32 [[ARRAYIDX8_US]], align 4			; CHECK-NEXT: [[ARRAYIDX8_PROMOTED_US:%.]] = load i32, i32 [[ARRAYIDX8_US]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 [[ARRAYIDX8_PROMOTED_US]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 [[ARRAYIDX8_PROMOTED_US]], i32 0
	; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP2]])			; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP2]])
				; CHECK-NEXT: [[NUM_ELEMENTS:%.*]] = add i32 [[TRIP_COUNT_MINUS_1]], 1
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_US]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_US]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ [[TMP4]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP14:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ [[TMP4]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP14:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP5:%.]] = phi i32 [ [[TMP2]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP15:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[TMP5:%.]] = phi i32 [ [[TMP2]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP15:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ [[N]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ [[NUM_ELEMENTS]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[TMP3]], i32 [[INDEX]]			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[TMP3]], i32 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP0]])			; CHECK-NEXT: [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP0]])
	; CHECK-NEXT: [[TMP2]] = sub i32 [[TMP0]], 4			; CHECK-NEXT: [[TMP2]] = sub i32 [[TMP0]], 4
	; CHECK-NEXT: [[TMP8:%.]] = bitcast i16 [[TMP6]] to <4 x i16>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast i16 [[TMP6]] to <4 x i16>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP8]], i32 2, <4 x i1> [[TMP1]], <4 x i16> undef)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP8]], i32 2, <4 x i1> [[TMP1]], <4 x i16> undef)
	; CHECK-NEXT: [[TMP9:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD]] to <4 x i32>			; CHECK-NEXT: [[TMP9:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD]] to <4 x i32>
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[INDEX]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[INDEX]]
	; CHECK-NEXT: [[TMP11:%.]] = bitcast i16 [[TMP10]] to <4 x i16>*			; CHECK-NEXT: [[TMP11:%.]] = bitcast i16 [[TMP10]] to <4 x i16>*
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	vector.body: ; preds = %vector.body, %for.cond1.preheader.us			vector.body: ; preds = %vector.body, %for.cond1.preheader.us
	%index = phi i32 [ 0, %for.cond1.preheader.us ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %for.cond1.preheader.us ], [ %index.next, %vector.body ]
	%vec.phi = phi <4 x i32> [ %tmp4, %for.cond1.preheader.us ], [ %tmp14, %vector.body ]			%vec.phi = phi <4 x i32> [ %tmp4, %for.cond1.preheader.us ], [ %tmp14, %vector.body ]
	%tmp5 = phi i32 [ %tmp2, %for.cond1.preheader.us ], [ %tmp15, %vector.body ]			%tmp5 = phi i32 [ %tmp2, %for.cond1.preheader.us ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp6 = getelementptr inbounds i16, i16* %tmp3, i32 %index			%tmp6 = getelementptr inbounds i16, i16* %tmp3, i32 %index
	%tmp7 = icmp ule <4 x i32> %induction, %broadcast.splat29
				; %tmp7 = icmp ule <4 x i32> %induction, %broadcast.splat29
				%tmp7 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp8 = bitcast i16* %tmp6 to <4 x i16>*			%tmp8 = bitcast i16* %tmp6 to <4 x i16>*
	%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %tmp8, i32 2, <4 x i1> %tmp7, <4 x i16> undef)			%wide.masked.load = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %tmp8, i32 2, <4 x i1> %tmp7, <4 x i16> undef)
	%tmp9 = sext <4 x i16> %wide.masked.load to <4 x i32>			%tmp9 = sext <4 x i16> %wide.masked.load to <4 x i32>
	%tmp10 = getelementptr inbounds i16, i16* %B, i32 %index			%tmp10 = getelementptr inbounds i16, i16* %B, i32 %index
	%tmp11 = bitcast i16* %tmp10 to <4 x i16>*			%tmp11 = bitcast i16* %tmp10 to <4 x i16>*
	%wide.masked.load30 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %tmp11, i32 2, <4 x i1> %tmp7, <4 x i16> undef)			%wide.masked.load30 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* %tmp11, i32 2, <4 x i1> %tmp7, <4 x i16> undef)
	%tmp12 = sext <4 x i16> %wide.masked.load30 to <4 x i32>			%tmp12 = sext <4 x i16> %wide.masked.load30 to <4 x i32>
	%tmp13 = mul nsw <4 x i32> %tmp12, %tmp9			%tmp13 = mul nsw <4 x i32> %tmp12, %tmp9
	Show All 18 Lines
	define void @mat_vec_i32(i32** nocapture readonly %A, i32* nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {			define void @mat_vec_i32(i32** nocapture readonly %A, i32* nocapture readonly %B, i32* noalias nocapture %C, i32 %N) {
	; CHECK-LABEL: @mat_vec_i32(			; CHECK-LABEL: @mat_vec_i32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP23:%.]] = icmp eq i32 [[N:%.]], 0			; CHECK-NEXT: [[CMP23:%.]] = icmp eq i32 [[N:%.]], 0
	; CHECK-NEXT: br i1 [[CMP23]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP23]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.]]
	; CHECK: for.cond1.preheader.us.preheader:			; CHECK: for.cond1.preheader.us.preheader:
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
	; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4			; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = add i32 [[N]], -1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT27:%.*]] = insertelement <4 x i32> undef, i32 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT28:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT27]], <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP:%.*]] = add i32 [[N_VEC]], -4			; CHECK-NEXT: [[TMP:%.*]] = add i32 [[N_VEC]], -4
	; CHECK-NEXT: [[TMP1:%.*]] = lshr i32 [[TMP]], 2			; CHECK-NEXT: [[TMP1:%.*]] = lshr i32 [[TMP]], 2
	; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i32 [[TMP1]], 1			; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i32 [[TMP1]], 1
	; CHECK-NEXT: br label [[FOR_COND1_PREHEADER_US:%.*]]			; CHECK-NEXT: br label [[FOR_COND1_PREHEADER_US:%.*]]
	; CHECK: for.cond1.preheader.us:			; CHECK: for.cond1.preheader.us:
	; CHECK-NEXT: [[I_024_US:%.]] = phi i32 [ [[INC9_US:%.]], [[MIDDLE_BLOCK:%.*]] ], [ 0, [[FOR_COND1_PREHEADER_US_PREHEADER]] ]			; CHECK-NEXT: [[I_024_US:%.]] = phi i32 [ [[INC9_US:%.]], [[MIDDLE_BLOCK:%.*]] ], [ 0, [[FOR_COND1_PREHEADER_US_PREHEADER]] ]
	; CHECK-NEXT: [[ARRAYIDX_US:%.]] = getelementptr inbounds i32, i32** [[A:%.*]], i32 [[I_024_US]]			; CHECK-NEXT: [[ARRAYIDX_US:%.]] = getelementptr inbounds i32, i32** [[A:%.*]], i32 [[I_024_US]]
	; CHECK-NEXT: [[TMP3:%.]] = load i32, i32** [[ARRAYIDX_US]], align 4			; CHECK-NEXT: [[TMP3:%.]] = load i32, i32** [[ARRAYIDX_US]], align 4
	; CHECK-NEXT: [[ARRAYIDX7_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[I_024_US]]			; CHECK-NEXT: [[ARRAYIDX7_US:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i32 [[I_024_US]]
	; CHECK-NEXT: [[ARRAYIDX7_PROMOTED_US:%.]] = load i32, i32 [[ARRAYIDX7_US]], align 4			; CHECK-NEXT: [[ARRAYIDX7_PROMOTED_US:%.]] = load i32, i32 [[ARRAYIDX7_US]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 [[ARRAYIDX7_PROMOTED_US]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> <i32 undef, i32 0, i32 0, i32 0>, i32 [[ARRAYIDX7_PROMOTED_US]], i32 0
	; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP2]])			; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 [[TMP2]])
				; CHECK-NEXT: [[NUM_ELEMENTS:%.*]] = add i32 [[TRIP_COUNT_MINUS_1]], 1
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_US]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_US]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ [[TMP4]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ [[TMP4]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP5:%.]] = phi i32 [ [[TMP2]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP13:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[TMP5:%.]] = phi i32 [ [[TMP2]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP13:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ [[N]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ [[NUM_ELEMENTS]], [[FOR_COND1_PREHEADER_US]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP3]], i32 [[INDEX]]			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP3]], i32 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP0]])			; CHECK-NEXT: [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP0]])
	; CHECK-NEXT: [[TMP2]] = sub i32 [[TMP0]], 4			; CHECK-NEXT: [[TMP2]] = sub i32 [[TMP0]], 4
	; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP8]], i32 4, <4 x i1> [[TMP1]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP8]], i32 4, <4 x i1> [[TMP1]], <4 x i32> undef)
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[INDEX]]			; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[INDEX]]
	; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*			; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD29:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP10]], i32 4, <4 x i1> [[TMP1]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_MASKED_LOAD29:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP10]], i32 4, <4 x i1> [[TMP1]], <4 x i32> undef)
	▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	vector.body: ; preds = %vector.body, %for.cond1.preheader.us			vector.body: ; preds = %vector.body, %for.cond1.preheader.us
	%index = phi i32 [ 0, %for.cond1.preheader.us ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %for.cond1.preheader.us ], [ %index.next, %vector.body ]
	%vec.phi = phi <4 x i32> [ %tmp4, %for.cond1.preheader.us ], [ %tmp12, %vector.body ]			%vec.phi = phi <4 x i32> [ %tmp4, %for.cond1.preheader.us ], [ %tmp12, %vector.body ]
	%tmp5 = phi i32 [ %tmp2, %for.cond1.preheader.us ], [ %tmp13, %vector.body ]			%tmp5 = phi i32 [ %tmp2, %for.cond1.preheader.us ], [ %tmp13, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp6 = getelementptr inbounds i32, i32* %tmp3, i32 %index			%tmp6 = getelementptr inbounds i32, i32* %tmp3, i32 %index
	%tmp7 = icmp ule <4 x i32> %induction, %broadcast.splat28
				; %tmp7 = icmp ule <4 x i32> %induction, %broadcast.splat28
				%tmp7 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp8 = bitcast i32* %tmp6 to <4 x i32>*			%tmp8 = bitcast i32* %tmp6 to <4 x i32>*
	%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp8, i32 4, <4 x i1> %tmp7, <4 x i32> undef)			%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp8, i32 4, <4 x i1> %tmp7, <4 x i32> undef)
	%tmp9 = getelementptr inbounds i32, i32* %B, i32 %index			%tmp9 = getelementptr inbounds i32, i32* %B, i32 %index
	%tmp10 = bitcast i32* %tmp9 to <4 x i32>*			%tmp10 = bitcast i32* %tmp9 to <4 x i32>*
	%wide.masked.load29 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp10, i32 4, <4 x i1> %tmp7, <4 x i32> undef)			%wide.masked.load29 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp10, i32 4, <4 x i1> %tmp7, <4 x i32> undef)
	%tmp11 = mul nsw <4 x i32> %wide.masked.load29, %wide.masked.load			%tmp11 = mul nsw <4 x i32> %wide.masked.load29, %wide.masked.load
	%tmp12 = add nsw <4 x i32> %vec.phi, %tmp11			%tmp12 = add nsw <4 x i32> %vec.phi, %tmp11
	%index.next = add i32 %index, 4			%index.next = add i32 %index, 4
	Show All 23 Lines
	declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>) #1			declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>) #1

	; Function Attrs: noduplicate nounwind			; Function Attrs: noduplicate nounwind
	declare void @llvm.set.loop.iterations.i32(i32) #2			declare void @llvm.set.loop.iterations.i32(i32) #2

	; Function Attrs: noduplicate nounwind			; Function Attrs: noduplicate nounwind
	declare i32 @llvm.loop.decrement.reg.i32(i32, i32) #2			declare i32 @llvm.loop.decrement.reg.i32(i32, i32) #2

				declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)

	attributes #0 = { argmemonly nounwind readonly willreturn }			attributes #0 = { argmemonly nounwind readonly willreturn }
	attributes #1 = { nounwind readnone willreturn }			attributes #1 = { nounwind readnone willreturn }
	attributes #2 = { noduplicate nounwind }			attributes #2 = { noduplicate nounwind }

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s		; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s

define dso_local void @foo(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {		define dso_local void @foo(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
; CHECK-LABEL: @foo(		; CHECK-LABEL: @foo(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 8001)		; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 8001)
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[LSR_IV14:%.]] = phi i32 [ [[SCEVGEP15:%.]], [[VECTOR_BODY]] ], [ [[A:%.]], [[ENTRY:%.*]] ]		; CHECK-NEXT: [[LSR_IV14:%.]] = phi i32 [ [[SCEVGEP15:%.]], [[VECTOR_BODY]] ], [ [[A:%.]], [[ENTRY:%.*]] ]
; CHECK-NEXT: [[LSR_IV11:%.]] = phi i32 [ [[SCEVGEP12:%.]], [[VECTOR_BODY]] ], [ [[C:%.]], [[ENTRY]] ]		; CHECK-NEXT: [[LSR_IV11:%.]] = phi i32 [ [[SCEVGEP12:%.]], [[VECTOR_BODY]] ], [ [[C:%.]], [[ENTRY]] ]
; CHECK-NEXT: [[LSR_IV:%.]] = phi i32 [ [[SCEVGEP:%.]], [[VECTOR_BODY]] ], [ [[B:%.]], [[ENTRY]] ]		; CHECK-NEXT: [[LSR_IV:%.]] = phi i32 [ [[SCEVGEP:%.]], [[VECTOR_BODY]] ], [ [[B:%.]], [[ENTRY]] ]
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[ENTRY]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ 8001, [[ENTRY]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ 8001, [[ENTRY]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP1:%.]] = phi i32 [ 32002, [[ENTRY]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[TMP1:%.]] = phi i32 [ 32003, [[ENTRY]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[LSR_IV1416:%.]] = bitcast i32 [[LSR_IV14]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV1416:%.]] = bitcast i32 [[LSR_IV14]] to <4 x i32>*
; CHECK-NEXT: [[LSR_IV1113:%.]] = bitcast i32 [[LSR_IV11]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV1113:%.]] = bitcast i32 [[LSR_IV11]] to <4 x i32>*
; CHECK-NEXT: [[LSR_IV10:%.]] = bitcast i32 [[LSR_IV]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV10:%.]] = bitcast i32 [[LSR_IV]] to <4 x i32>*
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP2:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP1]])		; CHECK-NEXT: [[TMP2:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP1]])
; CHECK-NEXT: [[TMP3]] = sub i32 [[TMP1]], 4		; CHECK-NEXT: [[TMP3]] = sub i32 [[TMP1]], 4
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV10]], i32 4, <4 x i1> [[TMP2]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV10]], i32 4, <4 x i1> [[TMP2]], <4 x i32> undef)
; CHECK-NEXT: [[WIDE_MASKED_LOAD9:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV1113]], i32 4, <4 x i1> [[TMP2]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD9:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV1113]], i32 4, <4 x i1> [[TMP2]], <4 x i32> undef)
; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_LOAD9]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_LOAD9]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP4]], <4 x i32>* [[LSR_IV1416]], i32 4, <4 x i1> [[TMP2]])		; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP4]], <4 x i32>* [[LSR_IV1416]], i32 4, <4 x i1> [[TMP2]])
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[SCEVGEP]] = getelementptr i32, i32* [[LSR_IV]], i32 4		; CHECK-NEXT: [[SCEVGEP]] = getelementptr i32, i32* [[LSR_IV]], i32 4
; CHECK-NEXT: [[SCEVGEP12]] = getelementptr i32, i32* [[LSR_IV11]], i32 4		; CHECK-NEXT: [[SCEVGEP12]] = getelementptr i32, i32* [[LSR_IV11]], i32 4
; CHECK-NEXT: [[SCEVGEP15]] = getelementptr i32, i32* [[LSR_IV14]], i32 4		; CHECK-NEXT: [[SCEVGEP15]] = getelementptr i32, i32* [[LSR_IV14]], i32 4
; CHECK-NEXT: [[TMP5]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP0]], i32 1)		; CHECK-NEXT: [[TMP5]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP0]], i32 1)
; CHECK-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP5]], 0		; CHECK-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP5]], 0
; CHECK-NEXT: br i1 [[TMP6]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
Show All 9 Lines	vector.body:
%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%1 = icmp ult <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		; %1 = icmp ult <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 32002)

%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%scevgep = getelementptr i32, i32* %lsr.iv, i32 4		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	vector.body:
%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

; UGT here:		; UGT here:
%1 = icmp ugt <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>		%1 = icmp ugt <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>

%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%scevgep = getelementptr i32, i32* %lsr.iv, i32 4		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
Show All 12 Lines
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 8001)		; CHECK-NEXT: call void @llvm.set.loop.iterations.i32(i32 8001)
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[LSR_IV14:%.]] = phi i32 [ [[SCEVGEP15:%.]], [[VECTOR_BODY]] ], [ [[A:%.]], [[ENTRY:%.*]] ]		; CHECK-NEXT: [[LSR_IV14:%.]] = phi i32 [ [[SCEVGEP15:%.]], [[VECTOR_BODY]] ], [ [[A:%.]], [[ENTRY:%.*]] ]
; CHECK-NEXT: [[LSR_IV11:%.]] = phi i32 [ [[SCEVGEP12:%.]], [[VECTOR_BODY]] ], [ [[C:%.]], [[ENTRY]] ]		; CHECK-NEXT: [[LSR_IV11:%.]] = phi i32 [ [[SCEVGEP12:%.]], [[VECTOR_BODY]] ], [ [[C:%.]], [[ENTRY]] ]
; CHECK-NEXT: [[LSR_IV:%.]] = phi i32 [ [[SCEVGEP:%.]], [[VECTOR_BODY]] ], [ [[B:%.]], [[ENTRY]] ]		; CHECK-NEXT: [[LSR_IV:%.]] = phi i32 [ [[SCEVGEP:%.]], [[VECTOR_BODY]] ], [ [[B:%.]], [[ENTRY]] ]
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[ENTRY]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[ENTRY]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ 8001, [[ENTRY]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ 8001, [[ENTRY]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP1:%.]] = phi i32 [ 32003, [[ENTRY]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[LSR_IV1416:%.]] = bitcast i32 [[LSR_IV14]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV1416:%.]] = bitcast i32 [[LSR_IV14]] to <4 x i32>*
; CHECK-NEXT: [[LSR_IV1113:%.]] = bitcast i32 [[LSR_IV11]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV1113:%.]] = bitcast i32 [[LSR_IV11]] to <4 x i32>*
; CHECK-NEXT: [[LSR_IV10:%.]] = bitcast i32 [[LSR_IV]] to <4 x i32>*		; CHECK-NEXT: [[LSR_IV10:%.]] = bitcast i32 [[LSR_IV]] to <4 x i32>*
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>		; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP1:%.*]] = icmp ult <4 x i32> [[INDUCTION]], <i32 32002, i32 32002, i32 32002, i32 32002>		; CHECK-NEXT: [[TMP2:%.*]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[TMP1]])
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV10]], i32 4, <4 x i1> [[TMP1]], <4 x i32> undef)		; CHECK-NEXT: [[TMP3]] = sub i32 [[TMP1]], 4
; CHECK-NEXT: [[WIDE_MASKED_LOAD9:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV1113]], i32 4, <4 x i1> [[TMP1]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV10]], i32 4, <4 x i1> [[TMP2]], <4 x i32> undef)
; CHECK-NEXT: [[TMP2:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_LOAD9]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[WIDE_MASKED_LOAD9:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[LSR_IV1113]], i32 4, <4 x i1> [[TMP2]], <4 x i32> undef)
; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP2]], <4 x i32>* [[LSR_IV1416]], i32 4, <4 x i1> [[TMP1]])		; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_LOAD9]], [[WIDE_MASKED_LOAD]]
		; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP4]], <4 x i32>* [[LSR_IV1416]], i32 4, <4 x i1> [[TMP2]])
; CHECK-NEXT: [[INDEX_NEXT]] = sub i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = sub i32 [[INDEX]], 4
; CHECK-NEXT: [[SCEVGEP]] = getelementptr i32, i32* [[LSR_IV]], i32 4		; CHECK-NEXT: [[SCEVGEP]] = getelementptr i32, i32* [[LSR_IV]], i32 4
; CHECK-NEXT: [[SCEVGEP12]] = getelementptr i32, i32* [[LSR_IV11]], i32 4		; CHECK-NEXT: [[SCEVGEP12]] = getelementptr i32, i32* [[LSR_IV11]], i32 4
; CHECK-NEXT: [[SCEVGEP15]] = getelementptr i32, i32* [[LSR_IV14]], i32 4		; CHECK-NEXT: [[SCEVGEP15]] = getelementptr i32, i32* [[LSR_IV14]], i32 4
; CHECK-NEXT: [[TMP3]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP0]], i32 1)		; CHECK-NEXT: [[TMP5]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP0]], i32 1)
; CHECK-NEXT: [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0		; CHECK-NEXT: [[TMP6:%.*]] = icmp ne i32 [[TMP5]], 0
; CHECK-NEXT: br i1 [[TMP4]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
call void @llvm.set.loop.iterations.i32(i32 8001)		call void @llvm.set.loop.iterations.i32(i32 8001)
br label %vector.body		br label %vector.body

vector.body:		vector.body:
%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%1 = icmp ult <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		; %1 = icmp ult <4 x i32> %induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 32002)
		efriedmaUnsubmitted Not Done Reply Inline Actions I don't understand how this loop is supposed to work. %index is zero in the first iteration, and UINT_MAX-3 in the second iteration. efriedma: I don't understand how this loop is supposed to work. %index is zero in the first iteration…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions yep, thanks for catching, doesn't make sense, some sort of copy-paste mistake. SjoerdMeijer: yep, thanks for catching, doesn't make sense, some sort of copy-paste mistake.

%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)

; Counting down:		; Counting down:
%index.next = sub i32 %index, 4		%index.next = sub i32 %index, 4
%scevgep = getelementptr i32, i32* %lsr.iv, i32 4		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	vector.body:
%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

; non-uniform constant vector here:		; Non-uniform constant vector here. This can't be represented with
		; @llvm.get.active.lane.mask, but let's keep this test as a sanity check:
%1 = icmp ult <4 x i32> %induction, <i32 0, i32 3200, i32 32002, i32 32002>		%1 = icmp ult <4 x i32> %induction, <i32 0, i32 3200, i32 32002, i32 32002>

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
		%index.next = add i32 %index, 4
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
		%4 = icmp ne i32 %3, 0
		br i1 %4, label %vector.body, label %for.cond.cleanup

		for.cond.cleanup:
		ret void
		}

		; CHECK-LABEL: @overflow(
		;
		; CHECK-NOT: @llvm.arm.mve.vctp32
		; CHECK-NOT: @llvm.get.active.lane.mask
		;
		; CHECK: %lane.mask.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		; CHECK: %lane.mask.splat = shufflevector <4 x i32> %lane.mask.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: %lane.mask.induction = add <4 x i32> %lane.mask.splat, <i32 0, i32 1, i32 2, i32 3>
		; CHECK: %[[ICMP:.*]] = icmp ule <4 x i32> %lane.mask.induction, <i32 -1, i32 -1, i32 -1, i32 -1>
		; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ICMP]], <4 x i32> undef)
		;
		; CHECK: ret void
		;
		define dso_local void @overflow(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
		entry:
		call void @llvm.set.loop.iterations.i32(i32 8001)
		br label %vector.body

		vector.body:
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

		; BTC = UINT_MAX, and scalar trip count BTC + 1 would overflow:
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 4294967295)

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
		%index.next = add i32 %index, 4
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
		%4 = icmp ne i32 %3, 0
		br i1 %4, label %vector.body, label %for.cond.cleanup

		for.cond.cleanup:
		ret void
		}

		; CHECK-LABEL: @IV_not_an_induction(
		;
		; CHECK-NOT: @llvm.arm.mve.vctp32
		; CHECK-NOT: @llvm.get.active.lane.mask
		;
		; CHECK: %lane.mask.splatinsert = insertelement <4 x i32> undef, i32 %N, i32 0
		; CHECK: %lane.mask.splat = shufflevector <4 x i32> %lane.mask.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: %lane.mask.induction = add <4 x i32> %lane.mask.splat, <i32 0, i32 1, i32 2, i32 3>
		; CHECK: %[[ICMP:.*]] = icmp ule <4 x i32> %lane.mask.induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ICMP]], <4 x i32> undef)
		; CHECK: ret void
		;
		define dso_local void @IV_not_an_induction(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
		entry:
		call void @llvm.set.loop.iterations.i32(i32 8001)
		br label %vector.body

		vector.body:
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

		; The induction variable %D is not an IV:
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %N, i32 32002)

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
		%index.next = add i32 %index, 4
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
		%4 = icmp ne i32 %3, 0
		br i1 %4, label %vector.body, label %for.cond.cleanup

		for.cond.cleanup:
		ret void
		}

		; CHECK-LABEL: @IV_wrong_step(
		;
		; CHECK-NOT: @llvm.arm.mve.vctp32
		; CHECK-NOT: @llvm.get.active.lane.mask
		;
		; CHECK: %lane.mask.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		; CHECK: %lane.mask.splat = shufflevector <4 x i32> %lane.mask.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: %lane.mask.induction = add <4 x i32> %lane.mask.splat, <i32 0, i32 1, i32 2, i32 3>
		; CHECK: %[[ICMP:.*]] = icmp ule <4 x i32> %lane.mask.induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ICMP]], <4 x i32> undef)
		; CHECK: ret void
		;
		define dso_local void @IV_wrong_step(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
		entry:
		call void @llvm.set.loop.iterations.i32(i32 8001)
		br label %vector.body

		vector.body:
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 32002)

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)

		; %index is incremented with 3 and not 4, which is the vectorisation factor
		; that we expect here:
		%index.next = add i32 %index, 3

		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
		%4 = icmp ne i32 %3, 0
		br i1 %4, label %vector.body, label %for.cond.cleanup

		for.cond.cleanup:
		ret void
		}

		; CHECK-LABEL: @IV_step_not_constant(
		;
		; CHECK-NOT: @llvm.arm.mve.vctp32
		; CHECK-NOT: @llvm.get.active.lane.mask
		;
		; CHECK: %lane.mask.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		; CHECK: %lane.mask.splat = shufflevector <4 x i32> %lane.mask.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: %lane.mask.induction = add <4 x i32> %lane.mask.splat, <i32 0, i32 1, i32 2, i32 3>
		; CHECK: %[[ICMP:.*]] = icmp ule <4 x i32> %lane.mask.induction, <i32 32002, i32 32002, i32 32002, i32 32002>
		; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ICMP]], <4 x i32> undef)
		; CHECK: ret void
		;
		define dso_local void @IV_step_not_constant(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
		entry:
		call void @llvm.set.loop.iterations.i32(i32 8001)
		br label %vector.body

		vector.body:
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 32002)
		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)

		; %index is incremented with some runtime value, i.e. not a constant:
		%index.next = add i32 %index, %N

		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
		%4 = icmp ne i32 %3, 0
		br i1 %4, label %vector.body, label %for.cond.cleanup

		for.cond.cleanup:
		ret void
		}

		; CHECK-LABEL: @outerloop_phi(
		;
		; CHECK-NOT: @llvm.arm.mve.vctp32
		; CHECK-NOT: @llvm.get.active.lane.mask
		; CHECK: %lane.mask.splatinsert = insertelement <4 x i32> undef, i32 %j.025, i32 0
		; CHECK: %lane.mask.splat = shufflevector <4 x i32> %lane.mask.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: %lane.mask.induction = add <4 x i32> %lane.mask.splat, <i32 0, i32 1, i32 2, i32 3>
		; CHECK: %[[ICMP:.*]] = icmp ule <4 x i32> %lane.mask.induction, <i32 4096, i32 4096, i32 4096, i32 4096>
		; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ICMP]], <4 x i32> undef)
		;
		; CHECK: ret void
		;
		define dso_local void @outerloop_phi(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32 %N) local_unnamed_addr #0 {
		entry:
		%cmp24 = icmp eq i32 %N, 0
		br i1 %cmp24, label %for.cond.cleanup, label %vector.ph.preheader

		vector.ph.preheader: ; preds = %entry
		br label %vector.ph

		vector.ph: ; preds = %vector.ph.preheader, %for.cond.cleanup3
		%lsr.iv36 = phi i32* [ %B, %vector.ph.preheader ], [ %scevgep37, %for.cond.cleanup3 ]
		%lsr.iv31 = phi i32* [ %C, %vector.ph.preheader ], [ %scevgep32, %for.cond.cleanup3 ]
		%lsr.iv = phi i32* [ %A, %vector.ph.preheader ], [ %scevgep, %for.cond.cleanup3 ]
		%j.025 = phi i32 [ %inc11, %for.cond.cleanup3 ], [ 0, %vector.ph.preheader ]
		call void @llvm.set.loop.iterations.i32(i32 1025)
		br label %vector.body

		vector.body: ; preds = %vector.body, %vector.ph
		%lsr.iv38 = phi i32* [ %scevgep39, %vector.body ], [ %lsr.iv36, %vector.ph ]
		%lsr.iv33 = phi i32* [ %scevgep34, %vector.body ], [ %lsr.iv31, %vector.ph ]
		%lsr.iv28 = phi i32* [ %scevgep29, %vector.body ], [ %lsr.iv, %vector.ph ]
		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		%0 = phi i32 [ 1025, %vector.ph ], [ %2, %vector.body ]
		%lsr.iv3840 = bitcast i32* %lsr.iv38 to <4 x i32>*
		%lsr.iv3335 = bitcast i32* %lsr.iv33 to <4 x i32>*
		%lsr.iv2830 = bitcast i32* %lsr.iv28 to <4 x i32>*

		; It's using %j.025, the induction variable from its outer loop:
		%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %j.025, i32 4096)

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv3840, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%wide.masked.load27 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv3335, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%1 = add nsw <4 x i32> %wide.masked.load27, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %1, <4 x i32>* %lsr.iv2830, i32 4, <4 x i1> %active.lane.mask)
		%index.next = add i32 %index, 4
		%scevgep29 = getelementptr i32, i32* %lsr.iv28, i32 4
		%scevgep34 = getelementptr i32, i32* %lsr.iv33, i32 4
		%scevgep39 = getelementptr i32, i32* %lsr.iv38, i32 4
		%2 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
		%3 = icmp ne i32 %2, 0
		br i1 %3, label %vector.body, label %for.cond.cleanup3

		for.cond.cleanup: ; preds = %for.cond.cleanup3, %entry
		ret void

		for.cond.cleanup3: ; preds = %vector.body
		%inc11 = add nuw i32 %j.025, 1
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 1
		%scevgep32 = getelementptr i32, i32* %lsr.iv31, i32 1
		%scevgep37 = getelementptr i32, i32* %lsr.iv36, i32 1
		%exitcond26 = icmp eq i32 %inc11, %N
		br i1 %exitcond26, label %for.cond.cleanup, label %vector.ph
		}

		; CHECK-LABEL: @overflow_in_sub(
		; CHECK-NOT: @llvm.arm.mve.vctp32
		; CHECK-NOT: @llvm.get.active.lane.mask
		; CHECK: ret void
		;
		define dso_local void @overflow_in_sub(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
		entry:
		call void @llvm.set.loop.iterations.i32(i32 8001)
		br label %vector.body

		vector.body:
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
		%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

		; Overflow in the substraction. This should hold:
		;
		; ceil(ElementCount / VectorWidth) >= TripCount
		;
		; But we have:
		;
		; ceil(3200 / 4) >= 8001
		; 8000 >= 8001
		;
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 31999)

%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load		%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%scevgep = getelementptr i32, i32* %lsr.iv, i32 4		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4		%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
%4 = icmp ne i32 %3, 0		%4 = icmp ne i32 %3, 0
br i1 %4, label %vector.body, label %for.cond.cleanup		br i1 %4, label %vector.body, label %for.cond.cleanup

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void
}		}


declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>) #1		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>) #1
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>) #2		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>) #2
declare i32 @llvm.loop.decrement.reg.i32(i32 , i32 )		declare i32 @llvm.loop.decrement.reg.i32(i32 , i32 )
declare void @llvm.set.loop.iterations.i32(i32)		declare void @llvm.set.loop.iterations.i32(i32)
		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-widen.ll

	Show All 21 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
	%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%tmp = getelementptr inbounds i16, i16* %a, i32 %index			%tmp = getelementptr inbounds i16, i16* %a, i32 %index
	%tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				%tmp1 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp2 = bitcast i16* %tmp to <8 x i16>*			%tmp2 = bitcast i16* %tmp to <8 x i16>*
	%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)			%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
	%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index			%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index
	%tmp4 = bitcast i16* %tmp3 to <8 x i16>*			%tmp4 = bitcast i16* %tmp3 to <8 x i16>*
	%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)			%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
	%expand.1 = zext <8 x i16> %wide.masked.load to <8 x i32>			%expand.1 = zext <8 x i16> %wide.masked.load to <8 x i32>
	%expand.2 = zext <8 x i16> %wide.masked.load2 to <8 x i32>			%expand.2 = zext <8 x i16> %wide.masked.load2 to <8 x i32>
	%mul = mul nsw <8 x i32> %expand.2, %expand.1			%mul = mul nsw <8 x i32> %expand.2, %expand.1
	Show All 40 Lines
	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%store.idx = phi i32 [ 0, %vector.ph ], [ %store.idx.next, %vector.body ]			%store.idx = phi i32 [ 0, %vector.ph ], [ %store.idx.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
	%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%tmp = getelementptr inbounds i16, i16* %a, i32 %index			%tmp = getelementptr inbounds i16, i16* %a, i32 %index
	%tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <8 x i32> %induction, %broadcast.splat11
				%tmp1 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp2 = bitcast i16* %tmp to <8 x i16>*			%tmp2 = bitcast i16* %tmp to <8 x i16>*
	%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)			%wide.masked.load = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp2, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
	%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index			%tmp3 = getelementptr inbounds i16, i16* %b, i32 %index
	%tmp4 = bitcast i16* %tmp3 to <8 x i16>*			%tmp4 = bitcast i16* %tmp3 to <8 x i16>*
	%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)			%wide.masked.load2 = tail call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp1, <8 x i16> undef)
	%extract.2.low = shufflevector <8 x i16> %wide.masked.load2, <8 x i16> undef, < 4 x i32> <i32 0, i32 1, i32 2, i32 3>			%extract.2.low = shufflevector <8 x i16> %wide.masked.load2, <8 x i16> undef, < 4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%extract.2.high = shufflevector <8 x i16> %wide.masked.load2, <8 x i16> undef, < 4 x i32> <i32 4, i32 5, i32 6, i32 7>			%extract.2.high = shufflevector <8 x i16> %wide.masked.load2, <8 x i16> undef, < 4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%expand.1 = zext <4 x i16> %extract.2.low to <4 x i32>			%expand.1 = zext <4 x i16> %extract.2.low to <4 x i32>
	▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]			%tmp14 = phi i32 [ %tmp13, %vector.ph ], [ %tmp15, %vector.body ]
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%tmp = getelementptr inbounds i32, i32* %a, i32 %index			%tmp = getelementptr inbounds i32, i32* %a, i32 %index
	%tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				; %tmp1 = icmp ule <4 x i32> %induction, %broadcast.splat11
				%tmp1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%tmp2 = bitcast i32* %tmp to <4 x i32>*			%tmp2 = bitcast i32* %tmp to <4 x i32>*
	%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp2, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
	%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index			%tmp3 = getelementptr inbounds i32, i32* %b, i32 %index
	%tmp4 = bitcast i32* %tmp3 to <4 x i32>*			%tmp4 = bitcast i32* %tmp3 to <4 x i32>*
	%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)			%wide.masked.load2 = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %tmp4, i32 4, <4 x i1> %tmp1, <4 x i32> undef)
	%expand.1 = zext <4 x i32> %wide.masked.load to <4 x i64>			%expand.1 = zext <4 x i32> %wide.masked.load to <4 x i64>
	%expand.2 = zext <4 x i32> %wide.masked.load2 to <4 x i64>			%expand.2 = zext <4 x i32> %wide.masked.load2 to <4 x i64>
	%mul = mul nsw <4 x i64> %expand.2, %expand.1			%mul = mul nsw <4 x i64> %expand.2, %expand.1
	Show All 11 Lines

	declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)			declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
	declare void @llvm.masked.store.v8i32.p0v8i32(<8 x i32>, <8 x i32>*, i32 immarg, <8 x i1>)			declare void @llvm.masked.store.v8i32.p0v8i32(<8 x i32>, <8 x i32>*, i32 immarg, <8 x i1>)
	declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)			declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
	declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)			declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
	declare void @llvm.masked.store.v4i64.p0v4i64(<4 x i64>, <4 x i64>*, i32 immarg, <4 x i1>)			declare void @llvm.masked.store.v4i64.p0v4i64(<4 x i64>, <4 x i64>*, i32 immarg, <4 x i1>)
	declare void @llvm.set.loop.iterations.i32(i32)			declare void @llvm.set.loop.iterations.i32(i32)
	declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)			declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
				declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll

	; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s

	; CHECK-LABEL: reduction_i32			; CHECK-LABEL: reduction_i32
	; CHECK: phi i32 [ 0, %entry ]			; CHECK: phi i32 [ 0, %vector.ph ]
	; CHECK: phi <8 x i16> [ zeroinitializer, %entry ]			; CHECK: phi <8 x i16> [ zeroinitializer, %vector.ph ]
	; CHECK: phi i32			; CHECK: phi i32
	; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %entry ], [ [[ELEMS:%[^ ]+]], %vector.body ]			; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[ELEMS:%[^ ]+]], %vector.body ]
	; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.mve.vctp16(i32 [[PHI]])			; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.mve.vctp16(i32 [[PHI]])
	; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8			; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8
	; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)			; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
	; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp6, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)			; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp6, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
	define i16 @reduction_i32(i16* nocapture readonly %A, i16* nocapture readonly %B, i32 %N) {			define i16 @reduction_i32(i16* nocapture readonly %A, i16* nocapture readonly %B, i32 %N) {
	entry:			entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph:
	%tmp = add i32 %N, -1			%tmp = add i32 %N, -1
	%n.rnd.up = add nuw nsw i32 %tmp, 8			%n.rnd.up = add i32 %tmp, 8
	%n.vec = and i32 %n.rnd.up, -8			%n.vec = and i32 %n.rnd.up, -8
	%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0			%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0
	%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer
	%0 = add i32 %n.vec, -8			%0 = add i32 %n.vec, -8
	%1 = lshr i32 %0, 3			%1 = lshr i32 %0, 3
	%2 = add nuw nsw i32 %1, 1			%2 = add i32 %1, 1
	call void @llvm.set.loop.iterations.i32(i32 %2)			call void @llvm.set.loop.iterations.i32(i32 %2)
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %entry			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph], [ %index.next, %vector.body ]
	%vec.phi = phi <8 x i16> [ zeroinitializer, %entry ], [ %tmp8, %vector.body ]			%vec.phi = phi <8 x i16> [ zeroinitializer, %vector.ph], [ %tmp8, %vector.body ]
	%3 = phi i32 [ %2, %entry ], [ %4, %vector.body ]			%3 = phi i32 [ %2, %vector.ph], [ %4, %vector.body ]
	%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
	%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index			%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index
	%tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				; %tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				%tmp3 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %tmp)

	%tmp4 = bitcast i16* %tmp2 to <8 x i16>*			%tmp4 = bitcast i16* %tmp2 to <8 x i16>*
	%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)			%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
	%tmp5 = getelementptr inbounds i16, i16* %B, i32 %index			%tmp5 = getelementptr inbounds i16, i16* %B, i32 %index
	%tmp6 = bitcast i16* %tmp5 to <8 x i16>*			%tmp6 = bitcast i16* %tmp5 to <8 x i16>*
	%wide.masked.load3 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp6, i32 4, <8 x i1> %tmp3, <8 x i16> undef)			%wide.masked.load3 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp6, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
	%tmp7 = add <8 x i16> %wide.masked.load, %vec.phi			%tmp7 = add <8 x i16> %wide.masked.load, %vec.phi
	%tmp8 = add <8 x i16> %tmp7, %wide.masked.load3			%tmp8 = add <8 x i16> %tmp7, %wide.masked.load3
	%index.next = add nuw nsw i32 %index, 8			%index.next = add i32 %index, 8
	%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)			%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)
	%5 = icmp ne i32 %4, 0			%5 = icmp ne i32 %4, 0
	br i1 %5, label %vector.body, label %middle.block			br i1 %5, label %vector.body, label %middle.block

	middle.block: ; preds = %vector.body			middle.block: ; preds = %vector.body
	%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]			%vec.phi.lcssa = phi <8 x i16> [ %vec.phi, %vector.body ]
	%.lcssa3 = phi <8 x i1> [ %tmp3, %vector.body ]			%.lcssa3 = phi <8 x i1> [ %tmp3, %vector.body ]
	%.lcssa = phi <8 x i16> [ %tmp8, %vector.body ]			%.lcssa = phi <8 x i16> [ %tmp8, %vector.body ]
	%tmp10 = select <8 x i1> %.lcssa3, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa			%tmp10 = select <8 x i1> %.lcssa3, <8 x i16> %.lcssa, <8 x i16> %vec.phi.lcssa
	%rdx.shuf = shufflevector <8 x i16> %tmp10, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			%rdx.shuf = shufflevector <8 x i16> %tmp10, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	%bin.rdx = add <8 x i16> %rdx.shuf, %tmp10			%bin.rdx = add <8 x i16> %rdx.shuf, %tmp10
	%rdx.shuf4 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%rdx.shuf4 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%bin.rdx5 = add <8 x i16> %rdx.shuf4, %bin.rdx			%bin.rdx5 = add <8 x i16> %rdx.shuf4, %bin.rdx
	%rdx.shuf6 = shufflevector <8 x i16> %bin.rdx5, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%rdx.shuf6 = shufflevector <8 x i16> %bin.rdx5, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%bin.rdx7 = add <8 x i16> %rdx.shuf6, %bin.rdx5			%bin.rdx7 = add <8 x i16> %rdx.shuf6, %bin.rdx5
	%tmp11 = extractelement <8 x i16> %bin.rdx7, i32 0			%tmp11 = extractelement <8 x i16> %bin.rdx7, i32 0
	ret i16 %tmp11			ret i16 %tmp11

				for.cond.cleanup:
				%res.0 = phi i16 [ 0, %entry ]
				ret i16 %res.0
	}			}

	; CHECK-LABEL: reduction_i32_with_scalar			; CHECK-LABEL: reduction_i32_with_scalar
	; CHECK: phi i32 [ 0, %entry ]			; CHECK: vector.body:
	; CHECK: phi <8 x i16> [ zeroinitializer, %entry ]			; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	; CHECK: phi i32			; CHECK: %vec.phi = phi <8 x i16> [ zeroinitializer, %vector.ph ], [ %{{.*}}, %vector.body ]
	; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %entry ], [ [[ELEMS:%[^ ]+]], %vector.body ]			; CHECK: %{{.}} = phi i32 [ %{{.}}, %vector.ph ], [ %{{.*}}, %vector.body ]
				; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[ELEMS:%[^ ]+]], %vector.body ]
	; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.mve.vctp16(i32 [[PHI]])			; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.mve.vctp16(i32 [[PHI]])
	; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8			; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8
	; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)			; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> [[VCTP]], <8 x i16> undef)
	define i16 @reduction_i32_with_scalar(i16* nocapture readonly %A, i16 %B, i32 %N) local_unnamed_addr {			define i16 @reduction_i32_with_scalar(i16* nocapture readonly %A, i16 %B, i32 %N) local_unnamed_addr {
	entry:			entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

				vector.ph:
	%tmp = add i32 %N, -1			%tmp = add i32 %N, -1
	%n.rnd.up = add nuw nsw i32 %tmp, 8			%n.rnd.up = add nuw nsw i32 %tmp, 8
	%n.vec = and i32 %n.rnd.up, -8			%n.vec = and i32 %n.rnd.up, -8
	%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0			%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0
	%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer
	%broadcast.splatinsert3 = insertelement <8 x i16> undef, i16 %B, i32 0			%broadcast.splatinsert3 = insertelement <8 x i16> undef, i16 %B, i32 0
	%broadcast.splat4 = shufflevector <8 x i16> %broadcast.splatinsert3, <8 x i16> undef, <8 x i32> zeroinitializer			%broadcast.splat4 = shufflevector <8 x i16> %broadcast.splatinsert3, <8 x i16> undef, <8 x i32> zeroinitializer
	%0 = add i32 %n.vec, -8			%0 = add i32 %n.vec, -8
	%1 = lshr i32 %0, 3			%1 = lshr i32 %0, 3
	%2 = add nuw nsw i32 %1, 1			%2 = add nuw nsw i32 %1, 1
	call void @llvm.set.loop.iterations.i32(i32 %2)			call void @llvm.set.loop.iterations.i32(i32 %2)
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %entry			vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph], [ %index.next, %vector.body ]
				%vec.phi = phi <8 x i16> [ zeroinitializer, %vector.ph], [ %tmp6, %vector.body ]
				%3 = phi i32 [ %2, %vector.ph], [ %4, %vector.body ]
				%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index

				; %tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				%tmp3 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %tmp)

				%tmp4 = bitcast i16* %tmp2 to <8 x i16>*
				%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
				%tmp5 = add <8 x i16> %vec.phi, %broadcast.splat4
				%tmp6 = add <8 x i16> %tmp5, %wide.masked.load
				%index.next = add nuw nsw i32 %index, 8
				%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)
				%5 = icmp ne i32 %4, 0
				br i1 %5, label %vector.body, label %middle.block

				middle.block: ; preds = %vector.body
				%tmp8 = select <8 x i1> %tmp3, <8 x i16> %tmp6, <8 x i16> %vec.phi
				%rdx.shuf = shufflevector <8 x i16> %tmp8, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <8 x i16> %rdx.shuf, %tmp8
				%rdx.shuf5 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx6 = add <8 x i16> %rdx.shuf5, %bin.rdx
				%rdx.shuf7 = shufflevector <8 x i16> %bin.rdx6, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx8 = add <8 x i16> %rdx.shuf7, %bin.rdx6
				%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0
				ret i16 %tmp9

				for.cond.cleanup:
				%res.0 = phi i16 [ 0, %entry ]
				ret i16 %res.0
				}

				; The vector loop is not guarded with an entry check (N == 0).
				; This means we can't calculate a precise range for the backedge count in
				; @llvm.get.active.lane.mask, and are assuming overflow can happen and thus
				; we can't insert the VCTP here.
				;
				; CHECK-LABEL: @reduction_not_guarded
				;
				; CHECK-NOT: @llvm.arm.mve.vctp
				; CHECK-NOT: @llvm.get.active.lane.mask.v8i1.i32
				;
				; CHECK: entry:
				; CHECK: %[[ELEMCOUNT:.*]] = add i32 %N, -1
				; CHECK: %broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %[[ELEMCOUNT]], i32 0
				; CHECK %broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer
				;
				; CHECK: vector.body:
				; CHECK: %lane.mask.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
				; CHECK: %lane.mask.splat = shufflevector <8 x i32> %lane.mask.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
				; CHECK: %lane.mask.induction = add <8 x i32> %lane.mask.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK: %[[ICMP:.*]] = icmp ule <8 x i32> %lane.mask.induction, %broadcast.splat2
				; CHECK: call <8 x i16> @llvm.masked.load.v8i16.p0v8i16({{.*}}, <8 x i1> %[[ICMP]], <8 x i16> undef)
				; CHECK: ret
				;
				define i16 @reduction_not_guarded(i16* nocapture readonly %A, i16 %B, i32 %N) local_unnamed_addr {
				entry:
				%tmp = add i32 %N, -1
				%n.rnd.up = add nuw nsw i32 %tmp, 8
				%n.vec = and i32 %n.rnd.up, -8
				%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0
				%broadcast.splat2 = shufflevector <8 x i32> %broadcast.splatinsert1, <8 x i32> undef, <8 x i32> zeroinitializer
				%broadcast.splatinsert3 = insertelement <8 x i16> undef, i16 %B, i32 0
				%broadcast.splat4 = shufflevector <8 x i16> %broadcast.splatinsert3, <8 x i16> undef, <8 x i32> zeroinitializer
				%0 = add i32 %n.vec, -8
				%1 = lshr i32 %0, 3
				%2 = add nuw nsw i32 %1, 1
				call void @llvm.set.loop.iterations.i32(i32 %2)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %entry], [ %index.next, %vector.body ]
	%vec.phi = phi <8 x i16> [ zeroinitializer, %entry ], [ %tmp6, %vector.body ]			%vec.phi = phi <8 x i16> [ zeroinitializer, %entry], [ %tmp6, %vector.body ]
	%3 = phi i32 [ %2, %entry ], [ %4, %vector.body ]			%3 = phi i32 [ %2, %entry ], [ %4, %vector.body ]
	%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer			%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
	%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index			%tmp2 = getelementptr inbounds i16, i16* %A, i32 %index
	%tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				; %tmp3 = icmp ule <8 x i32> %induction, %broadcast.splat2
				%tmp3 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %tmp)

	%tmp4 = bitcast i16* %tmp2 to <8 x i16>*			%tmp4 = bitcast i16* %tmp2 to <8 x i16>*
	%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)			%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %tmp4, i32 4, <8 x i1> %tmp3, <8 x i16> undef)
	%tmp5 = add <8 x i16> %vec.phi, %broadcast.splat4			%tmp5 = add <8 x i16> %vec.phi, %broadcast.splat4
	%tmp6 = add <8 x i16> %tmp5, %wide.masked.load			%tmp6 = add <8 x i16> %tmp5, %wide.masked.load
	%index.next = add nuw nsw i32 %index, 8			%index.next = add nuw nsw i32 %index, 8
	%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)			%4 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %3, i32 1)
	%5 = icmp ne i32 %4, 0			%5 = icmp ne i32 %4, 0
	br i1 %5, label %vector.body, label %middle.block			br i1 %5, label %vector.body, label %middle.block

	middle.block: ; preds = %vector.body			middle.block: ; preds = %vector.body
	%tmp8 = select <8 x i1> %tmp3, <8 x i16> %tmp6, <8 x i16> %vec.phi			%tmp8 = select <8 x i1> %tmp3, <8 x i16> %tmp6, <8 x i16> %vec.phi
	%rdx.shuf = shufflevector <8 x i16> %tmp8, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			%rdx.shuf = shufflevector <8 x i16> %tmp8, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	%bin.rdx = add <8 x i16> %rdx.shuf, %tmp8			%bin.rdx = add <8 x i16> %rdx.shuf, %tmp8
	%rdx.shuf5 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%rdx.shuf5 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%bin.rdx6 = add <8 x i16> %rdx.shuf5, %bin.rdx			%bin.rdx6 = add <8 x i16> %rdx.shuf5, %bin.rdx
	%rdx.shuf7 = shufflevector <8 x i16> %bin.rdx6, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%rdx.shuf7 = shufflevector <8 x i16> %bin.rdx6, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%bin.rdx8 = add <8 x i16> %rdx.shuf7, %bin.rdx6			%bin.rdx8 = add <8 x i16> %rdx.shuf7, %bin.rdx6
	%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0			%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0
	ret i16 %tmp9			ret i16 %tmp9
	}			}

	declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)			declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
	declare void @llvm.set.loop.iterations.i32(i32)			declare void @llvm.set.loop.iterations.i32(i32)
	declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)			declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
				declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vector-arith-codegen.ll

Show All 9 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r3, r2, #3		; CHECK-NEXT: adds r3, r2, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: bic r3, r3, #3		; CHECK-NEXT: bic r3, r3, #3
; CHECK-NEXT: sub.w r12, r3, #4		; CHECK-NEXT: sub.w r12, r3, #4
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: add.w lr, r3, r12, lsr #2		; CHECK-NEXT: add.w lr, r3, r12, lsr #2
; CHECK-NEXT: lsr.w r3, r12, #2		; CHECK-NEXT: movs r3, #0
; CHECK-NEXT: sub.w r3, r2, r3, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB0_1: @ %vector.body		; CHECK-NEXT: .LBB0_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpstt		; CHECK-NEXT: vpstt
; CHECK-NEXT: vldrwt.u32 q0, [r0], #16		; CHECK-NEXT: vldrwt.u32 q0, [r0], #16
; CHECK-NEXT: vldrwt.u32 q2, [r1], #16		; CHECK-NEXT: vldrwt.u32 q2, [r1], #16
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: adds r3, #4
; CHECK-NEXT: vmul.i32 q0, q2, q0		; CHECK-NEXT: vmul.i32 q0, q2, q0
		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vadd.i32 q0, q0, q1		; CHECK-NEXT: vadd.i32 q0, q0, q1
; CHECK-NEXT: le lr, .LBB0_1		; CHECK-NEXT: le lr, .LBB0_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r3
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp8 = icmp eq i32 %N, 0		%cmp8 = icmp eq i32 %N, 0
br i1 %cmp8, label %for.cond.cleanup, label %vector.ph		br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %6, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %6, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i32, i32* %a, i32 %index		%0 = getelementptr inbounds i32, i32* %a, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat12
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i32* %0 to <4 x i32>*		%2 = bitcast i32* %0 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
%3 = getelementptr inbounds i32, i32* %b, i32 %index		%3 = getelementptr inbounds i32, i32* %b, i32 %index
%4 = bitcast i32* %3 to <4 x i32>*		%4 = bitcast i32* %3 to <4 x i32>*
%wide.masked.load13 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %4, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load13 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %4, i32 4, <4 x i1> %1, <4 x i32> undef)
%5 = mul nsw <4 x i32> %wide.masked.load13, %wide.masked.load		%5 = mul nsw <4 x i32> %wide.masked.load13, %wide.masked.load
%6 = add nsw <4 x i32> %5, %vec.phi		%6 = add nsw <4 x i32> %5, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
Show All 19 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r1, r2, #3		; CHECK-NEXT: adds r1, r2, #3
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: bic r1, r1, #3		; CHECK-NEXT: bic r1, r1, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: subs r1, #4		; CHECK-NEXT: subs r1, #4
; CHECK-NEXT: add.w lr, r3, r1, lsr #2		; CHECK-NEXT: add.w lr, r3, r1, lsr #2
; CHECK-NEXT: lsrs r1, r1, #2		; CHECK-NEXT: movs r1, #0
; CHECK-NEXT: sub.w r1, r2, r1, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB1_1: @ %vector.body		; CHECK-NEXT: .LBB1_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q0, [r0], #16		; CHECK-NEXT: vldrwt.u32 q0, [r0], #16
		; CHECK-NEXT: adds r1, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vadd.i32 q0, q0, q1		; CHECK-NEXT: vadd.i32 q0, q0, q1
; CHECK-NEXT: le lr, .LBB1_1		; CHECK-NEXT: le lr, .LBB1_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r1
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp6 = icmp eq i32 %N, 0		%cmp6 = icmp eq i32 %N, 0
br i1 %cmp6, label %for.cond.cleanup, label %vector.ph		br i1 %cmp6, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert9 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert9 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat10 = shufflevector <4 x i32> %broadcast.splatinsert9, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat10 = shufflevector <4 x i32> %broadcast.splatinsert9, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i32, i32* %a, i32 %index		%0 = getelementptr inbounds i32, i32* %a, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat10
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat10
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i32* %0 to <4 x i32>*		%2 = bitcast i32* %0 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
%3 = add nsw <4 x i32> %wide.masked.load, %vec.phi		%3 = add nsw <4 x i32> %wide.masked.load, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%4 = icmp eq i32 %index.next, %n.vec		%4 = icmp eq i32 %index.next, %n.vec
br i1 %4, label %middle.block, label %vector.body		br i1 %4, label %middle.block, label %vector.body

middle.block: ; preds = %vector.body		middle.block: ; preds = %vector.body
Show All 15 Lines
; CHECK-NEXT: bxeq lr		; CHECK-NEXT: bxeq lr
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: adds r1, r2, #3		; CHECK-NEXT: adds r1, r2, #3
; CHECK-NEXT: movs r3, #1		; CHECK-NEXT: movs r3, #1
; CHECK-NEXT: bic r1, r1, #3		; CHECK-NEXT: bic r1, r1, #3
; CHECK-NEXT: vmov.i32 q0, #0x0		; CHECK-NEXT: vmov.i32 q0, #0x0
; CHECK-NEXT: subs r1, #4		; CHECK-NEXT: subs r1, #4
; CHECK-NEXT: add.w lr, r3, r1, lsr #2		; CHECK-NEXT: add.w lr, r3, r1, lsr #2
; CHECK-NEXT: lsrs r1, r1, #2		; CHECK-NEXT: movs r1, #0
; CHECK-NEXT: sub.w r1, r2, r1, lsl #2
; CHECK-NEXT: dls lr, lr		; CHECK-NEXT: dls lr, lr
; CHECK-NEXT: .LBB2_1: @ %vector.body		; CHECK-NEXT: .LBB2_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vctp.32 r2		; CHECK-NEXT: vctp.32 r2
; CHECK-NEXT: vmov q1, q0		; CHECK-NEXT: vmov q1, q0
; CHECK-NEXT: vpst		; CHECK-NEXT: vpst
; CHECK-NEXT: vldrwt.u32 q0, [r0], #16		; CHECK-NEXT: vldrwt.u32 q0, [r0], #16
		; CHECK-NEXT: adds r1, #4
; CHECK-NEXT: subs r2, #4		; CHECK-NEXT: subs r2, #4
; CHECK-NEXT: vadd.i32 q0, q0, q1		; CHECK-NEXT: vadd.i32 q0, q0, q1
; CHECK-NEXT: le lr, .LBB2_1		; CHECK-NEXT: le lr, .LBB2_1
; CHECK-NEXT: @ %bb.2: @ %middle.block		; CHECK-NEXT: @ %bb.2: @ %middle.block
; CHECK-NEXT: vctp.32 r1
; CHECK-NEXT: vpsel q0, q0, q1		; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: vaddv.u32 r0, q0		; CHECK-NEXT: vaddv.u32 r0, q0
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp6 = icmp eq i32 %N, 0		%cmp6 = icmp eq i32 %N, 0
br i1 %cmp6, label %for.cond.cleanup, label %vector.ph		br i1 %cmp6, label %for.cond.cleanup, label %vector.ph

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %N, 3		%n.rnd.up = add i32 %N, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %N, -1		%trip.count.minus.1 = add i32 %N, -1
%broadcast.splatinsert9 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert9 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat10 = shufflevector <4 x i32> %broadcast.splatinsert9, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat10 = shufflevector <4 x i32> %broadcast.splatinsert9, <4 x i32> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]		%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i32, i32* %a, i32 %index		%0 = getelementptr inbounds i32, i32* %a, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat10
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat10
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i32* %0 to <4 x i32>*		%2 = bitcast i32* %0 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
%3 = add nsw <4 x i32> %wide.masked.load, %vec.phi		%3 = add nsw <4 x i32> %wide.masked.load, %vec.phi
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%4 = icmp eq i32 %index.next, %n.vec		%4 = icmp eq i32 %index.next, %n.vec
br i1 %4, label %middle.block, label %vector.body		br i1 %4, label %middle.block, label %vector.body

middle.block: ; preds = %vector.body		middle.block: ; preds = %vector.body
%5 = select <4 x i1> %1, <4 x i32> %3, <4 x i32> %vec.phi		%5 = select <4 x i1> %1, <4 x i32> %3, <4 x i32> %vec.phi
%6 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %5)		%6 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %5)
br label %for.cond.cleanup		br label %for.cond.cleanup

for.cond.cleanup: ; preds = %middle.block, %entry		for.cond.cleanup: ; preds = %middle.block, %entry
%res.0.lcssa = phi i32 [ 0, %entry ], [ %6, %middle.block ]		%res.0.lcssa = phi i32 [ 0, %entry ], [ %6, %middle.block ]
ret i32 %res.0.lcssa		ret i32 %res.0.lcssa
}		}

define dso_local void @vector_mul_const(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c, i32 %N) {		define dso_local void @vector_mul_const(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c, i32 %N) {
; CHECK-LABEL: vector_mul_const:		; CHECK-LABEL: vector_mul_const:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB3_1: @ %vector.body		; CHECK-NEXT: .LBB3_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r1], #16
; CHECK-NEXT: vmul.i32 q0, q0, r2		; CHECK-NEXT: vmul.i32 q0, q0, r2
; CHECK-NEXT: vstrw.32 q0, [r0], #16		; CHECK-NEXT: vstrw.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB3_1		; CHECK-NEXT: letp lr, .LBB3_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp6 = icmp eq i32 %N, 0		%cmp6 = icmp eq i32 %N, 0
Show All 10 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i32, i32* %b, i32 %index		%0 = getelementptr inbounds i32, i32* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i32* %0 to <4 x i32>*		%2 = bitcast i32* %0 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
%3 = mul nsw <4 x i32> %wide.masked.load, %broadcast.splat11		%3 = mul nsw <4 x i32> %wide.masked.load, %broadcast.splat11
%4 = getelementptr inbounds i32, i32* %a, i32 %index		%4 = getelementptr inbounds i32, i32* %a, i32 %index
%5 = bitcast i32* %4 to <4 x i32>*		%5 = bitcast i32* %4 to <4 x i32>*
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %3, <4 x i32>* %5, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %3, <4 x i32>* %5, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%6 = icmp eq i32 %index.next, %n.vec		%6 = icmp eq i32 %index.next, %n.vec
br i1 %6, label %for.cond.cleanup, label %vector.body		br i1 %6, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define dso_local void @vector_add_const(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c, i32 %N) {		define dso_local void @vector_add_const(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c, i32 %N) {
; CHECK-LABEL: vector_add_const:		; CHECK-LABEL: vector_add_const:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB4_1: @ %vector.body		; CHECK-NEXT: .LBB4_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r1], #16
; CHECK-NEXT: vadd.i32 q0, q0, r2		; CHECK-NEXT: vadd.i32 q0, q0, r2
; CHECK-NEXT: vstrw.32 q0, [r0], #16		; CHECK-NEXT: vstrw.32 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB4_1		; CHECK-NEXT: letp lr, .LBB4_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
%cmp6 = icmp eq i32 %N, 0		%cmp6 = icmp eq i32 %N, 0
Show All 10 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds i32, i32* %b, i32 %index		%0 = getelementptr inbounds i32, i32* %b, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat9
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i32* %0 to <4 x i32>*		%2 = bitcast i32* %0 to <4 x i32>*
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %2, i32 4, <4 x i1> %1, <4 x i32> undef)
%3 = add nsw <4 x i32> %wide.masked.load, %broadcast.splat11		%3 = add nsw <4 x i32> %wide.masked.load, %broadcast.splat11
%4 = getelementptr inbounds i32, i32* %a, i32 %index		%4 = getelementptr inbounds i32, i32* %a, i32 %index
%5 = bitcast i32* %4 to <4 x i32>*		%5 = bitcast i32* %4 to <4 x i32>*
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %3, <4 x i32>* %5, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %3, <4 x i32>* %5, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%6 = icmp eq i32 %index.next, %n.vec		%6 = icmp eq i32 %index.next, %n.vec
br i1 %6, label %for.cond.cleanup, label %vector.body		br i1 %6, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define dso_local arm_aapcs_vfpcc void @vector_mul_vector_i8(i8* noalias nocapture %a, i8* noalias nocapture readonly %b, i8* noalias nocapture readonly %c, i32 %N) {		define dso_local arm_aapcs_vfpcc void @vector_mul_vector_i8(i8* noalias nocapture %a, i8* noalias nocapture readonly %b, i8* noalias nocapture readonly %c, i32 %N) {
; CHECK-LABEL: vector_mul_vector_i8:		; CHECK-LABEL: vector_mul_vector_i8:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.8 lr, r3		; CHECK-NEXT: dlstp.8 lr, r3
; CHECK-NEXT: .LBB5_1: @ %vector.body		; CHECK-NEXT: .LBB5_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #16
; CHECK-NEXT: vldrb.u8 q0, [r1], #16		; CHECK-NEXT: vldrb.u8 q0, [r1], #16
; CHECK-NEXT: vldrb.u8 q1, [r2], #16		; CHECK-NEXT: vldrb.u8 q1, [r2], #16
; CHECK-NEXT: vmul.i8 q0, q1, q0		; CHECK-NEXT: vmul.i8 q0, q1, q0
; CHECK-NEXT: vstrb.8 q0, [r0], #16		; CHECK-NEXT: vstrb.8 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB5_1		; CHECK-NEXT: letp lr, .LBB5_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer
%induction = add <16 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%induction = add <16 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
%0 = getelementptr inbounds i8, i8* %b, i32 %index		%0 = getelementptr inbounds i8, i8* %b, i32 %index
%1 = icmp ule <16 x i32> %induction, %broadcast.splat13
		; %1 = icmp ule <16 x i32> %induction, %broadcast.splat13
		%1 = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i8* %0 to <16 x i8>*		%2 = bitcast i8* %0 to <16 x i8>*
%wide.masked.load = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %2, i32 1, <16 x i1> %1, <16 x i8> undef)		%wide.masked.load = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %2, i32 1, <16 x i1> %1, <16 x i8> undef)
%3 = getelementptr inbounds i8, i8* %c, i32 %index		%3 = getelementptr inbounds i8, i8* %c, i32 %index
%4 = bitcast i8* %3 to <16 x i8>*		%4 = bitcast i8* %3 to <16 x i8>*
%wide.masked.load14 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %4, i32 1, <16 x i1> %1, <16 x i8> undef)		%wide.masked.load14 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %4, i32 1, <16 x i1> %1, <16 x i8> undef)
%5 = mul <16 x i8> %wide.masked.load14, %wide.masked.load		%5 = mul <16 x i8> %wide.masked.load14, %wide.masked.load
%6 = getelementptr inbounds i8, i8* %a, i32 %index		%6 = getelementptr inbounds i8, i8* %a, i32 %index
%7 = bitcast i8* %6 to <16 x i8>*		%7 = bitcast i8* %6 to <16 x i8>*
Show All 9 Lines
; Function Attrs: nofree norecurse nounwind		; Function Attrs: nofree norecurse nounwind
define dso_local arm_aapcs_vfpcc void @vector_mul_vector_i16(i16* noalias nocapture %a, i16* noalias nocapture readonly %b, i16* noalias nocapture readonly %c, i32 %N) local_unnamed_addr #0 {		define dso_local arm_aapcs_vfpcc void @vector_mul_vector_i16(i16* noalias nocapture %a, i16* noalias nocapture readonly %b, i16* noalias nocapture readonly %c, i32 %N) local_unnamed_addr #0 {
; CHECK-LABEL: vector_mul_vector_i16:		; CHECK-LABEL: vector_mul_vector_i16:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #0		; CHECK-NEXT: cmp r3, #0
; CHECK-NEXT: it eq		; CHECK-NEXT: it eq
; CHECK-NEXT: popeq {r7, pc}		; CHECK-NEXT: popeq {r7, pc}
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.16 lr, r3		; CHECK-NEXT: dlstp.16 lr, r3
; CHECK-NEXT: .LBB6_1: @ %vector.body		; CHECK-NEXT: .LBB6_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #8
; CHECK-NEXT: vldrh.u16 q0, [r1], #16		; CHECK-NEXT: vldrh.u16 q0, [r1], #16
; CHECK-NEXT: vldrh.u16 q1, [r2], #16		; CHECK-NEXT: vldrh.u16 q1, [r2], #16
; CHECK-NEXT: vmul.i16 q0, q1, q0		; CHECK-NEXT: vmul.i16 q0, q1, q0
; CHECK-NEXT: vstrh.16 q0, [r0], #16		; CHECK-NEXT: vstrh.16 q0, [r0], #16
; CHECK-NEXT: letp lr, .LBB6_1		; CHECK-NEXT: letp lr, .LBB6_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 9 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <8 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer		%broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> undef, <8 x i32> zeroinitializer
%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%induction = add <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%0 = getelementptr inbounds i16, i16* %b, i32 %index		%0 = getelementptr inbounds i16, i16* %b, i32 %index
%1 = icmp ule <8 x i32> %induction, %broadcast.splat13
		; %1 = icmp ule <8 x i32> %induction, %broadcast.splat13
		%1 = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast i16* %0 to <8 x i16>*		%2 = bitcast i16* %0 to <8 x i16>*
%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %2, i32 2, <8 x i1> %1, <8 x i16> undef)		%wide.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %2, i32 2, <8 x i1> %1, <8 x i16> undef)
%3 = getelementptr inbounds i16, i16* %c, i32 %index		%3 = getelementptr inbounds i16, i16* %c, i32 %index
%4 = bitcast i16* %3 to <8 x i16>*		%4 = bitcast i16* %3 to <8 x i16>*
%wide.masked.load14 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %4, i32 2, <8 x i1> %1, <8 x i16> undef)		%wide.masked.load14 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* %4, i32 2, <8 x i1> %1, <8 x i16> undef)
%5 = mul <8 x i16> %wide.masked.load14, %wide.masked.load		%5 = mul <8 x i16> %wide.masked.load14, %wide.masked.load
%6 = getelementptr inbounds i16, i16* %a, i32 %index		%6 = getelementptr inbounds i16, i16* %a, i32 %index
%7 = bitcast i16* %6 to <8 x i16>*		%7 = bitcast i16* %6 to <8 x i16>*
call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %5, <8 x i16>* %7, i32 2, <8 x i1> %1)		call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> %5, <8 x i16>* %7, i32 2, <8 x i1> %1)
%index.next = add i32 %index, 8		%index.next = add i32 %index, 8
%8 = icmp eq i32 %index.next, %n.vec		%8 = icmp eq i32 %index.next, %n.vec
br i1 %8, label %for.cond.cleanup, label %vector.body		br i1 %8, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)		declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)
declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)		declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)		declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)
declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)		declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)		declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
		declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)
		declare <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vector-reduce-mve-tail.ll


	; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -disable-mve-tail-predication=false -mattr=+mve %s -S -o - \| FileCheck %s

	; CHECK-LABEL: vec_mul_reduce_add			; CHECK-LABEL: vec_mul_reduce_add

	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK: call void @llvm.set.loop.iterations.i32			; CHECK: call void @llvm.set.loop.iterations.i32
	; CHECK: [[UF:%[^ ]+]] = shl i32 %{{.*}}, 2
	; CHECK: [[REMAT_ITER:%[^ ]+]] = sub i32 %N, [[UF]]
	; CHECK: br label %vector.body			; CHECK: br label %vector.body

	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NOT: phi i32 [ 0, %vector.ph ]
	; CHECK: [[ELTS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[SUB:%[^ ]+]], %vector.body ]			; CHECK: [[ELTS:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[SUB:%[^ ]+]], %vector.body ]
	; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[ELTS]])			; CHECK: [[VCTP:%[^ ]+]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[ELTS]])
	; CHECK: [[SUB]] = sub i32 [[ELTS]], 4			; CHECK: [[SUB]] = sub i32 [[ELTS]], 4
	; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]]			; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]]
	; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]],			; CHECK: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{.*}}, i32 4, <4 x i1> [[VCTP]],

	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[VCTP_CLONE:%[^ ]+]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[REMAT_ITER]])			; CHECK: [[VCTP_CLONE:%[^ ]+]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 [[REMAT_ITER:%.*]])
	; CHECK: [[VPSEL:%[^ ]+]] = select <4 x i1> [[VCTP_CLONE]],			; CHECK: [[VPSEL:%[^ ]+]] = select <4 x i1> [[VCTP_CLONE]],
	; CHECK: call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[VPSEL]])			; CHECK: call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[VPSEL]])

	define i32 @vec_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {			define i32 @vec_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32 %N) {
	entry:			entry:
	%cmp8 = icmp eq i32 %N, 0			%cmp8 = icmp eq i32 %N, 0
	%0 = add i32 %N, 3			%0 = add i32 %N, 3
	%1 = lshr i32 %0, 2			%1 = lshr i32 %0, 2
	%2 = shl nuw i32 %1, 2			%2 = shl nuw i32 %1, 2
	%3 = add i32 %2, -4			%3 = add i32 %2, -4
	%4 = lshr i32 %3, 2			%4 = lshr i32 %3, 2
	%5 = add nuw nsw i32 %4, 1			%5 = add nuw nsw i32 %4, 1
	br i1 %cmp8, label %for.cond.cleanup, label %vector.ph			br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%trip.count.minus.1 = add i32 %N, -1			%trip.count.minus.1 = add i32 %N, -1
	%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0			%broadcast.splatinsert11 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
	%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat12 = shufflevector <4 x i32> %broadcast.splatinsert11, <4 x i32> undef, <4 x i32> zeroinitializer
	call void @llvm.set.loop.iterations.i32(i32 %5)			call void @llvm.set.loop.iterations.i32(i32 %5)
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %vector.ph			vector.body: ; preds = %vector.body, %vector.ph
	%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	%lsr.iv2 = phi i32* [ %scevgep3, %vector.body ], [ %a, %vector.ph ]			%lsr.iv2 = phi i32* [ %scevgep3, %vector.body ], [ %a, %vector.ph ]
	%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %b, %vector.ph ]			%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %b, %vector.ph ]
	%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]			%vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]
	%6 = phi i32 [ %5, %vector.ph ], [ %10, %vector.body ]			%6 = phi i32 [ %5, %vector.ph ], [ %10, %vector.body ]
	%lsr.iv24 = bitcast i32* %lsr.iv2 to <4 x i32>*			%lsr.iv24 = bitcast i32* %lsr.iv2 to <4 x i32>*
	%lsr.iv1 = bitcast i32* %lsr.iv to <4 x i32>*			%lsr.iv1 = bitcast i32* %lsr.iv to <4 x i32>*
	%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0			%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
	%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer			%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
	%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>			%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
	%7 = icmp ule <4 x i32> %induction, %broadcast.splat12
				; %7 = icmp ule <4 x i32> %induction, %broadcast.splat12
				%7 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv24, i32 4, <4 x i1> %7, <4 x i32> undef)			%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv24, i32 4, <4 x i1> %7, <4 x i32> undef)
	%wide.masked.load13 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1, i32 4, <4 x i1> %7, <4 x i32> undef)			%wide.masked.load13 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1, i32 4, <4 x i1> %7, <4 x i32> undef)
	%8 = mul nsw <4 x i32> %wide.masked.load13, %wide.masked.load			%8 = mul nsw <4 x i32> %wide.masked.load13, %wide.masked.load
	%9 = add nsw <4 x i32> %8, %vec.phi			%9 = add nsw <4 x i32> %8, %vec.phi
	%index.next = add i32 %index, 4			%index.next = add i32 %index, 4
	%scevgep = getelementptr i32, i32* %lsr.iv, i32 4			%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
	%scevgep3 = getelementptr i32, i32* %lsr.iv2, i32 4			%scevgep3 = getelementptr i32, i32* %lsr.iv2, i32 4
	%10 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %6, i32 1)			%10 = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %6, i32 1)
	%11 = icmp ne i32 %10, 0			%11 = icmp ne i32 %10, 0
	br i1 %11, label %vector.body, label %middle.block			br i1 %11, label %vector.body, label %middle.block

	middle.block: ; preds = %vector.body			middle.block: ; preds = %vector.body
	%12 = icmp ule <4 x i32> %induction, %broadcast.splat12			; TODO: check that the intrinsic is also emitted here by the loop vectoriser
				; %12 = icmp ule <4 x i32> %induction, %broadcast.splat12
				%12 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

	%13 = select <4 x i1> %12, <4 x i32> %9, <4 x i32> %vec.phi			%13 = select <4 x i1> %12, <4 x i32> %9, <4 x i32> %vec.phi
	%14 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %13)			%14 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %13)
	br label %for.cond.cleanup			br label %for.cond.cleanup

	for.cond.cleanup: ; preds = %middle.block, %entry			for.cond.cleanup: ; preds = %middle.block, %entry
	%res.0.lcssa = phi i32 [ 0, %entry ], [ %14, %middle.block ]			%res.0.lcssa = phi i32 [ 0, %entry ], [ %14, %middle.block ]
	ret i32 %res.0.lcssa			ret i32 %res.0.lcssa
	}			}

	declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)			declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
	declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)			declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
	declare void @llvm.set.loop.iterations.i32(i32)			declare void @llvm.set.loop.iterations.i32(i32)
	declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)			declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32)
				declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/mve-fma-loops.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -disable-mve-tail-predication=false %s -o - \| FileCheck %s		; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -disable-mve-tail-predication=false %s -o - \| FileCheck %s

define arm_aapcs_vfpcc void @fmas1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fmas1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fmas1:		; CHECK-LABEL: fmas1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB0_1: @ %vector.body		; CHECK-NEXT: .LBB0_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r1], #16
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r0], #16
; CHECK-NEXT: vfmas.f32 q1, q0, r12		; CHECK-NEXT: vfmas.f32 q1, q0, r12
; CHECK-NEXT: vstrw.32 q1, [r2], #16		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB0_1		; CHECK-NEXT: letp lr, .LBB0_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %wide.masked.load12, <4 x float> %broadcast.splat14)		%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %wide.masked.load12, <4 x float> %broadcast.splat14)
%6 = getelementptr inbounds float, float* %z, i32 %index		%6 = getelementptr inbounds float, float* %z, i32 %index
%7 = bitcast float* %6 to <4 x float>*		%7 = bitcast float* %6 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %5, <4 x float>* %7, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %5, <4 x float>* %7, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%8 = icmp eq i32 %index.next, %n.vec		%8 = icmp eq i32 %index.next, %n.vec
br i1 %8, label %for.cond.cleanup, label %vector.body		br i1 %8, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fmas2(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fmas2(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fmas2:		; CHECK-LABEL: fmas2:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB1_1: @ %vector.body		; CHECK-NEXT: .LBB1_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r0], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r1], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vfmas.f32 q1, q0, r12		; CHECK-NEXT: vfmas.f32 q1, q0, r12
; CHECK-NEXT: vstrw.32 q1, [r2], #16		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB1_1		; CHECK-NEXT: letp lr, .LBB1_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fmul fast <4 x float> %wide.masked.load12, %wide.masked.load		%5 = fmul fast <4 x float> %wide.masked.load12, %wide.masked.load
%6 = fadd fast <4 x float> %5, %broadcast.splat14		%6 = fadd fast <4 x float> %5, %broadcast.splat14
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
%8 = bitcast float* %7 to <4 x float>*		%8 = bitcast float* %7 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%9 = icmp eq i32 %index.next, %n.vec		%9 = icmp eq i32 %index.next, %n.vec
br i1 %9, label %for.cond.cleanup, label %vector.body		br i1 %9, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fma1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fma1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fma1:		; CHECK-LABEL: fma1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB2_1: @ %vector.body		; CHECK-NEXT: .LBB2_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r0], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r1], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vfma.f32 q1, q0, r12		; CHECK-NEXT: vfma.f32 q1, q0, r12
; CHECK-NEXT: vstrw.32 q1, [r2], #16		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB2_1		; CHECK-NEXT: letp lr, .LBB2_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %broadcast.splat14, <4 x float> %wide.masked.load12)		%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %broadcast.splat14, <4 x float> %wide.masked.load12)
%6 = getelementptr inbounds float, float* %z, i32 %index		%6 = getelementptr inbounds float, float* %z, i32 %index
%7 = bitcast float* %6 to <4 x float>*		%7 = bitcast float* %6 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %5, <4 x float>* %7, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %5, <4 x float>* %7, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%8 = icmp eq i32 %index.next, %n.vec		%8 = icmp eq i32 %index.next, %n.vec
br i1 %8, label %for.cond.cleanup, label %vector.body		br i1 %8, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fma2(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fma2(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fma2:		; CHECK-LABEL: fma2:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB3_1: @ %vector.body		; CHECK-NEXT: .LBB3_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r0], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r1], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vfma.f32 q1, q0, r12		; CHECK-NEXT: vfma.f32 q1, q0, r12
; CHECK-NEXT: vstrw.32 q1, [r2], #16		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB3_1		; CHECK-NEXT: letp lr, .LBB3_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert12 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert12 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat13 = shufflevector <4 x float> %broadcast.splatinsert12, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat13 = shufflevector <4 x float> %broadcast.splatinsert12, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = fmul fast <4 x float> %wide.masked.load, %broadcast.splat13		%3 = fmul fast <4 x float> %wide.masked.load, %broadcast.splat13
%4 = getelementptr inbounds float, float* %y, i32 %index		%4 = getelementptr inbounds float, float* %y, i32 %index
%5 = bitcast float* %4 to <4 x float>*		%5 = bitcast float* %4 to <4 x float>*
%wide.masked.load14 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %5, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load14 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %5, i32 4, <4 x i1> %1, <4 x float> undef)
%6 = fadd fast <4 x float> %3, %wide.masked.load14		%6 = fadd fast <4 x float> %3, %wide.masked.load14
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
%8 = bitcast float* %7 to <4 x float>*		%8 = bitcast float* %7 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%9 = icmp eq i32 %index.next, %n.vec		%9 = icmp eq i32 %index.next, %n.vec
br i1 %9, label %for.cond.cleanup, label %vector.body		br i1 %9, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fmss1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fmss1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fmss1:		; CHECK-LABEL: fmss1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r4, s0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: eor r12, r12, #-2147483648		; CHECK-NEXT: eor r12, r4, #-2147483648
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: .LBB4_1: @ %vector.body		; CHECK-NEXT: .LBB4_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r1], #16
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r0], #16
; CHECK-NEXT: vfmas.f32 q1, q0, r12		; CHECK-NEXT: vfmas.f32 q1, q0, r12
; CHECK-NEXT: vstrw.32 q1, [r2], #16		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB4_1		; CHECK-NEXT: letp lr, .LBB4_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%fneg = fneg fast float %a		%fneg = fneg fast float %a
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %fneg, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %fneg, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %wide.masked.load12, <4 x float> %broadcast.splat14)		%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %wide.masked.load12, <4 x float> %broadcast.splat14)
%6 = getelementptr inbounds float, float* %z, i32 %index		%6 = getelementptr inbounds float, float* %z, i32 %index
%7 = bitcast float* %6 to <4 x float>*		%7 = bitcast float* %6 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %5, <4 x float>* %7, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %5, <4 x float>* %7, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%8 = icmp eq i32 %index.next, %n.vec		%8 = icmp eq i32 %index.next, %n.vec
br i1 %8, label %for.cond.cleanup, label %vector.body		br i1 %8, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fmss2(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fmss2(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fmss2:		; CHECK-LABEL: fmss2:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r4, s0
; CHECK-NEXT: vdup.32 q0, r12		; CHECK-NEXT: vdup.32 q0, r4
; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: vneg.f32 q0, q0		; CHECK-NEXT: vneg.f32 q0, q0
		; CHECK-NEXT: mov.w r12, #0
		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB5_1: @ %vector.body		; CHECK-NEXT: .LBB5_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vmov q3, q0		; CHECK-NEXT: vmov q3, q0
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r0], #16
; CHECK-NEXT: vldrw.u32 q2, [r1], #16		; CHECK-NEXT: vldrw.u32 q2, [r1], #16
; CHECK-NEXT: vfma.f32 q3, q2, q1		; CHECK-NEXT: vfma.f32 q3, q2, q1
; CHECK-NEXT: vstrw.32 q3, [r2], #16		; CHECK-NEXT: vstrw.32 q3, [r2], #16
; CHECK-NEXT: letp lr, .LBB5_1		; CHECK-NEXT: letp lr, .LBB5_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fmul fast <4 x float> %wide.masked.load12, %wide.masked.load		%5 = fmul fast <4 x float> %wide.masked.load12, %wide.masked.load
%6 = fsub fast <4 x float> %5, %broadcast.splat14		%6 = fsub fast <4 x float> %5, %broadcast.splat14
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
Show All 12 Lines
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r7, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r7, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
; CHECK-NEXT: vdup.32 q0, r12		; CHECK-NEXT: vdup.32 q0, r12
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB6_1: @ %vector.body		; CHECK-NEXT: .LBB6_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vmov q3, q0		; CHECK-NEXT: vmov q3, q0
; CHECK-NEXT: vldrw.u32 q1, [r1], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vldrw.u32 q2, [r0], #16		; CHECK-NEXT: vldrw.u32 q2, [r0], #16
; CHECK-NEXT: vfms.f32 q3, q2, q1		; CHECK-NEXT: vfms.f32 q3, q2, q1
; CHECK-NEXT: vstrw.32 q3, [r2], #16		; CHECK-NEXT: vstrw.32 q3, [r2], #16
; CHECK-NEXT: letp lr, .LBB6_1		; CHECK-NEXT: letp lr, .LBB6_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
Show All 12 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fneg fast <4 x float> %wide.masked.load12		%5 = fneg fast <4 x float> %wide.masked.load12
%6 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %5, <4 x float> %broadcast.splat14)		%6 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %5, <4 x float> %broadcast.splat14)
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
Show All 12 Lines
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r7, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r7, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
; CHECK-NEXT: vdup.32 q0, r12		; CHECK-NEXT: vdup.32 q0, r12
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB7_1: @ %vector.body		; CHECK-NEXT: .LBB7_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vmov q3, q0		; CHECK-NEXT: vmov q3, q0
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r0], #16
; CHECK-NEXT: vldrw.u32 q2, [r1], #16		; CHECK-NEXT: vldrw.u32 q2, [r1], #16
; CHECK-NEXT: vfms.f32 q3, q2, q1		; CHECK-NEXT: vfms.f32 q3, q2, q1
; CHECK-NEXT: vstrw.32 q3, [r2], #16		; CHECK-NEXT: vstrw.32 q3, [r2], #16
; CHECK-NEXT: letp lr, .LBB7_1		; CHECK-NEXT: letp lr, .LBB7_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
Show All 12 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fmul fast <4 x float> %wide.masked.load12, %wide.masked.load		%5 = fmul fast <4 x float> %wide.masked.load12, %wide.masked.load
%6 = fsub fast <4 x float> %broadcast.splat14, %5		%6 = fsub fast <4 x float> %broadcast.splat14, %5
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
%8 = bitcast float* %7 to <4 x float>*		%8 = bitcast float* %7 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%9 = icmp eq i32 %index.next, %n.vec		%9 = icmp eq i32 %index.next, %n.vec
br i1 %9, label %for.cond.cleanup, label %vector.body		br i1 %9, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fms1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fms1(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fms1:		; CHECK-LABEL: fms1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r4, s0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: eor r12, r12, #-2147483648		; CHECK-NEXT: eor r12, r4, #-2147483648
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: .LBB8_1: @ %vector.body		; CHECK-NEXT: .LBB8_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vldrw.u32 q0, [r0], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r1], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vfma.f32 q1, q0, r12		; CHECK-NEXT: vfma.f32 q1, q0, r12
; CHECK-NEXT: vstrw.32 q1, [r2], #16		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB8_1		; CHECK-NEXT: letp lr, .LBB8_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%fneg = fneg fast float %a		%fneg = fneg fast float %a
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %fneg, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %fneg, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %broadcast.splat14, <4 x float> %wide.masked.load12)		%5 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %broadcast.splat14, <4 x float> %wide.masked.load12)
%6 = getelementptr inbounds float, float* %z, i32 %index		%6 = getelementptr inbounds float, float* %z, i32 %index
%7 = bitcast float* %6 to <4 x float>*		%7 = bitcast float* %6 to <4 x float>*
Show All 11 Lines
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r7, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r7, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r7, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
; CHECK-NEXT: vdup.32 q0, r12		; CHECK-NEXT: vdup.32 q0, r12
		; CHECK-NEXT: mov.w r12, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB9_1: @ %vector.body		; CHECK-NEXT: .LBB9_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add.w r12, r12, #4
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r0], #16
; CHECK-NEXT: vldrw.u32 q2, [r1], #16		; CHECK-NEXT: vldrw.u32 q2, [r1], #16
; CHECK-NEXT: vfms.f32 q2, q1, q0		; CHECK-NEXT: vfms.f32 q2, q1, q0
; CHECK-NEXT: vstrw.32 q2, [r2], #16		; CHECK-NEXT: vstrw.32 q2, [r2], #16
; CHECK-NEXT: letp lr, .LBB9_1		; CHECK-NEXT: letp lr, .LBB9_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r7, pc}
entry:		entry:
Show All 11 Lines	vector.ph: ; preds = %entry
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fmul fast <4 x float> %wide.masked.load, %broadcast.splat14		%5 = fmul fast <4 x float> %wide.masked.load, %broadcast.splat14
%6 = fsub fast <4 x float> %wide.masked.load12, %5		%6 = fsub fast <4 x float> %wide.masked.load12, %5
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
%8 = bitcast float* %7 to <4 x float>*		%8 = bitcast float* %7 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%9 = icmp eq i32 %index.next, %n.vec		%9 = icmp eq i32 %index.next, %n.vec
br i1 %9, label %for.cond.cleanup, label %vector.body		br i1 %9, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fms3(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fms3(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fms3:		; CHECK-LABEL: fms3:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB10_1: @ %vector.body		; CHECK-NEXT: .LBB10_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vneg.f32 q0, q0		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vfma.f32 q0, q1, r12		; CHECK-NEXT: vneg.f32 q1, q1
; CHECK-NEXT: vstrw.32 q0, [r2], #16		; CHECK-NEXT: vfma.f32 q1, q0, r12
		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB10_1		; CHECK-NEXT: letp lr, .LBB10_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert13 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat14 = shufflevector <4 x float> %broadcast.splatinsert13, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = getelementptr inbounds float, float* %y, i32 %index		%3 = getelementptr inbounds float, float* %y, i32 %index
%4 = bitcast float* %3 to <4 x float>*		%4 = bitcast float* %3 to <4 x float>*
%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load12 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %4, i32 4, <4 x i1> %1, <4 x float> undef)
%5 = fneg fast <4 x float> %wide.masked.load12		%5 = fneg fast <4 x float> %wide.masked.load12
%6 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %broadcast.splat14, <4 x float> %5)		%6 = call fast <4 x float> @llvm.fma.v4f32(<4 x float> %wide.masked.load, <4 x float> %broadcast.splat14, <4 x float> %5)
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
%8 = bitcast float* %7 to <4 x float>*		%8 = bitcast float* %7 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%9 = icmp eq i32 %index.next, %n.vec		%9 = icmp eq i32 %index.next, %n.vec
br i1 %9, label %for.cond.cleanup, label %vector.body		br i1 %9, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

define arm_aapcs_vfpcc void @fms4(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {		define arm_aapcs_vfpcc void @fms4(float* nocapture readonly %x, float* nocapture readonly %y, float* noalias nocapture %z, float %a, i32 %n) {
; CHECK-LABEL: fms4:		; CHECK-LABEL: fms4:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: .save {r7, lr}		; CHECK-NEXT: .save {r4, lr}
; CHECK-NEXT: push {r7, lr}		; CHECK-NEXT: push {r4, lr}
; CHECK-NEXT: cmp r3, #1		; CHECK-NEXT: cmp r3, #1
; CHECK-NEXT: it lt		; CHECK-NEXT: it lt
; CHECK-NEXT: poplt {r7, pc}		; CHECK-NEXT: poplt {r4, pc}
; CHECK-NEXT: vmov r12, s0		; CHECK-NEXT: vmov r12, s0
		; CHECK-NEXT: movs r4, #0
; CHECK-NEXT: dlstp.32 lr, r3		; CHECK-NEXT: dlstp.32 lr, r3
; CHECK-NEXT: .LBB11_1: @ %vector.body		; CHECK-NEXT: .LBB11_1: @ %vector.body
; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1		; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vldrw.u32 q0, [r1], #16		; CHECK-NEXT: vldrw.u32 q0, [r0], #16
; CHECK-NEXT: vldrw.u32 q1, [r0], #16		; CHECK-NEXT: vldrw.u32 q1, [r1], #16
; CHECK-NEXT: vneg.f32 q0, q0		; CHECK-NEXT: adds r4, #4
; CHECK-NEXT: vfma.f32 q0, q1, r12		; CHECK-NEXT: vneg.f32 q1, q1
; CHECK-NEXT: vstrw.32 q0, [r2], #16		; CHECK-NEXT: vfma.f32 q1, q0, r12
		; CHECK-NEXT: vstrw.32 q1, [r2], #16
; CHECK-NEXT: letp lr, .LBB11_1		; CHECK-NEXT: letp lr, .LBB11_1
; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup		; CHECK-NEXT: @ %bb.2: @ %for.cond.cleanup
; CHECK-NEXT: pop {r7, pc}		; CHECK-NEXT: pop {r4, pc}
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %vector.ph, label %for.cond.cleanup		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

vector.ph: ; preds = %entry		vector.ph: ; preds = %entry
%n.rnd.up = add i32 %n, 3		%n.rnd.up = add i32 %n, 3
%n.vec = and i32 %n.rnd.up, -4		%n.vec = and i32 %n.rnd.up, -4
%trip.count.minus.1 = add i32 %n, -1		%trip.count.minus.1 = add i32 %n, -1
%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0		%broadcast.splatinsert10 = insertelement <4 x i32> undef, i32 %trip.count.minus.1, i32 0
%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat11 = shufflevector <4 x i32> %broadcast.splatinsert10, <4 x i32> undef, <4 x i32> zeroinitializer
%broadcast.splatinsert12 = insertelement <4 x float> undef, float %a, i32 0		%broadcast.splatinsert12 = insertelement <4 x float> undef, float %a, i32 0
%broadcast.splat13 = shufflevector <4 x float> %broadcast.splatinsert12, <4 x float> undef, <4 x i32> zeroinitializer		%broadcast.splat13 = shufflevector <4 x float> %broadcast.splatinsert12, <4 x float> undef, <4 x i32> zeroinitializer
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %vector.ph		vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0		%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer		%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>		%induction = or <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = getelementptr inbounds float, float* %x, i32 %index		%0 = getelementptr inbounds float, float* %x, i32 %index
%1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		; %1 = icmp ule <4 x i32> %induction, %broadcast.splat11
		%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)

%2 = bitcast float* %0 to <4 x float>*		%2 = bitcast float* %0 to <4 x float>*
%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %2, i32 4, <4 x i1> %1, <4 x float> undef)
%3 = fmul fast <4 x float> %wide.masked.load, %broadcast.splat13		%3 = fmul fast <4 x float> %wide.masked.load, %broadcast.splat13
%4 = getelementptr inbounds float, float* %y, i32 %index		%4 = getelementptr inbounds float, float* %y, i32 %index
%5 = bitcast float* %4 to <4 x float>*		%5 = bitcast float* %4 to <4 x float>*
%wide.masked.load14 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %5, i32 4, <4 x i1> %1, <4 x float> undef)		%wide.masked.load14 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %5, i32 4, <4 x i1> %1, <4 x float> undef)
%6 = fsub fast <4 x float> %3, %wide.masked.load14		%6 = fsub fast <4 x float> %3, %wide.masked.load14
%7 = getelementptr inbounds float, float* %z, i32 %index		%7 = getelementptr inbounds float, float* %z, i32 %index
%8 = bitcast float* %7 to <4 x float>*		%8 = bitcast float* %7 to <4 x float>*
call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)		call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %6, <4 x float>* %8, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4		%index.next = add i32 %index, 4
%9 = icmp eq i32 %index.next, %n.vec		%9 = icmp eq i32 %index.next, %n.vec
br i1 %9, label %for.cond.cleanup, label %vector.body		br i1 %9, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body, %entry		for.cond.cleanup: ; preds = %vector.body, %entry
ret void		ret void
}		}

declare <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>*, i32 immarg, <4 x i1>, <4 x float>)		declare <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>*, i32 immarg, <4 x i1>, <4 x float>)
declare <4 x float> @llvm.fma.v4f32(<4 x float>, <4 x float>, <4 x float>)		declare <4 x float> @llvm.fma.v4f32(<4 x float>, <4 x float>, <4 x float>)
declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32 immarg, <4 x i1>)
		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTCClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 269250

llvm/lib/Target/ARM/MVETailPredication.cpp

llvm/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/clear-maskedinsts.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/extending-loads.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-tail-data-types.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/nested.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-widen.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vector-arith-codegen.ll

llvm/test/CodeGen/Thumb2/LowOverheadLoops/vector-reduce-mve-tail.ll

llvm/test/CodeGen/Thumb2/mve-fma-loops.ll

[ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC
ClosedPublic