This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
3/3
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/
-
AArch64/
2/2
AArch64TargetTransformInfo.h
-
ARM/
-
ARMTargetTransformInfo.h
-
ARMTargetTransformInfo.cpp
-
RISCV/
-
RISCVTargetTransformInfo.h
-
Transforms/Vectorize/
-
Vectorize/
16/20
LoopVectorize.cpp
11/12
VPlan.h
21/23
VPlan.cpp
-
VPlanRecipes.cpp
-
VPlanValue.h
1/1
VPlanVerifier.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1/1
scalable-reductions-tf.ll
-
sve-low-trip-count.ll
-
sve-tail-folding-forced.ll
-
sve-tail-folding-optsize.ll
-
sve-tail-folding-unroll.ll
1/1
sve-tail-folding.ll
-
tail-fold-uniform-memops.ll

Differential D125301

[LoopVectorize] Add option to use active lane mask for loop control flow
ClosedPublic

Authored by david-arm on May 10 2022, 3:40 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
CarolineConcatto
RosieSumpter
fhahn
dmgreen
paulwalker-arm
Ayal

Commits

rG03fee6712a39: [LoopVectorize] Add option to use active lane mask for loop control flow

Summary

Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

Generating the active lane mask for the next iteration of the

vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.

Extract the first bit of this mask and use this bit for the

conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.May 10 2022, 3:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:40 AM

Herald added subscribers: rogfer01, zzheng, hiraditya. · View Herald Transcript

david-arm requested review of this revision.May 10 2022, 3:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:40 AM

Herald added subscribers: llvm-commits, vkmr. · View Herald Transcript

david-arm edited reviewers, added: dmgreen; removed: greened.May 10 2022, 3:41 AM

Harbormaster completed remote builds in B163662: Diff 428328.May 10 2022, 4:15 AM

sdesmalen added inline comments.May 25 2022, 1:38 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
536–541	I find this interface a bit cumbersome, can we make `emitGetActiveLaneMask` return an enum value that explains how the lane mask must be used, something like: enum class ActiveLaneMask { None = 0, PredicateOnly, PredicateAndControlFlow }; (but then with possibly better names)
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8035–8038	this comment is no longer correct.
8044	Can you add a comment explaining what this does? From my understanding, it sets the predicate for the vectorized loop's header block (which would be the vectorized loop's main predicate) to the ActiveLaneMaskPhi cached in the Plan. Is that right?
8596	nit: s/UseLaneMaskForControlFlow/UseLaneMaskForLoopControlFlow`/
llvm/lib/Transforms/Vectorize/VPlan.cpp
593–594	nit: s/CanonicalIVIncrementParts/CanonicalIVPIncrementForPart/ ?
605	nit: IV is probably not a suitable name, because it's the active lane mask, not the induction variable.
982	Is this extra scope necessary? There are no uses of IR builder after this scope, so I think you can just as well use the scope of the loop body for the InsertPointGuard.
llvm/lib/Transforms/Vectorize/VPlan.h
787	It would be nice to have a one-line description here of what this does and how it's different from CanonicalIVIncrement.

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

In D125301#3536643, @dmgreen wrote:

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Hi @dmgreen, it's worth pointing out that this behaviour is toggled under a TTI interface so it is currently only enabled for SVE targets where this structure happens to be the most optimal. Any targets that rely upon the existing structure (e.g. MVE) remain unaffected.

In D125301#3536664, @david-arm wrote:

In D125301#3536643, @dmgreen wrote:

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Hi @dmgreen, it's worth pointing out that this behaviour is toggled under a TTI interface so it is currently only enabled for SVE targets where this structure happens to be the most optimal. Any targets that rely upon the existing structure (e.g. MVE) remain unaffected.

Sure - that's good, but it wasn't my point. My point was that loops have a structure to them that usually includes latch block controlled by an induction variable. My worry is for SVE, if we are creating loops that do not have that structure. Do all the transforms such as LSR and anything that might happen in an LTO pipeline still behave optimally, or do they work less well because the loop is less analyzable. A simple example that comes up quite a lot is that during the compile step of lto the loop is vectorized, then during the lto step the trip count becomes a known constant. The loop should be fully unrolled and simplified away nicely. I think the simplification should still happen given the chance because the activelanemask will propagate constants, but would the unrolling happen if the trip count is not obvious any more? There may be any number of other similar transforms that could happen during an LTO phase.

I haven't looked at the details, it sounds like this is a good thing to do. But my gut would say don't mess with the structure of the loop in the vectorizer, do the transform later.

Changed TTI interface emitGetActiveLaneMask to return a new ActiveLaneMask enum class.
Added/moved comments in a few places.
Tidied up the ::execute function.

Hi @dmgreen, thanks for the explanation! The reason for doing this in LoopVectorize.cpp is because it feels much more natural to express it this way than writing a complicated pass just before lowering to transform it afterwards. There is no guarantee that such a transform would even succeed if subsequent passes change aspects of the loop. I believe this is actually the way it's done in GCC as well. From a performance perspective I've not observed any degradations in code quality or performance so far with this approach.

I think that in general LLVM is moving towards supporting predication in IR and this patch is aligned with that. You're right that it's possible we may occasionally have work to do in future looking for backedge counts, but I don't feel we should be afraid of that. The information is there if we need it, since it's inferred from the trip count passed to the active lane mask intrinsic.

Harbormaster completed remote builds in B166281: Diff 432006.May 25 2022, 9:11 AM

Matt added a subscriber: Matt.May 25 2022, 2:05 PM

dtemirbulatov added a subscriber: dtemirbulatov.May 26 2022, 7:19 AM

paulwalker-arm added a subscriber: paulwalker-arm.Jun 6 2022, 5:31 AM

paulwalker-arm added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
163	Please say if I'm being picky but to me `ActiveLaneMask` is a thing, as in, it's the active lane mask. Here I think you're asking the target to choose the style of predication. So something like the following reads better to me: enum class PredicationStyle { None, Data, DataAndControlFlow };
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
338–339	Will we do the correct thing if the vectoriser chooses fixed length vectorisation? Not sure what "correct" really means here other than not crash or regress performance? I just want to ensure we've considered the possibility.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8642	Given the new block is already `EB->appendRecipe`ing several instructions I don't see any value in breaking out the last instruction like this.
9086–9087	This idiom occurs in a few places. Is it worth adding something like `CM.useActiveLaneMaskForControlFlow()`. I think it makes sense for this to imply tail folding by mask as why else would you be doing it.
llvm/lib/Transforms/Vectorize/VPlan.cpp
603	no active lanes?
984	Why is this here? and not where `StartMask` is assigned a couple of lines below.

Changed name of new enum to PredicationStyle.
Addressed other refactoring comments.
Added new test simple_memset_v4i32 to demonstrate that the new style of tail-folding works when choosing a fixed-width VF. This can happen because SVE is enabled, but the cost model may still decide to choose a fixed-width VF.

david-arm marked 6 inline comments as done.Jun 7 2022, 9:16 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
338–339	I think it does yeah. I've added a new test @simple_memset_v4i32 in sve-tail-folding.ll to defend the case where SVE is enabled and we've chosen a fixed-width VF. The codegen side of things is safe because we have tests to defend lowering of get.active.lane.mask calls for fixed-width too.

Harbormaster completed remote builds in B168324: Diff 434840.Jun 7 2022, 10:08 AM

fhahn added inline comments.Jun 7 2022, 2:18 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
986	The start value should be created in the preheader block in VPlan instead of in the phi node here. There there should also be no need for a TC argument?
llvm/lib/Transforms/Vectorize/VPlan.h
793	Can `BranchOnCond` be used instead of the dedicated `BranchOnactiveLaneMask`?
1880	TC would need to be an operand instead of just a member variable, otherwise things will break if something does `TC->replaceAllUsesWith()` outside.

fhahn added inline comments.Jun 7 2022, 2:18 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	It looks like the update for the phi doesn't depend on the phi, but only on trip count & main induction. I might be missing something, but is the phi actually needed or would it be possible to compute the active lane mask at the beginning of each iteration instead of at the end of the iteration?

david-arm marked an inline comment as done.Jun 8 2022, 1:08 AM

david-arm added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	Hi @fhahn, perhaps I've misunderstood your question here, but actually the point of this patch is to do the exact opposite of that, i.e. we don't want to generate the active lane mask for the current iteration - we want to create the mask for the next iteration. This requires a PHI to carry the live value around the loop and is the only way to use the mask for control flow because at the point of branching we want to know if there are any active lanes in the mask for the next iteration. With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Effectively, the whilelo instruction is doing the comparison already, which makes the traditional scalar IV comparison redundant. I could be wrong, but I believe this form of vectorised loop would be beneficial for some other targets with a predicated instruction set too, such as RISC-V.
llvm/lib/Transforms/Vectorize/VPlan.h
793	I created this patch a month ago, which predated your BranchOnCond work. That's why I haven't used it. I can certainly look into this and see if it's possible though?

david-arm added inline comments.Jun 10 2022, 12:51 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
986	Hi @fhahn, I'm not really sure what you mean here. Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? If so, then VPInstruction::ActiveLaneMask needs a trip count operand. Or do you mean I should deviate from the expected recipe creation somehow in a VPlan function such as prepareToExecute by manually adding the incoming value?

fhahn added inline comments.Jun 13 2022, 1:05 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Oh right, I missed this in the test case I was looking at! I originally thought the intention of the new phi recipe was to encode some extra guarantees/information like we do for inductions or reductions. From the latest comment, it sounds like there would be no need to have a new recipe class for the phi, but perhaps VPWidenPHIRecipe could be used instead, if the setup can be moved to the pre-header.
llvm/lib/Transforms/Vectorize/VPlan.cpp
986	Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? Yes exactly. If so, then VPInstruction::ActiveLaneMask needs a trip count operand. Looking at `VPInstruction::generateInstruction`, it looks like it already takes a trip count operand? The first operand can be set to zero by wrapping a zero constant in a VPValue. case VPInstruction::ActiveLaneMask: { // Get first lane of vector induction variable. Value VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0)); // Get the original loop tripcount. Value ScalarTC = State.get(getOperand(1), Part);

david-arm added inline comments.Jun 13 2022, 1:12 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp: void VPWidenPHIRecipe::execute(VPTransformState &State) { State.ILV->widenPHIInstruction(cast<PHINode>(getUnderlyingValue()), this, State); } but the PHI node does not exist in the original scalar loop. It's not currently possible to widen a PHI that didn't previously exist, which means I would have to modify VPWidenPHIRecipe to test for the existence of an underlying value and take different paths accordingly.

Removed the creation of the initial lane mask values from ActiveLaneMaskPHIRecipe::execute. We now create the appropriate set of VPInstructions in the vplan preheader block and feed these into the ActiveLaneMaskPHI recipe.

david-arm marked 6 inline comments as done.Jun 14 2022, 4:14 AM

david-arm added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.h
793	So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. It's possible to do this, but then I wasn't sure about the semantics of this new instruction. When passing in a scalar constant of 0 for the lane, it gets widened to something like <vscale x 4 x i32> zeroinitializer for every part. However, I only need a single lane so I'd have to do something like: case VPInstruction::ExtractElement: { Value Vec = State.get(getOperand(0), Part); Value Lane = State.get(getOperand(1), VPIteration(0, 0)); Value *V = Builder.CreateExtractElement(Vec, Lane); State.set(this, V, Part); break; } It feels quite inefficient to go to all the effort of widening, only to discard everything! If you still prefer me to proceed with this approach I'm happy to try if you can provide your thoughts on what the new ExtractElement operation should look like?

Harbormaster completed remote builds in B169684: Diff 436733.Jun 14 2022, 4:55 AM

fhahn added inline comments.Jun 19 2022, 2:21 PM

llvm/lib/Transforms/Vectorize/VPlan.h
793	So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. Extracts are not modeled explicitly at the moment and usually `State.get` will take care of interesting an extract when requesting scalar lanes if it is needed. I think when using a `VPInstruction::not` as operand for `BranchOnCond`, `State.get` should insert the extract for the first lane, as this is what `BranchOnCond` uses.

fhahn added inline comments.Jun 19 2022, 2:27 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp: Yeah this was a bit unfortunate! I landed 2 patches that removed the unnecessary dependence on the underlying instruction by instead using the type of its operand.

david-arm added inline comments.Jun 20 2022, 12:56 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	OK well thanks a lot for tidying that class up! The only problem is that even after your patches landed VPWidenPhiRecipe::execute only ever deals with Part 0, whereas we need a Phi for every Part. So unfortunately I still can't use the class in it's current state. If you have suggestions about how to fix this I'm happy to take another look? Perhaps I can add a boolean flag to the VPWidenPhiRecipe constructor to indicate how many parts to generate?
llvm/lib/Transforms/Vectorize/VPlan.h
793	OK I'll take another look and give this a try!

Removed BranchOnActiveLaneMask and now use BranchOnCond instead

david-arm edited the summary of this revision. (Show Details)Jun 20 2022, 5:15 AM

david-arm marked an inline comment as done.

Harbormaster completed remote builds in B170827: Diff 438346.Jun 20 2022, 6:27 AM

Out of interest are the test output produced manually or via an update script. I ask because I cannot see any of the normal headers and am wondering why.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8591–8593	This comment is now out of date. Perhaps it's best to replace it with something more generic and have the implementation specific part nearer the code?
8632–8633	Please ignore if too awkward but I think the output IR will be more readable if we have tighter control over the name used for `EntryALM` and `ALM`. Currently the output is just littered with `active.lane.mask###` making it harder to understand the data flow. If the entry masks were called `active.lane.mask.entry###`, next iteration masks called `active.lane.mask.next###` and the PHIs keeping their current `active.lane.mask###` name, the loops would be easier to read.
8656–8657	This seems a bit wasteful given it's only the first lane we care about. perhaps `VPInstruction::BranchOnCount` could take an `invert` flag that switches the branch destinations?
llvm/lib/Transforms/Vectorize/VPlan.cpp
578–580	I know it'll get folded away by later passes but would it be wrong to handling the `Part == 0` case here? so that we don't litter the loop vectorize tests with pointless adds of zero. I see `CanonicalIVIncrement` doing similar for the non zero parts so I'll take that as precedent for doing such common sense early optimisation.
llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions-tf.ll
9	Given the structure has changed, is it worth showing how the entry active lane mask is constructed?
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
3–4	What's this protecting?

Fixed some incorrect comments.
Added more descriptive names for ActiveLaneMask VPInstructions.
Minor improvement in code quality for some canonical IV increments.

Harbormaster completed remote builds in B172693: Diff 440927.Jun 29 2022, 4:03 AM

fhahn added inline comments.Jun 29 2022, 4:04 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	Hm, let me double check!

In D125301#3616224, @paulwalker-arm wrote:

Out of interest are the test output produced manually or via an update script. I ask because I cannot see any of the normal headers and am wondering why.

So we've been using the update_test_checks.py to generate the CHECK lines, but then manually deleting CHECK lines after the middle.block since they were unnecessary for the tests. That reduced the amount of noise, but it does mean we can't really have the NOTE at the top of the tests to say "CHECK lines generated by update_test_checks.py".

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8656–8657	It's not as trivial as you'd think passing in flags to VPInstructions. :( You have to create an IR Constant, pass it in as a 3rd VPValue argument to a VPInstruction and mark it as IR "live value in" operand. We do end up with nice code after InstCombine, since it swaps the labels around for us anyway. If you really prefer me to go down this route I can - it's just a little awkward that's all.

A few comment related issues but otherwise looks good to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8606	remove the "it"?
8656–8657	No need if this is how the interface works. Perhaps VPIntruction could do with some domain specific flags, which might also simplify the NUW split. Either way it's not directly relevant to this patch.
llvm/lib/Transforms/Vectorize/VPlan.cpp
580–581	This looks to be copied from `case VPInstruction::CanonicalIVIncrement` but it not really correct within the context of this opcode.
llvm/lib/Transforms/Vectorize/VPlan.h
2519	Could do with a comment.

This revision is now accepted and ready to land.Jun 30 2022, 5:58 AM

fhahn added a reviewer: Ayal.Jun 30 2022, 7:35 AM

fhahn added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
536–541	Looks like this comment needs updating, as the interface is more fine grained now?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629	Ok it looks like the only reason is that VPWidenPHIRecipe is only used in plans for outer-loop vectorization and there interleaving isn't supported. So I think it should be fine to just extend it to generate phis for all parts. It would be beneficial to ensure we use generic recipes where possible to avoid duplication and confusion. This will also help with further unifying the split between native/inner loop planning.
llvm/lib/Transforms/Vectorize/VPlan.cpp
981	Can `Builder.CreatePhi` be used without the need to manually pick the insertion point?
llvm/lib/Transforms/Vectorize/VPlan.h
802	This should be documented more clearly. At the moment this is used as the name of the generated IR instruction? Also, this is only used for `VPInstruction::ActiveLaneMask` which is a bit surprising.
2598	Do we need to keep this as member variable? I think at the moment we generally avoid having links to recipes here, as this requires transforms that may remove or replace recipes to be aware of this state (see the similar `getCanonicalIV`) It should be trivial to find the active lane mask phi recipe, if there is one (just look at the phi recipes of the header)

david-arm marked an inline comment as done.Jun 30 2022, 8:56 AM

david-arm added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.cpp
982	Hi @fhahn, if I use VPWidenPHIRecipe instead it's not obvious how I add the incoming start mask. It looks like VPWidenPHIRecipe::execute doesn't add it - do you know how?

fhahn mentioned this in D128937: [VPlan] Generalize VPWidenPHIRecipe to generate values for all parts..Jun 30 2022, 12:38 PM

fhahn added inline comments.Jun 30 2022, 12:40 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp

982

I think D128937 should do the trick, together with the snippet below. If D128937 allows VPWidenPHIRecipe to be used here I'll send it for review.

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index d8fb7a9c12b0..1ba76bf52cc6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3646,8 +3646,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
     truncateToMinimalBitwidths(State);

   // Fix widened non-induction PHIs by setting up the PHI operands.
-  if (EnableVPlanNativePath)
-    fixNonInductionPHIs(Plan, State);
+  fixNonInductionPHIs(Plan, State);

   // At this point every instruction in the original loop is widened to a
   // vector form. Now we need to fix the recurrences in the loop. These PHI

Changed getActiveLaneMaskPHI to look up the recipe from the header, and removed the member variable.
Improved comments in code.
Split out the VPInstruction Name parameter work into a separate NFC patch - D128982.

Herald added a subscriber: shiva0217. · View Herald TranscriptJul 1 2022, 3:59 AM

david-arm added a parent revision: D128982: [LoopVectorize][NFC] Add optional Name parameter to VPInstruction.Jul 1 2022, 3:59 AM

Harbormaster completed remote builds in B173216: Diff 441660.Jul 1 2022, 4:00 AM

david-arm marked 8 inline comments as done.Jul 1 2022, 4:01 AM

david-arm added inline comments.Jul 4 2022, 6:20 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
982	Hi @fhahn, if you're happy with everything else in the patch, would you be ok with submitting this patch as it is now, then look into using VPWidenPHIRecipe as a follow-on piece of refactoring? We're quite keen to get this work submitted soonish so we can get sufficient testing in the run-up to the LLVM 15 release. We also have quite a few other pieces of work on-going that depend upon this too.

fhahn added inline comments.Jul 5 2022, 6:03 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
571	nit: stray whitespace.
596	Is this covered by a test? The comment and logic above applies specifically to `BranchOnCount`. It is not entirely clear to me that it would be safe for any `BranchOnCond`?
982	Ok sounds good to me, could you please add a FIXME/TODO to the code in the meantime? We also have quite a few other pieces of work on-going that depend upon this too. If other stuff depends on the patch it might be helpful to link them together as patch stack, to get a better idea on the wider impact.
992	Should this also print the operands here?
llvm/lib/Transforms/Vectorize/VPlan.h
2596	What about multiple VPActiveLaneMasks? Should the verifier check to make sure there's only a single one? It would probably also be good to place this after `getCanonicalIV`, which seems related.
llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
202	This still references `BranchOnActiveLaneMask`?

david-arm added inline comments.Jul 6 2022, 5:48 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
596	It is covered by sve-low-trip-count.ll, since we create BranchOnCond using the active lane mask. I did this after you suggested in earlier review comments to use BranchOnCond instead of BranchOnActiveLaneMask because we don't want to regress for low trip counts. We also don't have access to the TripCount in addCanonicalIVRecipes, so I can't use that to determine what type of Branch instruction to use. It's a shame that there is no easy way to add a flag to the VPInstruction to mark it as safe for optimisation here. For example, it would be nice to do something like (Term->getOpcode() == VPInstruction::BranchOnCount \|\| (Term->getOpcode() == VPInstruction::BranchOnCond && Term->isBranchSafeToOptimise())

Updated new VPActiveLaneMaskPHIRecipe::print function to write out the operands as well.
Add a check to the vplan verifier that we only have one active lane mask phi recipe.
Added a TODO above VPActiveLaneMaskPHIRecipe::execute to try to use VPWidenPHIRecipe.

david-arm marked 5 inline comments as done.Jul 6 2022, 6:46 AM

Harbormaster completed remote builds in B173885: Diff 442552.Jul 6 2022, 7:32 AM

fhahn added inline comments.Jul 6 2022, 10:10 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
596	Thanks, I don't think a flag is actually needed. IIUC it should be sufficient to check if the operand of the BranchOnCond is fed by the current active lane mask and update the comment?

david-arm added inline comments.Jul 7 2022, 12:18 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
596	OK, it's more complicated than that though, since we have to check for not(active_lane_mask).

david-arm updated this revision to Diff 442820.Jul 7 2022, 1:48 AM

david-arm marked 2 inline comments as done.

david-arm added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.cpp
596	I've added explicit checks for the right flavour of BranchOnCond now, although the logic is considerably more complicated now. :)

Harbormaster completed remote builds in B174085: Diff 442820.Jul 7 2022, 2:42 AM

LGTM, thanks! I left a few inline comments where some of the comments could still be update. Also it would be good to add to TODO to replace the phi recipe and do this as follow up soon.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1515	nit: would be good to add a quick comment here.
llvm/lib/Transforms/Vectorize/VPlan.cpp
595–596	nit: comment still needs updating?
llvm/lib/Transforms/Vectorize/VPlan.h
1878	nit: TOOD/FIXME to replace this by the more general VPWidenPHIRecipe?

This revision was landed with ongoing or failed builds.Jul 11 2022, 5:47 AM

Closed by commit rG03fee6712a39: [LoopVectorize] Add option to use active lane mask for loop control flow (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rG03fee6712a39: [LoopVectorize] Add option to use active lane mask for loop control flow.

Herald added subscribers: • pcwang-thead, frasercrmck, luismarques and 19 others. · View Herald TranscriptJul 11 2022, 5:47 AM

Hi @fhahn, I've tried working on a patch to use VPWidenPHIRecipe instead of VPActiveLaneMaskPHIRecipe (see D129744), which is based off another WIP of yours (D128937). However, there are still problems with this approach:

Still quite a lot of code assumes that VPWidenPHIRecipe has an underlying value, which I've tried to hack my way around.
Updating VPlan::getActiveLaneMaskPHI() is tricky now - I've hacked this by looking for a VPWidenPHIRecipe without an underlying value.
In addCanonicalIVRecipes the loop exit block that I add as an incoming block for the VPWidenPHIRecipe is not the final exit block that you see in VPlan::execute. In fixNonInductionPHIs we crash because we cannot map this bad exit VPBasicBlock to a real BasicBlock. It looks like I still may have to let VPlan::execute add the latch incoming value for VPWidenPHIRecipes.

david-arm mentioned this in D142887: [LoopVectorize][TTI] NFCI: Clarify enum for the tail folding style..Jan 31 2023, 2:51 AM

dmgreen mentioned this in D148010: [Pipelines] Don't run module optimization in full LTO pre-link.Apr 20 2023, 12:44 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

14 lines

TargetTransformInfoImpl.h

4 lines

CodeGen/

BasicTTIImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

2 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

6 lines

ARM/

ARMTargetTransformInfo.h

2 lines

ARMTargetTransformInfo.cpp

6 lines

RISCV/

RISCVTargetTransformInfo.h

5 lines

Transforms/

Vectorize/

95 lines

48 lines

35 lines

48 lines

2 lines

22 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalable-reductions-tf.ll

5 lines

sve-low-trip-count.ll

10 lines

sve-tail-folding-forced.ll

41 lines

sve-tail-folding-optsize.ll

50 lines

sve-tail-folding-unroll.ll

336 lines

sve-tail-folding.ll

246 lines

tail-fold-uniform-memops.ll

19 lines

Diff 443614

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	public:

bool isTypeBasedOnly() const {		bool isTypeBasedOnly() const {
return Arguments.empty();		return Arguments.empty();
}		}

bool skipScalarizationCost() const { return ScalarizationCost.isValid(); }		bool skipScalarizationCost() const { return ScalarizationCost.isValid(); }
};		};

		enum class PredicationStyle { None, Data, DataAndControlFlow };
		paulwalker-armUnsubmitted Done Reply Inline Actions Please say if I'm being picky but to me `ActiveLaneMask` is a thing, as in, it's the active lane mask. Here I think you're asking the target to choose the style of predication. So something like the following reads better to me: enum class PredicationStyle { None, Data, DataAndControlFlow }; paulwalker-arm: Please say if I'm being picky but to me `ActiveLaneMask` is a thing, as in, it's the active…

class TargetTransformInfo;		class TargetTransformInfo;
typedef TargetTransformInfo TTI;		typedef TargetTransformInfo TTI;

/// This pass provides access to the codegen interfaces that are needed		/// This pass provides access to the codegen interfaces that are needed
/// for IR-level transformations.		/// for IR-level transformations.
class TargetTransformInfo {		class TargetTransformInfo {
public:		public:
/// Construct a TTI object using a type implementing the \c Concept		/// Construct a TTI object using a type implementing the \c Concept
▲ Show 20 Lines • Show All 355 Lines • ▼ Show 20 Lines	public:
/// Query the target whether it would be prefered to create a predicated		/// Query the target whether it would be prefered to create a predicated
/// vector loop, which can avoid the need to emit a scalar epilogue loop.		/// vector loop, which can avoid the need to emit a scalar epilogue loop.
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) const;		const LoopAccessInfo *LAI) const;

/// Query the target whether lowering of the llvm.get.active.lane.mask		/// Query the target whether lowering of the llvm.get.active.lane.mask
/// intrinsic is supported.		/// intrinsic is supported and how the mask should be used. A return value
bool emitGetActiveLaneMask() const;		/// of PredicationStyle::Data indicates the mask is used as data only,
		/// whereas PredicationStyle::DataAndControlFlow indicates we should also use
		/// the mask for control flow in the loop. If unsupported the return value is
		/// PredicationStyle::None.
		PredicationStyle emitGetActiveLaneMask() const;
		sdesmalenUnsubmitted Done Reply Inline Actions I find this interface a bit cumbersome, can we make `emitGetActiveLaneMask` return an enum value that explains how the lane mask must be used, something like: enum class ActiveLaneMask { None = 0, PredicateOnly, PredicateAndControlFlow }; (but then with possibly better names) sdesmalen: I find this interface a bit cumbersome, can we make `emitGetActiveLaneMask` return an enum…
		fhahnUnsubmitted Done Reply Inline Actions Looks like this comment needs updating, as the interface is more fine grained now? fhahn: Looks like this comment needs updating, as the interface is more fine grained now?

// Parameters that control the loop peeling transformation		// Parameters that control the loop peeling transformation
struct PeelingPreferences {		struct PeelingPreferences {
/// A forced peeling factor (the number of bodied of the original loop		/// A forced peeling factor (the number of bodied of the original loop
/// that should be peeled off before the loop body). When set to 0, the		/// that should be peeled off before the loop body). When set to 0, the
/// a peeling factor based on profile information and other factors.		/// a peeling factor based on profile information and other factors.
unsigned PeelCount;		unsigned PeelCount;
/// Allow peeling off loop iterations.		/// Allow peeling off loop iterations.
▲ Show 20 Lines • Show All 1,004 Lines • ▼ Show 20 Lines	public:
virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) = 0;		HardwareLoopInfo &HWLoopInfo) = 0;
virtual bool		virtual bool
preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree DT, const LoopAccessInfo LAI) = 0;		DominatorTree DT, const LoopAccessInfo LAI) = 0;
virtual bool emitGetActiveLaneMask() = 0;		virtual PredicationStyle emitGetActiveLaneMask() = 0;
virtual Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		virtual Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) = 0;		IntrinsicInst &II) = 0;
virtual Optional<Value *>		virtual Optional<Value *>
simplifyDemandedUseBitsIntrinsic(InstCombiner &IC, IntrinsicInst &II,		simplifyDemandedUseBitsIntrinsic(InstCombiner &IC, IntrinsicInst &II,
APInt DemandedMask, KnownBits &Known,		APInt DemandedMask, KnownBits &Known,
bool &KnownBitsComputed) = 0;		bool &KnownBitsComputed) = 0;
virtual Optional<Value *> simplifyDemandedVectorEltsIntrinsic(		virtual Optional<Value *> simplifyDemandedVectorEltsIntrinsic(
InstCombiner &IC, IntrinsicInst &II, APInt DemandedElts, APInt &UndefElts,		InstCombiner &IC, IntrinsicInst &II, APInt DemandedElts, APInt &UndefElts,
▲ Show 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) override {		const LoopAccessInfo *LAI) override {
return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);		return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
}		}
bool emitGetActiveLaneMask() override {		PredicationStyle emitGetActiveLaneMask() override {
return Impl.emitGetActiveLaneMask();		return Impl.emitGetActiveLaneMask();
}		}
Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) override {		IntrinsicInst &II) override {
return Impl.instCombineIntrinsic(IC, II);		return Impl.instCombineIntrinsic(IC, II);
}		}
Optional<Value *>		Optional<Value *>
simplifyDemandedUseBitsIntrinsic(InstCombiner &IC, IntrinsicInst &II,		simplifyDemandedUseBitsIntrinsic(InstCombiner &IC, IntrinsicInst &II,
▲ Show 20 Lines • Show All 626 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	public:

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) const {		const LoopAccessInfo *LAI) const {
return false;		return false;
}		}

bool emitGetActiveLaneMask() const {		PredicationStyle emitGetActiveLaneMask() const {
return false;		return PredicationStyle::None;
}		}

Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) const {		IntrinsicInst &II) const {
return None;		return None;
}		}

Optional<Value *>		Optional<Value *>
▲ Show 20 Lines • Show All 1,093 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 601 Lines • ▼ Show 20 Lines	public:

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) {		const LoopAccessInfo *LAI) {
return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);		return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
}		}

bool emitGetActiveLaneMask() {		PredicationStyle emitGetActiveLaneMask() {
return BaseT::emitGetActiveLaneMask();		return BaseT::emitGetActiveLaneMask();
}		}

Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
return BaseT::instCombineIntrinsic(IC, II);		return BaseT::instCombineIntrinsic(IC, II);
}		}

▲ Show 20 Lines • Show All 1,744 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 292 Lines • ▼ Show 20 Lines

	bool TargetTransformInfo::preferPredicateOverEpilogue(			bool TargetTransformInfo::preferPredicateOverEpilogue(
	Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,			Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
	TargetLibraryInfo TLI, DominatorTree DT,			TargetLibraryInfo TLI, DominatorTree DT,
	const LoopAccessInfo *LAI) const {			const LoopAccessInfo *LAI) const {
	return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);			return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
	}			}

	bool TargetTransformInfo::emitGetActiveLaneMask() const {			PredicationStyle TargetTransformInfo::emitGetActiveLaneMask() const {
	return TTIImpl->emitGetActiveLaneMask();			return TTIImpl->emitGetActiveLaneMask();
	}			}

	Optional<Instruction *>			Optional<Instruction *>
	TargetTransformInfo::instCombineIntrinsic(InstCombiner &IC,			TargetTransformInfo::instCombineIntrinsic(InstCombiner &IC,
	IntrinsicInst &II) const {			IntrinsicInst &II) const {
	return TTIImpl->instCombineIntrinsic(IC, II);			return TTIImpl->instCombineIntrinsic(IC, II);
	}			}
	▲ Show 20 Lines • Show All 925 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	shouldConsiderAddressTypePromotion(const Instruction &I,
bool &AllowPromotionWithoutCommonHeader);		bool &AllowPromotionWithoutCommonHeader);

bool shouldExpandReduction(const IntrinsicInst *II) const { return false; }		bool shouldExpandReduction(const IntrinsicInst *II) const { return false; }

unsigned getGISelRematGlobalCost() const {		unsigned getGISelRematGlobalCost() const {
return 2;		return 2;
}		}

bool emitGetActiveLaneMask() const {		PredicationStyle emitGetActiveLaneMask() const {
return ST->hasSVE();		if (ST->hasSVE())
		return PredicationStyle::DataAndControlFlow;
		paulwalker-armUnsubmitted Done Reply Inline Actions Will we do the correct thing if the vectoriser chooses fixed length vectorisation? Not sure what "correct" really means here other than not crash or regress performance? I just want to ensure we've considered the possibility. paulwalker-arm: Will we do the correct thing if the vectoriser chooses fixed length vectorisation? Not sure…
		david-armAuthorUnsubmitted Done Reply Inline Actions I think it does yeah. I've added a new test @simple_memset_v4i32 in sve-tail-folding.ll to defend the case where SVE is enabled and we've chosen a fixed-width VF. The codegen side of things is safe because we have tests to defend lowering of get.active.lane.mask calls for fixed-width too. david-arm: I think it does yeah. I've added a new test @simple_memset_v4i32 in sve-tail-folding.ll to…
		return PredicationStyle::None;
}		}

bool supportsScalableVectors() const { return ST->hasSVE(); }		bool supportsScalableVectors() const { return ST->hasSVE(); }

bool enableScalableVectorization() const { return ST->hasSVE(); }		bool enableScalableVectorization() const { return ST->hasSVE(); }

bool isLegalToVectorizeReduction(const RecurrenceDescriptor &RdxDesc,		bool isLegalToVectorizeReduction(const RecurrenceDescriptor &RdxDesc,
ElementCount VF) const;		ElementCount VF) const;
Show All 15 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 292 Lines • ▼ Show 20 Lines	bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *TLI,		TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI);		const LoopAccessInfo *LAI);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP,		TTI::UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE);		OptimizationRemarkEmitter *ORE);

bool emitGetActiveLaneMask() const;		PredicationStyle emitGetActiveLaneMask() const;

void getPeelingPreferences(Loop *L, ScalarEvolution &SE,		void getPeelingPreferences(Loop *L, ScalarEvolution &SE,
TTI::PeelingPreferences &PP);		TTI::PeelingPreferences &PP);
bool shouldBuildLookupTablesForConstant(Constant *C) const {		bool shouldBuildLookupTablesForConstant(Constant *C) const {
// In the ROPI and RWPI relocation models we can't have pointers to global		// In the ROPI and RWPI relocation models we can't have pointers to global
// variables or functions in constant data, so don't convert switches to		// variables or functions in constant data, so don't convert switches to
// lookup tables if any of the values would need relocation.		// lookup tables if any of the values would need relocation.
if (ST->isROPI() \|\| ST->isRWPI())		if (ST->isROPI() \|\| ST->isRWPI())
Show All 39 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 2,241 Lines • ▼ Show 20 Lines	if (!HWLoopInfo.isHardwareLoopCandidate(SE, LI, DT)) {
LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "		LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
"a candidate.\n");		"a candidate.\n");
return false;		return false;
}		}

return canTailPredicateLoop(L, LI, SE, DL, LAI);		return canTailPredicateLoop(L, LI, SE, DL, LAI);
}		}

bool ARMTTIImpl::emitGetActiveLaneMask() const {		PredicationStyle ARMTTIImpl::emitGetActiveLaneMask() const {
if (!ST->hasMVEIntegerOps() \|\| !EnableTailPredication)		if (!ST->hasMVEIntegerOps() \|\| !EnableTailPredication)
return false;		return PredicationStyle::None;

// Intrinsic @llvm.get.active.lane.mask is supported.		// Intrinsic @llvm.get.active.lane.mask is supported.
// It is used in the MVETailPredication pass, which requires the number of		// It is used in the MVETailPredication pass, which requires the number of
// elements processed by this vector loop to setup the tail-predicated		// elements processed by this vector loop to setup the tail-predicated
// loop.		// loop.
return true;		return PredicationStyle::Data;
}		}
void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP,		TTI::UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE) {		OptimizationRemarkEmitter *ORE) {
// Enable Upper bound unrolling universally, not dependant upon the conditions		// Enable Upper bound unrolling universally, not dependant upon the conditions
// below.		// below.
UP.UpperBound = true;		UP.UpperBound = true;

▲ Show 20 Lines • Show All 121 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	public:
InstructionCost getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,		InstructionCost getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
const APInt &Imm, Type *Ty,		const APInt &Imm, Type *Ty,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);		TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);

bool shouldExpandReduction(const IntrinsicInst *II) const;		bool shouldExpandReduction(const IntrinsicInst *II) const;
bool supportsScalableVectors() const { return ST->hasVInstructions(); }		bool supportsScalableVectors() const { return ST->hasVInstructions(); }
bool emitGetActiveLaneMask() const { return ST->hasVInstructions(); }		PredicationStyle emitGetActiveLaneMask() const {
		return ST->hasVInstructions() ? PredicationStyle::Data
		: PredicationStyle::None;
		}
Optional<unsigned> getMaxVScale() const;		Optional<unsigned> getMaxVScale() const;
Optional<unsigned> getVScaleForTuning() const;		Optional<unsigned> getVScaleForTuning() const;

TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const;		TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const;

unsigned getRegUsageForType(Type *Ty);		unsigned getRegUsageForType(Type *Ty);

InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
▲ Show 20 Lines • Show All 206 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,020 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {

// Prune search if we find another recipe generating a widen memory		// Prune search if we find another recipe generating a widen memory
// instruction. Widen memory instructions involved in address computation		// instruction. Widen memory instructions involved in address computation
// will lead to gather/scatter instructions, which don't need to be		// will lead to gather/scatter instructions, which don't need to be
// handled.		// handled.
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) \|\|		if (isa<VPWidenMemoryInstructionRecipe>(CurRec) \|\|
isa<VPInterleaveRecipe>(CurRec) \|\|		isa<VPInterleaveRecipe>(CurRec) \|\|
isa<VPScalarIVStepsRecipe>(CurRec) \|\|		isa<VPScalarIVStepsRecipe>(CurRec) \|\|
isa<VPCanonicalIVPHIRecipe>(CurRec))		isa<VPCanonicalIVPHIRecipe>(CurRec) \|\|
		isa<VPActiveLaneMaskPHIRecipe>(CurRec))
continue;		continue;

// This recipe contributes to the address computation of a widen		// This recipe contributes to the address computation of a widen
// load/store. Collect recipe if its underlying instruction has		// load/store. Collect recipe if its underlying instruction has
// poison-generating flags.		// poison-generating flags.
Instruction *Instr = CurRec->getUnderlyingInstr();		Instruction *Instr = CurRec->getUnderlyingInstr();
if (Instr && Instr->hasPoisonGeneratingFlags())		if (Instr && Instr->hasPoisonGeneratingFlags())
State.MayGeneratePoisonRecipes.insert(CurRec);		State.MayGeneratePoisonRecipes.insert(CurRec);
▲ Show 20 Lines • Show All 468 Lines • ▼ Show 20 Lines	public:
/// loop hint annotation.		/// loop hint annotation.
bool isScalarEpilogueAllowed() const {		bool isScalarEpilogueAllowed() const {
return ScalarEpilogueStatus == CM_ScalarEpilogueAllowed;		return ScalarEpilogueStatus == CM_ScalarEpilogueAllowed;
}		}

/// Returns true if all loop blocks should be masked to fold tail loop.		/// Returns true if all loop blocks should be masked to fold tail loop.
bool foldTailByMasking() const { return FoldTailByMasking; }		bool foldTailByMasking() const { return FoldTailByMasking; }

		/// Returns true if were tail-folding and want to use the active lane mask
		fhahnUnsubmitted Not Done Reply Inline Actions nit: would be good to add a quick comment here. fhahn: nit: would be good to add a quick comment here.
		/// for vector loop control flow.
		bool useActiveLaneMaskForControlFlow() const {
		return FoldTailByMasking &&
		TTI.emitGetActiveLaneMask() == PredicationStyle::DataAndControlFlow;
		}

/// Returns true if the instructions in this block requires predication		/// Returns true if the instructions in this block requires predication
/// for any reason, e.g. because tail folding now requires a predicate		/// for any reason, e.g. because tail folding now requires a predicate
/// or because the block in the original loop was predicated.		/// or because the block in the original loop was predicated.
bool blockNeedsPredicationForAnyReason(BasicBlock *BB) const {		bool blockNeedsPredicationForAnyReason(BasicBlock *BB) const {
return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);		return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);
}		}

/// A SmallMapVector to store the InLoop reduction op chains, mapping phi		/// A SmallMapVector to store the InLoop reduction op chains, mapping phi
▲ Show 20 Lines • Show All 6,489 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
// All-one mask is modelled as no-mask following the convention for masked		// All-one mask is modelled as no-mask following the convention for masked
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
if (!CM.blockNeedsPredicationForAnyReason(BB))		if (!CM.blockNeedsPredicationForAnyReason(BB))
return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

		assert(CM.foldTailByMasking() && "must fold the tail");

		// If we're using the active lane mask for control flow, then we get the
		// mask from the active lane mask PHI that is cached in the VPlan.
		PredicationStyle EmitGetActiveLaneMask = CM.TTI.emitGetActiveLaneMask();
		if (EmitGetActiveLaneMask == PredicationStyle::DataAndControlFlow)
		return BlockMaskCache[BB] = Plan->getActiveLaneMaskPhi();

// Introduce the early-exit compare IV <= BTC to form header block mask.		// Introduce the early-exit compare IV <= BTC to form header block mask.
// This is used instead of IV < TC because TC may wrap, unlike BTC. Start by		// This is used instead of IV < TC because TC may wrap, unlike BTC. Start by
// constructing the desired canonical IV in the header block as its first		// constructing the desired canonical IV in the header block as its first
// non-phi instructions.		// non-phi instructions.
		sdesmalenUnsubmitted Done Reply Inline Actions this comment is no longer correct. sdesmalen: this comment is no longer correct.
assert(CM.foldTailByMasking() && "must fold the tail");
VPBasicBlock *HeaderVPBB =		VPBasicBlock *HeaderVPBB =
Plan->getVectorLoopRegion()->getEntryBasicBlock();		Plan->getVectorLoopRegion()->getEntryBasicBlock();
auto NewInsertionPoint = HeaderVPBB->getFirstNonPhi();		auto NewInsertionPoint = HeaderVPBB->getFirstNonPhi();
auto *IV = new VPWidenCanonicalIVRecipe(Plan->getCanonicalIV());		auto *IV = new VPWidenCanonicalIVRecipe(Plan->getCanonicalIV());
HeaderVPBB->insert(IV, HeaderVPBB->getFirstNonPhi());		HeaderVPBB->insert(IV, HeaderVPBB->getFirstNonPhi());
		sdesmalenUnsubmitted Done Reply Inline Actions Can you add a comment explaining what this does? From my understanding, it sets the predicate for the vectorized loop's header block (which would be the vectorized loop's main predicate) to the ActiveLaneMaskPhi cached in the Plan. Is that right? sdesmalen: Can you add a comment explaining what this does? From my understanding, it sets the predicate…

VPBuilder::InsertPointGuard Guard(Builder);		VPBuilder::InsertPointGuard Guard(Builder);
Builder.setInsertPoint(HeaderVPBB, NewInsertionPoint);		Builder.setInsertPoint(HeaderVPBB, NewInsertionPoint);
if (CM.TTI.emitGetActiveLaneMask()) {		if (EmitGetActiveLaneMask != PredicationStyle::None) {
VPValue *TC = Plan->getOrCreateTripCount();		VPValue *TC = Plan->getOrCreateTripCount();
BlockMask = Builder.createNaryOp(VPInstruction::ActiveLaneMask, {IV, TC},		BlockMask = Builder.createNaryOp(VPInstruction::ActiveLaneMask, {IV, TC},
nullptr, "active.lane.mask");		nullptr, "active.lane.mask");
} else {		} else {
VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();		VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();
BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
}		}
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
▲ Show 20 Lines • Show All 526 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
auto MaxVFPlusOne = MaxVF.getWithIncrement(1);		auto MaxVFPlusOne = MaxVF.getWithIncrement(1);
for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFPlusOne);) {		for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFPlusOne);) {
VFRange SubRange = {VF, MaxVFPlusOne};		VFRange SubRange = {VF, MaxVFPlusOne};
VPlans.push_back(		VPlans.push_back(
buildVPlanWithVPRecipes(SubRange, DeadInstructions, SinkAfter));		buildVPlanWithVPRecipes(SubRange, DeadInstructions, SinkAfter));
VF = SubRange.End;		VF = SubRange.End;
}		}
}		}

// Add a VPCanonicalIVPHIRecipe starting at 0 to the header, a		// Add the necessary canonical IV and branch recipes required to control the
// CanonicalIVIncrement{NUW} VPInstruction to increment it by VF * UF and a		// loop.
		paulwalker-armUnsubmitted Done Reply Inline Actions This comment is now out of date. Perhaps it's best to replace it with something more generic and have the implementation specific part nearer the code? paulwalker-arm: This comment is now out of date. Perhaps it's best to replace it with something more generic…
// BranchOnCount VPInstruction to the latch.
static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, DebugLoc DL,		static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, DebugLoc DL,
bool HasNUW) {		bool HasNUW,
		bool UseLaneMaskForLoopControlFlow) {
		sdesmalenUnsubmitted Done Reply Inline Actions nit: s/UseLaneMaskForControlFlow/UseLaneMaskForLoopControlFlow`/ sdesmalen: nit: s/UseLaneMaskForControlFlow/UseLaneMaskForLoopControlFlow`/
Value *StartIdx = ConstantInt::get(IdxTy, 0);		Value *StartIdx = ConstantInt::get(IdxTy, 0);
auto *StartV = Plan.getOrAddVPValue(StartIdx);		auto *StartV = Plan.getOrAddVPValue(StartIdx);

		// Add a VPCanonicalIVPHIRecipe starting at 0 to the header.
auto *CanonicalIVPHI = new VPCanonicalIVPHIRecipe(StartV, DL);		auto *CanonicalIVPHI = new VPCanonicalIVPHIRecipe(StartV, DL);
VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();		VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();
VPBasicBlock *Header = TopRegion->getEntryBasicBlock();		VPBasicBlock *Header = TopRegion->getEntryBasicBlock();
Header->insert(CanonicalIVPHI, Header->begin());		Header->insert(CanonicalIVPHI, Header->begin());

		// Add a CanonicalIVIncrement{NUW} VPInstruction to increment the scalar
		paulwalker-armUnsubmitted Done Reply Inline Actions remove the "it"? paulwalker-arm: remove the "it"?
		// IV by VF * UF.
auto *CanonicalIVIncrement =		auto *CanonicalIVIncrement =
new VPInstruction(HasNUW ? VPInstruction::CanonicalIVIncrementNUW		new VPInstruction(HasNUW ? VPInstruction::CanonicalIVIncrementNUW
: VPInstruction::CanonicalIVIncrement,		: VPInstruction::CanonicalIVIncrement,
{CanonicalIVPHI}, DL, "index.next");		{CanonicalIVPHI}, DL, "index.next");
CanonicalIVPHI->addOperand(CanonicalIVIncrement);		CanonicalIVPHI->addOperand(CanonicalIVIncrement);

VPBasicBlock *EB = TopRegion->getExitingBasicBlock();		VPBasicBlock *EB = TopRegion->getExitingBasicBlock();
EB->appendRecipe(CanonicalIVIncrement);		EB->appendRecipe(CanonicalIVIncrement);

auto *BranchOnCount =		if (UseLaneMaskForLoopControlFlow) {
new VPInstruction(VPInstruction::BranchOnCount,		// Create the active lane mask instruction in the vplan preheader.
		VPBasicBlock *Preheader = Plan.getEntry()->getEntryBasicBlock();

		// We can't use StartV directly in the ActiveLaneMask VPInstruction, since
		// we have to take unrolling into account. Each part needs to start at
		// Part * VF
		auto *CanonicalIVIncrementParts =
		new VPInstruction(HasNUW ? VPInstruction::CanonicalIVIncrementForPartNUW
		: VPInstruction::CanonicalIVIncrementForPart,
		{StartV}, DL, "index.part.next");
		Preheader->appendRecipe(CanonicalIVIncrementParts);

		fhahnUnsubmitted Done Reply Inline Actions It looks like the update for the phi doesn't depend on the phi, but only on trip count & main induction. I might be missing something, but is the phi actually needed or would it be possible to compute the active lane mask at the beginning of each iteration instead of at the end of the iteration? fhahn: It looks like the update for the phi doesn't depend on the phi, but only on trip count & main…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @fhahn, perhaps I've misunderstood your question here, but actually the point of this patch is to do the exact opposite of that, i.e. we don't want to generate the active lane mask for the current iteration - we want to create the mask for the next iteration. This requires a PHI to carry the live value around the loop and is the only way to use the mask for control flow because at the point of branching we want to know if there are any active lanes in the mask for the next iteration. With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Effectively, the whilelo instruction is doing the comparison already, which makes the traditional scalar IV comparison redundant. I could be wrong, but I believe this form of vectorised loop would be beneficial for some other targets with a predicated instruction set too, such as RISC-V. david-arm: Hi @fhahn, perhaps I've misunderstood your question here, but actually the point of this patch…
		fhahnUnsubmitted Done Reply Inline Actions With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Oh right, I missed this in the test case I was looking at! I originally thought the intention of the new phi recipe was to encode some extra guarantees/information like we do for inductions or reductions. From the latest comment, it sounds like there would be no need to have a new recipe class for the phi, but perhaps VPWidenPHIRecipe could be used instead, if the setup can be moved to the pre-header. fhahn: > With particular reference to SVE, the motivation for this work is to use the 'whilelo'…
		david-armAuthorUnsubmitted Done Reply Inline Actions So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp: void VPWidenPHIRecipe::execute(VPTransformState &State) { State.ILV->widenPHIInstruction(cast<PHINode>(getUnderlyingValue()), this, State); } but the PHI node does not exist in the original scalar loop. It's not currently possible to widen a PHI that didn't previously exist, which means I would have to modify VPWidenPHIRecipe to test for the existence of an underlying value and take different paths accordingly. david-arm: So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an…
		fhahnUnsubmitted Not Done Reply Inline Actions So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp: Yeah this was a bit unfortunate! I landed 2 patches that removed the unnecessary dependence on the underlying instruction by instead using the type of its operand. fhahn: > So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires…
		david-armAuthorUnsubmitted Done Reply Inline Actions OK well thanks a lot for tidying that class up! The only problem is that even after your patches landed VPWidenPhiRecipe::execute only ever deals with Part 0, whereas we need a Phi for every Part. So unfortunately I still can't use the class in it's current state. If you have suggestions about how to fix this I'm happy to take another look? Perhaps I can add a boolean flag to the VPWidenPhiRecipe constructor to indicate how many parts to generate? david-arm: OK well thanks a lot for tidying that class up! The only problem is that even after your…
		fhahnUnsubmitted Not Done Reply Inline Actions Hm, let me double check! fhahn: Hm, let me double check!
		fhahnUnsubmitted Not Done Reply Inline Actions Ok it looks like the only reason is that VPWidenPHIRecipe is only used in plans for outer-loop vectorization and there interleaving isn't supported. So I think it should be fine to just extend it to generate phis for all parts. It would be beneficial to ensure we use generic recipes where possible to avoid duplication and confusion. This will also help with further unifying the split between native/inner loop planning. fhahn: Ok it looks like the only reason is that VPWidenPHIRecipe is only used in plans for outer-loop…
		// Create the ActiveLaneMask instruction using the correct start values.
		VPValue *TC = Plan.getOrCreateTripCount();
		auto *EntryALM = new VPInstruction(VPInstruction::ActiveLaneMask,
		{CanonicalIVIncrementParts, TC}, DL,
		paulwalker-armUnsubmitted Done Reply Inline Actions Please ignore if too awkward but I think the output IR will be more readable if we have tighter control over the name used for `EntryALM` and `ALM`. Currently the output is just littered with `active.lane.mask###` making it harder to understand the data flow. If the entry masks were called `active.lane.mask.entry###`, next iteration masks called `active.lane.mask.next###` and the PHIs keeping their current `active.lane.mask###` name, the loops would be easier to read. paulwalker-arm: Please ignore if too awkward but I think the output IR will be more readable if we have tighter…
		"active.lane.mask.entry");
		Preheader->appendRecipe(EntryALM);

		// Now create the ActiveLaneMaskPhi recipe in the main loop using the
		// preheader ActiveLaneMask instruction.
		auto *LaneMaskPhi = new VPActiveLaneMaskPHIRecipe(EntryALM, DebugLoc());
		Header->insert(LaneMaskPhi, Header->getFirstNonPhi());

		// Create the active lane mask for the next iteration of the loop.
		paulwalker-armUnsubmitted Done Reply Inline Actions Given the new block is already `EB->appendRecipe`ing several instructions I don't see any value in breaking out the last instruction like this. paulwalker-arm: Given the new block is already `EB->appendRecipe`ing several instructions I don't see any value…
		CanonicalIVIncrementParts =
		new VPInstruction(HasNUW ? VPInstruction::CanonicalIVIncrementForPartNUW
		: VPInstruction::CanonicalIVIncrementForPart,
		{CanonicalIVIncrement}, DL);
		EB->appendRecipe(CanonicalIVIncrementParts);

		auto *ALM = new VPInstruction(VPInstruction::ActiveLaneMask,
		{CanonicalIVIncrementParts, TC}, DL,
		"active.lane.mask.next");
		EB->appendRecipe(ALM);
		LaneMaskPhi->addOperand(ALM);

		// We have to invert the mask here because a true condition means jumping
		// to the exit block.
		auto *NotMask = new VPInstruction(VPInstruction::Not, ALM, DL);
		paulwalker-armUnsubmitted Done Reply Inline Actions This seems a bit wasteful given it's only the first lane we care about. perhaps `VPInstruction::BranchOnCount` could take an `invert` flag that switches the branch destinations? paulwalker-arm: This seems a bit wasteful given it's only the first lane we care about. perhaps `VPInstruction…
		david-armAuthorUnsubmitted Done Reply Inline Actions It's not as trivial as you'd think passing in flags to VPInstructions. :( You have to create an IR Constant, pass it in as a 3rd VPValue argument to a VPInstruction and mark it as IR "live value in" operand. We do end up with nice code after InstCombine, since it swaps the labels around for us anyway. If you really prefer me to go down this route I can - it's just a little awkward that's all. david-arm: It's not as trivial as you'd think passing in flags to VPInstructions. :( You have to create an…
		paulwalker-armUnsubmitted Done Reply Inline Actions No need if this is how the interface works. Perhaps VPIntruction could do with some domain specific flags, which might also simplify the NUW split. Either way it's not directly relevant to this patch. paulwalker-arm: No need if this is how the interface works. Perhaps VPIntruction could do with some domain…
		EB->appendRecipe(NotMask);

		VPInstruction *BranchBack =
		new VPInstruction(VPInstruction::BranchOnCond, {NotMask}, DL);
		EB->appendRecipe(BranchBack);
		} else {
		// Add the BranchOnCount VPInstruction to the latch.
		VPInstruction *BranchBack = new VPInstruction(
		VPInstruction::BranchOnCount,
{CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);		{CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
EB->appendRecipe(BranchOnCount);		EB->appendRecipe(BranchBack);
		}
}		}

// Add exit values to \p Plan. VPLiveOuts are added for each LCSSA phi in the		// Add exit values to \p Plan. VPLiveOuts are added for each LCSSA phi in the
// original exit block.		// original exit block.
static void addUsersInExitBlock(VPBasicBlock *HeaderVPBB,		static void addUsersInExitBlock(VPBasicBlock *HeaderVPBB,
VPBasicBlock MiddleVPBB, Loop OrigLoop,		VPBasicBlock MiddleVPBB, Loop OrigLoop,
VPlan &Plan) {		VPlan &Plan) {
BasicBlock *ExitBB = OrigLoop->getUniqueExitBlock();		BasicBlock *ExitBB = OrigLoop->getUniqueExitBlock();
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	VPlanPtr LoopVectorizationPlanner::buildVPlanWithVPRecipes(
VPBlockUtils::insertBlockAfter(TopRegion, Preheader);		VPBlockUtils::insertBlockAfter(TopRegion, Preheader);
VPBasicBlock *MiddleVPBB = new VPBasicBlock("middle.block");		VPBasicBlock *MiddleVPBB = new VPBasicBlock("middle.block");
VPBlockUtils::insertBlockAfter(MiddleVPBB, TopRegion);		VPBlockUtils::insertBlockAfter(MiddleVPBB, TopRegion);

Instruction *DLInst =		Instruction *DLInst =
getDebugLocFromInstOrOperands(Legal->getPrimaryInduction());		getDebugLocFromInstOrOperands(Legal->getPrimaryInduction());
addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(),		addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(),
DLInst ? DLInst->getDebugLoc() : DebugLoc(),		DLInst ? DLInst->getDebugLoc() : DebugLoc(),
!CM.foldTailByMasking());		!CM.foldTailByMasking(),
		CM.useActiveLaneMaskForControlFlow());

// Scan the body of the loop in a topological order to visit each basic block		// Scan the body of the loop in a topological order to visit each basic block
// after having visited its predecessor basic blocks.		// after having visited its predecessor basic blocks.
LoopBlocksDFS DFS(OrigLoop);		LoopBlocksDFS DFS(OrigLoop);
DFS.perform(LI);		DFS.perform(LI);

VPBasicBlock *VPBB = HeaderVPBB;		VPBasicBlock *VPBB = HeaderVPBB;
SmallVector<VPWidenIntOrFpInductionRecipe *> InductionsToMove;		SmallVector<VPWidenIntOrFpInductionRecipe *> InductionsToMove;
▲ Show 20 Lines • Show All 298 Lines • ▼ Show 20 Lines	VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {

// Remove the existing terminator of the exiting block of the top-most region.		// Remove the existing terminator of the exiting block of the top-most region.
// A BranchOnCount will be added instead when adding the canonical IV recipes.		// A BranchOnCount will be added instead when adding the canonical IV recipes.
auto *Term =		auto *Term =
Plan->getVectorLoopRegion()->getExitingBasicBlock()->getTerminator();		Plan->getVectorLoopRegion()->getExitingBasicBlock()->getTerminator();
Term->eraseFromParent();		Term->eraseFromParent();

addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), DebugLoc(),		addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), DebugLoc(),
true);		true, CM.useActiveLaneMaskForControlFlow());
return Plan;		return Plan;
		paulwalker-armUnsubmitted Done Reply Inline Actions This idiom occurs in a few places. Is it worth adding something like `CM.useActiveLaneMaskForControlFlow()`. I think it makes sense for this to imply tail folding by mask as why else would you be doing it. paulwalker-arm: This idiom occurs in a few places. Is it worth adding something like `CM.
}		}

// Adjust the recipes for reductions. For in-loop reductions the chain of		// Adjust the recipes for reductions. For in-loop reductions the chain of
// instructions leading from the loop exit instr to the phi need to be converted		// instructions leading from the loop exit instr to the phi need to be converted
// to reductions, with one operand being vector and the other being the scalar		// to reductions, with one operand being vector and the other being the scalar
// reduction chain. For other reductions, a select is introduced between the phi		// reduction chain. For other reductions, a select is introduced between the phi
// and live-out recipes when folding the tail.		// and live-out recipes when folding the tail.
void LoopVectorizationPlanner::adjustRecipesForReductions(		void LoopVectorizationPlanner::adjustRecipesForReductions(
▲ Show 20 Lines • Show All 1,573 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 778 Lines • ▼ Show 20 Lines	FirstOrderRecurrenceSplice =
// values of a first-order recurrence.		// values of a first-order recurrence.
Not,		Not,
ICmpULE,		ICmpULE,
SLPLoad,		SLPLoad,
SLPStore,		SLPStore,
ActiveLaneMask,		ActiveLaneMask,
CanonicalIVIncrement,		CanonicalIVIncrement,
CanonicalIVIncrementNUW,		CanonicalIVIncrementNUW,
		// The next two are similar to the above, but instead increment the
		sdesmalenUnsubmitted Done Reply Inline Actions It would be nice to have a one-line description here of what this does and how it's different from CanonicalIVIncrement. sdesmalen: It would be nice to have a one-line description here of what this does and how it's different…
		// canonical IV separately for each unrolled part.
		CanonicalIVIncrementForPart,
		CanonicalIVIncrementForPartNUW,
BranchOnCount,		BranchOnCount,
BranchOnCond		BranchOnCond
};		};
		fhahnUnsubmitted Done Reply Inline Actions Can `BranchOnCond` be used instead of the dedicated `BranchOnactiveLaneMask`? fhahn: Can `BranchOnCond` be used instead of the dedicated `BranchOnactiveLaneMask`?
		david-armAuthorUnsubmitted Done Reply Inline Actions I created this patch a month ago, which predated your BranchOnCond work. That's why I haven't used it. I can certainly look into this and see if it's possible though? david-arm: I created this patch a month ago, which predated your BranchOnCond work. That's why I haven't…
		david-armAuthorUnsubmitted Done Reply Inline Actions So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. It's possible to do this, but then I wasn't sure about the semantics of this new instruction. When passing in a scalar constant of 0 for the lane, it gets widened to something like <vscale x 4 x i32> zeroinitializer for every part. However, I only need a single lane so I'd have to do something like: case VPInstruction::ExtractElement: { Value Vec = State.get(getOperand(0), Part); Value Lane = State.get(getOperand(1), VPIteration(0, 0)); Value V = Builder.CreateExtractElement(Vec, Lane); State.set(this, V, Part); break; } It feels quite inefficient to go to all the effort of widening, only to discard everything! If you still prefer me to proceed with this approach I'm happy to try if you can provide your thoughts on what the new ExtractElement operation should look like? david-arm:* So I did look into this. In order to do it this way I have to explicitly generate the Not and…
		fhahnUnsubmitted Done Reply Inline Actions So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. Extracts are not modeled explicitly at the moment and usually `State.get` will take care of interesting an extract when requesting scalar lanes if it is needed. I think when using a `VPInstruction::not` as operand for `BranchOnCond`, `State.get` should insert the extract for the first lane, as this is what `BranchOnCond` uses. fhahn: > So I did look into this. In order to do it this way I have to explicitly generate the Not and…
		david-armAuthorUnsubmitted Done Reply Inline Actions OK I'll take another look and give this a try! david-arm: OK I'll take another look and give this a try!

private:		private:
typedef unsigned char OpcodeTy;		typedef unsigned char OpcodeTy;
OpcodeTy Opcode;		OpcodeTy Opcode;
FastMathFlags FMF;		FastMathFlags FMF;
DebugLoc DL;		DebugLoc DL;

/// An optional name that can be used for the generated IR instruction.		/// An optional name that can be used for the generated IR instruction.
const std::string Name;		const std::string Name;
		fhahnUnsubmitted Done Reply Inline Actions This should be documented more clearly. At the moment this is used as the name of the generated IR instruction? Also, this is only used for `VPInstruction::ActiveLaneMask` which is a bit surprising. fhahn: This should be documented more clearly. At the moment this is used as the name of the generated…

/// Utility method serving execute(): generates a single instance of the		/// Utility method serving execute(): generates a single instance of the
/// modeled instruction.		/// modeled instruction.
void generateInstruction(VPTransformState &State, unsigned Part);		void generateInstruction(VPTransformState &State, unsigned Part);

protected:		protected:
void setUnderlyingInstr(Instruction *I) { setUnderlyingValue(I); }		void setUnderlyingInstr(Instruction *I) { setUnderlyingValue(I); }

▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	bool onlyFirstLaneUsed(const VPValue *Op) const override {
if (getOperand(0) != Op)		if (getOperand(0) != Op)
return false;		return false;
switch (getOpcode()) {		switch (getOpcode()) {
default:		default:
return false;		return false;
case VPInstruction::ActiveLaneMask:		case VPInstruction::ActiveLaneMask:
case VPInstruction::CanonicalIVIncrement:		case VPInstruction::CanonicalIVIncrement:
case VPInstruction::CanonicalIVIncrementNUW:		case VPInstruction::CanonicalIVIncrementNUW:
		case VPInstruction::CanonicalIVIncrementForPart:
		case VPInstruction::CanonicalIVIncrementForPartNUW:
case VPInstruction::BranchOnCount:		case VPInstruction::BranchOnCount:
return true;		return true;
};		};
llvm_unreachable("switch should return");		llvm_unreachable("switch should return");
}		}
};		};

/// VPWidenRecipe is a recipe for producing a copy of vector type its		/// VPWidenRecipe is a recipe for producing a copy of vector type its
▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines	protected:
}		}

public:		public:
~VPHeaderPHIRecipe() override = default;		~VPHeaderPHIRecipe() override = default;

/// Method to support type inquiry through isa, cast, and dyn_cast.		/// Method to support type inquiry through isa, cast, and dyn_cast.
static inline bool classof(const VPRecipeBase *B) {		static inline bool classof(const VPRecipeBase *B) {
return B->getVPDefID() == VPRecipeBase::VPCanonicalIVPHISC \|\|		return B->getVPDefID() == VPRecipeBase::VPCanonicalIVPHISC \|\|
		B->getVPDefID() == VPRecipeBase::VPActiveLaneMaskPHISC \|\|
B->getVPDefID() == VPRecipeBase::VPFirstOrderRecurrencePHISC \|\|		B->getVPDefID() == VPRecipeBase::VPFirstOrderRecurrencePHISC \|\|
B->getVPDefID() == VPRecipeBase::VPReductionPHISC \|\|		B->getVPDefID() == VPRecipeBase::VPReductionPHISC \|\|
B->getVPDefID() == VPRecipeBase::VPWidenIntOrFpInductionSC \|\|		B->getVPDefID() == VPRecipeBase::VPWidenIntOrFpInductionSC \|\|
B->getVPDefID() == VPRecipeBase::VPWidenPHISC;		B->getVPDefID() == VPRecipeBase::VPWidenPHISC;
}		}
static inline bool classof(const VPValue *V) {		static inline bool classof(const VPValue *V) {
return V->getVPValueID() == VPValue::VPVCanonicalIVPHISC \|\|		return V->getVPValueID() == VPValue::VPVCanonicalIVPHISC \|\|
		V->getVPValueID() == VPValue::VPVActiveLaneMaskPHISC \|\|
V->getVPValueID() == VPValue::VPVFirstOrderRecurrencePHISC \|\|		V->getVPValueID() == VPValue::VPVFirstOrderRecurrencePHISC \|\|
V->getVPValueID() == VPValue::VPVReductionPHISC \|\|		V->getVPValueID() == VPValue::VPVReductionPHISC \|\|
V->getVPValueID() == VPValue::VPVWidenIntOrFpInductionSC \|\|		V->getVPValueID() == VPValue::VPVWidenIntOrFpInductionSC \|\|
V->getVPValueID() == VPValue::VPVWidenPHISC;		V->getVPValueID() == VPValue::VPVWidenPHISC;
}		}

/// Generate the phi nodes.		/// Generate the phi nodes.
void execute(VPTransformState &State) override = 0;		void execute(VPTransformState &State) override = 0;
▲ Show 20 Lines • Show All 713 Lines • ▼ Show 20 Lines	#endif
/// Returns true if the recipe only uses the first lane of operand \p Op.		/// Returns true if the recipe only uses the first lane of operand \p Op.
bool onlyFirstLaneUsed(const VPValue *Op) const override {		bool onlyFirstLaneUsed(const VPValue *Op) const override {
assert(is_contained(operands(), Op) &&		assert(is_contained(operands(), Op) &&
"Op must be an operand of the recipe");		"Op must be an operand of the recipe");
return true;		return true;
}		}
};		};

		/// A recipe for generating the active lane mask for the vector loop that is
		/// used to predicate the vector operations.
		/// TODO: It would be good to use the existing VPWidenPHIRecipe instead and
		fhahnUnsubmitted Not Done Reply Inline Actions nit: TOOD/FIXME to replace this by the more general VPWidenPHIRecipe? fhahn: nit: TOOD/FIXME to replace this by the more general VPWidenPHIRecipe?
		/// remove VPActiveLaneMaskPHIRecipe.
		class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
		fhahnUnsubmitted Done Reply Inline Actions TC would need to be an operand instead of just a member variable, otherwise things will break if something does `TC->replaceAllUsesWith()` outside. fhahn: TC would need to be an operand instead of just a member variable, otherwise things will break…
		DebugLoc DL;

		public:
		VPActiveLaneMaskPHIRecipe(VPValue *StartMask, DebugLoc DL)
		: VPHeaderPHIRecipe(VPValue::VPVActiveLaneMaskPHISC,
		VPActiveLaneMaskPHISC, nullptr, StartMask),
		DL(DL) {}

		~VPActiveLaneMaskPHIRecipe() override = default;

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPDef *D) {
		return D->getVPDefID() == VPActiveLaneMaskPHISC;
		}
		static inline bool classof(const VPHeaderPHIRecipe *D) {
		return D->getVPDefID() == VPActiveLaneMaskPHISC;
		}
		static inline bool classof(const VPValue *V) {
		return V->getVPValueID() == VPValue::VPVActiveLaneMaskPHISC;
		}

		/// Generate the active lane mask phi of the vector loop.
		void execute(VPTransformState &State) override;

		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
		/// Print the recipe.
		void print(raw_ostream &O, const Twine &Indent,
		VPSlotTracker &SlotTracker) const override;
		#endif
		};

/// A Recipe for widening the canonical induction variable of the vector loop.		/// A Recipe for widening the canonical induction variable of the vector loop.
class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {		class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
public:		public:
VPWidenCanonicalIVRecipe(VPCanonicalIVPHIRecipe *CanonicalIV)		VPWidenCanonicalIVRecipe(VPCanonicalIVPHIRecipe *CanonicalIV)
: VPRecipeBase(VPWidenCanonicalIVSC, {CanonicalIV}),		: VPRecipeBase(VPWidenCanonicalIVSC, {CanonicalIV}),
VPValue(VPValue::VPVWidenCanonicalIVSC, nullptr, this) {}		VPValue(VPValue::VPVWidenCanonicalIVSC, nullptr, this) {}

~VPWidenCanonicalIVRecipe() override = default;		~VPWidenCanonicalIVRecipe() override = default;
▲ Show 20 Lines • Show All 591 Lines • ▼ Show 20 Lines	class VPlan {
/// Represents the trip count of the original loop, for folding		/// Represents the trip count of the original loop, for folding
/// the tail.		/// the tail.
VPValue *TripCount = nullptr;		VPValue *TripCount = nullptr;

/// Represents the backedge taken count of the original loop, for folding		/// Represents the backedge taken count of the original loop, for folding
/// the tail. It equals TripCount - 1.		/// the tail. It equals TripCount - 1.
VPValue *BackedgeTakenCount = nullptr;		VPValue *BackedgeTakenCount = nullptr;

/// Represents the vector trip count.		/// Represents the vector trip count.
		paulwalker-armUnsubmitted Done Reply Inline Actions Could do with a comment. paulwalker-arm: Could do with a comment.
VPValue VectorTripCount;		VPValue VectorTripCount;

/// Holds a mapping between Values and their corresponding VPValue inside		/// Holds a mapping between Values and their corresponding VPValue inside
/// VPlan.		/// VPlan.
Value2VPValueTy Value2VPValue;		Value2VPValueTy Value2VPValue;

/// Contains all VPValues that been allocated by addVPValue directly and need		/// Contains all VPValues that been allocated by addVPValue directly and need
/// to be free when the plan's destructor is called.		/// to be free when the plan's destructor is called.
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	public:
VPValue *getOrCreateBackedgeTakenCount() {		VPValue *getOrCreateBackedgeTakenCount() {
if (!BackedgeTakenCount)		if (!BackedgeTakenCount)
BackedgeTakenCount = new VPValue();		BackedgeTakenCount = new VPValue();
return BackedgeTakenCount;		return BackedgeTakenCount;
}		}

/// The vector trip count.		/// The vector trip count.
VPValue &getVectorTripCount() { return VectorTripCount; }		VPValue &getVectorTripCount() { return VectorTripCount; }

		fhahnUnsubmitted Done Reply Inline Actions What about multiple VPActiveLaneMasks? Should the verifier check to make sure there's only a single one? It would probably also be good to place this after `getCanonicalIV`, which seems related. fhahn: What about multiple VPActiveLaneMasks? Should the verifier check to make sure there's only a…
/// Mark the plan to indicate that using Value2VPValue is not safe any		/// Mark the plan to indicate that using Value2VPValue is not safe any
/// longer, because it may be stale.		/// longer, because it may be stale.
		fhahnUnsubmitted Done Reply Inline Actions Do we need to keep this as member variable? I think at the moment we generally avoid having links to recipes here, as this requires transforms that may remove or replace recipes to be aware of this state (see the similar `getCanonicalIV`) It should be trivial to find the active lane mask phi recipe, if there is one (just look at the phi recipes of the header) fhahn: Do we need to keep this as member variable? I think at the moment we generally avoid having…
void disableValue2VPValue() { Value2VPValueEnabled = false; }		void disableValue2VPValue() { Value2VPValueEnabled = false; }

void addVF(ElementCount VF) { VFs.insert(VF); }		void addVF(ElementCount VF) { VFs.insert(VF); }

bool hasVF(ElementCount VF) { return VFs.count(VF); }		bool hasVF(ElementCount VF) { return VFs.count(VF); }

const std::string &getName() const { return Name; }		const std::string &getName() const { return Name; }

▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	VPCanonicalIVPHIRecipe *getCanonicalIV() {
VPBasicBlock *EntryVPBB = getVectorLoopRegion()->getEntryBasicBlock();		VPBasicBlock *EntryVPBB = getVectorLoopRegion()->getEntryBasicBlock();
if (EntryVPBB->empty()) {		if (EntryVPBB->empty()) {
// VPlan native path.		// VPlan native path.
EntryVPBB = cast<VPBasicBlock>(EntryVPBB->getSingleSuccessor());		EntryVPBB = cast<VPBasicBlock>(EntryVPBB->getSingleSuccessor());
}		}
return cast<VPCanonicalIVPHIRecipe>(&*EntryVPBB->begin());		return cast<VPCanonicalIVPHIRecipe>(&*EntryVPBB->begin());
}		}

		/// Find and return the VPActiveLaneMaskPHIRecipe from the header - there
		/// be only one at most. If there isn't one, then return nullptr.
		VPActiveLaneMaskPHIRecipe *getActiveLaneMaskPhi();

void addLiveOut(PHINode PN, VPValue V);		void addLiveOut(PHINode PN, VPValue V);

void clearLiveOuts() {		void clearLiveOuts() {
for (auto &KV : LiveOuts)		for (auto &KV : LiveOuts)
delete KV.second;		delete KV.second;
LiveOuts.clear();		LiveOuts.clear();
}		}

▲ Show 20 Lines • Show All 350 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 562 Lines • ▼ Show 20 Lines	for (auto *BlockBase : depth_first(Entry)) {
BlockBase->print(O, NewIndent, SlotTracker);		BlockBase->print(O, NewIndent, SlotTracker);
}		}
O << Indent << "}\n";		O << Indent << "}\n";

printSuccessors(O, Indent);		printSuccessors(O, Indent);
}		}
#endif		#endif

		VPActiveLaneMaskPHIRecipe *VPlan::getActiveLaneMaskPhi() {
		fhahnUnsubmitted Done Reply Inline Actions nit: stray whitespace. fhahn: nit: stray whitespace.
		VPBasicBlock *Header = getVectorLoopRegion()->getEntryBasicBlock();
		for (VPRecipeBase &R : Header->phis()) {
		if (isa<VPActiveLaneMaskPHIRecipe>(&R))
		return cast<VPActiveLaneMaskPHIRecipe>(&R);
		}
		return nullptr;
		}

		static bool canSimplifyBranchOnCond(VPInstruction *Term) {
		paulwalker-armUnsubmitted Done Reply Inline Actions I know it'll get folded away by later passes but would it be wrong to handling the `Part == 0` case here? so that we don't litter the loop vectorize tests with pointless adds of zero. I see `CanonicalIVIncrement` doing similar for the non zero parts so I'll take that as precedent for doing such common sense early optimisation. paulwalker-arm: I know it'll get folded away by later passes but would it be wrong to handling the `Part == 0`…
		VPInstruction *Not = dyn_cast<VPInstruction>(Term->getOperand(0));
		paulwalker-armUnsubmitted Done Reply Inline Actions This looks to be copied from `case VPInstruction::CanonicalIVIncrement` but it not really correct within the context of this opcode. paulwalker-arm: This looks to be copied from `case VPInstruction::CanonicalIVIncrement` but it not really…
		if (!Not \|\| Not->getOpcode() != VPInstruction::Not)
		return false;

		VPInstruction *ALM = dyn_cast<VPInstruction>(Not->getOperand(0));
		return ALM && ALM->getOpcode() == VPInstruction::ActiveLaneMask;
		}

void VPlan::prepareToExecute(Value TripCountV, Value VectorTripCountV,		void VPlan::prepareToExecute(Value TripCountV, Value VectorTripCountV,
Value *CanonicalIVStartValue,		Value *CanonicalIVStartValue,
VPTransformState &State,		VPTransformState &State,
bool IsEpilogueVectorization) {		bool IsEpilogueVectorization) {

VPBasicBlock *ExitingVPBB = getVectorLoopRegion()->getExitingBasicBlock();		VPBasicBlock *ExitingVPBB = getVectorLoopRegion()->getExitingBasicBlock();
		sdesmalenUnsubmitted Done Reply Inline Actions nit: s/CanonicalIVIncrementParts/CanonicalIVPIncrementForPart/ ? sdesmalen: nit: s/CanonicalIVIncrementParts/CanonicalIVPIncrementForPart/ ?
auto *Term = dyn_cast<VPInstruction>(&ExitingVPBB->back());		auto *Term = dyn_cast<VPInstruction>(&ExitingVPBB->back());
// Try to simplify BranchOnCount to 'BranchOnCond true' if TC <= VF * UF when		// Try to simplify the branch condition if TC <= VF * UF when preparing to
		fhahnUnsubmitted Done Reply Inline Actions Is this covered by a test? The comment and logic above applies specifically to `BranchOnCount`. It is not entirely clear to me that it would be safe for any `BranchOnCond`? fhahn: Is this covered by a test? The comment and logic above applies specifically to `BranchOnCount`.
		david-armAuthorUnsubmitted Done Reply Inline Actions It is covered by sve-low-trip-count.ll, since we create BranchOnCond using the active lane mask. I did this after you suggested in earlier review comments to use BranchOnCond instead of BranchOnActiveLaneMask because we don't want to regress for low trip counts. We also don't have access to the TripCount in addCanonicalIVRecipes, so I can't use that to determine what type of Branch instruction to use. It's a shame that there is no easy way to add a flag to the VPInstruction to mark it as safe for optimisation here. For example, it would be nice to do something like (Term->getOpcode() == VPInstruction::BranchOnCount \|\| (Term->getOpcode() == VPInstruction::BranchOnCond && Term->isBranchSafeToOptimise()) david-arm: It is covered by sve-low-trip-count.ll, since we create BranchOnCond using the active lane mask.
		fhahnUnsubmitted Done Reply Inline Actions Thanks, I don't think a flag is actually needed. IIUC it should be sufficient to check if the operand of the BranchOnCond is fed by the current active lane mask and update the comment? fhahn: Thanks, I don't think a flag is actually needed. IIUC it should be sufficient to check if the…
		david-armAuthorUnsubmitted Done Reply Inline Actions OK, it's more complicated than that though, since we have to check for not(active_lane_mask). david-arm: OK, it's more complicated than that though, since we have to check for not(active_lane_mask).
		david-armAuthorUnsubmitted Done Reply Inline Actions I've added explicit checks for the right flavour of BranchOnCond now, although the logic is considerably more complicated now. :) david-arm: I've added explicit checks for the right flavour of BranchOnCond now, although the logic is…
		fhahnUnsubmitted Not Done Reply Inline Actions nit: comment still needs updating? fhahn: nit: comment still needs updating?
// preparing to execute the plan for the main vector loop.		// execute the plan for the main vector loop. We only do this if the
if (!IsEpilogueVectorization && Term &&		// terminator is:
Term->getOpcode() == VPInstruction::BranchOnCount &&		// 1. BranchOnCount, or
isa<ConstantInt>(TripCountV)) {		// 2. BranchOnCond where the input is Not(ActiveLaneMask).
		if (!IsEpilogueVectorization && Term && isa<ConstantInt>(TripCountV) &&
		(Term->getOpcode() == VPInstruction::BranchOnCount \|\|
		(Term->getOpcode() == VPInstruction::BranchOnCond &&
		paulwalker-armUnsubmitted Done Reply Inline Actions no active lanes? paulwalker-arm: no active lanes?
		canSimplifyBranchOnCond(Term)))) {
ConstantInt *C = cast<ConstantInt>(TripCountV);		ConstantInt *C = cast<ConstantInt>(TripCountV);
		sdesmalenUnsubmitted Done Reply Inline Actions nit: IV is probably not a suitable name, because it's the active lane mask, not the induction variable. sdesmalen: nit: IV is probably not a suitable name, because it's the active lane mask, not the induction…
uint64_t TCVal = C->getZExtValue();		uint64_t TCVal = C->getZExtValue();
if (TCVal && TCVal <= State.VF.getKnownMinValue() * State.UF) {		if (TCVal && TCVal <= State.VF.getKnownMinValue() * State.UF) {
auto *BOC =		auto *BOC =
new VPInstruction(VPInstruction::BranchOnCond,		new VPInstruction(VPInstruction::BranchOnCond,
{getOrAddExternalDef(State.Builder.getTrue())});		{getOrAddExternalDef(State.Builder.getTrue())});
Term->eraseFromParent();		Term->eraseFromParent();
ExitingVPBB->appendRecipe(BOC);		ExitingVPBB->appendRecipe(BOC);
// TODO: Further simplifications are possible		// TODO: Further simplifications are possible
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	for (VPRecipeBase &R : Header->phis()) {

auto *PhiR = cast<VPHeaderPHIRecipe>(&R);		auto *PhiR = cast<VPHeaderPHIRecipe>(&R);
// For canonical IV, first-order recurrences and in-order reduction phis,		// For canonical IV, first-order recurrences and in-order reduction phis,
// only a single part is generated, which provides the last part from the		// only a single part is generated, which provides the last part from the
// previous iteration. For non-ordered reductions all UF parts are		// previous iteration. For non-ordered reductions all UF parts are
// generated.		// generated.
bool SinglePartNeeded = isa<VPCanonicalIVPHIRecipe>(PhiR) \|\|		bool SinglePartNeeded = isa<VPCanonicalIVPHIRecipe>(PhiR) \|\|
isa<VPFirstOrderRecurrencePHIRecipe>(PhiR) \|\|		isa<VPFirstOrderRecurrencePHIRecipe>(PhiR) \|\|
cast<VPReductionPHIRecipe>(PhiR)->isOrdered();		(isa<VPReductionPHIRecipe>(PhiR) &&
		cast<VPReductionPHIRecipe>(PhiR)->isOrdered());
unsigned LastPartForNewPhi = SinglePartNeeded ? 1 : State->UF;		unsigned LastPartForNewPhi = SinglePartNeeded ? 1 : State->UF;

for (unsigned Part = 0; Part < LastPartForNewPhi; ++Part) {		for (unsigned Part = 0; Part < LastPartForNewPhi; ++Part) {
Value *Phi = State->get(PhiR, Part);		Value *Phi = State->get(PhiR, Part);
Value *Val = State->get(PhiR->getBackedgeValue(),		Value *Val = State->get(PhiR->getBackedgeValue(),
SinglePartNeeded ? State->UF - 1 : Part);		SinglePartNeeded ? State->UF - 1 : Part);
cast<PHINode>(Phi)->addIncoming(Val, VectorLatchBB);		cast<PHINode>(Phi)->addIncoming(Val, VectorLatchBB);
}		}
▲ Show 20 Lines • Show All 239 Lines • ▼ Show 20 Lines
}		}

#endif		#endif

template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);		template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);

void VPValue::replaceAllUsesWith(VPValue *New) {		void VPValue::replaceAllUsesWith(VPValue *New) {
for (unsigned J = 0; J < getNumUsers();) {		for (unsigned J = 0; J < getNumUsers();) {
VPUser *User = Users[J];		VPUser *User = Users[J];
		fhahnUnsubmitted Done Reply Inline Actions Can `Builder.CreatePhi` be used without the need to manually pick the insertion point? fhahn: Can `Builder.CreatePhi` be used without the need to manually pick the insertion point?
unsigned NumUsers = getNumUsers();		unsigned NumUsers = getNumUsers();
		sdesmalenUnsubmitted Done Reply Inline Actions Is this extra scope necessary? There are no uses of IR builder after this scope, so I think you can just as well use the scope of the loop body for the InsertPointGuard. sdesmalen: Is this extra scope necessary? There are no uses of IR builder after this scope, so I think you…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @fhahn, if I use VPWidenPHIRecipe instead it's not obvious how I add the incoming start mask. It looks like VPWidenPHIRecipe::execute doesn't add it - do you know how? david-arm: Hi @fhahn, if I use VPWidenPHIRecipe instead it's not obvious how I add the incoming start mask.
		fhahnUnsubmitted Not Done Reply Inline Actions I think D128937 should do the trick, together with the snippet below. If D128937 allows `VPWidenPHIRecipe` to be used here I'll send it for review. diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index d8fb7a9c12b0..1ba76bf52cc6 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -3646,8 +3646,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State, truncateToMinimalBitwidths(State); // Fix widened non-induction PHIs by setting up the PHI operands. - if (EnableVPlanNativePath) - fixNonInductionPHIs(Plan, State); + fixNonInductionPHIs(Plan, State); // At this point every instruction in the original loop is widened to a // vector form. Now we need to fix the recurrences in the loop. These PHI fhahn: I think D128937 should do the trick, together with the snippet below. If D128937 allows…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @fhahn, if you're happy with everything else in the patch, would you be ok with submitting this patch as it is now, then look into using VPWidenPHIRecipe as a follow-on piece of refactoring? We're quite keen to get this work submitted soonish so we can get sufficient testing in the run-up to the LLVM 15 release. We also have quite a few other pieces of work on-going that depend upon this too. david-arm: Hi @fhahn, if you're happy with everything else in the patch, would you be ok with submitting…
		fhahnUnsubmitted Done Reply Inline Actions Ok sounds good to me, could you please add a FIXME/TODO to the code in the meantime? We also have quite a few other pieces of work on-going that depend upon this too. If other stuff depends on the patch it might be helpful to link them together as patch stack, to get a better idea on the wider impact. fhahn: Ok sounds good to me, could you please add a FIXME/TODO to the code in the meantime? > We also…
for (unsigned I = 0, E = User->getNumOperands(); I < E; ++I)		for (unsigned I = 0, E = User->getNumOperands(); I < E; ++I)
if (User->getOperand(I) == this)		if (User->getOperand(I) == this)
		paulwalker-armUnsubmitted Done Reply Inline Actions Why is this here? and not where `StartMask` is assigned a couple of lines below. paulwalker-arm: Why is this here? and not where `StartMask` is assigned a couple of lines below.
User->setOperand(I, New);		User->setOperand(I, New);
// If a user got removed after updating the current user, the next user to		// If a user got removed after updating the current user, the next user to
		fhahnUnsubmitted Done Reply Inline Actions The start value should be created in the preheader block in VPlan instead of in the phi node here. There there should also be no need for a TC argument? fhahn: The start value should be created in the preheader block in VPlan instead of in the phi node…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @fhahn, I'm not really sure what you mean here. Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? If so, then VPInstruction::ActiveLaneMask needs a trip count operand. Or do you mean I should deviate from the expected recipe creation somehow in a VPlan function such as prepareToExecute by manually adding the incoming value? david-arm: Hi @fhahn, I'm not really sure what you mean here. Do you mean that LoopVectorize.cpp should…
		fhahnUnsubmitted Done Reply Inline Actions Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? Yes exactly. If so, then VPInstruction::ActiveLaneMask needs a trip count operand. Looking at `VPInstruction::generateInstruction`, it looks like it already takes a trip count operand? The first operand can be set to zero by wrapping a zero constant in a VPValue. case VPInstruction::ActiveLaneMask: { // Get first lane of vector induction variable. Value VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0)); // Get the original loop tripcount. Value ScalarTC = State.get(getOperand(1), Part); fhahn: > Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask…
// update will be moved to the current position, so we only need to		// update will be moved to the current position, so we only need to
// increment the index if the number of users did not change.		// increment the index if the number of users did not change.
if (NumUsers == getNumUsers())		if (NumUsers == getNumUsers())
J++;		J++;
}		}
}		}
		fhahnUnsubmitted Done Reply Inline Actions Should this also print the operands here? fhahn: Should this also print the operands here?

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
void VPValue::printAsOperand(raw_ostream &OS, VPSlotTracker &Tracker) const {		void VPValue::printAsOperand(raw_ostream &OS, VPSlotTracker &Tracker) const {
if (const Value *UV = getUnderlyingValue()) {		if (const Value *UV = getUnderlyingValue()) {
OS << "ir<";		OS << "ir<";
UV->printAsOperand(OS, false);		UV->printAsOperand(OS, false);
OS << ">";		OS << ">";
return;		return;
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

Show First 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	if (Part == 0) {
Next = Builder.CreateAdd(Phi, Step, Name, IsNUW, false);		Next = Builder.CreateAdd(Phi, Step, Name, IsNUW, false);
} else {		} else {
Next = State.get(this, 0);		Next = State.get(this, 0);
}		}

State.set(this, Next, Part);		State.set(this, Next, Part);
break;		break;
}		}

		case VPInstruction::CanonicalIVIncrementForPart:
		case VPInstruction::CanonicalIVIncrementForPartNUW: {
		bool IsNUW = getOpcode() == VPInstruction::CanonicalIVIncrementForPartNUW;
		auto *IV = State.get(getOperand(0), VPIteration(0, 0));
		if (Part == 0) {
		State.set(this, IV, Part);
		break;
		}

		// The canonical IV is incremented by the vectorization factor (num of SIMD
		// elements) times the unroll part.
		Value *Step = createStepForVF(Builder, IV->getType(), State.VF, Part);
		Value *Next = Builder.CreateAdd(IV, Step, Name, IsNUW, false);
		State.set(this, Next, Part);
		break;
		}
case VPInstruction::BranchOnCond: {		case VPInstruction::BranchOnCond: {
if (Part != 0)		if (Part != 0)
break;		break;

Value *Cond = State.get(getOperand(0), VPIteration(Part, 0));		Value *Cond = State.get(getOperand(0), VPIteration(Part, 0));
VPRegionBlock *ParentRegion = getParent()->getParent();		VPRegionBlock *ParentRegion = getParent()->getParent();
VPBasicBlock *Header = ParentRegion->getEntryBasicBlock();		VPBasicBlock *Header = ParentRegion->getEntryBasicBlock();

▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	case VPInstruction::CanonicalIVIncrement:
O << "VF * UF + ";		O << "VF * UF + ";
break;		break;
case VPInstruction::CanonicalIVIncrementNUW:		case VPInstruction::CanonicalIVIncrementNUW:
O << "VF * UF +(nuw) ";		O << "VF * UF +(nuw) ";
break;		break;
case VPInstruction::BranchOnCond:		case VPInstruction::BranchOnCond:
O << "branch-on-cond";		O << "branch-on-cond";
break;		break;
		case VPInstruction::CanonicalIVIncrementForPart:
		O << "VF * Part + ";
		break;
		case VPInstruction::CanonicalIVIncrementForPartNUW:
		O << "VF * Part +(nuw) ";
		break;
case VPInstruction::BranchOnCount:		case VPInstruction::BranchOnCount:
O << "branch-on-count ";		O << "branch-on-count ";
break;		break;
default:		default:
O << Instruction::getOpcodeName(getOpcode());		O << Instruction::getOpcodeName(getOpcode());
}		}

O << FMF;		O << FMF;
▲ Show 20 Lines • Show All 678 Lines • ▼ Show 20 Lines	if (getNumOperands() != OriginalPhi->getNumOperands()) {
return;		return;
}		}

printAsOperand(O, SlotTracker);		printAsOperand(O, SlotTracker);
O << " = phi ";		O << " = phi ";
printOperands(O, SlotTracker);		printOperands(O, SlotTracker);
}		}
#endif		#endif

		// TODO: It would be good to use the existing VPWidenPHIRecipe instead and
		// remove VPActiveLaneMaskPHIRecipe.
		void VPActiveLaneMaskPHIRecipe::execute(VPTransformState &State) {
		BasicBlock *VectorPH = State.CFG.getPreheaderBBFor(this);
		for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part) {
		Value *StartMask = State.get(getOperand(0), Part);
		PHINode *EntryPart =
		State.Builder.CreatePHI(StartMask->getType(), 2, "active.lane.mask");
		EntryPart->addIncoming(StartMask, VectorPH);
		EntryPart->setDebugLoc(DL);
		State.set(this, EntryPart, Part);
		}
		}

		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
		void VPActiveLaneMaskPHIRecipe::print(raw_ostream &O, const Twine &Indent,
		VPSlotTracker &SlotTracker) const {
		O << Indent << "ACTIVE-LANE-MASK-PHI ";

		printAsOperand(O, SlotTracker);
		O << " = phi ";
		printOperands(O, SlotTracker);
		}
		#endif

llvm/lib/Transforms/Vectorize/VPlanValue.h

Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	enum {
VPVWidenCallSC,		VPVWidenCallSC,
VPVWidenCanonicalIVSC,		VPVWidenCanonicalIVSC,
VPVWidenGEPSC,		VPVWidenGEPSC,
VPVWidenSelectSC,		VPVWidenSelectSC,

// Phi-like VPValues. Need to be kept together.		// Phi-like VPValues. Need to be kept together.
VPVBlendSC,		VPVBlendSC,
VPVCanonicalIVPHISC,		VPVCanonicalIVPHISC,
		VPVActiveLaneMaskPHISC,
VPVFirstOrderRecurrencePHISC,		VPVFirstOrderRecurrencePHISC,
VPVWidenPHISC,		VPVWidenPHISC,
VPVWidenIntOrFpInductionSC,		VPVWidenIntOrFpInductionSC,
VPVWidenPointerInductionSC,		VPVWidenPointerInductionSC,
VPVPredInstPHI,		VPVPredInstPHI,
VPVReductionPHISC,		VPVReductionPHISC,
};		};

▲ Show 20 Lines • Show All 239 Lines • ▼ Show 20 Lines	using VPRecipeTy = enum {
VPWidenGEPSC,		VPWidenGEPSC,
VPWidenMemoryInstructionSC,		VPWidenMemoryInstructionSC,
VPWidenSC,		VPWidenSC,
VPWidenSelectSC,		VPWidenSelectSC,

// Phi-like recipes. Need to be kept together.		// Phi-like recipes. Need to be kept together.
VPBlendSC,		VPBlendSC,
VPCanonicalIVPHISC,		VPCanonicalIVPHISC,
		VPActiveLaneMaskPHISC,
VPFirstOrderRecurrencePHISC,		VPFirstOrderRecurrencePHISC,
VPWidenPHISC,		VPWidenPHISC,
VPWidenIntOrFpInductionSC,		VPWidenIntOrFpInductionSC,
VPWidenPointerInductionSC,		VPWidenPointerInductionSC,
VPPredInstPHISC,		VPPredInstPHISC,
VPReductionPHISC,		VPReductionPHISC,
VPFirstPHISC = VPBlendSC,		VPFirstPHISC = VPBlendSC,
VPLastPHISC = VPReductionPHISC,		VPLastPHISC = VPReductionPHISC,
▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp

Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	bool VPlanVerifier::verifyPlanIsValid(const VPlan &Plan) {
auto Iter = depth_first(		auto Iter = depth_first(
VPBlockRecursiveTraversalWrapper<const VPBlockBase *>(Plan.getEntry()));		VPBlockRecursiveTraversalWrapper<const VPBlockBase *>(Plan.getEntry()));
for (const VPBasicBlock *VPBB :		for (const VPBasicBlock *VPBB :
VPBlockUtils::blocksOnly<const VPBasicBlock>(Iter)) {		VPBlockUtils::blocksOnly<const VPBasicBlock>(Iter)) {
// Verify that phi-like recipes are at the beginning of the block, with no		// Verify that phi-like recipes are at the beginning of the block, with no
// other recipes in between.		// other recipes in between.
auto RecipeI = VPBB->begin();		auto RecipeI = VPBB->begin();
auto End = VPBB->end();		auto End = VPBB->end();
while (RecipeI != End && RecipeI->isPhi())		unsigned NumActiveLaneMaskPhiRecipes = 0;
		while (RecipeI != End && RecipeI->isPhi()) {
		if (isa<VPActiveLaneMaskPHIRecipe>(RecipeI))
		NumActiveLaneMaskPhiRecipes++;
RecipeI++;		RecipeI++;
		}

		if (NumActiveLaneMaskPhiRecipes > 1) {
		errs() << "There should be no more than one VPActiveLaneMaskPHIRecipe";
		return false;
		}

while (RecipeI != End) {		while (RecipeI != End) {
if (RecipeI->isPhi() && !isa<VPBlendRecipe>(&*RecipeI)) {		if (RecipeI->isPhi() && !isa<VPBlendRecipe>(&*RecipeI)) {
errs() << "Found phi-like recipe after non-phi recipe";		errs() << "Found phi-like recipe after non-phi recipe";

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
errs() << ": ";		errs() << ": ";
RecipeI->dump();		RecipeI->dump();
Show All 21 Lines	#endif

const VPBasicBlock *Exiting = dyn_cast<VPBasicBlock>(TopRegion->getExiting());		const VPBasicBlock *Exiting = dyn_cast<VPBasicBlock>(TopRegion->getExiting());
if (!Exiting) {		if (!Exiting) {
errs() << "VPlan exiting block is not a VPBasicBlock\n";		errs() << "VPlan exiting block is not a VPBasicBlock\n";
return false;		return false;
}		}

if (Exiting->empty()) {		if (Exiting->empty()) {
errs() << "VPlan vector loop exiting block must end with BranchOnCount "		errs() << "VPlan vector loop exiting block must end with BranchOnCount or "
"VPInstruction but is empty\n";		"BranchOnCond VPInstruction but is empty\n";
return false;		return false;
}		}

auto *LastInst = dyn_cast<VPInstruction>(std::prev(Exiting->end()));		auto *LastInst = dyn_cast<VPInstruction>(std::prev(Exiting->end()));
if (!LastInst \|\| LastInst->getOpcode() != VPInstruction::BranchOnCount) {		if (!LastInst \|\| (LastInst->getOpcode() != VPInstruction::BranchOnCount &&
errs() << "VPlan vector loop exit must end with BranchOnCount "		LastInst->getOpcode() != VPInstruction::BranchOnCond)) {
"VPInstruction\n";		errs() << "VPlan vector loop exit must end with BranchOnCount or "
		"BranchOnCond VPInstruction\n";
		fhahnUnsubmitted Done Reply Inline Actions This still references `BranchOnActiveLaneMask`? fhahn: This still references `BranchOnActiveLaneMask`?
return false;		return false;
}		}

for (const VPRegionBlock *Region :		for (const VPRegionBlock *Region :
VPBlockUtils::blocksOnly<const VPRegionBlock>(		VPBlockUtils::blocksOnly<const VPRegionBlock>(
depth_first(VPBlockRecursiveTraversalWrapper<const VPBlockBase *>(		depth_first(VPBlockRecursiveTraversalWrapper<const VPBlockBase *>(
Plan.getEntry())))) {		Plan.getEntry())))) {
if (Region->getEntry()->getNumPredecessors() != 0) {		if (Region->getEntry()->getNumPredecessors() != 0) {
Show All 17 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions-tf.ll

	; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize \			; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize \
	; RUN: -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S \| FileCheck %s			; RUN: -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S \| FileCheck %s

	define void @invariant_store_red_exit_is_phi(i32* %dst, i32* readonly %src, i64 %n) {			define void @invariant_store_red_exit_is_phi(i32* %dst, i32* readonly %src, i64 %n) {
	; CHECK-LABEL: @invariant_store_red_exit_is_phi(			; CHECK-LABEL: @invariant_store_red_exit_is_phi(
				; CHECK: vector.ph:
				; CHECK: %[[ACTIVE_LANE_MASK_ENTRY:.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 %n)
	; CHECK: vector.body:			; CHECK: vector.body:
				; CHECK: %[[ACTIVE_LANE_MASK:.]] = phi <vscale x 4 x i1> [ %[[ACTIVE_LANE_MASK_ENTRY]], %vector.ph ], [ %[[ACTIVE_LANE_MASK_NEXT:.]], %vector.body ]
				paulwalker-armUnsubmitted Done Reply Inline Actions Given the structure has changed, is it worth showing how the entry active lane mask is constructed? paulwalker-arm: Given the structure has changed, is it worth showing how the entry active lane mask is…
	; CHECK: %[[VEC_PHI:.]] = phi <vscale x 4 x i32> [ zeroinitializer, %vector.ph ], [ %[[PREDPHI:.]], %vector.body ]			; CHECK: %[[VEC_PHI:.]] = phi <vscale x 4 x i32> [ zeroinitializer, %vector.ph ], [ %[[PREDPHI:.]], %vector.body ]
	; CHECK: %[[ACTIVE_LANE_MASK:.]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 {{%.}}, i64 %n)
	; CHECK: %[[LOAD:.*]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32			; CHECK: %[[LOAD:.*]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32
	; CHECK-NEXT: %[[ADD:.*]] = add <vscale x 4 x i32> %[[VEC_PHI]], %[[LOAD]]			; CHECK-NEXT: %[[ADD:.*]] = add <vscale x 4 x i32> %[[VEC_PHI]], %[[LOAD]]
	; CHECK-NEXT: %[[SELECT:.*]] = select <vscale x 4 x i1> %[[ACTIVE_LANE_MASK]], <vscale x 4 x i32> %[[ADD]], <vscale x 4 x i32> %[[VEC_PHI]]			; CHECK-NEXT: %[[SELECT:.*]] = select <vscale x 4 x i1> %[[ACTIVE_LANE_MASK]], <vscale x 4 x i32> %[[ADD]], <vscale x 4 x i32> %[[VEC_PHI]]
				; CHECK: %[[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %{{.*}}, i64 %n)
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: %[[SUM:.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> %[[SELECT]])			; CHECK-NEXT: %[[SUM:.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> %[[SELECT]])
	; CHECK-NEXT: store i32 %[[SUM]], i32* %dst, align 4			; CHECK-NEXT: store i32 %[[SUM]], i32* %dst, align 4
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.inc			for.body: ; preds = %entry, %for.inc
	%red = phi i32 [ 0, %entry ], [ %storemerge, %for.body ]			%red = phi i32 [ 0, %entry ], [ %storemerge, %for.body ]
	Show All 21 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll

	; RUN: opt -loop-vectorize -S < %s \| FileCheck %s			; RUN: opt -loop-vectorize -S < %s \| FileCheck %s

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define void @trip7_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {			define void @trip7_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip7_i64(			; CHECK-LABEL: @trip7_i64(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
	; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 {{%.}}, i64 7)			; CHECK: [[ACTIVE_LANE_MASK:%.]] = phi <vscale x 2 x i1> [ {{%.}}, %vector.ph ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %vector.body ]
	; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)			; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)
	; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)			; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)
	; CHECK: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.}}, <vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])			; CHECK: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.}}, <vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
	; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()			; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 2			; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 2
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
	; CHECK-NEXT: [[COND:%.]] = icmp eq i64 [[INDEX_NEXT]], {{%.}}			; CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 7)
				; CHECK-NEXT: [[ACTIVE_LANE_MASK_NOT:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i32 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)
				; CHECK-NEXT: [[COND:%.*]] = extractelement <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NOT]], i32 0
	; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body			; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]			%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06			%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06
	Show All 10 Lines
	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret void			ret void
	}			}

	define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {			define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip5_i8(			; CHECK-LABEL: @trip5_i8(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
	; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 {{%.}}, i64 5)			; CHECK: [[ACTIVE_LANE_MASK:%.]] = phi <vscale x 16 x i1> [ {{%.}}, %vector.ph ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %vector.body ]
	; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)			; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
	; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)			; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
	; CHECK: call void @llvm.masked.store.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.}}, <vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])			; CHECK: call void @llvm.masked.store.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.}}, <vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
	; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()			; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 16			; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 16
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 5)
				; CHECK-NEXT: [[ACTIVE_LANE_MASK_NOT:%.*]] = xor <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 16 x i1> insertelement (<vscale x 16 x i1> poison, i1 true, i32 0), <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer)
	; CHECK-NEXT: br i1 true, label %middle.block, label %vector.body			; CHECK-NEXT: br i1 true, label %middle.block, label %vector.body
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]			%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds i8, i8* %src, i64 %i.08			%arrayidx = getelementptr inbounds i8, i8* %src, i64 %i.08
	Show All 15 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

	; RUN: opt -S -loop-vectorize < %s \| FileCheck %s			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; REQUIRES: asserts
				; RUN: opt -S -loop-vectorize -debug-only=loop-vectorize < %s 2>%t \| FileCheck %s
				; RUN: cat %t \| FileCheck %s --check-prefix=VPLANS

	; These tests ensure that tail-folding is enabled when the predicate.enable			; These tests ensure that tail-folding is enabled when the predicate.enable
	; loop attribute is set to true.			; loop attribute is set to true.

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

				; VPLANS-LABEL: Checking a loop in 'simple_memset'
				; VPLANS: VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2,vscale x 4},UF>=1' {
				; VPLANS-NEXT: vector.ph:
				; VPLANS-NEXT: EMIT vp<%2> = VF * Part + ir<0>
				; VPLANS-NEXT: EMIT vp<%3> = active lane mask vp<%2> <badref>
				; VPLANS-NEXT: Successor(s): vector loop
				; VPLANS-EMPTY:
				; VPLANS-NEXT: <x1> vector loop: {
				; VPLANS-NEXT: vector.body:
				; VPLANS-NEXT: EMIT vp<%4> = CANONICAL-INDUCTION
				; VPLANS-NEXT: ACTIVE-LANE-MASK-PHI vp<%5> = phi vp<%3>, vp<%10>
				; VPLANS-NEXT: vp<%6> = SCALAR-STEPS vp<%4>, ir<0>, ir<1>
				; VPLANS-NEXT: CLONE ir<%gep> = getelementptr ir<%ptr>, vp<%6>
				; VPLANS-NEXT: WIDEN store ir<%gep>, ir<%val>, vp<%5>
				; VPLANS-NEXT: EMIT vp<%8> = VF * UF + vp<%4>
				; VPLANS-NEXT: EMIT vp<%9> = VF * Part + vp<%8>
				; VPLANS-NEXT: EMIT vp<%10> = active lane mask vp<%9> <badref>
				; VPLANS-NEXT: EMIT vp<%11> = not vp<%10>
				; VPLANS-NEXT: EMIT branch-on-cond vp<%11>
				; VPLANS-NEXT: No successors
				; VPLANS-NEXT: }

	define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {			define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
	; CHECK-LABEL: @simple_memset(			; CHECK-LABEL: @simple_memset(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
	; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]			; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
	; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])
	; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4			; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP14]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP14]]
	; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: [[TMP15:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP16:%.*]] = extractelement <vscale x 4 x i1> [[TMP15]], i32 0
				; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 15 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll

	; RUN: opt -loop-vectorize -S < %s \| FileCheck %s			; RUN: opt -loop-vectorize -S < %s \| FileCheck %s

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define void @trip1024_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {			define void @trip1024_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip1024_i64(			; CHECK-LABEL: @trip1024_i64(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
				; CHECK-NEXT: [[TMP2:%.*]] = icmp ult i64 -1025, [[TMP1]]
				; CHECK-NEXT: br i1 [[TMP2]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 2
				; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 2
				; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[TMP6]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 1024, [[TMP7]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP4]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 1024)
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 {{%.}}, i64 1024)			; CHECK-NEXT: [[ACTIVE_LANE_MASK1:%.]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK3:%.]], [[VECTOR_BODY]] ]
	; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)			; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX]], 0
	; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)			; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 [[SRC:%.*]], i64 [[TMP8]]
	; CHECK: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.}}, <vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i64, i64 [[TMP9]], i32 0
	; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP11:%.]] = bitcast i64 [[TMP10]] to <vscale x 2 x i64>*
	; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 2			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> [[TMP11]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK1]], <vscale x 2 x i64> poison)
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]			; CHECK-NEXT: [[TMP12:%.*]] = shl nsw <vscale x 2 x i64> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
	; CHECK-NEXT: [[COND:%.]] = icmp eq i64 [[INDEX_NEXT]], {{%.}}			; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds i64, i64 [[DST:%.*]], i64 [[TMP8]]
	; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body			; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i64, i64 [[TMP13]], i32 0
				; CHECK-NEXT: [[TMP15:%.]] = bitcast i64 [[TMP14]] to <vscale x 2 x i64>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.]] = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> [[TMP15]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK1]], <vscale x 2 x i64> poison)
				; CHECK-NEXT: [[TMP16:%.*]] = add nsw <vscale x 2 x i64> [[WIDE_MASKED_LOAD2]], [[TMP12]]
				; CHECK-NEXT: [[TMP17:%.]] = bitcast i64 [[TMP14]] to <vscale x 2 x i64>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> [[TMP16]], <vscale x 2 x i64>* [[TMP17]], i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK1]])
				; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP19]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK3]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 1024)
				; CHECK-NEXT: [[TMP20:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK3]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i32 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP21:%.*]] = extractelement <vscale x 2 x i1> [[TMP20]], i32 0
				; CHECK-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]			%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06			%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06
	%0 = load i64, i64* %arrayidx, align 8			%0 = load i64, i64* %arrayidx, align 8
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll

	Show All 16 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
				; CHECK-NEXT: [[INDEX_PART_NEXT:%.*]] = add i64 0, [[TMP10]]
				; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8
				; CHECK-NEXT: [[INDEX_PART_NEXT1:%.*]] = add i64 0, [[TMP12]]
				; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 12
				; CHECK-NEXT: [[INDEX_PART_NEXT2:%.*]] = add i64 0, [[TMP14]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK3:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK4:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT1]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK5:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT2]], i64 [[UMAX]])
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT11:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT5]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT12:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT11]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT13:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT7]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT14:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT13]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT9:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT15:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT10:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT9]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT16:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT15]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT11:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT17:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[ACTIVE_LANE_MASK7:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK22:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[ACTIVE_LANE_MASK8:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK3]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK23:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 4			; CHECK-NEXT: [[ACTIVE_LANE_MASK9:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK4]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK24:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP12:%.*]] = add i64 [[TMP11]], 0			; CHECK-NEXT: [[ACTIVE_LANE_MASK10:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK5]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK25:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP13:%.*]] = mul i64 [[TMP12]], 1			; CHECK-NEXT: [[TMP15:%.*]] = add i64 [[INDEX6]], 0
	; CHECK-NEXT: [[TMP14:%.*]] = add i64 [[INDEX1]], [[TMP13]]			; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
	; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 8			; CHECK-NEXT: [[TMP18:%.*]] = add i64 [[TMP17]], 0
	; CHECK-NEXT: [[TMP17:%.*]] = add i64 [[TMP16]], 0			; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 1
	; CHECK-NEXT: [[TMP18:%.*]] = mul i64 [[TMP17]], 1			; CHECK-NEXT: [[TMP20:%.*]] = add i64 [[INDEX6]], [[TMP19]]
	; CHECK-NEXT: [[TMP19:%.*]] = add i64 [[INDEX1]], [[TMP18]]			; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 8
	; CHECK-NEXT: [[TMP21:%.*]] = mul i64 [[TMP20]], 12			; CHECK-NEXT: [[TMP23:%.*]] = add i64 [[TMP22]], 0
	; CHECK-NEXT: [[TMP22:%.*]] = add i64 [[TMP21]], 0			; CHECK-NEXT: [[TMP24:%.*]] = mul i64 [[TMP23]], 1
	; CHECK-NEXT: [[TMP23:%.*]] = mul i64 [[TMP22]], 1			; CHECK-NEXT: [[TMP25:%.*]] = add i64 [[INDEX6]], [[TMP24]]
	; CHECK-NEXT: [[TMP24:%.*]] = add i64 [[INDEX1]], [[TMP23]]			; CHECK-NEXT: [[TMP26:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP27:%.*]] = mul i64 [[TMP26]], 12
	; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP14]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP28:%.*]] = add i64 [[TMP27]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK3:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP19]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP29:%.*]] = mul i64 [[TMP28]], 1
	; CHECK-NEXT: [[ACTIVE_LANE_MASK4:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP24]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP30:%.*]] = add i64 [[INDEX6]], [[TMP29]]
	; CHECK-NEXT: [[TMP25:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP31:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP15]]
	; CHECK-NEXT: [[TMP26:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP14]]			; CHECK-NEXT: [[TMP32:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP20]]
	; CHECK-NEXT: [[TMP27:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP19]]			; CHECK-NEXT: [[TMP33:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP25]]
	; CHECK-NEXT: [[TMP28:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP24]]			; CHECK-NEXT: [[TMP34:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP30]]
	; CHECK-NEXT: [[TMP29:%.]] = getelementptr i32, i32 [[TMP25]], i32 0			; CHECK-NEXT: [[TMP35:%.]] = getelementptr i32, i32 [[TMP31]], i32 0
	; CHECK-NEXT: [[TMP30:%.]] = bitcast i32 [[TMP29]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP36:%.]] = bitcast i32 [[TMP35]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP30]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP36]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK7]])
	; CHECK-NEXT: [[TMP31:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP37:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP32:%.*]] = mul i32 [[TMP31]], 4			; CHECK-NEXT: [[TMP38:%.*]] = mul i32 [[TMP37]], 4
	; CHECK-NEXT: [[TMP33:%.]] = getelementptr i32, i32 [[TMP25]], i32 [[TMP32]]			; CHECK-NEXT: [[TMP39:%.]] = getelementptr i32, i32 [[TMP31]], i32 [[TMP38]]
	; CHECK-NEXT: [[TMP34:%.]] = bitcast i32 [[TMP33]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP40:%.]] = bitcast i32 [[TMP39]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT6]], <vscale x 4 x i32>* [[TMP34]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT12]], <vscale x 4 x i32>* [[TMP40]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK8]])
	; CHECK-NEXT: [[TMP35:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP41:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP36:%.*]] = mul i32 [[TMP35]], 8			; CHECK-NEXT: [[TMP42:%.*]] = mul i32 [[TMP41]], 8
	; CHECK-NEXT: [[TMP37:%.]] = getelementptr i32, i32 [[TMP25]], i32 [[TMP36]]			; CHECK-NEXT: [[TMP43:%.]] = getelementptr i32, i32 [[TMP31]], i32 [[TMP42]]
	; CHECK-NEXT: [[TMP38:%.]] = bitcast i32 [[TMP37]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP44:%.]] = bitcast i32 [[TMP43]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT8]], <vscale x 4 x i32>* [[TMP38]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK3]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT14]], <vscale x 4 x i32>* [[TMP44]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK9]])
	; CHECK-NEXT: [[TMP39:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP45:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP40:%.*]] = mul i32 [[TMP39]], 12			; CHECK-NEXT: [[TMP46:%.*]] = mul i32 [[TMP45]], 12
	; CHECK-NEXT: [[TMP41:%.]] = getelementptr i32, i32 [[TMP25]], i32 [[TMP40]]			; CHECK-NEXT: [[TMP47:%.]] = getelementptr i32, i32 [[TMP31]], i32 [[TMP46]]
	; CHECK-NEXT: [[TMP42:%.]] = bitcast i32 [[TMP41]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP48:%.]] = bitcast i32 [[TMP47]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT10]], <vscale x 4 x i32>* [[TMP42]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT16]], <vscale x 4 x i32>* [[TMP48]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK10]])
	; CHECK-NEXT: [[TMP43:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP49:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP44:%.*]] = mul i64 [[TMP43]], 16			; CHECK-NEXT: [[TMP50:%.*]] = mul i64 [[TMP49]], 16
	; CHECK-NEXT: [[INDEX_NEXT11]] = add i64 [[INDEX1]], [[TMP44]]			; CHECK-NEXT: [[INDEX_NEXT17]] = add i64 [[INDEX6]], [[TMP50]]
	; CHECK-NEXT: [[TMP45:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC]]			; CHECK-NEXT: [[TMP51:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: br i1 [[TMP45]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: [[TMP52:%.*]] = mul i64 [[TMP51]], 4
				; CHECK-NEXT: [[INDEX_PART_NEXT19:%.*]] = add i64 [[INDEX_NEXT17]], [[TMP52]]
				; CHECK-NEXT: [[TMP53:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP54:%.*]] = mul i64 [[TMP53]], 8
				; CHECK-NEXT: [[INDEX_PART_NEXT20:%.*]] = add i64 [[INDEX_NEXT17]], [[TMP54]]
				; CHECK-NEXT: [[TMP55:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP56:%.*]] = mul i64 [[TMP55]], 12
				; CHECK-NEXT: [[INDEX_PART_NEXT21:%.*]] = add i64 [[INDEX_NEXT17]], [[TMP56]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK22]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT17]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK23]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT19]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK24]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT20]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK25]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT21]], i64 [[UMAX]])
				; CHECK-NEXT: [[TMP57:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK22]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP58:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK23]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP59:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK24]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP60:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK25]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP61:%.*]] = extractelement <vscale x 4 x i1> [[TMP57]], i32 0
				; CHECK-NEXT: br i1 [[TMP61]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 20 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
				; CHECK-NEXT: [[INDEX_PART_NEXT:%.*]] = add i64 0, [[TMP10]]
				; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8
				; CHECK-NEXT: [[INDEX_PART_NEXT1:%.*]] = add i64 0, [[TMP12]]
				; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 12
				; CHECK-NEXT: [[INDEX_PART_NEXT2:%.*]] = add i64 0, [[TMP14]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK3:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK4:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT1]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK5:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT2]], i64 [[UMAX]])
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT14:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT9:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT8]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT15:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT14]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT10:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT16:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT11:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT10]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT17:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT16]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT12:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT18:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT13:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT12]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT19:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT18]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT14:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT20:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[ACTIVE_LANE_MASK7:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK25:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[ACTIVE_LANE_MASK8:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK3]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK26:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 4			; CHECK-NEXT: [[ACTIVE_LANE_MASK9:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK4]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK27:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP12:%.*]] = add i64 [[TMP11]], 0			; CHECK-NEXT: [[ACTIVE_LANE_MASK10:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK5]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK28:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP13:%.*]] = mul i64 [[TMP12]], 1			; CHECK-NEXT: [[TMP15:%.*]] = add i64 [[INDEX6]], 0
	; CHECK-NEXT: [[TMP14:%.*]] = add i64 [[INDEX1]], [[TMP13]]			; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
	; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 8			; CHECK-NEXT: [[TMP18:%.*]] = add i64 [[TMP17]], 0
	; CHECK-NEXT: [[TMP17:%.*]] = add i64 [[TMP16]], 0			; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 1
	; CHECK-NEXT: [[TMP18:%.*]] = mul i64 [[TMP17]], 1			; CHECK-NEXT: [[TMP20:%.*]] = add i64 [[INDEX6]], [[TMP19]]
	; CHECK-NEXT: [[TMP19:%.*]] = add i64 [[INDEX1]], [[TMP18]]			; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 8
	; CHECK-NEXT: [[TMP21:%.*]] = mul i64 [[TMP20]], 12			; CHECK-NEXT: [[TMP23:%.*]] = add i64 [[TMP22]], 0
	; CHECK-NEXT: [[TMP22:%.*]] = add i64 [[TMP21]], 0			; CHECK-NEXT: [[TMP24:%.*]] = mul i64 [[TMP23]], 1
	; CHECK-NEXT: [[TMP23:%.*]] = mul i64 [[TMP22]], 1			; CHECK-NEXT: [[TMP25:%.*]] = add i64 [[INDEX6]], [[TMP24]]
	; CHECK-NEXT: [[TMP24:%.*]] = add i64 [[INDEX1]], [[TMP23]]			; CHECK-NEXT: [[TMP26:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP27:%.*]] = mul i64 [[TMP26]], 12
	; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP14]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP28:%.*]] = add i64 [[TMP27]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK3:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP19]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP29:%.*]] = mul i64 [[TMP28]], 1
	; CHECK-NEXT: [[ACTIVE_LANE_MASK4:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP24]], i64 [[UMAX]])			; CHECK-NEXT: [[TMP30:%.*]] = add i64 [[INDEX6]], [[TMP29]]
	; CHECK-NEXT: [[TMP25:%.]] = getelementptr i32, i32 [[COND_PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP31:%.]] = getelementptr i32, i32 [[COND_PTR:%.*]], i64 [[TMP15]]
	; CHECK-NEXT: [[TMP26:%.]] = getelementptr i32, i32 [[COND_PTR]], i64 [[TMP14]]			; CHECK-NEXT: [[TMP32:%.]] = getelementptr i32, i32 [[COND_PTR]], i64 [[TMP20]]
	; CHECK-NEXT: [[TMP27:%.]] = getelementptr i32, i32 [[COND_PTR]], i64 [[TMP19]]			; CHECK-NEXT: [[TMP33:%.]] = getelementptr i32, i32 [[COND_PTR]], i64 [[TMP25]]
	; CHECK-NEXT: [[TMP28:%.]] = getelementptr i32, i32 [[COND_PTR]], i64 [[TMP24]]			; CHECK-NEXT: [[TMP34:%.]] = getelementptr i32, i32 [[COND_PTR]], i64 [[TMP30]]
	; CHECK-NEXT: [[TMP29:%.]] = getelementptr i32, i32 [[TMP25]], i32 0			; CHECK-NEXT: [[TMP35:%.]] = getelementptr i32, i32 [[TMP31]], i32 0
	; CHECK-NEXT: [[TMP30:%.]] = bitcast i32 [[TMP29]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP36:%.]] = bitcast i32 [[TMP35]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP30]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP36]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK7]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP31:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP37:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP32:%.*]] = mul i32 [[TMP31]], 4			; CHECK-NEXT: [[TMP38:%.*]] = mul i32 [[TMP37]], 4
	; CHECK-NEXT: [[TMP33:%.]] = getelementptr i32, i32 [[TMP25]], i32 [[TMP32]]			; CHECK-NEXT: [[TMP39:%.]] = getelementptr i32, i32 [[TMP31]], i32 [[TMP38]]
	; CHECK-NEXT: [[TMP34:%.]] = bitcast i32 [[TMP33]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP40:%.]] = bitcast i32 [[TMP39]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD5:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP34]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD11:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP40]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK8]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP35:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP41:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP36:%.*]] = mul i32 [[TMP35]], 8			; CHECK-NEXT: [[TMP42:%.*]] = mul i32 [[TMP41]], 8
	; CHECK-NEXT: [[TMP37:%.]] = getelementptr i32, i32 [[TMP25]], i32 [[TMP36]]			; CHECK-NEXT: [[TMP43:%.]] = getelementptr i32, i32 [[TMP31]], i32 [[TMP42]]
	; CHECK-NEXT: [[TMP38:%.]] = bitcast i32 [[TMP37]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP44:%.]] = bitcast i32 [[TMP43]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD6:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP38]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK3]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD12:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP44]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK9]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP39:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP45:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP40:%.*]] = mul i32 [[TMP39]], 12			; CHECK-NEXT: [[TMP46:%.*]] = mul i32 [[TMP45]], 12
	; CHECK-NEXT: [[TMP41:%.]] = getelementptr i32, i32 [[TMP25]], i32 [[TMP40]]			; CHECK-NEXT: [[TMP47:%.]] = getelementptr i32, i32 [[TMP31]], i32 [[TMP46]]
	; CHECK-NEXT: [[TMP42:%.]] = bitcast i32 [[TMP41]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP48:%.]] = bitcast i32 [[TMP47]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD7:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP42]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD13:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP48]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK10]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP43:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP49:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP44:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD5]], zeroinitializer			; CHECK-NEXT: [[TMP50:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD11]], zeroinitializer
	; CHECK-NEXT: [[TMP45:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD6]], zeroinitializer			; CHECK-NEXT: [[TMP51:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD12]], zeroinitializer
	; CHECK-NEXT: [[TMP46:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD7]], zeroinitializer			; CHECK-NEXT: [[TMP52:%.*]] = icmp ne <vscale x 4 x i32> [[WIDE_MASKED_LOAD13]], zeroinitializer
	; CHECK-NEXT: [[TMP47:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP53:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP15]]
	; CHECK-NEXT: [[TMP48:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP14]]			; CHECK-NEXT: [[TMP54:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP20]]
	; CHECK-NEXT: [[TMP49:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP19]]			; CHECK-NEXT: [[TMP55:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP25]]
	; CHECK-NEXT: [[TMP50:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP24]]			; CHECK-NEXT: [[TMP56:%.]] = getelementptr i32, i32 [[PTR]], i64 [[TMP30]]
	; CHECK-NEXT: [[TMP51:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP43]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP57:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK7]], <vscale x 4 x i1> [[TMP49]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[TMP52:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i1> [[TMP44]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP58:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK8]], <vscale x 4 x i1> [[TMP50]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[TMP53:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK3]], <vscale x 4 x i1> [[TMP45]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP59:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK9]], <vscale x 4 x i1> [[TMP51]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[TMP54:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], <vscale x 4 x i1> [[TMP46]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP60:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK10]], <vscale x 4 x i1> [[TMP52]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[TMP55:%.]] = getelementptr i32, i32 [[TMP47]], i32 0			; CHECK-NEXT: [[TMP61:%.]] = getelementptr i32, i32 [[TMP53]], i32 0
	; CHECK-NEXT: [[TMP56:%.]] = bitcast i32 [[TMP55]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP62:%.]] = bitcast i32 [[TMP61]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP56]], i32 4, <vscale x 4 x i1> [[TMP51]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP62]], i32 4, <vscale x 4 x i1> [[TMP57]])
	; CHECK-NEXT: [[TMP57:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP63:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP58:%.*]] = mul i32 [[TMP57]], 4			; CHECK-NEXT: [[TMP64:%.*]] = mul i32 [[TMP63]], 4
	; CHECK-NEXT: [[TMP59:%.]] = getelementptr i32, i32 [[TMP47]], i32 [[TMP58]]			; CHECK-NEXT: [[TMP65:%.]] = getelementptr i32, i32 [[TMP53]], i32 [[TMP64]]
	; CHECK-NEXT: [[TMP60:%.]] = bitcast i32 [[TMP59]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP66:%.]] = bitcast i32 [[TMP65]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT9]], <vscale x 4 x i32>* [[TMP60]], i32 4, <vscale x 4 x i1> [[TMP52]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT15]], <vscale x 4 x i32>* [[TMP66]], i32 4, <vscale x 4 x i1> [[TMP58]])
	; CHECK-NEXT: [[TMP61:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP67:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP62:%.*]] = mul i32 [[TMP61]], 8			; CHECK-NEXT: [[TMP68:%.*]] = mul i32 [[TMP67]], 8
	; CHECK-NEXT: [[TMP63:%.]] = getelementptr i32, i32 [[TMP47]], i32 [[TMP62]]			; CHECK-NEXT: [[TMP69:%.]] = getelementptr i32, i32 [[TMP53]], i32 [[TMP68]]
	; CHECK-NEXT: [[TMP64:%.]] = bitcast i32 [[TMP63]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP70:%.]] = bitcast i32 [[TMP69]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT11]], <vscale x 4 x i32>* [[TMP64]], i32 4, <vscale x 4 x i1> [[TMP53]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT17]], <vscale x 4 x i32>* [[TMP70]], i32 4, <vscale x 4 x i1> [[TMP59]])
	; CHECK-NEXT: [[TMP65:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP71:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP66:%.*]] = mul i32 [[TMP65]], 12			; CHECK-NEXT: [[TMP72:%.*]] = mul i32 [[TMP71]], 12
	; CHECK-NEXT: [[TMP67:%.]] = getelementptr i32, i32 [[TMP47]], i32 [[TMP66]]			; CHECK-NEXT: [[TMP73:%.]] = getelementptr i32, i32 [[TMP53]], i32 [[TMP72]]
	; CHECK-NEXT: [[TMP68:%.]] = bitcast i32 [[TMP67]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP74:%.]] = bitcast i32 [[TMP73]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT13]], <vscale x 4 x i32>* [[TMP68]], i32 4, <vscale x 4 x i1> [[TMP54]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT19]], <vscale x 4 x i32>* [[TMP74]], i32 4, <vscale x 4 x i1> [[TMP60]])
	; CHECK-NEXT: [[TMP69:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP75:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP70:%.*]] = mul i64 [[TMP69]], 16			; CHECK-NEXT: [[TMP76:%.*]] = mul i64 [[TMP75]], 16
	; CHECK-NEXT: [[INDEX_NEXT14]] = add i64 [[INDEX1]], [[TMP70]]			; CHECK-NEXT: [[INDEX_NEXT20]] = add i64 [[INDEX6]], [[TMP76]]
	; CHECK-NEXT: [[TMP71:%.*]] = icmp eq i64 [[INDEX_NEXT14]], [[N_VEC]]			; CHECK-NEXT: [[TMP77:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: br i1 [[TMP71]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: [[TMP78:%.*]] = mul i64 [[TMP77]], 4
				; CHECK-NEXT: [[INDEX_PART_NEXT22:%.*]] = add i64 [[INDEX_NEXT20]], [[TMP78]]
				; CHECK-NEXT: [[TMP79:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP80:%.*]] = mul i64 [[TMP79]], 8
				; CHECK-NEXT: [[INDEX_PART_NEXT23:%.*]] = add i64 [[INDEX_NEXT20]], [[TMP80]]
				; CHECK-NEXT: [[TMP81:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP82:%.*]] = mul i64 [[TMP81]], 12
				; CHECK-NEXT: [[INDEX_PART_NEXT24:%.*]] = add i64 [[INDEX_NEXT20]], [[TMP82]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK25]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT20]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK26]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT22]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK27]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT23]], i64 [[UMAX]])
				; CHECK-NEXT: [[ACTIVE_LANE_MASK28]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT24]], i64 [[UMAX]])
				; CHECK-NEXT: [[TMP83:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK25]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP84:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK26]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP85:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK27]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP86:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK28]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP87:%.*]] = extractelement <vscale x 4 x i1> [[TMP83]], i32 0
				; CHECK-NEXT: br i1 [[TMP87]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.end ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.end ], [ 0, %entry ]
	Show All 23 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

	; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -prefer-inloop-reductions < %s \| FileCheck %s			; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -prefer-inloop-reductions < %s \| FileCheck %s
	; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -prefer-inloop-reductions < %s \| FileCheck %s			; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -prefer-inloop-reductions < %s \| FileCheck %s

	; CHECK-NOT: vector.body:

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"
				paulwalker-armUnsubmitted Done Reply Inline Actions What's this protecting? paulwalker-arm: What's this protecting?


	define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {			define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
	; CHECK-LABEL: @simple_memset(			; CHECK-LABEL: @simple_memset(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
	; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]			; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
	; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])
	; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4			; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP14]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP14]]
	; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: [[TMP15:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP16:%.*]] = extractelement <vscale x 4 x i1> [[TMP15]], i32 0
				; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	%gep = getelementptr i32, i32* %ptr, i64 %index			%gep = getelementptr i32, i32* %ptr, i64 %index
	store i32 %val, i32* %gep			store i32 %val, i32* %gep
	%index.next = add nsw i64 %index, 1			%index.next = add nsw i64 %index, 1
	%cmp10 = icmp ult i64 %index.next, %n			%cmp10 = icmp ult i64 %index.next, %n
	br i1 %cmp10, label %while.body, label %while.end.loopexit, !llvm.loop !0			br i1 %cmp10, label %while.body, label %while.end.loopexit, !llvm.loop !0

	while.end.loopexit: ; preds = %while.body			while.end.loopexit: ; preds = %while.body
	ret void			ret void
	}			}


				define void @simple_memset_v4i32(i32 %val, i32* %ptr, i64 %n) #0 {
				; CHECK-LABEL: @simple_memset_v4i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], 3
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 0, i64 [[UMAX]])
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <4 x i32> poison, i32 [[VAL:%.]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr i32, i32 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
				; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[BROADCAST_SPLAT]], <4 x i32>* [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK2]])
				; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], 4
				; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
				; CHECK-NEXT: [[TMP4:%.*]] = xor <4 x i1> [[ACTIVE_LANE_MASK4]], <i1 true, i1 true, i1 true, i1 true>
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i1> [[TMP4]], i32 0
				; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%gep = getelementptr i32, i32* %ptr, i64 %index
				store i32 %val, i32* %gep
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit, !llvm.loop !3

				while.end.loopexit: ; preds = %while.body
				ret void
				}


	define void @simple_memcpy(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {			define void @simple_memcpy(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {
	; CHECK-LABEL: @simple_memcpy(			; CHECK-LABEL: @simple_memcpy(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
	; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]			; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
	; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[SRC:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[SRC:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr i32, i32 [[DST:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP13:%.]] = getelementptr i32, i32 [[DST:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[TMP13]], i32 0			; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[TMP13]], i32 0
	; CHECK-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32>* [[TMP15]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32>* [[TMP15]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])
	; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4			; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP17]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP17]]
	; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: [[TMP18:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP19:%.*]] = extractelement <vscale x 4 x i1> [[TMP18]], i32 0
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 27 Lines
	; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4			; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4
	; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4			; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
	; CHECK-NEXT: [[TMP11:%.*]] = sub i64 [[TMP10]], 1			; CHECK-NEXT: [[TMP11:%.*]] = sub i64 [[TMP10]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP2]], [[TMP11]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP2]], [[TMP11]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP8]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP8]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
	; CHECK-NEXT: [[IND_END:%.*]] = mul i64 [[N_VEC]], 4			; CHECK-NEXT: [[IND_END:%.*]] = mul i64 [[N_VEC]], 4
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
	; CHECK-NEXT: [[TMP12:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()			; CHECK-NEXT: [[TMP12:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
	; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i64> [[TMP12]], zeroinitializer			; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i64> [[TMP12]], zeroinitializer
	; CHECK-NEXT: [[TMP14:%.*]] = mul <vscale x 4 x i64> [[TMP13]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 4, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP14:%.*]] = mul <vscale x 4 x i64> [[TMP13]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 4, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP14]]			; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP14]]
	; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4			; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
	; CHECK-NEXT: [[TMP17:%.*]] = mul i64 4, [[TMP16]]			; CHECK-NEXT: [[TMP17:%.*]] = mul i64 4, [[TMP16]]
	; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP17]], i32 0			; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP17]], i32 0
	; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0			; CHECK-NEXT: [[TMP18:%.]] = getelementptr i32, i32 [[SRC:%.*]], <vscale x 4 x i64> [[VEC_IND]]
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[TMP18]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> undef)
	; CHECK-NEXT: [[TMP18:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()			; CHECK-NEXT: [[TMP19:%.]] = getelementptr i32, i32 [[DST:%.*]], <vscale x 4 x i64> [[VEC_IND]]
	; CHECK-NEXT: [[TMP19:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP18]]			; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_GATHER]], <vscale x 4 x i32*> [[TMP19]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])
	; CHECK-NEXT: [[VEC_IV:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP19]]			; CHECK-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP20:%.*]] = extractelement <vscale x 4 x i64> [[VEC_IV]], i32 0			; CHECK-NEXT: [[TMP21:%.*]] = mul i64 [[TMP20]], 4
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP20]], i64 [[TMP2]])			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP21]]
	; CHECK-NEXT: [[TMP21:%.]] = getelementptr i32, i32 [[SRC:%.*]], <vscale x 4 x i64> [[VEC_IND]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[TMP2]])
	; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[TMP21]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> undef)			; CHECK-NEXT: [[TMP22:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK-NEXT: [[TMP22:%.]] = getelementptr i32, i32 [[DST:%.*]], <vscale x 4 x i64> [[VEC_IND]]
	; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_GATHER]], <vscale x 4 x i32*> [[TMP22]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
	; CHECK-NEXT: [[TMP23:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP24:%.*]] = mul i64 [[TMP23]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP24]]
	; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]			; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
	; CHECK-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[TMP23:%.*]] = extractelement <vscale x 4 x i1> [[TMP22]], i32 0
	; CHECK-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 23 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[IND:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[IND:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr i32, i32 [[SRC:%.*]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]]			; CHECK-NEXT: [[TMP13:%.]] = getelementptr i32, i32 [[SRC:%.*]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]]
	; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[TMP13]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[TMP13]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> undef)
	; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[DST:%.*]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]]			; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[DST:%.*]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]]
	; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_GATHER]], <vscale x 4 x i32*> [[TMP14]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_GATHER]], <vscale x 4 x i32*> [[TMP14]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])
	; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4			; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP16]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP16]]
	; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]			; CHECK-NEXT: [[TMP17:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP18:%.*]] = extractelement <vscale x 4 x i1> [[TMP17]], i32 0
				; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 26 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK1:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK2:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[N]])
	; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[SRC:%.*]], align 4			; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[SRC:%.*]], align 4
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[DST:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[DST:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP11]], i32 0			; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP11]], i32 0
	; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP13]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x i32>* [[TMP13]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK1]])
	; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP15:%.*]] = mul i64 [[TMP14]], 4			; CHECK-NEXT: [[TMP15:%.*]] = mul i64 [[TMP14]], 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP15]]			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP15]]
	; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK2]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[N]])
	; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]			; CHECK-NEXT: [[TMP16:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP17:%.*]] = extractelement <vscale x 4 x i1> [[TMP16]], i32 0
				; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	Show All 26 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32* [[SRC:%.*]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32* [[SRC:%.*]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32*> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32*> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[N]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[COND:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[COND:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP14:%.*]] = xor <vscale x 4 x i1> [[TMP13]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP14:%.*]] = xor <vscale x 4 x i1> [[TMP13]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK-NEXT: [[TMP15:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP14]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP15:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i1> [[TMP14]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], i32 4, <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[BROADCAST_SPLAT]], i32 4, <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> undef)
	; CHECK-NEXT: [[TMP16:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP13]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP16:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i1> [[TMP13]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP16]], <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32> [[WIDE_MASKED_GATHER]]			; CHECK-NEXT: [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP16]], <vscale x 4 x i32> zeroinitializer, <vscale x 4 x i32> [[WIDE_MASKED_GATHER]]
	; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[DST:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[DST:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP18:%.*]] = or <vscale x 4 x i1> [[TMP15]], [[TMP16]]			; CHECK-NEXT: [[TMP18:%.*]] = or <vscale x 4 x i1> [[TMP15]], [[TMP16]]
	; CHECK-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP17]], i32 0			; CHECK-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP17]], i32 0
	; CHECK-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <vscale x 4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[PREDPHI]], <vscale x 4 x i32>* [[TMP20]], i32 4, <vscale x 4 x i1> [[TMP18]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[PREDPHI]], <vscale x 4 x i32>* [[TMP20]], i32 4, <vscale x 4 x i1> [[TMP18]])
	; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4			; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP22]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP22]]
	; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[N]])
	; CHECK-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]			; CHECK-NEXT: [[TMP23:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP24:%.*]] = extractelement <vscale x 4 x i1> [[TMP23]], i32 0
				; CHECK-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %if.end			for.body: ; preds = %entry, %if.end
	Show All 34 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32* [[DST:%.*]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32* [[DST:%.*]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32*> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32*> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK1:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK2:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[N]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[SRC:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[SRC:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK1]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32*> [[BROADCAST_SPLAT]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32*> [[BROADCAST_SPLAT]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK1]])
	; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4			; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP14]]			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP14]]
	; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK2]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[N]])
	; CHECK-NEXT: br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]			; CHECK-NEXT: [[TMP15:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP16:%.*]] = extractelement <vscale x 4 x i1> [[TMP15]], i32 0
				; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	Show All 23 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT4:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK5:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr float, float [[SRC:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr float, float [[SRC:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr float, float [[DST:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP11:%.]] = getelementptr float, float [[DST:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP12:%.]] = getelementptr float, float [[TMP10]], i32 0			; CHECK-NEXT: [[TMP12:%.]] = getelementptr float, float [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP13:%.]] = bitcast float [[TMP12]] to <vscale x 4 x float>*			; CHECK-NEXT: [[TMP13:%.]] = bitcast float [[TMP12]] to <vscale x 4 x float>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP13]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x float> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP13]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x float> poison)
	; CHECK-NEXT: [[TMP14:%.]] = getelementptr float, float [[TMP11]], i32 0			; CHECK-NEXT: [[TMP14:%.]] = getelementptr float, float [[TMP11]], i32 0
	; CHECK-NEXT: [[TMP15:%.]] = bitcast float [[TMP14]] to <vscale x 4 x float>*			; CHECK-NEXT: [[TMP15:%.]] = bitcast float [[TMP14]] to <vscale x 4 x float>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP15]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x float> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD3:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP15]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x float> poison)
	; CHECK-NEXT: [[TMP16:%.*]] = fdiv <vscale x 4 x float> [[WIDE_MASKED_LOAD]], [[WIDE_MASKED_LOAD2]]			; CHECK-NEXT: [[TMP16:%.*]] = fdiv <vscale x 4 x float> [[WIDE_MASKED_LOAD]], [[WIDE_MASKED_LOAD3]]
	; CHECK-NEXT: [[TMP17:%.]] = bitcast float [[TMP14]] to <vscale x 4 x float>*			; CHECK-NEXT: [[TMP17:%.]] = bitcast float [[TMP14]] to <vscale x 4 x float>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP16]], <vscale x 4 x float>* [[TMP17]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP16]], <vscale x 4 x float>* [[TMP17]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]])
	; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 4			; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 4
	; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP19]]			; CHECK-NEXT: [[INDEX_NEXT4]] = add i64 [[INDEX1]], [[TMP19]]
	; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT3]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK5]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT4]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]			; CHECK-NEXT: [[TMP20:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK5]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP21:%.*]] = extractelement <vscale x 4 x i1> [[TMP20]], i32 0
				; CHECK-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 25 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP15:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP15:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP13:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP13:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP13]])			; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP13]])
	; CHECK-NEXT: [[TMP15]] = add i32 [[TMP14]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP15]] = add i32 [[TMP14]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4			; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP17]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP17]]
	; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]			; CHECK-NEXT: [[TMP18:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP19:%.*]] = extractelement <vscale x 4 x i1> [[TMP18]], i32 0
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 22 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK2:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP14:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP14:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[UMAX]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr float, float [[PTR:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr float, float [[PTR:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr float, float [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr float, float [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <vscale x 4 x float>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <vscale x 4 x float>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x float> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x float> poison)
	; CHECK-NEXT: [[TMP13:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x float> [[WIDE_MASKED_LOAD]], <vscale x 4 x float> shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float -0.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP13:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK2]], <vscale x 4 x float> [[WIDE_MASKED_LOAD]], <vscale x 4 x float> shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float -0.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK-NEXT: [[TMP14]] = call float @llvm.vector.reduce.fadd.nxv4f32(float [[VEC_PHI]], <vscale x 4 x float> [[TMP13]])			; CHECK-NEXT: [[TMP14]] = call float @llvm.vector.reduce.fadd.nxv4f32(float [[VEC_PHI]], <vscale x 4 x float> [[TMP13]])
	; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4			; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
	; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP16]]			; CHECK-NEXT: [[INDEX_NEXT3]] = add i64 [[INDEX1]], [[TMP16]]
	; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK4]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT3]], i64 [[UMAX]])
	; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]			; CHECK-NEXT: [[TMP17:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK4]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP18:%.*]] = extractelement <vscale x 4 x i1> [[TMP17]], i32 0
				; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	Show All 21 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP8]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK1:%.]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 7, [[VECTOR_PH]] ], [ [[TMP20:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 7, [[VECTOR_PH]] ], [ [[TMP20:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP9]], i64 [[N]])
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[COND:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[COND:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[ACTIVE_LANE_MASK1]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 5, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 5, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[A:%.*]], i64 [[TMP9]]			; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[A:%.*]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP15:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP13]], <vscale x 4 x i1> zeroinitializer			; CHECK-NEXT: [[TMP15:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK1]], <vscale x 4 x i1> [[TMP13]], <vscale x 4 x i1> zeroinitializer
	; CHECK-NEXT: [[TMP16:%.]] = getelementptr i32, i32 [[TMP14]], i32 0			; CHECK-NEXT: [[TMP16:%.]] = getelementptr i32, i32 [[TMP14]], i32 0
	; CHECK-NEXT: [[TMP17:%.]] = bitcast i32 [[TMP16]] to <vscale x 4 x i32>*			; CHECK-NEXT: [[TMP17:%.]] = bitcast i32 [[TMP16]] to <vscale x 4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP17]], i32 4, <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP17]], i32 4, <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> poison)
	; CHECK-NEXT: [[TMP18:%.*]] = select <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD1]], <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP18:%.*]] = select <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD2]], <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP19:%.*]] = call i32 @llvm.vector.reduce.xor.nxv4i32(<vscale x 4 x i32> [[TMP18]])			; CHECK-NEXT: [[TMP19:%.*]] = call i32 @llvm.vector.reduce.xor.nxv4i32(<vscale x 4 x i32> [[TMP18]])
	; CHECK-NEXT: [[TMP20]] = xor i32 [[TMP19]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP20]] = xor i32 [[TMP19]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4			; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP22]]			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP22]]
	; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[ACTIVE_LANE_MASK3]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[N]])
	; CHECK-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]			; CHECK-NEXT: [[TMP23:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK3]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP24:%.*]] = extractelement <vscale x 4 x i1> [[TMP23]], i32 0
				; CHECK-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP24:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines

	while.end.loopexit: ; preds = %while.body			while.end.loopexit: ; preds = %while.body
	ret void			ret void
	}			}

	!0 = distinct !{!0, !1, !2}			!0 = distinct !{!0, !1, !2}
	!1 = !{!"llvm.loop.vectorize.width", i32 4}			!1 = !{!"llvm.loop.vectorize.width", i32 4}
	!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}			!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!3 = distinct !{!3, !4}
				!4 = !{!"llvm.loop.vectorize.width", i32 4}

	attributes #0 = { "target-features"="+sve" }			attributes #0 = { "target-features"="+sve" }

llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

	; RUN: opt -loop-vectorize -scalable-vectorization=off -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S < %s \| FileCheck %s			; RUN: opt -loop-vectorize -scalable-vectorization=off -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S < %s \| FileCheck %s

	; NOTE: These tests aren't really target-specific, but it's convenient to target AArch64			; NOTE: These tests aren't really target-specific, but it's convenient to target AArch64
	; so that TTI.isLegalMaskedLoad can return true.			; so that TTI.isLegalMaskedLoad can return true.

	target triple = "aarch64-linux-gnu"			target triple = "aarch64-linux-gnu"

	; The original loop had an unconditional uniform load. Let's make sure			; The original loop had an unconditional uniform load. Let's make sure
	; we don't artificially create new predicated blocks for the load.			; we don't artificially create new predicated blocks for the load.
	define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {			define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {
	; CHECK-LABEL: @uniform_load(			; CHECK-LABEL: @uniform_load(
				; CHECK: vector.ph:
				; CHECK: [[INIT_ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 0, i64 %n)
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[IDX:%.]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[IDX:%.]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.]], %vector.body ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.]] = phi <4 x i1> [ [[INIT_ACTIVE_LANE_MASK]], %vector.ph ], [ [[NEXT_ACTIVE_LANE_MASK:%.]], %vector.body ]
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0
	; CHECK-NEXT: [[LOOP_PRED:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n)
	; CHECK-NEXT: [[LOAD_VAL:%.]] = load i32, i32 %src, align 4			; CHECK-NEXT: [[LOAD_VAL:%.]] = load i32, i32 %src, align 4
	; CHECK-NOT: load i32, i32* %src, align 4			; CHECK-NOT: load i32, i32* %src, align 4
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[LOAD_VAL]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[LOAD_VAL]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 %dst, i64 [[TMP3]]			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 %dst, i64 [[TMP3]]
	; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP6]], i32 0			; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP6]], i32 0
	; CHECK-NEXT: [[STORE_PTR:%.]] = bitcast i32 [[TMP7]] to <4 x i32>*			; CHECK-NEXT: [[STORE_PTR:%.]] = bitcast i32 [[TMP7]] to <4 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP5]], <4 x i32>* [[STORE_PTR]], i32 4, <4 x i1> [[LOOP_PRED]])			; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP5]], <4 x i32>* [[STORE_PTR]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]])
	; CHECK-NEXT: [[IDX_NEXT]] = add i64 [[IDX]], 4			; CHECK-NEXT: [[IDX_NEXT]] = add i64 [[IDX]], 4
	; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IDX_NEXT]], %n.vec			; CHECK-NEXT: [[NEXT_ACTIVE_LANE_MASK]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[IDX_NEXT]], i64 %n)
	; CHECK-NEXT: br i1 [[CMP]], label %middle.block, label %vector.body			; CHECK-NEXT: [[NOT_ACTIVE_LANE_MASK:%.*]] = xor <4 x i1> [[NEXT_ACTIVE_LANE_MASK]], <i1 true, i1 true, i1 true, i1 true>
				; CHECK-NEXT: [[FIRST_LANE_SET:%.*]] = extractelement <4 x i1> [[NOT_ACTIVE_LANE_MASK]], i32 0
				; CHECK-NEXT: br i1 [[FIRST_LANE_SET]], label %middle.block, label %vector.body

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	%val = load i32, i32* %src, align 4			%val = load i32, i32* %src, align 4
	%arrayidx = getelementptr inbounds i32, i32* %dst, i64 %indvars.iv			%arrayidx = getelementptr inbounds i32, i32* %dst, i64 %indvars.iv
	store i32 %val, i32* %arrayidx, align 4			store i32 %val, i32* %arrayidx, align 4
	%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1			%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
	%exitcond.not = icmp eq i64 %indvars.iv.next, %n			%exitcond.not = icmp eq i64 %indvars.iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body			br i1 %exitcond.not, label %for.end, label %for.body

	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	ret void			ret void
	}			}

	; The original loop had a conditional uniform load. In this case we actually			; The original loop had a conditional uniform load. In this case we actually
	; do need to perform conditional loads and so we end up using a gather instead.			; do need to perform conditional loads and so we end up using a gather instead.
	; However, we at least ensure the mask is the overlap of the loop predicate			; However, we at least ensure the mask is the overlap of the loop predicate
	; and the original condition.			; and the original condition.
	define void @cond_uniform_load(i32* nocapture %dst, i32* nocapture readonly %src, i32* nocapture readonly %cond, i64 %n) #0 {			define void @cond_uniform_load(i32* nocapture %dst, i32* nocapture readonly %src, i32* nocapture readonly %cond, i64 %n) #0 {
	; CHECK-LABEL: @cond_uniform_load(			; CHECK-LABEL: @cond_uniform_load(
	; CHECK: vector.ph:			; CHECK: vector.ph:
				; CHECK: [[INIT_ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 0, i64 %n)
	; CHECK: [[TMP1:%.]] = insertelement <4 x i32> poison, i32* %src, i32 0			; CHECK: [[TMP1:%.]] = insertelement <4 x i32> poison, i32* %src, i32 0
	; CHECK-NEXT: [[SRC_SPLAT:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer			; CHECK-NEXT: [[SRC_SPLAT:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[IDX:%.]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[IDX:%.]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.]], %vector.body ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.]] = phi <4 x i1> [ [[INIT_ACTIVE_LANE_MASK]], %vector.ph ], [ [[NEXT_ACTIVE_LANE_MASK:%.]], %vector.body ]
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0
	; CHECK-NEXT: [[LOOP_PRED:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n)			; CHECK: [[COND_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> {{%.*}}, i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)
	; CHECK: [[COND_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> {{%.*}}, i32 4, <4 x i1> [[LOOP_PRED]], <4 x i32> poison)
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i32> [[COND_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i32> [[COND_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = xor <4 x i1> [[TMP4]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP5:%.*]] = xor <4 x i1> [[TMP4]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[MASK:%.*]] = select <4 x i1> [[LOOP_PRED]], <4 x i1> [[TMP5]], <4 x i1> zeroinitializer			; CHECK-NEXT: [[MASK:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i1> [[TMP5]], <4 x i1> zeroinitializer
	; CHECK-NEXT: call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> [[SRC_SPLAT]], i32 4, <4 x i1> [[MASK]], <4 x i32> undef)			; CHECK-NEXT: call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> [[SRC_SPLAT]], i32 4, <4 x i1> [[MASK]], <4 x i32> undef)
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %if.end			for.body: ; preds = %entry, %if.end
	%index = phi i64 [ %index.next, %if.end ], [ 0, %entry ]			%index = phi i64 [ %index.next, %if.end ], [ 0, %entry ]
	%arrayidx = getelementptr inbounds i32, i32* %cond, i64 %index			%arrayidx = getelementptr inbounds i32, i32* %cond, i64 %index
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	Show All 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Add option to use active lane mask for loop control flowClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 443614

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlan.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

llvm/lib/Transforms/Vectorize/VPlanValue.h

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions-tf.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

[LoopVectorize] Add option to use active lane mask for loop control flow
ClosedPublic