This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
5/7
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
ARM/
2
mve-known-trip-count.ll
-
SystemZ/
-
predicated-first-order-recurrence.ll
-
X86/
-
consecutive-ptr-uniforms.ll
-
constant-fold.ll
-
pr31671-consecutive-ptr-uniforms.ll
-
pr34438.ll
-
pr40816-consecutive-ptr-uniforms.ll
-
vect.omp.force.small-tc.ll
-
pr43166-fold-tail-by-masking.ll
-
pr44488-predication.ll
-
tail-folding-vectorization-factor-1.ll
-
tripcount.ll
-
vplan-sink-scalars-and-merge.ll

Differential D115713

[LV] Don't apply "TinyTripCountVectorThreshold" for loops with compile time known TC.
Needs ReviewPublic

Authored by ebrevnov on Dec 14 2021, 1:37 AM.

Download Raw Diff

Details

Reviewers

fhahn
dmgreen
Ayal
anna
dtemirbulatov
lebedev.ri

Summary

When trip count is not known at compile time, there are additional overhead to make sure it's safe to perform next VF iterations. Thus, if vector loop is skipped at runtime then such vectorization is unprofitable. When trip count is known to be small enough there is high chance to get into such situation. Currently, LV is not able to properly cost model in this case since it doesn't account for cost of the epilog loop. Instead "short trip count" heuristic is employed.

While "short trip count" heuristic  makes sense in general (at least for current state) it can be slightly lifted up when trip count is compile time known constant. In   this case it's known at compile time how many vector iterations will be executed and there is no implied overhead by trip count checks as well.  Cost modeling is simple as well, if one vector iteration costs less than one scalar iteration multiple VF then vectorization is profitable.

Note: One may say, that "short trip count" heuristic is the needed to reduce code size in assumption that short trip count loops can't be performance critical. That statement turns out to be false in many cases (for example, nested loops) and should not be driving factor.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ebrevnov created this revision.Dec 14 2021, 1:37 AM

Herald added subscribers: dmgreen, hiraditya. · View Herald TranscriptDec 14 2021, 1:37 AM

ebrevnov requested review of this revision.Dec 14 2021, 1:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2021, 1:37 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B139185: Diff 394187.Dec 14 2021, 1:38 AM

ebrevnov mentioned this in D115712: [LV] Make sure prefer-predicate-over-epilogue works for short TC loops.Dec 14 2021, 1:53 AM

ebrevnov added a parent revision: D115712: [LV] Make sure prefer-predicate-over-epilogue works for short TC loops.

ebrevnov edited the summary of this revision. (Show Details)Dec 14 2021, 2:42 AM

ebrevnov added reviewers: fhahn, dmgreen, Ayal, lebedev.ri.

Ping

Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2022, 2:34 AM

ebrevnov added a reviewer: anna.Oct 9 2022, 7:22 AM

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptOct 9 2022, 7:22 AM

PING!

dmgreen added inline comments.Oct 10 2022, 12:04 AM

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll
7	I don't think we want this - it is worse. At least that is what my benchmarks suggest. That was the point of D101726. 1 vector + 1 masked vector iteration when unrolled was worse than 5 scalar because of the overheads of vector instructions. 1 vector + 1 scalar will be in the same boat.

fhahn added inline comments.Oct 10 2022, 12:27 AM

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll
7	I also think we would probably need to make this target/CPU dependent. We also have some AArch64 CPUs where usually at least 2 vector iterations are needed to make the vector code profitable if there is a scalar tail.

ebrevnov edited the summary of this revision. (Show Details)Oct 10 2022, 2:07 AM

Rebase + fix problem for platforms which prefer tail folding over scalar epilogue.

Herald added subscribers: frasercrmck, luismarques, apazos and 19 others. · View Herald TranscriptOct 14 2022, 10:12 AM

Harbormaster completed remote builds in B192211: Diff 467819.Oct 14 2022, 10:13 AM

ebrevnov edited parent revisions, added: D135971: [LV][NFC][WIP] Restructure getScalarEpilogueLowering; removed: D115712: [LV] Make sure prefer-predicate-over-epilogue works for short TC loops.Oct 14 2022, 10:13 AM

Thanks for feedback!
The problem with mve-known-trip-count.ll was caused by 2 things 1) Outdated code base 2) Bug in the implementation caused to skip "preferPredicateOverEpilogue" check and as a result scalar epilogue have been selected. Now both items are fixed mve-known-trip-count.ll works as previously (cost model says vectorization with tail folding is not profitable).

In order to fix above two items (and not introduce new ones) I had to restructure the code responsible for scalar epilog lowering. This restructuring is now parent for this one and in WIP state. While I believe proposed restructuring is a good thing regardless this change I would like to postpone polishing it until we have an agreement on this one.

Adding @dtemirbulatov since he was the author of the original patch that introduced getMinTripCountTailFoldingThreshold.

paulwalker-arm added a subscriber: paulwalker-arm.Oct 17 2022, 3:48 AM

paulwalker-arm added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	I don't quite understand this change. The whole point of `getMinTripCountTailFoldingThreshold()` was to give targets control over this behaviour based on their understanding how the cost model has been implemented. Admittedly this was in part due to the immaturity of the cost modelling but this change essentially removes that flexibility to the point where there's no value in keeping `getMinTripCountTailFoldingThreshold()`? If your previous patches improve the cost model in this regard then I'd rather `getMinTripCountTailFoldingThreshold()` be removed. That said, @dtemirbulatov can you help here in ascertaining if this option is still required based on this new patch series?

ebrevnov added inline comments.Oct 17 2022, 6:34 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	I think you are missing one thing. Currently we allow vectorization of short trip count loops if there is no run-time overhead and the tail can be folded only. Introduction of getMinTripCountTailFoldingThreshold put additional limitations on vectorization of such loops. This change enables vectorization of short trip count loops with scalar epilogue ONLY If trip count is compile time constant. getMinTripCountTailFoldingThreshold is still applicable if we decide to vectorize by folding the tail. In other words, this change enables exactly the same way of vectorization for loops with small constant trip count as for loops with unknown or large trip count. It's a question whether getMinTripCountTailFoldingThreshold threshold should be taken into account if it's decided to fold the tail at step 5) for short trip count loop. Possibly yes... I think this is your original concern, right?

Ping!

I think the effect this patch has is not well understood, and the description is not informative enough to understand it.
I would perhaps recommend formatting the desired behavior change as a truth table.

I tried running the original benchmarks again with this patch series, it still sees the same decrease in performance I'm afraid. If it was a small change it would be understandable, we accidentally end up on the wrong side of the scalar cost and choose to vectorize where we shouldn't. But the new code is 40% slower than the scalar version, so it's quite a difference. I haven't had a chance to look into the costs it produces, there is a chance we are underestimating the cost of predication or overestimating the cost of scalar. At worst, providing getMinTripCountTailFoldingThreshold works like it should we can always put a limit on the tripcount, providing we can find a decent minimum that works in general for MVE.

The vectorizer doesn't really do any cost modelling for the setup cost, past some obvious things like runtime checks. Due the the pass pipeline, it is usually expected that many low-trip-count loops are unrolled prior to vectorization, and we SLP them instead. D135971 looks like it changes a lot of code in an area that has caused an amount of trouble in the past. We know that the Vectorizer current needs to commit to tail folding vs not tail folding too early. It would be better to have that option as part of the vplan, allowing predicated and non-predicated patters to be costed against one another. I'm not sure I follow D135971 enough to understand the ramifications of those changes.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	Are you assuming that the loop will be unrolled after vectorization, and that unrolling will leave no extra overhead? Scalable vector loops cannot (currently) be unrolled.

Make sure MinTripCountTailFoldingThreshold is honord even for loops with short compile tine known TC.

Harbormaster completed remote builds in B195882: Diff 472865.Nov 3 2022, 1:02 AM

In D115713#3898587, @dmgreen wrote:

I tried running the original benchmarks again with this patch series, it still sees the same decrease in performance I'm afraid. If it was a small change it would be understandable, we accidentally end up on the wrong side of the scalar cost and choose to vectorize where we shouldn't. But the new code is 40% slower than the scalar version, so it's quite a difference. I haven't had a chance to look into the costs it produces, there is a chance we are underestimating the cost of predication or overestimating the cost of scalar. At worst, providing getMinTripCountTailFoldingThreshold works like it should we can always put a limit on the tripcount, providing we can find a decent minimum that works in general for MVE.

It sounds like current cost model works really bad for this case and it's essentially the reason why getMinTripCountTailFoldingThreshold had been introduced. If this is the case would you mind trying updated version and see if it helps. Otherwise, I'd probably need a test case I could analyze.

The vectorizer doesn't really do any cost modelling for the setup cost, past some obvious things like runtime checks. Due the the pass pipeline, it is usually expected that many low-trip-count loops are unrolled prior to vectorization, and we SLP them instead. D135971 looks like it changes a lot of code in an area that has caused an amount of trouble in the past. We know that the Vectorizer current needs to commit to tail folding vs not tail folding too early. It would be better to have that option as part of the vplan, allowing predicated and non-predicated patters to be costed against one another. I'm not sure I follow D135971 enough to understand the ramifications of those changes.

I understand the problem you describe here. D135971 has nothing to do with it. It's an NFC. None of tests change their behavior. Main motivation is to simplify unnecessary complicated implementation because I struggled how to push new functionality in it.

ebrevnov added inline comments.Nov 3 2022, 1:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any additional overhead comparing to scalar loop. At least not on IR level. Is there any overhead specific to MVE not visible on IR level?

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult. There's a lot of setup and spilling too, which is going to hurt for small trip counts. I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled. It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE. There will be a certain amount of overhead just from having a second loop, let alone all the extra overhead from the gubbins around it. https://godbolt.org/z/s8xEf7d6K. None of that is particularly MVE specific though, although as it is designed for small energy-efficient cores it may feel the overheads more than other architectures.

In D115713#3910792, @dmgreen wrote:

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult.

I would like to understand you case better. Let's take current non optimality of code generation out of the consideration. Is it a problem of inaccurate cost estimation for some particular instructions or not taking overhead which comes with the vectorization into account?

There's a lot of setup and spilling too, which is going to hurt for small trip counts.

Does this setup/spilling happens inside the main vector loop or outside? Is this reflected on IR level or low level (maybe even hardware specific)?

I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

IMHO, this may be an option only if we find out that this is something specific to MVE. Anyway, it looks like a workaround (hopefully temporary :-)) until a better support is there.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled.

Until I'm missing something looks like currently LV comes before Unroller&SLP which makes perfect sense to me. Anyway, I wouldn't stick to any specific pass ordering as LLVM is an infrastructure for building custom compilers. LV should do its job as good as it can. If it can prove that vectorization is beneficial it should do it (until we have an infrastructure to take dependencies between different passes into account).

It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

ebrevnov added inline comments.Nov 7 2022, 4:45 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	It looks like, in this specific example, vector cost of the main vector loop being estimated as 2x lower than scalar one is pretty much accurate. The problem is that extra cost coming from horizontal reduction after main vector loop is not taken into account. This is something that definitely should be fixed before we can move on in this direction. Essentially, this problem is a major blocking factor to enable short TC loops vectorization in general case (not only compile time known TC). PS: I wonder if your problematic case on MVE is something similar to this or not?

In D115713#3911820, @ebrevnov wrote:

In D115713#3910792, @dmgreen wrote:

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled.

Until I'm missing something looks like currently LV comes before Unroller&SLP which makes perfect sense to me. Anyway, I wouldn't stick to any specific pass ordering as LLVM is an infrastructure for building custom compilers. LV should do its job as good as it can. If it can prove that vectorization is beneficial it should do it (until we have an infrastructure to take dependencies between different passes into account).

(Unfortunately?) full loop unrolling happens before LV.
I personally think that once LV has outer loop vectorization, that should be considered a bug that should be resolved.

Matt added a subscriber: Matt.Nov 7 2022, 12:39 PM

Avoid vectorization if there is reduction/recurrence induced overhead

Harbormaster completed remote builds in B196849: Diff 474185.Nov 9 2022, 1:02 AM

ebrevnov added inline comments.Nov 9 2022, 1:05 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10064–10065	I've update the patch to avoid vectorization in such cases. Please take a look.

Avoid vectorization if there is reduction/recurrence induced overhead

For some architectures like MVE and AArch64 the reduction cost might well be cheap. Non-zero but also not huge. It can depend on the reduction. Have you considered the cost of constants and splats too, that the vectorizer will currently hoist into the preheader? The costmodel functions might also sometimes return cheaper costs for certain shuffles and gathers under the assumption that loop invariant parts can be hoisted.

I've tried to get an example of the case that is getting worse for MVE in rG662b5f18467e. It might be a bit over-reduced, but hopefully it still shows the same problems. The original I can't share, I'm afraid.

@fhahn did you have an example of cases where you had seen AArch64 getting worse? Is it covered by reductions or is it more general?

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 12 2023, 5:32 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

19 lines

test/

Transforms/

LoopVectorize/

ARM/

mve-known-trip-count.ll

6 lines

SystemZ/

predicated-first-order-recurrence.ll

69 lines

X86/

consecutive-ptr-uniforms.ll

constant-fold.ll

7 lines

	pr31671-consecutive-ptr-uniforms.ll
	consecutive-ptr-uniforms.ll

71 lines

pr34438.ll

7 lines

pr40816-consecutive-ptr-uniforms.ll

79 lines

vect.omp.force.small-tc.ll

56 lines

pr43166-fold-tail-by-masking.ll

2 lines

pr44488-predication.ll

27 lines

tail-folding-vectorization-factor-1.ll

4 lines

tripcount.ll

74 lines

vplan-sink-scalars-and-merge.ll

83 lines

Diff 394187

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,451 Lines • ▼ Show 20 Lines	case CM_ScalarEpilogueNotAllowedUsePredicate:
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
case CM_ScalarEpilogueNotNeededUsePredicate:		case CM_ScalarEpilogueNotNeededUsePredicate:
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: vector predicate hint/switch found.\n"		dbgs() << "LV: vector predicate hint/switch found.\n"
<< "LV: Not allowing scalar epilogue, creating predicated "		<< "LV: Not allowing scalar epilogue, creating predicated "
<< "vector loop.\n");		<< "vector loop.\n");
break;		break;
case CM_ScalarEpilogueNotAllowedLowTripLoop:		case CM_ScalarEpilogueNotAllowedLowTripLoop:
// fallthrough as a special case of OptForSize		// Bail if runtime checks are required.
		if (runtimeChecksRequired())
		return FixedScalableVFPair::getNone();
		// If trip count is known compile time constant then there is no overhead
		// implied by vectorizing with an epilogue. If there is no overhead implied
		// by the runtime checks as well (already checked above) let cost model to
		// decide on profitability.
		if (TC) {
		ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
		return computeFeasibleMaxVF(TC, UserVF, false);
		}
		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
		<< "count.\n");
		break;
case CM_ScalarEpilogueNotAllowedOptSize:		case CM_ScalarEpilogueNotAllowedOptSize:
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");		dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");
else
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
<< "count.\n");

// Bail if runtime checks are required, which are not good when optimising		// Bail if runtime checks are required, which are not good when optimising
// for size.		// for size.
if (runtimeChecksRequired())		if (runtimeChecksRequired())
return FixedScalableVFPair::getNone();		return FixedScalableVFPair::getNone();

break;		break;
}		}
▲ Show 20 Lines • Show All 4,571 Lines • ▼ Show 20 Lines	static ScalarEpilogueLowering getScalarEpilogueLowering(
// trip count by optimizing for size, to minimize overheads.		// trip count by optimizing for size, to minimize overheads.
auto ExpectedTC = getSmallBestKnownTC(*SE, L);		auto ExpectedTC = getSmallBestKnownTC(*SE, L);
if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {		if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "		<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");		<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
LLVM_DEBUG(dbgs() << "\n");		LLVM_DEBUG(dbgs() << "\n");
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I don't quite understand this change. The whole point of `getMinTripCountTailFoldingThreshold()` was to give targets control over this behaviour based on their understanding how the cost model has been implemented. Admittedly this was in part due to the immaturity of the cost modelling but this change essentially removes that flexibility to the point where there's no value in keeping `getMinTripCountTailFoldingThreshold()`? If your previous patches improve the cost model in this regard then I'd rather `getMinTripCountTailFoldingThreshold()` be removed. That said, @dtemirbulatov can you help here in ascertaining if this option is still required based on this new patch series? paulwalker-arm: I don't quite understand this change. The whole point of `getMinTripCountTailFoldingThreshold…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions I think you are missing one thing. Currently we allow vectorization of short trip count loops if there is no run-time overhead and the tail can be folded only. Introduction of getMinTripCountTailFoldingThreshold put additional limitations on vectorization of such loops. This change enables vectorization of short trip count loops with scalar epilogue ONLY If trip count is compile time constant. getMinTripCountTailFoldingThreshold is still applicable if we decide to vectorize by folding the tail. In other words, this change enables exactly the same way of vectorization for loops with small constant trip count as for loops with unknown or large trip count. It's a question whether getMinTripCountTailFoldingThreshold threshold should be taken into account if it's decided to fold the tail at step 5) for short trip count loop. Possibly yes... I think this is your original concern, right? ebrevnov: I think you are missing one thing. Currently we allow vectorization of short trip count loops…
		dmgreenUnsubmitted Not Done Reply Inline Actions Are you assuming that the loop will be unrolled after vectorization, and that unrolling will leave no extra overhead? Scalable vector loops cannot (currently) be unrolled. dmgreen: Are you assuming that the loop will be unrolled after vectorization, and that unrolling will…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any additional overhead comparing to scalar loop. At least not on IR level. Is there any overhead specific to MVE not visible on IR level? ebrevnov: Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any…
		dmgreenUnsubmitted Done Reply Inline Actions The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE. There will be a certain amount of overhead just from having a second loop, let alone all the extra overhead from the gubbins around it. https://godbolt.org/z/s8xEf7d6K. None of that is particularly MVE specific though, although as it is designed for small energy-efficient cores it may feel the overheads more than other architectures. dmgreen: The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE.
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions It looks like, in this specific example, vector cost of the main vector loop being estimated as 2x lower than scalar one is pretty much accurate. The problem is that extra cost coming from horizontal reduction after main vector loop is not taken into account. This is something that definitely should be fixed before we can move on in this direction. Essentially, this problem is a major blocking factor to enable short TC loops vectorization in general case (not only compile time known TC). PS: I wonder if your problematic case on MVE is something similar to this or not? ebrevnov: It looks like, in this specific example, vector cost of the main vector loop being estimated as…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions I've update the patch to avoid vectorization in such cases. Please take a look. ebrevnov: I've update the patch to avoid vectorization in such cases. Please take a look.
return CM_ScalarEpilogueNotAllowedLowTripLoop;		return CM_ScalarEpilogueNotAllowedLowTripLoop;
}		}
}		}

// 5) if the TTI hook indicates this is profitable, request predication.		// 5) if the TTI hook indicates this is profitable, request predication.
if (TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT,		if (TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT,
LVL.getLAI()))		LVL.getLAI()))
return CM_ScalarEpilogueNotNeededUsePredicate;		return CM_ScalarEpilogueNotNeededUsePredicate;
▲ Show 20 Lines • Show All 661 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

; RUN: opt -loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s		; RUN: opt -loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s
; REQUIRES: asserts		; REQUIRES: asserts

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"		target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv8.1m.main-arm-none-eabi"		target triple = "thumbv8.1m.main-arm-none-eabi"

; Trip count of 5 - shouldn't be vectorized.		; Trip count of 5 - vectorized with VF=4 plus one scalar iteration.
		dmgreenUnsubmitted Not Done Reply Inline Actions I don't think we want this - it is worse. At least that is what my benchmarks suggest. That was the point of D101726. 1 vector + 1 masked vector iteration when unrolled was worse than 5 scalar because of the overheads of vector instructions. 1 vector + 1 scalar will be in the same boat. dmgreen: I don't think we want this - it is worse. At least that is what my benchmarks suggest. That…
		fhahnUnsubmitted Not Done Reply Inline Actions I also think we would probably need to make this target/CPU dependent. We also have some AArch64 CPUs where usually at least 2 vector iterations are needed to make the vector code profitable if there is a scalar tail. fhahn: I also think we would probably need to make this target/CPU dependent. We also have some…
; CHECK-LABEL: tripcount5		; CHECK-LABEL: tripcount5
; CHECK: LV: Selecting VF: 1		; CHECK: LV: Selecting VF: 4
define void @tripcount5(i16* nocapture readonly %in, i32* nocapture %out, i16* nocapture readonly %consts, i32 %n) #0 {		define void @tripcount5(i16* nocapture readonly %in, i32* nocapture %out, i16* nocapture readonly %consts, i32 %n) #0 {
entry:		entry:
%arrayidx20 = getelementptr inbounds i32, i32* %out, i32 1		%arrayidx20 = getelementptr inbounds i32, i32* %out, i32 1
%arrayidx38 = getelementptr inbounds i32, i32* %out, i32 2		%arrayidx38 = getelementptr inbounds i32, i32* %out, i32 2
%arrayidx56 = getelementptr inbounds i32, i32* %out, i32 3		%arrayidx56 = getelementptr inbounds i32, i32* %out, i32 3
%arrayidx74 = getelementptr inbounds i32, i32* %out, i32 4		%arrayidx74 = getelementptr inbounds i32, i32* %out, i32 4
%arrayidx92 = getelementptr inbounds i32, i32* %out, i32 5		%arrayidx92 = getelementptr inbounds i32, i32* %out, i32 5
%arrayidx110 = getelementptr inbounds i32, i32* %out, i32 6		%arrayidx110 = getelementptr inbounds i32, i32* %out, i32 6
▲ Show 20 Lines • Show All 363 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
%conv135 = sext i16 %31 to i32		%conv135 = sext i16 %31 to i32
%mul136 = mul nsw i32 %conv135, %conv132		%mul136 = mul nsw i32 %conv135, %conv132
%add138 = add nsw i32 %mul136, %add129		%add138 = add nsw i32 %mul136, %add129
%add139 = add nuw nsw i32 %hop.0236, 16		%add139 = add nuw nsw i32 %hop.0236, 16
%cmp = icmp ult i32 %hop.0236, 112		%cmp = icmp ult i32 %hop.0236, 112
br i1 %cmp, label %for.body, label %for.cond.cleanup		br i1 %cmp, label %for.body, label %for.cond.cleanup
}		}

attributes #0 = { "target-features"="+mve" }		attributes #0 = { "target-features"="+mve" }
No newline at end of file

llvm/test/Transforms/LoopVectorize/SystemZ/predicated-first-order-recurrence.ll

	Show All 11 Lines

	define void @func_21() {			define void @func_21() {
	; CHECK-LABEL: @func_21(			; CHECK-LABEL: @func_21(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE4:%.*]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VECTOR_RECUR:%.]] = phi <2 x i32> [ <i32 poison, i32 0>, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[PRED_STORE_CONTINUE4]] ]			; CHECK-NEXT: [[VECTOR_RECUR:%.]] = phi <2 x i32> [ <i32 poison, i32 0>, [[VECTOR_PH]] ], [ [[WIDE_LOAD:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND:%.]] = phi <2 x i64> [ <i64 0, i64 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE4]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 1			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @A, i64 0, i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.*]] = icmp ule <2 x i64> [[VEC_IND]], <i64 4, i64 4>			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <2 x i32>*
	; CHECK-NEXT: br i1 [[TMP3]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]			; CHECK-NEXT: [[WIDE_LOAD]] = load <2 x i32>, <2 x i32>* [[TMP3]], align 4
	; CHECK: pred.load.if:			; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[VECTOR_RECUR]], <2 x i32> [[WIDE_LOAD]], <2 x i32> <i32 1, i32 2>
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @A, i64 0, i64 [[TMP0]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @B, i64 0, i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP4]], align 4			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> poison, i32 [[TMP5]], i32 0			; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <2 x i32>*
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]]			; CHECK-NEXT: store <2 x i32> [[TMP4]], <2 x i32>* [[TMP7]], align 4
	; CHECK: pred.load.continue:			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
	; CHECK-NEXT: [[TMP7:%.*]] = phi <2 x i32> [ poison, [[VECTOR_BODY]] ], [ [[TMP6]], [[PRED_LOAD_IF]] ]			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK-NEXT: br i1 [[TMP8]], label [[PRED_LOAD_IF1:%.]], label [[PRED_LOAD_CONTINUE2:%.]]
	; CHECK: pred.load.if1:
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @A, i64 0, i64 [[TMP1]]
	; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP9]], align 4
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP10]], i32 1
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE2]]
	; CHECK: pred.load.continue2:
	; CHECK-NEXT: [[TMP12]] = phi <2 x i32> [ [[TMP7]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP11]], [[PRED_LOAD_IF1]] ]
	; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x i32> [[VECTOR_RECUR]], <2 x i32> [[TMP12]], <2 x i32> <i32 1, i32 2>
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
	; CHECK-NEXT: br i1 [[TMP14]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
	; CHECK: pred.store.if:
	; CHECK-NEXT: [[TMP15:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @B, i64 0, i64 [[TMP0]]
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP13]], i32 0
	; CHECK-NEXT: store i32 [[TMP16]], i32* [[TMP15]], align 4
	; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
	; CHECK: pred.store.continue:
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
	; CHECK-NEXT: br i1 [[TMP17]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4]]
	; CHECK: pred.store.if3:
	; CHECK-NEXT: [[TMP18:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @B, i64 0, i64 [[TMP1]]
	; CHECK-NEXT: [[TMP19:%.*]] = extractelement <2 x i32> [[TMP13]], i32 1
	; CHECK-NEXT: store i32 [[TMP19]], i32* [[TMP18]], align 4
	; CHECK-NEXT: br label [[PRED_STORE_CONTINUE4]]
	; CHECK: pred.store.continue4:
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
	; CHECK-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 2, i64 2>
	; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], 6
	; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x i32> [[TMP12]], i32 1			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 5, 4
	; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i32> [[TMP12]], i32 0			; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x i32> [[WIDE_LOAD]], i32 1
	; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i32> [[WIDE_LOAD]], i32 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 6, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[SCALAR_RECUR:%.]] = phi i32 [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ], [ [[LV:%.]], [[LOOP]] ]			; CHECK-NEXT: [[SCALAR_RECUR:%.]] = phi i32 [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ], [ [[LV:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[LOOP]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[A_PTR:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @A, i64 0, i64 [[INDVARS_IV]]			; CHECK-NEXT: [[A_PTR:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @A, i64 0, i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[LV]] = load i32, i32* [[A_PTR]], align 4			; CHECK-NEXT: [[LV]] = load i32, i32* [[A_PTR]], align 4
	; CHECK-NEXT: [[B_PTR:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @B, i64 0, i64 [[INDVARS_IV]]			; CHECK-NEXT: [[B_PTR:%.]] = getelementptr inbounds [5 x i32], [5 x i32] @B, i64 0, i64 [[INDVARS_IV]]
	; CHECK-NEXT: store i32 [[SCALAR_RECUR]], i32* [[B_PTR]], align 4			; CHECK-NEXT: store i32 [[SCALAR_RECUR]], i32* [[B_PTR]], align 4
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 5			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 5
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[EXIT]], label [[LOOP]], [[LOOP2:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP2:![0-9]+]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop: ; preds = %loop, %entry			loop: ; preds = %loop, %entry
	%rec = phi i32 [ 0, %entry], [ %lv, %loop ]			%rec = phi i32 [ 0, %entry], [ %lv, %loop ]
	Show All 12 Lines

llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

This file was moved to llvm/test/Transforms/LoopVectorize/X86/pr31671-consecutive-ptr-uniforms.ll.

llvm/test/Transforms/LoopVectorize/X86/constant-fold.ll

	Show All 13 Lines
	; CHECK-LABEL: @f1(			; CHECK-LABEL: @f1(
	; CHECK-NEXT: bb1:			; CHECK-NEXT: bb1:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16			; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i16> poison, i16 [[OFFSET_IDX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i16> [[BROADCAST_SPLATINSERT]], <2 x i16> poison, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <2 x i16> [[BROADCAST_SPLAT]], <i16 0, i16 1>
	; CHECK-NEXT: [[TMP0:%.*]] = add i16 [[OFFSET_IDX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i16 [[OFFSET_IDX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = sext i16 [[TMP0]] to i64			; CHECK-NEXT: [[TMP1:%.*]] = sext i16 [[TMP0]] to i64
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr [2 x i16], [2 x i16] @b, i16 0, i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr [2 x i16], [2 x i16] @b, i16 0, i64 [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr i16, i16** [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = getelementptr i16, i16** [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i16* [[TMP3]] to <2 x i16>			; CHECK-NEXT: [[TMP4:%.]] = bitcast i16* [[TMP3]] to <2 x i16>
	; CHECK-NEXT: store <2 x i16> <i16 getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0), i16* getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0)>, <2 x i16> [[TMP4]], align 8			; CHECK-NEXT: store <2 x i16> <i16 getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0), i16* getelementptr inbounds ([1 x %rec8], [1 x %rec8]* @a, i32 0, i32 0, i32 0)>, <2 x i16> [[TMP4]], align 8
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2			; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2
	; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 2, 2			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 2, 2
	; CHECK-NEXT: br i1 [[CMP_N]], label [[BB3:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[BB3:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i16 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[BB1:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i16 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[BB1:%.]] ]
	; CHECK-NEXT: br label [[BB2:%.*]]			; CHECK-NEXT: br label [[BB2:%.*]]
	; CHECK: bb2:			; CHECK: bb2:
	; CHECK-NEXT: [[C_1_0:%.]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[_TMP9:%.]], [[BB2]] ]			; CHECK-NEXT: [[C_1_0:%.]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[_TMP9:%.]], [[BB2]] ]
	; CHECK-NEXT: [[_TMP1:%.*]] = zext i16 0 to i64			; CHECK-NEXT: [[_TMP1:%.*]] = zext i16 0 to i64
	; CHECK-NEXT: [[_TMP2:%.]] = getelementptr [1 x %rec8], [1 x %rec8] @a, i16 0, i64 [[_TMP1]]			; CHECK-NEXT: [[_TMP2:%.]] = getelementptr [1 x %rec8], [1 x %rec8] @a, i16 0, i64 [[_TMP1]]
	; CHECK-NEXT: [[_TMP4:%.]] = bitcast %rec8 [[_TMP2]] to i16*			; CHECK-NEXT: [[_TMP4:%.]] = bitcast %rec8 [[_TMP2]] to i16*
	; CHECK-NEXT: [[_TMP6:%.*]] = sext i16 [[C_1_0]] to i64			; CHECK-NEXT: [[_TMP6:%.*]] = sext i16 [[C_1_0]] to i64
	; CHECK-NEXT: [[_TMP7:%.]] = getelementptr [2 x i16], [2 x i16] @b, i16 0, i64 [[_TMP6]]			; CHECK-NEXT: [[_TMP7:%.]] = getelementptr [2 x i16], [2 x i16] @b, i16 0, i64 [[_TMP6]]
	; CHECK-NEXT: store i16* [[_TMP4]], i16** [[_TMP7]], align 8			; CHECK-NEXT: store i16* [[_TMP4]], i16** [[_TMP7]], align 8
	; CHECK-NEXT: [[_TMP9]] = add nsw i16 [[C_1_0]], 1			; CHECK-NEXT: [[_TMP9]] = add nsw i16 [[C_1_0]], 1
	; CHECK-NEXT: [[_TMP11:%.*]] = icmp slt i16 [[_TMP9]], 2			; CHECK-NEXT: [[_TMP11:%.*]] = icmp slt i16 [[_TMP9]], 2
	; CHECK-NEXT: br i1 [[_TMP11]], label [[BB2]], label [[BB3]], [[LOOP2:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[_TMP11]], label [[BB2]], label [[BB3]], !llvm.loop [[LOOP2:![0-9]+]]
	; CHECK: bb3:			; CHECK: bb3:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;

	bb1:			bb1:
	br label %bb2			br label %bb2

	bb2:			bb2:
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/X86/pr31671-consecutive-ptr-uniforms.ll

This file was moved from llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll.

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt < %s -aa-pipeline=basic-aa -passes=loop-vectorize,instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s		; RUN: opt < %s -aa-pipeline=basic-aa -passes=loop-vectorize,instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s
; RUN: opt < %s -loop-vectorize -instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine -enable-new-pm=0 2>&1 \| FileCheck %s		; RUN: opt < %s -loop-vectorize -instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine -enable-new-pm=0 2>&1 \| FileCheck %s
; RUN: opt < %s -loop-vectorize -force-vector-width=2 -S \| FileCheck %s -check-prefix=FORCE

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

; CHECK-LABEL: PR31671		; CHECK-LABEL: PR31671
;		;
; Check a pointer in which one of its uses is consecutive-like and another of		; Check a pointer in which one of its uses is consecutive-like and another of
; its uses is non-consecutive-like. In the test case below, %tmp3 is the		; its uses is non-consecutive-like. In the test case below, %tmp3 is the
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	for.body:
br i1 %cond, label %for.body, label %for.end		br i1 %cond, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

attributes #0 = { "target-cpu"="knl" }		attributes #0 = { "target-cpu"="knl" }

; CHECK-LABEL: PR40816
;
; Check that scalar with predication instructions are not considered uniform
; after vectorization, because that results in replicating a region instead of
; having a single instance (out of VF). The predication stems from a tiny count
; of 3 leading to folding the tail by masking using icmp ule <i, i+1> <= <2, 2>.
;
; CHECK: LV: Found trip count: 3
; CHECK: LV: Found uniform instruction: {{%.}} = icmp eq i32 {{%.}}, 0
; CHECK-NOT: LV: Found uniform instruction: {{%.}} = load i32, i32 {{%.*}}, align 1
; CHECK: LV: Found not uniform being ScalarWithPredication: {{%.}} = load i32, i32 {{%.*}}, align 1
; CHECK: LV: Found scalar instruction: {{%.}} = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 {{%.*}}
;
; FORCE-LABEL: @PR40816(
; FORCE-NEXT: entry:
; FORCE-NEXT: br i1 false, label {{%.}}, label [[VECTOR_PH:%.]]
; FORCE: vector.ph:
; FORCE-NEXT: br label [[VECTOR_BODY:%.*]]
; FORCE: vector.body:
; FORCE-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE4:%.*]] ]
; FORCE-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE4]] ]
; FORCE-NEXT: [[TMP2:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 2, i32 2>
; FORCE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
; FORCE-NEXT: br i1 [[TMP3]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
; FORCE: pred.load.if:
; FORCE-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
; FORCE-NEXT: store i32 [[TMP0]], i32* @b, align 1
; FORCE-NEXT: [[TMP6:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP0]]
; FORCE-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 1
; FORCE-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> poison, i32 [[TMP7]], i32 0
; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE]]
; FORCE: pred.load.continue:
; FORCE-NEXT: [[TMP9:%.*]] = phi <2 x i32> [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_LOAD_IF]] ]
; FORCE-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
; FORCE-NEXT: br i1 [[TMP10]], label [[PRED_LOAD_IF3:%.*]], label [[PRED_LOAD_CONTINUE4]]
; FORCE: pred.load.if1:
; FORCE-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
; FORCE-NEXT: store i32 [[TMP1]], i32* @b, align 1
; FORCE-NEXT: [[TMP11:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP1]]
; FORCE-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP11]], align 1
; FORCE-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> [[TMP9]], i32 [[TMP12]], i32 1
; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE4]]
; FORCE: pred.load.continue2:
; FORCE-NEXT: [[TMP14:%.*]] = phi <2 x i32> [ [[TMP9]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF3]] ]
; FORCE-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
; FORCE-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
; FORCE-NEXT: [[TMP15:%.*]] = icmp eq i32 [[INDEX_NEXT]], 4
; FORCE-NEXT: br i1 [[TMP15]], label {{%.*}}, label [[VECTOR_BODY]]
;
@a = internal constant [3 x i32] [i32 7, i32 7, i32 0], align 1
@b = external global i32, align 1

define void @PR40816() #1 {

entry:
br label %for.body

for.body: ; preds = %for.body, %entry
%0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
store i32 %0, i32* @b, align 1
%arrayidx1 = getelementptr inbounds [3 x i32], [3 x i32]* @a, i32 0, i32 %0
%1 = load i32, i32* %arrayidx1, align 1
%cmp2 = icmp eq i32 %1, 0
%inc = add nuw nsw i32 %0, 1
br i1 %cmp2, label %return, label %for.body

return: ; preds = %for.body
ret void
}

attributes #1 = { "target-cpu"="core2" }

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll

	Show All 11 Lines
	define void @small_tc(float* noalias nocapture %A, float* noalias nocapture readonly %B) {			define void @small_tc(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
	; CHECK-LABEL: @small_tc(			; CHECK-LABEL: @small_tc(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 8			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 8
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP1:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 8			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 8
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP3:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 17 Lines

llvm/test/Transforms/LoopVectorize/X86/pr40816-consecutive-ptr-uniforms.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt < %s -aa-pipeline=basic-aa -passes=loop-vectorize,instcombine -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -instcombine -S -debug-only=loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -disable-output -print-after=instcombine -enable-new-pm=0 2>&1 \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -force-vector-width=2 -S \| FileCheck %s -check-prefix=FORCE

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; CHECK-LABEL: PR40816
				;
				; Check that scalar with predication instructions are not considered uniform
				; after vectorization, because that results in replicating a region instead of
				; having a single instance (out of VF). The predication stems from a tiny count
				; of 3 leading to folding the tail by masking using icmp ule <i, i+1> <= <2, 2>.
				;
				; CHECK: LV: Found trip count: 3
				; CHECK: LV: Found uniform instruction: {{%.}} = icmp eq i32 {{%.}}, 0
				; CHECK-NOT: LV: Found uniform instruction: {{%.}} = load i32, i32 {{%.*}}, align 1
				; CHECK: LV: Found not uniform being ScalarWithPredication: {{%.}} = load i32, i32 {{%.*}}, align 1
				; CHECK: LV: Found scalar instruction: {{%.}} = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 {{%.*}}
				;
				; FORCE-LABEL: @PR40816(
				; FORCE-NEXT: entry:
				; FORCE-NEXT: br i1 false, label {{%.}}, label [[VECTOR_PH:%.]]
				; FORCE: vector.ph:
				; FORCE-NEXT: br label [[VECTOR_BODY:%.*]]
				; FORCE: vector.body:
				; FORCE-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE4:%.*]] ]
				; FORCE-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE4]] ]
				; FORCE-NEXT: [[TMP2:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 2, i32 2>
				; FORCE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
				; FORCE-NEXT: br i1 [[TMP3]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
				; FORCE: pred.load.if:
				; FORCE-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; FORCE-NEXT: store i32 [[TMP0]], i32* @b, align 1
				; FORCE-NEXT: [[TMP6:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP0]]
				; FORCE-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 1
				; FORCE-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> poison, i32 [[TMP7]], i32 0
				; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE]]
				; FORCE: pred.load.continue:
				; FORCE-NEXT: [[TMP9:%.*]] = phi <2 x i32> [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_LOAD_IF]] ]
				; FORCE-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
				; FORCE-NEXT: br i1 [[TMP10]], label [[PRED_LOAD_IF3:%.*]], label [[PRED_LOAD_CONTINUE4]]
				; FORCE: pred.load.if1:
				; FORCE-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
				; FORCE-NEXT: store i32 [[TMP1]], i32* @b, align 1
				; FORCE-NEXT: [[TMP11:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP1]]
				; FORCE-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP11]], align 1
				; FORCE-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> [[TMP9]], i32 [[TMP12]], i32 1
				; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE4]]
				; FORCE: pred.load.continue2:
				; FORCE-NEXT: [[TMP14:%.*]] = phi <2 x i32> [ [[TMP9]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF3]] ]
				; FORCE-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
				; FORCE-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
				; FORCE-NEXT: [[TMP15:%.*]] = icmp eq i32 [[INDEX_NEXT]], 4
				; FORCE-NEXT: br i1 [[TMP15]], label {{%.*}}, label [[VECTOR_BODY]]
				;
				@a = internal constant [3 x i32] [i32 7, i32 7, i32 0], align 1
				@b = external global i32, align 1

				define void @PR40816() #1 {

				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				store i32 %0, i32* @b, align 1
				%arrayidx1 = getelementptr inbounds [3 x i32], [3 x i32]* @a, i32 0, i32 %0
				%1 = load i32, i32* %arrayidx1, align 1
				%cmp2 = icmp eq i32 %1, 0
				%inc = add nuw nsw i32 %0, 1
				br i1 %cmp2, label %return, label %for.body

				return: ; preds = %for.body
				ret void
				}

				attributes #1 = { "target-cpu"="core2" }

llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

	Show All 32 Lines
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP1:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP4:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 22 Lines
	define void @vectorized1(float* noalias nocapture %A, float* noalias nocapture readonly %B) {			define void @vectorized1(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
	; CHECK-LABEL: @vectorized1(			; CHECK-LABEL: @vectorized1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = icmp ule <8 x i64> [[INDUCTION]], <i64 19, i64 19, i64 19, i64 19, i64 19, i64 19, i64 19, i64 19>			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*
	; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[TMP3]] to <8 x float>*			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x float> @llvm.masked.load.v8f32.p0v8f32(<8 x float> [[TMP4]], i32 4, <8 x i1> [[TMP1]], <8 x float> poison), !llvm.access.group !6			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[TMP7:%.]] = bitcast float [[TMP6]] to <8 x float>*			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x float> @llvm.masked.load.v8f32.p0v8f32(<8 x float> [[TMP7]], i32 4, <8 x i1> [[TMP1]], <8 x float> poison), !llvm.access.group !6			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.*]] = fadd fast <8 x float> [[WIDE_MASKED_LOAD]], [[WIDE_MASKED_LOAD1]]			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[TMP6]] to <8 x float>*			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6
	; CHECK-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP8]], <8 x float>* [[TMP9]], i32 4, <8 x i1> [[TMP1]]), !llvm.access.group !6			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
	; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP7:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 24, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP12:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP11]], [[TMP12]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP9:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 21 Lines
	define void @vectorized2(float* noalias nocapture %A, float* noalias nocapture readonly %B) {			define void @vectorized2(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
	; CHECK-LABEL: @vectorized2(			; CHECK-LABEL: @vectorized2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP11:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 16 Lines

llvm/test/Transforms/LoopVectorize/pr43166-fold-tail-by-masking.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -loop-vectorize -force-vector-width=4 -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S \| FileCheck %s


	; Test cases below are reduced (and slightly modified) reproducers based on a			; Test cases below are reduced (and slightly modified) reproducers based on a
	; problem seen when compiling a C program like this:			; problem seen when compiling a C program like this:
	;			;
	; #include <stdint.h>			; #include <stdint.h>
	; #include <stdio.h>			; #include <stdio.h>
	;			;
	▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/pr44488-predication.ll

	Show All 9 Lines

	define i16 @test_true_and_false_branch_equal() {			define i16 @test_true_and_false_branch_equal() {
	; CHECK-LABEL: @test_true_and_false_branch_equal(			; CHECK-LABEL: @test_true_and_false_branch_equal(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_SREM_CONTINUE4:%.*]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_SREM_CONTINUE2:%.*]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = trunc i32 [[INDEX]] to i16			; CHECK-NEXT: [[TMP0:%.*]] = trunc i32 [[INDEX]] to i16
	; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i16 99, [[TMP0]]			; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i16 99, [[TMP0]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i16> poison, i16 [[OFFSET_IDX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i16> [[BROADCAST_SPLATINSERT]], <2 x i16> poison, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <2 x i16> [[BROADCAST_SPLAT]], <i16 0, i16 1>
	; CHECK-NEXT: [[TMP1:%.*]] = add i16 [[OFFSET_IDX]], 0			; CHECK-NEXT: [[TMP1:%.*]] = add i16 [[OFFSET_IDX]], 0
	; CHECK-NEXT: [[TMP2:%.]] = load i16, i16 @v_38, align 1			; CHECK-NEXT: [[TMP2:%.]] = load i16, i16 @v_38, align 1
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <2 x i16> poison, i16 [[TMP2]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i16> poison, i16 [[TMP2]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <2 x i16> [[BROADCAST_SPLATINSERT1]], <2 x i16> poison, <2 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i16> [[BROADCAST_SPLATINSERT]], <2 x i16> poison, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP3:%.*]] = icmp eq <2 x i16> [[BROADCAST_SPLAT2]], <i16 32767, i16 32767>			; CHECK-NEXT: [[TMP3:%.*]] = icmp eq <2 x i16> [[BROADCAST_SPLAT]], <i16 32767, i16 32767>
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <2 x i16> [[BROADCAST_SPLAT2]], zeroinitializer			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <2 x i16> [[BROADCAST_SPLAT]], zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i1> [[TMP4]], <i1 true, i1 true>			; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i1> [[TMP4]], <i1 true, i1 true>
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
	; CHECK-NEXT: br i1 [[TMP6]], label [[PRED_SREM_IF:%.]], label [[PRED_SREM_CONTINUE:%.]]			; CHECK-NEXT: br i1 [[TMP6]], label [[PRED_SREM_IF:%.]], label [[PRED_SREM_CONTINUE:%.]]
	; CHECK: pred.srem.if:			; CHECK: pred.srem.if:
	; CHECK-NEXT: [[TMP7:%.*]] = srem i16 5786, [[TMP2]]			; CHECK-NEXT: [[TMP7:%.*]] = srem i16 5786, [[TMP2]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i16> poison, i16 [[TMP7]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i16> poison, i16 [[TMP7]], i32 0
	; CHECK-NEXT: br label [[PRED_SREM_CONTINUE]]			; CHECK-NEXT: br label [[PRED_SREM_CONTINUE]]
	; CHECK: pred.srem.continue:			; CHECK: pred.srem.continue:
	; CHECK-NEXT: [[TMP9:%.*]] = phi <2 x i16> [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_SREM_IF]] ]			; CHECK-NEXT: [[TMP9:%.*]] = phi <2 x i16> [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_SREM_IF]] ]
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
	; CHECK-NEXT: br i1 [[TMP10]], label [[PRED_SREM_IF3:%.*]], label [[PRED_SREM_CONTINUE4]]			; CHECK-NEXT: br i1 [[TMP10]], label [[PRED_SREM_IF1:%.*]], label [[PRED_SREM_CONTINUE2]]
	; CHECK: pred.srem.if3:			; CHECK: pred.srem.if1:
	; CHECK-NEXT: [[TMP11:%.*]] = srem i16 5786, [[TMP2]]			; CHECK-NEXT: [[TMP11:%.*]] = srem i16 5786, [[TMP2]]
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i16> [[TMP9]], i16 [[TMP11]], i32 1			; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i16> [[TMP9]], i16 [[TMP11]], i32 1
	; CHECK-NEXT: br label [[PRED_SREM_CONTINUE4]]			; CHECK-NEXT: br label [[PRED_SREM_CONTINUE2]]
	; CHECK: pred.srem.continue4:			; CHECK: pred.srem.continue2:
	; CHECK-NEXT: [[TMP13:%.*]] = phi <2 x i16> [ [[TMP9]], [[PRED_SREM_CONTINUE]] ], [ [[TMP12]], [[PRED_SREM_IF3]] ]			; CHECK-NEXT: [[TMP13:%.*]] = phi <2 x i16> [ [[TMP9]], [[PRED_SREM_CONTINUE]] ], [ [[TMP12]], [[PRED_SREM_IF1]] ]
	; CHECK-NEXT: [[PREDPHI:%.*]] = select <2 x i1> [[TMP4]], <2 x i16> <i16 5786, i16 5786>, <2 x i16> [[TMP13]]			; CHECK-NEXT: [[PREDPHI:%.*]] = select <2 x i1> [[TMP4]], <2 x i16> <i16 5786, i16 5786>, <2 x i16> [[TMP13]]
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i16> [[PREDPHI]], i32 0			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i16> [[PREDPHI]], i32 0
	; CHECK-NEXT: store i16 [[TMP14]], i16* @v_39, align 1			; CHECK-NEXT: store i16 [[TMP14]], i16* @v_39, align 1
	; CHECK-NEXT: [[TMP15:%.*]] = extractelement <2 x i16> [[PREDPHI]], i32 1			; CHECK-NEXT: [[TMP15:%.*]] = extractelement <2 x i16> [[PREDPHI]], i32 1
	; CHECK-NEXT: store i16 [[TMP15]], i16* @v_39, align 1			; CHECK-NEXT: store i16 [[TMP15]], i16* @v_39, align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], 12			; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], 12
	; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 12, 12			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 12, 12
	; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i16 [ 111, [[MIDDLE_BLOCK]] ], [ 99, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i16 [ 111, [[MIDDLE_BLOCK]] ], [ 99, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_07:%.]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC7:%.]], [[FOR_LATCH:%.*]] ]			; CHECK-NEXT: [[I_07:%.]] = phi i16 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC7:%.]], [[FOR_LATCH:%.*]] ]
	; CHECK-NEXT: [[LV:%.]] = load i16, i16 @v_38, align 1			; CHECK-NEXT: [[LV:%.]] = load i16, i16 @v_38, align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i16 [[LV]], 32767			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i16 [[LV]], 32767
	; CHECK-NEXT: br i1 [[CMP1]], label [[COND_END:%.*]], label [[COND_END]]			; CHECK-NEXT: br i1 [[CMP1]], label [[COND_END:%.*]], label [[COND_END]]
	; CHECK: cond.end:			; CHECK: cond.end:
	; CHECK-NEXT: [[CMP2:%.*]] = icmp eq i16 [[LV]], 0			; CHECK-NEXT: [[CMP2:%.*]] = icmp eq i16 [[LV]], 0
	; CHECK-NEXT: br i1 [[CMP2]], label [[FOR_LATCH]], label [[COND_FALSE4:%.*]]			; CHECK-NEXT: br i1 [[CMP2]], label [[FOR_LATCH]], label [[COND_FALSE4:%.*]]
	; CHECK: cond.false4:			; CHECK: cond.false4:
	; CHECK-NEXT: [[REM:%.*]] = srem i16 5786, [[LV]]			; CHECK-NEXT: [[REM:%.*]] = srem i16 5786, [[LV]]
	; CHECK-NEXT: br label [[FOR_LATCH]]			; CHECK-NEXT: br label [[FOR_LATCH]]
	; CHECK: for.latch:			; CHECK: for.latch:
	; CHECK-NEXT: [[COND6:%.*]] = phi i16 [ [[REM]], [[COND_FALSE4]] ], [ 5786, [[COND_END]] ]			; CHECK-NEXT: [[COND6:%.*]] = phi i16 [ [[REM]], [[COND_FALSE4]] ], [ 5786, [[COND_END]] ]
	; CHECK-NEXT: store i16 [[COND6]], i16* @v_39, align 1			; CHECK-NEXT: store i16 [[COND6]], i16* @v_39, align 1
	; CHECK-NEXT: [[INC7]] = add nsw i16 [[I_07]], 1			; CHECK-NEXT: [[INC7]] = add nsw i16 [[I_07]], 1
	; CHECK-NEXT: [[CMP:%.*]] = icmp slt i16 [[INC7]], 111			; CHECK-NEXT: [[CMP:%.*]] = icmp slt i16 [[INC7]], 111
	; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[EXIT]], [[LOOP2:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[EXIT]], !llvm.loop [[LOOP2:![0-9]+]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: [[RV:%.]] = load i16, i16 @v_39, align 1			; CHECK-NEXT: [[RV:%.]] = load i16, i16 @v_39, align 1
	; CHECK-NEXT: ret i16 [[RV]]			; CHECK-NEXT: ret i16 [[RV]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.latch			for.body: ; preds = %entry, %for.latch
	Show All 24 Lines

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -pass-remarks='loop-vectorize' -disable-output -S 2>&1 \| FileCheck %s --check-prefix=CHECK-REMARKS			; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -pass-remarks='loop-vectorize' -disable-output -S 2>&1 \| FileCheck %s --check-prefix=CHECK-REMARKS
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s

	; These tests are to check that fold-tail procedure produces correct scalar code when			; These tests are to check that fold-tail procedure produces correct scalar code when
	; loop-vectorization is only unrolling but not vectorizing.			; loop-vectorization is only unrolling but not vectorizing.

	; CHECK-REMARKS: remark: {{.*}} interleaved loop (interleaved count: 4)			; CHECK-REMARKS: remark: {{.*}} interleaved loop (interleaved count: 4)
	; CHECK-REMARKS-NEXT: remark: {{.*}} interleaved loop (interleaved count: 4)			; CHECK-REMARKS-NEXT: remark: {{.*}} interleaved loop (interleaved count: 4)
	; CHECK-REMARKS-NOT: remark: {{.*}} vectorized loop			; CHECK-REMARKS-NOT: remark: {{.*}} vectorized loop

	▲ Show 20 Lines • Show All 88 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/tripcount.ll

	Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines
	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	}			}

	define i32 @const_low_trip_count() {			define i32 @const_low_trip_count() {
	; Simple loop with constant, small trip count and no profiling info.			; Simple loop with constant, small trip count and no profiling info.
	; CHECK-LABEL: @const_low_trip_count(			; CHECK-LABEL: @const_low_trip_count(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <2 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x i8>, <2 x i8> [[TMP3]], align 1
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <2 x i8> [[WIDE_LOAD]], zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = select <2 x i1> [[TMP4]], <2 x i8> <i8 2, i8 2>, <2 x i8> <i8 1, i8 1>
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <2 x i8>*
				; CHECK-NEXT: store <2 x i8> [[TMP5]], <2 x i8>* [[TMP6]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
				; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2
				; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 3, 2
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP0]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 2			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 2
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END:%.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP10:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 26 Lines
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>			; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1			; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP10:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP12:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 10 Lines
	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	}			}

	define i32 @const_small_trip_count_step() {			define i32 @const_small_trip_count_step() {
	; Simple loop with static, small trip count and no profiling info.			; Simple loop with static, small trip count and no profiling info.
	; CHECK-LABEL: @const_small_trip_count_step(			; CHECK-LABEL: @const_small_trip_count_step(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i32 [[INDEX]], 5
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[OFFSET_IDX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[OFFSET_IDX]], 5
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[TMP2]], align 1
				; CHECK-NEXT: [[TMP5:%.]] = load i8, i8 [[TMP3]], align 1
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i8> poison, i8 [[TMP4]], i32 0
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i8> [[TMP6]], i8 [[TMP5]], i32 1
				; CHECK-NEXT: [[TMP8:%.*]] = icmp eq <2 x i8> [[TMP7]], zeroinitializer
				; CHECK-NEXT: [[TMP9:%.*]] = select <2 x i1> [[TMP8]], <2 x i8> <i8 2, i8 2>, <2 x i8> <i8 1, i8 1>
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i8> [[TMP9]], i32 0
				; CHECK-NEXT: store i8 [[TMP10]], i8* [[TMP2]], align 1
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i8> [[TMP9]], i32 1
				; CHECK-NEXT: store i8 [[TMP11]], i8* [[TMP3]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
				; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2
				; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 3, 2
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 10, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP13:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP0]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP13]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 5			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 5
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 10			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 10
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END:%.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP14:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 26 Lines
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>			; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1			; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !prof [[PROF0]], !llvm.loop [[LOOP12:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !prof [[PROF0]], !llvm.loop [[LOOP16:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 27 Lines

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

	Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines
	exit:			exit:
	ret void			ret void
	}			}

	; Make sure we do not sink uniform instructions.			; Make sure we do not sink uniform instructions.
	define void @uniform_gep(i64 %k, i16* noalias %A, i16* noalias %B) {			define void @uniform_gep(i64 %k, i16* noalias %A, i16* noalias %B) {
	; CHECK-LABEL: LV: Checking a loop in "uniform_gep"			; CHECK-LABEL: LV: Checking a loop in "uniform_gep"
	; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {			; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {
	; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
	; CHECK-EMPTY:
	; CHECK-NEXT: <x1> vector loop: {			; CHECK-NEXT: <x1> vector loop: {
	; CHECK-NEXT: loop:			; CHECK-NEXT: loop:
	; CHECK-NEXT: WIDEN-INDUCTION %iv = phi 21, %iv.next			; CHECK-NEXT: WIDEN-INDUCTION %iv = phi 21, %iv.next
	; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = WIDEN-CANONICAL-INDUCTION
	; CHECK-NEXT: EMIT vp<[[MASK:%.+]]> = icmp ule vp<[[CAN_IV]]> vp<[[BTC]]>
	; CHECK-NEXT: CLONE ir<%gep.A.uniform> = getelementptr ir<%A>, ir<0>			; CHECK-NEXT: CLONE ir<%gep.A.uniform> = getelementptr ir<%A>, ir<0>
	; CHECK-NEXT: Successor(s): pred.load			; CHECK-NEXT: CLONE ir<%lv> = load ir<%gep.A.uniform>
	; CHECK-EMPTY:
	; CHECK-NEXT: <xVFxUF> pred.load: {
	; CHECK-NEXT: pred.load.entry:
	; CHECK-NEXT: BRANCH-ON-MASK vp<[[MASK]]>
	; CHECK-NEXT: Successor(s): pred.load.if, pred.load.continue
	; CHECK-NEXT: CondBit: vp<[[MASK]]> (loop)
	; CHECK-EMPTY:
	; CHECK-NEXT: pred.load.if:
	; CHECK-NEXT: REPLICATE ir<%lv> = load ir<%gep.A.uniform>
	; CHECK-NEXT: Successor(s): pred.load.continue
	; CHECK-EMPTY:
	; CHECK-NEXT: pred.load.continue:
	; CHECK-NEXT: PHI-PREDICATED-INSTRUCTION vp<[[PRED:%.+]]> = ir<%lv>
	; CHECK-NEXT: No successors
	; CHECK-NEXT: }
	; CHECK-NEXT: Successor(s): loop.0
	; CHECK-EMPTY:
	; CHECK-NEXT: loop.0:
	; CHECK-NEXT: WIDEN ir<%cmp> = icmp ir<%iv>, ir<%k>			; CHECK-NEXT: WIDEN ir<%cmp> = icmp ir<%iv>, ir<%k>
	; CHECK-NEXT: Successor(s): loop.then			; CHECK-NEXT: Successor(s): loop.then
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: loop.then:			; CHECK-NEXT: loop.then:
	; CHECK-NEXT: EMIT vp<[[NOT2:%.+]]> = not ir<%cmp>			; CHECK-NEXT: EMIT vp<%4> = not ir<%cmp>
	; CHECK-NEXT: EMIT vp<[[MASK2:%.+]]> = select vp<[[MASK]]> vp<[[NOT2]]> ir<false>
	; CHECK-NEXT: Successor(s): pred.store			; CHECK-NEXT: Successor(s): pred.store
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: <xVFxUF> pred.store: {			; CHECK-NEXT: <xVFxUF> pred.store: {
	; CHECK-NEXT: pred.store.entry:			; CHECK-NEXT: pred.store.entry:
	; CHECK-NEXT: BRANCH-ON-MASK vp<[[MASK2]]>			; CHECK-NEXT: BRANCH-ON-MASK vp<%4>
	; CHECK-NEXT: Successor(s): pred.store.if, pred.store.continue			; CHECK-NEXT: Successor(s): pred.store.if, pred.store.continue
	; CHECK-NEXT: CondBit: vp<[[MASK2]]> (loop.then)			; CHECK-NEXT: CondBit: vp<%4> (loop.then)
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: pred.store.if:			; CHECK-NEXT: pred.store.if:
	; CHECK-NEXT: REPLICATE ir<%gep.B> = getelementptr ir<%B>, ir<%iv>			; CHECK-NEXT: REPLICATE ir<%gep.B> = getelementptr ir<%B>, ir<%iv>
	; CHECK-NEXT: REPLICATE store vp<[[PRED]]>, ir<%gep.B>			; CHECK-NEXT: REPLICATE store ir<%lv>, ir<%gep.B>
	; CHECK-NEXT: Successor(s): pred.store.continue			; CHECK-NEXT: Successor(s): pred.store.continue
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: pred.store.continue:			; CHECK-NEXT: pred.store.continue:
	; CHECK-NEXT: No successors			; CHECK-NEXT: No successors
	; CHECK-NEXT: }			; CHECK-NEXT: }
	; CHECK-NEXT: Successor(s): loop.then.0			; CHECK-NEXT: Successor(s): loop.then.0
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: loop.then.0:			; CHECK-NEXT: loop.then.0:
	; CHECK-NEXT: Successor(s): loop.latch			; CHECK-NEXT: Successor(s): loop.latch
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: loop.latch:			; CHECK-NEXT: loop.latch:
	; CHECK-NEXT: No successors			; CHECK-NEXT: No successors
	; CHECK-NEXT: }			; CHECK-NEXT: }
				; CHECK-NEXT: No successors
				; CHECK-NEXT: }
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ 21, %entry ], [ %iv.next, %loop.latch ]			%iv = phi i64 [ 21, %entry ], [ %iv.next, %loop.latch ]
	%gep.A.uniform = getelementptr inbounds i16, i16* %A, i64 0			%gep.A.uniform = getelementptr inbounds i16, i16* %A, i64 0
	%gep.B = getelementptr inbounds i16, i16* %B, i64 %iv			%gep.B = getelementptr inbounds i16, i16* %B, i64 %iv
	▲ Show 20 Lines • Show All 713 Lines • Show Last 20 Lines