This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
5/7
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
sve-low-trip-count.ll
-
RISCV/
-
low-trip-count.ll
-
X86/
-
consecutive-ptr-uniforms.ll
-
pr31671-consecutive-ptr-uniforms.ll
-
pr40816-consecutive-ptr-uniforms.ll
-
vect.omp.force.small-tc.ll
-
first-order-recurrence-sink-replicate-region.ll
-
icmp-uniforms.ll
-
pr43166-fold-tail-by-masking.ll
-
pr45679-fold-tail-by-masking.ll
-
reduction-order.ll
-
tail-folding-vectorization-factor-1.ll
-
tripcount.ll
-
vplan-sink-scalars-and-merge.ll

Differential D115713

[LV] Don't apply "TinyTripCountVectorThreshold" for loops with compile time known TC.
Needs ReviewPublic

Authored by ebrevnov on Dec 14 2021, 1:37 AM.

Download Raw Diff

Details

Reviewers

fhahn
dmgreen
Ayal
anna
dtemirbulatov
lebedev.ri

Summary

When trip count is not known at compile time, there are additional overhead to make sure it's safe to perform next VF iterations. Thus, if vector loop is skipped at runtime then such vectorization is unprofitable. When trip count is known to be small enough there is high chance to get into such situation. Currently, LV is not able to properly cost model in this case since it doesn't account for cost of the epilog loop. Instead "short trip count" heuristic is employed.

While "short trip count" heuristic  makes sense in general (at least for current state) it can be slightly lifted up when trip count is compile time known constant. In   this case it's known at compile time how many vector iterations will be executed and there is no implied overhead by trip count checks as well.  Cost modeling is simple as well, if one vector iteration costs less than one scalar iteration multiple VF then vectorization is profitable.

Note: One may say, that "short trip count" heuristic is the needed to reduce code size in assumption that short trip count loops can't be performance critical. That statement turns out to be false in many cases (for example, nested loops) and should not be driving factor.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ebrevnov created this revision.Dec 14 2021, 1:37 AM

Herald added subscribers: dmgreen, hiraditya. · View Herald TranscriptDec 14 2021, 1:37 AM

ebrevnov requested review of this revision.Dec 14 2021, 1:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2021, 1:37 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B139185: Diff 394187.Dec 14 2021, 1:38 AM

ebrevnov mentioned this in D115712: [LV] Make sure prefer-predicate-over-epilogue works for short TC loops.Dec 14 2021, 1:53 AM

ebrevnov added a parent revision: D115712: [LV] Make sure prefer-predicate-over-epilogue works for short TC loops.

ebrevnov edited the summary of this revision. (Show Details)Dec 14 2021, 2:42 AM

ebrevnov added reviewers: fhahn, dmgreen, Ayal, lebedev.ri.

Ping

Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2022, 2:34 AM

ebrevnov added a reviewer: anna.Oct 9 2022, 7:22 AM

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptOct 9 2022, 7:22 AM

PING!

dmgreen added inline comments.Oct 10 2022, 12:04 AM

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll
7 ↗	(On Diff #394187)	I don't think we want this - it is worse. At least that is what my benchmarks suggest. That was the point of D101726. 1 vector + 1 masked vector iteration when unrolled was worse than 5 scalar because of the overheads of vector instructions. 1 vector + 1 scalar will be in the same boat.

fhahn added inline comments.Oct 10 2022, 12:27 AM

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll
7 ↗	(On Diff #394187)	I also think we would probably need to make this target/CPU dependent. We also have some AArch64 CPUs where usually at least 2 vector iterations are needed to make the vector code profitable if there is a scalar tail.

ebrevnov edited the summary of this revision. (Show Details)Oct 10 2022, 2:07 AM

Rebase + fix problem for platforms which prefer tail folding over scalar epilogue.

Herald added subscribers: frasercrmck, luismarques, apazos and 19 others. · View Herald TranscriptOct 14 2022, 10:12 AM

Harbormaster completed remote builds in B192211: Diff 467819.Oct 14 2022, 10:13 AM

ebrevnov edited parent revisions, added: D135971: [LV][NFC][WIP] Restructure getScalarEpilogueLowering; removed: D115712: [LV] Make sure prefer-predicate-over-epilogue works for short TC loops.Oct 14 2022, 10:13 AM

Thanks for feedback!
The problem with mve-known-trip-count.ll was caused by 2 things 1) Outdated code base 2) Bug in the implementation caused to skip "preferPredicateOverEpilogue" check and as a result scalar epilogue have been selected. Now both items are fixed mve-known-trip-count.ll works as previously (cost model says vectorization with tail folding is not profitable).

In order to fix above two items (and not introduce new ones) I had to restructure the code responsible for scalar epilog lowering. This restructuring is now parent for this one and in WIP state. While I believe proposed restructuring is a good thing regardless this change I would like to postpone polishing it until we have an agreement on this one.

Adding @dtemirbulatov since he was the author of the original patch that introduced getMinTripCountTailFoldingThreshold.

paulwalker-arm added a subscriber: paulwalker-arm.Oct 17 2022, 3:48 AM

paulwalker-arm added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	I don't quite understand this change. The whole point of `getMinTripCountTailFoldingThreshold()` was to give targets control over this behaviour based on their understanding how the cost model has been implemented. Admittedly this was in part due to the immaturity of the cost modelling but this change essentially removes that flexibility to the point where there's no value in keeping `getMinTripCountTailFoldingThreshold()`? If your previous patches improve the cost model in this regard then I'd rather `getMinTripCountTailFoldingThreshold()` be removed. That said, @dtemirbulatov can you help here in ascertaining if this option is still required based on this new patch series?

ebrevnov added inline comments.Oct 17 2022, 6:34 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	I think you are missing one thing. Currently we allow vectorization of short trip count loops if there is no run-time overhead and the tail can be folded only. Introduction of getMinTripCountTailFoldingThreshold put additional limitations on vectorization of such loops. This change enables vectorization of short trip count loops with scalar epilogue ONLY If trip count is compile time constant. getMinTripCountTailFoldingThreshold is still applicable if we decide to vectorize by folding the tail. In other words, this change enables exactly the same way of vectorization for loops with small constant trip count as for loops with unknown or large trip count. It's a question whether getMinTripCountTailFoldingThreshold threshold should be taken into account if it's decided to fold the tail at step 5) for short trip count loop. Possibly yes... I think this is your original concern, right?

Ping!

I think the effect this patch has is not well understood, and the description is not informative enough to understand it.
I would perhaps recommend formatting the desired behavior change as a truth table.

I tried running the original benchmarks again with this patch series, it still sees the same decrease in performance I'm afraid. If it was a small change it would be understandable, we accidentally end up on the wrong side of the scalar cost and choose to vectorize where we shouldn't. But the new code is 40% slower than the scalar version, so it's quite a difference. I haven't had a chance to look into the costs it produces, there is a chance we are underestimating the cost of predication or overestimating the cost of scalar. At worst, providing getMinTripCountTailFoldingThreshold works like it should we can always put a limit on the tripcount, providing we can find a decent minimum that works in general for MVE.

The vectorizer doesn't really do any cost modelling for the setup cost, past some obvious things like runtime checks. Due the the pass pipeline, it is usually expected that many low-trip-count loops are unrolled prior to vectorization, and we SLP them instead. D135971 looks like it changes a lot of code in an area that has caused an amount of trouble in the past. We know that the Vectorizer current needs to commit to tail folding vs not tail folding too early. It would be better to have that option as part of the vplan, allowing predicated and non-predicated patters to be costed against one another. I'm not sure I follow D135971 enough to understand the ramifications of those changes.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	Are you assuming that the loop will be unrolled after vectorization, and that unrolling will leave no extra overhead? Scalable vector loops cannot (currently) be unrolled.

Make sure MinTripCountTailFoldingThreshold is honord even for loops with short compile tine known TC.

Harbormaster completed remote builds in B195882: Diff 472865.Nov 3 2022, 1:02 AM

In D115713#3898587, @dmgreen wrote:

I tried running the original benchmarks again with this patch series, it still sees the same decrease in performance I'm afraid. If it was a small change it would be understandable, we accidentally end up on the wrong side of the scalar cost and choose to vectorize where we shouldn't. But the new code is 40% slower than the scalar version, so it's quite a difference. I haven't had a chance to look into the costs it produces, there is a chance we are underestimating the cost of predication or overestimating the cost of scalar. At worst, providing getMinTripCountTailFoldingThreshold works like it should we can always put a limit on the tripcount, providing we can find a decent minimum that works in general for MVE.

It sounds like current cost model works really bad for this case and it's essentially the reason why getMinTripCountTailFoldingThreshold had been introduced. If this is the case would you mind trying updated version and see if it helps. Otherwise, I'd probably need a test case I could analyze.

The vectorizer doesn't really do any cost modelling for the setup cost, past some obvious things like runtime checks. Due the the pass pipeline, it is usually expected that many low-trip-count loops are unrolled prior to vectorization, and we SLP them instead. D135971 looks like it changes a lot of code in an area that has caused an amount of trouble in the past. We know that the Vectorizer current needs to commit to tail folding vs not tail folding too early. It would be better to have that option as part of the vplan, allowing predicated and non-predicated patters to be costed against one another. I'm not sure I follow D135971 enough to understand the ramifications of those changes.

I understand the problem you describe here. D135971 has nothing to do with it. It's an NFC. None of tests change their behavior. Main motivation is to simplify unnecessary complicated implementation because I struggled how to push new functionality in it.

ebrevnov added inline comments.Nov 3 2022, 1:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any additional overhead comparing to scalar loop. At least not on IR level. Is there any overhead specific to MVE not visible on IR level?

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult. There's a lot of setup and spilling too, which is going to hurt for small trip counts. I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled. It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE. There will be a certain amount of overhead just from having a second loop, let alone all the extra overhead from the gubbins around it. https://godbolt.org/z/s8xEf7d6K. None of that is particularly MVE specific though, although as it is designed for small energy-efficient cores it may feel the overheads more than other architectures.

In D115713#3910792, @dmgreen wrote:

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult.

I would like to understand you case better. Let's take current non optimality of code generation out of the consideration. Is it a problem of inaccurate cost estimation for some particular instructions or not taking overhead which comes with the vectorization into account?

There's a lot of setup and spilling too, which is going to hurt for small trip counts.

Does this setup/spilling happens inside the main vector loop or outside? Is this reflected on IR level or low level (maybe even hardware specific)?

I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

IMHO, this may be an option only if we find out that this is something specific to MVE. Anyway, it looks like a workaround (hopefully temporary :-)) until a better support is there.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled.

Until I'm missing something looks like currently LV comes before Unroller&SLP which makes perfect sense to me. Anyway, I wouldn't stick to any specific pass ordering as LLVM is an infrastructure for building custom compilers. LV should do its job as good as it can. If it can prove that vectorization is beneficial it should do it (until we have an infrastructure to take dependencies between different passes into account).

It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

ebrevnov added inline comments.Nov 7 2022, 4:45 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	It looks like, in this specific example, vector cost of the main vector loop being estimated as 2x lower than scalar one is pretty much accurate. The problem is that extra cost coming from horizontal reduction after main vector loop is not taken into account. This is something that definitely should be fixed before we can move on in this direction. Essentially, this problem is a major blocking factor to enable short TC loops vectorization in general case (not only compile time known TC). PS: I wonder if your problematic case on MVE is something similar to this or not?

In D115713#3911820, @ebrevnov wrote:

In D115713#3910792, @dmgreen wrote:

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled.

Until I'm missing something looks like currently LV comes before Unroller&SLP which makes perfect sense to me. Anyway, I wouldn't stick to any specific pass ordering as LLVM is an infrastructure for building custom compilers. LV should do its job as good as it can. If it can prove that vectorization is beneficial it should do it (until we have an infrastructure to take dependencies between different passes into account).

(Unfortunately?) full loop unrolling happens before LV.
I personally think that once LV has outer loop vectorization, that should be considered a bug that should be resolved.

Matt added a subscriber: Matt.Nov 7 2022, 12:39 PM

Avoid vectorization if there is reduction/recurrence induced overhead

Harbormaster completed remote builds in B196849: Diff 474185.Nov 9 2022, 1:02 AM

ebrevnov added inline comments.Nov 9 2022, 1:05 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9851–9878	I've update the patch to avoid vectorization in such cases. Please take a look.

Avoid vectorization if there is reduction/recurrence induced overhead

For some architectures like MVE and AArch64 the reduction cost might well be cheap. Non-zero but also not huge. It can depend on the reduction. Have you considered the cost of constants and splats too, that the vectorizer will currently hoist into the preheader? The costmodel functions might also sometimes return cheaper costs for certain shuffles and gathers under the assumption that loop invariant parts can be hoisted.

I've tried to get an example of the case that is getting worse for MVE in rG662b5f18467e. It might be a bit over-reduced, but hopefully it still shows the same problems. The original I can't share, I'm afraid.

@fhahn did you have an example of cases where you had seen AArch64 getting worse? Is it covered by reductions or is it more general?

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 12 2023, 5:32 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

40 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-low-trip-count.ll

105 lines

RISCV/

low-trip-count.ll

60 lines

X86/

consecutive-ptr-uniforms.ll

	pr31671-consecutive-ptr-uniforms.ll
	consecutive-ptr-uniforms.ll

66 lines

pr40816-consecutive-ptr-uniforms.ll

99 lines

vect.omp.force.small-tc.ll

53 lines

first-order-recurrence-sink-replicate-region.ll

2 lines

icmp-uniforms.ll

39 lines

pr43166-fold-tail-by-masking.ll

3 lines

pr45679-fold-tail-by-masking.ll

6 lines

reduction-order.ll

2 lines

tail-folding-vectorization-factor-1.ll

4 lines

tripcount.ll

74 lines

vplan-sink-scalars-and-merge.ll

9 lines

Diff 474185

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,783 Lines • ▼ Show 20 Lines
// for predication.		// for predication.
Optional<ScalarEpilogueLowering> getScalarEpilogueLowering(		Optional<ScalarEpilogueLowering> getScalarEpilogueLowering(
Function F, Loop L, LoopVectorizeHints &Hints, ProfileSummaryInfo *PSI,		Function F, Loop L, LoopVectorizeHints &Hints, ProfileSummaryInfo *PSI,
BlockFrequencyInfo BFI, TargetTransformInfo TTI, TargetLibraryInfo *TLI,		BlockFrequencyInfo BFI, TargetTransformInfo TTI, TargetLibraryInfo *TLI,
AssumptionCache AC, LoopInfo LI, PredicatedScalarEvolution &PSE,		AssumptionCache AC, LoopInfo LI, PredicatedScalarEvolution &PSE,
DominatorTree *DT, LoopVectorizationLegality &LVL,		DominatorTree *DT, LoopVectorizationLegality &LVL,
InterleavedAccessInfo IAI, OptimizationRemarkEmitter ORE) {		InterleavedAccessInfo IAI, OptimizationRemarkEmitter ORE) {
auto &SE = *PSE.getSE();		auto &SE = *PSE.getSE();
		auto PreferPredicate =
		TTI->preferPredicateOverEpilogue(L, LI, SE, *AC, TLI, DT, &LVL, IAI);

// 1) OptSize takes precedence over all other options, i.e. if this is set,		// 1) OptSize takes precedence over all other options, i.e. if this is set,
// don't look at hints or options, and don't request a scalar epilogue.		// don't look at hints or options, and don't request a scalar epilogue.
// (For PGSO, as shouldOptimizeForSize isn't currently accessible from		// (For PGSO, as shouldOptimizeForSize isn't currently accessible from
// LoopAccessInfo (due to code dependency and not being able to reliably get		// LoopAccessInfo (due to code dependency and not being able to reliably get
// PSI/BFI from a loop analysis under NPM), we cannot suppress the collection		// PSI/BFI from a loop analysis under NPM), we cannot suppress the collection
// of strides in LoopAccessInfo::analyzeLoop() and vectorize without		// of strides in LoopAccessInfo::analyzeLoop() and vectorize without
// versioning when the vectorization is forced, unlike hasOptSize. So revert		// versioning when the vectorization is forced, unlike hasOptSize. So revert
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
if (runtimeChecksRequired(LVL, PSE, ORE, L)) {		if (runtimeChecksRequired(LVL, PSE, ORE, L)) {
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
<< "count.\n");		<< "count.\n");
return None;		return None;
}		}

		// Currently, cost model only partially accounts for an extra overhead
		// which comes with the vectorization. For example, if loop contains
		// reduction there is an overhead to compute final value after the main
		// vector loop.
		// For short trip count loops the overhead may be relatively big and can't
		// be ignored. Mainly, current approach is to forbid vectorization with
		// some exceptions. In particular, it's assumed that tail folding avoids
		// most of the overhead (this is questionable though and more due to
		// historical reasons) and cost model is allowed to decide. In addition,
		// loops with compile time known constant TC doesn't have overhead except
		// "dependence induced" overhead exists.
		auto TC = SE.getSmallConstantTripCount(L);
		bool HasDependenceInducedOverhead =
		!LVL.getReductionVars().empty() \|\|
		!LVL.getFixedOrderRecurrences().empty();
		if (TC == 0 \|\| PreferPredicate \|\| HasDependenceInducedOverhead) {
if (*ExpectedTC <= TTI->getMinTripCountTailFoldingThreshold()) {		if (*ExpectedTC <= TTI->getMinTripCountTailFoldingThreshold()) {
LLVM_DEBUG(dbgs() << "But the target considers the trip count too "		LLVM_DEBUG(dbgs() << "But the target considers the trip count too "
"small to consider vectorizing.\n");		"small to consider vectorizing.\n");
reportVectorizationFailure(		reportVectorizationFailure(
"The trip count is below the minimal threshold value.",		"The trip count is below the minimal threshold value.",
"loop trip count is too low, avoiding vectorization",		"loop trip count is too low, avoiding vectorization",
"LowTripCount", ORE, L);		"LowTripCount", ORE, L);
return None;		return None;
}		}
return CM_SEL_PredicateOrDontVectorize;		return CM_SEL_PredicateOrDontVectorize;
}		}
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I don't quite understand this change. The whole point of `getMinTripCountTailFoldingThreshold()` was to give targets control over this behaviour based on their understanding how the cost model has been implemented. Admittedly this was in part due to the immaturity of the cost modelling but this change essentially removes that flexibility to the point where there's no value in keeping `getMinTripCountTailFoldingThreshold()`? If your previous patches improve the cost model in this regard then I'd rather `getMinTripCountTailFoldingThreshold()` be removed. That said, @dtemirbulatov can you help here in ascertaining if this option is still required based on this new patch series? paulwalker-arm: I don't quite understand this change. The whole point of `getMinTripCountTailFoldingThreshold…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions I think you are missing one thing. Currently we allow vectorization of short trip count loops if there is no run-time overhead and the tail can be folded only. Introduction of getMinTripCountTailFoldingThreshold put additional limitations on vectorization of such loops. This change enables vectorization of short trip count loops with scalar epilogue ONLY If trip count is compile time constant. getMinTripCountTailFoldingThreshold is still applicable if we decide to vectorize by folding the tail. In other words, this change enables exactly the same way of vectorization for loops with small constant trip count as for loops with unknown or large trip count. It's a question whether getMinTripCountTailFoldingThreshold threshold should be taken into account if it's decided to fold the tail at step 5) for short trip count loop. Possibly yes... I think this is your original concern, right? ebrevnov: I think you are missing one thing. Currently we allow vectorization of short trip count loops…
		dmgreenUnsubmitted Not Done Reply Inline Actions Are you assuming that the loop will be unrolled after vectorization, and that unrolling will leave no extra overhead? Scalable vector loops cannot (currently) be unrolled. dmgreen: Are you assuming that the loop will be unrolled after vectorization, and that unrolling will…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any additional overhead comparing to scalar loop. At least not on IR level. Is there any overhead specific to MVE not visible on IR level? ebrevnov: Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any…
		dmgreenUnsubmitted Done Reply Inline Actions The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE. There will be a certain amount of overhead just from having a second loop, let alone all the extra overhead from the gubbins around it. https://godbolt.org/z/s8xEf7d6K. None of that is particularly MVE specific though, although as it is designed for small energy-efficient cores it may feel the overheads more than other architectures. dmgreen: The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE.
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions It looks like, in this specific example, vector cost of the main vector loop being estimated as 2x lower than scalar one is pretty much accurate. The problem is that extra cost coming from horizontal reduction after main vector loop is not taken into account. This is something that definitely should be fixed before we can move on in this direction. Essentially, this problem is a major blocking factor to enable short TC loops vectorization in general case (not only compile time known TC). PS: I wonder if your problematic case on MVE is something similar to this or not? ebrevnov: It looks like, in this specific example, vector cost of the main vector loop being estimated as…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions I've update the patch to avoid vectorization in such cases. Please take a look. ebrevnov: I've update the patch to avoid vectorization in such cases. Please take a look.
}		}
		}

// 5) if the TTI hook indicates this is profitable, request predication.		// 5) if the TTI hook indicates this is profitable, request predication.
if (TTI->preferPredicateOverEpilogue(L, LI, SE, *AC, TLI, DT, &LVL, IAI))		if (PreferPredicate)
return CM_SEL_PredicateElseScalar;		return CM_SEL_PredicateElseScalar;

return CM_SEL_Allowed;		return CM_SEL_Allowed;
}		}

Value VPTransformState::get(VPValue Def, unsigned Part) {		Value VPTransformState::get(VPValue Def, unsigned Part) {
// If Values have been set for this Def return the one relevant for \p Part.		// If Values have been set for this Def return the one relevant for \p Part.
if (hasVectorValue(Def, Part))		if (hasVectorValue(Def, Part))
▲ Show 20 Lines • Show All 788 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -loop-vectorize -S < %s \| FileCheck %s			; RUN: opt -loop-vectorize -S < %s \| FileCheck %s

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define void @trip7_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {			define void @trip7_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip7_i64(			; CHECK-LABEL: @trip7_i64(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 7, [[TMP1]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 7, [[TMP3]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 7, [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK: [[ACTIVE_LANE_MASK:%.]] = phi <vscale x 2 x i1> [ {{%.}}, %vector.ph ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %vector.body ]			; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
	; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i64, i64 [[SRC:%.*]], i64 [[TMP4]]
	; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i64, i64 [[TMP5]], i32 0
	; CHECK: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.}}, <vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[TMP6]] to <vscale x 2 x i64>*
	; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 2 x i64>, <vscale x 2 x i64> [[TMP7]], align 8
	; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 2			; CHECK-NEXT: [[TMP8:%.*]] = shl nsw <vscale x 2 x i64> [[WIDE_LOAD]], shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]			; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i64, i64 [[DST:%.*]], i64 [[TMP4]]
	; CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 7)			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i64, i64 [[TMP9]], i32 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK_NOT:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i32 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP11:%.]] = bitcast i64 [[TMP10]] to <vscale x 2 x i64>*
	; CHECK-NEXT: [[COND:%.*]] = extractelement <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NOT]], i32 0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <vscale x 2 x i64>, <vscale x 2 x i64> [[TMP11]], align 8
	; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body			; CHECK-NEXT: [[TMP12:%.*]] = add nsw <vscale x 2 x i64> [[WIDE_LOAD1]], [[TMP8]]
				; CHECK-NEXT: [[TMP13:%.]] = bitcast i64 [[TMP10]] to <vscale x 2 x i64>*
				; CHECK-NEXT: store <vscale x 2 x i64> [[TMP12]], <vscale x 2 x i64>* [[TMP13]], align 8
				; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP15:%.*]] = mul i64 [[TMP14]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP15]]
				; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 7, [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I_06:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i64, i64 [[SRC]], i64 [[I_06]]
				; CHECK-NEXT: [[TMP17:%.]] = load i64, i64 [[ARRAYIDX]], align 8
				; CHECK-NEXT: [[MUL:%.*]] = shl nsw i64 [[TMP17]], 1
				; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i64, i64 [[DST]], i64 [[I_06]]
				; CHECK-NEXT: [[TMP18:%.]] = load i64, i64 [[ARRAYIDX1]], align 8
				; CHECK-NEXT: [[ADD:%.*]] = add nsw i64 [[TMP18]], [[MUL]]
				; CHECK-NEXT: store i64 [[ADD]], i64* [[ARRAYIDX1]], align 8
				; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_06]], 1
				; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 7
				; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]			%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
	%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06			%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06
	%0 = load i64, i64* %arrayidx, align 8			%0 = load i64, i64* %arrayidx, align 8
	%mul = shl nsw i64 %0, 1			%mul = shl nsw i64 %0, 1
	%arrayidx1 = getelementptr inbounds i64, i64* %dst, i64 %i.06			%arrayidx1 = getelementptr inbounds i64, i64* %dst, i64 %i.06
	%1 = load i64, i64* %arrayidx1, align 8			%1 = load i64, i64* %arrayidx1, align 8
	%add = add nsw i64 %1, %mul			%add = add nsw i64 %1, %mul
	store i64 %add, i64* %arrayidx1, align 8			store i64 %add, i64* %arrayidx1, align 8
	%inc = add nuw nsw i64 %i.06, 1			%inc = add nuw nsw i64 %i.06, 1
	%exitcond.not = icmp eq i64 %inc, 7			%exitcond.not = icmp eq i64 %inc, 7
	br i1 %exitcond.not, label %for.end, label %for.body			br i1 %exitcond.not, label %for.end, label %for.body

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret void			ret void
	}			}

	define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {			define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip5_i8(			; CHECK-LABEL: @trip5_i8(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[SRC:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1
				; CHECK-NEXT: [[TMP4:%.*]] = shl <4 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1>
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 [[DST:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP5]], i32 0
				; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <4 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i8>, <4 x i8> [[TMP7]], align 1
				; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i8> [[TMP4]], [[WIDE_LOAD1]]
				; CHECK-NEXT: [[TMP9:%.]] = bitcast i8 [[TMP6]] to <4 x i8>*
				; CHECK-NEXT: store <4 x i8> [[TMP8]], <4 x i8>* [[TMP9]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
				; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4
				; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 5, 4
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[SRC:%.*]], i64 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 [[I_08]]
	; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP0]], 1			; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP11]], 1
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[DST:%.*]], i64 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 [[I_08]]
	; CHECK-NEXT: [[TMP1:%.]] = load i8, i8 [[ARRAYIDX1]], align 1			; CHECK-NEXT: [[TMP12:%.]] = load i8, i8 [[ARRAYIDX1]], align 1
	; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP1]]			; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP12]]
	; CHECK-NEXT: store i8 [[ADD]], i8* [[ARRAYIDX1]], align 1			; CHECK-NEXT: store i8 [[ADD]], i8* [[ARRAYIDX1]], align 1
	; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 5			; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 5
	; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END:%.*]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]			%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
	Show All 16 Lines

llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -loop-vectorize -riscv-v-vector-bits-min=128 -scalable-vectorization=on -force-target-instruction-cost=1 -S < %s \| FileCheck %s			; RUN: opt -loop-vectorize -riscv-v-vector-bits-min=128 -scalable-vectorization=on -force-target-instruction-cost=1 -S < %s \| FileCheck %s

	target triple = "riscv64"			target triple = "riscv64"

	define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {			define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {
	; CHECK-LABEL: @trip5_i8(			; CHECK-LABEL: @trip5_i8(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
	; CHECK-NEXT: [[TMP2:%.*]] = icmp ult i64 -6, [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP2]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 8
	; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8
	; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[TMP6]], 1
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 5, [[TMP7]]
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP4]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[TMP8]], i64 5)			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[SRC:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i8, i8 [[SRC:%.*]], i64 [[TMP8]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i8, i8 [[TMP9]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP11:%.]] = bitcast i8 [[TMP10]] to <vscale x 8 x i8>*			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8> [[TMP11]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)			; CHECK-NEXT: [[TMP4:%.*]] = shl <4 x i8> [[WIDE_LOAD]], <i8 1, i8 1, i8 1, i8 1>
	; CHECK-NEXT: [[TMP12:%.*]] = shl <vscale x 8 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 8 x i8> insertelement (<vscale x 8 x i8> poison, i8 1, i32 0), <vscale x 8 x i8> poison, <vscale x 8 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 [[DST:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds i8, i8 [[DST:%.*]], i64 [[TMP8]]			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 [[TMP13]], i32 0			; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <4 x i8>*
	; CHECK-NEXT: [[TMP15:%.]] = bitcast i8 [[TMP14]] to <vscale x 8 x i8>*			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i8>, <4 x i8> [[TMP7]], align 1
	; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8> [[TMP15]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)			; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i8> [[TMP4]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP16:%.*]] = add <vscale x 8 x i8> [[TMP12]], [[WIDE_MASKED_LOAD1]]			; CHECK-NEXT: [[TMP9:%.]] = bitcast i8 [[TMP6]] to <4 x i8>*
	; CHECK-NEXT: [[TMP17:%.]] = bitcast i8 [[TMP14]] to <vscale x 8 x i8>*			; CHECK-NEXT: store <4 x i8> [[TMP8]], <4 x i8>* [[TMP9]], align 1
	; CHECK-NEXT: call void @llvm.masked.store.nxv8i8.p0nxv8i8(<vscale x 8 x i8> [[TMP16]], <vscale x 8 x i8>* [[TMP17]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4
	; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 8			; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP19]]
	; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 5, 4
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 4, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 [[I_08]]
	; CHECK-NEXT: [[TMP20:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP20]], 1			; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP11]], 1
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 [[I_08]]
	; CHECK-NEXT: [[TMP21:%.]] = load i8, i8 [[ARRAYIDX1]], align 1			; CHECK-NEXT: [[TMP12:%.]] = load i8, i8 [[ARRAYIDX1]], align 1
	; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP21]]			; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP12]]
	; CHECK-NEXT: store i8 [[ADD]], i8* [[ARRAYIDX1]], align 1			; CHECK-NEXT: store i8 [[ADD]], i8* [[ARRAYIDX1]], align 1
	; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nuw nsw i64 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 5			; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INC]], 5
	; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	Show All 20 Lines

llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

This file was moved to llvm/test/Transforms/LoopVectorize/X86/pr31671-consecutive-ptr-uniforms.ll.

llvm/test/Transforms/LoopVectorize/X86/pr31671-consecutive-ptr-uniforms.ll

This file was moved from llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll.

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt < %s -aa-pipeline=basic-aa -passes=loop-vectorize,instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s		; RUN: opt < %s -aa-pipeline=basic-aa -passes=loop-vectorize,instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s
; RUN: opt < %s -loop-vectorize -force-vector-width=2 -S \| FileCheck %s -check-prefix=FORCE

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

; CHECK-LABEL: PR31671		; CHECK-LABEL: PR31671
;		;
; Check a pointer in which one of its uses is consecutive-like and another of		; Check a pointer in which one of its uses is consecutive-like and another of
; its uses is non-consecutive-like. In the test case below, %tmp3 is the		; its uses is non-consecutive-like. In the test case below, %tmp3 is the
; pointer operand of an interleaved load, making it consecutive-like. However,		; pointer operand of an interleaved load, making it consecutive-like. However,
; it is also the pointer operand of a non-interleaved store that will become a		; it is also the pointer operand of a non-interleaved store that will become a
; scatter operation. %tmp3 (and the induction variable) should not be marked		; scatter operation. %tmp3 (and the induction variable) should not be marked
; uniform-after-vectorization.		; uniform-after-vectorization.
;		;
; CHECK: LV: Found uniform instruction: %tmp0 = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %i		; CHECK: LV: Found uniform instruction: %tmp0 = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %i
; CHECK-NOT: LV: Found uniform instruction: %tmp3 = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %i		; CHECK-NOT: LV: Found uniform instruction: %tmp3 = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %i
; CHECK-NOT: LV: Found uniform instruction: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]		; CHECK-NOT: LV: Found uniform instruction: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
; CHECK-NOT: LV: Found uniform instruction: %i.next = add nuw nsw i64 %i, 5		; CHECK-NOT: LV: Found uniform instruction: %i.next = add nuw nsw i64 %i, 5
; CHECK: define void @PR31671(		; CHECK-LABEL: @PR31671(
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x float> poison, float %x, i64 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x float> poison, float %x, i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x float> [[BROADCAST_SPLATINSERT]], <16 x float> poison, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x float> [[BROADCAST_SPLATINSERT]], <16 x float> poison, <16 x i32> zeroinitializer
; CHECK-NEXT: br label %vector.body		; CHECK-NEXT: br label %vector.body
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
; CHECK-NEXT: [[VEC_IND:%.]] = phi <16 x i64> [ <i64 0, i64 5, i64 10, i64 15, i64 20, i64 25, i64 30, i64 35, i64 40, i64 45, i64 50, i64 55, i64 60, i64 65, i64 70, i64 75>, %vector.ph ], [ [[VEC_IND_NEXT:%.]], %vector.body ]		; CHECK-NEXT: [[VEC_IND:%.]] = phi <16 x i64> [ <i64 0, i64 5, i64 10, i64 15, i64 20, i64 25, i64 30, i64 35, i64 40, i64 45, i64 50, i64 55, i64 60, i64 65, i64 70, i64 75>, %vector.ph ], [ [[VEC_IND_NEXT:%.]], %vector.body ]
; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 5		; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 5
Show All 33 Lines	for.body:
br i1 %cond, label %for.body, label %for.end		br i1 %cond, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

attributes #0 = { "target-cpu"="knl" }		attributes #0 = { "target-cpu"="knl" }

; CHECK-LABEL: PR40816
;
; Check that scalar with predication instructions are not considered uniform
; after vectorization, because that results in replicating a region instead of
; having a single instance (out of VF). The predication stems from a tiny count
; of 3 leading to folding the tail by masking using icmp ule <i, i+1> <= <2, 2>.
;
; CHECK: LV: Found trip count: 3
; CHECK: LV: Found uniform instruction: {{%.}} = icmp eq i32 {{%.}}, 0
; CHECK-NOT: LV: Found uniform instruction: {{%.}} = load i32, i32 {{%.*}}, align 1
; CHECK: LV: Found not uniform being ScalarWithPredication: {{%.}} = load i32, i32 {{%.*}}, align 1
; CHECK: LV: Found scalar instruction: {{%.}} = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 {{%.*}}
;
; FORCE-LABEL: @PR40816(
; FORCE-NEXT: entry:
; FORCE-NEXT: br i1 false, label {{%.}}, label [[VECTOR_PH:%.]]
; FORCE: vector.ph:
; FORCE-NEXT: br label [[VECTOR_BODY:%.*]]
; FORCE: vector.body:
; FORCE-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE4:%.*]] ]
; FORCE-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE4]] ]
; FORCE-NEXT: [[TMP2:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 2, i32 2>
; FORCE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
; FORCE-NEXT: br i1 [[TMP3]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
; FORCE: pred.load.if:
; FORCE-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
; FORCE-NEXT: store i32 [[TMP0]], i32* @b, align 1
; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE]]
; FORCE: pred.load.continue:
; FORCE-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
; FORCE-NEXT: br i1 [[TMP10]], label [[PRED_LOAD_IF3:%.*]], label [[PRED_LOAD_CONTINUE4]]
; FORCE: pred.load.if1:
; FORCE-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
; FORCE-NEXT: store i32 [[TMP1]], i32* @b, align 1
; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE4]]
; FORCE: pred.load.continue2:
; FORCE-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
; FORCE-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
; FORCE-NEXT: [[TMP15:%.*]] = icmp eq i32 [[INDEX_NEXT]], 4
; FORCE-NEXT: br i1 [[TMP15]], label {{%.*}}, label [[VECTOR_BODY]]
;
@a = internal constant [3 x i32] [i32 7, i32 7, i32 0], align 1
@b = external global i32, align 1

define void @PR40816() #1 {

entry:
br label %for.body

for.body: ; preds = %for.body, %entry
%0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
store i32 %0, i32* @b, align 1
%arrayidx1 = getelementptr inbounds [3 x i32], [3 x i32]* @a, i32 0, i32 %0
%1 = load i32, i32* %arrayidx1, align 1
%cmp2 = icmp eq i32 %1, 0
%inc = add nuw nsw i32 %0, 1
br i1 %cmp2, label %return, label %for.body

return: ; preds = %for.body
ret void
}

attributes #1 = { "target-cpu"="core2" }

llvm/test/Transforms/LoopVectorize/X86/pr40816-consecutive-ptr-uniforms.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; REQUIRES: asserts
				; RUN: opt < %s -aa-pipeline=basic-aa -passes=loop-vectorize,instcombine -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -instcombine -S -debug-only=loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -disable-output -print-after=instcombine -enable-new-pm=0 2>&1 \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -force-vector-width=2 -S \| FileCheck %s -check-prefix=FORCE

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; CHECK-LABEL: PR40816
				;
				; Check that scalar with predication instructions are not considered uniform
				; after vectorization, because that results in replicating a region instead of
				; having a single instance (out of VF). The predication stems from a tiny count
				; of 3 leading to folding the tail by masking using icmp ule <i, i+1> <= <2, 2>.
				;
				; CHECK: LV: Found trip count: 3
				; CHECK: LV: Found uniform instruction: {{%.}} = icmp eq i32 {{%.}}, 0
				; CHECK-NOT: LV: Found uniform instruction: {{%.}} = load i32, i32 {{%.*}}, align 1
				; CHECK: LV: Found not uniform being ScalarWithPredication: {{%.}} = load i32, i32 {{%.*}}, align 1
				; CHECK: LV: Found scalar instruction: {{%.}} = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 {{%.*}}
				;
				@a = internal constant [3 x i32] [i32 7, i32 7, i32 0], align 1
				@b = external global i32, align 1

				define void @PR40816() #1 {
				; CHECK-LABEL: @PR40816(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[TMP0:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
				; CHECK-NEXT: store i32 [[TMP0]], i32* @b, align 1
				; CHECK-NEXT: [[CMP2:%.*]] = icmp eq i32 [[TMP0]], 2
				; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[TMP0]], 1
				; CHECK-NEXT: br i1 [[CMP2]], label [[RETURN:%.*]], label [[FOR_BODY]]
				; CHECK: return:
				; CHECK-NEXT: ret void
				;
				; FORCE-LABEL: @PR40816(
				; FORCE-NEXT: entry:
				; FORCE-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; FORCE: vector.ph:
				; FORCE-NEXT: br label [[VECTOR_BODY:%.*]]
				; FORCE: vector.body:
				; FORCE-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE2:%.*]] ]
				; FORCE-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE2]] ]
				; FORCE-NEXT: [[TMP0:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 2, i32 2>
				; FORCE-NEXT: [[TMP1:%.*]] = extractelement <2 x i1> [[TMP0]], i32 0
				; FORCE-NEXT: br i1 [[TMP1]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
				; FORCE: pred.load.if:
				; FORCE-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 0
				; FORCE-NEXT: store i32 [[TMP2]], i32* @b, align 1
				; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE]]
				; FORCE: pred.load.continue:
				; FORCE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP0]], i32 1
				; FORCE-NEXT: br i1 [[TMP3]], label [[PRED_LOAD_IF1:%.*]], label [[PRED_LOAD_CONTINUE2]]
				; FORCE: pred.load.if1:
				; FORCE-NEXT: [[TMP4:%.*]] = add i32 [[INDEX]], 1
				; FORCE-NEXT: store i32 [[TMP4]], i32* @b, align 1
				; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE2]]
				; FORCE: pred.load.continue2:
				; FORCE-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
				; FORCE-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
				; FORCE-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], 4
				; FORCE-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; FORCE: middle.block:
				; FORCE-NEXT: br i1 true, label [[RETURN:%.*]], label [[SCALAR_PH]]
				; FORCE: scalar.ph:
				; FORCE-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 4, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; FORCE-NEXT: br label [[FOR_BODY:%.*]]
				; FORCE: for.body:
				; FORCE-NEXT: [[TMP6:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
				; FORCE-NEXT: store i32 [[TMP6]], i32* @b, align 1
				; FORCE-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP6]]
				; FORCE-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX1]], align 1
				; FORCE-NEXT: [[CMP2:%.*]] = icmp eq i32 [[TMP7]], 0
				; FORCE-NEXT: [[INC]] = add nuw nsw i32 [[TMP6]], 1
				; FORCE-NEXT: br i1 [[CMP2]], label [[RETURN]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
				; FORCE: return:
				; FORCE-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				store i32 %0, i32* @b, align 1
				%arrayidx1 = getelementptr inbounds [3 x i32], [3 x i32]* @a, i32 0, i32 %0
				%1 = load i32, i32* %arrayidx1, align 1
				%cmp2 = icmp eq i32 %1, 0
				%inc = add nuw nsw i32 %0, 1
				br i1 %cmp2, label %return, label %for.body

				return: ; preds = %for.body
				ret void
				}

				attributes #1 = { "target-cpu"="core2" }

llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

	Show All 32 Lines
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP1:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP4:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 23 Lines
	; CHECK-LABEL: @vectorized1(			; CHECK-LABEL: @vectorized1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*
	; CHECK-NEXT: [[TMP1:%.*]] = icmp ule <8 x i64> [[INDUCTION]], <i64 19, i64 19, i64 19, i64 19, i64 19, i64 19, i64 19, i64 19>			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP2]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[TMP3]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x float> @llvm.masked.load.v8f32.p0v8f32(<8 x float> [[TMP4]], i32 4, <8 x i1> [[TMP1]], <8 x float> poison), !llvm.access.group !6			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP5]], i32 0			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[TMP7:%.]] = bitcast float [[TMP6]] to <8 x float>*			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x float> @llvm.masked.load.v8f32.p0v8f32(<8 x float> [[TMP7]], i32 4, <8 x i1> [[TMP1]], <8 x float> poison), !llvm.access.group !6			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP8:%.*]] = fadd fast <8 x float> [[WIDE_MASKED_LOAD]], [[WIDE_MASKED_LOAD1]]			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[TMP6]] to <8 x float>*			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
	; CHECK-NEXT: call void @llvm.masked.store.v8f32.p0v8f32(<8 x float> [[TMP8]], <8 x float>* [[TMP9]], i32 4, <8 x i1> [[TMP1]]), !llvm.access.group !6
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 24
	; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP7:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 24, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP12:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP11]], [[TMP12]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 20
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP9:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 35 Lines
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6			; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 16
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP11:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 16 Lines

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=1 -force-widen-divrem-via-safe-divisor=0 -disable-output -debug-only=loop-vectorize 2>&1 \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=1 -force-widen-divrem-via-safe-divisor=0 -disable-output -debug-only=loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

	; Test cases for PR50009, which require sinking a replicate-region due to a			; Test cases for PR50009, which require sinking a replicate-region due to a
	; first-order recurrence.			; first-order recurrence.

	define void @sink_replicate_region_1(i32 %x, i8* %ptr, i32* noalias %dst) optsize {			define void @sink_replicate_region_1(i32 %x, i8* %ptr, i32* noalias %dst) optsize {
	; CHECK-LABEL: sink_replicate_region_1			; CHECK-LABEL: sink_replicate_region_1
	▲ Show 20 Lines • Show All 517 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll

Show All 33 Lines	for.end:
ret i32 %tmp4		ret i32 %tmp4
}		}

; Check for crash exposed by D76992.		; Check for crash exposed by D76992.
; CHECK-LABEL: 'test'		; CHECK-LABEL: 'test'
; CHECK: VPlan 'Initial VPlan for VF={4},UF>=1' {		; CHECK: VPlan 'Initial VPlan for VF={4},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION		; CHECK-NEXT: EMIT vp<%2> = CANONICAL-INDUCTION
; CHECK-NEXT: WIDEN-INDUCTION %iv = phi 0, %iv.next, ir<1>		; CHECK-NEXT: WIDEN-INDUCTION %iv = phi 0, %iv.next, ir<1>
; CHECK-NEXT: vp<[[STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<0>, ir<1>		; CHECK-NEXT: vp<%4> = SCALAR-STEPS vp<%2>, ir<0>, ir<1>
; CHECK-NEXT: EMIT vp<[[COND:%.+]]> = icmp ule ir<%iv> vp<[[BTC]]>
; CHECK-NEXT: WIDEN ir<%cond0> = icmp ult ir<%iv>, ir<13>		; CHECK-NEXT: WIDEN ir<%cond0> = icmp ult ir<%iv>, ir<13>
; CHECK-NEXT: WIDEN-SELECT ir<%s> = select ir<%cond0>, ir<10>, ir<20>		; CHECK-NEXT: WIDEN-SELECT ir<%s> = select ir<%cond0>, ir<10>, ir<20>
; CHECK-NEXT: Successor(s): pred.store		; CHECK-NEXT: CLONE ir<%gep> = getelementptr ir<%ptr>, vp<%4>
; CHECK-EMPTY:		; CHECK-NEXT: WIDEN store ir<%gep>, ir<%s>
; CHECK-NEXT: <xVFxUF> pred.store: {		; CHECK-NEXT: EMIT vp<%8> = VF * UF +(nuw) vp<%2>
; CHECK-NEXT: pred.store.entry:		; CHECK-NEXT: EMIT branch-on-count vp<%8> vp<%1>
; CHECK-NEXT: BRANCH-ON-MASK vp<[[COND]]>
; CHECK-NEXT: Successor(s): pred.store.if, pred.store.continue
; CHECK-EMPTY:
; CHECK-NEXT: pred.store.if:
; CHECK-NEXT: REPLICATE ir<%gep> = getelementptr ir<%ptr>, vp<[[STEPS]]>
; CHECK-NEXT: REPLICATE store ir<%s>, ir<%gep>
; CHECK-NEXT: Successor(s): pred.store.continue
; CHECK-EMPTY:
; CHECK-NEXT: pred.store.continue:
; CHECK-NEXT: No successors		; CHECK-NEXT: No successors
; CHECK-NEXT: }		; CHECK-NEXT: }
; CHECK-NEXT: Successor(s): loop.0
; CHECK-EMPTY:
; CHECK-NEXT: loop.0:
; CHECK-NEXT: EMIT vp<[[CAN_IV_NEXT:%.+]]> = VF * UF + vp<[[CAN_IV]]>
; CHECK-NEXT: EMIT branch-on-count vp<[[CAN_IV_NEXT]]> vp<[[VEC_TC]]>
; CHECK-NEXT: No successor
; CHECK-NEXT: }
define void @test(i32* %ptr) {		define void @test(i32* %ptr) {
entry:		entry:
br label %loop		br label %loop

loop: ; preds = %loop, %entry		loop: ; preds = %loop, %entry
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%cond0 = icmp ult i64 %iv, 13		%cond0 = icmp ult i64 %iv, 13
%s = select i1 %cond0, i32 10, i32 20		%s = select i1 %cond0, i32 10, i32 20
Show All 9 Lines

llvm/test/Transforms/LoopVectorize/pr43166-fold-tail-by-masking.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -passes=loop-vectorize -force-vector-width=4 -S \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S \| FileCheck %s


	; Test cases below are reduced (and slightly modified) reproducers based on a			; Test cases below are reduced (and slightly modified) reproducers based on a
	; problem seen when compiling a C program like this:			; problem seen when compiling a C program like this:
	;			;
	; #include <stdint.h>			; #include <stdint.h>
	; #include <stdio.h>			; #include <stdio.h>
	;			;
	▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/pr45679-fold-tail-by-masking.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -passes=loop-vectorize -force-vector-width=4 -S \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s
	; RUN: opt < %s -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=2 -S \| FileCheck %s -check-prefix=VF2UF2			; RUN: opt < %s -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=2 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s -check-prefix=VF2UF2
	; RUN: opt < %s -passes=loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -S \| FileCheck %s -check-prefix=VF1UF4			; RUN: opt < %s -passes=loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s -check-prefix=VF1UF4

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"

	; Make sure a loop is vectorized correctly with fold-tail when the constant			; Make sure a loop is vectorized correctly with fold-tail when the constant
	; trip-count is not a multiple of -force-vector-width and/or			; trip-count is not a multiple of -force-vector-width and/or
	; -force-vector-interleave, but is a multiple of the internally computed MaxVF;			; -force-vector-interleave, but is a multiple of the internally computed MaxVF;
	; e.g., when all types are i32 lead to MaxVF=1.			; e.g., when all types are i32 lead to MaxVF=1.

	▲ Show 20 Lines • Show All 425 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/reduction-order.ll

	; RUN: opt -passes='loop-vectorize' -force-vector-width=4 -force-vector-interleave=1 -S < %s 2>&1 \| FileCheck %s			; RUN: opt -passes='loop-vectorize' -force-vector-width=4 -force-vector-interleave=1 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S < %s 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

	; Make sure the selects generated from reduction are always emitted			; Make sure the selects generated from reduction are always emitted
	; in deterministic order.			; in deterministic order.
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI_1:%.+]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[ADD_5:%.+]], %vector.body ]			; CHECK: [[VEC_PHI_1:%.+]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[ADD_5:%.+]], %vector.body ]
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -pass-remarks='loop-vectorize' -disable-output -S 2>&1 \| FileCheck %s --check-prefix=CHECK-REMARKS			; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -pass-remarks='loop-vectorize' -disable-output -S 2>&1 \| FileCheck %s --check-prefix=CHECK-REMARKS
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s

	; These tests are to check that fold-tail procedure produces correct scalar code when			; These tests are to check that fold-tail procedure produces correct scalar code when
	; loop-vectorization is only unrolling but not vectorizing.			; loop-vectorization is only unrolling but not vectorizing.

	; CHECK-REMARKS: remark: {{.*}} interleaved loop (interleaved count: 4)			; CHECK-REMARKS: remark: {{.*}} interleaved loop (interleaved count: 4)
	; CHECK-REMARKS-NEXT: remark: {{.*}} interleaved loop (interleaved count: 4)			; CHECK-REMARKS-NEXT: remark: {{.*}} interleaved loop (interleaved count: 4)
	; CHECK-REMARKS-NOT: remark: {{.*}} vectorized loop			; CHECK-REMARKS-NOT: remark: {{.*}} vectorized loop

	▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/tripcount.ll

	Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines
	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	}			}

	define i32 @const_low_trip_count() {			define i32 @const_low_trip_count() {
	; Simple loop with constant, small trip count and no profiling info.			; Simple loop with constant, small trip count and no profiling info.
	; CHECK-LABEL: @const_low_trip_count(			; CHECK-LABEL: @const_low_trip_count(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <2 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x i8>, <2 x i8> [[TMP3]], align 1
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <2 x i8> [[WIDE_LOAD]], zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = select <2 x i1> [[TMP4]], <2 x i8> <i8 2, i8 2>, <2 x i8> <i8 1, i8 1>
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <2 x i8>*
				; CHECK-NEXT: store <2 x i8> [[TMP5]], <2 x i8>* [[TMP6]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
				; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2
				; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 3, 2
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP0]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 2			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 2
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END:%.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP10:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 26 Lines
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>			; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1			; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP10:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP12:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 10 Lines
	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	}			}

	define i32 @const_small_trip_count_step() {			define i32 @const_small_trip_count_step() {
	; Simple loop with static, small trip count and no profiling info.			; Simple loop with static, small trip count and no profiling info.
	; CHECK-LABEL: @const_small_trip_count_step(			; CHECK-LABEL: @const_small_trip_count_step(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i32 [[INDEX]], 5
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[OFFSET_IDX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[OFFSET_IDX]], 5
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[TMP2]], align 1
				; CHECK-NEXT: [[TMP5:%.]] = load i8, i8 [[TMP3]], align 1
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i8> poison, i8 [[TMP4]], i32 0
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i8> [[TMP6]], i8 [[TMP5]], i32 1
				; CHECK-NEXT: [[TMP8:%.*]] = icmp eq <2 x i8> [[TMP7]], zeroinitializer
				; CHECK-NEXT: [[TMP9:%.*]] = select <2 x i1> [[TMP8]], <2 x i8> <i8 2, i8 2>, <2 x i8> <i8 1, i8 1>
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i8> [[TMP9]], i32 0
				; CHECK-NEXT: store i8 [[TMP10]], i8* [[TMP2]], align 1
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i8> [[TMP9]], i32 1
				; CHECK-NEXT: store i8 [[TMP11]], i8* [[TMP3]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
				; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i32 [[INDEX_NEXT]], 2
				; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 3, 2
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 10, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP13:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP0]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP13]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 5			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 5
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 10			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 10
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END:%.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP14:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 26 Lines
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i8> [[WIDE_LOAD]], zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>			; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[TMP4]], <4 x i8> <i8 2, i8 2, i8 2, i8 2>, <4 x i8> <i8 1, i8 1, i8 1, i8 1>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1			; CHECK-NEXT: store <4 x i8> [[TMP5]], <4 x i8>* [[TMP6]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1001, 1000
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [32 x i8], [32 x i8] @tab, i32 0, i32 [[I_08]]
	; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0			; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i8 [[TMP8]], 0
	; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1			; CHECK-NEXT: [[DOT:%.*]] = select i1 [[CMP1]], i8 2, i8 1
	; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1			; CHECK-NEXT: store i8 [[DOT]], i8* [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1			; CHECK-NEXT: [[INC]] = add nsw i32 [[I_08]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp slt i32 [[I_08]], 1000
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !prof [[PROF0]], !llvm.loop [[LOOP12:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_END]], !prof [[PROF0]], !llvm.loop [[LOOP16:![0-9]+]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	Show All 27 Lines

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

	Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines
	}			}

	; Make sure we do not sink uniform instructions.			; Make sure we do not sink uniform instructions.
	define void @uniform_gep(i64 %k, i16* noalias %A, i16* noalias %B) {			define void @uniform_gep(i64 %k, i16* noalias %A, i16* noalias %B) {
	; CHECK-LABEL: LV: Checking a loop in 'uniform_gep'			; CHECK-LABEL: LV: Checking a loop in 'uniform_gep'
	; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {			; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {
	; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count			; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
	; CHECK-EMPTY:
	; CHECK-NEXT: vector.ph:			; CHECK-NEXT: vector.ph:
	; CHECK-NEXT: Successor(s): vector loop			; CHECK-NEXT: Successor(s): vector loop
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: <x1> vector loop: {			; CHECK-NEXT: <x1> vector loop: {
	; CHECK-NEXT: vector.body:			; CHECK-NEXT: vector.body:
	; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION			; CHECK-NEXT: EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION
	; CHECK-NEXT: WIDEN-INDUCTION %iv = phi 21, %iv.next, ir<1>			; CHECK-NEXT: WIDEN-INDUCTION %iv = phi 21, %iv.next, ir<1>
	; CHECK-NEXT: vp<[[STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<21>, ir<1>			; CHECK-NEXT: vp<[[STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<21>, ir<1>
	; CHECK-NEXT: EMIT vp<[[WIDE_CAN_IV:%.+]]> = WIDEN-CANONICAL-INDUCTION vp<[[CAN_IV]]>
	; CHECK-NEXT: EMIT vp<[[MASK:%.+]]> = icmp ule vp<[[WIDE_CAN_IV]]> vp<[[BTC]]>
	; CHECK-NEXT: CLONE ir<%gep.A.uniform> = getelementptr ir<%A>, ir<0>			; CHECK-NEXT: CLONE ir<%gep.A.uniform> = getelementptr ir<%A>, ir<0>
	; CHECK-NEXT: CLONE ir<%lv> = load ir<%gep.A.uniform>			; CHECK-NEXT: CLONE ir<%lv> = load ir<%gep.A.uniform>
	; CHECK-NEXT: WIDEN ir<%cmp> = icmp ult ir<%iv>, ir<%k>			; CHECK-NEXT: WIDEN ir<%cmp> = icmp ult ir<%iv>, ir<%k>
	; CHECK-NEXT: Successor(s): loop.then			; CHECK-NEXT: Successor(s): loop.then
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: loop.then:			; CHECK-NEXT: loop.then:
	; CHECK-NEXT: EMIT vp<[[NOT2:%.+]]> = not ir<%cmp>			; CHECK-NEXT: EMIT vp<[[NOT2:%.+]]> = not ir<%cmp>
	; CHECK-NEXT: EMIT vp<[[MASK2:%.+]]> = select vp<[[MASK]]> vp<[[NOT2]]> ir<false>
	; CHECK-NEXT: Successor(s): pred.store			; CHECK-NEXT: Successor(s): pred.store
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: <xVFxUF> pred.store: {			; CHECK-NEXT: <xVFxUF> pred.store: {
	; CHECK-NEXT: pred.store.entry:			; CHECK-NEXT: pred.store.entry:
	; CHECK-NEXT: BRANCH-ON-MASK vp<[[MASK2]]>			; CHECK-NEXT: BRANCH-ON-MASK vp<[[NOT2]]>
	; CHECK-NEXT: Successor(s): pred.store.if, pred.store.continue			; CHECK-NEXT: Successor(s): pred.store.if, pred.store.continue
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: pred.store.if:			; CHECK-NEXT: pred.store.if:
	; CHECK-NEXT: REPLICATE ir<%gep.B> = getelementptr ir<%B>, vp<[[STEPS]]>			; CHECK-NEXT: REPLICATE ir<%gep.B> = getelementptr ir<%B>, vp<[[STEPS]]>
	; CHECK-NEXT: REPLICATE store ir<%lv>, ir<%gep.B>			; CHECK-NEXT: REPLICATE store ir<%lv>, ir<%gep.B>
	; CHECK-NEXT: Successor(s): pred.store.continue			; CHECK-NEXT: Successor(s): pred.store.continue
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: pred.store.continue:			; CHECK-NEXT: pred.store.continue:
	; CHECK-NEXT: No successors			; CHECK-NEXT: No successors
	; CHECK-NEXT: }			; CHECK-NEXT: }
	; CHECK-NEXT: Successor(s): loop.then.0			; CHECK-NEXT: Successor(s): loop.then.0
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: loop.then.0:			; CHECK-NEXT: loop.then.0:
	; CHECK-NEXT: Successor(s): loop.latch			; CHECK-NEXT: Successor(s): loop.latch
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: loop.latch:			; CHECK-NEXT: loop.latch:
	; CHECK-NEXT: EMIT vp<[[CAN_IV_NEXT:%.+]]> = VF * UF + vp<[[CAN_IV]]>			; CHECK-NEXT: EMIT vp<[[CAN_IV_NEXT:%.+]]> = VF * UF +(nuw) vp<[[CAN_IV]]>
	; CHECK-NEXT: EMIT branch-on-count vp<[[CAN_IV_NEXT]]> vp<[[VEC_TC]]>			; CHECK-NEXT: EMIT branch-on-count vp<[[CAN_IV_NEXT]]> vp<[[VEC_TC]]>
	; CHECK-NEXT: No successors			; CHECK-NEXT: No successors
	; CHECK-NEXT: }			; CHECK-NEXT: }
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	▲ Show 20 Lines • Show All 910 Lines • Show Last 20 Lines